This project involves implementing and comparing word embedding models using Singular Value Decomposition (SVD) and Skip Gram with Negative Sampling. The analysis focuses on discerning differences in the quality of embeddings produced and their effectiveness in downstream tasks.
Many NLP systems employ modern distributional semantic algorithms, known as word embedding algorithms, to generate meaningful numerical representations for words. These algorithms aim to create embeddings where words with similar meanings are represented closely in a mathematical space. Word embeddings fall into two main categories: frequency-based and prediction-based.
- Frequency-based embeddings: Utilize vectorization methods such as Count Vector, TF-IDF Vector, and Cooccurrence Matrix.
- Prediction-based embeddings: Exemplified by Word2Vec, utilize models like Continuous Bag of Words (CBOW) and Skip-Gram (SG).
Implemented a word embedding model and trained word vectors by first building a Co-occurrence Matrix followed by the application of SVD.
Implemented the Word2Vec model and trained word vectors using the Skip Gram model with Negative Sampling.
Train the model on the given CSV files linked here: News Classification Dataset.
Note: Used the Description column of the train.csv
for training word vectors. The label/index column is used for the downstream classification task.
After successfully creating word vectors using the above two methods, evaluated the word vectors by using them for a downstream classification task. Used the same RNN and RNN hyperparameters across vectorization methods for the downstream task.
Compared and analyzed which of the two word vectorizing methods performs better using performance metrics such as accuracy, F1 score, precision, recall, and the confusion matrix on both the train and test sets. Wrote a detailed report on why one technique might perform better than the other, including the possible shortcomings of both techniques (SVD and Word2Vec).
Experimented with three different context window sizes. Reported performance metrics for all three context window configurations. Mentioned which configuration performs the best and discussed possible reasons for it.
To execute any file, use:
python3 <filename>
To load the pretrained models:
torch.load("<filename>.pt")
Loading svd-word-vectors.pt
and skip-gram-word-vectors.pt
gives us a dictionary. From this dictionary, we can access:
words_to_ind
usingdic["words_to_ind"]
word_embeddings
usingdic["word_embeddings"]
To get the word embedding for a token:
- Get the index (
idx
) usingwords_to_ind[token]
. - Get the word embedding using
word_embeddings[idx]
.
Loading svd-classification-model.pt
and skip-gram-classification-model.pt
gives us a model which provides the class index when given a sentence.
svd-classification
means the model is trained using word embeddings obtained by the SVD method.- Similarly,
skip-gram-classification
refers to the model trained using word embeddings obtained by the Skip Gram with Negative Sampling method.