Miklov et al. introduced the world to the power of word vectors by showing two main methods: Skip–Gram and Continuous Bag of Words (CBOW). Soon after, two more popular word embedding methods built on these methods were discovered.
In this post, we’ll talk about GloVe and fastText, which are extremely popular word vector models in the NLP world.
Global Vectors (GloVe)
Pennington et al. argue that the online scanning approach used by word2vec is suboptimal since it does not fully exploit the global statistical information regarding word co-occurrences.
In the model they call Global Vectors (GloVe), they say: “The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.”
In order to understand how GloVe works, we need to understand two main methods which GloVe was built on – global matrix factorization and local context window.
In NLP, global matrix factorization is the process of using matrix factorization methods from linear algebra to reduce large term frequency matrices. These matrices usually represent the occurrence or absence of words in a document. Global matrix factorizations when applied to term frequency matrices are called Latent Semantic Analysis (LSA).
Local context window methods are CBOW and Skip–Gram. These were discussed in detail in the previous post. Skip-gram works well with small amounts of training data and represents even words that are considered rare, whereas CBOW trains several times faster and has slightly better accuracy for frequent words.
Authors of the paper mention that instead of learning the raw co-occurrence probabilities, it was more useful to learn ratios of these co-occurrence probabilities. This helps to better discriminate the subtleties in term-term relevance and boosts the performance on word analogy tasks.
This is how it works: Instead of extracting the embeddings from a neural network that is designed to perform a different task like predicting neighboring words (CBOW) or predicting the focus word (Skip-Gram), the embeddings are optimized directly, so that the dot product of two-word vectors equals the log of the number of times the two words will occur near each other.
For example, if the two words “cat” and “dog” occur in the context of each other, say 20 times in a 10-word window in the document corpus, then:
Vector(cat) . Vector(dog) = log(10)
This forces the model to encode the frequency distribution of words that occur near them in a more global context.
fastText is another word embedding method that is an extension of the word2vec model. Instead of learning vectors for words directly, fastText represents each word as an n-gram of characters. So, for example, take the word, “artificial” with n=3, the fastText representation of this word is <ar, art, rti, tif, ifi, fic, ici, ial, al>, where the angular brackets indicate the beginning and end of the word.
This helps capture the meaning of shorter words and allows the embeddings to understand suffixes and prefixes. Once the word has been represented using character n-grams, a skip-gram model is trained to learn the embeddings. This model is considered to be a bag of words model with a sliding window over a word because no internal structure of the word is taken into account. As long as the characters are within this window, the order of the n-grams doesn’t matter.
fastText works well with rare words. So even if a word wasn’t seen during training, it can be broken down into n-grams to get its embeddings.
Word2vec and GloVe both fail to provide any vector representation for words that are not in the model dictionary. This is a huge advantage of this method.
Here are some references for the models described here:
- GloVe: Global Vectors for Word Representation: This paper shows you the internal workings of the GloVe model.
- Pre-Trained Glove Models: You can find word vectors pre-trained on Wikipedia here.
- Enriching Word Vectors with Subword Information: This paper builds on word2vec and shows how you can use sub-word information in order to build word vectors.
- fastText Library by Facebook: This contains word2vec models and a pre-trained model which you can use for tasks like sentence classification.
We’ve now seen the different word vector methods that are out there. GloVe showed us how we can leverage global statistical information contained in a document. Whereas fastText is built on the word2vec models but instead of considering words we consider sub-words.
You might ask which one of the different models is best. Well, that depends on your data and the problem you’re trying to solve!