Word2vec

Word2vec is a set of algorithms to produce word embeddings, which are nothing more than vector representations of words. The idea of word2vec, and word embeddings in general, is to use the context of surrounding words and identify semantically similar words since they're likely to be in the same neighbourhood in vector space.

Word2vec algorithms are based on shallow neural networks. Such a neural network might be optimizing for a well-defined task but the real goal is to produce word embeddings that can be used in NLP tasks.

Word2vec was invented at Google in 2013. Word2vec simplified computation compared to previous word embedding models. Since then, it has been popularly adopted by others for many NLP tasks. Airbnb, Alibaba and Spotify have used it to power recommendation engines.

Discussion

What's the key insight that lead to the invention of word2vec?
Before word2vec, a feedforward neural network was used to jointly learn the language model and word embeddings. This network had input, projection, hidden and output layers. The complexity is dominated by the mapping from projection to hidden layers. For N (=10) previous words, D-dimensional vectors (500-2000 dimensions), and hidden layer size H (500-1000), complexity is N x D x H.
A recurrent neural network model removes the projection layer. Hidden layer connects to itself with a time delay. Complexity is now H x H.
Word2vec does away with the non-linear hidden layer that was a bottleneck in earlier models. There's a tradeoff. We lose precise representation but training becomes more efficient. This simpler model is used to learn word vectors. The task of learning a language model is considered separately using these word vectors. Finally, not just past words but also future words are considered for context. When input words are projected, their vectors are averaged at the projection layer, unlike earlier models.
Further simplification of the softmax layer computation, enabled word2vec to be trained on 30 billion words, a scale that was not possible with earlier models.
What are the main models that are part of word2vec?
Two word2vec models for obtaining word embeddings. Source: Rong 2016, fig. 2 and 3.
Let's use a vocabulary of V words, a context of C words, a dense representation of N-dimensional word vector, an embedding matrix W of dimensions VxN at the input and a context matrix W' of dimensions NxV at the output.
Word2vec has two models for deriving word embeddings:
- Continuous Bag-of-Words (CBOW): We take words surrounding a given word and try to predict the latter. Each word is a one-hot coded vector. Via an embedding matrix, this is transformed into a N-dimensional vector that's the average of C word vectors. From this vector, we compute probabilities for each word in the vocabulary. Word with highest probability is the predicted word.
- Continuous Skip-gram: We take one word and try to predict words that occur around it. At the output, we try to predict C different words.
Could you describe the details of how word2vec learns word embeddings?
Use of sliding window in word2vec skip-gram model. Source: Alammar 2019.
Word2vec uses a neural network model based on word-context pairs. With each training step, the weights are adjusted with the goal of minimizing the loss function, that is, minimize the error between predicted output and actual output. An iteration uses one word-context pair. Training on the entire input corpus may be considered one training epoch.
Consider the skip-gram model. A sliding window around the current input word is used to predict the words within the window. Once this iteration adjusts the weights, the window slides to the next word in the corpus.
Word2vec is not a deep learning technique. In fact, there are no hidden layers, although it's common to refer to the embedding layer as hidden layer, or projection layer. A typical pipeline involves selecting the vocabulary from a text corpus, sliding the window to select context, performing extra tasks to simplify softmax computation, and iterating through the neural network model.
Why is the softmax layer of word2vec considered computationally difficult?
Word2vec uses a softmax layer. Source: Parellada 2017.
The softmax layer treats the problem of selecting the most probable word as a multiclass classification problem. It computes the probability of each word being the actual word. Probabilities of all words should add up to 1. For skip-gram model, it does this for each contextual word.
Consider a vocabulary of K words, and input and output vectors $v_w$ and $v'_w$ of word w. For skip-gram, softmax function is the probability of an output word given the input word, $$p(w_O|w_I) = \frac{e^{{v'_{w_O}}^T\,v_{w_I}}}{\sum_{w=1}^{K} e^{{v'_w}^T\,v_{w_I}}}$$
With a vocabulary of hundreds of thousands of words, computing the softmax probability for each word for each iteration is computationally expensive. Hierarchical Softmax solves this problem by doing computations on word parts and reusing the results. Negative Sampling is an alternative. It selects a few negative samples and computes softmax only for these and the actual outputs. Both these simplify computation without much loss of accuracy.
Sebastian Ruder gives a detailed explanation of different softmax approximation techniques.
What are other improvements to word2vec?
Word2vec implementation has the ability to select a dynamic window size, uniformly sampled in range [1, k]. This has the effect of giving more weight to closer words. Smaller window sizes lead to similar interchangeable words. Larger window sizes lead to similar related words.
Word2vec can also ignore rare words. In fact, rare words are discarded before context is set. This increases the effective window size for some words. In addition, we can subsample frequent words with the insight that being frequent, they are less informative. The net effect of this is that words that are far away could be topically similar and therefore captured in the embeddings.
What are some tips for those trying to use word2vec?
Developers can read sample TensorFlow code for the CBOW model, sample NumPy code, or sample Gensim code.
Designed by Xin Rong, wevi is a useful tool to visualize how word2vec learns word embeddings.
Sebastian Ruder gives a number of tips. Use Skip-Gram Negative Sampling (SGNS) as a baseline. Use many negative samples for better results. Use context distribution smoothing before selecting negative samples so that frequent words are not sampled quite so frequently. SGNS is a better technique than CBOW.

Milestones

2005

Example binary tree for hierarchical softmax. Source: Rong 2016, fig. 4.

Morin and Bengio come up with the idea of hierarchical softmax. A word is modelled as a composition of inner units, which are then arranged as a binary tree. Given a vocabulary of V words, probability of an output word is computed from softmax computation of inner units that lead to the word from the root of the tree. This reduces complexity from O(V) to O(log(V)). This idea becomes important later in word2vec models. In 2009, Mnih and Hinton explore different ways to construct the tree.

2012

Gutmann and Hyvarinen introduce Noise Contrastive Estimation (NCE) as an alternative to hierarchical softmax. The basic idea is that a good model can differentiate data from noise using logistic regression. Mnih and Teh apply NCE to language modelling. This is similar to hinge loss proposed by Collobert and Weston in 2008 to rank data above noise.

Jan
2013

At Google, Mikolov et al. develop word2vec although this name refers to a software implementation rather than the models. They propose two models: continuous bag-of-words and continuous skip-gram. They improve on earlier state-of-the-art models by removing the hidden layer. They also make use of hierarchical softmax, thus making this a log-linear model. Softmax uses Huffman binary tree to represent the vocabulary. They note that this speeds up evaluation by 2X.

Oct
2013

Mikolov et al. improve on their earlier models by proposing negative sampling, which is a simplification of NCE. This is possible because NCE tries to maximize the log probability of the softmax whereas we are more interested in the word embeddings. Negative sampling is simpler and faster than hierarchical softmax. For small datasets, 5-20 negative samples may be required. For large datasets, 2-5 negative samples may be enough.

Jan
2019

Inspired by word2vec, some researchers produce code embeddings, vector representations of snippets of software code. Called code2vec, this work could enable us to apply neural networks to programming tasks such as automated code reviews and API discovery. This is just one example of many advances due to word2vec. Another example is doc2vec from 2014.

Jun
2019

Scaling word2vec using multiple GPUs. Source: Li et al. 2019, fig. 9.

Word2vec is sequential due to strong dependencies across word-context pairs. Researchers show how word2vec can be trained on a GPU cluster by reducing dependency within a large training batch. Without loss of accuracy, they achieve 7.5 times acceleration using 16 GPUs. They also note that using Chainer framework, it's easy to implement CNN-based subword-level models.

References

Article Stats

1472

Words

Authors

Edits

Chats

Likes

9182

Hits

Cite As

Devopedia. 2020. "Word2vec." Version 4, September 5. Accessed 2023-11-12. https://devopedia.org/word2vec

Contributed by
1 author

Last updated on
2020-09-05 08:34:49

algorithms natural language processing text analytics word embedding

Word2vec

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Word2vec

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login