Word2vec
- Summary
-
Discussion
- What's the key insight that lead to the invention of word2vec?
- What are the main models that are part of word2vec?
- Could you describe the details of how word2vec learns word embeddings?
- Why is the softmax layer of word2vec considered computationally difficult?
- What are other improvements to word2vec?
- What are some tips for those trying to use word2vec?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Word2vec is a set of algorithms to produce word embeddings, which are nothing more than vector representations of words. The idea of word2vec, and word embeddings in general, is to use the context of surrounding words and identify semantically similar words since they're likely to be in the same neighbourhood in vector space.
Word2vec algorithms are based on shallow neural networks. Such a neural network might be optimizing for a well-defined task but the real goal is to produce word embeddings that can be used in NLP tasks.
Word2vec was invented at Google in 2013. Word2vec simplified computation compared to previous word embedding models. Since then, it has been popularly adopted by others for many NLP tasks. Airbnb, Alibaba and Spotify have used it to power recommendation engines.
Discussion
-
What's the key insight that lead to the invention of word2vec? Before word2vec, a feedforward neural network was used to jointly learn the language model and word embeddings. This network had input, projection, hidden and output layers. The complexity is dominated by the mapping from projection to hidden layers. For N (=10) previous words, D-dimensional vectors (500-2000 dimensions), and hidden layer size H (500-1000), complexity is N x D x H.
A recurrent neural network model removes the projection layer. Hidden layer connects to itself with a time delay. Complexity is now H x H.
Word2vec does away with the non-linear hidden layer that was a bottleneck in earlier models. There's a tradeoff. We lose precise representation but training becomes more efficient. This simpler model is used to learn word vectors. The task of learning a language model is considered separately using these word vectors. Finally, not just past words but also future words are considered for context. When input words are projected, their vectors are averaged at the projection layer, unlike earlier models.
Further simplification of the softmax layer computation, enabled word2vec to be trained on 30 billion words, a scale that was not possible with earlier models.
-
What are the main models that are part of word2vec? Let's use a vocabulary of V words, a context of C words, a dense representation of N-dimensional word vector, an embedding matrix W of dimensions VxN at the input and a context matrix W' of dimensions NxV at the output.
Word2vec has two models for deriving word embeddings:
- Continuous Bag-of-Words (CBOW): We take words surrounding a given word and try to predict the latter. Each word is a one-hot coded vector. Via an embedding matrix, this is transformed into a N-dimensional vector that's the average of C word vectors. From this vector, we compute probabilities for each word in the vocabulary. Word with highest probability is the predicted word.
- Continuous Skip-gram: We take one word and try to predict words that occur around it. At the output, we try to predict C different words.
-
Could you describe the details of how word2vec learns word embeddings? Word2vec uses a neural network model based on word-context pairs. With each training step, the weights are adjusted with the goal of minimizing the loss function, that is, minimize the error between predicted output and actual output. An iteration uses one word-context pair. Training on the entire input corpus may be considered one training epoch.
Consider the skip-gram model. A sliding window around the current input word is used to predict the words within the window. Once this iteration adjusts the weights, the window slides to the next word in the corpus.
Word2vec is not a deep learning technique. In fact, there are no hidden layers, although it's common to refer to the embedding layer as hidden layer, or projection layer. A typical pipeline involves selecting the vocabulary from a text corpus, sliding the window to select context, performing extra tasks to simplify softmax computation, and iterating through the neural network model.
-
Why is the softmax layer of word2vec considered computationally difficult? The softmax layer treats the problem of selecting the most probable word as a multiclass classification problem. It computes the probability of each word being the actual word. Probabilities of all words should add up to 1. For skip-gram model, it does this for each contextual word.
Consider a vocabulary of K words, and input and output vectors \(v_w\) and \(v'_w\) of word w. For skip-gram, softmax function is the probability of an output word given the input word, $$p(w_O|w_I) = \frac{e^{{v'_{w_O}}^T\,v_{w_I}}}{\sum_{w=1}^{K} e^{{v'_w}^T\,v_{w_I}}}$$
With a vocabulary of hundreds of thousands of words, computing the softmax probability for each word for each iteration is computationally expensive. Hierarchical Softmax solves this problem by doing computations on word parts and reusing the results. Negative Sampling is an alternative. It selects a few negative samples and computes softmax only for these and the actual outputs. Both these simplify computation without much loss of accuracy.
Sebastian Ruder gives a detailed explanation of different softmax approximation techniques.
-
What are other improvements to word2vec? Word2vec implementation has the ability to select a dynamic window size, uniformly sampled in range [1, k]. This has the effect of giving more weight to closer words. Smaller window sizes lead to similar interchangeable words. Larger window sizes lead to similar related words.
Word2vec can also ignore rare words. In fact, rare words are discarded before context is set. This increases the effective window size for some words. In addition, we can subsample frequent words with the insight that being frequent, they are less informative. The net effect of this is that words that are far away could be topically similar and therefore captured in the embeddings.
-
What are some tips for those trying to use word2vec? Developers can read sample TensorFlow code for the CBOW model, sample NumPy code, or sample Gensim code.
Designed by Xin Rong, wevi is a useful tool to visualize how word2vec learns word embeddings.
Sebastian Ruder gives a number of tips. Use Skip-Gram Negative Sampling (SGNS) as a baseline. Use many negative samples for better results. Use context distribution smoothing before selecting negative samples so that frequent words are not sampled quite so frequently. SGNS is a better technique than CBOW.
Milestones
Morin and Bengio come up with the idea of hierarchical softmax. A word is modelled as a composition of inner units, which are then arranged as a binary tree. Given a vocabulary of V words, probability of an output word is computed from softmax computation of inner units that lead to the word from the root of the tree. This reduces complexity from O(V) to O(log(V)). This idea becomes important later in word2vec models. In 2009, Mnih and Hinton explore different ways to construct the tree.
Gutmann and Hyvarinen introduce Noise Contrastive Estimation (NCE) as an alternative to hierarchical softmax. The basic idea is that a good model can differentiate data from noise using logistic regression. Mnih and Teh apply NCE to language modelling. This is similar to hinge loss proposed by Collobert and Weston in 2008 to rank data above noise.
2013
At Google, Mikolov et al. develop word2vec although this name refers to a software implementation rather than the models. They propose two models: continuous bag-of-words and continuous skip-gram. They improve on earlier state-of-the-art models by removing the hidden layer. They also make use of hierarchical softmax, thus making this a log-linear model. Softmax uses Huffman binary tree to represent the vocabulary. They note that this speeds up evaluation by 2X.
2013
Mikolov et al. improve on their earlier models by proposing negative sampling, which is a simplification of NCE. This is possible because NCE tries to maximize the log probability of the softmax whereas we are more interested in the word embeddings. Negative sampling is simpler and faster than hierarchical softmax. For small datasets, 5-20 negative samples may be required. For large datasets, 2-5 negative samples may be enough.
2019
Inspired by word2vec, some researchers produce code embeddings, vector representations of snippets of software code. Called code2vec, this work could enable us to apply neural networks to programming tasks such as automated code reviews and API discovery. This is just one example of many advances due to word2vec. Another example is doc2vec from 2014.
2019
Word2vec is sequential due to strong dependencies across word-context pairs. Researchers show how word2vec can be trained on a GPU cluster by reducing dependency within a large training batch. Without loss of accuracy, they achieve 7.5 times acceleration using 16 GPUs. They also note that using Chainer framework, it's easy to implement CNN-based subword-level models.
References
- Alammar, Jay. 2019. "The Illustrated Word2vec." March 27. Accessed 2019-10-07.
- Alon, Uri, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. "code2vec: Learning Distributed Representations of Code." Proc. ACM Program. Lang., vol. 3, no. POPL, Article 40, January. Accessed 2019-10-07.
- Goldberg, Yoav and Omer Levy. 2014. "word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method." arXiv, v1, February 15. Accessed 2019-10-07.
- Google Developers. 2019. "Embeddings: Obtaining Embeddings." Machine Learning Crash Course, Google. Accessed 2019-10-07.
- Google Developers. 2019b. "Multi-Class Neural Networks: Softmax." Machine Learning Crash Course, Google. Accessed 2019-10-07.
- Li, Bofang, Aleksandr Drozd, Yuhe Guo, Tao Liu, Satoshi Matsuoka, and Xiaoyong Du. 2019. "Scaling Word2Vec on Big Corpus." Data Science and Engineering, vol. 4, no. 2, pp. 157-175, June. Accessed 2019-10-07.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. "Efficient Estimation of Word Representations in Vector Space." arXiv, v3, September 07. Accessed 2019-10-07.
- Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. "Distributed Representations of Words and Phrases and their Compositionality." arXiv, v1, October 16. Accessed 2019-09-30.
- Parellada, Antoni. 2017. "Derivative of Softmax with respect to weights." CrossValidated, StackExchange, March 16. Accessed 2019-10-07.
- Rong, Xin. 2015. "wevi: Word Embedding Visual Inspector." University of Michigan. Accessed 2019-10-07.
- Rong, Xin. 2016. "word2vec Parameter Learning Explained." arXiv, v3, January 30. Accessed 2019-10-07.
- Ruder, Sebastian. 2016a. "On word embeddings - Part 1." April 11. Accessed 2019-10-07.
- Ruder, Sebastian. 2016b. "On word embeddings - Part 2: Approximating the Softmax." June 13. Accessed 2019-10-07.
- Ruder, Sebastian. 2016c. "On word embeddings - Part 3: The secret ingredients of word2vec." September 24. Accessed 2019-10-07.
- Ruder, Sebastian. 2018. "A Review of the Recent History of Natural Language Processing." October 01. Accessed 2019-09-26.
- Udacity. 2016. "Word2Vec Details." Udacity, on YouTube, June 06. Accessed 2019-10-07.
Further Reading
- Colyer, Adrian. 2016. "The amazing power of word vectors." Blog, The Morning Paper, April 21. Accessed 2019-10-07.
- Alammar, Jay. 2019. "The Illustrated Word2vec." March 27. Accessed 2019-10-07.
- Lakhey, Munesh. 2019. "Word2Vec - Negative Sampling made easy." mc.ai, May 27. Accessed 2019-10-07.
- Li, Zhi. 2019. "A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model." Towards Data Science, via Medium, May 31. Accessed 2019-10-07.
- TensorFlow. 2019. "Word embeddings." Tutorials, October 04. Accessed 2019-10-07.
- Chia, Derek. 2018. "An implementation guide to Word2Vec using NumPy and Google Sheets." Towards Data Science, via Medium, December 06. Accessed 2019-10-07.
Article Stats
Cite As
See Also
- Word Embedding
- GloVe
- Singular Value Decomposition
- Neural Networks for NLP
- Natural Language Processing
- Long Short-Term Memory