• Books on Wikipedia clustered by genre in just two dimensions. Source: Koehrsen 2018.
• Neural network with word vector C(i) for ith word. Source: Bengio et al. 2003, fig. 1.
• Words to features to word vectors via lookup tables. Source: Collobert and Weston 2008, fig. 1.
• Aligning word embeddings of English and Italian to enable language translation. Source: Conneau et al. 2018, fig. 1.
• Word embeddings reduce the number of dimensions. Source: Goldberg 2015, fig. 1.
• Similar words are near one another in the vector space. Source: Lynn 2018.
• Word embeddings capture useful relationships. Source: Lynn 2018.

# Word Embedding

arvindpdmn
1745 DevCoins
Last updated by arvindpdmn
on 2019-10-12 09:15:26
Created by arvindpdmn
on 2019-09-28 10:49:16

## Summary

Word embedding is simply a vector representation of a word, with the vector containing real numbers. Since languages typically contain at least tens of thousands of words, simple binary word vectors can become impractical due to high number of dimensions. Word embeddings solve this problem by providing dense representations of words in a low-dimensional vector space.

Since mid-2010s, word embeddings have being applied to neural network-based NLP tasks. Among the well-known embeddings are word2vec (Google), GloVe (Stanford) and FastText (Facebook).

## Milestones

1950

In contrast to the formal linguistics of Noam Chomsky, researchers in the 1950s explore the idea that context can be useful for linguistic representation. This is based on structuralist linguistics. Distributional Hypothesis by Zellig Harris states that word meanings are associated with context. Another linguist John Firth states,

You shall know a word by the company it keeps!
1960

Early attempts are made in the 1960s to construct features to represent semantic similarities. Hand-crafted features are used. Charles Osgood's semantic differentials is an example.

1990

Deerwester et al. note that words and documents in which they occur have a semantic structure. They exploit this structure for information retrieval to match documents based on concepts rather than keywords. They map words to documents as a matrix with word counts, giving a sparse representation. They attempt at the most 100 dimensions and employ the technique of Singular Value Decomposition (SVD). They coin the term Latent Semantic Indexing (LSI).

2003

Bengio et al. propose a language model based on neural networks, though they're not the first ones to do so. They use a feed-forward NN with one hidden layer. Words are represented as feature vectors. Model learns vectors and joint probability function of word sequences. However, they don't use the term "word embeddings". Instead, they use the term distributed representation of words. Note that here we're interested in similar words whereas LSI is about similar documents due to its application to information retrieval.

2008

Collobert and Weston show the usefulness of pretrained word embeddings. Using such word embeddings, they show that a number of downstream NLP tasks can be learned by a neural network. They consider both syntactic tasks (POS tagging, chunking, parsing) and semantic tasks (named entity recognition, semantic role labelling, word sense disambiguation). In their approach, a word is decomposed into features and then converted to vectors using lookup tables.

2013

At Google, Mikolov et al. develop word2vec that helps in learning standalone word embeddings from a text corpus. Efficiency comes from removing the hidden layer and approximating the objective. Word2vec enabled large-scale training. Embeddings from the skip-gram model is shown to give state-of-the-art results for sentence completion, analogy and sentiment analysis.

2014

Stanford researchers release GloVe word embedding. This has vectors of 25-300 dimensions learned from up to 840 billion tokens.

2018

Conneau et al. apply word embeddings to language translation by aligning monolingual word embedding spaces in an unsupervised way. They don't require parallel corpora or character-level information. This can therefore benefit low-resource languages. They achieve better results as compared to supervised methods. Earlier in 2016, Sebastian Ruder published a useful survey of many cross-lingual word embeddings.

## Discussion

• Why do we need word embeddings?

Consider an example where we have to encode "the dog DET", which is about 'dog', its previous word 'the' and whose part of speech is determiner (DET). If we represent every word and every part of speech in its own dimension, we would require a high-dimensional vector since our vocabulary will have lots of words. The vector will mostly be zeros except in three places that represent 'dog', 'the' and 'DET'. Called One-Hot Encoding, this a sparse representation.

Instead, word embeddings give a dense representation in a lower-dimensional space. Each entity gets a unique representation in this vector space. As shown in the figure, both words have six dimensions each and the part of speech has four dimensions. The entire vector representation is now only 16 dimensions. This makes it practical for further processing.

More importantly, word embeddings capture similarities. For example, even if the word 'cat' is not seen during training, it's embedding would be similar to that of 'dog'. Likewise, different tenses of the same verb are correlated.

• Is word embedding related to distributional semantics?

Yes. The term "word embedding" has been popularized by the deep learning community. In computational linguistics, the more preferred term is Distributional Semantic Model (DSM), which comes from the theory of distributional semantics. Other equivalent terms include distributed representation, semantic vector space or word space.

Essentially, words are not represented as a single number or symbol. Rather, the representation is distributed in a vector space of many dimensions. The notion of semantics emerges because two words that are close to each other in the vector space are somehow semantically related. Similar words form clusters in the vector space.

• How can we extract semantic relationships captured within word embeddings?

Word embeddings are produced in an unsupervised manner. We don't inform the model anything about syntactic or semantic relationships among words. Yet, word embeddings seem to capture these relationships. For example, country names and their capital cities form a relationship. Relations due to gender or verb tense of words are other examples.

To see this in practice, consider the following vector equations:

$$king\,–\,man\,+\,woman = queen\\Paris\,–\,France\,+\,Germany = Berlin$$

The relationship between 'king' and 'man' is same as that between 'queen' and 'woman'. This is captured in the vector space. This means that given the word vectors for Paris, France and Germany, we can find the capital of France. The term word analogy is often used to refer to this phenomenon.

• What are some applications of word embeddings?

Word embeddings have become useful in many downstream NLP tasks. Word embeddings along with neural networks have been applied successfully for text classification, thereby improving customer service, spam detection, and document classification. Machine translations have improved. Analyzing survey responses or verbatim comments from customers are specific examples.

Word embeddings help in adapting a model from one domain to another, such as from legal documents to news articles. In general, this is called domain adaptation that's useful for machine translation and transfer learning. In addition, pretrained word vectors can be adapted to domains where large training datasets are not available.

In recommendation systems, such as suggesting a playlist of songs, word embeddings can figure out what songs go well together in a particular context.

In search and information retrieval applications, word embeddings have been shown to be insensitive to spelling errors and exact keyword matches are not required. Even for words not seen during training, machine learning models work well provided such words are in the vector space. Thus, word embeddings are being preferred over older approaches such as TF-IDF or bag-of-words.

• What traits do we expect from a good word embedding?

Different models give different word embeddings. A good representation should aim for the following:

• Non-conflation: A word can occur in different contexts giving rise to variants (tense, plural, etc.). Embedding should represent these differences and not conflate them.
• Unambiguous: All meanings of the word should be represented. For example, for the word 'bow', the difference between "the bow of a ship" and "bow and arrows" should be represented.
• Multifaced: Words have multiple facets: phonetic, morphological, syntactic, etc. Representation should change when tense changes or a prefix is added.
• Reliable: During training, word vectors are randomly initialized. This will lead to different word embeddings from the same dataset. In any case, the final output should be reliable and show consistent performance.
• Good Geometry: There should be a good spread of words in the vector space. In general, rare words should cluster around frequent words. Frequent unrelated words should be spread out.
• Which are the well-known word embeddings?

One of the first word embeddings is the Neural Network Language Model (NNLM) in which word embeddings are learnt jointly with the language model. Embeddings can also be learnt using Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA).

NNLM has high complexity due to non-linear hidden layers. A tradeoff is to first learn the word vectors using a neural network with a single hidden layer, which is then used to train the NNLM. Other log-linear models are Continuous Bag-of-Words (CBOW) and Continuous Skip-gram. An improved version of the latter is Skip-gram with Negative Sampling (SGNS). These are part of the word2vec implementation.

CBOW and Skip-gram models use only local information. Global Vectors (GloVe) is an approach that considers global statistical information as well. Word-to-word co-occurrence counts are used. GloVe combines LSA and word2vec.

Rare words can be poorly estimated. FastText overcomes this by using subword information.

Other models include ngram2vec and dict2vec. Embeddings from Language Models (ELMo) is a representation that captures sentence level information. Based on ELMo, BERT and OpenAI GPT are two pretrained models for other NLP tasks that have been proven effective.

• What's the role of the embedding layer and the softmax layer?

The general architecture of word embeddings using neural networks involves the following:

• Embedding Layer: Generates word embeddings from an index vector and a word embedding matrix.
• Hidden Layers: These produce intermediate representations of the input. LSTMs could be used here. In word2vec, there are no hidden layers.
• Softmax Layer: The final layer that gives the distribution over all words in the vocabulary. This is the most computationally expensive layer and much work has gone into simplifying this. Two broad categories are softmax-based approaches and sampling-based approaches.
• What's the process for generating word embeddings?

We can use a neural network on a supervised task to learn word embeddings. The embeddings are weights that are tuned to minimize the loss on the task. For example, given 50K words from a collection of movie reviews, we might obtain a 100-dimensional embedding to predict sentiment. Words signifying positive sentiment will be closer in the vector space. Since embeddings are tuned for a task, selecting the right task is important.

Word embeddings can be learnt from a standalone model and then applied to different tasks. Or it could be learnt jointly with a task-specific model. For good embeddings, we would need to train on millions or even billions of words. An easier approach is to use pretrained word embeddings (word2vec or GloVe). They can be used "as is" if they suit the task at hand. If not, they can be updated while training your own model.

In biomedical NLP, it was noted that bigger corpora don't necessarily result in better embeddings. Sometimes intrinsic and extrinsic evaluation methods don't agree well. Hyperparameters that we can tune include negative sampling size, context window size, and vector dimension. Gains plateau at about 200 dimensions.

• Could you share some practical tips for applying word embeddings?

Predictive neural network models (word2vec) and count-based distributional semantic models (GloVe) are different means to achieve the same goal. There's no qualitative difference. Word2vec has proven to be robust across a range of semantic tasks.

For syntactic tasks such as named entity recognition or POS tagging, a small dimensionality is adequate. For semantic tasks, higher dimensionality may prove more effective. It's also been noted that pretrained embeddings give better results. It's been commented that 8 dimensions might suffice for small datasets and as many as 1024 dimensions for large datasets.

In 2018, selecting the optimal dimensionality was still considered an open problem. Too few dimensions, embeddings are not expressive. Too many dimensions, embeddings are overfitted and model becomes complex. Commonly, 300 dimensions are used.

• What are some challenges with word embeddings?

Models such as ELMo and BERT, capture surrounding context within word embeddings. However, word embeddings don't capture "context of situation" the way linguist J.R. Firth defined it in the 1960s. To achieve true NLU, we would have to combine the statistical approach of word embeddings along with the older linguistic approach.

More generally, it's been said the deep learning isn't sample efficient. Perhaps we need something better than deep learning to tackle language with compositional properties.

Word embeddings don't capture phrases and sentences. For example, it would be misleading to combine word vectors to represent "Boston Globe". Embeddings for "good" and "bad" might be similar, causing problems for sentiment analysis.

Word embeddings don't capture some linguistic traits. For example, vectors for 'house' and 'home' may be similar but vectors of 'like' and 'love' are not. In general, when a word has multiple meanings, called homographs or polysemy, its vector is an average value. One solution is to consider both the word and its part of speech. Inflections also cause problem. For example, 'find' and 'locate' are close to each other but not 'found' and 'located'. Lemmatization can help before training the word vectors.

• What software tools are available for word embeddings?

Both word2vec and GloVe implementations are available online. In some frameworks such Spark MLlib or DL4J, word2vec is readily available.

Some frameworks that support word embeddings are S-Space and SemanticVectors (Java); Gensim, PyDSM and DISSECT (Python).

Deeplearning4j provides the SequenceVectors class, an abstraction above word vectors. This allows us to extract features from any data that can be described as a sequence, be it transactions, proteins or social media profiles.

A tutorial explaining word embeddings in TensorFlow is available.

You can also download pretrained word embeddings. Note that many use lemmatization while learning word embeddings.

## Sample Code

• # Source: https://nlpforhackers.io/word-embeddings/
# Accessed: 2019-09-29

# -------------------------------------------------------
# Example using gensim word2vec
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus

# We take only words that appear more than 150 times for doing a visualization later

# Getting most similar vectors
print(w2v_model2.wv.most_similar('paris'))
# [('louvre', 0.7243613004684448),
#  ('venice', 0.7047281265258789),
#  ('vienna', 0.7043783068656921),
#  ('montparnasse', 0.7016372680664062),
#  ('le', 0.6870340704917908),
#  ('sur', 0.6818796396255493),
#  ('chapelle', 0.6787714958190918),
#  ('rodin', 0.6766049265861511),
#  ('bologna', 0.6761612892150879),
#  ('munich', 0.6749240159988403)]

# "King" - "Man" + "Woman" == "Queen"
print(w2v_model2.most_similar(['woman', 'king'], ['man'], topn=3))
# [('queen', 0.6777610778808594), ('throne', 0.6143913269042969), ('elizabeth', 0.593910813331604)]

# "Father" - "Boy" + "Girl" == "Mother"
print(w2v_model2.most_similar(['girl', 'father'], ['boy'], topn=3))
# [('mother', 0.7972878813743591), ('wife', 0.7469687461853027), ('grandmother', 0.7419005632400513)]

# "Paris" - "France" + "Italy" = "Rome"
print(w2v_model2.most_similar(['paris', 'italy'], ['france'], topn=3))
# [('venice', 0.7461134195327759), ('vienna', 0.7134193778038025), ('florence', 0.7019181251525879)]

# -------------------------------------------------------
# Example using gensim FastText
from gensim.models import FastText

# Getting most similar vectors
print(ft_model.wv.most_similar('paris'))
# [('vienna', 0.7305958271026611),
#  ('venice', 0.7068097591400146),
#  ('florence', 0.6955196261405945),
#  ('brussels', 0.682724118232727),
#  ('leipzig', 0.6486526131629944),
#  ('francesco', 0.6461360454559326),
#  ('amsterdam', 0.6385960578918457),
#  ('france', 0.6323560476303101),
#  ('cemetery', 0.6285153031349182),
#  ('hamburg', 0.6284394264221191)]

# "King" - "Man" + "Woman" == "Queen"
print(ft_model.most_similar(['woman', 'king'], ['man'], topn=3))
# [('emperor', 0.68890380859375), ('queen', 0.6823415160179138), ('princess', 0.6764928102493286)]

# "Father" - "Boy" + "Girl" == "Mother"
print(ft_model.most_similar(['girl', 'father'], ['boy'], topn=3))
# [('mother', 0.7996115684509277), ('grandfather', 0.7629683613777161), ('wife', 0.7478234767913818)]

# "Paris" - "France" + "Italy" = "Rome"
print(ft_model.most_similar(['paris', 'italy'], ['france'], topn=3))
# [('vienna', 0.6932680606842041), ('venice', 0.652579128742218), ('moscow', 0.6098273992538452)]

# Misspell something similar to Venice and we still get a vector ...
print(ft_model.wv['veniciaaaaaa'])
# [-6.31419778e-01  9.52705503e-01  1.35608479e-01  4.74076539e-01 ...

# Let's see if indeed it understood we're trying to say Venice
print(ft_model.most_similar('veniciaaaaaa', topn=3))
# [('venice', 0.7861752510070801), ('brussels', 0.771102786064148), ('francesco', 0.7474006414413452)]

# What?
print(ft_model.most_similar('whaaaa', topn=3))
# [('what', 0.8659393787384033), ('whatever', 0.7308462858200073), ('why', 0.6594464778900146)]

## Milestones

1950

In contrast to the formal linguistics of Noam Chomsky, researchers in the 1950s explore the idea that context can be useful for linguistic representation. This is based on structuralist linguistics. Distributional Hypothesis by Zellig Harris states that word meanings are associated with context. Another linguist John Firth states,

You shall know a word by the company it keeps!
1960

Early attempts are made in the 1960s to construct features to represent semantic similarities. Hand-crafted features are used. Charles Osgood's semantic differentials is an example.

1990

Deerwester et al. note that words and documents in which they occur have a semantic structure. They exploit this structure for information retrieval to match documents based on concepts rather than keywords. They map words to documents as a matrix with word counts, giving a sparse representation. They attempt at the most 100 dimensions and employ the technique of Singular Value Decomposition (SVD). They coin the term Latent Semantic Indexing (LSI).

2003

Bengio et al. propose a language model based on neural networks, though they're not the first ones to do so. They use a feed-forward NN with one hidden layer. Words are represented as feature vectors. Model learns vectors and joint probability function of word sequences. However, they don't use the term "word embeddings". Instead, they use the term distributed representation of words. Note that here we're interested in similar words whereas LSI is about similar documents due to its application to information retrieval.

2008

Collobert and Weston show the usefulness of pretrained word embeddings. Using such word embeddings, they show that a number of downstream NLP tasks can be learned by a neural network. They consider both syntactic tasks (POS tagging, chunking, parsing) and semantic tasks (named entity recognition, semantic role labelling, word sense disambiguation). In their approach, a word is decomposed into features and then converted to vectors using lookup tables.

2013

At Google, Mikolov et al. develop word2vec that helps in learning standalone word embeddings from a text corpus. Efficiency comes from removing the hidden layer and approximating the objective. Word2vec enabled large-scale training. Embeddings from the skip-gram model is shown to give state-of-the-art results for sentence completion, analogy and sentiment analysis.

2014

Stanford researchers release GloVe word embedding. This has vectors of 25-300 dimensions learned from up to 840 billion tokens.

2018

Conneau et al. apply word embeddings to language translation by aligning monolingual word embedding spaces in an unsupervised way. They don't require parallel corpora or character-level information. This can therefore benefit low-resource languages. They achieve better results as compared to supervised methods. Earlier in 2016, Sebastian Ruder published a useful survey of many cross-lingual word embeddings.

Author
No. of Edits
No. of Chats
DevCoins
5
0
1745
2299
Words
0
Chats
5
Edits
3
Likes
2424
Hits

## Cite As

Devopedia. 2019. "Word Embedding." Version 5, October 12. Accessed 2020-09-18. https://devopedia.org/word-embedding
• Site Map