Word Sense Disambiguation

Different senses of four words. Source: Liu et al. 2004, table 3.

Many words have multiple meanings or senses. For example, the word bass has at least eight different senses. The correct sense can be established by looking at the context of use. This is easy for humans because we know from experience how the world works. Word Sense Disambiguation (WSD) is about enabling computers to do the same.

WSD involves the use of syntax, semantics and word meanings in context. It's therefore a part of computational lexical semantics. WSD is considered an AI-complete problem, which means that it's as hard as the most difficult problems in AI.

Both supervised and unsupervised algorithms are available. The term Word Sense Induction (WSI) is sometimes used for unsupervised algorithms. In the 2010s, word embeddings became popular. Such embeddings used with neural network models represent the current state-of-the-art models for WSD.

Discussion

Could you explain Word Sense Disambiguation with an example?
In English-to-Spanish machine translation, we need the correct sense of 'bass'. Source: Jurafsky 2015, slide 9.
Consider sentences "I can hear bass sounds" and "They liked grilled bass". The meaning or sense of the word 'bass' is low frequency tones or a type of fish respectively. The word alone is not sufficient to determine the correct sense. When we consider the word in the context of surrounding words, the sense becomes clear. Using WSD, the second sentence can be sense-tagged as "They like/ENJOY grilled/COOKED bass/FISH".
Context comes from the words 'sounds' or 'grilled'. It's also helpful to know that these collocated words are noun and adjective respectively. One comes after 'bass' and the other comes before 'bass'. These syntactic relations give additional information. In general, pre-processing steps such as POS tagging and parsing help WSD.
A difficult example is "the astronomer married the star".
There's no universally agreed senses for a word. Senses can also vary with domains. WSD also relies on knowledge. Without knowledge, it's impossible to determine the correct sense. It's expensive to build knowledge resources. Back in the 1990s, this was seen as the knowledge acquisition bottleneck.
Could you mention some applications of WSD?
WSD applied to image retrieval for the word 'cup'. Source: Saenko and Darrell 2009, fig. 1.
WSD is usually seen as an "intermediate task", as a means to an end. Obtaining the correct word sense is helpful in many NLP applications. Exactly how WSD is used is application specific.
Here are some examples:
- Machine Translation: An English translation of the French word 'grille' can be railings, bar, grid, scale, schedule, etc. Correct word sense disambiguation is therefore necessary.
- Information Retrieval: When searching for judicial references with the word 'court', we wish to avoid matches pertaining to royalty.
- Thematic Analysis: Themes are identified based on word distribution but we include only words of the relevant sense.
- Grammatical Analysis: In POS tagging or syntactic analysis, WSD is useful. In the French sentence "L'étagère plie sous les livres", livres refers to 'books' and not 'pounds'.
- Speech Processing: WSD helps in obtaining the correct phonetization in speech synthesis.
- Text Processing: For inserting diacritics, WSD helps in correcting the French word 'comte' to 'comté'. For case changes, WSD corrects 'HE READ THE TIMES' to 'He read the Times'. To "Wikify" online documents, WSD helps.
What are some essential terms to know about word senses?
Example noun and verb senses in WordNet. Source: Jurafsky and Martin 2009a, fig. 19.2-19.3.
Consider the word 'bank'. This can refer to a financial institution or a sloping mound. These two senses of the same word are unrelated but they look and sound the same. We call them homonyms. The sense relation is called homonymy. Typically, homonyms have different origins and different dictionary entries.
A bank can also refer to the building that houses the financial institution. These senses are semantically related. The sense relation is called polysemy.
Synonyms are different words with same or nearly same meaning. Antonyms are words with opposite meaning.
Consider two words in which one is a subclass of the other, a type-of relation, such as mango and fruit. Mango is a hyponym of fruit. Fruit is a hypernym of mango.
Consider two words that form a part-whole relation, such as wheel and car. Wheel is a meronym of car. Car is a holonym of wheel.
Which are the essential elements for doing WSD?
A co-occurrence graph for the word 'bar'. Source: Navigli 2009, fig. 7.
WSD requires two main sources of information: context and knowledge. Context is established from neighbouring words and the domain of discourse. Sense-tagged corpora provide knowledge, leading to data-driven or corpus-based WSD. Use of lexicons or encyclopaedia lead to knowledge-driven WSD.
We also need to know possible word senses. These can be enumerative, with WordNet being an example. A generative model underspecifies senses until context is considered. Rules generate senses. These rules capture regularities in the creation of senses.
In lexical sample task, WSD is applied for a sample of pre-selected words. Supervised ML approach is possible based on hand-labelled corpus. In all-words task, WSD is applied for all words, for which supervised ML approach is not practical. Dictionary-based approaches or bootstrapping techniques are more suitable.
A bag-of-words approach can be used for context. To preserve word ordering, collocation can be used when forming feature vectors. Such a vector might include the word's root form and its POS. Syntactic relations, distance from target and selectional preferences are other approaches.
Could you describe some algorithms for WSD?
Simplified Lesk Algorithm looks at overlapping words. Source: Jurafsky 2015, slide 32.
A simple supervised approach is to use a naive Bayes classifier. We maximize the probability of word sense given a feature vector. The problem is simplified by using Bayes' Rule and assuming features are independent of one another. Another approach is decision list classifier.
The Corpus Lesk algorithm uses a sense-tagged corpus. We also use the definition or gloss of each sense from a dictionary. Examples from the corpus and the gloss become the signature of the sense. Then we compute the number of overlapping words between the signature and the context. Inverse Document Frequency (IDF) weighting is applied to discount function words (the, of, etc.). The simplified Lesk algorithm uses only the gloss for signature and doesn't use weights.
For evaluation, most frequent sense is used as a baseline. Frequencies can be taken from a sense-tagged corpus such as SemCor. Lesk algorithm is also a suitable baseline. Senseval and SemEval have standardized sense evaluation.
How are word embeddings relevant to the problem of WSD?
F1 scores on different all-words WSD datasets. Source: Iacobacci et al. 2016, table 3.
Schütze (1998) proposed the use of word vectors and context vectors. These are large dimensional vectors and often sparse. To make them practical for computation, Singular Value Decomposition (SVD) reduces the number of dimensions. Within such a vector space model, Latent Semantic Analysis (LSA) helps to determine semantic relations. Word embeddings is a modern alternative.
Word embeddings were proposed by Bengio et al. (2003). These are low-dimensional dense vectors that capture semantic information. However, words with multiple senses are reduced to a single vector. This is not directly useful for WSD. To overcome this, Trask et al. (2015) proposed sense2vec, where representations are of senses, not words. Sense2vec improved the accuracy of other NLP tasks such as named entity recognition and neural dependency parsing.
Iacobacci et al. (2016) explored the direct use of word embeddings. Using It Makes Sense (IMS) framework along with Word2vec, they improved F1 scores on various WSD datasets. Word embeddings of target word and its surrounding words are converted into "sense vectors" using various methods: concatenation, average, fractional decay, exponential decay.
What are some neural network approaches to WSD?
Seq2seq architecture with multiple attentions. Source: Ahmed et al. 2018, fig. 3.
Neural network approaches to WSD have become popular in the 2010s. Wiriyathammabhum et al. (2012) applied Deep Belief Networks (DBN). They pre-trained the hidden layers using various knowledge sources, layer by layer. They then used a separate fine tuning step for better discrimination.
We lose sequential and syntactic information when averaging word vectors. Instead, Yuan et al. (2016) proposed a semi-supervised LSTM model with label propagation. To capture contexts from both sides, Kågebäck and Salomonsson (2016) applied Bidirectional LSTM.
Many models consider only context. Knowledge sources such as WordNet are ignored. Gloss-Augmented WSD (GAS) considers both context and glosses (sense definitions) and uses BiLSTM.
One attention-based approach is a encoder-decoder model with multiple attentions on different linguistic features. Another is GlossBERT. It uses BERT, encodes context-gloss pairs of all possible senses, and treats WSD as a sentence-pair classification problem.
What are some resources to do WSD?
It Makes Sense (IMS): a flexible framework for WSD. Source: Zhong and Ng 2010, fig. 1.
A number of datasets and sense-annotated corpora are available to train WSD models: Senseval and SemEval tasks (all-words, lexical sample, WSI), AIDA CoNLL-YAGO, MASC, SemCor, and WebCAGe. Likewise, word sense inventories include WordNet, TWSI, Wiktionary, Wikipedia, FrameNet, OmegaWiki, VerbNet, and more. These are supported by the modular Java framework DKPro WSD.
UKB is an open-source toolkit that can be used for knowledge-based WSD. It should be used with optimal default parameters.
In January 2017, Google released word sense annotations on MASC and SemCor datasets. Senses are from New Oxford American Dictionary (NOAD). NOAD senses are also mapped to WordNet.
ACLWiki has curated a list of useful WSD resources. This includes papers, inventories, annotated corpora and software.
Ruder captures the current state of the art in WSD with links to relevant papers. Papers With Code also maintains a list of recent papers on WSD.

Milestones

Jul
1949

Warren Weaver considers the task of using computers to translate text from one language to another. He recognizes the importance of context and meaning. He makes references to cryptography and statistical semantic studies as possible approaches to obtaining the correct meaning of a word. Weaver also notes that a word mostly has only one meaning within a particular domain.

1953

Oswald and Lawson propose microglossaries for machine translation. These are glossaries assembled for a particular domain. Through the 1950s, researchers produce many such domain-specific glossaries to aid machine translation. For example, a microglossary for mathematics would define 'triangle' as a geometric shape and not as a musical instrument.

1955

Erwin Reifler defines what he calls semantic coincidences between a word and its context. He also notes that syntactic relations can be used to disambiguate. For example, the word 'kept' can have an object that's gerund (He kept eating), adjectival phrase (He kept calm), or noun phrase (He kept a record).

1957

Masterman makes use of synonyms, near synonyms and associated words from Roget's Thesaurus. She gives the example of "flowering plant". Using the thesaurus, we can determine that 'vegetable' is the only common sense for the words 'flowering' and 'plant'. This is therefore the correct sense of the word 'plant' in this context.

1965

Madhu and Lytle propose the use of what they call Figure of Merit. This is a probabilistic measure that's useful when grammatical structure alone is unable to disambiguate. They focus on scientific and engineering literature and identify ten groups. The group or context is determined using words with single meaning. Then the most probable meaning of words with multiple meanings is selected given the context. Paradoxically, this is also the time when interest in machine translation declines.

1970

An example semantic network from WordNet. Source: Navigli 2009, fig. 3.

In the 1970s, some notable approaches include Semantic Networks of Quillian and Simmons; Preferential Semantics of Wilks; word-based understanding of Riesbeck. Early semantic networks can be traced to the late 1950s.

1986

Lesk uses a machine-readable dictionary (MRD) for WSD. In general, the 1980s sees large-scale lexical resources (such as WordNet) for automated knowledge extraction. This is also when focus shifts from linguistic theories to empirical methods.

1990

The use of neural networks had been suggested in the early 1980s but was limited to a few words and hand-coded. Véronis and Ide extend this idea by using a machine-readable Collins English Dictionary. Network is formed using dictionary entries and words used to define them. Word nodes activate sense nodes. Feedback allows competing senses to inhibit one another.

1991

Brown et al. show that it's possible to disambiguate by aligning sentences in two languages. A word in one language might translate into different words in another language, each with a unique sense. In 1992, Gale et al. extend this idea by using Canadian Hansards (parliamentary debates) that's available in more than one language. This avoids expensive hand-labelled corpus.

1995

Supervised WSD algorithms have the problem of requiring sense-annotated corpora, which is expensive and laborious to create. Yarowsky proposes an unsupervised WSD algorithm. The algorithm uses two useful constraints: one sense per collocation, one sense per discourse. It's a bootstrapping procedure that seeds a small number of sense annotations. The algorithm then determines and iteratively improves on the senses for other occurrences of the word. An example is to disambiguate 'plant', which can be about plant life or a manufacturing plant.

1998

Schütze proposes a vector space approach to WSD via clustering. Senses are seen as clusters of similar contexts. A sense of a particular word is the cluster to which it's closest. Since the technique is unsupervised, senses are induced from a corpus. Word vectors are calculated using cooccurrences. Word vectors are sparse but context vectors are dense. The dimensions of both vectors are reduced using Singular Value Decomposition (SVD).

1999

Mihalcea and Moldovan make use of WordNet for WSD. They rank different senses using WordNet's semantic density for a word-pair and web mining for word pair cooccurrences. In 2004, Peter Turney also employs web mining to calculate cooccurrence probabilities that are used to generate semantic features for WSD.

2010

This decade sees an increasing use of word embeddings and neural network models for WSD. Some of these include Gloss-Augmented WSD (GAS), GlossBERT, and use of BiLSTM.

Mar
2018

Evolution of word senses for the word 'game'. Source: Ramiro et al. 2018, fig. 5.

Ramiro et al. study the evolution of word senses. They note that cognitive efficiency drives this evolution through a process called nearest-neighbour chaining. For new word senses, reuse of existing words (polysemy) is more common than new word forms.

References

Article Stats

2429

Words

Authors

Edits

Chats

Likes

6167

Hits

Cite As

Devopedia. 2021. "Word Sense Disambiguation." Version 3, June 28. Accessed 2023-11-13. https://devopedia.org/word-sense-disambiguation

Contributed by
1 author

Last updated on
2021-06-28 15:59:55

algorithms natural language processing word embedding

Word Sense Disambiguation

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Word Sense Disambiguation

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login