Word Sense Disambiguation

Different senses of four words. Source: Liu et al. 2004, table 3.
Different senses of four words. Source: Liu et al. 2004, table 3.

Many words have multiple meanings or senses. For example, the word bass has at least eight different senses. The correct sense can be established by looking at the context of use. This is easy for humans because we know from experience how the world works. Word Sense Disambiguation (WSD) is about enabling computers to do the same.

WSD involves the use of syntax, semantics and word meanings in context. It's therefore a part of computational lexical semantics. WSD is considered an AI-complete problem, which means that it's as hard as the most difficult problems in AI.

Both supervised and unsupervised algorithms are available. The term Word Sense Induction (WSI) is sometimes used for unsupervised algorithms. In the 2010s, word embeddings became popular. Such embeddings used with neural network models represent the current state-of-the-art models for WSD.

Discussion

  • Could you explain Word Sense Disambiguation with an example?
    In English-to-Spanish machine translation, we need the correct sense of 'bass'. Source: Jurafsky 2015, slide 9.
    In English-to-Spanish machine translation, we need the correct sense of 'bass'. Source: Jurafsky 2015, slide 9.

    Consider sentences "I can hear bass sounds" and "They liked grilled bass". The meaning or sense of the word 'bass' is low frequency tones or a type of fish respectively. The word alone is not sufficient to determine the correct sense. When we consider the word in the context of surrounding words, the sense becomes clear. Using WSD, the second sentence can be sense-tagged as "They like/ENJOY grilled/COOKED bass/FISH".

    Context comes from the words 'sounds' or 'grilled'. It's also helpful to know that these collocated words are noun and adjective respectively. One comes after 'bass' and the other comes before 'bass'. These syntactic relations give additional information. In general, pre-processing steps such as POS tagging and parsing help WSD.

    A difficult example is "the astronomer married the star".

    There's no universally agreed senses for a word. Senses can also vary with domains. WSD also relies on knowledge. Without knowledge, it's impossible to determine the correct sense. It's expensive to build knowledge resources. Back in the 1990s, this was seen as the knowledge acquisition bottleneck.

  • Could you mention some applications of WSD?
    WSD applied to image retrieval for the word 'cup'. Source: Saenko and Darrell 2009, fig. 1.
    WSD applied to image retrieval for the word 'cup'. Source: Saenko and Darrell 2009, fig. 1.

    WSD is usually seen as an "intermediate task", as a means to an end. Obtaining the correct word sense is helpful in many NLP applications. Exactly how WSD is used is application specific.

    Here are some examples:

    • Machine Translation: An English translation of the French word 'grille' can be railings, bar, grid, scale, schedule, etc. Correct word sense disambiguation is therefore necessary.
    • Information Retrieval: When searching for judicial references with the word 'court', we wish to avoid matches pertaining to royalty.
    • Thematic Analysis: Themes are identified based on word distribution but we include only words of the relevant sense.
    • Grammatical Analysis: In POS tagging or syntactic analysis, WSD is useful. In the French sentence "L'étagère plie sous les livres", livres refers to 'books' and not 'pounds'.
    • Speech Processing: WSD helps in obtaining the correct phonetization in speech synthesis.
    • Text Processing: For inserting diacritics, WSD helps in correcting the French word 'comte' to 'comté'. For case changes, WSD corrects 'HE READ THE TIMES' to 'He read the Times'. To "Wikify" online documents, WSD helps.
  • What are some essential terms to know about word senses?
    Example noun and verb senses in WordNet. Source: Jurafsky and Martin 2009a, fig. 19.2-19.3.
    Example noun and verb senses in WordNet. Source: Jurafsky and Martin 2009a, fig. 19.2-19.3.

    Consider the word 'bank'. This can refer to a financial institution or a sloping mound. These two senses of the same word are unrelated but they look and sound the same. We call them homonyms. The sense relation is called homonymy. Typically, homonyms have different origins and different dictionary entries.

    A bank can also refer to the building that houses the financial institution. These senses are semantically related. The sense relation is called polysemy.

    Synonyms are different words with same or nearly same meaning. Antonyms are words with opposite meaning.

    Consider two words in which one is a subclass of the other, a type-of relation, such as mango and fruit. Mango is a hyponym of fruit. Fruit is a hypernym of mango.

    Consider two words that form a part-whole relation, such as wheel and car. Wheel is a meronym of car. Car is a holonym of wheel.

  • Which are the essential elements for doing WSD?
    A co-occurrence graph for the word 'bar'. Source: Navigli 2009, fig. 7.
    A co-occurrence graph for the word 'bar'. Source: Navigli 2009, fig. 7.

    WSD requires two main sources of information: context and knowledge. Context is established from neighbouring words and the domain of discourse. Sense-tagged corpora provide knowledge, leading to data-driven or corpus-based WSD. Use of lexicons or encyclopaedia lead to knowledge-driven WSD.

    We also need to know possible word senses. These can be enumerative, with WordNet being an example. A generative model underspecifies senses until context is considered. Rules generate senses. These rules capture regularities in the creation of senses.

    In lexical sample task, WSD is applied for a sample of pre-selected words. Supervised ML approach is possible based on hand-labelled corpus. In all-words task, WSD is applied for all words, for which supervised ML approach is not practical. Dictionary-based approaches or bootstrapping techniques are more suitable.

    A bag-of-words approach can be used for context. To preserve word ordering, collocation can be used when forming feature vectors. Such a vector might include the word's root form and its POS. Syntactic relations, distance from target and selectional preferences are other approaches.

  • Could you describe some algorithms for WSD?
    Simplified Lesk Algorithm looks at overlapping words. Source: Jurafsky 2015, slide 32.
    Simplified Lesk Algorithm looks at overlapping words. Source: Jurafsky 2015, slide 32.

    A simple supervised approach is to use a naive Bayes classifier. We maximize the probability of word sense given a feature vector. The problem is simplified by using Bayes' Rule and assuming features are independent of one another. Another approach is decision list classifier.

    The Corpus Lesk algorithm uses a sense-tagged corpus. We also use the definition or gloss of each sense from a dictionary. Examples from the corpus and the gloss become the signature of the sense. Then we compute the number of overlapping words between the signature and the context. Inverse Document Frequency (IDF) weighting is applied to discount function words (the, of, etc.). The simplified Lesk algorithm uses only the gloss for signature and doesn't use weights.

    For evaluation, most frequent sense is used as a baseline. Frequencies can be taken from a sense-tagged corpus such as SemCor. Lesk algorithm is also a suitable baseline. Senseval and SemEval have standardized sense evaluation.

  • How are word embeddings relevant to the problem of WSD?
    F1 scores on different all-words WSD datasets. Source: Iacobacci et al. 2016, table 3.
    F1 scores on different all-words WSD datasets. Source: Iacobacci et al. 2016, table 3.

    Schütze (1998) proposed the use of word vectors and context vectors. These are large dimensional vectors and often sparse. To make them practical for computation, Singular Value Decomposition (SVD) reduces the number of dimensions. Within such a vector space model, Latent Semantic Analysis (LSA) helps to determine semantic relations. Word embeddings is a modern alternative.

    Word embeddings were proposed by Bengio et al. (2003). These are low-dimensional dense vectors that capture semantic information. However, words with multiple senses are reduced to a single vector. This is not directly useful for WSD. To overcome this, Trask et al. (2015) proposed sense2vec, where representations are of senses, not words. Sense2vec improved the accuracy of other NLP tasks such as named entity recognition and neural dependency parsing.

    Iacobacci et al. (2016) explored the direct use of word embeddings. Using It Makes Sense (IMS) framework along with Word2vec, they improved F1 scores on various WSD datasets. Word embeddings of target word and its surrounding words are converted into "sense vectors" using various methods: concatenation, average, fractional decay, exponential decay.

  • What are some neural network approaches to WSD?
    Seq2seq architecture with multiple attentions. Source: Ahmed et al. 2018, fig. 3.
    Seq2seq architecture with multiple attentions. Source: Ahmed et al. 2018, fig. 3.

    Neural network approaches to WSD have become popular in the 2010s. Wiriyathammabhum et al. (2012) applied Deep Belief Networks (DBN). They pre-trained the hidden layers using various knowledge sources, layer by layer. They then used a separate fine tuning step for better discrimination.

    We lose sequential and syntactic information when averaging word vectors. Instead, Yuan et al. (2016) proposed a semi-supervised LSTM model with label propagation. To capture contexts from both sides, Kågebäck and Salomonsson (2016) applied Bidirectional LSTM.

    Many models consider only context. Knowledge sources such as WordNet are ignored. Gloss-Augmented WSD (GAS) considers both context and glosses (sense definitions) and uses BiLSTM.

    One attention-based approach is a encoder-decoder model with multiple attentions on different linguistic features. Another is GlossBERT. It uses BERT, encodes context-gloss pairs of all possible senses, and treats WSD as a sentence-pair classification problem.

  • What are some resources to do WSD?
    It Makes Sense (IMS): a flexible framework for WSD. Source: Zhong and Ng 2010, fig. 1.
    It Makes Sense (IMS): a flexible framework for WSD. Source: Zhong and Ng 2010, fig. 1.

    A number of datasets and sense-annotated corpora are available to train WSD models: Senseval and SemEval tasks (all-words, lexical sample, WSI), AIDA CoNLL-YAGO, MASC, SemCor, and WebCAGe. Likewise, word sense inventories include WordNet, TWSI, Wiktionary, Wikipedia, FrameNet, OmegaWiki, VerbNet, and more. These are supported by the modular Java framework DKPro WSD.

    UKB is an open-source toolkit that can be used for knowledge-based WSD. It should be used with optimal default parameters.

    In January 2017, Google released word sense annotations on MASC and SemCor datasets. Senses are from New Oxford American Dictionary (NOAD). NOAD senses are also mapped to WordNet.

    ACLWiki has curated a list of useful WSD resources. This includes papers, inventories, annotated corpora and software.

    Ruder captures the current state of the art in WSD with links to relevant papers. Papers With Code also maintains a list of recent papers on WSD.

Milestones

Jul
1949

Warren Weaver considers the task of using computers to translate text from one language to another. He recognizes the importance of context and meaning. He makes references to cryptography and statistical semantic studies as possible approaches to obtaining the correct meaning of a word. Weaver also notes that a word mostly has only one meaning within a particular domain.

1953

Oswald and Lawson propose microglossaries for machine translation. These are glossaries assembled for a particular domain. Through the 1950s, researchers produce many such domain-specific glossaries to aid machine translation. For example, a microglossary for mathematics would define 'triangle' as a geometric shape and not as a musical instrument.

1955

Erwin Reifler defines what he calls semantic coincidences between a word and its context. He also notes that syntactic relations can be used to disambiguate. For example, the word 'kept' can have an object that's gerund (He kept eating), adjectival phrase (He kept calm), or noun phrase (He kept a record).

1957

Masterman makes use of synonyms, near synonyms and associated words from Roget's Thesaurus. She gives the example of "flowering plant". Using the thesaurus, we can determine that 'vegetable' is the only common sense for the words 'flowering' and 'plant'. This is therefore the correct sense of the word 'plant' in this context.

1965

Madhu and Lytle propose the use of what they call Figure of Merit. This is a probabilistic measure that's useful when grammatical structure alone is unable to disambiguate. They focus on scientific and engineering literature and identify ten groups. The group or context is determined using words with single meaning. Then the most probable meaning of words with multiple meanings is selected given the context. Paradoxically, this is also the time when interest in machine translation declines.

1970
An example semantic network from WordNet. Source: Navigli 2009, fig. 3.
An example semantic network from WordNet. Source: Navigli 2009, fig. 3.

In the 1970s, some notable approaches include Semantic Networks of Quillian and Simmons; Preferential Semantics of Wilks; word-based understanding of Riesbeck. Early semantic networks can be traced to the late 1950s.

1986

Lesk uses a machine-readable dictionary (MRD) for WSD. In general, the 1980s sees large-scale lexical resources (such as WordNet) for automated knowledge extraction. This is also when focus shifts from linguistic theories to empirical methods.

1990
Network topology for the word 'pen'. Source: Véronis and Ide 1990, fig. 2.
Network topology for the word 'pen'. Source: Véronis and Ide 1990, fig. 2.

The use of neural networks had been suggested in the early 1980s but was limited to a few words and hand-coded. Véronis and Ide extend this idea by using a machine-readable Collins English Dictionary. Network is formed using dictionary entries and words used to define them. Word nodes activate sense nodes. Feedback allows competing senses to inhibit one another.

1991
Aligning sentences between English and French to aid WSD. Source: Brown et al. 1991, fig. 1.
Aligning sentences between English and French to aid WSD. Source: Brown et al. 1991, fig. 1.

Brown et al. show that it's possible to disambiguate by aligning sentences in two languages. A word in one language might translate into different words in another language, each with a unique sense. In 1992, Gale et al. extend this idea by using Canadian Hansards (parliamentary debates) that's available in more than one language. This avoids expensive hand-labelled corpus.

1995
Unsupervised WSD for the word 'plant'. Source: Yarowsky 1995, fig. 1-3.
Unsupervised WSD for the word 'plant'. Source: Yarowsky 1995, fig. 1-3.

Supervised WSD algorithms have the problem of requiring sense-annotated corpora, which is expensive and laborious to create. Yarowsky proposes an unsupervised WSD algorithm. The algorithm uses two useful constraints: one sense per collocation, one sense per discourse. It's a bootstrapping procedure that seeds a small number of sense annotations. The algorithm then determines and iteratively improves on the senses for other occurrences of the word. An example is to disambiguate 'plant', which can be about plant life or a manufacturing plant.

1998
Word vector space and clustering for WSD. Source: Schütze 1998, fig. 2-3.
Word vector space and clustering for WSD. Source: Schütze 1998, fig. 2-3.

Schütze proposes a vector space approach to WSD via clustering. Senses are seen as clusters of similar contexts. A sense of a particular word is the cluster to which it's closest. Since the technique is unsupervised, senses are induced from a corpus. Word vectors are calculated using cooccurrences. Word vectors are sparse but context vectors are dense. The dimensions of both vectors are reduced using Singular Value Decomposition (SVD).

1999

Mihalcea and Moldovan make use of WordNet for WSD. They rank different senses using WordNet's semantic density for a word-pair and web mining for word pair cooccurrences. In 2004, Peter Turney also employs web mining to calculate cooccurrence probabilities that are used to generate semantic features for WSD.

2010

This decade sees an increasing use of word embeddings and neural network models for WSD. Some of these include Gloss-Augmented WSD (GAS), GlossBERT, and use of BiLSTM.

Mar
2018
Evolution of word senses for the word 'game'. Source: Ramiro et al. 2018, fig. 5.
Evolution of word senses for the word 'game'. Source: Ramiro et al. 2018, fig. 5.

Ramiro et al. study the evolution of word senses. They note that cognitive efficiency drives this evolution through a process called nearest-neighbour chaining. For new word senses, reuse of existing words (polysemy) is more common than new word forms.

References

  1. ACLWiki. 2014. "Word sense disambiguation resources." ACLWiki, December 12. Accessed 2019-12-26.
  2. Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. 2018. "The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD." Proceedings of Workshop for NLP Open Source Software (NLP-OSS), ACL, pp. 29-33, July. Accessed 2019-12-28.
  3. Ahmed, Mahtab, Muhammad Rifayat Samee, and Robert E. Mercer. 2018. "A Novel Neural Sequence Model with Multiple Attentions for Word Sense Disambiguation." arXiv, v1, September 4. Accessed 2019-12-28.
  4. Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1991. "Word sense disambiguation using statistical methods." Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 264-270 Accessed 2019-12-26.
  5. DKPro. 2015. "DKPro WSD - Welcome." V1.2.0, March 03. Accessed 2019-12-25.
  6. Evans, Colin and Dayu Yuan. 2017. "A Large Corpus for Supervised Word-Sense Disambiguation." Google AI Blog, January 18. Accessed 2019-12-25.
  7. Gale, William A., Kenneth W. Church, and David Yarowsky. 1992. "Using Bilingual Materials to Develop Word Sense Disambiguation Methods." Proceedings of International Conference on Theoretical and Methodological Issues in Machine Translation, pp. 101-112. Accessed 2019-12-26.
  8. Hasa. 2016. "Difference Between Polysemy and Homonymy." Pediaa, July 7. Accessed 2019-12-26.
  9. Huang, Luyao, Chi Sun, Xipeng Qiu, and Xuanjing Huang. 2019. "GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge." arXiv, v3, November 9. Accessed 2019-12-28.
  10. Iacobacci, Ignacio, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. "Embeddings for Word Sense Disambiguation: An Evaluation Study." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pp. 897-907, August. Accessed 2019-12-28.
  11. Ide, Nancy and Jean Véronis. 1998. "Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art." Computational Linguistics, vol. 24, no. 1, pp. 1-40, March. Accessed 2019-12-25.
  12. Jurafsky, Daniel. 2015. "Word Sense Disambiguation." Slides, Stanford University, August. Accessed 2019-12-25.
  13. Jurafsky, Daniel and James H. Martin. 2009a. "Lexical Semantics." Chapter 19 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-12-25.
  14. Jurafsky, Daniel and James H. Martin. 2009b. "Computational Lexical Semantics." Chapter 20 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-12-25.
  15. Liu, Hongfang, Virginia Teller, and Carol Friedman. 2004. "A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation." J Am Med Inform Assoc, vol. 11, no. 4, pp. 320–331, Jul-Aug. Accessed 2019-12-26.
  16. Luo, Fuli, Tianyu Liu, Qiaolin Xia, Baobao Chang, and Zhifang Sui. 2018. "Incorporating Glosses into Neural Word Sense Disambiguation." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pp. 2473-2482, July. Accessed 2019-12-28.
  17. Madhu, Swaminathan and Dean W. Lytle. 1965. "A Figure of Merit Technique for the Resolution of Non-Grammatical Ambiguity." Mechanical Translation, vol. 8, no. 2, pp. 9-13, February. Accessed 2019-12-26.
  18. Masterman, M.M. 1957. "The Thesaurus in Syntax and Semantics." Mechanical Translation, vol. 4, nos. 1 and 2, pp. 35-43, November. Accessed 2019-12-26.
  19. Mihalcea, Rada and Andras Csomai. 2007. "Wikify! Linking Documents to Encyclopedic Knowledge." CIKM'07, November 6-8. Accessed 2019-12-25.
  20. Mihalcea, Rada and Dan I. Moldovan. 1999. "A Method for Word Sense Disambiguation of Unrestricted Text." Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 152-158, June. Accessed 2019-12-26.
  21. Orkphol, Korawit and Wu Yang. 2019. "Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet." Future Internet, 11(5), 114. Accessed 2019-12-25.
  22. Ramiro, Christian, Mahesh Srinivasan, Barbara C. Malt, and Yang Xu. 2018. "Algorithms in the historical emergence of word senses." PNAS, 115(10), March 6. Accessed 2019-12-25.
  23. Ruder, Sebastian. 2019. "Word Sense Disambiguation." NLP-progress, October 24. Accessed 2019-12-25.
  24. Saenko, Kate and Trevor Darrell. 2009. "Filtering Abstract Senses From Image Search Results." Advances in Neural Information Processing Systems 22, pp. 1589-1597. Accessed 2019-12-25.
  25. Schütze, Hinrich. 1998. "Automatic Word Sense Discrimination." Computational Linguistics, vol. 24, no. 1, pp. 97-123. Accessed 2019-12-26.
  26. Trask, Andrew, Phil Michalak, and John Liu. 2015. "sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings." arXiv, v1, November 19. Accessed 2019-12-28.
  27. Turney, Peter D. 2004. "Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities." arXiv, v1, July 29. Accessed 2019-12-26.
  28. Véronis, Jean, and Nancy M. Ide. 1990. "Word Sense Disambiguation with Very Large Neural Networks Extracted from Machine Readable Dictionaries." COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics, pp. 389-394. Accessed 2019-12-26.
  29. Weaver, Warren. 1949. "Translation." The Rockefeller Foundation, July 15. Accessed 2019-12-26.
  30. Wiriyathammabhum, Peratham, Boonserm Kijsirikul, Hiroya Takamura, and Manabu Okumura. 2012. "Applying Deep Belief Networks to Word Sense Disambiguation." arXiv, v1, July 2. Accessed 2019-12-28.
  31. Yarowsky, David. 1995. "Unsupervised Word Sense Disambiguation Rivaling Supervised Methods." 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189-195, June. Accessed 2019-12-25.
  32. Yuan, Dayu, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. "Semi-supervised Word Sense Disambiguation with Neural Models." arXiv, v2, November 5. Accessed 2019-12-28.
  33. Zhong, Zhi and Hwee Tou Ng. 2010. "It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text." Proceedings of the ACL 2010 System Demonstrations, ACL, pp. 78-83, July. Accessed 2019-12-26.

Further Reading

  1. Navigli, Roberto. 2009. "Word Sense Disambiguation: A Survey." ACM Computing Surveys, vol. 41, no. 2, article 10, February. Accessed 2019-12-26.
  2. Ide, Nancy and Jean Véronis. 1998. "Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art." Computational Linguistics, vol. 24, no. 1, pp. 1-40, March. Accessed 2019-12-25.
  3. Jurafsky, Daniel and James H. Martin. 2009b. "Computational Lexical Semantics." Chapter 20 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-12-25.
  4. Ng, Hwee Tou and John Zelle. 1997. "Corpus-Based Approaches to Semantic Interpretation in Natural Language Processing." AI Magazine, vol. 18, no. 4, pp. 45-64, American Association for Artificial Intelligence. Accessed 2019-12-26.
  5. ACLWiki. 2014. "Word sense disambiguation resources." ACLWiki, December 12. Accessed 2019-12-26.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
3
0
1804
2429
Words
2
Likes
6907
Hits

Cite As

Devopedia. 2021. "Word Sense Disambiguation." Version 3, June 28. Accessed 2024-06-25. https://devopedia.org/word-sense-disambiguation
Contributed by
1 author


Last updated on
2021-06-28 15:59:55