Lemmatization

Lemmatization involves morphological analysis. Source: Bitext 2018.
Lemmatization involves morphological analysis. Source: Bitext 2018.

Consider the words 'am', 'are', and 'is'. These come from the same root word 'be'. Likewise, 'dinner' and 'dinners' can be reduced to 'dinner'. Variations of a word are called wordforms or surface forms. It's often complex to handle all such variations in software. By reducing these wordforms to a common root, we simplify the input. The root form is called lemma. An algorithm or program that determines lemmas from wordforms is called a lemmatizer.

For example, Oxford English Dictionary of 1989 has about 615K lemmas as an upper bound. Shakespeare's works have about 880K words, 29K wordforms, and 18K lemmas.

Lemmatization involves word morphology, which is the study of word forms. Typically, we identify the morphological tags of a word before selecting the lemma.

Discussion

  • Why do we need to find the lemma of a word?
    Part of speech helps in identifying the correct lemma. Source: McCloud 2019.
    Part of speech helps in identifying the correct lemma. Source: McCloud 2019.

    Many NLP tasks can benefit from lemmatization. For instance, topic modelling looks at word distribution in a document. By normalizing words to a common form, we get better results. In word embeddings, that is, representing words as real-valued vectors, removing inflected wordforms can improve downstream NLP tasks.

    For information retrieval (IR), lemmatization helps with query expansion so that suitable matches are returned even if there's not an exact word match. In document clustering, it's useful to reduce the number of tokens. It also helps in machine translation.

    Ultimately, the decision to use lemmas is application dependent. We should use lemmas only if they show better performance.

  • What are the challenges with lemmatization?
    Lemma ambiguity is high for Arabic and Urdu. Source: Bergmanis and Goldwater 2018, fig. 3.
    Lemma ambiguity is high for Arabic and Urdu. Source: Bergmanis and Goldwater 2018, fig. 3.

    Out-of-vocabulary (OOV) words is a challenge. For example, WordNet that's used by NLTK package for lemmatization, doesn't have the word 'carmaking'. The lemmatizer therefore doesn't relate this to 'carmaker'.

    It difficult to construct rules for irregular word inflections. The word 'bring' might look like an inflected form of 'to bre' but it's not. Even more challenging is a word such as 'gehört' in German. It's a participle of 'hören' (to hear) or of 'gehören' (to belong). Both are valid but only the context of usage can help us derive the correct lemma. It's for these reasons that neural network approaches to learning rules are preferred over hand-crafted rules.

    When content comes from Internet or social media, it's impractical to use predefined dictionary. This is another reason for a neural network approach with an open vocabulary.

    Even with neural networks, some inflected forms might never occur in training, such as, 'forbade', the past tense of 'forbid'. Many training corpora come from newspaper texts where verbs in second person are rare. This can impact lemmatization of such forms.

  • How is lemmatization different from stemming?
    Stemming versus lemmatization. Source: Kushwah 2019.
    Stemming versus lemmatization. Source: Kushwah 2019.

    Given a wordform, stemming is a simpler way to get to its root form. Stemming simply removes prefixes and suffixes. Lemmatization on the other hand does morphological analysis, uses dictionaries and often requires part of speech information. Thus, lemmatization is a more complex process.

    Stems need not be dictionary words but lemmas always are. Another way to say this is that "a lemma is the base form of all its inflectional forms, whereas a stem isn't".

    Wordforms are either inflectional (change of tense, singular/plural) or derivational (change of part of speech or meaning). Lemmatization usually collapses inflectional forms whereas stemming does this for derivational forms.

    Stemming may suffice for many use cases in English. For morphologically complex languages such as Arabic, lemmatization is essential.

    There are two types of problems with stemming that lemmatization can solve:

    • Two wordforms with different lemmas may stem to the same result. Eg. 'universal' and 'university' result in same stem 'univers'.
    • Two wordforms of same lemma may end as two different stems. Eg. 'good' and 'better' have the same lemma 'good'.
  • Which are the available models for lemmatization?
    Derivational rules are used by an FST lemmatizer. Source: Matuszek and Papalaskari 2015, slide 23.
    Derivational rules are used by an FST lemmatizer. Source: Matuszek and Papalaskari 2015, slide 23.

    The classical approach is to use a Finite State Transducer (FST). FST models encode vocabulary and string rewrite rules. Where there are multiple encoding rules, there's ambiguity. We can think of FST as reading the surface form from an input tape and writing the lexical form to an output tape. There could be intermediate tapes for spelling changes, etc.

    One well-known tool is Helsinki Finite State Toolkit (HFST) that makes use of other open source tools such as SFST and OpenFST.

    Chrupała formalized lemmatization in 2006 by treating it as a string-to-string transduction task. Given a word w, we get its morphological attributes m. To obtain the lemma l, we calculate the probability P(l|w,m). This uses features based on (l,w,m). It then trains a Maximum-Entropy Markov Model (MEMM), one each for POS tags and lemmas. Müller improved on this by using Conditional Random Fields (CRFs) for jointly learning tags and lemmas.

    One researcher combined the best of stemming and lemmatization.

  • Which are the neural network approaches to lemmatization?
    Seq2seq model for lemmatization. Source: Fonseca 2019.
    Seq2seq model for lemmatization. Source: Fonseca 2019.

    A well-known model is the Sequence-to-Sequence (seq2seq) neural network. Words and their lemmas are processed character by character. Input can include POS tags. Every input is represented using word embeddings.

    To deal with lemma ambiguity, we need to make use of the context. Bidirectional LSTM networks, that are based on RNNs, are able to do this. They take in a sequence of words to produce context-sensitive vectors. Then the lemmatizer uses automatically generated rules (pretrained by another neural network) to arrive at the lemma. However, such ambiguity is so rare that seq2seq architecture may be more efficient. Encoder-decoder architecture using GRU is another approach to handle unseen or ambiguous words.

    Turku NLP based on NN provides one of the state-of-the-art lemmatizers. Other good ones are UDPipe Future and Stanford NLP, although the latter performs poorly for low-resource languages, for which CUNI x-ling excels.

  • Could you mention some tools that can do lemmatization?
    MorphAdorner is an online lemmatizer. Source: NUIT 2019.
    MorphAdorner is an online lemmatizer. Source: NUIT 2019.

    In Python, NLTK has WordNetLemmatizer class to determine lemmas. It includes the option to pass the part of speech to help us obtain the correct lemmas. Other Python-based lemmatizers are in packages spaCy, TextBlob, Pattern and GenSim.

    Stanford's LemmaProcessor is another Python-based lemmatizer. It allows us to select a seq2seq model, a dictionary model or a trivial identity model. For Chinese, a dictionary model is adequate. In Vietnamese, lemma is identical to original word. Hence, identity model will suffice.

    TreeTagger does POS tagging plus gives lemma information. It supports 20 natural languages. Another multilingual framework is GATE DictLemmatizer, which is based on HFST and word-lemma dictionaries available form Wiktionary. We can use wiktextract to download and process Wiktionary data dumps.

    LemmaGen is an open source multilingual platform with implementations or bindings in C++, C# and Python. In .NET, there's LemmaGenerator. There's a Java implementation of Morpha.

Milestones

1968

It's in the 1960s that morphological analysis is formalized. Chomsky and Halle show that an ordered sequence of rewrite rules convert abstract phonological forms to surface forms through intermediate representations.

1972

Douglas C. Johnson shows that pairs of input/output can be modelled by finite state transducers. However, this result is overlooked and rediscovered later in 1981 by Ronald M. Kaplan and Martin Kay.

1983

Thus far, rules have been applied in a cascade. Kimmo Koskenniemi invents two-level morphology, where rules can be applied in parallel. Rules are seen as symbol-by-symbol constraints. Lexical lookup and morphological analysis are done in tandem. It's only in 1985 that the first two-level rules compiler is invented.

1997

Karttunen et al. show how we can compile regular expressions to create finite state transducers. The use finite state transducers for morphological analysis and generation is well known but it's application in other areas of NLP are not well known. The authors show how to use them for date parsing, date validation and tokenization.

2000

In a problem related to lemmatization, Minnen et al. at the University of Sussex show how to generate words in English based on lemma, POS tag and inflection form to be generated. They write high-level descriptions or rules as regular expressions, which Flex compiles into finite-state automata.

2006
Edit tree used for lemmatization. Source: Müller et al. 2015, fig. 1.
Edit tree used for lemmatization. Source: Müller et al. 2015, fig. 1.

Grzegorz Chrupała publishes Simple data-driven context-sensitive lemmatization. Lemmatization is modelled as a classification problem where the algorithm chooses one of many "edit trees" that can transform a word to its lemma. Such trees are induced from wordform-lemma pairs. This work leads to a PhD dissertation in 2008 and the system is named Morfette.

2015
Second-order linear chain CRF to predict lemma. Source: Müller et al. 2015, fig. 2.
Second-order linear chain CRF to predict lemma. Source: Müller et al. 2015, fig. 2.

Many NLP systems take a pipeline approach. They do tagging followed by lemmatization, since POS tags can help the lemmatizer disambiguate. But there's a mutual dependency between tagging and lemmatization. Müller et al. present a system called Lemming that jointly does POS tagging and lemmatization using Conditional Random Fields (CRFs). It can also analyze OOV words. This work sets a new baseline for lemmatization on six languages.

2016

At the SIGMORPHON Shared Task, it's noted that various neural sequence-to-sequence models give best results. In 2018, a seq2seq model is used along with novel context representation. This model, used within TurkuNLP in the CoNLL-18 Shared Task, gives best performance on lemmatization.

2018

Bergmanis and Goldwater use encoder-decoder NN architecture in a lemmatizer they name as Lematus. Both encoder and decoder are 2-layer Gated Recurrent Unit (GRU). They compare its performance against context-free systems. They use character contexts of each form to be lemmatized. Thus, training resources needed are less. They note that context-free systems may be adequate if a language has many unseen words but few ambiguous words.

2019
Words, morphological tags and lemmas. Source: Malaviya et al. 2019, fig. 1.
Words, morphological tags and lemmas. Source: Malaviya et al. 2019, fig. 1.

Malaviya et al. use a NN model to jointly learn morphological tags and lemmas. They use a encoder-decoder model with hard attention mechanism. In particular, they use a 2-layer LSTM for morphological tagging. For the lemmatizer, they use 2-layer BiLSTM encoder and 1-layer LSTM decoder. They compare their results with other state-of-the-art models: Lematus, UDPipe, Lemming, and Morfette.

Sample Code

  • # Source: https://timmccloud.net/blog-natural-language-processing/
    # Accessed: 2019-10-11
     
    from nltk.stem import WordNetLemmatizer
    input_words = ['writing', 'calves', 'be', 'branded', 'horse', 'randomize', 
                   'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'code']
     
    lemmatizer = WordNetLemmatizer()
     
    lemmatizer_names = ['NOUN LEMMATIZER', 'VERB LEMMATIZER'] 
    formatted_text = '{:>24}' * (len(lemmatizer_names) + 1) 
    print('\n', formatted_text.format('INPUT WORD', *lemmatizer_names), 
          '\n', '='*75)
     
    for word in input_words: 
        output = [word, lemmatizer.lemmatize(word, pos='n'), 
                  lemmatizer.lemmatize(word, pos='v')] 
        print(formatted_text.format(*output))
     

References

  1. Aker, Ahmet, Johann Petrak, and Firas Sabbah. 2017. "An Extensible Multilingual Open Source Lemmatizer." Proceedings of Recent Advances in Natural Language Processing, pp. 40–45, September 4-6. Accessed 2019-10-11.
  2. Bergmanis, Toms, and Sharon Goldwater. 2018. "Context Sensitive Neural Lemmatization with Lematus." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1391-1400, June. Accessed 2019-10-11.
  3. Bergs, Alexander, and Laurel J Brinton, eds. 2012. "English Historical Linguistics: An International Handbook." Volume 1, De Grutyer Mouton. Accessed 2019-10-10.
  4. Bitext. 2018. "What is the difference between stemming and lemmatization?" Blog, Bitext, February 28. Accessed 2019-09-24.
  5. Chrupała, Grzegorz. 2008. "Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing." PhD Dissertation, Dublin City University, April. Accessed 2019-10-10.
  6. Dobbins, Scott N. 2018. "Re-learning English: Pluralization and Lemmatization." Searching for 伯樂, January 10. Accessed 2019-10-11.
  7. Fonseca, Erick. 2019. "State-of-the-art Multilingual Lemmatization." Towards Data Science, via Medium, March 14. Accessed 2019-10-11.
  8. Jabeen, Hafsa. 2018. "Stemming and Lemmatization in Python." DataCamp, October 23. Accessed 2019-10-11.
  9. Jurafsky, Daniel and James H. Martin. 2019. "Regular Expressions, Text Normalization, Edit Distance." Chapter 2 In: Speech and Language Processing, Third Edition draft, October 02. Accessed 2019-10-10.
  10. Kanerva, Jenna, Filip Ginter, and Tapio Salakoski. 2019. "Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks." August 04. Accessed 2019-10-10.
  11. Karttunen, Lauri, and Kenneth R. Beesley. 2001. "A Short History of Two-Level Morphology." August 12. Accessed 2019-10-11.
  12. Karttunen, Lauri, Jean-Pierre Chanod, Gregory Grefenstette, and Anne Schiller. 1997. "Regular expressions for language engineering." Natural Language Engineering, vol. 2, no. 4, pp. 305-329. Accessed 2019-10-11.
  13. Koskenniemi, Kimmo. 2008. "HFST: Modular Compatibility for Open Source Finite-state Tools." University of Helsinki, June 09. Accessed 2019-10-11.
  14. Kushwah, Devendra. 2019. "What is difference between stemming and lemmatization?" Quora, May 16. Accessed 2019-10-11.
  15. LemmaGen. 2019. "Homepage." LemmaGen. Accessed 2019-10-11.
  16. Liberman, Mark and Ellen Prince. 1998. "Morphology II." LING 001: Introduction to Linguistics, University of Pennsylvania, September. Accessed 2019-10-10.
  17. Machine Learning Plus. 2018. "Lemmatization Approaches with Examples in Python." Machine Learning Plus, October 02. Accessed 2019-10-11.
  18. Malaviya, Chaitanya, Shijie Wu, and Ryan Cotterell. 2019. "A Simple Joint Model for Improved Contextual Neural Lemmatization." arXiv, v2, April 05. Accessed 2019-10-11.
  19. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. "Introduction to Information Retrieval." Cambridge University Press. Accessed 2019-09-24.
  20. Matuszek, Paula and Mary-Angela Papalaskari. 2015. "Lecture 3: Morphology, Finite State Transducers." CSC 9010: Natural Language Processing, University of Colarado. Accessed 2019-10-11.
  21. McCloud, Tim. 2019. "Blog: Natural Language Processing." May 09. Accessed 2019-10-11.
  22. Minnen, Guido, John Carroll, and Darren Pearce. 2000. "Robust, applied morphological generation." INLG’2000 Proceedings of the First International Conference on Natural Language Generation, pp. 201-208, June. Accessed 2019-10-10.
  23. Müller, Thomas, Ryan Cotterell, Alexander Fraser, and Hinrich Schütze. 2015. "Joint Lemmatization and Morphological Tagging with Lemming." Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2268–2274, September. Accessed 2019-10-11.
  24. NUIT. 2019. "English Lemmatizer Example." MorphAdorner V2.0., Northwestern University Information Technology. Accessed 2019-10-11.
  25. NuGet. 2014. "LemmaGenerator." v1.1.0, NuGet, June 23. Accessed 2019-10-11.
  26. Schumacher, Alex. 2019. "When (not) to Lemmatize or Remove Stop Words in Text Preprocessing." Open Data Group, March 21. Accessed 2019-10-11.
  27. UD. 2018. "CoNLL 2018 Shared Task." Universal Dependencies, October 28. Accessed 2019-10-10.

Further Reading

  1. Heidenreich, Hunter. 2018. "Stemming? Lemmatization? What?" Towards Data Science, via Medium, December 21. Accessed 2019-10-11.
  2. Machine Learning Plus. 2018. "Lemmatization Approaches with Examples in Python." Machine Learning Plus, October 02. Accessed 2019-10-11.
  3. Fonseca, Erick. 2019. "State-of-the-art Multilingual Lemmatization." Towards Data Science, via Medium, March 14. Accessed 2019-10-11.
  4. Kanerva, Jenna, Filip Ginter, and Tapio Salakoski. 2019. "Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks." August 04. Accessed 2019-10-10.
  5. Malaviya, Chaitanya, Shijie Wu, and Ryan Cotterell. 2019. "A Simple Joint Model for Improved Contextual Neural Lemmatization." arXiv, v2, April 05. Accessed 2019-10-11.
  6. Liu, Haibin, Tom Christiansen, William A Baumgartner Jr, and Karin Verspoor. 2012. "BioLemmatizer: a lemmatization tool for morphological processing of biomedical text." J Biomed Semantics, vol. 3, no. 3. Accessed 2019-10-10.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
2
0
1937
1694
Words
3
Likes
12K
Hits

Cite As

Devopedia. 2019. "Lemmatization." Version 2, October 11. Accessed 2024-06-25. https://devopedia.org/lemmatization
Contributed by
1 author


Last updated on
2019-10-11 18:00:39