Lemmatization

Article Info

Contributed by
1 author

Last updated on
2019-10-11 18:00:39

Article Versions

2 2019-10-11 18:00:39
1677,1676 2,1677

By arvindpdmn

Completing article and publishing.
1 2019-10-11 17:19:47
1,1676

By arvindpdmn

First version, no content.

Chat Room

Submitting ...

You are editing an existing chat message.

Lemmatization involves morphological analysis. Source: Bitext 2018.

Consider the words 'am', 'are', and 'is'. These come from the same root word 'be'. Likewise, 'dinner' and 'dinners' can be reduced to 'dinner'. Variations of a word are called wordforms or surface forms. It's often complex to handle all such variations in software. By reducing these wordforms to a common root, we simplify the input. The root form is called lemma. An algorithm or program that determines lemmas from wordforms is called a lemmatizer.

For example, Oxford English Dictionary of 1989 has about 615K lemmas as an upper bound. Shakespeare's works have about 880K words, 29K wordforms, and 18K lemmas.

Lemmatization involves word morphology, which is the study of word forms. Typically, we identify the morphological tags of a word before selecting the lemma.

Discussion

Why do we need to find the lemma of a word?
Part of speech helps in identifying the correct lemma. Source: McCloud 2019.
Many NLP tasks can benefit from lemmatization. For instance, topic modelling looks at word distribution in a document. By normalizing words to a common form, we get better results. In word embeddings, that is, representing words as real-valued vectors, removing inflected wordforms can improve downstream NLP tasks.
For information retrieval (IR), lemmatization helps with query expansion so that suitable matches are returned even if there's not an exact word match. In document clustering, it's useful to reduce the number of tokens. It also helps in machine translation.
Ultimately, the decision to use lemmas is application dependent. We should use lemmas only if they show better performance.
What are the challenges with lemmatization?
Lemma ambiguity is high for Arabic and Urdu. Source: Bergmanis and Goldwater 2018, fig. 3.
Out-of-vocabulary (OOV) words is a challenge. For example, WordNet that's used by NLTK package for lemmatization, doesn't have the word 'carmaking'. The lemmatizer therefore doesn't relate this to 'carmaker'.
It difficult to construct rules for irregular word inflections. The word 'bring' might look like an inflected form of 'to bre' but it's not. Even more challenging is a word such as 'gehört' in German. It's a participle of 'hören' (to hear) or of 'gehören' (to belong). Both are valid but only the context of usage can help us derive the correct lemma. It's for these reasons that neural network approaches to learning rules are preferred over hand-crafted rules.
When content comes from Internet or social media, it's impractical to use predefined dictionary. This is another reason for a neural network approach with an open vocabulary.
Even with neural networks, some inflected forms might never occur in training, such as, 'forbade', the past tense of 'forbid'. Many training corpora come from newspaper texts where verbs in second person are rare. This can impact lemmatization of such forms.
How is lemmatization different from stemming?
Stemming versus lemmatization. Source: Kushwah 2019.
Given a wordform, stemming is a simpler way to get to its root form. Stemming simply removes prefixes and suffixes. Lemmatization on the other hand does morphological analysis, uses dictionaries and often requires part of speech information. Thus, lemmatization is a more complex process.
Stems need not be dictionary words but lemmas always are. Another way to say this is that "a lemma is the base form of all its inflectional forms, whereas a stem isn't".
Wordforms are either inflectional (change of tense, singular/plural) or derivational (change of part of speech or meaning). Lemmatization usually collapses inflectional forms whereas stemming does this for derivational forms.
Stemming may suffice for many use cases in English. For morphologically complex languages such as Arabic, lemmatization is essential.
There are two types of problems with stemming that lemmatization can solve:
- Two wordforms with different lemmas may stem to the same result. Eg. 'universal' and 'university' result in same stem 'univers'.
- Two wordforms of same lemma may end as two different stems. Eg. 'good' and 'better' have the same lemma 'good'.
Which are the available models for lemmatization?
Derivational rules are used by an FST lemmatizer. Source: Matuszek and Papalaskari 2015, slide 23.
The classical approach is to use a Finite State Transducer (FST). FST models encode vocabulary and string rewrite rules. Where there are multiple encoding rules, there's ambiguity. We can think of FST as reading the surface form from an input tape and writing the lexical form to an output tape. There could be intermediate tapes for spelling changes, etc.
One well-known tool is Helsinki Finite State Toolkit (HFST) that makes use of other open source tools such as SFST and OpenFST.
Chrupała formalized lemmatization in 2006 by treating it as a string-to-string transduction task. Given a word w, we get its morphological attributes m. To obtain the lemma l, we calculate the probability P(l|w,m). This uses features based on (l,w,m). It then trains a Maximum-Entropy Markov Model (MEMM), one each for POS tags and lemmas. Müller improved on this by using Conditional Random Fields (CRFs) for jointly learning tags and lemmas.
One researcher combined the best of stemming and lemmatization.
Which are the neural network approaches to lemmatization?
Seq2seq model for lemmatization. Source: Fonseca 2019.
A well-known model is the Sequence-to-Sequence (seq2seq) neural network. Words and their lemmas are processed character by character. Input can include POS tags. Every input is represented using word embeddings.
To deal with lemma ambiguity, we need to make use of the context. Bidirectional LSTM networks, that are based on RNNs, are able to do this. They take in a sequence of words to produce context-sensitive vectors. Then the lemmatizer uses automatically generated rules (pretrained by another neural network) to arrive at the lemma. However, such ambiguity is so rare that seq2seq architecture may be more efficient. Encoder-decoder architecture using GRU is another approach to handle unseen or ambiguous words.
Turku NLP based on NN provides one of the state-of-the-art lemmatizers. Other good ones are UDPipe Future and Stanford NLP, although the latter performs poorly for low-resource languages, for which CUNI x-ling excels.
Could you mention some tools that can do lemmatization?
MorphAdorner is an online lemmatizer. Source: NUIT 2019.
In Python, NLTK has WordNetLemmatizer class to determine lemmas. It includes the option to pass the part of speech to help us obtain the correct lemmas. Other Python-based lemmatizers are in packages spaCy, TextBlob, Pattern and GenSim.
Stanford's LemmaProcessor is another Python-based lemmatizer. It allows us to select a seq2seq model, a dictionary model or a trivial identity model. For Chinese, a dictionary model is adequate. In Vietnamese, lemma is identical to original word. Hence, identity model will suffice.
TreeTagger does POS tagging plus gives lemma information. It supports 20 natural languages. Another multilingual framework is GATE DictLemmatizer, which is based on HFST and word-lemma dictionaries available form Wiktionary. We can use wiktextract to download and process Wiktionary data dumps.
LemmaGen is an open source multilingual platform with implementations or bindings in C++, C# and Python. In .NET, there's LemmaGenerator. There's a Java implementation of Morpha.

Milestones

1968

It's in the 1960s that morphological analysis is formalized. Chomsky and Halle show that an ordered sequence of rewrite rules convert abstract phonological forms to surface forms through intermediate representations.

1972

Douglas C. Johnson shows that pairs of input/output can be modelled by finite state transducers. However, this result is overlooked and rediscovered later in 1981 by Ronald M. Kaplan and Martin Kay.

1983

Thus far, rules have been applied in a cascade. Kimmo Koskenniemi invents two-level morphology, where rules can be applied in parallel. Rules are seen as symbol-by-symbol constraints. Lexical lookup and morphological analysis are done in tandem. It's only in 1985 that the first two-level rules compiler is invented.

1997

Karttunen et al. show how we can compile regular expressions to create finite state transducers. The use finite state transducers for morphological analysis and generation is well known but it's application in other areas of NLP are not well known. The authors show how to use them for date parsing, date validation and tokenization.

2000

In a problem related to lemmatization, Minnen et al. at the University of Sussex show how to generate words in English based on lemma, POS tag and inflection form to be generated. They write high-level descriptions or rules as regular expressions, which Flex compiles into finite-state automata.

2006

Grzegorz Chrupała publishes Simple data-driven context-sensitive lemmatization. Lemmatization is modelled as a classification problem where the algorithm chooses one of many "edit trees" that can transform a word to its lemma. Such trees are induced from wordform-lemma pairs. This work leads to a PhD dissertation in 2008 and the system is named Morfette.

2015

Many NLP systems take a pipeline approach. They do tagging followed by lemmatization, since POS tags can help the lemmatizer disambiguate. But there's a mutual dependency between tagging and lemmatization. Müller et al. present a system called Lemming that jointly does POS tagging and lemmatization using Conditional Random Fields (CRFs). It can also analyze OOV words. This work sets a new baseline for lemmatization on six languages.

2016

At the SIGMORPHON Shared Task, it's noted that various neural sequence-to-sequence models give best results. In 2018, a seq2seq model is used along with novel context representation. This model, used within TurkuNLP in the CoNLL-18 Shared Task, gives best performance on lemmatization.

2018

Bergmanis and Goldwater use encoder-decoder NN architecture in a lemmatizer they name as Lematus. Both encoder and decoder are 2-layer Gated Recurrent Unit (GRU). They compare its performance against context-free systems. They use character contexts of each form to be lemmatized. Thus, training resources needed are less. They note that context-free systems may be adequate if a language has many unseen words but few ambiguous words.

2019

Malaviya et al. use a NN model to jointly learn morphological tags and lemmas. They use a encoder-decoder model with hard attention mechanism. In particular, they use a 2-layer LSTM for morphological tagging. For the lemmatizer, they use 2-layer BiLSTM encoder and 1-layer LSTM decoder. They compare their results with other state-of-the-art models: Lematus, UDPipe, Lemming, and Morfette.

Sample Code

# Source: https://timmccloud.net/blog-natural-language-processing/
# Accessed: 2019-10-11
 
from nltk.stem import WordNetLemmatizer
input_words = ['writing', 'calves', 'be', 'branded', 'horse', 'randomize', 
               'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'code']
 
lemmatizer = WordNetLemmatizer()
 
lemmatizer_names = ['NOUN LEMMATIZER', 'VERB LEMMATIZER'] 
formatted_text = '{:>24}' * (len(lemmatizer_names) + 1) 
print('\n', formatted_text.format('INPUT WORD', *lemmatizer_names), 
      '\n', '='*75)
 
for word in input_words: 
    output = [word, lemmatizer.lemmatize(word, pos='n'), 
              lemmatizer.lemmatize(word, pos='v')] 
    print(formatted_text.format(*output))

# Source: https://stanfordnlp.github.io/stanfordnlp/lemma.html
# Accessed: 2019-10-11
 
import stanfordnlp
 
nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos,lemma')
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

# Source: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
# Accessed: 2019-10-10
 
from gensim.utils import lemmatize
 
sentence = "The striped bats were hanging on their feet and ate best fishes"
lemmatized_out = [wd.decode('utf-8').split('/')[0] for wd in lemmatize(sentence)]
#> ['striped', 'bat', 'be', 'hang', 'foot', 'eat', 'best', 'fish']

References

Article Stats

1694

Words

Authors

Edits

Chats

Likes

10K

Hits

Cite As

Devopedia. 2019. "Lemmatization." Version 2, October 11. Accessed 2023-11-12. https://devopedia.org/lemmatization

Contributed by
1 author

Last updated on
2019-10-11 18:00:39

algorithms natural language processing modelling

Lemmatization

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Cite As

See Also

Lemmatization

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login