Stemming

Article Info

Contributed by
1 author

Last updated on
2019-09-28 18:08:50

Article Versions

3 2019-09-28 18:08:50
1646,1635 3,1646

By arvindpdmn

Clarifying compute of measure m
2 2019-09-25 05:04:48
1635,1627 2,1635

By arvindpdmn

Adding content. Possibly more milestones could be added.
1 2019-09-23 15:48:28
1,1627

By arvindpdmn

First version with some questions

Chat Room

Submitting ...

You are editing an existing chat message.

Consider different forms of a word, such as organize, organizes, and organizing. Consider also words that are closely related, such as democracy, democratic, and democratization. In many applications, it may be inefficient to handle all the variations individually. Stemming reduces them to a common form. Algorithms that do this are called stemmers. The output of a stemmer is called the stem, which is the root word.

Stemming may be seen as a crude heuristic process that simply chops off ends of words. Unlike lemmatization, stemming doesn't involve dictionary lookup or morphological analysis. It's not even required that the stem be a valid word or identical to its morphological root. The goal is to reduce related words to the same stem.

Discussion

What are some applications of stemming?
Variations of verb forms to include for better SEO. Source: Ives 2011, table 3.
Stemming is a pre-processing task that's done before other application-specific tasks are invoked. One application of stemming is to count the use of emotion words and perform basic sentiment analysis. When used with a dictionary or spell checker such as Hunspell, stemmers can be used to suggest corrections when wrong spelling is encountered.
One of the first applications of stemming was in Information Retrieval (IR). Searching with keyword "explosion" would fail to retrieve documents indexed by the word "explosives". Stemming solves this problem since indexing would be done using stem words.
Online search engines such as Google Search use stemming. An analysis from 2009 of Google Search showed that some suffixes such as '-s', '-ed', and '-ing' are considered to be strongly correlated with the stems. Suffixes '-able', '-tive', '-ly', and '-ness' are considered less correlated. For better SEO, forms that are poorly correlated could be added to improve search ranking. However, Google may penalize the page if it appears unnatural.
What are some essential terms concerning stemming?
Comparing derivational and inflectional morphemes. Source: Liberman and Prince 1998.
Stemming is based on the assumption that words have a structure, based on a root word and modifications of the root. The study of words and their parts is called morphology. In IR systems, given a word, stemming is really about finding morphological variants. The term conflation indicates the combining of variants to a common stem.
Words may contain prefixes and suffixes, which generally are called affixes. Stemming usually concerns itself with suffixes. Suffixes themselves are of two types:
- Inflectional: Form is varied to express some grammatical feature such as singular/plural or present/past/future tense. Inflections don't change part of speech or meaning. For example, 'boy' and 'boys'; 'big', 'bigger' and 'biggest'.
- Derivational: New forms are created from words. New ones have a different part of speech or meaning. For example, 'creation' from 'create'. They can occur between the stem and an inflectional suffix, such as 'governments', where '-ment' precedes '-s'. Another example is 'rationalizations', where '-al', '-iz' and '-ation' are derivational, and '-s' is inflectional.
What are the typical errors while stemming?
Search for 'Withings' wrongly shows items containing 'with'. Source: Larochelle 2014.
Errors occur because rules fail for some special cases. The worst error is when two different concepts conflate to the same stem. For example, when 'Withings' is stemmed to 'with' on a web portal, wrong items are presented to the user. Porter stemmer stems 'meanness' and 'meaning' to 'mean', though they relate to different concepts. On the contrary, 'goose' and 'geese' are equivalent but they're stemmed as 'goos' and 'gees' respectively.
Typical errors of stemming are the following:
- Overstemming: Happens when too much is removed. For example, 'wander' becomes 'wand'; 'news' becomes 'new'; or 'universal', 'universe', 'universities', and 'university' are all reduced to 'univers'. A better result would be 'univers' for the first two and 'universi' for the last two.
- Understemming: Happens when words are from the same root but are not seen that way. For example, 'knavish' remains as 'knavish'; 'data' and 'datum' become 'dat' and 'datu' although they're from the same root.
- Misstemming: Usually not a problem unless it leads for false conflations. For example, 'relativity' becomes 'relative'.
Could you give an overview of algorithms for stemming?
Types of stemming algorithms. Source: Singh and Gupta 2017, fig. 1.
Stemming algorithms broadly fall into one of two categories:
- Rule-based: Table lookup is a brute force approach where inflected or derived forms are mapped to stems. For Turkish that has lots of inflected forms, this will lead to large tables. Another approach is to apply a series of rules to strip affixes to get to the stem. Yet another approach is to use word morphology, including part of speech. All these approaches require particular knowledge of the language.
- Statistical: By training a model, suffix-stripping rules are implicitly derived. This is good for languages with complex morphology. No expert knowledge of the language is needed. Models can be trained from a lexicon, a corpus or character-based n-grams of the language's words.
Among the rule-based stemmers are Lovins stemmer, Dawson stemmer, Porter stemmer, Paice/Husk stemmer (aka Lancaster stemmer), Krovetz stemmer and Xerox stemmer. Porter stemmer in its Snowball implementation is commonly used. Lancaster stemmer is more aggressive, leading to overstemming.
Could you explain Porter's algorithm?
Computing word measure, resulting in m=4. Source: Snowball 2019a.
Porter stemmer applies a set of rules in five steps involving 51 suffixes and 60 rules.
Each rule is given by the form $(condition) \, S1 \to S2$, where S1 and S2 are suffixes. If the condition is matched or null, S1 is replaced with S2. When multiple rules are applicable in a given sub-step, only the rule with the longest S1 is obeyed. For example, step-1a has four rules with null condition:
$$SSES \to SS\\IES \to I\\SS \to SS\\S \to$$
In this case, 'caresses' becomes 'caress' since the first rule gives the longest match for S1; 'caress' remains caress (third rule); 'cares' becomes 'care' (fourth rule).
Porter also defined the generic structure of a word as $[C] (VC)^{m} [V]$, where [ ] denotes arbitrary presence, C is one or more consonants, V is one or more vowels and m is the word's measure. Measure is computed on what precedes the rule's suffix. We could use it to ignore rules when stem is too short. For example, the following rule changes 'agreed' to 'agree' (m=1) but retains 'feed' as 'feed' (m=0): $(m>0) \, EED \to EE$
What are the common approaches used by stochastic stemmers?
One approach is to segment words into roots and suffixes and selecting the best possible segmentation. This is based on counts of letter successor varieties. Minimum Description Length (MDL) stemmer is also about finding the optimal split between root and suffix. MDL was found to be computationally intensive and gave 83% accuracy for both English and French.
Another approach is to discover suffixes based on their frequencies. Frequencies are adjusted for partial suffixes so that '-ng' doesn't include '-ing'.
Hidden Markov Model (HMM) has been applied to stemming. Word letters are considered as hidden states and the transition from a root state to a suffix state is taken as the split point.
Yet Another Suffix Stripper (YASS) treats stemming as a word clustering problem. Two words having long common prefix are considered similar. Threshold value must be selected carefully to produce meaningful clusters. YASS was shown to perform similar to rule-based stemmers.
Other approaches make use of graphs with weighted edges, word co-occurrences, distribution similarity, and queries along with context.
How is the performance of stemming?
Algorithmic stemmers are typically fast. For example, a million words can be stemmed in 6 seconds on a 500 MHz personal computer. It's more efficient not to use a dictionary. Rather than complicate the algorithm, it's simpler to ignore irregular forms.
Lemmatization uses a dictionary such as WordNet to get to the correct base forms. However, lemmatization has its limitations too. For example, 'happiness' and 'happy' are left unchanged while the Porter stemmer stems them correctly to 'happi'.
In IR systems, stemming reduces storage requirements by over 50% since we store stems rather than full terms.
What are some tools to do stemming?
For a quick demo of stemming, online tools are available:
- JavaScript-based Porter Stemmer: Highlights removed suffixes.
- JavaScript-based Snowball Stemmer: Includes open source code and unit tests.
- NLTK-based Stemmers: Two sources are text-processing.com and TextAnalysisOnline. They support Porter, Lancaster and Snowball stemmers. These tools also include lemmatization.
For developers, R has the corpus package for stemming. This is based on Snowball. This has support for multiple natural languages. In Python, NLTK and TextBlob are two packages that support stemming.
Martin Porter has shared a list of many language implementations of the Porter stemmer. For test purpose, he also shares a sample vocabulary and the expected output.

Milestones

Jun
1968

Julie Beth Lovins of MIT publishes details of one of the earliest stemmers in the context of information retrieval. The system has 10,000 documents in the field of materials science and engineering. The algorithm has two phases: (a) remove longest possible ending to obtain the stem; (b) handle spelling exceptions. For example, 'absorption' becomes 'absorpt' and 'absorbing' becomes 'absorb' in phase-1; phase-2 matches 'absorpt' and 'absorb' via recoding.

1980

Martin Porter invents an algorithmic stemmer based on rules for suffix stripping. The algorithm runs in five steps. He finds that in a vocabulary of 10,000 words the stemmer gives a size reduction of 33%, with most reduction coming from step-1 that deals with plurals and past participles. The stemmer is implemented in BCPL. This later becomes known as Porter stemmer.

1990

Chris Paice at Lancaster University invents a new stemmer that iteratively removes suffixes based on matching rules. Some rules replaces suffixes so that spelling exceptions don't occur. This stemmer is usually called Paice/Husk stemmer or Lancaster stemmer.

1994

Stemmers are often evaluated based on their effectiveness towards information retrieval. This however does not help in improving the stemmer. To solve this, Chris Paice proposes using predefined concept groups with sample words. The method counts overstemming and understemming errors. It then computes the metrics overstemming index (OI), understemming index (UI), stemming weight and error rate relative to truncation (ERRT). These help us compare the performance of different stemmers.

2001

Martin Porter finds that many implementations of Porter stemmer contain errors. He also notes that the Kraaij-Pohlmann stemmer for Dutch is released as open source ANSI C software but lacks a description of the algorithm. To solve these issues, Porter invents Snowball, a system in which rules can be expressed in a natural way. Then a compiler transforms the description into C code. The Snowball implementation for English is also called Porter2 stemmer. By 2019, Snowball supports more than a dozen natural languages with code generated in C, Java, Python, Go, Rust, C#, and more.

2007

State transition for the Fairweather stemmer. Source: Mitosystems 2014b, fig. 1.

John Fairweather files a patent application that treats stemming as a shortest-path problem. Words are modelled in the form [Prefix] Root {[Infix] Root} [Suffix], where [ ] is zero or more occurrences, { } is one or more occurrences. Cost of each path is computed from word component and number of characters in it. The algorithm is language independent.

2009

Ahmet Uyar of Mersin University, Turkey, investigates stemming mechanisms used by Google. Using 18,000 different words, he finds that Google uses a document-based algorithm. Documents are indexed based on word forms that are semantically strongly correlated. This includes singular and plural forms but rarely suffixes of the type '-able' or '-tively'.

Sample Code

R
python

# Source: https://cran.r-project.org/web/packages/corpus/vignettes/stemmer.html
# Accessed: 2019-09-24
 
library(corpus)
 
# A simple example
text <- "love loving lovingly loved lover lovely love"
text_tokens(text, stemmer = "en") # english stemmer
 
 
# Emotion word usage in the text of Wizard of Oz
data <- gutenberg_corpus(55, verbose = FALSE)
text_filter(data)$stemmer <-
    with(affect_wordnet,
        new_stemmer(term, interaction(category, emotion),
                    default = NA, duplicates = "omit"))
print(term_stats(data), -1)
text_sample(data, "Joy.Positive")

# Source: http://jonathansoma.com/lede/algorithms-2017/classes/more-text-analysis/counting-and-stemming/
# Accessed: 2019-09-24
 
from nltk.stem.porter import PorterStemmer
from textblob import TextBlob
 
# Use TextBlob
def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.words
    words = [token.stem() for token in tokens]
    return words
 
# Use NLTK's PorterStemmer
def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

References

Article Stats

1918

Words

Authors

Edits

Chats

Likes

10K

Hits

Cite As

Devopedia. 2019. "Stemming." Version 3, September 28. Accessed 2023-11-12. https://devopedia.org/stemming

Contributed by
1 author

Last updated on
2019-09-28 18:08:50

algorithms natural language processing linguistics

Stemming

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Cite As

See Also

Stemming

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login