N-Gram Model
- Summary
-
Discussion
- Could you explain N-gram models with an example?
- What are typical applications of N-gram models?
- How do we evaluate an N-gram model?
- What is smoothing in the context of N-gram modelling?
- How do backoff and interpolation techniques help N-gram models?
- What are some limitations of N-gram models?
- What software tools are available to do N-gram modelling?
- Milestones
- Sample Code
- References
- Further Reading
- Article Stats
- Cite As
Given a sequence of N-1 words, an N-gram model predicts the most probable word that might follow this sequence. It's a probabilistic model that's trained on a corpus of text. Such a model is useful in many NLP applications including speech recognition, machine translation and predictive text input.
An N-gram model is built by counting how often word sequences occur in corpus text and then estimating the probabilities. Since a simple N-gram model has limitations, improvements are often made via smoothing, interpolation and backoff.
An N-gram model is one type of a Language Model (LM), which is about finding the probability distribution over word sequences.
Discussion
-
Could you explain N-gram models with an example? Consider two sentences: "There was heavy rain" vs. "There was heavy flood". From experience, we know that the former sentence sounds better. An N-gram model will tell us that "heavy rain" occurs much more often than "heavy flood" in the training corpus. Thus, the first sentence is more probable and will be selected by the model.
A model that simply relies on how often a word occurs without looking at previous words is called unigram. If a model considers only the previous word to predict the current word, then it's called bigram. If two previous words are considered, then it's a trigram model.
An n-gram model for the above example would calculate the following probability:
P('There was heavy rain') = P('There', 'was', 'heavy', 'rain') = P('There')P('was'|'There')P('heavy'|'There was')P('rain'|'There was heavy')
Since it's impractical to calculate these conditional probabilities, using Markov assumption, we approximate this to a bigram model:
P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'was')P('rain'|'heavy')
-
What are typical applications of N-gram models? In speech recognition, input may be noisy and this can lead to wrong speech-to-text conversions. N-gram models can correct this based on their knowledge of the probabilities. Likewise, N-gram models are used in machine translation to produce more natural sentences in the target language.
When correcting for spelling errors, sometimes dictionary lookups will not help. For example, in the phrase "in about fifteen mineuts" the word 'minuets' is a valid dictionary word but it's incorrect in this context. N-gram models can correct such errors.
N-gram models are usually at word level. It's also been used at character level to do stemming, that is, separate the root word from the suffix. By looking at N-gram statistics, we could also classify languages or differentiate between US and UK spellings. For example, 'sz' is common in Czech; 'gb' and 'kp' are common in Igbo.
In general, many NLP applications benefit from N-gram models including part-of-speech tagging, natural language generation, word similarity, sentiment extraction and predictive text input.
-
How do we evaluate an N-gram model? The best way to evaluate a model is to check how well it predicts in end-to-end application testing. This approach is called extrinsic evaluation but it's time consuming and expensive. An alternative approach is to define a suitable metric and evaluate independent of the application. This is called intrinsic evaluation. This doesn't guarantee application performance but it's a quick first step to check algorithmic performance.
A common metric is to use perplexity, often written as PP. Given a test set \(W = w_1 w_2 \dots w_n\), \(PP(W) = P(w_1 w_2 \dots w_n)^{-1/N}\). Because of the inverse relationship with probability, minimizing perplexity implies maximizing the test set probability. Perplexity can also be related to the concept of entropy in information theory.
It's important in any N-gram model to include markers at start and end of sentences. This ensures that the total probability of the whole language sums to one. However, all calculations should include the end markers but not the start markers in the count of word tokens.
-
What is smoothing in the context of N-gram modelling? It's quite possible that some word sequences occur in test data that were never seen during training. When this happens, the probability of the sequence equals zero. Evaluation is also difficult since perplexity metric becomes infinite.
The usual way to solve this is to give non-zero counts to N-grams that are seen in testing but not in training. We can't just add 1 to all the zero counts since the overall probability distribution will not be normalized. Instead, we remove some probability mass from non-zero counts (called discounting) and add them to the zero counts. The overall process is called smoothing.
The simplest technique is Laplace Smoothing where we add 1 to all counts including non-zero counts. An alternative is to add k, with k tuned using test data. Other techniques include Good-Turing Discounting, Witten-Bell Discounting, and Kneser-Ney Smoothing. All of these try to estimate the count of things never seen based on count of things seen once. Kneser-Ney Smoothing provides a good baseline and it's based on absolute discounting.
-
How do backoff and interpolation techniques help N-gram models? Smoothing solves the zero-count problem but there are techniques to help us better estimate the probabilities of unseen n-gram sequences.
Suppose we want to get trigram probability of a certain word sequence that never occurs. We can estimate this using the bigram probability. If the latter is also not possible, we use unigram probability. This technique is called backoff. One such technique that's popular is called Katz Backoff.
Interpolation is another technique in which we can estimate an n-gram probability based on a linear combination of all lower-order probabilities. For instance, a 4-gram probability can be estimated using a combination of trigram, bigram and unigram probabilities. The weights in which these are combined can also be estimated by reserving some part of the corpus for this purpose.
While backoff considers each lower order one at a time, interpolation considers all the lower order probabilities together.
-
What are some limitations of N-gram models? A model trained on the works of Shakespeare will not give good predictions when applied to another genre. We need to therefore ensure that the training corpus looks similar to the test corpus.
There's also the problem of Out of Vocabulary (OOV) words. These are words that appear during testing but not in training. One way to solve this is to start with a fixed vocabulary and convert OOV words in training to
UNK
pseudo-word.In one study, when applied to sentiment analysis, a bigram model outperformed a unigram model but the number of features doubled. Thus, scaling N-gram models to larger datasets or moving to a higher N needs good feature selection techniques.
N-gram models poorly capture longer-distance context. It's been shown that after 6-grams, performance gains are limited. Other language models such cache LM, topic-based LM and latent semantic indexing do better.
-
What software tools are available to do N-gram modelling? R has a few useful packages including ngram, tm, tau and RWeka. Package tidytext has functions to do N-gram analysis.
In Python, NTLK has the function
nltk.utils.ngrams()
. A more comprehensive package is nltk.lm. Outside NLTK, the ngram package can compute n-gram string similarity.Written in C++ and open sourced, SRILM is a useful toolkit for building language models. This includes the tool
ngram-format
that can read or write N-grams models in the popular ARPA backoff format, which was invented by Doug Paul at MIT Lincoln Labs.A demo of an N-gram predictive model implemented in R Shiny can be tried out online.
Ngram Viewer is a useful research tool by Google. It's based on material collected for Google Books. Given a sequence of words, it shows how the N-gram counts have changed over the years.
Milestones
1948
Although smoothing techniques can be traced back to Lidstone (1920), or even earlier to Laplace (18th century), an early application of smoothing to n-gram models for NLP is by Jelinek and Mercer (1980). A better smoothing technique is due to Katz (1987). More smoothing techniques are proposed in the 1990s.
N-gram models look at the preceding (n-1) words but for larger n, there's a data sparsity problem. Huang et al. propose a skipping n-gram model in which some preceding words may be ignored or skipped. For example, in the phrase "Show John a good time", the last word would be predicted based on P(time|Show __ a good) rather than P(time|Show John a good). Many such skipping models are proposed through the 1990s.
Kneser and Ney propose improved backoff and discounting. It's based on the concept of absolute discounting in which a small constant is removed from all non-zero counts. Kneser-Ney Smoothing improves on absolute discounting by estimating the count of a word in a new context based on the number of different contexts in which the word has already appeared.
1999
Sample Code
References
- Ablimit, Mijit, Tatsuya Kawahara, and Askar Hamdulla. 2015. "Lexicon Optimization Approaches for Language Models of Agglutinative Language." SlidePlayer. Accessed 2019-09-25.
- Chen, Stanley F. and Joshua Goodman. 1998. "An Empirical Study of Smoothing Techniques for Language Modeling." Harvard Computer Science Group Technical Report TR-10-98. Accessed 2019-09-26.
- Franz, Alex and Thorsten Brants. 2006. "All Our N-gram are Belong to You." Google AI Blog, August 03. Accessed 2019-09-26.
- Ganapathiraju, M., D. Weisser, R. Rosenfeld, J. Carbonell, R. Reddy, and J. Klein-Seetharaman. 2002. "Comparative n-gram analysis of whole-genome protein sequences." Proceedings of HLT 2002, Second International Conference on Human Language Technology Research, M. Marcus, ed., Morgan Kaufmann, San Francisco. Accessed 2019-09-26.
- Goodman, Joshua. 2001. "A Bit of Progress in Language Modeling." arXiv, v1, August 9. Accessed 2020-01-25.
- Jelinek, F., B. Merialdo, S. Roukos, and M. Strauss. 1991. "A Dynamic Language Model for Speech Recognition." Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, pp. 293-295, February 19-22. Accessed 2019-09-26.
- Jurafsky, Daniel. 2019. "Introduction to N grams." Via André Ribeiro Miranda, on YouTube, February 24. Accessed 2023-03-01.
- Jurafsky, Daniel and James H. Martin. 2009. "N-grams." Chapter 4 in: Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-09-07.
- Machine Learning TV. 2018. "NLP: Understanding the N-gram language models." Machine Learning TV, on YouTube, August 12. Accessed 2019-09-25.
- Mayfield, James, and Paul McNamee. 2003. "Single n-gram stemming." Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 415-416, Toronto, Canada, July 28 - August 01. Accessed 2019-09-25.
- Mehmood, Arshad. 2019. "Generate Unigrams Bigrams Trigrams Ngrams Etc In Python." March 19. Accessed 2019-09-26.
- NGram. 2017. "NGram Module Documentation." v3.3.2, via Python Hosted, June 20. Accessed 2019-09-25.
- Ogbuji, Uche. 2018. "Word analysis and N-grams in a variety of practical applications." IBM Developer, April 18. Accessed 2019-09-25.
- Osmanbeyoglu, Hatice Ulku and Madhavi K Ganapathiraju. 2011. "N-gram analysis of 970 microbial organisms reveals presence of biological language models." BMC Bioinformatics, vol. 12, no. 12. Accessed 2019-09-26.
- Potapenko, Anna. 2018. "Natural Language Processing." National Research University, Higher School of Economics, via Coursera. Accessed 2019-09-25.
- SRILM Docs. 2019. "Version History." SRI International, September 09. Accessed 2019-09-26.
- SRILM Manpage. 2004. "ngram-format." SRI International. Accessed 2019-09-26.
- Schmidt, Drew and Christian Heckendorf. 2017. "Guide to the ngram Package." V3.0.4, CRAN, November 17. Accessed 2019-09-25.
- Shannon, C. E. 1948. "A Mathematical Theory of Communication." The Bell System Technical Journal, vol. 27, pp. 379–423, July. Accessed 2019-09-26.
- Silge, Julia and David Robinson. 2019. "Relationships between words: n-grams and correlations." Chapter 4 In: Text Mining with R, A Tidy Approach, February 02. Accessed 2019-09-26.
- Sookocheff, Kevin. 2015. "Modeling Natural Language with N-Gram Models." July 25. Accessed 2019-09-25.
- Stolcke, Andreas. 2002. "SRILM--An Extensible Language Modeling Toolkit." Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002). Accessed 2019-09-26.
- Thiel, Kilian. 2016. "Sentiment Analysis with N-grams." Blog, KNIME, January 08. Accessed 2019-09-25.
- Yeung, Albert Au. 2018. "Generating N-grams from Sentences Python." June 03. Accessed 2019-09-25.
Further Reading
- Jurafsky, Daniel and James H. Martin. 2009. "N-grams." Chapter 4 in: Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-09-07.
- Ogbuji, Uche. 2018. "Word analysis and N-grams in a variety of practical applications." IBM Developer, April 18. Accessed 2019-09-25.
- Yeung, Albert Au. 2018. "Generating N-grams from Sentences Python." June 03. Accessed 2019-09-25.
- Mariòo, José B., Rafael E. Banchs, Josep M. Crego, Adrià de Gispert, Patrik Lambert, José A. R. Fonollosa, and Marta R. Costa-jussà. 2006. "N-gram-based Machine Translation." J. Computational Linguistics, MIT Press, vol. 32, no. 4, pp. 527-549, December. Accessed 2019-09-26.
- Wang, Xuerui, Andrew McCallum, and Xing Wei. 2007. "Topical N-grams: Phrase and Topic Discovery,with an Application to Information Retrieval." Seventh IEEE International Conference on Data Mining (ICDM 2007), October 28-31. Accessed 2019-09-26.
Article Stats
Cite As
See Also
- Language Modelling
- Hidden Markov Model
- Text Corpus for NLP
- Kneser-Ney Smoothing
- Information Theory
- Katz Backoff