N-Gram Model

Given a sequence of N-1 words, an N-gram model predicts the most probable word that might follow this sequence. It's a probabilistic model that's trained on a corpus of text. Such a model is useful in many NLP applications including speech recognition, machine translation and predictive text input.

An N-gram model is built by counting how often word sequences occur in corpus text and then estimating the probabilities. Since a simple N-gram model has limitations, improvements are often made via smoothing, interpolation and backoff.

An N-gram model is one type of a Language Model (LM), which is about finding the probability distribution over word sequences.

Discussion

• Could you explain N-gram models with an example?

Consider two sentences: "There was heavy rain" vs. "There was heavy flood". From experience, we know that the former sentence sounds better. An N-gram model will tell us that "heavy rain" occurs much more often than "heavy flood" in the training corpus. Thus, the first sentence is more probable and will be selected by the model.

A model that simply relies on how often a word occurs without looking at previous words is called unigram. If a model considers only the previous word to predict the current word, then it's called bigram. If two previous words are considered, then it's a trigram model.

An n-gram model for the above example would calculate the following probability:

P('There was heavy rain') = P('There', 'was', 'heavy', 'rain') = P('There')P('was'|'There')P('heavy'|'There was')P('rain'|'There was heavy')

Since it's impractical to calculate these conditional probabilities, using Markov assumption, we approximate this to a bigram model:

P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'was')P('rain'|'heavy')

• What are typical applications of N-gram models?

In speech recognition, input may be noisy and this can lead to wrong speech-to-text conversions. N-gram models can correct this based on their knowledge of the probabilities. Likewise, N-gram models are used in machine translation to produce more natural sentences in the target language.

When correcting for spelling errors, sometimes dictionary lookups will not help. For example, in the phrase "in about fifteen mineuts" the word 'minuets' is a valid dictionary word but it's incorrect in this context. N-gram models can correct such errors.

N-gram models are usually at word level. It's also been used at character level to do stemming, that is, separate the root word from the suffix. By looking at N-gram statistics, we could also classify languages or differentiate between US and UK spellings. For example, 'sz' is common in Czech; 'gb' and 'kp' are common in Igbo.

In general, many NLP applications benefit from N-gram models including part-of-speech tagging, natural language generation, word similarity, sentiment extraction and predictive text input.

• How do we evaluate an N-gram model?

The best way to evaluate a model is to check how well it predicts in end-to-end application testing. This approach is called extrinsic evaluation but it's time consuming and expensive. An alternative approach is to define a suitable metric and evaluate independent of the application. This is called intrinsic evaluation. This doesn't guarantee application performance but it's a quick first step to check algorithmic performance.

A common metric is to use perplexity, often written as PP. Given a test set $$W = w_1 w_2 \dots w_n$$, $$PP(W) = P(w_1 w_2 \dots w_n)^{-1/N}$$. Because of the inverse relationship with probability, minimizing perplexity implies maximizing the test set probability. Perplexity can also be related to the concept of entropy in information theory.

It's important in any N-gram model to include markers at start and end of sentences. This ensures that the total probability of the whole language sums to one. However, all calculations should include the end markers but not the start markers in the count of word tokens.

• What is smoothing in the context of N-gram modelling?

It's quite possible that some word sequences occur in test data that were never seen during training. When this happens, the probability of the sequence equals zero. Evaluation is also difficult since perplexity metric becomes infinite.

The usual way to solve this is to give non-zero counts to N-grams that are seen in testing but not in training. We can't just add 1 to all the zero counts since the overall probability distribution will not be normalized. Instead, we remove some probability mass from non-zero counts (called discounting) and add them to the zero counts. The overall process is called smoothing.

The simplest technique is Laplace Smoothing where we add 1 to all counts including non-zero counts. An alternative is to add k, with k tuned using test data. Other techniques include Good-Turing Discounting, Witten-Bell Discounting, and Kneser-Ney Smoothing. All of these try to estimate the count of things never seen based on count of things seen once. Kneser-Ney Smoothing provides a good baseline and it's based on absolute discounting.

• How do backoff and interpolation techniques help N-gram models?

Smoothing solves the zero-count problem but there are techniques to help us better estimate the probabilities of unseen n-gram sequences.

Suppose we want to get trigram probability of a certain word sequence that never occurs. We can estimate this using the bigram probability. If the latter is also not possible, we use unigram probability. This technique is called backoff. One such technique that's popular is called Katz Backoff.

Interpolation is another technique in which we can estimate an n-gram probability based on a linear combination of all lower-order probabilities. For instance, a 4-gram probability can be estimated using a combination of trigram, bigram and unigram probabilities. The weights in which these are combined can also be estimated by reserving some part of the corpus for this purpose.

While backoff considers each lower order one at a time, interpolation considers all the lower order probabilities together.

• What are some limitations of N-gram models?

A model trained on the works of Shakespeare will not give good predictions when applied to another genre. We need to therefore ensure that the training corpus looks similar to the test corpus.

There's also the problem of Out of Vocabulary (OOV) words. These are words that appear during testing but not in training. One way to solve this is to start with a fixed vocabulary and convert OOV words in training to UNK pseudo-word.

In one study, when applied to sentiment analysis, a bigram model outperformed a unigram model but the number of features doubled. Thus, scaling N-gram models to larger datasets or moving to a higher N needs good feature selection techniques.

N-gram models poorly capture longer-distance context. It's been shown that after 6-grams, performance gains are limited. Other language models such cache LM, topic-based LM and latent semantic indexing do better.

• What software tools are available to do N-gram modelling?

R has a few useful packages including ngram, tm, tau and RWeka. Package tidytext has functions to do N-gram analysis.

In Python, NTLK has the function nltk.utils.ngrams(). A more comprehensive package is nltk.lm. Outside NLTK, the ngram package can compute n-gram string similarity.

Written in C++ and open sourced, SRILM is a useful toolkit for building language models. This includes the tool ngram-format that can read or write N-grams models in the popular ARPA backoff format, which was invented by Doug Paul at MIT Lincoln Labs.

A demo of an N-gram predictive model implemented in R Shiny can be tried out online.

Ngram Viewer is a useful research tool by Google. It's based on material collected for Google Books. Given a sequence of words, it shows how the N-gram counts have changed over the years.

Milestones

Jul
1948

In his paper titled A Mathematical Theory of Communication, Claude Shannon describes an example in which the next letter depends on the previous one based on defined probabilities. This is an application of Markov process to natural languages.

1980

Although smoothing techniques can be traced back to Lidstone (1920), or even earlier to Laplace (18th century), an early application of smoothing to n-gram models for NLP is by Jelinek and Mercer (1980). A better smoothing technique is due to Katz (1987). More smoothing techniques are proposed in the 1990s.

1991

Jelinek and team at the IBM Research Division adapt a trigram LM to match the current document better. By caching a recent history of words, they propose Cache Trigram Language Model (CTLM). They show that for long text of 500-800 words, there's a drop in error rate by about 24%.

1993

N-gram models look at the preceding (n-1) words but for larger n, there's a data sparsity problem. Huang et al. propose a skipping n-gram model in which some preceding words may be ignored or skipped. For example, in the phrase "Show John a good time", the last word would be predicted based on P(time|Show __ a good) rather than P(time|Show John a good). Many such skipping models are proposed through the 1990s.

1995

Kneser and Ney propose improved backoff and discounting. It's based on the concept of absolute discounting in which a small constant is removed from all non-zero counts. Kneser-Ney Smoothing improves on absolute discounting by estimating the count of a word in a new context based on the number of different contexts in which the word has already appeared.

1998

Chen and Goodman at Harvard University give an empirical comparison of different smoothing techniques. Other papers from them around the late 1990s become influential in this area of research. Due to their work, Interpolated Kneser-Ney has become a popular language model.

Jul
1999

Version 0.99 of the SRILM toolkit is open sourced for public use. Earlier versions of SRILM can be traced back to 1995. Version 1.00 is released in June 2000. Version 1.7.3 comes out in September 2019.

2002

Researchers at Carnegie Mellon University apply N-grams to model whole-genome protein sequences. Their research triggers further application of N-gram models to computational biology.

Sep
2006

For the benefit of research in linguistics, Google releases a dataset of counts of about a billion five-word sequences that appear at least 40 times from a text of trillion words. The dataset becomes available via the Linguistic Data Consortium (LDC).

Sample Code

• # Source: http://www.albertauyeung.com/post/generating-ngrams-python/
# Accessed: 2019-09-26

import re
from nltk.util import ngrams

s = "Natural-language processing (NLP) is an area of computer science " \
"and artificial intelligence concerned with the interactions " \
"between computers and human (natural) languages."

s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 5))

Author
No. of Edits
No. of Chats
DevCoins
7
0
1665
1716
Words
9
Likes
29448
Hits

Cite As

Devopedia. 2021. "N-Gram Model." Version 7, March 29. Accessed 2022-04-25. https://devopedia.org/n-gram-model
Contributed by
1 author

Last updated on
2021-03-29 03:16:46
• Site Map