# Language Modelling

Consider the phrase "I am going ___". If we analyse large amounts of English text, the missing word is more likely to be 'home' rather than 'house'. This also implies that we can obtain the probability of a sequence of words.

A Language Model (LM) captures the probability of a sequence of words in the language. Equivalently, it tells us how likely a given word will follow a sequence of words.

Traditionally, N-gram models and their variants were used as language models. Since the early 2010s, Neural Language Models (NLMs) were researched. By 2019, pre-trained LMs were used for transfer learning to improve the performance of many downstream NLP tasks.

## Discussion

• Why do we need a language model when languages have well-defined grammar?

While natural languages have grammar, they also have a huge vocabulary. There's also considerable flexibility how words can be combined. Ambiguities occur naturally in human communication. Usage also changes with time. The end result is that grammatical rules and syntactic structures can't be specified for all use cases.

Language models therefore attempt to learn the "structure" of the language by analysing vast amounts of text. The approach is statistical rather than being based on brittle rules. In some sense, language models capture language syntax, semantics and even common sense knowledge.

While we commonly speak of a word sequence for language modelling, we can abstract the idea to a sequence of tokens. A token could be a sentence, a phrase, a word, an n-gram, morpheme, or letter. Where lemmatization is used, multiple surface forms are reduced to a single token. An LM specifies its method of tokenization.

• What are some applications of language models?

Language models are useful in applications that deal with text generation. Some examples are optical character recognition, handwriting recognition, speech recognition, machine translation, spelling correction, image captioning, and text summarization.

In speech recognition, the phrases "no eye deer" and "no idea" may sound similar. In spelling correction, "fill the from" may show no spelling errors but in fact "fill the form" is the correct phrase. In both these examples, an LM selects the more probable phrase. In machine translation, LM can tell that "high winds tonight" is a better translation than "large winds tonight".

Gmail's Smart Reply, which suggests short responses to emails, relies on an LM. In healthcare, LMs were shown to obtain generic representations from data of all patients. This led to better prediction models. LMs have been used for paraphrasing text.

Language modelling tasks themselves (such as predicting a word given surrounding words) have been useful in obtaining efficient word embeddings. These embeddings help represent words at the input of a neural network model.

• Which are the main approaches to language modelling?

There are two broad categories of language models:

• Count-based: These are traditional statistical models such as n-gram models. Word co-occurrences are counted to estimate probabilities. Variants of n-gram models have also been proposed. Clustering models attempt to exploit similarities among words. Caching models exploit the fact that once a word is used, it's likely to appear again in that text. Sentence mixture models build different models for different sentence types.
• Continuous-space: These use neural networks. They use word embeddings as dense word representations in a real-valued vector space. Words that are semantically similar are typically close together in the vector space. Such embeddings solve the data sparsity problem of n-gram models. Unlike n-gram models, these scale well as vocabulary size increases.
• Could you briefly describe n-gram models?

Suppose a sentence S has N words, $$w_1...w_N$$. Since LM is about finding the probability of S, this is a joint probability measure over all N words. Typically, this is decomposed into a product of conditional probabilities $$P(S)=\prod_{i=1}^N P(w_i|w_1,...,w_{i-1})$$, where each term is a probability of a word given the previous words in the sentence.

To simplify the problem, we apply Markov assumption. This is an approximation in which only some recent words matter. For a bigram model, a word is predicted based on only the preceding word. For an n-gram model, only the preceding (n-1) words are considered. For instance, given a bigram model for the phrase "the cat sat on the mat", $$P(S)=P(the)\cdot P(cat|the)\cdot P(sat|cat)\cdot P(on|sat)\cdot P(the|on)\cdot P(mat|the)$$. We can get these probabilities by counting word co-occurrences. For example, $$P(cat|the)=P(the\,cat)/P(the)$$.

One problem with n-gram models is data sparsity. This means that word sequences not seen in training, may be encountered in real applications, leading to zero probability. Techniques to solve this problem include smoothing, backoff and interpolation.

Typically, 5-gram models are a compromise between computational complexity and performance.

• How can I train or make use of a neural language model?

A neural language model can be learned in an unsupervised or semi-supervised manner but it needs lots of input text. Easy availability of text online (billions of words) has made this feasible. However, words at the input of a neural network must be represented as numbers. This is where word embeddings provide efficient representations.

To train the LM itself, we need a task on which the model has to learn. One task is to predict a word given its surrounding words; or predict the surrounding words given the current word. In fact, these two LM tasks were used when creating word2vec word embeddings. Training an LM in this manner is called pre-training.

A pre-trained LM can then be applied to a variety of NLP tasks. However, since each task is different, we do task-specific fine-tuning of the LM.

This two-phase approach is practical since a single pre-trained LM can be fine-tuned as the task demands. While pre-training is done on huge volumes of text, fine-tuning takes lot less effort.

• Could you describe some well-known pre-trained neural language models?

Among the well-known NLMs are ELMo, ULMFiT, BERT, GPT, and GPT-2. BERT in particular has spawned many variants: XLM, RoBERTa, XLNet, MT-DNN, TinyBERT, ALERT, DistilBERT, and more.

While ELMo and ULMFiT use LSTM, GPT-2 and BERT are based on transformer architecture. ULMFiT and GPT-2 are unidirectional while BERT is bidirectional. Most models can be applied to any downstream NLP task.

LM pre-training tasks themselves differ across models:

• Causal LM: Used by GPT-2. Current prediction is based on previous hidden state.
• Masked LM: Used by BERT. Some input words are masked and the task is to predict them. Since model is bidirectional, masking improves performance.
• Translation LM: Used by XLM for better machine translation. An input sequence contains tokens from both languages, each with its language embeddings and position embeddings.
• Permutation LM: Used by XLNet. It uses permutation to capture bidirectional context.
• Multi-Task LM: Used by MT-DNN. Model is trained on multiple tasks such as classification, text similarity and pairwise ranking. This regularizes the model better.
• Which are the common techniques used in neural language models?

Models with more parameters or memory units perform better. Increasing the embedding size improves performance but causes undesirable increase in number of parameters. LSTMs are better than RNNs. LSTMs are much better than n-grams on rare words. Models tend to overfit on training data, for which dropout helps (10% for small models, 25% for large models). Character-level embeddings and softmax can reduce the number of parameters. They're also better at out-of-vocabulary words.

To predict the next word, we need to compute the softmax probability. This is expensive for a large vocabulary. Among the different approaches to simplify this are hierarchical softmax, noise contrastive estimation, importance sampling, and self-normalizing partition functions.

To handle rare words, there are neural LMs that make use of morphemes, word shape information (such as capitalization), or annotations (such as POS tags). The use of morphemes has led to morpheme embeddings. When combined with RNN, we obtain word embeddings. Some LMs use character-level embeddings at both input and/or output. This approach avoids morphological analysis.

• How can I evaluate the performance of language models?

The common measure of LM evaluation is called perplexity. It's a geometric average of the inverse probability of words predicted by the model. Thus, a lower perplexity implies a better model. Logarithm (base 2) of perplexity is also a common measure. This is called cross-entropy. As a thumb rule, a reduction of 10-20% in perplexity is noteworthy.

In practice, an LM is measured by how it performs in an actual application. This is called extrinsic evaluation, as opposed to perplexity that's seen as intrinsic evaluation. For example, in speech recognition, Word Error Rate (WER) is an extrinsic measure of an LM.

It's been difficult to compare LMs because they use different training corpora or evaluation benchmarks. Some published results are unclear about the computation complexity. Sometimes single-model performance numbers are not reported; only performance of ensemble models are reported. Language modelling can benefit from standardized pre-training corpus. Performance should be compared along with model size and resource consumption.

## Milestones

1980

Although smoothing techniques can be traced back to Lidstone (1920), or even earlier to Laplace (18th century), an early application of smoothing to n-gram models for NLP is by Jelinek and Mercer (1980). A better smoothing technique is due to Katz (1987). More smoothing techniques are proposed in the 1990s.

Jul
1989

Bahl et al. propose decision tree for language modelling in the domain of speech recognition. Each node has a yes/no question about preceding words. Each leaf has a probability distribution over the allowable vocabulary. Years later, it's noted that tree-based methods may outperform n-gram models but finding the right partitions are hard due to high computational cost and data sparsity.

1992

In decision tree approaches, as the tree grows, each leaf contains fewer data points. This data fragmentation issue can be solved by exponential models. Pietra et al. propose one such model using Maximum Entropy distribution. Similar models are proposed in the following years. In general, these models are computationally intensive.

1993

N-gram models look at the preceding (n-1) words but for larger n, there's a data sparsity problem. Huang et al. propose a skipping n-gram model in which some preceding words may be ignored or skipped. For example, in the phrase "Show John a good time", the last word would be predicted based on P(time|Show __ a good) rather than P(time|Show John a good). Many such skipping models are proposed through the 1990s.

1995

Due to the success of n-gram models, researchers ignored knowledge-based approaches. Statistical approach eclipsed linguistic approach. N-gram models worked but had little knowledge of language or its deep structures. Well-known researcher Fred Jelinek notes that a combination of statistical and linguist approaches may be required. He notes that we must "put language back into language modeling".

Aug
1998

As a smoothing technique for LMs, the Kneser-Ney method was proposed in 1995. Chen and Goodman introduce a modification of this and name it Modified Kneser-Ney Smoothing. Unlike the single discount of Kneser-Ney, the modified method uses different discounts for one, two and more than two counts. Subsequently, Kneser-Ney smoothing on a 5-gram model becomes a popular baseline among researchers.

2001

Bengio et al. point out the curse of dimensionality where the large vocabulary size of natural languages makes computations difficult. They propose a Feedforward Neural Network (FNN) that jointly learns the language model and vector representations of words. They refine their methods in a follow-up paper from 2003.

2010

Since n-grams and FNNs use a fixed length context, Mikolov et al. propose using a Recurrent Neural Network (RNN) for language modelling. Using cyclic connections, information in RNNs is retained for longer time. RNNs can therefore capture long-term dependencies. Only the size of the hidden context layers needs to be fixed. In 2018, Noaman et al. extend this approach to better suit languages with rich morphology or large vocabulary. They tokenize a word into prefix, stem and suffix.

Jan
2013

At Google, Mikolov et al. develop a word embedding called word2vec. This is created by training the model on one of two LM tasks: continuous bag-of-words (predict current word based on surrounding words) or continuous skip-gram (predict surrounding words given current word). This is a log-linear model due to the use of hierarchical softmax.

Aug
2015

Kim et al. use character-level input embeddings. Input is fed into a CNN followed by a highway network. An LSTM layer does the predictions, which are still at word level. They show that these character-level models have fewer parameters and outperform word-level models, particularly for languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian). In 2016, Jozefowicz et al. explore CharCNN and character-level LSTM at the prediction layer.

Jul
2018

Word embeddings such as word2vec have been popular since their release in 2013. However, they can't handle polysemy (same word, different meanings). This is because they produce a single representation for the word. They capture semantic relations but are poor at higher-level concepts such as anaphora, long-term dependencies, agreement, and negation. This is where LMs become useful. A good LM should capture lexical, syntactic, semantic and pragmatic aspects. NLP researcher Sebastian Ruder notes,

It is very likely that in a year's time NLP practitioners will download pretrained language models rather than pretrained word embeddings.
Oct
2018

Devlin et al. from Google publish details of an LM they call BERT. It's deeply bidirectional, meaning that it uses both left and right contexts in all layers. In November, Google open sources pre-trained BERT models, along with TensorFlow code that does this pre-training. These models are for English. Later in November, Google releases multilingual BERT that supports about 100 different languages.

Jan
2019

Lample and Conneau adapt BERT to propose a cross-lingual LM. Model is trained on both monolingual data (unsupervised) and parallel data (supervised). At the input, each language gets its own language and position embeddings. The model uses Byte-Pair Encoding (BPE) in which sub-words are the tokens. This improves the alignment of embedding spaces across languages. They obtain state-of-the-art results.

Dec
2019

Here are some new applications of LM. LM enables end-to-end Named Entity Recognition and Relation Extraction, and thereby avoids external NLP tools such as a dependency parser. LM is applied to zero-shot text classification. This work suggests that LMs can be used for meta-learning. A convolutional quantum-like LM is used for product rating prediction. LM uses RNN along with deep topic model to capture both syntax and global semantic structure.

Author
No. of Edits
No. of Chats
DevCoins
3
0
1673
1
0
6
2506
Words
0
Likes
3605
Hits

## Cite As

Devopedia. 2022. "Language Modelling." Version 4, February 15. Accessed 2022-10-09. https://devopedia.org/language-modelling
Contributed by
2 authors

Last updated on
2022-02-15 11:55:07
• Site Map