Language Modelling
- Summary
-
Discussion
- Why do we need a language model when languages have well-defined grammar?
- What are some applications of language models?
- Which are the main approaches to language modelling?
- Could you briefly describe n-gram models?
- How can I train or make use of a neural language model?
- Could you describe some well-known pre-trained neural language models?
- Which are the common techniques used in neural language models?
- How can I evaluate the performance of language models?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Consider the phrase "I am going ___". If we analyse large amounts of English text, the missing word is more likely to be 'home' rather than 'house'. This also implies that we can obtain the probability of a sequence of words.
A Language Model (LM) captures the probability of a sequence of words in the language. Equivalently, it tells us how likely a given word will follow a sequence of words.
Traditionally, N-gram models and their variants were used as language models. Since the early 2010s, Neural Language Models (NLMs) were researched. By 2019, pre-trained LMs were used for transfer learning to improve the performance of many downstream NLP tasks.
Discussion
-
Why do we need a language model when languages have well-defined grammar? While natural languages have grammar, they also have a huge vocabulary. There's also considerable flexibility how words can be combined. Ambiguities occur naturally in human communication. Usage also changes with time. The end result is that grammatical rules and syntactic structures can't be specified for all use cases.
Language models therefore attempt to learn the "structure" of the language by analysing vast amounts of text. The approach is statistical rather than being based on brittle rules. In some sense, language models capture language syntax, semantics and even common sense knowledge.
While we commonly speak of a word sequence for language modelling, we can abstract the idea to a sequence of tokens. A token could be a sentence, a phrase, a word, an n-gram, morpheme, or letter. Where lemmatization is used, multiple surface forms are reduced to a single token. An LM specifies its method of tokenization.
-
What are some applications of language models? Language models are useful in applications that deal with text generation. Some examples are optical character recognition, handwriting recognition, speech recognition, machine translation, spelling correction, image captioning, and text summarization.
In speech recognition, the phrases "no eye deer" and "no idea" may sound similar. In spelling correction, "fill the from" may show no spelling errors but in fact "fill the form" is the correct phrase. In both these examples, an LM selects the more probable phrase. In machine translation, LM can tell that "high winds tonight" is a better translation than "large winds tonight".
Gmail's Smart Reply, which suggests short responses to emails, relies on an LM. In healthcare, LMs were shown to obtain generic representations from data of all patients. This led to better prediction models. LMs have been used for paraphrasing text.
Language modelling tasks themselves (such as predicting a word given surrounding words) have been useful in obtaining efficient word embeddings. These embeddings help represent words at the input of a neural network model.
-
Which are the main approaches to language modelling? There are two broad categories of language models:
- Count-based: These are traditional statistical models such as n-gram models. Word co-occurrences are counted to estimate probabilities. Variants of n-gram models have also been proposed. Clustering models attempt to exploit similarities among words. Caching models exploit the fact that once a word is used, it's likely to appear again in that text. Sentence mixture models build different models for different sentence types.
- Continuous-space: These use neural networks. They use word embeddings as dense word representations in a real-valued vector space. Words that are semantically similar are typically close together in the vector space. Such embeddings solve the data sparsity problem of n-gram models. Unlike n-gram models, these scale well as vocabulary size increases.
-
Could you briefly describe n-gram models? Suppose a sentence S has N words, \(w_1...w_N\). Since LM is about finding the probability of S, this is a joint probability measure over all N words. Typically, this is decomposed into a product of conditional probabilities \(P(S)=\prod_{i=1}^N P(w_i|w_1,...,w_{i-1})\), where each term is a probability of a word given the previous words in the sentence.
To simplify the problem, we apply Markov assumption. This is an approximation in which only some recent words matter. For a bigram model, a word is predicted based on only the preceding word. For an n-gram model, only the preceding (n-1) words are considered. For instance, given a bigram model for the phrase "the cat sat on the mat", \(P(S)=P(the)\cdot P(cat|the)\cdot P(sat|cat)\cdot P(on|sat)\cdot P(the|on)\cdot P(mat|the)\). We can get these probabilities by counting word co-occurrences. For example, \(P(cat|the)=P(the\,cat)/P(the)\).
One problem with n-gram models is data sparsity. This means that word sequences not seen in training, may be encountered in real applications, leading to zero probability. Techniques to solve this problem include smoothing, backoff and interpolation.
Typically, 5-gram models are a compromise between computational complexity and performance.
-
How can I train or make use of a neural language model? A neural language model can be learned in an unsupervised or semi-supervised manner but it needs lots of input text. Easy availability of text online (billions of words) has made this feasible. However, words at the input of a neural network must be represented as numbers. This is where word embeddings provide efficient representations.
To train the LM itself, we need a task on which the model has to learn. One task is to predict a word given its surrounding words; or predict the surrounding words given the current word. In fact, these two LM tasks were used when creating word2vec word embeddings. Training an LM in this manner is called pre-training.
A pre-trained LM can then be applied to a variety of NLP tasks. However, since each task is different, we do task-specific fine-tuning of the LM.
This two-phase approach is practical since a single pre-trained LM can be fine-tuned as the task demands. While pre-training is done on huge volumes of text, fine-tuning takes lot less effort.
-
Could you describe some well-known pre-trained neural language models? Among the well-known NLMs are ELMo, ULMFiT, BERT, GPT, and GPT-2. BERT in particular has spawned many variants: XLM, RoBERTa, XLNet, MT-DNN, TinyBERT, ALERT, DistilBERT, and more.
While ELMo and ULMFiT use LSTM, GPT-2 and BERT are based on transformer architecture. ULMFiT and GPT-2 are unidirectional while BERT is bidirectional. Most models can be applied to any downstream NLP task.
LM pre-training tasks themselves differ across models:
- Causal LM: Used by GPT-2. Current prediction is based on previous hidden state.
- Masked LM: Used by BERT. Some input words are masked and the task is to predict them. Since model is bidirectional, masking improves performance.
- Translation LM: Used by XLM for better machine translation. An input sequence contains tokens from both languages, each with its language embeddings and position embeddings.
- Permutation LM: Used by XLNet. It uses permutation to capture bidirectional context.
- Multi-Task LM: Used by MT-DNN. Model is trained on multiple tasks such as classification, text similarity and pairwise ranking. This regularizes the model better.
-
Which are the common techniques used in neural language models? Models with more parameters or memory units perform better. Increasing the embedding size improves performance but causes undesirable increase in number of parameters. LSTMs are better than RNNs. LSTMs are much better than n-grams on rare words. Models tend to overfit on training data, for which dropout helps (10% for small models, 25% for large models). Character-level embeddings and softmax can reduce the number of parameters. They're also better at out-of-vocabulary words.
To predict the next word, we need to compute the softmax probability. This is expensive for a large vocabulary. Among the different approaches to simplify this are hierarchical softmax, noise contrastive estimation, importance sampling, and self-normalizing partition functions.
To handle rare words, there are neural LMs that make use of morphemes, word shape information (such as capitalization), or annotations (such as POS tags). The use of morphemes has led to morpheme embeddings. When combined with RNN, we obtain word embeddings. Some LMs use character-level embeddings at both input and/or output. This approach avoids morphological analysis.
-
How can I evaluate the performance of language models? The common measure of LM evaluation is called perplexity. It's a geometric average of the inverse probability of words predicted by the model. Thus, a lower perplexity implies a better model. Logarithm (base 2) of perplexity is also a common measure. This is called cross-entropy. As a thumb rule, a reduction of 10-20% in perplexity is noteworthy.
In practice, an LM is measured by how it performs in an actual application. This is called extrinsic evaluation, as opposed to perplexity that's seen as intrinsic evaluation. For example, in speech recognition, Word Error Rate (WER) is an extrinsic measure of an LM.
It's been difficult to compare LMs because they use different training corpora or evaluation benchmarks. Some published results are unclear about the computation complexity. Sometimes single-model performance numbers are not reported; only performance of ensemble models are reported. Language modelling can benefit from standardized pre-training corpus. Performance should be compared along with model size and resource consumption.
Milestones
Although smoothing techniques can be traced back to Lidstone (1920), or even earlier to Laplace (18th century), an early application of smoothing to n-gram models for NLP is by Jelinek and Mercer (1980). A better smoothing technique is due to Katz (1987). More smoothing techniques are proposed in the 1990s.
1989
Bahl et al. propose decision tree for language modelling in the domain of speech recognition. Each node has a yes/no question about preceding words. Each leaf has a probability distribution over the allowable vocabulary. Years later, it's noted that tree-based methods may outperform n-gram models but finding the right partitions are hard due to high computational cost and data sparsity.
In decision tree approaches, as the tree grows, each leaf contains fewer data points. This data fragmentation issue can be solved by exponential models. Pietra et al. propose one such model using Maximum Entropy distribution. Similar models are proposed in the following years. In general, these models are computationally intensive.
N-gram models look at the preceding (n-1) words but for larger n, there's a data sparsity problem. Huang et al. propose a skipping n-gram model in which some preceding words may be ignored or skipped. For example, in the phrase "Show John a good time", the last word would be predicted based on P(time|Show __ a good) rather than P(time|Show John a good). Many such skipping models are proposed through the 1990s.
Due to the success of n-gram models, researchers ignored knowledge-based approaches. Statistical approach eclipsed linguistic approach. N-gram models worked but had little knowledge of language or its deep structures. Well-known researcher Fred Jelinek notes that a combination of statistical and linguist approaches may be required. He notes that we must "put language back into language modeling".
1998
As a smoothing technique for LMs, the Kneser-Ney method was proposed in 1995. Chen and Goodman introduce a modification of this and name it Modified Kneser-Ney Smoothing. Unlike the single discount of Kneser-Ney, the modified method uses different discounts for one, two and more than two counts. Subsequently, Kneser-Ney smoothing on a 5-gram model becomes a popular baseline among researchers.
Bengio et al. point out the curse of dimensionality where the large vocabulary size of natural languages makes computations difficult. They propose a Feedforward Neural Network (FNN) that jointly learns the language model and vector representations of words. They refine their methods in a follow-up paper from 2003.
Since n-grams and FNNs use a fixed length context, Mikolov et al. propose using a Recurrent Neural Network (RNN) for language modelling. Using cyclic connections, information in RNNs is retained for longer time. RNNs can therefore capture long-term dependencies. Only the size of the hidden context layers needs to be fixed. In 2018, Noaman et al. extend this approach to better suit languages with rich morphology or large vocabulary. They tokenize a word into prefix, stem and suffix.
2013
At Google, Mikolov et al. develop a word embedding called word2vec. This is created by training the model on one of two LM tasks: continuous bag-of-words (predict current word based on surrounding words) or continuous skip-gram (predict surrounding words given current word). This is a log-linear model due to the use of hierarchical softmax.
2015
Kim et al. use character-level input embeddings. Input is fed into a CNN followed by a highway network. An LSTM layer does the predictions, which are still at word level. They show that these character-level models have fewer parameters and outperform word-level models, particularly for languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian). In 2016, Jozefowicz et al. explore CharCNN and character-level LSTM at the prediction layer.
2018
Word embeddings such as word2vec have been popular since their release in 2013. However, they can't handle polysemy (same word, different meanings). This is because they produce a single representation for the word. They capture semantic relations but are poor at higher-level concepts such as anaphora, long-term dependencies, agreement, and negation. This is where LMs become useful. A good LM should capture lexical, syntactic, semantic and pragmatic aspects. NLP researcher Sebastian Ruder notes,
It is very likely that in a year's time NLP practitioners will download pretrained language models rather than pretrained word embeddings.
2018
Devlin et al. from Google publish details of an LM they call BERT. It's deeply bidirectional, meaning that it uses both left and right contexts in all layers. In November, Google open sources pre-trained BERT models, along with TensorFlow code that does this pre-training. These models are for English. Later in November, Google releases multilingual BERT that supports about 100 different languages.
2019
Lample and Conneau adapt BERT to propose a cross-lingual LM. Model is trained on both monolingual data (unsupervised) and parallel data (supervised). At the input, each language gets its own language and position embeddings. The model uses Byte-Pair Encoding (BPE) in which sub-words are the tokens. This improves the alignment of embedding spaces across languages. They obtain state-of-the-art results.
2019
Here are some new applications of LM. LM enables end-to-end Named Entity Recognition and Relation Extraction, and thereby avoids external NLP tools such as a dependency parser. LM is applied to zero-shot text classification. This work suggests that LMs can be used for meta-learning. A convolutional quantum-like LM is used for product rating prediction. LM uses RNN along with deep topic model to capture both syntax and global semantic structure.
References
- Arora, Kushal, and Anand Rangarajan. 2016. "Contrastive Entropy: A new evaluation metric for unnormalized language models." arXiv, v2, March 31. Accessed 2020-01-24.
- Aßenmacher, Matthias, and Christian Heumann. 2020. "On the comparability of Pre-trained Language Models." arXiv, v1, January 2020. Accessed 2020-01-24.
- Bahl, L.R., P.F. Brown, P.V. de Souza, and R.L. Mercer. 1989. "A tree-based statistical language model for natural language speech recognition." IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 7, pp. 1001-1008, July. Accessed 2020-01-24.
- Bansal, Shivam. 2018. "Language Modelling and Text Generation using LSTMs — Deep Learning for NLP." Medium, March 26. Accessed 2020-01-21.
- Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. "A Neural Probabilistic Language Model." Journal of Machine Learning Research, vol. 3, pp. 1137–1155, February. Accessed 2020-01-21.
- Brownlee, Jason. 2017. "Gentle Introduction to Statistical Language Modeling and Neural Language Models." Machine Learning Mastery, November 1. Updated 2019-08-07. Accessed 2020-01-21.
- Casas, Noe. 2019. "Contextual Token Representations." Accessed 2020-01-21.
- Chen, Stanley F. and Joshua Goodman. 1998. "An Empirical Study ofSmoothing Techniques for Language Modeling." Harvard Computer Science Group Technical Report TR-10-98, August. Accessed 2020-01-24.
- Chromiak, Michał. 2017. "NLP: Explaining Neural Language Modeling." November 30. Accessed 2020-01-21.
- Devlin, Jacob and Ming-Wei Chang. 2018. "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing." Google AI Blog, November 02. Accessed 2020-01-24.
- Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, v2, May 24. Accessed 2019-11-30.
- Elhadad, Michael. 2017. "Language Modeling." Topics in Natural Language Processing, Dept. of CS, Ben-Gurion University, October 29. Accessed 2020-01-21.
- Giorgi, John, Xindi Wang, Nicola Sahar, Won Young Shin, Gary D. Bader, and Bo Wang. 2019. "End-to-end Named Entity Recognition and Relation Extraction using Pre-trained Language Models." arXiv, v1, December 20. Accessed 2020-01-21.
- Goodman, Joshua. 2001. "A Bit of Progress in Language Modeling." arXiv, v1, August 9. Accessed 2020-01-21.
- Google Research GitHub. 2019. "TensorFlow code and pre-trained models for BERT." google-research/bert, GitHub, October 18. Accessed 2020-01-25.
- Guo, Dandan, Bo Chen, Ruiying Lu, and Mingyuan Zhou. 2019. "Recurrent Hierarchical Topic-Guided Neural Language Models." arXiv, v1, December 21. Accessed 2020-01-21.
- Horev, Rani. 2019. "XLM — Enhancing BERT for Cross-lingual Language Model." Towards Data Science, on Medium, February 12. Accessed 2020-01-21.
- Jozefowicz, Rafal, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. "Exploring the Limits of Language Modeling." arXiv, v2, February 11. Accessed 2020-01-21.
- Kim, Yoon, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. "Character-Aware Neural Language Models." arXiv, v4, December 1. Accessed 2020-01-24.
- Koehn, Philipp. 2009. "Chapter 7: Language Models." Slides based on: Statistical Machine Translation, Cambridge University Press. Accessed 2020-01-21.
- Lample, Guillaume, and Alexis Conneau. 2019. "Cross-lingual Language Model Pretraining." arXiv, v1, January 22. Accessed 2020-01-21.
- Ma, Edward. 2019. "Cross-lingual Language Model." Towards AI, on Medium, July 16. Accessed 2020-01-21.
- Mikolov, Tomáš, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. 2010. "Recurrent Neural Network Based Language Model." INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, pp. 1045-1048, September 26-30. Accessed 2020-01-21.
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. "Efficient Estimation of Word Representations in Vector Space." arXiv, v3, September 07. Accessed 2019-10-07.
- Noaman, Hatem M., Shahenda S. Sarhan, and Mohsen. A. A. Rashwan. 2018. "Enhancing recurrent neural network-based language models by word tokenization." Human-centric Computing and Information Sciences, vol. 8, article no. 12, April. Accessed 2020-01-21.
- Osborne, Sterling. 2019. "Learning NLP Language Models with Real Data." Towards Data Science, on Medium, January 27. Accessed 2020-01-21.
- Phy, Vitou. 2019. "Language Model Concept behind Word Suggestion Feature." Towards Data Science, on Medium, November 2. Accessed 2020-01-21.
- Ping, Qing, and Chaomei Chen. 2019. "Convolutional Quantum-Like Language Model with Mutual-Attention for Product Rating Prediction." arXiv, v1, December 25. Accessed 2020-01-21.
- Puri, Raul, and Bryan Catanzaro. 2019. "Zero-shot Text Classification With Generative Language Models." arXiv, v1, December 10. Accessed 2020-01-21.
- Rathore, Mohit. 2018. "Introduction to Language Models." April 3. Accessed 2020-01-21.
- Rong, Xin. 2016. "word2vec Parameter Learning Explained." arXiv, v4, June 5. Accessed 2020-01-24.
- Rosenfeld, Ronald. 2000. "Two Decades of Statistical Language Modeling: Where Do We Go from Here?" Proceedings of the IEEE, vol. 88, no. 8, pp. 1270-1278, August. Accessed 2020-01-21.
- Ruder, Sebastian. 2018a. "NLP's ImageNet moment has arrived." July 12. Accessed 2020-01-24.
- Ruder, Sebastian. 2018b. "A Review of the Recent History of Natural Language Processing." October 01. Accessed 2019-09-26.
- Sieg, Adrien. 2019. "FROM Pre-trained Word Embeddings TO Pre-trained Language Models — Focus on BERT." Medium, August 29. Accessed 2020-01-21.
- Steinberg, Ethan, Ken Jung, Jason A. Fries, Conor K. Corbin, Stephen R. Pfohl, and Nigam H. Shah. 2020. "Language Models Are An Effective Patient Representation Learning Technique For Electronic Health Record Data." arXiv, v1, January 6. Accessed 2020-01-21.
- Synced. 2017. "Language Model: A Survey of the State-of-the-Art Technology." Synced, on Medium, September 10. Accessed 2020-01-21.
- Synced. 2019. "Microsoft’s New MT-DNN Outperforms Google BERT." Synced, on Medium, February 16. Accessed 2020-01-21.
- Weng, Lilian. 2019. "Generalized Language Models." Lil'Log, on GitHub.io, January 31. Accessed 2020-01-21.
- Witteveen, Sam, and Martin Andrews. 2019. "Paraphrasing with Large Language Models." arXiv, v1, November 21. Accessed 2020-01-21.
- Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. "XLNet: Generalized Autoregressive Pretraining for Language Understanding." arXiv, v1, June 19. Accessed 2019-12-03.
- thunlp. 2019. "Must-read papers on pre-trained language models." PLMPapers, thunlp on GitHub, November. Accessed 2020-01-24.
Further Reading
- Weng, Lilian. 2019. "Generalized Language Models." Lil'Log, on GitHub.io, January 31. Accessed 2020-01-21.
- Goodman, Joshua. 2001. "A Bit of Progress in Language Modeling." arXiv, v1, August 9. Accessed 2020-01-21.
- Jozefowicz, Rafal, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. "Exploring the Limits of Language Modeling." arXiv, v2, February 11. Accessed 2020-01-21.
- Rosenfeld, Ronald. 2000. "Two Decades of Statistical Language Modeling: Where Do We Go from Here?" Proceedings of the IEEE, vol. 88, no. 8, pp. 1270-1278, August. Accessed 2020-01-21.
- Faltl, Sandra, Michael Schimpk, and Constantin Hackober. 2019. "Universal Language Model Fine-Tuning (ULMFiT): State-of-the-Art in Text Analysis." Blog, Chair of Information System at HU-Berlin, February 7. Accessed 2020-01-21.
- Rizvi, Mohd Sanad Zaki. 2019. "A Comprehensive Guide to Build your own Language Model in Python!" Blog, Analytics Vidhya, August 8. Accessed 2020-01-21.