BERT (Language Model)

BERT uses many layers of bidirectional transformers. Source: Adapted from Devlin et al. 2019, fig. 3.
BERT uses many layers of bidirectional transformers. Source: Adapted from Devlin et al. 2019, fig. 3.

NLP involves a number of distinct tasks each of which typically needs its own set of training data. Often each task has only a few thousand samples of labelled data, which is not adequate to train a good model. However, there's plenty of unlabelled data readily available online. This data can be used to train a baseline model that can be reused across NLP tasks. Bidirectional Encoder Representations from Transformers (BERT) is one such model.

BERT is pre-trained using unlabelled data on language modelling tasks. For specific NLP tasks, the pretrained model can be fine-tuned for that task. Pre-trained BERT models, and their variants, have been open sourced. This makes it easier for NLP researchers to fine-tune BERT and quickly advance the state of the art for their tasks.


  • What's the typical process for using BERT?
    Fine-tuning BERT for various NLP tasks. Source: Devlin et al. 2019, fig. 4.
    Fine-tuning BERT for various NLP tasks. Source: Devlin et al. 2019, fig. 4.

    BERT is an evolution of self-attention and transformer architecture that's becoming popular for neural network models. BERT is an encoder-only transformer. It's deeply bidirectional, meaning that it uses both left and right contexts in all layers.

    BERT involves two stages: unsupervised pre-training followed by supervised task-specific fine-tuning. Once a BERT model is pre-trained, it can be shared. This enables downstream tasks to do further training on a much smaller dataset. Different NLP tasks can thus benefit from a single shared baseline model. In some sense, this is similar to transfer learning that's been common in computer vision.

    While pre-training takes a few days on many Cloud TPUs, fine-tuning takes only 30 minutes on a single Cloud TPU.

    For fine-tuning, one or more output layers are typically added to BERT. Likewise, input embeddings reflect the task. For question answering, an input sequence will contain the question and the answer while the model is trained to learn the start and end of answers. For classification, the [CLS] token at the output is fed into a classification layer.

  • Could you describe the tasks on which BERT is pre-trained?
    BERT does basic token masking whereas ERNIE uses three types of masking. Source: Liu 2019, fig. 2.
    BERT does basic token masking whereas ERNIE uses three types of masking. Source: Liu 2019, fig. 2.

    BERT is pre-trained on two tasks:

    • Masked Language Model (MLM): Given a sequence of tokens, some of them are masked. The objective is then to predict the masked tokens. Masking allows the model to be trained using both left and right contexts. Specifically, 15% of tokens are randomly chosen for masking. Of these, 80% are masked, 10% are replaced with a random word, 10% are retained.
    • Next Sentence Prediction (NSP): Given two sentences, the model predicts if the second one logically follows the first one. This task is used for capturing relationship between sentences since language modelling doesn't do this.

    Unlike word embeddings such as word2vec or GloVe, BERT produces contextualized embeddings. This means that BERT produces multiple embeddings of a word, each representing the context around the word. For example, word2vec embedding for the word 'bank' would not differentiate between the phrases "bank account" and "bank of the river" but BERT can tell the difference.

  • Which are some possible applications of BERT?
    For sentiment analysis, Target-Dependent BERT uses tokens at target positions rather than [CLS]. Source: Gao et al. 2019, fig. 2.
    For sentiment analysis, Target-Dependent BERT uses tokens at target positions rather than [CLS]. Source: Gao et al. 2019, fig. 2.

    In October 2019, Google Search started using BERT to better understand the intent behind search queries. Another application of BERT is to recommend products based on a descriptive user request. Use of BERT for question answering on SQuAD and NQ datasets is well known. BERT has also been used for document retrieval.

    BERT has been used for aspect-based sentiment analysis. Xu et al. use BERT for both sentiment analysis and comprehending product reviews so that questions on those products can be answered automatically.

    Among classification tasks, BERT has been used for fake news classification and sentence pair classification.

    To aid teachers, BERT has been used to generate questions on grammar or vocabulary based on a news article. The model frames a question and presents some choices, only one of which is correct.

    BERT is still new and many novel applications might happen in future. It's possible to use BERT for quantitative trading. BERT can be applied to specific domains but we would need domain-specific pre-trained models. SciBERT and BioBERT are two examples.

  • Which are the essential parameters or technical details of BERT model?

    BERT pre-trained models are available in two sizes:

    • Base: 12 layers, 768 hidden size, 12 self-attention heads, 110M parameters.
    • Large: 24 layers, 1024 hidden size, 16 self-attention heads, 340M parameters.

    Each of the above took 4 days to train on 4 Cloud TPUs (Base) or 16 Cloud TPUs (Large).

    For pre-training, a batch size of 256 sequences was used. Each sequence contained 512 tokens, implying 128K tokens per batch. The corpus for pre-training BERT had 3.3 billion words: 800M from BooksCorpus and 2500M from Wikipedia. This resulted in 40 epochs for 1M training steps. Dropout of 0.1 was used on all layers. GELU activation was used.

    For fine-tuning, batch sizes of 16 or 32 are recommended. Only 2-4 epochs are needed for fine-tuning. Learning rate is also different from what's used for pre-training. Learning rate is also task specific. Dropout used was same as in pre-training.

  • How do I represent the input to BERT?
    Input embeddings in BERT. Source: Devlin et al. 2019, fig. 2.
    Input embeddings in BERT. Source: Devlin et al. 2019, fig. 2.

    BERT input embeddings is a sum of three parts:

    • Token: Tokens are basically words. BERT uses a fixed vocabulary of about 30K tokens. To handle rare words or those not in token vocabulary, they're broken into sub-words and then mapped to tokens. The first token of a sequence is [CLS] that's useful for classification tasks. During MLM pre-training, some tokens are masked.
    • Segment/Sentence: An input sequence of tokens can be a single segment or two segments. A segment is a contiguous span of text, not an actual linguistic sentence. Since two segments are packed into the same sequence, each segment has its own embedding. Each segment is terminated by [SEP] token. For example in question answering, question is the first segment and answer is the second.
    • Position: This represents the token's position within the sequence.

    In practice, input embeddings can also contain an input mask. Since sequence length is fixed, the final sequence may involve padding. Input mask is used to differentiate between actual inputs and padding.

    Beginners may wish to look at a visual explanation of BERT input embeddings.

  • What are some variants of BERT?
    BERT has spawned many variants. Source: thunlp 2019.
    BERT has spawned many variants. Source: thunlp 2019.

    BERT has inspired many variants: RoBERTa, XLNet, MT-DNN, SpanBERT, VisualBERT, K-BERT, HUBERT, and more. Some variants attempt to compress the model: TinyBERT, ALERT, DistilBERT, and more. We describe a few of the variants that outperform BERT in many tasks:

    • RoBERTa: Showed that the original BERT was undertrained. RoBERTa is trained longer, on more data; with bigger batches and longer sequences; without NSP; and dynamically changes the masking pattern.
    • ALBERT: Uses parameter reduction techniques to yield a smaller model. To utilize inter-sentence coherence, ALBERT uses Sentence-Order Prediction (SOP) instead of NSP.
    • XLNet: Doesn't do masking but uses permutation to capture bidirectional context. It combines the best of denoising autoencoding of BERT and autoregressive language modelling of Transformer-XL.
    • MT-DNN: Uses BERT with additional multi-task training on NLU tasks. Cross-task data leads to regularization and more general representations.
  • Could you share some resources for developers to learn BERT?

    Developers can study the TensorFlow code for BERT. This follows the main paper by Devlin et al. (2019). This is also the source for downloading BERT pre-trained models.

    Google has shared TensorFlow code that fine-tunes BERT for Natural Questions.

    McCormick and Ryan show how to fine-tune BERT in PyTorch. HuggingFace provides transformers Python package with implementations of BERT (and alternative models) in both PyTorch and TensorFlow. They also provide a script to convert a TensorFlow checkpoint to PyTorch.

    IBM has shared a deployable BERT model for question answering. An online demo of BERT is available from Pragnakalp Techlabs.


Encoder self-attention distribution for the word 'it' in different contexts. Source: Uszkoreit 2017.

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention, although they're not the first to use self-attention. Self-attention is about attending to different tokens of the sequence. This would later prove to be the building block on which BERT is created.


Peters et al. use many layers of bidirectional LSTM trained on a language model objective. The final embeddings are based on all the hidden layers. Thus, their embeddings are deeply contextual. They call it Embeddings from Language Models (ELMo). They show that higher-level LSTM states capture semantics while lower-level states capture syntax.


Devlin et al. from Google publish on arXiv a paper titled BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding.


Google open sources pre-trained BERT models, along with TensorFlow code that does this pre-training. These models are for English. Later in the month, Google releases multilingual BERT that supports about 100 different languages. The multilingual model preserves case. The model for Chinese is separate. It uses character-level tokenization.


NAACL announces the best long paper award to the BERT paper by Devlin et al. The annual NAACL conference itself is held in June.


Whole Word Masking is introduced in BERT. This can be enabled with the option --do_whole_word_mask=True during data generation. For example, when the word 'philammon' is split into sub-tokens 'phil', '##am' and '##mon', then either all three are masked or none at all. Overall masking rate is not affected. As before, each sub-token is predicted independent of the others.

Asynchronous memory copy overlaps with computation. Source: Mukherjee et al. 2019, fig. 3.

Since a BERT model has 12 or 24 layers with multi-head attentions, using it in a real-time application is often a challenge. To make this practical for applications such conversational AI, NVIDIA releases TensorRT optimizations for BERT. In particular, the transformer layer has been optimized. Q, K and V are fused into a single tensor, thus locating them together in memory and improving model throughput. Latency is 2.2ms on T4 GPUs, well below the 10ms acceptable latency budget.

An example showing how BERT improves query understanding. Source: Nayak 2019.

Google Search starts using BERT for 10% of English queries. Since BERT looks at the bidirectional context of words, it helps in understanding the intent behind search queries. Particularly for conversational queries, prepositions such as "for" and "to" matter. BERT's bidirectional self-attention mechanism takes these into account. Because of the model's complexity, for the first time, Search uses Cloud TPUs.


  1. Agapiev, Aleksandar. 2019. "New Applications for Google’s BERT In Quantitative Trading Algorithms." Accessed 2019-12-01.
  2. Alberti, Chris, Kenton Lee, and Michael Collins. 2019. "A BERT Baseline for the Natural Questions." arXiv, v2, March 21. Accessed 2019-12-01.
  3. Devlin, Jacob and Ming-Wei Chang. 2018. "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing." Google AI Blog, November 02. Accessed 2019-11-30.
  4. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, v2, May 24. Accessed 2019-11-30.
  5. Elvis. 2019. "XLNet outperforms BERT on several NLP Tasks.", on Medium, June 24. Accessed 2019-11-30.
  6. Gao, Zhengjie, Ao Feng, Xinyu Song, and Xi Wu. 2019. "Target-Dependent Sentiment Classification With BERT." IEEE Access, vol. 7, pp. 154290-154299, October 11. Accessed 2019-11-30.
  7. GluonNLP. 2019. "Fine-tuning Sentence Pair Classification with BERT." Tutorial, GluonNLP. Accessed 2019-12-01.
  8. Google Research GitHub. 2019. "TensorFlow code and pre-trained models for BERT." google-research/bert, GitHub, October 18. Accessed 2019-11-30.
  9. Goutham, Ramsri. 2019. "Practical AI : Using pretrained BERT to generate grammar and vocabulary Multiple Choice Questions (MCQs) from any news article or story." Medium, October 02. Accessed 2019-12-01.
  10. Horev, Rani. 2018. "BERT Explained: State of the art language model for NLP." Towards Data Science, on Medium, November 17. Accessed 2019-11-30.
  11. Hu, Zhangning. 2019. "Question Answering on SQuAD with BERT." CS224N Report, Stanford University. Accessed 2019-12-01.
  12. HuggingFace. 2019a. "Transformers." Accessed 2019-11-30.
  13. HuggingFace. 2019b. "Converting Tensorflow Checkpoints." Accessed 2019-11-30.
  14. Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." arXiv, v3, October 30. Accessed 2019-12-03.
  15. Liu, Bang. 2019. "NLP Pretraining - from BERT to XLNet." July 01. Accessed 2019-12-03.
  16. Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv, v1, July 26. Accessed 2019-12-03.
  17. Liu, Xiaodong, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019b. "Multi-Task Deep Neural Networks for Natural Language Understanding." arXiv, v2, May 30. Accessed 2019-12-03.
  18. Ma, Edward. 2019. "Some examples of applying BERT in specific domain." Towards Data Science, on Medium, April 03. Accessed 2019-12-01.
  19. McCormick, Chris and Nick Ryan. 2019. "BERT Fine-Tuning Tutorial with PyTorch." July 22. Accessed 2019-11-30.
  20. Mukherjee, Purnendu, Eddie Weill, Rohit Taneja, Davide Onofrio, Young-Jun Ko and Siddharth Sharma. 2019. "Real-Time Natural Language Understanding with BERT Using TensorRT." NVIDIA Developer Blog, August 13. Accessed 2019-11-30.
  21. Nayak, Pandu. 2019. "Understanding searches better than ever before." Google Blog, October 25. Accessed 2019-11-30.
  22. Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Land uke Zettlemoyer. 2018. "Deep contextualized word representations." arXiv, v2, March 22. Accessed 2019-11-30.
  23. Pierre, Sadrach. 2019. "Fake News Classification with BERT." Towards Data Science, on Medium, November 20. Accessed 2019-12-01.
  24. Shukri, Mohd. 2019. "Why BERT has 3 Embedding Layers and Their Implementation Details." Medium, February 19. Accessed 2019-11-30.
  25. Sohrabi, Reza. 2019. "Give Me Jeans not Shoes: How BERT Helps Us Deliver What Clients Want." MultiThreaded, Blog, Stitch Fix, July 15. Accessed 2019-11-30.
  26. thunlp. 2019. "Must-read papers on pre-trained language models." PLMPapers, thunlp on GitHub, November. Accessed 2019-12-01.
  27. Uszkoreit, Jakob. 2017. "Transformer: A Novel Neural Network Architecture for Language Understanding." Google AI Blog, August 31. Accessed 2019-11-30.
  28. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. "Attention Is All You Need." arXiv, v5, December 06. Accessed 2019-11-30.
  29. Wan, Reina Qi. 2019. "NAACL 2019 | Google BERT Wins Best Long Paper." Synced, April 11. Accessed 2019-12-01.
  30. Weng, Lilian. 2018. "Attention? Attention!" Lil'Log, June 24. Accessed 2019-11-30.
  31. Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." arXiv, v2, October 08. Accessed 2019-06-13.
  32. Xu, Hu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. "BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis." arXiv, v2, May 04. Accessed 2019-12-01.
  33. Yang, Wei, Haotian Zhang, and Jimmy Lin. 2019a. "Simple Applications of BERT for Ad Hoc Document Retrieval." arXiv, v1, March 26. Accessed 2019-12-01.
  34. Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019b. "XLNet: Generalized Autoregressive Pretraining for Language Understanding." arXiv, v1, June 19. Accessed 2019-12-03.

Further Reading

  1. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, v2, May 24. Accessed 2019-11-30.
  2. Alammar, Jay. 2019. "A Visual Guide to Using BERT for the First Time." November 26. Accessed 2019-11-30.
  3. Seth, Yashu. 2019. "BERT Explained – A list of Frequently Asked Questions." Blog, Let the Machines Learn, June 11. Accessed 2019-11-30.
  4. Anderson, Dawn. 2019. "A deep dive into BERT: How BERT launched a rocket into natural language understanding." Search Engine Land, November 05. Accessed 2019-11-30.
  5. Alberti, Chris, Kenton Lee, and Michael Collins. 2019. "A BERT Baseline for the Natural Questions." arXiv, v2, March 21. Accessed 2019-11-30.
  6. Vig, Jesse. 2018. "Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters." Towards Data Science, on Medium, December 18. Accessed 2019-12-03.

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2019. "BERT (Language Model)." Version 4, December 3. Accessed 2020-11-24.
Contributed by
1 author

Last updated on
2019-12-03 14:01:26