• BERT uses many layers of bidirectional transformers. Source: Adapted from Devlin et al. 2019, fig. 3.
• Encoder self-attention distribution for the word 'it' in different contexts. Source: Uszkoreit 2017.
• Asynchronous memory copy overlaps with computation. Source: Mukherjee et al. 2019, fig. 3.
• An example showing how BERT improves query understanding. Source: Nayak 2019.
• Fine-tuning BERT for various NLP tasks. Source: Devlin et al. 2019, fig. 4.
• BERT does basic token masking whereas ERNIE uses three types of masking. Source: Liu 2019, fig. 2.
• For sentiment analysis, Target-Dependent BERT uses tokens at target positions rather than [CLS]. Source: Gao et al. 2019, fig. 2.
• Input embeddings in BERT. Source: Devlin et al. 2019, fig. 2.
• BERT has spawned many variants. Source: thunlp 2019.

# BERT (Language Model)

arvindpdmn
1237 DevCoins
Last updated by arvindpdmn
on 2019-12-03 14:01:26
Created by arvindpdmn
on 2019-11-30 09:41:03

## Summary

NLP involves a number of distinct tasks each of which typically needs its own set of training data. Often each task has only a few thousand samples of labelled data, which is not adequate to train a good model. However, there's plenty of unlabelled data readily available online. This data can be used to train a baseline model that can be reused across NLP tasks. Bidirectional Encoder Representations from Transformers (BERT) is one such model.

BERT is pre-trained using unlabelled data on language modelling tasks. For specific NLP tasks, the pretrained model can be fine-tuned for that task. Pre-trained BERT models, and their variants, have been open sourced. This makes it easier for NLP researchers to fine-tune BERT and quickly advance the state of the art for their tasks.

## Milestones

Jun
2017

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention, although they're not the first to use self-attention. Self-attention is about attending to different tokens of the sequence. This would later prove to be the building block on which BERT is created.

Feb
2018

Peters et al. use many layers of bidirectional LSTM trained on a language model objective. The final embeddings are based on all the hidden layers. Thus, their embeddings are deeply contextual. They call it Embeddings from Language Models (ELMo). They show that higher-level LSTM states capture semantics while lower-level states capture syntax.

Oct
2018

Devlin et al. from Google publish on arXiv a paper titled BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding.

Nov
2018

Google open sources pre-trained BERT models, along with TensorFlow code that does this pre-training. These models are for English. Later in the month, Google releases multilingual BERT that supports about 100 different languages. The multilingual model preserves case. The model for Chinese is separate. It uses character-level tokenization.

Apr
2019

NAACL announces the best long paper award to the BERT paper by Devlin et al. The annual NAACL conference itself is held in June.

May
2019

Whole Word Masking is introduced in BERT. This can be enabled with the option --do_whole_word_mask=True during data generation. For example, when the word 'philammon' is split into sub-tokens 'phil', '##am' and '##mon', then either all three are masked or none at all. Overall masking rate is not affected. As before, each sub-token is predicted independent of the others.

Aug
2019

Since a BERT model has 12 or 24 layers with multi-head attentions, using it in a real-time application is often a challenge. To make this practical for applications such conversational AI, NVIDIA releases TensorRT optimizations for BERT. In particular, the transformer layer has been optimized. Q, K and V are fused into a single tensor, thus locating them together in memory and improving model throughput. Latency is 2.2ms on T4 GPUs, well below the 10ms acceptable latency budget.

Oct
2019

Google Search starts using BERT for 10% of English queries. Since BERT looks at the bidirectional context of words, it helps in understanding the intent behind search queries. Particularly for conversational queries, prepositions such as "for" and "to" matter. BERT's bidirectional self-attention mechanism takes these into account. Because of the model's complexity, for the first time, Search uses Cloud TPUs.

## Discussion

• What's the typical process for using BERT?

BERT is an evolution of self-attention and transformer architecture that's becoming popular for neural network models. BERT is an encoder-only transformer. It's deeply bidirectional, meaning that it uses both left and right contexts in all layers.

BERT involves two stages: unsupervised pre-training followed by supervised task-specific fine-tuning. Once a BERT model is pre-trained, it can be shared. This enables downstream tasks to do further training on a much smaller dataset. Different NLP tasks can thus benefit from a single shared baseline model. In some sense, this is similar to transfer learning that's been common in computer vision.

While pre-training takes a few days on many Cloud TPUs, fine-tuning takes only 30 minutes on a single Cloud TPU.

For fine-tuning, one or more output layers are typically added to BERT. Likewise, input embeddings reflect the task. For question answering, an input sequence will contain the question and the answer while the model is trained to learn the start and end of answers. For classification, the [CLS] token at the output is fed into a classification layer.

• Could you describe the tasks on which BERT is pre-trained?

BERT is pre-trained on two tasks:

• Masked Language Model (MLM): Given a sequence of tokens, some of them are masked. The objective is then to predict the masked tokens. Masking allows the model to be trained using both left and right contexts. Specifically, 15% of tokens are randomly chosen for masking. Of these, 80% are masked, 10% are replaced with a random word, 10% are retained.
• Next Sentence Prediction (NSP): Given two sentences, the model predicts if the second one logically follows the first one. This task is used for capturing relationship between sentences since language modelling doesn't do this.

Unlike word embeddings such as word2vec or GloVe, BERT produces contextualized embeddings. This means that BERT produces multiple embeddings of a word, each representing the context around the word. For example, word2vec embedding for the word 'bank' would not differentiate between the phrases "bank account" and "bank of the river" but BERT can tell the difference.

• Which are some possible applications of BERT?

In October 2019, Google Search started using BERT to better understand the intent behind search queries. Another application of BERT is to recommend products based on a descriptive user request. Use of BERT for question answering on SQuAD and NQ datasets is well known. BERT has also been used for document retrieval.

BERT has been used for aspect-based sentiment analysis. Xu et al. use BERT for both sentiment analysis and comprehending product reviews so that questions on those products can be answered automatically.

Among classification tasks, BERT has been used for fake news classification and sentence pair classification.

To aid teachers, BERT has been used to generate questions on grammar or vocabulary based on a news article. The model frames a question and presents some choices, only one of which is correct.

BERT is still new and many novel applications might happen in future. It's possible to use BERT for quantitative trading. BERT can be applied to specific domains but we would need domain-specific pre-trained models. SciBERT and BioBERT are two examples.

• Which are the essential parameters or technical details of BERT model?

BERT pre-trained models are available in two sizes:

• Base: 12 layers, 768 hidden size, 12 self-attention heads, 110M parameters.
• Large: 24 layers, 1024 hidden size, 16 self-attention heads, 340M parameters.

Each of the above took 4 days to train on 4 Cloud TPUs (Base) or 16 Cloud TPUs (Large).

For pre-training, a batch size of 256 sequences was used. Each sequence contained 512 tokens, implying 128K tokens per batch. The corpus for pre-training BERT had 3.3 billion words: 800M from BooksCorpus and 2500M from Wikipedia. This resulted in 40 epochs for 1M training steps. Dropout of 0.1 was used on all layers. GELU activation was used.

For fine-tuning, batch sizes of 16 or 32 are recommended. Only 2-4 epochs are needed for fine-tuning. Learning rate is also different from what's used for pre-training. Learning rate is also task specific. Dropout used was same as in pre-training.

• How do I represent the input to BERT?

BERT input embeddings is a sum of three parts:

• Token: Tokens are basically words. BERT uses a fixed vocabulary of about 30K tokens. To handle rare words or those not in token vocabulary, they're broken into sub-words and then mapped to tokens. The first token of a sequence is [CLS] that's useful for classification tasks. During MLM pre-training, some tokens are masked.
• Segment/Sentence: An input sequence of tokens can be a single segment or two segments. A segment is a contiguous span of text, not an actual linguistic sentence. Since two segments are packed into the same sequence, each segment has its own embedding. Each segment is terminated by [SEP] token. For example in question answering, question is the first segment and answer is the second.
• Position: This represents the token's position within the sequence.

In practice, input embeddings can also contain an input mask. Since sequence length is fixed, the final sequence may involve padding. Input mask is used to differentiate between actual inputs and padding.

Beginners may wish to look at a visual explanation of BERT input embeddings.

• What are some variants of BERT?

BERT has inspired many variants: RoBERTa, XLNet, MT-DNN, SpanBERT, VisualBERT, K-BERT, HUBERT, and more. Some variants attempt to compress the model: TinyBERT, ALERT, DistilBERT, and more. We describe a few of the variants that outperform BERT in many tasks:

• RoBERTa: Showed that the original BERT was undertrained. RoBERTa is trained longer, on more data; with bigger batches and longer sequences; without NSP; and dynamically changes the masking pattern.
• ALBERT: Uses parameter reduction techniques to yield a smaller model. To utilize inter-sentence coherence, ALBERT uses Sentence-Order Prediction (SOP) instead of NSP.
• XLNet: Doesn't do masking but uses permutation to capture bidirectional context. It combines the best of denoising autoencoding of BERT and autoregressive language modelling of Transformer-XL.
• Could you share some resources for developers to learn BERT?

Developers can study the TensorFlow code for BERT. This follows the main paper by Devlin et al. (2019). This is also the source for downloading BERT pre-trained models.

Google has shared TensorFlow code that fine-tunes BERT for Natural Questions.

McCormick and Ryan show how to fine-tune BERT in PyTorch. HuggingFace provides transformers Python package with implementations of BERT (and alternative models) in both PyTorch and TensorFlow. They also provide a script to convert a TensorFlow checkpoint to PyTorch.

IBM has shared a deployable BERT model for question answering. An online demo of BERT is available from Pragnakalp Techlabs.

## Milestones

Jun
2017

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention, although they're not the first to use self-attention. Self-attention is about attending to different tokens of the sequence. This would later prove to be the building block on which BERT is created.

Feb
2018

Peters et al. use many layers of bidirectional LSTM trained on a language model objective. The final embeddings are based on all the hidden layers. Thus, their embeddings are deeply contextual. They call it Embeddings from Language Models (ELMo). They show that higher-level LSTM states capture semantics while lower-level states capture syntax.

Oct
2018

Devlin et al. from Google publish on arXiv a paper titled BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding.

Nov
2018

Google open sources pre-trained BERT models, along with TensorFlow code that does this pre-training. These models are for English. Later in the month, Google releases multilingual BERT that supports about 100 different languages. The multilingual model preserves case. The model for Chinese is separate. It uses character-level tokenization.

Apr
2019

NAACL announces the best long paper award to the BERT paper by Devlin et al. The annual NAACL conference itself is held in June.

May
2019

Whole Word Masking is introduced in BERT. This can be enabled with the option --do_whole_word_mask=True during data generation. For example, when the word 'philammon' is split into sub-tokens 'phil', '##am' and '##mon', then either all three are masked or none at all. Overall masking rate is not affected. As before, each sub-token is predicted independent of the others.

Aug
2019

Since a BERT model has 12 or 24 layers with multi-head attentions, using it in a real-time application is often a challenge. To make this practical for applications such conversational AI, NVIDIA releases TensorRT optimizations for BERT. In particular, the transformer layer has been optimized. Q, K and V are fused into a single tensor, thus locating them together in memory and improving model throughput. Latency is 2.2ms on T4 GPUs, well below the 10ms acceptable latency budget.

Oct
2019

Google Search starts using BERT for 10% of English queries. Since BERT looks at the bidirectional context of words, it helps in understanding the intent behind search queries. Particularly for conversational queries, prepositions such as "for" and "to" matter. BERT's bidirectional self-attention mechanism takes these into account. Because of the model's complexity, for the first time, Search uses Cloud TPUs.

Author
No. of Edits
No. of Chats
DevCoins
4
0
1237
1782
Words
0
Chats
4
Edits
2
Likes
738
Hits

## Cite As

Devopedia. 2019. "BERT (Language Model)." Version 4, December 3. Accessed 2020-01-28. https://devopedia.org/bert-language-model
• Site Map