• Example of self-attention within a word sequence. Source: Weng 2018.
• A sequence-to-sequence model for machine translation. Source: Weng 2018.
• Encoder-decoder architecture with attention. Source: Weng 2018, fig. 4.
• Encoder self-attention distribution for the word 'it' in different contexts. Source: Uszkoreit 2017.
• Variations of self-attention in decoder-only transformer. Source: Liu et al. 2018, fig. 1.
• GPT's transformer (left) and fine-tuning tasks (right). Source: Radford et al. 2018, fig. 1.
• Architecture of MT-DNN. Source: Liu et al. 2019, fig. 1.
• Transformer versus Evolved Transformer. Source: So et al. 2019, fig. 3.
• Machine translation using transformer. Source: Bradbury 2017, fig. 2.
• Transformer architecture showing encoder (left) and decoder (right). Source: Vaswani et al. 2017, fig. 1.
• Attention is computed using query, key and value vectors. Source: Vaswani et al. 2017, fig. 1.
• BERT improves Google search results. Source: Nayak 2019.
• BERT is bidirectional while GPT (and GPT-2) is not. Source: Devlin et al. 2019, fig. 3.
• Transformer-XL uses segment-level recurrence. Source: Dai et al. 2019, fig. 2.

# Transformer Neural Network Architecture

arvindpdmn
1473 DevCoins
Last updated by arvindpdmn
on 2019-11-13 15:49:23
Created by arvindpdmn
on 2019-11-06 13:28:43

## Summary

Given a word sequence, we recognize that some words within it are more closely related with one another than others. This gives rise to the concept of self-attention in which a given word "attends to" other words in the sequence. Essentially, attention is about representing context by giving weights to word relations.

Transformer is a neural network architecture that makes use of self-attention. It replaces earlier approaches of LSTMs or CNNs that used attention between encoder and decoder. Transformer showed that a feed-forward network used with self-attention is sufficient.

Influential language models such BERT and GPT-2 are based on the transformer architecture. By 2019, transformer architecture became an active area of research and application. While initially created for NLP, it's being used in other domains where problems can be cast as sequence modelling.

## Milestones

2014

Sutskever et al. at Google apply sequence-to-sequence model to the task of machine translation, that is, a sequence of words in source language is translated to a sequence of words in target language. They use an encoder-decoder architecture that has separate 4-layered LSTMs for encoder and decoder. The encoder produces a fixed-length context vector, which is used to initialize the decoder. The main limitation is that the context vector is unable to adequately represent long sentences.

2015

Bahdanau et al. apply the concept of attention to the seq2seq model used in machine translation. This helps the decoder to "pay attention" to important parts of the source sentence. Encoder is a bidirectional RNN. Unlike the seq2seq model of Sutskever et al., which uses only the encoder's last hidden state, attention mechanism uses all hidden states of encoder and decoder to generate the context vector. It also aligns the input and output sequences, with alignment score parameterized by a feed-forward network.

Jun
2017

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention, although they're not the first to use self-attention. Self-attention is about attending to different tokens of the sequence.

Jan
2018

For multi-document summarization, Liu et al. propose a decoder-only transformer architecture that can attend to sequences longer than what encoder-decoder architecture is capable of. Input and output sequences are combined into single sequence and used to train the decoder. During inference, output is generated auto-regressively. They also propose variations of attention to handle longer sequences.

Jun
2018

OpenAI publishes Generative Pre-trained Transformer (GPT). It's inspired by unsupervised pre-training and transformer architecture. The transformer is trained on large amount of data without supervision. It's then fine-tuned on smaller task-specific datasets with supervision. Pre-training involves a standard language modelling and uses Liu et al.'s decoder-only transformer. In February 2019, OpenAI announces an improved model named GPT-2. Compared to GPT, GPT-2 is trained on 10x the data and has 10x parameters.

Oct
2018

Google open sources Bidirectional Encoder Representations from Transformers (BERT), which is a pre-trained language model. It's deeply bidirectional and unsupervised. It improves state-of-the-art in many NLP tasks. It's trained on two tasks: (a) Masked Language Model (MLM), predicting some words that are masked in a sequence; (b) Next Sentence Prediction (NSP), binary classification that predicts if the next sentence follows the current sentence.

Jan
2019

Combining Multi-Task Learning (MTL) and pretrained language model, Liu et al. propose Multi-Task Deep Neural Network (MT-DNN). Lower layers of the architecture are shared across tasks and use BERT. Higher layers do task-specific training. They show that this approach outperforms BERT in many tasks even without fine-tuning.

Jan
2019

Just as AutoML has been used in computer vision, Google researchers use an evolution-based neural architecture search (NAS) to discover what they call Evolved Transformer (ET). It performs better than the original transformer of Vaswani et al. It's seen that ET is a hybrid, combining the best of self-attention and wide convolution.

## Discussion

• How is the transformer network better than CNNs, RNNs or LSTMs?

Words in a sentence come one after another. The context of the current word is established by the words surrounding it. RNNs are suited to model such a time-sequential structure. But an RNN has trouble remembering long sequences. LSTM is an RNN variant that does better in this regard. CNN architectures WaveNet, ByteNet and ConvS2S have also been used for sequence-to-sequence learning.

Moreover, RNNs and LSTMs consider only words that have gone before (although there's bidirectional LSTMs). Self-attention models the context by looking at words before and after the current word. For instance, the word "bank" in sentence "I arrived at the bank after crossing the river" doesn't refer to a financial institution. Transformer can figure out this meaning because it looks at subsequent words as well.

The sequential nature of RNNs implies that tasks can't be parallelized on GPUs and TPUs. Transformer's encoder self-attention can be parallelized. While CNNs are less sequential, complexity still grows logarithmically. It's worse for RNNs where complexity grows linearly. With transformers, the number of sequential operations is constant.

• What's the architecture of the transformer?

The transformer of Vaswani et al. basically follows the encoder-decoder model with attention passed from encoder to decoder. Both encoder and decoder stack multiple identical layers. Each encoder layer uses self-attention to represent context. Each decoder layer also uses self-attention in two sub-layers. While the encoder's self-attention uses both left and right context, the lower sub-layer of decoder masks out the future positions while predicting the current position.

In each layer we find some common elements. Residual connections are made. These are added and normalized with connections flowing via the self-attention sub-layers. There are no recurrent networks, only a fully connected feed-forward network.

At the input, source and target sequences are represented as embeddings. These are enhanced with positional encodings. At the output, a linear layer is followed with softmax.

The transformer's encoder can work on the input sequence in parallel but the decoder is auto-regressive. Each output is influenced by previous output symbols. Output symbols are generated one at a time.

• How is self-attention computed in a transformer network?

Every word is projected on to three vectors: query, key and value. Respective weight matrices $$W$$ to do this projection are learned during training. Suppose we're calculating the attention on a particular word. A dot-product operation of its query vector with the key vector of each word is calculated. Dot-product attention is scaled with $$1/\sqrt d_k$$ to compensate large dot-product values. The value vectors are weighted with weights from the dot product and then summed.

For better results, multi-head attention is used. Each head learns a different attention distribution, similar to having multiple filters in CNN. For example, if the model dimension is 512, instead of a large single attention layer, we use 8 parallel attention layers, each operating in 64 dimensions. Output from the layers are concatenated to derive the final attention. Mathematically, we have the following:

$$MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O\\head_i = Attention(QW^{Q}_i, KW^{K}_i, VW^{V}_i)\\Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt d_k})V$$

The original transformer of Vaswani et al. uses self-attention within encoder and decoder, but also transfers attention from encoder to decoder as is common in traditional sequence-to-sequence models.

• How does the transformer network capture the position of words?

In RNNs, the sequential structure accounts for position. In CNNs, positions are considered within the kernel size. In transformers, self-attention ignores the position of tokens within the sequence. To overcome this limitation, transformers explicitly add positional encodings. These are added to the input or output embeddings before the sum goes into the first attention layer.

Positional encodings can either be learned or fixed. In the latter case, Vaswani et al. used sine and cosine functions for even and odd positions respectively. They also used different frequencies for different positions to make it easier for the model to learn the positions:

$$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})\\PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})$$

While Vaswani et al. (2017) considered absolute positions, Shaw et al. (2018) looked at the distance between tokens in a sequence, that is, relative positioning. They showed that this leads to better results for machine translation with the trade-off of 7% decrease in steps per second.

• Could you share some applications of the transformer network?

In October 2019, Google announced the use of BERT for 10% of its English language search. Search will attempt to understand queries the way users tend to ask them in a natural way. This is opposed to parsing the query as a bunch of keywords. Thus, phrases such as "to" or "for someone" are important for meaning and BERT picks up these.

We can use transformers to generate synthetic text. Starting from a small prompt, GPT-2 model is able to generate long sequences and paragraphs of text that are realistic and coherent. This text also adapts to the style of the input.

For correcting grammar, transformers provide competitive baseline performance. For sequence generation, Insertion Transformer and Levenshtein Transformer have been proposed.

Transformers have been used beyond NLP, such as for image generation where self-attention is restricted to local neighbourhoods. Music Transformer applied self-attention to generate long pieces of music. While the original transformer used absolute positions, the music transformer used relative attention, allowing the model to create music in a consistent style.

• Which are the well-known transformer networks?

BERT is an encoder-only transformer. It's the first deeply bidirectional model, meaning that it uses both left and right contexts in all layers. BERT showed that as a pretrained language model it can be fine-tuned easily to obtain state-of-the-art models for many specific tasks. BERT has inspired many variants: RoBERTa, XLNet, MT-DNN, SpanBERT, VisualBERT, K-BERT, HUBERT, and more. Some variants attempt to compress the model: TinyBERT, ALERT, DistilBERT, and more.

The other competitive model is GPT-2. Unlike BERT, GPT-2 is not bidirectional and is a decoder-only transformer. However, the training includes both unsupervised pretraining and supervised fine-tuning. The training objective combines both of these to improve generalization and convergence. This approach of training on specific tasks is also seen in MT-DNN.

GPT-2 is auto-regressive. Each output token is generated one by one. Once a token is generated, it's added to the input sequence. BERT is not auto-regressive but instead uses context from both sides. XLNet is auto-regressive while also using context from both sides.

• What are some variations of the transformer network?

Compared to the original transformer of Vaswani et al., we note the following variations:

• Transformer-XL: Overcomes the limitation of fixed-length context. It makes use of segment-level recurrence and relative positional encoding.
• DS-Init & MAtt: Stacking many layers is problematic due to vanishing gradients. Therefore, depth-scaled initialization and merged attention sublayer are proposed.
• Average Attention Network (AAN): With the original transformer, decoder's self-attention is slow due to its auto-regressive nature. Speed is improved by replacing self-attention with an averaging layer followed by a gating layer.
• Dialogue Transformer: Conversation that has multiple overlapping topics can be picked out. Self-attention is over the dialogue sequence turns.
• Tensor-Product Transformer: Uses novel TP-Attention to explicitly encode relations and applies it to math problem solving.
• Tree Transformer: Puts a constraint on the encoder to follow tree structures that are more intuitive to humans. This also helps us learn grammatical structures from unlabelled data.
• Tensorized Transformer: Multi-head attention is difficult to deploy in a resource-limited setting. Hence, multi-linear attention with Block-Term Tensor Decomposition (BTD) is proposed.
• For a developer, what resources are out there to learn transformer networks?

To get a feel of transformers in action, you can try out Talk to Transformer, which is based on the full-sized GPT-2.

HuggingFace provides implementation of many transformer architectures in both TensorFlow and PyTorch. You can also convert them to CoreML models for iOS devices. Package spaCy also interfaces to HuggingFace.

TensorFlow code and pretrained models for BERT are available. There's also code for Transformer-XL, MT-DNN and GPT-2.

TensorFlow has provided an implementation for machine translation. Lilian Weng's implementation of the transformer is worth studying. Samuel Lynn-Evans has shared his implementation with explanations. The Annotated Transformer is another useful resource to learn the concepts along with the code.

## Milestones

2014

Sutskever et al. at Google apply sequence-to-sequence model to the task of machine translation, that is, a sequence of words in source language is translated to a sequence of words in target language. They use an encoder-decoder architecture that has separate 4-layered LSTMs for encoder and decoder. The encoder produces a fixed-length context vector, which is used to initialize the decoder. The main limitation is that the context vector is unable to adequately represent long sentences.

2015

Bahdanau et al. apply the concept of attention to the seq2seq model used in machine translation. This helps the decoder to "pay attention" to important parts of the source sentence. Encoder is a bidirectional RNN. Unlike the seq2seq model of Sutskever et al., which uses only the encoder's last hidden state, attention mechanism uses all hidden states of encoder and decoder to generate the context vector. It also aligns the input and output sequences, with alignment score parameterized by a feed-forward network.

Jun
2017

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention, although they're not the first to use self-attention. Self-attention is about attending to different tokens of the sequence.

Jan
2018

For multi-document summarization, Liu et al. propose a decoder-only transformer architecture that can attend to sequences longer than what encoder-decoder architecture is capable of. Input and output sequences are combined into single sequence and used to train the decoder. During inference, output is generated auto-regressively. They also propose variations of attention to handle longer sequences.

Jun
2018

OpenAI publishes Generative Pre-trained Transformer (GPT). It's inspired by unsupervised pre-training and transformer architecture. The transformer is trained on large amount of data without supervision. It's then fine-tuned on smaller task-specific datasets with supervision. Pre-training involves a standard language modelling and uses Liu et al.'s decoder-only transformer. In February 2019, OpenAI announces an improved model named GPT-2. Compared to GPT, GPT-2 is trained on 10x the data and has 10x parameters.

Oct
2018

Google open sources Bidirectional Encoder Representations from Transformers (BERT), which is a pre-trained language model. It's deeply bidirectional and unsupervised. It improves state-of-the-art in many NLP tasks. It's trained on two tasks: (a) Masked Language Model (MLM), predicting some words that are masked in a sequence; (b) Next Sentence Prediction (NSP), binary classification that predicts if the next sentence follows the current sentence.

Jan
2019

Combining Multi-Task Learning (MTL) and pretrained language model, Liu et al. propose Multi-Task Deep Neural Network (MT-DNN). Lower layers of the architecture are shared across tasks and use BERT. Higher layers do task-specific training. They show that this approach outperforms BERT in many tasks even without fine-tuning.

Jan
2019

Just as AutoML has been used in computer vision, Google researchers use an evolution-based neural architecture search (NAS) to discover what they call Evolved Transformer (ET). It performs better than the original transformer of Vaswani et al. It's seen that ET is a hybrid, combining the best of self-attention and wide convolution.

Author
No. of Edits
No. of Chats
DevCoins
6
0
1473
2140
Words
0
Chats
6
Edits
1
Likes
299
Hits

## Cite As

Devopedia. 2019. "Transformer Neural Network Architecture." Version 6, November 13. Accessed 2019-12-12. https://devopedia.org/transformer-neural-network-architecture
• Site Map