Transformer Neural Network Architecture

Example of self-attention within a word sequence. Source: Weng 2018.
Example of self-attention within a word sequence. Source: Weng 2018.

Given a word sequence, we recognize that some words within it are more closely related with one another than others. This gives rise to the concept of self-attention in which a given word "attends to" other words in the sequence. Essentially, attention is about representing context by giving weights to word relations.

Transformer is a neural network architecture that makes use of self-attention. It replaces earlier approaches of LSTMs or CNNs that used attention between encoder and decoder. Transformer showed that a feed-forward network used with self-attention is sufficient.

Influential language models such BERT and GPT-2 are based on the transformer architecture. By 2019, transformer architecture became an active area of research and application. While initially created for NLP, it's being used in other domains where problems can be cast as sequence modelling.


  • How is the transformer network better than CNNs, RNNs or LSTMs?
    Machine translation using transformer. Source: Bradbury 2017, fig. 2.
    Machine translation using transformer. Source: Bradbury 2017, fig. 2.

    Words in a sentence come one after another. The context of the current word is established by the words surrounding it. RNNs are suited to model such a time-sequential structure. But an RNN has trouble remembering long sequences. LSTM is an RNN variant that does better in this regard. CNN architectures WaveNet, ByteNet and ConvS2S have also been used for sequence-to-sequence learning.

    Moreover, RNNs and LSTMs consider only words that have gone before (although there's bidirectional LSTMs). Self-attention models the context by looking at words before and after the current word. For instance, the word "bank" in sentence "I arrived at the bank after crossing the river" doesn't refer to a financial institution. Transformer can figure out this meaning because it looks at subsequent words as well.

    The sequential nature of RNNs implies that tasks can't be parallelized on GPUs and TPUs. Transformer's encoder self-attention can be parallelized. While CNNs are less sequential, complexity still grows logarithmically. It's worse for RNNs where complexity grows linearly. With transformers, the number of sequential operations is constant.

  • What's the architecture of the transformer?
    Transformer architecture showing encoder (left) and decoder (right). Source: Vaswani et al. 2017, fig. 1.
    Transformer architecture showing encoder (left) and decoder (right). Source: Vaswani et al. 2017, fig. 1.

    The transformer of Vaswani et al. basically follows the encoder-decoder model with attention passed from encoder to decoder. Both encoder and decoder stack multiple identical layers. Each encoder layer uses self-attention to represent context. Each decoder layer also uses self-attention in two sub-layers. While the encoder's self-attention uses both left and right context, the lower sub-layer of decoder masks out the future positions while predicting the current position.

    In each layer we find some common elements. Residual connections are made. These are added and normalized with connections flowing via the self-attention sub-layers. There are no recurrent networks, only a fully connected feed-forward network.

    At the input, source and target sequences are represented as embeddings. These are enhanced with positional encodings. At the output, a linear layer is followed with softmax.

    The transformer's encoder can work on the input sequence in parallel but the decoder is auto-regressive. Each output is influenced by previous output symbols. Output symbols are generated one at a time.

  • How is self-attention computed in a transformer network?
    Attention is computed using query, key and value vectors. Source: Vaswani et al. 2017, fig. 1.
    Attention is computed using query, key and value vectors. Source: Vaswani et al. 2017, fig. 1.

    Every word is projected on to three vectors: query, key and value. Respective weight matrices \(W\) to do this projection are learned during training. Suppose we're calculating the attention on a particular word. A dot-product operation of its query vector with the key vector of each word is calculated. Dot-product attention is scaled with \(1/\sqrt d_k\) to compensate large dot-product values. The value vectors are weighted with weights from the dot product and then summed.

    For better results, multi-head attention is used. Each head learns a different attention distribution, similar to having multiple filters in CNN. For example, if the model dimension is 512, instead of a large single attention layer, we use 8 parallel attention layers, each operating in 64 dimensions. Output from the layers are concatenated to derive the final attention. Mathematically, we have the following:

    $$MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O\\head_i = Attention(QW^{Q}_i, KW^{K}_i, VW^{V}_i)\\Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt d_k})V$$

    The original transformer of Vaswani et al. uses self-attention within encoder and decoder, but also transfers attention from encoder to decoder as is common in traditional sequence-to-sequence models.

  • How does the transformer network capture the position of words?

    In RNNs, the sequential structure accounts for position. In CNNs, positions are considered within the kernel size. In transformers, self-attention ignores the position of tokens within the sequence. To overcome this limitation, transformers explicitly add positional encodings. These are added to the input or output embeddings before the sum goes into the first attention layer.

    Positional encodings can either be learned or fixed. In the latter case, Vaswani et al. used sine and cosine functions for even and odd positions respectively. They also used different frequencies for different positions to make it easier for the model to learn the positions:


    While Vaswani et al. (2017) considered absolute positions, Shaw et al. (2018) looked at the distance between tokens in a sequence, that is, relative positioning. They showed that this leads to better results for machine translation with the trade-off of 7% decrease in steps per second.

  • Could you share some applications of the transformer network?
    BERT improves Google search results. Source: Nayak 2019.
    BERT improves Google search results. Source: Nayak 2019.

    In October 2019, Google announced the use of BERT for 10% of its English language search. Search will attempt to understand queries the way users tend to ask them in a natural way. This is opposed to parsing the query as a bunch of keywords. Thus, phrases such as "to" or "for someone" are important for meaning and BERT picks up these.

    We can use transformers to generate synthetic text. Starting from a small prompt, GPT-2 model is able to generate long sequences and paragraphs of text that are realistic and coherent. This text also adapts to the style of the input.

    For correcting grammar, transformers provide competitive baseline performance. For sequence generation, Insertion Transformer and Levenshtein Transformer have been proposed.

    Transformers have been used beyond NLP, such as for image generation where self-attention is restricted to local neighbourhoods. Music Transformer applied self-attention to generate long pieces of music. While the original transformer used absolute positions, the music transformer used relative attention, allowing the model to create music in a consistent style.

  • Which are the well-known transformer networks?
    BERT is bidirectional while GPT (and GPT-2) is not. Source: Devlin et al. 2019, fig. 3.
    BERT is bidirectional while GPT (and GPT-2) is not. Source: Devlin et al. 2019, fig. 3.

    BERT is an encoder-only transformer. It's the first deeply bidirectional model, meaning that it uses both left and right contexts in all layers. BERT showed that as a pretrained language model it can be fine-tuned easily to obtain state-of-the-art models for many specific tasks. BERT has inspired many variants: RoBERTa, XLNet, MT-DNN, SpanBERT, VisualBERT, K-BERT, HUBERT, and more. Some variants attempt to compress the model: TinyBERT, ALERT, DistilBERT, and more.

    The other competitive model is GPT-2. Unlike BERT, GPT-2 is not bidirectional and is a decoder-only transformer. However, the training includes both unsupervised pretraining and supervised fine-tuning. The training objective combines both of these to improve generalization and convergence. This approach of training on specific tasks is also seen in MT-DNN.

    GPT-2 is auto-regressive. Each output token is generated one by one. Once a token is generated, it's added to the input sequence. BERT is not auto-regressive but instead uses context from both sides. XLNet is auto-regressive while also using context from both sides.

  • What are some variations of the transformer network?
    Transformer-XL uses segment-level recurrence. Source: Dai et al. 2019, fig. 2.
    Transformer-XL uses segment-level recurrence. Source: Dai et al. 2019, fig. 2.

    Compared to the original transformer of Vaswani et al., we note the following variations:

    • Transformer-XL: Overcomes the limitation of fixed-length context. It makes use of segment-level recurrence and relative positional encoding.
    • DS-Init & MAtt: Stacking many layers is problematic due to vanishing gradients. Therefore, depth-scaled initialization and merged attention sublayer are proposed.
    • Average Attention Network (AAN): With the original transformer, decoder's self-attention is slow due to its auto-regressive nature. Speed is improved by replacing self-attention with an averaging layer followed by a gating layer.
    • Dialogue Transformer: Conversation that has multiple overlapping topics can be picked out. Self-attention is over the dialogue sequence turns.
    • Tensor-Product Transformer: Uses novel TP-Attention to explicitly encode relations and applies it to math problem solving.
    • Tree Transformer: Puts a constraint on the encoder to follow tree structures that are more intuitive to humans. This also helps us learn grammatical structures from unlabelled data.
    • Tensorized Transformer: Multi-head attention is difficult to deploy in a resource-limited setting. Hence, multi-linear attention with Block-Term Tensor Decomposition (BTD) is proposed.
  • For a developer, what resources are out there to learn transformer networks?

    To get a feel of transformers in action, you can try out Talk to Transformer, which is based on the full-sized GPT-2.

    HuggingFace provides implementation of many transformer architectures in both TensorFlow and PyTorch. You can also convert them to CoreML models for iOS devices. Package spaCy also interfaces to HuggingFace.

    TensorFlow code and pretrained models for BERT are available. There's also code for Transformer-XL, MT-DNN and GPT-2.

    TensorFlow has provided an implementation for machine translation. Lilian Weng's implementation of the transformer is worth studying. Samuel Lynn-Evans has shared his implementation with explanations. The Annotated Transformer is another useful resource to learn the concepts along with the code.


A sequence-to-sequence model for machine translation. Source: Weng 2018.

Sutskever et al. at Google apply sequence-to-sequence model to the task of machine translation, that is, a sequence of words in source language is translated to a sequence of words in target language. They use an encoder-decoder architecture that has separate 4-layered LSTMs for encoder and decoder. The encoder produces a fixed-length context vector, which is used to initialize the decoder. The main limitation is that the context vector is unable to adequately represent long sentences.

Encoder-decoder architecture with attention. Source: Weng 2018, fig. 4.

Bahdanau et al. apply the concept of attention to the seq2seq model used in machine translation. This helps the decoder to "pay attention" to important parts of the source sentence. Encoder is a bidirectional RNN. Unlike the seq2seq model of Sutskever et al., which uses only the encoder's last hidden state, attention mechanism uses all hidden states of encoder and decoder to generate the context vector. It also aligns the input and output sequences, with alignment score parameterized by a feed-forward network.

Encoder self-attention distribution for the word 'it' in different contexts. Source: Uszkoreit 2017.

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention, although they're not the first to use self-attention. Self-attention is about attending to different tokens of the sequence.

Variations of self-attention in decoder-only transformer. Source: Liu et al. 2018, fig. 1.

For multi-document summarization, Liu et al. propose a decoder-only transformer architecture that can attend to sequences longer than what encoder-decoder architecture is capable of. Input and output sequences are combined into single sequence and used to train the decoder. During inference, output is generated auto-regressively. They also propose variations of attention to handle longer sequences.

GPT's transformer (left) and fine-tuning tasks (right). Source: Radford et al. 2018, fig. 1.

OpenAI publishes Generative Pre-trained Transformer (GPT). It's inspired by unsupervised pre-training and transformer architecture. The transformer is trained on large amount of data without supervision. It's then fine-tuned on smaller task-specific datasets with supervision. Pre-training involves a standard language modelling and uses Liu et al.'s decoder-only transformer. In February 2019, OpenAI announces an improved model named GPT-2. Compared to GPT, GPT-2 is trained on 10x the data and has 10x parameters.


Google open sources Bidirectional Encoder Representations from Transformers (BERT), which is a pre-trained language model. It's deeply bidirectional and unsupervised. It improves state-of-the-art in many NLP tasks. It's trained on two tasks: (a) Masked Language Model (MLM), predicting some words that are masked in a sequence; (b) Next Sentence Prediction (NSP), binary classification that predicts if the next sentence follows the current sentence.

Architecture of MT-DNN. Source: Liu et al. 2019, fig. 1.

Combining Multi-Task Learning (MTL) and pretrained language model, Liu et al. propose Multi-Task Deep Neural Network (MT-DNN). Lower layers of the architecture are shared across tasks and use BERT. Higher layers do task-specific training. They show that this approach outperforms BERT in many tasks even without fine-tuning.

Transformer versus Evolved Transformer. Source: So et al. 2019, fig. 3.

Just as AutoML has been used in computer vision, Google researchers use an evolution-based neural architecture search (NAS) to discover what they call Evolved Transformer (ET). It performs better than the original transformer of Vaswani et al. It's seen that ET is a hybrid, combining the best of self-attention and wide convolution.


  1. Abnar, Samira. 2019. "From Attention in Transformers to Dynamic Routing in Capsule Nets." March 27. Accessed 2019-11-13.
  2. Alammar, Jay. 2018. "The Illustrated Transformer." June 27. Accessed 2019-11-12.
  3. Alammar, Jay. 2019. "The Illustrated GPT-2 (Visualizing Transformer Language Models)." August 12. Accessed 2019-11-12.
  4. Alikaniotis, Dimitrios, and Vipul Raheja. 2019. "The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction." arXiv, v1, June 04. Accessed 2019-11-12.
  5. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2016. "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv, v7, May 19. Accessed 2019-11-09.
  6. Bradbury, James. 2017. "Fully-parallel text generation for neural machine translation." Blog,, November 08. Updated 2019-08-15. Accessed 2019-11-12.
  7. Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." arXiv, v3, June 02. Accessed 2019-11-12.
  8. Devlin, Jacob and Ming-Wei Chang. 2018. "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing." Google AI Blog, November 02. Accessed 2019-11-09.
  9. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, v2, May 24. Accessed 2019-11-09.
  10. Giacaglia, Giuliano. 2019. "How Transformers Work." Towards Data Science, on Medium, March 11. Accessed 2019-11-12.
  11. Gu, Jiatao, Changhan Wang, and Jake Zhao. 2019. "Levenshtein Transformer." arXiv, v2, October 28. Accessed 2019-11-12.
  12. Honnibal, Matthew and Ines Montani. 2019. "spaCy meets Transformers: Fine-tune BERT, XLNet and GPT-2." Blog,, August 02. Accessed 2019-11-12.
  13. Huang, Cheng-Zhi Anna, Ian Simon, and Monica Dinculescu. 2018. "Music Transformer: Generating Music with Long-Term Structure." Magenta, December 13. Updated 2019-09-16. Accessed 2019-11-12.
  14. Huggingface GitHub. 2019. "huggingface/transformers." November 12. Accessed 2019-11-12.
  15. Liu, Peter J., Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. "Generating Wikipedia by Summarizing Long Sequences." arXiv, v1, January 30. Accessed 2019-11-09.
  16. Liu, Xiaodong, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. "Multi-Task Deep Neural Networks for Natural Language Understanding." arXiv, v2, May 30. Accessed 2019-11-12.
  17. Ma, Xindian, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, and Ming Zhou. 2019. "A Tensorized Transformer for Language Modeling." arXiv, v3, November 06. Accessed 2019-11-12.
  18. Nayak, Pandu. 2019. "Understanding searches better than ever before." Google Blog, October 25. Accessed 2019-11-12.
  19. Nicholson, Chris. 2019. "A Beginner's Guide to Attention Mechanisms and Memory Networks." AI Wiki, Skymind. Accessed 2019-11-12.
  20. Parmar, Niki, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. "Image Transformer." arXiv, v3, June 15. Accessed 2019-11-12.
  21. Radford, Alec. 2018. "Improving Language Understanding with Unsupervised Learning." OpenAI Blog, June 11. Accessed 2019-11-09.
  22. Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. "Improving language under-standing with unsupervised learning." Technical report, OpenAI. Accessed 2019-11-09.
  23. Radford, Alec, Jeffrey Wu, Dario Amodei, Daniela Amodei, Jack Clark, Miles Brundage, and Ilya Sutskever. 2019. "Better Language Models and Their Implications." OpenAI Blog, February 14. Accessed 2019-11-09.
  24. Schlag, Imanol, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, and Jianfeng Gao. 2019. "Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving." arXiv, v1, October 15. Accessed 2019-11-12.
  25. Shaw, Peter, Jakob Uszkoreit, and Ashish Vaswani. 2018. "Self-Attention with Relative Position Representations." arXiv, v2, April 12. Accessed 2019-11-09.
  26. So, David. 2019. "Applying AutoML to Transformer Architectures." Google AI Blog, June 14. Accessed 2019-11-09.
  27. So, David R., Chen Liang, and Quoc V. Le. 2019. "The Evolved Transformer." arXiv, v4, May 17. Accessed 2019-11-09.
  28. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. "Sequence to Sequence Learning with Neural Networks." arXiv, v3, December 14. Accessed 2019-11-09.
  29. thunlp. 2019. "Must-read papers on pre-trained language models." PLMPapers, thunlp on GitHub, November. Accessed 2019-11-12.
  30. Uszkoreit, Jakob. 2017. "Transformer: A Novel Neural Network Architecture for Language Understanding." Google AI Blog, August 31. Accessed 2019-10-13.
  31. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. "Attention Is All You Need." arXiv, v5, December 06. Accessed 2019-11-09.
  32. Vig, Jesse. 2019. "OpenAI GPT-2: Understanding Language Generation through Visualization." Towards Data Science, via Medium, March 05. Accessed 2019-10-13.
  33. Vlasov, Vladimir, Johannes E. M. Mosig, and Alan Nichol. 2019. "Dialogue Transformers." arXiv, v1, October 01. Accessed 2019-11-12.
  34. Wang, Yau-Shian, Hung-Yi Lee, and Yun-Nung Chen. 2019. "Tree Transformer: Integrating Tree Structures into Self-Attention." arXiv, v2, November 02. Accessed 2019-11-12.
  35. Weng, Lilian. 2018. "Attention? Attention!" Lil'Log, June 24. Accessed 2019-11-09.
  36. Yang, Zhilin and Quoc Le. 2019. "Transformer-XL: Unleashing the Potential of Attention Models." Google AI Blog, January 29. Accessed 2019-11-12.
  37. Zhang, Biao, Deyi Xiong, and Jinsong Su. 2018. "Accelerating Neural Transformer via an Average Attention Network." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp. 1789–1798, July. Accessed 2019-11-12.
  38. Zhang, Biao, Ivan Titov, and Rico Sennrich. 2019. "Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention." arXiv, v1, August 29. Accessed 2019-11-12.

Further Reading

  1. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. "Attention Is All You Need." arXiv, v5, December 06. Accessed 2019-11-09.
  2. Weng, Lilian. 2018. "Attention? Attention!" Lil'Log, June 24. Accessed 2019-11-09.
  3. Alammar, Jay. 2018. "The Illustrated Transformer." June 27. Accessed 2019-11-12.
  4. Chromiak, Michal. 2017. "The Transformer – Attention is all you need." September 12. Updated 2017-10-30. Accessed 2019-11-12.
  5. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, v2, May 24. Accessed 2019-11-09.
  6. Harvard NLP. 2018. "The Annotated Transformer." Harvard NLP, April 03. Accessed 2019-11-09.

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2019. "Transformer Neural Network Architecture." Version 6, November 13. Accessed 2020-11-24.
Contributed by
1 author

Last updated on
2019-11-13 15:49:23