Attention Mechanism in Neural Networks

Context vectors (right) carry attention information from encoder to decoder. Source: Su 2018, fig. 15.
Context vectors (right) carry attention information from encoder to decoder. Source: Su 2018, fig. 15.

In machine translation, the encoder-decoder architecture is common. The encoder reads a sequence of words and represents it with a high-dimensional real-valued vector. This vector, often called the context vector, is given to the decoder, which then generates another sequence of words in the target language. If the input sequence is very long, a single vector from the encoder doesn't give enough information for the decoder.

Attention is about giving more contextual information to the decoder. At every decoding step, the decoder is informed how much "attention" it should give to each input word. While attention started this way in sequence-to-sequence modelling, it was later applied to words within the same sequence, giving rise to self-attention and transformer architecture.

Since the late 2010s, attention mechanism has become popular, sometimes replacing CNNs, RNNs and LSTMs.


  • Could you explain attention with an example?
    Heatmap showing attention between source and target languages. Source: Bahdanau et al. 2016, fig. 3.
    Heatmap showing attention between source and target languages. Source: Bahdanau et al. 2016, fig. 3.

    Consider an example from machine translation. The sentence "The agreement on the European Economic Area was signed in August 1992" is to be translated to French, which might be "L'accord sur la zone économique européenne a été signé en août 1992". We can see that "Economic" becomes "économique" and "European" becomes "européenne", but their positions are swapped. The phrase "was signed" becomes "a été signé". Thus, translation depends not just on individual words but also their context within the sentence. Attention is meant to capture this context.

    In this example, attention is passed from the encoder to the decoder. The decoder generates the translated words one by one. Each output word is influenced by all input words in different amounts. Attention captures these weights.

    We can also visualize attention via heatmaps. In the figure, we map English words to translated French words. We note that sometimes a translated word is attended to by multiple English words. Lighter colours represent higher attention.

  • Could you describe the architecture of attention?
    Attention is calculated from hidden states of encoder and recent hidden state of decoder. Source: Karim 2019.
    Attention is calculated from hidden states of encoder and recent hidden state of decoder. Source: Karim 2019.

    Let's consider machine translation as explained by Bahdanau et al. (2014). Encoder is a bidirectional RNN while the decoder is an RNN. The input sequence is fed into the encoder whose hidden states are exposed to the decoder via the attention layer. More specifically, the backward and forward hidden encoder states are concatenated. These states are weighted to give a context vector that's used by the decoder. Attention weights are calculated by aligning the decoder's last hidden state with the encoder hidden states.

    The decoder's current hidden state is a function of it's previous hidden state, previous output word and the context vector. Attention is passed via the context vector, which itself is based on the alignment of encoder and decoder states.

    Luong et al. proposed a slightly different architecture. Their encoder and decoder are each a 2-layer LSTM. It also uses a feedforward network for the final output. In Google's Neural Machine Translation, 8-layer LSTM is used in encoder and decoder. The first encoder layer is bidirectional. Both encoder and decoder include some residual connections.

  • What do you mean by "alignment" in the context of attention mechanism?
    Illustrating different alignment score functions. Source: Karim 2019.
    Illustrating different alignment score functions. Source: Karim 2019.

    Bahdanau et al. align the decoder's sequence with the encoder's sequence. An alignment score quantifies how well output at position i is aligned to the input at position j. The context vector that goes to the decoder is based on the weighted sum of the encoder's RNN hidden states \(h_j\). These weights come from the alignment. Mathematically, given an alignment model a, alignment energy e, context vector c, and weights α, we have:

    $$e_{ij} = a(s_{i-1},h_j)\\\alpha_{ij} = exp(e_{ij})/\sum_{k=1}^{T_x}{exp(e_{ik})}\\c_i = \sum_{j=1}^{T_x}{\alpha_{ij}h_j}$$

    The decoder's hidden state is based on it's previous hidden state \(s_{i-1}\), the previous predicted word and the current context vector. At each time step, the context vector is adjusted via the alignment model and attention. Thus, at step step, the decoder selectively attends to the input sequence via the encoder hidden states.

    Bahdanau et al. concatenated the forward and backward encoder hidden states and added these with decoder hidden state. Luong et al. proposed many other alternative alignment scores. Vaswani et al. proposed the scaled dot product.

  • What is self-attention?
    Self-attention applied to sentiment analysis. Source: Cheng et al. 2016, fig. 5.
    Self-attention applied to sentiment analysis. Source: Cheng et al. 2016, fig. 5.

    Self-attention is about attending to words within the sequence, such as within the encoder or decoder. By seeing how one word attends to other words in the sequence, we're able to capture syntactical structures.

    Consider the sentence "The animal didn't cross the street because it was too tired". The word "it" refers to the animal. What happens if we replace "tired" with "wide"? The word "it" now refers to the street. Attention understands this. In the former case there's high attention linking "it" and "animal" but in the latter case high attention shifts to "street".

    Self-attention was earlier applied together with RNNs. Later, self-attention came to stand on its own. Vaswani et al.'s paper titled Attention is all you need showed how we can get rid of CNNs and RNNs. RNNs in particular are hard to parallelize on GPUs, which is a problem solved by self-attention.

  • Could you describe some applications of attention mechanism?
    Attention heatmaps of clinical events in ICUs. Source: Kaji et al. 2019, fig. 3.
    Attention heatmaps of clinical events in ICUs. Source: Kaji et al. 2019, fig. 3.

    Beyond its early application to machine translation, attention mechanism has been applied to other NLP tasks such as sentiment analysis, POS tagging, document classification, text classification, and relation classification. One research used human eye-tracking corpora to derive attention and enhance NLP tasks. In another study, semantic role labelling was improved using linguistically-informed self-attention.

    By combining CNN with self-attention, the Google Brain team achieved top results for image classification and object detection. In Visual Question Answering (VQA), where there's a need to focus on small areas or details of the image, attention mechanism is useful. Attention is also useful for image captioning.

    In speech recognition, attention aligns characters and audio.

    In one medical study, higher attention was given to abnormal heartbeats from ECG readings to more accurately detect specific heart conditions. In another study based on ICU data, feature-level attention was used rather than attention on embeddings. This provided physicians better interpretability.

  • What's the difference between global and local attention?
    Local attention using a Gaussian function. Source: Ramamoorthy 2018.
    Local attention using a Gaussian function. Source: Ramamoorthy 2018.

    The distinction between global versus local attention originated in Luong et al. (2015). In the task of neural machine translation, global attention implies we attend to all the input words, and local attention means we attend to only a subset of words.

    It's said that local attention is a combination of hard and soft attentions. Like hard attention, it focuses on a subset. Like soft attention, it's differentiable and hence easier to implement and train. It's computationally simpler than global or soft attentions.

    Given the decoder's current state, local attention first selects the best aligned position \(p_t\) in the input sequence. Note that selecting \(p_t\) is not directly influenced by the encoder's states. Local attention is also called window-based attention because it's about selecting a window of input tokens for attention distribution. This window is centred on \(p_t\). To keep the approach differentiable, a Gaussian distribution is applied on the window. Attention \(a_t\) is therefore focused around \(p_t\).



Even before attention mechanism becomes popular via NLP in later years, it's used in computer vision. Mnih et al. propose a method to focus on important parts of an image that are then processed at high resolution. Instead of processing the entire image at once, it's processed sequentially, attending to different locations as is relevant to the task.

Encoder-decoder architecture with attention. Source: Weng 2018, fig. 4.

Bahdanau et al. apply the concept of attention to the seq2seq model used in machine translation. This helps the decoder to "pay attention" to important parts of the source sentence. Encoder is a bidirectional RNN. Unlike earlier seq2seq models that use only the encoder's last hidden state, attention mechanism uses all hidden states of encoder and decoder to generate the context vector. It also aligns the input and output sequences, with alignment score parameterized by a feed-forward network.

Soft attention versus hard attention in computer vision. Source: Xu et al. 2016, fig. 2.

Xu et al. propose the use of visual attention to the task of image captioning. They distinguish between soft attention and hard attention. Soft deterministic attention is smooth and differentiable, and is trained by standard back propagation. Hard stochastic attention is trained by maximizing an approximate variational lower bound. Soft attention is similar to Bahdanau et al.'s proposal.

(a) Single layer model. (b) Multi-hop attention model. Source: Sukhbaatar et al. 2015, fig. 1.

Sukhbaatar et al. propose the concept of multi-hop attention. Each hop or layer contains attention weights. Input and output from each layer is fed to the next higher layer. Thus, no hard decisions are taken in each layer. Outputs from each layer are passed on in a "soft" manner until prediction after the last layer.

Comparing global attention versus local attention. Source: Luong et al. 2015, fig. 2, 3.

Luong et al. distinguish between global attention versus local attention. In global attention, we attend to all the input words. In local attention, we attend to only a subset of words.

Attention-based VQA outperforms traditional VQA. Source: Chen et al. 2016, fig. 1.

Attention mechanism has been applied to computer vision. For Visual Question Answering (VQA), Chet et al. propose Attention-Based Configurable Convolutional Neural Network (ABC-CNN). The query text guides the model to pay attention to relevant image regions. In traditional VQA models, visual processing and question understanding are done separately.

Visualizing attention at word level and sentence level. Source: Yang et al. 2016, fig. 6.

Yang et al. propose the Hierarchical Attention Network (HAN). This comes from their insight that documents have a hierarchical structure (words, sentences, document). HAN has two levels of attention: word level and sentence level. This enables the model to more accurately do document classification. It improves on the state-of-the-art when tested on Yelp, IMDb and Amazon datasets. Kränkel and Lee provide a good description of HAN.

Embed, encode, attend, and predict. Source: Honnibal 2016.

Matthew Honnibal describes what he calls as the new neural network playbook for NLP. It's a four-step approach: embed, encode, attend, predict. This highlights the importance and usefulness of attention mechanism. Word embeddings do the embed part at word level. Bidirectional RNNs do the encode part at sequence level. Using a context vector, attend part produces a vector that's given to a feed-forward network. The predict part is done by this network.

Shallow or deep inter-attention and intra-attention fusion. Source: Cheng et al. 2016, fig. 3.

For the task of machine reading, Cheng et al. propose the use of both inter-attention (between encoder and decoder) and intra-attention (within encoder or decoder). Intra-attention (later to be called self-attention) is about attending to tokens within a sequence, thus uncovering lexical relations between tokens.

Encoder self-attention distribution for the word 'it' in different contexts. Source: Uszkoreit 2017.

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention. Self-attention is about attending to different tokens of the sequence. Self-attention leads to powerful language models including BERT (2018) and GPT-2 (2019).

Attention mechanism (left) and multi-head attention (right). Source: Veličković et al. 2018, fig. 1.

Veličković et al. apply attention to nodes in a graph. Nodes attend to neighbours. The computation can be parallelized.


  1. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2016. "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv, v7, May 19. Accessed 2019-11-13.
  2. Barrett, Maria, Joachim Bingel, Nora Hollenstein, Marek Rei, and Anders Søgaard. 2018. "Sequence Classification with Human Attention." Proceedings of the 22nd Conference on Computational Natural Language Learning, Association for Computational Linguistics, pp. 302-312, October. Accessed 2019-11-14.
  3. Chan, William, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2015. "Listen, Attend and Spell." arXiv, v2, August 20. Accessed 2019-11-14.
  4. Chen, Kan, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2016. "ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering." arXiv, v2, April 03. Accessed 2019-11-13.
  5. Cheng, Jianpeng, Li Dong, and Mirella Lapata. 2016. "Long Short-Term Memory-Networks for Machine Reading." Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 551-561, November. Accessed 2019-11-14.
  6. Culurciello, Eugenio. 2018. "The fall of RNN / LSTM." Towards Data Science, on Medium, April 13. Accessed 2019-11-13.
  7. Honnibal, Matthew. 2016. "Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models." Blog,, November 10. Accessed 2019-11-14.
  8. Kaji, Deepak A., John R. Zech, Jun S. Kim, Samuel K. Cho, Neha S. Dangayach, Anthony B. Costa, and Eric K. Oermann. 2019. "An attention based deep learning model of clinical events in the intensive care unit." PLOS, February 13. Accessed 2019-11-13.
  9. Karim, Raimi. 2019. "Attn: Illustrated Attention." Towards Data Science, on Medium, January 20. Accessed 2019-11-13.
  10. Kembhavi, Aniruddha. 2019. "May I have your attention please?" AI2, on Medium, June 14. Accessed 2019-11-13.
  11. Kränkel, Maria, and Hee-Eun Lee. 2019. "Text Classification with Hierarchical Attention Networks." Seminar Information Systems, HU-Berlin, February 08. Accessed 2019-11-13.
  12. Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. 2015. "Effective Approaches to Attention-based Neural Machine Translation." arXiv, v5, September 20. Accessed 2019-11-14.
  13. Mnih, Volodymyr, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. "Recurrent Models of Visual Attention." arXiv, v1, June 24. Accessed 2019-11-14.
  14. Ramamoorthy, Suriyadeepan. 2018. "Attention Mechanism: Benefits and Applications." Blog, Saama, April 19. Accessed 2019-11-13.
  15. Reichman, Ran. 2019. "Attention Augmented Convolutional Networks." LyrnAI, May 03. Accessed 2019-11-13.
  16. Strubell, Emma, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. "Linguistically-Informed Self-Attention for Semantic Role Labeling." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 5027-5038, October-November. Accessed 2019-11-14.
  17. Su, Ta-Chun. 2018. "Seq2seq pay Attention to Self Attention: Part 1." Medium, October 03. Accessed 2019-11-13.
  18. Sukhbaatar, Sainbayar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. "End-To-End Memory Networks." arXiv, v5, November 24. Accessed 2019-11-14.
  19. Uszkoreit, Jakob. 2017. "Transformer: A Novel Neural Network Architecture for Language Understanding." Google AI Blog, August 31. Accessed 2019-11-14.
  20. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. "Attention Is All You Need." arXiv, v5, December 06. Accessed 2019-11-13.
  21. Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. "Graph Attention Networks." arXiv, v3, February 04. Accessed 2019-11-13.
  22. Vig, Jesse. 2019. "OpenAI GPT-2: Understanding Language Generation through Visualization." Towards Data Science, via Medium, March 05. Accessed 2019-11-14.
  23. Weng, Lilian. 2018. "Attention? Attention!" Lil'Log, June 24. Accessed 2019-11-13.
  24. Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2016. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." arXiv, v3, April 19. Accessed 2019-11-14.
  25. Yang, Zichao, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. "Hierarchical Attention Networks for Document Classification." Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 1480-1489, June. Accessed 2019-11-13.
  26. Zhang, Yue and Jie Li. 2019. "Application of Heartbeat-Attention Mechanism for Detection of Myocardial Infarction Using 12-Lead ECG Records." Applied Sciences, vol. 9, no. 16, 3328. Accessed 2019-11-13.
  27. Zhang, Xiaobin, Fucai Chen, and Ruiyang Huang. 2018. "A Combination of RNN and CNN for Attention-based Relation Classification." Procedia Computer Science, vol. 131, pp. 911-917, Elsevier. Accessed 2019-11-13.

Further Reading

  1. Karim, Raimi. 2019. "Attn: Illustrated Attention." Towards Data Science, on Medium, January 20. Accessed 2019-11-13.
  2. Loye, Gabriel. 2019. "Attention Mechanism." Blog, FloydHub, September 15. Accessed 2019-11-13.
  3. Kim, Yoon, Carl Denton, Luong Hoang, and Alexander M. Rush. 2017. "Structured Attention Networks." Harvard NLP. Accessed 2019-11-13.
  4. Konstantinov, Michael. 2019. "Neural Machine Translation With Attention Mechanism: Step-by-step Guide." Eleks Labs, June 25. Accessed 2019-11-13.
  5. Tran, Trung. 2019. "Neural Machine Translation With Attention Mechanism." Machine Talk, March 29. Accessed 2019-11-13.
  6. Honnibal, Matthew. 2016. "Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models." Blog,, November 10. Accessed 2019-11-14.

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2019. "Attention Mechanism in Neural Networks." Version 3, November 16. Accessed 2020-11-24.
Contributed by
1 author

Last updated on
2019-11-16 10:40:47