# Attention Mechanism in Neural Networks

In machine translation, the encoder-decoder architecture is common. The encoder reads a sequence of words and represents it with a high-dimensional real-valued vector. This vector, often called the context vector, is given to the decoder, which then generates another sequence of words in the target language. If the input sequence is very long, a single vector from the encoder doesn't give enough information for the decoder.

Attention is about giving more contextual information to the decoder. At every decoding step, the decoder is informed how much "attention" it should give to each input word. While attention started this way in sequence-to-sequence modelling, it was later applied to words within the same sequence, giving rise to self-attention and transformer architecture.

Since the late 2010s, attention mechanism has become popular, sometimes replacing CNNs, RNNs and LSTMs.

## Discussion

• Could you explain attention with an example?

Consider an example from machine translation. The sentence "The agreement on the European Economic Area was signed in August 1992" is to be translated to French, which might be "L'accord sur la zone économique européenne a été signé en août 1992". We can see that "Economic" becomes "économique" and "European" becomes "européenne", but their positions are swapped. The phrase "was signed" becomes "a été signé". Thus, translation depends not just on individual words but also their context within the sentence. Attention is meant to capture this context.

In this example, attention is passed from the encoder to the decoder. The decoder generates the translated words one by one. Each output word is influenced by all input words in different amounts. Attention captures these weights.

We can also visualize attention via heatmaps. In the figure, we map English words to translated French words. We note that sometimes a translated word is attended to by multiple English words. Lighter colours represent higher attention.

• Could you describe the architecture of attention?

Let's consider machine translation as explained by Bahdanau et al. (2014). Encoder is a bidirectional RNN while the decoder is an RNN. The input sequence is fed into the encoder whose hidden states are exposed to the decoder via the attention layer. More specifically, the backward and forward hidden encoder states are concatenated. These states are weighted to give a context vector that's used by the decoder. Attention weights are calculated by aligning the decoder's last hidden state with the encoder hidden states.

The decoder's current hidden state is a function of it's previous hidden state, previous output word and the context vector. Attention is passed via the context vector, which itself is based on the alignment of encoder and decoder states.

Luong et al. proposed a slightly different architecture. Their encoder and decoder are each a 2-layer LSTM. It also uses a feedforward network for the final output. In Google's Neural Machine Translation, 8-layer LSTM is used in encoder and decoder. The first encoder layer is bidirectional. Both encoder and decoder include some residual connections.

• What do you mean by "alignment" in the context of attention mechanism?

Bahdanau et al. align the decoder's sequence with the encoder's sequence. An alignment score quantifies how well output at position i is aligned to the input at position j. The context vector that goes to the decoder is based on the weighted sum of the encoder's RNN hidden states $$h_j$$. These weights come from the alignment. Mathematically, given an alignment model a, alignment energy e, context vector c, and weights α, we have:

$$e_{ij} = a(s_{i-1},h_j)\\\alpha_{ij} = exp(e_{ij})/\sum_{k=1}^{T_x}{exp(e_{ik})}\\c_i = \sum_{j=1}^{T_x}{\alpha_{ij}h_j}$$

The decoder's hidden state is based on it's previous hidden state $$s_{i-1}$$, the previous predicted word and the current context vector. At each time step, the context vector is adjusted via the alignment model and attention. Thus, at step step, the decoder selectively attends to the input sequence via the encoder hidden states.

Bahdanau et al. concatenated the forward and backward encoder hidden states and added these with decoder hidden state. Luong et al. proposed many other alternative alignment scores. Vaswani et al. proposed the scaled dot product.

• What is self-attention?

Self-attention is about attending to words within the sequence, such as within the encoder or decoder. By seeing how one word attends to other words in the sequence, we're able to capture syntactical structures.

Consider the sentence "The animal didn't cross the street because it was too tired". The word "it" refers to the animal. What happens if we replace "tired" with "wide"? The word "it" now refers to the street. Attention understands this. In the former case there's high attention linking "it" and "animal" but in the latter case high attention shifts to "street".

Self-attention was earlier applied together with RNNs. Later, self-attention came to stand on its own. Vaswani et al.'s paper titled Attention is all you need showed how we can get rid of CNNs and RNNs. RNNs in particular are hard to parallelize on GPUs, which is a problem solved by self-attention.

• Could you describe some applications of attention mechanism?

Beyond its early application to machine translation, attention mechanism has been applied to other NLP tasks such as sentiment analysis, POS tagging, document classification, text classification, and relation classification. One research used human eye-tracking corpora to derive attention and enhance NLP tasks. In another study, semantic role labelling was improved using linguistically-informed self-attention.

By combining CNN with self-attention, the Google Brain team achieved top results for image classification and object detection. In Visual Question Answering (VQA), where there's a need to focus on small areas or details of the image, attention mechanism is useful. Attention is also useful for image captioning.

In speech recognition, attention aligns characters and audio.

In one medical study, higher attention was given to abnormal heartbeats from ECG readings to more accurately detect specific heart conditions. In another study based on ICU data, feature-level attention was used rather than attention on embeddings. This provided physicians better interpretability.

• What's the difference between global and local attention?

The distinction between global versus local attention originated in Luong et al. (2015). In the task of neural machine translation, global attention implies we attend to all the input words, and local attention means we attend to only a subset of words.

It's said that local attention is a combination of hard and soft attentions. Like hard attention, it focuses on a subset. Like soft attention, it's differentiable and hence easier to implement and train. It's computationally simpler than global or soft attentions.

Given the decoder's current state, local attention first selects the best aligned position $$p_t$$ in the input sequence. Note that selecting $$p_t$$ is not directly influenced by the encoder's states. Local attention is also called window-based attention because it's about selecting a window of input tokens for attention distribution. This window is centred on $$p_t$$. To keep the approach differentiable, a Gaussian distribution is applied on the window. Attention $$a_t$$ is therefore focused around $$p_t$$.

## Milestones

Jun
2014

Even before attention mechanism becomes popular via NLP in later years, it's used in computer vision. Mnih et al. propose a method to focus on important parts of an image that are then processed at high resolution. Instead of processing the entire image at once, it's processed sequentially, attending to different locations as is relevant to the task.

Sep
2014

Bahdanau et al. apply the concept of attention to the seq2seq model used in machine translation. This helps the decoder to "pay attention" to important parts of the source sentence. Encoder is a bidirectional RNN. Unlike earlier seq2seq models that use only the encoder's last hidden state, attention mechanism uses all hidden states of encoder and decoder to generate the context vector. It also aligns the input and output sequences, with alignment score parameterized by a feed-forward network.

Feb
2015

Xu et al. propose the use of visual attention to the task of image captioning. They distinguish between soft attention and hard attention. Soft deterministic attention is smooth and differentiable, and is trained by standard back propagation. Hard stochastic attention is trained by maximizing an approximate variational lower bound. Soft attention is similar to Bahdanau et al.'s proposal.

Mar
2015

Sukhbaatar et al. propose the concept of multi-hop attention. Each hop or layer contains attention weights. Input and output from each layer is fed to the next higher layer. Thus, no hard decisions are taken in each layer. Outputs from each layer are passed on in a "soft" manner until prediction after the last layer.

Aug
2015

Luong et al. distinguish between global attention versus local attention. In global attention, we attend to all the input words. In local attention, we attend to only a subset of words.

Nov
2015

Attention mechanism has been applied to computer vision. For Visual Question Answering (VQA), Chet et al. propose Attention-Based Configurable Convolutional Neural Network (ABC-CNN). The query text guides the model to pay attention to relevant image regions. In traditional VQA models, visual processing and question understanding are done separately.

Jun
2016

Yang et al. propose the Hierarchical Attention Network (HAN). This comes from their insight that documents have a hierarchical structure (words, sentences, document). HAN has two levels of attention: word level and sentence level. This enables the model to more accurately do document classification. It improves on the state-of-the-art when tested on Yelp, IMDb and Amazon datasets. Kränkel and Lee provide a good description of HAN.

Nov
2016

Matthew Honnibal describes what he calls as the new neural network playbook for NLP. It's a four-step approach: embed, encode, attend, predict. This highlights the importance and usefulness of attention mechanism. Word embeddings do the embed part at word level. Bidirectional RNNs do the encode part at sequence level. Using a context vector, attend part produces a vector that's given to a feed-forward network. The predict part is done by this network.

Nov
2016

For the task of machine reading, Cheng et al. propose the use of both inter-attention (between encoder and decoder) and intra-attention (within encoder or decoder). Intra-attention (later to be called self-attention) is about attending to tokens within a sequence, thus uncovering lexical relations between tokens.

Jun
2017

Vaswani et al. propose the transformer model in which they use a seq2seq model without RNN. The transformer model relies only on self-attention. Self-attention is about attending to different tokens of the sequence. Self-attention leads to powerful language models including BERT (2018) and GPT-2 (2019).

Oct
2017

Veličković et al. apply attention to nodes in a graph. Nodes attend to neighbours. The computation can be parallelized.

Author
No. of Edits
No. of Chats
DevCoins
3
0
1723
1955
Words
8
Likes
13780
Hits

## Cite As

Devopedia. 2019. "Attention Mechanism in Neural Networks." Version 3, November 16. Accessed 2022-09-22. https://devopedia.org/attention-mechanism-in-neural-networks
Contributed by
1 author

Last updated on
2019-11-16 10:40:47
• Site Map