Bidirectional RNN

Article Info

Contributed by
1 author

Last updated on
2020-04-30 17:23:47

Article Versions

6 2020-04-30 17:23:47
2034,1985 6,2034

By arvindpdmn

Update to acronym: biLM to BiLM.
5 2020-02-28 08:56:23
1985,1982 5,1985

By arvindpdmn

Added milestone on NMT.
4 2020-02-24 17:53:39
1982,1981 4,1982

By arvindpdmn

Adding missing citation.
3 2020-02-24 17:52:38
1981,1761 3,1981

By arvindpdmn

Content done. Images added. Publishing.
2 2019-11-16 04:32:10
1761,1760 2,1761

By arvindpdmn

Updated Discussion and Milestone. Work in progress.

Chat Room

Submitting ...

You are editing an existing chat message.

Many applications are sequential in nature. One input follows another in time. Dependencies among these give us important clues as to how they should be processed. Since Recurrent Neural Networks (RNNs) model the flow of time, they're suited for these applications.

RNN has the limitation that it processes inputs in strict temporal order. This means current input has context of previous inputs but not the future. Bidirectional RNN (BRNN) duplicates the RNN processing chain so that inputs are processed in both forward and reverse time order. This allows a BRNN to look at future context as well.

Two common variants of RNN include GRU and LSTM. LSTM does better than RNN in capturing long-term dependencies. Bidirectional LSTM (BiLSTM) in particular is a popular choice in NLP. These variants are also within the scope of this article.

Discussion

Could you explain Bidirectional RNN with an example?
Bidirectional RNN has forward and backward RNNs. Source: MLWhiz 2018.
Consider the phrase, 'He said, "Teddy ___". From these three opening words it's difficult to conclude if the sentence is about Teddy bears or Teddy Roosevelt. This is because the context that clarifies Teddy comes later. RNNs (including GRUs and LSTMs) are able to obtain the context only in one direction, from the preceding words. They're unable to look ahead into future words.
Bidirectional RNNs solve this problem by processing the sequence in both directions. Typically, two separate RNNs are used: one for forward direction and one for reverse direction. This results in a hidden state from each RNN, which are usually concatenated to form a single hidden state.
The final hidden state goes to a decoder, such as a fully connected network followed by softmax. Depending on the design of the neural network, the output from a BRNN can either be the complete sequence of hidden states or the state from the last time step. If a single hidden state is given to the decoder, it comes from the last states of each RNN.
What are some applications of Bidirectional RNN?
Use of BiLSTM and CRF for NER. Source: Lample et al. 2016, fig. 1.
BiLSTM has become a popular architecture for many NLP tasks. An early application of BiLSTM was in the domain of speech recognition. Other applications include sentence classification, sentiment analysis, review generation, or even medical event detection in electronic health records.
BiLSTM has been used for POS tagging and Word Sense Disambiguation (WSD). For Named Entity Recognition (NER), Lample et al. used word representations that captured both character-level characteristics and word-level context. These were fed into a BiLSTM encoder layer. The sequence of hidden states was decoded by a CRF layer.
For lemmatization, one study used two-layer bidirectional GRUs for the encoder. The decoder was a conditional GRU plus another GRU layer. Another study used a two-layer BiLSTM encoder and a one-layer LSTM decoder. A stack of four BiLSTMs has been used for Semantic Role Labelling (SRL).
In general, the paradigm of embed-encode-attend-predict has become popular in NLP work. The encode part benefits from BiLSTM, which has been shown to capture position-sensitive features.
Beyond NLP, BiLSTM has been applied to image processing applications such as OCR.
What are merge modes in Bidirectional RNN?
Log loss vs number of time steps for some merge modes. Source: Brownlee 2017.
Merge mode is about how forward and backward hidden states should be combined before being passed on to the next layer. In Keras package, supported modes are summation, multiplication, concatenation and averaging. The default mode is concatenation and this is what most research papers use.
In MathWorks, as on December 2019, only concatenation was supported.
What are some limitations of Bidirectional RNN?
One limitation with BRNN is that the entire sequence must be available before we can make predictions. For some applications such as real-time speech recognition, the entire utterance may not be available and BRNN may not be adequate.
In the case of language models, the task is to predict the next word given preceding words. BRNN is clearly not suitable since it expects future words as well. Applying BRNN in this application will give poor accuracy. Moreover, BRNN is slower than RNN since results of the forward pass must be available for the backward pass to proceed. Gradients will therefore have a long dependency chain.
LSTMs capture long-term dependencies better than RNN and also solve the exploding/vanishing gradient problem. However, stacking many layers of BiLSTM creates the vanishing gradient problem. Deep neural networks so successful with CNNs are not so successful with BiLSTMs.

Milestones

Nov
1997

BRNN unfolded in time for three time steps. Source: Schuster and Paliwal 1997, fig. 3.

Schuster and Paliwal propose Bidirectional Recurrent Neural Network (BRNN) as an extension of the standard RNN. Since the forward and backward RNNs don't interact, they can be trained similar to the standard RNN. On regression and classification experiments they observe better results with BRNN.

2005

For phoneme classification in speech recognition, Graves and Schmidhuber use Bidirectional LSTM and obtain good results. It's based on the insight that humans often understand sounds and words only after hearing the future context. In particular, often we don't require an output immediately upon receiving an input. We can afford to wait for a sequence of inputs and then work on the output. They also show that BRNN takes eight times longer to converge than BiLSTM.

Sep
2016

Google use BiLSTM for Neural Machine Translation. Source: Wu et al. 2016, fig. 1.

Google replaces its phrase-based translation system with Neural Machine Translation (NMT). It uses a deep LSTM network with 8 encoder and 8 decoder layers. The first layer of encoder is BiLSTM while all others are LSTM.

Apr
2017

Pre-trained LM used to initialize hidden states between two bidirectional GRUs. Source: Peters et al. 2017, fig. 2.

Pre-trained word embeddings are commonly used in neural networks for NLP. However, they don't capture context. Supervised learning for capturing context uses limited labelled data. To overcome this limitation, Peters et al. use BiLSTM to learn a language model (LM) and initialize a neural network for sequence tagging. This network uses two-layer bidirectional GRUs. They experiment with NER and chunking. They find best result when LM embeddings are used at the output of the first layer.

Jul
2017

Illustrating the use of two BiLSTMs for Semantic Role Labelling. Source: He et al. 2017, fig. 1.

For the task of Semantic Role Labelling (SRL), He et al. use an eight-layer network consisting of four BiLSTMs. Their network includes highway connections and transform gates that control inter-layer information flow. Output prediction is done by a softmax layer.

Feb
2018

Comparing the architectures of Deep Stacked BiLSTM vs. Densely Connected BiLSTM. Source: Ding et al. 2018, fig. 2.

When BRNNs are stacked, they suffer from vanishing gradients and overfitting. Ding et al. propose a Densely Connected BiLSTM (DC-BiLSTM) as a solution. This essentially means that a layer's hidden state includes the hidden states of all preceding layers. They show that the proposed architecture can handle up to 20 layers while improving performance over BiLSTM.

Jun
2018

Peters et al. publish details of a language model called Embeddings from Language Models (ELMo). ELMo representations are deep, that is, they're a linear combination of the states of all LSTM layers rather than using only the top layer representation. They show that higher layers capture context-dependent semantics whereas lower layers capture syntax. While their model uses both forward and backward LSTMs, forward LSTM stack is independent of the backward LSTM stack. Representations at each layer of the two stacks are concatenated. For this reason, they use the term Bidirectional Language Model (BiLM).

References

Article Stats

1286

Words

Authors

Edits

Chats

Likes

23K

Hits

Cite As

Devopedia. 2020. "Bidirectional RNN." Version 6, April 30. Accessed 2024-06-26. https://devopedia.org/bidirectional-rnn

Contributed by
1 author

Last updated on
2020-04-30 17:23:47

algorithms machine learning modelling neural networks

Bidirectional RNN

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Bidirectional RNN

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login