Bidirectional RNN

Many applications are sequential in nature. One input follows another in time. Dependencies among these give us important clues as to how they should be processed. Since Recurrent Neural Networks (RNNs) model the flow of time, they're suited for these applications.

RNN has the limitation that it processes inputs in strict temporal order. This means current input has context of previous inputs but not the future. Bidirectional RNN (BRNN) duplicates the RNN processing chain so that inputs are processed in both forward and reverse time order. This allows a BRNN to look at future context as well.

Two common variants of RNN include GRU and LSTM. LSTM does better than RNN in capturing long-term dependencies. Bidirectional LSTM (BiLSTM) in particular is a popular choice in NLP. These variants are also within the scope of this article.

Discussion

  • Could you explain Bidirectional RNN with an example?
    Bidirectional RNN has forward and backward RNNs. Source: MLWhiz 2018.
    Bidirectional RNN has forward and backward RNNs. Source: MLWhiz 2018.

    Consider the phrase, 'He said, "Teddy ___". From these three opening words it's difficult to conclude if the sentence is about Teddy bears or Teddy Roosevelt. This is because the context that clarifies Teddy comes later. RNNs (including GRUs and LSTMs) are able to obtain the context only in one direction, from the preceding words. They're unable to look ahead into future words.

    Bidirectional RNNs solve this problem by processing the sequence in both directions. Typically, two separate RNNs are used: one for forward direction and one for reverse direction. This results in a hidden state from each RNN, which are usually concatenated to form a single hidden state.

    The final hidden state goes to a decoder, such as a fully connected network followed by softmax. Depending on the design of the neural network, the output from a BRNN can either be the complete sequence of hidden states or the state from the last time step. If a single hidden state is given to the decoder, it comes from the last states of each RNN.

  • What are some applications of Bidirectional RNN?
    Use of BiLSTM and CRF for NER. Source: Lample et al. 2016, fig. 1.
    Use of BiLSTM and CRF for NER. Source: Lample et al. 2016, fig. 1.

    BiLSTM has become a popular architecture for many NLP tasks. An early application of BiLSTM was in the domain of speech recognition. Other applications include sentence classification, sentiment analysis, review generation, or even medical event detection in electronic health records.

    BiLSTM has been used for POS tagging and Word Sense Disambiguation (WSD). For Named Entity Recognition (NER), Lample et al. used word representations that captured both character-level characteristics and word-level context. These were fed into a BiLSTM encoder layer. The sequence of hidden states was decoded by a CRF layer.

    For lemmatization, one study used two-layer bidirectional GRUs for the encoder. The decoder was a conditional GRU plus another GRU layer. Another study used a two-layer BiLSTM encoder and a one-layer LSTM decoder. A stack of four BiLSTMs has been used for Semantic Role Labelling (SRL).

    In general, the paradigm of embed-encode-attend-predict has become popular in NLP work. The encode part benefits from BiLSTM, which has been shown to capture position-sensitive features.

    Beyond NLP, BiLSTM has been applied to image processing applications such as OCR.

  • What are merge modes in Bidirectional RNN?
    Log loss vs number of time steps for some merge modes. Source: Brownlee 2017.
    Log loss vs number of time steps for some merge modes. Source: Brownlee 2017.

    Merge mode is about how forward and backward hidden states should be combined before being passed on to the next layer. In Keras package, supported modes are summation, multiplication, concatenation and averaging. The default mode is concatenation and this is what most research papers use.

    In MathWorks, as on December 2019, only concatenation was supported.

  • What are some limitations of Bidirectional RNN?

    One limitation with BRNN is that the entire sequence must be available before we can make predictions. For some applications such as real-time speech recognition, the entire utterance may not be available and BRNN may not be adequate.

    In the case of language models, the task is to predict the next word given preceding words. BRNN is clearly not suitable since it expects future words as well. Applying BRNN in this application will give poor accuracy. Moreover, BRNN is slower than RNN since results of the forward pass must be available for the backward pass to proceed. Gradients will therefore have a long dependency chain.

    LSTMs capture long-term dependencies better than RNN and also solve the exploding/vanishing gradient problem. However, stacking many layers of BiLSTM creates the vanishing gradient problem. Deep neural networks so successful with CNNs are not so successful with BiLSTMs.

Milestones

Nov
1997
BRNN unfolded in time for three time steps. Source: Schuster and Paliwal 1997, fig. 3.
BRNN unfolded in time for three time steps. Source: Schuster and Paliwal 1997, fig. 3.

Schuster and Paliwal propose Bidirectional Recurrent Neural Network (BRNN) as an extension of the standard RNN. Since the forward and backward RNNs don't interact, they can be trained similar to the standard RNN. On regression and classification experiments they observe better results with BRNN.

2005
BiLSTM recognizes the phonemes better than either forward or backward LSTM alone. Source: Graves and Schmidhuber 2005, fig. 1.
BiLSTM recognizes the phonemes better than either forward or backward LSTM alone. Source: Graves and Schmidhuber 2005, fig. 1.

For phoneme classification in speech recognition, Graves and Schmidhuber use Bidirectional LSTM and obtain good results. It's based on the insight that humans often understand sounds and words only after hearing the future context. In particular, often we don't require an output immediately upon receiving an input. We can afford to wait for a sequence of inputs and then work on the output. They also show that BRNN takes eight times longer to converge than BiLSTM.

Sep
2016
Google use BiLSTM for Neural Machine Translation. Source: Wu et al. 2016, fig. 1.
Google use BiLSTM for Neural Machine Translation. Source: Wu et al. 2016, fig. 1.

Google replaces its phrase-based translation system with Neural Machine Translation (NMT). It uses a deep LSTM network with 8 encoder and 8 decoder layers. The first layer of encoder is BiLSTM while all others are LSTM.

Apr
2017
Pre-trained LM used to initialize hidden states between two bidirectional GRUs. Source: Peters et al. 2017, fig. 2.
Pre-trained LM used to initialize hidden states between two bidirectional GRUs. Source: Peters et al. 2017, fig. 2.

Pre-trained word embeddings are commonly used in neural networks for NLP. However, they don't capture context. Supervised learning for capturing context uses limited labelled data. To overcome this limitation, Peters et al. use BiLSTM to learn a language model (LM) and initialize a neural network for sequence tagging. This network uses two-layer bidirectional GRUs. They experiment with NER and chunking. They find best result when LM embeddings are used at the output of the first layer.

Jul
2017
Illustrating the use of two BiLSTMs for Semantic Role Labelling. Source: He et al. 2017, fig. 1.
Illustrating the use of two BiLSTMs for Semantic Role Labelling. Source: He et al. 2017, fig. 1.

For the task of Semantic Role Labelling (SRL), He et al. use an eight-layer network consisting of four BiLSTMs. Their network includes highway connections and transform gates that control inter-layer information flow. Output prediction is done by a softmax layer.

Feb
2018
Comparing the architectures of Deep Stacked BiLSTM vs. Densely Connected BiLSTM. Source: Ding et al. 2018, fig. 2.
Comparing the architectures of Deep Stacked BiLSTM vs. Densely Connected BiLSTM. Source: Ding et al. 2018, fig. 2.

When BRNNs are stacked, they suffer from vanishing gradients and overfitting. Ding et al. propose a Densely Connected BiLSTM (DC-BiLSTM) as a solution. This essentially means that a layer's hidden state includes the hidden states of all preceding layers. They show that the proposed architecture can handle up to 20 layers while improving performance over BiLSTM.

Jun
2018

Peters et al. publish details of a language model called Embeddings from Language Models (ELMo). ELMo representations are deep, that is, they're a linear combination of the states of all LSTM layers rather than using only the top layer representation. They show that higher layers capture context-dependent semantics whereas lower layers capture syntax. While their model uses both forward and backward LSTMs, forward LSTM stack is independent of the backward LSTM stack. Representations at each layer of the two stacks are concatenated. For this reason, they use the term Bidirectional Language Model (BiLM).

References

  1. Bergmanis, Toms, and Sharon Goldwater. 2018. "Context Sensitive Neural Lemmatization with Lematus." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1391-1400, June. Accessed 2019-10-11.
  2. Brownlee, Jason. 2017. "How to Develop a Bidirectional LSTM For Sequence Classification in Python with Keras." Machine Learning Mastery, June 16. Updated 2019-08-14. Accessed 2019-11-17.
  3. Ding, Zixiang, Rui Xia, Jianfei Yu, Xiang Li, and Jian Yang. 2018. "Densely Connected Bidirectional LSTM with Applications to Sentence Classification." arXiv, v1, February 3. Accessed 2020-02-24.
  4. Eric, Mihail. 2018. "Deep Contextualized Word Representations with ELMo." October. Accessed 2020-02-24.
  5. Graves, A. and J. Schmidhuber. 2005. "Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures." Neural Networks, vol. 18, no. 5-6, pp. 602–610, June/July. Accessed 2019-11-17.
  6. Gupta, Raunak. 2019. "What is the merge mode of Bidirectional LSTM?" MATLAB Answers, MathWorks, December 5. Accessed 2020-02-24.
  7. He, Luheng, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. "Deep Semantic Role Labeling: What Works and What’s Next." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 473-483, July. Accessed 2019-12-28.
  8. Honnibal, Matthew. 2016. "Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models." Blog, Explosion, November 10. Accessed 2020-02-24.
  9. Jagannatha, Abhyuday N, and Hong Yu. 2016. "Bidirectional RNN for Medical Event Detection in Electronic Health Records." Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 473-482, June. Accessed 2020-02-24.
  10. Keras Docs. 2019. "Trains a Bidirectional LSTM on the IMDB sentiment classification task." Keras Documentation, October 13. Accessed 2020-02-24.
  11. Lample, Guillaume, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. "Neural Architectures for Named Entity Recognition." arXiv, v3, April 7. Accessed 2020-02-24.
  12. Lee, Ceshine. 2017. "Understanding Bidirectional RNN in PyTorch." Towards Data Science, on Medium, November 13. Accessed 2020-02-24.
  13. Luo, Fuli, Tianyu Liu, Qiaolin Xia, Baobao Chang, and Zhifang Sui. 2018. "Incorporating Glosses into Neural Word Sense Disambiguation." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pp. 2473-2482, July. Accessed 2019-12-28.
  14. MLWhiz. 2018. "What Kagglers are using for Text Classification." MLWhiz, December 17. Accessed 2019-11-16.
  15. Malaviya, Chaitanya, Shijie Wu, and Ryan Cotterell. 2019. "A Simple Joint Model for Improved Contextual Neural Lemmatization." arXiv, v2, April 05. Accessed 2019-10-11.
  16. MathWorks. 2019. "bilstmLayer." R2019b, Help Center, MathWorks. Accessed 2020-02-24.
  17. Ng, Andrew. 2019. "Bidirectional RNN." Recurrent Neural Networks, Sequence Models, Deeplearning.ai, on Coursera. Accessed 2019-11-16.
  18. Peters, Matthew E., Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. "Semi-supervised sequence tagging with bidirectional language models." arXiv, v1, April 29. Accessed 2020-02-24.
  19. Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. "Deep Contextualized Word Representations." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227-2237, June. Accessed 2020-02-24.
  20. Rybalkin, Vladimir, Norbert Wehn, Mohammad Reza Yousefi, and Didier Stricker. 2017. "Hardware architecture of bidirectional long short-term memory neural network for optical character recognition." Proceedings of the Conference on Design, Automation & Test in Europe, pp. 1394-1399, March. Accessed 2020-02-24.
  21. Schuster, Mike and Kuldip K. Paliwal. 1997. "Bidirectional Recurrent Neural Networks." IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2673-2681, November. Accessed 2019-11-16.
  22. Uppal, Akshay. 2019. "Sentence classification using Bi-LSTM." Towards Data Science, on Medium, March 28. Accessed 2020-02-24.
  23. Veerakumar, Karthick. 2019. "Review generation using Bidirectional Long Short Term Memory(LSTM)." mc.ai, April 11. Accessed 2020-02-24.
  24. Wang, Peilu, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao. 2015. "Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Recurrent Neural Network." arXiv, v1, October 21. Accessed 2020-02-24.
  25. Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." v1, arXiv, September 26. Updated 2016-10-08. Accessed 2019-06-13.
  26. Zaki, Amr. 2019. "Multilayer Bidirectional LSTM/GRU for text summarization made easy (tutorial 4)." Hackernoon, March 30. Accessed 2020-02-24.
  27. Zhang, Aston, Zack C. Lipton, Mu Li, and Alex J. Smola. 2019. "Section 9.4: Bidirectional Recurrent Neural Networks." In: Dive into Deep Learning, Preview Version V0.7, December 5. Accessed 2020-02-24.

Further Reading

  1. Schuster, Mike and Kuldip K. Paliwal. 1997. "Bidirectional Recurrent Neural Networks." IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2673-2681, November. Accessed 2019-11-16.
  2. Brownlee, Jason. 2017. "How to Develop a Bidirectional LSTM For Sequence Classification in Python with Keras." Machine Learning Mastery, June 16. Updated 2019-08-14. Accessed 2019-11-17.
  3. Zhang, Aston, Zack C. Lipton, Mu Li, and Alex J. Smola. 2019. "Section 9.4: Bidirectional Recurrent Neural Networks." In: Dive into Deep Learning, Preview Version V0.7, December 5. Accessed 2020-02-24.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
6
0
1879
1286
Words
3
Likes
21K
Hits

Cite As

Devopedia. 2020. "Bidirectional RNN." Version 6, April 30. Accessed 2023-11-12. https://devopedia.org/bidirectional-rnn
Contributed by
1 author


Last updated on
2020-04-30 17:23:47