Neural Networks for NLP

State-of-the-art NN for NLP in 2018. Source: Devlin and Chang 2018.
State-of-the-art NN for NLP in 2018. Source: Devlin and Chang 2018.

The use of statistics in NLP started in the 1980s and heralded the birth of what we called Statistical NLP or Computational Linguistics. Since then, many machine learning techniques have been applied to NLP. These include naïve Bayes, k-nearest neighbours, hidden Markov models, conditional random fields, decision trees, random forests, and support vector machines.

The use of neutral networks for NLP did not start until the early 2000s. But by the end of 2010s, neural networks transformed NLP, enhancing or even replacing earlier techniques. This has been made possible because we now have more data to train neural network models and more powerful computing systems to do so.

In traditional NLP, features were often hand-crafted, incomplete, and time consuming to create. Neural networks can learn multilevel features automatically. They also give better results.


  • Which are the main innovations in the application of NN to NLP?

    Two main innovations have enabled the use of neural networks in NLP:

    • Word Embeddings: This enabled us to represent words as real-valued vectors. Instead of having a sparse representation, word embeddings allowed us to represent words in a much smaller dimensional space. We could identify similar words due to their closeness in this vector space, or use analogies to exploit semantic relationships between words.
    • NN Architectures: These had evolved in other domains such as computer vision and were adapted to NLP. This started in language modelling, and later applied to morphology, POS tagging, coreference resolution, parsing, and semantics. From these core areas, neural networks were applied to applications: sentiment analysis, speech recognition, information retrieval/extraction, text classification/generation, summarization, question answering, and machine translation. These architectures are usually not as deep (many hidden layers) as found in computer vision.
  • Which are the NN architectures that have been used for NLP?
    Example of 2-layer BiLSTM-based ELMo. Source: Horan 2019.
    Example of 2-layer BiLSTM-based ELMo. Source: Horan 2019.

    Early language models used a feedforward NN or convolutional NN architectures but these didn't capture context very well. Context is how one word occurs in relation to surrounding words in the sentence. To capture context, recurrent NNs were applied. LSTM, a variant of RNN, was then used to capture long-distance context. Bidirectional LSTM (BiLSTM) improves upon LSTM by looking at word sequences in forward and backward directions.

    Typically, the dimensionality of input and output must be known and fixed. This is problematic for machine translation. For example, the best translation of a 10-word English sentence might be a 12-word French sentence. This problem is solved by a sequence-to-sequence model that's based on encoder-decoder architecture. The essence of the encoder is to encode an entire input sequence into a large fixed-dimensional vector, called the context vector. The decoder implements a language model conditioned on the input sequence.

    To encode contextual information in a single context vector is difficult. This gave rise to the idea of attention where more information is given to decoder. From here, the transformer model evolved.

  • What's been the general trend in NLP research with neural networks?

    Language modelling has been essential for the progress of NLP. Because of the ready availability of text, it's been easy to train complex models in an unsupervised manner on lots of training data. The intent is to train the model to learn about words and the contexts in which they occur. For example, the model should learn a vector representation of "bank" and also discriminate between a river bank and a financial institution.

    A pretrained language model, first proposed in 2015, can save us expensive training on vast amounts of data. However, such a pretrained model may need some amount of training on domain-specific data. Then the model can be applied to many downstream NLP tasks. This approach is similar to pretrained word embeddings that didn't capture context.

    The use of a pretrained language model in another downstream task is called transfer learning, a concept that's also common in computer vision.

    It's expected that transformer model will dominate over RNN. Pretrained models will get better. It'll be easier to fine tune models. Transfer learning will become more important.

  • Could you share some real-world examples of NN in NLP?
    NN model for Gmail's Smart Compose. Source: Wu 2018.
    NN model for Gmail's Smart Compose. Source: Wu 2018.

    In 2018, Google introduced Smart Compose in Gmail. A seq2seq model using email subject and previous email body gave good results but failed to meet latency constraints. They finally settled on a hybrid of bag-of-words (BoW) and RNN-LM. Average embeddings are fed to RNN-LM.

    At Amazon, they've used a lightweight version of ELMo to augment Alexa functions. While ELMo uses a stack of BiLSTM, they use a single layer since Alexa transactions are linguistically more uniform. They trained the embeddings in an unsupervised manner, and then trained on two tasks (intent classification and slot tagging) in a supervised manner while only slowly adjusting the embeddings. They also did transfer learning on new tasks.

    Uber has used NLP to filter tickets related to map data. Using word2vec, they trained word embeddings on one million tickets. This had the limitation that all words are treated equally. They then experimented with WordCNN and LSTM networks. They got best results with word2vec trained on customer tickets and used it with WordCNN. For future work, they suggested character-level (CharCNN) embeddings that are more resilient to typos.


Neural network with word vector C(i) for ith word. Source: Bengio et al. 2003, fig. 1.

Bengio et al. point out the curse of dimensionality where the large vocabulary size of natural languages makes computations difficult. They propose a feedforward neural network that jointly learns the language model and vector representations of words. They refine their methods in a follow-up paper from 2003.

Words to features to word vectors via lookup tables. Source: Collobert and Weston 2008, fig. 1.

Collobert and Weston train a language model in an unsupervised manner from Wikipedia data. They use supervised training for both syntactic tasks (POS tagging, chunking, parsing) and semantic tasks (named entity recognition, semantic role labelling, word sense disambiguation). To model long-distance dependencies, they use a Time-Delay Neural Network (TNN) inspired from CNN. They use multiple layers to move from local features to global features. Moreover, two models can share word embeddings, an approach called multitask learning.


Mikolov et al. use a recurrent neural network (RNN) for language modelling and apply this for speech recognition. They show better results than traditional n-gram models.


Dahl et al. combine deep neural network with hidden Markov model (HMM) for large vocabulary speech recognition.

Recursive NN (for sentiment analysis) exploits the hierarchical structure of language. Source: Socher et al. 2013, fig. 1.

Going beyond just word embeddings, Kalchbrenner and Blunsom map an entire input sentence to a vector. They use this for machine translation without relying on alignments or phrasal translation units. In another research, LSTM is found to capture long-range context and therefore suitable for generating sequences. In general, 2013 is the year when there's research focus on using CNN, RNN/LSTM and recursive NN for NLP.

Using CNN for NLP tasks. Source: Kim 2014, fig. 1.

Using pretrained word2vec embeddings, Yoon Kim uses CNN for sentence classification. Also in 2014, Sutskever et al. at Google apply sequence-to-sequence model to the task of machine translation. They use separate 4-layered LSTMs for encoder and decoder. Reversing the order of source sentences allows LSTM to exploit short-term dependencies and therefore do well on long sentences. Seq2seq models are suited for NLG tasks such as captioning images or describing source code changes.

Encoder-decoder model with attention. Source: Weng 2018, fig. 4.

Bahdanau et al. apply the concept of attention to the seq2seq model used in machine translation. This helps the decoder to "pay attention" to important parts of the source sentence. It doesn't force the encoder to pack all information into a single context vector. Effectively, the model does a soft alignment of input to output words.


Dai and Le propose a two-step procedure of unsupervised pre-training followed by supervised training for text classification. This semi-supervised approach works well. Pre-training helps to initialize the model for supervised training and generalization. More unlabelled data during pre-training is seen to improve supervised learning. In later years, this approach becomes important.

Google's model for NMT. Source: Wu et al. 2016, fig. 1.

Google replaces its phrase-based translation system with Neural Machine Translation (NMT). This reduces translation errors by 60%. It uses a deep LSTM network with 8 encoder and 8 decoder layers. The first layer of encoder is BiLSTM. The model also uses residual connections among the LSTM layers.

Use of convolution and attention in seq2seq modelling. Source: Gehring et al. 2017, fig. 1.

Recurrent architectures can't be parallelized due to their sequential nature. Gehring et al. therefore propose using CNNs for seq2seq modelling since CNNs can be parallelized and make best use of GPU hardware. The model uses gated linear units, residual connection and attention in each decoder layer.

Encoder self-attention distribution for the word 'it' in different contexts. Source: Uszkoreit 2017.

Vaswani et al. propose the transformer model in which they use a seq2seq model without using RNN. The transformer model relies only on self-attention. By 2018, the transformer leads to state-of-the-art models such as OpenAI GPT and BERT.


Researchers at the Allen Institute for Artificial Intelligence introduce ELMo (Embeddings from Language Models). While earlier work derived contextualized word vectors, this was limited to the top LSTM layer. ELMo's word representations use all layers of a bidirectional language model. This allows ELMo to model syntax, semantics and polysemy. Such a language model can be pretrained on a large scale and then used for a number of downstream tasks.


OpenAI GPT-2 shows its power in natural language generation. Trained on 8 million websites, it has 1.5 billion parameters. The model is initially not released to the public due to concerns of misuse (such as fake news generation). However, in November 2019, GPT-2 is released.


  1. Alammar, Jay. 2018. "The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)." December 03. Accessed 2019-10-13.
  2. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2016. "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv, v7, May 19. Accessed 2019-10-13.
  3. Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. "A Neural Probabilistic Language Model." Journal of Machine Learning Research, vol. 3, pp. 1137–1155. Accessed 2019-09-28.
  4. Collobert, Ronan and Jason Weston. 2008. "A Unified Architecture for Natural Language Processing:Deep Neural Networks with Multitask Learning." Proceedings of the 25thInternational Confer-ence on Machine Learning. Accessed 2019-09-27.
  5. Dahl, G. E., D. Yu, L. Deng, and A. Acero. 2012. "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition." IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, January. Accessed 2019-10-14.
  6. Dai, Andrew M., and Quoc V. Le. 2015. "Semi-supervised Sequence Learning." arXiv, v1, November 04. Accessed 2019-11-12.
  7. Devlin, Jacob and Ming-Wei Chang. 2018. "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing." Google AI Blog, November 02. Accessed 2019-10-13.
  8. Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. "Convolutional Sequence to Sequence Learning." arXiv, v3, July 25. Accessed 2019-11-12.
  9. Goyal, Anuj. 2019. "Leveraging Unannotated Data to Bootstrap Alexa Functions More Quickly." Alexa Blogs, Amazon, January 22. Accessed 2019-10-13.
  10. Graves, Alex. 2014. "Generating Sequences With Recurrent Neural Networks." arXiv, v5, June 05. Accessed 2019-10-13.
  11. Honnibal, Matthew. 2016. "Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models." Blog, Explosion, November 10. Accessed 2019-10-13.
  12. Horan, Cathal. 2019. "Ten trends in Deep learning NLP." FloydHub Blog, March 12. Accessed 2019-10-13.
  13. Kalchbrenner, Nal, and Phil Blunsom. 2013. "Recurrent Continuous Translation Models." Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1700-1709, October. Accessed 2019-10-13.
  14. Kim, Yoon. 2014. "Convolutional Neural Networks for Sentence Classification." arXiv, v2, September 03. Accessed 2019-10-14.
  15. Kuo, Chun-Chen, Livia Yanez, and Jeffrey Yun. 2018. "Applying Customer Feedback: How NLP & Deep Learning Improve Uber’s Maps." Uber Engineering, October 22. Accessed 2019-10-13.
  16. Lazy Programmer. 2018. "Deep Learning: Advanced NLP and RNNs." Lazy Programmer, on YouTube, April 29. Accessed 2019-10-10.
  17. Lopez, Marc Moreno and Jugal Kalita. 2017. "Deep Learning applied to NLP." arXiv, v1, March 09. Accessed 2019-10-10.
  18. Mikolov, Tomáš, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. 2010. "Recurrent neural network based language model." 11th Annual Conference of the International Speech Communication Association, September 26-30. Accessed 2019-10-13.
  19. Otter, Daniel W., Julian R. Medina, and Jugal K. Kalita. 2019. "A Survey of the Usages of Deep Learning in Natural Language Processing." arXiv, v2, September 11. Accessed 2019-10-10.
  20. Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. "Deep contextualized word representations." arXiv, v2, March 22. Accessed 2019-10-14.
  21. Ruder, Sebastian. 2018. "A Review of the Recent History of Natural Language Processing." October 01. Accessed 2019-09-26.
  22. Ruder, Sebastian. 2018b. "NLP's ImageNet moment has arrived." July 12. Accessed 2019-10-14.
  23. Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank." Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631-1642, October. Accessed 2019-10-14.
  24. Solaiman, Irene, Jack Clark, and Miles Brundage. 2019. "GPT-2: 1.5B Release." OpenAI Blog, November 05. Accessed 2019-11-12.
  25. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. "Sequence to Sequence Learning with Neural Networks." arXiv, v3, December 14. Accessed 2019-10-13.
  26. Uszkoreit, Jakob. 2017. "Transformer: A Novel Neural Network Architecture for Language Understanding." Google AI Blog, August 31. Accessed 2019-10-13.
  27. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. "Attention Is All You Need." arXiv, v5, December 06. Accessed 2019-10-12.
  28. Vig, Jesse. 2019. "OpenAI GPT-2: Understanding Language Generation through Visualization." Towards Data Science, via Medium, March 05. Accessed 2019-10-13.
  29. Weng, Lilian. 2018. "Attention? Attention!" Lil'Log, June 24. Accessed 2019-10-13.
  30. Wu, Yonghui. 2018. "Smart Compose: Using Neural Networks to Help Write Emails." Google AI Blog, May 16. Accessed 2019-10-13.
  31. Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." v1, arXiv, September 26. Updated 2016-10-08. Accessed 2019-06-13.
  32. Young, Tom, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. "Recent Trends in Deep Learning Based Natural Language Processing." arXiv, v8, November 25. Accessed 2019-10-10.

Further Reading

  1. Otter, Daniel W., Julian R. Medina, and Jugal K. Kalita. 2019. "A Survey of the Usages of Deep Learning in Natural Language Processing." arXiv, v2, September 11. Accessed 2019-10-10.
  2. Alammar, Jay. 2018. "The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)." December 03. Accessed 2019-10-13.
  3. Horan, Cathal. 2019. "Ten trends in Deep learning NLP." FloydHub Blog, March 12. Accessed 2019-10-13.
  4. Lopez, Marc Moreno and Jugal Kalita. 2017. "Deep Learning applied to NLP." arXiv, v1, March 09. Accessed 2019-10-10.
  5. Ruder, Sebastian. 2018. "A Review of the Recent History of Natural Language Processing." October 01. Accessed 2019-09-26.
  6. Elvis. 2018. "Deep Learning for NLP: An Overview of Recent Trends.", via Medium, August 24. Accessed 2019-10-10.

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2019. "Neural Networks for NLP." Version 6, November 12. Accessed 2020-11-24.
Contributed by
1 author

Last updated on
2019-11-12 13:03:45