Neural Networks for NLP

Article Info

Contributed by
1 author

Last updated on
2019-11-12 13:03:45

Article Versions

6 2019-11-12 13:03:45
1745,1744 6,1745

By arvindpdmn

Added milestone about CNN.
5 2019-11-12 12:22:53
1744,1743 5,1744

By arvindpdmn

Added milestone on semi-supervised approach.
4 2019-11-12 10:25:30
1743,1681 4,1743

By arvindpdmn

GPT-2 release update
3 2019-10-14 14:37:12
1681,1678 3,1681

By arvindpdmn

Completing content and publishing.
2 2019-10-11 18:10:44
1678,1672 2,1678

By arvindpdmn

Work in progress. Added Summary and one answer.

Chat Room

Submitting ...

You are editing an existing chat message.

State-of-the-art NN for NLP in 2018. Source: Devlin and Chang 2018.

The use of statistics in NLP started in the 1980s and heralded the birth of what we called Statistical NLP or Computational Linguistics. Since then, many machine learning techniques have been applied to NLP. These include naïve Bayes, k-nearest neighbours, hidden Markov models, conditional random fields, decision trees, random forests, and support vector machines.

The use of neutral networks for NLP did not start until the early 2000s. But by the end of 2010s, neural networks transformed NLP, enhancing or even replacing earlier techniques. This has been made possible because we now have more data to train neural network models and more powerful computing systems to do so.

In traditional NLP, features were often hand-crafted, incomplete, and time consuming to create. Neural networks can learn multilevel features automatically. They also give better results.

Discussion

Which are the main innovations in the application of NN to NLP?
Two main innovations have enabled the use of neural networks in NLP:
- Word Embeddings: This enabled us to represent words as real-valued vectors. Instead of having a sparse representation, word embeddings allowed us to represent words in a much smaller dimensional space. We could identify similar words due to their closeness in this vector space, or use analogies to exploit semantic relationships between words.
- NN Architectures: These had evolved in other domains such as computer vision and were adapted to NLP. This started in language modelling, and later applied to morphology, POS tagging, coreference resolution, parsing, and semantics. From these core areas, neural networks were applied to applications: sentiment analysis, speech recognition, information retrieval/extraction, text classification/generation, summarization, question answering, and machine translation. These architectures are usually not as deep (many hidden layers) as found in computer vision.
Which are the NN architectures that have been used for NLP?
Example of 2-layer BiLSTM-based ELMo. Source: Horan 2019.
Early language models used a feedforward NN or convolutional NN architectures but these didn't capture context very well. Context is how one word occurs in relation to surrounding words in the sentence. To capture context, recurrent NNs were applied. LSTM, a variant of RNN, was then used to capture long-distance context. Bidirectional LSTM (BiLSTM) improves upon LSTM by looking at word sequences in forward and backward directions.
Typically, the dimensionality of input and output must be known and fixed. This is problematic for machine translation. For example, the best translation of a 10-word English sentence might be a 12-word French sentence. This problem is solved by a sequence-to-sequence model that's based on encoder-decoder architecture. The essence of the encoder is to encode an entire input sequence into a large fixed-dimensional vector, called the context vector. The decoder implements a language model conditioned on the input sequence.
To encode contextual information in a single context vector is difficult. This gave rise to the idea of attention where more information is given to decoder. From here, the transformer model evolved.
What's been the general trend in NLP research with neural networks?
Language modelling has been essential for the progress of NLP. Because of the ready availability of text, it's been easy to train complex models in an unsupervised manner on lots of training data. The intent is to train the model to learn about words and the contexts in which they occur. For example, the model should learn a vector representation of "bank" and also discriminate between a river bank and a financial institution.
A pretrained language model, first proposed in 2015, can save us expensive training on vast amounts of data. However, such a pretrained model may need some amount of training on domain-specific data. Then the model can be applied to many downstream NLP tasks. This approach is similar to pretrained word embeddings that didn't capture context.
The use of a pretrained language model in another downstream task is called transfer learning, a concept that's also common in computer vision.
It's expected that transformer model will dominate over RNN. Pretrained models will get better. It'll be easier to fine tune models. Transfer learning will become more important.
Could you share some real-world examples of NN in NLP?
NN model for Gmail's Smart Compose. Source: Wu 2018.
In 2018, Google introduced Smart Compose in Gmail. A seq2seq model using email subject and previous email body gave good results but failed to meet latency constraints. They finally settled on a hybrid of bag-of-words (BoW) and RNN-LM. Average embeddings are fed to RNN-LM.
At Amazon, they've used a lightweight version of ELMo to augment Alexa functions. While ELMo uses a stack of BiLSTM, they use a single layer since Alexa transactions are linguistically more uniform. They trained the embeddings in an unsupervised manner, and then trained on two tasks (intent classification and slot tagging) in a supervised manner while only slowly adjusting the embeddings. They also did transfer learning on new tasks.
Uber has used NLP to filter tickets related to map data. Using word2vec, they trained word embeddings on one million tickets. This had the limitation that all words are treated equally. They then experimented with WordCNN and LSTM networks. They got best results with word2vec trained on customer tickets and used it with WordCNN. For future work, they suggested character-level (CharCNN) embeddings that are more resilient to typos.

Milestones

2001

Neural network with word vector C(i) for ith word. Source: Bengio et al. 2003, fig. 1.

Bengio et al. point out the curse of dimensionality where the large vocabulary size of natural languages makes computations difficult. They propose a feedforward neural network that jointly learns the language model and vector representations of words. They refine their methods in a follow-up paper from 2003.

2008

Collobert and Weston train a language model in an unsupervised manner from Wikipedia data. They use supervised training for both syntactic tasks (POS tagging, chunking, parsing) and semantic tasks (named entity recognition, semantic role labelling, word sense disambiguation). To model long-distance dependencies, they use a Time-Delay Neural Network (TNN) inspired from CNN. They use multiple layers to move from local features to global features. Moreover, two models can share word embeddings, an approach called multitask learning.

2010

Mikolov et al. use a recurrent neural network (RNN) for language modelling and apply this for speech recognition. They show better results than traditional n-gram models.

2012

Dahl et al. combine deep neural network with hidden Markov model (HMM) for large vocabulary speech recognition.

2013

Going beyond just word embeddings, Kalchbrenner and Blunsom map an entire input sentence to a vector. They use this for machine translation without relying on alignments or phrasal translation units. In another research, LSTM is found to capture long-range context and therefore suitable for generating sequences. In general, 2013 is the year when there's research focus on using CNN, RNN/LSTM and recursive NN for NLP.

2014

Using pretrained word2vec embeddings, Yoon Kim uses CNN for sentence classification. Also in 2014, Sutskever et al. at Google apply sequence-to-sequence model to the task of machine translation. They use separate 4-layered LSTMs for encoder and decoder. Reversing the order of source sentences allows LSTM to exploit short-term dependencies and therefore do well on long sentences. Seq2seq models are suited for NLG tasks such as captioning images or describing source code changes.

Sep
2014

Encoder-decoder model with attention. Source: Weng 2018, fig. 4.

Bahdanau et al. apply the concept of attention to the seq2seq model used in machine translation. This helps the decoder to "pay attention" to important parts of the source sentence. It doesn't force the encoder to pack all information into a single context vector. Effectively, the model does a soft alignment of input to output words.

Nov
2015

Dai and Le propose a two-step procedure of unsupervised pre-training followed by supervised training for text classification. This semi-supervised approach works well. Pre-training helps to initialize the model for supervised training and generalization. More unlabelled data during pre-training is seen to improve supervised learning. In later years, this approach becomes important.

Sep
2016

Google's model for NMT. Source: Wu et al. 2016, fig. 1.

Google replaces its phrase-based translation system with Neural Machine Translation (NMT). This reduces translation errors by 60%. It uses a deep LSTM network with 8 encoder and 8 decoder layers. The first layer of encoder is BiLSTM. The model also uses residual connections among the LSTM layers.

May
2017

Use of convolution and attention in seq2seq modelling. Source: Gehring et al. 2017, fig. 1.

Recurrent architectures can't be parallelized due to their sequential nature. Gehring et al. therefore propose using CNNs for seq2seq modelling since CNNs can be parallelized and make best use of GPU hardware. The model uses gated linear units, residual connection and attention in each decoder layer.

Dec
2017

Encoder self-attention distribution for the word 'it' in different contexts. Source: Uszkoreit 2017.

Vaswani et al. propose the transformer model in which they use a seq2seq model without using RNN. The transformer model relies only on self-attention. By 2018, the transformer leads to state-of-the-art models such as OpenAI GPT and BERT.

2018

Researchers at the Allen Institute for Artificial Intelligence introduce ELMo (Embeddings from Language Models). While earlier work derived contextualized word vectors, this was limited to the top LSTM layer. ELMo's word representations use all layers of a bidirectional language model. This allows ELMo to model syntax, semantics and polysemy. Such a language model can be pretrained on a large scale and then used for a number of downstream tasks.

Feb
2019

OpenAI GPT-2 shows its power in natural language generation. Trained on 8 million websites, it has 1.5 billion parameters. The model is initially not released to the public due to concerns of misuse (such as fake news generation). However, in November 2019, GPT-2 is released.

References

Article Stats

1661

Words

Authors

Edits

Chats

Likes

23K

Hits

Cite As

Devopedia. 2019. "Neural Networks for NLP." Version 6, November 12. Accessed 2023-11-12. https://devopedia.org/neural-networks-for-nlp

Contributed by
1 author

Last updated on
2019-11-12 13:03:45

algorithms natural language processing modelling

Neural Networks for NLP

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Neural Networks for NLP

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login