# Named Entity Recognition

Named Entity Recognition (NER) is an essential task of the more general discipline of Information Extraction (IE). To obtain structured information from unstructured text we wish to identify named entities. Anything with a proper name is a named entity. This would include names of people, places, organizations, vehicles, facilities, and so on.

While temporal and numerical expressions are not about entities per se, they're important in understanding unstructured text. Hence, such expressions are included in NER.

The essence of NER is to identity the named entities and also classify them. NLP techniques used for POS tagging and syntactic chunking are applicable to NER.

NER models trained on general newswire text are not suitable for specialized domains, such as law or medicine. Domain-specific training is required.

## Discussion

• Which are the common entity types identified by NER?

At its simplest, NER recognizes three entity classes: location, person, and organization. In practice, at least for newswire text, there's value in extracting geo-political entities, date, time, currency, and percent. For more fine-grained NER, systems may identify ordinal/cardinal numbers, events, works of art, facilities, products, and so on.

Temporal expressions can be absolute or relative. For example, 'summer of 1977' and '10:15 AM' are absolute whereas 'yesterday' and 'last quarter' are relative. Expression 'four hours' is an example of duration.

Specialized domains often require additional or alternative entity types. In biomedical for example, we can broadly define six types: Cell, Chemical, Disease, Gene (DNA or RNA), Protein and Species. For question answering, Sekine defined a hierarchy of more than 200 entities.

• Could you describe some applications of NER?

Given the huge volume of content published online each day, NER helps in categorizing them and thus eases search and content discovery. In fact, for information retrieval, named entities can be used as search indices to speech up search. Likewise, NER helps organize and categorize research publications. For example, from the thousands of papers on machine learning, we might be interested only in face detection that uses CNN.

Content recommendation is possible with NER. Based on the entities in the current article, the reader can be recommended related articles. Another NER application is online customer support. NER can identify product type, model number, store location, and more. Opinion mining uses NER as a pre-processing task. NER can help link related concepts and entities for the Semantic Web.

One developer took 30,000 online recipes but found that the descriptions didn't fit any particular pattern. He therefore applied NER with entity types Name, Quantity, and Unit. He manually labelled 1500 samples of about 10,000 tokens.

Among the NLP tasks that benefit from NER are question generation, relation extraction, and coreference resolution.

• What are the typical challenges with NER?

Ambiguities make NER a challenging task. For example, 'JFK' can refer to former US President John F. Kennedy or his son. These are different entities of the same type. Coreference resolution is an NLP task that resolves this ambiguity. More common to NER is when the same name refers to different types. For example, 'JFK' can refer to the airport in New York or Oliver Stone's 1991 movie.

Since entities can span multiple words/tokens, NER needs to identify start and end of multi-token entities. A name within a name (such as Cecil H. Green Library) is also a challenge. When a word is a qualifier, it may be wrongly tagged. For example, in Clinton government, an NER system may tag Clinton as PER without recognizing the noun phrase.

Same entity can appear in different forms. Differences could be typographical (nucleotide binding vs NUCLEOTIDE-BINDING), morphological (localize vs localization), syntactic (DNA translocation vs translocation of DNA), reduction (secretion process vs secretion/ATP-binding activity), or abbreviated (type IV secretory system vs T4SS).

• What's the typical processing pipeline in NER?

NER is typically a supervised task. It needs annotated training data. Features are defined. From the annotated data, the ML model learns how to map these features to the entities. NER can be seen as a sequence labelling problem since it identifies a span of tokens and classifies it as a named entity. Thus, NER can be solved in a manner similar to POS tagging or syntactic phrase chunking. Statistical models such as HMM, MEMM or CRF can be used.

Annotating data manually is slow. A semi-automated approach is to give a file containing a set of patterns or rules. Another approach is to start with an existing model and manually correct mistakes.

A practical approach is for a model to label unambiguous entities in the first pass. Subsequent passes make use of already labelled entities to resolve ambiguities about other entities.

Essentially, iterative approaches that start with patterns, dictionaries or unambiguous entities start with high precision but low recall. Recall is then improved with already tagged entities and feedback. Manually annotated data is typically reserved for model evaluation.

• What sort of features are useful for training an NER model?

The token and its stem are basic features for NER. In addition, POS tag and phrase chunk label give useful information. Context from surrounding tokens, and their POS tags and chunk labels, are also useful inputs. For example, 'Rev.', 'MD', or 'Inc.' are good indicators of entities that precede or follow them.

The token's shape or orthography is particularly useful. This includes presence of numbers, punctuation, hyphenation, mixed case, all caps, etc. For example, A9, Yahoo!, IRA, and eBay are entities that can be identified by their shape. While this is useful for newswire text, it's less useful for blogs or text transcribed automatically from speech. Also, there's no case information in Chinese.

Gazetteers maintain a list of names of places and people. Presence of a token in such a list can be feature. Maintaining such a list is often difficult. Such a list is less useful for identifying persons and organizations. Likewise, domain-specific knowledge bases can be used and string matches with these bases can identify entities. Examples include DBpedia, DBLP and ScienceWise.

• How have neural networks influenced NER?

Linear statistical models used hand-crafted features and gazetteers. These are hard to adapt to new tasks or domains. Non-linear neural network models make use of word embeddings. Early NN models used these embeddings along with hand-crafted features.

A neural network for NER typically has three components: word embedding layer, context encoder layer, and decoder layer. Since NER is basically a sequence labelling task, RNN is suitable. BiLSTM in particular is good at capturing context. Since 2019, transformer architecture has been successfully applied to NER. Multi-task learning (MTL) is an approach in which the model is trained on different tasks and representations are shared across tasks.

For NER, character-level representations obtained via CNN or transformer are useful since they encode morphological features, alleviate out-of-vocabulary issues and overcome data sparsity problem.

Neural networks are used by NeuroNER (LSTM), spaCy (CNN), and AllenNLP (Bidirectional LM).

• How are words or phrases tagged with their identities in NER?

Entities have to be tagged in a manner that's suitable for algorithms to process. There are many tagging schemes available, some of which evolved from those used in chunking:

• IO: The simplest scheme, where I_X identifies the named entity type X and O indicates no entity.
• IOB: Where an entity has multiple tokens, this differently tags the begin token (B) from an inner token (I). A variation of this is IOB2 or BIO.
• BMEWO: This differently tags begin token (B), middle token (M), end token (E), and a single-token entity (W). An extension of this with more tags is called BMEWO+. Another scheme that's similar is called BILOU, where the letters stand for begin, inner, last, out and unit.

These encoding schemes have different complexities. For N entities, IO, BIO and BMEWO have complexity of N+1, 2N+1 and 4N+1 tags respectively.

While BIO has been a popular choice, more recent work has shown that BILOU outperforms BIO. spaCy supports IOB and BILUO schemes.

• How do we evaluate algorithms for NER?

Recall and precision are the basic metrics for evaluating NER models. Recall gives the ratio of correctly labelled entities to the total that should have been labelled. Precision is the ratio of correctly labelled entities to the total labelled. F-measure is a single metric that combines both recall and precision.

One caveat in using these metrics is the scope. In a real application, they have to be applied to actual named entities. Typical ML models optimize performance at the tag level, based on a particular encoding scheme. Performance at these two levels can be quite different.

Tagging details also matter. For example, in one study, training data did not include titles (Prime Minister, Pope, etc.) but Amazon Comprehend included these, thus leading to higher recall but lower precision. Different annotated datasets sometimes disagree about entities. For example, in "Baltimore defeated the Yankees", MUC-7 tags Balimore as LOC whereas CoNLL2003 tags it as ORG.

CoNLL2003 and OntoNotes 5.0 are commonly used for training and evaluating models. AllenNLP used One Billion Word Benchmark.

• What resources are available for research into NER?

As part of the GATE framework (University of Sheffield, UK) for text processing, ANNIE is an NER pipeline. Researchers can try out the online demo plus a free API.

displaCy from Explosion is a useful visualization tool. Prodigy is an annotation tool for creating training data. NER is one of the supported tasks. A 2018 survey lists English NER datasets and tools.

Cloud providers also offer APIs for NER. Google's Natural Language API is an example. This API can also figure out the sentiments about the entities. An equivalent offering from Amazon is Amazon Comprehend.

Stanford NER is a Java implementation. It's also called CRFClassifier because it's based on linear chain CRF sequence model. Via Stanford CoreNLP, this can be invoked from other languages.

More generally, packages that support HMM, MEMM, or CRF can be used to train an NER model. Such support is available in Mallet, NTLK, and Stanford NER. A Scikit-Learn compatible package that's useful is sklearn-crfsuite. Polyglot is capable of doing NER for 40 different languages.

## Milestones

Feb
1991

Lisa Rau implements an algorithm to extract company names from financial news. It's a combination of heuristics, exception lists and extensive corpus analysis. In subsequent retrieval tasks, the algorithm also looks at most likely variations of names. In 1992, she files for a US patent, which is granted in 1994.

1996

The term Named Entity is first used at the 6th Message Understanding Conference (MUC). The first MUC was held in 1987 with the aim of automating analysis of text messages in the military. While early MUC events focused on mainly template filling, MUC-6 looks at sub-tasks that would aid information extraction. It's in this context that named entities become relevant. SGML is used to markup entities. During 1996-2008, NER is talked about in MUC, CoNLL and ACE conferences.

2003

Hammerton applies Long Short-Term Memory (LSTM) neural network to NER. He finds significantly better performance for German but a disappointing baseline performance for English. Words and their sequences are represented using SARDNET. The algorithm operates in two passes. In the first pass, information is gathered. In the second pass, the algorithm disambiguates and outputs the named entities.

2008

NER is dropped from international evaluation forums. In fact, by 2005, NER was considered a solved problem, with models achieving recall and precision exceeding 90%. However, the best score on ACE 2008 is only about 50%, which is significantly lower that the scores for MUC and CoNLL2003 tasks. This suggests that NER is not a solved problem.

Jun
2009

Ratinov and Roth address some design challenges for NER. They note that NER is knowledge intensive in nature. Therefore, they use 30 gazetteers, 16 of which are extracted from Wikipedia. In general, these are high-precision, low-recall lists. They are shown to be effective for webpages, where there's less contextual information. Expressive features and gazetteers enable unsupervised learning. They also note the BILOU encoding significantly outperforms BIO.

Jun
2011

Tweets have the problem of insufficient information. NER models trained on news articles also do poorly on tweets. This can be solved by domain adaptation or semi-supervised learning from lots of unlabelled data. Liu et al. approach this problem with a combination of K-Nearest Neighbour (KNN) classifier and Conditional Random Field (CRF) labeller. KNN captures global coarse evidence while CRF captures fine-grained information from a single tweet. The classifier is retrained based on recently labelled tweets. Gazetteers are also used.

Aug
2011

To avoid task-specific engineering and hand-crafted features, Collobert et al. propose a unified neural network approach that can perform part-of-speech tagging, chunking, named entity recognition, and semantic role labelling. The model learns representations from lots of unlabelled data. The best results for NER are obtained with word embeddings from a trained language model and optimizing for sentence-level log-likelihood. This work motivates other researchers towards neural networks in preference to hand-crafted features.

Jul
2015

Santos and Guimarães extend the work of Collobert et al. by considering character-level representations using a convolutional layer. They note that "word-level embeddings capture syntactic and semantic information, character-level embeddings capture morphological and shape information". The use of both gives best results.

Aug
2015

Huang et al. study the use of LSTM and CRF for sequence labelling. They find that BiLSTM with a CRF layer gives state-of-the-art results for POS tagging, chunking and NER. With BiLSTM, past (forward states) and future (backward states) features are used. CRF works at the sentence level to predict the current entity label based on past and future labels. This model shows good performance even when Collobert et al.'s word embedding is not used. For faster training, spelling and context features bypass the BiLSTM and go directly to the CRF layer.

2016

Ma and Hovy achieve state-of-the-art F1 score of 91.21 for NER on CoNLL 2003 dataset. Their approach requires no feature engineering or specific data pre-processing. They use CNN to obtain character-level representations to capture morphological features such as prefixes and suffixes. GloVe word embeddings are used. Character embeddings are randomly initialized and then used to obtain character-level representations. Chiu and Nichols present a similar work that also uses word-level features.

Jul
2018

Güngör et al. study NER for morphologically rich languages such as Turkish, Finnish, Czech and Spanish. Word morphology is important for these languages. This is in contrast to English that gets useful information from syntax and word n-grams. They therefore propose a model that uses morphological embedding. This is combined with character-based and word embeddings. Both character-based and morphological embeddings are derived using separate BiLSTMs.

Jul
2019

Francis et al. make use of BERT language model for transfer learning in NER. For fine tuning, either softmax or CRF layer is used. They find BERT representations perform best in combination with character-level representation and word embeddings. Without BERT, they also show competitive performance when character-level representation and word embeddings are gated via an attention layer.

Dec
2019

Yan et al. note that the traditional transformer architecture is not quite as good for NER as it is for other NLP tasks. They customize transformer architecture and achieve state-of-the-art results, beating prevailing BiLSTM models. They call it Transformer Encoder for NER (TENER). While traditional transformer uses position embedding, directionality is lost. TENER uses relative position encoding to capture distance and direction. Smoothing and scaling of traditional transformer is seen to attend to noisy information. TENER therefore uses sharp unscaled attention.

Author
No. of Edits
No. of Chats
DevCoins
5
0
2042
2759
Words
2
Likes
7106
Hits

## Cite As

Devopedia. 2020. "Named Entity Recognition." Version 5, February 4. Accessed 2022-09-23. https://devopedia.org/named-entity-recognition
Contributed by
1 author

Last updated on
2020-02-04 08:03:48
• Site Map