• Sample text tagged with named entities (some tags are wrong). Source: Vyas 2018.
    Sample text tagged with named entities (some tags are wrong). Source: Vyas 2018.
  • Features for extracting company names. Source: Rau 1994.
    Features for extracting company names. Source: Rau 1994.
  • Sample annotation of named entities using SGML. Source: Grishman and Sundheim 1996, fig. 1.
    Sample annotation of named entities using SGML. Source: Grishman and Sundheim 1996, fig. 1.
  • Neural network for NER using CNN, LSTM, CRF and char+word embeddings. Source: Ma and Hovy 2016, fig. 1 and 3.
    Neural network for NER using CNN, LSTM, CRF and char+word embeddings. Source: Ma and Hovy 2016, fig. 1 and 3.
  • Concatenation of three different embeddings. Source: Güngör et al. 2018, fig.3.
    Concatenation of three different embeddings. Source: Güngör et al. 2018, fig.3.
  • Different architectures based on character/word/BERT representations. Source: Francis et al. 2019, fig. 1.
    Different architectures based on character/word/BERT representations. Source: Francis et al. 2019, fig. 1.
  • Transformer architecture of TENER for character-level encoding and word-level context. Source: Yan et al. 2019, fig. 2.
    Transformer architecture of TENER for character-level encoding and word-level context. Source: Yan et al. 2019, fig. 2.
  • Name entity types used in spaCy Python package. Source: spaCy API 2020.
    Name entity types used in spaCy Python package. Source: spaCy API 2020.
  • Tweet analysis includes named entities (green nodes). Source: Lyon 2017.
    Tweet analysis includes named entities (green nodes). Source: Lyon 2017.
  • The name 'Washington' can refer to different entity types. Source: Jurafsky and Martin 2009, fig. 22.4.
    The name 'Washington' can refer to different entity types. Source: Jurafsky and Martin 2009, fig. 22.4.
  • Steps in statistical sequential approach to NER. Source: Jurafsky and Martin 2009, fig. 22.10.
    Steps in statistical sequential approach to NER. Source: Jurafsky and Martin 2009, fig. 22.10.
  • Some features useful for NER. Source: Jurafsky and Martin 2009, fig. 22.6.
    Some features useful for NER. Source: Jurafsky and Martin 2009, fig. 22.6.
  • Taxonomy of Deep Learning based NER. Source: Li et al. 2018, fig. 3.
    Taxonomy of Deep Learning based NER. Source: Li et al. 2018, fig. 3.
  • Illustrating some tagging schemes for NER. Source: Baldwin 2009.
    Illustrating some tagging schemes for NER. Source: Baldwin 2009.
  • F1 scores on English NER datasets. Source: Yan et al. 2019, table 3.
    F1 scores on English NER datasets. Source: Yan et al. 2019, table 3.
  • ELI5 Python package helps us visualize CRF model weights. Source: Li 2018.
    ELI5 Python package helps us visualize CRF model weights. Source: Li 2018.

Named Entity Recognition

Avatar of user arvindpdmn
arvindpdmn
1942 DevCoins
1 author has contributed to this article
Last updated by arvindpdmn
on 2020-02-04 08:03:48
Created by arvindpdmn
on 2020-01-25 14:44:27

Summary

Sample text tagged with named entities (some tags are wrong). Source: Vyas 2018.
Sample text tagged with named entities (some tags are wrong). Source: Vyas 2018.

Named Entity Recognition (NER) is an essential task of the more general discipline of Information Extraction (IE). To obtain structured information from unstructured text we wish to identify named entities. Anything with a proper name is a named entity. This would include names of people, places, organizations, vehicles, facilities, and so on.

While temporal and numerical expressions are not about entities per se, they're important in understanding unstructured text. Hence, such expressions are included in NER.

The essence of NER is to identity the named entities and also classify them. NLP techniques used for POS tagging and syntactic chunking are applicable to NER.

NER models trained on general newswire text are not suitable for specialized domains, such as law or medicine. Domain-specific training is required.

Milestones

Feb
1991
Features for extracting company names. Source: Rau 1994.

Lisa Rau implements an algorithm to extract company names from financial news. It's a combination of heuristics, exception lists and extensive corpus analysis. In subsequent retrieval tasks, the algorithm also looks at most likely variations of names. In 1992, she files for a US patent, which is granted in 1994.

1996
Sample annotation of named entities using SGML. Source: Grishman and Sundheim 1996, fig. 1.

The term Named Entity is first used at the 6th Message Understanding Conference (MUC). The first MUC was held in 1987 with the aim of automating analysis of text messages in the military. While early MUC events focused on mainly template filling, MUC-6 looks at sub-tasks that would aid information extraction. It's in this context that named entities become relevant. SGML is used to markup entities. During 1996-2008, NER is talked about in MUC, CoNLL and ACE conferences.

2003

Hammerton applies Long Short-Term Memory (LSTM) neural network to NER. He finds significantly better performance for German but a disappointing baseline performance for English. Words and their sequences are represented using SARDNET. The algorithm operates in two passes. In the first pass, information is gathered. In the second pass, the algorithm disambiguates and outputs the named entities.

2008

NER is dropped from international evaluation forums. In fact, by 2005, NER was considered a solved problem, with models achieving recall and precision exceeding 90%. However, the best score on ACE 2008 is only about 50%, which is significantly lower that the scores for MUC and CoNLL2003 tasks. This suggests that NER is not a solved problem.

Jun
2009

Ratinov and Roth address some design challenges for NER. They note that NER is knowledge intensive in nature. Therefore, they use 30 gazetteers, 16 of which are extracted from Wikipedia. In general, these are high-precision, low-recall lists. They are shown to be effective for webpages, where there's less contextual information. Expressive features and gazetteers enable unsupervised learning. They also note the BILOU encoding significantly outperforms BIO.

Jun
2011

Tweets have the problem of insufficient information. NER models trained on news articles also do poorly on tweets. This can be solved by domain adaptation or semi-supervised learning from lots of unlabelled data. Liu et al. approach this problem with a combination of K-Nearest Neighbour (KNN) classifier and Conditional Random Field (CRF) labeller. KNN captures global coarse evidence while CRF captures fine-grained information from a single tweet. The classifier is retrained based on recently labelled tweets. Gazetteers are also used.

Aug
2011

To avoid task-specific engineering and hand-crafted features, Collobert et al. propose a unified neural network approach that can perform part-of-speech tagging, chunking, named entity recognition, and semantic role labelling. The model learns representations from lots of unlabelled data. The best results for NER are obtained with word embeddings from a trained language model and optimizing for sentence-level log-likelihood. This work motivates other researchers towards neural networks in preference to hand-crafted features.

Jul
2015

Santos and Guimarães extend the work of Collobert et al. by considering character-level representations using a convolutional layer. They note that "word-level embeddings capture syntactic and semantic information, character-level embeddings capture morphological and shape information". The use of both gives best results.

Aug
2015

Huang et al. study the use of LSTM and CRF for sequence labelling. They find that BiLSTM with a CRF layer gives state-of-the-art results for POS tagging, chunking and NER. With BiLSTM, past (forward states) and future (backward states) features are used. CRF works at the sentence level to predict the current entity label based on past and future labels. This model shows good performance even when Collobert et al.'s word embedding is not used. For faster training, spelling and context features bypass the BiLSTM and go directly to the CRF layer.

2016
Neural network for NER using CNN, LSTM, CRF and char+word embeddings. Source: Ma and Hovy 2016, fig. 1 and 3.

Ma and Hovy achieve state-of-the-art F1 score of 91.21 for NER on CoNLL 2003 dataset. Their approach requires no feature engineering or specific data pre-processing. They use CNN to obtain character-level representations to capture morphological features such as prefixes and suffixes. GloVe word embeddings are used. Character embeddings are randomly initialized and then used to obtain character-level representations. Chiu and Nichols present a similar work that also uses word-level features.

Jul
2018
Concatenation of three different embeddings. Source: Güngör et al. 2018, fig.3.

Güngör et al. study NER for morphologically rich languages such as Turkish, Finnish, Czech and Spanish. Word morphology is important for these languages. This is in contrast to English that gets useful information from syntax and word n-grams. They therefore propose a model that uses morphological embedding. This is combined with character-based and word embeddings. Both character-based and morphological embeddings are derived using separate BiLSTMs.

Jul
2019
Different architectures based on character/word/BERT representations. Source: Francis et al. 2019, fig. 1.

Francis et al. make use of BERT language model for transfer learning in NER. For fine tuning, either softmax or CRF layer is used. They find BERT representations perform best in combination with character-level representation and word embeddings. Without BERT, they also show competitive performance when character-level representation and word embeddings are gated via an attention layer.

Dec
2019
Transformer architecture of TENER for character-level encoding and word-level context. Source: Yan et al. 2019, fig. 2.

Yan et al. note that the traditional transformer architecture is not quite as good for NER as it is for other NLP tasks. They customize transformer architecture and achieve state-of-the-art results, beating prevailing BiLSTM models. They call it Transformer Encoder for NER (TENER). While traditional transformer uses position embedding, directionality is lost. TENER uses relative position encoding to capture distance and direction. Smoothing and scaling of traditional transformer is seen to attend to noisy information. TENER therefore uses sharp unscaled attention.

Discussion

  • Which are the common entity types identified by NER?
    Name entity types used in spaCy Python package. Source: spaCy API 2020.
    Name entity types used in spaCy Python package. Source: spaCy API 2020.

    At its simplest, NER recognizes three entity classes: location, person, and organization. In practice, at least for newswire text, there's value in extracting geo-political entities, date, time, currency, and percent. For more fine-grained NER, systems may identify ordinal/cardinal numbers, events, works of art, facilities, products, and so on.

    Temporal expressions can be absolute or relative. For example, 'summer of 1977' and '10:15 AM' are absolute whereas 'yesterday' and 'last quarter' are relative. Expression 'four hours' is an example of duration.

    Specialized domains often require additional or alternative entity types. In biomedical for example, we can broadly define six types: Cell, Chemical, Disease, Gene (DNA or RNA), Protein and Species. For question answering, Sekine defined a hierarchy of more than 200 entities.

  • Could you describe some applications of NER?
    Tweet analysis includes named entities (green nodes). Source: Lyon 2017.
    Tweet analysis includes named entities (green nodes). Source: Lyon 2017.

    Given the huge volume of content published online each day, NER helps in categorizing them and thus eases search and content discovery. In fact, for information retrieval, named entities can be used as search indices to speech up search. Likewise, NER helps organize and categorize research publications. For example, from the thousands of papers on machine learning, we might be interested only in face detection that uses CNN.

    Content recommendation is possible with NER. Based on the entities in the current article, the reader can be recommended related articles. Another NER application is online customer support. NER can identify product type, model number, store location, and more. Opinion mining uses NER as a pre-processing task. NER can help link related concepts and entities for the Semantic Web.

    One developer took 30,000 online recipes but found that the descriptions didn't fit any particular pattern. He therefore applied NER with entity types Name, Quantity, and Unit. He manually labelled 1500 samples of about 10,000 tokens.

    Among the NLP tasks that benefit from NER are question generation, relation extraction, and coreference resolution.

  • What are the typical challenges with NER?
    The name 'Washington' can refer to different entity types. Source: Jurafsky and Martin 2009, fig. 22.4.
    The name 'Washington' can refer to different entity types. Source: Jurafsky and Martin 2009, fig. 22.4.

    Ambiguities make NER a challenging task. For example, 'JFK' can refer to former US President John F. Kennedy or his son. These are different entities of the same type. Coreference resolution is an NLP task that resolves this ambiguity. More common to NER is when the same name refers to different types. For example, 'JFK' can refer to the airport in New York or Oliver Stone's 1991 movie.

    Since entities can span multiple words/tokens, NER needs to identify start and end of multi-token entities. A name within a name (such as Cecil H. Green Library) is also a challenge. When a word is a qualifier, it may be wrongly tagged. For example, in Clinton government, an NER system may tag Clinton as PER without recognizing the noun phrase.

    Same entity can appear in different forms. Differences could be typographical (nucleotide binding vs NUCLEOTIDE-BINDING), morphological (localize vs localization), syntactic (DNA translocation vs translocation of DNA), reduction (secretion process vs secretion/ATP-binding activity), or abbreviated (type IV secretory system vs T4SS).

  • What's the typical processing pipeline in NER?
    Steps in statistical sequential approach to NER. Source: Jurafsky and Martin 2009, fig. 22.10.
    Steps in statistical sequential approach to NER. Source: Jurafsky and Martin 2009, fig. 22.10.

    NER is typically a supervised task. It needs annotated training data. Features are defined. From the annotated data, the ML model learns how to map these features to the entities. NER can be seen as a sequence labelling problem since it identifies a span of tokens and classifies it as a named entity. Thus, NER can be solved in a manner similar to POS tagging or syntactic phrase chunking. Statistical models such as HMM, MEMM or CRF can be used.

    Annotating data manually is slow. A semi-automated approach is to give a file containing a set of patterns or rules. Another approach is to start with an existing model and manually correct mistakes.

    A practical approach is for a model to label unambiguous entities in the first pass. Subsequent passes make use of already labelled entities to resolve ambiguities about other entities.

    Essentially, iterative approaches that start with patterns, dictionaries or unambiguous entities start with high precision but low recall. Recall is then improved with already tagged entities and feedback. Manually annotated data is typically reserved for model evaluation.

  • What sort of features are useful for training an NER model?
    Some features useful for NER. Source: Jurafsky and Martin 2009, fig. 22.6.
    Some features useful for NER. Source: Jurafsky and Martin 2009, fig. 22.6.

    The token and its stem are basic features for NER. In addition, POS tag and phrase chunk label give useful information. Context from surrounding tokens, and their POS tags and chunk labels, are also useful inputs. For example, 'Rev.', 'MD', or 'Inc.' are good indicators of entities that precede or follow them.

    The token's shape or orthography is particularly useful. This includes presence of numbers, punctuation, hyphenation, mixed case, all caps, etc. For example, A9, Yahoo!, IRA, and eBay are entities that can be identified by their shape. While this is useful for newswire text, it's less useful for blogs or text transcribed automatically from speech. Also, there's no case information in Chinese.

    Gazetteers maintain a list of names of places and people. Presence of a token in such a list can be feature. Maintaining such a list is often difficult. Such a list is less useful for identifying persons and organizations. Likewise, domain-specific knowledge bases can be used and string matches with these bases can identify entities. Examples include DBpedia, DBLP and ScienceWise.

  • How have neural networks influenced NER?
    Taxonomy of Deep Learning based NER. Source: Li et al. 2018, fig. 3.
    Taxonomy of Deep Learning based NER. Source: Li et al. 2018, fig. 3.

    Linear statistical models used hand-crafted features and gazetteers. These are hard to adapt to new tasks or domains. Non-linear neural network models make use of word embeddings. Early NN models used these embeddings along with hand-crafted features.

    A neural network for NER typically has three components: word embedding layer, context encoder layer, and decoder layer. Since NER is basically a sequence labelling task, RNN is suitable. BiLSTM in particular is good at capturing context. Since 2019, transformer architecture has been successfully applied to NER. Multi-task learning (MTL) is an approach in which the model is trained on different tasks and representations are shared across tasks.

    For NER, character-level representations obtained via CNN or transformer are useful since they encode morphological features, alleviate out-of-vocabulary issues and overcome data sparsity problem.

    Neural networks are used by NeuroNER (LSTM), spaCy (CNN), and AllenNLP (Bidirectional LM).

  • How are words or phrases tagged with their identities in NER?
    Illustrating some tagging schemes for NER. Source: Baldwin 2009.
    Illustrating some tagging schemes for NER. Source: Baldwin 2009.

    Entities have to be tagged in a manner that's suitable for algorithms to process. There are many tagging schemes available, some of which evolved from those used in chunking:

    • IO: The simplest scheme, where I_X identifies the named entity type X and O indicates no entity.
    • IOB: Where an entity has multiple tokens, this differently tags the begin token (B) from an inner token (I). A variation of this is IOB2 or BIO.
    • BMEWO: This differently tags begin token (B), middle token (M), end token (E), and a single-token entity (W). An extension of this with more tags is called BMEWO+. Another scheme that's similar is called BILOU, where the letters stand for begin, inner, last, out and unit.

    These encoding schemes have different complexities. For N entities, IO, BIO and BMEWO have complexity of N+1, 2N+1 and 4N+1 tags respectively.

    While BIO has been a popular choice, more recent work has shown that BILOU outperforms BIO. spaCy supports IOB and BILUO schemes.

  • How do we evaluate algorithms for NER?
    F1 scores on English NER datasets. Source: Yan et al. 2019, table 3.
    F1 scores on English NER datasets. Source: Yan et al. 2019, table 3.

    Recall and precision are the basic metrics for evaluating NER models. Recall gives the ratio of correctly labelled entities to the total that should have been labelled. Precision is the ratio of correctly labelled entities to the total labelled. F-measure is a single metric that combines both recall and precision.

    One caveat in using these metrics is the scope. In a real application, they have to be applied to actual named entities. Typical ML models optimize performance at the tag level, based on a particular encoding scheme. Performance at these two levels can be quite different.

    Tagging details also matter. For example, in one study, training data did not include titles (Prime Minister, Pope, etc.) but Amazon Comprehend included these, thus leading to higher recall but lower precision. Different annotated datasets sometimes disagree about entities. For example, in "Baltimore defeated the Yankees", MUC-7 tags Balimore as LOC whereas CoNLL2003 tags it as ORG.

    CoNLL2003 and OntoNotes 5.0 are commonly used for training and evaluating models. AllenNLP used One Billion Word Benchmark.

  • What resources are available for research into NER?
    ELI5 Python package helps us visualize CRF model weights. Source: Li 2018.
    ELI5 Python package helps us visualize CRF model weights. Source: Li 2018.

    As part of the GATE framework (University of Sheffield, UK) for text processing, ANNIE is an NER pipeline. Researchers can try out the online demo plus a free API.

    displaCy from Explosion is a useful visualization tool. Prodigy is an annotation tool for creating training data. NER is one of the supported tasks. A 2018 survey lists English NER datasets and tools.

    Cloud providers also offer APIs for NER. Google's Natural Language API is an example. This API can also figure out the sentiments about the entities. An equivalent offering from Amazon is Amazon Comprehend.

    Stanford NER is a Java implementation. It's also called CRFClassifier because it's based on linear chain CRF sequence model. Via Stanford CoreNLP, this can be invoked from other languages.

    More generally, packages that support HMM, MEMM, or CRF can be used to train an NER model. Such support is available in Mallet, NTLK, and Stanford NER. A Scikit-Learn compatible package that's useful is sklearn-crfsuite. Polyglot is capable of doing NER for 40 different languages.

References

  1. Ananiadou, Sophia, Dan Sullivan, William Black, Gina-Anne Levow, Joseph J. Gillespie, Chunhong Mao, Sampo Pyysalo, BalaKrishna Kolluru, Junichi Tsujii, and Bruno Sobral. 2011. "Named Entity Recognition for Bacterial Type IV Secretion Systems." PLoS ONE 6(3): e14780. Accessed 2020-01-26.
  2. Baldwin, Breck. 2009. "Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO+." LingPipe Blog, October 14. Accessed 2020-01-26.
  3. Bird, Steven, Ewan Klein, and Edward Loper. 2020. "Natural Language Processing with Python–Analyzing Text with the Natural Language Toolkit." Accessed 2020-02-01.
  4. Brocklehurst, George. 2017. "Named Entity Recognition." Blog, Thoughtbot, September 28. Accessed 2020-01-26.
  5. Chiu, Jason P.C., and Eric Nichols. 2016. "Named Entity Recognition with Bidirectional LSTM-CNNs." Transactions of the Association for Computational Linguistics, vol. 4, pp. 357-370. Accessed 2020-01-26.
  6. Collobert, Ronan, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. "Natural Language Processing (Almost) from Scratch." Journal of Machine Learning Research, vol. 12, pp. 2493-2537, August. Accessed 2020-01-26.
  7. Francis, Sumam, Jordy Van Landeghem, and Marie-Francine Moens. 2019. "Transfer Learning for Named Entity Recognition in Financial and Biomedical Documents." Information, vol. 10, no. 8, 248, July 26. Accessed 2020-01-26.
  8. GATE Cloud. 2020. "English Named Entity Recognizer." GATE Cloud, University of Sheffield. Accessed 2020-02-01.
  9. Giannetti, Frederic. 2018. "Named Entity Recognition: Challenges and Solutions." Blog, Doculayer, April 10. Accessed 2020-02-01.
  10. Google Cloud. 2020. "Analyzing Entities." How-to Guides, Natural Language API, Google Cloud. Accessed 2020-01-26.
  11. Grishman, Ralph, and Beth Sundheim. 1996. "Message Understanding Conference-6: a brief history." COLING '96: Proceedings of the 16th conference on Computational linguistics, vol. 1, pp. 466-471, August. Accessed 2020-01-26.
  12. Gupta, Shashank. 2018. "Named Entity Recognition: Applications and Use Cases." Towards Data Science, on Medium, February 6. Accessed 2020-01-26.
  13. Güngör, Onur, Tunga Güngör, and Suzan Uskudarli. 2018. "The effect of morphology in named entity recognition with sequence tagging." Natural Language Engineering, Cambridge University Press, vol. 25, no. 1, pp. 147–169. Accessed 2020-01-26.
  14. Hammerton, James. 2003. "Named Entity Recognition with Long Short-Term Memory." Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, pp. 172-175. Accessed 2020-01-26.
  15. Huang, Zhiheng, Wei Xu, and Kai Yu. 2015. "Bidirectional LSTM-CRF Models for Sequence Tagging." arXiv, v1, August 9. Accessed 2020-02-01.
  16. Jurafsky, Daniel and James H. Martin. 2009. "Information Extraction." Chapter 22 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2020-01-26.
  17. Li, Susan. 2018. "Named Entity Recognition and Classification with Scikit-Learn." Towards Data Science, on Medium, August 27. Accessed 2020-01-26.
  18. Li, Jing, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. "A Survey on Deep Learning for Named Entity Recognition." arXiv, v1, December 22. Accessed 2020-01-26.
  19. Liu, Xiaohua, Shaodian Zhang, Furu Wei, and Ming Zhou. 2011. "Recognizing Named Entities in Tweets." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL, pp. 359-367, June. Accessed 2020-02-01.
  20. Lyon, William. 2017. "Applying NLP and Entity Extraction To The Russian Twitter Troll Tweets In Neo4j (and more Python!)." November 15. Accessed 2020-01-26.
  21. Ma, Xuezhe, and Eduard Hovy. 2016. "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, 00. 1064-1074, August. Accessed 2020-01-26.
  22. Marrero, M., J. Urbano, S. Sánchez-Cuadrado, J. Morato, and J. M. Gómez-Berbís. 2013. "Named Entity Recognition: Fallacies, Challenges and Opportunities." Journal of Computer Standards and Interfaces, vol. 35, no. 5, pp. 482-489. Accessed 2020-02-01.
  23. Ong, Donovan. 2018. "Tagging Scheme for NER." December 31. Accessed 2020-01-26.
  24. Polyglot. 2016. "polyglot 16.7.4." PyPi, July 4. Accessed 2020-02-01.
  25. Prodigy Docs. 2020. "Named Entity Recognition." Prodigy Docs. Accessed 2020-02-01.
  26. Prokofyev, Roman, Gianluca Demartini, and Philippe Cudre-Mauroux. 2014. "Effective Named Entity Recognition for Idiosyncratic Web Collections." University of Fribourg, April 10. Accessed 2020-01-26.
  27. Ramachandran, Akshitha. 2018. "Evaluating Solutions for Named Entity Recognition." Blog, Novetta, August 27. Accessed 2020-01-26.
  28. Ratinov, Lev, and Dan Roth. 2009. "Design Challenges and Misconceptions in Named Entity Recognition." Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL), ACL, pp. 147–155, June. Accessed 2020-02-01.
  29. Rau, L.F. 1991. "Extracting company names from text." Proc. of the Seventh IEEE Conference on Artificial Intelligence Application, pp. 29-32, February 24-28. Accessed 2020-01-26.
  30. Rau, Lisa F. 1994. "Method for extracting company names from text." US Patent US5287278A, February 15. Filed 1992-01-27. Accessed 2020-01-26.
  31. Santos, Cícero dos, and Victor Guimarães. 2015. "Boosting Named Entity Recognition with Neural Character Embeddings." Proceedings of the Fifth Named Entity Workshop, ACL, pp. 25-33, July. Accessed 2020-01-26.
  32. Stanford NLP Group. 2018. "Stanford Named Entity Recognizer (NER)." V3.9.2, Stanford NLP Group, October 16. Accessed 2020-01-26.
  33. Sundar V. 2019. "Entity Linking: A primary NLP task for Information Extraction." Analytics Vidhya, on Medium, September 14. Accessed 2020-01-26.
  34. Trelle, Tobias. 2018. "Google Cloud Natural Language API." Blog, Codecentric, May 7. Accessed 2020-01-26.
  35. Vyas, Meena. 2018. "spaCy – Named Entity and Dependency Parsing Visualizers." June 10. Accessed 2020-01-26.
  36. Wang, Xi, Jiagao Lyu, Li Dong, and Ke Xu. 2019. "Multitask learning for biomedical named entity recognition with cross-sharing structure." BMC Bioinformatics, vol. 20, article no. 427, August 16. Accessed 2020-02-01.
  37. Wikipedia. 2020. "JFK (film)." Wikipedia, February 1. Accessed 2020-02-01.
  38. Yan, Hang, Bocao Deng, Xiaonan Li, and Xipeng Qiu. 2019. "TENER: Adapting Transformer Encoder for Named Entity Recognition." arXiv, v3, December 10. Accessed 2020-01-26.
  39. spaCy API. 2020. "Annotation Specifications." spaCy API. Accessed 2020-01-26.

Milestones

Feb
1991
Features for extracting company names. Source: Rau 1994.

Lisa Rau implements an algorithm to extract company names from financial news. It's a combination of heuristics, exception lists and extensive corpus analysis. In subsequent retrieval tasks, the algorithm also looks at most likely variations of names. In 1992, she files for a US patent, which is granted in 1994.

1996
Sample annotation of named entities using SGML. Source: Grishman and Sundheim 1996, fig. 1.

The term Named Entity is first used at the 6th Message Understanding Conference (MUC). The first MUC was held in 1987 with the aim of automating analysis of text messages in the military. While early MUC events focused on mainly template filling, MUC-6 looks at sub-tasks that would aid information extraction. It's in this context that named entities become relevant. SGML is used to markup entities. During 1996-2008, NER is talked about in MUC, CoNLL and ACE conferences.

2003

Hammerton applies Long Short-Term Memory (LSTM) neural network to NER. He finds significantly better performance for German but a disappointing baseline performance for English. Words and their sequences are represented using SARDNET. The algorithm operates in two passes. In the first pass, information is gathered. In the second pass, the algorithm disambiguates and outputs the named entities.

2008

NER is dropped from international evaluation forums. In fact, by 2005, NER was considered a solved problem, with models achieving recall and precision exceeding 90%. However, the best score on ACE 2008 is only about 50%, which is significantly lower that the scores for MUC and CoNLL2003 tasks. This suggests that NER is not a solved problem.

Jun
2009

Ratinov and Roth address some design challenges for NER. They note that NER is knowledge intensive in nature. Therefore, they use 30 gazetteers, 16 of which are extracted from Wikipedia. In general, these are high-precision, low-recall lists. They are shown to be effective for webpages, where there's less contextual information. Expressive features and gazetteers enable unsupervised learning. They also note the BILOU encoding significantly outperforms BIO.

Jun
2011

Tweets have the problem of insufficient information. NER models trained on news articles also do poorly on tweets. This can be solved by domain adaptation or semi-supervised learning from lots of unlabelled data. Liu et al. approach this problem with a combination of K-Nearest Neighbour (KNN) classifier and Conditional Random Field (CRF) labeller. KNN captures global coarse evidence while CRF captures fine-grained information from a single tweet. The classifier is retrained based on recently labelled tweets. Gazetteers are also used.

Aug
2011

To avoid task-specific engineering and hand-crafted features, Collobert et al. propose a unified neural network approach that can perform part-of-speech tagging, chunking, named entity recognition, and semantic role labelling. The model learns representations from lots of unlabelled data. The best results for NER are obtained with word embeddings from a trained language model and optimizing for sentence-level log-likelihood. This work motivates other researchers towards neural networks in preference to hand-crafted features.

Jul
2015

Santos and Guimarães extend the work of Collobert et al. by considering character-level representations using a convolutional layer. They note that "word-level embeddings capture syntactic and semantic information, character-level embeddings capture morphological and shape information". The use of both gives best results.

Aug
2015

Huang et al. study the use of LSTM and CRF for sequence labelling. They find that BiLSTM with a CRF layer gives state-of-the-art results for POS tagging, chunking and NER. With BiLSTM, past (forward states) and future (backward states) features are used. CRF works at the sentence level to predict the current entity label based on past and future labels. This model shows good performance even when Collobert et al.'s word embedding is not used. For faster training, spelling and context features bypass the BiLSTM and go directly to the CRF layer.

2016
Neural network for NER using CNN, LSTM, CRF and char+word embeddings. Source: Ma and Hovy 2016, fig. 1 and 3.

Ma and Hovy achieve state-of-the-art F1 score of 91.21 for NER on CoNLL 2003 dataset. Their approach requires no feature engineering or specific data pre-processing. They use CNN to obtain character-level representations to capture morphological features such as prefixes and suffixes. GloVe word embeddings are used. Character embeddings are randomly initialized and then used to obtain character-level representations. Chiu and Nichols present a similar work that also uses word-level features.

Jul
2018
Concatenation of three different embeddings. Source: Güngör et al. 2018, fig.3.

Güngör et al. study NER for morphologically rich languages such as Turkish, Finnish, Czech and Spanish. Word morphology is important for these languages. This is in contrast to English that gets useful information from syntax and word n-grams. They therefore propose a model that uses morphological embedding. This is combined with character-based and word embeddings. Both character-based and morphological embeddings are derived using separate BiLSTMs.

Jul
2019
Different architectures based on character/word/BERT representations. Source: Francis et al. 2019, fig. 1.

Francis et al. make use of BERT language model for transfer learning in NER. For fine tuning, either softmax or CRF layer is used. They find BERT representations perform best in combination with character-level representation and word embeddings. Without BERT, they also show competitive performance when character-level representation and word embeddings are gated via an attention layer.

Dec
2019
Transformer architecture of TENER for character-level encoding and word-level context. Source: Yan et al. 2019, fig. 2.

Yan et al. note that the traditional transformer architecture is not quite as good for NER as it is for other NLP tasks. They customize transformer architecture and achieve state-of-the-art results, beating prevailing BiLSTM models. They call it Transformer Encoder for NER (TENER). While traditional transformer uses position embedding, directionality is lost. TENER uses relative position encoding to capture distance and direction. Smoothing and scaling of traditional transformer is seen to attend to noisy information. TENER therefore uses sharp unscaled attention.

Tags

See Also

Further Reading

  1. Jurafsky, Daniel and James H. Martin. 2009. "Information Extraction." Chapter 22 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2020-01-26.
  2. Marrero, M., J. Urbano, S. Sánchez-Cuadrado, J. Morato, and J. M. Gómez-Berbís. 2013. "Named Entity Recognition: Fallacies, Challenges and Opportunities." Journal of Computer Standards and Interfaces, vol. 35, no. 5, pp. 482-489. Accessed 2020-02-01.
  3. Nadeau, David and Satoshi Sekine. 2007. "A survey of named entity recognition and classification." Linguisticae Investigationes, vol. 30, pp. 3–26. Accessed 2020-01-26.
  4. Yadav, Vikas, and Steven Bethard. 2018. "A Survey on Recent Advances in Named Entity Recognition from Deep Learning models." Proceedings of the 27th International Conference on Computational Linguistics, Association for Computational Linguistics, pp. 2145-2158, August. Accessed 2020-01-26.
  5. Li, Jing, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. "A Survey on Deep Learning for Named Entity Recognition." arXiv, v1, December 22. Accessed 2020-01-26.
  6. Li, Susan. 2018. "Named Entity Recognition and Classification with Scikit-Learn." Towards Data Science, on Medium, August 27. Accessed 2020-01-26.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
5
0
1942
2759
Words
0
Chats
5
Edits
0
Likes
1336
Hits

Cite As

Devopedia. 2020. "Named Entity Recognition." Version 5, February 4. Accessed 2020-07-07. https://devopedia.org/named-entity-recognition