Entity Linking

Finding knowledge is one of the most popular tasks of Internet users. In most cases, query results are a mix of pages containing different entities that share the same name.

Entity linking is the process of connecting entity mentions in text to their knowledgebase counterparts. Prospective information extraction, retrieval, and knowledgebase population are some applications of entity linking. This job, however, is difficult due to entity ambiguity and name variants. Because of the large number of web applications that produce knowledgebase data, major entity linking research has been conducted.

In many retrieval systems, a user would simply enter an entity or concept name into the retrieval system, and search results would be clustered by various entities/concepts that share that name. Additional details in the indexed records is one way to implement such a framework.

Discussion

  • What are the main steps in the entity linking process?
    Linking Paris and France to their Wikipedia pages. Source: Aparravi 2019.
    Linking Paris and France to their Wikipedia pages. Source: Aparravi 2019.

    Entity entity, in most cases, involves the three sub-tasks that are executed in this specific order:

    • Information Extraction: Extract information from unstructured data.
    • Named Entity Recognition (NER): Individuals, places, organizations, and other real-world objects are examples of named entities. NER recognizes and classifies named entity occurrences in text into pre-defined categories. The role of assigning a tag to each word in a sentence is modeled as NER.
    • Named Entity Linking (NEL): Each entity identified by NER will be assigned a unique identity by NEL. NEL then attempts to link each entity to its description in a knowledgebase. The knowledgebase to be used depends on the program, but we may use Wikipedia-derived knowledgebases for open-domain text, such as Wikidata, DBpedia, or YAGO. Wikification refers to the process of connecting entities to Wikipedia.

    Entity linking can either be end-to-end involving both recognition and disambiguation. If gold standard named entities as available at the input, entity linking does only disambiguation.

  • What are the main issues with entity linking?
    Entity 'bass' can mean two completely different things. Source: Adapted from dave_anon 2016.
    Entity 'bass' can mean two completely different things. Source: Adapted from dave_anon 2016.

    There are two main issues with entity linking:

    • A phrase or word can be represented with multiple entities from the knowledgebase. For example, the entity Japan could mean Japan (national football team), Japan (country), Japan (Band), etc.
    • A single entity from the knowledgebase can have multiple names that represent entities that belong to distinct concepts. Take the example of bass as shown in the figure. The image on the left represents bass in context of sound and music. The image on the right represents bass as a fish.

    The challenge for entity linking is to figure out the correct context to resolve these ambiguities and link each entity to the most suitable entries in the knowledgebase.

  • Which are the main entity recognition paradigms?

    For Named Entity Recognition and Classification (NERC), we have the following machine learning approaches:

    • Supervised Learning: Relies on distinctive features that separate positive and negative examples. From earlier handcrafted rules, supervised learning has evolved to automatically inferred rule-based systems or sequence labeling algorithms from a set of training examples. Currently a popular approach and has many variants: Hidden Markov Models (HMM), Decision Trees, Maximum Entropy Models, Support Vector Machines (SVM), and Conditional Random Fields (CRF). To solve entity ambiguity, Milne and Witten employed used Wikipedia entities as training data. Other approaches also collected training data based on unambiguous synonyms.
    • Semi-Supervised Learning: Also called "weakly supervised", the most common approach is called bootstrapping. The learning process begins with a small amount of supervision, such as a collection of seeds. From these few samples, the model learns contextual clues that are then applied to the rest of the data in the next iteration. With many iterations, the model see more and more examples to learn from. Some semi-supervised approaches are known to rival baseline supervised approaches.
    • Unsupervised Learning: Not very common but one possible approach is to semantic relations present in the data.
  • How can we use knowledge graphs for the entity linking task?
    Use of a knowledge graph for entity linking. Source: Dexter 2021.
    Use of a knowledge graph for entity linking. Source: Dexter 2021.

    Modern entity linking systems use broad knowledge graphs built from knowledgebases like Wikipedia instead of textual features generated from input documents or text corpora. These systems extract complex features that take advantage of the information graph topology or exploit multi-step relations between entities that would otherwise go undetected by simple text analysis. Furthermore, developing multilingual entity linking systems based on natural language processing (NLP) is inherently difficult, as it necessitates either broad text corpora, which are often lacking for many languages, or hand-crafted grammar rules, which vary greatly between languages.

    In the previous work of Han et al. proposed a graph-based collective entity linking method to model global topical interdependence (rather than pairwise interdependence) among different entity linking decisions in a single document. They first proposed Referent Graph, a graph-based representation that could model both textual context similarity and global topical interdependence between entity linking decisions as its graph structure. Then, using a strictly collective inference algorithm over the Referent Graph, they were able to jointly infer mapping entities for all entity mentions in the same paper.

  • What are the main steps in implementing an entity linking system?

    An entity linking system will need the following:

    • Recognize: Recognize the entities that are mentioned in the context of text. In this module, for each entity mention m ∈ M, the entity linking system aims to filter out irrelevant entities in the knowledgebase and retrieve a candidate entity set Em which contains possible entities that entity mention m may refer to.
    • Rank: Rank each candidate. In most cases, the size of the candidate entity set Em is larger than one. Researchers leverage different kinds of evidence to rank the candidate entities in Em. They try to find the entity e ∈ Em which is the most likely link for mention m.
    • Link: Link the recognized entities to the categorized entities in the knowledge graph.

    New facts are created and digitally expressed on the web as the world evolves. For semantic web and knowledge management strategies, automatically populating and enriching existing knowledgebases with newly derived facts has become a key problem. Entity linking is inherently regarded as a critical subtask in the population of a knowledgebase. Entity linking help a knowledgebase to grow.

  • What datasets are available for entity linking?

    YAGO is a high-coverage, high-quality open-domain information base that combines Wikipedia and WordNet. It's similar to Wikipedia by size but uses WordNet's clean taxonomy of concepts. YAGO currently contains over 10 million entities (such as individuals, organizations, places, and so on) and 120 million information about these entities, including the Is-A hierarchy (such as form and subclass of relations) as well as non-taxonomic relations. YAGO includes means-relation the relates strings to entities. For example, "Harry" denotes Harry Potter. Hoffart et al. used YAGO relations to create candidate entities.

    DBpedia is a multilingual knowledgebase built by extracting structured data from Wikipedia, such as categorization details, geo-coordinates, and links to external web pages. English DBpedia contains 4 million entities. Furthermore, it adapts to Wikipedia's changes automatically.

    Freebase is a broad online knowledgebase that's primarily generated collaboratively by its users. Non-programmers can edit the structured data in Freebase using a user interface. Freebase is a database that compiles information from a variety of sites, including Wikipedia. There are currently over 43 million entities and 2.4 billion facts about them in the database.

  • What evaluation metrics are suited for entity linking?

    Where only disambiguation is done, we have:

    • Micro-Precision: Fraction of correctly disambiguated named entities in the full corpus.
    • Macro-Precision: Fraction of correctly disambiguated named entities, averaged by document.

    Where both entity recognition and disambiguation are done, we have:

    • Gerbil Micro-F1 – Strong Matching: InKB micro F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation.
    • Gerbil Macro-F1 – Strong Matching: InKB macro F1 score for correctly linked and disambiguated mentions in the full corpus as computed using the Gerbil platform. InKB means only mentions with valid KB entities are used for evaluation.
  • What's the current state-of-the-art (SOTA) in entity linking?

    Mulang et al. 2020 is the current Sota for ConLL-AIDA dataset.

    Raiman is the current SOTA in cross-lingual entity linking for WikiDisamb30 and TAC KBP 2010 datasets. They construct a type system, and use it to constrain the outputs of a neural network to respect the symbolic structure. They achieve this by reformulating the design problem into a mixed integer problem: create a type system and subsequently train a neural network with it. They propose a 2-step algorithm: 1) heuristic search or stochastic optimization over discrete variables that define a type system informed by an Oracle and a Learnability heuristic, 2) gradient descent to fit classifier parameters.

    They apply DeepType to the problem of entity linking on three standard datasets (WikiDisamb30, CoNLL (YAGO), TAC KBP 2010) and find that it outperforms all existing solutions by a wide margin, including approaches that rely on a human-designed type system or recent deep learning-based entity embeddings. Explicitly using symbolic information lets it integrate new entities without retraining.

Milestones

2006

Bunescu and Paşca propose using Wikipedia for Named Entity Disambiguation (NED). Their model makes use of Wikipedia's redirect pages, disambiguation pages, categories and hyperlinks. They apply some rules to construct a dictionary of named entities from Wikipedia. For context-article similarity they use cosine similarity with vectors formed from TF-IDF of words in the vocabulary. As similar work is due to Cucerzan (2007).

Nov
2007
System architecture for automatic text wikification. Source: Mihalcea and Csomai 2007, fig. 2.
System architecture for automatic text wikification. Source: Mihalcea and Csomai 2007, fig. 2.

Mihalcea and Csomai propose Wikify!, a system that recognizes key phrases or concepts and link them to suitable Wikipedia pages. The two main tasks in the process are keyword extraction and word sense disambiguation (WSD). They adopt unsupervised keyword extraction that involves candidate extraction and ranking. For WSD, they evaluate a knowledge-based approach (inspired by Lesk algorithm) and a data-driven approach (using Naive Bayes classifier).

2008

Given two phrases, their semantic relatedness is usually computed using external knowledge sources. Instead, Milne and Witten propose using both incoming and outgoing links in Wikipedia pages to measure semantic relatedness. This is useful for WSD and hence for entity linking as well. To determine relatedness, the authors compare each potential candidate to the document's surrounding background, which is created by the other candidates. The use of Wikipedia for semantic relatedness was previously studied by Strube and Ponzetto (2006) and Gabrilovich and Markovitch (2007).

2011
Local φ and global ψ measures used for disambiguation. Source: Ratinov et al. 2011, fig. 1.
Local φ and global ψ measures used for disambiguation. Source: Ratinov et al. 2011, fig. 1.

Ratinov et al. propose the GLOW system, which is an approximation of joint disambiguation using mutual information for semantic relatedness. The researchers extract additional named entity mentions and noun phrases from the document that were previously used as link anchor texts in Wikipedia to emphasize the coherence among candidates. Candidates are then retrieved by querying an anchor-title index that maps each link goal in Wikipedia to its various link anchor texts and vice versa, augmenting the given query mentions with this collection.

2012
A constructed semantic network helps in disambiguating terms 'Michael Jordan' and 'NBA'. Source: Shen et al. 2012, fig. 1.
A constructed semantic network helps in disambiguating terms 'Michael Jordan' and 'NBA'. Source: Shen et al. 2012, fig. 1.

Shen et al. present LINDEN, a framework that uses YAGO to connect named entity mentions. They consider coherence among potential candidate entities. They consider semantic similarity of candidates to the forms in the YAGO ontology. This function assumes that candidate senses are organized into a tree structure of categories. They also consider global coherence of candidates for document mentions, where one candidate's global coherence is equal to the average Semantic Role Labelling (SRL) of all candidates.

2016

Tsai and Roth consider the problem of linking entity mentions in non-English language text to the English Wikipedia. They address this using multilingual embeddings of titles and words. Their system doesn't handle the case of an English Wikipedia entry that doesn't have an equivalent entry in a foreign language. They also release a Wikipedia dataset across 12 languages.

References

  1. Aparravi. 2019. "File:Entity Linking - Short Example.png." Wikimedia Commons, July 2. Accessed 2021-05-10.
  2. Bollacker, Kurt, Robert Cook, and Patrick Tufts. 2007. "Freebase: A Shared Database of Structured General Human Knowledge." AAAI'07: Proceedings of the 22nd national conference on Artificial intelligence, vol. 2, pp. 1962–1963, July. Accessed 2021-04-27.
  3. Bunescu, Razvan, and Marius Paşca. 2006. "Using Encyclopedic Knowledge for Named Entity Disambiguation." 11th Conference of the European Chapter of the Association for Computational Linguistics, April. Accessed 2021-05-10.
  4. Cucerzan, Silviu. 2007. "Large-Scale Named Entity Disambiguation Based on Wikipedia Data." Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), ACL, pp. 708–716, June. Accessed 2021-05-10.
  5. Dai, H., C.-Y. Wu, R. Tzong, R. T.-H. Tsai, and W.-L. Hsu. 2012. "From Entity Recognition to Entity Linking: A Survey of Advanced Entity Linking Techniques." The 26th Annual Conference of the Japanese Society for Artificial Intelligence, June 12-15. Accessed 2021-04-27.
  6. Dexter. 2021. "Dexter, an Open Source Framework for Entity Linking." Dexter. Accessed 2021-05-10.
  7. Hachey, B., W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. 2013. "Evaluating Entity Linking with Wikipedia." Artificial Intelligence, Elsevier, vol. 194, pp. 130–150, Januart. doi: 10.1016/j.artint.2012.04.005. Accessed 2021-04-27.
  8. Kim, Youngsik, and Key-Sun Choi. 2015. "Entity Linking Korean Text: An Unsupervised Learning Approach using Semantic Relations." Proceedings of the Nineteenth Conference on Computational Natural Language Learning, ACL, pp. 132-141, July. Accessed 2021-04-27.
  9. Mihalcea, Rada and Andras Csomai. 2007. "Wikify! Linking Documents to Encyclopedic Knowledge." CIKM'07, November 6-8. doi: 10.1145/1321440.1321475. Accessed 2021-05-10.
  10. Miller, George A. 1995. "WordNet: A LexicalDatabase for English." Comm. of the ACM, vol. 38, no. 11, pp. 39-41, November. Accessed 2021-04-27.
  11. Milne, David, and Ian H. Witten. 2008. "An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links." In: Technical Report WS-08-15, Wikipedia and Artificial Intelligence: An Evolving Synergy, AAAI Workshop. Accessed 2021-05-10.
  12. Nadeau, D. and S. Sekine. 2007. "A survey of named entity recognition and classification." Lingvisticæ Investigationes, vol. 30, no. 1, pp. 3–26, August. doi: 10.1075/li.30.1.03nad. Accessed 2021-04-27.
  13. Pilz, A. 2016. "Entity Linking to Wikipedia." PhD Dissertation, Rheinischen Friedrich-Wilhelms-Universität Bonn. Accessed 2021-04-27.
  14. Ratinov, Lev, Dan Roth, Doug Downey, and Mike Anderson. 2011. "Local and Global Algorithms for Disambiguation to Wikipedi." Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pp. 1375–1384, June 19-24. Accessed 2021-05-10.
  15. Rebele, T., F. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and G. Weikum. 2016. "YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames." In: Groth P. et al. (eds) The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science, vol. 9982. Springer, Cham. doi: 10.1007/978-3-319-46547-0_19. Accessed 2021-04-27.
  16. Ruder, Sebastian. 2021. "Entity Linking." NLP-progress, on GitHub, April 16. Accessed 2021-05-11.
  17. Shen, Wei, Jianyong Wang, Ping Luo, and Min Wang. 2012. "LINDEN: Linking Named Entities with Knowledge Base via Semantic Knowledge." WWW '12: Proceedings of the 21st international conference on World Wide Web, pp. 449-458, April. doi: 10.1145/2187836.2187898. Accessed 2021-05-11.
  18. Shen, W., J. Wang, and J. Han. 2015. "Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions." IEEE Trans. Knowl. Data Eng., vol. 27, no. 2, pp. 443–460, February. doi: 10.1109/TKDE.2014.2327028. Accessed 2021-05-10.
  19. Tsai, Chen-Tse, and Dan Roth. 2016. "Cross-lingual Wikification Using Multilingual Embeddings." Proceedings of NAACL-HLT, ACL, pp. 589–598. Accessed 2021-05-10.
  20. dave_anon. 2016. "Same Spelling But Different Meaning?" EnglishForward, June 16. Accessed 2021-05-10.

Further Reading

  1. Al-Moslmi, Tareq, Marc Gallofré Ocaña, Andreas L. Opdahl, and Csaba Veres. 2020. "Named Entity Extraction for Knowledge Graphs: A Literature Overview." IEEE Access, pp. 32862-32881, February 14. doi: 10.1109/ACCESS.2020.2973928. Accessed 2021-05-10.
  2. Mihalcea, Rada and Andras Csomai. 2007. "Wikify! Linking Documents to Encyclopedic Knowledge." CIKM'07, November 6-8. doi: 10.1145/1321440.1321475. Accessed 2021-05-10.
  3. Nadeau, D. and S. Sekine. 2007. "A survey of named entity recognition and classification." Lingvisticæ Investigationes, vol. 30, no. 1, pp. 3–26, August. doi: 10.1075/li.30.1.03nad. Accessed 2021-04-27.
  4. Dai, H., C.-Y. Wu, R. Tzong, R. T.-H. Tsai, and W.-L. Hsu. 2012. "From Entity Recognition to Entity Linking: A Survey of Advanced Entity Linking Techniques." The 26th Annual Conference of the Japanese Society for Artificial Intelligence, June 12-15. Accessed 2021-04-27.
  5. Rao, Delip, Paul McNamee, and Mark Dredze. 2013. "Entity Linking: Finding Extracted Entities in a Knowledge Base." In: Poibeau T., Saggion H., Piskorski J., Yangarber R. (eds), Multi-source, Multilingual Information Extraction and Summarization, Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg, pp. 93-115. doi: 10.1007/978-3-642-28569-1_5. Accessed 2021-05-10.
  6. Provatorova, Vera, Svitlana Vakulenko, Evangelos Kanoulas, KoenDercksen, and Johannes M van Hulst. 2020. "Named Entity Recognition and Linking on Historical Newspapers: UvA.ILPS & REL." Conference and Labs of the Evaluation Forum, September 22-25. Accessed 2021-05-10.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
5
2
1576
1999
Words
2
Likes
6041
Hits

Cite As

Devopedia. 2021. "Entity Linking." Version 6, June 28. Accessed 2024-06-25. https://devopedia.org/entity-linking
Contributed by
2 authors


Last updated on
2021-06-28 15:58:05

Improve this article

Article Warnings

  • Readability score of this article is below 50 (45.3). Use shorter sentences. Use simpler words.