Information Retrieval

Architecture of an ad hoc IR system. Source: Jurafsky and Martin 2009, sec. 23.2.
Architecture of an ad hoc IR system. Source: Jurafsky and Martin 2009, sec. 23.2.

Information Retrieval (IR) refers to finding out relevant information from any kind of data. It may be audio, video, document, article or an image. Traditional ways of information retrieval consist of breaking down data into subsets or clusters across dimensions and finding relevant information according to problem statement.

There are various ways to retrieve information from any unstructured data. It started with rule-based models. More recently, neural networks are being used.

The way humans understand is far more complex than how machines understand. Machines follow the logic of finding out similarities, differences or links with the input data. But it's hard to interpret many-to-many relationships. Humans are able to do this for all previous information stored in memory whenever an image, text or video appears.

Discussion

  • What's the difference between Information Retrieval (IR) and Information Extraction (IE)?

    IR is about searching and retrieving unstructured documents that fulfil a specific search query. IE is about obtaining structured representation of key information from unstructured documents. For example, IR can help us find all documents about fishing whereas IE can tell who went fishing and what type of fish was caught.

    Information retrieval is the process of gathering information or sources that are appropriate to the topic from a collection of raw text data.

    Information extraction is a kind of automated process where rule-based algorithm is applied to structured data after it is obtained from any unstructured source. It involves NLP techniques of cleaning and arranging it in a matrix. It also involves activities like automatic annotation or content extraction.

  • Could you mention some applications of IR?

    Web search is an example of IR. A search engine might search billions of web documents to respond to a query such as "Italian restaurants near me" or "flights to London tomorrow". A more personal example is to search through emails. IR in this case is typically executed by the mail program such as GMail.

    Searching through online discussion forums or Q&A archives is another IR application. Text classification is typically used to classify posts, questions and answers.

    In any enterprise system, we might need to retrieve patent documents, research papers or other publications based on a keyword or phrase. In a library, we might want to obtain books with specific words in its title, or by author name, or by genre. In an e-commerce site, we might want to apply various criteria and see a filtered list of products.

    As an example of IR for multimedia, MPEG-7 is a standard that's used to describe multimedia content. This metadata can be searched to retrieve audio/video clips that are of interest to the user.

  • What are some essential terms used in information retrieval?
    A systematic approach to information retrieval. Source: Lalmas et al. 2001, fig. 2.
    A systematic approach to information retrieval. Source: Lalmas et al. 2001, fig. 2.

    While the term "record" is common in databases, the equivalent term in IR is document. It refers to a basic unit of data stored in the system. A document can be a newspaper article, an encyclopaedia entry, a paragraph in a report, a web page, an entire website, etc. It really depends on the application.

    A set of documents indexed and stored is called a collection. Alternative names include archive, corpus, and digital library.

    Query is what the user submits to the system to meet her information need. A query is composed of one or more terms. A term is a lexical item but it could be a phrase as well.

    To search through each document is a slow inefficient process. Instead, each document is indexed in advance. The index is a structured representation of the document. Searching through this representation is much faster. A similar representation of the query aids the retrieval function.

    In practice, IR systems use inverted index that maps each index term to a list of documents in which it occurs.

  • Does information retrieval search through structured or unstructured documents?

    Information retrieval typically concerns itself with unstructured documents. IR started with a focus on text. More recently, due to machine learning and improved processing capability, other types such as image, audio, and video can also be retrieved.

    While documents may be unstructured, often they are accompanied by structured metadata. Metadata is data about the document. For example, an email will have fields date, from, to, subject, and body. This is useful structured information. Another example is a photograph taken with a digital camera. Useful metadata might include date, camera model, exposure setting, and picture dimensions.

    Sometimes documents are semi-structured in the sense that specific information are in "standard" locations that heuristic algorithms can identify. Text documents with sections and sub-sections are also semi-structured. XML documents can be seen as semi-structured that IR systems can exploit.

  • What are the various models of IR?
    Taxonomy of Classic IR Models. Source: Chen 2008, slide 2.
    Taxonomy of Classic IR Models. Source: Chen 2008, slide 2.

    IR models can be categorized into Classic, Structured and Browsing models. Classic models include Boolean, Vector Space and Probabilistic models.

    Boolean primitives (AND, OR, NOT) can be used in queries according to user's information need. This requires some skills for any complex IR. Different vocabularies will result in problems of string matching. This is the most used approach.

    Vector space models use vectors to map documents, phrases or terms. Words/terms can be categorized to high or low resolving power weights according to vector dimension of document. This helps in finding out positive/negative or similar/opposite matches and retrieve information.

    Probability distribution models use different distribution types to find out similarity between documents. It can be one of two types: similarity-based and expected-utility-based. TF-IDF and HMM can be used in different frameworks to generate multiple words within the same model, making comparison easier. HMM gives better accuracy than traditional TF-IDF models.

    Two broad matching strategies are:

    • Literal Term: Vector Space Model (VSM), Hidden Markov Model (HMM), Language Model (LM)
    • Concept: Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Topical Mixture Model (TMM)
  • How is information retrieval represented?
    Document representation in vector space model. Source: Baazeem 2015, fig. 4.
    Document representation in vector space model. Source: Baazeem 2015, fig. 4.

    Information is any unstructured form of data which consists of insights of any particular domain. It needs to be converted to a format that can then be parsed/extracted into structured form. Information in text format can be represented as document representation or query representation.

    The content of a document can be represented as a collection of terms such as terms, phrases or other units. Every term will be having their weights which gives its importance to that document. A proper term weighting is needed according to various aspects of words, phrases, names, statistical, linguistic and other means for better IR.

    Query representation is based on conventional approach to formulate a query based on a keyword or set of keywords. But in several cases there's no relevant context for user to select appropriate keywords. In that case, maximal frequent sequences is used to retrieve relevant documents from an input document. A bag-of-words model is used to identify keywords and extract similar documents.

  • Which are some basic techniques used in information retrieval?

    Let's say a query has the term 'processed'. A strict match for this word will leave out documents that contain variants such as 'processing' and 'process'. Stemming or lemmatization solve this but they create their own problems.

    Often it doesn't make sense to index words such as 'to', 'be', 'for', etc. A stop list of these high-frequency words are often omitted for indexing.

    Suppose the user searches for 'high blood pressure'. The IR system may also search for 'hypertension' and 'hypertension, renal'. This is called query expansion. Query expansion usually involves pre-generating a suitable thesaurus using term clustering.

    When a query has multiple terms, some terms may be more important than others. This is done by term weighting. TF-IDF is a popular approach.

    To improve IR performance, relevance feedback can be used, typically in VSM. User is shown a small set of retrieved documents to get feedback on what's relevant and what's not. The IR system then refines the query to improve performance.

  • What are some useful resources on Information Retrieval?
    Some collections for IR research. Source: Melucci 2015, slide 47.
    Some collections for IR research. Source: Melucci 2015, slide 47.

    Greengrass (2000) offers a comprehensive survey of IR. Nyamisa et al. (2017) is another survey paper. Zhou et al. (2006) introduces basic terms and concepts. There's also a comprehensive online IR glossary.

    Stanford University's CS276 course on IR is worth studying. Among the books for IR, you can look into Manning et al. (2009), Croft et al. (2015) and Frakes et al. (1992).

    For IR research, you'll need collections. Wiki Small/Large and CACM collections are available for free download. TIPSTER Complete is another useful collection. Other important collections include the Cranfield collection, TREC, GOV2, Reuters-21578, Reuters-RCV1, and 20 Newgroups. For cross-language IR, there's NTCIR and CLEF.

    To keep track of latest developments, ACM's Special Interest Group on Information Retrieval (SIGIR) is the place to visit.

Milestones

Dec
1931
Search plate containing 'GE MN' looks for documents that match this metadata. Source: Goldberg 1931.
Search plate containing 'GE MN' looks for documents that match this metadata. Source: Goldberg 1931.

Emanuel Goldberg receives a US patent for a machine that does information retrieval. The machine is opto-electric in nature. Light is passed through a negative search plate containing the search terms. Document matches are picked up by activating a photoelectric cell. All matches are recorded on a photographic plate.

Mar
1950

Calvin Mooers coins the term Information Retrieval.

1952
Punched holes of search card must match with those of record cards. Source: Luhn 1952, pp. 7.
Punched holes of search card must match with those of record cards. Source: Luhn 1952, pp. 7.

H. P. Luhn describes an early IR system from IBM. It makes use of punched cards and a photoelectric scanning unit. Information is indexed and encoded into record cards. A query is input as a search card. Plug connections on a switchboard control how search terms are to be combined. This is therefore one of the earliest implementations of the Boolean Model for IR.

1957
A dictionary of notions to aid information retrieval. Source: Luhn 1957, fig. 2.
A dictionary of notions to aid information retrieval. Source: Luhn 1957, fig. 2.

H. P. Luhn proposes a statistical approach to IR. Content can be identified based on the frequency of occurrence of words. This also means that documents can be indexed automatically based on the words they contain. This is therefore the first attempt at automatic indexing. The system uses statistical information to group words into "notional" families.

1959

Cavin Mooers notes that often using information is not rewarded. Those who are at work are seen as getting the job done rather than fussing about information. Having information means you have to read it, understand it, store it carefully and make decisions based on it. These observations lead Mooers to define Mooers' Law,

An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have it.
1960

Maron and Kuhns publish a paper titled On Relevance, Probabilistic Indexing and Information Retrieval. Where previous IR systems returned matching documents, the idea in this paper is to rank documents by computing probability of relevance. The paper also introduces the notion of query expansion. In years to come, this goes on to become one of the most influential papers in the field of IR.

1965

Salton and Lesk propose SMART retrieval system. It goes beyond Luhn's work of 1957. It uses the original search words along with their word stems. It uses synonym dictionaries to account for variations in vocabulary. It uses relations between words and surrounding words to determine the nature of content. Over the years, the SMART system develops or uses many important concepts such as vector model, term weighting, relevance feedback, and cosine similarity measure.

1976

Robertson and Jones implement a probabilistic model to IR. From the distribution of index terms in documents, they determine term weights. This work is perhaps the first real application of the probability to IR as proposed by Maron and Kuhns in 1960.

1982
Illustrating the traditional Boolean model. Source: Khilfeh 2014, fig. 1.
Illustrating the traditional Boolean model. Source: Khilfeh 2014, fig. 1.

Salton et al. propose the extended Boolean model. Output from a traditional Boolean model is hard to control and retrieved documents are not ranked. The extended model assigns weights to terms in both queries and documents. This enables the model to rank retrieved documents based on similarity to queries.

1988

Dumais et al. note two common problems in IR: same word has many meanings, different words express the same concept. These result in irrelevant documents or missed relevant documents respectively. They therefore propose a novel method that maps terms and documents into a lower dimensional semantic space. In later years, this field is named Latent Semantic Analysis (LSA).

1992

In the US, Text Retrieval Conference (TREC) is launched to facilitate research and collaboration in IR. It goes on to become an almost annual event. The growth of the web and the need for large-scale IR makes TREC all the more relevant. TREC produces test collections that are useful for IR research.

Aug
1998

At SIGIR'98, Ponte and Croft propose the use of Language Modelling (LM) to the task of IR. They note that typical IR systems lack a good indexing model. Existing indexing models make unwarranted parametric assumptions. They therefore propose a single non-parametric model for both indexing and retrieval. They show better performance compared to TF-IDF weighting. In 2001, Zhai and Lafferty study different smoothing techniques for such language models.

Sep
1998

Google is founded for web search. By now, many other search engines are already popular: Lycos (1994), Yahoo! Search (1995), Excite (1995), AltaVista (1995), Yandex (1997). These engines become the best instantiations of IR. They use features that until now were only experimental.

2016
Two NN architectures to score query-document relevance: (A) Two mirror models with shared weights; (B) Joint model. Source: Zhang et al. 2017, fig. 4.
Two NN architectures to score query-document relevance: (A) Two mirror models with shared weights; (B) Joint model. Source: Zhang et al. 2017, fig. 4.

There's growing interest in applying neural networks to IR. This is called Neural IR. The network learns representations of queries and documents. In one application of Neural IR from 2019, two deep models are trained, one for image and one for text. This allows a user to retrieve images based on a textual query, or vice versa. However, early work on content-based image retrieval can be traced to the work of Wan et al. (2014).

References

  1. Agarwal, Shubham. 2017. "Information Retrieval in Natural Language Processing — Part 1." Medium, November 9. Accessed 2019-11-23.
  2. Baazeem, Ibtehal Salem. 2015. "Analysing the Effects of Latent Semantic Analysis Parameters on Plain Language Visualisation." Thesis, The School of Information Technology and Electrical Engineering, The University of Queensland, June 15. Accessed 2019-12-16.
  3. Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. 1999. "Glossary." In: Modern Information Retrieval, Addison Wesley Longman Publishing Co. Inc. Accessed 2020-01-07.
  4. Chen, Berlin. 2008. "Latent Semantic Approaches for Information Retrieval and Language Modeling." Department of Computer Science & Information Engineering, National Taiwan Normal University, July. Accessed 2019-11-23.
  5. Dumais, Susan. 2007. "LSA and information retrieval: Getting back to basics." Chapter 16 in: Handbook of Latent Semantic Analysis, CRC Press, pp. 293-321. Accessed 2020-01-07.
  6. Goldberg, E. 1931. "Statistical machine." US Patent 1,838,389, filed 1928-04-05, December 29. Accessed 2020-01-07.
  7. Greengrass, Ed. 2000. "Information Retrieval: A Survey." Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, November 30. Accessed 2020-01-07.
  8. Hoogeveen, Doris, Li Wang, Timothy Baldwin, and Karin M. Verspoor. 2017. "Web Forum Retrieval and Text Analytics: a Survey." Foundations and Trends in Information Retrieval, now Pubishers Inc., Preprint, pp. 1-163. Accessed 2020-01-07.
  9. Hua, Yan and Jianhe Du. 2019. "Uniting Image and Text Deep Networks via Bi-directional Triplet Loss for Retrieval." 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 297-300, July 12-14. Accessed 2019-11-18.
  10. Jurafsky, Daniel and James H. Martin. 2009. "Question Answering and Summarization." Chapter 23 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2020-01-07.
  11. Khilfeh, Rabah. 2014. "Information Retrieval – Boolean Retrieval." On Wordpress, January 30. Accessed 2019-11-18.
  12. Lalmas, Mounia, Benoit Mory, Katerina Moutogianni, Thomas Rlleke, Wolfgang Putz, and Thomas Rölleke. 2001. "Searching Multimedia Data Using Mpeg-7 Descriptions In A Broadcast Terminal." ResearchGate. Accessed 2019-11-23.
  13. Luhn, H. P. 1952. "The IBM Electronic Information Searching System." Presented at the Symposium on Machine Techniques for Information Selection, MIT Industrial Liaison Program, June 10-11. Accessed 2020-01-07.
  14. Luhn, H. P. 1957. "A Statistical Approach to Mechanized Encoding and Searching of Literary Information." IBM Journal, pp. 309-317, October. Accessed 2020-01-07.
  15. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2009. "Introduction to Information Retrieval." Online Draft, Cambridge University Press, April 1. Accessed 2020-01-07.
  16. Maron, M. E. and K. L. Kuhns. 1960. "On Relevance, Probabilistic Indexing and Information Retrieval." Journal of the Associations of Computing Machinery, vol. 7, no. 4, pp. 216-244, July. Accessed 2020-01-07.
  17. Melucci, Massimo. 2009. "Boolean Model." In: Liu, L. and M. Tamer Özsu (eds), Encyclopedia of Database Systems, Springer, Boston, MA. Accessed 2020-01-07.
  18. Melucci, Massimo. 2015. "Information Retrieval and Machine Learning." CIMI School in Machine Learning. Accessed 2020-01-07.
  19. Merlo-Galeazzi, R., J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad and J. A. Olvera-López. 2013. "Information Retrieval Based on a Query Document Using Maximal Frequent Sequences." 32nd International Conference of the Chilean Computer Science Society (SCCC), Temuco, 2013, pp. 58-62. Accessed 2019-11-20.
  20. Miller, David R. H., Tim Leek, and Richard M. Schwartz. 1999. "A Hidden Markov Model Information Retrieval System." SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, August, pp. 214–221. Accessed 2019-11-20.
  21. Mooers, Calvin N. 1996. "Mooers' Law or Why Some Retrieval Systems Are Used and Others Are Not." Bulletin of the American Society for Information Science and Technology, October/November. Accessed 2020-01-07.
  22. Nyamisa, Mang'are Fridah, Waweru Mwangi, and Wilson Cheruiyot. 2017. "A Survey of Information Retrieval Techniques." Advances in Networks, vol. 5, no. 2, pp. 40-46, November. Accessed 2019-11-23.
  23. O'Riordan, Colm, and Humphrey Sorensen. 1997. "Information Filtering and Retrieval: An Overview." Proceedings of the 16th Annual International Conference of the IEEE, Atlanta, GA, USA, pp. A42-A49. Accessed 2019-11-20.
  24. Ponte, Jay M. and W. Bruce Croft. 1998. "A Language Modeling Approach to Information Retrieval." SIGIR'98, Melbourne, Australia, pp. 275-281, August. Accessed 2020-01-07.
  25. Robertson, S. E. and K. Sparck Jones. 1976. "Relevance Weighting of Search Terms." Journal of American Society of Information Science, vol. 27, no. 3, pp. 129-146, May-June. Accessed 2020-01-07.
  26. Salton, G. and M. E. Lesk. 1965. "The SMART automatic document retrieval systems—an illustration." Communications of the ACM, vol. 8, no. 6, Accessed 2019-11-18.
  27. Salton, Gerard, Edward A. Fox, and Harry Wu. 1982. "Extended Boolean Information Retrieval." Technical Report TR 82-511, Cornell University, August. Accessed 2019-11-18.
  28. Srivastava, Tavish. 2015. "Information Retrieval System explained in simple terms!" Blog, Analytics Vidya, April 7. Accessed 2019-11-23.
  29. Strzalkowski, Tomek. 1994. "Document Representation in Natural Language Text Retrieval." Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, pp. 364-369, March 8-11. Accessed 2019-11-20.
  30. Thompson, Paul. 2008. "Looking back: On relevance, probabilistic indexing and information retrieval." Information Processing and Management, vol. 44, pp. 963–970, Elsevier. Accessed 2020-01-07.
  31. Wan, Ji, Dayong Wang, Steven C.H. Hoi, Pengcheng Wu, Jianke Zhu, Yongdong Zhang, and Jintao Li. 2014. "Deep Learning for Content-Based Image Retrieval: A Comprehensive Study." Proceedings of the ACM International Conference on Multimedia, Orlando. pp. 157-166, November 3-7. Accessed 2020-01-07.
  32. Wikipedia. 2019a. "Calvin Mooers." Wikipedia, October 28. Accessed 2020-01-07.
  33. Wikipedia. 2019b. "Information retrieval." Wikipedia, December 18. Accessed 2020-01-07.
  34. Wikipedia. 2019c. "Exif." Wikipedia, December 12. Accessed 2020-01-07.
  35. Wikipedia. 2019d. "Information extraction." Wikipedia, December 17. Accessed 2020-01-07.
  36. Wikipedia. 2019e. "Text Retrieval Conference." Wikipedia, December 5. Accessed 2020-01-07.
  37. Wikipedia. 2019f. "Web search engine." Wikipedia, January 5. Accessed 2020-01-07.
  38. Xiong, Chenyan. 2016. "Knowledge Based Text Representation for Information Retrieval." Thesis, Language Technologies Institute, May. Accessed 2019-11-22.
  39. Zhai, Chengxiang, and John Lafferty. 2001. "A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval." SIGIR'01, New York, NY, USA, pp. 334–342. Accessed 2020-01-07.
  40. Zhang, Ye, Md Mustafizur Rahman, Alex Braylan, Brandon Dang, Heng-Lu Chang, Henna Kim, Quinten McNamara, Aaron Angert, Edward Banner, Vivek Khetan, Tyler McDonnell, An Thanh Nguyen, Dan Xu, Byron C. Wallace, and Matthew Lease. 2017. "Neural Information Retrieval: A Literature Review." arXiv, v3, March 3. Accessed 2019-11-18.
  41. Zhou, Wei, Neil Smalheiser, and Clement Wu. 2006. "A tutorial on information retrieval: Basic terms and concepts." Journal of Biomedical Discovery and Collaboration, vol. 1, February. Accessed 2019-11-23.

Further Reading

  1. Zhou, Wei, Neil Smalheiser, and Clement Wu. 2006. "A tutorial on information retrieval: Basic terms and concepts." Journal of Biomedical Discovery and Collaboration, vol. 1, February. Accessed 2019-11-23.
  2. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2009. "Introduction to Information Retrieval." Online Draft, Cambridge University Press, April 1. Accessed 2020-01-07.
  3. Croft, W. Bruce, Donald Metzler, and Trevor Strohman. 2015. "Search engines: Information Retrieval in Practice." Previously published by Pearson Education, Inc. Accessed 2020-01-07.
  4. Frakes, William B, and Ricardo Baeza-Yates. 1992. "Information Retrieval: Data Structures And Algorithms." Prentice Hall. Accessed 2020-01-07.
  5. Manning, Christopher, and Pandu Nayak. 2019. "CS 276 / LING 286: Information Retrieval and Web Search." Stanford University. Accessed 2020-01-07.
  6. Greengrass, Ed. 2000. "Information Retrieval: A Survey." Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, November 30. Accessed 2020-01-07.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
4
2
1757
10
2
1049
1
0
7
2362
Words
1
Likes
13K
Hits

Cite As

Devopedia. 2022. "Information Retrieval." Version 15, February 15. Accessed 2024-06-26. https://devopedia.org/information-retrieval
Contributed by
3 authors


Last updated on
2022-02-15 11:54:36