• Comparing different QA datasets. Source: Choi et al. 2018, table 1.
    Comparing different QA datasets. Source: Choi et al. 2018, table 1.
  • An example specification list for a question. Source: Green et al. 1961.
    An example specification list for a question. Source: Green et al. 1961.
  • Coarse classes (bold) and fine classes from TREC-10 dataset. Source: Li and Roth 2002, table 1.
    Coarse classes (bold) and fine classes from TREC-10 dataset. Source: Li and Roth 2002, table 1.
  • Architecture of IBM's DeepQA. Source: Ferrucci et al. 2010, fig. 6.
    Architecture of IBM's DeepQA. Source: Ferrucci et al. 2010, fig. 6.
  • Use of CNN to obtain a sentence representation. Source: Yu et al. 2014, fig. 1.
    Use of CNN to obtain a sentence representation. Source: Yu et al. 2014, fig. 1.
  • Embeddings, BiLSTMs and attention used in DrQA. Source: Jurafsky and Martin 2019, fig. 25.7.
    Embeddings, BiLSTMs and attention used in DrQA. Source: Jurafsky and Martin 2019, fig. 25.7.
  • Fine-tuning of BERT for question answering. Source: Devlin et al. 2019, fig. 4c.
    Fine-tuning of BERT for question answering. Source: Devlin et al. 2019, fig. 4c.
  • SQuAD 2.0 entry on Steam_engine with some question-answer pairs. Source: SQuAD 2020b.
    SQuAD 2.0 entry on Steam_engine with some question-answer pairs. Source: SQuAD 2020b.
  • Unsupervised question generation to train a QA model. Source: Lewis et al. 2019, fig. 1.
    Unsupervised question generation to train a QA model. Source: Lewis et al. 2019, fig. 1.
  • In Finnish, words 'day' and 'week' are represented differently in question and answer. Source: Clark 2020.
    In Finnish, words 'day' and 'week' are represented differently in question and answer. Source: Clark 2020.
  • Conceptually, a question type has three facets. Source: Oh et al. 2011, fig. 1.
    Conceptually, a question type has three facets. Source: Oh et al. 2011, fig. 1.
  • Architecture of a knowledge-based QA system with attention computed between question and candidate answers. Source: Zhang et al. 2016, fig. 1.
    Architecture of a knowledge-based QA system with attention computed between question and candidate answers. Source: Zhang et al. 2016, fig. 1.
  • Question answering in dialogue context. Source: Choi et al. 2018, fig. 1.
    Question answering in dialogue context. Source: Choi et al. 2018, fig. 1.
  • Pipeline of IR-based factoid QA systems. Source: Jurafsky and Martin 2019, fig. 25.2.
    Pipeline of IR-based factoid QA systems. Source: Jurafsky and Martin 2019, fig. 25.2.
  • Distribution of trigram prefixes in questions of SQuAD and CoQA datasets. Source: Heidenreich 2018, fig. 3.
    Distribution of trigram prefixes in questions of SQuAD and CoQA datasets. Source: Heidenreich 2018, fig. 3.

Question Answering

Avatar of user arvindpdmn
arvindpdmn
1494 DevCoins
Avatar of user abhip
abhip
287 DevCoins
2 authors have contributed to this article
Last updated by arvindpdmn
on 2020-02-24 07:38:57
Created by arvindpdmn
on 2020-02-08 15:25:17

Summary

Comparing different QA datasets. Source: Choi et al. 2018, table 1.
Comparing different QA datasets. Source: Choi et al. 2018, table 1.

Search engines, and information retrieval systems in general, help us obtain relevant documents to any search query. In reality, people want answers. Question Answering (QA) is about giving a direct answer in the form of a grammatically correct sentence.

QA is a subfield of Computer Science. It's predominantly based on Information Retrieval and Natural Language Processing. Both questions and answers are in natural language.

QA is also related to an NLP subfield called text summarization. Where answers are long and descriptive, they're probably summarized from different sources. In this case, QA is also called focused summarization or query-based summarization.

There are lots of datasets to train and evaluate QA models. By late 2010s, neural network models have brought state-of-the-art results.

Milestones

1961
An example specification list for a question. Source: Green et al. 1961.

MIT researchers implement a program named Baseball. It reads a question from a punched card. It references a dictionary of words and idioms to generate a "specification list", which is a canonical expression of what the question is asking. Content analysis involves syntactic phrase structures.

1963

Bertram Raphael at MIT publishes a memo titled Operation of a Semantic Question-Answering System. He describes a QA model that accepts a restricted form of English. Factual information comes from a relational model. Program is written in LISP. Raphael credits LISP's list-processing capability for making the implementation a lot easier.

Dec
1993

Developed at MIT, START goes online. This is probably the world's first web-based QA system. It can answer questions on places, people, movies, dictionary definitions, etc.

Jun
1997

With the growth of web, AskJeeves is launched as an online QA system. However, it basically does pattern matching against a knowledge base of questions and returns curated answers. If there's no match, it falls back to a web search. In February 2006, the system is rebranded as Ask.

Nov
1999

At the 8th Text REtrieval Conference (TREC-8), a Question Answering track is introduced. This is to foster research in QA. TREC-8 focuses on only open-domain closed-class questions (fact-based short answers). At future TREC events, the QA track continues to produce datasets for training and evaluation.

2002
Coarse classes (bold) and fine classes from TREC-10 dataset. Source: Li and Roth 2002, table 1.

It's helpful to identify the type of question being asked. Li and Roth propose a machine learning approach to question classification. Such a classification imposes constraints on potential answers. Due to ambiguity, their model allows for multiple classes for a single question. For example, "What do bats eat?" could belong to three class: food, plant, animal. The features used for learning include words, POS tags, chunks, head chunks, named entities, semantically related words, n-grams, and relations.

2010
Architecture of IBM's DeepQA. Source: Ferrucci et al. 2010, fig. 6.

After about three years of effort, IBM Watson competes at human expert levels in terms precision, confidence and speed at the Jeopardy! quiz show. It's DeepQA architecture integrates many content sources and NLP techniques. Answer candidates come with confidence measures. They're then scored using supporting evidence. Watson wins Jeopardy! in February 2011.

Dec
2014
Use of CNN to obtain a sentence representation. Source: Yu et al. 2014, fig. 1.

Yu et al. look at the specific task of answer selection. Using distributed representations, they look for answers that are semantically similar to the question. This is a departure from a classification approach that uses hand-crafted syntactic and semantic features. They use a bigram model with a convolutional layer and a average pooling layer. These capture syntactic structures and long-term dependencies without relying external parse trees.

Jul
2017
Embeddings, BiLSTMs and attention used in DrQA. Source: Jurafsky and Martin 2019, fig. 25.7.

Chen et al. use Wikipedia as the knowledge source for open-domain QA. Answers are predicted as text spans. Earlier research typically consider a short piece of already identified text. Since the present approach searches over multiple large documents, they call it "machine reading at scale". Called DrQA, this system integrates document retrieval and document reading. Bigram features and bag-of-words weighted with TF-IDF are used for retrieval. The reader uses BiLSTM each for the question and passages, with attention between the two.

Oct
2018
Fine-tuning of BERT for question answering. Source: Devlin et al. 2019, fig. 4c.

Researchers at Google release BERT that's trained on 3.3 billion words of unlabelled text. BERT is a pre-trained language model. As a sample task, they fine-tune BERT for question answering. SQuAD v1.1 and v2.0 datasets are used. Question and text containing the answer are concatenated to form the input sequence. Start and end tokens of the answer are predicted using softmax. For questions without answers, start/end tokens point to [CLS] token.

Jan
2019

Google release Natural Questions (NQ) dataset. It has 300K pairs plus 16K questions with answers from five different annotators. Answer comes from a Wikipedia page and the model is required to read the entire page. The questions themselves are based on real, anonymized, aggregated queries from Google Search. Answers can be yes/no, long, long and short, or no answer.

2019
SQuAD 2.0 entry on Steam_engine with some question-answer pairs. Source: SQuAD 2020b.

On SQuAD 2.0 dataset, many implementations start surpassing human performance. Many of these are based on the transformer neural network architecture including BERT, RoBERTa, XLNet, and ALBERT. Let's note that SQuAD 2.0 combines 100K questions from SQuAD 1.1 plus 50K unanswerable questions. When there's no answer, models are required to abstain from answering.

Jun
2019
Unsupervised question generation to train a QA model. Source: Lewis et al. 2019, fig. 1.

Since datasets are available only for some domains and languages, Lewis et al. propose a method to synthesize questions to train QA models. Passages are randomly selected from documents. Random noun phrases or named entities are picked as answers. "Fill-in-the-blanks" questions are generated. Using neural machine translation (NMT), these are converted into natural questions.

Feb
2020
In Finnish, words 'day' and 'week' are represented differently in question and answer. Source: Clark 2020.

Google Research releases TyDi QA, a typologically diverse multilingual dataset. It has 200K question-answer pairs from 11 languages. To avoid shared words in a pair, a human was asked to frame a question when they didn't know the answer. Google Search identified a suitable Wikipedia article to answer the question. The person then marked the answer. Researchers expect their model to generalize well to many languages.

Discussion

  • Which are the broad categories of questions answered by QA systems?
    Conceptually, a question type has three facets. Source: Oh et al. 2011, fig. 1.
    Conceptually, a question type has three facets. Source: Oh et al. 2011, fig. 1.

    Factoid questions are the simplest. An example of this is "What is the population of the Bahamas?" Answers are short and factual, often identified by named entities. Variations of factoid questions include single answer, list of answers (such as "Which are the official languages of Singapore?"), or yes/no. Questions typically ask what, where, when, which, who, or is.

    QA research started with factoid questions. Later, research progressed to questions that sought descriptive answers. "Why is the sky blue?" requires an explanation. "What is global warming?" requires a definition. Questions typically ask why, how or what.

    Closed-domain questions are about a specific domain such as medicine, environment, baseball, algebra, etc. Open-domain questions are regardless of the domain. Open-domain QA systems use large collections of documents or knowledge bases covering diverse domains.

    When the system is given a single document to answer a question, we call it reading comprehension. If information has to be searched in multiple documents across domains, the term open-context open-domain QA has been used.

  • What are the main approaches or techniques used in question answering?
    Architecture of a knowledge-based QA system with attention computed between question and candidate answers. Source: Zhang et al. 2016, fig. 1.
    Architecture of a knowledge-based QA system with attention computed between question and candidate answers. Source: Zhang et al. 2016, fig. 1.

    QA systems rely on external sources from where answers can be determined. Broad approaches are the following:

    • Information Retrieval-based: Extends traditional IR pipeline. Reading comprehension is applied on each retrieved document to select a suitable named entity, sentence or paragraph. This has also been called open domain QA. The web (or CommonCrawl), PubMed and Wikipedia are possible sources.
    • Knowledge-based: Facts are stored in knowledge bases. Questions are converted (by semantic parsers) into semantic representations, which are then used to query the knowledge bases. Knowledge could be stored in relational databases or as RDF triples. This has also been called semantic parsing-based QA. DBpedia and Freebase are possible knowledge sources.
    • Hybrid: IBM's DeepQA is an example that combines both IR and knowledge approaches.
  • What are some variations of question answering systems?
    Question answering in dialogue context. Source: Choi et al. 2018, fig. 1.
    Question answering in dialogue context. Source: Choi et al. 2018, fig. 1.

    We note the following variations or specializations of QA systems:

    • Visual QA (VQA): Input is an image (or video) rather than text. VQA is at the intersection of computer vision and NLP.
    • Conversational QA: In dialogue systems, there's a continuity of context. The current question may be incomplete or ambiguous but it can be resolved by looking at past interactions. CoQA and QuAC are two datasets for this purpose.
    • Compositional QA: Complex questions are decomposed into smaller parts, each answered individually, and then the final answers is composed. This technique is used in VQA as well.
    • Domain-Specific QA: Biomedical QA is a specialized field where both domain patterns and knowledge can be exploited. AQuA is a dataset specific to algebra.
    • Context-Specific QA: Social media texts are informal. Models that do well on newswire QA have been shown to do poorly on tweets. Community forums (Quora, StackOverflow) provide multi-sentence questions with often long answers that are upvoted or downvoted.
  • What are the key challenges faced by question answering systems?

    QA systems face two challenges: question complexity (depth) and domain size (breadth). Systems are good at either of these but not both. An example of depth is "What's the cheapest bus to Chichen Itza leaving tomorrow?" A much simpler question is "Where is Chichen Itza?"

    Common sense reasoning is challenging. For example, 'longest river' requires reverse sorting by length; 'by a margin of' involves some sort of comparison; 'at least' implies a lower cut-off. Temporal or spatial questions require reasoning about time or space relations.

    Lexical gap means that a concept can be expressed using different words. For example, we're looking for a 'city' but the question asks about a 'venue'. Approaches to solving this include string normalization, query expansion, and entailment.

    Ambiguity occurs when a word or phrase can have multiple meanings, only one of which is intended in a given context. The correct meaning can be obtained via corpus-based methods (distributional hypothesis) or resource-based methods.

    Sometimes the answer is distributed across different sources. QA systems need to align different knowledge ontologies. An alternative is to decompose the question into simpler queries and combine the answers later.

  • What are the steps in a typical question answering pipeline?
    Pipeline of IR-based factoid QA systems. Source: Jurafsky and Martin 2019, fig. 25.2.
    Pipeline of IR-based factoid QA systems. Source: Jurafsky and Martin 2019, fig. 25.2.

    In IR-based factoid QA, tokens from the question or the question itself forms the query to the IR system. Sometimes stopwords may be removed, the query rephrased or expanded. From the retrieved documents, relevant sentences or passages are extracted. Named entities, n-gram overlap, question keywords, and keyword proximity are some techniques at this stage. Finally, a suitable answer is picked. We can train classifiers to extract an answer. Features include answer type, matching pattern, number of matching keywords, keyword distance, punctuation location, etc. Neural network models are also common for answer selection.

    For knowledge-based QA, the first step is to invoke a semantic parser to obtain a logical form for querying. Such a parser could be rule-based to extract common relations, or it could be learned via supervised machine learning. More commonly, semi-supervised or unsupervised methods are used based on web content. Such methods help us discover new knowledge relations in unstructured text. Relevant techniques include distant supervision, open information extraction and entity linking.

  • How are neural networks being used in question answering?

    Widespread use of neural networks for NLP started with distributed representation for words. A feedforward model learned the representation as it was being trained on a language modelling task. In these representations, semantically similar words will be close to one another. The next development was towards compositional distributional semantics, where sentence-level representations are composed from word representations. These were more useful for question answering.

    Iyyer et al. reduced dependency parse trees to vector representations that were used to train an RNN. Yu et al. used a CNN for answer selection. A common approach to answer selection is to look at the similarity between question and answer in the semantic space. Later models added an attention layer between the question and its candidate answers. Tan et al. evaluated BiLSTMs with attention and CNN. Dynamic Coattention Network (DCN) is also based on attention. Facebook researchers combined a seq2seq model with multitasking.

    Transformer architecture has been applied for QA. In fact, QA was one of the tasks to which BERT was fine-tuned (on SQuAD) and evaluated. BERTserini used fine-tuned BERT along with information retrieval from Wikipedia.

  • What are some useful datasets for training or evaluating question answering models?
    Distribution of trigram prefixes in questions of SQuAD and CoQA datasets. Source: Heidenreich 2018, fig. 3.
    Distribution of trigram prefixes in questions of SQuAD and CoQA datasets. Source: Heidenreich 2018, fig. 3.

    Datasets are used for training and evaluating QA systems. Based on the design and makeup, each dataset might evaluate different aspects of the system better.

    Among the well-known datasets are Stanford Question Answering Dataset (SQuAD), Natural Question (NQ), Question Answering in Context (QuAC) and HotpotQA. All four are based on Wikipedia content. Conversational Question Answering (CoQA) is a dataset that's based on Wikipedia plus other sources. Wikipedia often presents data in tables. WikiTableQuestions is a dataset in which answers are in tables rather than freeform text. TyDi QA is a multilingual dataset. TweetQA takes its data from Twitter.

    Question Answering over Linked Data (QALD) is a series of datasets created from knowledge bases such as DBpedia, MusicBrainz, Drugbank and LinkedSpending.

    Other datasets to note are ELI5, ShARC, MS MARCO, NewsQA, CMU Wikipedia Factoid QA, CNN/DailyMail QA, Microsoft WikiQA, Quora Question Pairs, CuratedTREC, WebQuestions, WikiMovies, GeoQuery and ATIS.

    Papers With Code lists dozens of datasets along with their respective state-of-the-art models.

References

  1. Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. "A Neural Probabilistic Language Model." Journal of Machine Learning Research, vol. 3, pp. 1137–1155. Accessed 2020-02-23.
  2. Bordes, Antoine, Jason Weston, and Nicolas Usunier. 2014. "Open Question Answering with Weakly Supervised Embedding Models." arXiv, v1, April 16. Accessed 2020-02-21.
  3. Chen, Danqi, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. "Reading Wikipedia to Answer Open-Domain Questions." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1870-1879, July. Accessed 2020-02-21.
  4. Choi, Eunsol, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. "QuAC: Question Answering in Context." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2174-2184, October-November. Accessed 2020-02-08.
  5. Choudhury, Ambika. 2019. "10 Question-Answering Datasets To Build Robust Chatbot Systems." Analytics India Magazine, September 27. Accessed 2020-02-08.
  6. Clark, Jonathan. 2020. "TyDi QA: A Multilingual Question Answering Benchmark." Google AI Blog, February 6. Accessed 2020-02-21.
  7. Couto, Javier. 2018. "Introduction to Visual Question Answering: Datasets, Approaches and Evaluation." Blog, TryoLabs, March 1. Accessed 2020-02-08.
  8. DeepMind. 2017. "deepmind / AQuA." GitHub, November 2. Accessed 2020-02-21.
  9. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv, v2, May 24. Accessed 2020-02-22.
  10. Diefenbach, Dennis, Vanessa Lopez, Kamal Singh, and Pierre Maret. 2017. "Core techniques of question answering systems over knowledge bases: a survey." Knowledge and Information Systems, vol. 55, no. 3, pp. 529-569. Accessed 2020-02-08.
  11. Fan, Angela, Yacine Jernite, and Michael Auli. 2019. "Introducing long-form question answering." Blog, Facebook AI, July 25. Accessed 2020-02-08.
  12. Ferrucci, David, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. 2010. "Building Watson: An Overview of the DeepQA Project." AI Magazine, Association for the Advancement of Artificial Intelligence, pp. 60-79. Accessed 2020-02-21.
  13. Green, Bert F., Alice K. Wolf, Carol Chomsky, and Kenneth Laughery. 1961. "Baseball: an automatic question-answerer." Western Joint IRE-AIEE-ACM Computer Conference, pp. 219-224, May. doi:10.1145/1460690.1460714. Accessed 2020-02-24.
  14. Heidenreich, Hunter. 2018. "CoQA: A Conversational Question Answering Challenge." Blog, August 24. Accessed 2020-02-08.
  15. Heidenreich, Hunter. 2018b. "QuAC: Question Answering in Context." Chatbots Life, on Medium, August 24. Accessed 2020-02-08.
  16. Hudson, Drew A. and Christopher D. Manning. 2019. "GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering." arXiv, v3, May 10. Accessed 2020-02-21.
  17. Höffner, Konrad, Sebastian Walter, Edgard Marx, Ricardo Usbeck, Jens Lehmann, and Axel-Cyrille Ngonga Ngomo. 2017. "Survey on Challenges of Question Answering in the Semantic Web." Semantic Web, IOS Press, vol. 8, no. 6, pp. 895-920, August. Accessed 2020-02-08.
  18. InfoLab Group. 2019. "START: Natural Language Question Answering System." InfoLab Group, CSAIL, MIT. Accessed 2020-02-08.
  19. Iyyer, Mohit, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. 2014. "A Neural Network for Factoid Question Answering over Paragraphs." Proceedings of EMNLP, pp. 633-644, October. Accessed 2020-02-08.
  20. Jacob. 2018. "Question Answering Datasets." StreamHacker, January 9. Accessed 2020-02-21.
  21. Jurafsky, Daniel, and James H. Martin. 2009. "Question Answering and Summarization." Chapter 23 in: Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2020-02-20.
  22. Jurafsky, Daniel and James H. Martin. 2019. "Speech and Language Processing." Third Edition draft, October 16. Accessed 2020-02-21.
  23. Klein, Dan. 2009. "Lecture 25: Question Answering." Statistical NLP, UC Berkeley, Spring. Accessed 2020-02-08.
  24. Kwiatkowski, Tom and Michael Collins. 2019. "Natural Questions: a New Corpus and Challenge for Question Answering Research." Google AI Blog, January 23. Accessed 2020-02-08.
  25. Kwiatkowski, Tom, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. "Natural Questions: a Benchmark for Question Answering Research." Transactions of the Association of Computational Linguistics, vol. 7, pp. 453-466. Accessed 2020-02-09.
  26. Lewis, Patrick, Ludovic Denoyer, and Sebastian Riedel. 2019. "Unsupervised Question Answering by Cloze Translation." arXiv, v2, June 27. Accessed 2020-02-21.
  27. Li, Xin and Dan Roth. 2002. "Learning Question Classifiers." COLING 2002: The 19th International Conference on Computational Linguistics, August 24 - September 1. Accessed 2020-02-21.
  28. Markoff, John. 2011. "Computer Wins on ‘Jeopardy!’: Trivial, It’s Not." NY Times, February 16. Accessed 2020-02-21.
  29. Oh, Hyo-Jung, Ki-Youn Sung, Myung-Gil Jang, and Sung Hyon Myaeng. 2011. "Compositional question answering: A divide and conquer approach." Information Processing & Management, Elsevier, vol. 47, no. 6, pp. 808-824, November. Accessed 2020-02-21.
  30. Papers With Code. 2020. "Question Answering." Papers With Code. Accessed 2020-02-21.
  31. Pasupat, Ice. 2016. "WikiTableQuestions: a Complex Real-World Question Understanding Dataset." Research Blog, Stanford NLP Group, November 2. Accessed 2020-02-08.
  32. Patra, Barun. 2017. "A survey of Community Question Answering." arXiv, v1, May 11. Accessed 2020-02-08.
  33. Qi, Peng. 2019. "Answering Complex Open-domain Questions at Scale." The Stanford AI Lab Blog, October 21. Accessed 2020-02-08.
  34. Raphael, Bertram. 1963. "Operation of a Semantic Question-Answering System." AIM-059, MIT, November 1. Accessed 2020-02-21.
  35. Reddy, Siva, Danqi Chen, and Christopher D. Manning. 2019. "CoQA: A Conversational Question Answering Challenge." arXiv, v2, March 29. Accessed 2020-02-21.
  36. SQuAD. 2020. "SQuAD 2.0: The Stanford Question Answering Dataset." Accessed 2020-02-21.
  37. SQuAD. 2020b. "Steam_engine." SQuAD Explorer, SQuAD 2.0. Accessed 2020-02-21.
  38. Satapathy, Ranjan. 2018. "Question Answering in Natural Language Processing [Part-I]." Medium, August 11. Accessed 2020-02-08.
  39. Tan, Ming, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2016. "LSTM-based Deep Learning Models for Non-factoid Answer Selection." arXiv, v4, March 28. Accessed 2020-02-21.
  40. Voorhees, Ellen M., and Dawn M. Tice. 2000. "Building a Question Answering Test Collection." SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 200-207, July. doi:10.1145/345508.345577. Accessed 2020-02-21.
  41. Wasim, Muhammad, Waqar Mahmood, and Usman Ghani Khan. 2017. "A Survey of Datasets for Biomedical Question Answering Systems." International Journal of Advanced Computer Science and Applications, vol. 8, no. 7. Accessed 2020-02-08.
  42. Wikipedia. 2019. "Question answering." Wikipedia, December 27. Accessed 2020-02-21.
  43. Wikipedia. 2020. "Ask.com." Wikipedia, January 31. Accessed 2020-02-09.
  44. Xiong, Caiming, Victor Zhong, and Richard Socher. 2018. "Dynamic Coattention Networks For Question Answering." arXiv, v4, March 6. Accessed 2020-02-09.
  45. Xiong, Wenhan, Jiawei Wu, Hong Wang, Vivek Kulkarni, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang. 2019. "TWEETQA: A Social Media Focused Question Answering Dataset." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5020-5031, July. Accessed 2020-02-08.
  46. Yang, Wei, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019. "End-to-End Open-Domain Question Answering with BERTserini." arXiv, v2, September 18. Accessed 2020-02-21.
  47. Yu, Lei, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. "Deep Learning for Answer Sentence Selection." arXiv, v1, December 4. Accessed 2020-02-21.
  48. Zhang, Yuanzhe, Kang Liu, Shizhu He, Guoliang Ji, Zhanyi Liu, Hua Wu, and Jun Zhao. 2016. "Question Answering over Knowledge Base with Neural Attention Combining Global Knowledge Information." arXiv, v1, June 3. Accessed 2020-02-08.

Milestones

1961
An example specification list for a question. Source: Green et al. 1961.

MIT researchers implement a program named Baseball. It reads a question from a punched card. It references a dictionary of words and idioms to generate a "specification list", which is a canonical expression of what the question is asking. Content analysis involves syntactic phrase structures.

1963

Bertram Raphael at MIT publishes a memo titled Operation of a Semantic Question-Answering System. He describes a QA model that accepts a restricted form of English. Factual information comes from a relational model. Program is written in LISP. Raphael credits LISP's list-processing capability for making the implementation a lot easier.

Dec
1993

Developed at MIT, START goes online. This is probably the world's first web-based QA system. It can answer questions on places, people, movies, dictionary definitions, etc.

Jun
1997

With the growth of web, AskJeeves is launched as an online QA system. However, it basically does pattern matching against a knowledge base of questions and returns curated answers. If there's no match, it falls back to a web search. In February 2006, the system is rebranded as Ask.

Nov
1999

At the 8th Text REtrieval Conference (TREC-8), a Question Answering track is introduced. This is to foster research in QA. TREC-8 focuses on only open-domain closed-class questions (fact-based short answers). At future TREC events, the QA track continues to produce datasets for training and evaluation.

2002
Coarse classes (bold) and fine classes from TREC-10 dataset. Source: Li and Roth 2002, table 1.

It's helpful to identify the type of question being asked. Li and Roth propose a machine learning approach to question classification. Such a classification imposes constraints on potential answers. Due to ambiguity, their model allows for multiple classes for a single question. For example, "What do bats eat?" could belong to three class: food, plant, animal. The features used for learning include words, POS tags, chunks, head chunks, named entities, semantically related words, n-grams, and relations.

2010
Architecture of IBM's DeepQA. Source: Ferrucci et al. 2010, fig. 6.

After about three years of effort, IBM Watson competes at human expert levels in terms precision, confidence and speed at the Jeopardy! quiz show. It's DeepQA architecture integrates many content sources and NLP techniques. Answer candidates come with confidence measures. They're then scored using supporting evidence. Watson wins Jeopardy! in February 2011.

Dec
2014
Use of CNN to obtain a sentence representation. Source: Yu et al. 2014, fig. 1.

Yu et al. look at the specific task of answer selection. Using distributed representations, they look for answers that are semantically similar to the question. This is a departure from a classification approach that uses hand-crafted syntactic and semantic features. They use a bigram model with a convolutional layer and a average pooling layer. These capture syntactic structures and long-term dependencies without relying external parse trees.

Jul
2017
Embeddings, BiLSTMs and attention used in DrQA. Source: Jurafsky and Martin 2019, fig. 25.7.

Chen et al. use Wikipedia as the knowledge source for open-domain QA. Answers are predicted as text spans. Earlier research typically consider a short piece of already identified text. Since the present approach searches over multiple large documents, they call it "machine reading at scale". Called DrQA, this system integrates document retrieval and document reading. Bigram features and bag-of-words weighted with TF-IDF are used for retrieval. The reader uses BiLSTM each for the question and passages, with attention between the two.

Oct
2018
Fine-tuning of BERT for question answering. Source: Devlin et al. 2019, fig. 4c.

Researchers at Google release BERT that's trained on 3.3 billion words of unlabelled text. BERT is a pre-trained language model. As a sample task, they fine-tune BERT for question answering. SQuAD v1.1 and v2.0 datasets are used. Question and text containing the answer are concatenated to form the input sequence. Start and end tokens of the answer are predicted using softmax. For questions without answers, start/end tokens point to [CLS] token.

Jan
2019

Google release Natural Questions (NQ) dataset. It has 300K pairs plus 16K questions with answers from five different annotators. Answer comes from a Wikipedia page and the model is required to read the entire page. The questions themselves are based on real, anonymized, aggregated queries from Google Search. Answers can be yes/no, long, long and short, or no answer.

2019
SQuAD 2.0 entry on Steam_engine with some question-answer pairs. Source: SQuAD 2020b.

On SQuAD 2.0 dataset, many implementations start surpassing human performance. Many of these are based on the transformer neural network architecture including BERT, RoBERTa, XLNet, and ALBERT. Let's note that SQuAD 2.0 combines 100K questions from SQuAD 1.1 plus 50K unanswerable questions. When there's no answer, models are required to abstain from answering.

Jun
2019
Unsupervised question generation to train a QA model. Source: Lewis et al. 2019, fig. 1.

Since datasets are available only for some domains and languages, Lewis et al. propose a method to synthesize questions to train QA models. Passages are randomly selected from documents. Random noun phrases or named entities are picked as answers. "Fill-in-the-blanks" questions are generated. Using neural machine translation (NMT), these are converted into natural questions.

Feb
2020
In Finnish, words 'day' and 'week' are represented differently in question and answer. Source: Clark 2020.

Google Research releases TyDi QA, a typologically diverse multilingual dataset. It has 200K question-answer pairs from 11 languages. To avoid shared words in a pair, a human was asked to frame a question when they didn't know the answer. Google Search identified a suitable Wikipedia article to answer the question. The person then marked the answer. Researchers expect their model to generalize well to many languages.

Tags

See Also

Further Reading

  1. Jurafsky, Daniel and James H. Martin. 2019. "Chapter 25: Question Answering." In: Speech and Language Processing, Third Edition draft, October 16. Accessed 2020-02-21.
  2. Breja, Manvi, and Sanjay Kumar Jain. 2019. "A Survey on Why-Type Question Answering Systems." arXiv, v1, November 12. Accessed 2020-02-08.
  3. Soares, Marco Antonio Calijorne, and Fernando Silva Parreiras. 2018. "A literature review on question answering techniques, paradigms and systems." Journal of King Saud University, August. Accessed 2020-02-08.
  4. Ojokoh, Bolanle, and Emmanuel Adebisi. 2018. "A Review of Question Answering System." Journal of Web Engineering, vol. 17, no. 8, pp. 717-758, December. Accessed 2020-02-08.
  5. Bouziane, Abdelghani, Djelloul Bouchiha, Noureddine Doumi, and Mimoun Malki. 2015. "Question Answering Systems: Survey and Trends." Procedia Computer Science, vol. 73, pp. 366-375. Accessed 2020-02-08.
  6. Wang, Mengqui. 2006. "A Survey of Answer Extraction Techniques in Factoid Question Answering." Computational Linguistics, ACL, vol. 1, no. 1. Accessed 2020-02-08.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
4
4
1494
8
3
287
2359
Words
7
Chats
12
Edits
1
Likes
751
Hits

Cite As

Devopedia. 2020. "Question Answering." Version 12, February 24. Accessed 2020-05-25. https://devopedia.org/question-answering