• Comparing different QA datasets. Source: Choi et al. 2018, table 1.
• An example specification list for a question. Source: Green et al. 1961.
• Coarse classes (bold) and fine classes from TREC-10 dataset. Source: Li and Roth 2002, table 1.
• Architecture of IBM's DeepQA. Source: Ferrucci et al. 2010, fig. 6.
• Use of CNN to obtain a sentence representation. Source: Yu et al. 2014, fig. 1.
• Embeddings, BiLSTMs and attention used in DrQA. Source: Jurafsky and Martin 2019, fig. 25.7.
• Fine-tuning of BERT for question answering. Source: Devlin et al. 2019, fig. 4c.
• Unsupervised question generation to train a QA model. Source: Lewis et al. 2019, fig. 1.
• In Finnish, words 'day' and 'week' are represented differently in question and answer. Source: Clark 2020.
• Conceptually, a question type has three facets. Source: Oh et al. 2011, fig. 1.
• Architecture of a knowledge-based QA system with attention computed between question and candidate answers. Source: Zhang et al. 2016, fig. 1.
• Question answering in dialogue context. Source: Choi et al. 2018, fig. 1.
• Pipeline of IR-based factoid QA systems. Source: Jurafsky and Martin 2019, fig. 25.2.
• Distribution of trigram prefixes in questions of SQuAD and CoQA datasets. Source: Heidenreich 2018, fig. 3.

arvindpdmn
1494 DevCoins

abhip
287 DevCoins
Last updated by arvindpdmn
on 2020-02-24 07:38:57
Created by arvindpdmn
on 2020-02-08 15:25:17

## Summary

Search engines, and information retrieval systems in general, help us obtain relevant documents to any search query. In reality, people want answers. Question Answering (QA) is about giving a direct answer in the form of a grammatically correct sentence.

QA is a subfield of Computer Science. It's predominantly based on Information Retrieval and Natural Language Processing. Both questions and answers are in natural language.

QA is also related to an NLP subfield called text summarization. Where answers are long and descriptive, they're probably summarized from different sources. In this case, QA is also called focused summarization or query-based summarization.

There are lots of datasets to train and evaluate QA models. By late 2010s, neural network models have brought state-of-the-art results.

## Milestones

1961

MIT researchers implement a program named Baseball. It reads a question from a punched card. It references a dictionary of words and idioms to generate a "specification list", which is a canonical expression of what the question is asking. Content analysis involves syntactic phrase structures.

1963

Bertram Raphael at MIT publishes a memo titled Operation of a Semantic Question-Answering System. He describes a QA model that accepts a restricted form of English. Factual information comes from a relational model. Program is written in LISP. Raphael credits LISP's list-processing capability for making the implementation a lot easier.

Dec
1993

Developed at MIT, START goes online. This is probably the world's first web-based QA system. It can answer questions on places, people, movies, dictionary definitions, etc.

Jun
1997

With the growth of web, AskJeeves is launched as an online QA system. However, it basically does pattern matching against a knowledge base of questions and returns curated answers. If there's no match, it falls back to a web search. In February 2006, the system is rebranded as Ask.

Nov
1999

At the 8th Text REtrieval Conference (TREC-8), a Question Answering track is introduced. This is to foster research in QA. TREC-8 focuses on only open-domain closed-class questions (fact-based short answers). At future TREC events, the QA track continues to produce datasets for training and evaluation.

2002

It's helpful to identify the type of question being asked. Li and Roth propose a machine learning approach to question classification. Such a classification imposes constraints on potential answers. Due to ambiguity, their model allows for multiple classes for a single question. For example, "What do bats eat?" could belong to three class: food, plant, animal. The features used for learning include words, POS tags, chunks, head chunks, named entities, semantically related words, n-grams, and relations.

2010

After about three years of effort, IBM Watson competes at human expert levels in terms precision, confidence and speed at the Jeopardy! quiz show. It's DeepQA architecture integrates many content sources and NLP techniques. Answer candidates come with confidence measures. They're then scored using supporting evidence. Watson wins Jeopardy! in February 2011.

Dec
2014

Yu et al. look at the specific task of answer selection. Using distributed representations, they look for answers that are semantically similar to the question. This is a departure from a classification approach that uses hand-crafted syntactic and semantic features. They use a bigram model with a convolutional layer and a average pooling layer. These capture syntactic structures and long-term dependencies without relying external parse trees.

Jul
2017

Chen et al. use Wikipedia as the knowledge source for open-domain QA. Answers are predicted as text spans. Earlier research typically consider a short piece of already identified text. Since the present approach searches over multiple large documents, they call it "machine reading at scale". Called DrQA, this system integrates document retrieval and document reading. Bigram features and bag-of-words weighted with TF-IDF are used for retrieval. The reader uses BiLSTM each for the question and passages, with attention between the two.

Oct
2018

Researchers at Google release BERT that's trained on 3.3 billion words of unlabelled text. BERT is a pre-trained language model. As a sample task, they fine-tune BERT for question answering. SQuAD v1.1 and v2.0 datasets are used. Question and text containing the answer are concatenated to form the input sequence. Start and end tokens of the answer are predicted using softmax. For questions without answers, start/end tokens point to [CLS] token.

Jan
2019

Google release Natural Questions (NQ) dataset. It has 300K pairs plus 16K questions with answers from five different annotators. Answer comes from a Wikipedia page and the model is required to read the entire page. The questions themselves are based on real, anonymized, aggregated queries from Google Search. Answers can be yes/no, long, long and short, or no answer.

2019

On SQuAD 2.0 dataset, many implementations start surpassing human performance. Many of these are based on the transformer neural network architecture including BERT, RoBERTa, XLNet, and ALBERT. Let's note that SQuAD 2.0 combines 100K questions from SQuAD 1.1 plus 50K unanswerable questions. When there's no answer, models are required to abstain from answering.

Jun
2019

Since datasets are available only for some domains and languages, Lewis et al. propose a method to synthesize questions to train QA models. Passages are randomly selected from documents. Random noun phrases or named entities are picked as answers. "Fill-in-the-blanks" questions are generated. Using neural machine translation (NMT), these are converted into natural questions.

Feb
2020

Google Research releases TyDi QA, a typologically diverse multilingual dataset. It has 200K question-answer pairs from 11 languages. To avoid shared words in a pair, a human was asked to frame a question when they didn't know the answer. Google Search identified a suitable Wikipedia article to answer the question. The person then marked the answer. Researchers expect their model to generalize well to many languages.

## Discussion

• Which are the broad categories of questions answered by QA systems?

Factoid questions are the simplest. An example of this is "What is the population of the Bahamas?" Answers are short and factual, often identified by named entities. Variations of factoid questions include single answer, list of answers (such as "Which are the official languages of Singapore?"), or yes/no. Questions typically ask what, where, when, which, who, or is.

QA research started with factoid questions. Later, research progressed to questions that sought descriptive answers. "Why is the sky blue?" requires an explanation. "What is global warming?" requires a definition. Questions typically ask why, how or what.

Closed-domain questions are about a specific domain such as medicine, environment, baseball, algebra, etc. Open-domain questions are regardless of the domain. Open-domain QA systems use large collections of documents or knowledge bases covering diverse domains.

When the system is given a single document to answer a question, we call it reading comprehension. If information has to be searched in multiple documents across domains, the term open-context open-domain QA has been used.

• What are the main approaches or techniques used in question answering?

QA systems rely on external sources from where answers can be determined. Broad approaches are the following:

• Information Retrieval-based: Extends traditional IR pipeline. Reading comprehension is applied on each retrieved document to select a suitable named entity, sentence or paragraph. This has also been called open domain QA. The web (or CommonCrawl), PubMed and Wikipedia are possible sources.
• Knowledge-based: Facts are stored in knowledge bases. Questions are converted (by semantic parsers) into semantic representations, which are then used to query the knowledge bases. Knowledge could be stored in relational databases or as RDF triples. This has also been called semantic parsing-based QA. DBpedia and Freebase are possible knowledge sources.
• Hybrid: IBM's DeepQA is an example that combines both IR and knowledge approaches.
• What are some variations of question answering systems?

We note the following variations or specializations of QA systems:

• Visual QA (VQA): Input is an image (or video) rather than text. VQA is at the intersection of computer vision and NLP.
• Conversational QA: In dialogue systems, there's a continuity of context. The current question may be incomplete or ambiguous but it can be resolved by looking at past interactions. CoQA and QuAC are two datasets for this purpose.
• Compositional QA: Complex questions are decomposed into smaller parts, each answered individually, and then the final answers is composed. This technique is used in VQA as well.
• Domain-Specific QA: Biomedical QA is a specialized field where both domain patterns and knowledge can be exploited. AQuA is a dataset specific to algebra.
• Context-Specific QA: Social media texts are informal. Models that do well on newswire QA have been shown to do poorly on tweets. Community forums (Quora, StackOverflow) provide multi-sentence questions with often long answers that are upvoted or downvoted.
• What are the key challenges faced by question answering systems?

QA systems face two challenges: question complexity (depth) and domain size (breadth). Systems are good at either of these but not both. An example of depth is "What's the cheapest bus to Chichen Itza leaving tomorrow?" A much simpler question is "Where is Chichen Itza?"

Common sense reasoning is challenging. For example, 'longest river' requires reverse sorting by length; 'by a margin of' involves some sort of comparison; 'at least' implies a lower cut-off. Temporal or spatial questions require reasoning about time or space relations.

Lexical gap means that a concept can be expressed using different words. For example, we're looking for a 'city' but the question asks about a 'venue'. Approaches to solving this include string normalization, query expansion, and entailment.

Ambiguity occurs when a word or phrase can have multiple meanings, only one of which is intended in a given context. The correct meaning can be obtained via corpus-based methods (distributional hypothesis) or resource-based methods.

Sometimes the answer is distributed across different sources. QA systems need to align different knowledge ontologies. An alternative is to decompose the question into simpler queries and combine the answers later.

• What are the steps in a typical question answering pipeline?

In IR-based factoid QA, tokens from the question or the question itself forms the query to the IR system. Sometimes stopwords may be removed, the query rephrased or expanded. From the retrieved documents, relevant sentences or passages are extracted. Named entities, n-gram overlap, question keywords, and keyword proximity are some techniques at this stage. Finally, a suitable answer is picked. We can train classifiers to extract an answer. Features include answer type, matching pattern, number of matching keywords, keyword distance, punctuation location, etc. Neural network models are also common for answer selection.

For knowledge-based QA, the first step is to invoke a semantic parser to obtain a logical form for querying. Such a parser could be rule-based to extract common relations, or it could be learned via supervised machine learning. More commonly, semi-supervised or unsupervised methods are used based on web content. Such methods help us discover new knowledge relations in unstructured text. Relevant techniques include distant supervision, open information extraction and entity linking.

• How are neural networks being used in question answering?

Widespread use of neural networks for NLP started with distributed representation for words. A feedforward model learned the representation as it was being trained on a language modelling task. In these representations, semantically similar words will be close to one another. The next development was towards compositional distributional semantics, where sentence-level representations are composed from word representations. These were more useful for question answering.

Iyyer et al. reduced dependency parse trees to vector representations that were used to train an RNN. Yu et al. used a CNN for answer selection. A common approach to answer selection is to look at the similarity between question and answer in the semantic space. Later models added an attention layer between the question and its candidate answers. Tan et al. evaluated BiLSTMs with attention and CNN. Dynamic Coattention Network (DCN) is also based on attention. Facebook researchers combined a seq2seq model with multitasking.

Transformer architecture has been applied for QA. In fact, QA was one of the tasks to which BERT was fine-tuned (on SQuAD) and evaluated. BERTserini used fine-tuned BERT along with information retrieval from Wikipedia.

• What are some useful datasets for training or evaluating question answering models?

Datasets are used for training and evaluating QA systems. Based on the design and makeup, each dataset might evaluate different aspects of the system better.

Among the well-known datasets are Stanford Question Answering Dataset (SQuAD), Natural Question (NQ), Question Answering in Context (QuAC) and HotpotQA. All four are based on Wikipedia content. Conversational Question Answering (CoQA) is a dataset that's based on Wikipedia plus other sources. Wikipedia often presents data in tables. WikiTableQuestions is a dataset in which answers are in tables rather than freeform text. TyDi QA is a multilingual dataset. TweetQA takes its data from Twitter.

Question Answering over Linked Data (QALD) is a series of datasets created from knowledge bases such as DBpedia, MusicBrainz, Drugbank and LinkedSpending.

Other datasets to note are ELI5, ShARC, MS MARCO, NewsQA, CMU Wikipedia Factoid QA, CNN/DailyMail QA, Microsoft WikiQA, Quora Question Pairs, CuratedTREC, WebQuestions, WikiMovies, GeoQuery and ATIS.

Papers With Code lists dozens of datasets along with their respective state-of-the-art models.

## Milestones

1961

MIT researchers implement a program named Baseball. It reads a question from a punched card. It references a dictionary of words and idioms to generate a "specification list", which is a canonical expression of what the question is asking. Content analysis involves syntactic phrase structures.

1963

Bertram Raphael at MIT publishes a memo titled Operation of a Semantic Question-Answering System. He describes a QA model that accepts a restricted form of English. Factual information comes from a relational model. Program is written in LISP. Raphael credits LISP's list-processing capability for making the implementation a lot easier.

Dec
1993

Developed at MIT, START goes online. This is probably the world's first web-based QA system. It can answer questions on places, people, movies, dictionary definitions, etc.

Jun
1997

With the growth of web, AskJeeves is launched as an online QA system. However, it basically does pattern matching against a knowledge base of questions and returns curated answers. If there's no match, it falls back to a web search. In February 2006, the system is rebranded as Ask.

Nov
1999

At the 8th Text REtrieval Conference (TREC-8), a Question Answering track is introduced. This is to foster research in QA. TREC-8 focuses on only open-domain closed-class questions (fact-based short answers). At future TREC events, the QA track continues to produce datasets for training and evaluation.

2002

It's helpful to identify the type of question being asked. Li and Roth propose a machine learning approach to question classification. Such a classification imposes constraints on potential answers. Due to ambiguity, their model allows for multiple classes for a single question. For example, "What do bats eat?" could belong to three class: food, plant, animal. The features used for learning include words, POS tags, chunks, head chunks, named entities, semantically related words, n-grams, and relations.

2010

After about three years of effort, IBM Watson competes at human expert levels in terms precision, confidence and speed at the Jeopardy! quiz show. It's DeepQA architecture integrates many content sources and NLP techniques. Answer candidates come with confidence measures. They're then scored using supporting evidence. Watson wins Jeopardy! in February 2011.

Dec
2014

Yu et al. look at the specific task of answer selection. Using distributed representations, they look for answers that are semantically similar to the question. This is a departure from a classification approach that uses hand-crafted syntactic and semantic features. They use a bigram model with a convolutional layer and a average pooling layer. These capture syntactic structures and long-term dependencies without relying external parse trees.

Jul
2017

Chen et al. use Wikipedia as the knowledge source for open-domain QA. Answers are predicted as text spans. Earlier research typically consider a short piece of already identified text. Since the present approach searches over multiple large documents, they call it "machine reading at scale". Called DrQA, this system integrates document retrieval and document reading. Bigram features and bag-of-words weighted with TF-IDF are used for retrieval. The reader uses BiLSTM each for the question and passages, with attention between the two.

Oct
2018

Researchers at Google release BERT that's trained on 3.3 billion words of unlabelled text. BERT is a pre-trained language model. As a sample task, they fine-tune BERT for question answering. SQuAD v1.1 and v2.0 datasets are used. Question and text containing the answer are concatenated to form the input sequence. Start and end tokens of the answer are predicted using softmax. For questions without answers, start/end tokens point to [CLS] token.

Jan
2019

Google release Natural Questions (NQ) dataset. It has 300K pairs plus 16K questions with answers from five different annotators. Answer comes from a Wikipedia page and the model is required to read the entire page. The questions themselves are based on real, anonymized, aggregated queries from Google Search. Answers can be yes/no, long, long and short, or no answer.

2019

On SQuAD 2.0 dataset, many implementations start surpassing human performance. Many of these are based on the transformer neural network architecture including BERT, RoBERTa, XLNet, and ALBERT. Let's note that SQuAD 2.0 combines 100K questions from SQuAD 1.1 plus 50K unanswerable questions. When there's no answer, models are required to abstain from answering.

Jun
2019

Since datasets are available only for some domains and languages, Lewis et al. propose a method to synthesize questions to train QA models. Passages are randomly selected from documents. Random noun phrases or named entities are picked as answers. "Fill-in-the-blanks" questions are generated. Using neural machine translation (NMT), these are converted into natural questions.

Feb
2020

Google Research releases TyDi QA, a typologically diverse multilingual dataset. It has 200K question-answer pairs from 11 languages. To avoid shared words in a pair, a human was asked to frame a question when they didn't know the answer. Google Search identified a suitable Wikipedia article to answer the question. The person then marked the answer. Researchers expect their model to generalize well to many languages.

Author
No. of Edits
No. of Chats
DevCoins
4
4
1494
8
3
287
2359
Words
7
Chats
12
Edits
1
Likes
751
Hits