# Question Similarity

In some applications, there's a need to compare two questions and determine how similar they are. In Information Retrieval (IR), it may be necessary to compare the incoming query against questions stored in the system database. This helps the IR system give a suitable response.

Question similarity involves a few basic aspects: pre-processing to reduce words and phrases to a form suited to the task, representing questions in efficient vector forms, defining features and selecting models to bring out the similarity, and defining a similarity measure to objectively compare the questions. The task benefits from lexical, syntactic and semantic features.

Question similarity is part of a more general NLP task called Semantic Textual Similarity (STS). STS involves comparing two sentences, two paragraphs or even two documents. Question similarity is also closely related to the task of question answering.

## Discussion

• Could you explain question similarity with some examples?

Consider the pair "How old are you?" and "What is your age?" These are semantically equivalent. Consider the pair "How are you?" and "How old are you?" These are semantically different even when they differ by only one word. Thus, question similarity is usually about the meaning or intent of the questions, not about how many common words they share. A useful definition is,

Determining semantic equivalent is a challenging problem. Consider the pair "Which Star Wars episode should I watch?" and "I have so much of a life that I'm at home watching Star Wars." Topic "Star Wars" can mislead a model into thinking these two are similar.

Another example is the pair "How to declare an array in Python?" and "How to create a fix size list in python?" In Python terminology, arrays are often referred to as lists. A model that lacks this knowledge, when asked the first question, might instead match "How do I declare and initialize an array in Java?"

• What are some applications of question similarity?

On knowledge-sharing platforms such as Quora and StackExchange, people post questions. Question similarity detects similar questions, possibly already answered. Moderators may review this information to mark duplicates. These sites also list related questions as a useful feature for readers.

In chatbot applications or with voice assistants, a user may ask a question but that exact question may be unavailable in the system. Question similarity identifies questions that're closest in terms of semantics or user intent. The user is then presented these interpretations or if possible directly give an answer.

In general, any Information Retrieval (IR) system can benefit from question similarity. For example, when a user asks Google "how to bake a cake", Google will provide search results but also include a section titled People also ask. This section lists a handful of related questions such as "How do you bake a cake in 10 steps?" and "What are the ingredients used in baking cake?"

Where exam papers use automatically generated questions or pick questions from a question bank, question similarity ensures that similar questions are not picked for the same paper.

• How should I prepare the data for question similarity?

Like other NLP tasks, question similarity involves data collection, exploratory data analysis, pre-processing, feature extraction, model training and evaluation. The idea of pre-processing data is to make it suitable for a given task and also remove unclean or wrong data. Other steps of the process might also influence the selection of pre-processing steps.

For example, if Jaccard similarity measure is used, we may do case normalization, tokenization, stopwords removal, punctuation removal, and lemmatization. Where there are many stopwords, Soft Cosine similarity will be higher if they're not removed. However, removing stopwords gives poorer results with BM25 (used in IR applications) and Translation-Based Language Model.

On short form Twitter data, Dey et al. suggested topic phrase removal, slang normalization (eg. aaf vs as a friend), named entity boundary correction (eg. Colorado Ravens and Ravens are probably the same concept), named entity tag cleaning, and synonym/hypernym replacement using WordNet (eg. quick vs fast).

• How can I represent questions for the task of question similarity?

Before questions can be compared, we need to represent them in a form that algorithms can efficiently process and compare. This is done after data pre-processing. The simplest techniques are TF-IDF, bag-of-words model, and n-gram model.

Since the mid-2010s, word embeddings are popular. Word embeddings are multi-dimensional vectors containing real values. Words that are similar tend to occur close to one another in the vector space. Such embeddings are learned by training on huge amounts of real-world text. GloVe and word2vec are well-known word embeddings. Contextualized word embeddings such as BERT and ELMo capture context as well.

Sentence embeddings can be learned from a weighted sum of word embeddings. By using Principal Component Analysis (PCA) and removing the projections of the average vectors on their first principal components, embeddings have shown to perform better on text similarity tasks.

Universal Sentence Encoder is another approach to sentence or question embeddings. Cer et al. have shown that transfer learning via sentence embeddings perform better than via word embeddings. Sentence-BERT used BERT to learn sentence embeddings.

• How should I select features for text similarity?

Features can be lexical, either character-level or word-level n-grams. Syntactic features include words along with their POS tags, named entities and verb similarity. Semantic features include words and phrases that semantically overlap, with help from WordNet.

Some basic features could include number of words or characters, uppercase words, last words, first words, adjectives, nouns, ratio of the common words to words, fuzz ratio (from Python's FuzzyWuzzy package), etc. Exploratory data analysis may show that only some of these are actually useful. Such features are commonly employed with traditional ML models such as SVM.

With neural networks, extensive feature engineering is not required. For example, when using Sentence-BERT, the sentence embeddings are used as features. In fact, these embeddings and neural networks are meant to capture context and semantics effectively. However, enabling neural networks to use syntactic information is an ongoing research.

Where embeddings such as GloVe are used, correcting spelling errors may give poorer results since the embeddings themselves were trained without correcting for spelling.

• When are two questions said to be semantically equivalent?

Various similarity measures, also called distance measures, exist to compare two vectors, each representing one question: cosine distance, Euclidean distance, Word mover's distance, Minkowski distance, Jaccard distance, and many more. We note the calculations for cosine, Euclidean and weighted Manhattan:

$$d_{cos}(h_1, h_2) =\frac{h_1 . h_2}{||h_1|| ||h_2||}$$

$$d_{euc}(h_1, h_2) = ||h_1 − h_2||$$

$$d_{w−man}(h_1, h_2) = w · |h_1 − h_2|$$

More generally, in STS, many methodologies are available to calculate semantic similarity:

• Knowledge-based: Such as the use of WordNet and distance measures associated with WordNet graph nodes.
• Statistical or Corpus-based: Techniques include Latent Semantic Analysis (LSA), Pointwise Mutual Information-Information Retrieval (PMI-IR), Normalized Google Distance (NGD).
• Character-based: Longest Common Substring (LCS), Damerau-Levenshtein, Jaro, Needleman-Wunsch, Smith-Waterman, n-gram.
• Term-based: Manhattan, Euclidean, Jaccard, Cosine similarity, Soft Cosine Similarity, Sorensen-Dice index, Simple Matching Coefficient (SMC).

While different distance measures are available, it's hard to know which of these is ideal for measuring semantic similarity. Homma et al. addressed this with a 2-layer neural network that learned the correct distance function.

• Which datasets are useful for benchmarking or training Question Similarity models?

We note a few of these:

• CQADupStack: Questions and answers from 12 subforums of StackExchange, each subforum with at least 10,000 threads and 500 duplicate questions. Another StackExchange dataset covering 19 domains is also available. StackExchange data dump is also useful. This dataset is part of SemEval-2017 Task 3.
• glue/qqp: General Language Understanding Evaluation (GLUE) is a benchmark for many NLP tasks. One of them involves Quora Question Pairs2 dataset of about 364K training, 40K validation and 391K testing data samples. GLUE's glue/stsb dataset might also be useful. The first Quora dataset is also available.
• MQS: Medical Question Similarity dataset related to COVID-19. Contains 3,048 labelled medical questions pairs.
• Microsoft Research Paraphrase Corpus: 5,800 pairs of sentences from web news sources including annotations to indicate semantic similarity.

Since question answering (QA) is a closely related task, QA datasets can also be used. 719 QA pairs from the web apps domain of StackExchange is one such dataset. Google's Natural Questions has 307,373 training examples, 7,830 development examples, and 7,842 test examples.

Other datasets include SQuAD, SuperGLUE, , TriviaQA, TREC-QA, WikiQA, YahooCQA, and SemEval CQA (2015, 2016, 2017).

• What are some classical machine learning techniques used in question similarity?

Question similarity can be seen as a classification problem. Many classical ML techniques have been applied to this problem including Support Vector Machine (SVM), Decision Tree, Logistic Regression, Extreme Gradient Boosting, Random Forests, and Adaptive Boosting. Dey et al. (2016) showed good results using SVMs with hand-picked features and pre-processed data.

Similarity is determined either statistically (TF-IDF) or semantically (meaning and context). Semantic similarity requires knowledge representation for which WordNet can provide shallow lexical semantics. Other NLP tasks such as word sense disambiguation and relation extraction can also be useful.

• What are some neural network approaches used in question similarity?

Till the mid-2010s, neural networks performed worse than SVMs due to limited training on noisy datasets. Subsequently, neutral networks started to perform better. One approach is to use "Siamese" neural network architecture based on CNN, RNN or GRU. Each question is encoded individually using the same network and then compared. For better interaction between the questions, bilateral multi-perspective matching LSTM model is an option.

Where sufficient labelled data is not available, domain adaptation is possible. Shah et al. did this for some StackExchange forums using labelled data from other forums. Joty et al. used Adversarial Neural Networks across languages English and Arabic. Hazem et al. built a multi-domain framework using a StackExchange dataset covering 19 domains.

Transformer models such as BERT have been used for question similarity. In general, model training benefits from not just question-question pairs but also question-answer pairs and other NLP tasks.

Papers With Code curates the latest developments in Question Similarity.

• What tools or libraries are available for question similarity?

In Python, NumPy, Pandas and Scikit-Learn are commonly used libraries for machine learning. NLTK, TextBlob, spaCy, Gensim, Pattern, and Stanford CoreNLP (in Java with Python wrappers) are more specific to NLP. A GitHub repo has a curated list of Python NLP libraries.

For neural network models, PyTorch or TensorFlow libraries are commonly used. Hugging Face is an active community and library that bundles PyTorch, TensorFlow, and models based on them. Embeddings, BERT and its many variants are available within Hugging Face.

For semantic similarity, Java libraries such as Semantic Measures Library, SEMILAR and SEMILAT, and DISCO Builder/API could be useful. There's a Perl module for WordNet-based similarity. UMLS is another Perl module specific to medical domain. RxNLP has an API for text similarity.

## Milestones

1993

To compare the similarity of two signatures, Bromley et al. propose using two identical time delay neural networks that share one output that captures the similarity of the two inputs. They call this the Siamese Neural Network. This technique of combining two neural networks is later used specifically for the task of question similarity.

Jul
2015

Bogdanova et al. compare Support Vector Machine (SVM) with their proposed approach of Convolutional Neural Network (CNN). They train their CNN with positive and negative pairs of semantically equivalent questions. The dataset is Ask Ubuntu forum on StackExchange. Their definition of a question consists of both title and body. CNN captures local features around each word, which are then combined to produce a fixed-size vector representation for each question. Their results show that CNN outperforms SVM. Moreover, results are better when embeddings are learned from in-domain data.

Sep
2015

Sanborn and Skryzalin try out both Recurrent Neural Network (RNN) and Recursive Neural Network within a Siamese architecture. They use SemEval-2015 Task 2 as the dataset. For baselines, they use cosine similarity between bag-of-words vectors, cosine similarity between GloVe-based sentence vectors, and Jaccard similarity between sets of words. They find that RNN with 100-dimensional word vectors and 20% dropout gives best performance, although not the state of the art.

2016

Dey et al. apply SVM to short-text Twitter data based on a SemEval 2015 dataset. They obtain an F1-score of 0.741 for semantic similarity.

Feb
2017

Bonadiman et al. recognize that question-question similarity relates to question-comment similarity and new question-comment similarity. They therefore train a deep neural network jointly on these three tasks of SemEval 2016 Task 3, a technique that's called Multi-Task Learning (MTL). They use CNN with max pooling as the sentence model. This is followed by a Multi-Layer Perceptron (MLP) with sigmoid output layer. For feature embedding, they use word overlap.

Jul
2017

Conneau et al. show that NLP tasks can benefit from pre-trained sentence level embeddings rather than just word embeddings as in GloVe or word2vec. They consider seven different encoder architectures and find best performance with BiLSTM with max pooling with embeddings trained in a supervised manner on Natural Language Inference datasets. They obtain better results than earlier sentence embeddings obtained via unsupervised methods such as SkipThought (2015) and FastSent (2016).

2018

Cer et al. propose the Universal Sentence Encoder (USE) via transformer-based encoding with multi-task learning. They also propose a simpler Deep Averaging Network (DAN). The former gives better performance at the cost of higher compute. They also find that transfer learning via sentence embeddings performs better than via word embeddings. For sentence embeddings, USE is publicly available via TensorFlow Hub. It can be fine tuned for specific tasks.

Aug
2019

Reimers and Gurevych propose Sentence-BERT (SBERT) as a modification of pretrained BERT to train sentence embeddings. They use a Siamese network architecture and fine tune SBERT using Natural Language Inference (NLI) data. On seven Semantic Textual Similarity (STS) tasks and also SentEval, they achieve better performance than Universal Sentence Encoder.

Aug
2020

In medical domain, specifically COVID-19 FAQs, McCreery et al. show that double fine-tuning during pretraining yields good performance. They try four different pretraining tasks: question-question pairs, question-answer pairs, answer-answer pairs and question-category pairs. They observe best results when the network is pretrained with medical question-answer pairs and then fine tuned on medical question-question pairs. On question similarity task, the model gives 84.5% accuracy. Even with much smaller training set, it achieves 80% accuracy.

Author
No. of Edits
No. of Chats
DevCoins
6
4
2161
4
0
787
2489
Words
0
Likes
73
Hits

## Cite As

Devopedia. 2021. "Question Similarity." Version 10, May 3. Accessed 2021-05-03. https://devopedia.org/question-similarity
Contributed by
2 authors

Last updated on
2021-05-03 06:27:16
• Site Map