Revisit Semantic Representation and Tree Search for Similar Question Retrieval

Revisit Semantic Representation and Tree Search for Similar Question Retrieval

Tong Guo Huilin Gao 1Rokid AI Lab12China Electronic Technology Group Corporation Information Science Academy, Beijing, China2

This paper studies the performances of BERT combined with tree-based structure in short sentence ranking task. In retrieval-based question answering system, we retrieve the most similar question of the query question by ranking all the questions in datasets. If we want to rank all the sentences by neural rankers, we need to score all the sentence pairs. However it consumes large amount of time. So we combine tree-based search and compute sentence embeddings in advance to solve this problem. We fine-tune BERT on the training data to get semantic vector or sentence embeddings on the test data. We use all the sentence embeddings of test data to build our tree based on k-means and do beam search at predicting time when given a sentence as query. We do the experiments on the semantic textual similarity dataset, Quora Question Pairs, and process the dataset for sentence ranking. Experimental results show that our methods outperform the strong baseline. Our tree accelerate the predicting speed by 500%-1000% without losing too much ranking accuracy.

Deep Learning Information Retrieval Question Answering

1 Introduction

In retrieval-based question answering system, we retrieve the answer or similar question from a large question-answer pairs.[2] In this paper we discuss the similar question retrieval. In predicting time, when given a new question, we get the most similar question in the large question-answer pairs by ranking, then we can return the corresponding answer. We consider this problem as a short sentence ranking problem, which is also a kind of information retrieval task.

Neural information retrieval has developed in several ways to solve this problem. This task is considered to be solved in two step: A fast algorithm like TF-IDF or BM25 to retrieve about 10-100 or more candidate similar questions and then the second step leverage the neural rankers to re-rank the 10-100 candidate questions by computing the question-question pairs similarity scores. So the weakness of this framework with two step above is that if the first fast retrieval step fails to get the right similar questions, the second re-rank step is useless. So one way to solve this weakness is to score all the question-question pairs by the neural rankers, but it consumes large amount of time. See Fig 1. for the pipeline illustration.

In this paper, to get the absolute most similar question on all the questions, our strategy is to compute all the semantic vector for all the sentence by the neural ranker offline. And then we encode the new question by the neural ranker online. To accelerate the speed of vector distance computation without losing the ranking accuracy we build a tree by k-means for vector distance computation, borrowed the idea from [4] and [12]. Previous research [1] [3] shows that origin BERT[10] can not output good sentence embeddings, so we design the cosine-based loss and the fine-tune architecture of BERT to get better sentence embeddings. The code is available. 111

In summary our paper has two contributions: First, We succeed in fine-tuning BERT to get better sentence embeddings, as the origin embeddings from BERT is bad. Second, To accelerate the predicting speed, we build a specific tree to search on all the embeddings, as we need to compute all the vector pair distances for a query.

Figure 1: The pipeline for retrieval-based question answering. The left is the classical pipeline and the right is our approach

2 Related Work

In recent years, neural information retrieval and neural question answering research has developed several effective ways to improve ranking accuracy. Interaction-based neural rankers match query and document pair using attention-based deep model; representation-based neural rankers output sentence representations and using cosine distance to score the sentence pairs. There are many effective representation-based model include DSSM[5], CLSM [6] and LSTM-RNN [7] and many effective interaction-based model include DRMM[8] Match-SRNN[9] and BERT[10]. Our deep model belongs to the representation-based models which could output the final semantic representation vector for each sentence.

Sentence embeddings is an important topic in this research area. Skip-Thought[13] input one sentence to predict its previous and next sentence. InferSent[18] outperforms Skip-Thought. [14] is the methods that use unsupervised word vectors[19] to construct the sentence vectors which is a strong baseline. Universal Sentence Encoder [15] present two models for producing sentence embeddings that demonstrate good transfer to a number of other of other NLP tasks.

BERT is a very deep transformer-based[11] model. It first pre-train on very large corpus using the mask language model loss and the next-sentence loss. And then we could fine-tune the model on a variety of specific tasks like text classification, text matching and natural language inference. As BERT is a very large model, the inference time is too long to rank all the sentence.

We follow the BERT convention of data input format for encoding the natural language question. For single sentence classification task, the question is encoded as following:

For sentence pair classification task, the question 1 and question 2 are encoded as following:

where [CLS] is a special symbol added in front of every input example, [SEP] is a special separator token, , is the token number. Our fine-tune training follows the single sentence classification task.

3 Problem Statement

In this section, we illustrate the short sentence ranking task. In training time, we have a set of question pairs labeled by 1 for similar and labeled by 0 for not similar. Our goal is to learn a classifier which is able to precisely predict whether the question pair is similar. But we can not follow the same way as sentence pair classification task of BERT, because we want to output the sentence embeddings for each of the sentence. In predicting time, we have a set of questions that each have a labeled most similar question in the same set. Our goal is to use questions from the question set as query and find the top N similar questions from the question set. Although the most similar question for the query is the one that we consider to be the most important one in question answering system, but the top N results may be applied to the scenario such as similar question recommendation.

4 Approach

In this section we describe our deep model and the tree building methods.

Figure 2: The fine-tune training architecture

4.1 Fine-tune Training

In this subsection we describe our fine-tune methods for BERT. The sketch view is shown in Fig. 2. We input the two questions to the same BERT without concatenate them and output two vector representation. In detail, we use three ways to get the representation from BERT:

1. The output of the [CLS] token. We take the two output vector of the [CLS] token of BERT for the two input questions.

2. The max pooling strategy. We do max pooling to the last layer of BERT and use it as the representation.

3. The mean pooling strategy. We do mean pooling to the last layer of BERT and use it as the representation.

Then the two output vectors from BERT compute the cosine distance as the input for mean square error loss:

where and is the two vector and is the label. The full algorithm is shown in Algorithm 1.

  init BERT model BERT-A
  for epoch epoch_num do
     for question_pairs train_question_pairs do
        input question_pairs and fine-tune BERT-A to BERT-B
     end for
  end for
  all_embeddings = set()
  for question test_questions do
  end for
  use all_embeddings to init the tree Tree-A
  for question test_questions do
     result=Tree-A.beam_search(question_embedding, 20)
  end for
Algorithm 1 Algorithm Pipeline

4.2 Tree Building

Figure 3: The k-means clustering for building the tree with K=3

After all the embeddings of test data are computed, we start to build the tree by k-means. The outline for tree building is shown in Fig. 3. We cluster the embeddings recursively and use the k-means centers for the non-leaf node. The sentence embeddings are all in the leaf nodes. We also tried to sample keywords or sample sentence for the representation of non-leaf nodes, but we do not observe good performance. The non-leaf node representation is important for the tree search as they pave the way and lead to the right leaf nodes. We think the clustering centers is a good solution for the non-leaf node representation, as it is hard to get the exact representation from the child nodes for the parent nodes.

Figure 4: The beam search strategy (beam size = 2): deep green nodes are the final choices and light green nodes are the candidates

4.3 Test

At test time, we use beam search to get the nearest top N vectors for the given query vector. Then we evaluate the top N sentences on Mean Average Precision (MAP), Precision @ 1 (P@1), Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR) and MRR@10. The detail search strategy is shown in Fig 4.

5 Experiments

In this section, we describe the experiments parameter detail and the experimental result.

5.1 Fine-tune Training

We use the pre-trained BERT-base model file from here111 The inputs for the mean square error loss are the cosine similarity score and gold label. The max sequence length is 64 and the batch size is 32. The hidden dimension of BERT is 768. We use Adam optimizer with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. The training datasets size is 384348 pairs of questions.

5.2 Tree Building

We choose 5,8,10 as clustering number for k-means. The depth for the tree is 5 level for 36735 vectors. In predicting time, the 5-K tree is the slowest with best accuracy tree and the 10-K tree is the fastest with worst accuracy tree. The 8-K tree is in the middle of them.

5.3 Data Description

We evaluate the performance on the Quora Question Pairs datasets. Based on the Quora Question Pairs datasets, we combine the dev data and test data to get a dataset of 20000 question pairs, which contains 10000 pairs with label 1 and 10000 pairs with label 0. After remove the duplicate questions, we get a datasets of 36735 questions. We compute the all embeddings for the 36736 questions offline. And then we use the 10000 questions which have label 1 as 10000 queries. For each query it compute 36735 cosine distances if we loop all the 36735 questions. We take the top 20 questions for the evaluation of ranking.

5.4 Result

The BM25 baseline is implemented on Lucene. And the [14] is from here111 . The detail compare result is shown in Table 1. and Table 2. The compute-all result means we score all the vector pairs from 0 to end sequentially. The vector distance computation of compute-all uses cosine distance and euclidean distance, and k-d tree uses euclidean distance. The speed comparison is shown in Table 3.

Methods MAP P@1 MRR NDCG MRR@10
wordvec[14] 0.072 0.042 0.073 0.097 0.070
BM25 0.138 0.086 0.138 0.170 0.137
our BERT [CLS] output 0.132 0.084 0.133 0.168 0.131
our BERT mean pooling strategy 0.138 0.088 0.140 0.175 0.138
our BERT max pooling strategy 0.135 0.086 0.136 0.172 0.135
Table 1: Our 5-K tree result compare to the baseline
Methods MAP P@1 MRR NDCG MRR@10
our 10-K tree 0.132 0.084 0.135 0.167 0.131
our 8-K tree 0.134 0.085 0.136 0.169 0.133
our 5-K tree 0.138 0.088 0.140 0.175 0.138
k-d tree 0.153 0.097 0.155 0.192 0.152
compute-all (cosine) 0.152 0.097 0.155 0.192 0.152
compute-all (euclidean) 0.153 0.097 0.155 0.192 0.152
Table 2: Analysis of ranking accuracy losing. Our tree, k-d tree and compute-all results
Methods times
our 5-K tree 6000-7000
our 8-K tree 3000-4000
our 10-K tree 2000-3000
k-d tree about 24000
compute-all 36735
Table 3: vector distance computation times for 36735 pairs in predicting

5.5 Case Study and Error Analysis

We show some examples from the eval results to demonstrate the ability of our methods. Table 4 shows the retrieval result of top 3 for the query question ”How do I get funding for my web based startup idea ?” for BM25 and our tree. The compare results show that our method can get more semantic information than BM25. Table 5 shows the retrieval result of top 3 for the query question ”Who is the best bodybuilder of all time ?” for compute-all and our tree. The results show that the ranking accuracy losing may be caused by the non-leaf representation’s error, as the similarity result of our tree is far from the query question. We think the non-leaf node lead to the wrong children in searching.

6 Conclusion

In this paper, we study the problem of short sentence ranking for question answering. In order to get best score for all the questions when given a question as query. We compute the representation for all the questions in advance and build a tree by k-means to accelerate the predicting speed. The experimental results show that our methods beat the strong baseline of [14] and are comparable to BM25 baseline on large information retrieval datasets we construct. The sentence embeddings quality can be improved by better BERT[17] or the XLNet[16] and we will discover more powerful non-leaf node embeddings for the tree search and evaluate on other datasets[20] in the future.

Methods result Label
our tree rank 1 How do I get funding from investors for my business idea ? 0
our tree rank 2 How can I get funds to turn my idea into a reality ? 0
our tree rank 3 How can I get funds for my business idea ? 1
BM25 rank 1 Where can my web-based startup find funding or investors ? 0
BM25 rank 2 Can I get funded based on my startup idea ? 0
BM25 rank 3 How do I get funding for my startup idea before we have a prototype ? 0
Table 4: Case study for query: How do I get funding for my web based startup idea ?
Methods result Label
our tree rank 1 How much money do professional strongmen make ? 0
our tree rank 2 Why did Indians want Independence from Britain ? 0
our tree rank 3 Do you think the Indian marriage traditions needs a change ? 0
compute-all rank 1 Who is the best bodybuilder ? 1
compute-all rank 2 Who is the most skillful fighter in Game of Thrones ? 0
compute-all rank 3 Which is the best website maker for an online shop ? 0
Table 5: Case study for query: Who is the best bodybuilder of all time ?


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description