Revisit Semantic Representation and Tree Search for Similar Question Retrieval
This paper studies the performances of BERT combined with tree-based structure in short sentence ranking task. In retrieval-based question answering system, we retrieve the most similar question of the query question by ranking all the questions in datasets. If we want to rank all the sentences by neural rankers, we need to score all the sentence pairs. However it consumes large amount of time. So we combine tree-based search and compute sentence embeddings in advance to solve this problem. We fine-tune BERT on the training data to get semantic vector or sentence embeddings on the test data. We use all the sentence embeddings of test data to build our tree based on k-means and do beam search at predicting time when given a sentence as query. We do the experiments on the semantic textual similarity dataset, Quora Question Pairs, and process the dataset for sentence ranking. Experimental results show that our methods outperform the strong baseline. Our tree accelerate the predicting speed by 500%-1000% without losing too much ranking accuracy.
Keywords:Deep Learning Information Retrieval Question Answering
In retrieval-based question answering system, we retrieve the answer or similar question from a large question-answer pairs. In this paper we discuss the similar question retrieval. In predicting time, when given a new question, we get the most similar question in the large question-answer pairs by ranking, then we can return the corresponding answer. We consider this problem as a short sentence ranking problem, which is also a kind of information retrieval task.
Neural information retrieval has developed in several ways to solve this problem. This task is considered to be solved in two step: A fast algorithm like TF-IDF or BM25 to retrieve about 10-100 or more candidate similar questions and then the second step leverage the neural rankers to re-rank the 10-100 candidate questions by computing the question-question pairs similarity scores. So the weakness of this framework with two step above is that if the first fast retrieval step fails to get the right similar questions, the second re-rank step is useless. So one way to solve this weakness is to score all the question-question pairs by the neural rankers, but it consumes large amount of time. See Fig 1. for the pipeline illustration.
In this paper, to get the absolute most similar question on all the questions, our strategy is to compute all the semantic vector for all the sentence by the neural ranker offline. And then we encode the new question by the neural ranker online. To accelerate the speed of vector distance computation without losing the ranking accuracy we build a tree by k-means for vector distance computation, borrowed the idea from  and . Previous research   shows that origin BERT can not output good sentence embeddings, so we design the cosine-based loss and the fine-tune architecture of BERT to get better sentence embeddings. The code is available. 111https://github.com/guotong1988/Semantic-Tree-Search
In summary our paper has two contributions: First, We succeed in fine-tuning BERT to get better sentence embeddings, as the origin embeddings from BERT is bad. Second, To accelerate the predicting speed, we build a specific tree to search on all the embeddings, as we need to compute all the vector pair distances for a query.
2 Related Work
In recent years, neural information retrieval and neural question answering research has developed several effective ways to improve ranking accuracy. Interaction-based neural rankers match query and document pair using attention-based deep model; representation-based neural rankers output sentence representations and using cosine distance to score the sentence pairs. There are many effective representation-based model include DSSM, CLSM  and LSTM-RNN  and many effective interaction-based model include DRMM Match-SRNN and BERT. Our deep model belongs to the representation-based models which could output the final semantic representation vector for each sentence.
Sentence embeddings is an important topic in this research area. Skip-Thought input one sentence to predict its previous and next sentence. InferSent outperforms Skip-Thought.  is the methods that use unsupervised word vectors to construct the sentence vectors which is a strong baseline. Universal Sentence Encoder  present two models for producing sentence embeddings that demonstrate good transfer to a number of other of other NLP tasks.
BERT is a very deep transformer-based model. It first pre-train on very large corpus using the mask language model loss and the next-sentence loss. And then we could fine-tune the model on a variety of specific tasks like text classification, text matching and natural language inference. As BERT is a very large model, the inference time is too long to rank all the sentence.
We follow the BERT convention of data input format for encoding the natural language question. For single sentence classification task, the question is encoded as following:
For sentence pair classification task, the question 1 and question 2 are encoded as following:
where [CLS] is a special symbol added in front of every input example, [SEP] is a special separator token, , is the token number. Our fine-tune training follows the single sentence classification task.
3 Problem Statement
In this section, we illustrate the short sentence ranking task. In training time, we have a set of question pairs labeled by 1 for similar and labeled by 0 for not similar. Our goal is to learn a classifier which is able to precisely predict whether the question pair is similar. But we can not follow the same way as sentence pair classification task of BERT, because we want to output the sentence embeddings for each of the sentence. In predicting time, we have a set of questions that each have a labeled most similar question in the same set. Our goal is to use questions from the question set as query and find the top N similar questions from the question set. Although the most similar question for the query is the one that we consider to be the most important one in question answering system, but the top N results may be applied to the scenario such as similar question recommendation.
In this section we describe our deep model and the tree building methods.
4.1 Fine-tune Training
In this subsection we describe our fine-tune methods for BERT. The sketch view is shown in Fig. 2. We input the two questions to the same BERT without concatenate them and output two vector representation. In detail, we use three ways to get the representation from BERT:
1. The output of the [CLS] token. We take the two output vector of the [CLS] token of BERT for the two input questions.
2. The max pooling strategy. We do max pooling to the last layer of BERT and use it as the representation.
3. The mean pooling strategy. We do mean pooling to the last layer of BERT and use it as the representation.
Then the two output vectors from BERT compute the cosine distance as the input for mean square error loss:
where and is the two vector and is the label. The full algorithm is shown in Algorithm 1.
4.2 Tree Building
After all the embeddings of test data are computed, we start to build the tree by k-means. The outline for tree building is shown in Fig. 3. We cluster the embeddings recursively and use the k-means centers for the non-leaf node. The sentence embeddings are all in the leaf nodes. We also tried to sample keywords or sample sentence for the representation of non-leaf nodes, but we do not observe good performance. The non-leaf node representation is important for the tree search as they pave the way and lead to the right leaf nodes. We think the clustering centers is a good solution for the non-leaf node representation, as it is hard to get the exact representation from the child nodes for the parent nodes.
At test time, we use beam search to get the nearest top N vectors for the given query vector. Then we evaluate the top N sentences on Mean Average Precision (MAP), Precision @ 1 (P@1), Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR) and MRR@10. The detail search strategy is shown in Fig 4.
In this section, we describe the experiments parameter detail and the experimental result.
5.1 Fine-tune Training
We use the pre-trained BERT-base model file from here111https://github.com/google-research/bert. The inputs for the mean square error loss are the cosine similarity score and gold label. The max sequence length is 64 and the batch size is 32. The hidden dimension of BERT is 768. We use Adam optimizer with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. The training datasets size is 384348 pairs of questions.
5.2 Tree Building
We choose 5,8,10 as clustering number for k-means. The depth for the tree is 5 level for 36735 vectors. In predicting time, the 5-K tree is the slowest with best accuracy tree and the 10-K tree is the fastest with worst accuracy tree. The 8-K tree is in the middle of them.
5.3 Data Description
We evaluate the performance on the Quora Question Pairs datasets. Based on the Quora Question Pairs datasets, we combine the dev data and test data to get a dataset of 20000 question pairs, which contains 10000 pairs with label 1 and 10000 pairs with label 0. After remove the duplicate questions, we get a datasets of 36735 questions. We compute the all embeddings for the 36736 questions offline. And then we use the 10000 questions which have label 1 as 10000 queries. For each query it compute 36735 cosine distances if we loop all the 36735 questions. We take the top 20 questions for the evaluation of ranking.
The BM25 baseline is implemented on Lucene. And the  is from here111https://github.com/peter3125/sentence2vec . The detail compare result is shown in Table 1. and Table 2. The compute-all result means we score all the vector pairs from 0 to end sequentially. The vector distance computation of compute-all uses cosine distance and euclidean distance, and k-d tree uses euclidean distance. The speed comparison is shown in Table 3.
|our BERT [CLS] output||0.132||0.084||0.133||0.168||0.131|
|our BERT mean pooling strategy||0.138||0.088||0.140||0.175||0.138|
|our BERT max pooling strategy||0.135||0.086||0.136||0.172||0.135|
|our 10-K tree||0.132||0.084||0.135||0.167||0.131|
|our 8-K tree||0.134||0.085||0.136||0.169||0.133|
|our 5-K tree||0.138||0.088||0.140||0.175||0.138|
|our 5-K tree||6000-7000|
|our 8-K tree||3000-4000|
|our 10-K tree||2000-3000|
|k-d tree||about 24000|
5.5 Case Study and Error Analysis
We show some examples from the eval results to demonstrate the ability of our methods. Table 4 shows the retrieval result of top 3 for the query question ”How do I get funding for my web based startup idea ?” for BM25 and our tree. The compare results show that our method can get more semantic information than BM25. Table 5 shows the retrieval result of top 3 for the query question ”Who is the best bodybuilder of all time ?” for compute-all and our tree. The results show that the ranking accuracy losing may be caused by the non-leaf representation’s error, as the similarity result of our tree is far from the query question. We think the non-leaf node lead to the wrong children in searching.
In this paper, we study the problem of short sentence ranking for question answering. In order to get best score for all the questions when given a question as query. We compute the representation for all the questions in advance and build a tree by k-means to accelerate the predicting speed. The experimental results show that our methods beat the strong baseline of  and are comparable to BM25 baseline on large information retrieval datasets we construct. The sentence embeddings quality can be improved by better BERT or the XLNet and we will discover more powerful non-leaf node embeddings for the tree search and evaluate on other datasets in the future.
|our tree rank 1||How do I get funding from investors for my business idea ?||0|
|our tree rank 2||How can I get funds to turn my idea into a reality ?||0|
|our tree rank 3||How can I get funds for my business idea ?||1|
|BM25 rank 1||Where can my web-based startup find funding or investors ?||0|
|BM25 rank 2||Can I get funded based on my startup idea ?||0|
|BM25 rank 3||How do I get funding for my startup idea before we have a prototype ?||0|
|our tree rank 1||How much money do professional strongmen make ?||0|
|our tree rank 2||Why did Indians want Independence from Britain ?||0|
|our tree rank 3||Do you think the Indian marriage traditions needs a change ?||0|
|compute-all rank 1||Who is the best bodybuilder ?||1|
|compute-all rank 2||Who is the most skillful fighter in Game of Thrones ?||0|
|compute-all rank 3||Which is the best website maker for an online shop ?||0|
-  Qiao Y, Xiong C, Liu Z, et al. Understanding the Behaviors of BERT in Ranking[J]. arXiv preprint arXiv:1904.07531, 2019.
-  Guo J, Fan Y, Pang L, et al. A deep look into neural ranking models for information retrieval[J]. arXiv preprint arXiv:1903.06902, 2019.
-  Xu P, Ma X, Nallapati R, et al. Passage Ranking with Weak Supervsion[J]. arXiv preprint arXiv:1905.05910, 2019.
-  Zhu H, Li X, Zhang P, et al. Learning Tree-based Deep Model for Recommender Systems[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018: 1079-1088.
-  P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, L. Heck, Learning deep structured semantic models for web search using clickthrough data, in: Proceedings of the 22Nd ACM International Conference on Information & Knowledge Management, CIKM ’13, ACM, New York, NY, USA, 2013, pp. 2333–2338.
-  Y. Shen, X. He, J. Gao, L. Deng, G. Mesnil, A latent semantic model with convolutional-pooling structure for information retrieval, in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, ACM, New York, NY, USA, 2014, pp. 101–110.
-  H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, R. Ward, Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval, IEEE/ACM Trans. Audio, Speech and Lang. Proc. 24 (4) (2016) 694–707.
-  J. Guo, Y. Fan, Q. Ai, W. B. Croft, A deep relevance matching model for ad-hoc retrieval, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, ACM, New York, NY, USA, 2016, pp. 55–64.
-  S. Wan, Y. Lan, J. Xu, J. Guo, L. Pang, X. Cheng, Match-srnn: Modeling the recursive matching structure with spatial rnn, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, AAAI Press, 2016, pp. 2922–2928.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
-  Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.
-  Zhu H, Chang D, Xu Z, et al. Joint Optimization of Tree-based Index and Deep Model for Recommender Systems[J]. arXiv preprint arXiv:1902.07565, 2019.
-  Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-Thought Vectors. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3294–3302. Curran Associates, Inc.
-  Arora S, Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings[J]. 2016.
-  Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arXiv preprint arXiv:1803.11175.
-  Yang Z, Dai Z, Yang Y, et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding[J]. arXiv preprint arXiv:1906.08237, 2019.
-  Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
-  Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ ıc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
-  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
-  Daniel Cer, Mona Diab, Eneko Agirre, Iigo LopezGazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada.