Learning to Focus when Ranking Answers
One of the main challenges in ranking is embedding the query and document pairs into a joint feature space, which can then be fed to a learning-to-rank algorithm. To achieve this representation, the conventional state of the art approaches perform extensive feature engineering that encode the similarity of the query-answer pair. Recently, deep-learning solutions have shown that it is possible to achieve comparable performance, in some settings, by learning the similarity representation directly from data. Unfortunately, previous models perform poorly on longer texts, or on texts with significant portion of irrelevant information, or which are grammatically incorrect. To overcome these limitations, we propose a novel ranking algorithm for question answering, QARAT , which uses an attention mechanism to learn on which words and phrases to focus when building the mutual representation. We demonstrate superior ranking performance on several real-world question-answer ranking datasets, and provide visualization of the attention mechanism to offer more insights into how our models of attention could benefit ranking for difficult question answering challenges.
One of the main challenges in ranking is representing a query and document pair in a joint feature space, which can then be fed to a ranking algorithm. Over the last decade, supervised learning-to-rank (LTR) approaches have been shown to perform best for many difficult ranking tasks, including Question Answering. However, state of the art conventional LTR approaches, such as (Surdeanu et al., 2011), require extensive feature engineering, which includes different similarity metrics, such as manually curated lexical, syntactic and semantic similarities between the query and document.
Recently, deep learning models have obtained significant success in ranking for question answering. However, so far, these models were shown to be successful only in modeling relatively short query and answer pairs, roughly a sentence in length (Severyn and Moschitti, 2013). One of the reasons these models under-perform for longer texts, is that longer answers often contain irrelevant information, which is also incorporated into the similarity representation. In this work, we explore the use of the attention mechanism for answer ranking, to overcome this limitation.
Attention mechanisms in deep learning were originally inspired by the human visual attention mechanism, which helps us perceive large amounts of information at once by focusing on parts of the information. For example, when looking at an image, the mechanism allows us to focus on some part of it in high detail while putting less attention to the rest. Similarly, for ranking, attention mechanism do not try to encode the entire query and document pair into a fixed-length vector, but rather learn the interdependence between the two of them while focusing on only the important words. While in theory, deep-learning architectures, such as LSTM, are designed to deal with long-range dependencies – in practice, they show poor results on representing and matching long texts (Liu et al., 2015).
In this work, we present QARAT (Question-Answering Ranking with Attention), a novel deep-learning ranking algorithm for question answering, which employs attention mechanisms to identify the main question and answer terms to use for the joint representation. We evaluate QARAT on two popular retrieval tasks: TREC answer selection, and answer ranking for LiveQA, and show superior results on both. We complement the main results with an analysis of performance with respect to answer length, and empirically show that indeed QARAT provides significantly superior results on longer answers, and even short answers that contain irrelevant information or are grammatically incorrect.
As an added benefit, modeling attention explicitly enables us to visualize what QARAT learns. we report visualizations of the focus of the algorithm when embedding a pair of query and document. Figure 1 shows one such visualization, where the most important words for modeling of the interdependence between the question and the answer terms are marked in green. In the example on the left, the phrase “in 1981” and “company” are learned to be similar to the phrase “chairman” in the query, and therefore receive higher attention for ranking. Similarly, in the example on the right, the phrase “born in” and “Florence” are represented most prominently in the embedding due to their similarity to the query
In summary, the contributions of this work are threefold: First, we propose QARAT , a novel method that improves answer ranking for long question-answer pairs via attention mechanisms. Second, we present visualizations and interpretability of our method to help gain insights into the method. Finally, we present strong empirical performance of our method on several challenging question answering datasets for answer ranking.
2. Related Work
Question-answer selection and ranking has been an active area of research for decades, presenting many solutions. For example, reference Wang et al. (2007), modeled question-answers relations to a parse tree in a way that questions and their relevant answers are connected via syntactic transformations. Others, e.g., (Heilman and Smith, 2010; Severyn and Moschitti, 2013) focused on parse tree editing for a question-answer pair, searching for minimal tree edit operations using heuristics, probabilities, and also automatic creation of features. Deep learning solutions have also been proposed for this task. Yu et al. (2014) created a deep learning model and learned to match questions and answers by using their semantic structure. Other works learned the question and answer representation and matched them by similarity metric. For example, Severyn and Moschitti (2015) created a convolutional neural network (CNN) that receives as input vectors of question and answer pairs and returns a score for each pair. They manually added word indicator features for whether a word appeared both in the question and answer. Tan et al. (2015) created an LSTM-based model by creating a biLSTM network with three gates (input, forget and output) separately for the question and the answer. They then used cosine similarity for the score analysis. This model was presented with CNN filters as well.
Attention mechanisms have been applied in several domains, including image tasks (Denil et al., 2011; Larochelle and Hinton, 2010) and other natural language processing tasks, such as machine translation (Bahdanau et al., 2014a), textual entailment (Yin et al., 2015), etc. However, to the best of our knowledge, our work is the first to use attention mechanisms for the task of ranking question and answers. Our model is based on a feed forward network with an attention layer that aims to focus on the relevant parts of the question and the answer and overcome weaknesses of other models with dealing with long or confusing answers.
3. Question Answering Ranking with Attention
In this section, we provide the problem definition of ranking answers for a question and present QARAT , our deep learning model with the attention mechanism.
3.1. Problem Definition
Given a question and possible answers for this question , we aim to rank candidate answers by their relevance to . We adopt a common pointwise method for ranking. The method requires training a binary classifier based on training instances composed of tuples of the form , where is a label indicating whether the answer is relevant for and is a function mapping the query-answer pair to a feature vector. In this work, we mainly focus on finding the best to represent the query-answer pairs.
Most previous approaches focused on manually defining . In this work, we present a deep learning method that learns from data. This method has been explored before for related tasks, and showed promising results on short texts (Severyn and Moschitti, 2015). Building on these efforts, we explore a new architecture and attention mechanism to learn , which we show performs more robustly than the previous models.
3.2. Attention Model for Answer Ranking
An attention model allows the network to sequentially focus on a subset of the input, process it, and then change its focus to another part of the input. This method makes it easier to process the data sequentially, even if the data isn’t sequential in nature. In our case, the model scans the answer sequentially in order to embed it relatively to the query.
In general terms, our attention model computes a context vector for each “time stamp” of the sequence. The context vector is a weighted mean of the sequence states. More specifically, each state receives a weight by applying a SoftMax function over the output of an activation unit on the state. The sum of each weight, multiplied by its state value, creates the context vector. This context vector will serve as the embedding of the question and answer.
To build , we propose a feed forward network (shown in Figure 2) that utilizes an attention layer. The inputs for the model are sentences, which represent an answer or a question. The network is comprised of several layers, where the first layers perform separate transformations on the question and answer, and then embed them into a single representation which is then fed to layers which output the probability of the answer responding to the input question.
Specifically, the first five layers work separately on the questions and answers and subsequently merge into one layer:
An embedding layer: replaces each token of the sentence with its word2vec representation. The Word2Vec model was trained before on the same training data which is the input for our model (i.e. it was not built on the test data). In addition, as suggested by Severyn and Moschitti (2015), the layer concatenates to the representation a set of boolean features that represent words that appear both in the question and answer.
(1) (2) (3) (4)
where and are the sentence and the bias, is the parameters matrix, and is a word iterator over the sentence. is an activation function which is based on and is defined as:
The layer is further illustrated in Figure 3.
Non linearity layer, with an tanh-based activation function
Non linearity layer, with an activation Lrelu function:
Pooling layer used to reduce the representation: max pool was used, taking the maximum activation value.
The next layers includes the outputs from the question and answer layers, and are based on part of the network suggested by Severyn and Moschitti (2015):
A pairwise layer: takes an answer and a question output vectors from the previous layers and concatenates them to a single vector.
A non-linearity layer, with an activation function.
A softmax layer, used in order to get a score for the question-answer pair.
4. Empirical Evaluation
4.1. Experimental Setup
In order to test our model, we report results on two datasets:
TREC-QA answer sentence selection dataset (Wang et al., 2007) contains 53,417 question-answer pairs (1,229 unique questions) from the entire TREC 8-12 collection and comes with an automatic judgment tool based on manual judgment (and not regular expressions).
LIVE-QA 2015 dataset (Agichtein et al., 2015), which contains 22,227 question-answer pairs (1187 valid questions). This dataset is characterized by both long and verbose answers, which are often not grammatically correct, and therefore might be challenging for many models.
Based on a separate development set, we perform parameter tuning and set the batch size to be 50, the emlrelu parameter to 0.01002, and the learning rate to 0.1, with parameter initialization of with , where is the size of the longest answer. Due to the different nature of the datasets, we set the vector size to 300 for TREC-QA and 100 for LiveQA.
4.2. Main Results
Tables 1 and 2 summarize our results on the TREC-QA and LiveQA datasets respectively. We compare our model to the state-of-the-art model (Severyn and Moschitti, 2015) using the standard MRR and NDCG metrics. As the TREC-QA provides only binary labels, we calculated only the MRR measurement for that dataset. The results show that QARAT outperforms the state of the art on all metrics. Statistically significant results are shown in bold.
|Severyn and Moschitti (2015)||0.81|
|Severyn and Moschitti (2015)||0.46||0.7974|
4.3. Effect of Answer and Question Length
To further understand when our algorithm outperforms the state of the art, we compared the two models for different answers length. Figure 4 shows the model results over the TREC-QA. It is evident that the model outperforms the baseline in a statistically significant manner at answer sizes above 25. This aligns with the strength of attention mechanisms to deal with longer texts. When looking at LiveQA dataset (figures 5 and 6), the model presents statistically significant results mainly for all length of answers, and specifically for those above 110 and below 30. When investigating the answers of length less than 30, we observe that those, unlike the answers in TREC-QA, contain many phrases which are grammatically incorrect. We conclude that attention mechanism bring value for either long or even short confusing texts.
4.4. Visualization of the Model
In figure 1, we present an example of two questions and their respective answers. The attention weights that QARAT created for each answer are presented to visualize where the algorithm “attended” the most when performing the embedding of the query-answer pair. Interestingly, the most relevant part of the answer received the highest “attention” weight. In the left example, observe the phrases “in 1981” and “company” receive the highest attention weights, probably due to their relevance to the “chairman” phrase in the query. Our model gave this correct answer a score of 0.93 whereas the baseline gave the answer a score of 0.001. The main reason for this low score is the abundance of additional irrelevant information that influenced the baseline to score this answer low. A similar phenomenon occurs in the example on the right, where the phrase “born in Florence” receives the highest attention weights when performing the embedding of the query and the answer. Our model gave this correct answer a score of 0.93 whereas the baseline gave the answer a score of 0.66.
5. Conclusions and Future Work
In this work, we presented QARAT , a deep learning ranking algorithms with attention mechanisms for answer ranking, and showed its superiority over the state of the art deep learning methods. Specifically, we observed that QARAT performs significantly better on longer, confusing texts with abundance of information. To build a better intuition into the model performance, we visualized the attention mechanism to show on which words and phrases the model focuses. We believe the use of our proposed model will help to advance question answering research, and aid in adaption of deep learning models for ranking.
- Agichtein et al. (2015) Eugene Agichtein, David Carmel, Dan Pelleg, Yuval Pinter, and Donna Harman. 2015. Overview of the TREC 2015 LiveQA Track.. In Proceedings of TREC 2015.
- Bahdanau et al. (2014a) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014a. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR (2014).
- Bahdanau et al. (2014b) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014b. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
- Denil et al. (2011) Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. 2011. Learning where to Attend with Deep Architectures for Image Tracking. CoRR (2011).
- Heilman and Smith (2010) Michael Heilman and Noah A Smith. 2010. Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In Proceedings of NAACL.
- Larochelle and Hinton (2010) Hugo Larochelle and Geoffrey E Hinton. 2010. Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). 1243–1251.
- Liu et al. (2015) Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. 2015. Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents.. In Proceedings of EMNLP.
- Raffel and Ellis (2015) Colin Raffel and Daniel PW Ellis. 2015. Feed-forward networks with attention can solve some long-term memory problems. arXiv preprint arXiv:1512.08756 (2015).
- Severyn and Moschitti (2013) Aliaksei Severyn and Alessandro Moschitti. 2013. Automatic Feature Engineering for Answer Selection and Extraction.. In Proceedings of EMNLP, Vol. 13. 458–467.
- Severyn and Moschitti (2015) Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of SIGIR. ACM, 373–382.
- Surdeanu et al. (2011) Mihai Surdeanu, Massimiliano Ciaramita, and Hugo Zaragoza. 2011. Learning to Rank Answers to Non-factoid Questions from Web Collections. Comput. Linguist. 37, 2 (June 2011), 351–383.
- Tan et al. (2015) Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. LSTM-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108 (2015).
- Wang et al. (2007) Mengqiu Wang, Noah A Smith, and Teruko Mitamura. 2007. What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA.. In Proceedings of EMNLP-CoNLL, Vol. 7. 22–32.
- Yin et al. (2015) Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2015. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. arXiv preprint arXiv:1512.05193 (2015).
- Yu et al. (2014) Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632 (2014).