Hyperbolic Representation Learning for Fast and Efficient Neural Question Answering
The dominant neural architectures in question answer retrieval are based on recurrent or convolutional encoders configured with complex word matching layers. Given that recent architectural innovations are mostly new word interaction layers or attention-based matching mechanisms, it seems to be a well-established fact that these components are mandatory for good performance. Unfortunately, the memory and computation cost incurred by these complex mechanisms are undesirable for practical applications. As such, this paper tackles the question of whether it is possible to achieve competitive performance with simple neural architectures. We propose a simple but novel deep learning architecture for fast and efficient question-answer ranking and retrieval. More specifically, our proposed model, HyperQA, is a parameter efficient neural network that outperforms other parameter intensive models such as Attentive Pooling BiLSTMs and Multi-Perspective CNNs on multiple QA benchmarks. The novelty behind HyperQA is a pairwise ranking objective that models the relationship between question and answer embeddings in Hyperbolic space instead of Euclidean space. This empowers our model with a self-organizing ability and enables automatic discovery of latent hierarchies while learning embeddings of questions and answers. Our model requires no feature engineering, no similarity matrix matching, no complicated attention mechanisms nor over-parameterized layers and yet outperforms and remains competitive to many models that have these functionalities on multiple benchmarks.
Neural ranking models are commonplace in many modern question answering (QA) systems (Severyn and Moschitti, 2015; He and Lin, 2016). In these applications, the problem of question answering is concerned with learning to rank candidate answers in response to questions. Intuitively, this is reminiscent of document retrieval albeit with shorter text which aggravates the long standing problem of lexical chasm (Berger et al., 2000). For this purpose, a wide assortment of neural ranking architectures have been proposed. The key and most basic intuition pertaining to many of these models are as follows: Firstly, representations of questions and answers are first learned via a neural encoder such as the long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) network or convolutional neural network (CNN). Secondly, these representations of questions and answers are composed by an interaction function to produce an overall matching score.
The design of the interaction function between question and answer representations lives at the heart of deep learning QA research. While it is simply possible to combine QA representations with simple feed forward neural networks or other composition functions (Qiu and Huang, 2015; Tay et al., 2017a), a huge bulk of recent work is concerned with designing novel word interaction layers that model the relationship between the words in the QA pairs. For example, similarity matrix based matching (Wan et al., 2016), soft attention alignment (Parikh et al., 2016) and attentive pooling (dos Santos et al., 2016) are highly popular techniques for improving the performance of neural ranking models. Apparently, it seems to be well-established that grid-based matching is essential to good performance. Notably, these new innovations come with trade-offs such as huge computational cost that lead to significantly longer training times and also a larger memory footprint. Additionally, it is good to consider that the base neural encoder employed also contributes to the computational cost of these neural ranking models, e.g., LSTM networks are known to be over-parameterized and also incur a parameter and runtime cost of quadratic scale. It also seems to be a well-established fact that a neural encoder (such as the LSTM, Gated Recurrent Unit (GRU), CNN, etc.) must be first selected for learning individual representations of questions and answers and is generally treated as mandatory for good performance.
In this paper, we propose an extremely simple neural ranking model for question answering that achieves highly competitive results on several benchmarks with only a fraction of the runtime and only 40K-90K parameters (as opposed to millions). Our neural ranking models the relationships between QA pairs in Hyperbolic space instead of Euclidean space. Hyperbolic space is an embedding space with a constant negative curvature in which the distance towards the border is increasing exponentially. Intuitively, this makes it suitable for learning embeddings that reflect a natural hierarchy (e.g., networks, text, etc.) which we believe might benefit neural ranking models for QA. Notably, our work is inspired by the recently incepted Poincaré embeddings (Nickel and Kiela, 2017) which demonstrates the effectiveness of inducing a structural (hierarchical) bias in the embedding space for improved generalization. In our early empirical experiments, we discovered that a simple feed forward neural network trained in Hyperbolic space is capable of outperforming more sophisticated models on several standard benchmark datasets. We believe that this can be attributed to two reasons. Firstly, latent hierarchies are prominent in QA. Aside from the natural hierarchy of questions and answers, conceptual hierarchies also exist. Secondly, natural language is inherently hierarchical which can be traced to power law distributions such as Zipf’s law (Ravasz and Barabási, 2003). The key contributions in this paper are as follows:
We propose a new neural ranking model for ranking question answer pairs. For the first time, our proposed model, HyperQA, performs matching of questions and answers in Hyperbolic space. To the best of our knowledge, we are the first to model QA pairs in Hyperbolic space. While hyperbolic geometry and embeddings have been explored in the domains of complex networks or graphs (Krioukov et al., 2010), our work is the first to investigate the suitability of this metric space for question answering.
HyperQA is an extremely fast and parameter efficient model that achieves very competitive results on multiple QA benchmarks such as TrecQA, WikiQA and YahooCQA. The efficiency and speed of HyperQA are attributed by the fact that we do not use any sophisticated neural encoder and have no complicated word interaction layer. In fact, HyperQA is a mere single layered neural network with only 90K parameters. Very surprisingly, HyperQA actually outperforms many state-of-the-art models such as Attentive Pooling BiLSTMs (dos Santos et al., 2016; Zhang et al., 2017) and Multi-Perspective CNNs (He and Lin, 2016). We believe that this allows us to reconsider whether many of these complex word interaction layers are really necessary for good performance.
We conduct extensive qualitative analysis of both the learned QA embeddings and word embeddings. We discover several interesting properties of QA embeddings in Hyperbolic space. Due to its compositional nature, we find that our model learns to self-organize not only at the QA level but also at the word-level. Our qualitative studies enable us to gain a better intuition pertaining to the good performance of our model.
2. Related Work
Many prior works have established the fact that there are mainly two key ingredients to a powerful neural ranking model. First, an effective neural encoder and second, an expressive word interaction layer. The first ingredient is often treated as a given, i.e., the top performing models always use a neural encoder such as the CNN or LSTM. In fact, many top performing models adopt convolutional encoders for sentence representation (He et al., 2015; Qiu and Huang, 2015; Severyn and Moschitti, 2015; He and Lin, 2016; Zhang et al., 2017; Shen et al., 2014). The usage of recurrent models is also notable (Mueller and Thyagarajan, 2016; Severyn and Moschitti, 2015; Tay et al., 2017a, b).
The key component in which many recent models differ at is at the interaction layer. Early works often combined QA embeddings ‘as it is’, i.e., representations are learned first and then combined. For example, Yu et al. (Yu et al., 2014) used CNN representations as feature inputs to a logistic regression model. The end-to-end CNN-based model of Severyn and Moschitti (Severyn and Moschitti, 2015) combines the CNN encoded representations of question and answer using a multi-layered perceptron (MLP). Recently, a myriad of composition functions have been proposed as well, e.g., tensor layers in Qiu et al. (Qiu and Huang, 2015) and holographic layers in Tay et al. (Tay et al., 2017a).
It has been recently fashionable to model the relationships between question and answer using similarity matrices. Intuitively, this enables more fine-grained matching across words in question and answer sentences. The Multi-Perspective CNN (MP-CNN) (He et al., 2015) compared two sentences via a wide diversity of pooling functions and filter widths aiming to capture ‘multi-perspectives’ between two sentences. The attention based neural matching (aNMM) model of Yang et al. (Yang et al., 2016) performed soft-attention alignment by first measuring the pairwise word similarity between each word in question and answer. The attentive pooling models of Santos et al. (dos Santos et al., 2016) (AP-BiLSTM and AP-CNN) utilized this soft-attention alignment to learn weighted representations of question and answer that are dependent of each other. Zhang et al. (Zhang et al., 2017) extended AP-CNN to 3D tensor-based attentive pooling (AI-CNN). A recent work, the Cross Temporal Recurrent Network (CTRN) (Tay et al., 2017b) proposed a pairwise gating mechanism for joint learning of QA pairs.
Unfortunately, these models actually introduce a prohibitive computational cost to the model usually for a very marginal performance gain. Notably, it is easy to see that similarity matrix based matching incurs a computational cost of quadratic scale. Representation ability such as dimension size of word or CNN/RNN embeddings are naturally also quite restricted, i.e., increasing any of these dimensions can cause computation or memory requirements to explode. Moreover, it is not uncommon for models such as AI-CNN or AP-BiLSTM to spend more than minutes on a single epoch on QA datasets that are only medium sized. Let us not forget that these models still have to be extensively tuned which aggravates the impracticality problem posed by some of these models.
In this paper, we seek a new paradigm for neural ranking for QA. While many recent works try to out-stack each other with new layers, we strip down our network instead. Our work is inspired by the very recent Poincarè embeddings (Nickel and Kiela, 2017) which demonstrates the superiority and efficiency of generalization in Hyperbolic space. Moreover, this alleviates many overfitting and complexity issues that Euclidean embeddings might face especially if the data has intrinsic hierarchical structure. It is good to note that power-law distributions, such as Zipf’s law, have been known to be from innate hierarchical structure (Ravasz and Barabási, 2003). Specifically, the defining characteristic of Hyperbolic space is a much quicker expansion relative to that of Euclidean space which makes naturally equipped for modeling hierarchical structure. The concept of Hyperbolic spaces has been applied to domains such as complex network modeling (Krioukov et al., 2010), social networks (Verbeek and Suri, 2016) and geographic routing (Kleinberg, 2007).
There are several key geometric intuitions regarding Hyperbolic spaces. Firstly, the concept of distance and area is warped in Hyperbolic spaces. Specifically, each tile in Figure 1(a) is of equal area in Hyperbolic space but diminishes towards zero in Euclidean space towards the boundary. Secondly, Hyperbolic spaces are conformal, i.e., angles in Hyperbolic spaces and Euclidean spaces are identical. In Figure 1(b), the arcs on the curve are parallel lines that are orthogonal to the boundary. Finally, hyperbolic spaces can be regarded as larger spaces relative to Euclidean spaces due to the fact that the concept of relative distance can be expressed much better, i.e., not only does the distance between two vectors encode information but also where a vector is placed in Hyperbolic space. This enables efficient representation learning.
In Nickel et al. (Nickel and Kiela, 2017), the authors applied the hyperbolic distance (specifically, the Poincarè distance) to model taxonomic entities and graph nodes. Notably, our work, to the best of our knowledge, is the only work that learns QA embeddings in Hyperbolic space. Moreover, questions and answers introduce an interesting layer of complexity to the problem since QA embeddings are in fact compositions of their constituent word embeddings. On the other hand, nodes in a graph and taxonomic entities in (Nickel and Kiela, 2017) are already at its most abstract form, i.e., symbolic objects. As such, we believe it would be interesting to investigate the impacts of QA in Hyperbolic space in lieu of the added compositional nature.
3. Our Proposed Approach
This section outlines the overall architecture of our proposed model. Similar to many neural ranking models for QA, our network has ‘two’ sides with shared parameters, i.e., one for question and another for answer. However, since we optimize for a pairwise ranking loss, the model takes in a positive (correct) answer and a negative (wrong) answer and aims to maximize the margin between the scores of the correct QA pair and the negative QA pair. Figure 2 depicts the overall model architecture.
3.1. Embedding Layer
Our model accepts three sequences as an input, i.e., the question (denoted as ), the correct answer (denoted as ) and a randomly sampled corrupted answer (denoted as ). Each sequence consists of words where and are predefined maximum sequence lengths for questions and answers respectively. Each word is represented as a one-hot vector (representing a word in the vocabulary). As such, this layer is a look-up layer that converts each word into a low-dimensional vector by indexing onto the word embedding matrix. In our implementation, we initialize this layer with pretrained word embeddings (Pennington et al., 2014). Note that this layer is not updated during training. Instead, we utilize a projection layer that learns a task-specific projection of the embeddings.
3.2. Projection Layer
In order to learn a task-specific representation for each word, we utilize a projection layer. The projection layer is essentially a single layered neural network that is applied to each word in all three sequences.
where , , and is a non-linear function such as the rectified linear unit (ReLU). The output of this layer is a sequence of dimensional embeddings for each sequence (question, positive answer and negative answer). Note that the parameters of this projection layer are shared for both question and answer.
3.3. Learning QA Representations
In order to learn question and answer representations, we simply take the sum of all word embeddings in the sequence.
where . is the predefined max sequence length (specific to question and answer) and are -dimensional embeddings of the sequence. This is essentially the neural bag-of-words (NBoW) representation. Unlike popular neural encoders such as LSTM or CNN, the NBOW representation does not add any parameters and is much more efficient. Additionally, we constrain the question and answer embeddings to the unit ball before passing to the next layer, i.e., . This is easily done via when . Note that this projection of QA embeddings onto the unit ball is mandatory and absolutely crucial for HyperQA to even work.
3.4. Hyperbolic Representations of QA Pairs
Neural ranking models are mainly characterized by the interaction function between question and answer representations. In our work, we mainly adopt the hyperbolic111While there exist multiple models of Hyperbolic geometry such as the Beltrami-Klein model or the Hyperboloid model, we adopt the Poincarè ball / disk due to its ease of differentiability and freedom from constraints (Nickel and Kiela, 2017). distance function to model the relationships between questions and answers. Formally, let be the open -dimensional unit ball, our model corresponds to the Riemannian manifold () and is equipped with the Riemannian metric tensor given as follows:
where is the Euclidean metric tensor. The hyperbolic distance function between question and answer is defined as:
where denotes the Euclidean norm and are the question and answer embeddings respectively. Note that is the inverse hyperbolic cosine function, i.e., . Notably, changes smoothly with respect to the position of and which enables the automatic discovery of latent hierarchies. As mentioned earlier, the distance increases exponentially as the norm of the vectors approaches 1. As such, the latent hierarchies of QA embeddings are captured through the norm of the vectors. From a geometric perspective, the origin can be seen as the root of a tree that branches out towards the boundaries of the hyperbolic ball. This self-organizing ability of the hyperbolic distance is visually and qualitatively analyzed in later sections.
3.4.1. Gradient Derivation
Amongst the other models of Hyperbolic geometry, the hyperbolic Poincarè distance is differentiable. Let The partial derivate w.r.t to is defined as:
where , and .
3.5. Similarity Scoring Layer
Finally, we pass the hyperbolic distance through a linear transformation described as follows:
where and are scalar parameters of this layer. The performance of this layer is empirically motivated by its performance and was selected amongst other variants such as , non-linear activations such as sigmoid function or the raw hyperbolic distance.
3.6. Optimization and Learning
This section describes the optimization and learning process of HyperQA. Our model learns via a pairwise ranking loss, which is well suited for metric-based learning algorithms.
3.6.1. Pairwise Hinge Loss
Our network minimizes the pairwise hinge loss which is defined as follows:
where is the set of all QA pairs for question , is the score between and , and is the margin which controls the extent of discrimination between positive QA pairs and corrupted QA pairs. The adoption of the pairwise hinge loss is motivated by the good empirical results demonstrated in Rao et al. (Rao et al., 2016). Additionally, we also adopt the mix sampling strategy for sampling negative samples as described in their work.
3.6.2. Gradient Conversion
Since our network learns in Hyperbolic space, parameters have to be learned via stochastic Riemannian optimization methods such as RSGD (Bonnabel, 2013).
where denotes a retraction onto at . is the learning rate and is the Riemannian gradient with respect to . Fortunately, the Riemannian gradient can be easily derived from the Euclidean gradient in this case (Bonnabel, 2013). In order to do so, we can simply scale the Euclidean gradient by the inverse of the metric tensor . Overall, the final gradients used to update the parameters are:
Due to the lack of space, we refer interested readers to (Nickel and Kiela, 2017; Bonnabel, 2013) for more details. For practical purposes, we simply utilize the automatic gradient feature of TensorFlow but convert the gradients with Equation (9) before updating the parameters.
This section describes our empirical evaluation and its results.
In the spirit of experimental rigor, we conduct our empirical evaluation based on four popular and well-studied benchmark datasets for question answering.
YahooCQA - This is a benchmark dataset for community-based question answering that was collected from Yahoo Answers. In this dataset, the answer lengths are relatively longer than TrecQA and WikiQA. Therefore, we filtered answers that have more than words and less than characters. The train-dev-test splits for this dataset are provided by (Tay et al., 2017a).
WikiQA - This is a recently popular benchmark dataset (Yang et al., 2015) for open-domain question answering based on factual questions from Wikipedia and Bing search logs.
SemEvalCQA - This is a well-studied benchmark dataset from SemEval-2016 Task 3 Subtask A (CQA). This is a real world dataset obtained from Qatar Living Forums. In this dataset, there are ten answers in each question ‘thread’ which are marked as ‘Good‘, ‘Potentially Useful’ or ‘’Bad’. We treat ‘Good’ as positive and anything else as negative labels.
TrecQA - This is the benchmark dataset provided by Wang et al. (Wang et al., 2007). This dataset was collected from TREC QA tracks 8-13 and is comprised of factoid based questions which mainly answer the ‘who’, ‘what’, ‘where’, ‘when’ and ‘why’ types of questions. There are two versions, namely clean and raw, as noted by (Rao et al., 2016) which we evaluate our models on.
Statistics pertaining to each dataset is given in Table 1.
4.2. Compared Baselines
In this section, we introduce the baselines for comparison.
YahooCQA - The key competitors of this dataset are the Neural Tensor LSTM (NTN-LSTM) and HD-LSTM from Tay et al. (Tay et al., 2017a) along with their implementation of the Convolutional Neural Tensor Network (Qiu and Huang, 2015), vanilla CNN model, and the Okapi BM-25 (Robertson et al., 1994) benchmark. Additionally, we also report our own implementations of QA-BiLSTM, QA-CNN, AP-BiLSTM and AP-CNN on this dataset based on our experimental setup.
WikiQA - The key competitors of this dataset are the Paragraph Vector (PV) and PV + Cnt models (Le and Mikolov, 2014) of Le and Mikolv, CNN + Cnt model from Yu et al. (Yu et al., 2014) and LCLR (Yih et al.) (Yih et al., 2013). These three baselines are reported in the original WikiQA paper (Yang et al., 2015) which also include variations that include handcrafted features. Additional strong baselines include QA-BiLSTM, QA-CNN from (dos Santos et al., 2016) along with AP-BiLSTM and AP-CNN which are attentive pooling improvements of the former. Finally, we also report the Pairwise Ranking MP-CNN from Rao et al. (Rao et al., 2016).
SemEvalCQA - The key competitors of this dataset are the CNN-based ARC-I/II architecture by Hu et al. (Hu et al., 2014), the Attentive Pooling CNN (dos Santos et al., 2016), Kelp (Filice et al., 2016) a feature engineering based SVM method, ConvKN (Barrón-Cedeño et al., 2016) a combination of convolutional tree kernels with CNN and finally AI-CNN (Attentive Interactive CNN) (Zhang et al., 2017), a tensor-based attentive pooling neural model. A comparison with AI-CNN (with features) is also included.
TrecQA - The key competitors on the dataset are mainly the CNN model of Severyn and Moschitti (S&M) (Severyn and Moschitti, 2015), the Attention-based Neural Matching Model (aNMM) of Yang et al. (Yang et al., 2016), HD-LSTM (Tay et al.) (Tay et al., 2017a) and Multi-Perspective CNN (MP-CNN) (He et al., 2015) proposed by He et al. Lastly, we also compare with the pairwise ranking adaption of MP-CNN (Rao et al.) (Rao et al., 2016). Additionally and due to long standing nature of this dataset, there have been a huge number of works based on traditional feature engineering approaches (Wang et al., 2007; Heilman and Smith, 2010; Severyn et al., 2014; Yao et al., 2013) which we also report. For the clean version of this dataset, we also compare with AP-CNN and QA-BiLSTM/CNN (dos Santos et al., 2016).
Since the training splits are standard, we are able to directly report the results from the original papers.
4.3. Evaluation Protocol
This section describes the key evaluation protocol / metrics and implementation details of our experiments.
We adopt a dataset specific evaluation protocol in which we follow the prior work in their evaluation protocols. Specifically, TrecQA and WikiQA adopt the Mean Reciprocal Rank (MRR) and MAP (Mean Average Precision) metrics which are commonplace in IR research. On the other hand, YahooCQA and SemEvalCQA evaluate on MAP and Precision@1 (abbreviated P@1) which is determined based on whether the top predicted answer is the ground truth. For all competitor methods, we report the performance results from the original paper.
4.3.2. Training Time & Parameter Size
Additionally, we report the parameter size and runtime (seconds per epoch) of selected models. We selectively re-implement some of the key competitors with the best performance and benchmark their training time on our machine/GPU (a single Nvidia GTX1070). For reporting the parameter size and training time, we try our best to follow the hyperparameters stated in the original papers. As such, the same model can have different training time and parameter size on different datasets.
HyperQA is implemented in TensorFlow. We adopt the AdaGrad (Duchi et al., 2011) optimizer with initial learning rate tuned amongst . The batch size is tuned amongst . Models are trained for epochs and the model parameters are saved each time the performance on the validation set is topped. The dimension of the projection layer is tuned amongst . L2 regularization is tuned amongst . The negative sampling rate is tuned from to . Finally, the margin is tuned amongst . For TrecQA, WikiQA and YahooCQA, we initialize the embedding layer with GloVe (Pennington et al., 2014) and use the version with and trained on 840 billion words. For SemEvalCQA, we train our own Skipgram model using the unannotated corpus provided by the task. In this case, the embedding dimension is tuned amongst . Embeddings are not updated during training. For the SemEvalCQA dataset, we concatenated the raw QA embeddings before passing into the final layer since we found that it improves performance.
4.4. Results and Analysis
In this section, we present our empirical results on all datasets. For all reported results, the best result is in boldface and the second best is underlined.
4.4.1. Experimental Results on WikiQA
Table 2 reports our results on the WikiQA dataset. Firstly, we observe that HyperQA outperforms a myriad of complex neural architectures. Notably, we obtain a clear performance gain of in terms of MAP/MRR against models such as AP-CNN or AP-BiLSTM. Our model also outperforms MP-CNN which is severely equipped with parameterized word matching mechanisms. We achieve competitive results relative to the Rank MP-CNN. Finally, HyperQA is extremely efficient and fast, clocking 2s per epoch compared to 33s per epoch for Rank MP-CNN. The parameter cost is also 90K vs 10 million which is a significant improvement.
|PV + Cnt||0.599||0.609||-||-|
|CNN + Cnt||0.652||0.665||-||-|
|QA-BiLSTM (Santos et al.)||0.656||0.670||-||-|
|QA-CNN (Santos et al.)||0.670||0.682||-||-|
|AP-BiLSTM (Santos et al.)||0.671||0.684||-||-|
|AP-CNN (Santos et al.)||0.688||0.696||-||-|
|MP-CNN (He et al.)||0.693||0.709||10.0M||35s|
|Rank MP-CNN (Rao et al.)||0.701||0.718||10.0M||33s|
|HyperQA (This work)||0.712||0.727||90K||2s|
4.4.2. Experimental Results on YahooCQA
Table 3 reports the experimental results on YahooCQA. First, we observe that HyperQA outperforms AP-BiLSTM and AP-CNN significantly. Specifically, we outperform AP-BiLSTM, the runner-up model by in terms of MRR and in terms of MAP. Notably, HyperQA is 32 times faster than AP-BiLSTM and has times less parameters. Our approach shows that complicated attentive pooling mechanisms are not necessary for good performance.
|CNTN (Qiu et al.)||0.465||0.632||-||-|
|NTN-LSTM (Tay et al.)||0.545||0.731||-||-|
|HD-LSTM (Tay et al.)||0.557||0.735||-||-|
|QA-BiLSTM (Santos et al.)||0.508||0.683||1.40M||440s|
|QA-CNN (Santos et al.)||0.564||0.727||90.9K||60s|
|AP-CNN (Santos et al.)||0.560||0.726||540K||110s|
|AP-BiLSTM (Santos et al.)||0.568||0.731||1.80M||640s|
|HyperQA (This work)||0.683||0.801||90.0K||20s|
4.4.3. Experimental Results on SemEvalCQA
Table 4 reports the experimental results on SemEvalCQA. Our proposed approach achieves highly competitive performance on this dataset. Specifically, we have obtained the best P@1 performance overall, outperforming the state-of-the-art AI-CNN model by in terms of P@1. The performance of our model on MAP is marginally short from the best performing model. Notably, AI-CNN has benefited from external handcrafted features. As such, comparing AI-CNN (w/o features) with HyperQA shows that our proposed model is a superior neural ranking model. Next, we draw the readers attention to the time cost of AI-CNN. The training time per epoch is per epoch which is about times longer than our model. AI-CNN is extremely cost prohibitive, i.e., attentive pooling is already very expensive and yet AI-CNN performs 3D attentive pooling. Evidently, its performance can be easily superseded in a much smaller training time and parameter cost. This raises questions about the effectiveness of the 3D attentive pooling mechanism.
|ARC-I (Hu et al.)||0.741||0.771||-||-|
|ARC-II (Hu et al.)||0.753||0.780||-||-|
|AP-CNN (Santos et al.)||0.755||0.771||-||-|
|Kelp (Filice et al.)||0.751||0.792||-||-|
|ConvKN (Barrón-Cedeño et al.)||0.755||0.777||-||-|
|AI-CNN (Zhang et al.)||0.763||0.792||140K||3250s|
|AI-CNN + Feats (Zhang et al.)||0.769||0.801||140K||3250s|
|HyperQA (This work)||0.809||0.795||45K||10s|
|Wang et al. (2007)||0.603||0.685||-||-|
|Heilman et al. (2010)||0.609||0.692||-||-|
|Wang et al. (2010)||0.595||0.695||-||-|
|Severyn and Moschitti (2013)||0.678||0.736||-||-|
|Yih et al (2014)||0.709||0.770||-||-|
|CNN (Yu et al)||0.711||0.785||-||-|
|BLSTM + BM25 (Wang & Nyberg)||0.713||0.791||-||-|
|CNN (Severyn & Moschitti)||0.746||0.808||-||-|
|aNMM (Yang et al.)||0.750||0.811||-||-|
|HD-LSTM (Tay et al.)||0.750||0.815||-||-|
|MP-CNN (He et al.)||0.762||0.822||10.0M||141s|
|Rank MP-CNN (Rao et al.)||0.780||0.830||10.0M||130s|
|HyperQA (This work)||0.770||0.825||90K||12s|
|QA-LSTM / CNN (Santos et al.)||0.728||0.832||-||-|
|AP-CNN (Santos et al.)||0.753||0.851||-||-|
|MP-CNN (He et al.)||0.777||0.836||10M||141|
|Rank MP-CNN (Rao et al.)||0.801||0.877||10M||130s|
4.4.4. Experimental Results on TrecQA
Table 5 reports the results on TrecQA (raw). HyperQA achieves very competitive performance on both MAP and MRR metrics. Specifically, HyperQA outperforms the basic CNN model of (S&M) by in terms of MAP/MRR. Moreover, the CNN (S&M) model uses handcrafted features which HyperQA does not require. Similarly, the aNMM model and HD-LSTM also benefit from additional features but are outperformed by HyperQA. HyperQA also outperforms MP-CNN but is around times faster and has times less parameters. MP-CNN consists of a huge number of filter banks and utilizes heavy parameterization to match multiple perspectives of questions and answers. On the other hand, our proposed HyperQA is merely a single layered neural network with 90K parameters and yet outperforms MP-CNN. Similarly, Table 6 reports the results on TrecQA (clean). Similarly, HyperQA also outperforms MP-CNN, AP-CNN and QA-CNN. On both datasets, the performance of HyperQA is competitive to Rank MP-CNN.
4.4.5. Overall analysis
Overall, we summarize the key findings of our experiments.
It is possible to achieve very competitive performance with small parameterization, and no word matching or interaction layers. HyperQA outperforms complex models such as MP-CNN and AP-BiLSTM on multiple datasets.
The relative performance of HyperQA is significantly better on large datasets, e.g., YahooCQA (253K training pairs) as opposed to smaller ones like WikiQA (5.9K training pairs). We believe that this is due to the fact that Hyperbolic space is seemingly larger than Euclidean space.
HyperQA is extremely fast and trains at times faster than complex models like MP-CNN. Note that if CPUs are used instead of GPUs (which speed convolutions up significantly), this disparity would be significantly larger.
Our proposed approach does not require handcrafted features and yet outperforms models that benefit from them. This is evident on all datasets, i.e., HyperQA outperforms CNN model with features (TrecQA and WikiQA) and AI-CNN + features on SemEvalCQA.
|AP-BiLSTM||1-7% better||20x less||32 x faster|
|AP-CNN||1-12% better||Same||3x faster|
|AI-CNN||Competitive||3x less||300x faster|
|MP-CNN||1-2% better||100x less||10x faster|
|Rank MP-CNN||Competitive||100x less||10x faster|
4.5. Effects of QA Embedding Size
In this section, we study the effects of the QA embedding size on performance. Figure 3 describes the relationship between QA embedding size () and MAP on the WikiQA dataset. Additionally, we include a simple baseline (CosineQA) which is exactly the same as HyperQA but uses cosine similarity instead of hyperbolic distance. The MAP scores of three other reported models (MP-CNN, CNN-Cnt and PV-Cnt) are also reported for reference. Firstly, we notice the disparity between HyperQA and CosineQA in terms of performance. This is also observed across other datasets but is not reported due to the lack of space. While CosineQA maintains a stable performance throughout embedding size, the performance of HyperQA rapidly improves at . In fact, the performance of HyperQA at (45K parameters) is already similar to the Multi-Perspective CNN (He et al., 2015) which contains 10 million parameters. Moreover, the performance of HyperQA outperforms MP-CNN with -.
5. Discussion and Analysis
This section delves into qualitative analysis of our model and aims to investigate the following research questions:
RQ1: Is there any hierarchical structure learned in the QA embeddings? How are QA embeddings organized in the final embedding space of HyperQA?
RQ2: What are the impacts of embedding compositional embeddings in hyperbolic space? Is there an impact on the constituent word embeddings?
RQ3: Are we able to derive any insight about how word interaction and matching happens in HyperQA?
|What is the gross sale of Burger King||Q||are||sales, today||gross||is, what||burger, king|
|A||based||sales, 14,billion, 183||diageo||contributed||burger, corp|
|What is Florence Nightingale famous for||Q||in, the||for||famous||what||florence, nightingale|
|A||of, in||was||nursing||founder, modern, born||nightingale, italy|
|Who is the founder of twitter?||Q||the, of||-||twitter, founder||-||who, is|
|A||and, the||networking, launched||twitter, jack dorsey||match, social||-|
|0-1||to, and, an, on, in, of, its, the, had, or, go|
|1-2||be, a, was, up, put, said, but|
|2-3||judging, returning, volunteered, managing, meant, cited|
|3-4||responsibility, engineering, trading, prosecuting|
|4-5||turkish, autonomous, cowboys, warren, seven, what|
|5-6||ebay, magdalena, spielberg, watson, nova|
5.1. Analysis of QA Embeddings
Figure 4(a) shows a visualization of QA embeddings on the test set TrecQA projected in 3-dimensional space using t-SNE (van der Maaten, 2009). QA embeddings are extracted from the network as discussed in Section 3.3. We observe that question embeddings form a ‘sphere’ over answer embeddings. Contrastingly, this is not exhibited when the cosine similarity is used as shown in Figure 4(b). It is important to note that these are embeddings from the test set which have not been trained and therefore the model is not explicitly told whether a particular textual input is a question or answer. This demonstrates the innate ability of HyperQA to self-organize and learn latent hierarchies which directly answers RQ1. Additionally, Figure 5(a) shows a histogram of the vector norms of question and answer embeddings. We can clearly see that questions in general have a higher vector norm222We extract QA embeddings right before the constraining / normalization layer. and are at a different hierarchical level from answers. In order to further understand what the model is doing, we delve deeper into the visualization at word-level.
5.2. Analysis of Word Embeddings
Table 9 shows some examples of words at each hierarchical level of the sphere on TrecQA. Recall that the vector norms333Note that word embeddings are not constrained to . allow us to infer the distance of the word embedding from the origin which depicts its hierarchical level in our context. Interestingly, we found that HyperQA exhibits self-organizing ability even at word-level. Specifically, we notice that the words closer to the origin are common words such as ‘to’, ‘and’ which do not have much semantic values for QA problems. At the middle of the hierarchy (), we notice that there are more verbs. Finally, as we move towards the surface of the ‘sphere’, the words become rarer and reflect more domain-specific words such as ‘ebay’ and ‘spielberg’. Moreover, we also found many names and proper nouns occurring at this hierarchical level.
Additionally, we also observe that words such as ’where’ or ’what’ have relatively high vector norms and located quite high up in the hierarchy. This is in concert with Figure 4 which shows the question embeddings form a sphere around the answer embeddings. At last, we parsed QA pairs word-by-word according to hierarchical level (based on their vector norm). Table 8 reports the outcome of this experiment where are hierarchical levels based on vector norms. First, we find that questions often start with the overall context and drill down into more specific query words. Take the first sample in Table 8 for example, it begins at a top level with ‘burger king’ and then drills down progressively to ’what is gross sales?’. Similarly in the second example, it begins with ‘florence nightingale’ and drills down to ‘famous’ at H3 in which a match is being found with ‘nursing’ in the same hierarchical level. Overall, based on our qualitative analysis, we observe that, HyperQA builds two hierarchical structures at the word-level (in vector space) towards the middle which strongly facilitates word-level matching. Pertaining to answers, it seems like the model builds a hierarchy by splitting on conjunctive words (‘and’), i.e., the root node of the tree starts by conjunctive words at splits sentences into semantic phrases. Overall, Figure 6 depicts our key intuitions regarding the inner workings of HyperQA which explains both RQ2 and RQ3. This is also supported by Figure 5(b) which shows the majority of the word norms are clustered with . This would be reasonable considering that the leaf nodes of both question and answer hierarchies would reside in the middle.
We proposed a new neural ranking model for question answering. Our proposed HyperQA achieves very competitive performance on four well-studied benchmark datasets. Our model is light-weight, fast and efficient, outperforming many state-of-the-art models with complex word interaction layers, attentive mechanisms or rich neural encoders. Our model only has 40K-90K parameters as opposed to millions of parameters which plague many competitor models. Moreover, we derive qualitative insights pertaining to our model which enable us to further understand its inner workings. Finally, we observe that the superior generalization of our model (despite small parameters) can be attributed to self-organizing properties of not only question and answer embeddings but also word embeddings.
- Barrón-Cedeño et al. (2016) Alberto Barrón-Cedeño, Giovanni Da San Martino, Shafiq R. Joty, Alessandro Moschitti, Fahad Al-Obaidli, Salvatore Romeo, Kateryna Tymoshenko, and Antonio Uva. 2016. ConvKN at SemEval-2016 Task 3: Answer and Question Selection for Question Answering on Arabic and English Fora. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016. 896–903.
- Berger et al. (2000) Adam L. Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu O. Mittal. 2000. Bridging the lexical chasm: statistical approaches to answer-finding. In SIGIR. 192–199. DOI:http://dx.doi.org/10.1145/345508.345576
- Bonnabel (2013) Silvere Bonnabel. 2013. Stochastic Gradient Descent on Riemannian Manifolds. IEEE Trans. Automat. Contr. 58, 9 (2013), 2217–2229. DOI:http://dx.doi.org/10.1109/TAC.2013.2254619
- dos Santos et al. (2016) Cícero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive Pooling Networks. CoRR abs/1602.03609 (2016).
- Duchi et al. (2011) John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12 (2011), 2121–2159.
- Filice et al. (2016) Simone Filice, Danilo Croce, Alessandro Moschitti, and Roberto Basili. 2016. KeLP at SemEval-2016 Task 3: Learning Semantic Relations between Questions and Answers. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016. 1116–1123.
- He et al. (2015) Hua He, Kevin Gimpel, and Jimmy J. Lin. 2015. Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. 1576–1586.
- He and Lin (2016) Hua He and Jimmy J. Lin. 2016. Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. 937–948.
- Heilman and Smith (2010) Michael Heilman and Noah A. Smith. 2010. Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, California, USA. 1011–1019.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. 2042–2050.
- Kleinberg (2007) Robert Kleinberg. 2007. Geographic Routing Using Hyperbolic Space. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 6-12 May 2007, Anchorage, Alaska, USA. 1902–1909. DOI:http://dx.doi.org/10.1109/INFCOM.2007.221
- Krioukov et al. (2010) Dmitri V. Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguñá. 2010. Hyperbolic Geometry of Complex Networks. CoRR abs/1006.5169 (2010).
- Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). 1188–1196.
- Mueller and Thyagarajan (2016) Jonas Mueller and Aditya Thyagarajan. 2016. Siamese Recurrent Architectures for Learning Sentence Similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA. 2786–2792.
- Nickel and Kiela (2017) Maximilian Nickel and Douwe Kiela. 2017. Poincaré Embeddings for Learning Hierarchical Representations. CoRR abs/1705.08039 (2017).
- Parikh et al. (2016) Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. 2249–2255.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.
- Qiu and Huang (2015) Xipeng Qiu and Xuanjing Huang. 2015. Convolutional Neural Tensor Network Architecture for Community-Based Question Answering. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015. 1305–1311.
- Rao et al. (2016) Jinfeng Rao, Hua He, and Jimmy J. Lin. 2016. Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016. 1913–1916. DOI:http://dx.doi.org/10.1145/2983323.2983872
- Ravasz and Barabási (2003) Erzsébet Ravasz and Albert-László Barabási. 2003. Hierarchical organization in complex networks. Physical Review E 67, 2 (2003), 026112.
- Robertson et al. (1994) Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994. 109–126.
- Severyn and Moschitti (2015) Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015. 373–382. DOI:http://dx.doi.org/10.1145/2766462.2767738
- Severyn et al. (2014) Aliaksei Severyn, Alessandro Moschitti, Manos Tsagkias, Richard Berendsen, and Maarten de Rijke. 2014. A syntax-aware re-ranker for microblog retrieval. In The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014. 1067–1070. DOI:http://dx.doi.org/10.1145/2600428.2609511
- Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 101–110.
- Tay et al. (2017a) Yi Tay, Minh C. Phan, Anh Tuan Luu, and Siu Cheung Hui. 2017a. Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017. 695–704. DOI:http://dx.doi.org/10.1145/3077136.3080790
- Tay et al. (2017b) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2017b. Cross Temporal Recurrent Networks for Ranking Question Answer Pairs. (2017). arXiv:arXiv:1711.07656
- van der Maaten (2009) Laurens van der Maaten. 2009. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009. 384–391.
- Verbeek and Suri (2016) Kevin Verbeek and Subhash Suri. 2016. Metric embedding, hyperbolic space, and social networks. Computational Geometry 59 (2016), 1–12.
- Wan et al. (2016) Shengxian Wan, Yanyan Lan, Jiafeng Guo, Jun Xu, Liang Pang, and Xueqi Cheng. 2016. A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA. 2835–2841.
- Wang et al. (2007) Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic. 22–32.
- Yang et al. (2016) Liu Yang, Qingyao Ai, Jiafeng Guo, and W. Bruce Croft. 2016. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016. 287–296. DOI:http://dx.doi.org/10.1145/2983323.2983818
- Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. 2013–2018.
- Yao et al. (2013) Xuchen Yao, Benjamin Van Durme, Chris Callison-Burch, and Peter Clark. 2013. Answer Extraction as Sequence Tagging with Tree Edit Distance. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA. 858–867.
- Yih et al. (2013) Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question Answering Using Enhanced Lexical Semantic Models. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers. 1744–1753.
- Yu et al. (2014) Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep Learning for Answer Sentence Selection. CoRR abs/1412.1632 (2014).
- Zhang et al. (2017) Xiaodong Zhang, Sujian Li, Lei Sha, and Houfeng Wang. 2017. Attentive Interactive Neural Networks for Answer Selection in Community Question Answering. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. 3525–3531.