Hyperbolic Representation Learning for Fast and Efficient Neural Question Answering
Abstract.
The dominant neural architectures in question answer retrieval are based on recurrent or convolutional encoders configured with complex word matching layers. Given that recent architectural innovations are mostly new word interaction layers or attentionbased matching mechanisms, it seems to be a wellestablished fact that these components are mandatory for good performance. Unfortunately, the memory and computation cost incurred by these complex mechanisms are undesirable for practical applications. As such, this paper tackles the question of whether it is possible to achieve competitive performance with simple neural architectures. We propose a simple but novel deep learning architecture for fast and efficient questionanswer ranking and retrieval. More specifically, our proposed model, HyperQA, is a parameter efficient neural network that outperforms other parameter intensive models such as Attentive Pooling BiLSTMs and MultiPerspective CNNs on multiple QA benchmarks. The novelty behind HyperQA is a pairwise ranking objective that models the relationship between question and answer embeddings in Hyperbolic space instead of Euclidean space. This empowers our model with a selforganizing ability and enables automatic discovery of latent hierarchies while learning embeddings of questions and answers. Our model requires no feature engineering, no similarity matrix matching, no complicated attention mechanisms nor overparameterized layers and yet outperforms and remains competitive to many models that have these functionalities on multiple benchmarks.
1. Introduction
Neural ranking models are commonplace in many modern question answering (QA) systems (Severyn and Moschitti, 2015; He and Lin, 2016). In these applications, the problem of question answering is concerned with learning to rank candidate answers in response to questions. Intuitively, this is reminiscent of document retrieval albeit with shorter text which aggravates the long standing problem of lexical chasm (Berger et al., 2000). For this purpose, a wide assortment of neural ranking architectures have been proposed. The key and most basic intuition pertaining to many of these models are as follows: Firstly, representations of questions and answers are first learned via a neural encoder such as the long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) network or convolutional neural network (CNN). Secondly, these representations of questions and answers are composed by an interaction function to produce an overall matching score.
The design of the interaction function between question and answer representations lives at the heart of deep learning QA research. While it is simply possible to combine QA representations with simple feed forward neural networks or other composition functions (Qiu and Huang, 2015; Tay et al., 2017a), a huge bulk of recent work is concerned with designing novel word interaction layers that model the relationship between the words in the QA pairs. For example, similarity matrix based matching (Wan et al., 2016), soft attention alignment (Parikh et al., 2016) and attentive pooling (dos Santos et al., 2016) are highly popular techniques for improving the performance of neural ranking models. Apparently, it seems to be wellestablished that gridbased matching is essential to good performance. Notably, these new innovations come with tradeoffs such as huge computational cost that lead to significantly longer training times and also a larger memory footprint. Additionally, it is good to consider that the base neural encoder employed also contributes to the computational cost of these neural ranking models, e.g., LSTM networks are known to be overparameterized and also incur a parameter and runtime cost of quadratic scale. It also seems to be a wellestablished fact that a neural encoder (such as the LSTM, Gated Recurrent Unit (GRU), CNN, etc.) must be first selected for learning individual representations of questions and answers and is generally treated as mandatory for good performance.
In this paper, we propose an extremely simple neural ranking model for question answering that achieves highly competitive results on several benchmarks with only a fraction of the runtime and only 40K90K parameters (as opposed to millions). Our neural ranking models the relationships between QA pairs in Hyperbolic space instead of Euclidean space. Hyperbolic space is an embedding space with a constant negative curvature in which the distance towards the border is increasing exponentially. Intuitively, this makes it suitable for learning embeddings that reflect a natural hierarchy (e.g., networks, text, etc.) which we believe might benefit neural ranking models for QA. Notably, our work is inspired by the recently incepted Poincaré embeddings (Nickel and Kiela, 2017) which demonstrates the effectiveness of inducing a structural (hierarchical) bias in the embedding space for improved generalization. In our early empirical experiments, we discovered that a simple feed forward neural network trained in Hyperbolic space is capable of outperforming more sophisticated models on several standard benchmark datasets. We believe that this can be attributed to two reasons. Firstly, latent hierarchies are prominent in QA. Aside from the natural hierarchy of questions and answers, conceptual hierarchies also exist. Secondly, natural language is inherently hierarchical which can be traced to power law distributions such as Zipf’s law (Ravasz and Barabási, 2003). The key contributions in this paper are as follows:

We propose a new neural ranking model for ranking question answer pairs. For the first time, our proposed model, HyperQA, performs matching of questions and answers in Hyperbolic space. To the best of our knowledge, we are the first to model QA pairs in Hyperbolic space. While hyperbolic geometry and embeddings have been explored in the domains of complex networks or graphs (Krioukov et al., 2010), our work is the first to investigate the suitability of this metric space for question answering.

HyperQA is an extremely fast and parameter efficient model that achieves very competitive results on multiple QA benchmarks such as TrecQA, WikiQA and YahooCQA. The efficiency and speed of HyperQA are attributed by the fact that we do not use any sophisticated neural encoder and have no complicated word interaction layer. In fact, HyperQA is a mere single layered neural network with only 90K parameters. Very surprisingly, HyperQA actually outperforms many stateoftheart models such as Attentive Pooling BiLSTMs (dos Santos et al., 2016; Zhang et al., 2017) and MultiPerspective CNNs (He and Lin, 2016). We believe that this allows us to reconsider whether many of these complex word interaction layers are really necessary for good performance.

We conduct extensive qualitative analysis of both the learned QA embeddings and word embeddings. We discover several interesting properties of QA embeddings in Hyperbolic space. Due to its compositional nature, we find that our model learns to selforganize not only at the QA level but also at the wordlevel. Our qualitative studies enable us to gain a better intuition pertaining to the good performance of our model.
2. Related Work
Many prior works have established the fact that there are mainly two key ingredients to a powerful neural ranking model. First, an effective neural encoder and second, an expressive word interaction layer. The first ingredient is often treated as a given, i.e., the top performing models always use a neural encoder such as the CNN or LSTM. In fact, many top performing models adopt convolutional encoders for sentence representation (He et al., 2015; Qiu and Huang, 2015; Severyn and Moschitti, 2015; He and Lin, 2016; Zhang et al., 2017; Shen et al., 2014). The usage of recurrent models is also notable (Mueller and Thyagarajan, 2016; Severyn and Moschitti, 2015; Tay et al., 2017a, b).
The key component in which many recent models differ at is at the interaction layer. Early works often combined QA embeddings ‘as it is’, i.e., representations are learned first and then combined. For example, Yu et al. (Yu et al., 2014) used CNN representations as feature inputs to a logistic regression model. The endtoend CNNbased model of Severyn and Moschitti (Severyn and Moschitti, 2015) combines the CNN encoded representations of question and answer using a multilayered perceptron (MLP). Recently, a myriad of composition functions have been proposed as well, e.g., tensor layers in Qiu et al. (Qiu and Huang, 2015) and holographic layers in Tay et al. (Tay et al., 2017a).
It has been recently fashionable to model the relationships between question and answer using similarity matrices. Intuitively, this enables more finegrained matching across words in question and answer sentences. The MultiPerspective CNN (MPCNN) (He et al., 2015) compared two sentences via a wide diversity of pooling functions and filter widths aiming to capture ‘multiperspectives’ between two sentences. The attention based neural matching (aNMM) model of Yang et al. (Yang et al., 2016) performed softattention alignment by first measuring the pairwise word similarity between each word in question and answer. The attentive pooling models of Santos et al. (dos Santos et al., 2016) (APBiLSTM and APCNN) utilized this softattention alignment to learn weighted representations of question and answer that are dependent of each other. Zhang et al. (Zhang et al., 2017) extended APCNN to 3D tensorbased attentive pooling (AICNN). A recent work, the Cross Temporal Recurrent Network (CTRN) (Tay et al., 2017b) proposed a pairwise gating mechanism for joint learning of QA pairs.
Unfortunately, these models actually introduce a prohibitive computational cost to the model usually for a very marginal performance gain. Notably, it is easy to see that similarity matrix based matching incurs a computational cost of quadratic scale. Representation ability such as dimension size of word or CNN/RNN embeddings are naturally also quite restricted, i.e., increasing any of these dimensions can cause computation or memory requirements to explode. Moreover, it is not uncommon for models such as AICNN or APBiLSTM to spend more than minutes on a single epoch on QA datasets that are only medium sized. Let us not forget that these models still have to be extensively tuned which aggravates the impracticality problem posed by some of these models.
In this paper, we seek a new paradigm for neural ranking for QA. While many recent works try to outstack each other with new layers, we strip down our network instead. Our work is inspired by the very recent Poincarè embeddings (Nickel and Kiela, 2017) which demonstrates the superiority and efficiency of generalization in Hyperbolic space. Moreover, this alleviates many overfitting and complexity issues that Euclidean embeddings might face especially if the data has intrinsic hierarchical structure. It is good to note that powerlaw distributions, such as Zipf’s law, have been known to be from innate hierarchical structure (Ravasz and Barabási, 2003). Specifically, the defining characteristic of Hyperbolic space is a much quicker expansion relative to that of Euclidean space which makes naturally equipped for modeling hierarchical structure. The concept of Hyperbolic spaces has been applied to domains such as complex network modeling (Krioukov et al., 2010), social networks (Verbeek and Suri, 2016) and geographic routing (Kleinberg, 2007).
There are several key geometric intuitions regarding Hyperbolic spaces. Firstly, the concept of distance and area is warped in Hyperbolic spaces. Specifically, each tile in Figure 1(a) is of equal area in Hyperbolic space but diminishes towards zero in Euclidean space towards the boundary. Secondly, Hyperbolic spaces are conformal, i.e., angles in Hyperbolic spaces and Euclidean spaces are identical. In Figure 1(b), the arcs on the curve are parallel lines that are orthogonal to the boundary. Finally, hyperbolic spaces can be regarded as larger spaces relative to Euclidean spaces due to the fact that the concept of relative distance can be expressed much better, i.e., not only does the distance between two vectors encode information but also where a vector is placed in Hyperbolic space. This enables efficient representation learning.
In Nickel et al. (Nickel and Kiela, 2017), the authors applied the hyperbolic distance (specifically, the Poincarè distance) to model taxonomic entities and graph nodes. Notably, our work, to the best of our knowledge, is the only work that learns QA embeddings in Hyperbolic space. Moreover, questions and answers introduce an interesting layer of complexity to the problem since QA embeddings are in fact compositions of their constituent word embeddings. On the other hand, nodes in a graph and taxonomic entities in (Nickel and Kiela, 2017) are already at its most abstract form, i.e., symbolic objects. As such, we believe it would be interesting to investigate the impacts of QA in Hyperbolic space in lieu of the added compositional nature.
3. Our Proposed Approach
This section outlines the overall architecture of our proposed model. Similar to many neural ranking models for QA, our network has ‘two’ sides with shared parameters, i.e., one for question and another for answer. However, since we optimize for a pairwise ranking loss, the model takes in a positive (correct) answer and a negative (wrong) answer and aims to maximize the margin between the scores of the correct QA pair and the negative QA pair. Figure 2 depicts the overall model architecture.
3.1. Embedding Layer
Our model accepts three sequences as an input, i.e., the question (denoted as ), the correct answer (denoted as ) and a randomly sampled corrupted answer (denoted as ). Each sequence consists of words where and are predefined maximum sequence lengths for questions and answers respectively. Each word is represented as a onehot vector (representing a word in the vocabulary). As such, this layer is a lookup layer that converts each word into a lowdimensional vector by indexing onto the word embedding matrix. In our implementation, we initialize this layer with pretrained word embeddings (Pennington et al., 2014). Note that this layer is not updated during training. Instead, we utilize a projection layer that learns a taskspecific projection of the embeddings.
3.2. Projection Layer
In order to learn a taskspecific representation for each word, we utilize a projection layer. The projection layer is essentially a single layered neural network that is applied to each word in all three sequences.
(1) 
where , , and is a nonlinear function such as the rectified linear unit (ReLU). The output of this layer is a sequence of dimensional embeddings for each sequence (question, positive answer and negative answer). Note that the parameters of this projection layer are shared for both question and answer.
3.3. Learning QA Representations
In order to learn question and answer representations, we simply take the sum of all word embeddings in the sequence.
(2) 
where . is the predefined max sequence length (specific to question and answer) and are dimensional embeddings of the sequence. This is essentially the neural bagofwords (NBoW) representation. Unlike popular neural encoders such as LSTM or CNN, the NBOW representation does not add any parameters and is much more efficient. Additionally, we constrain the question and answer embeddings to the unit ball before passing to the next layer, i.e., . This is easily done via when . Note that this projection of QA embeddings onto the unit ball is mandatory and absolutely crucial for HyperQA to even work.
3.4. Hyperbolic Representations of QA Pairs
Neural ranking models are mainly characterized by the interaction function between question and answer representations. In our work, we mainly adopt the hyperbolic^{1}^{1}1While there exist multiple models of Hyperbolic geometry such as the BeltramiKlein model or the Hyperboloid model, we adopt the Poincarè ball / disk due to its ease of differentiability and freedom from constraints (Nickel and Kiela, 2017). distance function to model the relationships between questions and answers. Formally, let be the open dimensional unit ball, our model corresponds to the Riemannian manifold () and is equipped with the Riemannian metric tensor given as follows:
(3) 
where is the Euclidean metric tensor. The hyperbolic distance function between question and answer is defined as:
(4) 
where denotes the Euclidean norm and are the question and answer embeddings respectively. Note that is the inverse hyperbolic cosine function, i.e., . Notably, changes smoothly with respect to the position of and which enables the automatic discovery of latent hierarchies. As mentioned earlier, the distance increases exponentially as the norm of the vectors approaches 1. As such, the latent hierarchies of QA embeddings are captured through the norm of the vectors. From a geometric perspective, the origin can be seen as the root of a tree that branches out towards the boundaries of the hyperbolic ball. This selforganizing ability of the hyperbolic distance is visually and qualitatively analyzed in later sections.
3.4.1. Gradient Derivation
Amongst the other models of Hyperbolic geometry, the hyperbolic Poincarè distance is differentiable. Let The partial derivate w.r.t to is defined as:
(5) 
where , and .
3.5. Similarity Scoring Layer
Finally, we pass the hyperbolic distance through a linear transformation described as follows:
(6) 
where and are scalar parameters of this layer. The performance of this layer is empirically motivated by its performance and was selected amongst other variants such as , nonlinear activations such as sigmoid function or the raw hyperbolic distance.
3.6. Optimization and Learning
This section describes the optimization and learning process of HyperQA. Our model learns via a pairwise ranking loss, which is well suited for metricbased learning algorithms.
3.6.1. Pairwise Hinge Loss
Our network minimizes the pairwise hinge loss which is defined as follows:
(7) 
where is the set of all QA pairs for question , is the score between and , and is the margin which controls the extent of discrimination between positive QA pairs and corrupted QA pairs. The adoption of the pairwise hinge loss is motivated by the good empirical results demonstrated in Rao et al. (Rao et al., 2016). Additionally, we also adopt the mix sampling strategy for sampling negative samples as described in their work.
3.6.2. Gradient Conversion
Since our network learns in Hyperbolic space, parameters have to be learned via stochastic Riemannian optimization methods such as RSGD (Bonnabel, 2013).
(8) 
where denotes a retraction onto at . is the learning rate and is the Riemannian gradient with respect to . Fortunately, the Riemannian gradient can be easily derived from the Euclidean gradient in this case (Bonnabel, 2013). In order to do so, we can simply scale the Euclidean gradient by the inverse of the metric tensor . Overall, the final gradients used to update the parameters are:
(9) 
Due to the lack of space, we refer interested readers to (Nickel and Kiela, 2017; Bonnabel, 2013) for more details. For practical purposes, we simply utilize the automatic gradient feature of TensorFlow but convert the gradients with Equation (9) before updating the parameters.
4. Experiments
This section describes our empirical evaluation and its results.
4.1. Datasets
In the spirit of experimental rigor, we conduct our empirical evaluation based on four popular and wellstudied benchmark datasets for question answering.

YahooCQA  This is a benchmark dataset for communitybased question answering that was collected from Yahoo Answers. In this dataset, the answer lengths are relatively longer than TrecQA and WikiQA. Therefore, we filtered answers that have more than words and less than characters. The traindevtest splits for this dataset are provided by (Tay et al., 2017a).

WikiQA  This is a recently popular benchmark dataset (Yang et al., 2015) for opendomain question answering based on factual questions from Wikipedia and Bing search logs.

SemEvalCQA  This is a wellstudied benchmark dataset from SemEval2016 Task 3 Subtask A (CQA). This is a real world dataset obtained from Qatar Living Forums. In this dataset, there are ten answers in each question ‘thread’ which are marked as ‘Good‘, ‘Potentially Useful’ or ‘’Bad’. We treat ‘Good’ as positive and anything else as negative labels.

TrecQA  This is the benchmark dataset provided by Wang et al. (Wang et al., 2007). This dataset was collected from TREC QA tracks 813 and is comprised of factoid based questions which mainly answer the ‘who’, ‘what’, ‘where’, ‘when’ and ‘why’ types of questions. There are two versions, namely clean and raw, as noted by (Rao et al., 2016) which we evaluate our models on.
Statistics pertaining to each dataset is given in Table 1.
YahooCQA  WikiQA  SemEvalCQA  TrecQA  

Train Qns  50.1K  94  4.8K  1229 
Dev Qns  6.2K  65  224  82 
Test Qns  6.2K  68  327  100 
Train Pairs  253K  5.9K  36K  53 
Dev Pairs  31.7K  1.1K  2.4K  1.1K 
Test Pairs  31.7K  1.4K  3.2K  1.5K 
4.2. Compared Baselines
In this section, we introduce the baselines for comparison.

YahooCQA  The key competitors of this dataset are the Neural Tensor LSTM (NTNLSTM) and HDLSTM from Tay et al. (Tay et al., 2017a) along with their implementation of the Convolutional Neural Tensor Network (Qiu and Huang, 2015), vanilla CNN model, and the Okapi BM25 (Robertson et al., 1994) benchmark. Additionally, we also report our own implementations of QABiLSTM, QACNN, APBiLSTM and APCNN on this dataset based on our experimental setup.

WikiQA  The key competitors of this dataset are the Paragraph Vector (PV) and PV + Cnt models (Le and Mikolov, 2014) of Le and Mikolv, CNN + Cnt model from Yu et al. (Yu et al., 2014) and LCLR (Yih et al.) (Yih et al., 2013). These three baselines are reported in the original WikiQA paper (Yang et al., 2015) which also include variations that include handcrafted features. Additional strong baselines include QABiLSTM, QACNN from (dos Santos et al., 2016) along with APBiLSTM and APCNN which are attentive pooling improvements of the former. Finally, we also report the Pairwise Ranking MPCNN from Rao et al. (Rao et al., 2016).

SemEvalCQA  The key competitors of this dataset are the CNNbased ARCI/II architecture by Hu et al. (Hu et al., 2014), the Attentive Pooling CNN (dos Santos et al., 2016), Kelp (Filice et al., 2016) a feature engineering based SVM method, ConvKN (BarrónCedeño et al., 2016) a combination of convolutional tree kernels with CNN and finally AICNN (Attentive Interactive CNN) (Zhang et al., 2017), a tensorbased attentive pooling neural model. A comparison with AICNN (with features) is also included.

TrecQA  The key competitors on the dataset are mainly the CNN model of Severyn and Moschitti (S&M) (Severyn and Moschitti, 2015), the Attentionbased Neural Matching Model (aNMM) of Yang et al. (Yang et al., 2016), HDLSTM (Tay et al.) (Tay et al., 2017a) and MultiPerspective CNN (MPCNN) (He et al., 2015) proposed by He et al. Lastly, we also compare with the pairwise ranking adaption of MPCNN (Rao et al.) (Rao et al., 2016). Additionally and due to long standing nature of this dataset, there have been a huge number of works based on traditional feature engineering approaches (Wang et al., 2007; Heilman and Smith, 2010; Severyn et al., 2014; Yao et al., 2013) which we also report. For the clean version of this dataset, we also compare with APCNN and QABiLSTM/CNN (dos Santos et al., 2016).
Since the training splits are standard, we are able to directly report the results from the original papers.
4.3. Evaluation Protocol
This section describes the key evaluation protocol / metrics and implementation details of our experiments.
4.3.1. Metrics
We adopt a dataset specific evaluation protocol in which we follow the prior work in their evaluation protocols. Specifically, TrecQA and WikiQA adopt the Mean Reciprocal Rank (MRR) and MAP (Mean Average Precision) metrics which are commonplace in IR research. On the other hand, YahooCQA and SemEvalCQA evaluate on MAP and Precision@1 (abbreviated P@1) which is determined based on whether the top predicted answer is the ground truth. For all competitor methods, we report the performance results from the original paper.
4.3.2. Training Time & Parameter Size
Additionally, we report the parameter size and runtime (seconds per epoch) of selected models. We selectively reimplement some of the key competitors with the best performance and benchmark their training time on our machine/GPU (a single Nvidia GTX1070). For reporting the parameter size and training time, we try our best to follow the hyperparameters stated in the original papers. As such, the same model can have different training time and parameter size on different datasets.
4.3.3. Hyperparameters
HyperQA is implemented in TensorFlow. We adopt the AdaGrad (Duchi et al., 2011) optimizer with initial learning rate tuned amongst . The batch size is tuned amongst . Models are trained for epochs and the model parameters are saved each time the performance on the validation set is topped. The dimension of the projection layer is tuned amongst . L2 regularization is tuned amongst . The negative sampling rate is tuned from to . Finally, the margin is tuned amongst . For TrecQA, WikiQA and YahooCQA, we initialize the embedding layer with GloVe (Pennington et al., 2014) and use the version with and trained on 840 billion words. For SemEvalCQA, we train our own Skipgram model using the unannotated corpus provided by the task. In this case, the embedding dimension is tuned amongst . Embeddings are not updated during training. For the SemEvalCQA dataset, we concatenated the raw QA embeddings before passing into the final layer since we found that it improves performance.
4.4. Results and Analysis
In this section, we present our empirical results on all datasets. For all reported results, the best result is in boldface and the second best is underlined.
4.4.1. Experimental Results on WikiQA
Table 2 reports our results on the WikiQA dataset. Firstly, we observe that HyperQA outperforms a myriad of complex neural architectures. Notably, we obtain a clear performance gain of in terms of MAP/MRR against models such as APCNN or APBiLSTM. Our model also outperforms MPCNN which is severely equipped with parameterized word matching mechanisms. We achieve competitive results relative to the Rank MPCNN. Finally, HyperQA is extremely efficient and fast, clocking 2s per epoch compared to 33s per epoch for Rank MPCNN. The parameter cost is also 90K vs 10 million which is a significant improvement.
Model  MAP  MRR  #Params  Time 

PV  0.511  0.516     
PV + Cnt  0.599  0.609     
LCLR  0.599  0.609     
CNN + Cnt  0.652  0.665     
QABiLSTM (Santos et al.)  0.656  0.670     
QACNN (Santos et al.)  0.670  0.682     
APBiLSTM (Santos et al.)  0.671  0.684     
APCNN (Santos et al.)  0.688  0.696     
MPCNN (He et al.)  0.693  0.709  10.0M  35s 
Rank MPCNN (Rao et al.)  0.701  0.718  10.0M  33s 
HyperQA (This work)  0.712  0.727  90K  2s 
4.4.2. Experimental Results on YahooCQA
Table 3 reports the experimental results on YahooCQA. First, we observe that HyperQA outperforms APBiLSTM and APCNN significantly. Specifically, we outperform APBiLSTM, the runnerup model by in terms of MRR and in terms of MAP. Notably, HyperQA is 32 times faster than APBiLSTM and has times less parameters. Our approach shows that complicated attentive pooling mechanisms are not necessary for good performance.
Model  P@1  MRR  # Params  Time 

Random Guess  0.200  0.457     
BM25  0.225  0.493     
CNN  0.413  0.632     
CNTN (Qiu et al.)  0.465  0.632     
LSTM  0.465  0.669     
NTNLSTM (Tay et al.)  0.545  0.731     
HDLSTM (Tay et al.)  0.557  0.735     
QABiLSTM (Santos et al.)  0.508  0.683  1.40M  440s 
QACNN (Santos et al.)  0.564  0.727  90.9K  60s 
APCNN (Santos et al.)  0.560  0.726  540K  110s 
APBiLSTM (Santos et al.)  0.568  0.731  1.80M  640s 
HyperQA (This work)  0.683  0.801  90.0K  20s 
4.4.3. Experimental Results on SemEvalCQA
Table 4 reports the experimental results on SemEvalCQA. Our proposed approach achieves highly competitive performance on this dataset. Specifically, we have obtained the best P@1 performance overall, outperforming the stateoftheart AICNN model by in terms of P@1. The performance of our model on MAP is marginally short from the best performing model. Notably, AICNN has benefited from external handcrafted features. As such, comparing AICNN (w/o features) with HyperQA shows that our proposed model is a superior neural ranking model. Next, we draw the readers attention to the time cost of AICNN. The training time per epoch is per epoch which is about times longer than our model. AICNN is extremely cost prohibitive, i.e., attentive pooling is already very expensive and yet AICNN performs 3D attentive pooling. Evidently, its performance can be easily superseded in a much smaller training time and parameter cost. This raises questions about the effectiveness of the 3D attentive pooling mechanism.
Model  P@1  MAP  #Params  Time 

ARCI (Hu et al.)  0.741  0.771     
ARCII (Hu et al.)  0.753  0.780     
APCNN (Santos et al.)  0.755  0.771     
Kelp (Filice et al.)  0.751  0.792     
ConvKN (BarrónCedeño et al.)  0.755  0.777     
AICNN (Zhang et al.)  0.763  0.792  140K  3250s 
AICNN + Feats (Zhang et al.)  0.769  0.801  140K  3250s 
HyperQA (This work)  0.809  0.795  45K  10s 
Model  MAP  MRR  # Params  Time 

Wang et al. (2007)  0.603  0.685     
Heilman et al. (2010)  0.609  0.692     
Wang et al. (2010)  0.595  0.695     
Yao (2013)  0.631  0.748     
Severyn and Moschitti (2013)  0.678  0.736     
Yih et al (2014)  0.709  0.770     
CNN (Yu et al)  0.711  0.785     
BLSTM + BM25 (Wang & Nyberg)  0.713  0.791     
CNN (Severyn & Moschitti)  0.746  0.808     
aNMM (Yang et al.)  0.750  0.811     
HDLSTM (Tay et al.)  0.750  0.815     
MPCNN (He et al.)  0.762  0.822  10.0M  141s 
Rank MPCNN (Rao et al.)  0.780  0.830  10.0M  130s 
HyperQA (This work)  0.770  0.825  90K  12s 
Model  MAP  MRR  # Params  Time 

QALSTM / CNN (Santos et al.)  0.728  0.832     
APCNN (Santos et al.)  0.753  0.851     
MPCNN (He et al.)  0.777  0.836  10M  141 
Rank MPCNN (Rao et al.)  0.801  0.877  10M  130s 
HyperQA  0.784  0.865  90K  12s 
4.4.4. Experimental Results on TrecQA
Table 5 reports the results on TrecQA (raw). HyperQA achieves very competitive performance on both MAP and MRR metrics. Specifically, HyperQA outperforms the basic CNN model of (S&M) by in terms of MAP/MRR. Moreover, the CNN (S&M) model uses handcrafted features which HyperQA does not require. Similarly, the aNMM model and HDLSTM also benefit from additional features but are outperformed by HyperQA. HyperQA also outperforms MPCNN but is around times faster and has times less parameters. MPCNN consists of a huge number of filter banks and utilizes heavy parameterization to match multiple perspectives of questions and answers. On the other hand, our proposed HyperQA is merely a single layered neural network with 90K parameters and yet outperforms MPCNN. Similarly, Table 6 reports the results on TrecQA (clean). Similarly, HyperQA also outperforms MPCNN, APCNN and QACNN. On both datasets, the performance of HyperQA is competitive to Rank MPCNN.
4.4.5. Overall analysis
Overall, we summarize the key findings of our experiments.

It is possible to achieve very competitive performance with small parameterization, and no word matching or interaction layers. HyperQA outperforms complex models such as MPCNN and APBiLSTM on multiple datasets.

The relative performance of HyperQA is significantly better on large datasets, e.g., YahooCQA (253K training pairs) as opposed to smaller ones like WikiQA (5.9K training pairs). We believe that this is due to the fact that Hyperbolic space is seemingly larger than Euclidean space.

HyperQA is extremely fast and trains at times faster than complex models like MPCNN. Note that if CPUs are used instead of GPUs (which speed convolutions up significantly), this disparity would be significantly larger.

Our proposed approach does not require handcrafted features and yet outperforms models that benefit from them. This is evident on all datasets, i.e., HyperQA outperforms CNN model with features (TrecQA and WikiQA) and AICNN + features on SemEvalCQA.
Ours against  Performance  Params  Speed 

APBiLSTM  17% better  20x less  32 x faster 
APCNN  112% better  Same  3x faster 
AICNN  Competitive  3x less  300x faster 
MPCNN  12% better  100x less  10x faster 
Rank MPCNN  Competitive  100x less  10x faster 
4.5. Effects of QA Embedding Size
In this section, we study the effects of the QA embedding size on performance. Figure 3 describes the relationship between QA embedding size () and MAP on the WikiQA dataset. Additionally, we include a simple baseline (CosineQA) which is exactly the same as HyperQA but uses cosine similarity instead of hyperbolic distance. The MAP scores of three other reported models (MPCNN, CNNCnt and PVCnt) are also reported for reference. Firstly, we notice the disparity between HyperQA and CosineQA in terms of performance. This is also observed across other datasets but is not reported due to the lack of space. While CosineQA maintains a stable performance throughout embedding size, the performance of HyperQA rapidly improves at . In fact, the performance of HyperQA at (45K parameters) is already similar to the MultiPerspective CNN (He et al., 2015) which contains 10 million parameters. Moreover, the performance of HyperQA outperforms MPCNN with .
5. Discussion and Analysis
This section delves into qualitative analysis of our model and aims to investigate the following research questions:

RQ1: Is there any hierarchical structure learned in the QA embeddings? How are QA embeddings organized in the final embedding space of HyperQA?

RQ2: What are the impacts of embedding compositional embeddings in hyperbolic space? Is there an impact on the constituent word embeddings?

RQ3: Are we able to derive any insight about how word interaction and matching happens in HyperQA?
Question  H1  H2  H3  H4  H5  

What is the gross sale of Burger King  Q  are  sales, today  gross  is, what  burger, king 
A  based  sales, 14,billion, 183  diageo  contributed  burger, corp  
What is Florence Nightingale famous for  Q  in, the  for  famous  what  florence, nightingale 
A  of, in  was  nursing  founder, modern, born  nightingale, italy  
Who is the founder of twitter?  Q  the, of    twitter, founder    who, is 
A  and, the  networking, launched  twitter, jack dorsey  match, social   
Words (w)  

01  to, and, an, on, in, of, its, the, had, or, go 
12  be, a, was, up, put, said, but 
23  judging, returning, volunteered, managing, meant, cited 
34  responsibility, engineering, trading, prosecuting 
45  turkish, autonomous, cowboys, warren, seven, what 
56  ebay, magdalena, spielberg, watson, nova 
5.1. Analysis of QA Embeddings
Figure 4(a) shows a visualization of QA embeddings on the test set TrecQA projected in 3dimensional space using tSNE (van der Maaten, 2009). QA embeddings are extracted from the network as discussed in Section 3.3. We observe that question embeddings form a ‘sphere’ over answer embeddings. Contrastingly, this is not exhibited when the cosine similarity is used as shown in Figure 4(b). It is important to note that these are embeddings from the test set which have not been trained and therefore the model is not explicitly told whether a particular textual input is a question or answer. This demonstrates the innate ability of HyperQA to selforganize and learn latent hierarchies which directly answers RQ1. Additionally, Figure 5(a) shows a histogram of the vector norms of question and answer embeddings. We can clearly see that questions in general have a higher vector norm^{2}^{2}2We extract QA embeddings right before the constraining / normalization layer. and are at a different hierarchical level from answers. In order to further understand what the model is doing, we delve deeper into the visualization at wordlevel.
5.2. Analysis of Word Embeddings
Table 9 shows some examples of words at each hierarchical level of the sphere on TrecQA. Recall that the vector norms^{3}^{3}3Note that word embeddings are not constrained to . allow us to infer the distance of the word embedding from the origin which depicts its hierarchical level in our context. Interestingly, we found that HyperQA exhibits selforganizing ability even at wordlevel. Specifically, we notice that the words closer to the origin are common words such as ‘to’, ‘and’ which do not have much semantic values for QA problems. At the middle of the hierarchy (), we notice that there are more verbs. Finally, as we move towards the surface of the ‘sphere’, the words become rarer and reflect more domainspecific words such as ‘ebay’ and ‘spielberg’. Moreover, we also found many names and proper nouns occurring at this hierarchical level.
Additionally, we also observe that words such as ’where’ or ’what’ have relatively high vector norms and located quite high up in the hierarchy. This is in concert with Figure 4 which shows the question embeddings form a sphere around the answer embeddings. At last, we parsed QA pairs wordbyword according to hierarchical level (based on their vector norm). Table 8 reports the outcome of this experiment where are hierarchical levels based on vector norms. First, we find that questions often start with the overall context and drill down into more specific query words. Take the first sample in Table 8 for example, it begins at a top level with ‘burger king’ and then drills down progressively to ’what is gross sales?’. Similarly in the second example, it begins with ‘florence nightingale’ and drills down to ‘famous’ at H3 in which a match is being found with ‘nursing’ in the same hierarchical level. Overall, based on our qualitative analysis, we observe that, HyperQA builds two hierarchical structures at the wordlevel (in vector space) towards the middle which strongly facilitates wordlevel matching. Pertaining to answers, it seems like the model builds a hierarchy by splitting on conjunctive words (‘and’), i.e., the root node of the tree starts by conjunctive words at splits sentences into semantic phrases. Overall, Figure 6 depicts our key intuitions regarding the inner workings of HyperQA which explains both RQ2 and RQ3. This is also supported by Figure 5(b) which shows the majority of the word norms are clustered with . This would be reasonable considering that the leaf nodes of both question and answer hierarchies would reside in the middle.
6. Conclusion
We proposed a new neural ranking model for question answering. Our proposed HyperQA achieves very competitive performance on four wellstudied benchmark datasets. Our model is lightweight, fast and efficient, outperforming many stateoftheart models with complex word interaction layers, attentive mechanisms or rich neural encoders. Our model only has 40K90K parameters as opposed to millions of parameters which plague many competitor models. Moreover, we derive qualitative insights pertaining to our model which enable us to further understand its inner workings. Finally, we observe that the superior generalization of our model (despite small parameters) can be attributed to selforganizing properties of not only question and answer embeddings but also word embeddings.
References
 (1)
 BarrónCedeño et al. (2016) Alberto BarrónCedeño, Giovanni Da San Martino, Shafiq R. Joty, Alessandro Moschitti, Fahad AlObaidli, Salvatore Romeo, Kateryna Tymoshenko, and Antonio Uva. 2016. ConvKN at SemEval2016 Task 3: Answer and Question Selection for Question Answering on Arabic and English Fora. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACLHLT 2016, San Diego, CA, USA, June 1617, 2016. 896–903.
 Berger et al. (2000) Adam L. Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu O. Mittal. 2000. Bridging the lexical chasm: statistical approaches to answerfinding. In SIGIR. 192–199. DOI:http://dx.doi.org/10.1145/345508.345576
 Bonnabel (2013) Silvere Bonnabel. 2013. Stochastic Gradient Descent on Riemannian Manifolds. IEEE Trans. Automat. Contr. 58, 9 (2013), 2217–2229. DOI:http://dx.doi.org/10.1109/TAC.2013.2254619
 dos Santos et al. (2016) Cícero Nogueira dos Santos, Ming Tan, Bing Xiang, and Bowen Zhou. 2016. Attentive Pooling Networks. CoRR abs/1602.03609 (2016).
 Duchi et al. (2011) John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12 (2011), 2121–2159.
 Filice et al. (2016) Simone Filice, Danilo Croce, Alessandro Moschitti, and Roberto Basili. 2016. KeLP at SemEval2016 Task 3: Learning Semantic Relations between Questions and Answers. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACLHLT 2016, San Diego, CA, USA, June 1617, 2016. 1116–1123.
 He et al. (2015) Hua He, Kevin Gimpel, and Jimmy J. Lin. 2015. MultiPerspective Sentence Similarity Modeling with Convolutional Neural Networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 1721, 2015. 1576–1586.
 He and Lin (2016) Hua He and Jimmy J. Lin. 2016. Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 1217, 2016. 937–948.
 Heilman and Smith (2010) Michael Heilman and Noah A. Smith. 2010. Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 24, 2010, Los Angeles, California, USA. 1011–1019.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9, 8 (1997), 1735–1780.
 Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada. 2042–2050.
 Kleinberg (2007) Robert Kleinberg. 2007. Geographic Routing Using Hyperbolic Space. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 612 May 2007, Anchorage, Alaska, USA. 1902–1909. DOI:http://dx.doi.org/10.1109/INFCOM.2007.221
 Krioukov et al. (2010) Dmitri V. Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguñá. 2010. Hyperbolic Geometry of Complex Networks. CoRR abs/1006.5169 (2010).
 Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML14). 1188–1196.
 Mueller and Thyagarajan (2016) Jonas Mueller and Aditya Thyagarajan. 2016. Siamese Recurrent Architectures for Learning Sentence Similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA. 2786–2792.
 Nickel and Kiela (2017) Maximilian Nickel and Douwe Kiela. 2017. Poincaré Embeddings for Learning Hierarchical Representations. CoRR abs/1705.08039 (2017).
 Parikh et al. (2016) Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016. 2249–2255.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. 1532–1543.
 Qiu and Huang (2015) Xipeng Qiu and Xuanjing Huang. 2015. Convolutional Neural Tensor Network Architecture for CommunityBased Question Answering. In Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 2531, 2015. 1305–1311.
 Rao et al. (2016) Jinfeng Rao, Hua He, and Jimmy J. Lin. 2016. NoiseContrastive Estimation for Answer Selection with Deep Neural Networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 2428, 2016. 1913–1916. DOI:http://dx.doi.org/10.1145/2983323.2983872
 Ravasz and Barabási (2003) Erzsébet Ravasz and AlbertLászló Barabási. 2003. Hierarchical organization in complex networks. Physical Review E 67, 2 (2003), 026112.
 Robertson et al. (1994) Stephen E. Robertson, Steve Walker, Susan Jones, Micheline HancockBeaulieu, and Mike Gatford. 1994. Okapi at TREC3. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 24, 1994. 109–126.
 Severyn and Moschitti (2015) Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 913, 2015. 373–382. DOI:http://dx.doi.org/10.1145/2766462.2767738
 Severyn et al. (2014) Aliaksei Severyn, Alessandro Moschitti, Manos Tsagkias, Richard Berendsen, and Maarten de Rijke. 2014. A syntaxaware reranker for microblog retrieval. In The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia  July 06  11, 2014. 1067–1070. DOI:http://dx.doi.org/10.1145/2600428.2609511
 Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutionalpooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 101–110.
 Tay et al. (2017a) Yi Tay, Minh C. Phan, Anh Tuan Luu, and Siu Cheung Hui. 2017a. Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 711, 2017. 695–704. DOI:http://dx.doi.org/10.1145/3077136.3080790
 Tay et al. (2017b) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2017b. Cross Temporal Recurrent Networks for Ranking Question Answer Pairs. (2017). arXiv:arXiv:1711.07656
 van der Maaten (2009) Laurens van der Maaten. 2009. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 1618, 2009. 384–391.
 Verbeek and Suri (2016) Kevin Verbeek and Subhash Suri. 2016. Metric embedding, hyperbolic space, and social networks. Computational Geometry 59 (2016), 1–12.
 Wan et al. (2016) Shengxian Wan, Yanyan Lan, Jiafeng Guo, Jun Xu, Liang Pang, and Xueqi Cheng. 2016. A Deep Architecture for Semantic Matching with Multiple Positional Sentence Representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA. 2835–2841.
 Wang et al. (2007) Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the Jeopardy Model? A QuasiSynchronous Grammar for QA. In EMNLPCoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 2830, 2007, Prague, Czech Republic. 22–32.
 Yang et al. (2016) Liu Yang, Qingyao Ai, Jiafeng Guo, and W. Bruce Croft. 2016. aNMM: Ranking Short Answer Texts with AttentionBased Neural Matching Model. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 2428, 2016. 287–296. DOI:http://dx.doi.org/10.1145/2983323.2983818
 Yang et al. (2015) Yi Yang, Wentau Yih, and Christopher Meek. 2015. WikiQA: A Challenge Dataset for OpenDomain Question Answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 1721, 2015. 2013–2018.
 Yao et al. (2013) Xuchen Yao, Benjamin Van Durme, Chris CallisonBurch, and Peter Clark. 2013. Answer Extraction as Sequence Tagging with Tree Edit Distance. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 914, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA. 858–867.
 Yih et al. (2013) Wentau Yih, MingWei Chang, Christopher Meek, and Andrzej Pastusiak. 2013. Question Answering Using Enhanced Lexical Semantic Models. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 49 August 2013, Sofia, Bulgaria, Volume 1: Long Papers. 1744–1753.
 Yu et al. (2014) Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. 2014. Deep Learning for Answer Sentence Selection. CoRR abs/1412.1632 (2014).
 Zhang et al. (2017) Xiaodong Zhang, Sujian Li, Lei Sha, and Houfeng Wang. 2017. Attentive Interactive Neural Networks for Answer Selection in Community Question Answering. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA. 3525–3531.