Variational Reasoning for
Question Answering with Knowledge Graph
Knowledge graph (KG) is known to be helpful for the task of question answering (QA), since it provides well-structured relational information between entities, and allows one to further infer indirect facts. However, it is challenging to build QA systems which can learn to reason over knowledge graphs based on question-answer pairs alone. First, when people ask questions, their expressions are noisy (for example, typos in texts, or variations in pronunciations), which is non-trivial for the QA system to match those mentioned entities to the knowledge graph. Second, many questions require multi-hop logic reasoning over the knowledge graph to retrieve the answers. To address these challenges, we propose a novel and unified deep learning architecture, and an end-to-end variational learning algorithm which can handle noise in questions, and learn multi-hop reasoning simultaneously. Our method achieves state-of-the-art performance on a recent benchmark dataset in the literature. We also derive a series of new benchmark datasets, including questions for multi-hop reasoning, questions paraphrased by neural translation model, and questions in human voice. Our method yields very promising results on all these challenging datasets.
Question answering (QA) has been a long-standing research problem in Machine Learning and Artificial Intelligence. Thanks to the creation of large-scale knowledge graphs such as DBPedia  and Freebase , QA systems can be armed with well-structured knowledge on specific and open domains. Many traditional approaches for KG-powered QA are based on semantic parsers [3, 4, 5, 6], which first map a question to formal meaning representation (e.g. logical form) and then translate it to a KG query. The answer to the question can be retrieved by executing the query. One of the disadvantages of these approaches is that the model is not trained end-to-end and errors may be cascaded.
With the recent success of deep learning, some end-to-end solutions based on neural networks have been proposed and show very promising performance on benchmark datasets, such as Memory Networks , Key-Value Memory Networks  and Gated Graph Sequence Neural Networks . However, these neural approaches treat the KG as a flattened big table of itemized knowledge records, making it hard to exploit the structure information in the graph and thus weak on logic reasoning. When the answer is not a direct neighbor of the topic entity in question (i.e. there are multiple hops between question and answer entities in the KG), which requires logic reasoning over the KG, the neural approaches usually perform poorly. For instance, it is easy to handle single-hop questions like “Who wrote the paper titled …?” by querying itemized knowledge records in triples (paper_title, authored_by, author_name). However, logic reasoning on the KG is required for multi-hop questions such as “Who have co-authored papers with …?”. With the KG, we start from the mentioned author, and follow to find answers. A common remedy is the so-called knowledge graph completion: create new relations for non-neighbor entity pairs in the KG [10, 11, 12]. However, multi-hop reasoning is combinatorial in nature, i.e. the number of multi-hop relations grow explosively with the increase of hops. For example, if we create new relation types like friend-of-friend and friend-of-friend-of-friend, the number of edges in the KG will explode, which is intractable for both storage and computation.
Another key challenge is how to locate topic entities in the KG. Most existing works assume that the topic entity in question can be located by simple string matching [8, 13, 9, 5], which is often not true. When people ask questions, either in text or speech, various noise can be introduced in the expressions. For example, people are likely to make typos or name ambiguity in question. In even harder case, audio questions, people may pronounce the same entity differently in different questions, even for the same person. Due to these noises, it is hard to do exact matching to locate topic entities. For text questions, broad matching techniques (e.g. hand-craft rules, regular expressions, edit distance, etc.) are widely used for entity recognition . However, they require domain experts and lots of human effort. For speech questions, it is even harder to match topic entities directly. Most existing QA systems first do speech recognition, converting the audio to text, and then match entities in text. Unfortunately, the error rate is typically high for speech recognition system to recognize entities in voice, such as human names or street addresses. Since it is not end-to-end, the error of the speech recognition system may cascade to affect the downstream QA system.
Typically, the training data for QA system is provided as question-answer pairs, where fine-grained annotation of these pairs are not available, or only available for a few. More specifically, there are very few explicit annotations of the exact entity present in the question, the type of the questions, and the exact logic reasoning steps along the knowledge graph leading to the answer. Thus it is challenging to simultaneously learn to locate the topic KG entity in the question, and figure out the unknown reasoning steps pointing to the answer based on training question-answer pairs alone.
To address the challenges mentioned above, we propose an end-to-end learning framework for question answering with knowledge graph named variational reasoning network (VRN), which have the following new features:
We build a probabilistic modeling framework for end-to-end QA system, which can simultaneously handle uncertain topic entity and multi-hop reasoning.
We propose a novel propagation-like deep learning architecture over the knowledge graph to perform logic inference in the probabilistic model.
We apply the REINFORCE algorithm with variance reduction technique to make the system end-to-end trainable.
We derive a series of new challenging benchmark datasets MetaQA111Our new benchmark dataset collections MetaQA are publicly available at https://goo.gl/f3AmcY. (MoviE Text Audio QA) intended for research on question-answering systems. These datasets contain over 400K questions for both single- and multi-hop reasoning. To test QA systems in more realistic (and more difficult) scenarios, MetaQA also provides neural-translation-model-paraphrased datasets, and text-to-speech-based audio datasets.
Extensive experiments show that our method achieves state-of-the-art performance on both single- and multi-hop datasets, demonstrating the capability of multi-hop reasoning. Moreover, we obtain promising results on the challenging audio QA datasets, showing the effectiveness of end-to-end learning framework. With the rise of virtual assistant tools (e.g. Alexa, Cortana, Google Assistant and Siri), QA systems are now even closer to our daily life. This paper is one step towards more realistic QA systems, which can handle noisy question input in both text and speech, and learn from examples to reason over the knowledge graph.
2 Related Work
QA with semantic parser: Most traditional approaches for KG-powered QA are based on semantic parsers, which map the question to a certain meaning representation or logical form [3, 4, 15, 5, 6], or directly map the question to an executable program . These approaches require domain-specific grammars, rules, or fine-grained annotations. Also, they are not designed to handle noisy questions, and do not support end-to-end training since they use separate stages for question parsing and logic reasoning.
Neural approaches for QA: The family of memory networks achieves state-of-the-art performance in various kinds of QA tasks. Some of them are able to do reasoning within local context [17, 18] using attention mechanism . For QA with KG, Miller et al.  achieves state-of-the-art performance, outperforming previous works [20, 7] on benchmark datasets. Recent work  uses neural programmer model for QA with single knowledge table. However, the multi-hop reasoning capability of these approaches depends on recurrent attentions and there is no explicit traversal over the KG.
Graph embedding: Recently, researchers have built deep architectures to embed structured data, such as trees [22, 23, 24] or graphs [25, 26, 27]. Also some works [9, 28] extend it to sequential case like multi-step reasoning. However, these approaches only work on small instances like sentences or molecules. Instead, our work embeds the reasoning-graph from source entity to every target entity in large-scale knowledge graph.
Multi-hop reasoning: There are some other works on knowledge graph completion with traversal, which requires path sampling [12, 29] or dynamic programming . Our work can handle QA with natural language or human speech, and the reasoning-graph embeddings can represent complicated reasoning rules.
In summary, most of the existing approaches have separate stages for entity locating, such as keyword matching, frequency-based method, and domain-specific methods . Since they are not jointly trained with the reasoning part, the errors in entity locating (e.g. incorrectly recognized name entity from speech recognition system) will be cascaded to the downstream QA system.
3.1 Problem definition
Knowledge base/graph (KG): A knowledge graph is a directed graph where the entities and their relations are represented by nodes and edges, respectively, i.e. . Furthermore, each edge from is a triplet , representing a directed relation between subject entity and object entity both from the node set . Each entity in the knowledge graph can also contain additional information such as type and text description. For instance, entity is described as actor Jennifer Lawrence, and entity is movie Passengers. Then a relation in the knowledge graph can be (Jennifer Lawrence, acted_in, Passengers), where the corresponding is acted_in. In this work, we assume that the knowledge graph is given.
Question answering with KG: Given a question , the algorithm is asked to output an entity in the knowledge graph which properly answers the question. For example, can be a question like “who acted in the movie Passengers?”, and one possible answer is Jennifer Lawrence, which is an entity in the KG. In a more challenging setting, can even be an audio segment reading the same question. The training set contains pairs of question and answers. Note that fine-grained annotation is not present, such as the exact entity present in the question, question type, or the exact logic reasoning steps along the knowledge graph leading to the answer. Thus, a QA system with KG should be able to handle noisy entity in questions and learn multi-hop reasoning directly from question-answer pairs.
3.2 Overall formulation
To address both key challenges in a unified probabilistic framework, we propose the variational reasoning network (VRN). The overall architecture is shown in Fig 1. VRN consists of two probabilistic modules, as described below.
Module for topic entity recognition: Recognizing the topic entity (or the entity mentioned in the question) is the first step in performing logic reasoning over the knowledge graph222In this paper, we consider the case with single topic entity in each question.. For example, the topic entity mentioned in Sec 3.1 is the movie Passenger. We denote such entity as , and model the compatibility of this entity with the question as a probabilistic model , which shows the probability of the KG entity being mentioned in the question . Depending on the question form (text or audio), the parameterization of may be different and details can be found in Sec 3.3.
Module for logic reasoning over knowledge graph: Given the topic entity , one need to reason over the knowledge graph to find out the answer . As described in Sec 3.1, the algorithm should learn to use the reasoning rule for that question. Since there is no annotations for such reasoning step, the QA system has to learn it only from question-answer pairs. Thus we model the likelihood of an answer being correct given entity and question as . The parameterization of need to capture traversal or reasoning over knowledge graph, which is explained in detail in Sec 3.4.
Since the topic entity in question is not annotated, it is natural to formulate the problem by treating the topic entity as a latent variable. With the two probabilistic components above, we model the probability of answer being correct given question as , which sums out all possibilities of the latent variable. Given a training set of question-answer pairs, the set of parameters and can be estimated by maximizing the log-likelihood of this latent variable model:
Next we will describe our parametrization of and , and the algorithms for learning and inference based on that.
3.3 Probabilistic module for topic entity recognition
Most existing QA approaches assume that topic entities are annotated, or can be simply found via string matching. However, for more realistic questions or even audio questions, a more general approach is to build a recognizer that can be trained jointly with the logic reasoning engine.
To handle unlabeled topic entities, we notice that the full context of the question can be helpful. For example, Michael could either be the name of a movie or an actor. It is hard to tell which one relates to the question by merely looking at this entity name. However, we should be able to resolve the unique entity by checking the surrounding words in the question. Similarly, in the knowledge graph there could be multiple entities with the same name, but the connected edges (relations) of the entity nodes are different, which helps to resolve the unique entity. For example, as a movie name, Michael may be connected with a directed_by edge pointing to an entity of director; while as an actor name, Michael may be connected with birthday and height edges.
Specifically, we use a neural network which can represent the question in a dimensional vector. Depending on the question form (text or audio), this neural network can be a simple embedding network mapping bag-of-words to a vector, or a recurrent neural network to embed sentences, or a convolution neural network to embed audio questions. Thus the probability of having in is
where are the weights in the last classification layer. This parameterization avoids heuristic keyword matching for the entity as is done in previous work [8, 20], and makes the entity recognition process differentiable and end-to-end trainable.
3.4 Probabilistic module for logic reasoning over knowledge graph
Parameterizing the reasoning model is challenging, since 1) the knowledge graph can be very large; 2) the required logic reasoning is unknown and can be multi-step. In other words, retrieving the answer requires multi-step traversal over a gigantic graph. Thus in this paper, we propose a reasoning-graph embedding architecture, where all the inference rules and their complex combinations are represented as nonlinear embeddings in vector space and will be learned.
Scope of . More specifically, we assume the maximum number of steps (or hops), , of the logic reasoning is known to the algorithm. Starting from a topic entity , we perform topological sort (ignoring the original edge direction) for all entities within hops according to the knowledge graph. After that, we get an ordered list of entities and their relations from the knowledge graph. We call this subgraph with ordered nodes as the scope of . Fig 2 shows an example of a 2-hop scope, where entities are labeled with their topological distance to the source entity.
Reasoning graph to . Given a potential answer in the scope , we denote to be the minimum subgraph that contains all the paths from to in . The actual logic reasoning leading to answer for question is unknown but hidden in the reasoning graph. Thus we will learn a vector representation (or embedding) for , denoted as , for scoring the compatibility of the question type and the hidden path in the reasoning graph.
More specifically, suppose the question is embedded using a neural network , which captures the question type and implies the type of logic reasoning we need to perform over knowledge graph. Then the compatibility (or likelihood) of answer being correct can be computed using the embedded reasoning graph and the scope as
We note that the normalization in the likelihood requires the embedding of the reasoning graphs for all entities in the scope . This may involve thousands of or even more reasoning graphs depending on the KG and the number of hops. Computing these embeddings separately can be very computationally expensive. Instead, we develop a neural architecture which can compute these embeddings jointly and share intermediate computations.
Joint embedding reasoning graphs. More specifically, we propose a “forward graph embedding” architecture, which is analogous to forward filtering in Hidden Markov Model or Bayesian Network. The embedding of the reasoning graph for is computed recursively using its parents’ embeddings:
where is the one-hot encoding of relation type , are the model parameters, is a nonlinear function such as ReLU, and counts the number of parents of in . The only boundary case is when . Overall, computing the embedding for all takes time, which is proportional to the number of nodes and edges in the scope .
This formulation is able to capture various reasoning rules. Take Fig 2 as an example: the embedding of the entity Killing Them Softly sums up the two embeddings propagated from its parents. Thus it tends to match the reasoning paths from the parent entities. Note that this formulation is significantly different from the work in [25, 26, 27], where embedding is computed for each small molecular graph separately. Furthermore, those graph embedding methods often contain iterative processes which visit each nodes multiple times.
4 End-to-end Learning
In this section, we describe the algorithm for learning the parameters in and . The overall learning algorithm is described in Algorithm 1.
4.1 Variational method with inverse reasoning-graph embedding
EM algorithm is often used to learn latent variable models. However, performing exact EM updates for the objective in (1) is intractable since the posterior cannot be computed in closed form. Instead, we use variational inference and optimize the negative Helmholtz variational free energy:
where the variational posterior is jointly learned with the model. Note that (7) is essentially optimizing the lower bound of (1). Thus to reduce the approximation error, a powerful set of posterior distributions is necessary.
Variational posterior. computes the likelihood of the topic entity for a question , with additional information of answer . Thus besides the direct text or acoustic compatibility of and , we can also introduce logic match with the help of . Similar to the forward propagation architecture used in Sec 3.4, here we can define the scope for answer , the inverse reasoning graph , and the inverse embedding architecture to efficiently compute the embedding . Finally, the variational posterior consists of two parts:
where the normalization is done over all entities in the scope . Furthermore, the embedding operators and parameters are defined in the same way as (4) and (6) but with different set of parameters. One can also share the parameter to obtain a more compact model.
4.2 REINFORCE with variance reduction
Since the latent variable in the variational objective (7) takes discrete values, which is not differentiable with respect to , we use the REINFORCE algorithm  with variance reduction  to tackle this problem.
First, using the likelihood ratio trick, the gradient of with respect to posterior parameters can be computed as (for simplicity of notation, we assume that there is only one training instance, i.e. ):
where can be treated as the learning signal in policy gradient.
Second, to reduce the variance of gradient, we center and normalize the signal and also subtract a baseline function . Finally, the gradient in (9) can be approximated by the Monte Carlo method using samples of the latent variable from :
where and estimate the mean and standard deviation of with moving average. is another neural network that fits the expected normalized learning signal. In our experiments, we simply build a two-layer perceptron with concatenated one-hot answer and question features. Here tries to fit by minimizing the square loss. For other parameters and in and respectively, the gradients are computed in the normal way.
During inference, we are only given the question , and ideally we want to find the answer by computing . However, this computation is quadratic in the number of entities and thus too expensive. Alternatively, we can approximate it via beam search. So we select candidate entities with top scores from , and then the answer is given by
In our experiments, we found that (equivalent as greedy inference) can already achieve good performance.
|Bordes et al. ’s QA system||95.7||81.8||28.4||39.5||38.3||26.9|
|Bordes et al. ’s QA system||32.5||32.3||25.3||18.5||19.3||15.3|
6.1 The MetaQA benchmark
There is an existing public QA dataset named WikiMovies333It is available at https://research.fb.com/downloads/babi., which consists of question-answer pairs in the domain of movies and provides a medium-sized knowledge graph . However, it has several limitations: 1) all questions in it are single-hop, thus it is not able to evaluate the ability of reasoning; 2) there is no noise on the topic entity in question, so it can be easily located in the knowledge graph; 3) it is generated from very limited number of text templates, which is easy to be exploited by models and of limited practical value. Some small datasets like WebQuestions  are mostly for single-hop questions; while WikiTableQuestions  involves tiny knowledge table for each question, instead of one large-scale knowledge graph shared among all questions.
Thus in this paper, we introduce a new challenging question-answer benchmark: MetaQA (MoviE Text Audio QA). It contains more than 400K questions for both single and multi-hop reasoning, and provides more realistic text and audio versions. MetaQA serves as a comprehensive extension of WikiMovies. Due to the page limit, we briefly list the datasets included in MetaQA below, and put more details in Appendix A.
Vanilla: We have the original WikiMovies as the Vanilla 1-hop dataset. For multi-hop reasoning, we design 21 types of 2-hop questions and 15 types of 3-hop questions, and generate them by random sampling from a text template pool. Details and question examples are in Appendix B.
NTM: Thanks to the recent breakthrough in neural translation models (NTM), we can introduce more variations over the Vanilla datasets. We use a NTM trained by dual learning techniques  to paraphrase question by first translating it from English to French, and then sample translations back to English with beam search. The questions in the NTM dataset have different wordings but keep the same meaning. This dataset also contains 1-hop, 2-hop and 3-hop categories.
Audio: To make it even more practical and challenging, we generate audio datasets with the help of text-to-speech (TTS) system. We use Google TTS service to read all the questions in Vanilla. We also provide extracted MFCC features for each question. The Audio dataset also contains 1-hop, 2-hop and 3-hop categories. Note that although the audio is machine-generated, it is still much less regulated compared to text-template-generated data, and have a lot of variations in waveforms. For example, even for the same word, the TTS system can have different intonations depending on the word position in question and other context words. Visualization of the audio data can be found in Appendix C.
6.2 Competitor methods
We have three competitor methods: 1) as discussed in Sec 2, Miller et al.  proposed Key-Value Memory Networks (KV-MemNN), and reported state-of-the-art results at that time on WikiMovies; 2) Bordes et al. ’s QA system also tries to embed the inference subgraph for reasoning, but the representation is simply an unordered bag-of-relationships and neighbor entities; 3) the “supervised embedding” is considered as yet another baseline method, which is a simple approach but often works surprisingly well as reported in Dodge et al. .
We implement baseline methods with Tensorflow . Our results on Vanilla 1-hop are consistent with the reported performance in . We take whichever higher and report it in Table 1. For example, our KV-MemNN obtains 95.8% test accuracy, while the original paper reports 93.9% on the same dataset, so we just report 95.8% in table.
When training KV-MemNN, we use the same number of “internal hops” as the hop number of that dataset. We also try to use more “internal hops” than the dataset hop number, but it is not helpful. Also, we insert knowledge items within 3 hops of the located topic entity to the memory slots, which ensures that if the topic entity is correctly matched, the answer is existing somewhere in the memory array.
6.3 Experimental settings
We use all the datasets in MetaQA for experiments. We follow the same split of train/validation/test for all datasets. The number of questions in each part is listed in Appendix (Table 3). We tune hyperparameters on validation set for all methods. In both Vanilla and NTM, we use bag-of-words representation for entity name to parameterize in (3).
For Vanilla, we have two different settings: 1) provide the entity labels in all questions, so that we can compare with KV-MemNN under the same setting of Miller et al.  on Vanilla 1-hop dataset; 2) only provide 5% entity labels among all questions, named as Vanilla-EU (EU stands for topic entity unlabeled). We make all the methods use bag-of-words representation of the question, and avoid hard entity matching. This setting is more of a sanity check of how much the method is dependent on labeled topic entities. In practice, hard matching can always be an option on text data, but it is not feasible for audio data.
To make task more realistic and challenging, we experiment with EU setting for NTM and Audio datasets. For NTM-EU, only 5% topic entity labels among all questions are provided. For Audio-EU, a higher labeled ratio 20% since it is much more difficult than text data. To handle the variant length of audio questions, we use a simple convolutional neural network (CNN) with three convolutional layers and three max-pooling layers to embed the audio questions into fixed-dimension vectors. We put more details about CNN embedding in Appendix D.
For all the EU setting above, the small set of entity labeled questions are used to initialize a topic entity recognizer. After that, all methods train on entire dataset but without the entity labels. For VRN, we show that this pretrained recognizer will also get improved with variational joint training; for other baselines, the entity recognizer will be fixed.
6.4 Results and discussions
Vanilla: Since all the topic entities are labeled, Vanilla mainly evaluates the ability of logic reasoning. Note that Vanilla 1-hop is the same as WikiMovies, which is included for sanity check. All the baseline methods achieve similar performance as reported in the original papers [8, 20], while our method performs the best. It is clear to see that 2- and 3-hop questions are harder, leading to significant accuracy drop on all methods. Nevertheless, our method still achieves promising results and lead competitors by a large margin. We notice that KV-MemNN is not performing well on multi-hop reasoning, perhaps due to explosion of relevant knowledge items.
Vanilla-EU: Without topic entity labels, all reasoning-based methods are getting worse on multi-hop questions. However, supervised embedding gets better in this case, since it just learns to remember the pair of question and answer entities. According to the statistics in Appendix (Table 4), a big portion of questions can be answered by just memorizing the pairs in training data. That explains why supervised embedding behaves differently on this dataset.
NTM-EU: The questions in this dataset are paraphrased by neural translation model, which increases the variety of wordings, and makes the task harder. It is reasonable that all methods are getting slightly worse results compared to Vanilla-EU. The same explanation applies to supervised embedding, which is not reasoning but memorizing all the pairs. This is indeed weak generalization and it takes advantage of the nature of this dataset, but it is not likely to perform well on new entity pairs.
Audio-EU: This audio dataset is the most challenging one. As mentioned in Sec 6.1, even the same word can be pronounced in a variety of intonations. It is hard to recognize the entity in audio data, also hard to tell the question type. It is not surprising that all methods perform worse compared to text data. Our method achieves 37% on 1-hop audio questions, which is very promising. For 2-hop and 3-hop questions, our method still outperforms other methods. Clearly, there is large room for improvement on audio QA. We leave it as future work, and hopefully the MetaQA benchmark can facilitate more researchers working on QA systems.
6.5 Model ablation
Since our framework uses variational method to jointly learn the entity recognizer and reasoning graph embedding, we here do the model ablation to answer the following two questions: 1) is the reasoning graph embedding approach necessary for inference? 2) is the variational method helpful for joint training?
Importance of reasoning graph embedding: As the results shown in Table 1, our proposed VRN outperforms all the other baselines, especially in 3-hop setting. Since this experiment only compares the reasoning ability, it clearly shows that simply representing the inference rule as linear combination of reasoning graph entities is not enough.
Improvement of entity recognition with joint training: In Fig 3 we show that using our joint training framework with variance reduction REINFORCE, we can improve the entity recognition performance further without the corresponding topic entity label supervision. For 1-hop and 2-hop questions, our model can improve greatly. While for 3-hop, since the inference task is much harder, we can only marginally improve the performance. For audio data, we’ve improved by 10% in 1-hop case, and it is hard to improve further for multi hops. In Table 1 the baselines perform much worse in the EU setting, due to the absence of joint training.
6.6 Inspection of learning and inference
We study the convergence of our learning algorithm in Appendix E.1. It shows variance reduction technique helps the convergence significantly, while simpler tasks converge better. Also we present an example inference path with highest score in the reasoning graph in Appendix E.2. To answer “What are the main languages in David Mandel films?”, the model learns to find the movie EuroTrip first through directed or wrote relationships, then follow in_language to get the correct answer German.
- Auer et al.  Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. The semantic web, 2007.
- Bollacker et al.  Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008.
- Clarke et al.  James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. Driving semantic parsing from the world’s response. In Proceedings of the fourteenth conference on computational natural language learning, 2010.
- Liang et al.  Percy Liang, Michael I Jordan, and Dan Klein. Learning dependency-based compositional semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, 2011.
- Berant et al.  Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013.
- Yih et al.  Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015.
- Weston et al.  Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
- Miller et al.  Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126, 2016.
- Li et al.  Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
- Socher et al. [2013a] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems, pages 926–934, 2013a.
- Dong et al.  Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
- Guu et al.  Kelvin Guu, John Miller, and Percy Liang. Traversing knowledge graphs in vector space. arXiv preprint arXiv:1506.01094, 2015.
- Dodge et al.  Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931, 2015.
- Rao et al.  Delip Rao, Paul McNamee, and Mark Dredze. Entity linking: Finding extracted entities in a knowledge base. In Multi-source, multilingual information extraction and summarization, pages 93–115. Springer, 2013.
- Kwiatkowski et al.  Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. Scaling semantic parsers with on-the-fly ontology matching. In EMNLP, 2013.
- Liang et al.  Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020, 2016.
- Kumar et al.  Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. arXiv preprint arXiv:1506.07285, 2015.
- Sukhbaatar et al.  Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
- Yang et al.  Z. Yang, X. He, J. Gao, L. Deng, and A.J. Smola. Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274, 2015.
- Bordes et al.  Antoine Bordes, Sumit Chopra, and Jason Weston. Question answering with subgraph embeddings. arXiv preprint arXiv:1406.3676, 2014.
- Neelakantan et al.  Arvind Neelakantan, Quoc V Le, Martin Abadi, Andrew McCallum, and Dario Amodei. Learning a natural language interface with neural programmer. arXiv preprint arXiv:1611.08945, 2016.
- Socher et al. [2013b] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013b.
- Irsoy and Cardie  Ozan Irsoy and Claire Cardie. Deep recursive neural networks for compositionality in language. In Advances in Neural Information Processing Systems, pages 2096–2104, 2014.
- Mou et al.  Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Duvenaud et al.  David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, pages 2215–2223, 2015.
- Dai et al.  Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In ICML, 2016.
- Atwood and Towsley  James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1993–2001, 2016.
- Johnson  Daniel D. Johnson. Learning graphical state transitions. In Proceedings of the 5th International Conference on Learning Representations, ICLR’17, 2017.
- Neelakantan et al.  Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space models for knowledge base completion. arXiv preprint arXiv:1504.06662, 2015.
- Toutanova et al.  Kristina Toutanova, Xi Victoria Lin, Wen-tau Yih, Hoifung Poon, and Chris Quirk. Compositional learning of embeddings for relation paths in knowledge bases and text. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1434–1444, 2016.
- Yang and Chang  Yi Yang and Ming-Wei Chang. S-mart: Novel tree-based structured learning algorithms applied to tweet entity linking. arXiv preprint arXiv:1609.08075, 2016.
- Williams  Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- Mnih and Gregor  Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
- Pasupat and Liang  Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305, 2015.
- He et al.  Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016.
- Martın et al.  Abadi Martın, Ashish Agarwal, Paul Barham, Eugene Brevdo, and et. al. Tensorflow: Large-scale machine learning on heterogeneous systems. 2015.
Appendix A Details of the MetaQA benchmark
Vanilla 1-hop dataset: Our Vanilla 1-hop dataset is derived from the WikiMovies dataset. Following the settings in , we use the wiki_entities branch of WikiMovies. To make it easier to use, we apply an automatic entity labeling on the dataset. Specifically, we parse each question with left-to-right largest consumption of entity names and then normal words, and highlight the entity in question with a pair of square brackets. A few entity names are identical to normal words, which will lead to “fake entities” to be labeled in the question. For simplicity, we just remove those ambiguous questions, which makes our Vanilla 1-hop text dataset slightly smaller than WikiMovies. We also provide corresponding question type identifier files for train / validation/ test sets.
Vanilla 2-hop dataset: The WikiMovies dataset has 1-hop questions only, which inspires us to generate new datasets for 2-hop and 3-hop reasoning. For 2-hop questions, we design 21 question types in total:
Actor / Writer / Director to Movie to Actor / Writer / Director / Year / Language / Genre (18 types)
Movie to Actor / Writer / Director to Movie (3 types)
Following the way that WikiMovies is generated, we design 10 question templates and uniformly randomly sample from them when generating questions.
Vanilla 3-hop dataset: For 3-hop questions, we don’t consider awkward question types: Actor / Writer / Director to Movie to Actor / Writer / Director to Movie, since they’re counter-intuitive and quite confusing. For example: The directors of the movies acted by [@entity] have directed which movies? Instead, we construct meaningful 3-hop questions in 15 question types:
Movie to Actor / Writer / Director to Movie to Actor / Writer / Director / Year / Language / Genre (different roles for the former and latter roles)
NTM datasets: We use one of the state-of-the-art machine translation models, named dual learning for neural translation model , to generate 1-hop, 2-hop and 3-hop NTM datasets. We firstly translate the corresponding Vanilla dataset from English to French, and then translate it back to English with beam search. We guarantee that the topic entity can still be found in the question. By doing so, we can automatically paraphrase the questions, which introduces variations in question wordings and thus leads to more realistic scenario.
Audio datasets: To generate 1-hop, 2-hop and 3-hop audio datasets, we use Google text-to-speech API 444We use the API from https://github.com/pndurette/gTTS. to read all questions in Vanilla datasets and save the audio as mp3 files. It takes time to process hundreds of thousands of audio files. For convenience, we also provide extracted MFCC features for each question.
|# of Questions|
|New entities in validation (%)||20.0||5.0||0.1|
|New entities in test (%)||20.6||4.7||0.1|
|New entity pairs in validation (%)||34.1||28.1||32.1|
|New entity pairs in test (%)||34.5||28.6||32.2|
Appendix B Question samples
|Movie to Actor to Movie||11,709||The actor of Ruby Cairo also starred in which films?|
|Movie to Director to Movie||11,412||Which films share the same director of Vampires Suck?|
|Movie to Writer to Movie||8,817||Which movies have the same screenwriter of The Pianist?|
|Actor to Movie to Actor||9,547||Who co-starred with Joel Evans?|
|Actor to Movie to Director||9,241||Who co-starred with Carlo Ninchi?|
|Actor to Movie to Genre||8,548||What are the genres of the movies acted by Melora Hardin?|
|Actor to Movie to Language||3,067||What are the main languages in Molly Windsor starred movies|
|Actor to Movie to Writer||8,499||Who wrote the movies acted by James Madio?|
|Actor to Movie to Year||10,072||When did the movies acted by Masato Hagiwara release?|
|Director to Movie to Actor||4,800||Who acted in the films directed by Jerry London?|
|Director to Movie to Director||1,797||Who directed movies together with Chad Stahelski?|
|Director to Movie to Genre||5,205||What types are the films directed by Phillip Noyce?|
|Director to Movie to Language||1,850||What are the languages spoken in the movies directed by Peter Sellers?|
|Director to Movie to Writer||3,688||Who wrote the movies directed by Gary McKendry?|
|Director to Movie to Year||6,026||When were the films directed by Jonathan Kahn released?|
|Writer to Movie to Actor||8,447||Who acted in the films written by Travis Milloy?|
|Writer to Movie to Director||7,342||Who directed the films written by Nick Damici?|
|Writer to Movie to Genre||8,633||What types are the movies written by Amza Pellea?|
|Writer to Movie to Language||2,629||What are the primary languages in the films written by John Musker?|
|Writer to Movie to Writer||7,142||Who wrote films together with Jonah Hill?|
|Writer to Movie to Year||10,226||When did the movies directed by Paul Linke release?|
|Movie to Actor to Movie to Director||11,600||Who directed films that share actors with the film Last Passenger?|
|Movie to Actor to Movie to Genre||11,513||What types are the films starred by actors in Jack Reacher?|
|Movie to Actor to Movie to Language||8,735||What are the languages spoken in the films starred by Blade actors?|
|Movie to Actor to Movie to Writer||11,516||Which person wrote the movies starred by the actors in Ludwig?|
|Movie to Actor to Movie to Year||11,688||When did the movies starred by Witchboard actors release?|
|Movie to Director to Movie to Actor||10,784||Who acted in the films directed by the director of The Road?|
|Movie to Director to Movie to Genre||10,822||What types are the films directed by the director of Holly?|
|Movie to Director to Movie to Language||5,909||The movies that share directors with the movie Effi Briest were in which languages?|
|Movie to Director to Movie to Writer||11,005||Who wrote films that share directors with the film Male and Female?|
|Movie to Director to Movie to Year||11,350||When did the films release whose directors also directed Date Movie?|
|Movie to Writer to Movie to Actor||8,216||Who acted in the movies written by the writer of Bottle Rocket?|
|Movie to Writer to Movie to Director||8,734||Who directed films for the writer of Sugar?|
|Movie to Writer to Movie to Genre||8,212||What types are the movies written by the screenwriter of The Gospel?|
|Movie to Writer to Movie to Language||3,908||What languages are the films that share writers with The Bat in?|
|Movie to Writer to Movie to Year||8,752||When did the films release whose screenwriters also wrote Crash?|
Appendix C Visualization of audio data
We visualize the MFCC features of two questions sharing the same entity, as in Fig 6. It is clear that the entity part (highlighted by red dotted lines) is similar but not exactly the same. This shows the difficulty of handling the audio questions.
Appendix D Details of CNN embedding
To answer audio questions, we need information of both the topic entity in question, and the question type, i.e. what is asking about that entity. So we train two CNNs with different objectives: one is to predict the topic entity, and the other is to predict the question type. We use the same input (MFCC features of audio questions) for both CNNs, and only use training data to fit them. We treat the activations of the second last layer (before softmax layer) as the embeddings of audio questions.
Appendix E More experiment results
e.1 Convergence of VRN
We visualize the training loss in Fig 4 to get an idea of how it converges. We can see using the variance reduction technique, the training converges very fast. Also as expected, for simpler tasks involving fewer inference steps, they can converge to a better solution.
e.2 Visualization of the learned reasoning rule
To check what the reasoning graph have learned, we visualize the inference path with highest score in the reasoning-graph. Specifically, for 1-hop answers, we simply check the compatibility between edge type and question embedding; for answers with multi-hop, we traverse from answer to topic entity, and take the edge whose embedding has maximum compatibility with the question.
We show one 2-hop inference result in Fig 5. To answer the NTM question correctly, one need to use either directed or wrote relation to find the movie EuroTrip first, then follow in_language to get the correct answer German.