Dynamically Pruned Message Passing Networks for Large-Scale Knowledge Graph Reasoning
We propose Dynamically Pruned Message Passing Networks (DPMPN) for large-scale knowledge graph reasoning. In contrast to existing models, embedding-based or path-based, we learn an input-dependent subgraph to explicitly model a sequential reasoning process. Each subgraph is dynamically constructed, expanding itself selectively under a flow-style attention mechanism. In this way, we can not only construct graphical explanations to interpret prediction, but also prune message passing in Graph Neural Networks (GNNs) to scale with the size of graphs. We take the inspiration from the consciousness prior proposed by Bengio  to design a two-GNN framework to encode global input-invariant graph-structured representation and learn local input-dependent one coordinated by an attention module. Experiments show the reasoning capability in our model that is providing a clear graphical explanation as well as predicting results accurately, outperforming most state-of-the-art methods in knowledge base completion tasks.
Modern deep learning systems need to acquire the reasoning capability beyond their black-box nature to produce interpretable predictions [41, 5]. In what form we model a reasoning process should be given more thought than just obtaining a final prediction. Intuitively, a reasoning process can be regarded as a sequence of using existing facts to establish new knowledge, step by step, and finally drawing conclusions in the form of constructing explanations as well as making predictions. Therefore, it needs an explicit modeling to identify and organize reasoning steps to form a clear interpretable representation during predicting. A natural idea is to use graph-structured representation where a semantic unit or pairwise relation can be explicitly represented by a node or an edge as building blocks to support graph-based reasoning, a more flexible form in contrast to rigid deductive logical reasoning [2, 59].
Graph-based reasoning can be applied to a wide variety of real-world scenarios. Here, we choose knowledge graph-related tasks to explore due to its representativeness. In knowledge base completion (KBC) tasks, embedding-based models [6, 61, 16, 50, 46, 27] can easily obtain a very competitive score by fitting data using various neural network techniques, but lacking an explicit modeling to construct explanations by directly exploiting graph structure prevents it from being interpretable, a critical property of reasoning, since Euclidean embedding space will not produce a clearly stated and human-readable representation.
Recent work for knowledge graph (KG) reasoning focuses on path-based [53, 57, 12, 45, 9, 32] or logic-like models [10, 62]. Most of them construct an explicit path to model an iterative decision-making process using reinforcement learning and recurrent networks. However, a question is: do we have a better form, more flexible and interpretable, to express reasoning in the graph context rather than one or several paths. To this end, we propose to learn a subgraph starting from a head node and expanding itself conditionally and selectively according to a query relation, where a tail node is predicted after the last expansion. To better explain how the tail is determined by the expansion, we weigh, prune and save intermediate nodes selected at each step to capture long-range dependence and yield a concise and compact subgraph explanation for the tail prediction as shown in Figure 1.
Graph reasoning can be powered by Graph Neural Networks. Graph reasoning demands a way to effectively learn about entities, relations, and rules for composing them, that is, an ability for combinatorial generalization by manipulating structured knowledge and producing structured explanations. Graph Neural Networks (GNNs) provide such structured representation and computation and also inherit powerful data-fitting capacity from deep neural networks [44, 2]. Specifically, GNNs follow a neighborhood aggregation scheme, recursively aggregating and transforming neighboring nodes’ representations to update representations for each node. Therefore, after iterations of aggregation, each node can carry the structured information within the node’s -hop neighborhood [19, 58].
GNNs need graphical attention expression to interpret. Neighborhood attention operation is a popular way to implement attention mechanism on graphs [52, 22] by using multi-head self-attention to focus on specific interactions with neighbors when aggregating messages. However, we argue that graphical attention expression should be designed instead not only to facilitate structured computation but also to construct dynamically pruned structured explanations. We present three considerations: (1) selecting nodes based on currently operated subgraphs, that is, first attending over nodes within subgraphs to pick a smaller set and then attending over the picked nodes’ neighbors to expand subgraphs, (2) breaking isolation of attention operations used for each step and propagating attention across steps like a flow to produce long-range influence, and (3) that such flow-style attention mechanism should model a changing node probability distribution, that is, a Markov process driven by step-varying transition matrices. Besides, we should use an attention module disentangled from representation aggregating and transforming in GNNs to explicitly model a reasoning process on graphs out of low-level representation computing.
GNNs need input-dependent pruning to scale. GNNs are notorious for its poor scalability due to its heavy computation complexity. Consider, for example, one message passing iteration performed over a graph with nodes and edges. It has quadratic complexity in the number of nodes, , if the graph is fully connected. Even if the graph is sparse so that the complexity can be reduced to by exploiting structural sparsity, it is still problematic when meeting large graphs with millions of nodes and edges. Besides, mini-batch based training with batch size and high dimensions would make things worse, leading to the complexity of . We argue that this situation can be avoided by learning input-dependent pruning, as in most cases an input example uses a small fraction of the entire graph, and it is wasteful to perform homogeneous structured computation over the full graph for each input. Therefore, we propose to prune message passing depending on inputs and run on dynamical computation graphs instead of a static computation graph.
Cognitive intuition of the consciousness prior. The notion of attentive awareness has been shared by cognitive science communities in several theories [15, 47]. Bengio  brought this notion into deep learning models in his consciousness prior proposal. He pointed out a process of disentangling high-level abstract factors from full underlying representation to form a low-dimensional combination of a few selected factors to constitute a conscious thought, and emphasized the role of attention in expressing awareness during this process. Bengio proposed to use two recurrent neural networks (RNNs) to encode two types of state: the unconscious state represented by a full high-dimensional vector before applying attention, and the conscious state by a derived low-dimensional vector after applying attention.
In our work, we use two GNNs to encode such states into node representation vectors. However, standard message passing runs globally so that messages gathered by a node can come from everywhere and get further entangled by aggregation operations. Therefore, we draw an input-dependent or context-aware local subgraph to constrain message passing. We also want to access global information about the graph structure to get a boarder view before focusing on a local subgraph. Inspired by the consciousness prior, we apply attention mechanism to the two GNNs, where the bottom one performs input-invariant standard message passing globally, called Inattentive GNN (IGNN), and the above one performs input-dependent pruned message passing locally, called Attentive GNN (AGNN). The intuition is that the Inattentive GNN can support the Attentive GNN by providing raw representation, entangled but rich, while the Attentive GNN captures various input-dependent subgraphs consisting of a few selected nodes and their edges, cohesive with sharp semantics, disentangled from the full graph. Nodes within such a subgraph are more densely connected to form a small community to further exchange information and make decisions collectively on how to grow the subgraph next. In experiments, we find our model can run on very large graphs with millions of edges, such as the YAGO3-10 dataset, even using a laptop without causing out-of-memory errors. Our prediction results of KBC tasks attain very competitive scores on HITS@1,3 and the mean reciprocal rank (MRR) compared to the best embedding-based method so far. Besides, we provide interpretations that they do not have.
2 Problem Formulation
Notation. We use a supervised setting with training data where is an input and is a target. We denote a full graph by with node set and edge set , and denote an input-dependent subgraph by with node set and edge set . We also denote the set of edge types (or relation types) by . We require each subgraph to hold , so that we can define if and define if . We define the boundary of a subgraph as if where means the union of neighbors of all the nodes in . We also define high-order boundaries such as if . Trainable parameters include node embeddings , relation type embeddings , and neural network weights used in two GNNs and an attention module. When performing standard or pruned message passing, node embeddings and relation type embeddings will be indexed according to the operated graph, and thus we denote them by or . We denote batch size by and dimensions by . For IGNN, we use of size to denote node hidden states at step ; for AGNN, we use of size to denote.
We define the objective based on our two GNNs as , where is dynamically constructed. First, we write the standard message passing in IGNN as
where represents all involved operations in one message passing iteration over , including: (1) computing messages along each edge with the complexity111We assume per-example per-edge per-dimension time cost as a unit time. of , (2) aggregating messages received at each node with the complexity of , and (3) updating node hidden states with the complexity of . For a -step propagation, we get the per-batch complexity of . Considering that backpropagation requires intermediate computation results to be saved during one pass, this complexity counts for both time and space. However, since IGNN is input-invariant, its node representations can be shared across input examples in one batch so that can be removed to get . If we sample a smaller set from to run such that , we can further reduce the complexity to .
The pruned message passing in AGNN can be written as
Its complexity can be computed similarly as above. However, we cannot remove . Fortunately, subgraph is not . If we let be a node , grows from a single node, i.e., , and expands itself each step, leading to a sequence of . Here, we describe the expansion behavior as consecutive expansion, which means no jumping across neighborhood allowed, so that we can ensure that
Many real-world graphs follow the small-world pattern, and the six degrees of separation implies . The upper bound of can grow exponentially in , and there is no guarantee that will not explode.
Given a graph (undirected or directed in both directions), we assume the probability of the degree of an arbitrary node being less than or equal to is larger than , i.e., . Considering a sequence of consecutively expanding subgraphs , starting with , for all , we can ensure
The proposition implies the guarantee of upper-bounding becomes exponentially looser and weaker as gets larger even if the given assumption has a small and a large (close to 1). We define graph increment at step as such that . To prevent from explosion, we need to constrain . We propose several sampling strategies:
, which means we sample nodes from the boundary of .
, which means we take the boundary of sampled nodes from .
, which means we sample nodes from the boundary of .
, which means we samples nodes from .
Obviously, we have and . Further, we let and be the maximum number of sampled nodes in and the last sampling of respectively and let be per-node maximum sampled neighbors in , and then we can obtain much tighter guarantee as follow:
and for .
By , we can guarantee . To constrain the growth of , we can decrease either or . However, smaller sample size means less area explored and less chance to hit target nodes. We thus use attention operations to do the top- selection instead of random sampling when has to be small. We change to where represents the operation of attending over nodes and picking the top-. There are two types of attention operations, one applied to and the other applied to . Note that the size of might be much larger than if we want to sample more nodes with larger to sufficiently explore the boundary, . Nevertheless, we can address this problem by using small dimensions to compute attention scores, since attention carried by each node is just a scalar, much smaller than a node representation vector computed during message passing over .
3 Model Implementation
3.1 Architecture design for knowledge graph reasoning
Our model architecture as shown in Figure 2 consists of:
IGNN module: performs standard message passing to compute full-graph node representations.
AGNN module: performs a batch of pruned message passing to compute input-dependent node representations which also make use of low-level representations from IGNN.
Attention Module: performs a flow-style attention transition process, conditioned on node representations from both IGNN and AGNN but only affecting AGNN.
We let denote a knowledge graph where is a set of entities and is a set of relations. Each edge or relation is represented by a triple , where is the head entity, is the tail entity, and is their relation type. The goal is to predict potential unknown links, i.e., which entity is likely to be the tail given a query with the head and the relation type specified.
IGNN module. We implement it using standard message passing mechanism . If the full graph has an extremely large number of edges, we sample a subset of edges, , randomly each step. For a batch of input queries, we let node representations from IGNN be shared across queries, containing no batch dimension. Thus, its complexity does not scale with batch size and the saved resources can be allocated to sampling more edges. Each node has a state at step , where the initial . Each edge produces a message, denoted by at step . The computation components include:
Message function: , where .
Message aggregation: , where .
Node state update function: , where .
We compute messages only for the sampled edges, , each step. Functions and are implemented by a two-layer MLP (using for the first layer and for the second) with input arguments concatenated respectively. Messages are aggregated by dividing the sum by the square root of , the number of sampled neighbors that send messages to , preserving the scale of variance. We use a residual adding to update each node state instead of a GRU or a LSTM. After running for steps, we output a pooling result or simply the last, denoted by , to feed into downstream modules.
AGNN module. AGNN is input-dependent, which means node states depend on input query , denoted by . We implement pruned message passing, running on small subgraphs each conditioned on an input query. We leverage the sparsity and only save for visited nodes . When , we start from node with . When computing messages, denoted by , we use a sampling-attending procedure, explained in Section 3.2, to constrain the number of computed edges. The computation components include:
Message function: , where 222In practice, we can use a smaller set of edges than to pass messages as discussed in Section 3.2, and represents a context vector.
Message aggregation: , where .
Node state attending function: , where is an attention score.
Node state update function: , where also represents a context vector.
Query context is defined by its head and relation type embeddings, i.e., and . We introduce a node state attending function to pass node representation information from IGNN to AGNN weighted by a scalar attention score and projected by a learnable matrix . We initialize for node , treating the rest as zero states.
Attention module. Attention over steps is represented by a sequence of node probability distributions, denoted by . The initial distribution is a one-hot vector with . To spread attention, we need to compute transition matrices each step. Since it is conditioned on both IGNN and AGNN, we capture two types of interaction between and : , and . The former favors visited nodes, while the latter is used to attend to unseen nodes.
where and are two learnable matrices. Each MLP uses one single layer with the activation. To reduce the complexity for computing , we use nodes , which contains nodes with the -largest attention scores at step , and use nodes sampled from ’s neighbors to compute attention transition for the next step. Due to the fact that nodes result from the top- pruning, the loss of attention may occur to diminish the total amount. Therefore, we use a renormalized version, , to compute new attention scores. We use attention scores at the final step as the probability to predict the tail node.
3.2 Complexity reduction by iterative sampling and attending
AGNN deals with local subgraphs for each input so that only a few selected nodes are kept in , called visited nodes, and is much smaller than . The initial contains only one node , and then is enlarged each step by adding new nodes. When propagating messages, we can just consider the one-step neighborhood each step. However, the expansion goes so rapidly that it covers almost all nodes after a few steps. The key to address the problem is to constrain the scope of nodes we can expand the boundary from, i.e., the core nodes which determine where we can go next. We call it the attending-from horizon, , selected according to attention scores . Given this horizon, we still need edge sampling over its neighborhood instead of using the whole in case of a hub node of extremely high degree. Here, we face a trade-off between coverage and complexity when sampling over the neighborhood. Also, we need node representations within each subgraph to keep their information coherent and avoid possible noises caused by randomly sampling. Therefore, we introduce an attending-to horizon inside the sampling horizon. We denote the sampling horizon by and the attending-to horizon by . The attention module runs within the sampling horizon with smaller dimensions in order to sample more neighbors for a larger coverage. Then, we prune the sampling horizon to obtain the attending-to horizon, which contains a subset of nodes selected according to newly computed attention scores . Current message passing iteration at step in AGNN can be further constrained on edges between and , a smaller set than . We illustrate this procedure in Figure 3.
Datasets. We use six large KG datasets: FB15K, FB15K-237, WN18, WN18RR, NELL995, and YAGO3-10. FB15K-237  is sampled from FB15K  with redundant relations removed, and WN18RR  is a subset of WN18  removing triples that cause test leakage. Thus, they are both considered more challenging. NELL995  has separate datasets for 12 query relations each corresponding to a single-query-relation KBC task. YAGO3-10  contains the largest KG with millions of edges. Their statistics are shown in Table 1. We find some statistical differences between train and validation (or test). In a KG with all training triples as its edges, a triple is considered as a multi-edge triple if the KG contains other triples that also connect and ignoring the direction. We notice that FB15K-237 is a special case compared to the others, as there are no edges in its KG directly linking any pair of and in validation (or test). Therefore, when using training triples as queries to train our model, given a batch, for FB15K-237, we cut off from the KG all triples connecting the head-tail pairs in the given batch, ignoring relation types and edge directions, forcing the model to learn a composite reasoning pattern rather than a single-hop pattern, and for the rest datasets, we only remove the triples of this batch and their inverse from the KG to avoid information leakage before training on this batch. This can be regarded as a hyperparameter tuning whether to force a multi-hop reasoning or not, leading to a performance boost of about in HITS@1 on FB15-237.
Experimental settings. We use the same data split protocol as in many papers [16, 57, 12]. We create a KG, a directed graph, consisting of all train triples and their inverse added for each dataset except NELL995, since it already includes reciprocal relations. Besides, every node in KGs has a self-loop edge to itself. We also add inverse relations into the validation and test set to evaluate the two directions. For evaluation metrics, we use HITS@1,3,10 and the mean reciprocal rank (MRR) in the filtered setting for FB15K-237, WN18RR, FB15K, WN18, and YAGO3-10, and use the mean average precision (MAP) for NELL995’s single-query-relation KBC tasks. For NELL995, we follow the same evaluation procedure as in [57, 12, 45], ranking the answer entities against the negative examples given in their experiments. We run our experiments using a 12G-memory GPU, TITAN X (Pascal), with Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. Our code is written in Python based on TensorFlow 2.0 and NumPy 1.16 and can be found by the link333https://github.com/netpaladinx/DPMPN below. We run three times for each hyperparameter setting per dataset to report the means and standard deviations. See hyperparameter details in the appendix.
|Dataset||#Entities||#Rels||#Train||#Valid||#Test||PME (tr)||PME (va)||AL (va)|
|DistMult ||20.6 (.4)||31.8 (.2)||-||29.0 (.2)||38.4 (.4)||42.4 (.3)||-||41.3 (.3)|
|ComplEx ||20.8 (.2)||32.6 (.5)||-||29.6 (.2)||38.5 (.3)||43.9 (.3)||-||42.2 (.2)|
|ConvE ||23.3 (.4)||33.8 (.3)||-||30.8 (.2)||39.6 (.3)||44.7 (.2)||-||43.3 (.2)|
|NeuralLP ||18.2 (.6)||27.2 (.3)||-||24.9 (.2)||37.2 (.1)||43.4 (.1)||-||43.5 (.1)|
|MINERVA ||14.1 (.2)||23.2 (.4)||-||20.5 (.3)||35.1 (.1)||44.5 (.4)||-||40.9 (.1)|
|M-Walk ||16.5 (.3)||24.3 (.2)||-||23.2 (.2)||41.4 (.1)||44.5 (.2)||-||43.7 (.1)|
|DPMPN||28.6 (.1)||40.3 (.1)||53.0 (.3)||36.9 (.1)||44.4 (.4)||49.7 (.8)||55.8 (.5)||48.2 (.5)|
Baselines. We compare our model against embedding-based approaches, including TransE , TransR , DistMult , ConvE , ComplE , HolE , RotatE , and ComplEx-N3 , and path-based approaches that use RL methods, including DeepPath , MINERVA , and M-Walk , and also that uses learned neural logic, NeuralLP .
Comparison results and analysis. We report comparison on FB15K-23 and WN18RR in Table 2. Our model DPMPN significantly outperforms all the baselines in HITS@1,3 and MRR. Compared to the best baseline, we only lose a few points in HITS@10 but gain a lot in HITS@1,3. We speculate that it is the reasoning capability that helps DPMPN make a sharp prediction by exploiting graph-structured composition locally and conditionally. When a target becomes too vague to predict, reasoning may lose its advantage against embedding-based models. However, path-based baselines, with a certain ability to do reasoning, perform worse than we expect. We argue that it might be inappropriate to think of reasoning, a sequential decision process, equivalent to a sequence of nodes. The average lengths of the shortest paths between heads and tails as shown in Table 1 suggests a very short path, which makes the motivation of using a path almost useless. The reasoning pattern should be modeled in the form of dynamical local graph-structured pattern with nodes densely connected with each other to produce a decision collectively. We also run our model on FB15K, WN18, and YAGO3-10, and the comparison results in the appendix show that DPMPN achieves a very competitive position against the best state of the art. We summarize the comparison on NELL995’s tasks in the appendix. DPMPN performs the best on five tasks, also being competitive on the rest.
Convergence analysis. Our model converges very fast during training. We may use half of training queries to train model to generalize as shown in Figure 4(A). Compared to less expensive embedding-based models, our model need to traverse a number of edges when training on one input, consuming much time per batch, but it does not need to pass a second epoch, thus saving a lot of training time. The reason may be that training queries also belong to the KG’s edges and some might be exploited to construct subgraphs during training on other queries.
Component analysis. If we do not run message passing in IGNN, is just the initial embedding of node , and we can still run pruned message passing in AGNN as usual. We want to know whether IGNN is actually useful. Considering that long-range propagated messages might bring in noisy features, we compare running IGNN for two steps against totally shutting it down. The result in Figure 4(B) shows that IGNN brings a small gain in each metric on WN18RR.
Horizon analysis. The sampling, attending-to, attending-from and searching (i.e., propagation steps) horizons determine how large area a subgraph can expand over. These factors affect computation complexity as well as prediction performance. Intuitively, enlarging the exploring area by sampling more, attending more, and searching longer, may increase the chance of hitting a target to gain some performance. However, the experimental results in Figure 4(C)(D) show that it is not always the case. In Figure 4(E), we can see that increasing the maximum number of attending-from nodes per step is useful, but normal GPUs with a limited memory do not allow for an arbitrarily large number due to heavy intermediate data produced during feedforward computing. Figure 4(F) suggests that the propagation steps of AGNN should not go below four.
Attention flow analysis. If the flow-style attention really captures the way we reason about the world, its process should be conducted in a diverging-converging thinking pattern. Intuitively, first, for the diverging thinking phase, we search and collect ideas as much as we can; then, for the converging thinking phase, we try to concentrate our thoughts on one point. To check whether the attention flow has such a pattern, we measure the average entropy of attention distributions changing along steps and also the proportion of attention concentrated at the top-1,3,5 nodes. As we expect, attention is more focused at the final step and the beginning.
Time cost analysis. The time cost is affected not only by the scale of a dataset but also by the horizon setting. For each dataset, we list the training time for one epoch corresponding to our standard hyperparameter settings in the appendix. Note that there is always a trade-off between complexity and performance. We thus study whether we can reduce time cost a lot at the price of sacrificing a little performance. We plot the one-epoch training time in Figure 6(A)-(D), using the same settings as we do in the horizon analysis. We can see that Max-attending-from-per-step and #Steps-in-AGNN affect the training time significantly while Max-sampling-per-node and Max-attending-to-per-step affect very slightly. Therefore, we can use smaller Max-sampling-per-node and Max-attending-to-per-step in order to gain a larger batch size, making the computation more efficiency as shown in Figure 6(E).
Visualization. To further demonstrate the reasoning capability, we show visualization results of some pruned subgraphs on NELL995’s test data for 12 separate tasks. We avoid using the training data in order to show generalization of the learned reasoning capability. We show the visualization results in Figure 1. See the appendix for detailed analysis and more visualization results.
5 Related Work
Knowledge graph reasoning. Early work, including TransE  and its analogues [55, 33, 23], DistMult , ConvE  and ComplEx , focuses on learning embeddings of entities and relations. Some recent works of this line [46, 27] achieve high accuracy. Another line aims to learn inference paths [29, 18, 20, 34, 49, 13] for knowledge graph reasoning, especially DeepPath , MINERVA , and M-Walk , which use RL to learn multi-hop relational paths. However, these approaches, based on policy gradients or Monte Carlo tree search, often suffer from low sample efficiency and sparse rewards, requiring a large number of rollouts and sophisticated reward function design. Other efforts include learning soft logical rules [10, 62] or compostional programs .
Relational reasoning in Graph Neural Networks. Relational reasoning is regarded as the key for combinatorial generalization, taking the form of entity- and relation-centric organization to reason about the composition structure of the world [11, 28]. A multitude of recent implementations  encode relational inductive biases into neural networks to exploit graph-structured representation, including graph convolution networks (GCNs) [8, 21, 17, 24, 14, 39, 25, 7] and graph neural networks [44, 30, 43, 3, 19]. Variants of GNN architectures have been developed. Relation networks  use a simple but effective neural module to model relational reasoning, and its recurrent versions [42, 40] do multi-step relational inference for long periods; Interaction networks  provide a general-purpose learnable physics engine, and two of its variants are visual interaction networks  and vertex attention interaction networks ; Message passing neural networks  unify various GCNs and GNNs into a general message passing formalism by analogy to the one in graphical models.
Attention mechanism on graphs. Neighborhood attention operation can enhance GNNs’ representation power [52, 22, 54, 26]. These approaches often use multi-head self-attention to focus on specific interactions with neighbors when aggregating messages, inspired by [1, 35, 51]. Most graph-based attention mechanisms attend over neighborhood in a single-hop fashion, and  claims that the multi-hop architecture does not help to model high-order interaction in experiments. However, a flow-style design of attention in  shows a way to model long-range attention, stringing isolated attention operations by transition matrices.
We introduce Dynamically Pruned Message Passing Networks (DPMPN) and apply it to large-scale knowledge graph reasoning tasks. We propose to learn an input-dependent local subgraph which is progressively and selectively constructed to model a sequential reasoning process in knowledge graphs. We use graphical attention expression, a flow-style attention mechanism, to guide and prune the underlying message passing, making it scalable for large-scale graphs and also providing clear graphical interpretations. We also take the inspiration from the consciousness prior to develop a two-GNN framework to boost experimental performances.
-  (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §5.
-  (2018) Relational inductive biases, deep learning, and graph networks. CoRR abs/1806.01261. Cited by: §1, §1, §5.
-  (2016) Interaction networks for learning about objects, relations and physics. In NIPS, Cited by: §5.
-  (2017) The consciousness prior. CoRR abs/1709.08568. Cited by: Dynamically Pruned Message Passing Networks for Large-Scale Knowledge Graph Reasoning, §1, 1st item.
-  (2018-11) Challenges for deep learning towards human-level ai. Cited by: §1.
-  (2013) Translating embeddings for modeling multi-relational data. In NIPS, Cited by: §1, §4, §4, §5.
-  (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34, pp. 18–42. Cited by: §5.
-  (2014) Spectral networks and locally connected networks on graphs. CoRR abs/1312.6203. Cited by: §5.
-  (2018) Variational knowledge graph reasoning. In NAACL-HLT, Cited by: §1.
-  (2016) TensorLog: a differentiable deductive database. CoRR abs/1605.06523. Cited by: §1, §5.
-  (1952) The nature of explanation. Cited by: §5.
-  (2018) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. CoRR abs/1711.05851. Cited by: §1, Table 2, §4, §4, §5.
-  (2017) Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL, Cited by: §5.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, Cited by: §5.
-  (1998) A neuronal model of a global workspace in effortful cognitive tasks.. Proceedings of the National Academy of Sciences of the United States of America 95 24, pp. 14529–34. Cited by: §1.
-  (2018) Convolutional 2d knowledge graph embeddings. In AAAI, Cited by: §1, Table 4, Table 5, Table 2, §4, §4, §4, §5.
-  (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §5.
-  (2014) Incorporating vector space similarity in random walk inference over knowledge bases. In EMNLP, Cited by: §5.
-  (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §1, §3.1, §5.
-  (2015) Traversing knowledge graphs in vector space. In EMNLP, Cited by: §5.
-  (2015) Deep convolutional networks on graph-structured data. CoRR abs/1506.05163. Cited by: §5.
-  (2017) VAIN: attentional multi-agent predictive modeling. In NIPS, Cited by: §1, §5, §5.
-  (2015) Knowledge graph embedding via dynamic mapping matrix. In ACL, Cited by: §5.
-  (2016) Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design 30 8, pp. 595–608. Cited by: §5.
-  (2017) Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907. Cited by: §5.
-  (2018) Attention solves your tsp , approximately. Cited by: §5.
-  (2018) Canonical tensor decomposition for knowledge base completion. In ICML, Cited by: §1, Table 4, Table 5, Table 2, §4, §5.
-  (2017) Building machines that learn and think like people. The Behavioral and brain sciences 40, pp. e253. Cited by: §5.
-  (2011) Random walk inference and learning in a large scale knowledge base. In EMNLP, Cited by: §5.
-  (2016) Gated graph sequence neural networks. CoRR abs/1511.05493. Cited by: §5.
-  (2016) Neural symbolic machines: learning semantic parsers on freebase with weak supervision. In ACL, Cited by: §5.
-  (2018) Multi-hop knowledge graph reasoning with reward shaping. In EMNLP, Cited by: §1.
-  (2015) Learning entity and relation embeddings for knowledge graph completion. In AAAI, Cited by: §4, §5.
-  (2015) Modeling relation paths for representation learning of knowledge bases. In EMNLP, Cited by: §5.
-  (2017) A structured self-attentive sentence embedding. CoRR abs/1703.03130. Cited by: §5.
-  (2014) YAGO3: a knowledge base from multilingual wikipedias. In CIDR, Cited by: §4.
-  (2018) A novel embedding model for knowledge base completion based on convolutional neural network. In NAACL-HLT, Cited by: Table 2.
-  (2016) Holographic embeddings of knowledge graphs. In AAAI, Cited by: Table 4, §4.
-  (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §5.
-  (2018) Recurrent relational networks. In NeurIPS, Cited by: §5.
-  (2018) The book of why: the new science of cause and effect. Basic Books. Cited by: §1.
-  (2018) Relational recurrent neural networks. In NeurIPS, Cited by: §5.
-  (2017) A simple neural network module for relational reasoning. In NIPS, Cited by: §5.
-  (2009) The graph neural network model. IEEE Transactions on Neural Networks 20, pp. 61–80. Cited by: §1, §5.
-  (2018) M-walk: learning to walk over graphs using monte carlo tree search. In NeurIPS, Cited by: §1, Table 6, Table 2, §4, §4, §5.
-  (2018) RotatE: knowledge graph embedding by relational rotation in complex space. CoRR abs/1902.10197. Cited by: §1, Table 4, Table 2, §4, §5.
-  (2016) Integrated information theory: from consciousness to its physical substrate. Nature Reviews Neuroscience 17, pp. 450–461. Cited by: §1.
-  (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, Cited by: §4.
-  (2016) Compositional learning of embeddings for relation paths in knowledge base and text. In ACL, Cited by: §5.
-  (2016) Complex embeddings for simple link prediction. In ICML, Cited by: §1, §4, §5.
-  (2017) Attention is all you need. In NIPS, Cited by: §5.
-  (2018) Graph attention networks. CoRR abs/1710.10903. Cited by: §1, §5.
-  (2018) Knowledge graph reasoning: recent advances. Cited by: §1.
-  (2018) Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §5.
-  (2014) Knowledge graph embedding by translating on hyperplanes. In AAAI, Cited by: §5.
-  (2017) Visual interaction networks: learning a physics simulator from video. In NIPS, Cited by: §5.
-  (2017) DeepPath: a reinforcement learning method for knowledge graph reasoning. In EMNLP, Cited by: §1, §4, §4, §4, §5.
-  (2018) How powerful are graph neural networks?. ArXiv abs/1810.00826. Cited by: §1.
-  (2019) What can neural networks reason about?. ArXiv abs/1905.13211. Cited by: §1.
-  (2018) Modeling attention flow on graphs. CoRR abs/1811.00497. Cited by: §5.
-  (2015) Embedding entities and relations for learning and inference in knowledge bases. CoRR abs/1412.6575. Cited by: §1, §4, §5.
-  (2017) Differentiable learning of logical rules for knowledge base reasoning. In NIPS, Cited by: §1, Table 4, §4, §5.
Given a graph (undirected or directed in both directions), we assume the probability of the degree of an arbitrary node being less than or equal to is larger than , i.e., . Considering a sequence of consecutively expanding subgraphs , starting with , for all , we can ensure
We consider the extreme case of greedy consecutive expansion, where , since if this case satisfies the inequality, any case of consecutive expansion can also satisfy it. By definition, all the subgraphs are a connected graph. Here, we use to denote for short. In the extreme case, we can ensure that the newly added nodes at step only belong to the neighborhood of the last added nodes . Since for each node in already has at least one edge within due to the definition of connected graphs, we can have
For , we have and thus
For , based on , we obtain
We can find that also satisfies this inequality. ∎
2 Hyperparameter Settings
|max_sampling_per_step (in IGNN)||10000||10000||10000||10000||10000||10000|
|max_sampling_per_node (in AGNN)||200||200||200||200||200||1000|
|One-epoch training time (h)||25.7||63.7||4.3||8.5||185.0||0.12|
The hyperparameters can be categorized into three groups:
Normal hyperparameters, including batch_size, n_dims_att, n_dims, learning_rate, grad_clipnorm, and n_epochs. We set smaller dimensions, n_dims_att, for computation in the attention module, as it uses more edges than the message passing uses in AGNN, and also intuitively, it does not need to propagate high-dimensional messages but only compute scalar scores over a sampled neighborhood, in concert with the idea in the key-value mechanism . We set in most cases, indicating that our model can be trained well by one epoch only due to its fast convergence.
The hyperparameters in charge of the sampling-attending horizon, including max_sampling_per_step that controls the maximum number to sample edges per step in IGNN, and max_sampling_per_node, max_attending_from_per_step and max_attending_to_per_step that control the maximum number to sample neighbors of each selected node per step per input, the maximum number of selected nodes for attending-from per step per input, and the maximum number of selected nodes in a sampled neighborhood for attending-to per step per input in AGNN.
The hyperparameters in charge of the searching horizon, including n_steps_in_IGNN representing the number of propagation steps to run standard message passing in IGNN, and n_steps_in_AGNN representing the number of propagation steps to run pruned message passing in AGNN.
Note that we tune these hyperparameters according to not only their performances but also the computation resources available to us. In some cases, to deal with a very large knowledge graph with limited resources, we need to make a trade-off between efficiency and effectiveness. For example, each of NELL995’s single-query-relation tasks has a small training set, though still with a large graph, so we can reduce the batch size in favor of affording larger dimensions and a larger sampling-attending horizon without any concern for waiting too long to finish one epoch.
3 More Experimental Results
|DPMPN||72.6 (.4)||78.4 (.4)||83.4 (.5)||76.4 (.4)||91.6 (.8)||93.6 (.4)||94.9 (.4)||92.8 (.6)|
|AthletePlaysForTeam||83.9 (0.5)||84.7 (1.3)||82.7 (0.8)||72.1 (1.2)||62.7||67.3|
|AthletePlaysInLeague||97.5 (0.1)||97.8 (0.2)||95.2 (0.8)||92.7 (5.3)||77.3||91.2|
|AthleteHomeStadium||93.6 (0.1)||91.9 (0.1)||92.8 (0.1)||84.6 (0.8)||71.8||72.2|
|AthletePlaysSport||98.6 (0.0)||98.3 (0.1)||98.6 (0.1)||91.7 (4.1)||87.6||96.3|
|TeamPlayssport||90.4 (0.4)||88.4 (1.8)||87.5 (0.5)||69.6 (6.7)||76.1||81.4|
|OrgHeadQuarteredInCity||94.7 (0.3)||95.0 (0.7)||94.5 (0.3)||79.0 (0.0)||62.0||65.7|
|WorksFor||86.8 (0.0)||84.2 (0.6)||82.7 (0.5)||69.9 (0.3)||67.7||69.2|
|PersonBornInLocation||84.1 (0.5)||81.2 (0.0)||78.2 (0.0)||75.5 (0.5)||71.2||81.2|
|PersonLeadsOrg||88.4 (0.1)||88.8 (0.5)||83.0 (2.6)||79.0 (1.0)||75.1||77.2|
|OrgHiredPerson||84.7 (0.8)||88.8 (0.6)||87.0 (0.3)||73.8 (1.9)||71.9||73.7|
4 More Visualization Resutls
4.1 Case study on the AthletePlaysForTeam task
In the case shown in Figure 9, the query is (concept_personnorthamerica_michael_turner, concept:athleteplays-forteam, ?) and a true answer is concept_sportsteam_falcons. From Figure 9, we can see our model learns that (concept_personnorthamerica_michael_turner, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_stadiumoreventvenue_georgia_dome, concept:teamhomestadium_inv, concept_sportsteam_falcons) are two important facts to support the answer of concept_sportsteam_falcons. Besides, other facts, such as (concept_athlete_joey_harrington, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_athlete-_joey_harrington, concept:athleteplaysforteam, concept_sportsteam_falcons), provide a vivid example that a person or an athlete with concept_stadiumoreventvenue_georgia_dome as his or her home stadium might play for the team concept_sportsteam_falcons. We have such examples more than one, like concept_athlete_roddy_white’s and concept_athlete_quarterback_matt_ryan’s. The entity concept_sportsleague_nfl cannot help us differentiate the true answer from other NFL teams, but it can at least exclude those non-NFL teams. In a word, our subgraph-structured representation can well capture the relational and compositional reasoning pattern.
For the AthletePlaysForTeam task
4.2 More results
For the AthletePlaysInLeague task
For the AthleteHomeStadium task
For the AthletePlaysSport task
For the TeamPlaysSport task
For the OrganizationHeadQuarteredInCity task
For the WorksFor task