Dynamically Pruned Message Passing Networks for LargeScale Knowledge Graph Reasoning
Abstract
We propose Dynamically Pruned Message Passing Networks (DPMPN) for largescale knowledge graph reasoning. In contrast to existing models, embeddingbased or pathbased, we learn an inputdependent subgraph to explicitly model a sequential reasoning process. Each subgraph is dynamically constructed, expanding itself selectively under a flowstyle attention mechanism. In this way, we can not only construct graphical explanations to interpret prediction, but also prune message passing in Graph Neural Networks (GNNs) to scale with the size of graphs. We take the inspiration from the consciousness prior proposed by Bengio [4] to design a twoGNN framework to encode global inputinvariant graphstructured representation and learn local inputdependent one coordinated by an attention module. Experiments show the reasoning capability in our model that is providing a clear graphical explanation as well as predicting results accurately, outperforming most stateoftheart methods in knowledge base completion tasks.
1 Introduction
Modern deep learning systems need to acquire the reasoning capability beyond their blackbox nature to produce interpretable predictions [41, 5]. In what form we model a reasoning process should be given more thought than just obtaining a final prediction. Intuitively, a reasoning process can be regarded as a sequence of using existing facts to establish new knowledge, step by step, and finally drawing conclusions in the form of constructing explanations as well as making predictions. Therefore, it needs an explicit modeling to identify and organize reasoning steps to form a clear interpretable representation during predicting. A natural idea is to use graphstructured representation where a semantic unit or pairwise relation can be explicitly represented by a node or an edge as building blocks to support graphbased reasoning, a more flexible form in contrast to rigid deductive logical reasoning [2, 59].
Graphbased reasoning can be applied to a wide variety of realworld scenarios. Here, we choose knowledge graphrelated tasks to explore due to its representativeness. In knowledge base completion (KBC) tasks, embeddingbased models [6, 61, 16, 50, 46, 27] can easily obtain a very competitive score by fitting data using various neural network techniques, but lacking an explicit modeling to construct explanations by directly exploiting graph structure prevents it from being interpretable, a critical property of reasoning, since Euclidean embedding space will not produce a clearly stated and humanreadable representation.
Recent work for knowledge graph (KG) reasoning focuses on pathbased [53, 57, 12, 45, 9, 32] or logiclike models [10, 62]. Most of them construct an explicit path to model an iterative decisionmaking process using reinforcement learning and recurrent networks. However, a question is: do we have a better form, more flexible and interpretable, to express reasoning in the graph context rather than one or several paths. To this end, we propose to learn a subgraph starting from a head node and expanding itself conditionally and selectively according to a query relation, where a tail node is predicted after the last expansion. To better explain how the tail is determined by the expansion, we weigh, prune and save intermediate nodes selected at each step to capture longrange dependence and yield a concise and compact subgraph explanation for the tail prediction as shown in Figure 1.
Graph reasoning can be powered by Graph Neural Networks. Graph reasoning demands a way to effectively learn about entities, relations, and rules for composing them, that is, an ability for combinatorial generalization by manipulating structured knowledge and producing structured explanations. Graph Neural Networks (GNNs) provide such structured representation and computation and also inherit powerful datafitting capacity from deep neural networks [44, 2]. Specifically, GNNs follow a neighborhood aggregation scheme, recursively aggregating and transforming neighboring nodes’ representations to update representations for each node. Therefore, after iterations of aggregation, each node can carry the structured information within the node’s hop neighborhood [19, 58].
GNNs need graphical attention expression to interpret. Neighborhood attention operation is a popular way to implement attention mechanism on graphs [52, 22] by using multihead selfattention to focus on specific interactions with neighbors when aggregating messages. However, we argue that graphical attention expression should be designed instead not only to facilitate structured computation but also to construct dynamically pruned structured explanations. We present three considerations: (1) selecting nodes based on currently operated subgraphs, that is, first attending over nodes within subgraphs to pick a smaller set and then attending over the picked nodes’ neighbors to expand subgraphs, (2) breaking isolation of attention operations used for each step and propagating attention across steps like a flow to produce longrange influence, and (3) that such flowstyle attention mechanism should model a changing node probability distribution, that is, a Markov process driven by stepvarying transition matrices. Besides, we should use an attention module disentangled from representation aggregating and transforming in GNNs to explicitly model a reasoning process on graphs out of lowlevel representation computing.
GNNs need inputdependent pruning to scale. GNNs are notorious for its poor scalability due to its heavy computation complexity. Consider, for example, one message passing iteration performed over a graph with nodes and edges. It has quadratic complexity in the number of nodes, , if the graph is fully connected. Even if the graph is sparse so that the complexity can be reduced to by exploiting structural sparsity, it is still problematic when meeting large graphs with millions of nodes and edges. Besides, minibatch based training with batch size and high dimensions would make things worse, leading to the complexity of . We argue that this situation can be avoided by learning inputdependent pruning, as in most cases an input example uses a small fraction of the entire graph, and it is wasteful to perform homogeneous structured computation over the full graph for each input. Therefore, we propose to prune message passing depending on inputs and run on dynamical computation graphs instead of a static computation graph.
Cognitive intuition of the consciousness prior. The notion of attentive awareness has been shared by cognitive science communities in several theories [15, 47]. Bengio [4] brought this notion into deep learning models in his consciousness prior proposal. He pointed out a process of disentangling highlevel abstract factors from full underlying representation to form a lowdimensional combination of a few selected factors to constitute a conscious thought, and emphasized the role of attention in expressing awareness during this process. Bengio proposed to use two recurrent neural networks (RNNs) to encode two types of state: the unconscious state represented by a full highdimensional vector before applying attention, and the conscious state by a derived lowdimensional vector after applying attention.
In our work, we use two GNNs to encode such states into node representation vectors. However, standard message passing runs globally so that messages gathered by a node can come from everywhere and get further entangled by aggregation operations. Therefore, we draw an inputdependent or contextaware local subgraph to constrain message passing. We also want to access global information about the graph structure to get a boarder view before focusing on a local subgraph. Inspired by the consciousness prior, we apply attention mechanism to the two GNNs, where the bottom one performs inputinvariant standard message passing globally, called Inattentive GNN (IGNN), and the above one performs inputdependent pruned message passing locally, called Attentive GNN (AGNN). The intuition is that the Inattentive GNN can support the Attentive GNN by providing raw representation, entangled but rich, while the Attentive GNN captures various inputdependent subgraphs consisting of a few selected nodes and their edges, cohesive with sharp semantics, disentangled from the full graph. Nodes within such a subgraph are more densely connected to form a small community to further exchange information and make decisions collectively on how to grow the subgraph next. In experiments, we find our model can run on very large graphs with millions of edges, such as the YAGO310 dataset, even using a laptop without causing outofmemory errors. Our prediction results of KBC tasks attain very competitive scores on HITS@1,3 and the mean reciprocal rank (MRR) compared to the best embeddingbased method so far. Besides, we provide interpretations that they do not have.
2 Problem Formulation
Notation. We use a supervised setting with training data where is an input and is a target. We denote a full graph by with node set and edge set , and denote an inputdependent subgraph by with node set and edge set . We also denote the set of edge types (or relation types) by . We require each subgraph to hold , so that we can define if and define if . We define the boundary of a subgraph as if where means the union of neighbors of all the nodes in . We also define highorder boundaries such as if . Trainable parameters include node embeddings , relation type embeddings , and neural network weights used in two GNNs and an attention module. When performing standard or pruned message passing, node embeddings and relation type embeddings will be indexed according to the operated graph, and thus we denote them by or . We denote batch size by and dimensions by . For IGNN, we use of size to denote node hidden states at step ; for AGNN, we use of size to denote.
We define the objective based on our two GNNs as , where is dynamically constructed. First, we write the standard message passing in IGNN as
(1) 
where represents all involved operations in one message passing iteration over , including: (1) computing messages along each edge with the complexity^{1}^{1}1We assume perexample peredge perdimension time cost as a unit time. of , (2) aggregating messages received at each node with the complexity of , and (3) updating node hidden states with the complexity of . For a step propagation, we get the perbatch complexity of . Considering that backpropagation requires intermediate computation results to be saved during one pass, this complexity counts for both time and space. However, since IGNN is inputinvariant, its node representations can be shared across input examples in one batch so that can be removed to get . If we sample a smaller set from to run such that , we can further reduce the complexity to .
The pruned message passing in AGNN can be written as
(2) 
Its complexity can be computed similarly as above. However, we cannot remove . Fortunately, subgraph is not . If we let be a node , grows from a single node, i.e., , and expands itself each step, leading to a sequence of . Here, we describe the expansion behavior as consecutive expansion, which means no jumping across neighborhood allowed, so that we can ensure that
(3) 
Many realworld graphs follow the smallworld pattern, and the six degrees of separation implies . The upper bound of can grow exponentially in , and there is no guarantee that will not explode.
Proposition.
Given a graph (undirected or directed in both directions), we assume the probability of the degree of an arbitrary node being less than or equal to is larger than , i.e., . Considering a sequence of consecutively expanding subgraphs , starting with , for all , we can ensure
(4) 
The proposition implies the guarantee of upperbounding becomes exponentially looser and weaker as gets larger even if the given assumption has a small and a large (close to 1). We define graph increment at step as such that . To prevent from explosion, we need to constrain . We propose several sampling strategies:

, which means we sample nodes from the boundary of .

, which means we take the boundary of sampled nodes from .

, which means we sample nodes from the boundary of .

, which means we samples nodes from .
Obviously, we have and . Further, we let and be the maximum number of sampled nodes in and the last sampling of respectively and let be pernode maximum sampled neighbors in , and then we can obtain much tighter guarantee as follow:

for .

and for .

for .
By , we can guarantee . To constrain the growth of , we can decrease either or . However, smaller sample size means less area explored and less chance to hit target nodes. We thus use attention operations to do the top selection instead of random sampling when has to be small. We change to where represents the operation of attending over nodes and picking the top. There are two types of attention operations, one applied to and the other applied to . Note that the size of might be much larger than if we want to sample more nodes with larger to sufficiently explore the boundary, . Nevertheless, we can address this problem by using small dimensions to compute attention scores, since attention carried by each node is just a scalar, much smaller than a node representation vector computed during message passing over .
3 Model Implementation
3.1 Architecture design for knowledge graph reasoning
Our model architecture as shown in Figure 2 consists of:

[wide=5pt, leftmargin=]

IGNN module: performs standard message passing to compute fullgraph node representations.

AGNN module: performs a batch of pruned message passing to compute inputdependent node representations which also make use of lowlevel representations from IGNN.

Attention Module: performs a flowstyle attention transition process, conditioned on node representations from both IGNN and AGNN but only affecting AGNN.
We let denote a knowledge graph where is a set of entities and is a set of relations. Each edge or relation is represented by a triple , where is the head entity, is the tail entity, and is their relation type. The goal is to predict potential unknown links, i.e., which entity is likely to be the tail given a query with the head and the relation type specified.
IGNN module. We implement it using standard message passing mechanism [19]. If the full graph has an extremely large number of edges, we sample a subset of edges, , randomly each step. For a batch of input queries, we let node representations from IGNN be shared across queries, containing no batch dimension. Thus, its complexity does not scale with batch size and the saved resources can be allocated to sampling more edges. Each node has a state at step , where the initial . Each edge produces a message, denoted by at step . The computation components include:

[wide=5pt, leftmargin=]

Message function: , where .

Message aggregation: , where .

Node state update function: , where .
We compute messages only for the sampled edges, , each step. Functions and are implemented by a twolayer MLP (using for the first layer and for the second) with input arguments concatenated respectively. Messages are aggregated by dividing the sum by the square root of , the number of sampled neighbors that send messages to , preserving the scale of variance. We use a residual adding to update each node state instead of a GRU or a LSTM. After running for steps, we output a pooling result or simply the last, denoted by , to feed into downstream modules.
AGNN module. AGNN is inputdependent, which means node states depend on input query , denoted by . We implement pruned message passing, running on small subgraphs each conditioned on an input query. We leverage the sparsity and only save for visited nodes . When , we start from node with . When computing messages, denoted by , we use a samplingattending procedure, explained in Section 3.2, to constrain the number of computed edges. The computation components include:

[wide=5pt, leftmargin=]

Message function: , where ^{2}^{2}2In practice, we can use a smaller set of edges than to pass messages as discussed in Section 3.2, and represents a context vector.

Message aggregation: , where .

Node state attending function: , where is an attention score.

Node state update function: , where also represents a context vector.
Query context is defined by its head and relation type embeddings, i.e., and . We introduce a node state attending function to pass node representation information from IGNN to AGNN weighted by a scalar attention score and projected by a learnable matrix . We initialize for node , treating the rest as zero states.
Attention module. Attention over steps is represented by a sequence of node probability distributions, denoted by . The initial distribution is a onehot vector with . To spread attention, we need to compute transition matrices each step. Since it is conditioned on both IGNN and AGNN, we capture two types of interaction between and : , and . The former favors visited nodes, while the latter is used to attend to unseen nodes.
(5) 
where and are two learnable matrices. Each MLP uses one single layer with the activation. To reduce the complexity for computing , we use nodes , which contains nodes with the largest attention scores at step , and use nodes sampled from ’s neighbors to compute attention transition for the next step. Due to the fact that nodes result from the top pruning, the loss of attention may occur to diminish the total amount. Therefore, we use a renormalized version, , to compute new attention scores. We use attention scores at the final step as the probability to predict the tail node.
3.2 Complexity reduction by iterative sampling and attending
AGNN deals with local subgraphs for each input so that only a few selected nodes are kept in , called visited nodes, and is much smaller than . The initial contains only one node , and then is enlarged each step by adding new nodes. When propagating messages, we can just consider the onestep neighborhood each step. However, the expansion goes so rapidly that it covers almost all nodes after a few steps. The key to address the problem is to constrain the scope of nodes we can expand the boundary from, i.e., the core nodes which determine where we can go next. We call it the attendingfrom horizon, , selected according to attention scores . Given this horizon, we still need edge sampling over its neighborhood instead of using the whole in case of a hub node of extremely high degree. Here, we face a tradeoff between coverage and complexity when sampling over the neighborhood. Also, we need node representations within each subgraph to keep their information coherent and avoid possible noises caused by randomly sampling. Therefore, we introduce an attendingto horizon inside the sampling horizon. We denote the sampling horizon by and the attendingto horizon by . The attention module runs within the sampling horizon with smaller dimensions in order to sample more neighbors for a larger coverage. Then, we prune the sampling horizon to obtain the attendingto horizon, which contains a subset of nodes selected according to newly computed attention scores . Current message passing iteration at step in AGNN can be further constrained on edges between and , a smaller set than . We illustrate this procedure in Figure 3.
4 Experiments
Datasets. We use six large KG datasets: FB15K, FB15K237, WN18, WN18RR, NELL995, and YAGO310. FB15K237 [48] is sampled from FB15K [6] with redundant relations removed, and WN18RR [16] is a subset of WN18 [6] removing triples that cause test leakage. Thus, they are both considered more challenging. NELL995 [57] has separate datasets for 12 query relations each corresponding to a singlequeryrelation KBC task. YAGO310 [36] contains the largest KG with millions of edges. Their statistics are shown in Table 1. We find some statistical differences between train and validation (or test). In a KG with all training triples as its edges, a triple is considered as a multiedge triple if the KG contains other triples that also connect and ignoring the direction. We notice that FB15K237 is a special case compared to the others, as there are no edges in its KG directly linking any pair of and in validation (or test). Therefore, when using training triples as queries to train our model, given a batch, for FB15K237, we cut off from the KG all triples connecting the headtail pairs in the given batch, ignoring relation types and edge directions, forcing the model to learn a composite reasoning pattern rather than a singlehop pattern, and for the rest datasets, we only remove the triples of this batch and their inverse from the KG to avoid information leakage before training on this batch. This can be regarded as a hyperparameter tuning whether to force a multihop reasoning or not, leading to a performance boost of about in HITS@1 on FB15237.
Experimental settings. We use the same data split protocol as in many papers [16, 57, 12]. We create a KG, a directed graph, consisting of all train triples and their inverse added for each dataset except NELL995, since it already includes reciprocal relations. Besides, every node in KGs has a selfloop edge to itself. We also add inverse relations into the validation and test set to evaluate the two directions. For evaluation metrics, we use HITS@1,3,10 and the mean reciprocal rank (MRR) in the filtered setting for FB15K237, WN18RR, FB15K, WN18, and YAGO310, and use the mean average precision (MAP) for NELL995’s singlequeryrelation KBC tasks. For NELL995, we follow the same evaluation procedure as in [57, 12, 45], ranking the answer entities against the negative examples given in their experiments. We run our experiments using a 12Gmemory GPU, TITAN X (Pascal), with Intel(R) Xeon(R) CPU E52670 v3 @ 2.30GHz. Our code is written in Python based on TensorFlow 2.0 and NumPy 1.16 and can be found by the link^{3}^{3}3https://github.com/netpaladinx/DPMPN below. We run three times for each hyperparameter setting per dataset to report the means and standard deviations. See hyperparameter details in the appendix.


Dataset  #Entities  #Rels  #Train  #Valid  #Test  PME (tr)  PME (va)  AL (va) 
FB15K  14,951  1,345  483,142  50,000  59,071  81.2%  80.6%  1.22 
FB15K237  14,541  237  272,115  17,535  20,466  38.0%  0%  2.25 
WN18  40,943  18  141,442  5,000  5,000  93.1%  94.0%  1.18 
WN18RR  40,943  11  86,835  3,034  3,134  34.5%  35.5%  2.84 
NELL995  74,536  200  149,678  543  2,818  100%  31.1%  2.00 
YAGO310  123,188  37  1,079,040  5,000  5,000  56.4%  56.0%  1.75 




FB15K237  WN18RR  
Metric ()  H@1  H@3  H@10  MRR  H@1  H@3  H@10  MRR 
TransE []      46.5  29.4      50.1  22.6 
DistMult []  15.5  26.3  41.9  24.1  39  44  49  43 
DistMult []  20.6 (.4)  31.8 (.2)    29.0 (.2)  38.4 (.4)  42.4 (.3)    41.3 (.3) 
ComplEx []  15.8  27.5  42.8  24.7  41  46  51  44 
ComplEx []  20.8 (.2)  32.6 (.5)    29.6 (.2)  38.5 (.3)  43.9 (.3)    42.2 (.2) 
ConvE []  23.7  35.6  50.1  32.5  40  44  52  43 
ConvE []  23.3 (.4)  33.8 (.3)    30.8 (.2)  39.6 (.3)  44.7 (.2)    43.3 (.2) 
RotatE []  24.1  37.5  53.3  33.8  42.8  49.2  57.1  47.6 
ComplExN3[]      56  37      57  48 
NeuralLP []  18.2 (.6)  27.2 (.3)    24.9 (.2)  37.2 (.1)  43.4 (.1)    43.5 (.1) 
MINERVA []  14.1 (.2)  23.2 (.4)    20.5 (.3)  35.1 (.1)  44.5 (.4)    40.9 (.1) 
MINERVA []      45.6    41.3  45.6  51.3   
MWalk []  16.5 (.3)  24.3 (.2)    23.2 (.2)  41.4 (.1)  44.5 (.2)    43.7 (.1) 
DPMPN  28.6 (.1)  40.3 (.1)  53.0 (.3)  36.9 (.1)  44.4 (.4)  49.7 (.8)  55.8 (.5)  48.2 (.5) 

Baselines. We compare our model against embeddingbased approaches, including TransE [6], TransR [33], DistMult [61], ConvE [16], ComplE [50], HolE [38], RotatE [46], and ComplExN3 [27], and pathbased approaches that use RL methods, including DeepPath [57], MINERVA [12], and MWalk [45], and also that uses learned neural logic, NeuralLP [62].
Comparison results and analysis. We report comparison on FB15K23 and WN18RR in Table 2. Our model DPMPN significantly outperforms all the baselines in HITS@1,3 and MRR. Compared to the best baseline, we only lose a few points in HITS@10 but gain a lot in HITS@1,3. We speculate that it is the reasoning capability that helps DPMPN make a sharp prediction by exploiting graphstructured composition locally and conditionally. When a target becomes too vague to predict, reasoning may lose its advantage against embeddingbased models. However, pathbased baselines, with a certain ability to do reasoning, perform worse than we expect. We argue that it might be inappropriate to think of reasoning, a sequential decision process, equivalent to a sequence of nodes. The average lengths of the shortest paths between heads and tails as shown in Table 1 suggests a very short path, which makes the motivation of using a path almost useless. The reasoning pattern should be modeled in the form of dynamical local graphstructured pattern with nodes densely connected with each other to produce a decision collectively. We also run our model on FB15K, WN18, and YAGO310, and the comparison results in the appendix show that DPMPN achieves a very competitive position against the best state of the art. We summarize the comparison on NELL995’s tasks in the appendix. DPMPN performs the best on five tasks, also being competitive on the rest.
Convergence analysis. Our model converges very fast during training. We may use half of training queries to train model to generalize as shown in Figure 4(A). Compared to less expensive embeddingbased models, our model need to traverse a number of edges when training on one input, consuming much time per batch, but it does not need to pass a second epoch, thus saving a lot of training time. The reason may be that training queries also belong to the KG’s edges and some might be exploited to construct subgraphs during training on other queries.
Component analysis. If we do not run message passing in IGNN, is just the initial embedding of node , and we can still run pruned message passing in AGNN as usual. We want to know whether IGNN is actually useful. Considering that longrange propagated messages might bring in noisy features, we compare running IGNN for two steps against totally shutting it down. The result in Figure 4(B) shows that IGNN brings a small gain in each metric on WN18RR.
Horizon analysis. The sampling, attendingto, attendingfrom and searching (i.e., propagation steps) horizons determine how large area a subgraph can expand over. These factors affect computation complexity as well as prediction performance. Intuitively, enlarging the exploring area by sampling more, attending more, and searching longer, may increase the chance of hitting a target to gain some performance. However, the experimental results in Figure 4(C)(D) show that it is not always the case. In Figure 4(E), we can see that increasing the maximum number of attendingfrom nodes per step is useful, but normal GPUs with a limited memory do not allow for an arbitrarily large number due to heavy intermediate data produced during feedforward computing. Figure 4(F) suggests that the propagation steps of AGNN should not go below four.
Attention flow analysis. If the flowstyle attention really captures the way we reason about the world, its process should be conducted in a divergingconverging thinking pattern. Intuitively, first, for the diverging thinking phase, we search and collect ideas as much as we can; then, for the converging thinking phase, we try to concentrate our thoughts on one point. To check whether the attention flow has such a pattern, we measure the average entropy of attention distributions changing along steps and also the proportion of attention concentrated at the top1,3,5 nodes. As we expect, attention is more focused at the final step and the beginning.
Time cost analysis. The time cost is affected not only by the scale of a dataset but also by the horizon setting. For each dataset, we list the training time for one epoch corresponding to our standard hyperparameter settings in the appendix. Note that there is always a tradeoff between complexity and performance. We thus study whether we can reduce time cost a lot at the price of sacrificing a little performance. We plot the oneepoch training time in Figure 6(A)(D), using the same settings as we do in the horizon analysis. We can see that Maxattendingfromperstep and #StepsinAGNN affect the training time significantly while Maxsamplingpernode and Maxattendingtoperstep affect very slightly. Therefore, we can use smaller Maxsamplingpernode and Maxattendingtoperstep in order to gain a larger batch size, making the computation more efficiency as shown in Figure 6(E).
Visualization. To further demonstrate the reasoning capability, we show visualization results of some pruned subgraphs on NELL995’s test data for 12 separate tasks. We avoid using the training data in order to show generalization of the learned reasoning capability. We show the visualization results in Figure 1. See the appendix for detailed analysis and more visualization results.
5 Related Work
Knowledge graph reasoning. Early work, including TransE [6] and its analogues [55, 33, 23], DistMult [61], ConvE [16] and ComplEx [50], focuses on learning embeddings of entities and relations. Some recent works of this line [46, 27] achieve high accuracy. Another line aims to learn inference paths [29, 18, 20, 34, 49, 13] for knowledge graph reasoning, especially DeepPath [57], MINERVA [12], and MWalk [45], which use RL to learn multihop relational paths. However, these approaches, based on policy gradients or Monte Carlo tree search, often suffer from low sample efficiency and sparse rewards, requiring a large number of rollouts and sophisticated reward function design. Other efforts include learning soft logical rules [10, 62] or compostional programs [31].
Relational reasoning in Graph Neural Networks. Relational reasoning is regarded as the key for combinatorial generalization, taking the form of entity and relationcentric organization to reason about the composition structure of the world [11, 28]. A multitude of recent implementations [2] encode relational inductive biases into neural networks to exploit graphstructured representation, including graph convolution networks (GCNs) [8, 21, 17, 24, 14, 39, 25, 7] and graph neural networks [44, 30, 43, 3, 19]. Variants of GNN architectures have been developed. Relation networks [43] use a simple but effective neural module to model relational reasoning, and its recurrent versions [42, 40] do multistep relational inference for long periods; Interaction networks [3] provide a generalpurpose learnable physics engine, and two of its variants are visual interaction networks [56] and vertex attention interaction networks [22]; Message passing neural networks [19] unify various GCNs and GNNs into a general message passing formalism by analogy to the one in graphical models.
Attention mechanism on graphs. Neighborhood attention operation can enhance GNNs’ representation power [52, 22, 54, 26]. These approaches often use multihead selfattention to focus on specific interactions with neighbors when aggregating messages, inspired by [1, 35, 51]. Most graphbased attention mechanisms attend over neighborhood in a singlehop fashion, and [22] claims that the multihop architecture does not help to model highorder interaction in experiments. However, a flowstyle design of attention in [60] shows a way to model longrange attention, stringing isolated attention operations by transition matrices.
6 Conclusion
We introduce Dynamically Pruned Message Passing Networks (DPMPN) and apply it to largescale knowledge graph reasoning tasks. We propose to learn an inputdependent local subgraph which is progressively and selectively constructed to model a sequential reasoning process in knowledge graphs. We use graphical attention expression, a flowstyle attention mechanism, to guide and prune the underlying message passing, making it scalable for largescale graphs and also providing clear graphical interpretations. We also take the inspiration from the consciousness prior to develop a twoGNN framework to boost experimental performances.
References
 [1] (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §5.
 [2] (2018) Relational inductive biases, deep learning, and graph networks. CoRR abs/1806.01261. Cited by: §1, §1, §5.
 [3] (2016) Interaction networks for learning about objects, relations and physics. In NIPS, Cited by: §5.
 [4] (2017) The consciousness prior. CoRR abs/1709.08568. Cited by: Dynamically Pruned Message Passing Networks for LargeScale Knowledge Graph Reasoning, §1, 1st item.
 [5] (201811) Challenges for deep learning towards humanlevel ai. Cited by: §1.
 [6] (2013) Translating embeddings for modeling multirelational data. In NIPS, Cited by: §1, §4, §4, §5.
 [7] (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34, pp. 18–42. Cited by: §5.
 [8] (2014) Spectral networks and locally connected networks on graphs. CoRR abs/1312.6203. Cited by: §5.
 [9] (2018) Variational knowledge graph reasoning. In NAACLHLT, Cited by: §1.
 [10] (2016) TensorLog: a differentiable deductive database. CoRR abs/1605.06523. Cited by: §1, §5.
 [11] (1952) The nature of explanation. Cited by: §5.
 [12] (2018) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. CoRR abs/1711.05851. Cited by: §1, Table 2, §4, §4, §5.
 [13] (2017) Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL, Cited by: §5.
 [14] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, Cited by: §5.
 [15] (1998) A neuronal model of a global workspace in effortful cognitive tasks.. Proceedings of the National Academy of Sciences of the United States of America 95 24, pp. 14529–34. Cited by: §1.
 [16] (2018) Convolutional 2d knowledge graph embeddings. In AAAI, Cited by: §1, Table 4, Table 5, Table 2, §4, §4, §4, §5.
 [17] (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §5.
 [18] (2014) Incorporating vector space similarity in random walk inference over knowledge bases. In EMNLP, Cited by: §5.
 [19] (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §1, §3.1, §5.
 [20] (2015) Traversing knowledge graphs in vector space. In EMNLP, Cited by: §5.
 [21] (2015) Deep convolutional networks on graphstructured data. CoRR abs/1506.05163. Cited by: §5.
 [22] (2017) VAIN: attentional multiagent predictive modeling. In NIPS, Cited by: §1, §5, §5.
 [23] (2015) Knowledge graph embedding via dynamic mapping matrix. In ACL, Cited by: §5.
 [24] (2016) Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design 30 8, pp. 595–608. Cited by: §5.
 [25] (2017) Semisupervised classification with graph convolutional networks. CoRR abs/1609.02907. Cited by: §5.
 [26] (2018) Attention solves your tsp , approximately. Cited by: §5.
 [27] (2018) Canonical tensor decomposition for knowledge base completion. In ICML, Cited by: §1, Table 4, Table 5, Table 2, §4, §5.
 [28] (2017) Building machines that learn and think like people. The Behavioral and brain sciences 40, pp. e253. Cited by: §5.
 [29] (2011) Random walk inference and learning in a large scale knowledge base. In EMNLP, Cited by: §5.
 [30] (2016) Gated graph sequence neural networks. CoRR abs/1511.05493. Cited by: §5.
 [31] (2016) Neural symbolic machines: learning semantic parsers on freebase with weak supervision. In ACL, Cited by: §5.
 [32] (2018) Multihop knowledge graph reasoning with reward shaping. In EMNLP, Cited by: §1.
 [33] (2015) Learning entity and relation embeddings for knowledge graph completion. In AAAI, Cited by: §4, §5.
 [34] (2015) Modeling relation paths for representation learning of knowledge bases. In EMNLP, Cited by: §5.
 [35] (2017) A structured selfattentive sentence embedding. CoRR abs/1703.03130. Cited by: §5.
 [36] (2014) YAGO3: a knowledge base from multilingual wikipedias. In CIDR, Cited by: §4.
 [37] (2018) A novel embedding model for knowledge base completion based on convolutional neural network. In NAACLHLT, Cited by: Table 2.
 [38] (2016) Holographic embeddings of knowledge graphs. In AAAI, Cited by: Table 4, §4.
 [39] (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §5.
 [40] (2018) Recurrent relational networks. In NeurIPS, Cited by: §5.
 [41] (2018) The book of why: the new science of cause and effect. Basic Books. Cited by: §1.
 [42] (2018) Relational recurrent neural networks. In NeurIPS, Cited by: §5.
 [43] (2017) A simple neural network module for relational reasoning. In NIPS, Cited by: §5.
 [44] (2009) The graph neural network model. IEEE Transactions on Neural Networks 20, pp. 61–80. Cited by: §1, §5.
 [45] (2018) Mwalk: learning to walk over graphs using monte carlo tree search. In NeurIPS, Cited by: §1, Table 6, Table 2, §4, §4, §5.
 [46] (2018) RotatE: knowledge graph embedding by relational rotation in complex space. CoRR abs/1902.10197. Cited by: §1, Table 4, Table 2, §4, §5.
 [47] (2016) Integrated information theory: from consciousness to its physical substrate. Nature Reviews Neuroscience 17, pp. 450–461. Cited by: §1.
 [48] (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, Cited by: §4.
 [49] (2016) Compositional learning of embeddings for relation paths in knowledge base and text. In ACL, Cited by: §5.
 [50] (2016) Complex embeddings for simple link prediction. In ICML, Cited by: §1, §4, §5.
 [51] (2017) Attention is all you need. In NIPS, Cited by: §5.
 [52] (2018) Graph attention networks. CoRR abs/1710.10903. Cited by: §1, §5.
 [53] (2018) Knowledge graph reasoning: recent advances. Cited by: §1.
 [54] (2018) Nonlocal neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §5.
 [55] (2014) Knowledge graph embedding by translating on hyperplanes. In AAAI, Cited by: §5.
 [56] (2017) Visual interaction networks: learning a physics simulator from video. In NIPS, Cited by: §5.
 [57] (2017) DeepPath: a reinforcement learning method for knowledge graph reasoning. In EMNLP, Cited by: §1, §4, §4, §4, §5.
 [58] (2018) How powerful are graph neural networks?. ArXiv abs/1810.00826. Cited by: §1.
 [59] (2019) What can neural networks reason about?. ArXiv abs/1905.13211. Cited by: §1.
 [60] (2018) Modeling attention flow on graphs. CoRR abs/1811.00497. Cited by: §5.
 [61] (2015) Embedding entities and relations for learning and inference in knowledge bases. CoRR abs/1412.6575. Cited by: §1, §4, §5.
 [62] (2017) Differentiable learning of logical rules for knowledge base reasoning. In NIPS, Cited by: §1, Table 4, §4, §5.
Appendix
1 Proof
Proposition.
Given a graph (undirected or directed in both directions), we assume the probability of the degree of an arbitrary node being less than or equal to is larger than , i.e., . Considering a sequence of consecutively expanding subgraphs , starting with , for all , we can ensure
(6) 
Proof.
We consider the extreme case of greedy consecutive expansion, where , since if this case satisfies the inequality, any case of consecutive expansion can also satisfy it. By definition, all the subgraphs are a connected graph. Here, we use to denote for short. In the extreme case, we can ensure that the newly added nodes at step only belong to the neighborhood of the last added nodes . Since for each node in already has at least one edge within due to the definition of connected graphs, we can have
(7) 
For , we have and thus
(8) 
For , based on , we obtain
(9) 
which is
(10) 
We can find that also satisfies this inequality. ∎
2 Hyperparameter Settings


Hyperparameter  FB15K237  FB15K  WN18RR  WN18  YAGO310  NELL995 
batch_size  80  80  100  100  100  10 
n_dims_att  50  50  50  50  50  200 
n_dims  100  100  100  100  100  200 
max_sampling_per_step (in IGNN)  10000  10000  10000  10000  10000  10000 
max_attending_from_per_step  20  20  20  20  20  100 
max_sampling_per_node (in AGNN)  200  200  200  200  200  1000 
max_attending_to_per_step  200  200  200  200  200  1000 
n_steps_in_IGNN  2  1  2  1  1  1 
n_steps_in_AGNN  6  6  8  8  6  5 
learning_rate  0.001  0.001  0.001  0.001  0.0001  0.001 
optimizer  Adam  Adam  Adam  Adam  Adam  Adam 
grad_clipnorm  1  1  1  1  1  1 
n_epochs  1  1  1  1  1  3 
Oneepoch training time (h)  25.7  63.7  4.3  8.5  185.0  0.12 

The hyperparameters can be categorized into three groups:

[wide=5pt, leftmargin=]

Normal hyperparameters, including batch_size, n_dims_att, n_dims, learning_rate, grad_clipnorm, and n_epochs. We set smaller dimensions, n_dims_att, for computation in the attention module, as it uses more edges than the message passing uses in AGNN, and also intuitively, it does not need to propagate highdimensional messages but only compute scalar scores over a sampled neighborhood, in concert with the idea in the keyvalue mechanism [4]. We set in most cases, indicating that our model can be trained well by one epoch only due to its fast convergence.

The hyperparameters in charge of the samplingattending horizon, including max_sampling_per_step that controls the maximum number to sample edges per step in IGNN, and max_sampling_per_node, max_attending_from_per_step and max_attending_to_per_step that control the maximum number to sample neighbors of each selected node per step per input, the maximum number of selected nodes for attendingfrom per step per input, and the maximum number of selected nodes in a sampled neighborhood for attendingto per step per input in AGNN.

The hyperparameters in charge of the searching horizon, including n_steps_in_IGNN representing the number of propagation steps to run standard message passing in IGNN, and n_steps_in_AGNN representing the number of propagation steps to run pruned message passing in AGNN.
Note that we tune these hyperparameters according to not only their performances but also the computation resources available to us. In some cases, to deal with a very large knowledge graph with limited resources, we need to make a tradeoff between efficiency and effectiveness. For example, each of NELL995’s singlequeryrelation tasks has a small training set, though still with a large graph, so we can reduce the batch size in favor of affording larger dimensions and a larger samplingattending horizon without any concern for waiting too long to finish one epoch.
3 More Experimental Results



FB15K  WN18  
Metric ()  H@1  H@3  H@10  MRR  H@1  H@3  H@10  MRR 
TransE []  29.7  57.8  74.9  46.3  11.3  88.8  94.3  49.5 
HolE []  40.2  61.3  73.9  52.4  93.0  94.5  94.9  93.8 
DistMult []  54.6  73.3  82.4  65.4  72.8  91.4  93.6  82.2 
ComplEx []  59.9  75.9  84.0  69.2  93.6  93.6  94.7  94.1 
ConvE []  55.8  72.3  83.1  65.7  93.5  94.6  95.6  94.3 
RotatE []  74.6  83.0  88.4  79.7  94.4  95.2  95.9  94.9 
ComplExN3 []      91  86      96  95 
NeuralLP []      83.7  76      94.5  94 
DPMPN  72.6 (.4)  78.4 (.4)  83.4 (.5)  76.4 (.4)  91.6 (.8)  93.6 (.4)  94.9 (.4)  92.8 (.6) 




YAGO310  
Metric ()  H@1  H@3  H@10  MRR 
DistMult []  24  38  54  34 
ComplEx []  26  40  55  36 
ConvE []  35  49  62  44 
ComplExN3 []      71  58 
DPMPN  48.4  59.5  67.9  55.3 




Tasks  NeuCFlow  MWalk  MINERVA  DeepPath  TransE  TransR 
AthletePlaysForTeam  83.9 (0.5)  84.7 (1.3)  82.7 (0.8)  72.1 (1.2)  62.7  67.3 
AthletePlaysInLeague  97.5 (0.1)  97.8 (0.2)  95.2 (0.8)  92.7 (5.3)  77.3  91.2 
AthleteHomeStadium  93.6 (0.1)  91.9 (0.1)  92.8 (0.1)  84.6 (0.8)  71.8  72.2 
AthletePlaysSport  98.6 (0.0)  98.3 (0.1)  98.6 (0.1)  91.7 (4.1)  87.6  96.3 
TeamPlayssport  90.4 (0.4)  88.4 (1.8)  87.5 (0.5)  69.6 (6.7)  76.1  81.4 
OrgHeadQuarteredInCity  94.7 (0.3)  95.0 (0.7)  94.5 (0.3)  79.0 (0.0)  62.0  65.7 
WorksFor  86.8 (0.0)  84.2 (0.6)  82.7 (0.5)  69.9 (0.3)  67.7  69.2 
PersonBornInLocation  84.1 (0.5)  81.2 (0.0)  78.2 (0.0)  75.5 (0.5)  71.2  81.2 
PersonLeadsOrg  88.4 (0.1)  88.8 (0.5)  83.0 (2.6)  79.0 (1.0)  75.1  77.2 
OrgHiredPerson  84.7 (0.8)  88.8 (0.6)  87.0 (0.3)  73.8 (1.9)  71.9  73.7 
AgentBelongsToOrg  89.3 (1.2)           
TeamPlaysInLeague  97.2 (0.3)           

4 More Visualization Resutls
4.1 Case study on the AthletePlaysForTeam task
In the case shown in Figure 9, the query is (concept_personnorthamerica_michael_turner, concept:athleteplaysforteam, ?) and a true answer is concept_sportsteam_falcons. From Figure 9, we can see our model learns that (concept_personnorthamerica_michael_turner, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_stadiumoreventvenue_georgia_dome, concept:teamhomestadium_inv, concept_sportsteam_falcons) are two important facts to support the answer of concept_sportsteam_falcons. Besides, other facts, such as (concept_athlete_joey_harrington, concept:athletehomestadium, concept_stadiumoreventvenue_georgia_dome) and (concept_athlete_joey_harrington, concept:athleteplaysforteam, concept_sportsteam_falcons), provide a vivid example that a person or an athlete with concept_stadiumoreventvenue_georgia_dome as his or her home stadium might play for the team concept_sportsteam_falcons. We have such examples more than one, like concept_athlete_roddy_white’s and concept_athlete_quarterback_matt_ryan’s. The entity concept_sportsleague_nfl cannot help us differentiate the true answer from other NFL teams, but it can at least exclude those nonNFL teams. In a word, our subgraphstructured representation can well capture the relational and compositional reasoning pattern.
For the AthletePlaysForTeam task
4.2 More results
For the AthletePlaysInLeague task
For the AthleteHomeStadium task
For the AthletePlaysSport task
For the TeamPlaysSport task
For the OrganizationHeadQuarteredInCity task
For the WorksFor task