Direct Multi-hop Attention based Graph Neural Networks

Direct Multi-hop Attention based Graph Neural Networks


Introducing self-attention mechanism in graph neural networks (GNNs) achieved state-of-the-art performance for graph representation learning. However, at every layer, attention is only computed between two connected nodes and depends solely on the representation of both nodes. This attention computation cannot account for the multi-hop neighbors which supply graph structure context information and have influence on the node representation learning as well. In this paper we propose Direct multi-hop Attention based Graph neural Network (DAGN) for graph representation learning, a principled way to incorporate multi-hop neighboring context into attention computation, enabling long-range interactions at every layer. To compute attention between nodes that are multiple hops away in a single layer, DAGN diffuses the attention scores from neighboring nodes to non-neighboring nodes, increasing the receptive field for every message passing layer. Unlike previous methods, DAGN uses a diffusion prior on attention values, to efficiently account for all paths between the pair of nodes when computing attention weights. This helps DAGN capture large-scale structural information in every layer, and learn more informative attention distribution. Experimental results on standard node classification as well as the knowledge graph completion benchmarks show that DAGN achieves state-of-the-art results: DAGN achieves up to relative error reduction over the previous state-of-the-art on Cora, Citeseer, and Pubmed. DAGN also obtains the best performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion DAGN advances state-of-the-art on WN18RR and FB15k-237 across four different performance metrics.

1 Introduction

The introduction of the self-attention mechanism (Bahdanau et al., 2015), especially the Transformer architecture (Vaswani et al., 2017), has pushed the state-of-the-art in many natural language processing tasks (Radford et al., 2019; Devlin et al., 2019; Liu et al., 2019a; Lan et al., 2019). Graph Attention Network (GAT) (Veličković et al., 2018) and related models (Li et al., 2018; Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020) applies attention mechanism to graph neural networks. They compute attention scores based on nodes that are directly connected by an edge, allowing the model to attend to messages on edges according to their attention scores.

However, such attention computation on pairs of nodes connected by edges implies that a node can only attend to its immediate neighbors to compute its next layer representation. This implies that receptive field of a single message passing layer is restricted to one-hop graph structure. Although stacking multiple GATs can enlarge the receptive field to multi-hop neighbors and learn non-neighboring interactions, these deep GATs usually suffer from the oversmoothing problem (Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020). Furthermore, edge attention weights in the single GAT layer are based solely on node representations themselves, and do not depend on the neighborhood context of the graph structure. In a word, the one-hop attention mechanism in GATs limits their ability to explore the correlation between graph structure information and attention weights. Previous works (Xu et al., 2018; Klicpera et al., 2019b) have shown advantages in performing multi-hop message-passing in a single layer. This indicates that exploring graph structure information in a single layer is beneficial. However, these approaches are not graph-attention based. Therefore, incorporating multi-hop neighboring context into the attention computation in graph neural networks had not been explored.

Here we present Direct Multi-hop Attention based Graph Neural Network (DAGN), an effective and efficient multi-hop self-attention computation for relational graph data via a novel graph attention diffusion layer (Figure 1). We achieve this by first computing attention weights on edges (represented by solid arrows), and then computing other self-attention weights (dotted arrows) through an attention diffusion process using the attention weights on edges.

Our model has two main advantages. 1) DAGN captures long-range interactions between nodes multiple hops away at every message-passing layer. Thus the model enables effective long-range message passing, from important nodes multiple hops away. 2) The attention computation in DAGN is context-dependent. The attention value in GATs only depends on node representations of the previous layer between connected nodes, and is 0 between unconnected nodes. In contrast, for any pair of reachable nodes1 within chosen multi-hop neighborhood, DAGN computes attention by aggregating the attention scores on all the possible paths (length ) between the pair of nodes. In addition, inspired from the transformer architecture, DAGN also demonstrates that the use of layer normalization and feed-forward layers further boosts the performance.

Theoretically we demonstrate that DAGN places a Personalized Page Rank (PPR) prior on the attention values, based on the graph structure. We also use spectral graph analysis to show that DAGN has the capability of emphasizing on large-scale graph structure and lowering high-frequency noise in graphs. Specifically, DAGN enlarges the lower Laplacian eigen-values, which corresponds to the large-scale structure in the graph, and suppresses the higher Laplacian eigen-values which correspond to more noisy and fine-grained information in the graph.

Figure 1: Multi-hop Diffused Attention. Left: a single GAT layer only computes one-hop neighbor based attention and thus , and the attention between and only depends on their node representations (see ); Right: DAGN is able to 1) capture the information of two-hop neighbor to via the diffused attention , and 2) enhance graph structure learning by considering all paths between nodes via diffused attention (see over two paths: “” and “”) based on powers of graph adjacency matrix in a single layer.

We perform experiments on standard datasets for semi-supervised node classification as well as the knowledge graph completion. Experiments show that DAGN achieves state-of-the-art results: DAGN achieves up to relative error reduction over previous state-of-the-art on Cora, Citeseer, and Pubmed. DAGN also obtains better performance on a large-scale Open Graph Benchmark dataset. On knowledge graph completion DAGN advances state-of-the-art on WN18RR and FB15k-237 across four metrics, with the largest gain of 7.1% in the metric of Hit at 1.

Furthermore, our ablation study reveals the synergistic effect of the essential components of DAGN, including layer normalization and multi-hop diffused attention. We show that DAGN benefits from increase in model depth, while the performance of baselines plateaus at a much smaller model depths. We further observe that compared to GAT, the attention values learned by DAGN have higher diversity, indicating the ability to better pay attention to important nodes.

2 Directed Multi-hop Attention based Graph Neural Network

We first discuss the background and then explain Direct Multi-hop Attention based Graph Neural Network’s new attention diffusion module and its overall model architecture.

2.1 Preliminaries

Let be a given graph, where is the set of nodes, is the set of edges connecting pairs of nodes in . Each node and each edge are associated with their type mapping functions: and . Here and denote the sets of node types (labels) and edge types. Our framework supports learning on heterogeneous graphs with multiple elements in .

Figure 2: DAGN Architecture. Each DAGN block consists of attention computation, attention diffusion, layer normalization, feed forward layers, and 2 residual connections for each block. DAGN blocks can be stacked to constitute a deep model. As illustrated on the right, context-dependent attention is achieved via the attention diffusion process. Here are nodes in the graph.

A general Graph neural Network (GNN) approach learns an embedding that maps nodes and/or edge types into a continuous vector space. Let and be the node embedding and edge type embedding, where , , and represent the embedding dimension of node and edge types, each row represents the embedding of node (), and represents the embedding of relation ().

DAGN builds on GNNs, while bringing together the benefits of Graph Attention and Diffusion techniques. The core of DAGN is Multi-hop Attention Diffusion, a principled way to learn attention between any pair of nodes in a scalable way, taking into account the graph structure and enabling context-dependent attention.

The key challenge here is how to allow for flexible but scalable context-dependent multi-hop attention, where any node can influence embedding of any other node in the same layer (even if they are far away in the underlying network). Simply learning attention scores over all node pairs is infeasible and would lead to overfitting and poor generalization.

2.2 Multi-hop Attention Diffusion

We first introduce attention diffusion, which operates on the DAGN’s attention scores at each layer. The input to the attention diffusion operator is a set of triples , where are nodes and is the edge type. DAGN first computes the attention scores on all edges. The attention diffusion module then computes the attention values between pairs of nodes that are not directly connected by an edge, based on the edge attention scores, via a diffusion process. The attention diffusion module can then be used as a component in DAGN architecture, which we will further elaborate in Section LABEL:sec:arch.

Edge Attention Computation. At each layer , a vector message is computed for each triple . To compute representation of at layer , all messages from triples incident to are aggregated into a single message, which is then used to update .

In the first stage, the attention score of an edge is computed by the following:


where , , and are the trainable weights shared by -th layer. represents the embedding of node at -th layer, and . is the trainable relation embedding and denotes concatenation of embedding vectors and . For graphs with no relation type, we treat as a degenerate categorical distribution with 1 category.

Applying Eq. 1 on each edge of the graph , we obtain an attention score matrix :


Subsequently we obtain the attention matrix by performing row-wised softmax over the score matrix : . denotes the attention value at layer when aggregating message from node to node .

Attention Diffusion for Multi-hop Neighbors. In the second stage, we further enable attention between nodes that are not directly connected in the network. We achieve this via the following attention diffusion procedure. The procedure computes the attention scores of multi-hop neighbors via graph diffusion based on the powers of the 1-hop attention matrix :


where is the attention decay factor and . The powers of attention matrix, , give us the number of relation paths from node to node of length up to , increasing the receptive field of the attention (Figure 1). Importantly, the mechanism allows the attention between two nodes to not only depend on their previous layer representations, but also taking into account of the paths between the nodes, effectively creating attention shortcuts between nodes that are not connected (Figure 1). Attention through each path is also weighted differently, depending on and the path length.

In our implementation we utilize the geometric distribution = , where . The choice is based on the inductive bias that nodes further away should be weighted less in message aggregation, and nodes with different relation path lengths to the target node are sequentially weighted in an independent manner. In addition, notice that if we define = , , then Eq. 3 gives the Personalized Page Rank (PPR) procedure on the graph with the attention matrix and teleport probability . Hence the diffused attention weights, , can be seen as the influence of node to node . We further elaborate the significance of this observation in Section 4.3.

We can also view as the attention value of node to since .2 We then define the graph attention diffusion based feature aggregation as


where is the set of parameters for computing attention. Thanks to the diffusion process defined in Eq. 3, DAGN uses the same number of parameters as if we were only computing attention between nodes connected via edges. This ensures runtime efficiency as well as good model generalization.

Approximate Computation for Attention Diffusion. For large graphs computing the exact attention diffusion matrix using Eq. 3 may be prohibitively expensive, due to computing the powers of the attention matrix (Klicpera et al., 2019a). To resolve this bottleneck, we proceed as follows: Let be the input entity embedding of the -th layer () and . Since DAGN only requires aggregation via , we can approximate by defining a sequence which converges to the true value of (Proposition 1) as :

Proposition 1.

In the Appendix we give the proof which relies on the expansion of Eq. 5.

Using the above approximation, the complexity of attention computation with diffusion is still , with a constant factor corresponding to the number of hops . In practice, we find that choosing the values of such that results in good model performance. Many real-world graphs exhibit small-world property, in which case even a smaller value of is sufficient. For graphs with larger diameter, we choose larger , and lower the value of .

2.3 Direct Multi-hop Attention based GNN Architecture

Figure 2 provides an architecture overview of the DAGN Block that can be stacked multiple times.

Multi-head Graph Attention Diffusion Layer. Multi-head attention (Vaswani et al., 2017; Veličković et al., 2018) is used to allow the model to jointly attend to information from different representation sub-spaces at different viewpoints. In Eq. 6, the attention diffusion for each head is computed separately with Eq. 4, and aggregated:


where denotes concatenation and are the parameters in Eq. 1 for the -th head ().

Deep Aggregation. Moreover our DAGN block contains a fully connected feed-forward sub-layer, which consists of a two-layer feedforward network. We also add the layer normalization and residual connection in both sub-layers, allowing for a more expressive aggregation step for each block:


DAGN generalizes GAT. DAGN extends GAT via the diffusion process. The feature aggregation in GAT is , where represents the activation function. We can divide GAT layer into two components as follows:


In component (1), DAGN removes the restriction of attending to direct neighbors, without requiring additional parameters as is induced from . For component (2) DAGN uses layer normalization and deep aggregation which achieves significant gains according to ablation studies in Table 1.

3 Attention Diffusion Analysis

In this section, we investigate the benefits of DAGN from the viewpoint of discrete signal processing on graphs. Our first result demonstrates that DAGN can better capture large-scale structural information. Our second result explores the relation between DAGN and Personalized PageRank (PPR).

3.1 Spectral Properties of Graph Attention Diffusion

We view the attention matrix of GAT, and of DAGN as weighted adjacency matrices, and apply Graph Fourier transform and spectral analysis (details in Appendix) to show the effect of DAGN as a graph low-pass filter, being able to more effectively capture large-scale structure in graphs. By Eq. 3, the sum of each row of either or is 1. Hence the normalized graph Laplacians are and for and respectively. We can get the following proposition:

Proposition 2.

Let and be the -th eigeinvalues of and .


Refer to Appendix for the proof. We additionally have (proved by (Ng et al., 2002)). Eq. 9 shows that when is small such that , then , whereas for large , . This relation indicates that the use of increases smaller eigenvalues and decreases larger eigenvalues3. See Section 4.3 for its empirical evidence. The low-pass effect increases with smaller .

The eigenvalues of the low-frequency signals describe the large-scale structure in the graph (Ng et al., 2002) and have been shown to be crucial in graph tasks (Klicpera et al., 2019b). As  (Ng et al., 2002) and , the reciprocal format in Eq. 9 will amplify the ratio of lower eigenvalues to the sum of all eigenvalues. In contrast, high eigenvalues corresponding to noise are suppressed.

3.2 Personalized PageRank Meets Graph Attention Diffusion

We can also view the attention matrix as a random walk matrix on graph since and . If we perform Personalized PageRank (PPR).with parameter on with transition matrix , the fully Personalized PageRank (Lofgren, 2015) is defined as:


Using the power series expansion for the matrix inverse, we obtain


Comparing to the diffusion Equation 3 with , we have the following proposition.

Proposition 3.

Graph attention diffusion defines a personalized page rank with parameter on with transition matrix , i.e., .

The parameter in DAGN is equivalent to the teleport probability of PPR. PPR provides a good relevance score between nodes in a weighted graph (the weights from the attention matrix ). In summary, DAGN places a PPR prior over node pairwise attention scores: the diffused attention between node and depends on the attention scores on the edges of all paths between and .

4 Experiments

We evaluate DAGN on two classical tasks4. (1) On node classification we achieve an average of relative error reduction; (2) On knowledge graph completion we achieve relative improvement in the Hit at 1 metric.5 We compare with numbers reported by baseline papers when available.

4.1 Task 1: Node Classification

Datasets. We employ four benchmark datasets for node classification: (1) standard citation network benchmarks Cora, Citeseer and Pubmed (Sen et al., 2008; Kipf & Welling, 2016); and (2) a benchmark dataset ogbn-arxiv on 170k nodes and 1.2m edges from the Open Graph Benchmark (Weihua Hu, 2020). We follow the standard data splits for all datasets. Further information about these datasets is summarized in the Appendix.

Baselines. We compare against a comprehensive suite of state-of-the-art GNN methods including: GCNs (Kipf & Welling, 2016), Chebyshev filter based GCNs (Defferrard et al., 2016), DualGCN (Zhuang & Ma, 2018), JKNet (Xu et al., 2018), LGCN (Gao et al., 2018), Diffusion-GCN (Klicpera et al., 2019b), Graph U-Nets (g-U-Nets) (Gao & Ji, 2019), and GAT (Veličković et al., 2018).

Experimental Setup. For datasets Cora, Citeseer and Pubmed, we use 6 DAGN blocks with hidden dimension 512 and 8 attention heads. For the large-scale ogbn-arxiv dataset, we use 2 DAGN blocks with hidden dimension 128 and 8 attention heads. Refer to Appendix for detailed description of all hyper-parameters and evaluation settings.

Models Cora Citeseer Pubmed


GCN (Kipf & Welling, 2016) 81.5 70.3 79.0
Chebyshev (Defferrard et al., 2016) 81.2 69.8 74.4
DualGCN (Zhuang & Ma, 2018) 83.5 72.6 80.0
JKNet (Xu et al., 2018) 81.1 69.8 78.1
LGCN (Gao et al., 2018) 83.3 0.5 73.0 0.6 79.5 0.2
Diffusion-GCN (Klicpera et al., 2019b) 83.6 0.2 73.4 0.3 79.6 0.4
g-U-Nets (Gao & Ji, 2019) 84.4 0.6 73.2 0.5 79.6 0.2
GAT (Veličković et al., 2018) 83.0 0.7 72.5 0.7 79.0 0.3


No LayerNorm 83.8 0.6 71.1 0.5 79.8 0.2
No Diffusion 83.0 0.4 71.6 0.4 79.3 0.3
No Feed-Forward 84.9 0.4 72.2 0.3 80.9 0.3
DAGN 85.4 0.6 73.7 0.5 81.4 0.2
Table 1: Node classification accuracy on Cora, Citeseer, Pubmed. DAGN achieves state-of-the-art.
Data GCN (Kipf & Welling, 2016) GraphSAGE (Hamilton et al., 2017) Node2vec (Grover & Leskovec, 2016) MLP DAGN
ogbn-arxiv 71.74 0.29 71.49 0.27 70.07 0.13 55.50 0.23 72.76 0.14
Table 2: Node classification accuracy on the OGB Arxiv dataset.

Results. We report node classification accuracies on the benchmarks. Results are summarized in Tables 1 and 2. DAGN improves over all methods and achieves the new state-of-the-art on all datasets.

Ablation study. We report (Tables 1) the model performance after removing each component of DAGN (layer normalization, attention diffusion and deep aggregation feed forward layers) from every layer of DAGN. Note that the model is equivalent to GAT without these three components. We observe that both diffusion and layer normalization play a crucial role in improving the node classification performance for all datasets. While layer normalization alone does not benefit GNNs, its use in conjunction with the attention diffusion module significantly boosts DAGN’s performance. Since DAGN computes many attention values, layer normalization is crucial in ensuring training stability.

4.2 Task 2: Knowledge Graph Completion

Datasets. We evaluate DAGN on standard benchmark knowledge graphs: WN18RR (Dettmers et al., 2018) and FB15K-237 (Toutanova & Chen, 2015). Refer to Appendix for statistics of these knowledge graphs.

Baselines. We compare DAGN with state-of-the-art baselines, including (1) translational distance based KG embedding models: TransE (Bordes et al., 2013) and its latest extension RotatE (Sun et al., 2019) and OTE (Tang et al., 2020), and ROTH (Chami et al., 2020); (2) semantic matching based models: DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), QuatE (Zhang et al., 2019) and TuckER (Balazevic et al., 2019); (3) GNN-based models: R-GCN (Schlichtkrull et al., 2018), SACN (Shang et al., 2019) and A2N (Bansal et al., 2019).

Training procedure. We use the standard training procedure used in previous KG embedding models (Balazevic et al., 2019; Dettmers et al., 2018) (Appendix for details). We follow an encoder-decoder framework: The encoder applies the proposed DAGN model to compute the entity embeddings. The decoder then makes link prediction given the embeddings, and existing decoders in prior models can be applied. To show the power of DAGN, we employ the DistMult decoder (Yang et al., 2015), a simple decoder without extra parameters.

Models WN18RR FB15k-237
MR MRR H@1 H@3 H@10 MR MRR H@1 H@3 H@10
TransE (Bordes et al., 2013) 3384 .226 - - .501 357 .294 - - .465
RotatE (Sun et al., 2019) 3340 .476 .428 .492 .571 177 .338 .241 .375 .533
OTE (Tang et al., 2020) - .491 .442 .511 .583 - .361 .267 .396 .550
ROTH (Chami et al., 2020) - .496 .449 .514 .586 - .344 .246 .380 .535
ComplEx (Trouillon et al., 2016) 5261 .44 .41 .46 .51 339 .247 .158 .275 .428
QuatE (Zhang et al., 2019) 2314 .488 .438 .508 .582 - .366 .271 .401 .556
CoKE (Wang et al., 2019b) - .475 .437 .490 .552 - .361 .269 .398 .547
ConvE (Dettmers et al., 2018) 4187 .43 .40 .44 .52 244 .325 .237 .356 .501
DistMult (Yang et al., 2015) 5110 .43 .39 .44 .49 254 .241 .155 .263 .419
TuckER (Balazevic et al., 2019) - .470 .443 .482 .526 - .358 .266 .392 .544
R-GCN (Schlichtkrull et al., 2018) - - - - - - .249 .151 .264 .417
SACN (Shang et al., 2019) - .47 .43 .48 .54 - .35 .26 .39 .54
A2N (Bansal et al., 2019) - .45 .42 .46 .51 - .317 .232 .348 .486
DAGN + DistMult 2545 .502 .459 .519 .589 138 .369 .275 .409 .563
Table 3: Knowledge Graph Completion on WN18RR and FB15k-237. DAGN achieves state of the art.

Evaluation. We use the standard split for the benchmarks, and the standard testing procedure of predicting tail (head) entity given the head (tail) entity and relation type. We exactly follow the evaluation used by all previous works, namely the Mean Reciprocal Rank (MRR), Mean Rank (MR), and hit rate at (H@K). See Appendix for a detailed description of this standard setup.

Results. DAGN achieves new state-of-the-art in knowledge graph completion on all four metrics (Table 3). DAGN compares favourably to both the most recent shallow embedding methods (QuatE), and deep embedding methods (SACN). Note that with the same decoder (DistMult), DAGN using the its own embeddings achieves drastic improvements over using the corresponding DistMult embeddings.

4.3 DAGN Model Analysis

Here we present (1) the spectral analysis results, (2) effect of the hyper-parameters on DAGN performance, and (3) attention distribution analysis to show the strengths of DAGN.

Spectral Analysis: Why DAGN works for node classification? We compute the eigenvalues of the graph Laplacian of the attention matrix , , and compare to that of the diffused matrix , . Figure 3 (a) shows the ratio on the Cora dataset. Low eigenvalues corresponding to large-scale structure in the graph are amplified (up to a factor of 8), while high eigenvalues corresponding to eigenvectors with noisy information are suppressed (Klicpera et al., 2019b).

DAGN Model Depth. Here we conduct experiments by varying the number of GCN, GAT and our DAGN layers to be 3, 6, 8, 12, and 24 for node classification on Cora. Results in Figure  3 (b) show that both deep GCN and deep GAT (even with residual connection) suffer from degrading performance, due to the over-smoothing problem (Li et al., 2018; Wang et al., 2019a). In contrast, the DAGN model achieves consistent best results even with 24 layers, making deep DAGN model robust and expressive.

Figure 3: Analysis of DAGN. (a) Effect of depth on performance. (b) Effect of iteration steps on performance. (c) Effect of teleport probability . (d) Influence of DAGN on Laplacian eigenvalues.

Effect of and . Figures 3 (c) and (d) report the effect of iteration steps and teleport probability on model performance. We observe significant increase in performance when considering multi-hop neighbors information (). However, increasing the iteration steps has a diminishing returns, for . Moreover, we find that the optimal is correlated with the largest node average shortest path distance (e.g., 5.27 for Cora). This provides a guideline for choosing the best .

We also observe that the accuracy drops significantly for larger . This is because small increases the low-pass effect (Figure 3 (a)). However, being too small results in the model only focusing on large-scale graph structure and ignores too much high-frequency information.

Attention Distribution. Last we also analyze the learned attention scores of GAT and DAGN.

Figure 4: Attention weights on Cora.

We first define a discrepancy metric over the attention matrix for node as  (Shanthamallu et al., 2020), where is the uniform distribution score for the node . gives a measure of how much the learnt attention deviates from an uninformative uniform distribution. Large indicates more meaningful attention scores. Fig. 4 shows the distribution of the discrepancy metric of the attention matrix of the 1st head w.r.t. the first layer of DAGN and GAT. Observe that attention scores learned in DAGN have much larger discrepancy. This shows that DAGN is more powerful than GAT in distinguishing important and non-important nodes and assign attention scores accordingly.

5 Related Work

Our proposed DAGN belongs to the family of Graph Neural Network (GNN) models (Battaglia et al., 2018; Wu et al., 2020; Kipf & Welling, 2016; Hamilton et al., 2017), while taking advantage of graph attention and diffusion techniques.

Graph Attention Neural Networks (GATs) generalize attention operation to graph data. GATs allow for assigning different importance to nodes of the same neighborhood at the feature aggregation step (Veličković et al., 2018). Based on such framework, different attention-based GNNs have been proposed, including GaAN (Zhang et al., 2018), AGNN (Thekumparampil et al., 2018), GeniePath (Liu et al., 2019b). However, these models only consider direct neighbors for each layer of feature aggregation, and suffer from over-smoothing when they go deep (Wang et al., 2019a).

Diffusion based Graph Neural Network. Recently Graph Diffusion Convolution (GDC) (Klicpera et al., 2019b, a) proposes to aggregate information from a larger (multi-hop) neighborhood at each layer, by sparsifying a generalized form of graph diffusion. This idea was also explored in (Liao et al., 2019; Luan et al., 2019; Xhonneux et al., 2019) for multi-scale deep Graph Convolutional Networks. However, these methods do not incorporate attention mechanisms which proves to have a significant gain in model performance, and do not make use of edge embeddings (e.g., Knowledge graph) (Klicpera et al., 2019b). Our approach defines a novel multi-hop context-dependent self-attention GNN which resolves the over-smoothing issue of GAT architectures (Wang et al., 2019a).

6 Conclusion

We proposed Direct Multi-hop Attention based Graph Neural Network (DAGN), which brings together benefits of graph attention and diffusion techniques in a single layer through attention diffusion, layer normalization and deep aggregation. DAGN enables context-dependent attention between any pair of nodes in the graph, enhances large-scale structural information, and learns more informative attention distribution. DAGN improves over all state-of-the-art methods on the standard tasks of node classification and knowledge graph completion.

Appendix A Attention Diffusion Approximation Proposition

As mentioned in Section 2.2, we use the following equation and proposition to efficiently approximate the attention diffused feature aggregation .

Proposition 4.


Let be the total number of iterations and we approximate by . After -th iteration, we can get


The term converges to 0 as and when , and thus = . ∎

Appendix B Connection to Transformer

Given a sequence of tokens, the Transformer architecture makes uses of multi-head attention between all pairs of tokens, and can be viewed as performing message-passing on a fully connected graph between all tokens. A Naïve application of Transformer on graphs would require computation of all pairwise attention vlaues. Such approach, however, would not make effective use of the graph structure, and could not scale to large graphs. In contrast, Graph Attention Network(Veličković et al., 2018) leverages the graph structure and only computes attention values and perform message passing between direct neighbors. However, it has a limited receptive field (restricted to one-hop neighborhood) and a fixed attention score (Figure 1) that is independent of the context for prediction.

Transformer consists of self-attention layer followed by feed-forward layer. We can organize the self-attention layer in transformer as the following:


where . The softmax part can be demonstrated as an attention matrix computed by scaled dot-product attention over a complete graph6 with self-loop. Computation of attention over complete graph is expensive, Transformers are usually limited by a fixed-length context (e.g., 512 in BERT (Devlin et al., 2019)) in the setting of language modeling, and thus cannot handle large graphs. Therefore direct application of the transformer model cannot capture the graph structure in a scalable way.

In the past, graph structure is usually encoded implicitly by special position embeddings (Zhang et al., 2020) or well-designed attention computation (Shiv & Quirk, 2019; Wang et al., 2019c; Nguyen et al., 2020). However, none of the methods can compute attention between any pair of nodes at each layer.

In contrast, essentially DAGN places a prior over the attention values via personalized PageRank, allowing it to compute the attention between any pair of two nodes via attention diffusion, without any impact on its scalability. In particular, DAGN can handle large graphs as they are usually quite sparse and the graph diameter is usually quite smaller than graph size in practice, resulting in very efficient attention diffusion computation.

Appendix C Spectral Analysis Background and Proof for Proposition 2

Graph Fourier Transform. Suppose represents the attention matrix of graph with , and . Let be the r Jordan’s decomposition of graph attention matrix , where is the square matrix whose -th column is the eigenvector of , and is the diagonal matrix whose diagonal elements are the corresponding eigenvalues, i.e., . Then, for a given vector , its Graph Fourier Transform (Sandryhaila & Moura, 2013a) is defined as


where is denoted as graph Fourier transform matrix. The Inverse Graph Fourier Transform is defined as , which reconstructs the signal from its spectrum. Based on the graph Fourier transform, we can define a graph convolution operation on as = , where denotes the element-wise product.

Graph Attention Diffusion Acts as A Polynomial Graph Filter. A graph filter (Tremblay et al., 2018; Sandryhaila & Moura, 2013a, b) acts on as = , where . A common choice for in the literature is a polynomial filter of order , since it is linear and shift invariant (Sandryhaila & Moura, 2013a, b).


Comparing to the graph attention diffusion , if we set , we can view graph attention diffusion as a polynomial filter.

Spectral Analysis. The eigenvectors of the power matrix are same as , since = = = . By that analogy, we can get that . Therefore, the summation of the power series of has the same eigenvectors as . Therefore by properties of eigenvectors and Equation 3, we obtain:

Proposition 5.

The set of eigenvectors for and are the same.

Lemma 1.

Let and be the -th eigenvalues of and , respectively. Then, we have


The symmetric normalized graph Laplacian of is , where = , and . As is the attention matrix of graph , and thus . Therefore, = . Let be the eigenvalues of , the eigenvalues of the symmetric normalized Laplacian of is . Meanwhile, for every eigenvalue of the normalized graph Laplacian , we have  (Mohar et al., 1991), and thus . As and thus . Therefore, when , and . ∎

Section 3 Further defines the eigenvalues of the laplacian matrices, and respectively. They satisfy: and , and (proved by  (Ng et al., 2002)).

Appendix D Graph Learning Tasks

Node classification and knowledge graph link prediction are two representative and common tasks in graph learning. We first define the task of node classification:

Definition 1.

Node classification Suppose that represents the node input features, where each row = is a -dimensional vector of attribute values of node (). consists of a set of labeled nodes, and the labels are from , node classification is to learn the map function , which predicts the labels of the remaining un-labeled nodes .

Knowledge graph (KG) is a heterogeneous graph describing entities and their typed relations to each other. KG is defined by a set of entities (nodes) , and a set of relations (edges) connecting nodes and via relation . We then define the task of knowledge graph completion:

Definition 2.

KG completion refers to the task of predicting an entity that has a specific relation with another given entity (Wang et al., 2017), i.e., predicting head given a pair of relation and entity or predicting tail given a pair of head and relation .

Appendix E Dataset Statistics

Node classification. We show the dataset statistics of the node classification benchmark datasets in Table 4.

Name Nodes Edges Classes Features Train/Dev/Test
Cora 2,708 5,429 7 1,433 140/500/1,000
Citeseer 3,327 4,732 6 3,703 120/500/1,000
Pubmed 19,717 88,651 3 500 60/500/1,000
ogbn-arxiv 169,343 1,166,243 40 128 90,941/29,799/48,603
Table 4: Statistical Information on Node Classification Benchmarks

Knowledge Graph Link Prediction. We show the dataset statistics of the knowledge graph benchmark datasets in Table 5.

Dataset #Entities #Relations #Train #Dev #Test #Avg. Degree
WN18RR 40,943 11 86,835 3034 3134 2.19
FB15k-237 14,541 237 272,115 17,535 20,466 18.17
Table 5: Statistical Information on Benchmarks

Appendix F Knowledge Graph Training and Evaluation

Training. The standard knowledge graph completion task training procedure is as follows. We add the reverse-direction triple for each triple to construct an undirected knowledge graph . Following the training procedure introduced in (Balazevic et al., 2019; Dettmers et al., 2018), we use 1-N scoring, i.e. we simultaneously score entity-relation pairs and with all entities, respectively. We explore KL diversity loss with label smoothing as the optimization function.

Inference time procedure. For each test triplet , the head is removed and replaced by each of the entities appearing in KG. Afterward, we remove from the corrupted triplets all the ones that appear either in the training, validation or test set. Finally, we score these corrupted triplets by the link prediction models and then sorted by descending order; the rank of is finally scored. This whole procedure is repeated while removing the tail instead of . And averaged metrics are reported. We report mean reciprocal rank (MRR), mean rank (MR) and the proportion of correct triplets in the top ranks (Hits@) for = 1, 3 and 10. Lower values of MR and larger values of MRR and Hits@ mean better performance.

Experimental Setup. We use the multi-layer DAGN as encoder for both FB15k-237 and WN18RR. We randomly initialize the entity embedding and relation embedding as the input of the encoders, and set the dimensionality of the initialized entity/relation vector as 100 used in DistMult Yang et al. (2015). We select other DAGN model hype-parameters, including number of layers, hidden dimension, head number, top-, learning rate, number of power iteration steps, teleport probability and dropout ratios (see the settings of these parameter in Appendix), by a random search during the training.

Appendix G Hyper-parameter Settings for Node Classification

The best models are selected according to the classification accuracy on the validation set by early stopping with window size 200.

For each data set, the hyper-parameters are determined by a random search (Bergstra & Bengio, 2012), including learning rate, number of power iteration steps, teleport probability and dropout ratios. The hyper-parameter search space is show in Tables 6 (for Cora, Citeseer and Pubmed) and 7 (for ogbn-arxiv).

Hyper-parameters Search Space Type
Hidden Dimension 512 Fixed
Head Number 8 Fixed
Layer Number 6 Fixed
Learning rate [] Range
Number of power iteration steps [] Choice
Teleport probability [] Range
Dropout (attention, feature) [] Range
Weight Decay [] Range
Optimizer Adam Fixed
  • Fixed: a constant value;

  • Range: a value range with lower bound and higher bound;

  • Choice: a set of values.

Table 6: Hyper-parameter search space used for node classification on Cora, Citeseer and Pubmed
Hyper-parameters Search Space Type
Hidden Dimension 128 Fixed
Head Number 8 Fixed
Layer Number 2 Fixed
Learning rate [] Range
Number of power iteration steps [] Choice
Teleport probability [] Range
Dropout (attention, feature) [] Range
Weight Decay [] Range
Optimizer Adam Fixed
Table 7: Hyper-parameter search space used for node classification on ogbn-arxiv

Appendix H Hyper-parameter Setting for Link Prediction on KG

For each KG, the hyper-parameters are determined by a random search (Bergstra & Bengio, 2012), including number of layers, learning rate, hidden dimension, batch-size, head number, number of power iteration steps, teleport probability and dropout ratios. The hyper-parameter search space is show in Table 8.

Hyper-parameters Search Space Type
Initial Entity/Relation Dimension 100 Fixed
Number of layers [] Choice
Learning rate [] Range
Hidden Dimension [] Choice
Batch size [] Choice
Head Number [] Choice
Number of power iteration steps [] Choice
Teleport probability [] Range
Dropout (attention, feature) [] Range
Weight Decay [] Range
Optimizer Adam Fixed
Table 8: Hyper-parameter search space used for link prediction on KG


  1. Node is reachable from node iff there exists a path which starts with and ends with .
  2. Obtained by the attention definition and Eq. 3.
  3. The eigenvalues of and correspond to the same eigenvectors, as shown in Proposition 5 in Appendix.
  4. All datasets used are public, and the code will be released at the time of publication.
  5. Please see the definitions of these two tasks in Appendix.
  6. All nodes are connected with each other.


  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  2. Ivana Balazevic, Carl Allen, and Timothy Hospedales. Tucker: Tensor factorization for knowledge graph completion. In EMNLP, 2019.
  3. Trapit Bansal, Da-Cheng Juan, Sujith Ravi, and Andrew McCallum. A2n: Attending to neighbors for knowledge graph inference. In ACL, 2019.
  4. Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  5. James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR, pp. 281–305, 2012.
  6. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NeurIPS, 2013.
  7. Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, and Christopher Ré. Low-dimensional hyperbolic knowledge graph embeddings. In ACL, 2020.
  8. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, 2016.
  9. Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In AAAI, 2018.
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  11. Hongyang Gao and Shuiwang Ji. Graph u-nets. In ICML, 2019.
  12. Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional networks. In KDD, 2018.
  13. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.
  14. Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
  15. Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2016.
  16. Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In ICLR, 2019a.
  17. Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. In NeurIPS, 2019b.
  18. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In ICLR, 2019.
  19. Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, 2018.
  20. Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S. Zemel. Lanczosnet: Multi-scale deep graph convolutional networks. In ICLR, 2019.
  21. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019a.
  22. Ziqi Liu, Chaochao Chen, Longfei Li, Jun Zhou, Xiaolong Li, Le Song, and Yuan Qi. Geniepath: Graph neural networks with adaptive receptive paths. In AAAI, 2019b.
  23. Peter Lofgren. Efficient Algorithms for Personalized PageRank. PhD thesis, Stanford University, 2015.
  24. Sitao Luan, Mingde Zhao, Xiao-Wen Chang, and Doina Precup. Break the ceiling: Stronger multi-scale deep graph convolutional networks. In NeurIPS, 2019.
  25. Bojan Mohar, Y Alavi, G Chartrand, and OR Oellermann. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, pp. 871–898, 1991.
  26. Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In NeurIPS, 2002.
  27. Xuan-Phi Nguyen, Shafiq Joty, Steven CH Hoi, and Richard Socher. Tree-structured attention with hierarchical accumulation. In ICLR, 2020.
  28. Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In ICLR, 2020.
  29. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
  30. Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs: Graph fourier transform. In ICASSP, 2013a.
  31. Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. TSP, pp. 1644–1656, 2013b.
  32. Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In ESWC, 2018.
  33. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, pp. 93–106, 2008.
  34. Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen Zhou. End-to-end structure-aware convolutional networks for knowledge base completion. In AAAI, 2019.
  35. Uday Shankar Shanthamallu, Jayaraman J Thiagarajan, and Andreas Spanias. A regularized attention mechanism for graph attention networks. In ICASSP, 2020.
  36. Vighnesh Shiv and Chris Quirk. Novel positional encodings to enable tree-based transformers. In NeurIPS, 2019.
  37. Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR, 2019.
  38. Yun Tang, Jing Huang, Guangtao Wang, Xiaodong He, and Bowen Zhou. Orthogonal relation transforms with graph context modeling for knowledge graph embedding. In ACL, 2020.
  39. Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735, 2018.
  40. Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In CVSC-WS, 2015.
  41. Nicolas Tremblay, Paulo Gonçalves, and Pierre Borgnat. Design of graph filters and filterbanks. In Cooperative and Graph Signal Processing, pp. 299–324. Elsevier, 2018.
  42. Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, 2016.
  43. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  44. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  45. Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Improving graph attention networks with large margin-based constraints. In NeurIPS-WS, 2019a.
  46. Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey of approaches and applications. TKDE, pp. 2724–2743, 2017.
  47. Quan Wang, Pingping Huang, Haifeng Wang, Songtai Dai, Wenbin Jiang, Jing Liu, Yajuan Lyu, Yong Zhu, and Hua Wu. Coke: Contextualized knowledge graph embedding. arXiv preprint arXiv:1911.02168, 2019b.
  48. Yaushian Wang, Hung-Yi Lee, and Yun-Nung Chen. Tree transformer: Integrating tree structures into self-attention. In EMNLP, 2019c.
  49. Marinka Zitnik Yuxiao Dong Hongyu Ren Bowen Liu Michele Catasta Jure Leskovec Weihua Hu, Matthias Fey. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
  50. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst, 2020.
  51. Louis-Pascal A. C. Xhonneux, Meng Qu, and Jian Tang. Continuous graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
  52. Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
  53. Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In ICLR, 2015.
  54. Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. In UAI, 2018.
  55. Jiawei Zhang, Haopeng Zhang, Li Sun, and Congying Xia. Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140, 2020.
  56. Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. Quaternion knowledge graph embedding. In NeurIPS, 2019.
  57. Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graph-based semi-supervised classification. In WWW, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description