Direct Multihop Attention based Graph Neural Networks
Abstract
Introducing selfattention mechanism in graph neural networks (GNNs) achieved stateoftheart performance for graph representation learning. However, at every layer, attention is only computed between two connected nodes and depends solely on the representation of both nodes. This attention computation cannot account for the multihop neighbors which supply graph structure context information and have influence on the node representation learning as well. In this paper we propose Direct multihop Attention based Graph neural Network (DAGN) for graph representation learning, a principled way to incorporate multihop neighboring context into attention computation, enabling longrange interactions at every layer. To compute attention between nodes that are multiple hops away in a single layer, DAGN diffuses the attention scores from neighboring nodes to nonneighboring nodes, increasing the receptive field for every message passing layer. Unlike previous methods, DAGN uses a diffusion prior on attention values, to efficiently account for all paths between the pair of nodes when computing attention weights. This helps DAGN capture largescale structural information in every layer, and learn more informative attention distribution. Experimental results on standard node classification as well as the knowledge graph completion benchmarks show that DAGN achieves stateoftheart results: DAGN achieves up to relative error reduction over the previous stateoftheart on Cora, Citeseer, and Pubmed. DAGN also obtains the best performance on a largescale Open Graph Benchmark dataset. On knowledge graph completion DAGN advances stateoftheart on WN18RR and FB15k237 across four different performance metrics.
1 Introduction
The introduction of the selfattention mechanism (Bahdanau et al., 2015), especially the Transformer architecture (Vaswani et al., 2017), has pushed the stateoftheart in many natural language processing tasks (Radford et al., 2019; Devlin et al., 2019; Liu et al., 2019a; Lan et al., 2019). Graph Attention Network (GAT) (Veličković et al., 2018) and related models (Li et al., 2018; Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020) applies attention mechanism to graph neural networks. They compute attention scores based on nodes that are directly connected by an edge, allowing the model to attend to messages on edges according to their attention scores.
However, such attention computation on pairs of nodes connected by edges implies that a node can only attend to its immediate neighbors to compute its next layer representation. This implies that receptive field of a single message passing layer is restricted to onehop graph structure. Although stacking multiple GATs can enlarge the receptive field to multihop neighbors and learn nonneighboring interactions, these deep GATs usually suffer from the oversmoothing problem (Wang et al., 2019a; Liu et al., 2019b; Oono & Suzuki, 2020). Furthermore, edge attention weights in the single GAT layer are based solely on node representations themselves, and do not depend on the neighborhood context of the graph structure. In a word, the onehop attention mechanism in GATs limits their ability to explore the correlation between graph structure information and attention weights. Previous works (Xu et al., 2018; Klicpera et al., 2019b) have shown advantages in performing multihop messagepassing in a single layer. This indicates that exploring graph structure information in a single layer is beneficial. However, these approaches are not graphattention based. Therefore, incorporating multihop neighboring context into the attention computation in graph neural networks had not been explored.
Here we present Direct Multihop Attention based Graph Neural Network (DAGN), an effective and efficient multihop selfattention computation for relational graph data via a novel graph attention diffusion layer (Figure 1). We achieve this by first computing attention weights on edges (represented by solid arrows), and then computing other selfattention weights (dotted arrows) through an attention diffusion process using the attention weights on edges.
Our model has two main advantages.
1) DAGN captures longrange interactions between nodes multiple hops away at every messagepassing layer. Thus the model enables effective longrange message passing, from important nodes multiple hops away.
2) The attention computation in DAGN is contextdependent. The attention value in GATs only depends on node representations of the previous layer between connected nodes, and is 0 between unconnected nodes. In contrast, for any pair of reachable nodes
Theoretically we demonstrate that DAGN places a Personalized Page Rank (PPR) prior on the attention values, based on the graph structure. We also use spectral graph analysis to show that DAGN has the capability of emphasizing on largescale graph structure and lowering highfrequency noise in graphs. Specifically, DAGN enlarges the lower Laplacian eigenvalues, which corresponds to the largescale structure in the graph, and suppresses the higher Laplacian eigenvalues which correspond to more noisy and finegrained information in the graph.
We perform experiments on standard datasets for semisupervised node classification as well as the knowledge graph completion. Experiments show that DAGN achieves stateoftheart results: DAGN achieves up to relative error reduction over previous stateoftheart on Cora, Citeseer, and Pubmed. DAGN also obtains better performance on a largescale Open Graph Benchmark dataset. On knowledge graph completion DAGN advances stateoftheart on WN18RR and FB15k237 across four metrics, with the largest gain of 7.1% in the metric of Hit at 1.
Furthermore, our ablation study reveals the synergistic effect of the essential components of DAGN, including layer normalization and multihop diffused attention. We show that DAGN benefits from increase in model depth, while the performance of baselines plateaus at a much smaller model depths. We further observe that compared to GAT, the attention values learned by DAGN have higher diversity, indicating the ability to better pay attention to important nodes.
2 Directed Multihop Attention based Graph Neural Network
We first discuss the background and then explain Direct Multihop Attention based Graph Neural Network’s new attention diffusion module and its overall model architecture.
2.1 Preliminaries
Let be a given graph, where is the set of nodes, is the set of edges connecting pairs of nodes in . Each node and each edge are associated with their type mapping functions: and . Here and denote the sets of node types (labels) and edge types. Our framework supports learning on heterogeneous graphs with multiple elements in .
A general Graph neural Network (GNN) approach learns an embedding that maps nodes and/or edge types into a continuous vector space. Let and be the node embedding and edge type embedding, where , , and represent the embedding dimension of node and edge types, each row represents the embedding of node (), and represents the embedding of relation ().
DAGN builds on GNNs, while bringing together the benefits of Graph Attention and Diffusion techniques. The core of DAGN is Multihop Attention Diffusion, a principled way to learn attention between any pair of nodes in a scalable way, taking into account the graph structure and enabling contextdependent attention.
The key challenge here is how to allow for flexible but scalable contextdependent multihop attention, where any node can influence embedding of any other node in the same layer (even if they are far away in the underlying network). Simply learning attention scores over all node pairs is infeasible and would lead to overfitting and poor generalization.
2.2 Multihop Attention Diffusion
We first introduce attention diffusion, which operates on the DAGN’s attention scores at each layer. The input to the attention diffusion operator is a set of triples , where are nodes and is the edge type. DAGN first computes the attention scores on all edges. The attention diffusion module then computes the attention values between pairs of nodes that are not directly connected by an edge, based on the edge attention scores, via a diffusion process. The attention diffusion module can then be used as a component in DAGN architecture, which we will further elaborate in Section LABEL:sec:arch.
Edge Attention Computation. At each layer , a vector message is computed for each triple . To compute representation of at layer , all messages from triples incident to are aggregated into a single message, which is then used to update .
In the first stage, the attention score of an edge is computed by the following:
(1) 
where , , and are the trainable weights shared by th layer. represents the embedding of node at th layer, and . is the trainable relation embedding and denotes concatenation of embedding vectors and . For graphs with no relation type, we treat as a degenerate categorical distribution with 1 category.
Applying Eq. 1 on each edge of the graph , we obtain an attention score matrix :
(2) 
Subsequently we obtain the attention matrix by performing rowwised softmax over the score matrix : . denotes the attention value at layer when aggregating message from node to node .
Attention Diffusion for Multihop Neighbors. In the second stage, we further enable attention between nodes that are not directly connected in the network. We achieve this via the following attention diffusion procedure. The procedure computes the attention scores of multihop neighbors via graph diffusion based on the powers of the 1hop attention matrix :
(3) 
where is the attention decay factor and . The powers of attention matrix, , give us the number of relation paths from node to node of length up to , increasing the receptive field of the attention (Figure 1). Importantly, the mechanism allows the attention between two nodes to not only depend on their previous layer representations, but also taking into account of the paths between the nodes, effectively creating attention shortcuts between nodes that are not connected (Figure 1). Attention through each path is also weighted differently, depending on and the path length.
In our implementation we utilize the geometric distribution = , where . The choice is based on the inductive bias that nodes further away should be weighted less in message aggregation, and nodes with different relation path lengths to the target node are sequentially weighted in an independent manner. In addition, notice that if we define = , , then Eq. 3 gives the Personalized Page Rank (PPR) procedure on the graph with the attention matrix and teleport probability . Hence the diffused attention weights, , can be seen as the influence of node to node . We further elaborate the significance of this observation in Section 4.3.
We can also view as the attention value of node to since .
(4) 
where is the set of parameters for computing attention. Thanks to the diffusion process defined in Eq. 3, DAGN uses the same number of parameters as if we were only computing attention between nodes connected via edges. This ensures runtime efficiency as well as good model generalization.
Approximate Computation for Attention Diffusion. For large graphs computing the exact attention diffusion matrix using Eq. 3 may be prohibitively expensive, due to computing the powers of the attention matrix (Klicpera et al., 2019a). To resolve this bottleneck, we proceed as follows: Let be the input entity embedding of the th layer () and . Since DAGN only requires aggregation via , we can approximate by defining a sequence which converges to the true value of (Proposition 1) as :
(5) 
Proposition 1.
In the Appendix we give the proof which relies on the expansion of Eq. 5.
Using the above approximation, the complexity of attention computation with diffusion is still , with a constant factor corresponding to the number of hops . In practice, we find that choosing the values of such that results in good model performance. Many realworld graphs exhibit smallworld property, in which case even a smaller value of is sufficient. For graphs with larger diameter, we choose larger , and lower the value of .
2.3 Direct Multihop Attention based GNN Architecture
Figure 2 provides an architecture overview of the DAGN Block that can be stacked multiple times.
Multihead Graph Attention Diffusion Layer. Multihead attention (Vaswani et al., 2017; Veličković et al., 2018) is used to allow the model to jointly attend to information from different representation subspaces at different viewpoints. In Eq. 6, the attention diffusion for each head is computed separately with Eq. 4, and aggregated:
(6)  
where denotes concatenation and are the parameters in Eq. 1 for the th head ().
Deep Aggregation. Moreover our DAGN block contains a fully connected feedforward sublayer, which consists of a twolayer feedforward network. We also add the layer normalization and residual connection in both sublayers, allowing for a more expressive aggregation step for each block:
(7)  
DAGN generalizes GAT. DAGN extends GAT via the diffusion process. The feature aggregation in GAT is , where represents the activation function. We can divide GAT layer into two components as follows:
(8) 
In component (1), DAGN removes the restriction of attending to direct neighbors, without requiring additional parameters as is induced from . For component (2) DAGN uses layer normalization and deep aggregation which achieves significant gains according to ablation studies in Table 1.
3 Attention Diffusion Analysis
In this section, we investigate the benefits of DAGN from the viewpoint of discrete signal processing on graphs. Our first result demonstrates that DAGN can better capture largescale structural information. Our second result explores the relation between DAGN and Personalized PageRank (PPR).
3.1 Spectral Properties of Graph Attention Diffusion
We view the attention matrix of GAT, and of DAGN as weighted adjacency matrices, and apply Graph Fourier transform and spectral analysis (details in Appendix) to show the effect of DAGN as a graph lowpass filter, being able to more effectively capture largescale structure in graphs. By Eq. 3, the sum of each row of either or is 1. Hence the normalized graph Laplacians are and for and respectively. We can get the following proposition:
Proposition 2.
Let and be the th eigeinvalues of and .
(9) 
Refer to Appendix for the proof.
We additionally have (proved by (Ng et al., 2002)).
Eq. 9 shows that when is small such that , then , whereas for large , . This relation indicates that the use of increases smaller eigenvalues and decreases larger eigenvalues
The eigenvalues of the lowfrequency signals describe the largescale structure in the graph (Ng et al., 2002) and have been shown to be crucial in graph tasks (Klicpera et al., 2019b). As (Ng et al., 2002) and , the reciprocal format in Eq. 9 will amplify the ratio of lower eigenvalues to the sum of all eigenvalues. In contrast, high eigenvalues corresponding to noise are suppressed.
3.2 Personalized PageRank Meets Graph Attention Diffusion
We can also view the attention matrix as a random walk matrix on graph since and . If we perform Personalized PageRank (PPR).with parameter on with transition matrix , the fully Personalized PageRank (Lofgren, 2015) is defined as:
(10) 
Using the power series expansion for the matrix inverse, we obtain
(11) 
Comparing to the diffusion Equation 3 with , we have the following proposition.
Proposition 3.
Graph attention diffusion defines a personalized page rank with parameter on with transition matrix , i.e., .
The parameter in DAGN is equivalent to the teleport probability of PPR. PPR provides a good relevance score between nodes in a weighted graph (the weights from the attention matrix ). In summary, DAGN places a PPR prior over node pairwise attention scores: the diffused attention between node and depends on the attention scores on the edges of all paths between and .
4 Experiments
We evaluate DAGN on two classical tasks
4.1 Task 1: Node Classification
Datasets. We employ four benchmark datasets for node classification: (1) standard citation network benchmarks Cora, Citeseer and Pubmed (Sen et al., 2008; Kipf & Welling, 2016); and (2) a benchmark dataset ogbnarxiv on 170k nodes and 1.2m edges from the Open Graph Benchmark (Weihua Hu, 2020). We follow the standard data splits for all datasets. Further information about these datasets is summarized in the Appendix.
Baselines. We compare against a comprehensive suite of stateoftheart GNN methods including: GCNs (Kipf & Welling, 2016), Chebyshev filter based GCNs (Defferrard et al., 2016), DualGCN (Zhuang & Ma, 2018), JKNet (Xu et al., 2018), LGCN (Gao et al., 2018), DiffusionGCN (Klicpera et al., 2019b), Graph UNets (gUNets) (Gao & Ji, 2019), and GAT (Veličković et al., 2018).
Experimental Setup. For datasets Cora, Citeseer and Pubmed, we use 6 DAGN blocks with hidden dimension 512 and 8 attention heads. For the largescale ogbnarxiv dataset, we use 2 DAGN blocks with hidden dimension 128 and 8 attention heads. Refer to Appendix for detailed description of all hyperparameters and evaluation settings.
Models  Cora  Citeseer  Pubmed  
Baselines 
GCN (Kipf & Welling, 2016)  81.5  70.3  79.0 
Chebyshev (Defferrard et al., 2016)  81.2  69.8  74.4  
DualGCN (Zhuang & Ma, 2018)  83.5  72.6  80.0  
JKNet (Xu et al., 2018)^{⋆}  81.1  69.8  78.1  
LGCN (Gao et al., 2018)  83.3 0.5  73.0 0.6  79.5 0.2  
DiffusionGCN (Klicpera et al., 2019b)  83.6 0.2  73.4 0.3  79.6 0.4  
gUNets (Gao & Ji, 2019)  84.4 0.6  73.2 0.5  79.6 0.2  
GAT (Veličković et al., 2018)  83.0 0.7  72.5 0.7  79.0 0.3  
Abl. 
No LayerNorm  83.8 0.6  71.1 0.5  79.8 0.2 
No Diffusion  83.0 0.4  71.6 0.4  79.3 0.3  
No FeedForward^{⋄}  84.9 0.4  72.2 0.3  80.9 0.3  
DAGN  85.4 0.6  73.7 0.5  81.4 0.2 

: based on the implementation in https://github.com/DropEdge/DropEdge;

: replace the feed forward layer with elu used in GAT.
Data  GCN (Kipf & Welling, 2016)  GraphSAGE (Hamilton et al., 2017)  Node2vec (Grover & Leskovec, 2016)  MLP  DAGN 
ogbnarxiv  71.74 0.29  71.49 0.27  70.07 0.13  55.50 0.23  72.76 0.14 
Results. We report node classification accuracies on the benchmarks. Results are summarized in Tables 1 and 2. DAGN improves over all methods and achieves the new stateoftheart on all datasets.
Ablation study. We report (Tables 1) the model performance after removing each component of DAGN (layer normalization, attention diffusion and deep aggregation feed forward layers) from every layer of DAGN. Note that the model is equivalent to GAT without these three components. We observe that both diffusion and layer normalization play a crucial role in improving the node classification performance for all datasets. While layer normalization alone does not benefit GNNs, its use in conjunction with the attention diffusion module significantly boosts DAGN’s performance. Since DAGN computes many attention values, layer normalization is crucial in ensuring training stability.
4.2 Task 2: Knowledge Graph Completion
Datasets. We evaluate DAGN on standard benchmark knowledge graphs: WN18RR (Dettmers et al., 2018) and FB15K237 (Toutanova & Chen, 2015). Refer to Appendix for statistics of these knowledge graphs.
Baselines. We compare DAGN with stateoftheart baselines, including (1) translational distance based KG embedding models: TransE (Bordes et al., 2013) and its latest extension RotatE (Sun et al., 2019) and OTE (Tang et al., 2020), and ROTH (Chami et al., 2020); (2) semantic matching based models: DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018), QuatE (Zhang et al., 2019) and TuckER (Balazevic et al., 2019); (3) GNNbased models: RGCN (Schlichtkrull et al., 2018), SACN (Shang et al., 2019) and A2N (Bansal et al., 2019).
Training procedure. We use the standard training procedure used in previous KG embedding models (Balazevic et al., 2019; Dettmers et al., 2018) (Appendix for details). We follow an encoderdecoder framework: The encoder applies the proposed DAGN model to compute the entity embeddings. The decoder then makes link prediction given the embeddings, and existing decoders in prior models can be applied. To show the power of DAGN, we employ the DistMult decoder (Yang et al., 2015), a simple decoder without extra parameters.
Models  WN18RR  FB15k237  
MR  MRR  H@1  H@3  H@10  MR  MRR  H@1  H@3  H@10  
TransE (Bordes et al., 2013)  3384  .226      .501  357  .294      .465 
RotatE (Sun et al., 2019)  3340  .476  .428  .492  .571  177  .338  .241  .375  .533 
OTE (Tang et al., 2020)    .491  .442  .511  .583    .361  .267  .396  .550 
ROTH (Chami et al., 2020)    .496  .449  .514  .586    .344  .246  .380  .535 
ComplEx (Trouillon et al., 2016)  5261  .44  .41  .46  .51  339  .247  .158  .275  .428 
QuatE (Zhang et al., 2019)  2314  .488  .438  .508  .582    .366  .271  .401  .556 
CoKE (Wang et al., 2019b)    .475  .437  .490  .552    .361  .269  .398  .547 
ConvE (Dettmers et al., 2018)  4187  .43  .40  .44  .52  244  .325  .237  .356  .501 
DistMult (Yang et al., 2015)  5110  .43  .39  .44  .49  254  .241  .155  .263  .419 
TuckER (Balazevic et al., 2019)    .470  .443  .482  .526    .358  .266  .392  .544 
RGCN (Schlichtkrull et al., 2018)              .249  .151  .264  .417 
SACN (Shang et al., 2019)    .47  .43  .48  .54    .35  .26  .39  .54 
A2N (Bansal et al., 2019)    .45  .42  .46  .51    .317  .232  .348  .486 
DAGN + DistMult  2545  .502  .459  .519  .589  138  .369  .275  .409  .563 
Evaluation. We use the standard split for the benchmarks, and the standard testing procedure of predicting tail (head) entity given the head (tail) entity and relation type. We exactly follow the evaluation used by all previous works, namely the Mean Reciprocal Rank (MRR), Mean Rank (MR), and hit rate at (H@K). See Appendix for a detailed description of this standard setup.
Results. DAGN achieves new stateoftheart in knowledge graph completion on all four metrics (Table 3). DAGN compares favourably to both the most recent shallow embedding methods (QuatE), and deep embedding methods (SACN). Note that with the same decoder (DistMult), DAGN using the its own embeddings achieves drastic improvements over using the corresponding DistMult embeddings.
4.3 DAGN Model Analysis
Here we present (1) the spectral analysis results, (2) effect of the hyperparameters on DAGN performance, and (3) attention distribution analysis to show the strengths of DAGN.
Spectral Analysis: Why DAGN works for node classification? We compute the eigenvalues of the graph Laplacian of the attention matrix , , and compare to that of the diffused matrix , . Figure 3 (a) shows the ratio on the Cora dataset. Low eigenvalues corresponding to largescale structure in the graph are amplified (up to a factor of 8), while high eigenvalues corresponding to eigenvectors with noisy information are suppressed (Klicpera et al., 2019b).
DAGN Model Depth. Here we conduct experiments by varying the number of GCN, GAT and our DAGN layers to be 3, 6, 8, 12, and 24 for node classification on Cora. Results in Figure 3 (b) show that both deep GCN and deep GAT (even with residual connection) suffer from degrading performance, due to the oversmoothing problem (Li et al., 2018; Wang et al., 2019a). In contrast, the DAGN model achieves consistent best results even with 24 layers, making deep DAGN model robust and expressive.
Effect of and . Figures 3 (c) and (d) report the effect of iteration steps and teleport probability on model performance. We observe significant increase in performance when considering multihop neighbors information (). However, increasing the iteration steps has a diminishing returns, for . Moreover, we find that the optimal is correlated with the largest node average shortest path distance (e.g., 5.27 for Cora). This provides a guideline for choosing the best .
We also observe that the accuracy drops significantly for larger . This is because small increases the lowpass effect (Figure 3 (a)). However, being too small results in the model only focusing on largescale graph structure and ignores too much highfrequency information.
Attention Distribution. Last we also analyze the learned attention scores of GAT and DAGN.
We first define a discrepancy metric over the attention matrix for node as (Shanthamallu et al., 2020), where is the uniform distribution score for the node . gives a measure of how much the learnt attention deviates from an uninformative uniform distribution. Large indicates more meaningful attention scores. Fig. 4 shows the distribution of the discrepancy metric of the attention matrix of the 1st head w.r.t. the first layer of DAGN and GAT. Observe that attention scores learned in DAGN have much larger discrepancy. This shows that DAGN is more powerful than GAT in distinguishing important and nonimportant nodes and assign attention scores accordingly.
5 Related Work
Our proposed DAGN belongs to the family of Graph Neural Network (GNN) models (Battaglia et al., 2018; Wu et al., 2020; Kipf & Welling, 2016; Hamilton et al., 2017), while taking advantage of graph attention and diffusion techniques.
Graph Attention Neural Networks (GATs) generalize attention operation to graph data. GATs allow for assigning different importance to nodes of the same neighborhood at the feature aggregation step (Veličković et al., 2018). Based on such framework, different attentionbased GNNs have been proposed, including GaAN (Zhang et al., 2018), AGNN (Thekumparampil et al., 2018), GeniePath (Liu et al., 2019b). However, these models only consider direct neighbors for each layer of feature aggregation, and suffer from oversmoothing when they go deep (Wang et al., 2019a).
Diffusion based Graph Neural Network. Recently Graph Diffusion Convolution (GDC) (Klicpera et al., 2019b, a) proposes to aggregate information from a larger (multihop) neighborhood at each layer, by sparsifying a generalized form of graph diffusion. This idea was also explored in (Liao et al., 2019; Luan et al., 2019; Xhonneux et al., 2019) for multiscale deep Graph Convolutional Networks. However, these methods do not incorporate attention mechanisms which proves to have a significant gain in model performance, and do not make use of edge embeddings (e.g., Knowledge graph) (Klicpera et al., 2019b). Our approach defines a novel multihop contextdependent selfattention GNN which resolves the oversmoothing issue of GAT architectures (Wang et al., 2019a).
6 Conclusion
We proposed Direct Multihop Attention based Graph Neural Network (DAGN), which brings together benefits of graph attention and diffusion techniques in a single layer through attention diffusion, layer normalization and deep aggregation. DAGN enables contextdependent attention between any pair of nodes in the graph, enhances largescale structural information, and learns more informative attention distribution. DAGN improves over all stateoftheart methods on the standard tasks of node classification and knowledge graph completion.
Appendix A Attention Diffusion Approximation Proposition
As mentioned in Section 2.2, we use the following equation and proposition to efficiently approximate the attention diffused feature aggregation .
(12)  
Proposition 4.
Proof.
Let be the total number of iterations and we approximate by . After th iteration, we can get
(13) 
The term converges to 0 as and when , and thus = . ∎
Appendix B Connection to Transformer
Given a sequence of tokens, the Transformer architecture makes uses of multihead attention between all pairs of tokens, and can be viewed as performing messagepassing on a fully connected graph between all tokens. A Naïve application of Transformer on graphs would require computation of all pairwise attention vlaues. Such approach, however, would not make effective use of the graph structure, and could not scale to large graphs. In contrast, Graph Attention Network(Veličković et al., 2018) leverages the graph structure and only computes attention values and perform message passing between direct neighbors. However, it has a limited receptive field (restricted to onehop neighborhood) and a fixed attention score (Figure 1) that is independent of the context for prediction.
Transformer consists of selfattention layer followed by feedforward layer. We can organize the selfattention layer in transformer as the following:
(14) 
where . The softmax part can be demonstrated as an attention matrix computed by scaled dotproduct attention over a complete graph
In the past, graph structure is usually encoded implicitly by special position embeddings (Zhang et al., 2020) or welldesigned attention computation (Shiv & Quirk, 2019; Wang et al., 2019c; Nguyen et al., 2020). However, none of the methods can compute attention between any pair of nodes at each layer.
In contrast, essentially DAGN places a prior over the attention values via personalized PageRank, allowing it to compute the attention between any pair of two nodes via attention diffusion, without any impact on its scalability. In particular, DAGN can handle large graphs as they are usually quite sparse and the graph diameter is usually quite smaller than graph size in practice, resulting in very efficient attention diffusion computation.
Appendix C Spectral Analysis Background and Proof for Proposition 2
Graph Fourier Transform. Suppose represents the attention matrix of graph with , and . Let be the r Jordanâs decomposition of graph attention matrix , where is the square matrix whose th column is the eigenvector of , and is the diagonal matrix whose diagonal elements are the corresponding eigenvalues, i.e., . Then, for a given vector , its Graph Fourier Transform (Sandryhaila & Moura, 2013a) is defined as
(15) 
where is denoted as graph Fourier transform matrix. The Inverse Graph Fourier Transform is defined as , which reconstructs the signal from its spectrum. Based on the graph Fourier transform, we can define a graph convolution operation on as = , where denotes the elementwise product.
Graph Attention Diffusion Acts as A Polynomial Graph Filter. A graph filter (Tremblay et al., 2018; Sandryhaila & Moura, 2013a, b) acts on as = , where . A common choice for in the literature is a polynomial filter of order , since it is linear and shift invariant (Sandryhaila & Moura, 2013a, b).
(16) 
Comparing to the graph attention diffusion , if we set , we can view graph attention diffusion as a polynomial filter.
Spectral Analysis. The eigenvectors of the power matrix are same as , since = = = . By that analogy, we can get that . Therefore, the summation of the power series of has the same eigenvectors as . Therefore by properties of eigenvectors and Equation 3, we obtain:
Proposition 5.
The set of eigenvectors for and are the same.
Lemma 1.
Let and be the th eigenvalues of and , respectively. Then, we have
(17) 
Proof.
The symmetric normalized graph Laplacian of is , where = , and . As is the attention matrix of graph , and thus . Therefore, = . Let be the eigenvalues of , the eigenvalues of the symmetric normalized Laplacian of is . Meanwhile, for every eigenvalue of the normalized graph Laplacian , we have (Mohar et al., 1991), and thus . As and thus . Therefore, when , and . ∎
Section 3 Further defines the eigenvalues of the laplacian matrices, and respectively. They satisfy: and , and (proved by (Ng et al., 2002)).
Appendix D Graph Learning Tasks
Node classification and knowledge graph link prediction are two representative and common tasks in graph learning. We first define the task of node classification:
Definition 1.
Node classification Suppose that represents the node input features, where each row = is a dimensional vector of attribute values of node (). consists of a set of labeled nodes, and the labels are from , node classification is to learn the map function , which predicts the labels of the remaining unlabeled nodes .
Knowledge graph (KG) is a heterogeneous graph describing entities and their typed relations to each other. KG is defined by a set of entities (nodes) , and a set of relations (edges) connecting nodes and via relation . We then define the task of knowledge graph completion:
Definition 2.
KG completion refers to the task of predicting an entity that has a specific relation with another given entity (Wang et al., 2017), i.e., predicting head given a pair of relation and entity or predicting tail given a pair of head and relation .
Appendix E Dataset Statistics
Node classification. We show the dataset statistics of the node classification benchmark datasets in Table 4.
Name  Nodes  Edges  Classes  Features  Train/Dev/Test 
Cora  2,708  5,429  7  1,433  140/500/1,000 
Citeseer  3,327  4,732  6  3,703  120/500/1,000 
Pubmed  19,717  88,651  3  500  60/500/1,000 
ogbnarxiv^{†}  169,343  1,166,243  40  128  90,941/29,799/48,603 

The data is available at https://ogb.stanford.edu/docs/nodeprop/.
Knowledge Graph Link Prediction. We show the dataset statistics of the knowledge graph benchmark datasets in Table 5.
Dataset  #Entities  #Relations  #Train  #Dev  #Test  #Avg. Degree 
WN18RR  40,943  11  86,835  3034  3134  2.19 
FB15k237  14,541  237  272,115  17,535  20,466  18.17 
Appendix F Knowledge Graph Training and Evaluation
Training. The standard knowledge graph completion task training procedure is as follows. We add the reversedirection triple for each triple to construct an undirected knowledge graph . Following the training procedure introduced in (Balazevic et al., 2019; Dettmers et al., 2018), we use 1N scoring, i.e. we simultaneously score entityrelation pairs and with all entities, respectively. We explore KL diversity loss with label smoothing as the optimization function.
Inference time procedure. For each test triplet , the head is removed and replaced by each of the entities appearing in KG. Afterward, we remove from the corrupted triplets all the ones that appear either in the training, validation or test set. Finally, we score these corrupted triplets by the link prediction models and then sorted by descending order; the rank of is finally scored. This whole procedure is repeated while removing the tail instead of . And averaged metrics are reported. We report mean reciprocal rank (MRR), mean rank (MR) and the proportion of correct triplets in the top ranks (Hits@) for = 1, 3 and 10. Lower values of MR and larger values of MRR and Hits@ mean better performance.
Experimental Setup. We use the multilayer DAGN as encoder for both FB15k237 and WN18RR. We randomly initialize the entity embedding and relation embedding as the input of the encoders, and set the dimensionality of the initialized entity/relation vector as 100 used in DistMult Yang et al. (2015). We select other DAGN model hypeparameters, including number of layers, hidden dimension, head number, top, learning rate, number of power iteration steps, teleport probability and dropout ratios (see the settings of these parameter in Appendix), by a random search during the training.
Appendix G Hyperparameter Settings for Node Classification
The best models are selected according to the classification accuracy on the validation set by early stopping with window size 200.
For each data set, the hyperparameters are determined by a random search (Bergstra & Bengio, 2012), including learning rate, number of power iteration steps, teleport probability and dropout ratios. The hyperparameter search space is show in Tables 6 (for Cora, Citeseer and Pubmed) and 7 (for ogbnarxiv).
Hyperparameters  Search Space  Type 
Hidden Dimension  512  Fixed^{⊢} 
Head Number  8  Fixed 
Layer Number  6  Fixed 
Learning rate  []  Range^{⋆} 
Number of power iteration steps  []  Choice^{⋄} 
Teleport probability  []  Range 
Dropout (attention, feature)  []  Range 
Weight Decay  []  Range 
Optimizer  Adam  Fixed 

Fixed: a constant value;

Range: a value range with lower bound and higher bound;

Choice: a set of values.
Hyperparameters  Search Space  Type 
Hidden Dimension  128  Fixed 
Head Number  8  Fixed 
Layer Number  2  Fixed 
Learning rate  []  Range 
Number of power iteration steps  []  Choice 
Teleport probability  []  Range 
Dropout (attention, feature)  []  Range 
Weight Decay  []  Range 
Optimizer  Adam  Fixed 
Appendix H Hyperparameter Setting for Link Prediction on KG
For each KG, the hyperparameters are determined by a random search (Bergstra & Bengio, 2012), including number of layers, learning rate, hidden dimension, batchsize, head number, number of power iteration steps, teleport probability and dropout ratios. The hyperparameter search space is show in Table 8.
Hyperparameters  Search Space  Type 
Initial Entity/Relation Dimension  100  Fixed 
Number of layers  []  Choice 
Learning rate  []  Range 
Hidden Dimension  []  Choice 
Batch size  []  Choice 
Head Number  []  Choice 
Number of power iteration steps  []  Choice 
Teleport probability  []  Range 
Dropout (attention, feature)  []  Range 
Weight Decay  []  Range 
Optimizer  Adam  Fixed 
Footnotes
 Node is reachable from node iff there exists a path which starts with and ends with .
 Obtained by the attention definition and Eq. 3.
 The eigenvalues of and correspond to the same eigenvectors, as shown in Proposition 5 in Appendix.
 All datasets used are public, and the code will be released at the time of publication.
 Please see the definitions of these two tasks in Appendix.
 All nodes are connected with each other.
References
 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 Ivana Balazevic, Carl Allen, and Timothy Hospedales. Tucker: Tensor factorization for knowledge graph completion. In EMNLP, 2019.
 Trapit Bansal, DaCheng Juan, Sujith Ravi, and Andrew McCallum. A2n: Attending to neighbors for knowledge graph inference. In ACL, 2019.
 Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. JMLR, pp. 281–305, 2012.
 Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In NeurIPS, 2013.
 Ines Chami, Adva Wolf, DaCheng Juan, Frederic Sala, Sujith Ravi, and Christopher Ré. Lowdimensional hyperbolic knowledge graph embeddings. In ACL, 2020.
 Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, 2016.
 Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In AAAI, 2018.
 Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In NAACL, 2019.
 Hongyang Gao and Shuiwang Ji. Graph unets. In ICML, 2019.
 Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Largescale learnable graph convolutional networks. In KDD, 2018.
 Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, 2016.
 Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
 Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In ICLR, 2016.
 Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In ICLR, 2019a.
 Johannes Klicpera, Stefan WeiÃenberger, and Stephan GÃ¼nnemann. Diffusion improves graph learning. In NeurIPS, 2019b.
 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for selfsupervised learning of language representations. In ICLR, 2019.
 Qimai Li, Zhichao Han, and XiaoMing Wu. Deeper insights into graph convolutional networks for semisupervised learning. In AAAI, 2018.
 Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S. Zemel. Lanczosnet: Multiscale deep graph convolutional networks. In ICLR, 2019.
 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019a.
 Ziqi Liu, Chaochao Chen, Longfei Li, Jun Zhou, Xiaolong Li, Le Song, and Yuan Qi. Geniepath: Graph neural networks with adaptive receptive paths. In AAAI, 2019b.
 Peter Lofgren. Efficient Algorithms for Personalized PageRank. PhD thesis, Stanford University, 2015.
 Sitao Luan, Mingde Zhao, XiaoWen Chang, and Doina Precup. Break the ceiling: Stronger multiscale deep graph convolutional networks. In NeurIPS, 2019.
 Bojan Mohar, Y Alavi, G Chartrand, and OR Oellermann. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, pp. 871–898, 1991.
 Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In NeurIPS, 2002.
 XuanPhi Nguyen, Shafiq Joty, Steven CH Hoi, and Richard Socher. Treestructured attention with hierarchical accumulation. In ICLR, 2020.
 Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In ICLR, 2020.
 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
 Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs: Graph fourier transform. In ICASSP, 2013a.
 Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs. TSP, pp. 1644–1656, 2013b.
 Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In ESWC, 2018.
 Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. Collective classification in network data. AI magazine, pp. 93–106, 2008.
 Chao Shang, Yun Tang, Jing Huang, Jinbo Bi, Xiaodong He, and Bowen Zhou. Endtoend structureaware convolutional networks for knowledge base completion. In AAAI, 2019.
 Uday Shankar Shanthamallu, Jayaraman J Thiagarajan, and Andreas Spanias. A regularized attention mechanism for graph attention networks. In ICASSP, 2020.
 Vighnesh Shiv and Chris Quirk. Novel positional encodings to enable treebased transformers. In NeurIPS, 2019.
 Zhiqing Sun, ZhiHong Deng, JianYun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR, 2019.
 Yun Tang, Jing Huang, Guangtao Wang, Xiaodong He, and Bowen Zhou. Orthogonal relation transforms with graph context modeling for knowledge graph embedding. In ACL, 2020.
 Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and LiJia Li. Attentionbased graph neural network for semisupervised learning. arXiv preprint arXiv:1803.03735, 2018.
 Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In CVSCWS, 2015.
 Nicolas Tremblay, Paulo Gonçalves, and Pierre Borgnat. Design of graph filters and filterbanks. In Cooperative and Graph Signal Processing, pp. 299–324. Elsevier, 2018.
 Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, 2016.
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
 Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
 Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Improving graph attention networks with large marginbased constraints. In NeurIPSWS, 2019a.
 Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey of approaches and applications. TKDE, pp. 2724–2743, 2017.
 Quan Wang, Pingping Huang, Haifeng Wang, Songtai Dai, Wenbin Jiang, Jing Liu, Yajuan Lyu, Yong Zhu, and Hua Wu. Coke: Contextualized knowledge graph embedding. arXiv preprint arXiv:1911.02168, 2019b.
 Yaushian Wang, HungYi Lee, and YunNung Chen. Tree transformer: Integrating tree structures into selfattention. In EMNLP, 2019c.
 Marinka Zitnik Yuxiao Dong Hongyu Ren Bowen Liu Michele Catasta Jure Leskovec Weihua Hu, Matthias Fey. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
 Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst, 2020.
 LouisPascal A. C. Xhonneux, Meng Qu, and Jian Tang. Continuous graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, 2018.
 Bishan Yang, Wentau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In ICLR, 2015.
 Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and DitYan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. In UAI, 2018.
 Jiawei Zhang, Haopeng Zhang, Li Sun, and Congying Xia. Graphbert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140, 2020.
 Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. Quaternion knowledge graph embedding. In NeurIPS, 2019.
 Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graphbased semisupervised classification. In WWW, 2018.