Diffusion Improves Graph Learning

Diffusion Improves Graph Learning

Johannes Klicpera, Stefan Weißenberger, Stephan Günnemann
Technical University of Munich
{klicpera,stefan.weissenberger,guennemann}@in.tum.de
Abstract

Graph convolution is the core of most Graph Neural Networks (GNNs) and usually approximated by message passing between direct (one-hop) neighbors. In this work, we remove the restriction of using only the direct neighbors by introducing a powerful, yet spatially localized graph convolution: Graph diffusion convolution (GDC). GDC leverages generalized graph diffusion, examples of which are the heat kernel and personalized PageRank. It alleviates the problem of noisy and often arbitrarily defined edges in real graphs. We show that GDC is closely related to spectral-based models and thus combines the strengths of both spatial (message passing) and spectral methods. We demonstrate that replacing message passing with graph diffusion convolution consistently leads to significant performance improvements across a wide range of models on both supervised and unsupervised tasks and a variety of datasets. Furthermore, GDC is not limited to GNNs but can trivially be combined with any graph-based model or algorithm (e.g. spectral clustering) without requiring any changes to the latter or affecting its computational complexity. Our implementation is available online. 111https://www.kdd.in.tum.de/gdc

1 Introduction

When people started using graphs for evaluating chess tournaments in the middle of the 19th century they only considered each player’s direct opponents, i.e. their first-hop neighbors. Only later was the analysis extended to recursively consider higher-order relationships via , , etc. and finally generalized to consider all exponents at once, using the adjacency matrix’s dominant eigenvector (Landau, 1895; Vigna, 2016). The field of Graph Neural Networks (GNNs) is currently in a similar state. Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017), also referred to as Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) are the prevalent approach in this field but they only pass messages between neighboring nodes in each layer. These messages are then aggregated at each node to form the embedding for the next layer. While MPNNs do leverage higher-order neighborhoods in deeper layers, limiting each layer’s messages to one-hop neighbors seems arbitrary. Edges in real graphs are often noisy or defined using an arbitrary threshold (Tang et al., 2018), so we can clearly improve upon this approach.

Since MPNNs only use the immediate neigborhod information, they are often referred to as spatial methods. On the other hand, spectral-based models do not just rely on first-hop neighbors and capture more complex graph properties (Defferrard et al., 2016). However, while being theoretically more elegant, these methods are routinely outperformed by MPNNs on graph-related tasks (Kipf & Welling, 2017; Veličković et al., 2018; Xu et al., 2019b) and do not generalize to previously unseen graphs. This shows that message passing is a powerful framework worth extending upon. To reconcile these two separate approaches and combine their strengths we propose a novel technique of performing message passing inspired by spectral methods: Graph diffusion convolution (GDC). Instead of aggregating information only from the first-hop neighbors, GDC aggregates information from a larger neighborhood. This neighborhood is constructed via a new graph generated by sparsifying a generalized form of graph diffusion. We show how graph diffusion is expressed as an equivalent polynomial filter and how GDC is closely related to spectral-based models while addressing their shortcomings. GDC is spatially localized, scalable, can be combined with message passing, and generalizes to unseen graphs. Furthermore, since GDC generates a new sparse graph it is not limited to MPNNs and can trivially be combined with any existing graph-based model or algorithm in a plug-and-play manner, i.e. without requiring changing the model or affecting its computational complexity. We show that GDC consistently improves performance across a wide range of models on both supervised and unsupervised tasks and various homophilic datasets. In summary, this paper’s core contributions are:

  1. Proposing graph diffusion convolution (GDC), a more powerful and general, yet spatially localized alternative to message passing that uses a sparsified generalized form of graph diffusion. GDC is not limited to GNNs and can be combined with any graph-based model or algorithm.

  2. Analyzing the spectral properties of GDC and graph diffusion. We show how graph diffusion is expressed as an equivalent polynomial filter and analyze GDC’s effect on the graph spectrum.

  3. Comparing and evaluating several specific variants of GDC and demonstrating its wide applicability to supervised and unsupervised learning on graphs.

2 Generalized graph diffusion

We consider an undirected graph with node set and edge set . We denote with the number of nodes and the adjacency matrix. We define generalized graph diffusion via the diffusion matrix

(1)

with the weighting coefficients , and the generalized transition matrix . The choice of and must at least ensure that Eq. 1 converges. In this work we will consider somewhat stricter conditions and require that , , and that the eigenvalues of are bounded by , which together are sufficient to guarantee convergence. Note that regular graph diffusion commonly requires to be column- or row-stochastic.

Transition matrix. Examples for in an undirected graph include the random walk transition matrix and the symmetric transition matrix , where the degree matrix is the diagonal matrix of node degrees, i.e. . Note that in our definition is column-stochastic. We furthermore adjust the random walk by adding (weighted) self-loops to the original adjacency matrix, i.e. use , with the self-loop weight . This is equivalent to performing a lazy random walk with a probability of staying at node of .

Special cases. Two popular examples of graph diffusion are personalized PageRank (PPR) (Page et al., 1998) and the heat kernel (Kondor & Lafferty, 2002). PPR corresponds to choosing and , with teleport probability (Chung, 2007). The heat kernel uses and , with the diffusion time (Chung, 2007). Another special case of generalized graph diffusion is the approximated graph convolution introduced by Kipf & Welling (2017), which translates to and for and uses with .

Weighting coefficients. We compute the series defined by Eq. 1 either in closed-form, if possible, or by restricting the sum to a finite number . Both the coefficients defined by PPR and the heat kernel give a closed-form solution for this series that we found to perform well for the tasks considered. Note that we are not restricted to using and can use any generalized transition matrix along with the coefficients or and the series still converges. We can furthermore choose by repurposing the graph-specific coefficients obtained by methods that optimize coefficients analogous to as part of their training process. We investigated this approach using label propagation (Chen et al., 2013; Berberidis et al., 2019) and node embedding models (Abu-El-Haija et al., 2018). However, we found that the simple coefficients defined by PPR or the heat kernel perform better than those learned by these models (see Fig. 7 in Sec. 6).

3 Graph diffusion convolution

Graph diffusion

Density defines edges

Sparsify edges

New graph

Figure 1: Illustration of graph diffusion convolution (GDC). We transform a graph via graph diffusion and sparsification into a new graph and run the given model on this graph instead.

Essentially, graph diffusion convolution (GDC) exchanges the normal adjacency matrix with a sparsified version of the generalized graph diffusion matrix , as illustrated by Fig. 1. This matrix defines a weighted and directed graph, and the model we aim to augment is applied to this graph instead. We found that the calculated edge weights are beneficial for the tasks considered. However, we even found that GDC works when ignoring the weights after sparsification. This enables us to use GDC with models that only support unweighted edges such as the degree-corrected stochastic block model (DCSBM). If required, we make the graph undirected by using , e.g. for spectral clustering. With these adjustments GDC is applicable to any graph-based model or algorithm.

Intuition. The general intuition behind GDC is that graph diffusion smooths out the neighborhood over the graph, acting as a kind of denoising filter similar to Gaussian filters on images. This helps with graph learning since both features and edges in real graphs are often noisy. Previous works also highlighted the effectiveness of graph denoising. Berberidis & Giannakis (2018) showed that PPR is able to reconstruct the underlying probability matrix of a sampled stochastic block model (SBM) graph. Kloumann et al. (2017) and Ragain (2017) showed that PPR is optimal in recovering the SBM and DCSBM clusters in the space of landing probabilities. These results confirm the intuition that graph diffusion-based smoothing indeed recovers meaningful neighborhoods from noisy graphs.

Sparsification. Most graph diffusions result in a dense matrix . This happens even if we do not sum to in Eq. 1 due to the "four/six degrees of separation" in real-world graphs (Backstrom et al., 2012). However, the values in represent the influence between all pairs of nodes, which typically are highly localized (Nassar et al., 2015). This is a major advantage over spectral-based models since the spectral domain does not provide any notion of locality. Spatial localization allows us to simply truncate small values of and recover sparsity, resulting in the matrix . In this work we consider two options for sparsification: 1. top-: Use the entries with the highest mass per column, 2. Threshold : Set entries below to zero. Sparsification would still require calculating a dense matrix during preprocessing. However, many popular graph diffusions can be approximated efficiently and accurately in linear time and space. Most importantly, there are fast approximations for both PPR (Andersen et al., 2006; Wei et al., 2018) and the heat kernel (Kloster & Gleich, 2014), with which GDC achieves a linear runtime . Furthermore, top- truncation generates a regular graph, which is amenable to batching methods and solves problems related to widely varying node degrees (Decelle et al., 2011). Empirically, we even found that sparsification slightly improves prediction accuracy (see Fig. 7 in Sec. 6). After sparsification we calculate the (symmetric or random walk) transition matrix on the resulting graph via .

Limitations. GDC is based on the assumption of homophily, i.e. "birds of a feather flock together" (McPherson et al., 2001). Many methods share this assumption and most common datasets adhere to this principle. However, this is an often overlooked limitation and it seems non-straightforward to overcome. One way of extending GDC to heterophily, i.e. "opposites attract", might be negative edge weights (Ma et al., 2016; Derr et al., 2018). Furthermore, we suspect that GDC does not perform well in settings with more complex edges (e.g. knowledge graphs) or graph reconstruction tasks such as link prediction. Preliminary experiments showed that GDC indeed does not improve link prediction performance.

4 Spectral analysis of GDC

Even though GDC is a spatial-based method it can also be interpreted as a graph convolution and analyzed in the graph spectral domain. In this section we show how generalized graph diffusion is expressed as an equivalent polynomial filter and vice versa. Additionally, we perform a spectral analysis of GDC, which highlights the tight connection between GDC and spectral-based models.

Spectral graph theory. To employ the tools of spectral theory to graphs we exchange the regular Laplace operator with either the unnormalized Laplacian , the random-walk normalized , or the symmetric normalized graph Laplacian (von Luxburg, 2007). The Laplacian’s eigendecomposition is , where both and are real-valued. The graph Fourier transform of a vector is then defined via and its inverse as . Using this we define a graph convolution on as , where denotes the Hadamard product. Hence, a filter with parameters acts on as , where . A common choice for in the literature is a polynomial filter of order , since it is localized and has a limited number of parameters (Hammond et al., 2011; Defferrard et al., 2016):

(2)

Graph diffusion as a polynomial filter. Comparing Eq. 1 with Eq. 2 shows the close relationship between polynomial filters and generalized graph diffusion since we only need to exchange by to go from one to the other. To make this relationship more specific and find a direct correspondence between GDC with and a polynomial filter with parameters we need to find parameters that solve

(3)

To find these parameters we choose the Laplacian corresponding to , resulting in (see App. A)

(4)

which shows the direct correspondence between graph diffusion and spectral methods. Note that we need to set . Solving Eq. 4 for the coefficients corresponding to the heat kernel and PPR leads to

(5)

showing how the heat kernel and PPR are expressed as polynomial filters. Note that PPR’s corresponding polynomial filter converges only if . This is caused by changing the order of summation when deriving , which results in an alternating series. However, if the series does converge it gives the exact same transformation as the equivalent graph diffusion.

Spectral properties of GDC. We will now extend the discussion to all parts of GDC and analyze how they transform the graph Laplacian’s eigenvalues. GDC consists of four steps: 1. Calculate the transition matrix , 2. take the sum in Eq. 1 to obtain , 3. sparsify the resulting matrix by truncating small values, resulting in , and 4. calculate the transition matrix .

1. Transition matrix. The transition matrix is related to the corresponding Laplacian via and thus the eigenvalues are . Adding self-loops to obtain does not preserve the eigenvectors and its effect therefore cannot be calculated precisely. Wu et al. (2019) empirically found that adding self-loops shrinks the graph’s eigenvalues.

2. Sum over . Summation does not affect the eigenvectors of the original matrix, since , for the eigenvector of with associated eigenvalue . This also shows that the eigenvalues are transformed as

(6)

Since the eigenvalues of are bounded by 1 we can use the geometric series to derive a closed-form expression for PPR, i.e. . For the heat kernel we use the exponential series, resulting in . The combined transformation of steps 1 and 2 of GDC is illustrated in Fig. 1(a). Both PPR and the heat kernel act as low-pass filters. Low eigenvalues corresponding to large-scale structure in the graph (e.g. clusters (Ng et al., 2002)) are amplified, while high eigenvalues corresponding to fine details but also noise are suppressed.

(a) Graph diffusion as a filter, PPR with and heat kernel with . Both act as low-pass filters.

Index

;

;

;

(b) Sparsification with threshold of PPR () on Cora. Eigenvalues are almost unchanged.

(c) Transition matrix on sparsified graph . Medium eigenvalues are amplified.
Figure 2: Influence of different parts of GDC on the eigenvalues .

3. Sparsification. Sparsification changes both the eigenvalues and the eigenvectors, which means that there is no direct correspondence between the eigenvalues of and and we cannot analyze its effect analytically. However, we can use eigenvalue perturbation theory (Stewart & Sun (1990), Corollary 4.13) to derive the upper bound

(7)

with the perturbation matrix and the threshold . This bound significantly overestimates the perturbation since PPR and the heat kernel both exhibit strong localization on real-world graphs and hence the change in eigenvalues empirically does not scale with (or, rather, ). By ordering the eigenvalues we see that, empirically, the typical thresholds for sparsification have almost no effect on the eigenvalues, as shown in Fig. 1(b) and in the close-up 11 in App. B.2. We find that the small changes caused by sparsification mostly affect the highest and lowest eigenvalues. The former correspond to very large clusters and long-range interactions, which are undesired for local graph smoothing. The latter correspond to spurious oscillations, which are not helpful for graph learning either and most likely affected because of the abrupt cutoff at . These changes indicate why sparsification actually improves the model’s accuracy.

4. Transition matrix on . As a final step we calculate the transition matrix on the resulting graph . This step does not just change which Laplacian we consider since we have already switched to using the transition matrix in step 1. It furthermore does not preserve the eigenvectors and is thus again best investigated empirically by ordering the eigenvalues. Fig. 1(c) shows that this step amplifies medium eigenvalues, which correspond to medium-sized clusters. Such clusters most likely form the most informative neighborhoods for node prediction, which probably explains why using the transition matrix instead of improves performance.

Limitations of spectral-based models. While there are tight connections between GDC and spectral-based models, GDC is actually spatial-based and therefore does not share their limitations. It does not compute an expensive eigenvalue decomposition, preserves locality on the graph and is not limited to a single graph after training, i.e. typically the same coefficients can be used across graphs. The choice of coefficients depends on the type of graph at hand and does not change significantly between similar graphs. Moreover, the hyperparameters of PPR and of the heat kernel usually fall within a narrow range that is rather insensitive to both the graph and model (see Fig. 8 in Sec. 6).

5 Related work

Graph diffusion and random walks have been extensively studied in classical graph learning (Kondor & Lafferty, 2002; Lafon & Lee, 2006; Chen et al., 2013; Chung, 2007), especially for clustering (Kloster & Gleich, 2014), semi-supervised classification (Fouss et al., 2012; Buchnik & Cohen, 2018), and recommendation systems (Ma et al., 2016). For an overview of existing methods see Masuda et al. (2017) and Fouss et al. (2012).

The first models similar in structure to current Graph Neural Networks (GNNs) were proposed by Sperduti & Starita (1997) and Baskin et al. (1997), and the name GNN first appeared in (Gori et al., 2005; Scarselli et al., 2009). However, they only became widely adopted in recent years, when they started to outperform classical models in many graph-related tasks (Duvenaud et al., 2015; Klicpera et al., 2019; Ying et al., 2018; Li et al., 2018b). In general, GNNs are classified into spectral-based models (Defferrard et al., 2016; Kipf & Welling, 2017; Bruna et al., 2014; Henaff et al., 2015; Li et al., 2018a), which are based on the eigendecomposition of the graph Laplacian, and spatial-based methods (Gilmer et al., 2017; Hamilton et al., 2017; Li et al., 2016; Veličković et al., 2018; Monti et al., 2017; Niepert et al., 2016; Pham et al., 2017), which use the graph directly and form new representations by aggregating the representations of a node and its neighbors. Deep learning also inspired a variety of unsupervised node embedding methods. Most models use random walks to learn node embeddings in a similar fashion as word2vec (Mikolov et al., 2013) (Perozzi et al., 2014; Grover & Leskovec, 2016) and have been shown to implicitly perform a matrix factorization (Qiu et al., 2018). Other unsupervised models learn Gaussian distributions instead of vectors (Bojchevski & Günnemann, 2018), use an auto-encoder (Kipf & Welling, 2016), or train an encoder by maximizing the mutual information between local and global embeddings (Velickovic et al., 2019).

There have been some isolated efforts of using extended neighborhoods for aggregation in GNNs and graph diffusion for node embeddings. PPNP (Klicpera et al., 2019) propagates the node predictions generated by a neural network using personalized PageRank, DCNN (Atwood & Towsley, 2016) extends node features by concatenating features aggregated using the transition matrices of -hop random walks, GraphHeat (Xu et al., 2019a) uses the heat kernel and PAN (Ma et al., 2019) the transition matrix of maximal entropy random walks to aggregate over nodes in each layer, PinSage (Ying et al., 2018) uses random walks for neighborhood aggregation, and MixHop (Abu-El-Haija et al., 2019) concatenates embeddings aggregated using the transition matrices of -hop random walks before each layer. VERSE (Tsitsulin et al., 2018) learns node embeddings by minimizing KL-divergence from the PPR matrix to a low-rank approximation. Attention walk (Abu-El-Haija et al., 2018) uses a similar loss to jointly optimize the node embeddings and diffusion coefficients . None of these works considered sparsification, generalized graph diffusion, spectral properties, or using preprocessing to generalize across models.

6 Experimental results

Experimental setup. We take extensive measures to prevent any kind of bias in our results. We optimize the hyperparameters of all models on all datasets with both the unmodified graph and all GDC variants separately using a combination of grid and random search on the validation set. Each result is averaged across 100 data splits and random initializations for supervised tasks and 20 random initializations for unsupervised tasks, as suggested by Klicpera et al. (2019) and Shchur et al. (2018). We report performance on a test set that was used exactly once. We report all results as averages with confidence intervals calculated via bootstrapping.

We use the symmetric transition matrix with self-loops for GDC and the symmetric transition matrix on . We present two simple and effective choices for the coefficients : The heat kernel and PPR. The diffusion matrix is sparsified using either an -threshold or top-.

Datasets and models. We evaluate GDC on six datasets: The citation graphs Citeseer (Sen et al., 2008), Cora (McCallum et al., 2000), and PubMed (Namata et al., 2012), the co-author graph Coauthor CS (Shchur et al., 2018), and the co-purchase graphs Amazon Computers and Amazon Photo (McAuley et al., 2015; Shchur et al., 2018). We only use their largest connected components. We show how GDC impacts the performance of 9 models: Graph Convolutional Network (GCN) (Kipf & Welling, 2017), Graph Attention Network (GAT) (Veličković et al., 2018), jumping knowledge network (JK) (Xu et al., 2018), Graph Isomorphism Network (GIN) (Xu et al., 2019b), and ARMA (Bianchi et al., 2019) are supervised models. The degree-corrected stochastic block model (DCSBM) (Karrer & Newman, 2011), spectral clustering (using ) (Ng et al., 2002), DeepWalk (Perozzi et al., 2014), and Deep Graph Infomax (DGI) (Velickovic et al., 2019) are unsupervised models. Note that DGI uses node features while other unsupervised models do not. We use -means clustering to generate clusters from node embeddings. Dataset statistics and hyperparameters are reported in App. B.

GCN

GAT

JK

GIN

ARMA

72

75

78

81

84

Accuracy (%)

Cora

None

Heat

PPR

GCN

GAT

JK

GIN

ARMA

60

63

66

69

72

75

Citeseer

GCN

GAT

JK

GIN

ARMA

72

76

80

PubMed

GCN

GAT

JK

GIN

ARMA

90

92

Accuracy (%)

oom

Coauthor CS

GCN

GAT

JK

GIN

ARMA

40

60

80

Amz Comp

GCN

GAT

JK

GIN

ARMA

60

75

90

Amz Photo

Figure 3: Node classification accuracy of GNNs with and without GDC. GDC consistently improves accuracy across models and datasets. It is able to fix models whose accuracy otherwise breaks down.

Semi-supervised node classification. In this task the goal is to label nodes based on the graph, node features and a subset of labeled nodes . The main goal of GDC is improving the performance of MPNN models. Fig. 3 shows that GDC consistently and significantly improves the accuracy of a wide variety of state-of-the-art models across multiple diverse datasets. Note how GDC is able to fix the performance of GNNs that otherwise break down on some datasets (e.g. GAT). We also surpass or match the previous state of the art on all datasets investigated (see App. B.2).

DCSBM

Spectral

DeepWalk

DGI

30

45

60

Accuracy (%)

Cora

None

Heat

PPR

DCSBM

Spectral

DeepWalk

DGI

30

45

60

Citeseer

DCSBM

Spectral

DeepWalk

DGI

40

50

60

70

PubMed

DCSBM

Spectral

DeepWalk

DGI

30

45

60

Accuracy (%)

Coauthor CS

DCSBM

Spectral

DeepWalk

DGI

30

45

60

Amz Comp

DCSBM

Spectral

DeepWalk

DGI

30

45

60

75

Amz Photo

Figure 4: Clustering accuracy with and without GDC. GDC consistently improves the accuracy across a diverse set of models and datasets.

Clustering. We highlight GDC’s ability to be combined with any graph-based model by reporting the performance of a diverse set of models that use a wide range of paradigms. Fig. 4 shows the unsupervised accuracy obtained by matching clusters to ground-truth classes using the Hungarian algorithm. Accuracy consistently and significantly improves for all models and datasets. Note that spectral clustering uses the graph’s eigenvectors, which are not affected by the diffusion step itself. Still, its performance improves by up to percentage points. Results in tabular form are presented in App. B.2.

In this work we concentrate on node-level prediction tasks in a transductive setting. However, GDC can just as easily be applied to inductive problems or different tasks like graph classification. In our experiments we found promising, yet not as consistent results for graph classification (e.g. percentage points with GCN on the DD dataset (Dobson & Doig, 2003)). We found no improvement for the inductive setting on PPI (Menche et al., 2015), which is rather unsurprising since the underlying data used for graph construction already includes graph diffusion-like mechanisms (e.g. regulatory interactions, protein complexes, and metabolic enzyme-coupled interactions). We furthermore conducted experiments to answer five important questions:

Average degree

Accuracy (%)

Cora

Citeseer

Amz Comp

Figure 5: GCN+GDC accuracy (using PPR and top-). Lines indicate original accuracy and degree. GDC surpasses original accuracy at around the same degree independent of dataset. Sparsification often improves accuracy.

Self-loop weight

Accuracy (%)