Stability and Generalization of Graph Convolutional Neural Networks

Stability and Generalization of Graph Convolutional Neural Networks

Saurabh Verma Department of Computer Science
University of Minnesota, Twin Cities
verma076@cs.umn.edu
 and  Zhi-Li Zhang Department of Computer Science
University of Minnesota, Twin Cities
zhzhang@cs.umn.edu
Abstract.

Inspired by convolutional neural networks on 1D and 2D data, graph convolutional neural networks (GCNNs) have been developed for various learning tasks on graph data, and have shown superior performance on real-world datasets. Despite their success, there is a dearth of theoretical explorations of GCNN models such as their generalization properties. In this paper, we take a first step towards developing a deeper theoretical understanding of GCNN models by analyzing the stability of single-layer GCNN models and deriving their generalization guarantees in a semi-supervised graph learning setting. In particular, we show that the algorithmic stability of a GCNN model depends upon the largest absolute eigenvalue of its graph convolution filter. Moreover, to ensure the uniform stability needed to provide strong generalization guarantees, the largest absolute eigenvalue must be independent of the graph size. Our results shed new insights on the design of new & improved graph convolution filters with guaranteed algorithmic stability. We evaluate the generalization gap and stability on various real-world graph datasets and show that the empirical results indeed support our theoretical findings. To the best of our knowledge, we are the first to study stability bounds on graph learning in a semi-supervised setting and derive generalization bounds for GCNN models.

Deep learning, graph convolutional neural networks, graph mining, stability, generalization guarantees
journalyear: 2019copyright: acmlicensedconference: The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; August 4–8, 2019; Anchorage, Alaska USAbooktitle: In KDD ’19: The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 4–8, 2019, Anchorage, Alaska USAccs: Computing methodologies Neural networksccs: Theory of computation Graph algorithms analysisccs: Theory of computation Semi-supervised learning

1. Introduction

Building upon the huge success of deep learning in computer vision (CV) and natural language processing (NLP), Graph Convolutional Neural Networks (GCNNs) (Kipf and Welling, 2016a) have recently been developed for tackling various learning tasks on graph-structured datasets. These models have shown superior performance on real-world datasets from various domains such as node labelling on social networks (Kipf and Welling, 2016b), link prediction in knowledge graphs (Schlichtkrull et al., 2018) and molecular graph classification in quantum chemistry (Gilmer et al., 2017) . Due to the versatility of graph-structured data representation, GCNN models have been incorporated in many diverse applications, e.g., question-answer systems (Song et al., 2018) in NLP and/or image semantic segmentation (Qi et al., 2017) in CV. While various versions of GCCN models have been proposed, there is a dearth of theoretical explorations of GCNN models ((Xu et al., 2018) is one of few exceptions which explores the discriminant power of GCNN models)—especially, in terms of their generalization properties and (algorithmic) stability. The latter is of particular import, as the stability of a learning algorithm plays a crucial role in generalization.

The generalization of a learning algorithm can be explored in several ways. One of the earliest and most popular approach is Vapnik–Chervonenkis (VC)-theory (Blumer et al., 1989) which establishes generalization errors in terms VC-dimensions of a learning algorithm. Unfortunately, VC-theory is not applicable for learning algorithms with unbounded VC-dimensions such as neural networks. Another way to show generalization is to perform the Probably Approximately Correct (PAC) (Haussler, 1990) analysis, which is generally difficult to do in practice. The third approach, which we adopt, relies on deriving stability bounds of a learning algorithm, often known as algorithmic stability (Bousquet and Elisseeff, 2002). The idea behind algorithmic stability is to understand how the learning function changes with small changes in the input data. Over the past decade, several definitions of algorithmic stability have been developed (Agarwal and Niyogi, 2005, 2009; Bousquet and Elisseeff, 2002; Elisseeff et al., 2005; Mukherjee et al., 2006), including uniform stability, hypothesis stability, pointwise hypothesis stability, error stability and cross-validation stability, each yielding either a tight or loose bound on the generalization errors. For instance, learning algorithm based on Tikhonov regularization satisfy the uniform stability criterion (the strongest stability condition among all existing forms of stability), and thus are generalizable.

In this paper, we take a first step towards developing a deeper theoretical understanding of GCNN models by analyzing the (uniform) stability of GCNN models and thereby deriving their generalization guarantees. For simplicity of exposition, we focus on single layerGCNN models in a semi-supervised learning setting. The main result of this paper is that (single layer) GCNN models with stable graph convolution filters can satisfy the strong notion of uniform stability and thus are generalizable. More specifically, we show that the stability of a (single layer) GCNN model depends upon the largest absolute eigenvalue (the eigenvalue with the largest absolute value) of the graph filter it employs – or more generally, the largest singular value if the graph filter is asymmetric – and that the uniform stability criterion is met if the largest absolute eigenvalue (or singular value) is independent of the graph size, i.e., the number of nodes in the graph. As a consequence of our analysis, we establish that (appropriately) normalized graph convolution filters such as the symmetric normalized graph Laplacian or random walk based filters are all uniformly stable and thus are generalizable. In contrast, graph convolution filters based on the unnormalized graph Laplacian or adjacency matrix do not enjoy algorithmic stability, as their largest absolute eigenvalues grow as a function of the graph size. Empirical evaluations based on real world datasets support our theoretical findings: the generalization gap and weight parameters instability in case of unnormalized graph filters are significantly higher than those of the normalized filters. Our results shed new insights on the design of new & improved graph convolution filters with guaranteed algorithmic stability.

We remark that our GCNN generalization bounds obtained from algorithmic stability are non-asymptotic in nature, i.e., they do not assume any form of data distribution. Nor do they hinge upon the complexity of the hypothesis class, unlike the most uniform convergence bounds. We only assume that the activation & loss functions employed are Lipschitz continuous and smooth functions. These criteria are readily satisfied by several popular activation functions such as ELU (holds for ), Sigmoid and/or Tanh. To the best of our knowledge, we are the first to study stability bounds on graph learning in a semi-supervised setting and derive generalization bounds for GCCN models. Our analysis framework remains general enough and can be extended to theoretical stability analyses of GCCN models beyond a semi-supervised learning setting (where there is a single and fixed underlying graph structure) such as for the graph classification (where there are multiple graphs).

In summary, the major contributions of our paper are:

  • We provide the first generalization bound on single layer GCNN models based on analysis of their algorithmic stability. We establish that GCNN models which employ graph filters with bounded eigenvalues that are independent of the graph size can satisfy the strong notion of uniform stability and thus are generalizable.

  • Consequently, we demonstrate that many existing GCNN models that employ normalized graph filters satisfy the strong notion of uniform stability. We also justify the importance of employing batch-normalization in a GCNN architecture.

  • Empirical evaluations of the generalization gap and stability using real-world datasets support our theoretical findings.

The paper is organized as follows. Section 2 reviews key generalization results for deep learning as well as regularized graphs and briefly discusses existing GCNN models. The main result is presented in Section 3 where we introduce the needed background and establish the GCNN generalization bounds step by step. In Section 4, we apply our results to existing graph convolution filters and GCNN architecture designs. In Section 5 we conduct empirical studies which complement our theoretical analysis. The paper is concluded in Section 6 with a brief discussion of future work.

2. Related Work

Generalization Bounds on Deep Learning: Many theoretical studies have been devoted to understanding the representational power of neural networks by analyzing their capability as a universal function approximator as well as their depth efficiency (Cohen and Shashua, 2016; Telgarsky, 2016; Eldan and Shamir, 2016; Mhaskar and Poggio, 2016; Delalleau and Bengio, 2011). In (Delalleau and Bengio, 2011) the authors show that the number of hidden units in a shallow network has to grow exponentially (as opposed to a linear growth in a deep network) in order to represent the same function; thus depth yields much more compact representation of a function than having a wide-breadth. It is shown in (Cohen and Shashua, 2016) that convolutional neural networks with the ReLU activation function are universal function approximators with max pooling, but not with average pooling. The authors of (Neyshabur et al., 2017) authors explore which complexity measure is more appropriate for explaining the generalization power of deep learning. The work most closest to ours is (Hardt et al., 2015) where the authors derive upper bounds on the generalization errors for stochastic gradient methods. While also utilizing the notion of uniform stability (Bousquet and Elisseeff, 2002), their analysis is concerned with the impact of SGD learning rates. More recently, through empirically evaluations on real-world datasets, it has been argued in (Zhang et al., 2016) that the traditional measures of model complexity are not sufficient to explain the generalization ability of neural networks. Likely, in (Kawaguchi et al., 2017) several open-ended questions are posed regarding the (yet unexplained) generalization capability of neural networks, despite their possible algorithmic instability, non-robustness, and sharp minima.

Generalization Bounds on Regularized Graphs: Another line of work concerns with generalization bounds on regularized graphs in transductive settings (Belkin et al., 2004; Cortes et al., 2008; Ando and Zhang, 2007; Sun et al., 2014). Of the most interest to ours is (Belkin et al., 2004) where the authors provide theoretical guarantees for the generalization error based on Laplacian regularization, which are also derived based on the notion of algorithmic stability. Their generalization estimate is inversely proportional to the second smallest eigenvalue of the graph Laplacian. Unfortunately this estimate may be not yield desirable guarantee as the second smallest eigenvalue is dependent on both the graph structure and its size; it is in general difficult to remove this dependency via normalization. In contrast, our estimates are directly proportional to the largest absolute eigenvalue (or the largest singular value of an asymmetric graph filter), and can easily be made independent of the graph size by performing appropriate Laplacian normalization.

Graph Convolution Neural Networks: The development of GCNNs can be traced back to graph signal processing (Shuman et al., 2013) in terms of learning filter parameters of the graph Fourier transform (Bruna et al., 2013; Henaff et al., 2015). Various GCNN models have since been proposed (Kipf and Welling, 2016a; Atwood and Towsley, 2016; Li et al., 2018; Duvenaud et al., 2015; Puy et al., 2017; Dernbach et al., 2018; Zhang et al., 2019) that mainly attempt to improve the basic GCNN model along two aspects: 1) enhancing the graph convolution operation by developing novel graph filters; and 2) designing appropriate graph pooling operations. For instance, (Levie et al., 2017) employs complex graph filters via Cayley instead of Chebyshev polynomials, whereas (Fey et al., 2018) introduces b-splines as a basis for the filtering operation instead of the graph Laplacian. Similarly (Li et al., 2018) parameterize graph filters using residual Laplacian matrix and in (Such et al., 2017) authors used simply polynomial of adjacency matrix. Random walk and quantum walk based graph convolutions are also been proposed recently (Puy et al., 2017; Dernbach et al., 2018; Zhang et al., 2019). The authors of (Hamilton et al., 2017; Veličković et al., 2018) have also applied graph convolution to large graphs. In terms of graph pooling operations, pre-computed graph coarsening layers via the graclus multilevel clustering algorithm are employed in (Defferrard et al., 2016), while a differential pooling operation that can generate hierarchical representation of a graph is developed in (Ying et al., 2018). In (Lei et al., 2017; Gilmer et al., 2017; Dai et al., 2016; García-Durán and Niepert, 2017) message passing neural networks (MPNNs) are developed, which are viewed as equivalent to GCNN models, as the underlying notion of the graph convolution operation is pretty much the same.

3. Stability and Generalization Guarantees For GCNNs

To derive generalization guarantees of GCNNs based on algorithmic stability analysis, we adopt the strategy devised in (Bousquet and Elisseeff, 2002). It relies on bounding the output difference of a loss function due to a single data point perturbation. As stated earlier, there exist several different notions of algorithmic stability (Bousquet and Elisseeff, 2002; Mukherjee et al., 2006). In this paper, we focus on the strong notion of uniform stability (see Definition 1).

3.1. Graph Convolution Neural Networks

Notations: Let be a graph where is the vertex set, the edge set and the adjacency matrix, with the graph size. We define the standard graph Laplacian as , where is the degree matrix. We define a graph filter, as a function of the graph Laplacian or a normalized (using ) version of it. Let be the eigen decomposition of , with the diagonal matrix of ’s eigenvalues. Then , and its eigenvalues . We define , referred to as the largest absolute eigenvalue111This definition is valid for a symmetric graph filter , or the matrix is normal. More generally, is defined as the largest singular value of . of the graph filter . Let is the number of training samples depending on as .

Let be a node feature matrix ( is the input dimension) and be the learning parameters. With a slight abuse of notation, we will represent both a node (index) in a graph and its feature values by . denotes a set of the neighbor indices at most hop distance away from node (including ). Here the hop distance neighbors are determined using the filter matrix. Finally, represents the ego-graph extracted at node from .

Single Layer GCNN (Full Graph View): Output function of a single layer GCNN model – on all graph nodes together – can be written in a compact matrix form as follows,

(1)

where is a graph filter. Some commonly used graph filters are a linear function of as  (Xu et al., 2018) (here is the identity matrix) or a Chebyshev polynomial of  (Defferrard et al., 2016).

Single Layer GCNN (Ego-Graph View): We will work with the notion of ego-graph for each node (extracted from ) as it contains the complete information needed for computing the output of a single layer GCNN model. We can re-write the Equation (1) for a single node prediction as,

(2)

where is the weighted edge (value) between node and its neighbor , if and only . The size of an ego-graph depends upon . We assume that the filters are localized to the hop neighbors, but our analysis is applicable to hop neighbors. For further notational clarity, we will consider the case , and thus . Our analysis holds for the general dimensional case.

3.2. Main Result

The main result of the paper is stated in Theorem 1, which provides a bound on the generalization gap for single layer GCNN models. This gap is defined as the difference between the generalization error and empirical error (see definitions in Section 3.3).

Theorem 1 (GCNN Generalization Gap).

Let be a single layer GCNN model equipped with the graph convolution filter , and trained on a dataset using the SGD algorithm for iterations. Let the loss & activation functions be Lipschitz-continuous and smooth. Then the following expected generalization gap holds with probability at least , with ,

where the expectation is taken over the randomness inherent in SGD, is the number of training samples and a constant depending on the loss function.

Remarks: Theorem 1 establishes a key connection between the generalization gap and the graph filter eigenvalues. A GCNN model is uniformly stable if the bound converges to zero as . In particular, we see that if is independent of the graph size, the generalization gap decays at the rate of , yielding the tightest bound possible. Theorem 1 sheds light on the design of stable graph filters with generalization guarantees.

Proof Strategy: We need to tackle several technical challenges in order to obtain the generalization bound in Theorem 1.

  1. Analyzing GCNN Stability w.r.t. Graph Convolution: We analyze the stability of a graph convolution function under the single data perturbation. For this purpose, we separately bound the difference on weight parameters from the graph convolution operation in the GCNN output function.

  2. Analyzing GCNN Stability w.r.t. SGD algorithm: GCNNs employ the randomized stochastic gradient descent algorithm (SGD) for optimizing the weight parameters. Thus, we need to bound the difference in the expected value over the learned weight parameters under single data perturbation and establish stability bounds. For this, we analyze the uniform stability of SGD in the context of GCNNs. We adopt the same strategy as in (Hardt et al., 2015) to obtain uniform stability of GCNN models, but with fewer assumptions compared with the general case (Hardt et al., 2015).

3.3. Preliminaries

Basic Setup: Let and be a a subset of a Hilbert space and define . We define as the input space and as the output space. Let and be a training set . We introduce two more notations below:

Removing data point in the set is represented as,

Replacing the data point in by is represented as,


General Data Sampling Process: Let denote an unknown distribution from which data points are sampled to form a training set . Throughout the paper, we assume all samples (including the replacement sample) are i.i.d. unless mentioned otherwise. Let denote the expectation of the function when samples are drawn from to form the training set . Likewise, let denote the expectation of the function when is sampled according to .

Graph Node Sampling Process: At first it may not be clear on how to describe the sampling procedure of nodes from a graph in the context of GCNNs for performing semi-supervised learning. For our purpose, we consider ego-graphs formed by the hops neighbors at each node as a single data point. This ego-graph is necessary and sufficient to compute the single layer GCNN output as shown in Equation (2). We assume node data points are sampled in an i.i.d. fashion by first choosing a node and then extracting its neighbors from to form an ego-graph.

Generalization Error: Let be a learning algorithm trained on dataset . is defined as a function from to . For GCNNs, we set . Then generalization error or risk with respect to a loss function is defined as,

Empirical Error: Empirical risk is defined as,

Generalization Gap: When is a randomized algorithm, we consider the expected generalization gap as shown below,

Here the expectation is taken over the inherent randomness of . For instance, most learning algorithms employ Stochastic Gradient descent (SGD) to learn the weight parameters. SGD introduces randomness due to the random order it uses to choose samples for batch processing. In our analysis, we only consider randomness in due to SGD and ignore the randomness introduced by parameter initialization. Hence, we will replace with .

Uniform Stability of Randomized Algorithm: For a randomized algorithm, uniform stability is defined as follows,

Definition 1 (Uniform Stability).

A randomized learning algorithm is uniformly stable with respect to a loss function , if it satisfies,

For our convenience, we will work with the following definition of uniform stability,

which follows immediately from the fact that,

Remarks: Uniform stability imposes an upper bound on the difference in losses due to a removal (or change) of a single data point from the set (of size ) for all possible combinations of . Here, is a function of (the number of training samples). Note that there is a subtle difference between Definition 1 above and the uniform stability of randomized algorithms defined in (Elisseeff et al., 2005) (see Definition in (Elisseeff et al., 2005)). The authors in (Elisseeff et al., 2005) are concerned with random elements associated with the cost function such as those induced by bootstrapping, bagging or initialization process. However, we focus on the randomness due to the learning procedure, i.e., SGD.

Stability Guarantees: A randomized learning algorithm with uniform stability yields the following bound on generalization gap:

Theorem 2 (Stability Guarantees).

A uniform stable randomized algorithm with a bounded loss function , satisfies following generalization bound with probability at-least , over the random draw of , with ,

Proof: The proof for Theorem 2 mirrors that of Theorem 12 (shown in (Bousquet and Elisseeff, 2002) for deterministic learning algorithms). For the sake of completeness, we include the proof in Appendix based on our definition of uniform stability .

Remarks: The generalization bound is meaningful if the bound converges to 0 as . This occurs when decays faster than ; otherwise the generalization gap does not approach to zero as . Furthermore, generalization gap produces tightest bounds when decays at which is the most stable state possible for a learning algorithm.

Lipschitz Continuous and Smooth Activation Function: Our bounds hold for all activation functions which are Lipschitz-continuous and smooth. An activation function is Lipschitz-continuous if , or equivalently, . We further require to be smooth, namely, . This assumption is more strict but necessary for establishing the strong notion of uniform stability. Some common activation functions satisfying the above conditions are ELU (with ), Sigmoid, and Tanh.

Lipschitz Continuous and Smooth Loss Function: We also assume that the loss function is Lipschitz-continuous and smooth,

Unlike in (Hardt et al., 2015), we define Lipschitz-continuity with respect to the function argument rather than the weight parameters, a relatively weak assumption.

3.4. Uniform Stability of GCNN Models

The crux of our main result relies on showing that GCNN models are uniformly stable as stated in Theorem 3 below.

Theorem 3 (GCNN Uniform Stability).

Let the loss & activation be Lipschitz-continuous and smooth functions. Then a single layer GCNN model trained using the SGD algorithm for iterations is uniformly stable, where

Remarks: Plugging the bound on in Theorem 2 yields the main result of our paper.

Before we proceed to prove this theorem, we first explain what is meant by training a single layer GCNN using SGD on datasets and which differ in one data point, following the same line of reasoning as in (Hardt et al., 2015). Let be a sequence of samples, where is an i.i.d. sample drawn from at the iteration of SGD during a training run of the GCCN222 One way to generate the sample sequence is to choose a node index uniformly at random from the set at each step . Alternatively, one can first choose a random permutation of and then process the samples accordingly. Our analysis holds for both cases.. Training the same GCCN using SGD on means that we supply the same sample sequence to the GCCN except that if for some (), we replace it with , where is the (node) index at which and differ. We denote this sample sequence by . Let , and denote the corresponding sequences of the weight parameters learned by running SGD on and , respectively. Since the parameter initialization is kept same, . In addition, if is the first time that the sample sequences and differ, then at each step before , and at the and subsequent steps, and diverge. The key in establishing the uniform stability of a GCNN model is to bound the difference in losses when training the GCNN using SGD on vs. . As stated earlier in the proof strategy, we proceed in two steps.

Proof Part I (Single Layer GCNN Bound): We first bound the expected loss by separating the factors due to the graph convolution operation vs. the expected difference in the filter weight parameters learned via SGD on two datasets and .

Let and represent the final GCNN filter weights learned on training set and respectively. Define . Using the facts that the loss are Lipschitz continuous and also , we have,

(3)

where is defined as . We will bound in terms of the largest absolute eigenvalue of the graph convolution filter later. Note that is nothing but a graph convolution operation. As such, reducing will be the contributing factor in improving the generalization performance.

Proof Part II (SGD Based Bounds For GCNN Weights): What remains is to bound due to the randomness inherent in SGD. This is proved through a series of three lemmas. We first note that on a given training set , a GCNN minimizes the following objective function,

(4)

For this, at each iteration , SGD performs the following update:

(5)

where is the learning rate.

Given two sequences of the weight parameters, , and , learned by the GCCN running SGD on and , respectively, we first find a bound on at each iteration step of SGD.

There are two scenarios to consider 1) At step , SGD picks a sample which is identical in and , and occurs with probability . From Equation (5), we have . We bound this term in Lemma 4 below 2) At step , SGD picks the only samples that and differ, and which occurs with probability . Then . We bound the second term in Lemma 5 below.

Lemma 0 (GCNN Same Sample Loss Stability Bound).

The loss-derivative bound difference of (single-layer) GCNN models trained with SGD algorithm for iterations on two training datasets and respectively, with respect to the same sample is given by,

Proof: The first order derivative of a single-layer the GCNN output function, , is given by,

(6)

where is the first order derivative of the activation function.

Using Equation (6) and the fact that the loss function is Lipschitz continuous and smooth, we have,

This completes the proof of Lemma 4.

Note: Without the smooth assumption, it would not be possible to derive the above bound in terms of which is necessary for showing the uniform stability. Unfortunately, this constraint excludes RELU activation from our analysis.

Lemma 0 (GCNN Different Sample Loss Stability Bound).

The loss-derivative bound difference of (single-layer) GCNN models trained with SGD algorithm for iterations on two training datasets and respectively, with respect to the different samples is given by,

Proof: Again using Equation (6) and the fact that the loss & activation function is Lipschitz continuous and smooth, and for any , , , we have,

(7)

This completes the proof of Lemma 5.

Summing over all iteration steps, and taking expectations over all possible sample sequences , from and , we have

Lemma 0 (GCNN SGD Stability Bound).

Let the loss & activation functions be Lipschitz-continuous and smooth. Let and denote the graph filter parameters of (single-layer) GCNN models trained using SGD for iterations on two training datasets and , respectively. Then the expected difference in the filter parameters is bounded by,

Proof: From Equation (5) and taking into account the probabilities of the two scenarios considered in Lemma 4 and Lemma 5 at step , we have,

(8)

Plugging the bounds in Lemma 4 and Lemma 5 into Equation (8), we have,

Lastly, solving the first order recursion yields,

This completes the proof of Lemma 6.

Bound on : We now bound in terms of the largest absolute eigenvalue of the graph filter matrix . We first note that at each node , the ego-graph ego-graph can be represented as a sub-matrix of . Let be the submatrix of whose row and column indices are from the set . The ego-graph size is . We use to denote the graph signals (node features) on the ego-graph . Without loss of generality, we will assume that node is represented by index in . Thus, we can compute , a scalar value. Here represents the value of a vector at index 0, i.e., corresponding to node . Then the following holds (assuming the graph signals are normalized, i.e., ),

(9)

where the second inequality follows from Cauchy–Schwarz Inequality, and is the matrix operator norm and is the largest singular value of matrix . For a normal matrix (such as a symmetric graph filter ), , the largest absolute eigenvalue of .

Lemma 0 (Ego-Graph Eigenvalue Bound).

Let be a (un)directed graph with (either symmetric or non-negative) weighted adjacency matrix and be the maximum absolute eigenvalue of . Let be the ego-graph of a node with corresponding maximum absolute eigenvalue . Then the following eigenvalue (singular value) bound holds ,

Proof: Notice that is the adjacency matrix of which also happens to be the principal submatrix of . As a result, above bound holds from the eigenvalue interlacing theorem for normal/Hermitian matrices and their principal submatrices (Laffey and Šmigoc, 2008; Haemers, 1995).

Finally, plugging and Lemma 6 into Equation (3) yields the following remaining result,

This completes the full proof of Theorem 3.

4. Revisiting Graph Convolutional Neural Network Architecture

(a) Generlization Gap on Citeseer Dataset
(b) Generlization Gap on Cora Dataset
(c) Generlization Gap on Pubmed Dataset
Figure 1. The above figures show the generalziation gap for three datasets. The generlization gap is measured with respect to the loss function, i.e., —(training error test error)—. In this experiment, the cross-entropy loss is used.
(a) Parameter Norm Diff on Citeseer
(b) Parameter Norm Diff on Cora
(c) Parameter Norm Diff on Pubmed
Figure 2. The above figures show the divergence in weight parameters of a single layer GCNN measured using norm on the three datasets. We surgically alter one sample point at index in the training set to generate and run the SGD algorithm.

In this section, we discuss the implication of our results in designing graph convolution filters and revisit the importance of employing batch-normalization layers in GCNN network.

Unnormalized Graph Filters: One of the most popular graph convolution filters is  (Xu et al., 2018). The eigen spectrum of the unnormalized is bounded by the maximal node degree, i.e., . This is concerning as now is bounded by and as becomes close to , tend towards complexity with . As a result, the generalization gap of such a GCNN model is not guaranteed to converge.

Normalized Graph Filters: Numerical instabilities with the unnormalized adjacency matrix have already been suspected in (Kipf and Welling, 2016a). Therefore, the symmetric normalized graph filter has been adopted: . The eigen spectrum of is bounded between . As a result, such a GCNN model is uniformly stable (assuming that the graph features are also normalized appropriately, e.g., ).

Random Walk Graph Filters: Another graph filter that has been widely used is based on random walks:  (Puy et al., 2017). The eigenvalues of are spread out in the interval and thus such a GCNN model is uniformly stable.

Importance of Batch-Normalization in GCNN: Recall that and notice that in Equation (9), we assume that the graph signals are normalized in order to bound . This can easily be accomplished by normalizing features during data pre-processing phase for a single layer GCNN. However, for a multi-layer GCNN, the intermediate feature outputs are not guaranteed to be normalized. Thus to ensure stability, it is crucial to employ batch-normalization layers in GCNN models. This has already been reported in (Xu et al., 2018) as an important factor for keeping the GCNN outputs stable.

5. Experimental Evaluation

In this section, we empirically evaluate the effect of graph filters on the GCNN stability bounds using four different GCNN filters. We employ three citation network datasets: Citeseer, Cora and Pubmed (see (Kipf and Welling, 2016a) for details about the datasets).

Experimental Setup: We extract hop ego-graphs of each node in a given dataset to create samples and normalize the node graph features such that in the data pre-processing step. We run the SGD algorithm with a fixed learning rate with the batch size equal to for epochs on all datasets. We employ ELU (set ) as the activation function and cross-entropy as the loss function.

Measuring Generalization Gap: In this experiment, we quantitatively measure the generalization gap defined as the absolute difference between the training and test errors. From Figure 1, it is clear that the unnormalized graph convolution filters such as show a significantly higher generalization gap than the normalized ones such as or random walk based graph filters. The results hold consistently across the three datasets. We note that the generalization gap becomes constant after a certain number of iterations. While this phenomenon is not reflected in our bounds, it can plausibly be explained by considering the variable bounding parameters (as a function of SGD iterations). This hints at the pessimistic nature of our bounds.

Measuring GCNN Learned Filter-Parameters Stability Based On SGD Optimizer: In this experiment, we evaluate the difference between learned weight parameters of two single layer GCNN models trained on datasets and which differ precisely in one sample point. We generate by surgically altering one sample point in at the node index . For this experiment, we initialize the GCNN models on both datasets with the same parameters and random seeds, and then run the SGD algorithm. After each epoch, we measure the norm difference between the weight parameters of the respective models. From Figure 2, it is evident that for the unnormalized graph convolution filters, the weight parameters tend to deviate by a large amount and therefore the network is less stable. While for the normalized graph filters the norm difference converges quickly to a fixed value. These empirical observations are reinforced by our stability bounds. However, the decreasing trend in the norm difference after a certain number of iterations before convergence, remains unexplained, due to the pessimistic nature of our bounds.

6. Conclusion and Future Work

We have taken the first steps towards establishing a deeper theoretical understanding of GCNN models by analyzing their stability and establishing their generalization guarantees. More specifically, we have shown that the algorithmic stability of GCNN models depends upon the largest absolute eigenvalue of graph convolution filters. To ensure uniform stability and thereby generalization guarantees, the largest absolute eigenvalue must be independent of the graph size. Our results shed new insights on the design of new & improved graph convolution filters with guaranteed algorithmic stability. Furthermore, applying our results to existing GCNN models, we provide a theoretical justification for the importance of employing the batch-normalization process in a GCNN architecture. We have also conducted empirical evaluations based on real world datasets which support our theoretical findings. To the best of our knowledge, we are the first to study stability bounds on graph learning in a semi-supervised setting and derive generalization bounds for GCNN models.

As part of our ongoing and future work, we will extend our analysis to multi-layer GCNN models. For a multi-layer GCNN, we need to bound the difference in weights at each layer according to the back-propagation algorithm. Therefore the main technical challenge is to study the stability of the full fledged back-propagation algorithm. Furthermore, we plan to study the stability and generalization properties of non-localized convolutional filters designed based on rational polynomials of the graph Laplacian. We also plan to generalize our analysis framework beyond semi-supervised learning to provide generalization guarantees in learning settings where multiple graphs are present, e.g., for graph classification.

References

  • (1)
  • Agarwal and Niyogi (2005) Shivani Agarwal and Partha Niyogi. 2005. Stability and generalization of bipartite ranking algorithms. In International Conference on Computational Learning Theory. Springer, 32–47.
  • Agarwal and Niyogi (2009) Shivani Agarwal and Partha Niyogi. 2009. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research 10, Feb (2009), 441–474.
  • Ando and Zhang (2007) Rie K Ando and Tong Zhang. 2007. Learning on graph with Laplacian regularization. In Advances in neural information processing systems. 25–32.
  • Atwood and Towsley (2016) James Atwood and Don Towsley. 2016. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems. 1993–2001.
  • Belkin et al. (2004) Mikhail Belkin, Irina Matveeva, and Partha Niyogi. 2004. Regularization and semi-supervised learning on large graphs. In International Conference on Computational Learning Theory. Springer, 624–638.
  • Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. 1989. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM) 36, 4 (1989), 929–965.
  • Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. 2002. Stability and generalization. Journal of Machine Learning Research 2, Mar (2002), 499–526.
  • Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
  • Cohen and Shashua (2016) Nadav Cohen and Amnon Shashua. 2016. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning. 955–963.
  • Cortes et al. (2008) Corinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi. 2008. Stability of transductive regression algorithms. In Proceedings of the 25th international conference on Machine learning. ACM, 176–183.
  • Dai et al. (2016) Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning. 2702–2711.
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3837–3845.
  • Delalleau and Bengio (2011) Olivier Delalleau and Yoshua Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems. 666–674.
  • Dernbach et al. (2018) Stefan Dernbach, Arman Mohseni-Kabir, Siddharth Pal, and Don Towsley. 2018. Quantum Walk Neural Networks for Graph-Structured Data. In International Workshop on Complex Networks and their Applications. Springer, 182–193.
  • Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems. 2224–2232.
  • Eldan and Shamir (2016) Ronen Eldan and Ohad Shamir. 2016. The power of depth for feedforward neural networks. In Conference on Learning Theory. 907–940.
  • Elisseeff et al. (2005) Andre Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. 2005. Stability of randomized learning algorithms. Journal of Machine Learning Research 6, Jan (2005), 55–79.
  • Fey et al. (2018) Matthias Fey, Jan Eric Lenssen, Frank Weichert, and Heinrich Müller. 2018. SplineCNN: Fast geometric deep learning with continuous B-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 869–877.
  • García-Durán and Niepert (2017) Alberto García-Durán and Mathias Niepert. 2017. Learning Graph Representations with Embedding Propagation. arXiv preprint arXiv:1710.03059 (2017).
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212 (2017).
  • Haemers (1995) Willem H Haemers. 1995. Interlacing eigenvalues and graphs. Linear Algebra and its applications 226 (1995), 593–616.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
  • Hardt et al. (2015) Moritz Hardt, Benjamin Recht, and Yoram Singer. 2015. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015).
  • Haussler (1990) David Haussler. 1990. Probably approximately correct learning. University of California, Santa Cruz, Computer Research Laboratory.
  • Henaff et al. (2015) Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).
  • Kawaguchi et al. (2017) Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. 2017. Generalization in deep learning. arXiv preprint arXiv:1710.05468 (2017).
  • Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Kipf and Welling (2016b) Thomas N Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
  • Laffey and Šmigoc (2008) Thomas J Laffey and Helena Šmigoc. 2008. Spectra of principal submatrices of nonnegative matrices. Linear Algebra Appl. 428, 1 (2008), 230–238.
  • Lei et al. (2017) Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2017. Deriving neural architectures from sequence and graph kernels. arXiv preprint arXiv:1705.09037 (2017).
  • Levie et al. (2017) Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. 2017. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv preprint arXiv:1705.07664 (2017).
  • Li et al. (2018) Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive Graph Convolutional Neural Networks. arXiv preprint arXiv:1801.03226 (2018).
  • Mhaskar and Poggio (2016) Hrushikesh N Mhaskar and Tomaso Poggio. 2016. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications 14, 06 (2016), 829–848.
  • Mukherjee et al. (2006) Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25, 1-3 (2006), 161–193.
  • Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. 2017. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems. 5947–5956.
  • Puy et al. (2017) Gilles Puy, Srdan Kitic, and Patrick Pérez. 2017. Unifying local and non-local signal processing with graph CNNs. arXiv preprint arXiv:1702.07759 (2017).
  • Qi et al. (2017) Xiaojuan Qi, Renjie Liao, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. 3d graph neural networks for rgbd semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5199–5208.
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
  • Shuman et al. (2013) David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30, 3 (2013), 83–98.
  • Song et al. (2018) Linfeng Song, Zhiguo Wang, Mo Yu, Yue Zhang, Radu Florian, and Daniel Gildea. 2018. Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040 (2018).
  • Such et al. (2017) Felipe Petroski Such, Shagan Sah, Miguel Alexander Dominguez, Suhas Pillai, Chao Zhang, Andrew Michael, Nathan D Cahill, and Raymond Ptucha. 2017. Robust spatial filtering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing 11, 6 (2017), 884–896.
  • Sun et al. (2014) Shiliang Sun, Zakria Hussain, and John Shawe-Taylor. 2014. Manifold-preserving graph reduction for sparse semi-supervised learning. Neurocomputing 124 (2014), 13–21.
  • Telgarsky (2016) Matus Telgarsky. 2016. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485 (2016).
  • Veličković et al. (2018) Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018. Deep graph infomax. arXiv preprint arXiv:1809.10341 (2018).
  • Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks? arXiv preprint arXiv:1810.00826 (2018).
  • Ying et al. (2018) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems. 4805–4815.
  • Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016).
  • Zhang et al. (2019) Zhihong Zhang, Dongdong Chen, Jianjia Wang, Lu Bai, and Edwin R Hancock. 2019. Quantum-based subgraph convolutional neural networks. Pattern Recognition 88 (2019), 38–49.

Appendix A Appendix

Proof of Theorem 2: To derive generalization bounds for uniform stable randomized algorithms, we utilize McDiarmid’s concentration inequality. Let be a random variable set and , then the inequality is given as,

(10)

We will derive some expressions that would be helpful to compute variables needed for applying McDiarmid’s inequality.

Since the samples are i.i.d., we have

(11)

Using Equation 11 and renaming the variables, one can show that

(12)

Using Equation 12 and uniform stability, we obtain

(13)
(14)
(15)

Let .

Using Equation 14 and Equation 15, we have

(16)