The Power of Graph Convolutional Networks to Distinguish Random Graph Models: Short Version
Graph convolutional networks (GCNs) are a widely used method for graph representation learning. We investigate the power of GCNs, as a function of their number of layers, to distinguish between different random graph models on the basis of the embeddings of their sample graphs. In particular, the graph models that we consider arise from graphons, which are the most general possible parameterizations of infinite exchangeable graph models and which are the central objects of study in the theory of dense graph limits. We exhibit an infinite class of graphons that are well-separated in terms of cut distance and are indistinguishable by a GCN with nonlinear activation functions coming from a certain broad class if its depth is at least logarithmic in the size of the sample graph. These results theoretically match empirical observations of several prior works. Finally, we show a converse result that for pairs of graphons satisfying a degree profile separation property, a very simple GCN architecture suffices for distinguishability. To prove our results, we exploit a connection to random walks on graphs.
In applications ranging from drug discovery  and design to proteomics  to neuroscience  to social network analysis , inputs to machine learning methods take the form of graphs. In order to leverage the empirical success of deep learning and other methods that work on vectors in finite-dimensional Euclidean spaces for supervised learning tasks in this domain, a plethora of graph representation learning schemes have been proposed and used . One particularly effective such method is the graph convolutional network (GCN) architecture [6, 7]. A graph convolutional network works by associating with each node of an input graph a vector of features and passing these node features through a sequence of layers, resulting in a final set of node vectors, called node embeddings. To generate a vector representing the entire graph, these final embeddings are sometimes averaged. Each layer of the network consists of a graph diffusion step, where a node’s feature vector is averaged with those of its neighbors; a feature transformation step, where each node’s vector is transformed by a weight matrix; and, finally, application of an elementwise nonlinearity such as the ReLU or sigmoid function. The weight matrices are trained from data, so that the metric structure of the resulting embeddings are (one hopes) tailored to a particular classification task.
While GCNs and other graph representation learning methods have been successful in practice, numerous theoretical questions about their capabilities and the roles of their hyperparameters remain unexplored. In this paper, we give results on the ability of GCNs to distinguish between samples from different random graph models. We focus on the roles that the number of layers and the presence or absence of nonlinearity play. The random graph models that we consider are those that are parameterized by graphons , which are functions from the unit square to the interval that essentially encode edge density among a continuum of vertices. Graphons are the central objects of study in the theory of dense graph limits and, by the Aldous-Hoover theorem  exactly parameterize the class of infinite exchangeable random graph models – those models whose samples are invariant in distribution under permutation of vertices.
I-a Prior Work
A survey of modern graph representation learning methods is provided in . Graph convolutional networks were first introduced in , and since then, many variants have been proposed. For instance, the polynomial convolutional filters in the original work were replaced by linear convolutions . Authors in  modified the original architecture to include gated recurrent units for working with dynamical graphs. These and other variants have been used in various applications, e.g., [11, 12, 13, 14].
Theoretical work on GCNs has been from a variety of perspectives. In , the authors investigated the generalization and stability properties of GCNs. Several works, including [16, 17, 18], have drawn connections between the representation capabilities of GCNs and the distinguishing ability of the Weisfeiler-Lehman (WL) algorithm for graph isomorphism testing . These papers drawing comparisons to the WL algorithm implicitly study the injectivity properties of the mapping from graphs to vectors induced by GCNs. However, they do not address the metric/analytic properties, which are important in consideration of their performance as representation learning methods . Finally, at least one work has considered the performance of untrained GCNs on community detection . The authors of that paper provide a heuristic calculation based on the mean-field approximation from statistical physics and demonstrate through numerical experiments the ability of untrained GCNs to detect the presence of clusters and to recover the ground truth community assignments of vertices in the stochastic block model. They empirically show that the regime of graph model parameters in which an untrained GCN is successful at this task agrees well with the analytically derived detection threshold. The authors also conjecture that training GCNs does not significantly affect their community detection performance.
I-B Our Contributions
We first establish a convergence result for GCN embedding vectors, which will give a lower bound on the probability of error of any test that attempts to distinguish between two graphons based on slightly perturbed -layer GCN embedding matrices of sample graphs of size , provided that . In particular, we exhibit a family of pairs of graphons that are hard for any test to distinguish on the basis of these embeddings. This is the content of Theorems 1 and 2.
We then show a converse achievability result in Theorem 3 that says, roughly, that provided that the number of layers is sufficiently large (), there exists a linear GCN architecture with a very simple sequence of weight matrices and a choice of initial embedding matrix such that pairs of graphons whose expected degree statistics differ by a sufficiently large amount are distinguishable from the noise-perturbed GCN embeddings of their sample graphs. In other words, this indicates that the family of difficult-to-distinguish graphons alluded to above is essentially the only sort of case in which a nonlinear GCN architecture could be necessary (though, as Theorem 2 shows, for several choices of activation functions, these graphons are still indistinguishable).
Our proofs rely on concentration of measure results and techniques from the theory of Markov chain mixing times and spectral graph theory .
Relations between probability of error lower and upper bounds
Our probability of error lower bounds give theoretical backing to a phenomenon that has been observed empirically in graph classification problems: adding arbitrarily many layers (more than ) to a GCN can substantially degrade classification performance. This is an implication of Theorem 2. On the other hand, Theorem 3 shows that this is not always the case, and that for many pairs of graphons, adding more layers improves classification performance. We suspect that the set of pairs of graphons for which adding arbitrarily many layers does not help forms a set of measure , though this does not imply that such examples never arise in practice.
The factor that determines whether or not adding layers will improve or degrade performance of a GCN in distinguishing between two graphons and is the distance between the stationary distributions of the random walks on the sample graphs from and . This, in turn, is determined by the normalized degree profiles of the sample graphs.
An extended version of this paper is available on ArXiv .
Ii Notation and Model
Ii-a Graph Convolutional Networks
We start by defining the model and relevant notation. A -layer graph convolutional network (GCN) is a function mapping graphs to vectors over . It is parameterized by a sequence of weight matrices , , where is the embedding dimension, a hyperparameter. From an input graph with adjacency matrix and random walk matrix (i.e., is with every row normalized by the sum of its entries), and starting with an initial embedding matrix , the th embedding matrix is defined as follows:
where is a fixed nonlinear activation function and is applied element-wise to an input matrix. An embedding vector is then produced by averaging the rows of :
Typical examples of activation functions in neural network and GCN contexts include the ReLU, sigmoid, and hyperbolic tangent functions. Empirical work has given evidence that the performance of GCNs on certain classification tasks is unaffected by replacing nonlinear activation functions by the identity . Our results lend theoretical credence to this.
Frequently, is replaced by either the normalized adjacency matrix , where is a diagonal matrix with the degrees of the vertices of the graph on the diagonal, or some variant of the Laplacian matrix . For simplicity, we will consider in this paper only the choice of .
The defining equation (1) has the following interpretation: multiplication on the left by has the effect of replacing each node’s embedding vector with the average of those of its neighbors. Multiplication on the right by the weight matrix has the effect of replacing each coordinate (corresponding to a feature) of each given node embedding vector with a linear combination of values of the node’s features in the previous layer.
In order to probe the ability of GCNs to distinguish random graph models from samples, we consider the task of distinguishing random graph models induced by graphons. A graphon is a symmetric, Lebesgue-measurable function from . To each graphon is associated a natural exchangeable random graph model as follows: to generate a graph on vertices, one chooses points uniformly at random from . An edge between vertices is independent of all other edge events and is present with probability . We use the notation to denote that is a random sample graph from the model induced by . The number of vertices will be clear from context.
One commonly studied class of models that may be defined equivalently in terms of sampling from graphons is the class of stochastic block models. A stochastic block model on vertices with two blocks is parameterized by four quantities: . The two blocks of vertices have sizes and , respectively. Edges between two vertices in block , , appear with probability , independently of all other edges. Edges between vertices in block and in block appear independently with probability . We will write this model as , suppressing .
An important metric on graphons is the cut distance . It is induced by the cut norm, which is defined as follows: fix a graphon . Then
where the supremum is taken over all measurable subsets of , and the integral is taken with respect to the Lebesgue measure. For finite graphs, this translates to taking the pair of subsets of vertices that has the maximum between-subset edge density. The cut distance between graphons is then defined as
where the infimum is taken over all measure-preserving bijections of . In the case of finite graphs, this intuitively translates to ignoring vertex labelings. The cut distance generates the same topology on the space of graphons as convergence of subgraph homomorphism densities (i.e., left convergence), and so it is an important part of the theory of graph limits.
Ii-C Main Hypothesis Testing Problem
We may now state the hypothesis testing problem under consideration. Fix two graphons . A coin is flipped, and then a graph on vertices is sampled. Next, is passed through layers of a GCN, resulting in a matrix whose rows are node embedding vectors. The graph embedding vector is then defined to be . As a final step, the embedding vector is perturbed in each entry by adding an independent, uniformly random number in the interval , for a parameter that may depend on , which we will typically consider to be . This results in a vector . We note that this perturbation step has precedent in the context of studies on the performance of neural networks in the presence of numerical imprecision . For our purposes, it will allow us to translate convergence results to information theoretic lower bounds.
Our goal is to study the effect of the number of layers and presence or absence of nonlinearities on the representation properties of GCNs and probability of error of optimal tests that are meant to estimate . Throughout, we will consider the case where . We will frequently use two particular norms: the norm for vectors and matrices, which is the maximum absolute entry; and the operator norm induced by for matrices: for a matrix ,
Iii Main Results
To state our results, we need a few definitions. For a graphon , we define the degree function to be
and define the total degree function
We will assume in what follows that all graphons have the property that there is some for which for all .
For any , we say that two graphons are a -exceptional pair if
for some measure-preserving bijection . If a pair of graphons is not -exceptional, then we say that they are -separated.
We define the following class of activation functions:
Definition 1 (Nice activation functions).
We define to be the class of activation functions satisfying the following conditions:
, and for all .
For simplicity, in Theorems 1 and 2 below, we will consider activations in the above class; however, some of the conditions may be relaxed without inducing changes to our results: in particular, we may remove the requirement that , and we may relax for all to only hold for in some constant-length interval around . This expanded class includes activation functions such as and the swish and SELU functions:
We also make the following stipulation about the parameters of the GCN: the initial embedding matrices (with ) and weight matrices satisfy
and , for some fixed positive constants and .
Theorem 1 (Convergence of embedding vectors for a large class of graphons and for a family of nonlinear activations).
Let denote two -exceptional graphons, for some fixed .
Let satisfy , for some large enough constant that is a function of and . Consider the GCN with layers and output embedding matrix , with the additional properties stated before the theorem.
Suppose that . Then, in any coupling of the graphs , as , we have that the embedding vectors and satisfy
with high probability.
If , then we have
and for a -fraction of coordinates ,
Theorem 1 can be translated, with some effort, to the following result.
Theorem 2 (Probability of error lower bound).
Consider again the setting of Theorem 1. Furthermore, suppose that . Let additionally satisfy , for an arbitrarily small fixed . Then there exist two sequences of random graph models such that
with probability , samples converge in cut distance to ,
When , the probability of error of any test in distinguishing between and based on , the -uniform perturbation of , is at least
When , the probability of error lower bound becomes
When and , the error probability lower bound (12) is exponentially decaying to . On the other hand, when and , it becomes , which is .
When and , the probability of error lower bound in (13) is .
We next turn to a positive result demonstrating the distinguishing capabilities of very simple, linear GCNs.
Theorem 3 (Distinguishability result).
Let denote two -separated graphons. Then there exists a test that distinguishes with probability between samples and based on the -perturbed embedding vector from a GCN with layers, identity initial and weight matrices, and ReLU activation functions, provided that for a sufficiently large and that .
Finally, we exhibit a family of stochastic block models that are difficult to distinguish and are such that infinitely many pairs of them have large cut distance.
To define the family of models, we consider the following density parameter set: we pick a base point with all positive numbers and then define
where is the lexicographic partial order, and . We have defined this parameter family because the corresponding SBMs all have equal expected degree sequences.
For any pair from the family of stochastic block models parameterized by , there exists a , for some large enough positive constant , such that the following statements hold:
Convergence of embedding vectors In any coupling of the graphs and , as , we have that the embedding vectors and satisfy
with probability .
Probability of error lower bound Let additionally satisfy , for an arbitrary small fixed . Then there exist two sequences , of random graph models such that
with probability , samples converge in cut distance to ,
the probability of error of any test in distinguishing between and based on , the -uniform perturbation of , is lower bounded by
Iv Conclusions and future work
We have shown conditions under which GCNs are information-theoretically capable/incapable of distinguishing between sufficiently well-separated graphons.
It is worthwhile to discuss what lies ahead for the theory of graph representation learning in relation to the problem of distinguishing distributions on graphs. As the present paper is a first step, we have left several directions for future exploration. Most immediately, although we have proven impossibility results for GCNs with nonlinear activation functions, we lack a complete understanding of the benefits of more general ways of incorporating nonlinearity. We have shown that architectures with too many layers cannot be used to distinguish between graphons coming from a certain exceptional class. It would be of interest to determine if more general ways of incorporating nonlinearity are able to generically distinguish between any sufficiently well-separated pair of graphons, whether or not they come from the exceptional class. To this end, we are exploring results indicating that replacing the random walk matrix in the GCN architecture with the transition matrix of a related Markov chain with the same graph structure as the input graph results in a linear GCN that is capable of distinguishing graphons generically.
Furthermore, a clear understanding of the role played by the embedding dimension would be of interest. In particular, we suspect that decreasing the embedding dimension results in worse graphon discrimination performance. Moreover, a more precise understanding of how performance parameters scale with the embedding dimension would be valuable in GCN design. Finally, we note that in many application domains, graphs are typically sparse. Thus, we intend to generalize our theory to the sparse graph setting by replacing graphons, which inherently generate dense graphs, with suitable nonparametric sparse graph models, e.g., graphexes.
This research was partially supported by grants from ARO W911NF-19-1026, ARO W911NF-15-1-0479, and ARO W911NF-14-1-0359 and the Blue Sky Initiative from the College of Engineering at the University of Michigan.
- M. Sun, S. Zhao, C. Gilvary, O. Elemento, J. Zhou, and F. Wang, “Graph convolutional networks for computational drug development and discovery,” Briefings in bioinformatics, 2019.
- M. Randić and S. C. Basak, “A comparative study of proteomics maps using graph theoretical biodescriptors,” Journal of chemical information and computer sciences, vol. 42, no. 5, pp. 983–992, 2002.
- O. Sporns, “Graph theory methods for the analysis of neural connectivity patterns,” in Neuroscience databases. Springer, 2003, pp. 171–185.
- J. A. Barnes and F. Harary, “Graph theory in network analysis,” 1983.
- W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” IEEE Data Eng. Bull., vol. 40, pp. 52–74, 2017.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. USA: Curran Associates Inc., 2016, pp. 3844–3852. [Online]. Available: http://dl.acm.org/citation.cfm?id=3157382.3157527
- L. Lovász, Large Networks and Graph Limits., ser. Colloquium Publications. American Mathematical Society, 2012, vol. 60.
- D. J. Aldous, “Representations for partially exchangeable arrays of random variables,” Journal of Multivariate Analysis, vol. 11, no. 4, pp. 581 – 598, 1981. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0047259X81900993
- L. Ruiz, F. Gama, and A. Ribeiro, “Gated graph convolutional recurrent neural networks,” arXiv preprint arXiv:1903.01888, 2019.
- T. S. Jepsen, C. S. Jensen, and T. D. Nielsen, “Graph convolutional networks for road networks,” arXiv preprint arXiv:1908.11567, 2019.
- C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H. Green, R. Barzilay, and K. F. Jensen, “A graph-convolutional neural network model for the prediction of chemical reactivity,” Chemical science, vol. 10, no. 2, pp. 370–377, 2019.
- W. Yao, A. S. Bandeira, and S. Villar, “Experimental performance of graph neural networks on random instances of max-cut,” in Wavelets and Sparsity XVIII, vol. 11138. International Society for Optics and Photonics, 2019, p. 111380S.
- D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015, pp. 2224–2232.
- S. Verma and Z.-L. Zhang, “Stability and generalization of graph convolutional neural networks,” arXiv preprint arXiv:1905.01004, 2019.
- C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe, “Weisfeiler and lehman go neural: Higher-order graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4602–4609.
- Z. Chen, S. Villar, L. Chen, and J. Bruna, “On the equivalence between graph isomorphism testing and function approximation with gnns,” arXiv preprint arXiv:1905.12560, 2019.
- K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
- B. Y. Weisfeiler and A. A. Lehman, “Reduction of a graph to a canonical form and an algebra arising during this reduction (in Russian),” Nauchno-Technicheskaya Informatsia, Seriya, vol. 2, no. 9, pp. 12–16, 1968.
- S. Arora and A. Risteski, “Provable benefits of representation learning,” CoRR, vol. abs/1706.04601, 2017.
- T. Kawamoto, M. Tsubaki, and T. Obuchi, “Mean-field theory of graph neural networks in graph partitioning,” in Proceedings of the 32Nd International Conference on Neural Information Processing Systems, ser. NIPS’18. USA: Curran Associates Inc., 2018, pp. 4366–4376. [Online]. Available: http://dl.acm.org/citation.cfm?id=3327345.3327349
- L. Lovász and B. Szegedy, “Limits of dense graph sequences,” Journal of Combinatorial Theory, Series B, vol. 96, no. 6, pp. 933 – 957, 2006. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0095895606000517
- C. Borgs, J. Chayes, L. Lovász, V. Sós, and K. Vesztergombi, “Convergent sequences of dense graphs i: Subgraph frequencies, metric properties and testing,” Advances in Mathematics, vol. 219, no. 6, pp. 1801 – 1851, 2008. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0001870808002053
- C. Borgs, J. Chayes, L. Lovász, V. Sós, and K. Vesztergombi, “Convergent sequences of dense graphs. ii. multiway cuts and statistical physics,” Annals of Mathematics. Second Series, vol. 1, 07 2012.
- S. H. Chan and E. M. Airoldi, “A consistent histogram estimator for exchangeable graph models,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, pp. I–208–I–216. [Online]. Available: http://dl.acm.org/citation.cfm?id=3044805.3044830
- C. Gao, Y. Lu, and H. H. Zhou, “Rate-optimal graphon estimation,” Ann. Statist., vol. 43, no. 6, pp. 2624–2652, 12 2015. [Online]. Available: https://doi.org/10.1214/15-AOS1354
- O. Klopp and N. Verzelen, “Optimal graphon estimation in cut distance,” Probability Theory and Related Fields, vol. 174, no. 3, pp. 1033–1090, Aug 2019. [Online]. Available: https://doi.org/10.1007/s00440-018-0878-1
- D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing times. American Mathematical Society, 2006.
- A. Magner, M. Baranwal, and A. O. Hero III, “The power of graph convolutional networks to distinguish random graph models,” arXiv preprint arXiv:1910.12954, 2019.
- F. Wu, A. H. Souza, T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in ICML, 2019.
- S. Janson, “Graphons, cut norm and distance, couplings, and rearrangements,” New York Journal of Mathematics, vol. 4, pp. 1–76, 2013.
- C. Sakr, Y. Kim, and N. Shanbhag, “Analytical guarantees on numerical precision of deep neural networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 3007–3016. [Online]. Available: http://proceedings.mlr.press/v70/sakr17a.html
- D. Hendrycks and K. Gimpel, “Bridging nonlinearities and stochastic regularizers with gaussian error linear units,” ArXiv, vol. abs/1606.08415, 2017.
- G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in neural information processing systems, 2017, pp. 971–980.