A Collective Learning Framework toBoost GNN Expressiveness

A Collective Learning Framework to Boost GNN Expressiveness


Graph Neural Networks (GNNs) have recently been used for node and graph classification tasks with great success. Unfortunately, existing GNNs are not universal (i.e., most-expressive) graph representations. In this work, we propose collective learning, a general collective classification Monte Carlo approach for graph representation learning that provably increases the representation power of existing GNNs. We show that our use of Monte Carlo sampling is key in these results. Our experiments consider the task of inductive node classification across partially-labeled graphs using five real-world network datasets and demonstrate a consistent, significant boost in node classification accuracy when our framework is combined with a variety of state-of-the-art GNNs.

1 Introduction

Graph Neural Networks (GNNs) have recently shown great success at node and graph classification tasks (Hamilton et al., 2017; Kipf and Welling, 2016; Luan et al., 2019; Xu et al., 2018). GNNs have been applied in both transductive settings (where the test nodes are embedded in the training graph) and inductive settings (where the training and test graphs are disjoint). Despite their success, existing GNNs are no more powerful than the Weisfeiler-Lehman (WL) graph isomorphism test, and thus, inherit its shortcomings, i.e. they are not universal (most-expressive) graph representations Chen et al. (2019); Morris et al. (2019); Murphy et al. (2019); Xu et al. (2018). In other words, these GNNs (which we refer to as WL-GNNs and also include GCNs Kipf and Welling (2016)) are not expressive enough for some node classification tasks, since their representation can provably fail to distinguish non-isomorphic nodes with different labels.

At the same time, a large body of work in relational learning has focused on strengthening poorly-expressive (i.e., local) classifiers in relational models (e.g., relational logistic regression, naive Bayes, decision trees (Neville et al., 2003a)) in collective classification frameworks, by incorporating dependencies among node labels and propagating inferences during classification to improve performance, particularly in semi-supervised settings (Koller et al., 2007; Pfeiffer III et al., 2015; Xiang and Neville, 2008).

In this work, we theoretically and empirically investigate the hypothesis that, by explicitly incorporating label dependencies among neighboring nodes via predicted label sampling—akin to how collective classification improves not-so-expressive classifiers—it is possible to devise an add-on training and inference procedure that can improve the expressiveness of any existing WL-GNN for inductive node classification tasks, which we denote collective learning.

Contributions. We first show that collective classification is provably unnecessary if one can obtain GNNs that are most-expressive. Then, because current WL-GNNs are not most-expressive, we propose an add-on general collective learning framework to GNNs that provably boosts their expressiveness, beyond that of an optimal WL-GNN1. Our framework, which we call CL-GNN, involves the use of self-supervised learning and sampled embeddings to incorporate node labels during inductive learning—and it can be implemented with any component GNN. In addition to being strictly more expressive than optimal WL-GNNs, we also show how collective learning improves finite -layer WL-GNNs in practice by extending their power to distinguish non-isomorphic nodes from hop neighborhoods to . We also show that, in contrast to our proposed framework, attempts to incorporate collective classification ideas into WL-GNNs without sampled embeddings (e.g., Fan and Huang (2019); Qu et al. (2019); Vijayan et al. (2018)) cannot increase expressivity beyond that of an optimal WL-GNN. Our empirical evaluation shows CL-GNN achieves a consistent improvement of node classification accuracy, across a variety of state-of-the-art WL-GNNs, for tasks involving unlabeled and partially-labeled test graphs.

2 Problem formulation

We consider the problem of inductive node classification across partially-labeled graphs, which takes as input a graph for training, where is a set of vertices, is a set of edges with adjacency matrix , is a matrix containing node attributes as -dimensional vectors, and is a set of observed labels (with classes) of a connected set of nodes , where is assumed to be a proper subset of , noting that . Let be the unknown labels of nodes . The goal is to learn a joint model of and apply this same model to predict hidden labels in another test graph , i.e., . The test graph can be partially labeled or unlabeled so .

Graph Neural Networks (GNNs), which aggregate node attribute information to produce node representations, have been successfully used for this task. At the same time, relational machine learning (RML) methods, which use collective inference to boost the performance of local node classifiers via (predicted) label dependencies, have also been successfully applied to this task.

Since state-of-the-art GNNs are not most-expressive Morris et al. (2019); Murphy et al. (2019); Xu et al. (2018), collective classification ideas may help to improve the expressiveness of GNNs. In particular, collective inference methods often sample predicted labels (conditioned on observed labels) to improve the local representation around nodes and approximate the joint distribution . We also know from recent research that sampling randomized features can boost GNN expressiveness Murphy et al. (2019). This leads to the key conjecture of this work creftypecap 1, which we prove theoretically in Section 4 and validate empirically by extensive experimentation in Section 5.

Hypothesis 1.

Since current Graph Neural Networks (e.g. GCN, GraphSAGE, TK-GNN) cannot produce most expressive graph representations, collective learning (which takes label dependencies into account via Monte Carlo sampling) can improve the accuracy of node classification by producing a more expressive graph representation.

Why? Because WL-GNNs can extract more information about local neighborhood dependencies via sampling Murphy et al. (2019), and sampling predicted labels allows the GNN to pay attention to the relationship between node attributes, the graph topology, and label dependencies in local neighborhoods. With collective learning, GNNs will be able to incorporate more information into the estimated joint label distribution. Next, we describe our collective learning framework.

3 Proposed framework: Collective Learning to boost GNNs

In this section, we outline CL-GNN. It is a general framework to incorporate any GNN, and combines self-supervised learning approach and Monte Carlo embedding sampling in an iterative process to improve inductive learning on partially labeled graphs.

Specifically, given a partially labeled training graph with adjacency matrix and a partially-labeled test graph with adjacency matrix . The goal of inductive node classification task is to train a joint model on to learn and apply it to by replacing the input graph with . Suppose the graphs and , we can define as a binary (0-1) matrix of dimension , and of dimension , where the rows corresponding to the one-hot encoding of the (available) labels.

(Background) GNN and representation learning. Given a partially labeled graphs , WL-GNNs generate node representation by propagating feature information throughout the graph. Specifically,


where is the GNN representation of node , is the softmax activation, and , and are model parameters, which are learned by minimizing the cross-entropy loss between true labels and the predicted labels.

The collective learning framework. Following creftypecap 1, we propose Collective Learning GNNs (CL-GNN), which includes label information as input to GNNs to produce a more expressive representation. The overall framework follows three steps: (Step 1) Include labels in the input using a random mask; (Step 2) Sample predicted labels for whatever is masked, use as input to WL-GNN that also takes all node labels as input, and average representations of the WL-GNN over the sampled predicted labels; (Step 3) Perform one optimization step by minimizing a negative log-likelihood upper bound (per Section 4.3). Collective learning for WL-GNNs then consists of iterating over Steps 1-3 for iterations. Finally, once optimized, we perform inference via Monte Carlo estimates.

Step 1. Random masking and self supervised learning. The input to GNNs is typically the full graph . If we included the observed labels directly in the input, then it would be trivial to learn a model that predicts part of the input. Instead, we either (scenario test-unlabeled) mask all label inputs if the test graph is expected to be unlabeled; or (scenario test-partial) if is expected to have partial labels, we apply a mask to the labels we wish to predict in training so they do not appear in the input .

In (scenario test-partial), where is expected to have some observed labels, at each stochastic gradient descent step, we randomly sample a binary mask from a set of masks, where is a binary (0-1) matrix with the same -dimensional vector in each column. By applying the mask on the observed labels , the set of true labels is effectively partitioned into two parts, where part of true labels are used as input to CL-GNN, and the other part are used as optimization target, where is the bitwise negated matrix of .

Step 2. CL-GNN loss and its representation averaging. At the -th step of our optimization —these steps can be coarser than a gradient step— we either (scenario test-partial) sample a mask or (scenario test-unlabeled) set . For now, we assume we can sample from an estimate of the distribution —we will come back to this assumption soon. Let be the matrix concatenation between and , where again is the bitwise negated matrix of . Let


where GNN represents an arbitrary graph neural network model and is the CL-GNN’s representation obtained for node at step .

Our optimization is defined over the expectation of w.r.t. to the sampled predicted labels (Equation 2) and over a loss averaged over all sampled masks (noting that the case where is trivial):


where again, is the softmax activation function, and are the labeled nodes in training graph.

Obtaining . At iteration , we use the learned CL-GNN model parameter to obtain according to Equation 2 and use the CL-GNN model parameters to obtain the label prediction recursively


starting the recursion with .

Step 3. Stochastic optimization of Equation 3. Equation 3 is based on a pseudolikelihood, where the joint distribution of the labels is decomposed as marginal distributions resulting in the sum over in Equation 3. In order to optimize Equation 3, we compute gradient estimates w.r.t.  and using the following sampling procedure.

1. We first need to compute an unbiased estimate of in Equation 2 using i.i.d. samples from the model obtained at time step (as describe above). Note that the time/space complexity of the CL-GNN is times the time/space complexity of the corresponding GNN model as we have to compute representations for each node at each stochastic gradient step.

2. Next, we need an unbiased estimate of the expectation over in Equation 3. In (scenario test-partial) the unbiased estimates are obtained by sampling , in the (scenario test-unlabeled) the value obtained is exact since . Section 4.3 shows that the above procedure is a proper surrogate upperbound of the loss function

Inference with learned model. We apply the same procedure as in obtaining , but transferring the learned parameters to a different test graph. Specifically, at iteration , CL-GNN parameters are learned according to Equation 3 on the training graph . Suppose the iteration contains gradient steps, thus sampled masks are used. Given an any-size attributed graph , we sample another masks of size , either (scenario test-partial) sampling or (scenario test-unlabeled) set . We also sample predicted labels from the predicted probability distribution where . The node representations for are obtained using and :


where again, is the number of gradient steps per iteration and is the number of Monte Carlo samples of . Then the label predictions are obtained using the learned CL-GNN parameters :

Note that the test label predictions are also recursively updated, and the recursion starts with .

4 Collective learning analysis

Is collective classification able to better represent target label distributions than graph representation learning? The answer to this question is both yes (for WL-GNNs) and no (for most-expressive representations). Section 4 shows that a most-expressive graph representation (Murphy et al., 2019; Maron et al., 2019; Srinivasan and Ribeiro, 2019) would not benefit from a collective learning boost. All proofs can be found in the Appendix. {restatable}[Collective classification can be unnecessary]theoremthmCCun Consider the task of predicting node labels when no labels are available in test data. Let be a most-expressive representation of node in graph . Then, for any collective learning procedure predicting the class label of , there exists a classifier that takes as input and predicts the label of with equal or higher accuracy. While Section 4 shows that the most-expressive graph representation does not need collective classification, WL-GNNs are not most-expressive Morris et al. (2019); Murphy et al. (2019); Xu et al. (2018). Indeed, Section 4.1 and Section 4.2 show that CL-GNN boosts the expressiveness of optimal WL-GNN and practical WL-GNNs, respectivelly. Then, we show that the stochastic optimization in Step 3 optimizes a loss surrogate upper bound.

4.1 Expressive power of Cl-Gnn

Morris et al. (2019) and Xu et al. (2018) show that WL-GNNs are no more powerful in distinguishing non-isomorphic graphs and nodes as the standard Weisfeiler-Lehman graph isomorphism test (1-WL or just WL test). Two nodes are assumed isomorphic by the WL test if they have the same color assignment in the stable coloring. The node-expressivity of a parameterized graph representation (with parameter ) can then be determined by the set of graphs for which can identify non-isomorphic nodes:

where is the set of all any-size attributed graphs, is the set of nodes in graph . We call the identifiable set of graph representation .

The most expressive graph representation has an identifiable set of all any-size attributed graphs, i.e. . We refer to the WL-GNN that is equally expressive as WL test as the optimal WL-GNN (or ), which is at least as expressive as all other WL-GNNs.

In this section we show that collective learning can boost the optimal , i.e., the identifiable set of is a proper subset of collective learning over (denoted )


[CL-GNN expressive power]theoremthmclpower Let be an optimal WL-GNN. Then, the collective learning representation of Equation 2, using as the GNN component, (denoted ) is strictly more expressive than this representation model applied to the same tasks.

Section 4.1 answers creftypecap 1, by showing that by incorporating collective learning and sampling procedures, CL-GNN can boost the expressiveness of WL-GNNs, including the optimal .

Corollary 1.

Consider a graph representation learning method that, at iteration , replaces , in Equations 2 and 4 with a deterministic function over , e.g., a softmax function that outputs . Then, such method will be no more expressive than the optimal and, hence, less expressive than .

Corollary 1 shows that existing graph representation methods that —on the surface— may even look like CL-GNN, but do not perform the crucial step of sampling , unfortunately, are no more expressive than WL-GNNs. Examples of such methods include Fan and Huang (2019); Moore and Neville (2017); Qu et al. (2019); Vijayan et al. (2018).

Next, we see that the practical benefits of collective learning are even greater when the WL-GNN has limited expressive power due to a constraint on the number of message-passing layers.

4.2 How Cl-Gnn further expands the power of few-layer WL-GNNs

A -layer () WL-GNN will only aggregate neighborhood information within hops of any given node (i.e., over a -hop egonet, defined as the graph representing the connections among all nodes that are at most hops away from the center node). In practice —mostly for computational reasons— WL-GNNs have many fewer layers than the graph’s diameter , i.e., . For instance, GCN Kipf and Welling (2016) and GraphSAGE Hamilton et al. (2017) both used in their experiments. Hence, they cannot differentiate two non-isomorphic nodes that are isomorphic within their -hop neighborhood. We now show that CL-GNN can gather -hop neighborhood information with a -layer WL-GNN.


propositionpropexpandpower Let be the -hop egonet of a node in graph with diameter . Let and be two non-isomorphic nodes whose -hop egonets are isomorphic (i.e., is isomorphic to ) but -hop egonets are not isomorphic. Then, a WL-GNN representation with layers will generate identical representations for and while CL-GNN is capable of giving distinct node representations.

Section 4.2 shows that collective learning has yet another benefit: CL-GNN further boosts the power of WL-GNNs with limited message-passing layers by gathering neighborhood information within a larger radius. Specifically, CL-GNN built on a WL-GNN with layers can enlarge the effective neighborhood radius from to in Equation 2 , while WL-GNN would have to stack layers to achieve the same neighborhood radius, which in practice may cause optimization challenges (i.e., is a common hyperparameter value in the literature).

4.3 Optimization of Cl-Gnn


propositionpropopt If , is bounded (e.g., via gradient clipping), then the optimization in Equation 3, with the unbiased sampling of and described above, results in a Robbins-Monro (Robbins and Monro, 1951) stochastic optimization algorithm that optimizes a surrogate upper bound of the loss in Equation 3. Since the optimization objective in Equation 3 is computationally impractical, as it requires computing all possible binary masks and label predictions, Section 4.3 shows that the sampling procedures used in CL-GNN that considers samples of label predictions and a random mask at each gradient step is a feasible approach of estimating an unbiased upper bound of the objective.

5 Experiments

5.1 Experiment Setup

Datasets. We use datasets of Cora, Pubmed, Friendster, Facebook, and Protein. The largest dataset (Friendster Teixeira et al. (2019)) has 43,880 nodes, which is a social network of users where the node attributes include numerical features (e.g number of photos posted) and categorical features (e.g. gender, college, music interests, etc) encoded as binary one-hot features. The node labels represent one of the five age groups. Please refer to Appendix F for more details.

Train/Test split. To conduct inductive learning tasks, for each dataset we split the graphs for training and testing, and the nodes are sampled to guarantee that there is no overlapping between any two sets of , and . In our experiments, we tested two different label rates in test graph: (unlabeled) and . We run five trials for all the experiments, and in each trial we randomly split the nodes/graphs for training/validation and testing.

As our method can be applied to any GNN models, we tested three GNNs as examples:

  • GCN Kipf and Welling (2016) which includes two graph convolutional layers. Here we implemented an inductive variant of the original GCN model for our tasks.

  • Supervised GraphSage Hamilton et al. (2017) (denoted by GS) with Mean pooling aggregator. We use sample size of 5 for neighbor sampling.

  • Truncated Krylov GCN Luan et al. (2019) (denoted by TK), a recent GNN model that leverages multi-scale information in different ways and are scalable in depth. The TK has stronger expressive power and achieved state-of-the-art performance on node classification tasks. We implemented Snowball architecture which achieved comparable performance with the other truncated Krylov architecture according to the original paper.

For each of GNNs, we compare its baseline performance (on its own) to the performance achieved using collective learning in CL-GNN. Note that we set the number of layers to be for GCN Kipf and Welling (2016) and GraphSage Hamilton et al. (2017) as set in their original papers, and use layers for TK Luan et al. (2019). For a fair comparison, the baseline GNN and CL-GNN are trained with the same hyper-parameters, e.g. number of layers, hidden dimensions, learning rate, early-stopping procedures. Please refer to Appendix F for details.

In addition, we also compare to three relational classifiers, ICA Lu and Getoor (2003), PL-EM Pfeiffer III et al. (2015) and GMNN Qu et al. (2019). The first two models apply collective learning and inference with simple local classifiers —— Naive Bayes for PL-EM and Logistic regression for ICA. GMNN is the state-of-the-art collective model with GNNs, which uses two GCN models to model label dependency and node attribute dependency respectively. All the three models take true labels in their input, thus we use for training and for testing.

We report the average accuracy score and standard error of five trials for the baseline models, and compute the absolute improvement of accuracy of our method over the corresponding base GNN. We report the balanced accuracy scores on Friendster dataset as the labels are highly imbalanced. To evaluate the significance of our model’s improvements, we performed a paired t-test with five trials.

5.2 Results

The node classification accuracy of all the models is shown in Table 1. Our proposed collective learning boost is denoted as +CL (for Collective Learning) in the results and our model performance (absolute of improvement over the corresponding baseline GNN) is shown in shaded area. Numbers in bold represent significant improvement over the baseline GNN based on a paired t-test (), and numbers with is the best performing method in each column.

Comparison with baseline GNN models. Table 1 shows that our method improves the corresponding non-collective GNN models for all the three model architectures (i.e. GCN, GraphSage and TK). Although all the models have large variances over multiple trials — which is because different parts of the graphs are being trained and tested on in different trials, our model consistently improves the baseline GNN. The results from a paired t-test comparing the performance of our method and the corresponding non-collective GNN shows that the improvement is almost always significant (marked as bold), with only four exceptions. Comparing the gains on different datasets, our method achieved smaller gains on Friendster. This is because the Friendster graph is much more sparse than all other graphs (e.g. edge density of Friendster is 1.5e-4 and edge density of Cora is 1.44e-3 Teixeira et al. (2019)), which makes it hard for any model to propagate label information and capture the dependency.

Moreover, comparing the improvement over GCN and TK, we can observe that in general our method adds more gains to GCN performance. For example, with 3% training labels on Cora, our method when combining with GCN has an average of 6.29% improvement over GCN, 2.35% improvement over GraphSage, and 0.96% improvement over TK. This is in line with our assumption creftypecap 1 that collective learning can help GNNs produce a more expressive representation. As GCN is provably less expressive than TK Luan et al. (2019), there is a larger room to increase its expressiveness.

Note the we use different trials for the two test label rates, the gains are generally larger when of the labels are available. For example, when combining with GCN, the improvements of our method are 6.29% and 1.72% for unlabeled Cora and Facebook test sets, but with partially-labeled test data, the improvements are 15.69% and 2.95% respectively. This shows the importance of modeling label dependency especially when the some test data labels are observed.

Comparison with other relational classifiers The two baseline non-GNN relational models, i.e. PL-EM and ICA generally perform worse than the three GNNs, with the only exception on Protein dataset. This could be because the two non-GNN models generally need a larger portion of labeled set to train the weak local classifier, whereas GNNs utilize a neural network architecture as ”local classifier”, which is better at representation learning by transforming and aggregating node attribute information. However, when the model is trained with a large training set (e.g. with 30% nodes on Protein dataset), modeling the label dependency becomes crucial. At the same time, our method is still able to improve the performance of the corresponding GNNs.

For GMNN, the collective GNN model, it achieved better performance than its non-collective base model, i.e. GCN, and we can see that our model combining with GCN achieved comparable or slightly better performance than GMNN. When combing with other more powerful GNNs, our model can easily out-perform it, e.g. on Cora, Pubmed and Facebook datasets, the TK performs better than GMNN and our method adds extra gains over TK.

Cora Pubmed Friendster Facebook Protein
# train labels: 85 (3.21%) 300 (1.52%) 641 (1.47%) 80 (1.76%) 7607 (30%)
% labels in : 0% 50% 0% 50% 0% 50% 0% 50% 0% 50%
Random 14.28 (0.00) 14.28 (0.00) 33.33 (0.00) 33.33 (0.00) 20.00 (0.00) 20.00 (0.00) 50.00 (0.00) 50.00 (0.00) 50.00 (0.00) 50.00 (0.00)
GCN - 45.90 (3.26) 36.38 (1.35) 52.68 (2.36) 54.11 (4.86) 29.34 (0.55) 28.44 (0.56) 65.85 (1.01) 63.13 (2.12) 75.86 (1.11) 77.54 (1.09)
+ CL +6.29 (1.49) +15.69 (3.20) +4.48 (2.33) +5.62 (1.17) +0.81 (0.10) +0.90 (0.32) +1.72 (0.48) +2.95 (0.84) +1.22 (0.51) +0.75 (0.33)
GS - 50.69 (1.50) 48.42 (2.82) 59.34 (3.47) 58.52 (5.42) 28.10 (0.59) 28.10 (0.48) 64.56 (0.92) 62.99 (0.88) 73.85 (1.12) 73.01 (2.28)
+ CL +2.35 (0.56) +4.52 (0.84) +1.48 (0.41) +2.42 (0.27) +0.31 (0.15) +0.73 (0.23) +2.38 (0.77) +2.05 (0.04) +0.84 (0.12) +1.47 (0.63)
TK - 63.74 (2.61) 55.68 (2.08) 61.13 (5.03) 63.05 (5.15) 28.89 (0.10) 29.30 (0.15) 67.63 (1.03) 65.80 (1.16) 73.65 (1.69) 78.94 (1.50)
+ CL +0.96 (0.30) +7.18 (1.88) +1.00 (0.21) +1.91 (0.75) +0.55 (0.17) +0.45 (0.08) +0.63 (0.26) +2.37 (0.80) +1.31(0.27) +1.36(0.94)
PL-EM - 20.70 (0.05) 20.35 (0.05) 38.05 (4.85) 31.70 (4.78) 23.26 (0.01) 26.30 (0.25) 56.17 (7.42) 54.56 (6.17) 78.46 (1.45) 77.95 (1.56)
ICA - 26.20 (0.51) 31.17 (3.66) 44.40 (1.92) 33.38 (4.69) 25.14 (0.03) 25.08 (0.17) 47.93 (6.04) 59.39 (3.69) 84.88 (3.35) 84.39 (4.08)
GMNN - 49.05 (1.86) 49.36 (2.22) 58.03 (3.26) 62.16 (4.40) 22.20 (0.07) 28.53 (0.64) 65.82 (1.30) 63.45 (2.15) 76.75 (0.74) 75.96 (0.76)
Table 1: Node classification accuracy with unlabeled and partially-labeled test data. Numbers in bold represent significant improvement in a paired t-test at the level, and numbers with represent the best performing method in each column.

We did two ablation studies to investigate the usage of predicted labels (detailed in Appendix H), which showed that (1) adding predicted labels in model input had extra value comparing to using true labels only, and (2) the gain of our framework is from using samples of the predicted labels rather than random one-hot vectors.

Runtime analysis. CL-GNN computes embeddings at each stochastic gradient step, therefore overall, per-gradient step, CL-GNN is slower than its component WL-GNN. Overall, after iterations of Steps 1-3, CL-GNN total runtime increases by over the original runtime of its component WL-GNN. Our experiments are conducted on a single Nvidia GTX 1080Ti with GB of shared memory. CL-GNN built on GCN on the largest dataset (i.e. Friendster with 43K nodes) takes minutes for training and inference (the longest time), with a total of iterations, while the corresponding GCN takes only minutes for the same operations. In the same dataset, CL-GNN built on TK takes minutes for training and inference, while the corresponding TK takes minutes. We give a detailed account of running times on smaller graphs in Appendix F. We note that we spent nearly no time engineering CL-GNN for speed or for improving our results. Our interest in this paper lies entirely on the gains of a direct application of CL-GNN. We fully expect that further engineering advances can significantly reduce the performance penalty and increase accuracy gains. For instance, parallelism can significantly reduce the time to collect samples in CL-GNN.

6 Related work

On collective learning and neural networks. There has been work on applying deep learning to collective classification. For example, Moore and Neville (2017) proposed to use LSTM-based RNNs for classification tasks on graphs. They transform each node and its set of neighbors into an unordered sequence and use an RNN to predict the class label as the output of that sequence. Pham et al. (2017) designed a deep learning model for collective classification in multi-relational domains, which learns local and relational features simultaneously to encodes multi-relations between any two instances.

The closest work to ours is Fan and Huang (2019), which proposed a recurrent collective classification (RCC) framework, a variant of ICA Lu and Getoor (2003) including dynamic relational features encoding label information. Unlike our framework, this method does not sample labels , opting for an end-to-end training procedure. Vijayan et al. (2018) opts for a similar no-sample RCC end-to-end training method as Fan and Huang (2019), now combining a differentiable graph kernel with an iterative stage. Graph Markov Neural Network (GMNN) Qu et al. (2019) is another promising approach that applies statistical relational learning to GNNs. GMNNs model the joint label distribution with a conditional random field trained with the variational EM algorithm. GMNNs are trained by alternating between an E-step and an M-step, and two WL-GCNs are trained for the two steps respectively. These studies represent different ideas for bringing the power of collective classification to neural networks. Unfortunately, Corollary 1 shows that, without sampling , the above methods are still WL-GNNs, and hence, their use of collective classification fails to deliver any increase in expressiveness beyond an optimal WL-GNN (e.g., Xu et al. (2018)).

In parallel to our work, Jia and Benson (2020) considers regression tasks by modeling the joint GNN residual of a target set () as a multivariate Gaussian, defining the loss function as the marginal likelihood only over labeled nodes . In contrast, by using the more general foundation of collective classification, our framework can seamlessly model both classification and regression tasks, and include model predictions over the entire graph as CL-GNN’s input, thus affecting both the model prediction and the GNN training in inductive node classification tasks.

On self-supervised learning and semi-supervised learning. Self-supervised learning is closely related to semi-supervised learning. In fact, self-supervised learning can be seen as a self-imposed semi-supervised learning task, where part of the input is masked (or transformed) and must be predicted back by the model (Doersch et al., 2015; Noroozi and Favaro, 2016; Lee et al., 2017; Misra et al., 2016). Recently, self-supervised learning has been broadly applied to achieve state-of-the-art accuracy in computer vision (Hénaff et al., 2019; Gidaris et al., 2019) and natural language processing Devlin et al. (2018) supervised learning tasks. The use of self-supervised learning in graph representation learning is intimately related to the use of pseudolikelihood to approximate true likelihood functions.

Collective classification for semi-supervised learning tasks. Conventional relational machine learning (RML) developed methods to learn joint models from labeled graphs (Lu and Getoor, 2003; Neville and Jensen, 2000), and applied the learned classifier to jointly infer the labels for unseen examples with collective inference. When the goal is to learn and predict within a partially-labeled graph, RML methods have considered semi-supervised formulations (Koller et al., 2007; Xiang and Neville, 2008; Pfeiffer III et al., 2015) to model the joint probability distribution:

In this case RML methods use both collective learning and collective inference procedures for semi-supervised learning.

RML methods typically consider a Markov assumption to simplify the above expressions —every node is considered conditionally independent of the rest of the network given its Markov Blanket (). For undirected graphs, this is often simply the set of the immediate neighbors of . Given the Markov blanket assumption, RML methods typically use a local conditional model (e.g., relational Naive Bayes (Neville et al., 2003b), relational logistic regression (Popescul et al., 2002)) to learn and infer labels within the network.

Semi-supervised RML methods utilizes the unlabeled data to make better predictions within the network. Given the estimated values of unlabeled examples, i.e., the local model parameters can be learned by maximizing the pseudolikelihood of the labeled part:


The key difference between Equation 6 and the GNNs objective in Equation 1: the RML model is always conditioned on the labels (either true labels or estimated labels ) even when there are no observed labels in the test data, i.e., even when .

The most common form of semi-supervised RML utilizes expectation maximization (EM) (Xiang and Neville, 2008; Pfeiffer III et al., 2015), which iteratively relearn the parameters given the expected values of the unlabeled examples. For instance, the PL-EM algorithm Pfeiffer III et al. (2015) optimizes the pseudolikelihood over the entire graph:

In comparison to semi-supervised RML, our proposed framework CL-GNN performs collective learning to strengthen GNN, which is a more powerful and flexible ”local” classifier compared to the typical weak local classifier used in RML methods (e.g. relational Naive Bayes). For parameter learning (M-step), CL-GNN also optimizes pseudolikelihood, but incorporates Monte Carlo sampling from label prediction instead of directly using the predicted probability of unlabeled nodes

Collective inference. When the model is applied to make predictions on unlabeled nodes (in the E-step), joint (i.e., collective) inference methods such as variational inference or Gibbs sampling must be applied in order to use the conditionals from Equation 6. This combines the local conditional probabilities with global inference to estimate the joint distribution over the unlabeled vertices, e.g.,:

where each component is iteratively updated.

Alternatively, a Gibbs sampler iteratively draws a label from the corresponding conditional distributions of the unlabeled vertices:

In Gibbs sampling, it is the sampling of labels that allow us to sample from the joint distribution, which is what enriches the simple models often used in collective inference.

7 Conclusion

In this work, we answer the question “can collective learning and sampling techniques improve the expressiveness of state-of-the-art GNNs in inductive node classification tasks?” We first show that with the most expressive GNNs there is no need to do collective learning; however, since we do not have the most expressive models, we present the collective learning framework (exemplified in CL-GNN), that can be combined with any existing GNNs to provably improve WL-GNN expressiveness. We considered the inductive node classification tasks across graphs, and showed by extensive empirical study that our collective learning significantly boosts GNNs performance on all the tasks.

8 Potential biases

This work presents an algorithm for node classification that works on arbitrary graphs, including real-world social networks, citation networks, etc.. Any classification algorithm that learns from data runs the risk of producing biased predictions reflective of the training data — our work, which learns an inductive node classifier from training graph examples, is no exception. However, our main focus in this paper is on introducing a general algorithmic framework that can increase the expressivity of existing graph representation learning methods, rather than particular real-world applications.

Supplementary Material of collective learning GNN

Appendix A Proof of Section 4

We restate the theorem for completeness. \thmCCun*


Let be a classifier function that takes the most expressive representation of node as input and outputs a predicted class label for .

Let be the set of predicted labels at iteration of collective classification and let be the true label of node . Then either (1) , or (2) .

Case (1): Given the classifier and the most expressive representation , the true label of is independent of the labels predicted with collective classification. In this case, the predicted labels of ’s neighbors offer no additional information and, thus, collective classification is unnecessary.

Case (2): In this case, the true label of is not independent of the predicted labels. By Theorem 1 of Srinivasan and Ribeiro (2019), we know that for any random variable attached to node , it must be that a measurable function independent of s.t.

where is an noise source exogenous to (pure noise), and a.s. implies almost sure equality. Defining ,

which means must either be dependent on or contain domain knowledge information about the function that is not in . Since is a vector of random variables fully determined by and , it cannot depend on an exogenous variable , Thus, the predictions must contain domain knowledge of . Hence, we can directly incorporate this domain knowledge into another classifier s.t. , for instance is a function of . In this case, will predict the label of with equal or higher accuracy than collective classification based on predicted labels , which finishes our proof.

Appendix B Proof of Section 4.1

Figure 1: Training/testing graphs. Colors represent available node labels, and testing nodes are marked with question marks. WL-GNN cannot differentiate between the red and green nodes.



As defined, WLGNN is a most-expressive WL-GNN. We need to prove We will do that by first showing and then showing that .

: First, we need to show that for any mask , such that . This is clearly true since, for labeled tests, in Equation 2 we can always construct a for a CL-GNN


that ignores the inputs. Similarly, for unlabeled tests, in Equation 2 we can always construct a for a CL-GNN

that ignores the inputs.

: Let be the graph in Figure 1. We will first consider the case where the test data has partial labels. The case without labels follows directly from it. Using the graph in Figure 1(a) (training) and Figure 1(c) (partially-labeled testing) we show that a WL-GNN is unable, in test, to correctly give representations to the left-most nodes that are distinct from the right-most nodes (the same happens for the unlabeled test graph in Figure 1(b)). We then show that the representation of Equation 7 is able to distinguish these two sets of nodes.

WL-CNN is not powerful enough to give distinct representations to nodes in Figure 1(c): Consider giving an arbitrary feature value (say, the “color white”) to all uncolored nodes in Figure 1(c). We will start showing that the 1-WL test is unable to give different colors to the nodes in this graph. Since WL-GNNs are no more expressive than the 1-WL test (Xu et al., 2018; Morris et al., 2019), showing that the above is a stable coloring for nodes in the 1-WL test, proves the first part of our result. A stable 1-WL coloring is defined as a coloring scheme on the graph that has a 1-to-1 correspondence with the colors in the previous step of the 1-WL algorithm. Since the input to the hash function of the 1-WL test is the same for all of nodes : The node itself has color white while the color set of the neighbors is the set . In the next 1-WL round, all the white nodes will be mapped to the same color by the hash function. The colors of node will be not the same as . Hence, the initial coloring of all nodes white and yellow is a stable coloring for 1-WL. Consequently, WL-GNN will give the same representation to all nodes in .

CL-GNN gives the same representations within the sets and : At iteration of CL-GNN, we start with the base of the recursion . Now consider a given mask . Note that to sample for we apply into Equation 2 to obtain , and then apply into Equation 4, defining , which will give us , and any classes has a non-zero probability of being sampled since our output is a softmax.

Since nodes all get the same representation in the above WLGNN, their respective sampled , , will have the same distribution but possibly not the same values (due to sampling). Note that the nodes will get the same average in Equation 7 since , , has the same distribution and the nodes are isomorphic (even given the colors on nodes 9 and 10). Similarly, the nodes will also get the same average in Equation 7.

CL-GNN gives distinct representations accross the sets and : Finally, we now prove that exists a WL-GNN, which we will denote , such that for and . We will show that there is a joint sample of where there is no symmetry between the representations of nodes in and . Since each layer of WLGNN can have different parameters, we can easily encode differences in the number of hops it takes to reach a certain color. Moreover, at any WLGNN layer, the representation of a node can perfectly encode its own last-layer representation and the last-layer representation of its neighbors through a most-expressive multiset representation function Xu et al. (2018).

It is enough for us to show that for a sampled the sets of nodes and can get distinct unique representations under WLGNN. By unique, we mean, can get representations in WLGNN that cannot be obtained by the nodes in . This representation uniqueness makes sure the averages in Equation 2 are different. Without loss of generality we will consider giving a special sampled label to only one node in one of the sets. The sampled labels , while all other nodes get red, will happen with non-zero probability, hence, they must be part of the expectation in Equation 2. Note that node (for ) and (for ) will feel the effects of the green color in their neighbors differently. That is, for there is a parameter choice for the layers of where the representation of node 6 uniquely encodes that the color green is within hops 1 (node 5) and 3 (from node 5 through nodes 9 and 10) of node 6 (if 6 treats its own representation differently from its neighbors). For , node 2’s representation will encode that green is observed hops 1 (node 1) and 2 (from node 1 through node 9) (similarly, 2 treats its own representation differently from its neighbors). Hence, these representations can be made unique by , i.e., no other assignments will create the same patterns for nodes 2 and 6, and thus, since has most-expressive multiset representations, it can give a unique representation to nodes 2 and 6 for these two unique configurations. These unique representations are enough to ensure for any , which concludes our proof.

Appendix C Proof of Section 4.2

(a) Training graph
(b) -order neighborhood for label prdiction
Figure 2: WL-GNN using 2nd-order neighborhood cannot differentiate node 1 and 2, but CL-GNN built on this WLGNN can break the local isomorphism.



Let be the graph in Figure 2(a) with no node features, and let WLGNN be of order , meaning it will generate node embeddings based on 2nd-order neighborhoods (shown in (b)). Since node and have the same 2nd-order neighborhood structure, WLGNN will generate identical node representation for them, which gives random label predictions. Meanwhile, as nodes and have distinct 2nd-order neighborhood structures, WLGNN generates different node representations for them, which enables the model to learn from the labels and . We can assume the predicted label probability and . For CL-GNN, at iteration , we sample from the WLGNN output and use the samples as input. In the worst case, nodes and get the same distribution and sampled labels (i.e. = , = ). Since the distribution of and are different, the samples of and are different, which breaks the tie between the 2nd-order neighborhood of nodes and . Therefore, CL-GNN will produce different node representation starting from iteration for nodes and , which enables the model to learn from the training label and , and thus gives more accurate predictions. ∎

The advantage of collective inference is more clear when it is used to strengthen less-expressive local classifiers, e.g. logistic regression Although GNN are much powerful than these local classifiers by aggregating high(er)-order graph information, collective learning can still help if GNN fail to make use of “global” information in graphs (or equivalently, if the order of GNN is small than graph diameter). Previous work Jensen et al. (2004) investigating the power of collective inference also showed that methods for collective inference benefit from a clever factoring of the space of dependencies, by arguing that these (collective inference) methods benefit from information propagated from outside of their local neighborhood. Predictions about the class label on other objects essentially “bundle information” about the graph beyond the immediate neighborhood.

Appendix D Proof of Section 4.3




In our optimization, we only need to sample two variables and . We obtain unbiased bounded-variance estimates of the derivative of the loss function if we sample (and exact values when ). We can now compound that with unbiased bounded-variance estimates of the derivative if we estimate the expectation in Equation 2 by i.i.d. sampling . The loss in Equation 3 is convex on since the negative log-likelihood of the multi-class logistic regression is convex on , which means it is also convex on as the loss is defined on the affine transformation . The expectation of the loss always exist, since we assume is bounded for all . Hence, as the loss is convex w.r.t. , the expection w.r.t.  exists, and we obtain an unbiased estimate of , we can apply Jensen’s inequality to show that the resulting Robbins-Monro stochastic optimization optimizes an upper bound of the loss in Equation 3. ∎

Appendix E Additional information on datasets and experiment setup

Datasets. We use five datasets for evaluation, with the dataset statistics shown in Table 2.

Dataset # Nodes # Attributes # Classes # Test
Cora 2708 1433 7 1000
Pubmed 19717 500 3 1000
Friendster 43880 644 5 6251
Facebook 4556 3 2 1000
Protein 12679 29 2 2376
Table 2: Dataset statistics
  • Cora and Pubmed are benchmark datasets for node classification tasks from Sen et al. (2008). They are citation networks with nodes representing publications and edges representing citation relation. Node attributes are bag-of-word features of each document, and the predicted label is the corresponding research field.

  • Facebook Yang et al. (2017) is social network of Facebook users from Purdue university, where nodes represent users and edges represent friendship. The features of the nodes are: religious views, gender and whether the user’s hometown is in Indiana. The predicted labels is political view.

  • Friendster Teixeira et al. (2019) is social network. Nodes represent users and edges represent friendship. The node attributes include numerical features (e.g number of photos posted, etc) and categorical features (e.g. gender, college, music interests, etc), encoded as binary one-hot features. The node labels represent one of the five age groups: 0-24, 25-30, 36-40, 46-50 and over 50. This version of the graph contain 40K nodes, 25K of which are labeled.

  • Protein is a collection of protein graphs from Borgwardt et al. (2005). Each node is labeled with a functional role of the protein, and has a 29 dimensional feature vector. We use 85 graphs with an average size of 150 nodes.

Data splits. To conduct inductive learning tasks, we have to properly split the graphs into labeled and unlabeled parts. For datasets containing only one graph (Cora, Pubmed, Facebook and Friendster), we randomly sample a connected component to be , and then sample a test set () from the remainder nodes (). To make partially-labeled test data available, we sample another connected component as with the same size as . The nodes are sampled to guarantee that there is no overlapping between any two sets of , and . Here and have the same graph structure but with different labeled nodes.

For the protein dataset, as we have 85 disjoint graphs, we randomly choose 51 (60%) graphs for training, 17 (20%) graphs for validation and the remaining 17 (20%) graphs for testing. To simulate semi-supervised learning settings, we mask out 50% of true labels on the training graphs. For the tasks with partially-labeled test data, we randomly select 50% of the nodes in the test graphs as labeled nodes, and test on the remaining 50% nodes. We run five trials for all the experiments, and in each trial we randomly split the nodes/graphs as described.

As seen in Section 5.1, to approximate an inductive learning setting, we use a different train/test data split procedure (i.e. connected training set) on Cora and Pubmed networks from the public version (i.e. random training set) used in most of the existing GNN models (Kipf and Welling, 2016; Luan et al., 2019). This is illustrated in Figure 3, where the random training set of the traditional GNN evaluation methods (in e.g., (Kipf and Welling, 2016; Luan et al., 2019)) is shown on the left, contrasted with our harder task of connected training set shown on the right. This difference in task is the reason why the model performance reported in our paper is not directly comparable with the reported results in previous GNN papers, even though we used the same implementations and hyperparameter search procedures.

Figure 3: The different data splits between traditional GNN train/test split evaluation (left) and our—more realistic— connected train/random test split evaluation (right)

Hyperparameter setting. For hyperparameter tuning, we searched the initial learning rate within {0.005, 0.01, 0.05} with weight decay of . Dropout is applied to all the layers with . Hidden units are searched within {16, 32} if the dataset wasn’t used by the original GNN paper, or set as the same number as originally chosen in the GNN paper. The number of layers is set to for both GCN Kipf and Welling (2016) and GraphSage Hamilton et al. (2017) as used in their paper, and we use layers for TK Xu et al. (2018). For GraphSage Hamilton et al. (2017), the neighborhood sample size is set to . We use the same GNN structure (i.e. layers, hidden units, neighborhood sample size) for the non-collective version and in CL-GNN for fair comparison.

For CL-GNN, the additional hyperparameters are (1) the sample size of predicted labels (), and (2) the number of model iterations (). we set sample size for friendster dataset and for all other datasets. For label rate of , the model is trained for iterations, and each iteration contains epochs. Note that we sample a new binary mask for each epoch as described in Section 3. For label rate of , the model is trained for iterations, and each iteration contains up to epochs which can be early stopped if the validation accuracy decreases for a specified consecutive epochs. The numbers of iterations are empirically determined as only marginal improvements are observed after iterations for unlabeled test data and iterations for partially-labeled test data. The validation accuracy is used to choose the best epoch.

Note that the hyper-parameter tuning could be done more aggressively to further boost the performance of CL-GNN, e.g. using more layers for TK Xu et al. (2018), but our main goal is to evaluate the relative improvements of CL-GNN on the corresponding non-collective GNNs.

Appendix F Running times for GNN models on multiple datasets

Dataset GNN structure Running time (minutes)
unlabeled partially-labeled
Cora GCN 0.09 0.83 3.65
TK 0.12 1.91 5.74
Pubmed GCN 0.49 5.38 21.87
TK 0.52 7.82 51.62
Friendster GCN 1.04 17.93 66.31
TK 1.93 30.17 132.33
Facebook GCN 0.02 1.44 5.37
TK 0.05 2.41 7.22
Table 3: The running time (in minutes) for CL-GNN and its corresponding GNNs.

Table 3 shows the running times for CL-GNN and the corresponding non-collective GNNs on various datasets. As mentioned in , for partially-labeled , CL-GNN applied random masks at each epoch, and ran for iterations, whereas for unlabeled , CL-GNN ran for iterations.

Appendix G Cl-Gnn performance with varying training label rates

To investigate the impact of the training label rates on the node classification accuracy, we repeated the experiments on Cora and Pubmed datasets with various numbers of training labels, on unlabeled test data and partially-labeled test data. Table 4 and Table 5 show the results for test labels rates of and respectively. We can see that in general CL-GNN framework achieved a larger improvement when fewer labels are available in the training graph. For example, with label rates of , and on Pubmed, the improvements of our framework combining with GCN are , and respectively. This shows that the CL-GNN framework is especially useful when only a small number of labels are available in training, which is the common use case of GNNs.

Cora Pubmed
# train labels 85 (3.21%) 105 (3.88%) 140 (5.17%) 300 (1.52%) 375 (1.90%) 600 (3.04%)
% test labels 0% 0% 0% 0% 0% 0%
Random 14.28 (0.00) 14.28 (0.00) 14.28 (0.00) 33.33 (0.00) 33.33 (0.00) 33.33 (0.00)
GCN - 45.90 (3,26) 47.54 (3.50) 61.92 (1.50) 52.68 (2.36) 55.76 (3.32) 70.38 (2.31)
+ CL +6.29 (1.49) +5.20 (1.12) +5.18 (0.66) +4.48 (2.33) +3.30(1.52) +0.98(0.23)
GS - 50.69 (1.50) 56.24 (2.08) 66.08 (0.96) 59.34 (3.47) 64.37 (3.70) 72.08 (1.87)
+ CL +2.35 (0.56) +2.78 (0.59) +1.95 (0.45) +1.48 (0.41) +0.62 (0.21) +0.65 (0.25)
TK - 63.74 (2.61) 70.01 (1.93) 74.45 (0.34) 61.13 (5.03) 63.09 (5.57) 75.46 (1.46)
+ CL +0.96 (0.30) +1.08 (0.37) +0.30 (0.11) +1.00 (0.21) +1.34 (0.20) +1.03 (0.22)
PL-EM - 20.70 (0.05) 24.65 (0.38) 30.46 (1.48) 38.05 (4.85) 44.85 (5.75) 51.25 (3.06)
ICA - 26.20 (0.51) 41.05 (0.50) 49.51 (1.90) 44.40 (1.92) 45.62 (0.86) 54.26 (2.09)
GMNN - 49.05 (1.86) 54.55 (1.15) 67.16 (1.86) 58.03 (3.62) 62.50 (3.77) 71.03 (4.54)
Table 4: Node classification accuracy with unlabeled test data varying number of training labels on Cora and Pubmed datasets. Numbers in bold represent significant improvement in a paired t-test at the level, and numbers with represent the best performing method in each column.
Cora Pubmed
# train labels 85 (3.21%) 105 (3.88%) 140 (5.17%) 300 (1.52%) 375 (1.90%) 600 (3.04%)
% test labels 50% 50% 50% 50% 50% 50%
Random 14.28 (0.00) 14.28 (0.00) 14.28 (0.00) 33.33 (0.00) 33.33 (0.00) 33.33 (0.00)
GCN - 36.38 (1.35) 48.31 (2.58) 64.02 (1.54) 54.11 (4.86) 56.31 (3.10) 68.13 (1.84)
+ CL +15.69 (3.20) +14.02 (3.38) +6.31 (0.89) +5.62 (1.17) +5.06 (3.24) + 4.60 (2.50)
GS - 48.42 (2.82) 57.52 (2.15) 65.04 (0.79) 58.52 (5.42) 59.77 (4.68) 75.01 (4.86)
+ CL +4.52 (0.84) +3.06 (0.20) +2.18 (0.21) +2.42 (0.27) +1.49 (0.10) +2.67 (0.56)
TK - 55.68 (2.08) 61.51 (2.45) 67.95 (0.45) 63.05 (5.15) 67.95 (0.45) 74.01 (3.58)
+ CL +7.18 (1.88) +3.04 (1.07) +2.75 (0.47) +1.91 (0.75) +0.54  (0.44) +3.23 (0.78)
PL-EM - 20.35 (0.05) 25.25 (0.35) 31.45 (1.95) 31.70 (4.78) 34.92 (5.87) 48.70 (5.72)
ICA - 31.17 (3.66) 42.07 (1.29) 57.14 (1.81) 33.38 (4.69) 45.93 (5.48) 46.97 (5.19)
GMNN - 49.36 (2.22) 56.58 (2.96) 67.83 (1.91) 62.16 (4.40) 63.42 4.82) 74.78 (3.63)
Table 5: Node classification accuracy with partially-labeled test data varying number of training labels on Cora and Pubmed datasets. Numbers in bold represent significant improvement in a paired t-test at the level, and numbers with represent the best performing method in each column.

Appendix H Ablation study

h.1 With or without predicted labels as input

Figure 4: CL-GNN performance with and without predicted labels on Cora and Pubmed. X-axis refers to iteration number in Section 3

To investigate if adding predicted labels in model input adds extra information with partially-labeled test data, we tested the performance of a model variant which only use true labels as input with the same node masking procedure. Figure 4 shows two examples on Cora with GCN (Figure (a)a) and Pubmed with TK (Figure (b)b), where including predicted labels achieves better performance. We run the model 10 times and calculate the average and standard deviation (shown as shaded area) of classification accuracy at each iteration as described in Section 3. We can see that adding predicted labels starts to improve the performance after the first iteration and achieves consistent gains.

Cora Pubmed Friendster Facebook Protein
# labels 85 (3.21%) 105 (3.88%) 140 (5.17%) 300 (1.52%) 375 (1.90%) 600 (3.04%) 641 (1.47%) 80 (1.76%) 7607 (30%)
Random 14.28 (0.00) 14.28 (0.00) 14.28 (0.00) 33.33 (0.00) 33.33 (0.00) 33.33 (0.00) 20.00 (0.00) 50.00 (0.00) 50.00 (0.00)
GCN - 45.15 (3.73) 52.35 (2.01) 65.11 (1.95) 53.21 (4.04) 57.15 (3.61) 70.81 (3.47) 29.80 (0.48) 65.89 (0.68) 73.03 (2.14)
+ CL-random +0.02 (0.65) -1.83 (1.05) +0.27 (0.18) +2.29 (0.34) +1.35 (0.58) +1.05 (0.96) -0.14 (0.40) +1.16 (0.25) +0.16 (0.80)
GS - 46.38 (1.62) 52.87 (1.03) 63.46 (1.38) 55.38 (3.48) 57.61 (4.21) 68.81 (4.15) 28.05 (0.56) 65.20 (0.40) 71.05 (0.40)
+ CL-random -2.45 (0.22) +0.46 (0.45) -0.23 (0.67) -0.02 (0.32) +0.42 (0.31) +0.34 (0.53) +0.21 (0.39) +1.65 (0.15) +0.01 (0.22)
TK - 61.99 (3.07) 67.88 (1.80) 73.04 (0.42) 61.00 (4.93) 61.91 (5.16) 73.87 (3.99) 29.44 (0.39) 67.75 (0.40) 73.38 (0.57)
+ CL-random -3.95 (1.08) -2.54 (0.63) -2.28 (0.84) -0.65 (0.56) -0.78 (0.38) -1.17 (0.78) +0.26 (0.38) -0.05 (0.19) -0.13 (0.53)
Table 6: Node classification accuracy with unlabeled test data using uniform sampling. Numbers in bold represent significant improvement in a paired t-test at the level.

h.2 Sampling from predicted labels or random ids

Creating more expressive GNN representations by averaging out random features was first proposed by Murphy et al. (2019). Murphy et al. (2019) shows a whole-graph classification application, Circulant Skip Links (CSL) graphs, where such randomized feature averaging is provably (and empirically) more expressive than GNNs. Our Monte Carlo collective learning method can be seen as a type of feature averaging GNN representation though, unlike Murphy et al. (2019), the feature sampling is not at random, but rather driven by our own model recursively. Hence, it is fair to ask if our performance gains are simply because random feature averaging is beneficial to GNN representations? Or does collective learning sampling actually improve performance? We need an ablation study.

Therefore, in this section we investigate whether the gains of our method for unlabeled test data are from incorporating feature randomness, or from sampling w.r.t predicted labels (collective learning). To do so, we replace the samples drawn from previous prediction as uniformly drawn from the set of class labels at each gradient step in CL-GNN. The results are shown in Table 6. Clearly, the random features are not able to consistently improve the model performance as our method does (contrast Table 6 with Table 1 and Table 4). In summary, collective learning goes beyond the purely randomized approach of Murphy et al. (2019), providing much larger, statistically significant, gains.


  1. We use the term optimal WL-GNN to refer to the most expressive version of a GNN–one that has the same distinguishing power as a Weisfeiler-Lehman test. Note this is not a universal graph representation.


  1. Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: 4th item.
  2. On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, pp. 15868–15876. Cited by: §1.
  3. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.
  4. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §6.
  5. Recurrent collective classification. Knowledge and Information Systems 60 (2), pp. 741–755. Cited by: §1, §4.1, §6.
  6. Boosting few-shot visual learning with self-supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8059–8068. Cited by: §6.
  7. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: Appendix E, §1, §4.2, 2nd item, §5.1, Table 1.
  8. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §6.
  9. Why collective inference improves relational classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 593–598. Cited by: Appendix C.
  10. Outcome correlation in graph neural network regression. External Links: 2002.08274 Cited by: §6.
  11. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Appendix E, Appendix E, §1, §4.2, 1st item, §5.1, Table 1.
  12. Introduction to statistical relational learning. MIT press. Cited by: §1, §6.
  13. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §6.
  14. Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 496–503. Cited by: Table 4, Table 5, §5.1, Table 1, §6, §6.
  15. Break the ceiling: stronger multi-scale deep graph convolutional networks. arXiv preprint arXiv:1906.02174. Cited by: Appendix E, §1, 3rd item, §5.1, §5.2, Table 1.
  16. On the universality of invariant networks. arXiv preprint arXiv:1901.09342. Cited by: §4.
  17. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §6.
  18. Deep collective inference. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §4.1, §6.
  19. Weisfeiler and leman go neural: higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: Appendix B, §1, §2, §4.1, §4.
  20. Relational pooling for graph representations. arXiv preprint arXiv:1903.02541. Cited by: §H.2, §H.2, §1, §2, §2, §4.
  21. Learning relational probability trees. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 625–630. Cited by: §1.
  22. Simple estimators for relational bayesian classifiers. In Third IEEE International Conference on Data Mining, pp. 609–612. Cited by: §6.
  23. Iterative classification in relational data. In Proc. AAAI-2000 workshop on learning statistical models from relational data, pp. 13–20. Cited by: §6.
  24. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §6.
  25. Overcoming relational learning biases to accurately predict preferences in large scale networks. In Proceedings of the 24th International Conference on World Wide Web, pp. 853–863. Cited by: Table 4, Table 5, §1, §5.1, Table 1, §6, §6.
  26. Column networks for collective classification. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §6.
  27. Towards structural logistic regression: combining relational and statistical learning. Departmental Papers (CIS), pp. 134. Cited by: §6.
  28. Gmnn: graph markov neural networks. arXiv preprint arXiv:1905.06214. Cited by: Table 4, Table 5, §1, §4.1, §5.1, Table 1, §6.
  29. A stochastic approximation method. Ann. Math. Statist. 22 (3), pp. 400–407. External Links: Document, Link Cited by: §4.3.
  30. Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: 1st item.