A Collective Learning Framework to Boost GNN Expressiveness
Abstract
Graph Neural Networks (GNNs) have recently been used for node and graph classification tasks with great success. Unfortunately, existing GNNs are not universal (i.e., mostexpressive) graph representations. In this work, we propose collective learning, a general collective classification Monte Carlo approach for graph representation learning that provably increases the representation power of existing GNNs. We show that our use of Monte Carlo sampling is key in these results. Our experiments consider the task of inductive node classification across partiallylabeled graphs using five realworld network datasets and demonstrate a consistent, significant boost in node classification accuracy when our framework is combined with a variety of stateoftheart GNNs.
1 Introduction
Graph Neural Networks (GNNs) have recently shown great success at node and graph classification tasks (Hamilton et al., 2017; Kipf and Welling, 2016; Luan et al., 2019; Xu et al., 2018). GNNs have been applied in both transductive settings (where the test nodes are embedded in the training graph) and inductive settings (where the training and test graphs are disjoint). Despite their success, existing GNNs are no more powerful than the WeisfeilerLehman (WL) graph isomorphism test, and thus, inherit its shortcomings, i.e. they are not universal (mostexpressive) graph representations Chen et al. (2019); Morris et al. (2019); Murphy et al. (2019); Xu et al. (2018). In other words, these GNNs (which we refer to as WLGNNs and also include GCNs Kipf and Welling (2016)) are not expressive enough for some node classification tasks, since their representation can provably fail to distinguish nonisomorphic nodes with different labels.
At the same time, a large body of work in relational learning has focused on strengthening poorlyexpressive (i.e., local) classifiers in relational models (e.g., relational logistic regression, naive Bayes, decision trees (Neville et al., 2003a)) in collective classification frameworks, by incorporating dependencies among node labels and propagating inferences during classification to improve performance, particularly in semisupervised settings (Koller et al., 2007; Pfeiffer III et al., 2015; Xiang and Neville, 2008).
In this work, we theoretically and empirically investigate the hypothesis that, by explicitly incorporating label dependencies among neighboring nodes via predicted label sampling—akin to how collective classification improves notsoexpressive classifiers—it is possible to devise an addon training and inference procedure that can improve the expressiveness of any existing WLGNN for inductive node classification tasks, which we denote collective learning.
Contributions. We first show that collective classification is provably unnecessary if one can obtain GNNs that are mostexpressive.
Then, because current WLGNNs are not mostexpressive, we propose an addon general collective learning framework to GNNs that provably boosts their expressiveness, beyond that of an optimal WLGNN
2 Problem formulation
We consider the problem of inductive node classification across partiallylabeled graphs, which takes as input a graph for training, where is a set of vertices, is a set of edges with adjacency matrix , is a matrix containing node attributes as dimensional vectors, and is a set of observed labels (with classes) of a connected set of nodes , where is assumed to be a proper subset of , noting that . Let be the unknown labels of nodes . The goal is to learn a joint model of and apply this same model to predict hidden labels in another test graph , i.e., . The test graph can be partially labeled or unlabeled so .
Graph Neural Networks (GNNs), which aggregate node attribute information to produce node representations, have been successfully used for this task. At the same time, relational machine learning (RML) methods, which use collective inference to boost the performance of local node classifiers via (predicted) label dependencies, have also been successfully applied to this task.
Since stateoftheart GNNs are not mostexpressive Morris et al. (2019); Murphy et al. (2019); Xu et al. (2018), collective classification ideas may help to improve the expressiveness of GNNs. In particular, collective inference methods often sample predicted labels (conditioned on observed labels) to improve the local representation around nodes and approximate the joint distribution . We also know from recent research that sampling randomized features can boost GNN expressiveness Murphy et al. (2019). This leads to the key conjecture of this work creftypecap 1, which we prove theoretically in Section 4 and validate empirically by extensive experimentation in Section 5.
Hypothesis 1.
Since current Graph Neural Networks (e.g. GCN, GraphSAGE, TKGNN) cannot produce most expressive graph representations, collective learning (which takes label dependencies into account via Monte Carlo sampling) can improve the accuracy of node classification by producing a more expressive graph representation.
Why? Because WLGNNs can extract more information about local neighborhood dependencies via sampling Murphy et al. (2019), and sampling predicted labels allows the GNN to pay attention to the relationship between node attributes, the graph topology, and label dependencies in local neighborhoods. With collective learning, GNNs will be able to incorporate more information into the estimated joint label distribution. Next, we describe our collective learning framework.
3 Proposed framework: Collective Learning to boost GNNs
In this section, we outline CLGNN. It is a general framework to incorporate any GNN, and combines selfsupervised learning approach and Monte Carlo embedding sampling in an iterative process to improve inductive learning on partially labeled graphs.
Specifically, given a partially labeled training graph with adjacency matrix and a partiallylabeled test graph with adjacency matrix . The goal of inductive node classification task is to train a joint model on to learn and apply it to by replacing the input graph with . Suppose the graphs and , we can define as a binary (01) matrix of dimension , and of dimension , where the rows corresponding to the onehot encoding of the (available) labels.
(Background) GNN and representation learning. Given a partially labeled graphs , WLGNNs generate node representation by propagating feature information throughout the graph. Specifically,
(1) 
where is the GNN representation of node , is the softmax activation, and , and are model parameters, which are learned by minimizing the crossentropy loss between true labels and the predicted labels.
The collective learning framework. Following creftypecap 1, we propose Collective Learning GNNs (CLGNN), which includes label information as input to GNNs to produce a more expressive representation. The overall framework follows three steps: (Step 1) Include labels in the input using a random mask; (Step 2) Sample predicted labels for whatever is masked, use as input to WLGNN that also takes all node labels as input, and average representations of the WLGNN over the sampled predicted labels; (Step 3) Perform one optimization step by minimizing a negative loglikelihood upper bound (per Section 4.3). Collective learning for WLGNNs then consists of iterating over Steps 13 for iterations. Finally, once optimized, we perform inference via Monte Carlo estimates.
Step 1. Random masking and self supervised learning. The input to GNNs is typically the full graph . If we included the observed labels directly in the input, then it would be trivial to learn a model that predicts part of the input. Instead, we either (scenario testunlabeled) mask all label inputs if the test graph is expected to be unlabeled; or (scenario testpartial) if is expected to have partial labels, we apply a mask to the labels we wish to predict in training so they do not appear in the input .
In (scenario testpartial), where is expected to have some observed labels, at each stochastic gradient descent step, we randomly sample a binary mask from a set of masks, where is a binary (01) matrix with the same dimensional vector in each column. By applying the mask on the observed labels , the set of true labels is effectively partitioned into two parts, where part of true labels are used as input to CLGNN, and the other part are used as optimization target, where is the bitwise negated matrix of .
Step 2. CLGNN loss and its representation averaging. At the th step of our optimization —these steps can be coarser than a gradient step— we either (scenario testpartial) sample a mask or (scenario testunlabeled) set . For now, we assume we can sample from an estimate of the distribution —we will come back to this assumption soon. Let be the matrix concatenation between and , where again is the bitwise negated matrix of . Let
(2) 
where GNN represents an arbitrary graph neural network model and is the CLGNN’s representation obtained for node at step .
Our optimization is defined over the expectation of w.r.t. to the sampled predicted labels (Equation 2) and over a loss averaged over all sampled masks (noting that the case where is trivial):
(3) 
where again, is the softmax activation function, and are the labeled nodes in training graph.
Obtaining . At iteration , we use the learned CLGNN model parameter to obtain according to Equation 2 and use the CLGNN model parameters to obtain the label prediction recursively
(4) 
starting the recursion with .
Step 3. Stochastic optimization of Equation 3. Equation 3 is based on a pseudolikelihood, where the joint distribution of the labels is decomposed as marginal distributions resulting in the sum over in Equation 3. In order to optimize Equation 3, we compute gradient estimates w.r.t. and using the following sampling procedure.
1. We first need to compute an unbiased estimate of in Equation 2 using i.i.d. samples from the model obtained at time step (as describe above). Note that the time/space complexity of the CLGNN is times the time/space complexity of the corresponding GNN model as we have to compute representations for each node at each stochastic gradient step.
2. Next, we need an unbiased estimate of the expectation over in Equation 3. In (scenario testpartial) the unbiased estimates are obtained by sampling , in the (scenario testunlabeled) the value obtained is exact since . Section 4.3 shows that the above procedure is a proper surrogate upperbound of the loss function
Inference with learned model. We apply the same procedure as in obtaining , but transferring the learned parameters to a different test graph. Specifically, at iteration , CLGNN parameters are learned according to Equation 3 on the training graph . Suppose the iteration contains gradient steps, thus sampled masks are used. Given an anysize attributed graph , we sample another masks of size , either (scenario testpartial) sampling or (scenario testunlabeled) set . We also sample predicted labels from the predicted probability distribution where . The node representations for are obtained using and :
(5) 
where again, is the number of gradient steps per iteration and is the number of Monte Carlo samples of . Then the label predictions are obtained using the learned CLGNN parameters :
Note that the test label predictions are also recursively updated, and the recursion starts with .
4 Collective learning analysis
Is collective classification able to better represent target label distributions than graph representation learning? The answer to this question is both yes (for WLGNNs) and no (for mostexpressive representations). Section 4 shows that a mostexpressive graph representation (Murphy et al., 2019; Maron et al., 2019; Srinivasan and Ribeiro, 2019) would not benefit from a collective learning boost. All proofs can be found in the Appendix. {restatable}[Collective classification can be unnecessary]theoremthmCCun Consider the task of predicting node labels when no labels are available in test data. Let be a mostexpressive representation of node in graph . Then, for any collective learning procedure predicting the class label of , there exists a classifier that takes as input and predicts the label of with equal or higher accuracy. While Section 4 shows that the mostexpressive graph representation does not need collective classification, WLGNNs are not mostexpressive Morris et al. (2019); Murphy et al. (2019); Xu et al. (2018). Indeed, Section 4.1 and Section 4.2 show that CLGNN boosts the expressiveness of optimal WLGNN and practical WLGNNs, respectivelly. Then, we show that the stochastic optimization in Step 3 optimizes a loss surrogate upper bound.
4.1 Expressive power of ClGnn
Morris et al. (2019) and Xu et al. (2018) show that WLGNNs are no more powerful in distinguishing nonisomorphic graphs and nodes as the standard WeisfeilerLehman graph isomorphism test (1WL or just WL test). Two nodes are assumed isomorphic by the WL test if they have the same color assignment in the stable coloring. The nodeexpressivity of a parameterized graph representation (with parameter ) can then be determined by the set of graphs for which can identify nonisomorphic nodes:
where is the set of all anysize attributed graphs, is the set of nodes in graph . We call the identifiable set of graph representation .
The most expressive graph representation has an identifiable set of all anysize attributed graphs, i.e. . We refer to the WLGNN that is equally expressive as WL test as the optimal WLGNN (or ), which is at least as expressive as all other WLGNNs.
In this section we show that collective learning can boost the optimal , i.e., the identifiable set of is a proper subset of collective learning over (denoted )
[CLGNN expressive power]theoremthmclpower Let be an optimal WLGNN. Then, the collective learning representation of Equation 2, using as the GNN component, (denoted ) is strictly more expressive than this representation model applied to the same tasks.
Section 4.1 answers creftypecap 1, by showing that by incorporating collective learning and sampling procedures, CLGNN can boost the expressiveness of WLGNNs, including the optimal .
Corollary 1.
Consider a graph representation learning method that, at iteration , replaces , in Equations 2 and 4 with a deterministic function over , e.g., a softmax function that outputs . Then, such method will be no more expressive than the optimal and, hence, less expressive than .
Corollary 1 shows that existing graph representation methods that —on the surface— may even look like CLGNN, but do not perform the crucial step of sampling , unfortunately, are no more expressive than WLGNNs. Examples of such methods include Fan and Huang (2019); Moore and Neville (2017); Qu et al. (2019); Vijayan et al. (2018).
Next, we see that the practical benefits of collective learning are even greater when the WLGNN has limited expressive power due to a constraint on the number of messagepassing layers.
4.2 How ClGnn further expands the power of fewlayer WLGNNs
A layer () WLGNN will only aggregate neighborhood information within hops of any given node (i.e., over a hop egonet, defined as the graph representing the connections among all nodes that are at most hops away from the center node). In practice —mostly for computational reasons— WLGNNs have many fewer layers than the graph’s diameter , i.e., . For instance, GCN Kipf and Welling (2016) and GraphSAGE Hamilton et al. (2017) both used in their experiments. Hence, they cannot differentiate two nonisomorphic nodes that are isomorphic within their hop neighborhood. We now show that CLGNN can gather hop neighborhood information with a layer WLGNN.
propositionpropexpandpower Let be the hop egonet of a node in graph with diameter . Let and be two nonisomorphic nodes whose hop egonets are isomorphic (i.e., is isomorphic to ) but hop egonets are not isomorphic. Then, a WLGNN representation with layers will generate identical representations for and while CLGNN is capable of giving distinct node representations.
Section 4.2 shows that collective learning has yet another benefit: CLGNN further boosts the power of WLGNNs with limited messagepassing layers by gathering neighborhood information within a larger radius. Specifically, CLGNN built on a WLGNN with layers can enlarge the effective neighborhood radius from to in Equation 2 , while WLGNN would have to stack layers to achieve the same neighborhood radius, which in practice may cause optimization challenges (i.e., is a common hyperparameter value in the literature).
4.3 Optimization of ClGnn
{restatable}propositionpropopt If , is bounded (e.g., via gradient clipping), then the optimization in Equation 3, with the unbiased sampling of and described above, results in a RobbinsMonro (Robbins and Monro, 1951) stochastic optimization algorithm that optimizes a surrogate upper bound of the loss in Equation 3. Since the optimization objective in Equation 3 is computationally impractical, as it requires computing all possible binary masks and label predictions, Section 4.3 shows that the sampling procedures used in CLGNN that considers samples of label predictions and a random mask at each gradient step is a feasible approach of estimating an unbiased upper bound of the objective.
5 Experiments
5.1 Experiment Setup
Datasets. We use datasets of Cora, Pubmed, Friendster, Facebook, and Protein. The largest dataset (Friendster Teixeira et al. (2019)) has 43,880 nodes, which is a social network of users where the node attributes include numerical features (e.g number of photos posted) and categorical features (e.g. gender, college, music interests, etc) encoded as binary onehot features. The node labels represent one of the five age groups. Please refer to Appendix F for more details.
Train/Test split. To conduct inductive learning tasks, for each dataset we split the graphs for training and testing, and the nodes are sampled to guarantee that there is no overlapping between any two sets of , and . In our experiments, we tested two different label rates in test graph: (unlabeled) and . We run five trials for all the experiments, and in each trial we randomly split the nodes/graphs for training/validation and testing.
As our method can be applied to any GNN models, we tested three GNNs as examples:

GCN Kipf and Welling (2016) which includes two graph convolutional layers. Here we implemented an inductive variant of the original GCN model for our tasks.

Supervised GraphSage Hamilton et al. (2017) (denoted by GS) with Mean pooling aggregator. We use sample size of 5 for neighbor sampling.

Truncated Krylov GCN Luan et al. (2019) (denoted by TK), a recent GNN model that leverages multiscale information in different ways and are scalable in depth. The TK has stronger expressive power and achieved stateoftheart performance on node classification tasks. We implemented Snowball architecture which achieved comparable performance with the other truncated Krylov architecture according to the original paper.
For each of GNNs, we compare its baseline performance (on its own) to the performance achieved using collective learning in CLGNN. Note that we set the number of layers to be for GCN Kipf and Welling (2016) and GraphSage Hamilton et al. (2017) as set in their original papers, and use layers for TK Luan et al. (2019). For a fair comparison, the baseline GNN and CLGNN are trained with the same hyperparameters, e.g. number of layers, hidden dimensions, learning rate, earlystopping procedures. Please refer to Appendix F for details.
In addition, we also compare to three relational classifiers, ICA Lu and Getoor (2003), PLEM Pfeiffer III et al. (2015) and GMNN Qu et al. (2019). The first two models apply collective learning and inference with simple local classifiers ââ Naive Bayes for PLEM and Logistic regression for ICA. GMNN is the stateoftheart collective model with GNNs, which uses two GCN models to model label dependency and node attribute dependency respectively. All the three models take true labels in their input, thus we use for training and for testing.
We report the average accuracy score and standard error of five trials for the baseline models, and compute the absolute improvement of accuracy of our method over the corresponding base GNN. We report the balanced accuracy scores on Friendster dataset as the labels are highly imbalanced. To evaluate the significance of our model’s improvements, we performed a paired ttest with five trials.
5.2 Results
The node classification accuracy of all the models is shown in Table 1. Our proposed collective learning boost is denoted as +CL (for Collective Learning) in the results and our model performance (absolute of improvement over the corresponding baseline GNN) is shown in shaded area. Numbers in bold represent significant improvement over the baseline GNN based on a paired ttest (), and numbers with is the best performing method in each column.
Comparison with baseline GNN models. Table 1 shows that our method improves the corresponding noncollective GNN models for all the three model architectures (i.e. GCN, GraphSage and TK). Although all the models have large variances over multiple trials — which is because different parts of the graphs are being trained and tested on in different trials, our model consistently improves the baseline GNN. The results from a paired ttest comparing the performance of our method and the corresponding noncollective GNN shows that the improvement is almost always significant (marked as bold), with only four exceptions. Comparing the gains on different datasets, our method achieved smaller gains on Friendster. This is because the Friendster graph is much more sparse than all other graphs (e.g. edge density of Friendster is 1.5e4 and edge density of Cora is 1.44e3 Teixeira et al. (2019)), which makes it hard for any model to propagate label information and capture the dependency.
Moreover, comparing the improvement over GCN and TK, we can observe that in general our method adds more gains to GCN performance. For example, with 3% training labels on Cora, our method when combining with GCN has an average of 6.29% improvement over GCN, 2.35% improvement over GraphSage, and 0.96% improvement over TK. This is in line with our assumption creftypecap 1 that collective learning can help GNNs produce a more expressive representation. As GCN is provably less expressive than TK Luan et al. (2019), there is a larger room to increase its expressiveness.
Note the we use different trials for the two test label rates, the gains are generally larger when of the labels are available. For example, when combining with GCN, the improvements of our method are 6.29% and 1.72% for unlabeled Cora and Facebook test sets, but with partiallylabeled test data, the improvements are 15.69% and 2.95% respectively. This shows the importance of modeling label dependency especially when the some test data labels are observed.
Comparison with other relational classifiers The two baseline nonGNN relational models, i.e. PLEM and ICA generally perform worse than the three GNNs, with the only exception on Protein dataset. This could be because the two nonGNN models generally need a larger portion of labeled set to train the weak local classifier, whereas GNNs utilize a neural network architecture as ”local classifier”, which is better at representation learning by transforming and aggregating node attribute information. However, when the model is trained with a large training set (e.g. with 30% nodes on Protein dataset), modeling the label dependency becomes crucial. At the same time, our method is still able to improve the performance of the corresponding GNNs.
For GMNN, the collective GNN model, it achieved better performance than its noncollective base model, i.e. GCN, and we can see that our model combining with GCN achieved comparable or slightly better performance than GMNN. When combing with other more powerful GNNs, our model can easily outperform it, e.g. on Cora, Pubmed and Facebook datasets, the TK performs better than GMNN and our method adds extra gains over TK.
Cora  Pubmed  Friendster  Protein  

# train labels:  85 (3.21%)  300 (1.52%)  641 (1.47%)  80 (1.76%)  7607 (30%)  
% labels in :  0%  50%  0%  50%  0%  50%  0%  50%  0%  50%  
Random  14.28 (0.00)  14.28 (0.00)  33.33 (0.00)  33.33 (0.00)  20.00 (0.00)  20.00 (0.00)  50.00 (0.00)  50.00 (0.00)  50.00 (0.00)  50.00 (0.00)  
GCN    45.90 (3.26)  36.38 (1.35)  52.68 (2.36)  54.11 (4.86)  29.34 (0.55)  28.44 (0.56)  65.85 (1.01)  63.13 (2.12)  75.86 (1.11)  77.54 (1.09) 
+ CL  +6.29 (1.49)  +15.69 (3.20)  +4.48 (2.33)  +5.62 (1.17)  +0.81 (0.10)  +0.90 (0.32)  +1.72 (0.48)  +2.95 (0.84)  +1.22 (0.51)  +0.75 (0.33)  
GS    50.69 (1.50)  48.42 (2.82)  59.34 (3.47)  58.52 (5.42)  28.10 (0.59)  28.10 (0.48)  64.56 (0.92)  62.99 (0.88)  73.85 (1.12)  73.01 (2.28) 
+ CL  +2.35 (0.56)  +4.52 (0.84)  +1.48 (0.41)  +2.42 (0.27)  +0.31 (0.15)  +0.73 (0.23)  +2.38 (0.77)  +2.05 (0.04)  +0.84 (0.12)  +1.47 (0.63)  
TK    63.74 (2.61)  55.68 (2.08)  61.13 (5.03)  63.05 (5.15)  28.89 (0.10)  29.30 (0.15)  67.63 (1.03)  65.80 (1.16)  73.65 (1.69)  78.94 (1.50) 
+ CL  +0.96 (0.30)  +7.18 (1.88)  +1.00 (0.21)  +1.91 (0.75)  +0.55 (0.17)  +0.45 (0.08)  +0.63 (0.26)  +2.37 (0.80)  +1.31(0.27)  +1.36(0.94)  
PLEM    20.70 (0.05)  20.35 (0.05)  38.05 (4.85)  31.70 (4.78)  23.26 (0.01)  26.30 (0.25)  56.17 (7.42)  54.56 (6.17)  78.46 (1.45)  77.95 (1.56) 
ICA    26.20 (0.51)  31.17 (3.66)  44.40 (1.92)  33.38 (4.69)  25.14 (0.03)  25.08 (0.17)  47.93 (6.04)  59.39 (3.69)  84.88 (3.35)  84.39 (4.08) 
GMNN    49.05 (1.86)  49.36 (2.22)  58.03 (3.26)  62.16 (4.40)  22.20 (0.07)  28.53 (0.64)  65.82 (1.30)  63.45 (2.15)  76.75 (0.74)  75.96 (0.76) 
We did two ablation studies to investigate the usage of predicted labels (detailed in Appendix H), which showed that (1) adding predicted labels in model input had extra value comparing to using true labels only, and (2) the gain of our framework is from using samples of the predicted labels rather than random onehot vectors.
Runtime analysis. CLGNN computes embeddings at each stochastic gradient step, therefore overall, pergradient step, CLGNN is slower than its component WLGNN. Overall, after iterations of Steps 13, CLGNN total runtime increases by over the original runtime of its component WLGNN. Our experiments are conducted on a single Nvidia GTX 1080Ti with GB of shared memory. CLGNN built on GCN on the largest dataset (i.e. Friendster with 43K nodes) takes minutes for training and inference (the longest time), with a total of iterations, while the corresponding GCN takes only minutes for the same operations. In the same dataset, CLGNN built on TK takes minutes for training and inference, while the corresponding TK takes minutes. We give a detailed account of running times on smaller graphs in Appendix F. We note that we spent nearly no time engineering CLGNN for speed or for improving our results. Our interest in this paper lies entirely on the gains of a direct application of CLGNN. We fully expect that further engineering advances can significantly reduce the performance penalty and increase accuracy gains. For instance, parallelism can significantly reduce the time to collect samples in CLGNN.
6 Related work
On collective learning and neural networks. There has been work on applying deep learning to collective classification. For example, Moore and Neville (2017) proposed to use LSTMbased RNNs for classification tasks on graphs. They transform each node and its set of neighbors into an unordered sequence and use an RNN to predict the class label as the output of that sequence. Pham et al. (2017) designed a deep learning model for collective classification in multirelational domains, which learns local and relational features simultaneously to encodes multirelations between any two instances.
The closest work to ours is Fan and Huang (2019), which proposed a recurrent collective classification (RCC) framework, a variant of ICA Lu and Getoor (2003) including dynamic relational features encoding label information. Unlike our framework, this method does not sample labels , opting for an endtoend training procedure. Vijayan et al. (2018) opts for a similar nosample RCC endtoend training method as Fan and Huang (2019), now combining a differentiable graph kernel with an iterative stage. Graph Markov Neural Network (GMNN) Qu et al. (2019) is another promising approach that applies statistical relational learning to GNNs. GMNNs model the joint label distribution with a conditional random field trained with the variational EM algorithm. GMNNs are trained by alternating between an Estep and an Mstep, and two WLGCNs are trained for the two steps respectively. These studies represent different ideas for bringing the power of collective classification to neural networks. Unfortunately, Corollary 1 shows that, without sampling , the above methods are still WLGNNs, and hence, their use of collective classification fails to deliver any increase in expressiveness beyond an optimal WLGNN (e.g., Xu et al. (2018)).
In parallel to our work, Jia and Benson (2020) considers regression tasks by modeling the joint GNN residual of a target set () as a multivariate Gaussian, defining the loss function as the marginal likelihood only over labeled nodes . In contrast, by using the more general foundation of collective classification, our framework can seamlessly model both classification and regression tasks, and include model predictions over the entire graph as CLGNN’s input, thus affecting both the model prediction and the GNN training in inductive node classification tasks.
On selfsupervised learning and semisupervised learning. Selfsupervised learning is closely related to semisupervised learning. In fact, selfsupervised learning can be seen as a selfimposed semisupervised learning task, where part of the input is masked (or transformed) and must be predicted back by the model (Doersch et al., 2015; Noroozi and Favaro, 2016; Lee et al., 2017; Misra et al., 2016). Recently, selfsupervised learning has been broadly applied to achieve stateoftheart accuracy in computer vision (Hénaff et al., 2019; Gidaris et al., 2019) and natural language processing Devlin et al. (2018) supervised learning tasks. The use of selfsupervised learning in graph representation learning is intimately related to the use of pseudolikelihood to approximate true likelihood functions.
Collective classification for semisupervised learning tasks. Conventional relational machine learning (RML) developed methods to learn joint models from labeled graphs (Lu and Getoor, 2003; Neville and Jensen, 2000), and applied the learned classifier to jointly infer the labels for unseen examples with collective inference. When the goal is to learn and predict within a partiallylabeled graph, RML methods have considered semisupervised formulations (Koller et al., 2007; Xiang and Neville, 2008; Pfeiffer III et al., 2015) to model the joint probability distribution:
In this case RML methods use both collective learning and collective inference procedures for semisupervised learning.
RML methods typically consider a Markov assumption to simplify the above expressions —every node is considered conditionally independent of the rest of the network given its Markov Blanket (). For undirected graphs, this is often simply the set of the immediate neighbors of . Given the Markov blanket assumption, RML methods typically use a local conditional model (e.g., relational Naive Bayes (Neville et al., 2003b), relational logistic regression (Popescul et al., 2002)) to learn and infer labels within the network.
Semisupervised RML methods utilizes the unlabeled data to make better predictions within the network. Given the estimated values of unlabeled examples, i.e., the local model parameters can be learned by maximizing the pseudolikelihood of the labeled part:
(6) 
The key difference between Equation 6 and the GNNs objective in Equation 1: the RML model is always conditioned on the labels (either true labels or estimated labels ) even when there are no observed labels in the test data, i.e., even when .
The most common form of semisupervised RML utilizes expectation maximization (EM) (Xiang and Neville, 2008; Pfeiffer III et al., 2015), which iteratively relearn the parameters given the expected values of the unlabeled examples. For instance, the PLEM algorithm Pfeiffer III et al. (2015) optimizes the pseudolikelihood over the entire graph:
In comparison to semisupervised RML, our proposed framework CLGNN performs collective learning to strengthen GNN, which is a more powerful and flexible ”local” classifier compared to the typical weak local classifier used in RML methods (e.g. relational Naive Bayes). For parameter learning (Mstep), CLGNN also optimizes pseudolikelihood, but incorporates Monte Carlo sampling from label prediction instead of directly using the predicted probability of unlabeled nodes
Collective inference. When the model is applied to make predictions on unlabeled nodes (in the Estep), joint (i.e., collective) inference methods such as variational inference or Gibbs sampling must be applied in order to use the conditionals from Equation 6. This combines the local conditional probabilities with global inference to estimate the joint distribution over the unlabeled vertices, e.g.,:
where each component is iteratively updated.
Alternatively, a Gibbs sampler iteratively draws a label from the corresponding conditional distributions of the unlabeled vertices:
In Gibbs sampling, it is the sampling of labels that allow us to sample from the joint distribution, which is what enriches the simple models often used in collective inference.
7 Conclusion
In this work, we answer the question “can collective learning and sampling techniques improve the expressiveness of stateoftheart GNNs in inductive node classification tasks?” We first show that with the most expressive GNNs there is no need to do collective learning; however, since we do not have the most expressive models, we present the collective learning framework (exemplified in CLGNN), that can be combined with any existing GNNs to provably improve WLGNN expressiveness. We considered the inductive node classification tasks across graphs, and showed by extensive empirical study that our collective learning significantly boosts GNNs performance on all the tasks.
8 Potential biases
This work presents an algorithm for node classification that works on arbitrary graphs, including realworld social networks, citation networks, etc.. Any classification algorithm that learns from data runs the risk of producing biased predictions reflective of the training data — our work, which learns an inductive node classifier from training graph examples, is no exception. However, our main focus in this paper is on introducing a general algorithmic framework that can increase the expressivity of existing graph representation learning methods, rather than particular realworld applications.
Supplementary Material of collective learning GNN
Appendix A Proof of Section 4
We restate the theorem for completeness. \thmCCun*
Proof.
Let be a classifier function that takes the most expressive representation of node as input and outputs a predicted class label for .
Let be the set of predicted labels at iteration of collective classification and let be the true label of node . Then either (1) , or (2) .
Case (1): Given the classifier and the most expressive representation , the true label of is independent of the labels predicted with collective classification. In this case, the predicted labels of ’s neighbors offer no additional information and, thus, collective classification is unnecessary.
Case (2): In this case, the true label of is not independent of the predicted labels. By Theorem 1 of Srinivasan and Ribeiro (2019), we know that for any random variable attached to node , it must be that a measurable function independent of s.t.
where is an noise source exogenous to (pure noise), and a.s. implies almost sure equality. Defining ,
which means must either be dependent on or contain domain knowledge information about the function that is not in . Since is a vector of random variables fully determined by and , it cannot depend on an exogenous variable , Thus, the predictions must contain domain knowledge of . Hence, we can directly incorporate this domain knowledge into another classifier s.t. , for instance is a function of . In this case, will predict the label of with equal or higher accuracy than collective classification based on predicted labels , which finishes our proof.
∎
Appendix B Proof of Section 4.1
\thmclpower*
Proof.
As defined, WLGNN is a mostexpressive WLGNN. We need to prove We will do that by first showing and then showing that .
: First, we need to show that for any mask , such that . This is clearly true since, for labeled tests, in Equation 2 we can always construct a for a CLGNN
(7) 
that ignores the inputs. Similarly, for unlabeled tests, in Equation 2 we can always construct a for a CLGNN
that ignores the inputs.
: Let be the graph in Figure 1. We will first consider the case where the test data has partial labels. The case without labels follows directly from it. Using the graph in Figure 1(a) (training) and Figure 1(c) (partiallylabeled testing) we show that a WLGNN is unable, in test, to correctly give representations to the leftmost nodes that are distinct from the rightmost nodes (the same happens for the unlabeled test graph in Figure 1(b)). We then show that the representation of Equation 7 is able to distinguish these two sets of nodes.
WLCNN is not powerful enough to give distinct representations to nodes in Figure 1(c): Consider giving an arbitrary feature value (say, the “color white”) to all uncolored nodes in Figure 1(c). We will start showing that the 1WL test is unable to give different colors to the nodes in this graph. Since WLGNNs are no more expressive than the 1WL test (Xu et al., 2018; Morris et al., 2019), showing that the above is a stable coloring for nodes in the 1WL test, proves the first part of our result. A stable 1WL coloring is defined as a coloring scheme on the graph that has a 1to1 correspondence with the colors in the previous step of the 1WL algorithm. Since the input to the hash function of the 1WL test is the same for all of nodes : The node itself has color white while the color set of the neighbors is the set . In the next 1WL round, all the white nodes will be mapped to the same color by the hash function. The colors of node will be not the same as . Hence, the initial coloring of all nodes white and yellow is a stable coloring for 1WL. Consequently, WLGNN will give the same representation to all nodes in .
CLGNN gives the same representations within the sets and : At iteration of CLGNN, we start with the base of the recursion . Now consider a given mask . Note that to sample for we apply into Equation 2 to obtain , and then apply into Equation 4, defining , which will give us , and any classes has a nonzero probability of being sampled since our output is a softmax.
Since nodes all get the same representation in the above WLGNN, their respective sampled , , will have the same distribution but possibly not the same values (due to sampling). Note that the nodes will get the same average in Equation 7 since , , has the same distribution and the nodes are isomorphic (even given the colors on nodes 9 and 10). Similarly, the nodes will also get the same average in Equation 7.
CLGNN gives distinct representations accross the sets and : Finally, we now prove that exists a WLGNN, which we will denote , such that for and . We will show that there is a joint sample of where there is no symmetry between the representations of nodes in and . Since each layer of WLGNN can have different parameters, we can easily encode differences in the number of hops it takes to reach a certain color. Moreover, at any WLGNN layer, the representation of a node can perfectly encode its own lastlayer representation and the lastlayer representation of its neighbors through a mostexpressive multiset representation function Xu et al. (2018).
It is enough for us to show that for a sampled the sets of nodes and can get distinct unique representations under WLGNN. By unique, we mean, can get representations in WLGNN that cannot be obtained by the nodes in . This representation uniqueness makes sure the averages in Equation 2 are different. Without loss of generality we will consider giving a special sampled label to only one node in one of the sets. The sampled labels , while all other nodes get red, will happen with nonzero probability, hence, they must be part of the expectation in Equation 2. Note that node (for ) and (for ) will feel the effects of the green color in their neighbors differently. That is, for there is a parameter choice for the layers of where the representation of node 6 uniquely encodes that the color green is within hops 1 (node 5) and 3 (from node 5 through nodes 9 and 10) of node 6 (if 6 treats its own representation differently from its neighbors). For , node 2’s representation will encode that green is observed hops 1 (node 1) and 2 (from node 1 through node 9) (similarly, 2 treats its own representation differently from its neighbors). Hence, these representations can be made unique by , i.e., no other assignments will create the same patterns for nodes 2 and 6, and thus, since has mostexpressive multiset representations, it can give a unique representation to nodes 2 and 6 for these two unique configurations. These unique representations are enough to ensure for any , which concludes our proof.
∎
Appendix C Proof of Section 4.2
*
Proof.
Let be the graph in Figure 2(a) with no node features, and let WLGNN be of order , meaning it will generate node embeddings based on 2ndorder neighborhoods (shown in (b)). Since node and have the same 2ndorder neighborhood structure, WLGNN will generate identical node representation for them, which gives random label predictions. Meanwhile, as nodes and have distinct 2ndorder neighborhood structures, WLGNN generates different node representations for them, which enables the model to learn from the labels and . We can assume the predicted label probability and . For CLGNN, at iteration , we sample from the WLGNN output and use the samples as input. In the worst case, nodes and get the same distribution and sampled labels (i.e. = , = ). Since the distribution of and are different, the samples of and are different, which breaks the tie between the 2ndorder neighborhood of nodes and . Therefore, CLGNN will produce different node representation starting from iteration for nodes and , which enables the model to learn from the training label and , and thus gives more accurate predictions. ∎
The advantage of collective inference is more clear when it is used to strengthen lessexpressive local classifiers, e.g. logistic regression Although GNN are much powerful than these local classifiers by aggregating high(er)order graph information, collective learning can still help if GNN fail to make use of “global” information in graphs (or equivalently, if the order of GNN is small than graph diameter). Previous work Jensen et al. (2004) investigating the power of collective inference also showed that methods for collective inference benefit from a clever factoring of the space of dependencies, by arguing that these (collective inference) methods benefit from information propagated from outside of their local neighborhood. Predictions about the class label on other objects essentially “bundle information” about the graph beyond the immediate neighborhood.
Appendix D Proof of Section 4.3
\propopt*
Proof.
In our optimization, we only need to sample two variables and . We obtain unbiased boundedvariance estimates of the derivative of the loss function if we sample (and exact values when ). We can now compound that with unbiased boundedvariance estimates of the derivative if we estimate the expectation in Equation 2 by i.i.d. sampling . The loss in Equation 3 is convex on since the negative loglikelihood of the multiclass logistic regression is convex on , which means it is also convex on as the loss is defined on the affine transformation . The expectation of the loss always exist, since we assume is bounded for all . Hence, as the loss is convex w.r.t. , the expection w.r.t. exists, and we obtain an unbiased estimate of , we can apply Jensen’s inequality to show that the resulting RobbinsMonro stochastic optimization optimizes an upper bound of the loss in Equation 3. ∎
Appendix E Additional information on datasets and experiment setup
Datasets. We use five datasets for evaluation, with the dataset statistics shown in Table 2.
Dataset  # Nodes  # Attributes  # Classes  # Test 

Cora  2708  1433  7  1000 
Pubmed  19717  500  3  1000 
Friendster  43880  644  5  6251 
4556  3  2  1000  
Protein  12679  29  2  2376 

Cora and Pubmed are benchmark datasets for node classification tasks from Sen et al. (2008). They are citation networks with nodes representing publications and edges representing citation relation. Node attributes are bagofword features of each document, and the predicted label is the corresponding research field.

Facebook Yang et al. (2017) is social network of Facebook users from Purdue university, where nodes represent users and edges represent friendship. The features of the nodes are: religious views, gender and whether the userâs hometown is in Indiana. The predicted labels is political view.

Friendster Teixeira et al. (2019) is social network. Nodes represent users and edges represent friendship. The node attributes include numerical features (e.g number of photos posted, etc) and categorical features (e.g. gender, college, music interests, etc), encoded as binary onehot features. The node labels represent one of the five age groups: 024, 2530, 3640, 4650 and over 50. This version of the graph contain 40K nodes, 25K of which are labeled.

Protein is a collection of protein graphs from Borgwardt et al. (2005). Each node is labeled with a functional role of the protein, and has a 29 dimensional feature vector. We use 85 graphs with an average size of 150 nodes.
Data splits. To conduct inductive learning tasks, we have to properly split the graphs into labeled and unlabeled parts. For datasets containing only one graph (Cora, Pubmed, Facebook and Friendster), we randomly sample a connected component to be , and then sample a test set () from the remainder nodes (). To make partiallylabeled test data available, we sample another connected component as with the same size as . The nodes are sampled to guarantee that there is no overlapping between any two sets of , and . Here and have the same graph structure but with different labeled nodes.
For the protein dataset, as we have 85 disjoint graphs, we randomly choose 51 (60%) graphs for training, 17 (20%) graphs for validation and the remaining 17 (20%) graphs for testing. To simulate semisupervised learning settings, we mask out 50% of true labels on the training graphs. For the tasks with partiallylabeled test data, we randomly select 50% of the nodes in the test graphs as labeled nodes, and test on the remaining 50% nodes. We run five trials for all the experiments, and in each trial we randomly split the nodes/graphs as described.
As seen in Section 5.1, to approximate an inductive learning setting, we use a different train/test data split procedure (i.e. connected training set) on Cora and Pubmed networks from the public version (i.e. random training set) used in most of the existing GNN models (Kipf and Welling, 2016; Luan et al., 2019). This is illustrated in Figure 3, where the random training set of the traditional GNN evaluation methods (in e.g., (Kipf and Welling, 2016; Luan et al., 2019)) is shown on the left, contrasted with our harder task of connected training set shown on the right. This difference in task is the reason why the model performance reported in our paper is not directly comparable with the reported results in previous GNN papers, even though we used the same implementations and hyperparameter search procedures.
Hyperparameter setting. For hyperparameter tuning, we searched the initial learning rate within {0.005, 0.01, 0.05} with weight decay of . Dropout is applied to all the layers with . Hidden units are searched within {16, 32} if the dataset wasn’t used by the original GNN paper, or set as the same number as originally chosen in the GNN paper. The number of layers is set to for both GCN Kipf and Welling (2016) and GraphSage Hamilton et al. (2017) as used in their paper, and we use layers for TK Xu et al. (2018). For GraphSage Hamilton et al. (2017), the neighborhood sample size is set to . We use the same GNN structure (i.e. layers, hidden units, neighborhood sample size) for the noncollective version and in CLGNN for fair comparison.
For CLGNN, the additional hyperparameters are (1) the sample size of predicted labels (), and (2) the number of model iterations (). we set sample size for friendster dataset and for all other datasets. For label rate of , the model is trained for iterations, and each iteration contains epochs. Note that we sample a new binary mask for each epoch as described in Section 3. For label rate of , the model is trained for iterations, and each iteration contains up to epochs which can be early stopped if the validation accuracy decreases for a specified consecutive epochs. The numbers of iterations are empirically determined as only marginal improvements are observed after iterations for unlabeled test data and iterations for partiallylabeled test data. The validation accuracy is used to choose the best epoch.
Note that the hyperparameter tuning could be done more aggressively to further boost the performance of CLGNN, e.g. using more layers for TK Xu et al. (2018), but our main goal is to evaluate the relative improvements of CLGNN on the corresponding noncollective GNNs.
Appendix F Running times for GNN models on multiple datasets
Dataset  GNN structure  Running time (minutes)  

GNN  CLGNN  
unlabeled  partiallylabeled  
Cora  GCN  0.09  0.83  3.65 
TK  0.12  1.91  5.74  
Pubmed  GCN  0.49  5.38  21.87 
TK  0.52  7.82  51.62  
Friendster  GCN  1.04  17.93  66.31 
TK  1.93  30.17  132.33  
GCN  0.02  1.44  5.37  
TK  0.05  2.41  7.22 
Table 3 shows the running times for CLGNN and the corresponding noncollective GNNs on various datasets. As mentioned in , for partiallylabeled , CLGNN applied random masks at each epoch, and ran for iterations, whereas for unlabeled , CLGNN ran for iterations.
Appendix G ClGnn performance with varying training label rates
To investigate the impact of the training label rates on the node classification accuracy, we repeated the experiments on Cora and Pubmed datasets with various numbers of training labels, on unlabeled test data and partiallylabeled test data. Table 4 and Table 5 show the results for test labels rates of and respectively. We can see that in general CLGNN framework achieved a larger improvement when fewer labels are available in the training graph. For example, with label rates of , and on Pubmed, the improvements of our framework combining with GCN are , and respectively. This shows that the CLGNN framework is especially useful when only a small number of labels are available in training, which is the common use case of GNNs.
Cora  Pubmed  

# train labels  85 (3.21%)  105 (3.88%)  140 (5.17%)  300 (1.52%)  375 (1.90%)  600 (3.04%)  
% test labels  0%  0%  0%  0%  0%  0%  
Random  14.28 (0.00)  14.28 (0.00)  14.28 (0.00)  33.33 (0.00)  33.33 (0.00)  33.33 (0.00)  
GCN    45.90 (3,26)  47.54 (3.50)  61.92 (1.50)  52.68 (2.36)  55.76 (3.32)  70.38 (2.31) 
+ CL  +6.29 (1.49)  +5.20 (1.12)  +5.18 (0.66)  +4.48 (2.33)  +3.30(1.52)  +0.98(0.23)  
GS    50.69 (1.50)  56.24 (2.08)  66.08 (0.96)  59.34 (3.47)  64.37 (3.70)  72.08 (1.87) 
+ CL  +2.35 (0.56)  +2.78 (0.59)  +1.95 (0.45)  +1.48 (0.41)  +0.62 (0.21)  +0.65 (0.25)  
TK    63.74 (2.61)  70.01 (1.93)  74.45 (0.34)  61.13 (5.03)  63.09 (5.57)  75.46 (1.46) 
+ CL  +0.96 (0.30)  +1.08 (0.37)  +0.30 (0.11)  +1.00 (0.21)  +1.34 (0.20)  +1.03 (0.22)  
PLEM    20.70 (0.05)  24.65 (0.38)  30.46 (1.48)  38.05 (4.85)  44.85 (5.75)  51.25 (3.06) 
ICA    26.20 (0.51)  41.05 (0.50)  49.51 (1.90)  44.40 (1.92)  45.62 (0.86)  54.26 (2.09) 
GMNN    49.05 (1.86)  54.55 (1.15)  67.16 (1.86)  58.03 (3.62)  62.50 (3.77)  71.03 (4.54) 
Cora  Pubmed  

# train labels  85 (3.21%)  105 (3.88%)  140 (5.17%)  300 (1.52%)  375 (1.90%)  600 (3.04%)  
% test labels  50%  50%  50%  50%  50%  50%  
Random  14.28 (0.00)  14.28 (0.00)  14.28 (0.00)  33.33 (0.00)  33.33 (0.00)  33.33 (0.00)  
GCN    36.38 (1.35)  48.31 (2.58)  64.02 (1.54)  54.11 (4.86)  56.31 (3.10)  68.13 (1.84) 
+ CL  +15.69 (3.20)  +14.02 (3.38)  +6.31 (0.89)  +5.62 (1.17)  +5.06 (3.24)  + 4.60 (2.50)  
GS    48.42 (2.82)  57.52 (2.15)  65.04 (0.79)  58.52 (5.42)  59.77 (4.68)  75.01 (4.86) 
+ CL  +4.52 (0.84)  +3.06 (0.20)  +2.18 (0.21)  +2.42 (0.27)  +1.49 (0.10)  +2.67 (0.56)  
TK    55.68 (2.08)  61.51 (2.45)  67.95 (0.45)  63.05 (5.15)  67.95 (0.45)  74.01 (3.58) 
+ CL  +7.18 (1.88)  +3.04 (1.07)  +2.75 (0.47)  +1.91 (0.75)  +0.54 (0.44)  +3.23 (0.78)  
PLEM    20.35 (0.05)  25.25 (0.35)  31.45 (1.95)  31.70 (4.78)  34.92 (5.87)  48.70 (5.72) 
ICA    31.17 (3.66)  42.07 (1.29)  57.14 (1.81)  33.38 (4.69)  45.93 (5.48)  46.97 (5.19) 
GMNN    49.36 (2.22)  56.58 (2.96)  67.83 (1.91)  62.16 (4.40)  63.42 4.82)  74.78 (3.63) 
Appendix H Ablation study
h.1 With or without predicted labels as input
To investigate if adding predicted labels in model input adds extra information with partiallylabeled test data, we tested the performance of a model variant which only use true labels as input with the same node masking procedure. Figure 4 shows two examples on Cora with GCN (Figure (a)a) and Pubmed with TK (Figure (b)b), where including predicted labels achieves better performance. We run the model 10 times and calculate the average and standard deviation (shown as shaded area) of classification accuracy at each iteration as described in Section 3. We can see that adding predicted labels starts to improve the performance after the first iteration and achieves consistent gains.
Cora  Pubmed  Friendster  Protein  

# labels  85 (3.21%)  105 (3.88%)  140 (5.17%)  300 (1.52%)  375 (1.90%)  600 (3.04%)  641 (1.47%)  80 (1.76%)  7607 (30%)  
Random  14.28 (0.00)  14.28 (0.00)  14.28 (0.00)  33.33 (0.00)  33.33 (0.00)  33.33 (0.00)  20.00 (0.00)  50.00 (0.00)  50.00 (0.00)  
GCN    45.15 (3.73)  52.35 (2.01)  65.11 (1.95)  53.21 (4.04)  57.15 (3.61)  70.81 (3.47)  29.80 (0.48)  65.89 (0.68)  73.03 (2.14) 
+ CLrandom  +0.02 (0.65)  1.83 (1.05)  +0.27 (0.18)  +2.29 (0.34)  +1.35 (0.58)  +1.05 (0.96)  0.14 (0.40)  +1.16 (0.25)  +0.16 (0.80)  
GS    46.38 (1.62)  52.87 (1.03)  63.46 (1.38)  55.38 (3.48)  57.61 (4.21)  68.81 (4.15)  28.05 (0.56)  65.20 (0.40)  71.05 (0.40) 
+ CLrandom  2.45 (0.22)  +0.46 (0.45)  0.23 (0.67)  0.02 (0.32)  +0.42 (0.31)  +0.34 (0.53)  +0.21 (0.39)  +1.65 (0.15)  +0.01 (0.22)  
TK    61.99 (3.07)  67.88 (1.80)  73.04 (0.42)  61.00 (4.93)  61.91 (5.16)  73.87 (3.99)  29.44 (0.39)  67.75 (0.40)  73.38 (0.57) 
+ CLrandom  3.95 (1.08)  2.54 (0.63)  2.28 (0.84)  0.65 (0.56)  0.78 (0.38)  1.17 (0.78)  +0.26 (0.38)  0.05 (0.19)  0.13 (0.53) 
h.2 Sampling from predicted labels or random ids
Creating more expressive GNN representations by averaging out random features was first proposed by Murphy et al. (2019). Murphy et al. (2019) shows a wholegraph classification application, Circulant Skip Links (CSL) graphs, where such randomized feature averaging is provably (and empirically) more expressive than GNNs. Our Monte Carlo collective learning method can be seen as a type of feature averaging GNN representation though, unlike Murphy et al. (2019), the feature sampling is not at random, but rather driven by our own model recursively. Hence, it is fair to ask if our performance gains are simply because random feature averaging is beneficial to GNN representations? Or does collective learning sampling actually improve performance? We need an ablation study.
Therefore, in this section we investigate whether the gains of our method for unlabeled test data are from incorporating feature randomness, or from sampling w.r.t predicted labels (collective learning). To do so, we replace the samples drawn from previous prediction as uniformly drawn from the set of class labels at each gradient step in CLGNN. The results are shown in Table 6. Clearly, the random features are not able to consistently improve the model performance as our method does (contrast Table 6 with Table 1 and Table 4). In summary, collective learning goes beyond the purely randomized approach of Murphy et al. (2019), providing much larger, statistically significant, gains.
Footnotes
 We use the term optimal WLGNN to refer to the most expressive version of a GNN–one that has the same distinguishing power as a WeisfeilerLehman test. Note this is not a universal graph representation.
References
 Protein function prediction via graph kernels. Bioinformatics 21 (suppl_1), pp. i47–i56. Cited by: 4th item.
 On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, pp. 15868–15876. Cited by: §1.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.
 Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §6.
 Recurrent collective classification. Knowledge and Information Systems 60 (2), pp. 741–755. Cited by: §1, §4.1, §6.
 Boosting fewshot visual learning with selfsupervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8059–8068. Cited by: §6.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: Appendix E, §1, §4.2, 2nd item, §5.1, Table 1.
 Dataefficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §6.
 Why collective inference improves relational classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 593–598. Cited by: Appendix C.
 Outcome correlation in graph neural network regression. External Links: 2002.08274 Cited by: §6.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Appendix E, Appendix E, §1, §4.2, 1st item, §5.1, Table 1.
 Introduction to statistical relational learning. MIT press. Cited by: §1, §6.
 Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §6.
 Linkbased classification. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pp. 496–503. Cited by: Table 4, Table 5, §5.1, Table 1, §6, §6.
 Break the ceiling: stronger multiscale deep graph convolutional networks. arXiv preprint arXiv:1906.02174. Cited by: Appendix E, §1, 3rd item, §5.1, §5.2, Table 1.
 On the universality of invariant networks. arXiv preprint arXiv:1901.09342. Cited by: §4.
 Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §6.
 Deep collective inference. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §4.1, §6.
 Weisfeiler and leman go neural: higherorder graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: Appendix B, §1, §2, §4.1, §4.
 Relational pooling for graph representations. arXiv preprint arXiv:1903.02541. Cited by: §H.2, §H.2, §1, §2, §2, §4.
 Learning relational probability trees. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 625–630. Cited by: §1.
 Simple estimators for relational bayesian classifiers. In Third IEEE International Conference on Data Mining, pp. 609–612. Cited by: §6.
 Iterative classification in relational data. In Proc. AAAI2000 workshop on learning statistical models from relational data, pp. 13–20. Cited by: §6.
 Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §6.
 Overcoming relational learning biases to accurately predict preferences in large scale networks. In Proceedings of the 24th International Conference on World Wide Web, pp. 853–863. Cited by: Table 4, Table 5, §1, §5.1, Table 1, §6, §6.
 Column networks for collective classification. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §6.
 Towards structural logistic regression: combining relational and statistical learning. Departmental Papers (CIS), pp. 134. Cited by: §6.
 Gmnn: graph markov neural networks. arXiv preprint arXiv:1905.06214. Cited by: Table 4, Table 5, §1, §4.1, §5.1, Table 1, §6.
 A stochastic approximation method. Ann. Math. Statist. 22 (3), pp. 400–407. External Links: Document, Link Cited by: §4.3.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: 1st item.