GraphMix: Regularized Training of Graph Neural Networks for SemiSupervised Learning
Abstract
We present GraphMix, a regularization technique for Graph Neural Network based semisupervised object classification, leveraging the recent advances in the regularization of classical deep neural networks. Specifically, we propose a unified approach in which we train a fullyconnected network jointly with the graph neural network via parameter sharing, interpolationbased regularization and selfpredictedtargets. Our proposed method is architecture agnostic in the sense that it can be applied to any variant of graph neural networks which applies a parametric transformation to the features of the graph nodes. Despite its simplicity, with GraphMix we can consistently improve results and achieve or closely match stateoftheart performance using even simpler architectures such as Graph Convolutional Networks, across three established graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as three newly proposed datasets : CoraFull, CoauthorCS and CoauthorPhysics.
1 Introduction
Due to the presence of graph structured data across a wide variety of domains, such as biological networks, social networks and telecommunication networks, there have been several attempts to design neural networks that can process arbitrarily structured graphs. Early work includes (gori; scarselli) which propose a neural network that can directly process most type of graphs e.g., acyclic, cyclic, directed, and undirected graphs. More recent approaches include (bruna; henaff; defferrard; kipf2016variational; gilmer2017neural; hamilton2017inductive; velivckovic2018graph; velickovic2018deep; gmnn; ugnet; ma2019disentangled), among others. Many of these approaches are designed for addressing the important problem of Semisupervised learning over graph structured data (gnn_review). However, much of this research effort has been dedicated to developing novel architectures.
Unlike many existing works which try to come up with the new architectures, we focus on architectureagnostic regularization techniques for graph neural networks based semisupervised object classification. Data Augmentation based regularization has been shown to be very effective in other types of neural networks but how to apply these techniques in graph neural networks is still underexplored. Our proposed method GraphMix ^{1}^{1}1code available at https://github.com/vikasverma1077/GraphMixis inspired by interpolation based data augmentation techniques (mixup; manifold_mixup) but is changed appropriately to make it suitable for graph structured data. Furthermore, GraphMix also utilizes the selftargetprediction (laine2016temporal; meanteacher; ict; mixmatch) based dataaugmentation. We show that with our proposed regularization techniques, we can achieve stateoftheart performance even when using simpler graph neural network architectures such as Graph Convolutional Networks (kipf2016semi) and without incurring any significant additional computation cost.
2 Problem Definition and Preliminaries
2.1 Problem Setup
We are interested in the problem of semisupervised object classification using graph structured data. We can formally define such graph structured data as , where represents the set of nodes , and is the set of edges between the nodes of .
Each node in the graph has a corresponding dimensional feature vector . The feature vectors of all the nodes are stacked together to form the entire feature matrix . Each node belongs to one out of classes and can be labeled with a dimensional onehot vector . Given the labels of for few of the labeled nodes , the task is to predict the labels of the remaining nodes .
2.2 Graph Neural Networks
Graph Neural Networks (GNN) learn the layer representations of a sample by leveraging the representations of the samples in the neighbourhood of . This is done by using an aggregation function that takes as an input the representations of all the samples and the graph structure and outputs the aggregated representation. The aggregation function can be defined using the Graph Convolution layer (kipf2016semi), Graph Attention Layer (velivckovic2018graph), or any general message passing layer (gilmer2017neural). Formally, let be a matrix containing the dimensional representation of nodes in the layer, then:
(1) 
where is a linear transformation matrix, is the dimension of layer and is the aggregation function that utilizes the graph structure.
2.3 Interpolation Based Regularization Techniques
Recently, interpolationbased techniques have been proposed for regularizing neural networks. We briefly describe some of these techniques here. Mixup (mixup) trains a neural network on the convex combination of input and targets, whereas Manifold Mixup (manifold_mixup) trains a neural network on the convex combination of the hidden states of a randomly chosen hidden layer and the targets. While Mixup regularizes a neural network by enforcing a constraint that the model output should change linearly in between the examples in the input space, Manifold Mixup regularizes the neural network by learning better (more discriminative) hidden states.
Suppose is a function that maps input samples to hidden states, is a function that maps hidden states to predicted output, is a random variable drawn from distribution, is an interpolation function, is the data distribution and be a loss function such as crossentropy loss, then the Manifold Mixup Loss is defined as:
(2) 
3 GraphMix
3.1 Motivation
Data Augmentation is arguably the simplest and most efficient technique for regularizing a neural network. In some domains, such as computer vision, speech and text, there exist efficient data augmentation techniques, for example, random cropping, translation or Cutout (cutout) for computer vision, audioaf and specaugment for speech and dataNoising for text domain. However, data augmentation for the graphstructured data remains underexplored. There exists some recent work along these lines but the prohibitive additional computation cost (see Section 5.3) introduced by these methods make them impractical for realworld large graph datasets. Based on these limitations, our main objective is to propose an efficient data augmentation technique for graph datasets.
Recent work based on interpolationbased data augmentation (mixup; manifold_mixup) has seen sizable improvements in regularization performance across a number of tasks. However, these techniques are not directly applicable to graphs for an important reason: Although we can create additional nodes by interpolating the features and corresponding labels, it remains unclear how these new nodes must be connected to the original nodes via synthetic edges such that the structure of the whole graph is preserved. In this work, we explore how this limitation can be addressed. Furthermore, drawing inspiration from the success of selfsupervised semisupervised learning algorithms (selfpredictedtargets based algorithms which can be also interpreted as a form of dataaugmentation techniques) (ict; mixmatch), we explore selfsupervision in the training of GNNs. We note that selfsupervision has already been explored for unsupervised representation learning from graph structured data (velickovic2018deep), but not for semisupervised object classification over graph structured data. Based on these challenges and motivations we present our proposed approach GraphMix for training Graph Neural Networks in the following Section.
3.2 Method
GraphMix augments the vanilla GNN with a Fully Connected Network (FCN) via parameter sharing. The FCN loss is computed using the Manifold Mixup as discussed in Section 2.3 and the GNN loss is computed in the standard way. Both of these losses are optimized in an alternating fashion during training. Manifold Mixup has been shown to learn better features. The use of Manifold Mixup for FCN training facilitates learning better features, which are used in the GNN training via parameter sharing. The predicted targets from the GNN are used to augment the training set of the FCN. In this way, both FCN and GNN facilitate each other’s learning process. At inference time, the predictions are made using only GNN. The diagrammatic representation of GraphMix is presented in Figure 1 and the full algorithm is presented in Algorithm 1.
Some implementation considerations. For Manifold Mixup training of FCN, we apply mixup only in the hidden layer. Note that in manifold_mixup, the authors recommended applying mixing in a randomly chosen layer (which also includes the input layer) at each training update. However, we observed underfitting when applying mixup randomly at the input layer or hidden layer. Applying mixup only in the input layer also resulted in underfitting and did not improve test accuracy.
The GraphMix framework can be applied to any underlying GNN as long as the underlying GNN applies parametric transformations to the node features. In our experiments, we show the improvements over GCN (kipf2016variational) and GAT (velivckovic2018graph) using the GraphMix , however, this framework can also be applied to more recent GNNs such as Graph UNet (ugnet) and DisenGCN (ma2019disentangled), which may facilitate in improving the stateoftheart even further.
The performance of selfsupervision based algorithms such as GraphMix is greatly affected by the accuracy of the predicted targets. To improve the accuracy of the predicted targets, we applied the average of the model prediction on random perturbations of an input sample as discussed in Section 3.2.1 and sharpening as described in Section 3.2.2. Further, we draw similarities and difference of GraphMix w.r.t. Cotraining framework in the Section 3.2.3.
3.2.1 Accurate Target Prediction for Unlabeled data
Recent stateoftheart semisupervised learning methods use a teacher model to accurately predict targets for the unlabeled data. These predicted targets on the unlabeled data are used as "true labels" for further training of the model. The teacher model can be realized as a temporal ensemble of the student model (the model being trained) (laine2016temporal) or by using an Exponential Moving Average (EMA) of the parameters of the student model (meanteacher). Another recently proposed method for accurate target predictions for unlabeled data is to use the average of the predicted targets across random augmentations of the input sample (mixmatch). Along these lines, in this work, we compute the predicted target as the average of predictions on dropout versions of the input sample. We also used the EMA of the student model but it did not improve test accuracy across all the datasets (see Section 4.4 for details).
3.2.2 Entropy Minimization
Many recent semisupervised learning algorithms (laine2016temporal; miyato2017vat; meanteacher; ict) are based on the cluster assumption (chapple), which posits that the class boundary should pass through the lowdensity regions of the marginal data distribution. One way to enforce this assumption is to explicitly minimize the entropy of the model’s predictions on unlabeled data by adding an extra loss term to the original loss term (entmin). The entropy minimization can be also achieved implicitly by modifying the model’s prediction on the unlabeled data such that the prediction has low entropy and using these lowentropy predictions as targets for the further training of the model. Examples include "Pseudolabels" (pseudolabel) and "Sharpening" (mixmatch). In this work, we use Sharpening for entropy minimization. The Sharpening function over the model prediction can be formally defined as follows (mixmatch), where is the temperature hyperparameter and is the number of classes:
(3) 
3.2.3 Connection to Cotraining
The GraphMix approach can be seen as a special instance of the Cotraining framework (cotraining). Cotraining assumes that the description of an example can be partitioned into two distinct views and either of these views would be sufficient for learning if we had enough labeled data. In this framework, two learning algorithms are trained separately on each view and then the prediction of each learning algorithm on the unlabeled data is used to enlarge the training set of the other. Our method has some important differences and similarities to the Cotraining framework. Similar to Cotraining, we train two neural networks and the predictions from the GNN are used to enlarge the training set of FCN. The important difference is that instead of using the predictions from the FCN to enlarge the training set for the GNN, we employ parameter sharing for passing the learned information from FCN to GNN. In our experiments, directly using the predictions of the FCN for GNN training resulted in reduced accuracy. This is due to the fact that the number of labeled samples for training the FCN is sufficiently low and hence the FCN does not make accurate enough predictions. Another important difference is that unlike the cotraining framework, FCN and GNN do not use completely distinct views of the data: the FCN uses feature vectors and the GNN uses the feature vector and edge connectivity .
4 Experiments
We present results for the GraphMix algorithm using standard benchmark datasets and the standard architecture in Section 4.2 and 4.3. We also conduct an ablation study on GraphMix in Section 4.4 to understand the contribution of various components to its performance. Refer to Appendix A.3 for implementation and hyperparameter details.
4.1 Datasets
We use three standard benchmark citation network datasets for semisupervised node classification, namely Cora, Citeseer and Pubmed. In all these datasets, nodes correspond to documents and edges correspond to citations. Node features correspond to the bagofwords representation of the document. Each node belongs to one of classes. During training, the algorithm has access to the feature vectors and edge connectivity of all the nodes but has access to the class labels of only a few of the nodes.
For semisupervised link classification, we use two datasets Bitcoin Alpha and Bitcoin OTC from (kumar2016edge; kumar2018rev2). The nodes in these datasets correspond to the bitcoin users and the edge weights between them correspond to the degree of trust between users. Following (gmnn), we treat edges with weights greater than 3 as positive instances, and edges with weights less than 3 are treated as negative ones. Given a few labeled edges, the task is to predict the labels of the remaining edges. The statistics of these datasets as well as the number of training/validation/test nodes is presented in Appendix A.1.
4.2 Semisupervised Node Classification
For baselines, we choose GCN (kipf2016semi), and the recent stateoftheart methods GAT (velivckovic2018graph), GMNN (gmnn) and Graph UNet (ugnet). To underline the importance of the shared parameters between FCN and GCN in GraphMix , we used two additional baselines: in the first one, we trained the GCN with selfgenerated predicted targets, and in the second one, we trained the FCN with selfgenerated predicted targets, named “GCN (with predictedtargets)” and “FCN (with predictedtargets)” respectively in Table 1. GraphMix(GCN) and GraphMix(GAT) refer to the methods where underlying GNNs are GCN and GAT respectively. Refer to Appendix Section A.3 for implementation and hyperparameter details.
We observe that for Cora, GraphMix(GCN) performs closely to the current stateoftheart method, Graph UNet. For Citeseer, GraphMix(GCN) achieves thestateoftheart performance. For Pubmed, GraphMix improved upon GCN and GAT but was worse than GMNN. More interestingly, we obtained the best results for Pubmed by just using the GCN(with predicted targets). Importantly, GraphMix always improves the performance of underlying GNN (GCN or GAT) across all the datasets. We further present results using random partitioning of data, results with fewer labeled samples and results on larger datasets (CoraFull, CoauthorCS and CoauthorPhysics) in Section 4.2.1, Section 4.2.2 and Section A.7 respectively.
4.2.1 Random Partitioning of the datasets
pitfalls has demonstrated that the performance of the current stateoftheart Graph Neural Network approaches on the standard train/validation/test split of the popular benchmark datasets (such as Cora, Citeseer, Pubmed, etc) is significantly different from their performance on the random splits. For fair evaluation, they recommend using multiple random partitions of the datasets. Along these lines, we created random splits of the Cora, Citeseer and Pubmed with the same train/ validation/test number of samples as in the standard split. Our results in Table 2 show that GraphMix significantly outperforms GCN across all the datasets.
4.2.2 Results with fewer labeled samples
We further evaluate the effectiveness of GraphMix in the learning regimes where fewer labeled samples exist. For each class, we randomly sampled samples for training and the same number of samples for the validation. We used all the remaining labeled samples as the test set. We repeated this process for times. The results in Table 3 show that GraphMix achieves even better improvements when the labeled samples are fewer ( Refer to Table 1 for results with training samples per class).
Algorithm  Cora  Citeseer  Pubmed 

GCN * (kipf2016variational) 
81.5  70.3  79.0 
GAT * (velivckovic2018graph)  83.0  72.5  79.0 
GMNN * (gmnn)  83.7  73.1  81.8 
Graph UNet * (ugnet)  84.4  73.2  79.6 
GCN  81.300.66  70.610.22  79.860.34 
GAT  82.700.21  70.400.35  79.050.64 
GCN (with predictedtargets) 
82.030.43  73.380.35  82.420.36 
FCN (with predictedtargets)  80.300.75  71.500.80  77.400.37 
GraphMix (GCN) 
83.940.57  74.520.59  80.980.55 
GraphMix (GAT)  83.320.18  73.080.23  81.100.78 

Algorithm  Cora  Citeseer  Pubmed 

GCN 
77.841.45  72.562.46  78.740.99 
GraphMix (GCN)  82.071.17  76.451.57  80.721.08 
Algorithm  Cora  Citeseer  Pubmed  

GCN 
66.394.26  72.913.10  55.615.75  64.193.89  66.06  75.571.58 
GraphMix (GCN)  71.996.46  79.301.36  58.552.26  70.781.41  67.663.90  77.133.60 
4.3 Semisupervised Link Classification
In the Semisupervised Link Classification problem, the task is to predict the labels of the remaining links, given a graph and labels of a few links. Following (taskar2004link), we can formulate the link classification problem as a node classification problem. Specifically, given an original graph , we construct a dual Graph . The node set of the dual graph corresponds to the link set of the original graph. The nodes in the dual graph are connected if their corresponding links in the graph share a node. The attributes of a node in the dual graph are defined as the index of the nodes of the corresponding link in the original graph. Using this formulation, we present results on link classification on Bit OTC and Bit Alpha benchmark datasets in the Table 4. As the numbers of the positive and negative edges are strongly imbalanced, we report the F1 score. Our results show that GraphMix(GCN) improves the performance over the baseline GCN method for both the datasets. Furthermore, the results of GraphMix(GCN) are comparable with the recently proposed stateoftheart method GMNN (gmnn).
Algorithm  Bit OTC  Bit Alpha 

DeepWalk (perozzi2014deepwalk) 
63.20  62.71 
GMNN*(gmnn)  66.93  65.86 
GCN  65.720.38  64.000.19 
GraphMix (GCN)  66.350.41  65.340.19 
4.4 Ablation Study
Since GraphMix consists of various components, some of which are common with the existing literature of semisupervised learning, we set out to study the effect of various components by systematically removing or adding a component from GraphMix . We measure the effect of the following:

Removing the Manifold Mixup and predicted targets from the FCN training.

Removing the predicted targets from the FCN training.

Removing the Manifold Mixup from the FCN training.

Removing the Sharpening of the predicted targets.

Removing the Average of predictions for random perturbations of the input sample

Using the EMA (meanteacher) of GNN for target prediction.
The results for semisupervised node classification are presented in Table 5. We did not do any hyperparameter tuning for the ablation study and used the best performing hyperparameters found for the results presented in Table 1. We observe that all the components of GraphMix contribute to its performance. Furthermore, since EMA is an ensemble model, it is expected to produce more accurate predicted targets and hence, improve the test accuracy over all the datasets. However, we observe that using the EMA model (meanteacher) for computing the predicted targets results in improved performance for Citeseer but decreased performance for Cora and Pubmed. It can be the effect of not doing the hyperparameter search when adding the EMA to the GraphMix . We leave this exploration for future work.
Ablation  Cora  Citeseer  Pubmed 

GraphMix  83.940.57  74.520.59  80.980.55 
without Manifold Mixup, without predicted targets  79.980.27  70.800.46  79.050.26 
without predictedtargets  81.860.41  71.300.14  79.660.14 
without Manifold Mixup  83.570.79  73.960.76  80.900.45 
no Sharpening  80.200.23  71.300.27  80.060.18 
no Averaging of predictions  83.320.27  73.470.33  80.520.59 
with EMA  83.820.76  74.920.57  80.380.59 
4.5 Visualization of the Learned Features
In this section, we present the analysis of the features learned by GraphMix for Cora dataset. Specifically, we present the 2D visualization of the hidden states using the tSNE (tsne) in Figure 1(a) and 1(b). We observe that GraphMix learns hidden states which are better separated and condensed. We further evaluate the Softrank (refer to Appendix A.5) of the classspecific hidden states to demonstrate that GraphMix(GCN) makes the classspecific hidden states more concentrated as shown in 1(c). Refer to Appendix A.6 for 2D representation of other datasets.
5 Related Work
5.1 Semisupervised Learning over Graph Data
There exists a long line of work for Semisupervised learning over Graph Data. Earlier work included using Graph Laplacian Regularizer for enforcing local smoothness over the predicted targets for the nodes (lp; zhu2003semi; belkin2006manifold). Another line of work learns node embedding in an unsupervised way (perozzi2014deepwalk) which can then be used as an input to any classifier, or learns the node embedding and target prediction jointly (yang2016revisiting). Many of the recent Graph Neural Network based approaches (refer to gnn_review for a review of these methods) are inspired by the success of Convolutional Neural Networks in image and text domains, defines the convolutional operators using the neighbourhood information of the nodes (kipf2016semi; velivckovic2018graph; defferrard). These convolution operator based method exhibit stateoftheresults for semisupervised learning over graph data, hence much of the recent attention is dedicated to proposing architectural changes to these methods (gmnn; ugnet; ma2019disentangled). Unlike these methods, we propose a regularization technique that can be applied to any of these Graph Neural Networks which uses a parameterized transformation on the node features.
5.2 Data Augmentation
It is well known that the generalization of a learning algorithm can be improved by enlarging the training data size. Because labeling more samples is labourintensive and costly, Dataaugmentation has become de facto technique for enlarging the training data size, especially in the computer vision applications such as image classification. Some of the notable Data Augmentation techniques include Cutout (cutout) and DropBlock (dropBlock). In Cutout, a contiguous part of the input is zeroed out. DropBlock further extends Cutout to the hidden states. In another line of research, such as Mixup and BClearning (mixup; bclearning), additional training samples are generated by interpolating the samples and their corresponding targets. Manifold Mixup (manifold_mixup) proposes to augment the data in the hidden states and shows that it learns more discriminative features for supervised learning. Furthermore, ICT (ict) and MixMatch (mixmatch) extend the Mixup technique to semisupervised learning, by computing the predicted targets for the unlabeled data and applying the Mixup on the unlabeled data and their corresponding predicted targets. Even further, for unsupervised learning, ACAI (acai) and AMR (amr) explore the interpolation techniques for autoencoders. ACAI interpolates the hidden states of an autoencoder and uses a critic network to constrain the reconstruction of these interpolated states to be realistic. AMR explores different ways of combining the hidden states of an autoencoder other than the convex combinations of the hidden states. Unlike, all of these techniques which have been proposed for the fixed topology datasets, in this work, we propose interpolation based dataaugmentation techniques for graph structured data.
5.3 Regularizing Graph Neural Networks
Regularizing Graph Neural Networks has drawn some attention recently. GraphSGAN (graphscan) first uses an embedding method such as DeepWalk (perozzi2014deepwalk) and then trains generatorclassifier networks in the adversarial learning setting to generate fake samples in the lowdensity region between subgraphs. BVAT (BVAT) and graphadvtraining generate adversarial perturbations to the features of the graph nodes while taking graph structure into account. While these methods improve generalization in graphstructured data, they introduce significant additional computation cost: GraphScan requires computing node embedding as a preprocessing step, BVAT and graphadvtraining require additional gradient computation for computing adversarial perturbations. Unlike these methods, GraphMix does not introduce any significant additional computation since it is based on interpolationbased techniques and selfgenerated predicted targets.
6 Discussion
We presented GraphMix , a simple and efficient regularization technique for the graph neural networks. GraphMix is a general technique that can be applied to any graph neural network that uses a parameterized transformation on the feature vector of the graph nodes. Through extensive experiments, we demonstrated stateoftheart performances or close to stateoftheart performance using this simple regularization technique on various benchmark datasets, more importantly, GraphMix improves test accuracy over vanilla GNN across all the datasets, even without doing any extensive hyperparameter search. Further, we conduct a systematic ablation study to understand the effect of different components in the performance of GraphMix . This suggests that in parallel to designing new architectures, exploring better regularization for graph structured data is a promising avenue for research.
Acknowledgments
Authors thank Petar Veličković and David LopezPaz for helpful discussions and comments. Authors also thank Compute Canada for providing computational resources used in this work.
References
Appendix A Appendix
a.1 Datasets
The statistics of these datasets as well as the number of training/validation/test nodes is presented in Table 6.
Dataset  # Nodes  # Edges  # Features  # Classes  # Training  # Validation  # Test 

Cora  2,708  5,429  1,433  7  140  500  1,000 
Citeseer  3,327  4,732  3,703  6  120  500  1,000 
Pubmed  19,717  44,338  500  3  60  500  1,000 
Bitcoin Alpha  3,783  24,186  3,783  2  100  500  3,221 
Bitcoin OTC  5,881  35,592  5,881  2  100  500  5,947 
a.2 Comparison with Stateoftheart Methods
We present the comparion of GraphMix with the recent stateoftheart methods as well as earlier methods is presented in Table 7.
Method  Cora  Citeseer  Pubmed 
MLP  55.1%  46.5%  71.4% 
ManiReg (belkin2006manifold)  59.5%  60.1%  70.7% 
SemiEmb (weston2012deep)  59.0%  59.6%  71.7% 
LP (zhu2003semi)  68.0%  45.3%  63.0% 
DeepWalk (perozzi2014deepwalk)  67.2%  43.2%  65.3% 
ICA (lu2003link)  75.1%  69.1%  73.9% 
Planetoid (yang2016revisiting)  75.7%  64.7%  77.2% 
Chebyshev (defferrard)  81.2%  69.8%  74.4% 
GCN (kipf2016semi)  81.5%  70.3%  79.0% 
MoNet (monti2016geometric)  81.7 0.5%  —  78.8 0.3% 
GAT (velivckovic2018graph)  83.0 0.7%  72.5 0.7%  79.0 0.3% 
GraphScan (graphscan)  83.3 1.3  73.11.8  — 
GMNN (gmnn)  83.7%  73.1%  81.8% 
DisenGCN (ma2019disentangled)  83.7%  73.4%  80.5% 
Graph UNet (ugnet)  84.4%  73.2%  79.6% 
BVAT (BVAT)  83.60.5  74.00.6  79.90.4 
GraphMix (GCN) 
83.940.57%  74.520.59%  80.980.55% 
GraphMix (GAT)  83.320.18%  73.080.23%  81.100.78 
a.3 Implementation and Hyperparameter Details
We use the standard benchmark architecture as used in GCN (kipf2016semi), GAT (velivckovic2018graph) and GMNN (gmnn), among others. This architecture has one hidden layer and the graph convolution is applied twice : on the input layer and on the output of the hidden layer. The FCN in GraphMix shares the parameters with the GCN.
GraphMix introduces four additional hyperparameters, namely the parameter of distribution used in Manifold Mixup training of the FCN, the maxconsistency coefficient which controls the tradeoff between the supervised loss and the unsupervised loss (loss computed using the pseudolables) of FCN, the temparature in sharpening and the number of random perturbations applied to the input data for the averaging of the predictions.
We conducted minimal hyperparameter seach over only and and fixed the hyperparameters and to and respectively. The other hyperparameters were set to the best values for underlying GNN (GCN or GAT), including the learning rate, the decay rate, number of units in the hidden layer etc. We observed that GraphMix is not very sensitive to the values of and and similar values of these hyperparameters work well across all the benchmark datasets. Refer to Appendix A.3 and A.4 for the details about the hyperparameter values and the procedure used for the best hyperparameters selection.
a.3.1 For results reported in Section 4.2 and 4.2.1
For GCN and GraphMix(GCN), we used Adam optimizer with learning rate and decay 5e4, the number of units in the hidden layer , dropout rate in the input layer and hidden layer was set to and , respectively. For GAT and GraphMix(GAT), we used Adam optimizer with learning rate and decay 5e4, the number of units in the hidden layer , and the dropout rate in the input layer and hidden layer was searched from the values .
For and of GraphMix(GCN) and GraphMix(GAT) , we searched over the values in the set and respectively.
For GraphMix(GCN) : works best across all the datasets. works best for Cora and Citeseer and works best for Pubmed.
For GraphMix(GAT) : works best for Cora and Citeseer and works best for Pubmed. works best for Cora and Citeseer and works best for Pubmed. Input droputrate=0.5 and hidden dropout rate=0.5 work best for Cora and Citeseer and Input droputrate=0.2 and hidden dropout rate =0.2 work best for Pubmed.
a.3.2 For results reported in Section 4.2.2
For of GraphMix(GCN) , we searched over the values in the set and found that works best across all the datasets. For , we searched over the values in the set and found that and works best across all the datasets. Rest of the details for GraphMix(GCN) and GCN are same as Section A.3.1.
a.3.3 For results reported in Section 4.3
For of GraphMix(GCN) , we searched over the values in the set and found that works best for both the datasets. For , we searched over the values in the set and found that works best for both the datasets. We conducted all the experiments for 150 epochs. The value of consistency coefficient (line 13 in Algorithm 1) is increased from to its maximum value from epoch 75 to 125 using the sigmoid rampup of MeanTeacher (meanteacher).
Both for GraphMix(GCN) and GCN, we use Adam optimizer with learning rate and decay , the number of units in the hidden layer , dropout rate in the input layer was set to .
a.4 Hyperparameter Selection
For each configuration of hyperparameters, we run the experiments with random seeds. We select the hyperparameter configuration which has the best validation accuracy averaged over these trials. With this best hyperparameter configuration, for random seeds, we train the model again and use the validataion set for model selection ( i.e. we report the test accuracy at the epoch which has best validation accuracy.)
a.5 SoftRank
Let H be a matrix containing the hidden states of all the samples from a particular class. The SoftRank of matrix H is defined by the sum of the singular values of the matrix divided by the largest singular value. A lower SoftRank implies fewer dimensions with substantial variability and it provides a continuous analogue to the notion of rank from matrix algebra. This provides evidence that the concentration of classspecific states observed when using GraphMix in Figure 3 can be measured directly from the hidden states and is not an artifact of the TSNE visualization.
a.6 Feature Visualization
We present the 2D visualization of the hidden states learned using GCN and GraphMix(GCN) for Cora, Pubmed and Citeseer datasets in Figure 3. We observe that for Cora and Citeseer, GraphMix learns substantially better hidden states than GCN. For Pubmed, we observe that although there is no clear separation between classes, "Green" and "Red" classes overlap less using the GraphMix, resulting in better hidden states.
a.7 Results on Larger Datasets
In this section, we provide results on three recently proposed datasets which are relatively larger than standard benchmark datasets (Cora/Citeseer/Pubmed). Specifically we use CoraFull dataset proposed in bojchevski2018deep and CoauthorCS and CoauthorPhysics datasets proposed in pitfalls. We took processed versions of these dataset available here ^{2}^{2}2https://github.com/shchur/gnnbenchmark. The statistics of these datasets are given in Table 9. We did 10 random splits of the the data into train/validation/test split. For the classes which had more than 100 samples. We choose 20 samples per class for training, 30 samples per class for validation and the remaining samples as test data. For the classes which had less than 100 samples, we chose 20% samples, per class for training, 30% samples for validation and the remaining for testing. For each split we run experiments using 100 random seeds. The statistics of these datasets in presented in Table 9 and the results are presented in Table 8. We observe that GraphMix(GCN) improves the results over GCN for all the three datasets. We note that we did minimal hyperparameter search for GraphMix(GCN) as mentioned in Section A.7.1, and doing more rigorous hyperparameter search can further improve the performance of GraphMix .
Method  CoraFull  CoauthorCS  CoauthorPhysics 

GCN*  62.20.6  91.10.5  92.81.0 
GAT*  51.91.5  90.50.6  92.50.9 
MoNet*  59.80.8  90.80.6  92.50.9 
GSMean*  58.61.6  91.32.8  93.00.8 
GCN 
60.130.57  91.270.56  92.900.92 
GraphMix (GCN)  61.800.54  91.830.51  94.490.84 

Datasets  Classes  Features  Nodes  Edges 

CoraFull 
67  8710  18703  62421 
CoauthorCS  15  6805  18333  81894 
CoauthorPhysics  5  8415  34493  247962 

a.7.1 Hyperparameter Details for Results in Table 8
For all the experiments we use the standard architecture mentioned in Section A.3 and used Adam optimizer with learning rate 0.001 and 64 hidden units in the hidden layer. For CoauthorCS and CoauthorPhysics, we trained the network for 2000 epochs. For CoraFull, we trained the network for 5000 epochs because we observed the training loss of CoraFull dataset takes longer to converge.
For CoauthorCS and CoauthorPhysics: We set the input layer dropout rate to 0.5 and weightdecay to 0.0005, both for GCN and GraphMix(GCN) . We did not conduct any hyperparameter search over the GraphMix hyperparameters , , temparature and number of random permutations applied to the input data for GraphMix(GCN) for these two datasets, and set these values to , , and respectively.
For CoraFull dataset: We found input layer dropout rate to 0.2 and weightdecay to 0.0 to be best for both GCN and GraphMix(GCN) . For GraphMix(GCN) we fixed , temparature and number of random permutations to and respectively. For , we did search over and found that works best.