GraphMix: Regularized Training of Graph Neural Networks for Semi-Supervised Learning
We present GraphMix, a regularization technique for Graph Neural Network based semi-supervised object classification, leveraging the recent advances in the regularization of classical deep neural networks. Specifically, we propose a unified approach in which we train a fully-connected network jointly with the graph neural network via parameter sharing, interpolation-based regularization and self-predicted-targets. Our proposed method is architecture agnostic in the sense that it can be applied to any variant of graph neural networks which applies a parametric transformation to the features of the graph nodes. Despite its simplicity, with GraphMix we can consistently improve results and achieve or closely match state-of-the-art performance using even simpler architectures such as Graph Convolutional Networks, across three established graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as three newly proposed datasets : Cora-Full, Co-author-CS and Co-author-Physics.
Due to the presence of graph structured data across a wide variety of domains, such as biological networks, social networks and telecommunication networks, there have been several attempts to design neural networks that can process arbitrarily structured graphs. Early work includes (gori; scarselli) which propose a neural network that can directly process most type of graphs e.g., acyclic, cyclic, directed, and undirected graphs. More recent approaches include (bruna; henaff; defferrard; kipf2016variational; gilmer2017neural; hamilton2017inductive; velivckovic2018graph; velickovic2018deep; gmnn; u-g-net; ma2019disentangled), among others. Many of these approaches are designed for addressing the important problem of Semi-supervised learning over graph structured data (gnn_review). However, much of this research effort has been dedicated to developing novel architectures.
Unlike many existing works which try to come up with the new architectures, we focus on architecture-agnostic regularization techniques for graph neural networks based semi-supervised object classification. Data Augmentation based regularization has been shown to be very effective in other types of neural networks but how to apply these techniques in graph neural networks is still under-explored. Our proposed method GraphMix 111code available at https://github.com/vikasverma1077/GraphMixis inspired by interpolation based data augmentation techniques (mixup; manifold_mixup) but is changed appropriately to make it suitable for graph structured data. Furthermore, GraphMix also utilizes the self-target-prediction (laine2016temporal; meanteacher; ict; mixmatch) based data-augmentation. We show that with our proposed regularization techniques, we can achieve state-of-the-art performance even when using simpler graph neural network architectures such as Graph Convolutional Networks (kipf2016semi) and without incurring any significant additional computation cost.
2 Problem Definition and Preliminaries
2.1 Problem Setup
We are interested in the problem of semi-supervised object classification using graph structured data. We can formally define such graph structured data as , where represents the set of nodes , and is the set of edges between the nodes of .
Each node in the graph has a corresponding -dimensional feature vector . The feature vectors of all the nodes are stacked together to form the entire feature matrix . Each node belongs to one out of classes and can be labeled with a -dimensional one-hot vector . Given the labels of for few of the labeled nodes , the task is to predict the labels of the remaining nodes .
2.2 Graph Neural Networks
Graph Neural Networks (GNN) learn the layer representations of a sample by leveraging the representations of the samples in the neighbourhood of . This is done by using an aggregation function that takes as an input the representations of all the samples and the graph structure and outputs the aggregated representation. The aggregation function can be defined using the Graph Convolution layer (kipf2016semi), Graph Attention Layer (velivckovic2018graph), or any general message passing layer (gilmer2017neural). Formally, let be a matrix containing the -dimensional representation of nodes in the layer, then:
where is a linear transformation matrix, is the dimension of layer and is the aggregation function that utilizes the graph structure.
2.3 Interpolation Based Regularization Techniques
Recently, interpolation-based techniques have been proposed for regularizing neural networks. We briefly describe some of these techniques here. Mixup (mixup) trains a neural network on the convex combination of input and targets, whereas Manifold Mixup (manifold_mixup) trains a neural network on the convex combination of the hidden states of a randomly chosen hidden layer and the targets. While Mixup regularizes a neural network by enforcing a constraint that the model output should change linearly in between the examples in the input space, Manifold Mixup regularizes the neural network by learning better (more discriminative) hidden states.
Suppose is a function that maps input samples to hidden states, is a function that maps hidden states to predicted output, is a random variable drawn from distribution, is an interpolation function, is the data distribution and be a loss function such as cross-entropy loss, then the Manifold Mixup Loss is defined as:
Data Augmentation is arguably the simplest and most efficient technique for regularizing a neural network. In some domains, such as computer vision, speech and text, there exist efficient data augmentation techniques, for example, random cropping, translation or Cutout (cutout) for computer vision, audioaf and specaugment for speech and dataNoising for text domain. However, data augmentation for the graph-structured data remains under-explored. There exists some recent work along these lines but the prohibitive additional computation cost (see Section 5.3) introduced by these methods make them impractical for real-world large graph datasets. Based on these limitations, our main objective is to propose an efficient data augmentation technique for graph datasets.
Recent work based on interpolation-based data augmentation (mixup; manifold_mixup) has seen sizable improvements in regularization performance across a number of tasks. However, these techniques are not directly applicable to graphs for an important reason: Although we can create additional nodes by interpolating the features and corresponding labels, it remains unclear how these new nodes must be connected to the original nodes via synthetic edges such that the structure of the whole graph is preserved. In this work, we explore how this limitation can be addressed. Furthermore, drawing inspiration from the success of self-supervised semi-supervised learning algorithms (self-predicted-targets based algorithms which can be also interpreted as a form of data-augmentation techniques) (ict; mixmatch), we explore self-supervision in the training of GNNs. We note that self-supervision has already been explored for unsupervised representation learning from graph structured data (velickovic2018deep), but not for semi-supervised object classification over graph structured data. Based on these challenges and motivations we present our proposed approach GraphMix for training Graph Neural Networks in the following Section.
GraphMix augments the vanilla GNN with a Fully Connected Network (FCN) via parameter sharing. The FCN loss is computed using the Manifold Mixup as discussed in Section 2.3 and the GNN loss is computed in the standard way. Both of these losses are optimized in an alternating fashion during training. Manifold Mixup has been shown to learn better features. The use of Manifold Mixup for FCN training facilitates learning better features, which are used in the GNN training via parameter sharing. The predicted targets from the GNN are used to augment the training set of the FCN. In this way, both FCN and GNN facilitate each other’s learning process. At inference time, the predictions are made using only GNN. The diagrammatic representation of GraphMix is presented in Figure 1 and the full algorithm is presented in Algorithm 1.
Some implementation considerations. For Manifold Mixup training of FCN, we apply mixup only in the hidden layer. Note that in manifold_mixup, the authors recommended applying mixing in a randomly chosen layer (which also includes the input layer) at each training update. However, we observed under-fitting when applying mixup randomly at the input layer or hidden layer. Applying mixup only in the input layer also resulted in underfitting and did not improve test accuracy.
The GraphMix framework can be applied to any underlying GNN as long as the underlying GNN applies parametric transformations to the node features. In our experiments, we show the improvements over GCN (kipf2016variational) and GAT (velivckovic2018graph) using the GraphMix , however, this framework can also be applied to more recent GNNs such as Graph U-Net (u-g-net) and DisenGCN (ma2019disentangled), which may facilitate in improving the state-of-the-art even further.
The performance of self-supervision based algorithms such as GraphMix is greatly affected by the accuracy of the predicted targets. To improve the accuracy of the predicted targets, we applied the average of the model prediction on random perturbations of an input sample as discussed in Section 3.2.1 and sharpening as described in Section 3.2.2. Further, we draw similarities and difference of GraphMix w.r.t. Co-training framework in the Section 3.2.3.
3.2.1 Accurate Target Prediction for Unlabeled data
Recent state-of-the-art semi-supervised learning methods use a teacher model to accurately predict targets for the unlabeled data. These predicted targets on the unlabeled data are used as "true labels" for further training of the model. The teacher model can be realized as a temporal ensemble of the student model (the model being trained) (laine2016temporal) or by using an Exponential Moving Average (EMA) of the parameters of the student model (meanteacher). Another recently proposed method for accurate target predictions for unlabeled data is to use the average of the predicted targets across random augmentations of the input sample (mixmatch). Along these lines, in this work, we compute the predicted target as the average of predictions on drop-out versions of the input sample. We also used the EMA of the student model but it did not improve test accuracy across all the datasets (see Section 4.4 for details).
3.2.2 Entropy Minimization
Many recent semi-supervised learning algorithms (laine2016temporal; miyato2017vat; meanteacher; ict) are based on the cluster assumption (chapple), which posits that the class boundary should pass through the low-density regions of the marginal data distribution. One way to enforce this assumption is to explicitly minimize the entropy of the model’s predictions on unlabeled data by adding an extra loss term to the original loss term (entmin). The entropy minimization can be also achieved implicitly by modifying the model’s prediction on the unlabeled data such that the prediction has low entropy and using these low-entropy predictions as targets for the further training of the model. Examples include "Pseudolabels" (pseudolabel) and "Sharpening" (mixmatch). In this work, we use Sharpening for entropy minimization. The Sharpening function over the model prediction can be formally defined as follows (mixmatch), where is the temperature hyperparameter and is the number of classes:
3.2.3 Connection to Co-training
The GraphMix approach can be seen as a special instance of the Co-training framework (cotraining). Co-training assumes that the description of an example can be partitioned into two distinct views and either of these views would be sufficient for learning if we had enough labeled data. In this framework, two learning algorithms are trained separately on each view and then the prediction of each learning algorithm on the unlabeled data is used to enlarge the training set of the other. Our method has some important differences and similarities to the Co-training framework. Similar to Co-training, we train two neural networks and the predictions from the GNN are used to enlarge the training set of FCN. The important difference is that instead of using the predictions from the FCN to enlarge the training set for the GNN, we employ parameter sharing for passing the learned information from FCN to GNN. In our experiments, directly using the predictions of the FCN for GNN training resulted in reduced accuracy. This is due to the fact that the number of labeled samples for training the FCN is sufficiently low and hence the FCN does not make accurate enough predictions. Another important difference is that unlike the co-training framework, FCN and GNN do not use completely distinct views of the data: the FCN uses feature vectors and the GNN uses the feature vector and edge connectivity .
We present results for the GraphMix algorithm using standard benchmark datasets and the standard architecture in Section 4.2 and 4.3. We also conduct an ablation study on GraphMix in Section 4.4 to understand the contribution of various components to its performance. Refer to Appendix A.3 for implementation and hyperparameter details.
We use three standard benchmark citation network datasets for semi-supervised node classification, namely Cora, Citeseer and Pubmed. In all these datasets, nodes correspond to documents and edges correspond to citations. Node features correspond to the bag-of-words representation of the document. Each node belongs to one of classes. During training, the algorithm has access to the feature vectors and edge connectivity of all the nodes but has access to the class labels of only a few of the nodes.
For semi-supervised link classification, we use two datasets Bitcoin Alpha and Bitcoin OTC from (kumar2016edge; kumar2018rev2). The nodes in these datasets correspond to the bitcoin users and the edge weights between them correspond to the degree of trust between users. Following (gmnn), we treat edges with weights greater than 3 as positive instances, and edges with weights less than -3 are treated as negative ones. Given a few labeled edges, the task is to predict the labels of the remaining edges. The statistics of these datasets as well as the number of training/validation/test nodes is presented in Appendix A.1.
4.2 Semi-supervised Node Classification
For baselines, we choose GCN (kipf2016semi), and the recent state-of-the-art methods GAT (velivckovic2018graph), GMNN (gmnn) and Graph U-Net (u-g-net). To underline the importance of the shared parameters between FCN and GCN in GraphMix , we used two additional baselines: in the first one, we trained the GCN with self-generated predicted targets, and in the second one, we trained the FCN with self-generated predicted targets, named “GCN (with predicted-targets)” and “FCN (with predicted-targets)” respectively in Table 1. GraphMix(GCN) and GraphMix(GAT) refer to the methods where underlying GNNs are GCN and GAT respectively. Refer to Appendix Section A.3 for implementation and hyperparameter details.
We observe that for Cora, GraphMix(GCN) performs closely to the current state-of-the-art method, Graph U-Net. For Citeseer, GraphMix(GCN) achieves the-state-of-the-art performance. For Pubmed, GraphMix improved upon GCN and GAT but was worse than GMNN. More interestingly, we obtained the best results for Pubmed by just using the GCN(with predicted targets). Importantly, GraphMix always improves the performance of underlying GNN (GCN or GAT) across all the datasets. We further present results using random partitioning of data, results with fewer labeled samples and results on larger datasets (Cora-Full, Co-author-CS and Co-author-Physics) in Section 4.2.1, Section 4.2.2 and Section A.7 respectively.
4.2.1 Random Partitioning of the datasets
pitfalls has demonstrated that the performance of the current state-of-the-art Graph Neural Network approaches on the standard train/validation/test split of the popular benchmark datasets (such as Cora, Citeseer, Pubmed, etc) is significantly different from their performance on the random splits. For fair evaluation, they recommend using multiple random partitions of the datasets. Along these lines, we created random splits of the Cora, Citeseer and Pubmed with the same train/ validation/test number of samples as in the standard split. Our results in Table 2 show that GraphMix significantly outperforms GCN across all the datasets.
4.2.2 Results with fewer labeled samples
We further evaluate the effectiveness of GraphMix in the learning regimes where fewer labeled samples exist. For each class, we randomly sampled samples for training and the same number of samples for the validation. We used all the remaining labeled samples as the test set. We repeated this process for times. The results in Table 3 show that GraphMix achieves even better improvements when the labeled samples are fewer ( Refer to Table 1 for results with training samples per class).
GCN * (kipf2016variational)
|GAT * (velivckovic2018graph)||83.0||72.5||79.0|
|GMNN * (gmnn)||83.7||73.1||81.8|
|Graph U-Net * (u-g-net)||84.4||73.2||79.6|
GCN (with predicted-targets)
|FCN (with predicted-targets)||80.300.75||71.500.80||77.400.37|
4.3 Semi-supervised Link Classification
In the Semi-supervised Link Classification problem, the task is to predict the labels of the remaining links, given a graph and labels of a few links. Following (taskar2004link), we can formulate the link classification problem as a node classification problem. Specifically, given an original graph , we construct a dual Graph . The node set of the dual graph corresponds to the link set of the original graph. The nodes in the dual graph are connected if their corresponding links in the graph share a node. The attributes of a node in the dual graph are defined as the index of the nodes of the corresponding link in the original graph. Using this formulation, we present results on link classification on Bit OTC and Bit Alpha benchmark datasets in the Table 4. As the numbers of the positive and negative edges are strongly imbalanced, we report the F1 score. Our results show that GraphMix(GCN) improves the performance over the baseline GCN method for both the datasets. Furthermore, the results of GraphMix(GCN) are comparable with the recently proposed state-of-the-art method GMNN (gmnn).
|Algorithm||Bit OTC||Bit Alpha|
4.4 Ablation Study
Since GraphMix consists of various components, some of which are common with the existing literature of semi-supervised learning, we set out to study the effect of various components by systematically removing or adding a component from GraphMix . We measure the effect of the following:
Removing the Manifold Mixup and predicted targets from the FCN training.
Removing the predicted targets from the FCN training.
Removing the Manifold Mixup from the FCN training.
Removing the Sharpening of the predicted targets.
Removing the Average of predictions for random perturbations of the input sample
Using the EMA (meanteacher) of GNN for target prediction.
The results for semi-supervised node classification are presented in Table 5. We did not do any hyperparameter tuning for the ablation study and used the best performing hyperparameters found for the results presented in Table 1. We observe that all the components of GraphMix contribute to its performance. Furthermore, since EMA is an ensemble model, it is expected to produce more accurate predicted- targets and hence, improve the test accuracy over all the datasets. However, we observe that using the EMA model (meanteacher) for computing the predicted- targets results in improved performance for Citeseer but decreased performance for Cora and Pubmed. It can be the effect of not doing the hyperparameter search when adding the EMA to the GraphMix . We leave this exploration for future work.
|-without Manifold Mixup, without predicted targets||79.980.27||70.800.46||79.050.26|
|-without Manifold Mixup||83.570.79||73.960.76||80.900.45|
|-no Averaging of predictions||83.320.27||73.470.33||80.520.59|
4.5 Visualization of the Learned Features
In this section, we present the analysis of the features learned by GraphMix for Cora dataset. Specifically, we present the 2D visualization of the hidden states using the t-SNE (tsne) in Figure 1(a) and 1(b). We observe that GraphMix learns hidden states which are better separated and condensed. We further evaluate the Soft-rank (refer to Appendix A.5) of the class-specific hidden states to demonstrate that GraphMix(GCN) makes the class-specific hidden states more concentrated as shown in 1(c). Refer to Appendix A.6 for 2D representation of other datasets.
5 Related Work
5.1 Semi-supervised Learning over Graph Data
There exists a long line of work for Semi-supervised learning over Graph Data. Earlier work included using Graph Laplacian Regularizer for enforcing local smoothness over the predicted targets for the nodes (lp; zhu2003semi; belkin2006manifold). Another line of work learns node embedding in an unsupervised way (perozzi2014deepwalk) which can then be used as an input to any classifier, or learns the node embedding and target prediction jointly (yang2016revisiting). Many of the recent Graph Neural Network based approaches (refer to gnn_review for a review of these methods) are inspired by the success of Convolutional Neural Networks in image and text domains, defines the convolutional operators using the neighbourhood information of the nodes (kipf2016semi; velivckovic2018graph; defferrard). These convolution operator based method exhibit state-of-the-results for semi-supervised learning over graph data, hence much of the recent attention is dedicated to proposing architectural changes to these methods (gmnn; u-g-net; ma2019disentangled). Unlike these methods, we propose a regularization technique that can be applied to any of these Graph Neural Networks which uses a parameterized transformation on the node features.
5.2 Data Augmentation
It is well known that the generalization of a learning algorithm can be improved by enlarging the training data size. Because labeling more samples is labour-intensive and costly, Data-augmentation has become de facto technique for enlarging the training data size, especially in the computer vision applications such as image classification. Some of the notable Data Augmentation techniques include Cutout (cutout) and DropBlock (dropBlock). In Cutout, a contiguous part of the input is zeroed out. DropBlock further extends Cutout to the hidden states. In another line of research, such as Mixup and BC-learning (mixup; bclearning), additional training samples are generated by interpolating the samples and their corresponding targets. Manifold Mixup (manifold_mixup) proposes to augment the data in the hidden states and shows that it learns more discriminative features for supervised learning. Furthermore, ICT (ict) and MixMatch (mixmatch) extend the Mixup technique to semi-supervised learning, by computing the predicted targets for the unlabeled data and applying the Mixup on the unlabeled data and their corresponding predicted targets. Even further, for unsupervised learning, ACAI (acai) and AMR (amr) explore the interpolation techniques for autoencoders. ACAI interpolates the hidden states of an autoencoder and uses a critic network to constrain the reconstruction of these interpolated states to be realistic. AMR explores different ways of combining the hidden states of an autoencoder other than the convex combinations of the hidden states. Unlike, all of these techniques which have been proposed for the fixed topology datasets, in this work, we propose interpolation based data-augmentation techniques for graph structured data.
5.3 Regularizing Graph Neural Networks
Regularizing Graph Neural Networks has drawn some attention recently. GraphSGAN (graphscan) first uses an embedding method such as DeepWalk (perozzi2014deepwalk) and then trains generator-classifier networks in the adversarial learning setting to generate fake samples in the low-density region between sub-graphs. BVAT (BVAT) and graphadvtraining generate adversarial perturbations to the features of the graph nodes while taking graph structure into account. While these methods improve generalization in graph-structured data, they introduce significant additional computation cost: GraphScan requires computing node embedding as a preprocessing step, BVAT and graphadvtraining require additional gradient computation for computing adversarial perturbations. Unlike these methods, GraphMix does not introduce any significant additional computation since it is based on interpolation-based techniques and self-generated predicted targets.
We presented GraphMix , a simple and efficient regularization technique for the graph neural networks. GraphMix is a general technique that can be applied to any graph neural network that uses a parameterized transformation on the feature vector of the graph nodes. Through extensive experiments, we demonstrated state-of-the-art performances or close to state-of-the-art performance using this simple regularization technique on various benchmark datasets, more importantly, GraphMix improves test accuracy over vanilla GNN across all the datasets, even without doing any extensive hyperparameter search. Further, we conduct a systematic ablation study to understand the effect of different components in the performance of GraphMix . This suggests that in parallel to designing new architectures, exploring better regularization for graph structured data is a promising avenue for research.
Authors thank Petar Veličković and David Lopez-Paz for helpful discussions and comments. Authors also thank Compute Canada for providing computational resources used in this work.
Appendix A Appendix
The statistics of these datasets as well as the number of training/validation/test nodes is presented in Table 6.
|Dataset||# Nodes||# Edges||# Features||# Classes||# Training||# Validation||# Test|
a.2 Comparison with State-of-the-art Methods
We present the comparion of GraphMix with the recent state-of-the-art methods as well as earlier methods is presented in Table 7.
|MoNet (monti2016geometric)||81.7 0.5%||—||78.8 0.3%|
|GAT (velivckovic2018graph)||83.0 0.7%||72.5 0.7%||79.0 0.3%|
|GraphScan (graphscan)||83.3 1.3||73.11.8||—|
|Graph U-Net (u-g-net)||84.4%||73.2%||79.6%|
a.3 Implementation and Hyperparameter Details
We use the standard benchmark architecture as used in GCN (kipf2016semi), GAT (velivckovic2018graph) and GMNN (gmnn), among others. This architecture has one hidden layer and the graph convolution is applied twice : on the input layer and on the output of the hidden layer. The FCN in GraphMix shares the parameters with the GCN.
GraphMix introduces four additional hyperparameters, namely the parameter of distribution used in Manifold Mixup training of the FCN, the max-consistency coefficient which controls the trade-off between the supervised loss and the unsupervised loss (loss computed using the pseudolables) of FCN, the temparature in sharpening and the number of random perturbations applied to the input data for the averaging of the predictions.
We conducted minimal hyperparameter seach over only and and fixed the hyperparameters and to and respectively. The other hyperparameters were set to the best values for underlying GNN (GCN or GAT), including the learning rate, the decay rate, number of units in the hidden layer etc. We observed that GraphMix is not very sensitive to the values of and and similar values of these hyperparameters work well across all the benchmark datasets. Refer to Appendix A.3 and A.4 for the details about the hyperparameter values and the procedure used for the best hyperparameters selection.
For GCN and GraphMix(GCN), we used Adam optimizer with learning rate and -decay 5e-4, the number of units in the hidden layer , dropout rate in the input layer and hidden layer was set to and , respectively. For GAT and GraphMix(GAT), we used Adam optimizer with learning rate and -decay 5e-4, the number of units in the hidden layer , and the dropout rate in the input layer and hidden layer was searched from the values .
For and of GraphMix(GCN) and GraphMix(GAT) , we searched over the values in the set and respectively.
For GraphMix(GCN) : works best across all the datasets. works best for Cora and Citeseer and works best for Pubmed.
For GraphMix(GAT) : works best for Cora and Citeseer and works best for Pubmed. works best for Cora and Citeseer and works best for Pubmed. Input droputrate=0.5 and hidden dropout rate=0.5 work best for Cora and Citeseer and Input droputrate=0.2 and hidden dropout rate =0.2 work best for Pubmed.
a.3.2 For results reported in Section 4.2.2
For of GraphMix(GCN) , we searched over the values in the set and found that works best across all the datasets. For , we searched over the values in the set and found that and works best across all the datasets. Rest of the details for GraphMix(GCN) and GCN are same as Section A.3.1.
a.3.3 For results reported in Section 4.3
For of GraphMix(GCN) , we searched over the values in the set and found that works best for both the datasets. For , we searched over the values in the set and found that works best for both the datasets. We conducted all the experiments for 150 epochs. The value of consistency coefficient (line 13 in Algorithm 1) is increased from to its maximum value from epoch 75 to 125 using the sigmoid ramp-up of Mean-Teacher (meanteacher).
Both for GraphMix(GCN) and GCN, we use Adam optimizer with learning rate and -decay , the number of units in the hidden layer , dropout rate in the input layer was set to .
a.4 Hyperparameter Selection
For each configuration of hyperparameters, we run the experiments with random seeds. We select the hyperparameter configuration which has the best validation accuracy averaged over these trials. With this best hyperparameter configuration, for random seeds, we train the model again and use the validataion set for model selection ( i.e. we report the test accuracy at the epoch which has best validation accuracy.)
Let H be a matrix containing the hidden states of all the samples from a particular class. The Soft-Rank of matrix H is defined by the sum of the singular values of the matrix divided by the largest singular value. A lower Soft-Rank implies fewer dimensions with substantial variability and it provides a continuous analogue to the notion of rank from matrix algebra. This provides evidence that the concentration of class-specific states observed when using GraphMix in Figure 3 can be measured directly from the hidden states and is not an artifact of the T-SNE visualization.
a.6 Feature Visualization
We present the 2D visualization of the hidden states learned using GCN and GraphMix(GCN) for Cora, Pubmed and Citeseer datasets in Figure 3. We observe that for Cora and Citeseer, GraphMix learns substantially better hidden states than GCN. For Pubmed, we observe that although there is no clear separation between classes, "Green" and "Red" classes overlap less using the GraphMix, resulting in better hidden states.
a.7 Results on Larger Datasets
In this section, we provide results on three recently proposed datasets which are relatively larger than standard benchmark datasets (Cora/Citeseer/Pubmed). Specifically we use Cora-Full dataset proposed in bojchevski2018deep and Coauthor-CS and Coauthor-Physics datasets proposed in pitfalls. We took processed versions of these dataset available here 222https://github.com/shchur/gnn-benchmark. The statistics of these datasets are given in Table 9. We did 10 random splits of the the data into train/validation/test split. For the classes which had more than 100 samples. We choose 20 samples per class for training, 30 samples per class for validation and the remaining samples as test data. For the classes which had less than 100 samples, we chose 20% samples, per class for training, 30% samples for validation and the remaining for testing. For each split we run experiments using 100 random seeds. The statistics of these datasets in presented in Table 9 and the results are presented in Table 8. We observe that GraphMix(GCN) improves the results over GCN for all the three datasets. We note that we did minimal hyperparameter search for GraphMix(GCN) as mentioned in Section A.7.1, and doing more rigorous hyperparameter search can further improve the performance of GraphMix .
a.7.1 Hyperparameter Details for Results in Table 8
For all the experiments we use the standard architecture mentioned in Section A.3 and used Adam optimizer with learning rate 0.001 and 64 hidden units in the hidden layer. For Coauthor-CS and Coauthor-Physics, we trained the network for 2000 epochs. For Cora-Full, we trained the network for 5000 epochs because we observed the training loss of Cora-Full dataset takes longer to converge.
For Coauthor-CS and Coauthor-Physics: We set the input layer dropout rate to 0.5 and weight-decay to 0.0005, both for GCN and GraphMix(GCN) . We did not conduct any hyperparameter search over the GraphMix hyperparameters , , temparature and number of random permutations applied to the input data for GraphMix(GCN) for these two datasets, and set these values to , , and respectively.
For Cora-Full dataset: We found input layer dropout rate to 0.2 and weight-decay to 0.0 to be best for both GCN and GraphMix(GCN) . For GraphMix(GCN) we fixed , temparature and number of random permutations to and respectively. For , we did search over and found that works best.