New GCNNBased Architecture for SemiSupervised Node Classification
Abstract
The nodes of a graph existing in a specific cluster are more likely to connect to each other than with other nodes in the graph. Then revealing some information about the nodes, the structure of the graph (the graph edges) provides this opportunity to know more information about the other nodes. From this perspective, this paper revisits the node classification task in a semisupervised scenario by graph convolutional neural network. The goal is to benefit from the flow of information that circulates around the revealed node labels. For this aim, this paper provides a new graph convolutional neural network architecture. This architecture benefits efficiently from the revealed training nodes, the node features, and the graph structure. On the other hand, in many applications, nongraph observations (side information) exist beside a given graph realization. The nongraph observations are usually independent of the graph structure. This paper shows that the proposed architecture is also powerful in combining a graph realization and independent nongraph observations. For both cases, the experiments on the synthetic and realworld datasets demonstrate that our proposed architecture achieves a higher prediction accuracy in comparison to the existing stateoftheart methods for the node classification task.
tabular \makesavenoteenvtable
Graph Convolutional Neural Network SemiSupervised Classification Graph Inference
1 Introduction
Node classification in graphs is generally an unsupervised learning task which refers to clustering (grouping) nodes with similar features. Revealing the labels for a small proportion of nodes transforms the unsupervised node classification task to a semisupervised learning problem. Semisupervised node classification on a purely graphical observation (a graph realization) has been investigated in the literature on realworld networks by providing various methods. For a brief survey see Section 2.
Under the transductive semisupervised learning setting, the goal is to predict the labels of unlabeled nodes given the adjacency matrix of a graph, the feature matrix containing a set of features for all nodes, and a few revealed node labels. There exist various methods for inferring the unlabeled nodes such as [1, 2]. Most of the prominent existing methods use either graphbased regularization, graph embedding, or graph convolutional neural networks in a node domain or a spectral domain.
The structure of a graph (graph edges) allows a graph convolutional neural network to use a set of fixed training nodes to predict the unlabeled nodes. Increasing the number of fixed training nodes improves the accuracy of the predictions. But in practice, a few training nodes are available in the training set. In this paper, we comprehensively investigate the way that the predicted labels can be effectively involved in the training procedure to increase the prediction accuracy.
On the other hand, in many applications, nongraph observations (side information) exist beside a given graph realization and its node feature matrix. See [3] and references therein for a brief introduction about the effects of side information on the community detection for generative models. In practice, the feature matrix is not independent of the graph structure, while the nongraph observations may be independent. Combining the feature matrix with the nongraph observations is important especially for the case in which the quality of side information is not obvious for the estimator.
In this paper, we propose a novel graph convolutional neural network architecture that benefits from the predicted unlabeled nodes and improves the accuracy of prediction. Our proposed architecture is also able to combine the provided side information with the graph structure and its feature matrix. This combination achieves higher accuracy in comparison to the existing stateoftheart methods. To the best of our knowledge, this is the first time that the predicted labels in a graph are revisited by a graph convolutional neural network to improve the accuracy. In addition, this is the first time that the performance of graph convolutional neural networks has been investigated in the presence of independent nongraph observations (side information).
2 Related Work
Graphbased semisupervised methods are typically classified into explicit and implicit learning methods. In this section, we review the related work in both classes while the focus of this paper is mainly on the graph convolutional neural network which belongs to the latter.
2.1 Explicit GraphBased Learning
In the graphbased regularization methods, it is assumed that the data samples are located in a low dimensional manifold. These methods use a regularizer to combine the low dimensional data with the graph. In the graphbased regularization methods, the objective function of optimization is a linear combination of a supervised loss function for the labeled nodes and a graphbased regularization term with a hyperparameter. The hyperparameter makes a tradeoff between the supervised loss function and the regularization term. Graph Laplacian regularizer is widely used in the literature: a label propagation algorithm based on Gaussian random fields [4], a variant of label propagation [5], a regularization framework by relying on the local or global consistency [6], manifold regularization [7], a unified optimization framework for smoothing language models on graph structures [8], and deep semisupervised embedding [9].
Besides the graph Laplacian regularization, there exist other methods based on the graph embedding: DeepWalk [10] that uses the neighborhood of nodes to learn embeddings, LINE [11] and node2vec [12] which are two extensions of DeepWalk using a biased and complex random walk algorithm, and Planetoid [13] which uses a random walkbased sampling algorithm instead of a graph Laplacian regularizer for acquiring the context information.
2.2 Implicit GraphBased Learning
Graph convolutional neural networks have attracted increasing attention recently, as an implicit graphbased semisupervised learning method. Several graph convolutional neural network methods have been proposed in the literature: a diffusionbased convolution method which produces tensors as the inputs for a neural network [14], a scalable and shallow graph convolutional neural network which encodes both the graph structure and the node features [1], a multiscale graph convolution [15], an adaptive graph convolutional networks [16], graph attention networks [2], a variant of attentionbased graph neural network for semisupervised learning [17], and dual graph convolutional networks [18].
3 Proposed SemiSupervised Node Classification Architecture
In this section, we start by stating some quick intuitions to clarify how revealing some node labels may help the estimator to classify other nodes. We define the graph convolutional neural network semisupervised problem and analyze our idea for revealed node labels. Then we propose our semisupervised node classification architecture. This section is finished by providing a technique for extracting side information in the proposed architecture based on the adjacency matrix.
3.1 Intuition
We start by a simple example to illustrate how revealed node labels may help an estimator to predict the labels of unlabeled nodes. Assume in a given graph with classes, the labels of all nodes are revealed except for two nodes and . The goal is to classify node . A Bayesian hypothesis testing problem with hypotheses is considered. Let be a vector of random variables such that its th element denotes the number of edges from node to other nodes with revealed labels in the cluster . Also, let be a vector whose th element denotes the number of edges from node to other unlabeled nodes (node in this example) in the cluster . Since the estimator does not know that node belongs to which class, is also an unknown random variable. The random variable takes the values in the set . For node , we want to infer the value of by observing a realization of . Then we have to select the most likely hypothesis conditioned on , i.e.,
which is the Maximum A Posteriori (MAP) estimator. Let denote the adjacency matrix of the graph. With no prior distribution on , when , the MAP estimator is reorganized as
(1) 
which can be solved by pairwise comparisons. When ,
where . Assume there exists no prior distribution on . Then the MAP estimator is reorganized as
(2) 
A comparison between (1) and (2) shows that how revealing true node labels reduces the complexity of optimum estimator.
3.2 Problem Definition & Analysis
The focus of this paper is on the graphbased semisupervised node classification. For a given graph with nodes, let denote an adjacency matrix and denote an feature matrix, where is the number of features. Under a transductive learning setting, the goal is to infer unknown labels , given the adjacency matrix , the feature matrix , and revealed labels denoted by (fixed training nodes). Without loss of generality, assume the first nodes of the graph are the revealed labels. Then denotes the vector of all node labels (labeled and unlabeled nodes). On the other hand, assume there exists a genie that gives us a vector of side information with length such that conditioned on the true labels, is independent of the graph edges. Without loss of generality, it is assumed that the entries of are a noisy version of the true labels. In this paper, we suppose that the feature matrix depends on the graph, conditioned on the true labels. To infer the unlabeled nodes, the Maximum A Posteriori (MAP) estimator for this configuration is
where is drawn uniformly from the set of labels, i.e., there is no prior distribution on node labels. Then we are interested in the optimal solution of the following maximization:
Assume and are the primal optimal solutions of maximizing and , respectively. Then,
or equivalently
For squeezing from above and below, it suffices to provide an algorithm to make and as close as possible by changing the entries of training labels . Recall the assumption that there exists a genie that provides an independent graph side information. This assumption can be relaxed and the side information can be extracted from either the feature matrix or the adjacency matrix of a graph. Note that extracting side information from both the feature matrix and the adjacency matrix makes the side information completely dependent on both inputs of a graph convolutional neural network.
3.3 Proposed Model
In this paper, the side information either is given directly or generated from the feature or the adjacency matrix. Figure 1 shows our proposed architecture with three blocks.
The GCN block is a variant of classical graph convolutional neural network [1] that takes and as inputs and returns which is an matrix, where is the number of classes. Note that determines the probability that the node belongs to the class in the graph. The correlated recovery block is applied when the side information is not given directly. In community detection, the correlated recovery refers to the recovering of node labels better than random guessing. The input of the correlated recovery is either the feature matrix or a function of the adjacency matrix . The output of the correlated recovery block is which is a vector with length . The decision maker decides how to combine the provided side information and the output of the GCN block . The decision maker returns the predicted labels which is a vector with length and a set of node indices that are used for defining the loss function. Then the loss function for this architecture is defined as
where is the crossentropy loss function, and the index in and refers to th entry.
Let index the epochs during the training procedure and denote the epoch in which the decision maker starts to make a change in the number of training nodes.

Phase (1): when , the decision maker embeds the fixed training labels inside the side information , resulting in for all . The decision maker returns and the set of training nodes .

Phase (2): when , the decision maker first embeds the fixed training labels inside the side information . Then the decision maker uses and determines a set of nodes such that each element of belongs to a specific class with a probability at least . Note that is a threshold that evaluates the quality of the selected nodes. On the other hand, the decision maker obtains a set of nodes such that for each element of both the corresponding side information and the prediction of the graph convolutional neural network refer to the same class. Then,
Phase (2) continues until the prediction accuracy for the fixed training nodes be grater than ; Otherwise, the training continues based on the last obtained set . In this procedure, , , and are three hyperparameters that should be tuned.
Assume at epoch , the optimal solution for maximizing is which is extracted from . The decision maker uses and to obtain a set of nodes that is used for the next training iteration. Also, since the neural networks are robust to the noisy labels [19, 20, 21], the selected nodes will have enough quality to be involved in the training process by choosing an appropriate value for . Note that the hyperparameter determines the quality of the selected nodes. Then at epoch , the training is based on a new training set which includes the fixed training labels in . Let be the optimal solution for maximizing . Note that the side information is more similar to than . Then is more similar to than and the idea follows.
3.4 Extracting Side Information
For extracting side information that is as much as possible independent from the output of the GCN block, the side information is extracted either from the given feature matrix or the adjacency matrix of the graph. Define the neighborhood matrix as
where is the set of nodes that are in a distance with radius of node . For extracting side information from the adjacency matrix, a classifier is trained by the neighborhood matrix and the training nodes, while is a hyperparameter that must be tuned. A similar idea is represented in [22] in which the authors use a variant of neighborhood matrices and solve a set of linear equations to theoretically determine whether a pair of nodes are in the same cluster or not. On the other hand, for extracting side information from the feature matrix, a classifier is trained directly based on the feature matrix and the training nodes.
4 Experiments
The proposed architecture in Section 3 is tested under a number of experiments on synthetic and realworld datasets: semisupervised document classification on three real citation networks, semisupervised node classification under the stochastic block models with a different number of classes, and semisupervised node classification in the presence of noisy labels side information which is independent of graph edges for both the synthetic and real datasets.
4.1 Datasets & Side Information
Citation Networks: Cora, Citeseer, and Pubmed are three common citation networks that have been investigated in previous studies. In these networks, articles are considered as nodes. The article citations determine the edges connected to the corresponding node. Also, a sparse bagofwords vector, extracted from the title and the abstract of each article, is used as a vector of features for that node. Table 1 shows the properties of these real datasets in detail.
Real Dataset  Nodes  Edges  Classes  Features  Training Nodes 

Cora  
Citeseer  
Pubmed 
Stochastic Block Model (SBM): The stochastic block model is a generative model for random graphs which produces graphs containing clusters. Here, we consider a stochastic block model with nodes and classes. Without loss of generality, assume the true label for each node is drawn uniformly from the set . Under this model, if two nodes belong to the same class then an edge is drawn between them with probability ; Otherwise, these nodes are connected to each other with probability . Table 2 summarizes the properties of the stochastic block models in our experiments. Also, Figure 2 shows three realizations of the described generative model with the parameters in Table 2. In this paper, a realization of the stochastic block model, based on the parameters in Table 2 with classes, is briefly called SBM dataset.
Synthetic Dataset  Nodes  Classes  p  q  Training Nodes 

SBM 
Noisy Labels Side Information: We consider a noisy version of the true label for each node as synthetic side information. This information is given to the decision maker to investigate the effect of a nongraph observation which is completely independent of the graph edges. Under the noisy labels side information, the decision maker observes the true label of each node with probability ; Otherwise, the decision maker observes a value that is drawn uniformly from the incorrect labels.
4.2 Experimental Settings
For the GCN block in Figure 1, a twolayer graph convolutional neural network is trained with ReLu and softmax activation functions at the hidden and output layers, respectively. For real datasets, we exactly follow the same data splits in [1] including nodes per class for training, nodes for the validation, and nodes for the test. For SBM datasets, we follow a data splitting similar to the one used for the real datasets. Then it is randomly considered nodes per class for the training, nodes for the validation, and nodes for the test. The weights of the neural networks are initialized by the Glorot initialization in [23]. Adam [24] optimizer with specific learning rates for phase (1) and phase (2) is applied. Also, the crossentropy loss is used for all datasets. Table 3 summarizes the values of hyperparameters that are picked for each dataset in the experiments.
Hyperparameters  Cora  Citeseer  Pubmed  SBM 
Neurons  
Maximum Epochs  
L2 Regularization Factor  
Learning Rate for Phase 1  
Learning Rate for Phase 2  
Correlated Recovery Input(s)  
Correlated Recovery Classifier  GBC  GBC  GBC  GCNN 
Throughout this paper, a gradient boosting classifier and a graph convolution neural network classifier are used for real and synthetic datasets, respectively, as a classifier in the correlated recovery block.
4.3 Baselines
For the synthetic dataset either with or without the synthetic side information, the proposed architecture is compared with the architecture in [1]. When the synthetic side information is not available, our architecture benefits from correlated recovery to extract the side information. For the real datasets, the architecture is compared with several stateoftheart methods. These methods have been listed in Table 7 including graph Laplacian regularized methods [25, 4, 6, 13] and deep graph embedding methods [2, 18, 26, 15]. The comparisons are based on the reported prediction accuracy in each paper for each dataset.
5 Results
In this section, we report the average prediction accuracy on the test set for the proposed architecture by running repeated runs with random initializations for each dataset
Table 4 compares the prediction accuracy of various classifiers in the correlated recovery block in Figure 1. In Table 4, for each dataset, either the neighborhood matrix or the feature matrix is considered as the classifier input. For each classifier and each dataset, and other classifier hyperparameters have been chosen appropriately to maximize the accuracy on the validation set. Note that for SBM datasets the feature matrix does not exist, i.e., . Then the extracted side information only based on the feature matrix is not reliable.
Classifier  Input(s)  Cora  Citeseer  Pubmed  SBM  

Neural Network  
Neural Network  
Gradient Boosting  
Gradient Boosting  
Graph Convolution Network  
Graph Convolution Network 
Table 5 summarizes the results which compare the proposed method with the GCN [1] for both real and synthetic datasets. The results show that without independent side information, the accuracy of the proposed method outperforms the traditional GCN method because it benefits from the extracted side information. Also, Table 5 makes a comparison between the quality of the extracted side information and synthetic noisy labels side information with various noise parameters .
Method  Synthetic Side Information  Cora  Citeseer  Pubmed  SBM  

GCN [1]  without  
Active GCN (ours)  without  
Active GCN (ours)  
Active GCN (ours)  
Active GCN (ours) 
Note that in Table 5, the synthetic side information is not combined with the feature matrix because it is assumed that the quality of the side information is unknown. If the synthetic side information has enough and acceptable quality, it can be embedded in the feature matrix. This embedding improves the accuracy of both the classical GCN and the proposed architecture. But if the side information does not have enough quality, embedding reduces the accuracy of both methods dramatically. Considering this fact, Table 6 shows the results when the synthetic side information is combined with the feature matrix for both classical GCN and the proposed architecture. Then we need to create a new feature matrix by combining the side information with the feature matrix . Therefore, for real datasets, the new feature matrix is created by stacking the onehot representation of synthetic side information to the given feature matrix. Also, for synthetic datasets, the onehot representation of side information is used as a newly created feature matrix instead of the identity matrix.
Method  Synthetic Side Information  Cora  Citeseer  Pubmed  SBM  

GCN [1]  
Active GCN (ours)  
GCN [1]  
Active GCN (ours)  
GCN [1]  
Active GCN (ours) 
Finally, the accuracy of the proposed architecture is compared with the reported accuracy of several stateoftheart methods. The results are summarized in Table 7. The proposed architecture achieves higher accuracy in comparison to all existing methods for Cora, Citeseer, and Pubmed datasets. The results verify the proposed idea in Section 3 that improves the prediction accuracy by revealing more node labels and allowing the nodes of a graph to access more information about the other nodes.
Method  Cora  Citeseer  Pubmed 

Modularity Clustering [25]  
SemiEmb [9]  
DeepWalk [6]  
Gaussian Fields [4]  
Graph Embedding (Planetoid) [13]  
DCNN [14]    
GCN [1]  
MoNet [27]    
NGCN [15]  
GAT [2]  
AGNN [17]  
TAGCN [26]  
DGCN [18]  
LSMGAT [28]  
SBMGCN [28]  
Active GCN (ours)  84.7  74.8  81.0 
6 Conclusion & Future Work
In this paper, we proposed a novel architecture for the semisupervised node classification task. We introduced a new metric based on the adjacency matrix that provides information about the number of common existing nodes in a specific distance between a pair of nodes. This architecture provides more information for graph convolutional neural network to estimate the unknown labels using the adjacency matrix, the feature matrix, and the revealed labels for training a model. Also, we indicated how the proposed architecture outperforms the basic GCN [1] method in the presence of both a graph realization and an independent graph side information.
One of the main contributions of this paper is the idea of revealing the label of some nodes to other nodes and use the graph structure for recovering the label of unlabeled nodes. We mainly investigated this idea on the classical GCN [1] while it remains an open problem to investigate the proposed idea with other basic semisupervised classification methods. Also, investigating the role of independent graph side information (especially in a general form consisting of multiple features with finite cardinality) is still an open problem for other stateoftheart methods.
Footnotes
 The code is available at https://github.com/mohammadesmaeili/GCNN.
References
 Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
 Hussein Saad and Aria Nosratinia. Community detection with side information: Exact recovery under the stochastic block model. IEEE Journal of Selected Topics in Signal Processing, 12(5):944–958, 2018.
 Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML03), pages 912–919, 2003.
 Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductive learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 442–457. Springer, 2009.
 Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
 Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
 Qiaozhu Mei, Duo Zhang, and ChengXiang Zhai. A general optimization framework for smoothing language models on graph structures. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 611–618, 2008.
 Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semisupervised embedding. In Neural networks: Tricks of the trade, pages 639–655. Springer, 2012.
 Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
 Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077, 2015.
 Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
 Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semisupervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.
 James Atwood and Don Towsley. Diffusionconvolutional neural networks. In Advances in neural information processing systems, pages 1993–2001, 2016.
 Sami AbuElHaija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. Ngcn: Multiscale graph convolution for semisupervised node classification. arXiv preprint arXiv:1802.08888, 2018.
 Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks. In Thirtysecond AAAI conference on artificial intelligence, 2018.
 Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and LiJia Li. Attentionbased graph neural network for semisupervised learning. arXiv preprint arXiv:1803.03735, 2018.
 Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graphbased semisupervised classification. In Proceedings of the 2018 World Wide Web Conference, pages 499–508, 2018.
 David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
 Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in neural information processing systems, pages 10456–10465, 2018.
 Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 670–688. IEEE, 2015.
 Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. On modularity clustering. IEEE transactions on knowledge and data engineering, 20(2):172–188, 2007.
 Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370, 2017.
 Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5115–5124, 2017.
 Jiaqi Ma, Weijing Tang, Ji Zhu, and Qiaozhu Mei. A flexible generative framework for graphbased semisupervised learning. In Advances in Neural Information Processing Systems, pages 3276–3285, 2019.