New GCNN-Based Architecture for Semi-Supervised Node Classification
The nodes of a graph existing in a specific cluster are more likely to connect to each other than with other nodes in the graph. Then revealing some information about the nodes, the structure of the graph (the graph edges) provides this opportunity to know more information about the other nodes. From this perspective, this paper revisits the node classification task in a semi-supervised scenario by graph convolutional neural network. The goal is to benefit from the flow of information that circulates around the revealed node labels. For this aim, this paper provides a new graph convolutional neural network architecture. This architecture benefits efficiently from the revealed training nodes, the node features, and the graph structure. On the other hand, in many applications, non-graph observations (side information) exist beside a given graph realization. The non-graph observations are usually independent of the graph structure. This paper shows that the proposed architecture is also powerful in combining a graph realization and independent non-graph observations. For both cases, the experiments on the synthetic and real-world datasets demonstrate that our proposed architecture achieves a higher prediction accuracy in comparison to the existing state-of-the-art methods for the node classification task.
Graph Convolutional Neural Network Semi-Supervised Classification Graph Inference
Node classification in graphs is generally an unsupervised learning task which refers to clustering (grouping) nodes with similar features. Revealing the labels for a small proportion of nodes transforms the unsupervised node classification task to a semi-supervised learning problem. Semi-supervised node classification on a purely graphical observation (a graph realization) has been investigated in the literature on real-world networks by providing various methods. For a brief survey see Section 2.
Under the transductive semi-supervised learning setting, the goal is to predict the labels of unlabeled nodes given the adjacency matrix of a graph, the feature matrix containing a set of features for all nodes, and a few revealed node labels. There exist various methods for inferring the unlabeled nodes such as [1, 2]. Most of the prominent existing methods use either graph-based regularization, graph embedding, or graph convolutional neural networks in a node domain or a spectral domain.
The structure of a graph (graph edges) allows a graph convolutional neural network to use a set of fixed training nodes to predict the unlabeled nodes. Increasing the number of fixed training nodes improves the accuracy of the predictions. But in practice, a few training nodes are available in the training set. In this paper, we comprehensively investigate the way that the predicted labels can be effectively involved in the training procedure to increase the prediction accuracy.
On the other hand, in many applications, non-graph observations (side information) exist beside a given graph realization and its node feature matrix. See  and references therein for a brief introduction about the effects of side information on the community detection for generative models. In practice, the feature matrix is not independent of the graph structure, while the non-graph observations may be independent. Combining the feature matrix with the non-graph observations is important especially for the case in which the quality of side information is not obvious for the estimator.
In this paper, we propose a novel graph convolutional neural network architecture that benefits from the predicted unlabeled nodes and improves the accuracy of prediction. Our proposed architecture is also able to combine the provided side information with the graph structure and its feature matrix. This combination achieves higher accuracy in comparison to the existing state-of-the-art methods. To the best of our knowledge, this is the first time that the predicted labels in a graph are revisited by a graph convolutional neural network to improve the accuracy. In addition, this is the first time that the performance of graph convolutional neural networks has been investigated in the presence of independent non-graph observations (side information).
2 Related Work
Graph-based semi-supervised methods are typically classified into explicit and implicit learning methods. In this section, we review the related work in both classes while the focus of this paper is mainly on the graph convolutional neural network which belongs to the latter.
2.1 Explicit Graph-Based Learning
In the graph-based regularization methods, it is assumed that the data samples are located in a low dimensional manifold. These methods use a regularizer to combine the low dimensional data with the graph. In the graph-based regularization methods, the objective function of optimization is a linear combination of a supervised loss function for the labeled nodes and a graph-based regularization term with a hyperparameter. The hyperparameter makes a trade-off between the supervised loss function and the regularization term. Graph Laplacian regularizer is widely used in the literature: a label propagation algorithm based on Gaussian random fields , a variant of label propagation , a regularization framework by relying on the local or global consistency , manifold regularization , a unified optimization framework for smoothing language models on graph structures , and deep semi-supervised embedding .
Besides the graph Laplacian regularization, there exist other methods based on the graph embedding: DeepWalk  that uses the neighborhood of nodes to learn embeddings, LINE  and node2vec  which are two extensions of DeepWalk using a biased and complex random walk algorithm, and Planetoid  which uses a random walk-based sampling algorithm instead of a graph Laplacian regularizer for acquiring the context information.
2.2 Implicit Graph-Based Learning
Graph convolutional neural networks have attracted increasing attention recently, as an implicit graph-based semi-supervised learning method. Several graph convolutional neural network methods have been proposed in the literature: a diffusion-based convolution method which produces tensors as the inputs for a neural network , a scalable and shallow graph convolutional neural network which encodes both the graph structure and the node features , a multi-scale graph convolution , an adaptive graph convolutional networks , graph attention networks , a variant of attention-based graph neural network for semi-supervised learning , and dual graph convolutional networks .
3 Proposed Semi-Supervised Node Classification Architecture
In this section, we start by stating some quick intuitions to clarify how revealing some node labels may help the estimator to classify other nodes. We define the graph convolutional neural network semi-supervised problem and analyze our idea for revealed node labels. Then we propose our semi-supervised node classification architecture. This section is finished by providing a technique for extracting side information in the proposed architecture based on the adjacency matrix.
We start by a simple example to illustrate how revealed node labels may help an estimator to predict the labels of unlabeled nodes. Assume in a given graph with classes, the labels of all nodes are revealed except for two nodes and . The goal is to classify node . A Bayesian hypothesis testing problem with hypotheses is considered. Let be a vector of random variables such that its -th element denotes the number of edges from node to other nodes with revealed labels in the cluster . Also, let be a vector whose -th element denotes the number of edges from node to other unlabeled nodes (node in this example) in the cluster . Since the estimator does not know that node belongs to which class, is also an unknown random variable. The random variable takes the values in the set . For node , we want to infer the value of by observing a realization of . Then we have to select the most likely hypothesis conditioned on , i.e.,
which is the Maximum A Posteriori (MAP) estimator. Let denote the adjacency matrix of the graph. With no prior distribution on , when , the MAP estimator is reorganized as
which can be solved by pairwise comparisons. When ,
where . Assume there exists no prior distribution on . Then the MAP estimator is reorganized as
3.2 Problem Definition & Analysis
The focus of this paper is on the graph-based semi-supervised node classification. For a given graph with nodes, let denote an adjacency matrix and denote an feature matrix, where is the number of features. Under a transductive learning setting, the goal is to infer unknown labels , given the adjacency matrix , the feature matrix , and revealed labels denoted by (fixed training nodes). Without loss of generality, assume the first nodes of the graph are the revealed labels. Then denotes the vector of all node labels (labeled and unlabeled nodes). On the other hand, assume there exists a genie that gives us a vector of side information with length such that conditioned on the true labels, is independent of the graph edges. Without loss of generality, it is assumed that the entries of are a noisy version of the true labels. In this paper, we suppose that the feature matrix depends on the graph, conditioned on the true labels. To infer the unlabeled nodes, the Maximum A Posteriori (MAP) estimator for this configuration is
where is drawn uniformly from the set of labels, i.e., there is no prior distribution on node labels. Then we are interested in the optimal solution of the following maximization:
Assume and are the primal optimal solutions of maximizing and , respectively. Then,
For squeezing from above and below, it suffices to provide an algorithm to make and as close as possible by changing the entries of training labels . Recall the assumption that there exists a genie that provides an independent graph side information. This assumption can be relaxed and the side information can be extracted from either the feature matrix or the adjacency matrix of a graph. Note that extracting side information from both the feature matrix and the adjacency matrix makes the side information completely dependent on both inputs of a graph convolutional neural network.
3.3 Proposed Model
In this paper, the side information either is given directly or generated from the feature or the adjacency matrix. Figure 1 shows our proposed architecture with three blocks.
The GCN block is a variant of classical graph convolutional neural network  that takes and as inputs and returns which is an matrix, where is the number of classes. Note that determines the probability that the node belongs to the class in the graph. The correlated recovery block is applied when the side information is not given directly. In community detection, the correlated recovery refers to the recovering of node labels better than random guessing. The input of the correlated recovery is either the feature matrix or a function of the adjacency matrix . The output of the correlated recovery block is which is a vector with length . The decision maker decides how to combine the provided side information and the output of the GCN block . The decision maker returns the predicted labels which is a vector with length and a set of node indices that are used for defining the loss function. Then the loss function for this architecture is defined as
where is the cross-entropy loss function, and the index in and refers to -th entry.
Let index the epochs during the training procedure and denote the epoch in which the decision maker starts to make a change in the number of training nodes.
Phase (1): when , the decision maker embeds the fixed training labels inside the side information , resulting in for all . The decision maker returns and the set of training nodes .
Phase (2): when , the decision maker first embeds the fixed training labels inside the side information . Then the decision maker uses and determines a set of nodes such that each element of belongs to a specific class with a probability at least . Note that is a threshold that evaluates the quality of the selected nodes. On the other hand, the decision maker obtains a set of nodes such that for each element of both the corresponding side information and the prediction of the graph convolutional neural network refer to the same class. Then,
Phase (2) continues until the prediction accuracy for the fixed training nodes be grater than ; Otherwise, the training continues based on the last obtained set . In this procedure, , , and are three hyperparameters that should be tuned.
Assume at epoch , the optimal solution for maximizing is which is extracted from . The decision maker uses and to obtain a set of nodes that is used for the next training iteration. Also, since the neural networks are robust to the noisy labels [19, 20, 21], the selected nodes will have enough quality to be involved in the training process by choosing an appropriate value for . Note that the hyperparameter determines the quality of the selected nodes. Then at epoch , the training is based on a new training set which includes the fixed training labels in . Let be the optimal solution for maximizing . Note that the side information is more similar to than . Then is more similar to than and the idea follows.
3.4 Extracting Side Information
For extracting side information that is as much as possible independent from the output of the GCN block, the side information is extracted either from the given feature matrix or the adjacency matrix of the graph. Define the -neighborhood matrix as
where is the set of nodes that are in a distance with radius of node . For extracting side information from the adjacency matrix, a classifier is trained by the -neighborhood matrix and the training nodes, while is a hyperparameter that must be tuned. A similar idea is represented in  in which the authors use a variant of -neighborhood matrices and solve a set of linear equations to theoretically determine whether a pair of nodes are in the same cluster or not. On the other hand, for extracting side information from the feature matrix, a classifier is trained directly based on the feature matrix and the training nodes.
The proposed architecture in Section 3 is tested under a number of experiments on synthetic and real-world datasets: semi-supervised document classification on three real citation networks, semi-supervised node classification under the stochastic block models with a different number of classes, and semi-supervised node classification in the presence of noisy labels side information which is independent of graph edges for both the synthetic and real datasets.
4.1 Datasets & Side Information
Citation Networks: Cora, Citeseer, and Pubmed are three common citation networks that have been investigated in previous studies. In these networks, articles are considered as nodes. The article citations determine the edges connected to the corresponding node. Also, a sparse bag-of-words vector, extracted from the title and the abstract of each article, is used as a vector of features for that node. Table 1 shows the properties of these real datasets in detail.
|Real Dataset||Nodes||Edges||Classes||Features||Training Nodes|
Stochastic Block Model (SBM): The stochastic block model is a generative model for random graphs which produces graphs containing clusters. Here, we consider a stochastic block model with nodes and classes. Without loss of generality, assume the true label for each node is drawn uniformly from the set . Under this model, if two nodes belong to the same class then an edge is drawn between them with probability ; Otherwise, these nodes are connected to each other with probability . Table 2 summarizes the properties of the stochastic block models in our experiments. Also, Figure 2 shows three realizations of the described generative model with the parameters in Table 2. In this paper, a realization of the stochastic block model, based on the parameters in Table 2 with classes, is briefly called -SBM dataset.
|Synthetic Dataset||Nodes||Classes||p||q||Training Nodes|
Noisy Labels Side Information: We consider a noisy version of the true label for each node as synthetic side information. This information is given to the decision maker to investigate the effect of a non-graph observation which is completely independent of the graph edges. Under the noisy labels side information, the decision maker observes the true label of each node with probability ; Otherwise, the decision maker observes a value that is drawn uniformly from the incorrect labels.
4.2 Experimental Settings
For the GCN block in Figure 1, a two-layer graph convolutional neural network is trained with ReLu and softmax activation functions at the hidden and output layers, respectively. For real datasets, we exactly follow the same data splits in  including nodes per class for training, nodes for the validation, and nodes for the test. For -SBM datasets, we follow a data splitting similar to the one used for the real datasets. Then it is randomly considered nodes per class for the training, nodes for the validation, and nodes for the test. The weights of the neural networks are initialized by the Glorot initialization in . Adam  optimizer with specific learning rates for phase (1) and phase (2) is applied. Also, the cross-entropy loss is used for all datasets. Table 3 summarizes the values of hyperparameters that are picked for each dataset in the experiments.
|L2 Regularization Factor|
|Learning Rate for Phase 1|
|Learning Rate for Phase 2|
|Correlated Recovery Input(s)|
|Correlated Recovery Classifier||GBC||GBC||GBC||GCNN|
Throughout this paper, a gradient boosting classifier and a graph convolution neural network classifier are used for real and synthetic datasets, respectively, as a classifier in the correlated recovery block.
For the synthetic dataset either with or without the synthetic side information, the proposed architecture is compared with the architecture in . When the synthetic side information is not available, our architecture benefits from correlated recovery to extract the side information. For the real datasets, the architecture is compared with several state-of-the-art methods. These methods have been listed in Table 7 including graph Laplacian regularized methods [25, 4, 6, 13] and deep graph embedding methods [2, 18, 26, 15]. The comparisons are based on the reported prediction accuracy in each paper for each dataset.
In this section, we report the average prediction accuracy on the test set for the proposed architecture by running repeated runs with random initializations for each dataset
Table 4 compares the prediction accuracy of various classifiers in the correlated recovery block in Figure 1. In Table 4, for each dataset, either the -neighborhood matrix or the feature matrix is considered as the classifier input. For each classifier and each dataset, and other classifier hyperparameters have been chosen appropriately to maximize the accuracy on the validation set. Note that for -SBM datasets the feature matrix does not exist, i.e., . Then the extracted side information only based on the feature matrix is not reliable.
|Graph Convolution Network|
|Graph Convolution Network|
Table 5 summarizes the results which compare the proposed method with the GCN  for both real and synthetic datasets. The results show that without independent side information, the accuracy of the proposed method outperforms the traditional GCN method because it benefits from the extracted side information. Also, Table 5 makes a comparison between the quality of the extracted side information and synthetic noisy labels side information with various noise parameters .
|Method||Synthetic Side Information||Cora||Citeseer||Pubmed||-SBM|
|Active GCN (ours)||without|
|Active GCN (ours)|
|Active GCN (ours)|
|Active GCN (ours)|
Note that in Table 5, the synthetic side information is not combined with the feature matrix because it is assumed that the quality of the side information is unknown. If the synthetic side information has enough and acceptable quality, it can be embedded in the feature matrix. This embedding improves the accuracy of both the classical GCN and the proposed architecture. But if the side information does not have enough quality, embedding reduces the accuracy of both methods dramatically. Considering this fact, Table 6 shows the results when the synthetic side information is combined with the feature matrix for both classical GCN and the proposed architecture. Then we need to create a new feature matrix by combining the side information with the feature matrix . Therefore, for real datasets, the new feature matrix is created by stacking the one-hot representation of synthetic side information to the given feature matrix. Also, for synthetic datasets, the one-hot representation of side information is used as a newly created feature matrix instead of the identity matrix.
|Method||Synthetic Side Information||Cora||Citeseer||Pubmed||-SBM|
|Active GCN (ours)|
|Active GCN (ours)|
|Active GCN (ours)|
Finally, the accuracy of the proposed architecture is compared with the reported accuracy of several state-of-the-art methods. The results are summarized in Table 7. The proposed architecture achieves higher accuracy in comparison to all existing methods for Cora, Citeseer, and Pubmed datasets. The results verify the proposed idea in Section 3 that improves the prediction accuracy by revealing more node labels and allowing the nodes of a graph to access more information about the other nodes.
|Modularity Clustering |
|Gaussian Fields |
|Graph Embedding (Planetoid) |
|Active GCN (ours)||84.7||74.8||81.0|
6 Conclusion & Future Work
In this paper, we proposed a novel architecture for the semi-supervised node classification task. We introduced a new metric based on the adjacency matrix that provides information about the number of common existing nodes in a specific distance between a pair of nodes. This architecture provides more information for graph convolutional neural network to estimate the unknown labels using the adjacency matrix, the feature matrix, and the revealed labels for training a model. Also, we indicated how the proposed architecture outperforms the basic GCN  method in the presence of both a graph realization and an independent graph side information.
One of the main contributions of this paper is the idea of revealing the label of some nodes to other nodes and use the graph structure for recovering the label of unlabeled nodes. We mainly investigated this idea on the classical GCN  while it remains an open problem to investigate the proposed idea with other basic semi-supervised classification methods. Also, investigating the role of independent graph side information (especially in a general form consisting of multiple features with finite cardinality) is still an open problem for other state-of-the-art methods.
- The code is available at https://github.com/mohammadesmaeili/GCNN.
- Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- Hussein Saad and Aria Nosratinia. Community detection with side information: Exact recovery under the stochastic block model. IEEE Journal of Selected Topics in Signal Processing, 12(5):944–958, 2018.
- Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919, 2003.
- Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductive learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 442–457. Springer, 2009.
- Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. Learning with local and global consistency. In Advances in neural information processing systems, pages 321–328, 2004.
- Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
- Qiaozhu Mei, Duo Zhang, and ChengXiang Zhai. A general optimization framework for smoothing language models on graph structures. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 611–618, 2008.
- Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural networks: Tricks of the trade, pages 639–655. Springer, 2012.
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077, 2015.
- Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864, 2016.
- Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.
- James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in neural information processing systems, pages 1993–2001, 2016.
- Sami Abu-El-Haija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. N-gcn: Multi-scale graph convolution for semi-supervised node classification. arXiv preprint arXiv:1802.08888, 2018.
- Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks. In Thirty-second AAAI conference on artificial intelligence, 2018.
- Kiran K Thekumparampil, Chong Wang, Sewoong Oh, and Li-Jia Li. Attention-based graph neural network for semi-supervised learning. arXiv preprint arXiv:1803.03735, 2018.
- Chenyi Zhuang and Qiang Ma. Dual graph convolutional networks for graph-based semi-supervised classification. In Proceedings of the 2018 World Wide Web Conference, pages 499–508, 2018.
- David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.
- Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in neural information processing systems, pages 10456–10465, 2018.
- Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 670–688. IEEE, 2015.
- Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Gorke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. On modularity clustering. IEEE transactions on knowledge and data engineering, 20(2):172–188, 2007.
- Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370, 2017.
- Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5115–5124, 2017.
- Jiaqi Ma, Weijing Tang, Ji Zhu, and Qiaozhu Mei. A flexible generative framework for graph-based semi-supervised learning. In Advances in Neural Information Processing Systems, pages 3276–3285, 2019.