Collective Vertex Classification Using Recursive Neural Network
Abstract
Collective classification of vertices is a task of assigning categories to each vertex in a graph based on both vertex attributes and link structure. Nevertheless, some existing approaches do not use the features of neighbouring vertices properly, due to the noise introduced by these features. In this paper, we propose a graphbased recursive neural network framework for collective vertex classification. In this framework, we generate hidden representations from both attributes of vertices and representations of neighbouring vertices via recursive neural networks. Under this framework, we explore two types of recursive neural units, naive recursive neural unit and long shortterm memory unit. We have conducted experiments on four realworld network datasets. The experimental results show that our framework with long shortterm memory model achieves better results and outperforms several competitive baseline methods.
Collective Vertex Classification Using Recursive Neural Network
Qiongkai Xu The Australian National University Data61 CSIRO Xu.Qiongkai@data61.csiro.au Qing Wang The Australian National University qing.wang@anu.edu.au
Chenchen Xu The Australian National University Data61 CSIRO Xu.Chenchen@data61.csiro.au Lizhen Qu The Australian National University Data61 CSIRO Qu.Lizhen@data61.csiro.au
Introduction
In everyday life, graphs are ubiquitous, e.g. social networks, sensor networks, and citation networks. Mining useful knowledge from graphs and studying properties of various kinds of graphs have been gaining popularity in recent years. Many studies formulate their graph problems as predictive tasks such as vertex classification (?), link prediction (?), and graph classification (?).
In this paper, we focus on vertex classification task which studies the properties of vertices by categorising them. Algorithms for classifying vertices are widely adopted in web page analysis, citation analysis and social network analysis (?). Naive approaches for vertex classification use traditional machine learning techniques to classify a vertex only based on the attributes or features provided by this vertex, e.g. such attributes can be words of web pages or user profiles in a social network. Another series of approaches are collective vertex classification, where instances are classified simultaneously as opposed to independently. Based on the observation that information of neighbouring vertices may help classifying current vertex, some approaches incorporate attributes of neighbouring vertices into classification process, which however introduce noise at the same time and result in reduced performance (?; ?). Other approaches incorporate the labels of its neighbours. For instance, the Iterative Classification Approach (ICA) integrates the label distribution of neighbouring vertices to assist classification (?) and the Label Propagation approach (LP) finetunes predictions of the vertex using the labels of its neighbouring vertices (?). However, labels of neighbouring vertices are not representative enough for learning sophisticated relationships of vertices, while using their attributes directly would involve noise. We thus need an approach that is capable of capturing information from neighbouring vertices, while in the mean time reducing noise of attributes. Utilizing representations learned from neural networks instead of neighbouring attributes or labels is one of the possible approaches (?). As graphs normally provide rich structural information, we exploit the neural networks with sufficient complexity for capturing such structures.
Recurrent neural networks were developed to utilize sequence structures by processing input in order, in which the representation of the previous node is used to generate the representation of the current node. Recursive neural networks exploit representation learning on tree structures. Both of the approaches achieve success in learning representation of data with implicit structures which indicate the order of processing vertices (?; ?). Following these work, we explore the possibility of integrating graph structures into recursive neural network. However, graph structures, especially cyclic graphs, do not provide such an processing order. In this paper, we propose a graphbased recursive neural network (GRNN), which allows us to transform graph structures to tree structures and use recursive neural networks to learn representations for the vertices to classify. This framework consists of two main components, as illustrated in Figure 1:

Tree construction: For each vertex to classify , we generate a search tree rooted at . Starting from , we add its neighbouring vertices into the tree layer by layer.

Recursive neural network construction: We build a recursive neural network for the constructed tree, by augmenting each vertex with one recursive neural unit. The inputs of each vertex are its features and hidden states of its child vertices. The output of a vertex is its hidden states.
Our main contributions are: (1) We introduce recursive neural networks to solve the collective vertex classification problem. (2) We propose a method that can transfer vertices for classification to a locally constructed tree. Based on the tree, recursive neural network can extract representations for target vertices. (3) Our experimental results show that the proposed approach outperforms several baseline methods. Particularly, we demonstrate that including information from neighbouring vertices can improve performance of classification.
Related Work
There has been a growing trend to represent data using graphs (?). Discovering knowledge from graphs becomes an exciting research area, such as vertex classification (?) and graph classification (?). Graph classification analyzes the properties of the graph as a whole, while vertex classification focuses on predicting labels of vertices in the graph. In this paper, we discuss the problem of vertex classification. The mainstream approaches for vertex classification are collective vertex classification (?) which classify vertices using information provided by neighbouring vertices. Iterative classification approach (?) models neighbours’ label distribution as link features to facilitate classification. Label propagation approach (?) assigns a probabilistic label for each vertex and then finetunes the probability using graph structure. However, labels of neighbouring vertices are not representative enough to include all useful information. Some researchers tried to introduce attributes from neighbouring vertices to improve classification performance. Nevertheless, as reported in (?; ?), naively incorporating these features may reduce the performance of classification, when original features of neighbouring vertices are too noisy.
Recently, some researchers analysed graphs using deep neural network technologies. Deepwalk (?) is an unsupervised learning algorithm to learn vertex embeddings using link structure, while content of each vertex is not considered. Convolutional neural network for graphs (?) learns feature representations for the graphs as a whole. Recurrent neural collective classification (?) encodes neighbouring vertices via a recurrent neural network, which is hard to capture the information from vertices that are more than several steps away.
Recursive neural networks (RNN) are a series of models that deal with treestructured information. RNN has been implemented in natural scenes parsing (?) and treestructured sentence representation learning (?). Under this framework, representations can be learned from both input features and representations of child nodes. Graph structures are more widely used and more complicated than tree or sequence structures. Due to the lack of notable order for processing vertices in a graph, few studies have investigated the vertex classification problem using recursive neural network techniques. The graphbased recursive neural network framework proposed in this paper can generate the processing order for neural network according to the vertex to classify and the local graph structure.
Graphbased Recursive Neural Networks
In this section, we present the framework of Graphbased Recursive Neural Networks (GRNN). A graph consists of a set of vertices and a set of edges . Graphs may contain cycles, where a cycle is a path from a vertex back to itself. Let be a set of feature vectors, where each is associated with a vertex , be a set of labels, and be a vertex to be classified, called target vertex. Then, the collective vertex classification problem is to predict the label of , such that
(1) 
using a recursive neural network with parameters .
Tree Construction
In a neural network, neurons are arranged in layers and different layers are processed following a predefined order. For example, recurrent neural networks process inputs in sequential order and recursive neural networks deal with treestructures in a bottomup manner. However, graphs, particularly cyclic graphs, do not have an explicit order. How to construct an ordered structure from a graph is challenging.
Given a graph , a target vertex and tree depth , we can construct a tree rooted at using breadthfirstsearch, where is a vertex set, is an edge set, and means an edge from parent vertex to child vertex . The depth of a vertex in is the length of the path from to , denoted as .depth(). The depth of a tree is maximum depth of vertices in . We use .outgoingVertices() to denote a set of outgoing vertices from , i.e. . The tree construction algorithm is described in Algorithm 1. Firstly, a firstinfirstout queue (Q) is initialized with (lines 13). The algorithm iteratively check the vertices in Q. If there is a vertex whose depth is less than , we pop it out from Q (line 6), add all its neighbouring vertices in as its children in and push them to the end of Q (lines 911).
In general, there are two approaches to deal with cycles in a graph. One is to remove vertices that have been visited, and the other is to keep duplicate vertices. Fig 2.a describes an example of a graph with a cycle between and . Let us start with the target vertex . In the first approach, there will be no child vertex for , since is already visited. The corresponding tree is shown in Fig 2.b. In the second approach, we will add as a child vertex to iteratively and terminate after certain steps. The corresponding tree is illustrated in Fig 2.c. When generating the representation of a vertex, say , any information from its neighbours may help. We thus include as a child vertex of . In this paper, we use the second manner for tree construction.
Recursive Neural Network Construction
Now we construct a recursive neural unit (RNU) for each vertex . Each RNU takes a feature vector and hidden states of its child vertices as input. We explore two kinds of recursive neural units which are discussed in (?; ?).
Naive Recursive Neural Unit (NRNU)
Each NRNU for a vertex takes a feature vector and the aggregation of the hidden states from all children of . The transition equations of NRNU are given as follows:
(2) 
(3) 
where is the set of child vertices of , and are weight matrix, and is the bias. The generated hidden state of is related to the input vector and aggregated hidden state . Different from summing up all hidden states as in (?), we use max pooling for (see Eq 2). This is because, in reallife situations, the number of neighbours for a vertex can be very large and some of them are irrelevant for the vertex to classify (?). We use GNRNN to refer to the graphbased naive recursive neural network which incorporates NRNU as recursive neural units.
Long ShortTerm Memory Unit (LSTMU)
LSTMU is one variation on RNU, which can handle the long term dependency problem by introducing memory cells and gated units (?). LSTMU is composed of an input gate , a forget gate , an output gate , a memory cell and a hidden state . The transition equations of LSTMU are given as follows:
(4) 
(5) 
(6) 
(7) 
(8) 
(9) 
(10) 
where is the set of child vertices of , , is corresponding feature vector of the child vertex , is elementwise multiplication and is sigmoid function, and are weight matrices, and are the biases. is a vector aggregated from the hidden states of the child vertices. We use GLSTM to refer to the graphbased long shortterm memory network which incorporates LSTMU as recursive neural units.
After constructing GRNN, we calculate the hidden states of all vertices in from leaves to root, then we use a softmax classifier to predict label of the target vertex using its hidden states (see Eq 11).
(11) 
(12) 
Crossentropy is used as cost function, where N is the number of vertices in a training set.
Experimental Setup
To verify the effectiveness of our approach, we have conducted experiments on four datasets and compared our approach with three baseline methods. We will describe the datasets, baseline methods, and experimental settings.
Datasets
We have tested our approach on four realworld network datasets.

Cora (?) is a citation network dataset which is composed of 2708 scientific publications and 5429 citations between publications. All publications are classified into seven classes: Rule Learning (RU), Genetic Algorithms (GE), Reinforcement Learning (RE), Neural Networks (NE), Probabilistic Methods (PR), Case Based (CA) and Theory (TH).

Citeseer (?) is another citation network dataset which is larger than Cora. Citeseer is composed of 3312 scientific publications and 4723 citations. All publications are classified into six classes: Agents, AI, DB, IR, ML and HCI.

WebKB (?) is a website network collected from four computer science departments in different universities which consists of 877 web pages, 1608 hyperlinks between web pages. All websites are classified into five classes: faculty, students, project, course and other.

WebKBsim is a network dataset which is generated from WebKB based on the cosine similarity between each vertex and its top 3 similar vertices according to their feature vectors (?). We use same feature vectors as the ones in WebKB. This dataset is used to demonstrate the effectiveness of our framework on datasets which may not have explicit relationship represented as edges between vertices, but can be treated as graphs whose edges are based on some metrics such as similarity of vertices.
We use abstracts of publications in Cora and Citeseer, and contents of web pages in WebKB to generate features of vertices. For the above datasets, all words are stemmed first, then stop words and words with document frequency less than 10 are discarded. A dictionary is generated by including all these words. We have 1433, 3793, and 1703 for Cora, Citeseer and WebKB, respectively. Each vertex is represented by a bagofwords vector where each dimension indicates absence or occurrence of a word in the dictionary of the corresponding dataset^{1}^{1}1Cora, Citeseer and WebKB can be downloaded from LINQS. We will publish WebKBsim along with our code..
Baseline Methods
We have implemented the following three baseline methods:

Logistic Regression (LR) (?) predicts the label of a vertex using its own attributes through a logistic regression model.

Iterative classification approach (ICA) (?; ?) utilizes the combination of link structure and vertex features as input of a statistical machine learning model. We use two variants of ICA: ICAbinary uses the occurrence of labels of neighbouring vertices, ICAcount uses the frequency of labels of neighbouring vertices.

Label propagation (LP) (?; ?) uses a statistical machine learning to give a label probability for each vertex, then propagates the label probability to all its neighbours. The propagation steps are repeated until all label probabilities converge.
To make experiments consistent, logistic regression is used for all statistical machine learning components in ICA and LP. We run 5 iterations for each ICA experiment and 20 iterations for each LP experiment^{2}^{2}2According to our preliminary experiments, LP converges slower than ICA..
Experimental Settings
In our experiments, we split each dataset into two parts: training set and testing set, with different proportions ( to for training). For each proportion setting, we randomly generate 5 pairs of training and testing sets. For each experiment on a pair of training and testing sets, we run 10 epochs on the training set and record the highest MicroF1 score (?) on the testing set. Then we report the averaged results from the experiments with the same proportion setting. According to preliminary experiments, the learning rate is set to 0.1 for LR, ICA and LP and 0.01 for all GRNN models. We empirically set number of hidden states to 200 for all GRNN models. Adagrad (?) is used as the optimization method in our experiments.
Results and Discussion
Model and Parameter Selection
Figure 3 illustrates the performance of GNRNN and GLSTM on four datasets. We use GNRNN_di and GLSTM_di to refer to GNRNN and GLSTM over trees of depth , respectively, where .
For the experiments on GNRNN and GLSTM over trees of different steps, and outperform in most cases^{3}^{3}3When , each constructed tree contains only one vertex. . Particularly, the experiments with and perform better with more than 2% improvement than on Cora, and and enjoy a consistent improvement over on Citeseer and WebKBsim. This performance difference is also obvious in WebKB, when the training proportion is larger than 85%. These mean that introducing neighbouring vertices can improve the performance of classification and more neighbouring information can be obtained by increasing the depth of trees. Using same RNU setting, outperforms in most experiments on Cora, Citeseer and WebKBsim. However, for WebKB, does not always outperforms . That is to say, introducing more layers of vertices may help improving the performance, while the choice of the best tree depth depends on applications.
Method  Pooling  Datasets  

Strategy  Cora  Citeseer  WebKB  WebKBsim  
sum  83.05  74.81  86.21  87.42  
GLSTM_d1  mean  84.18  74.77  86.21  87.58 
max  83.83  74.89  87.12  87.42  
sum  84.03  75.05  85.61  87.42  
GLSTM_d2  mean  84.47  75.33  86.21  87.42 
max  84.72  75.45  86.06  87.73 
In Table 1, we compare three different pooling strategies used in GLSTM^{4}^{4}4As GNRNN gives similar results, we only illustrate results for GLSTM here, sum, mean and max pooling. We use 85% for training for all datasets here. In general, mean and max outperform sum which is used in (?). This is probably because, the number of neighbours for a vertex can be very large and summing them up can make large for some extreme cases. max slightly outperforms mean in our experiments, which is probably due to max pooling can select the most influential information of child vertices which filters out noise to some extend.
Baseline Comparison
We compare our approach with the baseline methods on four network datasets. As our models with provide better performance, we choose GNRNN_d2 and GLSTM_d2 as representative models.
In Figure 4.a and Figure 4.b, we compare our approach with the baseline methods on two citation networks, Cora and Citeseer. The collective vertex classification approaches, i.e. LP, ICA, GNRNN and GLSTM, largely outperform LR which only uses attributes of the target vertex. Both GNRNN_d2 and GLSTM_d2 consistently outperform all baseline methods on the citation networks, which indicates the effectiveness of our approach. In Figure 4.c and Figure 4.d, we compare our approach with the baseline methods on WebKB and WebKBsim. For WebKB, our method obtains competitive results in comparison with ICA. works worse than on WebKB, where MicroF1 score is less than 0.7. For this reason, it is not presented in Figure 4.c, and we will discuss this in detail in the next subsection. For WebKBsim, our approaches consistently outperform all baseline methods, and GLSTM_d2 outperforms GNRNN_d1 when the training proportion is larger than 80%.
In general, GLSTM_d2 outperforms GNRNN_d2. This is likely due to the LSTMU’s capability of memorizing information using memory cells. GLSTM can thus better capture correlations between representations with long dependencies (?).
Dataset Comparison
To analyse cooccurrence of neighbouring labels, we compute the transition probability from target vertices to their neighbouring vertices. We first calculate the label cooccurrence matrix , where indicates cooccurred times of labels of target vertices and labels of step away vertices . Then we obtain a transition probability matrix , where . The heat maps of on four datasets are demonstrated in Figure 5.
For Cora and Citeseer, neighbouring vertices tend to share same labels. When increases to 2, labels are still tightly correlated. That is probably why all ICA, LP GNRNN and GLSTM work well on Cora and Citeseer. In this situation, GRNN integrates features of step away vertices which may directly help classify a target vertex. For WebKB, correlation of labels is not clear, some label can be strongly related to more than two labels, e.g. students connects to course, project and student. Introducing vertices which are more steps away makes the correlation even worse for WebKB, e.g.all labels are most related to student. In this situation, LP totally fails, while ICA can learn the correlation of labels that are not same, i.e. student may relate to course instead of student itself. For this dataset, GRNN still achieves competitive results with the best baseline approach. For WebKBsim, although student is still the label with highest frequency, the correlation between labels is clearer than WebKB, i.e. project relates to project and student. That is probably the reason why, the performance of our approach is good on WebKBsim for both settings and GLSTM_d2 achieves better results than GNRNN_d2 when the training proportion is larger.
Conclusions and Future work
In this paper, we have presented a graphbased recursive neural network framework(GRNN) for vertex classification on graphs. We have compared two recursive units, NRNU and LSTMU within this framework. It turns out that LSTMU works better than NRNU on most experiments. Finally, the performance of our proposed methods outperformed several stateoftheart statistical machine learning based methods.
In the future, we intend to extend this work in several directions. We aim to apply GRNN to large scale graphs. We also aim to improve the efficiency of GRNN and conduct time complexity analysis.
References
 [Angles and Gutierrez 2008] Angles, R., and Gutierrez, C. 2008. Survey of graph database models. ACM Computing Surveys (CSUR) 40(1):1.
 [BaezaYates, RibeiroNeto, and others 1999] BaezaYates, R.; RibeiroNeto, B.; et al. 1999. Modern information retrieval, volume 463. ACM press New York.
 [Chakrabarti, Dom, and Indyk 1998] Chakrabarti, S.; Dom, B.; and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In ACM SIGMOD Record, volume 27, 307–318. ACM.
 [Craven et al. 1998] Craven, M.; DiPasquo, D.; Freitag, D.; and McCallum, A. 1998. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the 15th National Conference on Artificial Intelligence, 509–516. American Association for Artificial Intelligence.
 [Duchi, Hazan, and Singer 2011] Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12(Jul):2121–2159.
 [Giles, Bollacker, and Lawrence 1998] Giles, C. L.; Bollacker, K. D.; and Lawrence, S. 1998. Citeseer: An automatic citation indexing system. In Proceedings of the 3rd ACM conference on Digital libraries, 89–98. ACM.
 [Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural computation 9(8):1735–1780.
 [Hosmer Jr and Lemeshow 2004] Hosmer Jr, D. W., and Lemeshow, S. 2004. Applied logistic regression. John Wiley & Sons.
 [London and Getoor 2014] London, B., and Getoor, L. 2014. Collective classification of network data. Data Classification: Algorithms and Applications 399.
 [Lu and Getoor 2003] Lu, Q., and Getoor, L. 2003. Linkbased classification. In Proceedings of the 20th International Conference on Machine Learning, volume 3, 496–503.
 [McCallum et al. 2000] McCallum, A. K.; Nigam, K.; Rennie, J.; and Seymore, K. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3(2):127–163.
 [Monner and Reggia 2013] Monner, D. D., and Reggia, J. A. 2013. Recurrent neural collective classification. IEEE transactions on neural networks and learning systems 24(12):1932–1943.
 [Myaeng and Lee 2000] Myaeng, S. H., and Lee, M.h. 2000. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval. ACM.
 [Niepert, Ahmed, and Kutzkov 2016] Niepert, M.; Ahmed, M.; and Kutzkov, K. 2016. Learning convolutional neural networks for graphs. In Proceedings of the 33rd International Conference on Machine Learning.
 [Perozzi, AlRfou, and Skiena 2014] Perozzi, B.; AlRfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710. ACM.
 [Schmidhuber 2015] Schmidhuber, J. 2015. Deep learning in neural networks: An overview. Neural Networks 61:85–117.
 [Socher et al. 2011] Socher, R.; Lin, C. C.; Manning, C.; and Ng, A. Y. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning, 129–136.
 [Tai, Socher, and Manning 2015] Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from treestructured long shortterm memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistic. ACL.
 [Tang et al. 2015] Tang, J.; Chang, S.; Aggarwal, C.; and Liu, H. 2015. Negative link prediction in social media. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 87–96. ACM.
 [Tang, Qin, and Liu 2015] Tang, D.; Qin, B.; and Liu, T. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1422–1432.
 [Wang and Zhang 2008] Wang, F., and Zhang, C. 2008. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20(1):55–67.
 [Zhang et al. 2013] Zhang, J.; Liu, B.; Tang, J.; Chen, T.; and Li, J. 2013. Social influence locality for modeling retweeting behaviors. In Proceeding of the 23rd International Joint Conference on Artificial Intelligence, volume 13, 2761–2767. Citeseer.