Neural Graph Machines: Learning Neural Networks Using Graphs
Abstract
Label propagation is a powerful and flexible semisupervised learning technique on graphs. Neural networks, on the other hand, have proven track records in many supervised learning tasks. In this work, we propose a training framework with a graphregularised objective, namely Neural Graph Machines, that can combine the power of neural networks and label propagation. This work generalises previous literature on graphaugmented training of neural networks, enabling it to be applied to multiple neural architectures (Feedforward NNs, CNNs and LSTM RNNs) and a wide range of graphs. The new objective allows the neural networks to harness both labeled and unlabeled data by: (a) allowing the network to train using labeled data as in the supervised setting, (b) biasing the network to learn similar hidden representations for neighboring nodes on a graph, in the same vein as label propagation. Such architectures with the proposed objective can be trained efficiently using stochastic gradient descent and scaled to large graphs, with a runtime that is linear in the number of edges. The proposed joint training approach convincingly outperforms many existing methods on a wide range of tasks (multilabel classification on social graphs, news categorization, document classification and semantic intent classification), with multiple forms of graph inputs (including graphs with and without nodelevel features) and using different types of neural networks.
University of Cambridge, United Kingdom \icmladdress Google Research, Mountain View, CA, USA
1 Introduction
Semisupervised learning is a powerful machine learning paradigm that can improve the prediction performance compared to techniques that use only labeled data, by leveraging a large amount of unlabeled data. The need of semisupervised learning arises in many problems in computer vision, natural language processing or social networks, in which getting labeled datapoints is expensive or unlabeled data is abundant and readily available.
There exist a plethora of semisupervised learning methods. The simplest one uses bootstrapping techniques to generate pseudolabels for unlabeled data generated from a system trained on labeled data. However, this suffers from label error feedbacks (Lee, 2013). In a similar vein, autoencoder based methods often need to rely on a twostage approach: train an autoencoder using unlabeled data to generate an embedding mapping, and use the learnt embeddings for prediction. In practice, this procedure is often costly and inaccurate. Another example is transductive SVMs (Joachims, 1999), which is too computationally expensive to be used for large datasets. Methods that are based on generative models and amortized variational inference (Kingma et al., 2014) can work well for images and videos, but it is not immediately clear on how to extend such techniques to handle sparse and multimodal inputs or graphs over the inputs.
In contrast to the methods above, graphbased techniques such as label propagation (Zhu & Ghahramani, ; Bengio et al., 2006) often provide a versatile, scalable, and yet effective solution to a wide range of problems. These methods construct a smooth graph over the unlabeled and labeled data. Graphs are also often a natural way to describe the relationships between nodes, such as similarities between embeddings, phrases or images, or connections between entities on the web or relations in a social network. Edges in the graph connect semantically similar nodes or datapoints, and if present, edge weights reflect how strong such similarities are. By providing a set of labeled nodes, such techniques iteratively refine the node labels by aggregating information from neighbours and propagate these labels to the nodes’ neighbours. In practice, these methods often converge quickly and can be scaled to large datasets with a large label space (Ravi & Diao, 2016). We build upon the principle behind label propagation for our method.
Another key motivation of our work is the recent advances in neural networks and their performance on a wide variety of supervised learning tasks such as image and speech recognition or sequencetosequence learning (Krizhevsky et al., 2012; Hinton et al., 2012; Sutskever et al., 2014). Such results are however conditioned on training very large networks on large datasets, which may need millions of labeled training inputoutput pairs. This begs the question: can we harness previous stateoftheart semisupervised learning techniques, to jointly train neural networks using limited labeled data and unlabeled data to improve its performance?
Contributions:
We propose a discriminative training objective for neural networks with graph augmentation, that can be trained with stochastic gradient descent and efficiently scaled to large graphs. The new objective has a regularization term for generic neural network architectures that enforces similarity between nodes in the graphs, which is inspired by the objective function of label propagation. In particular, we show that:

Graphaugmented neural network training can work for a wide range of neural networks, such as feedforward, convolutional and recurrent networks. Additionally, this technique can be used in both inductive and transductive settings. It also helps learning in lowsample regime (small number of labeled nodes), which cannot be handled by vanilla neural network training.

The framework can handle multiple forms of graphs, either naturally given or constructed based on embeddings and knowledge bases.

Using graphs and neighbourhood information alone as direct inputs to neural networks in this joint training framework permits fast and simple inference, yet provides competitive performance with current stateoftheart approaches which employ a twostep method of first training a node embedding representation from the graph and then using it as feature input to train a classifer separately (see section 4.1).

As a byproduct, our proposed framework provides a simple technique to finding smaller and faster neural networks that offer competitive performance with larger and slower non graphaugmented alternatives (see section 4.2).
We experimentally show that the proposed training framework outperforms stateoftheart or perform favourably on a variety of prediction tasks and datasets, involving text features and/or graph inputs and on many different neural network architectures (see section 4).
2 Background and related works
In this section, we will lay out the groundwork for our proposed training objective in section 3.
2.1 Neural network learning
Neural networks are a class of nonlinear mapping from inputs to outputs and comprised of multiple layers that can potentially learn useful representations for predicting the outputs. We will view various models such as feedforward neural networks, recurrent neural networks and convolutional networks under the same umbrella. Given a set of training inputoutput pairs , such neural networks are often trained by performing maximum likelihood learning, that is, tuning their parameters so that the networks’ outputs are close to the ground truth under some criterion,
(1) 
where denotes the overall mapping, parameterized by , and denotes a loss function such as 2 for regression or cross entropy for classification. The cost function and the mapping are typically differentiable w.r.t , which facilitates optimisation via gradient descent. Importantly, this can be scaled to a large number of training instances by employing stochastic training using minibatches of data. However, it is not clear how unlabeled data, if available, can be treated using this objective, or if extra information about the training set, such as relational structures can be used.
2.2 Graphbased semisupervised learning
In this section, we provide a concise introduction to graphbased semisupervised learning using label propagation and its training objective. Suppose we are given a graph where is the set of nodes, the set of edges and the edge weight matrix. Let be the labeled and unlabeled nodes in the graph. The goal is to predict a soft assignment of labels for each node in the graph, , given the training label distribution for the seed nodes, . Mathematically, label propagation performs minimization of the following convex objective function, for labels,
(2) 
subject to , where is the neighbour node set of the node , and is the prior distribution over all labels, is the edge weight between nodes and , and , , and are hyperparameters that balance the contribution of individual terms in the objective. The terms in the objective function above encourage that: (a) the label distribution of seed nodes should be close to the ground truth, (b) the label distribution of neighbouring nodes should be similar, and, (c) if relevant, the label distribution should stay close to our prior belief. This objective function can be solved efficiently using iterative methods such as the Jacobi procedure. That is, in each step, each node aggregates the label distributions from its neighbours and adjusts its own distribution, which is then repeated until convergence. In practice, the iterative updates can be done in parallel or in a distributed fashion which then allows large graphs with a large number of nodes and labels to be trained efficiently. Bengio et al. (2006) and Ravi & Diao (2016) provide good surveys on the topic for interested readers.
There are many variants of label propagation that could be viewed as optimising modified versions of eq. 2. For example, manifold regularization (Belkin et al., 2006) replaces the label distribution by a Reproducing Kernel Hilbert Space mapping from input features. Similarly, Weston et al. (2012) also employs such mapping but uses a feedforward neural network instead. Both methods can be classified as inductive learning algorithms; whereas the original label propagation algorithm is transductive (Yang et al., 2016).
These aforementioned methods are closest to our proposed approach; however, there are key differences. Our work generalizes previously proposed frameworks for graphaugmented training of neural networks (e.g., Weston et al. (2012)) and extends it to new settings, for example, when there is only graph input and no features are available. Unlike the previous works, we show that the graph augmented training method can work with multiple neural network architectures (Feedforward NNs, CNNs, RNNs) and on multiple prediction tasks and datasets using natural as well as constructed graphs. The experiment results (see section 4) clearly validate the effectiveness of this method in all these different settings, in both inductive and transductive learning paradigms. Besides the methodology, our study also presents an important contribution towards assessing the effectiveness of graph combined neural networks as a generic training mechanism for different architectures and problems, which was not well studied in previous work.
More recently, graph embedding techniques have been used to create node embedding that encode local structures of the graph and the provided node labels (Perozzi et al., 2014; Yang et al., 2016). These techniques target learning better node representations to be used for other tasks such as node classification. In this work, we aim to directly learn better predictive models from the graph. We compare our method to these twostage (embedding + classifier) techniques in several experiments in section 4.
Our work is also different and orthogonal to recent works on using neural networks on graphs, for example: Defferrard et al. (2016) employs spectral graph convolution to create a neuralnetwork like classifier. However, these approaches requires many approximations to arrive at a practical implementation. Here, we advocate a training objective that uses graphs to augment neural network learning, and works with many forms of graphs and with any type of neural network.
3 Neural graph machines
In this section, we devise a discriminative training objective for neural networks, that is inspired by the label propagation objective and uses both labeled and unlabeled data, and can be trained by stochastic gradient descent.
First, we take a close look at the two objective functions discussed in section 2. The label propagation objective equation 2 ensures the predicted label distributions of neighbouring nodes to be similar, while those of labeled nodes to be close to the ground truth. For example: if a cat image and a dog image are strongly connected in a graph, and if the cat node is labeled as animal, the predicted probability of the dog node being animal is also high. In contrast, the neural network training objective equation 1 only takes into account the labeled instances, and ensure correct predictions on the training set. As a consequence, a neural network trained on the cat image alone will not make an accurate prediction on the dog image.
Such shortcoming of neural network training can be rectified by biasing the network using prior knowledge about the relationship between instances in the dataset. In particular, for the domains we are interested in, training instances (either labeled or unlabeled) that are connected in a graph, for example, dog and cat in the above example, should have similar predictions. This can be done by encouraging neighboring data points to have a similar hidden representation learnt by a neural network, resulting in a modified objective function for training neural network architectures using both labeled and unlabeled datapoints. We call architectures trained using this objective Neural Graph Machines, and schematically illustrate the concept in figure 1. The proposed objective function is a weighted sum of the neural network cost and the label propagation cost as follows,
(3) 
where , , and are sets of labeledlabeled, labeledunlabeled and unlabeledunlabeled edges correspondingly, represents the hidden representations of the inputs produced by the neural network, and is a distance metric, and are hyperparameters. Note that we have separated the terms based on the edge types, as these can affect the training differently.
In practice, we choose an 1 or 2 distance metric for , and to be the last layer of the neural network. However, these choices can be changed, to a customized metric, or to using an intermediate hidden layer instead.
3.1 Connections to previous methods
The graphdependent hyperparameters control the balance of the contributions of different edge types. When , the proposed objective ignores the similarity constraint and becomes a supervisedonly neural network objective as in equation 1. When only , the training cost has an additional term for labeled nodes, that acts as a regularizer. When , where is the label distribution, the individual cost functions ( and ) are squared 2 norm, and the objective is trained using directly instead of , we arrive at the label propagation objective in equation 2. Therefore, the proposed objective could be thought of as a nonlinear version of the label propagation objective, and a graphregularized version of the neural network training objective.
3.2 Network inputs and graph construction
Similar to graphbased label propagation, the choice of the input graphs is critical, to correctly bias the neural network’s prediction. Depending on the type of the graphs and nodes in the graph, they can be readily available to use such as social networks or protein linking networks, or they can be constructed (a) using generic graphs such as Knowledge Bases, that consists of relationship links between entities, (b) using embeddings learnt by an unsupervised learning technique, or, (c) using sparse feature representations for each vertex. Additionally, the proposed training objective can be easily modified for directed graphs.
We have discussed using nodelevel features as inputs to the neural network. In the absence of such inputs, our training scheme can still be deployed using input features derived from the graph itself. We show in figure 2 and in experiments that the neighbourhood information such as rows in the adjacency matrix are simple to construct, yet powerful inputs to the network. These features can also be combined with existing features.
When the number of graph nodes is high, this construction can have a high complexity and result in a large number of input features. This can be avoided by several ways: (i) clustering the nodes and using the cluster assignments and similarities, (ii) learning an embedding function of nodes (Perozzi et al., 2014), or (iii) sampling the neighbourhood/context (Yang et al., 2016). In practice, we observe that the input space can be bounded by a constant, even for massive graphs, with efficient scalable methods like unsupervised propagation (i.e., propagating node identity labels across the graph and selecting ones with highest support as input features to neural graph machines).
3.3 Optimization
The proposed objective function in equation 3 has several summations over the labeled points and edges, and can be equivalently written as follows,
(4) 
where
and are the number of edges incident to vertices and , respectively. The objective in its new form enables stochastic training to be deployed by sampling edges. In particular, in each training iteration, we use a minibatch of edges and obtain the stochastic gradients of the objective. To further reduce noise and speedup learning, we sample edges within a neighbourhood region, that is to make sure some sampled edges have shared end nodes.
3.4 Complexity
The complexity of each epoch in training using equation 4 is where is the number of edges in the graph. In the case where there is a large number of unlabeledunlabeled edges, they potentially do not help learning and could be ignored, leading to a lower complexity. One strategy to include them is selftraining, that is to grow seeds or labeled nodes as we train the networks. We experimentally demonstrate this technique in section 4.4. Predictions at inference time can be made at the same cost as that of vanilla neural networks.
4 Experiments
In this section, we provide several experiments showing the efficacy of the proposed training objective on a wide range of tasks, datasets and network architectures. All the experiments are done using a TensorFlow implementation (Abadi et al., 2015).
4.1 Multilabel Classification of Nodes on Graphs
We first demonstrate our approach using a multilabel classification problem on nodes in a relationship graph. In particular, the BlogCatalog dataset (Agarwal et al., 2009), a network of social relationships between bloggers is considered. This graph has 10,312 nodes, 333,983 edges and 39 labels per node, which represent the bloggers, their social connections and the bloggers’ interests, respectively. Following previous approaches in the literature (Grover & Leskovec, 2016; Agarwal et al., 2009), we train and make predictions using multiple onevsrest classifiers.
Since there are no provided features for each node, we use the rows of the adjacency matrix as input features, as discussed in section 3.2. Feedforward neural networks (FFNNs) with one hidden layer of 50 units are employed to map the constructed inputs to the node labels. As we use the test set to construct the graph and augment the training objective, the training in this experiment is transductive. Critically, to combat the unbalanced training set, we employ weighted sampling during training, i.e. making sure each minibatch has both positive and negative examples. In this experiment, we fix to be equal, and experiment with and use the 2 metric to compute the distance between the hidden representations of the networks. In addition, we create a range of train/test splits by varying the number of training points being presented to the networks.
We compare our method (NGMFFNN) against a twostage approach that first uses node2vec (Grover & Leskovec, 2016) to generate node embeddings and then uses a linear onevsrest classifier for classification. The methods are evaluated using two metrics Macro F1 and Micro F1. The average results for different train/test splits using our method and the baseline are included in table 1. In addition, we compare NGMFFNN with a nonaugmented FFNN in which , i.e. no edge information is used during training. We observe that the graphaugmented training scheme performs better (6% relative improvement on Macro F1 when the training set size is 20% and 50% of the dataset) or comparatively (when the training size is 80%) compared to the vanilla neural networks trained with no edge information. Both methods significantly outperform the approach that uses node embeddings and linear classifiers. We observe the same improvement over node2vec on the Micro F1 metric and NGMFFNN is comparable to vanilla FFNN () but outperforms other methods on the recall metric.
Train / Dataset  NGMFFNN  node2vec^{2}^{2}2These results are different compared to (Grover & Leskovec, 2016), since we treat the classifiers (one per label) independently. Both methods shown here use the exact same setting and training/test data splits. 

20%  0.191  0.168 
50%  0.242  0.174 
80%  0.262  0.177 
These results demonstrate that using the graph itself as direct inputs to the neural network and letting the network figure out a nonlinear mapping directly from the raw graph is more effective than the twostage approach considered. More importantly, the results also show that using the graph information improves the performance in the limited data regime (for example: when training set is only 20% or 50% of the dataset).
4.2 Text Classification using Characterlevel CNNs
We evaluate the proposed objective function on a multiclass text classification task using a characterlevel convolutional neural network (CNN). We use the AG news dataset from (Zhang et al., 2015), where the task is to classify a news article into one of 4 categories. Each category has 30,000 examples for training and 1,900 examples for testing. In addition to the train and test sets, there are 111,469 examples that are treated as unlabeled examples.
As there is no provided graph structure linking the articles, we create such a graph based on the embeddings of the articles. We restrict the graph construction to only the train set and the unlabeled examples and keep the test set only for evaluation. We use the Google News word2vec corpus to calculate the average embedding for each news article and use the cosine similarity of document embeddings as a similarity metric. Each node is restricted to have a maximum of 5 neighbors.
We construct the CNN in the same way as (Zhang et al., 2015) and pick their competitive “small CNN” as our baseline for a more reasonable comparison to our setup. Our approach employs the same network, but with significantly smaller number of convolutional layers and layer sizes, as shown in table 2.
Setting  Baseline  Our “tiny CNN” 

# of conv. layers  6  3 
Frame size in conv. layers  256  32 
# of FC layers  3  3 
Hidden units in FC layers  1024  256 
The networks are trained with the same hyperparameters as reported in (Zhang et al., 2015). We observed that the model converged within 20 epochs (the model loss did not change much) and hence used this as a stopping criterion for this task. Experiments also showed that running the network for longer also did not change the qualitative performance. We use the cross entropy loss on the final outputs of the network, that is , to compute the distance between nodes on an edge. In addition, we also experiment with a data augmentation technique using an English thesaurus, as done in (Zhang et al., 2015).
We compare the “tiny CNN” trained using the proposed objective function with the baseline using the accuracy on the test set in table 3. Our approach outperforms the baseline by provides a 1.8% absolute and 2.1% relative improvement in accuracy, despite using a much smaller network. In addition, our model with graph augmentation trains much faster and produces results on par or better than the performance of a significantly larger network, “large CNN” (Zhang et al., 2015), which has an accuracy of 87.18 without using a thesaurus, and 86.61 with the thesaurus.
Network  Accuracy % 

Baseline  84.35 
Baseline with thesaurus augmentation  85.20 
Our “tiny” CNN  85.07 
Our “tiny” CNN with NGM  86.90 
4.3 Semantic Intent Classification using LSTM RNNs
We compare the performance of our approach for training RNN sequence models (LSTM) for a semantic intent classification task as described in the recent work on SmartReply (Kannan et al., 2016) for automatically generating short email responses. One of the underlying tasks in SmartReply is to discover and map short response messages to semantic intent clusters.^{3}^{3}3For details regarding SmartReply and how the semantic intent clusters are generated, refer (Kannan et al., 2016). We choose 20 intent classes and created a dataset comprised of 5,483 samples (3,832 for training, 560 for validation and 1,091 for testing). Each sample instance corresponds to a short response message text paired with a semantic intent category that was manually verified by human annotators. For example, “That sounds awesome!” and “Sounds fabulous” belong to the sounds good intent cluster.
We construct a sparse graph in a similar manner as the news categorization task using word2vec embeddings over the message text and computing similarity to generate a response message graph with fixed node degree (k=10). We use 2 for the distance metric and choose based on the development set.
We run the experiments for a fixed number of time steps and pick the best results on the development set. A multilayer LSTM architecture (2 layers, 100 dimensions) is used for the RNN sequence model. The LSTM model and its NGM variant are also compared against other baseline systems—Random baseline ranks the intent categories randomly and Frequency baseline ranks them in order of their frequency in the training corpus. To evaluate the intent prediction quality of different approaches, for each test instance, we compute the rank of the actual intent category with respect to the ranking produced by the method and use this to calculate the Mean Reciprocal Rank:
We show in table 4 that LSTM RNNs with our proposed graphaugmented training objective function outperform standard baselines by achieving a better MRR.
Model  Mean Reciprocal Rank (MRR) 

Random  0.175 
Frequency  0.258 
LSTM  0.276 
NGMLSTM  0.284 
4.4 Lowsupervision Document Classification
Finally, we compare our method on a task with very limited supervision—the PubMed document classification problem (Sen et al., 2008). The task is to classify each document into one of 3 classes, with each document being described by a TFIDF weighted word vector. The graph is available as a citation network: two documents are connected to each other if one cites the other. The graph has 19,717 nodes and 44,338 edges, with each class having 20 seed nodes and 1000 test nodes. In our experiments we exclude the test nodes from the graph entirely, training only on the labeled and unlabeled nodes.
We train a feedforward neural network (FFNN) with two hidden layers with 250 and 100 neurons, using the 2 distance metric on the last hidden layer. The NGMFFNN model is trained with , while the baseline FFNN is trained with (i.e., a supervisedonly model). We use selftraining to train the model, starting with just the 60 seed nodes (20 per class) as training data. The amount of training data is iteratively increased by assigning labels to the immediate neighbors of the labeled nodes and retraining the model. For the selftrained NGMFFNN model, this strategy results in incrementally growing the neighborhood and thereby, and edges in equation 4 objective.
We compare the final NGMFFNN model against the FFNN baseline and other techniques reported in (Yang et al., 2016) including the Planetoid models (Yang et al., 2016), semisupervised embedding (Weston et al., 2012), manifold regression (Belkin et al., 2006), transductive SVM (Joachims, 1999), label propagation (Zhu et al., 2003), graph embeddings (Perozzi et al., 2014) and a linear softmax model. Full results are included in table 5.
Method  Accuracy 

Linear + Softmax  0.698 
Semisupervised embedding  0.711 
Manifold regularization  0.707 
Transductive SVM  0.622 
Label propagation  0.630 
Graph embedding  0.653 
PlanetoidI  0.772 
PlanetoidG  0.664 
PlanetoidT  0.757 
Feedforward NN  0.709 
NGMFFNN  0.759 
The results show that the NGM model (without any tuning) outperforms many baselines including FFNN, semisupervised embedding, manifold regularization and PlanetoidG/PlanetoidT, and compares favorably to PlanetoidI. Most importantly, this result demonstrates the graph augmentation scheme can lead to better regularised neural networks, especially in low sample regime (20 samples per class in this case). We believe that with tuning, NGM accuracy can be improved even further.
5 Conclusions
We have revisited graphaugmentation training of neural networks and proposed Neural Graph Machines as a general framework for doing so. Its objective function encourages the neural networks to make accurate nodelevel predictions, as in vanilla neural network training, as well as constrains the networks to learn similar hidden representations for nodes connected by an edge in the graph. Importantly, the objective can be trained by stochastic gradient descent and scaled to large graphs.
We validated the efficacy of the graphaugmented objective on various tasks including bloggers’ interest, text category and semantic intent classification problems, using a wide range of neural network architectures (FFNNs, CNNs and LSTM RNNs). The experimental results demonstrated that graphaugmented training almost always helps to find better neural networks that outperforms other techniques in predictive performance or even much smaller networks that are faster and easier to train. Additionally, the nodelevel input features can be combined with graph features as inputs to the neural networks. We showed that a neural network that simply takes the adjacency matrix of a graph and produces node labels, can perform better than a recently proposed twostage approach using sophisticated graph embeddings and a linear classifier. Our framework also excels when the neural network is small, or when there is limited supervision available.
While our objective can be applied to multiple graphs which come from different domains, we have not fully explored this aspect and leave this as future work. We expect the domainspecific networks can interact with the graphs to determine the importance of each domain/graph source in prediction. We also did not explore using graph regularisation for different hidden layers of the neural networks; we expect this is key for the multigraph transfer setting (Yosinski et al., 2014). Another possible future extension is to use our objective on directed graphs, that is to control the direction of influence between nodes during training.
Acknowledgements
We would like to thank the Google Expander team for insightful feedback.
References
 Abadi et al. (2015) Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Agarwal et al. (2009) Agarwal, Nitin, Liu, Huan, Murthy, Sudheendra, Sen, Arunabha, and Wang, Xufei. A social identity approach to identify familiar strangers in a social network. In 3rd International AAAI Conference on Weblogs and Social Media (ICWSM09), 2009.
 Belkin et al. (2006) Belkin, Mikhail, Niyogi, Partha, and Sindhwani, Vikas. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
 Bengio et al. (2006) Bengio, Yoshua, Delalleau, Olivier, and Le Roux, Nicolas. Label propagation and quadratic criterion. In Chapelle, O, Scholkopf, B, and Zien, A (eds.), Semisupervised learning, pp. 193–216. MIT Press, 2006.
 Defferrard et al. (2016) Defferrard, Michaël, Bresson, Xavier, and Vandergheynst, Pierre. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837–3845, 2016.
 Grover & Leskovec (2016) Grover, Aditya and Leskovec, Jure. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, pp. 855–864, 2016.
 Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdelrahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 Joachims (1999) Joachims, Thorsten. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning, 1999.
 Kannan et al. (2016) Kannan, Anjuli, Kurach, Karol, Ravi, Sujith, Kaufmann, Tobias, Tomkins, Andrew, Miklos, Balint, Corrado, Greg, Lukacs, Laszlo, Ganea, Marina, Young, Peter, and Ramavajjala, Vivek. Smart reply: Automated response suggestion for email. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016.
 Kingma et al. (2014) Kingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, and Welling, Max. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 Lee (2013) Lee, DongHyun. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 2013.
 Perozzi et al. (2014) Perozzi, Bryan, AlRfou, Rami, and Skiena, Steven. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
 Ravi & Diao (2016) Ravi, Sujith and Diao, Qiming. Large scale distributed semisupervised learning using streaming approximation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 519–528, 2016.
 Sen et al. (2008) Sen, Prithviraj, Namata, Galileo, Bilgic, Mustafa, Getoor, Lise, Galligher, Brian, and EliassiRad, Tina. Collective classification in network data. AI magazine, 29(3):93, 2008.
 Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
 Weston et al. (2012) Weston, Jason, Ratle, Frédéric, Mobahi, Hossein, and Collobert, Ronan. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Springer, 2012.
 Yang et al. (2016) Yang, Zhilin, Cohen, William, and Salakhudinov, Ruslan. Revisiting semisupervised learning with graph embeddings. In Proceedings of The 33rd International Conference on Machine Learning, pp. 40–48, 2016.
 Yosinski et al. (2014) Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pp. 3320–3328, 2014.
 Zhang et al. (2015) Zhang, Xiang, Zhao, Junbo, and LeCun, Yann. Characterlevel convolutional networks for text classification. In Advances in Neural Information Processing Systems, pp. 649–657, 2015.
 Zhu et al. (2003) Zhu, X, Ghahramani, Z, and Lafferty, J. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning (ICML2003) Volume 2, volume 2, pp. 912–919. AIAA Press, 2003.
 (22) Zhu, Xiaojin and Ghahramani, Zoubin. Learning from labeled and unlabeled data with label propagation. Technical report, School of Computer Science, Canegie Mellon University.