Neural Graph Machines: Learning Neural Networks Using Graphs

Neural Graph Machines: Learning Neural Networks Using Graphs

Thang D. Bui111Work done during an internship at Google.    Sujith Ravi    Vivek Ramavajjala

Label propagation is a powerful and flexible semi-supervised learning technique on graphs. Neural networks, on the other hand, have proven track records in many supervised learning tasks. In this work, we propose a training framework with a graph-regularised objective, namely Neural Graph Machines, that can combine the power of neural networks and label propagation. This work generalises previous literature on graph-augmented training of neural networks, enabling it to be applied to multiple neural architectures (Feed-forward NNs, CNNs and LSTM RNNs) and a wide range of graphs. The new objective allows the neural networks to harness both labeled and unlabeled data by: (a) allowing the network to train using labeled data as in the supervised setting, (b) biasing the network to learn similar hidden representations for neighboring nodes on a graph, in the same vein as label propagation. Such architectures with the proposed objective can be trained efficiently using stochastic gradient descent and scaled to large graphs, with a runtime that is linear in the number of edges. The proposed joint training approach convincingly outperforms many existing methods on a wide range of tasks (multi-label classification on social graphs, news categorization, document classification and semantic intent classification), with multiple forms of graph inputs (including graphs with and without node-level features) and using different types of neural networks.


University of Cambridge, United Kingdom \icmladdress Google Research, Mountain View, CA, USA

1 Introduction

Semi-supervised learning is a powerful machine learning paradigm that can improve the prediction performance compared to techniques that use only labeled data, by leveraging a large amount of unlabeled data. The need of semi-supervised learning arises in many problems in computer vision, natural language processing or social networks, in which getting labeled datapoints is expensive or unlabeled data is abundant and readily available.

There exist a plethora of semi-supervised learning methods. The simplest one uses bootstrapping techniques to generate pseudo-labels for unlabeled data generated from a system trained on labeled data. However, this suffers from label error feedbacks (Lee, 2013). In a similar vein, autoencoder based methods often need to rely on a two-stage approach: train an autoencoder using unlabeled data to generate an embedding mapping, and use the learnt embeddings for prediction. In practice, this procedure is often costly and inaccurate. Another example is transductive SVMs (Joachims, 1999), which is too computationally expensive to be used for large datasets. Methods that are based on generative models and amortized variational inference (Kingma et al., 2014) can work well for images and videos, but it is not immediately clear on how to extend such techniques to handle sparse and multi-modal inputs or graphs over the inputs.

In contrast to the methods above, graph-based techniques such as label propagation (Zhu & Ghahramani, ; Bengio et al., 2006) often provide a versatile, scalable, and yet effective solution to a wide range of problems. These methods construct a smooth graph over the unlabeled and labeled data. Graphs are also often a natural way to describe the relationships between nodes, such as similarities between embeddings, phrases or images, or connections between entities on the web or relations in a social network. Edges in the graph connect semantically similar nodes or datapoints, and if present, edge weights reflect how strong such similarities are. By providing a set of labeled nodes, such techniques iteratively refine the node labels by aggregating information from neighbours and propagate these labels to the nodes’ neighbours. In practice, these methods often converge quickly and can be scaled to large datasets with a large label space (Ravi & Diao, 2016). We build upon the principle behind label propagation for our method.

Another key motivation of our work is the recent advances in neural networks and their performance on a wide variety of supervised learning tasks such as image and speech recognition or sequence-to-sequence learning (Krizhevsky et al., 2012; Hinton et al., 2012; Sutskever et al., 2014). Such results are however conditioned on training very large networks on large datasets, which may need millions of labeled training input-output pairs. This begs the question: can we harness previous state-of-the-art semi-supervised learning techniques, to jointly train neural networks using limited labeled data and unlabeled data to improve its performance?


We propose a discriminative training objective for neural networks with graph augmentation, that can be trained with stochastic gradient descent and efficiently scaled to large graphs. The new objective has a regularization term for generic neural network architectures that enforces similarity between nodes in the graphs, which is inspired by the objective function of label propagation. In particular, we show that:

  • Graph-augmented neural network training can work for a wide range of neural networks, such as feed-forward, convolutional and recurrent networks. Additionally, this technique can be used in both inductive and transductive settings. It also helps learning in low-sample regime (small number of labeled nodes), which cannot be handled by vanilla neural network training.

  • The framework can handle multiple forms of graphs, either naturally given or constructed based on embeddings and knowledge bases.

  • Using graphs and neighbourhood information alone as direct inputs to neural networks in this joint training framework permits fast and simple inference, yet provides competitive performance with current state-of-the-art approaches which employ a two-step method of first training a node embedding representation from the graph and then using it as feature input to train a classifer separately (see section 4.1).

  • As a by-product, our proposed framework provides a simple technique to finding smaller and faster neural networks that offer competitive performance with larger and slower non graph-augmented alternatives (see section 4.2).

We experimentally show that the proposed training framework outperforms state-of-the-art or perform favourably on a variety of prediction tasks and datasets, involving text features and/or graph inputs and on many different neural network architectures (see section 4).

The paper is organized as follows: we first review some background and literature, and relate them to our approach in section 2; we then detail the training objective and its properties in section 3; and finally we validate our approach on a range of experiments in section 4.

2 Background and related works

In this section, we will lay out the groundwork for our proposed training objective in section 3.

2.1 Neural network learning

Neural networks are a class of non-linear mapping from inputs to outputs and comprised of multiple layers that can potentially learn useful representations for predicting the outputs. We will view various models such as feed-forward neural networks, recurrent neural networks and convolutional networks under the same umbrella. Given a set of training input-output pairs , such neural networks are often trained by performing maximum likelihood learning, that is, tuning their parameters so that the networks’ outputs are close to the ground truth under some criterion,


where denotes the overall mapping, parameterized by , and denotes a loss function such as -2 for regression or cross entropy for classification. The cost function and the mapping are typically differentiable w.r.t , which facilitates optimisation via gradient descent. Importantly, this can be scaled to a large number of training instances by employing stochastic training using minibatches of data. However, it is not clear how unlabeled data, if available, can be treated using this objective, or if extra information about the training set, such as relational structures can be used.

2.2 Graph-based semi-supervised learning

In this section, we provide a concise introduction to graph-based semi-supervised learning using label propagation and its training objective. Suppose we are given a graph where is the set of nodes, the set of edges and the edge weight matrix. Let be the labeled and unlabeled nodes in the graph. The goal is to predict a soft assignment of labels for each node in the graph, , given the training label distribution for the seed nodes, . Mathematically, label propagation performs minimization of the following convex objective function, for labels,


subject to , where is the neighbour node set of the node , and is the prior distribution over all labels, is the edge weight between nodes and , and , , and are hyperparameters that balance the contribution of individual terms in the objective. The terms in the objective function above encourage that: (a) the label distribution of seed nodes should be close to the ground truth, (b) the label distribution of neighbouring nodes should be similar, and, (c) if relevant, the label distribution should stay close to our prior belief. This objective function can be solved efficiently using iterative methods such as the Jacobi procedure. That is, in each step, each node aggregates the label distributions from its neighbours and adjusts its own distribution, which is then repeated until convergence. In practice, the iterative updates can be done in parallel or in a distributed fashion which then allows large graphs with a large number of nodes and labels to be trained efficiently. Bengio et al. (2006) and Ravi & Diao (2016) provide good surveys on the topic for interested readers.

There are many variants of label propagation that could be viewed as optimising modified versions of eq. 2. For example, manifold regularization (Belkin et al., 2006) replaces the label distribution by a Reproducing Kernel Hilbert Space mapping from input features. Similarly, Weston et al. (2012) also employs such mapping but uses a feed-forward neural network instead. Both methods can be classified as inductive learning algorithms; whereas the original label propagation algorithm is transductive (Yang et al., 2016).

These aforementioned methods are closest to our proposed approach; however, there are key differences. Our work generalizes previously proposed frameworks for graph-augmented training of neural networks (e.g., Weston et al. (2012)) and extends it to new settings, for example, when there is only graph input and no features are available. Unlike the previous works, we show that the graph augmented training method can work with multiple neural network architectures (Feed-forward NNs, CNNs, RNNs) and on multiple prediction tasks and datasets using natural as well as constructed graphs. The experiment results (see section 4) clearly validate the effectiveness of this method in all these different settings, in both inductive and transductive learning paradigms. Besides the methodology, our study also presents an important contribution towards assessing the effectiveness of graph combined neural networks as a generic training mechanism for different architectures and problems, which was not well studied in previous work.

More recently, graph embedding techniques have been used to create node embedding that encode local structures of the graph and the provided node labels (Perozzi et al., 2014; Yang et al., 2016). These techniques target learning better node representations to be used for other tasks such as node classification. In this work, we aim to directly learn better predictive models from the graph. We compare our method to these two-stage (embedding + classifier) techniques in several experiments in section 4.

Our work is also different and orthogonal to recent works on using neural networks on graphs, for example: Defferrard et al. (2016) employs spectral graph convolution to create a neural-network like classifier. However, these approaches requires many approximations to arrive at a practical implementation. Here, we advocate a training objective that uses graphs to augment neural network learning, and works with many forms of graphs and with any type of neural network.

3 Neural graph machines

Figure 1: A: An example of a graph and feature inputs. In this case, there are two labeled nodes (, ) and one unlabeled node (), and two edges. The feature vectors, one for each node, are used as neural network inputs. B, C and D: Illustration of Neural Graph Machine for feed-forward, convolution and recurrent networks respectively: the training flow ensures the neural net to make accurate node-level predictions and biases the hidden representations/embeddings of neighbouring nodes to be similar. In this example, we force and to be similar as there is an edge connecting and nodes.

In this section, we devise a discriminative training objective for neural networks, that is inspired by the label propagation objective and uses both labeled and unlabeled data, and can be trained by stochastic gradient descent.

First, we take a close look at the two objective functions discussed in section 2. The label propagation objective equation 2 ensures the predicted label distributions of neighbouring nodes to be similar, while those of labeled nodes to be close to the ground truth. For example: if a cat image and a dog image are strongly connected in a graph, and if the cat node is labeled as animal, the predicted probability of the dog node being animal is also high. In contrast, the neural network training objective equation 1 only takes into account the labeled instances, and ensure correct predictions on the training set. As a consequence, a neural network trained on the cat image alone will not make an accurate prediction on the dog image.

Such shortcoming of neural network training can be rectified by biasing the network using prior knowledge about the relationship between instances in the dataset. In particular, for the domains we are interested in, training instances (either labeled or unlabeled) that are connected in a graph, for example, dog and cat in the above example, should have similar predictions. This can be done by encouraging neighboring data points to have a similar hidden representation learnt by a neural network, resulting in a modified objective function for training neural network architectures using both labeled and unlabeled datapoints. We call architectures trained using this objective Neural Graph Machines, and schematically illustrate the concept in figure 1. The proposed objective function is a weighted sum of the neural network cost and the label propagation cost as follows,


where , , and are sets of labeled-labeled, labeled-unlabeled and unlabeled-unlabeled edges correspondingly, represents the hidden representations of the inputs produced by the neural network, and is a distance metric, and are hyperparameters. Note that we have separated the terms based on the edge types, as these can affect the training differently.

In practice, we choose an -1 or -2 distance metric for , and to be the last layer of the neural network. However, these choices can be changed, to a customized metric, or to using an intermediate hidden layer instead.

3.1 Connections to previous methods

The graph-dependent hyperparameters control the balance of the contributions of different edge types. When , the proposed objective ignores the similarity constraint and becomes a supervised-only neural network objective as in equation 1. When only , the training cost has an additional term for labeled nodes, that acts as a regularizer. When , where is the label distribution, the individual cost functions ( and ) are squared -2 norm, and the objective is trained using directly instead of , we arrive at the label propagation objective in equation 2. Therefore, the proposed objective could be thought of as a non-linear version of the label propagation objective, and a graph-regularized version of the neural network training objective.

3.2 Network inputs and graph construction

Similar to graph-based label propagation, the choice of the input graphs is critical, to correctly bias the neural network’s prediction. Depending on the type of the graphs and nodes in the graph, they can be readily available to use such as social networks or protein linking networks, or they can be constructed (a) using generic graphs such as Knowledge Bases, that consists of relationship links between entities, (b) using embeddings learnt by an unsupervised learning technique, or, (c) using sparse feature representations for each vertex. Additionally, the proposed training objective can be easily modified for directed graphs.

We have discussed using node-level features as inputs to the neural network. In the absence of such inputs, our training scheme can still be deployed using input features derived from the graph itself. We show in figure 2 and in experiments that the neighbourhood information such as rows in the adjacency matrix are simple to construct, yet powerful inputs to the network. These features can also be combined with existing features.

Figure 2: Illustration of how we can construct inputs to the neural network using the adjacency matrix. In this example, we have three nodes and two edges. The feature vector created for each node (shown on the right) has 1’s at its index and indices of nodes that it’s adjacent to.

When the number of graph nodes is high, this construction can have a high complexity and result in a large number of input features. This can be avoided by several ways: (i) clustering the nodes and using the cluster assignments and similarities, (ii) learning an embedding function of nodes (Perozzi et al., 2014), or (iii) sampling the neighbourhood/context (Yang et al., 2016). In practice, we observe that the input space can be bounded by a constant, even for massive graphs, with efficient scalable methods like unsupervised propagation (i.e., propagating node identity labels across the graph and selecting ones with highest support as input features to neural graph machines).

3.3 Optimization

The proposed objective function in equation 3 has several summations over the labeled points and edges, and can be equivalently written as follows,



and are the number of edges incident to vertices and , respectively. The objective in its new form enables stochastic training to be deployed by sampling edges. In particular, in each training iteration, we use a minibatch of edges and obtain the stochastic gradients of the objective. To further reduce noise and speedup learning, we sample edges within a neighbourhood region, that is to make sure some sampled edges have shared end nodes.

3.4 Complexity

The complexity of each epoch in training using equation 4 is where is the number of edges in the graph. In the case where there is a large number of unlabeled-unlabeled edges, they potentially do not help learning and could be ignored, leading to a lower complexity. One strategy to include them is self-training, that is to grow seeds or labeled nodes as we train the networks. We experimentally demonstrate this technique in section 4.4. Predictions at inference time can be made at the same cost as that of vanilla neural networks.

4 Experiments

In this section, we provide several experiments showing the efficacy of the proposed training objective on a wide range of tasks, datasets and network architectures. All the experiments are done using a TensorFlow implementation (Abadi et al., 2015).

4.1 Multi-label Classification of Nodes on Graphs

We first demonstrate our approach using a multi-label classification problem on nodes in a relationship graph. In particular, the BlogCatalog dataset (Agarwal et al., 2009), a network of social relationships between bloggers is considered. This graph has 10,312 nodes, 333,983 edges and 39 labels per node, which represent the bloggers, their social connections and the bloggers’ interests, respectively. Following previous approaches in the literature (Grover & Leskovec, 2016; Agarwal et al., 2009), we train and make predictions using multiple one-vs-rest classifiers.

Since there are no provided features for each node, we use the rows of the adjacency matrix as input features, as discussed in section 3.2. Feed-forward neural networks (FFNNs) with one hidden layer of 50 units are employed to map the constructed inputs to the node labels. As we use the test set to construct the graph and augment the training objective, the training in this experiment is transductive. Critically, to combat the unbalanced training set, we employ weighted sampling during training, i.e. making sure each minibatch has both positive and negative examples. In this experiment, we fix to be equal, and experiment with and use the -2 metric to compute the distance between the hidden representations of the networks. In addition, we create a range of train/test splits by varying the number of training points being presented to the networks.

We compare our method (NGM-FFNN) against a two-stage approach that first uses node2vec (Grover & Leskovec, 2016) to generate node embeddings and then uses a linear one-vs-rest classifier for classification. The methods are evaluated using two metrics Macro F1 and Micro F1. The average results for different train/test splits using our method and the baseline are included in table 1. In addition, we compare NGM-FFNN with a non-augmented FFNN in which , i.e. no edge information is used during training. We observe that the graph-augmented training scheme performs better (6% relative improvement on Macro F1 when the training set size is 20% and 50% of the dataset) or comparatively (when the training size is 80%) compared to the vanilla neural networks trained with no edge information. Both methods significantly outperform the approach that uses node embeddings and linear classifiers. We observe the same improvement over node2vec on the Micro F1 metric and NGM-FFNN is comparable to vanilla FFNN () but outperforms other methods on the recall metric.

Train / Dataset NGM-FFNN node2vec222These results are different compared to (Grover & Leskovec, 2016), since we treat the classifiers (one per label) independently. Both methods shown here use the exact same setting and training/test data splits.
20% 0.191 0.168
50% 0.242 0.174
80% 0.262 0.177
Table 1: Macro F1 results for BlogCatalog dataset averaged over 10 random splits. The higher is better. Graph regularized neural networks outperform node2vec embedding and a linear classifer in all training size settings.

These results demonstrate that using the graph itself as direct inputs to the neural network and letting the network figure out a non-linear mapping directly from the raw graph is more effective than the two-stage approach considered. More importantly, the results also show that using the graph information improves the performance in the limited data regime (for example: when training set is only 20% or 50% of the dataset).

4.2 Text Classification using Character-level CNNs

We evaluate the proposed objective function on a multi-class text classification task using a character-level convolutional neural network (CNN). We use the AG news dataset from (Zhang et al., 2015), where the task is to classify a news article into one of 4 categories. Each category has 30,000 examples for training and 1,900 examples for testing. In addition to the train and test sets, there are 111,469 examples that are treated as unlabeled examples.

As there is no provided graph structure linking the articles, we create such a graph based on the embeddings of the articles. We restrict the graph construction to only the train set and the unlabeled examples and keep the test set only for evaluation. We use the Google News word2vec corpus to calculate the average embedding for each news article and use the cosine similarity of document embeddings as a similarity metric. Each node is restricted to have a maximum of 5 neighbors.

We construct the CNN in the same way as (Zhang et al., 2015) and pick their competitive “small CNN” as our baseline for a more reasonable comparison to our set-up. Our approach employs the same network, but with significantly smaller number of convolutional layers and layer sizes, as shown in table 2.

Setting Baseline Our “tiny CNN”
# of conv. layers 6 3
Frame size in conv. layers 256 32
# of FC layers 3 3
Hidden units in FC layers 1024 256
Table 2: Settings of CNNs for the text classification experiment, including the number of convolutional layers and their sizes. The baseline model is the small CNN from (Zhang et al., 2015) and is significantly larger than our model.

The networks are trained with the same hyper-parameters as reported in (Zhang et al., 2015). We observed that the model converged within 20 epochs (the model loss did not change much) and hence used this as a stopping criterion for this task. Experiments also showed that running the network for longer also did not change the qualitative performance. We use the cross entropy loss on the final outputs of the network, that is , to compute the distance between nodes on an edge. In addition, we also experiment with a data augmentation technique using an English thesaurus, as done in (Zhang et al., 2015).

We compare the “tiny CNN” trained using the proposed objective function with the baseline using the accuracy on the test set in table 3. Our approach outperforms the baseline by provides a 1.8% absolute and 2.1% relative improvement in accuracy, despite using a much smaller network. In addition, our model with graph augmentation trains much faster and produces results on par or better than the performance of a significantly larger network, “large CNN” (Zhang et al., 2015), which has an accuracy of 87.18 without using a thesaurus, and 86.61 with the thesaurus.

Network Accuracy %
Baseline 84.35
Baseline with thesaurus augmentation 85.20
Our “tiny” CNN 85.07
Our “tiny” CNN with NGM 86.90
Table 3: Results for news article categorization using character-level CNNs. Our method gives better predictive accuracy, despite using a much smaller CNN compared to the “small CNN” baseline from (Zhang et al., 2015).

4.3 Semantic Intent Classification using LSTM RNNs

We compare the performance of our approach for training RNN sequence models (LSTM) for a semantic intent classification task as described in the recent work on SmartReply (Kannan et al., 2016) for automatically generating short email responses. One of the underlying tasks in SmartReply is to discover and map short response messages to semantic intent clusters.333For details regarding SmartReply and how the semantic intent clusters are generated, refer (Kannan et al., 2016). We choose 20 intent classes and created a dataset comprised of 5,483 samples (3,832 for training, 560 for validation and 1,091 for testing). Each sample instance corresponds to a short response message text paired with a semantic intent category that was manually verified by human annotators. For example, “That sounds awesome!” and “Sounds fabulous” belong to the sounds good intent cluster.

We construct a sparse graph in a similar manner as the news categorization task using word2vec embeddings over the message text and computing similarity to generate a response message graph with fixed node degree (k=10). We use -2 for the distance metric and choose based on the development set.

We run the experiments for a fixed number of time steps and pick the best results on the development set. A multilayer LSTM architecture (2 layers, 100 dimensions) is used for the RNN sequence model. The LSTM model and its NGM variant are also compared against other baseline systems—Random baseline ranks the intent categories randomly and Frequency baseline ranks them in order of their frequency in the training corpus. To evaluate the intent prediction quality of different approaches, for each test instance, we compute the rank of the actual intent category with respect to the ranking produced by the method and use this to calculate the Mean Reciprocal Rank:

We show in table 4 that LSTM RNNs with our proposed graph-augmented training objective function outperform standard baselines by achieving a better MRR.

Model Mean Reciprocal Rank (MRR)
Random 0.175
Frequency 0.258
LSTM 0.276
NGM-LSTM 0.284
Table 4: Results for Semantic Intent Classification using graph-augmented LSTM RNNs and baselines. Higher MRR is better.

4.4 Low-supervision Document Classification

Finally, we compare our method on a task with very limited supervision—the PubMed document classification problem (Sen et al., 2008). The task is to classify each document into one of 3 classes, with each document being described by a TF-IDF weighted word vector. The graph is available as a citation network: two documents are connected to each other if one cites the other. The graph has 19,717 nodes and 44,338 edges, with each class having 20 seed nodes and 1000 test nodes. In our experiments we exclude the test nodes from the graph entirely, training only on the labeled and unlabeled nodes.

We train a feed-forward neural network (FFNN) with two hidden layers with 250 and 100 neurons, using the -2 distance metric on the last hidden layer. The NGM-FFNN model is trained with , while the baseline FFNN is trained with (i.e., a supervised-only model). We use self-training to train the model, starting with just the 60 seed nodes (20 per class) as training data. The amount of training data is iteratively increased by assigning labels to the immediate neighbors of the labeled nodes and retraining the model. For the self-trained NGM-FFNN model, this strategy results in incrementally growing the neighborhood and thereby, and edges in equation 4 objective.

We compare the final NGM-FFNN model against the FFNN baseline and other techniques reported in (Yang et al., 2016) including the Planetoid models (Yang et al., 2016), semi-supervised embedding (Weston et al., 2012), manifold regression (Belkin et al., 2006), transductive SVM (Joachims, 1999), label propagation (Zhu et al., 2003), graph embeddings (Perozzi et al., 2014) and a linear softmax model. Full results are included in table 5.

Method Accuracy
Linear + Softmax 0.698
Semi-supervised embedding 0.711
Manifold regularization 0.707
Transductive SVM 0.622
Label propagation 0.630
Graph embedding 0.653
Planetoid-I 0.772
Planetoid-G 0.664
Planetoid-T 0.757
Feed-forward NN 0.709
NGM-FFNN 0.759
Table 5: Results for document classification on the PubMed dataset using neural networks. The top results are taken from (Yang et al., 2016). The bottom two rows are ours, with the NGM training outperforming all other baselines, except Planetoid-I. Please see text for relevant references.

The results show that the NGM model (without any tuning) outperforms many baselines including FFNN, semi-supervised embedding, manifold regularization and Planetoid-G/Planetoid-T, and compares favorably to Planetoid-I. Most importantly, this result demonstrates the graph augmentation scheme can lead to better regularised neural networks, especially in low sample regime (20 samples per class in this case). We believe that with tuning, NGM accuracy can be improved even further.

5 Conclusions

We have revisited graph-augmentation training of neural networks and proposed Neural Graph Machines as a general framework for doing so. Its objective function encourages the neural networks to make accurate node-level predictions, as in vanilla neural network training, as well as constrains the networks to learn similar hidden representations for nodes connected by an edge in the graph. Importantly, the objective can be trained by stochastic gradient descent and scaled to large graphs.

We validated the efficacy of the graph-augmented objective on various tasks including bloggers’ interest, text category and semantic intent classification problems, using a wide range of neural network architectures (FFNNs, CNNs and LSTM RNNs). The experimental results demonstrated that graph-augmented training almost always helps to find better neural networks that outperforms other techniques in predictive performance or even much smaller networks that are faster and easier to train. Additionally, the node-level input features can be combined with graph features as inputs to the neural networks. We showed that a neural network that simply takes the adjacency matrix of a graph and produces node labels, can perform better than a recently proposed two-stage approach using sophisticated graph embeddings and a linear classifier. Our framework also excels when the neural network is small, or when there is limited supervision available.

While our objective can be applied to multiple graphs which come from different domains, we have not fully explored this aspect and leave this as future work. We expect the domain-specific networks can interact with the graphs to determine the importance of each domain/graph source in prediction. We also did not explore using graph regularisation for different hidden layers of the neural networks; we expect this is key for the multi-graph transfer setting (Yosinski et al., 2014). Another possible future extension is to use our objective on directed graphs, that is to control the direction of influence between nodes during training.


We would like to thank the Google Expander team for insightful feedback.


  • Abadi et al. (2015) Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL Software available from
  • Agarwal et al. (2009) Agarwal, Nitin, Liu, Huan, Murthy, Sudheendra, Sen, Arunabha, and Wang, Xufei. A social identity approach to identify familiar strangers in a social network. In 3rd International AAAI Conference on Weblogs and Social Media (ICWSM09), 2009.
  • Belkin et al. (2006) Belkin, Mikhail, Niyogi, Partha, and Sindhwani, Vikas. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
  • Bengio et al. (2006) Bengio, Yoshua, Delalleau, Olivier, and Le Roux, Nicolas. Label propagation and quadratic criterion. In Chapelle, O, Scholkopf, B, and Zien, A (eds.), Semi-supervised learning, pp. 193–216. MIT Press, 2006.
  • Defferrard et al. (2016) Defferrard, Michaël, Bresson, Xavier, and Vandergheynst, Pierre. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3837–3845, 2016.
  • Grover & Leskovec (2016) Grover, Aditya and Leskovec, Jure. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 855–864, 2016.
  • Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
  • Joachims (1999) Joachims, Thorsten. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning, 1999.
  • Kannan et al. (2016) Kannan, Anjuli, Kurach, Karol, Ravi, Sujith, Kaufmann, Tobias, Tomkins, Andrew, Miklos, Balint, Corrado, Greg, Lukacs, Laszlo, Ganea, Marina, Young, Peter, and Ramavajjala, Vivek. Smart reply: Automated response suggestion for email. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016.
  • Kingma et al. (2014) Kingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, and Welling, Max. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
  • Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
  • Lee (2013) Lee, Dong-Hyun. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML 2013 Workshop : Challenges in Representation Learning (WREPL), 2013.
  • Perozzi et al. (2014) Perozzi, Bryan, Al-Rfou, Rami, and Skiena, Steven. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
  • Ravi & Diao (2016) Ravi, Sujith and Diao, Qiming. Large scale distributed semi-supervised learning using streaming approximation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 519–528, 2016.
  • Sen et al. (2008) Sen, Prithviraj, Namata, Galileo, Bilgic, Mustafa, Getoor, Lise, Galligher, Brian, and Eliassi-Rad, Tina. Collective classification in network data. AI magazine, 29(3):93, 2008.
  • Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
  • Weston et al. (2012) Weston, Jason, Ratle, Frédéric, Mobahi, Hossein, and Collobert, Ronan. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Springer, 2012.
  • Yang et al. (2016) Yang, Zhilin, Cohen, William, and Salakhudinov, Ruslan. Revisiting semi-supervised learning with graph embeddings. In Proceedings of The 33rd International Conference on Machine Learning, pp. 40–48, 2016.
  • Yosinski et al. (2014) Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pp. 3320–3328, 2014.
  • Zhang et al. (2015) Zhang, Xiang, Zhao, Junbo, and LeCun, Yann. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pp. 649–657, 2015.
  • Zhu et al. (2003) Zhu, X, Ghahramani, Z, and Lafferty, J. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning (ICML-2003) Volume 2, volume 2, pp. 912–919. AIAA Press, 2003.
  • (22) Zhu, Xiaojin and Ghahramani, Zoubin. Learning from labeled and unlabeled data with label propagation. Technical report, School of Computer Science, Canegie Mellon University.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description