# Graph Partition Neural Networks for Semi-Supervised Classification

## Abstract

We present graph partition neural networks (GPNN), an extension of graph neural networks (GNNs) able to handle extremely large graphs. GPNNs alternate between locally propagating information between nodes in small subgraphs and globally propagating information between the subgraphs. To efficiently partition graphs, we experiment with several partitioning algorithms and also propose a novel variant for fast processing of large scale graphs. We extensively test our model on a variety of semi-supervised node classification tasks. Experimental results indicate that GPNNs are either superior or comparable to state-of-the-art methods on a wide variety of datasets for graph-based semi-supervised classification. We also show that GPNNs can achieve similar performance as standard GNNs with fewer propagation steps.

## 1 Introduction

Graphs are a flexible way of encoding data, and many tasks can be cast as learning from graph-structured inputs. Examples include prediction of properties of chemical molecules [[9]], answering questions about knowledge graphs [[25]], natural language processing with parse-structured inputs (trees or richer structures like Abstract Meaning Representations) [[4]], predicting properties of data structures or source code in programming languages [[22], [2]], and making predictions from scene graphs [[39]]. Sequence data can be seen as a special case of a simple chain-structured graph. Thus, we are interested in training high-capacity neural network-like models on these types of graph-structured inputs. Graph Neural Networks (GNNs) [[14], [29], [22], [28], [21], [11], [42]] are one of the best contenders, although there has been much recent interest in applying other neural network-like models to graph data, including generalizations of convolutional architectures [[9], [18], [30]]. Gilmer et al. [[12]] recently reviewed and unified many of these models.

An important issue that has not received much attention in GNN models is how
information gets propagated across the graph.
There are often scenarios in which information has to be propagated over long
distances across a graph, e.g., when we have long sequences augmented with
additional relationships between elements of the sequence, like in text,
programming language source code, or temporal streams.
The simplest approach, and the one adopted by almost all graph-based neural
networks is to follow *synchronous message-passing
systems* [[3]] from distributed computing theory.
Specifically, inference is executed as a sequence of rounds: in each round,
every node sends messages to all of its neighbors, the messages are delivered
and every node does some computation based on the received messages.
While this approach has the benefit of being simple and easy to implement, it is
especially inefficient when the task requires spreading information across long
distances in the graph.
For example, in processing sequence data, if we were to employ the above
schedule for a sequence of length , it would take messages
to propagate information from the beginning of the sequence to the end, and during training all messages must be stored in memory.
In contrast, the common practice with sequence data is to use a forward pass
followed by a backward pass at a cost of to propagate information from
end to end, as in hidden Markov models (HMMs) or bidirectional recurrent neural networks (RNNs), for example.

One possible approach for tackling this problem is to propagate information over the graph following some pre-specified sequential order, as in Bidirectional LSTMs. However, this sequential solution has several issues. First, if a graph used for training has large diameter, the unrolled GNN computational graph will be large (cf. Bidirectional LSTMs on long sequences). This leads to fundamental issues with learning (e.g., vanishing/exploding gradients) and implementation difficulties (i.e., resource constraints). Second, sequential schedules are typically less amenable to efficient acceleration on parallel hardware. More recently, Gilmer et al. [[12]] attempted to tackle the first problem by introducing a “dummy node” with connections to all nodes in the input graph, meaning that all nodes are at most two steps away from each other. However, we note that the graph structure itself often contains important information, which is modified by adding additional nodes and edges.

In this work, we propose graph partition neural networks (GPNN) that exploit a propagation schedule combining features of synchronous and sequential propagation schedules. Concretely, we first partition the graph into disjunct subgraphs and a cut set, and then alternate steps of synchronous propagation within subgraphs with synchronous propagation within the cut set. In Sect. 3, we discuss different propagation schedules on an example, showing that GPNNs can be substantially more efficient than standard GNNs, and then present our model formally. Finally, we evaluate our model in Sect. 4 on a variety of semi-supervised classification benchmarks. The empirical results suggest that our models are either superior to or comparable with state-of-the-art learning systems on graphs.

## 2 Related Work

There are many neural network models for handling graph-structured inputs. They can be roughly categorized into generalizations of recurrent neural networks (RNNs) [[13], [14], [29], [34], [37], [22], [25], [28], [21]] and generalizations of convolutional neural networks (CNNs) [[7], [9], [18], [30]]. Gilmer et al. [[12]] provide a good review and unification of many of these models, and they present some additional model variations that lead to strong empirical results in making predictions from molecule-structured inputs.

In RNN-like models, the standard approach is to propagate information using a synchronous schedule. In convolution-like models, the node updates mimic standard convolutions where all nodes in a layer are updated as functions of neighboring node states in the previous layer. This leads to information propagating across the graph in the same pattern as synchronous schedules. While our focus has been mainly on the RNN-like model of Li et al. [[22]], it would be interesting to apply our schedules to the other models as well.

Some of the RNN based neural network models operate on restricted classes of graphs and employ sequential or sequential-like schedules. For example, recursive neural networks [[13], [33]] and tree-LSTMs [[37]] have bidirectional variants that use fully sequential schedules. Sukhbaatar et al. [[35]] modeling of agents can be viewed as a GNN model with a sequential schedule, where messages are passed inwards towards a master node that aggregates messages from different agents, and then outwards from the master node to all the agents. The difference in our work is the focus on graphs with arbitrary structure (not necessarily a sequence or tree). Recently, Marino et al. [[25]] developed an attention-like mechanism to dynamically select a subset of graph nodes to propagate information from, but the propagation is synchronous amongst selected nodes.

Recently, Hamilton et al. [[16]] propose a graph sample and aggregate (GraphSAGE) method. It first samples a neighborhood graph for each node which can be regarded as overlapping partitions of the original graph. An improved graph convolutional network (GCN) [[18]] is then applied to each neighborhood graph independently. They show that this partition based strategy facilitates the unsupervised representation learning on large scale graphs.

An area where scheduling has been studied extensively is in the probabilistic inference literature. It is common to decompose a graph into a set of spanning trees and sequentially update the tree structures [[41]]. Graph partition based schedules have been explored in belief propagation (BP) [[26]], generalized belief propagation (GBP) [[48], [43]], generalized mean-field inference [[45], [46]] and dual decomposition based inference [[19], [49]]. In generalized mean-field inference [[45]], a graph partition algorithm, e.g., graph cut, is applied to obtain the clusters of nodes. A sequential update schedule among clusters is adopted to perform variational inference. Zhang et al. [[49]] adopt a partition-based strategy to better distribute the dual decomposition based message passing algorithm for high order MRF. The junction tree algorithm [[20]] can also be viewed as a partition based inference where the partition is obtained by finding the maximum spanning tree on the weighted clique graph. Each node of the junction tree corresponds to a cluster of nodes, i.e., maximal clique, in the original graph. A sequential update can then be executed on the junction tree. See also [[10], [38], [36]] for more discussion of sequential updates in the context of belief propagation. Finally, the question of sequential versus synchronous updates arises in numerical linear algebra. Jacobi iteration uses a synchronous update while Gauss-Seidel applies the same algorithm but according to a sequential schedule.

## 3 Model

In this section, we briefly recapitulate graph neural networks (GNNs) and then describe our graph partition neural networks (GPNN). A graph has nodes and edges . We focus on directed graphs, as our approach readily applies to undirected graphs by splitting any undirected edge into two directed edges. We denote the out-going neighborhood as , and similarly, the incoming neighborhood as . We associate an edge type with every edge , where is some pre-specified total number of edge types. Such edge types are used to encode different relationships between nodes. Note that one can also associate multiple edge types with the same edge which results in a multi-graph. We assume one edge type per directed edge to simplify the notation here, but the model can be easily generalized to the multi-edge case.

### 3.1 Graph Neural Networks

Graph neural networks [[29], [22]] can be viewed as an extension of recurrent neural networks (RNNs) to arbitrary graphs. Each node in the graph is associated with an initial state vector at time step . Initial state vectors can be observed features or annotations as in [[22]]. At time step , an outgoing message is computed for each edge by transforming the source state according to the edge type, i.e.,

(1) |

where is a message function, which could be the identity or a fully connected neural network. Note the subscript indicating that different edges of the same type share the same instance of the message function. We then aggregate all messages at the receiving nodes, i.e.,

(2) |

where is the aggregation function, which may be a summation, average or max-pooling function. Finally, every node will update its state vector based on its current state vector and the aggregated message, i.e.,

(3) |

where is the update function, which may be a gated recurrent unit (GRU), a long short term memory (LSTM) unit, or a fully connected network. Note that all nodes share the same instance of update function. The described propagation step is repeatedly applied for a fixed number of time steps , to obtain final state vectors . A node classification task can then be implemented by feeding these state vectors to a fully connected neural network which is shared by all nodes. Back-propagation through time (BPTT) is typically adopted for learning the model.

### 3.2 Graph Partition Neural Networks

The above inference process is described from the perspective of an individual
node.
If we look at the same process from the graph view, we observe a
*synchronous schedule* in which all nodes receive and send messages at the
same time, cf. the illustration in Fig. 1(d).
A natural question is to consider different propagation schedules in which not
all nodes in the graph send messages at the same time, e.g., *sequential
schedules*, in which nodes are ordered in some linear sequence and
messages are sent only from one node at a time.
A mix of the two ideas leads to our Graph Partition Neural Networks (GPNN),
which we will discuss before elaborating on how to partition graphs
appropriately. Finally, we discuss how to handle initial node labels and node
classification tasks.

#### Propagation Model

We first consider the example graph in Fig. 1 (a). A corresponding computational graph that shows how information is propagated from time step to time step using the standard (synchronous) propagation schedule is shown in Fig. 1 (d). The example graph’s diameter is , and it hence requires at least steps to propagate information over the graph. Fig. 1(c) instead shows two possible sequences that show how information can be propagated between nodes to and to . These visualizations show that (i) a full synchronous propagation schedule requires significant computation at each step, and (ii) a sequential propagation schedule, in which we only propagate along sequences of nodes, results in very sparse and deep computational graphs. Moreover, experimentally, we found sequential schedules to require multiple propagation rounds across the whole graph, resulting in an even deeper computational graph.

In order to achieve both efficient propagation and tractable learning, we propose a new propagation schedule that follows a divide and conquer strategy. In particular, we first partition the graph into disjunct subgraphs. We will explain the details of how to compute graph partitions below. For now, we assume that we already have subgraphs such that each subgraph contains a subset of nodes and the edges induced by this subset. We will also have a cut set, i.e., the set of edges that connect different subgraphs. One possible partition of our example is visualized in Fig. 1 (b).

In GPNNs, we alternate between propagating information in parallel local to each subgraph (making use of highly parallel computing units such as GPUs) and propagating messages between subgraphs. Our propagation schedule is shown in Alg. 1. To understand the benefit of this schedule, consider a broadcasting problem over the example graph in Fig. 1. When information from any one node has reached all other nodes in the graph for the first time, this problem is considered as solved. We will compare the number of messages required to solve this problem for different propagation schedules.

Synchronous propagation: Fig. 1(d) shows that a synchronous step requires 10 messages. Broadcasting requires sufficient propagation steps to cover the graph diameter (in this case, 5), giving a total of messages.

Partitioned propagation: For simplicity, we analyze the case , , where is the maximum diameter of the subgraphs. Using the partitioning in 1(e), we have and each step of intra-subgraph propagation requires 8 messages. After steps ( messages) the broadcast problem is solved within each subgraph. Inter-subgraph propagation requires 2 messages in this example, giving messages per outer loop iteration in Alg. 1. The example requires outer iterations to broadcast between all subgraphs, giving a total of messages.

In general, our propagation schedule requires no more messages than the synchronous schedule to solve broadcast (if the number of subgraphs is set to or then our schedule reduces to the synchronous one). We analyze the number of messages required to solve the broadcast problem on chain graphs in detail in Sect. A.1. Overall, our method avoids the large number of messages required by synchronous schedules, while avoiding the very deep computational graphs required by sequential schedules. Our experiments in Sect. 4 show that this makes learning tractable even on extremely large graphs.

#### Graph Partition

We now investigate how to construct graph partitions. First, since partition problems in graph theory typically are NP-hard, we are only looking for approximations in practice. A simple approach is to re-use the classical spectral partition method. Specifically, we follow the normalized cut method in [[32]] and use the random walk normalized graph Laplacian matrix , where is the identity matrix, is the degree matrix and is the weight matrix of graph (i.e., the adjacency matrix if no weights are presented).

However, the spectral partition method is slow and hard to scale with large graphs [[40]]. For performance reasons, we developed the following heuristic method based on a multi-seed flood fill partition algorithm as listed in Alg. 2. We first randomly sample the initial seed nodes biased towards nodes which are labeled and have a large out-degree. We maintain a global dictionary assigning nodes to subgraphs, and initially assign each selected seed node to its own subgraph. We then grow the dictionary using flood fill, attaching unassigned nodes that are direct neighbors of a subgraph to that graph. To avoid bias towards the first subgraph, we randomly permute the order of subgraphs at the beginning of each round. This procedure is repeatedly applied until no subgraph grows anymore. There may still be disconnected components left in the graph, which we assign to the smallest subgraph found so far to balance subgraph sizes.

#### Node Features & Classification

In practice, problems using graph-structured data sometimes
(1) do not have observed features associated with every
node [[15]];
(2) have very high dimensional sparse features per node [[6]].
We develop two types of models for the initial node labels:
*embedding-input* and *feature-input*.
For *embedding-input*, we introduce learnable node embeddings into the
model to solve challenge (1), inspired by other graph embedding methods. For nodes with
observed features we initialize the embeddings to these observations, and all other nodes are initialized randomly.
All embeddings are fed to the propagation model and are treated as learnable parameters.
For *feature-input*, we apply a sparse fully-connected network to input
features to tackle challenge (2).
The dimension-reduced feature is then fed to the propagation model, and the
sparse network is jointly learned with the rest of model.

We also empirically found that concatenating the input features with the final embedding produced by the propagation model is helpful in boosting the performance.

## 4 Experiments

We test our model on a variety of semi-supervised tasks
^{1}

Dataset | #Nodes | #Edges | #Classes | #Features | Label Rate |
---|---|---|---|---|---|

Citeseer | 3,327 | 4,732 | 6 | 3,703 | 0.036 |

Cora | 2,708 | 5,429 | 7 | 1,433 | 0.052 |

Pubmed | 19,717 | 44,338 | 3 | 500 | 0.003 |

NELL | 65,755 | 266,144 | 210 | 5,414 | 0.1, 0.01, 0.001 |

DIEL | 4,373,008 | 4,464,261 | 4 | 1,233,598 | 0.0095 |

### 4.1 Citation Networks

We first discuss experimental results on three citation networks: Citeseer, Cora and Pubmed [[31]]. The datasets contain sparse bag-of-words feature vectors for each document and a list of citation links between documents. Documents and citation links are regarded as nodes and edges while constructing the graph. instances are sampled for each class as labeled data, 1000 instances as test data, and the rest are used as unlabeled data. The goal is to classify each document into one of the predefined classes. We use the same data split as in [[47]] and [[18]]. We use an additional validation set of 500 labeled nodes for tuning hyperparameters as in [[18]].

The experimental results are shown in Tab. 2.
We report the results of baselines directly from [[47]] and
[[18]].
We see that GPNN is on par with other state-of-the-art methods on these small graphs.
We also conducted experiments with random splits and results are reported in the appendix.
We found these datasets easy to overfit due to their small size, and use
*feature-input* rather than *embedding-input*, as the latter case
increases the model capacity as well as the risk of overfitting.
We also show a t-SNE [[24]] visualization of node
representations produced by the propagation model of GGNN and GPNN on
the Cora dataset in Fig. 2 (a) and (b) respectively.
The visualizations show that the node representations of GPNN are better separated.

Method | Citeseer | Cora | Pubmed | NELL | ||
---|---|---|---|---|---|---|

Feat[[47]] | 57.2 | 57.4 | 69.8 | 62.1 | 40.4 | 21.7 |

ManiReg[[5]] | 60.1 | 59.5 | 70.7 | 63.4 | 41.3 | 21.8 |

SemiEmb[[44]] | 59.6 | 59.0 | 71.1 | 65.4 | 43.8 | 26.7 |

LP[[50]] | 45.3 | 68.0 | 63.0 | 71.4 | 44.8 | 26.5 |

DeepWalk[[27]] | 43.2 | 67.2 | 65.3 | 79.5 | 72.5 | 58.1 |

ICA[[23]] | 69.1 | 75.1 | 73.9 | – | – | – |

Planetoid (Transductive)[[47]] | 64.9 | 75.7 | 75.7 | 84.5 | 75.7 | 61.9 |

Planetoid (Inductive)[[47]] | 64.7 | 61.2 | 77.2 | 70.2 | 59.8 | 45.4 |

GCN[[18]] | 70.3 | 81.5 | 79.0 | 83.0 | 67.0 | 54.2 |

GGNN[[22]] | 68.1 | 77.9 | 77.2 | 84.6 | 66.2 | 59.1 |

GPNN (Ours) | 69.7 | 81.8 | 79.3 | 84.4 | 74.7 | 63.9 |

### 4.2 Entity Classification

Next, we consider experimental results of entity classification task on the NELL
dataset extracted from the knowledge graph first presented in
[[8]].
A knowledge graph consists of a set of entities and a set of directed edges
which have labels (i.e., different types of relation).
Following [[47]], each triplet of entities and relation in the knowledge graph is split into two tuples.
Specifically, we assign separate relation nodes and to each entity
and thus obtain and .
Entity nodes are associated with sparse feature vectors.
We follow [[18]] to extend the number of features by assigning a
unique one-hot representation for every relation node.
This results in a -dim sparse feature vector per node.
An additional validation set of labeled nodes under the label rate
as in [[18]] is used for tuning hyperparameters.
The chosen hyperparameters are then used for other label rates.
The semi-supervised task here considers three different label rates ,
, per class in the training set.
We run the released code of GCN with the reported hyperparameters in
[[18]].
Since we did not observe overfitting on this dataset, we choose the
*embedding-input* variant as the input model.
The results are shown in Tab. 2, where we see that our model outperforms competitors under the most
challenging label rate and obtain comparable results with
the state of the art on other label rates.

### 4.3 Distantly-Supervised Entity Extraction

Finally, we consider the DIEL (Distant Information Extraction using coordinate-term Lists) dataset [[6]]. This dataset constructs a bipartite graph where nodes are medical entities and texts (referred as mentions and coordinate lists in the original paper). Texts contain some facts about the medical entities. Edges of the graph are links between entities and texts. Each entity is associated with a pre-extracted sparse feature vector. The goal is to extract medical entities from text given sparse feature vectors and the graph. As shown in Tab. 1, this dataset is very challenging due to its extremely large scale and very high-dimensional sparse features. Note that we attempted to run the released code GCN model on this dataset, but ran out of memory. Thus, we adapted the public implementation of GCN to make it successfully run on this dataset, and also implemented GCN with our partition-based schedule.

We follow the exact experimental setup as in [[6], [47]], including different data splits, preprocessing of entity mentions and coordinate lists, and evaluation. We randomly sample of the training nodes as the validation set. We regard the top- entities returned by a model as positive instances and compute recall as the evaluation metric where as in [[6], [47]]. Average recall over runs is reported in Tab. 3, and we see that GPNN outperforms all other models. Note that since Freebase is used as ground truth and some entities are not present in texts, the upper bound of recall given by [[6]] is .

Method | Recallk |
---|---|

LP [[50]] | 16.20 |

DeepWalk [[27]] | 25.80 |

Feat [[47]] | 34.90 |

DIEL [[6]] | 40.50 |

ManiReg [[5]] | 47.70 |

SemiEmb [[44]] | 48.60 |

Planetoid (Transductive) [[47]] | 50.00 |

Planetoid (Inductive) [[47]] | 50.10 |

GCN [[18]] | 48.14 |

GCN + Partition | 48.47 |

GGNN [[22]] | 51.15 |

GPNN | 52.11 |

### 4.4 Comparison of Different Partition Methods

We now compare the two partition methods we considered for our model: spectral partition and our modified multi-seed flood fill. We use the NELL data set to benchmark and report the average validation accuracy over runs in Tab. 4, in which we also report the average runtime of the partition process. The accuracies of the trained models do not allow for a clear conclusion as to which method to use, and in our further experiments they seem to highly depend on the number of subgraphs, the connectivity of input graphs, optimization and other factors. However, our multi-seed flood fill partition method is substantially faster and is efficiently applicable to very large graphs.

Number of subgraphs | Spectral Partition | Modified Multi-seed Flood Fill | |
---|---|---|---|

5 | 54.8% | (2.5s) | 62.0% (0.36s) |

10 | 55.6% | (4.2s) | 63.1% (0.36s) |

20 | 58.0% | (12.2s) | 57.5% (0.43s) |

30 | 60.1% | (3115.0s) | 59.9% (0.23s) |

### 4.5 Comparison of Different Propagation Schedules

Besides the synchronous and our partition based propagation schedules, we also investigated two further schedules based on a sequential order and a series of minimum spanning trees (MST).

To generate a sequential schedule, we first perform graph traversal via breadth first search (BFS) which gives us a visiting order. We then split the edges into those that follow the visiting order and those that violate it. The edges in each class construct a directed acyclic graph (DAG), and we construct a propagation schedule from each DAG following the principle that every node will send messages once it receives all messages from its parents and updates its own state. An example of the schedule is given in the appendix. Note that this sequential schedule reduces to a standard bidirectional recurrent neural network on a chain graph.

For the MST schedule, we find a sequence of minimum spanning trees as follows. We first assign random positive weights between and to every edge and then apply Kruskal’s algorithm to find an MST. Next we increase the weights by for edges which are present in the MST we found so far. This process is iterated until we find MSTs where is the total number of propagation steps.

We compare all four schedules by varying the number of propagation steps on the Cora dataset. The validation accuracies are shown in Fig. 2 (c). To clarify, assuming graph is singly connected, then the number of edges per propagation step of MST, Sequential, Synchronous and Partition in Fig. 2 (c) are , , and respectively. Here, and are the set of nodes and edges. We also show the average results of runs with different random seeds on Cora in Tab. 5.

Prop Step | 1 | 3 | 5 |
---|---|---|---|

MST | 59.94% 0.89 | 71.83% 0.96 | 77.1% 0.72 |

Sequential | 73.04% 1.93 | 77.55% 0.65 | 74.89% 1.26 |

Synchronous | 67.36% 1.44 | 80.15% 0.80 | 80.06% 0.98 |

Partition | 68.1% 1.98 | 80.27% 0.78 | 80.12% 0.93 |

In these results, the meaning of one propagation step varies. For the synchronous schedule, a propagation step means that every node sent and received messages once and updated its state. For the sequential schedule, it means that messages from all roots of the two DAGs were sent to all the leaves. For the MST-based schedule, it means sending messages from the root to all leaves on one minimum spanning tree. For our partition schedules, it means one outer loop of the algorithm. In this sense, messages are propagated furthest through the graph for the sequential schedule within one propagate step. This becomes visible in the results on a single propagation step, in which the sequential schedule yields the highest accuracy. However, when increasing the number of propagation steps, the computation graph associated with the sequential schedule becomes extremely deep, making the learning problem very hard. Our proposed partition schedule performs similarly to the synchronous schedule (while requiring less computation), and better than other asynchronous schedules when using more than a single propagation step.

## 5 Conclusion

We presented graph partition neural networks, which extend graph neural networks. Relying on graph partitions, our model alternates between locally propagating information between nodes in small subgraphs and globally propagating information between the subgraphs. Moreover, we propose a modified multi-seed flood fill for fast partitioning of large scale graphs. Empirical results show that our model performs better or is comparable to state-of-the-art methods on a wide variety of semi-supervised node classification tasks. However, in contrast to existing models, our GPNNs are able to handle extremely large graphs well.

There are quite a few exciting directions to explore in the future. One is to learn the graph partitioning as well as the GNN weights, using a soft partition assignment. Other types of propagation schedules which have proven useful in probabilistic graphical models are also worthwhile to explore in the context of GNNs. To further improve the efficiency of propagating information, different nodes within the graph could share some memory, which mimics the shared memory model in the theory of distributed computing. Perhaps most importantly, this work makes it possible to run GNN models on very large graphs, which potentially opens the door to many new applications.

## Appendix A Appendix

### a.1 Bi-directional Chain

In this section, we revisit the broadcast problem on bi-direction chain graphs. We show that our propagation schedule has advantages over the synchronous one via the following proposition.

###### Proposition 1.

Let be a bi-direction chain of size . We have: (1) Synchronous propagation schedule requires messages to solve the problem; (2) If we partition the chain evenly into sub-chains for , GPNN propagation schedule can solve the problem with messages.

###### Proof.

We first analyze the case for synchronous propagation schedule. At each round, it needs messages to propagate messages one step away. Since it requires at least steps for message from one endpoint of the chain to reach the other, the number of messages to solve broadcast is thus .

We now turn to our schedule. Since the chain is evenly partitioned, each sub-chain is of nodes. We need to perform propagation steps to traverse a sub-chain, so we set . The number of messages required by a single sub-chain during the intra-subgraph propagation phase is , and so all sub-chains collectively require messages. Between intra-subgraph propagation, we perform step of inter-subgraph propagation to transfer messages over the cut edges between sub-chains. Each inter-subgraph step requires messages per cut edge - i.e. 2(K-1) messages in total. We need outer loops to ensure that message from any node can reach any other nodes, and strictly speaking, the the last inter-subgraph propagation step is unnecessary. So in total, we require messages, which proves the proposition. ∎

One can see from the above proposition that if we take and , the number of messages of our schedule matches the synchronous one. We can also derive the optimal value of as resulting in a factor of reduction in the total messages sent compared to the synchronous schedule.

### a.2 Hyperparameters

We train all models using Adam [[17]] with a learning rate of . We also use early stopping with a window size of . We clip the norm gradient to ensure that it is no larger than . The maximum epoch of all experiments except NELL is set to . The one of NELL is . The weight decays for Cora, Citeseer, Pubmed, NELL and DIEL are set to , , , and respectively. The dimensions of state vectors of GPNNfor Cora, Citeseer, Pubmed, NELL and DIEL are set to , , , and . The output model for Cora, Citeseer, NELL is just softmax layer. For Pubmed and DIEL, we add one hidden layer with activation function before the softmax which have dimension and respectively.

### a.3 Random Splits of Citation Networks

We include the results on citation networks with random splits in Table 6. From the table, we can see that our results are comparable with the state-of-the-art on these small scale datasets.

### a.4 Sequential Propagation Schedule

### a.5 Random Partition Schedule

We did an experiment on schedules which are determined by random partitions of the graph. In particular, for -step propagation, we randomly sample proportion of edges from the whole edge set without replacement and use them for update. We summarize the results ( runs) on the Cora dataset in Table 7.

Propagation Step | 2 | 3 | 5 | 10 |
---|---|---|---|---|

Avg Acc | 76.03 | 74.71 | 72.09 | 69.99 |

Std Acc | 1.55 | 1.31 | 1.81 | 2.26 |

From the results, we can see that the best average accuracy is which is still lower than both synchronous and our partition based schedule. Note that this result roughly matches the one with spanning trees. The reason might be that random schedules typically need more propagation steps to spread information throughout the graph. However, more propagation steps of GNNs may lead to issues in learning with BPTT.

### a.6 Implementation

The released code of GGNN [[22]] is implemented in Torch. We implement both our own version of GGNN and our model in Tensorflow [[1]]. To ensure correctness, we first reproduced the experimental results of the paper on bAbI artificial intelligence (AI) tasks with our implementations of GGNN. Our code will be released soon. One challenging part is the implementation of synchronous propagation within subgraphs. We implicitly implement the parallel part by building one separate branch of the computational graph for each subgraphs (i.e., use a Python for loop rather than tf.while_loop). This relies on the claim that tensorflow optimizes the execution of the computational graph in a way that independent branches of the graph will be executed in parallel as decribed in [[1]]. However, since we have no control of the optimization of the computational graph, this part could be improved by explicitly putting each branch on one separate computation device, just like the multi-tower solution for training convolutional neural networks (CNNs) on multiple GPUs.

### Footnotes

- Our code is released at https://github.com/Microsoft/graph-partition-neural-network-samples

### References

- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- M. Allamanis, M. Brockschmidt, and M. Khademi. Learning to represent programs with graphs. In ICLR, 2018.
- H. Attiya and J. Welch. Distributed computing: fundamentals, simulations, and advanced topics, volume 19. John Wiley & Sons, 2004.
- L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, and N. Schneider. Abstract meaning representation (amr) 1.0 specification.
- M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7(Nov):2399–2434, 2006.
- L. Bing, S. Chaudhari, R. Wang, and W. Cohen. Improving distant supervision for information extraction using label propagation through lists. In EMNLP, pages 524–529, 2015.
- J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. ICLR, 2014.
- A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5, page 3, 2010.
- D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
- G. Elidan, I. McGraw, and D. Koller. Residual belief propagation: informed scheduling for asynchronous message passing. In UAI, pages 165–173, 2006.
- V. Garcia and J. Bruna. Few-shot learning with graph neural networks. In ICLR, 2018.
- J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. In ICML, 2017.
- C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropagation through structure. In IEEE International Conference on Neural Networks, volume 1, pages 347–352. IEEE, 1996.
- M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IJCNN, 2005.
- A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855–864, 2016.
- W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, pages 1025–1035, 2017.
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017.
- N. Komodakis, N. Paragios, and G. Tziritas. Mrf energy minimization and beyond via dual decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):531–552, 2011.
- S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), pages 157–224, 1988.
- R. Li, M. Tapaswi, R. Liao, J. Jia, R. Urtasun, and S. Fidler. Situation recognition with graph neural networks. 2017.
- Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. ICLR, 2016.
- Q. Lu and L. Getoor. Link-based classification. In ICML, volume 3, pages 496–503, 2003.
- L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
- K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for image classification. arXiv preprint arXiv:1612.04844, 2016.
- J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
- B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
- X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3D graph neural networks for rgbd semantic segmentation. In ICCV, pages 5199–5208, 2017.
- F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009.
- M. Schlichtkrull, T. N. Kipf, P. Bloem, R. vd Berg, I. Titov, and M. Welling. Modeling relational data with graph convolutional networks. arXiv preprint arXiv:1703.06103, 2017.
- P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93, 2008.
- J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
- R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Y. Ng. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS, pages 801–809, 2011.
- R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural networks. In ICML, pages 129–136, 2011.
- S. Sukhbaatar, R. Fergus, et al. Learning multiagent communication with backpropagation. In NIPS, pages 2244–2252, 2016.
- C. Sutton and A. McCallum. Improved dynamic schedules for belief propagation. arXiv preprint arXiv:1206.5291, 2012.
- K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representations from tree-structured long short-term memory networks. ACL, 2015.
- D. Tarlow, I. Givoni, R. Zemel, and B. Frey. Graph cuts is a max-product algorithm. In UAI, 2011.
- D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. arXiv preprint arXiv:1609.05600, 2016.
- U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.
- M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization for approximate inference on loopy graphs. In NIPS, pages 1001–1008, 2002.
- T. Wang, R. Liao, R. Zemel, J. Ba, and S. Fidler. Nervenet: Learning structured policy with graph neural networks. In ICLR, 2018.
- M. Welling. On the choice of regions for generalized belief propagation. In UAI, pages 585–592, 2004.
- J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
- E. P. Xing, M. I. Jordan, and S. Russell. A generalized mean field algorithm for variational inference in exponential families. In UAI, pages 583–591, 2002.
- E. P. Xing, M. I. Jordan, and S. Russell. Graph partition strategies for generalized mean field inference. In UAI, pages 602–610, 2004.
- Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016.
- J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium, 8:236–239, 2003.
- J. Zhang, A. Schwing, and R. Urtasun. Message passing inference for large scale graphical models with high order potentials. In NIPS, pages 1134–1142, 2014.
- X. Zhu, Z. Ghahramani, J. Lafferty, et al. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 3, pages 912–919, 2003.