Communication-Efficient Sampling for Distributed Training of Graph Convolutional Networks
Training Graph Convolutional Networks (GCNs) is expensive as it needs to aggregate data recursively from neighboring nodes. To reduce the computation overhead, previous works have proposed various neighbor sampling methods that estimate the aggregation result based on a small number of sampled neighbors. Although these methods have successfully accelerated the training, they mainly focus on the single-machine setting. As real-world graphs are large, training GCNs in distributed systems is desirable. However, we found that the existing neighbor sampling methods do not work well in a distributed setting. Specifically, a naive implementation may incur a huge amount of communication of feature vectors among different machines. To address this problem, we propose a communication-efficient neighbor sampling method in this work. Our main idea is to assign higher sampling probabilities to the local nodes so that remote nodes are accessed less frequently. We present an algorithm that determines the local sampling probabilities and makes sure our skewed neighbor sampling does not affect much to the convergence of the training. Our experiments with node classification benchmarks show that our method significantly reduces the communication overhead for distributed GCN training with little accuracy loss.
Graph Convolutional Networks (GCNs) are powerful models for learning representations of attributed graphs. They have achieved great success in graph-based learning tasks such as node classification kipf2017semi; duran2017learning, link prediction zhang2017weisfeiler; zhang2018link, and graph classification ying2018hierarchical; 10.5555/3305381.3305512.
Despite the success of GCNs, training a deep GCN on large-scale graphs is challenging. To compute the embedding of a node, GCN needs to recursively aggregate the embeddings of the neighboring nodes. The number of nodes needed for computing a single sample can grow exponentially with respect to the number of layers. This has made mini-batch sampling ineffective to achieve efficient training of GCNs. To alleviate the computational burden, various neighbor sampling methods have been proposed hamilton2017inductive; ying2018graph; chen2018fastgcn; zou2019layer; AAAI1816642; chiang2019cluster; Zeng2020GraphSAINT. The idea is that, instead of aggregating the embeddings of all neighbors, they compute an unbiased estimation of the result based on a sampled subset of neighbors.
Although the existing neighbor sampling methods can effectively reduce the computation overhead of training GCNs, most of them assume a single-machine setting. The existing distributed GCN systems either perform neighbor sampling for each machine/GPU independently (e.g., PinSage ying2018graph, AliGraph yang2019aligraph, DGL wang2019deep) or perform a distributed neighbor sampling for all machines/GPUs (e.g., AGL 10.14778/3415478.3415539). If the sampled neighbors on a machine include nodes stored on other machines, the system needs to transfer the feature vectors of the neighboring nodes across the machines. This incurs a huge communication overhead. None of the existing sampling methods or the distributed GCN systems have taken this communication overhead into consideration.
In this work, we propose a communication-efficient neighbor sampling method for distributed training of GCNs. Our main idea is to assign higher sampling probabilities for local nodes so that remote nodes will be accessed less frequently. By discounting the embeddings with the sampling probability, we make sure that the estimation is unbiased. We present an algorithm to generate the sampling probability that ensures the convergence of training. To validate our sampling method, we conduct experiments with node classification benchmarks on different graphs. The experimental results show that our method significantly reduces the communication overhead with little accuracy loss.
2 Related Work
The idea of applying convolution operation to the graph domain is first proposed by bruna2013spectral. Later, kipf2017semi and defferrard2016convolutional simplify the convolution computation with localized filters. Most of the recent GCN models (e.g., GAT velickovic2018graph, GraphSAGE hamilton2017inductive, GIN xu2018how) are based on the GCN in kipf2017semi where the information is only from 1-hop neighbors in each layer of the neural network. In kipf2017semi, the authors only apply their GCN to small graphs and use full batch for training. This has been the major limitation of the original GCN model as full batch training is expensive and infeasible for large graphs. Mini-batch training does not help much since the number of nodes needed for computing a single sample can grow exponentially as the GCN goes deeper. To overcome this limitation, various neighbor sampling methods have been proposed to reduce the computation complexity of GCN training.
Node-wise Neighbor Sampling: GraphSAGE hamilton2017inductive proposes to reduce the receptive field size of each node by sampling a fixed number of its neighbors in the previous layer. PinSAGE ying2018graph adopts this node-wise sampling technique and enhances it by introducing an importance score to each neighbor. It leads to less information loss due to weighted aggregation. VR-GCN pmlr-v80-chen18p further restricts the neighbor sampling size to two and uses the historical activation of the previous layer to reduce variance. Although it achieves comparable convergence to GraphSAGE,VR-GCN incurs additional computation overhead for convolution operations on historical activation which can outweigh the benefit of reduced number of sampled neighbors. The problem with node-wise sampling is that, due to the recursive aggregation, it may still need to gather the information of a large number of nodes to compute the embeddings of a mini-batch.
Layer-wise Importance Sampling: To further reduce the sample complexity, FastGCN chen2018fastgcn proposes layer-wise importance sampling. Instead of fixing the number of sampled neighbors for each node, it fixes the number of sampled nodes in each layer. Since the sampling is conduced independently in each layer, it requires a large sample size to guarantee the connectivity between layers. To improve the sample density and reduce the sample size, huang2018adaptive and zou2019layer propose to restrict the sampling space to the neighbors of nodes sampled in the previous layer.
Subgraph Sampling: Layer-wise sampling needs to maintain a list of neighbors and calculate a new sampling distribution for each layer. It incurs an overhead that can sometime deny the benefit of sampling, especially for small graphs. GraphSAINT Zeng2020GraphSAINT proposes to simplify the sampling procedure by sampling a subgraph and performing full convolution on the subgraph. Similarly, ClusterGCN chiang2019cluster pre-partitions a graph into small clusters and constructs mini-batches by randomly selecting subsets of clusters during the training.
All of the existing neighbor sampling methods assume a single-machine setting. As we will show in the next section, a straightforward adoption of these methods to a distributed setting can lead to a large communication overhead.
3 Background and Motivation
In a -layer GCN, the -th convolution layer is defined as where represents the embeddings of all nodes at layer before activation, represents the feature vectors, is the activation function, is the normalized Laplacian matrix of the graph, and is the learnable weights at layer . The multiple convolution layers in the GCN can be represented as
The output embedding is given to some loss function for downstream learning tasks such as node classification or link prediction.
GCN as Multi-level Stochastic Compositional Optimization: As pointed out by cong2020minimal, training a GCN with neighbor sampling can be considered as a multi-level stochastic compositional optimization (SCO) problem (although their description is not accurate). Here, we give a more precise connection between GCN training and multi-level SCO. Since the convergence property of algorithms for multi-level SCO has been extensively studied yang2019multilevel; zhang2019multi; chen2020solving, this connection will allow us to study the convergence of GCN training with different neighbor sampling methods. We can define the graph convolution at layer as a function . The embedding approximation with neighbor sampling can be considered as a stochastic function where is a stochastic matrix with . Therefore, we have . The loss function of the GCN can be written as
Here, is the set of learnable weights at all layers , , and the stochastic function corresponds to the mini-batch sampling.
Distributed Training of GCN: As the real-world graphs are large and the compute/memory capacity of a single machine is limited, it is always desirable to perform distributed training of GCNs. A possible scenario would be that we train a GCN on a multi-GPU system. The global memory of a single GPU cannot accommodate the feature vectors of all nodes in the graph. It will be inefficient to store the feature vectors on the CPU main memory and move the feature vectors to GPU in each iteration of the training process because the data movement incurs a large overhead. We want to split the feature vectors and store them on multiple GPUs so that each GPU can perform calculation on its local data. Another possible scenario would be that we have a large graph with rich features which cannot be store on a single machine. For example, the e-commerce graphs considered in AliGraph yang2019aligraph can ‘contain tens of billions of nodes and hundreds of billions of edges with storage cost over 10TB easily’. Such graphs need to be partitioned and stored on different machines in a distributed system. Figure 1 shows an example of training a two-layer GCN on four GPUs. Suppose full neighbor convolution is used and each GPU computes the embeddings of its local nodes. GPU-0 needs to compute the embeddings of node A and B and obtain a stochastic gradient based on the loss function. GPU-1 needs to compute the embeddings of node C and D and obtain a stochastic gradient . Similarly, GPU-2 and GPU-3 compute the embeddings of their local nodes and obtain stochastic gradient and . The stochastic gradients obtained on different GPUs are then averaged and used to update the model parameters.
Communication of Feature Vectors: As shown in Figure 1, the computation of a node’s embedding may involve reading the feature vector of a remote node. To compute the embedding of node A on GPU-0, we need the intermediate embeddings of node B and E, which in turn need the feature vectors of node A, B, C, E and F (Note that the feature vector of node E itself is needed to compute its intermediate embedding; the same for node B). Since node C, E, F are not on GPU-0, we need to send the feature vectors of node C from GPU-1 and node E, F from GPU-2. Similarly, to compute the embedding of node B on GPU-0, we need feature vectors of node B, C, D, E, F and G, which means that GPU-0 needs to obtain data from all of the other three GPUs. This apparently incurs a large communication overhead. Even with neighbor sampling, the communication of the feature vectors among the GPUs are unavoidable. In fact, in our experiments on a four-GPU workstation, the communication can take more than 60% of the total execution time with a naive adoption of neighbor sampling. The problem is expected to be more severe on distributed systems with multiple machines. Therefore, reducing the communication overhead for feature vectors is critical to the performance of distributed training of GCNs.
4 Communication-Efficient Neighbor Sampling
To reduce the communication overhead of feature vectors, a straightforward idea is to skew the probability distribution for neighbor sampling so that local nodes are more likely to be sampled. More specifically, to estimate the aggregated embedding of node ’s neighbors (i.e., where denotes the neighbors of node , is the embedding of node , and is its weight), we can define a sequence of random variables Bernoulli() where is the probability that node in the neighbor list is sampled. We have an unbiased estimate of the result as
The expected communication overhead with this sampling strategy is
where is the set of remote nodes. Suppose we have a sampling budget and we denote all the local nodes as . We can let so that neighbors are sampled on average. It is apparent that, if we increase the sampling probability of local nodes (i.e., ), the expected communication overhead will be reduced. However, the local sampling probability cannot be increased arbitrarily. As an extreme case, if we let , only the local nodes will be sampled, but we will not be able to obtain an unbiased estimate of the result, which can lead to poor convergence of the training algorithm. We need a sampling strategy that can reduce the communication overhead while maintaining an unbiased approximation with small variance.
4.1 Variance of Embedding Approximation
Consider the neighbor sampling at layer . Suppose is the set of sampled nodes at layer . We sample from all the neighbors of nodes in and estimate the result for each of the node using (3). The total estimation variance is
Here is the sum of squared weights of edges from nodes in to node . Clearly, the smallest variance is achieved when , and it corresponds to full computation. Since we are given a sampling budget, we want to minimize under the constraint . The optimization problem is infeasible because the real value of is unknown during the sampling phase. Some prior work uses from the previous iteration of the training loop to obtain an optimal sampling probability distribution (e.g., cong2020minimal). This however incurs an extra overhead for storing for all the layers. A more commonly used approach is to consider as bounded by a constant and minimize the upper bound of chen2018fastgcn; zou2019layer. The problem can be written as a constrained optimization problem:
The Sampling Method Used in Previous Works: Although the solution to the above problem can be obtained, it requires expensive computations. For example, cong2020minimal give an algorithm that needs to sort and searches for the solution. As neighbor sampling is performed at each layer of the GCN and in each iteration of the training algorithm, finding the exact solution to (6) may significantly slowdown the training procedure. chen2018fastgcn and zou2019layer adopt a simpler sampling method. They define a discrete probability distribution over all nodes in and assign the probability of returning node as
They run the sampling for times (without replacement) to obtain neighbors. We call this sampling method linear weighted sampling. Intuitively, if a node is in the neighbor list of many nodes in (i.e., is large), it has a high probability of being sampled. More precisely, the probability of node being sampled is
Due to its easy calculation, we adopt this sampling strategy in our work, and we skew the sampling probability distribution to the local nodes so that the communication overhead can be reduced.
4.2 Skewed Linear Weighted Sampling
Our idea is to scale the sampling weights of local nodes by a factor . More specifically, we divide the nodes in into the local nodes and the remote nodes , and we define the sampling probability distribution as
Compared with (7), (10) achieves a smaller communication overhead because is smaller. We call our sampling method skewed linear weighted sampling. Clearly, the larger we use, the more communication we save. Our next task is to find that can ensure the convergence of the training.
Note that the variance does not necessarily increase with the skewed neighbor sampling (the last term of (11) is negative).
Since GCN training is equivalent to multi-level SCO as explained in the Background section, we can use the convergence analysis of multi-level SCO to study the convergence of GCN training with our skewed neighbor sampling. Although different algorithms for multi-level SCO achieve different convergence rates yang2019multilevel; zhang2019multi; chen2020solving, for general non-convex objective function , all of these algorithms have the optimality error or bounded by some terms that are linear to the upper bound of the approximation variance at each level. This means that if we can make sure , our skewed neighbor sampling will not affect the convergence of the training. In light of this, we have the following theorem for determining the scaler in (10).
If we set with some constant , we can calculate an exact upper bound of by solving the equations with (7) and (10). The upper bound is where , , . is the number of local nodes, is the number of remote nodes, and is the sampling budget. For simple computation, we ignore all the terms dependant on and , and it gives us a feasible solution where . ∎
Intuitively, if there are few remote nodes (i.e., is large), we can sample the local nodes more frequently, and (12) gives us a larger . If we have a large sampling budget , the estimation variance of the linear weighted sampling (9) is small. We will need to sample enough remote nodes to keep the variance small, and (12) gives us a smaller .
5 Experimental Results
We evaluate our communication-efficient sampling method in this section.
5.1 Experimental Setup
Platform: We conducted our experiments on two platforms: a workstation with four Nvidia RTX 2080 Ti GPUs, and eight machines each with an Nvidia P100 GPU in a HPC cluster. The four GPUs in the workstation are connected through PCIe 3.0 x16 slot. The nodes in the HPC cluster are connected with 100Gbps InfiniBand based on a fat-tree topology. Our code is implemented with PyTorch 1.6. We use CUDA-aware MPI for communication among the GPUs. To enable the send/recv primitive in PyTorch distributed library, we compile PyTorch from source with OpenMPI 4.0.5.
Datasets: We conduct our experiments on five graphs as listed in Table 5.1. Cora and CiteSeer are two small graphs that are widely used in previous works for evaluating GCN performance zou2019layer; chen2018fastgcn; Zeng2020GraphSAINT. Reddit is a medium-size graph with 233K nodes and an average degree of 492. Amazon is a large graph with more than 1.6M nodes and 132M edges Zeng2020GraphSAINT. We use the same configurations of training set, validation set, and test set for the graphs as in previous works zou2019layer; chen2018fastgcn; Zeng2020GraphSAINT. Youtube is a large graph with more than 1M nodes mislove-2007-socialnetworks. Each node represents a user, and the edges represent the links among users. The graph does not have feature vector or label information given. We generate the labels and feature vectors based on the group information of the nodes. More specifically, we choose the 64 largest groups in the graph as labels. The label of each node is a vector of length 64 with element value of 0 or 1 depending on whether the node belongs to the group. Only the nodes that belong to at least one group are labeled. For feature vector, we randomly select 2048 from the 4096 largest groups. If a node does not belong to any group, its feature vector is 0. We use 90% of the labeled nodes for training and the remaining 10% for testing.
Benchmark and Settings: We use a 5-layer standard GCN (as in Formula (1)) to perform node classification tasks on the graphs. For Cora, CiteSeer, Reddit and Amazon whose labels are single value, we use the conventional cross-entropy loss to perform multi-class classification. For YouTube, since the nodes’ labels are vectors, we use binary cross entropy loss with a rescaling weight of 50 to perform multi-label classification. The dimension of the hidden state is set to 256 for Cora, CiteSeer and Reddit. The dimension of the hidden state is set to 512 for YouTube and Amazon. For distributed training, we divide the nodes evenly among different GPUs. Each GPU runs sampling-based training independently, with the gradients averaged among GPUs in each iteration.
We incorporate our skewed sampling technique into LADIES zou2019layer and GraphSAINT Zeng2020GraphSAINT, both of which use linear weighted sampling as in (7). The only difference is that LADIES uses all neighbors of nodes at layer as when it samples nodes at layer , while GraphSAINT considers all training nodes as when it samples the subgraph.
We compare the performance of three versions. The first version (Full) uses the original linear weighted sampling and transfers sampled neighbors among GPUs. The second version (Local) also uses linear weighted sampling but only aggregates neighboring nodes on the same GPU. The third version (Our) uses our skewed sampling and transfers sampled neighbors among GPUs.
5.2 Results on Single-Machine with Multiple GPUs
We use LADIES to train the GCN with Cora, CiteSeer, Reddit and YouTube graph on the four-GPU workstation. The batch size on each GPU is set to 512, and the number of neighbor samples in each intermediate layer is also set to 512.
|Graph||Sampling Method||F1-Score (%)||Communication Data Size (#nodes)|
Cora and CiteSeer Results: Although these two graphs are small and can be easily trained on a single GPU, we apply distributed training to these two graphs and measure the total communication data size to show the benefit of our sampling method. Table 1 shows the best test accuracy and the total communication data size in 10 epochs of training. The results are collected in 10 different runs and we report the mean and deviation of the numbers. Compared with the full-communication version, our sampling method does not cause any accuracy loss for Cora with (in Formula (12)) set to . For CiteSeer, the mean accuracy in different runs decreases by about 1% with our sampling method. However, the best accuracy in different runs matches the best accuracy of full-communication version. Figure 2 shows the training loss over epochs with different sampling methods on Cora. We can see that local aggregation leads to poor convergence, due to the loss of information in the edges across GPUs. The other versions have the model converge to optimal after 3 epochs. The training loss on CiteSeer follows a similar pattern. The results indicate that our sampling method does not impair the convergence of training.
The execution times of different versions are almost the same on these two small graphs because the communication overhead is small and reducing the communication has little effect to the overall performance. Therefore, instead of reporting the execution time, we show the communication data size of different versions. The numbers in Table 1 are the numbers of nodes whose feature vectors are transferred among GPUs during the entire training process. We can see that the communication is indeed reduced. When , our sampling method saves the communication overhead by 1.48x on Cora and 1.32x on CiteSeer.
Reddit and YouTube Results: These are two larger graphs for which communication overhead is more critical to the performance of distributed training. Figure 3 shows the results on Reddit graph. We run the training for 50 epochs and compare the training loss, validation accuracy and execution time of different versions. The breakdown execution time is shown in Figure (c)c. We can see that communication takes more than 60% of the total execution time if we naively adopt the linear weighted sampling method. Our sampling method reduces the communication time by 1.4x, 2.5x and 3.5x with set to 4, 8, 16, respectively. The actual communication data size is reduced by 2.4x, 3.6x and 5.2x. From Figure (a)a, we can see that our sampling method converges at almost the same rate as the full-communication version when is set to 4 or 8. The training converges slower when , due to the large approximation variance. Figure (b)b shows the validation accuracy of different versions. The best accuracy achieved by full-communication version is 93.0%. Our sampling method achieves accuracy of 94.3%, 92.4%, 92.2% with set to 4, 8, 16, respectively. The figures also show that training with local aggregation leads to a significant accuracy loss.
Figure 4 shows the results on YouTube graph. We run the training for 300 epochs. As shown in Figure (c)c, the communication takes more than 70% of the total execution time in the full-communication version. Our sampling method effectively reduces the communication time. The larger we use, the more communication time we save. The actual communication data size is reduced by 3.3x, 4.6x, 6.7x with set to 4, 8, 16. Despite the communication reduction, our sampling method achieves almost the same convergence as full-communication version, as shown in Figure (a)a and (b)b. The full-communication version achieves a best accuracy of 34.0%, while our sampling method achieves best accuracy of 33.4%, 36.0%, 33.2% with set to 4, 8, 16. In contrast, local aggregation leads to a noticeable accuracy loss. The best accuracy it achieves is 28.5%.
Comparison with Centralized Training: As we described in Section 3, the motivation of partitioning the graph and splitting the nodes’ features among GPUs in a single machine is that each GPU cannot hold the feature vectors of the entire graph. As oppose to this distributed approach, an alternative implementation is to store all the features on CPU and copy the feature vectors of the sampled nodes to different GPUs in each iteration. For example, PinSage ying2018graph adopts this approach for training on large graphs. To justify our distributed storage of the feature vectors, we collect the performance results of GCN training with this centralized storage of feature vectors on CPU. Table 5.2 lists the training time for different graphs. Even compared with the full-communication baseline in Figure (c)c and (c)c, this centralized approach is 1.2x to 13x slower. This is because the centralized approach incurs a large data movement overhead for copying data from CPU to GPU. The results suggest that, for training large graphs on a single-machine with multiple GPUs, distributing the feature vectors onto different GPUs is more efficient than storing the graph on CPU.
5.3 Results on Multiple Machines
We incorporate our sampling method to GraphSAINT and train a GCN with Amazon graph on eight machines in a HPC cluster. The subgraph size is set to 4500. With our skewed sampling method, a subgraph is more likely to contain nodes on the same machine, and thus incurs less communication among machines. We run the training for 20 epochs. As shown in Figure (c)c, the communication takes more than 75% of the total execution time in the full-communication version. Our sampling method effectively reduces the communication overhead. The overall speedup is 1.7x, 2.8x, 4.2x with set to 4, 8, 16. As shown in Figure (a)a and (b)b, our sampling method achieves almost the same convergence as full-communication version. The full-communication version achieves a best accuracy of 79.31%, while our sampling method achieves best accuracy of 79.5%, 79.19%, 79.29% with set to 4, 8, 16. Although the training seems to converge with local aggregation on this dataset, there is a clear gap between the line of local aggregation and the lines of other versions. The best accuracy achieved by local aggregation is 76.12%.
In this work, we study the training of GCNs in a distributed setting. We find that the training performance is bottlenecked by the communication of feature vectors among different machines/GPUs. Based on the observation, we propose the first communication-efficient sampling method for distributed GCN training. The experimental results show that our sampling method effectively reduces the communication overhead while maintaining a good accuracy.