Layerwise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
Abstract
To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (SSGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGSSGD, which combines SSGD with a novel layerwise adaptive gradient sparsification (LAGS) scheme. In LAGSSGD, every worker selects a small set of “significant” gradients from each layer independently whose size can be adaptive to the communicationtocomputation ratio of that layer. The layerwise nature of LAGSSGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGSSGD makes it flexible to control the communication time. We prove that LAGSSGD has convergence guarantees and it has the same order of convergence rate as vanilla SSGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGSSGD. Experimental results on a 16GPU cluster show that LAGSSGD outperforms the original SSGD and existing sparsified SSGD without losing obvious model accuracy.
1 Introduction
With increasing data volumes and model sizes of deep neural networks (DNNs), distributed training is commonly adopted to accelerate the training process among multiple workers. Current distributed stochastic gradient descent (SGD) approaches can be categorized into three types, synchronous [9, 21], asynchronous [40] and stall synchronous [15]. Synchronous SGD (SSGD) with dataparallelism is the most widely used one in distributed deep learning due to its good convergence properties [8, 12]. However, SSGD requires iterative synchronization and communication of dense gradient/parameter aggregation among all the workers. Compared to the computing speed of modern accelerators (e.g., GPUs and TPUs), the network speed is usually slow which makes communications a potential system bottleneck. Even worse, the communication time usually grows with the size of the cluster [37]. Many recent studies focus on alleviating the impact of communications in SSGD to improve the system scalability. These studies include the systemlevel methods and the algorithmlevel methods.
On the system level, pipelining [4, 38, 12, 28, 22, 13, 17, 27] is used to overlap the communications with the computations by exploiting the layerwise structure of backpropagation during the training process of deep models. On the algorithmic level, researchers have proposed gradient quantization (fewer bits for a number) and sparsification (zeroout gradients that are not necessary to be communicated) techniques for SSGD to reduce the communication traffic with negligible impact on the model convergence [2, 6, 35, 23, 36, 19]. The gradient sparsification method is more aggressive than the gradient quantization method in reducing the communication size. For example, Top sparsification [1, 23] with error compensation can zeroout local gradients without loss of accuracy while quantization from bit floating points to bit has a maximum of size reduction. In this paper, we mainly focus on the sparsification methods, while our proposed algorithm and analysis are also applicable to the quantization methods.
A number of works have investigated the theoretical convergence properties of the gradient sparsification schemes under different analytical assumptions [34, 32, 3, 18, 16, 19]. However, these gradient sparsification methods ignore the layerwise structure of DNNs and treat all model parameters as a single vector to derive the convergence bounds, which implicitly requires a singlelayer communication [37] at the end of each SGD iteration. Therefore, the current gradient sparsification SSGD (denoted by SLGSSGD hereafter) cannot overlap the gradient communications with backpropagation computations, which limits the system scaling efficiency. To tackle this challenge, we propose a new distributed optimization algorithm named LAGSSGD which exploits a layerwise adaptive gradient sparsification (LAGS) scheme atop SSGD to increase the system scalability. We also derive the convergence bounds for LAGSSGD. Our theoretical convergence results on LAGSSGD conclude that high compression ratios would slow down the model convergence rate, which indicates that one should choose the compression ratios for different layers as low as possible. The adaptive nature of LAGSSGD provides flexible options to choose the compression ratios according to the communicationtocomputation ratios. We evaluate our proposed algorithm on various DNNs to verify the soundness of the weak analytic assumption and the convergence results. Finally, we demonstrate our system implementation of LAGSSGD to show the wallclock training time improvement on a 16GPU cluster with 10Gbps Ethernet interconnect. The contributions of this work are summarized as follows.

We propose a new distributed optimization algorithm named LAGSSGD with convergence guarantees. The proposed algorithm enables us to embrace the benefits of both pipelining and gradient sparsification.

We provide thorough convergence analysis for LAGSSGD on nonconvex smooth optimization problems, and the derived theoretical results indicate that LAGSSGD has a consistent convergence guarantee with SLGSSGD, and it has the same order of convergence rate with SSGD under a weak analytical assumption.

We empirically verify the analytical assumption and the convergence performance of LAGSSGD on various deep neural networks including CNNs and LSTM in a distributed setting.

We implement LAGSSGD atop PyTorch
^{1} , which is one of the popular deep learning frameworks, and evaluate the training efficiency of LAGSSGD on a 16GPU cluster connected with 10Gbps Ethernet. Experimental results show that LAGSSGD outperforms SSGD and SLGSSGD on a 16GPU cluster with little impact on the model accuracy.
The rest of the paper is organized as follows. Section 2 introduces some related work, and Section 3 presents preliminaries for our proposed algorithm and theoretical analysis. We propose the LAGSSGD algorithm and provide the theoretical results in Section 4. The efficient system design for LAGSSGD is illustrated in Section 5. Experimental results and discussions are presented in Section 6. Finally, we conclude the paper in Section 7.
2 Related Work
Many recent works have provided convergence analysis for distributed SGD with quantified or sparsified gradients that can be biased or unbiased.
For the unbiased quantified or sparsified gradients, researchers [2, 35] derived the convergence guarantees for lowerbit quantified gradients, while the quantization operator applied on gradients should be unbiased to guarantee the theoretical results. On the gradient sparsification algorithm whose sparsification method is also unbiased, Wangni et al. [34] derived the similar theoretical results. However, empirical gradient sparsification methods (e.g., Top sparsification [23]) can be biased, which require some other analytical techniques to derive the bounds. In this paper, we also mainly focus on the bias sparsification operators like Top sparsification.
For the biased quantified or sparsified gradients, Cordonnier [7] and Stich et al. [32] provided the convergence bound for top or random gradient sparsification algorithms on only convex problems. Jiang et al. [18] derived similar theoretical results, but they exploited another strong assumption that requires each worker to select the same components at each iteration so that the whole (the dimension of model/gradient) components are exchanged after a certain number of iterations. Alistarh et al. [3] relaxed these strong assumptions on sparsified gradients, and further proposed an analytical assumption, in which the norm of the difference between the top elements on fully aggregated gradients and the aggregated results on local top gradients is bounded. Though the assumption is relaxed, it is difficult to verify in realworld applications. Our convergence analysis is relatively close to the study [30] which provided convergence analysis on the biased Top sparsification with an easytoverify analytical assumption.
The above mentioned studies, however, view all the model parameters (or gradients) as a single vector to derive the convergence bounds, while we propose the layerwise gradient sparsification algorithm which breaks down full gradients into multiple pieces (i.e., multiple layers). It is obvious that breaking a vector into pieces and selecting top elements from each piece generates different results from the top elements on the full vector, which makes the proofs of the bounds of LAGSSGD nontrivial. Recently, [39] proposed the blockwise SGD for quantified gradients, but it lacks of convergence guarantees for sparsified gradients. Simultaneous to our work, Dutta et al. [11] proposed the layerwise compression schemes and provided a different way of proof on the theoretical analysis.
3 Preliminaries
We consider the common settings of distributed synchronous SGD with dataparallelism on workers to minimize the nonconvex objective function by:
(1) 
where is the stacked layerwise model parameters of the target DNN at iteration , is the stochastic gradients of the DNN parameters at the worker with locally sampled data, and is the step size (i.e., learning rate) at iteration . Let denote the number of learnable layers of the DNN, and denote the parameter vector of the learnable layer with elements
(2) 
Pipelining between communications and computations. Due to the fact that the gradient computation of layer using the backpropagation algorithm has no dependency on the gradient aggregation of layer , the layerwise communications can then be pipelined with layerwise computations [4, 38] as shown in Fig. 1(a). It can be seen that some communication time can be overlapped with the computations so that the wallclock iteration time is reduced. Note that the pipelining technique with full gradients has no sideeffect on the convergence, and it becomes very useful when the communication time is comparable to the computing time.
Top sparsification. In the gradient sparsification method, the Top sparsification with error compensation [1, 23, 29] is promising in reducing communication traffic for distributed training, and its convergence property has been empirically [1, 23] verified and theoretically [3, 18, 32] proved under some assumptions. The model update formula of Top SSGD can be represented by
(3) 
where is the selected top gradients at worker . For any vector and a given , and its () element is:
(4) 
where is the element of and is the largest value of . As shown in Fig. 1(b), in each iteration, at the end of the backpropagation pass, each worker selects top gradients from its whole set of gradients. The selected gradients are exchanged with all other workers in the decentralized architecture or sent to the parameter server in the centralized architecture for averaging.
4 LayerWise Adaptive Gradient Sparsification
4.1 Algorithm
To enjoy the benefits of the pipelining technique and the gradient sparsification technique, we propose the LAGSSGD algorithm, which exploits a layerwise adaptive gradient sparsification (LAGS) scheme atop SSGD.
In LAGSSGD, we apply gradient sparsification with error compensation on each layer separately. Instead of selecting the top values from all gradients to be communicated, each worker selects top gradients from layer so that it does not need to wait for the completion of backpropagation pass before communicating the sparsified gradients. LAGSSGD not only significantly reduces the communication traffic (hence the communication time) using the gradient sparsification, but it also makes use of the layered structure of DNNs to overlap the communications with computations. As shown in Fig. 1(c), at each iteration, after the gradients of layer have been calculated, is selected to be exchanged among workers immediately. Formally, let denote the model parameter and denote the local gradient residual of worker at iteration . In LAGSSGD on distributed workers, the update formula of the layer’s parameters becomes
(5) 
The pseudocode of LAGSSGD is shown in Algorithm 1.
4.2 Convergence Analysis
We first introduce some notations and assumptions for our convergence analysis, and then present the theoretical results of the convergence properties of LAGSSGD.
Notations and Assumptions
Let denote norm. We mainly discuss that the nonconvex objective function is smooth, i.e.,
(6) 
Let denote the optimal solution of the objective function . We assume that the sampled stochastic gradients are unbiased, i.e., . We also assume that the second moment of the average of stochastic gradients has the following bound:
(7) 
We make an analytical assumption on the aggregated results from the distributed sparsified vectors.
Assumption 1.
For any vectors () in workers, and each vector is sparsified as locally. The aggregation of selects larger values than randomly selecting values from the accumulated vectors, i.e.,
(8) 
where is a vector whose elements are randomly selected from following a uniform distribution, and the other elements are zeros.
Main Results
Here we present the major lemmas and theorems to prove the convergence of LAGSSGD. Our results are mainly the derivation of the standard bounds in nonconvex settings [5], i.e.,
(11) 
for some constants and the number of iterations .
Lemma 1.
For any vectors , and every vector can be broken down into pieces, that is and , it holds that
(12) 
where , for , and .
Proof.
Corollary 1.
For any iteration and :
(13) 
Proof.
Let , and We have and . According to the update formulas of and , we have
where . Iterating the above inequality from yields:
Taking the expectation and using the bound of the second moment on stochastic gradients: , we obtain
which concludes the proof. ∎
Corollary 1 implies that the parameters with sparsified layerwise gradients have bounds compared to that with dense gradients.
Theorem 1.
Under the assumptions defined in the objective function , after running iterations with Algorithm 1, we have
(14) 
if one chooses a step size schedule such that and ,
(15) 
holds at any iteration .
Proof.
We use the smooth property of and Corollary 1 to derive (1). First, with the smoothness of , we have
Taking expectation with respective to sampling at , it yields
Taking expectation with respective to the gradients before , it yields
Using Corollary 1, we obtain
If (15) holds, then we have
Adjusting the order, we obtain
(16) 
We further apply the property of , that is
Together with (4.2.2), it yields
Summing up the inequality for , it yields
Multiplying in both sides concludes the proof. ∎
Theorem 1 indicates that if one chooses the step sizes to satisfy (15), then the right hand side of (1) converges as , so that Algorithm 1 is guaranteed to converge. If we let which is easily satisfied then (15) holds for both constant and diminishing step sizes. Therefore, if the step sizes are further configured as
(17) 
then the right hand side of inequality (1) converges to zero, which ensures the convergence of Algorithm 1.
Corollary 2.
Proof.
In Corollary 2, if is large enough, then the right hand side of inequality (18) is dominated by the first term. It implies that Algorithm 1 has a convergence rate of , which is the same as the vanilla SGD [9]. However, the second term of inequality (18) also indicates that higher compress ratios (i.e., ) lead to a larger bound of the convergence rate. In realworld settings, one may have a fixed number of iteration budget to train the model, so high compression ratios could slowdown the convergence speed. On the one hand, if we choose lower compression ratios, then the algorithm has a faster convergence rate (less number of iterations). On the other hand, lower compression ratios have a larger communication size and thus may result in longer wallclock time per iteration. Therefore, the adaptive selection of the compression ratios tackles the problem properly.
5 System Implementation and Optimization
The layerwise sparsification nature of LAGSSGD enables the pipelining technique to hide the communication overheads, while the efficient system implementation of communication and computation parallelism with gradient sparsification is nontrivial due to three reasons: 1) Layerwise communications with sparsified gradients indicate that there exist many small size messages to be communicated across the network, while collectives (e.g., AllReduce) with small messages are latencysensitive. 2) Gradient sparsification (especially top selection on GPUs) would introduce extra computation time. 3) The convergence rate of LAGSSGD is negatively effected by the compression ratio, and one should decide proper compression ratios to tradeoff the number of iteration to converge and the iteration wallclock time.
First, we exploit a heuristic method to merge extremely small sparsified tensors to a single one for efficient communication to address the first problem. Specifically, we use a memory buffer to temporarily store the sparsified gradients, and aggregate the buffered gradients once the buffer becomes full or the gradients of the first layer have been calculated. Second, we implement the double sampling method [23] to approximately select the top gradients, which can significantly reduce the top selection time on GPUs. Finally, to achieve a balance between the convergence rate and the training wallclock time, we propose to select the layerwise compression ratio according to the communicationtocomputation ratio. To be specific, we select a compression ratio for layer such that its communication overhead is appropriately hidden by the computation. Given an upper bound of the compression ratio (e.g., ), the algorithm determines according to the following three metrics: 1) Backpropagation computation time of the pipelined layers (i.e., ); 2) Communication time of the current layer under a specific compression ratio , which can be predicted using the communication model of the AllGather or AllReduce collectives (e.g., [22, 25]) according to the size of gradients and the interconnection (e.g., latency and bandwidth) between workers; 3) An extra overhead involved by the sparsification operator (), which generally includes a pair of operations (compression and decompression). Therefore, the selected value of can be generalized as
(19) 
5.1 Bound of Pipelining Speedup
In LAGSSGD, the sparsification technique is used to reduce the overall communication time, and the pipelining technique is used to further overlap the already reduced communication time with computation time. We can analyze the optimal speedup of LAGSSGD over SLGSSGD in terms of wallclock time under the same compression ratios. Let , and denote the forward computation, backward computation and gradient communication time at each iteration respectively. We assume that the sparsification overhead can be ignored as we can use the efficient sampling method. Compared to SLGSSGD, LAGSSGD reduces the wallclock time by pipelining the communications with computations, and the maximum overlapped time is (i.e., either backpropagation computations or communications are completely overlapped). So the maximum speedup of LAGSSGD over SLGSSGD can be calculated as . Let denote the communicationtocomputation ratio. The ideal speedup can be represented by
(20) 
The equation shows that the maximum speedup of LAGSSGD over SLGSSGD mainly depends on the communicationtocomputation ratio. If is close to , then LAGSSGD has the potential to achieve the highest speedup by completely hiding either the backpropagation computation or the communication time.
6 Experiments
6.1 Experimental Settings
We conduct the similar experiments as the work [23], which cover two types of applications with three data sets: 1) image classification by convolutional neural networks (CNNs) such as ResNet20 [14] and VGG16 [31] on the Cifar10 [20] data set and Inceptionv4 [33] and ResNet50 [14] on the ImageNet [10] data set; 2) language model by a 2layer LSTM model (LSTMPTB) with 1500 hidden units per layer on the PTB [24] data set. On Cifar10, the batch size for each worker is 32, and the base learning rate is 0.1; On ImageNet, the batch size for each worker is also 32, and the learning rate is 0.01; On PTB, the batch size and learning rate is 20 and 22 respectively. We set the compression ratios as and for CNNs and LSTM respectively. In all compared algorithms, the hyperparameters are kept the same and experiments are conducted on a 16GPU cluster.
6.2 Verification of Assumption 1 and Convergences
To show the soundness of Assumption 1 and the convergence results, we conduct the experiments with 16 workers to train the models. We define metrics () for each learnable layer during the training process at each iteration with Algorithm 1, and
(21) 
where . Assumption 1 holds if (). We measure on ResNet20, VGG16 and LSTMPTB during training, and the results are shown in Fig. 2. It is seen that throughout the training process, which implies that Assumption 1 holds. The evaluated models all converge in a certain number of epochs, which verifies the convergence of LAGSSGD.
6.3 Comparison of Convergence Rates
The convergence comparison under the same number of training epochs is shown in Fig. 3. The top1 validation accuracy (the higher the better) on CNNs and the validation perplexity (the lower the better) on LSTM show that LAGSSGD has very close convergence performance to SLGSSGD. Compared to DenseSGD, SLGSSGD and LAGSSGD both have slight accuracy losses. The problem could be resolved by some training tricks like momentum correction [23]. The final evaluation results are shown in Table 1. The nearly consistent convergence performance between LAGSSGD and DenseSGD verifies our theoretical results on the convergence rate.
Model  DenseSGD  SLGSSGD  LAGSSGD 

ResNet20  
VGG16  
ResNet50  
LSTMPTB 
6.4 Wallclock Time Performance and Discussions
We evaluate the average iteration time with CNNs including VGG16I (VGG16 [31] for ImageNet), ResNet50 and Inceptionv4 on the largescale data set ImageNet (over one million training images) on a 16GPU cluster (four nodes and each node contains four Nvidia Tesla V100 PCIE32G GPUs) connected with 10Gbps Ethernet (10GbE). The servers in the cluster are with Intel CPUs (Xeon E52698v3 Dual), Ubuntu16.04 and CUDA10.0. The main libraries used in our experiments are PyTorchv1.1, OpenMPIv4.0.0, Horovodv0.18.1 and NCCLv2.3.7. The experimental results are shown in Table 2 using the compression ratio of for gradient sparsification. It demonstrates that LAGSSGD performs around faster than SLGSSGD, which is up to close to the maximum speedup, and it achieves improvement over DenseSGD.
Model  Dense  SLGS  LAGS  

VGG16I  
ResNet50  
Inceptionv4 
The achieved speedups of LAGSSGD over SLGSSGD in the endtoend training wallclock time are minor, which is caused by three main reasons. First, as shown in Eq. (20), the improvement of LAGSSGD over SLGSSGD is highly depended on the communicationtocomputation ratio . In the conducted experiments, is small because transferring highly sparsified data under 10GbE is much faster than the computation time on Nvidia Tesla V100 GPUs, while the proposed method has potential improvement with increased such as lower bandwidth networks. Second, the compression time is not negligible compared to the communication time. Even we exploit the sampling method [23] to select the top gradients, it is inefficient on GPUs so that it enlarges the computation time, which makes