Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

Abstract

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of “significant” gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy.

1 Introduction

With increasing data volumes and model sizes of deep neural networks (DNNs), distributed training is commonly adopted to accelerate the training process among multiple workers. Current distributed stochastic gradient descent (SGD) approaches can be categorized into three types, synchronous [9, 21], asynchronous [40] and stall synchronous [15]. Synchronous SGD (S-SGD) with data-parallelism is the most widely used one in distributed deep learning due to its good convergence properties [8, 12]. However, S-SGD requires iterative synchronization and communication of dense gradient/parameter aggregation among all the workers. Compared to the computing speed of modern accelerators (e.g., GPUs and TPUs), the network speed is usually slow which makes communications a potential system bottleneck. Even worse, the communication time usually grows with the size of the cluster [37]. Many recent studies focus on alleviating the impact of communications in S-SGD to improve the system scalability. These studies include the system-level methods and the algorithm-level methods.

On the system level, pipelining [4, 38, 12, 28, 22, 13, 17, 27] is used to overlap the communications with the computations by exploiting the layer-wise structure of backpropagation during the training process of deep models. On the algorithmic level, researchers have proposed gradient quantization (fewer bits for a number) and sparsification (zero-out gradients that are not necessary to be communicated) techniques for S-SGD to reduce the communication traffic with negligible impact on the model convergence [2, 6, 35, 23, 36, 19]. The gradient sparsification method is more aggressive than the gradient quantization method in reducing the communication size. For example, Top- sparsification [1, 23] with error compensation can zero-out local gradients without loss of accuracy while quantization from -bit floating points to -bit has a maximum of size reduction. In this paper, we mainly focus on the sparsification methods, while our proposed algorithm and analysis are also applicable to the quantization methods.

A number of works have investigated the theoretical convergence properties of the gradient sparsification schemes under different analytical assumptions [34, 32, 3, 18, 16, 19]. However, these gradient sparsification methods ignore the layer-wise structure of DNNs and treat all model parameters as a single vector to derive the convergence bounds, which implicitly requires a single-layer communication [37] at the end of each SGD iteration. Therefore, the current gradient sparsification S-SGD (denoted by SLGS-SGD hereafter) cannot overlap the gradient communications with backpropagation computations, which limits the system scaling efficiency. To tackle this challenge, we propose a new distributed optimization algorithm named LAGS-SGD which exploits a layer-wise adaptive gradient sparsification (LAGS) scheme atop S-SGD to increase the system scalability. We also derive the convergence bounds for LAGS-SGD. Our theoretical convergence results on LAGS-SGD conclude that high compression ratios would slow down the model convergence rate, which indicates that one should choose the compression ratios for different layers as low as possible. The adaptive nature of LAGS-SGD provides flexible options to choose the compression ratios according to the communication-to-computation ratios. We evaluate our proposed algorithm on various DNNs to verify the soundness of the weak analytic assumption and the convergence results. Finally, we demonstrate our system implementation of LAGS-SGD to show the wall-clock training time improvement on a 16-GPU cluster with 10Gbps Ethernet interconnect. The contributions of this work are summarized as follows.

  • We propose a new distributed optimization algorithm named LAGS-SGD with convergence guarantees. The proposed algorithm enables us to embrace the benefits of both pipelining and gradient sparsification.

  • We provide thorough convergence analysis for LAGS-SGD on non-convex smooth optimization problems, and the derived theoretical results indicate that LAGS-SGD has a consistent convergence guarantee with SLGS-SGD, and it has the same order of convergence rate with S-SGD under a weak analytical assumption.

  • We empirically verify the analytical assumption and the convergence performance of LAGS-SGD on various deep neural networks including CNNs and LSTM in a distributed setting.

  • We implement LAGS-SGD atop PyTorch1, which is one of the popular deep learning frameworks, and evaluate the training efficiency of LAGS-SGD on a 16-GPU cluster connected with 10Gbps Ethernet. Experimental results show that LAGS-SGD outperforms S-SGD and SLGS-SGD on a 16-GPU cluster with little impact on the model accuracy.

The rest of the paper is organized as follows. Section 2 introduces some related work, and Section 3 presents preliminaries for our proposed algorithm and theoretical analysis. We propose the LAGS-SGD algorithm and provide the theoretical results in Section 4. The efficient system design for LAGS-SGD is illustrated in Section 5. Experimental results and discussions are presented in Section 6. Finally, we conclude the paper in Section 7.

2 Related Work

Many recent works have provided convergence analysis for distributed SGD with quantified or sparsified gradients that can be biased or unbiased.

For the unbiased quantified or sparsified gradients, researchers [2, 35] derived the convergence guarantees for lower-bit quantified gradients, while the quantization operator applied on gradients should be unbiased to guarantee the theoretical results. On the gradient sparsification algorithm whose sparsification method is also unbiased, Wangni et al. [34] derived the similar theoretical results. However, empirical gradient sparsification methods (e.g., Top- sparsification [23]) can be biased, which require some other analytical techniques to derive the bounds. In this paper, we also mainly focus on the bias sparsification operators like Top- sparsification.

For the biased quantified or sparsified gradients, Cordonnier [7] and Stich et al. [32] provided the convergence bound for top- or random- gradient sparsification algorithms on only convex problems. Jiang et al. [18] derived similar theoretical results, but they exploited another strong assumption that requires each worker to select the same components at each iteration so that the whole (the dimension of model/gradient) components are exchanged after a certain number of iterations. Alistarh et al. [3] relaxed these strong assumptions on sparsified gradients, and further proposed an analytical assumption, in which the -norm of the difference between the top- elements on fully aggregated gradients and the aggregated results on local top- gradients is bounded. Though the assumption is relaxed, it is difficult to verify in real-world applications. Our convergence analysis is relatively close to the study [30] which provided convergence analysis on the biased Top- sparsification with an easy-to-verify analytical assumption.

The above mentioned studies, however, view all the model parameters (or gradients) as a single vector to derive the convergence bounds, while we propose the layer-wise gradient sparsification algorithm which breaks down full gradients into multiple pieces (i.e., multiple layers). It is obvious that breaking a vector into pieces and selecting top- elements from each piece generates different results from the top- elements on the full vector, which makes the proofs of the bounds of LAGS-SGD non-trivial. Recently, [39] proposed the blockwise SGD for quantified gradients, but it lacks of convergence guarantees for sparsified gradients. Simultaneous to our work, Dutta et al. [11] proposed the layer-wise compression schemes and provided a different way of proof on the theoretical analysis.

3 Preliminaries

We consider the common settings of distributed synchronous SGD with data-parallelism on workers to minimize the non-convex objective function by:

(1)

where is the stacked layer-wise model parameters of the target DNN at iteration , is the stochastic gradients of the DNN parameters at the worker with locally sampled data, and is the step size (i.e., learning rate) at iteration . Let denote the number of learnable layers of the DNN, and denote the parameter vector of the learnable layer with elements2. Thus, the model parameter can be represented by the concatenation of layer-wise parameters. Using as the concatenation operator, the stacked vector can be represented by

(2)

Pipelining between communications and computations. Due to the fact that the gradient computation of layer using the backpropagation algorithm has no dependency on the gradient aggregation of layer , the layer-wise communications can then be pipelined with layer-wise computations [4, 38] as shown in Fig. 1(a). It can be seen that some communication time can be overlapped with the computations so that the wall-clock iteration time is reduced. Note that the pipelining technique with full gradients has no side-effect on the convergence, and it becomes very useful when the communication time is comparable to the computing time.

Top- sparsification. In the gradient sparsification method, the Top- sparsification with error compensation [1, 23, 29] is promising in reducing communication traffic for distributed training, and its convergence property has been empirically [1, 23] verified and theoretically [3, 18, 32] proved under some assumptions. The model update formula of Top- S-SGD can be represented by

(3)

where is the selected top- gradients at worker . For any vector and a given , and its () element is:

(4)

where is the element of and is the largest value of . As shown in Fig. 1(b), in each iteration, at the end of the backpropagation pass, each worker selects top- gradients from its whole set of gradients. The selected gradients are exchanged with all other workers in the decentralized architecture or sent to the parameter server in the centralized architecture for averaging.

Figure 1: Comparison between three distributed training algorithms: (a) the pipeline of layer-wise gradient communications and backpropagation computations without gradient sparsification (Dense-SGD), (b) the single-layer gradient sparsification (SLGS) without pipelining, and (c) our proposed layer-wise adaptive gradient sparsification (LAGS) with pipelining.

4 Layer-Wise Adaptive Gradient Sparsification

4.1 Algorithm

To enjoy the benefits of the pipelining technique and the gradient sparsification technique, we propose the LAGS-SGD algorithm, which exploits a layer-wise adaptive gradient sparsification (LAGS) scheme atop S-SGD.

In LAGS-SGD, we apply gradient sparsification with error compensation on each layer separately. Instead of selecting the top- values from all gradients to be communicated, each worker selects top- gradients from layer so that it does not need to wait for the completion of backpropagation pass before communicating the sparsified gradients. LAGS-SGD not only significantly reduces the communication traffic (hence the communication time) using the gradient sparsification, but it also makes use of the layered structure of DNNs to overlap the communications with computations. As shown in Fig. 1(c), at each iteration, after the gradients of layer have been calculated, is selected to be exchanged among workers immediately. Formally, let denote the model parameter and denote the local gradient residual of worker at iteration . In LAGS-SGD on distributed workers, the update formula of the -layer’s parameters becomes

(5)

The pseudo-code of LAGS-SGD is shown in Algorithm 1.

Input: Stochastic gradients at worker
Input: Configured ,
Input: Configured learning rates

1:for  do
2:     Initialize ;
3:for  do
4:     Feed-forward computation;
5:     for  do
6:          ;
7:          ;
8:          ;
9:          ;      
Algorithm 1 LAGS-SGD at worker

4.2 Convergence Analysis

We first introduce some notations and assumptions for our convergence analysis, and then present the theoretical results of the convergence properties of LAGS-SGD.

Notations and Assumptions

Let denote -norm. We mainly discuss that the non-convex objective function is -smooth, i.e.,

(6)

Let denote the optimal solution of the objective function . We assume that the sampled stochastic gradients are unbiased, i.e., . We also assume that the second moment of the average of stochastic gradients has the following bound:

(7)

We make an analytical assumption on the aggregated results from the distributed sparsified vectors.

Assumption 1.

For any vectors () in workers, and each vector is sparsified as locally. The aggregation of selects larger values than randomly selecting values from the accumulated vectors, i.e.,

(8)

where is a vector whose elements are randomly selected from following a uniform distribution, and the other elements are zeros.

Similar to [3, 30], we introduce an auxiliary random variable , which is updated by the non-sparsified gradients, i.e.,

(9)

where and . The error between and can be represented by

(10)

Main Results

Here we present the major lemmas and theorems to prove the convergence of LAGS-SGD. Our results are mainly the derivation of the standard bounds in non-convex settings [5], i.e.,

(11)

for some constants and the number of iterations .

Lemma 1.

For any vectors , and every vector can be broken down into pieces, that is and , it holds that

(12)

where , for , and .

Proof.

According to [32], for any vectors and , Then

The inequality (1) is a sufficient condition to derive the convergence properties of Algorithm 1.

Corollary 1.

For any iteration and :

(13)
Proof.

Let , and We have and . According to the update formulas of and , we have

where . Iterating the above inequality from yields:

Taking the expectation and using the bound of the second moment on stochastic gradients: , we obtain

which concludes the proof. ∎

Corollary 1 implies that the parameters with sparsified layer-wise gradients have bounds compared to that with dense gradients.

Theorem 1.

Under the assumptions defined in the objective function , after running iterations with Algorithm 1, we have

(14)

if one chooses a step size schedule such that and ,

(15)

holds at any iteration .

Proof.

We use the smooth property of and Corollary 1 to derive (1). First, with the smoothness of , we have

Taking expectation with respective to sampling at , it yields

Taking expectation with respective to the gradients before , it yields

Using Corollary 1, we obtain

If (15) holds, then we have

Adjusting the order, we obtain

(16)

We further apply the property of , that is

Together with (4.2.2), it yields

Summing up the inequality for , it yields

Multiplying in both sides concludes the proof. ∎

Theorem 1 indicates that if one chooses the step sizes to satisfy (15), then the right hand side of (1) converges as , so that Algorithm 1 is guaranteed to converge. If we let which is easily satisfied then (15) holds for both constant and diminishing step sizes. Therefore, if the step sizes are further configured as

(17)

then the right hand side of inequality (1) converges to zero, which ensures the convergence of Algorithm 1.

Corollary 2.

Under the same assumptions in Theorem 1, if , where is a constant, then we have the convergence rate bound for Algorithm 1 as:

(18)

if the total number of iterations is large enough.

Proof.

As , we simplify the notations by: and . The left hand side of (15) becomes

Let , then , and

So (15) holds when . Applying Theorem 1, we obtain

which concludes the proof. ∎

In Corollary 2, if is large enough, then the right hand side of inequality (18) is dominated by the first term. It implies that Algorithm 1 has a convergence rate of , which is the same as the vanilla SGD [9]. However, the second term of inequality (18) also indicates that higher compress ratios (i.e., ) lead to a larger bound of the convergence rate. In real-world settings, one may have a fixed number of iteration budget to train the model, so high compression ratios could slowdown the convergence speed. On the one hand, if we choose lower compression ratios, then the algorithm has a faster convergence rate (less number of iterations). On the other hand, lower compression ratios have a larger communication size and thus may result in longer wall-clock time per iteration. Therefore, the adaptive selection of the compression ratios tackles the problem properly.

5 System Implementation and Optimization

The layer-wise sparsification nature of LAGS-SGD enables the pipelining technique to hide the communication overheads, while the efficient system implementation of communication and computation parallelism with gradient sparsification is non-trivial due to three reasons: 1) Layer-wise communications with sparsified gradients indicate that there exist many small size messages to be communicated across the network, while collectives (e.g., AllReduce) with small messages are latency-sensitive. 2) Gradient sparsification (especially top- selection on GPUs) would introduce extra computation time. 3) The convergence rate of LAGS-SGD is negatively effected by the compression ratio, and one should decide proper compression ratios to trade-off the number of iteration to converge and the iteration wall-clock time.

First, we exploit a heuristic method to merge extremely small sparsified tensors to a single one for efficient communication to address the first problem. Specifically, we use a memory buffer to temporarily store the sparsified gradients, and aggregate the buffered gradients once the buffer becomes full or the gradients of the first layer have been calculated. Second, we implement the double sampling method [23] to approximately select the top- gradients, which can significantly reduce the top- selection time on GPUs. Finally, to achieve a balance between the convergence rate and the training wall-clock time, we propose to select the layer-wise compression ratio according to the communication-to-computation ratio. To be specific, we select a compression ratio for layer such that its communication overhead is appropriately hidden by the computation. Given an upper bound of the compression ratio (e.g., ), the algorithm determines according to the following three metrics: 1) Backpropagation computation time of the pipelined layers (i.e., ); 2) Communication time of the current layer under a specific compression ratio , which can be predicted using the communication model of the AllGather or AllReduce collectives (e.g., [22, 25]) according to the size of gradients and the inter-connection (e.g., latency and bandwidth) between workers; 3) An extra overhead involved by the sparsification operator (), which generally includes a pair of operations (compression and de-compression). Therefore, the selected value of can be generalized as

(19)

5.1 Bound of Pipelining Speedup

In LAGS-SGD, the sparsification technique is used to reduce the overall communication time, and the pipelining technique is used to further overlap the already reduced communication time with computation time. We can analyze the optimal speedup of LAGS-SGD over SLGS-SGD in terms of wall-clock time under the same compression ratios. Let , and denote the forward computation, backward computation and gradient communication time at each iteration respectively. We assume that the sparsification overhead can be ignored as we can use the efficient sampling method. Compared to SLGS-SGD, LAGS-SGD reduces the wall-clock time by pipelining the communications with computations, and the maximum overlapped time is (i.e., either backpropagation computations or communications are completely overlapped). So the maximum speedup of LAGS-SGD over SLGS-SGD can be calculated as . Let denote the communication-to-computation ratio. The ideal speedup can be represented by

(20)

The equation shows that the maximum speedup of LAGS-SGD over SLGS-SGD mainly depends on the communication-to-computation ratio. If is close to , then LAGS-SGD has the potential to achieve the highest speedup by completely hiding either the backpropagation computation or the communication time.

6 Experiments

6.1 Experimental Settings

We conduct the similar experiments as the work [23], which cover two types of applications with three data sets: 1) image classification by convolutional neural networks (CNNs) such as ResNet-20 [14] and VGG-16 [31] on the Cifar-10 [20] data set and Inception-v4 [33] and ResNet-50 [14] on the ImageNet [10] data set; 2) language model by a 2-layer LSTM model (LSTM-PTB) with 1500 hidden units per layer on the PTB [24] data set. On Cifar-10, the batch size for each worker is 32, and the base learning rate is 0.1; On ImageNet, the batch size for each worker is also 32, and the learning rate is 0.01; On PTB, the batch size and learning rate is 20 and 22 respectively. We set the compression ratios as and for CNNs and LSTM respectively. In all compared algorithms, the hyper-parameters are kept the same and experiments are conducted on a 16-GPU cluster.

Figure 2: The values of ( layers are displayed for better visualization), and the training loss of LAGS-SGD on 16 workers.

6.2 Verification of Assumption 1 and Convergences

To show the soundness of Assumption 1 and the convergence results, we conduct the experiments with 16 workers to train the models. We define metrics () for each learnable layer during the training process at each iteration with Algorithm 1, and

(21)

where . Assumption 1 holds if (). We measure on ResNet-20, VGG-16 and LSTM-PTB during training, and the results are shown in Fig. 2. It is seen that throughout the training process, which implies that Assumption 1 holds. The evaluated models all converge in a certain number of epochs, which verifies the convergence of LAGS-SGD.

6.3 Comparison of Convergence Rates

Figure 3: The comparison of convergence performance.

The convergence comparison under the same number of training epochs is shown in Fig. 3. The top-1 validation accuracy (the higher the better) on CNNs and the validation perplexity (the lower the better) on LSTM show that LAGS-SGD has very close convergence performance to SLGS-SGD. Compared to Dense-SGD, SLGS-SGD and LAGS-SGD both have slight accuracy losses. The problem could be resolved by some training tricks like momentum correction [23]. The final evaluation results are shown in Table 1. The nearly consistent convergence performance between LAGS-SGD and Dense-SGD verifies our theoretical results on the convergence rate.

Model Dense-SGD SLGS-SGD LAGS-SGD
ResNet-20
VGG-16
ResNet-50
LSTM-PTB
Table 1: Comparison of evaluation performance. Top-1 validation accuracy for CNNs and perplexity for LSTM-PTB.

6.4 Wall-clock Time Performance and Discussions

We evaluate the average iteration time with CNNs including VGG-16I (VGG-16 [31] for ImageNet), ResNet-50 and Inception-v4 on the large-scale data set ImageNet (over one million training images) on a 16-GPU cluster (four nodes and each node contains four Nvidia Tesla V100 PCIE-32G GPUs) connected with 10Gbps Ethernet (10GbE). The servers in the cluster are with Intel CPUs (Xeon E5-2698v3 Dual), Ubuntu-16.04 and CUDA-10.0. The main libraries used in our experiments are PyTorch-v1.1, OpenMPI-v4.0.0, Horovod-v0.18.1 and NCCL-v2.3.7. The experimental results are shown in Table 2 using the compression ratio of for gradient sparsification. It demonstrates that LAGS-SGD performs around faster than SLGS-SGD, which is up to close to the maximum speedup, and it achieves improvement over Dense-SGD.

Model Dense SLGS LAGS
VGG-16I
ResNet-50
Inception-v4
Table 2: The average iteration time in seconds of 1000 running iterations. and indicate the speedups of LAGS-SGD over Dense-SGD and SLGS-SGD respectively. is the maximum speedup of pipelining over SLGS-SGD.

The achieved speedups of LAGS-SGD over SLGS-SGD in the end-to-end training wall-clock time are minor, which is caused by three main reasons. First, as shown in Eq. (20), the improvement of LAGS-SGD over SLGS-SGD is highly depended on the communication-to-computation ratio . In the conducted experiments, is small because transferring highly sparsified data under 10GbE is much faster than the computation time on Nvidia Tesla V100 GPUs, while the proposed method has potential improvement with increased such as lower bandwidth networks. Second, the compression time is not negligible compared to the communication time. Even we exploit the sampling method [23] to select the top- gradients, it is inefficient on GPUs so that it enlarges the computation time, which makes