Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations

Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations

Debraj Basu University of California, Los Angeles, USA Deepesh Data University of California, Los Angeles, USA Can Karakus Work done while Can Karakus was at UCLA. Amazon Web Services Suhas Diggavi University of California, Los Angeles, USA
Abstract

Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed setting for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.

Keywords: Distributed optimization and learning; stochastic optimization; communication efficient training methods.

1 Introduction

Stochastic Gradient Descent (SGD) [HM51] and its many variants have become the workhorse for modern large-scale optimization as applied to machine learning [Bot10, BM11]. We consider a setup, in which SGD is applied to the distributed setting, where different nodes compute local stochastic gradients on their own datasets . Co-ordination between them is done by aggregating these local computations to update the overall parameter as,

where , for , is the local stochastic gradient at the ’th machine for a local loss function of the parameter vector , where and is the learning rate.

Training of high dimensional models is typically performed at a large scale over bandwidth limited networks. Therefore, despite the distributed processing gains, it is well understood by now that exchange of full-precision gradients between nodes causes communication to be the bottleneck for many large-scale models [AHJ18, WXY17, BWAA18, SYKM17]. For example, consider training the ResNet 152 architecture [HZRS16] which has about 60 million parameters, on the ImageNet dataset that contains 14 million images. Each full precision exchange between workers is around 240 MB. Such a communication bottleneck could be significant in emerging edge computation architectures suggested by federated learning [Kon17, MMR17, ABC16]. In such an architecture, data resides on and can even be generated by personal devices such as smart phones, and other edge (IoT) devices, in contrast to data-center architectures. Learning is envisaged with such an ultra-large scale, heterogeneous environment, with potentially unreliable or limited communication. These and other applications have led to many recently proposed methods, which are broadly based on three major approaches:

  1. Quantization of gradients, where nodes locally quantize the gradient (perhaps with randomization) to a small number of bits [AGL17, BWAA18, WHHZ18, WXY17, SYKM17].

  2. Sparsification of gradients, e.g., where nodes locally select values of the gradient in absolute value and transmit these at full precision [Str15, AH17, SCJ18, AHJ18, WHHZ18, LHM18], while maintaining errors in local nodes for later compensation.

  3. Skipping communication rounds, whereby nodes average their models after locally updating their models for several steps [YYZ18, Cop15, ZDW13, Sti19, CH16, WJ18].

In this work we propose a Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computations, along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD in a distributed setting, where the nodes perform computations on their local datasets. In our asynchronous model, the distributed nodes’ iterates evolve at the same rate, but update the gradients at arbitrary times; see Section 4 for more details. We analyze convergence for Qsparse-local-SGD in the distributed case, for smooth non-convex and smooth strongly-convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We implement Qsparse-local-SGD for ResNet-50 using the ImageNet dataset, and for a softmax multiclass classifier using the MNIST dataset, and we achieve target accuracies with about a factor of 15-20 savings over the state-of-the-art [AHJ18, SCJ18, Sti19], in the total number of bits transmitted.

1.1 Related Work

The use of quantization for communication efficient gradient methods has decades rich history [GMT73] and its recent use in training deep neural networks [SFD14, Str15] has re-ignited interest. Theoretically justified gradient compression using unbiased stochastic quantizers has been proposed and analyzed in [AGL17, WXY17, SYKM17]. Though methods in [WWLZ18, WSL18] use induced sparsity in the quantized gradients, explicitly sparsifying the gradients more aggressively by retaining components, e.g., , has been proposed [Str15, AH17, LHM18, AHJ18, SCJ18], combined with error compensation to ensure that all co-ordinates do get eventually updated as needed. [WHHZ18] analyzed error compensation for QSGD, without sparsification while focusing on quadratic functions. Another approach for mitigating the communication bottlenecks is by having infrequent communication, which has been popularly referred to in the literature as iterative parameter mixing, see [Cop15], and model averaging, see [Sti19, YYZ18, ZSMR16] and references therein. Our work is most closely related to and builds upon the recent theoretical results in [AHJ18, SCJ18, Sti19, YYZ18]. [SCJ18] considered the analysis for the centralized (among other sparsifiers), and [AHJ18] analyzed a distributed version with the assumption of closeness of the aggregated gradients to the centralized case, see Assumption 1 in [AHJ18]. [Sti19, YYZ18] studied local-SGD, where several local iterations are done before sending the full gradients, and did not do any gradient compression beyond local iterations. Our work generalizes these works in several ways. We prove convergence for the distributed sparsification and error compensation algorithm, without the assumption of [AHJ18], by using the perturbed iterate methods [MPP17, SCJ18]. We analyze non-convex as well as convex objectives for the distributed case with local computations. [SCJ18] gave a proof of sparsified SGD for convex objective functions and for the centralized case, without local computations 111At the completion of our work, we recently found that in parallel to our work [KRSJ19] examined use of sign-SGD quantization, without sparsification for the centralized model. Another recent work in [KSJ19] studies the decentralized case with sparsification for strongly convex functions. Our work, developed independent of these works, uses quantization, sparsification and local computations for the distributed case, for both non-convex and strongly convex objectives.. Our techniques compose a (stochastic or deterministic -bit sign) quantizer with sparsification and local computations using error compensation. While our focus has only been on mitigating the communication bottlenecks in training high dimensional models over bandwidth limited networks, this technique works for any compression operator satisfying a regularity condition (see Definition 3) including our composed operators.

1.2 Contributions

We study a distributed set of worker nodes, each of which perform computations on locally stored data, denoted by . Consider the empirical-risk minimization of the loss function

where , where denotes expectation over a random sample chosen from the local data set . Our setup can also handle different local functional forms, beyond dependence on the local data set , which is not explicitly written for notational simplicity. For , we denote and . The distributed nodes perform computations and provide updates to the master node that is responsible for aggregation and model update. We develop Qsparse-local-SGD, a distributed SGD composing gradient quantization and explicit sparsification (e.g., components), along with local iterations. We develop the algorithms and analysis for both synchronous as well as asynchronous operations, in which workers can communicate with the master at arbitrary time intervals. To the best of our knowledge, these are the first algorithms which combine quantization, aggressive sparsification, and local computations for distributed optimization. With some minor modifications to Qsparse-local-SGD, it can also be used in a peer-to-peer setting, where the aggregation is done without any help from the master node, and each worker exchanges its updates with all other workers.

Our main theoretical results are the convergence analyses of Qsparse-local-SGD for both non-convex as well as convex objectives; see Theorem 1 and Theorem 3 for the synchronous case, as well as Theorem 4 and Theorem 6, for the asynchronous operation. Our analyses also demonstrate natural gains in convergence that distributed, mini-batch operation affords, and has convergence similar to equivalent vanilla SGD with local iterations (see Corollary 2 and Corollary 3), for both the non-convex case (with convergence rate for fixed learning rate) as well as the convex case (with convergence rate , for diminishing learning rate). We also demonstrate that quantizing and sparsifying the gradient, even after local iterations asymptotically yields an almost “free” efficiency gain (also observed numerically in Section 5 non-asymptotically). The numerical results on ImageNet dataset implemented for a ResNet-50 architecture and for the convex case for multi-class logistic classification on MNIST [LBBH98] dataset demonstrates that one can get significant communication savings, while retaining equivalent state-of-the-art performance. The combination of quantization, sparsification, and local computations poses several challenges for theoretical analyses, including the analyses of impact of local iterations (block updates) of parameters on quantization and sparsification (see Lemma 4-5 in Section 3), as well as asynchronous updates and its combination with distributed compression (see Lemma 9-12 in Section 4).

1.3 Organization

In Section 2, we demonstrate that composing certain classes of quantizers with sparsifiers satisfies a certain regularity condition that is needed for several convergence proofs for our algorithms. We describe the synchronous implementation of Qsparse-local-SGD in Section 3, and outline the main convergence results for it in Section 3.3, briefly giving the proof ideas in Section 3.4. We describe our asynchronous implementation of Qsparse-local-SGD and provide the theoretical convergence results in Section 4. The experimental results are given in Section 5. Many of the proof details are given in the appendices.

2 Communication Efficient Operators

Traditionally, distributed stochastic gradient descent affords to send full precision (32 or 64 bit) unbiased gradient updates across workers to peers or to a central server that helps with aggregation. However, communication bottlenecks that arise in bandwidth limited networks limit the applicability of such an algorithm at a large scale when the parameter size is massive or the data is widely distributed on a very large number of worker nodes. In such settings, one could think of updates which not only result in convergence, but also require less bandwidth thus making the training process faster. In the following sections we discuss several useful operators from literature and enhance their use by proposing a novel class of composed operators.

We first consider two different techniques used in the literature for mitigating the communication bottleneck in distributed optimization, namely, quantization and sparsification. In quantization, we reduce precision of the gradient vector by mapping each of its components by a deterministic [BWAA18, KRSJ19] or randomized [AGL17, WXY17, SYKM17, ZDJW13] map to a finite number of quantization levels. In sparsification, we sparsify the gradients vector before using it to update the parameter vector, by taking its components or choosing components uniformly at random, denoted by , [SCJ18, KSJ19].

2.1 Quantization

SGD computes an unbiased estimate of the gradient, which can be used to update the model iteratively and is extremely useful in large scale applications. It is well known that the first order terms in the rate of convergence are affected by the variance of the gradients. While stochastic quantization of gradients could result in a variance blow up, it preserves the unbiasedness of the gradients at low precision; and, therefore, when training over bandwidth limited networks, the convergence would be much faster; see [AGL17, WXY17, SYKM17, ZDJW13].

Definition 1 (Randomized quantizer).

We say that is a randomized quantizer with quantization levels, if the following holds for every : (i) ; (ii) , where could be a function of and . Here expectation is taken over the randomness of .

Examples of randomized quantizers include

  1. QSGD [AGL17, WXY17], which independently quantizes components of into levels, with .

  2. Stochastic -level Quantization [SYKM17, ZDJW13], which independently quantizes every component of into levels between and , with .

  3. Stochastic Rotated Quantization [SYKM17], which is a stochastic quantization, preprocessed by a random rotation, with .

Instead of quantizing randomly into levels, we can take a deterministic approach and round off each component of the vector to the nearest level. In particular, we can just take the sign, which has shown promise in [BWAA18, KRSJ19].

Definition 2 (Deterministic Sign quantizer).

A deterministic quantizer is defined as follows: for every vector , the ’th component of , for , is defined as .

Such methods drew interest since Rprop [RB93], which only used the temporal behavior of the sign of the gradient. This is an example where the biased 1-bit quantizer as in Definition 2 is used. This further inspired optimizers, such as RMSprop [TH12], Adam [KB15], which incorporate appropriate adaptive scaling with momentum acceleration and have demonstrated empirical superiority in non-convex applications.

2.2 Sparsification

As mentioned earlier, we consider two important examples of sparsification operators: and , For any , is equal to a -length vector, which has at most non-zero components whose indices correspond to the indices of the largest components (in absolute value) of . Similarly, is a -length (random) vector, which is obtained by selecting components of uniformly at random. Both of these satisfy a so-called “contraction” property as defined below, with [SCJ18].

Definition 3 (Contraction operator [Scj18]).

A (randomized) function is called a contraction operator, if there exists a constant (that may depend on and ), such that for every , we have

(1)

where expectation is taken over the randomness of the contraction operator .

Note that stochastic quantizers, as defined in Definition 1, also satisfy this regularity condition in Definition 3 for . Now we give a simple but important corollary, which allows us to apply different contraction operators to different coordinates of a vector. As an application, in the case of training neural networks, we can apply different operators to different layers.

Corollary 1 (Piecewise contraction).

Let for denote possibly different contraction operators with contraction coefficients . Let , where for all . Then is a contraction operator with the contraction coefficient being equal to .

Proof.

Fix an arbitrary and consider the following:

Inequality (a) follows because each is a contraction operator with the contraction coefficient . ∎

Corollary 1 allows us to apply different contraction operators to different coordinates of the updates which can based upon their dimensionality and sparsity patterns.

2.3 Composition of Quantization and Sparsification

Now we show that we can compose deterministic/randomized quantizers with sparsifiers and the resulting operator is a contraction operator. First we compose a general stochastic quantizer with an explicit sparsifier such as and and show that the resulting operator is a “contraction” operator. A proof is provided in Appendix A.1.

Lemma 1 (Contraction of a composed operator).

Let . Let be a quantizer with parameter that satisfies Definition 1. Let be defined as for every . If are such that , then is a contraction operator with the contraction coefficient being equal to , i.e., for every , we have

where expectation is taken over the randomness of the contraction operator as well as of the quantizer .

For the different quantizers mentioned earlier, the conditions when their composition with gives are:

  1. QSGD: for , we get.

  2. Stochastic k-level Quantization: for , we get .

  3. Stochastic Rotated Quantization: for , we get .

Remark 1.

Observe that for a given stochastic quantizer that satisfies Definition 1, we have a prescribed operating regime of . This results in an upper bound on the coarseness of the quantizer, which happens because the quantization leads to a blow-up of the second moment; see condition (ii) of Definition 1. However, by employing Corollary 1, we show that this can be alleviated to some extent via an example.

Consider an operator as described in Lemma 1, where the quantizer, in use is QSGD [AGL17, WXY17], and the sparsifier, is [AHJ18, SCJ18]. Apply it to a vector in a piecewise manner, i.e., to smaller vectors as prescribed in Corollary 1. Define as the coefficient of the variance bound as in Definition 1 for the quantizer , used for and . Observe that the regularity condition in Definition 3 can be satisfied by having . Therefore, the piecewise contraction operator allows a coarser quantizer than when the operator is applied to the entire vector together where we require , thus providing a small gain in communication efficiency. For example, consider the composed operator being applied on a per layer basis to a deep neural network. We can now afford to have a much coarser quantizer than when the operator is applied to all the parameters at once.

As discussed above, stochastic quantization results in a variance blow-up which limits our regime of operation, when we combine that with sparsification. However, it turns out that, we can expand our regime of operation unrestrictedly by scaling the vector properly. We summarize the result in the following lemma, which is proved in Appendix A.2.

Lemma 2 (Composing sparsification with stochastic quantization).

Let . Let be a stochastic quantizer with parameter that satisfies Definition 1. Let be defined as for every . Then is a contraction operator with the contraction coefficient being equal to , i.e., for every

Remark 2.

Note that, unlike , the scaled version is always a contraction operator for all values of . Furthermore, observe that, if , then we have , which implies that even in the operating regime of , which is required in Lemma 1, the scaled composed operator of Lemma 2 gives better contraction than what we get from the unscaled composed operator of Lemma 1. So, scaling a composed operator properly is always a better choice for contraction.

We can also compose a deterministic 1-bit quantizer with . For that we need some notations first. For and given vector , let denote the set of indices chosen for defining . For example, if , then denote the set of indices corresponding to the largest components of ; if , then denote a set of random set of indices in . The composition of with is denoted by , and for , the ’th component of is defined as

In the following lemma we show that is a contraction operator; a proof of which is provided in Appendix A.3.

Lemma 3 (Composing sparsification with deterministic quantization).

For , the operator

for any is a contraction operator with the contraction coefficient being equal to

Remark 3.

Observe that for , depending on the value of , either of the terms inside the max can be bigger than the other term. For example, if , then , which implies that the second term inside the max is equal to , which is much smaller than the first term. On the other hand, if and the vector is dense, then the second term may be much bigger than the first term.

3 Distributed Synchronous Operation

Let with denote a set of indices for which worker synchronizes with the master. In a synchronous setting, is same for all the workers. Let for any . Every worker maintains a local parameter vector which is updated in each iteration . If , every worker sends the compressed and error-compensated update computed on the net progress made since the last synchronization to the master node, and updates its local memory . Upon receiving , master aggregates them, updates the global parameter vector, and sends the new model to all the workers; upon receiving which, they set their local parameter vector to be equal to the global parameter vector . Our algorithm is summarized in Algorithm 1.

1:  Initialize . Suppose follows a certain learning rate schedule.
2:  for  to  do
3:     On Workers:
4:     for  to  do
5:         ; is a mini-batch of size uniformly in
6:         if  then
7:            , and
8:         else
9:            , send to the master
10:            
11:            Receive from the master and set
12:         end if
13:     end for
14:     At Master:
15:     if  then
16:         
17:     else
18:         Receive from workers and compute
19:         Broadcast to all workers
20:     end if
21:  end for
22:  Comment: is used to denote an intermediate variable between iterations and
Algorithm 1 Qsparse-local-SGD

3.1 Assumptions

All results in this paper use the following two standard assumptions.

  1. Smoothness: The local function at each worker is -smooth, i.e., for every , we have .

  2. Bounded second moment: For every and for some constant , we have . This is a standard assumption in [SSS07, NJLS09, RRWN11, HK14, RSS12, SCJ18, Sti19, YYZ18, KSJ19, AHJ18]. Relaxation of the uniform boundedness of the gradient allowing arbitrarily different gradients of local functions in heterogenous settings as done for SGD in [NNvD18, WJ18] is left as future work. This also imposes a bound on the variance: , where for every .

In this section we present our main convergence results with synchronous updates, obtained by running Algorithm 1 for smooth functions, both non-convex and strongly convex. To state our results, we need the following definition from [Sti19].

Definition 4 (Gap [Sti19]).

Let , where for . The gap of is defined as , which is equal to the maximum difference between any two consecutive synchronization indices.

3.2 Error Compensation

Sparsified gradient methods, where workers send the largest coordinates of the updates based on their magnitudes have been investigated in the literature and serves as a communication efficient strategy for distributed training of learning models. However, the convergence rates are subpar to distributed vanilla SGD. Together with some form of error compensation, these methods have been empirically observed to converge as fast as vanilla SGD in [Str15, AH17, LHM18, AHJ18, SCJ18]. In [AHJ18, SCJ18], sparsified SGD with such feedback schemes has been carefully analyzed. Under analytic assumptions [AHJ18], proves the convergence of distributed SGD with error feedback. The net error in the system is accumulated by each worker locally on a per iteration basis and this is used as feedback for generating the future updates. [SCJ18] did the analysis for the centralized SGD for strongly convex objectives.

In Algorithm 1, the error introduced in every iteration is accumulated into the memory of each worker, which is compensated for in the future rounds of communication. This feedback is the key to recovering the convergence rates matching vanilla SGD. The operators employed provide a controlled way of using both the current update as well as the compression errors from the previous rounds of communication. Under the assumption of the uniform boundedness of the gradient we analyze the controlled evolution of memory through the optimization process; the results are summarized in Lemma 4 and Lemma 5 below.

3.2.1 Decaying Learning Rate

Here we show that if we run Algorithm 1 with a decaying learning rate , then the local memory at each worker contracts and goes to zero as .

Lemma 4 (Memory contraction).

Let and , where is a constant and . Then there exists a constant , such that the following holds for every worker and for every :

(2)

We prove Lemma 4 in Appendix B.1. Note that for decaying , the memory decays as . This implies that the net error in the algorithm from the compression of updates in each round of communication is compensated for in the end.

3.2.2 Fixed Learning Rate

In the following lemma, which is proved in Appendix B.2, we show that if we run Algorithm 1 with a fixed learning rate , then the local memory at each worker is bounded. It can be verified that the proof of Lemma 4 also holds for fixed learning rate, and we can trivially bound in this case by simply putting in (2). However, we can get a better bound (saving a factor of , which is bigger than 4) by directly working with a fixed learning rate.

Lemma 5 (Bounded memory).

Let . Then the following holds for every worker and for every :

(3)

Note that, for fixed , the memory is upper bounded by a constant . Observe that since the memory accumulates the past errors due to compression and local computation, in order to asymptotically reduce the memory to zero, the learning rate would have to be reduced once in a while throughout the training process.

3.3 Main Results

We leverage the perturbed iterate analysis as in [MPP17, SCJ18] to provide convergence guarantees for Qsparse-local-SGD. Under the assumptions of Section 3.1, the following theorems hold when Algorithm 1 is run with any contraction operator (including our composed operators).

Theorem 1 (Smooth (non-convex) case with fixed learning rate).

Let be -smooth for every . Let be a contraction operator whose contraction coefficient is equal to . Let be generated according to Algorithm 1 with , for step sizes (where is a constant such that ) and . Then we have

Here is a random variable which samples a previous parameter with probability .

Corollary 2.

Let , where is a constant,222Even classical SGD requires knowing an upper bound on in order to choose the learning rate. Smoothness of translates this to the difference of the function values. , and , we have

In order to ensure that the compression does not affect the dominating terms while converging at a rate of , we would require .

Theorem 1 is proved in Appendix B.6 and provides non-asymptotic guarantees, where we observe that compression does not affect the first order term. Here we are required decide the horizon before running the algorithm. Therefore, in order to converge to a fixed point, the learning rate needs to follow a piecewise schedule (i.e., the learning rate would have to be reduced once in a while throughout the training process), which is also the case in our numerics in Section 5.1. The corresponding asymptotic result (with decaying learning rate) is given below.

Theorem 2 (Smooth (non-convex) case with decaying learning rate).

Let be -smooth for every . Let be a contraction operator whose contraction coefficient is equal to . Let be generated according to Algorithm 1 with , for step sizes and , where is such that, we have and . Then the following holds.

Here (i) ; (ii) , which is lower bounded as ; and (iii) is a random variable which samples a previous parameter with probability .

Note that Theorem 2 gives a convergence rate of . We prove it in Appendix B.7.

Theorem 3 (Smooth and strongly convex case with a decaying learning rate).

Let be -smooth and -strongly convex. Let be a contraction operator whose contraction coefficient is equal to . Let be generated according to Algorithm 1 with , for step sizes with , where is such that we have , . Then the following holds

Here (i) , , where ; (ii) , where ; and (iii) .

Corollary 3.

For , , and using from Lemma 2 in [RSS12], we have

In order to ensure that the compression does not affect the dominating terms while converging at a rate of , we would require .

Theorem 3 is proved in Appendix B.8. For no compression and only local computations, i.e., for , and under the same assumptions, we recover/generalize a few recent results from literature with similar convergence rates:

  1. We recover [YYZ18, Theorem 1], which does local SGD for the non-convex case;

  2. We generalize [Sti19, Theorem 2.2], which does local SGD for a strongly convex case and requires the unbiasedness assumption of gradients,333The unbiasedness of gradients at every worker can be ensured by assuming that each worker samples data points from the entire dataset. to the distributed case.

We emphasize that unlike [YYZ18, Sti19], which only consider local computation, we combine quantization and sparsification with local computation, which poses several technical challenges; e.g., see proofs of Lemma 4, 5, 6.

3.4 Proof Outlines

In order to prove our results, we define virtual sequences for every worker and for all as follows:

(4)

Here can be taken to be decaying or fixed, depending on the result that we are proving. Let be the set of random sampling of the mini-batches at each worker . We define

  1. , ;

  2. , .

3.4.1 Proof Outline of Theorem 1

Proof.

Since is -smooth, we have from (4) (with fixed learning rate ) that

(5)

With some algebraic manipulations provided in Appendix B.6, for , we arrive at

(6)

Under the Assumption 2, stated in Section 3.1, we have

(7)

To bound on the RHS of (6), we first show below that , i.e., the difference of the true and the virtual sequence is equal to the average memory; and then we can use the bound on the local memory terms from Lemma 5.

Lemma 6 (Memory).

Let