QsparselocalSGD: Distributed SGD with Quantization, Sparsification, and Local Computations
Abstract
Communication bottleneck has been identified as a significant issue in distributed optimization of largescale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose QsparselocalSGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of QsparselocalSGD. We analyze convergence for QsparselocalSGD in the distributed setting for smooth nonconvex and convex objective functions. We demonstrate that QsparselocalSGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use QsparselocalSGD to train ResNet50 on ImageNet, and show that it results in significant savings over the stateoftheart, in the number of bits transmitted to reach target accuracy.
Keywords: Distributed optimization and learning; stochastic optimization; communication efficient training methods.
1 Introduction
Stochastic Gradient Descent (SGD) [HM51] and its many variants have become the workhorse for modern largescale optimization as applied to machine learning [Bot10, BM11]. We consider a setup, in which SGD is applied to the distributed setting, where different nodes compute local stochastic gradients on their own datasets . Coordination between them is done by aggregating these local computations to update the overall parameter as,
where , for , is the local stochastic gradient at the ’th machine for a local loss function of the parameter vector , where and is the learning rate.
Training of high dimensional models is typically performed at a large scale over bandwidth limited networks. Therefore, despite the distributed processing gains, it is well understood by now that exchange of fullprecision gradients between nodes causes communication to be the bottleneck for many largescale models [AHJ18, WXY17, BWAA18, SYKM17]. For example, consider training the ResNet 152 architecture [HZRS16] which has about 60 million parameters, on the ImageNet dataset that contains 14 million images. Each full precision exchange between workers is around 240 MB. Such a communication bottleneck could be significant in emerging edge computation architectures suggested by federated learning [Kon17, MMR17, ABC16]. In such an architecture, data resides on and can even be generated by personal devices such as smart phones, and other edge (IoT) devices, in contrast to datacenter architectures. Learning is envisaged with such an ultralarge scale, heterogeneous environment, with potentially unreliable or limited communication. These and other applications have led to many recently proposed methods, which are broadly based on three major approaches:
In this work we propose a QsparselocalSGD algorithm, which combines aggressive sparsification with quantization and local computations, along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of QsparselocalSGD in a distributed setting, where the nodes perform computations on their local datasets. In our asynchronous model, the distributed nodes’ iterates evolve at the same rate, but update the gradients at arbitrary times; see Section 4 for more details. We analyze convergence for QsparselocalSGD in the distributed case, for smooth nonconvex and smooth stronglyconvex objective functions. We demonstrate that QsparselocalSGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We implement QsparselocalSGD for ResNet50 using the ImageNet dataset, and for a softmax multiclass classifier using the MNIST dataset, and we achieve target accuracies with about a factor of 1520 savings over the stateoftheart [AHJ18, SCJ18, Sti19], in the total number of bits transmitted.
1.1 Related Work
The use of quantization for communication efficient gradient methods has decades rich history [GMT73] and its recent use in training deep neural networks [SFD14, Str15] has reignited interest. Theoretically justified gradient compression using unbiased stochastic quantizers has been proposed and analyzed in [AGL17, WXY17, SYKM17]. Though methods in [WWLZ18, WSL18] use induced sparsity in the quantized gradients, explicitly sparsifying the gradients more aggressively by retaining components, e.g., , has been proposed [Str15, AH17, LHM18, AHJ18, SCJ18], combined with error compensation to ensure that all coordinates do get eventually updated as needed. [WHHZ18] analyzed error compensation for QSGD, without sparsification while focusing on quadratic functions. Another approach for mitigating the communication bottlenecks is by having infrequent communication, which has been popularly referred to in the literature as iterative parameter mixing, see [Cop15], and model averaging, see [Sti19, YYZ18, ZSMR16] and references therein. Our work is most closely related to and builds upon the recent theoretical results in [AHJ18, SCJ18, Sti19, YYZ18]. [SCJ18] considered the analysis for the centralized (among other sparsifiers), and [AHJ18] analyzed a distributed version with the assumption of closeness of the aggregated gradients to the centralized case, see Assumption 1 in [AHJ18]. [Sti19, YYZ18] studied localSGD, where several local iterations are done before sending the full gradients, and did not do any gradient compression beyond local iterations. Our work generalizes these works in several ways. We prove convergence for the distributed sparsification and error compensation algorithm, without the assumption of [AHJ18], by using the perturbed iterate methods [MPP17, SCJ18]. We analyze nonconvex as well as convex objectives for the distributed case with local computations. [SCJ18] gave a proof of sparsified SGD for convex objective functions and for the centralized case, without local computations ^{1}^{1}1At the completion of our work, we recently found that in parallel to our work [KRSJ19] examined use of signSGD quantization, without sparsification for the centralized model. Another recent work in [KSJ19] studies the decentralized case with sparsification for strongly convex functions. Our work, developed independent of these works, uses quantization, sparsification and local computations for the distributed case, for both nonconvex and strongly convex objectives.. Our techniques compose a (stochastic or deterministic bit sign) quantizer with sparsification and local computations using error compensation. While our focus has only been on mitigating the communication bottlenecks in training high dimensional models over bandwidth limited networks, this technique works for any compression operator satisfying a regularity condition (see Definition 3) including our composed operators.
1.2 Contributions
We study a distributed set of worker nodes, each of which perform computations on locally stored data, denoted by . Consider the empiricalrisk minimization of the loss function
where , where denotes expectation over a random sample chosen from the local data set . Our setup can also handle different local functional forms, beyond dependence on the local data set , which is not explicitly written for notational simplicity. For , we denote and . The distributed nodes perform computations and provide updates to the master node that is responsible for aggregation and model update. We develop QsparselocalSGD, a distributed SGD composing gradient quantization and explicit sparsification (e.g., components), along with local iterations. We develop the algorithms and analysis for both synchronous as well as asynchronous operations, in which workers can communicate with the master at arbitrary time intervals. To the best of our knowledge, these are the first algorithms which combine quantization, aggressive sparsification, and local computations for distributed optimization. With some minor modifications to QsparselocalSGD, it can also be used in a peertopeer setting, where the aggregation is done without any help from the master node, and each worker exchanges its updates with all other workers.
Our main theoretical results are the convergence analyses of QsparselocalSGD for both nonconvex as well as convex objectives; see Theorem 1 and Theorem 3 for the synchronous case, as well as Theorem 4 and Theorem 6, for the asynchronous operation. Our analyses also demonstrate natural gains in convergence that distributed, minibatch operation affords, and has convergence similar to equivalent vanilla SGD with local iterations (see Corollary 2 and Corollary 3), for both the nonconvex case (with convergence rate for fixed learning rate) as well as the convex case (with convergence rate , for diminishing learning rate). We also demonstrate that quantizing and sparsifying the gradient, even after local iterations asymptotically yields an almost “free” efficiency gain (also observed numerically in Section 5 nonasymptotically). The numerical results on ImageNet dataset implemented for a ResNet50 architecture and for the convex case for multiclass logistic classification on MNIST [LBBH98] dataset demonstrates that one can get significant communication savings, while retaining equivalent stateoftheart performance. The combination of quantization, sparsification, and local computations poses several challenges for theoretical analyses, including the analyses of impact of local iterations (block updates) of parameters on quantization and sparsification (see Lemma 45 in Section 3), as well as asynchronous updates and its combination with distributed compression (see Lemma 912 in Section 4).
1.3 Organization
In Section 2, we demonstrate that composing certain classes of quantizers with sparsifiers satisfies a certain regularity condition that is needed for several convergence proofs for our algorithms. We describe the synchronous implementation of QsparselocalSGD in Section 3, and outline the main convergence results for it in Section 3.3, briefly giving the proof ideas in Section 3.4. We describe our asynchronous implementation of QsparselocalSGD and provide the theoretical convergence results in Section 4. The experimental results are given in Section 5. Many of the proof details are given in the appendices.
2 Communication Efficient Operators
Traditionally, distributed stochastic gradient descent affords to send full precision (32 or 64 bit) unbiased gradient updates across workers to peers or to a central server that helps with aggregation. However, communication bottlenecks that arise in bandwidth limited networks limit the applicability of such an algorithm at a large scale when the parameter size is massive or the data is widely distributed on a very large number of worker nodes. In such settings, one could think of updates which not only result in convergence, but also require less bandwidth thus making the training process faster. In the following sections we discuss several useful operators from literature and enhance their use by proposing a novel class of composed operators.
We first consider two different techniques used in the literature for mitigating the communication bottleneck in distributed optimization, namely, quantization and sparsification. In quantization, we reduce precision of the gradient vector by mapping each of its components by a deterministic [BWAA18, KRSJ19] or randomized [AGL17, WXY17, SYKM17, ZDJW13] map to a finite number of quantization levels. In sparsification, we sparsify the gradients vector before using it to update the parameter vector, by taking its components or choosing components uniformly at random, denoted by , [SCJ18, KSJ19].
2.1 Quantization
SGD computes an unbiased estimate of the gradient, which can be used to update the model iteratively and is extremely useful in large scale applications. It is well known that the first order terms in the rate of convergence are affected by the variance of the gradients. While stochastic quantization of gradients could result in a variance blow up, it preserves the unbiasedness of the gradients at low precision; and, therefore, when training over bandwidth limited networks, the convergence would be much faster; see [AGL17, WXY17, SYKM17, ZDJW13].
Definition 1 (Randomized quantizer).
We say that is a randomized quantizer with quantization levels, if the following holds for every : (i) ; (ii) , where could be a function of and . Here expectation is taken over the randomness of .
Examples of randomized quantizers include

Stochastic Rotated Quantization [SYKM17], which is a stochastic quantization, preprocessed by a random rotation, with .
Instead of quantizing randomly into levels, we can take a deterministic approach and round off each component of the vector to the nearest level. In particular, we can just take the sign, which has shown promise in [BWAA18, KRSJ19].
Definition 2 (Deterministic Sign quantizer).
A deterministic quantizer is defined as follows: for every vector , the ’th component of , for , is defined as .
Such methods drew interest since Rprop [RB93], which only used the temporal behavior of the sign of the gradient. This is an example where the biased 1bit quantizer as in Definition 2 is used. This further inspired optimizers, such as RMSprop [TH12], Adam [KB15], which incorporate appropriate adaptive scaling with momentum acceleration and have demonstrated empirical superiority in nonconvex applications.
2.2 Sparsification
As mentioned earlier, we consider two important examples of sparsification operators: and , For any , is equal to a length vector, which has at most nonzero components whose indices correspond to the indices of the largest components (in absolute value) of . Similarly, is a length (random) vector, which is obtained by selecting components of uniformly at random. Both of these satisfy a socalled “contraction” property as defined below, with [SCJ18].
Definition 3 (Contraction operator [Scj18]).
A (randomized) function is called a contraction operator, if there exists a constant (that may depend on and ), such that for every , we have
(1) 
where expectation is taken over the randomness of the contraction operator .
Note that stochastic quantizers, as defined in Definition 1, also satisfy this regularity condition in Definition 3 for . Now we give a simple but important corollary, which allows us to apply different contraction operators to different coordinates of a vector. As an application, in the case of training neural networks, we can apply different operators to different layers.
Corollary 1 (Piecewise contraction).
Let for denote possibly different contraction operators with contraction coefficients . Let , where for all . Then is a contraction operator with the contraction coefficient being equal to .
Proof.
Fix an arbitrary and consider the following:
Inequality (a) follows because each is a contraction operator with the contraction coefficient . ∎
Corollary 1 allows us to apply different contraction operators to different coordinates of the updates which can based upon their dimensionality and sparsity patterns.
2.3 Composition of Quantization and Sparsification
Now we show that we can compose deterministic/randomized quantizers with sparsifiers and the resulting operator is a contraction operator. First we compose a general stochastic quantizer with an explicit sparsifier such as and and show that the resulting operator is a “contraction” operator. A proof is provided in Appendix A.1.
Lemma 1 (Contraction of a composed operator).
Let . Let be a quantizer with parameter that satisfies Definition 1. Let be defined as for every . If are such that , then is a contraction operator with the contraction coefficient being equal to , i.e., for every , we have
where expectation is taken over the randomness of the contraction operator as well as of the quantizer .
For the different quantizers mentioned earlier, the conditions when their composition with gives are:

QSGD: for , we get.

Stochastic klevel Quantization: for , we get .

Stochastic Rotated Quantization: for , we get .
Remark 1.
Observe that for a given stochastic quantizer that satisfies Definition 1, we have a prescribed operating regime of . This results in an upper bound on the coarseness of the quantizer, which happens because the quantization leads to a blowup of the second moment; see condition (ii) of Definition 1. However, by employing Corollary 1, we show that this can be alleviated to some extent via an example.
Consider an operator as described in Lemma 1, where the quantizer, in use is QSGD [AGL17, WXY17], and the sparsifier, is [AHJ18, SCJ18]. Apply it to a vector in a piecewise manner, i.e., to smaller vectors as prescribed in Corollary 1. Define as the coefficient of the variance bound as in Definition 1 for the quantizer , used for and . Observe that the regularity condition in Definition 3 can be satisfied by having . Therefore, the piecewise contraction operator allows a coarser quantizer than when the operator is applied to the entire vector together where we require , thus providing a small gain in communication efficiency. For example, consider the composed operator being applied on a per layer basis to a deep neural network. We can now afford to have a much coarser quantizer than when the operator is applied to all the parameters at once.
As discussed above, stochastic quantization results in a variance blowup which limits our regime of operation, when we combine that with sparsification. However, it turns out that, we can expand our regime of operation unrestrictedly by scaling the vector properly. We summarize the result in the following lemma, which is proved in Appendix A.2.
Lemma 2 (Composing sparsification with stochastic quantization).
Let . Let be a stochastic quantizer with parameter that satisfies Definition 1. Let be defined as for every . Then is a contraction operator with the contraction coefficient being equal to , i.e., for every
Remark 2.
Note that, unlike , the scaled version is always a contraction operator for all values of . Furthermore, observe that, if , then we have , which implies that even in the operating regime of , which is required in Lemma 1, the scaled composed operator of Lemma 2 gives better contraction than what we get from the unscaled composed operator of Lemma 1. So, scaling a composed operator properly is always a better choice for contraction.
We can also compose a deterministic 1bit quantizer with . For that we need some notations first. For and given vector , let denote the set of indices chosen for defining . For example, if , then denote the set of indices corresponding to the largest components of ; if , then denote a set of random set of indices in . The composition of with is denoted by , and for , the ’th component of is defined as
In the following lemma we show that is a contraction operator; a proof of which is provided in Appendix A.3.
Lemma 3 (Composing sparsification with deterministic quantization).
For , the operator
for any is a contraction operator with the contraction coefficient being equal to
Remark 3.
Observe that for , depending on the value of , either of the terms inside the max can be bigger than the other term. For example, if , then , which implies that the second term inside the max is equal to , which is much smaller than the first term. On the other hand, if and the vector is dense, then the second term may be much bigger than the first term.
3 Distributed Synchronous Operation
Let with denote a set of indices for which worker synchronizes with the master. In a synchronous setting, is same for all the workers. Let for any . Every worker maintains a local parameter vector which is updated in each iteration . If , every worker sends the compressed and errorcompensated update computed on the net progress made since the last synchronization to the master node, and updates its local memory . Upon receiving , master aggregates them, updates the global parameter vector, and sends the new model to all the workers; upon receiving which, they set their local parameter vector to be equal to the global parameter vector . Our algorithm is summarized in Algorithm 1.
3.1 Assumptions
All results in this paper use the following two standard assumptions.

Smoothness: The local function at each worker is smooth, i.e., for every , we have .

Bounded second moment: For every and for some constant , we have . This is a standard assumption in [SSS07, NJLS09, RRWN11, HK14, RSS12, SCJ18, Sti19, YYZ18, KSJ19, AHJ18]. Relaxation of the uniform boundedness of the gradient allowing arbitrarily different gradients of local functions in heterogenous settings as done for SGD in [NNvD18, WJ18] is left as future work. This also imposes a bound on the variance: , where for every .
In this section we present our main convergence results with synchronous updates, obtained by running Algorithm 1 for smooth functions, both nonconvex and strongly convex. To state our results, we need the following definition from [Sti19].
Definition 4 (Gap [Sti19]).
Let , where for . The gap of is defined as , which is equal to the maximum difference between any two consecutive synchronization indices.
3.2 Error Compensation
Sparsified gradient methods, where workers send the largest coordinates of the updates based on their magnitudes have been investigated in the literature and serves as a communication efficient strategy for distributed training of learning models. However, the convergence rates are subpar to distributed vanilla SGD. Together with some form of error compensation, these methods have been empirically observed to converge as fast as vanilla SGD in [Str15, AH17, LHM18, AHJ18, SCJ18]. In [AHJ18, SCJ18], sparsified SGD with such feedback schemes has been carefully analyzed. Under analytic assumptions [AHJ18], proves the convergence of distributed SGD with error feedback. The net error in the system is accumulated by each worker locally on a per iteration basis and this is used as feedback for generating the future updates. [SCJ18] did the analysis for the centralized SGD for strongly convex objectives.
In Algorithm 1, the error introduced in every iteration is accumulated into the memory of each worker, which is compensated for in the future rounds of communication. This feedback is the key to recovering the convergence rates matching vanilla SGD. The operators employed provide a controlled way of using both the current update as well as the compression errors from the previous rounds of communication. Under the assumption of the uniform boundedness of the gradient we analyze the controlled evolution of memory through the optimization process; the results are summarized in Lemma 4 and Lemma 5 below.
3.2.1 Decaying Learning Rate
Here we show that if we run Algorithm 1 with a decaying learning rate , then the local memory at each worker contracts and goes to zero as .
Lemma 4 (Memory contraction).
Let and , where is a constant and . Then there exists a constant , such that the following holds for every worker and for every :
(2) 
We prove Lemma 4 in Appendix B.1. Note that for decaying , the memory decays as . This implies that the net error in the algorithm from the compression of updates in each round of communication is compensated for in the end.
3.2.2 Fixed Learning Rate
In the following lemma, which is proved in Appendix B.2, we show that if we run Algorithm 1 with a fixed learning rate , then the local memory at each worker is bounded. It can be verified that the proof of Lemma 4 also holds for fixed learning rate, and we can trivially bound in this case by simply putting in (2). However, we can get a better bound (saving a factor of , which is bigger than 4) by directly working with a fixed learning rate.
Lemma 5 (Bounded memory).
Let . Then the following holds for every worker and for every :
(3) 
Note that, for fixed , the memory is upper bounded by a constant . Observe that since the memory accumulates the past errors due to compression and local computation, in order to asymptotically reduce the memory to zero, the learning rate would have to be reduced once in a while throughout the training process.
3.3 Main Results
We leverage the perturbed iterate analysis as in [MPP17, SCJ18] to provide convergence guarantees for QsparselocalSGD. Under the assumptions of Section 3.1, the following theorems hold when Algorithm 1 is run with any contraction operator (including our composed operators).
Theorem 1 (Smooth (nonconvex) case with fixed learning rate).
Let be smooth for every . Let be a contraction operator whose contraction coefficient is equal to . Let be generated according to Algorithm 1 with , for step sizes (where is a constant such that ) and . Then we have
Here is a random variable which samples a previous parameter with probability .
Corollary 2.
Let , where is a constant,^{2}^{2}2Even classical SGD requires knowing an upper bound on in order to choose the learning rate. Smoothness of translates this to the difference of the function values. , and , we have
In order to ensure that the compression does not affect the dominating terms while converging at a rate of , we would require .
Theorem 1 is proved in Appendix B.6 and provides nonasymptotic guarantees, where we observe that compression does not affect the first order term. Here we are required decide the horizon before running the algorithm. Therefore, in order to converge to a fixed point, the learning rate needs to follow a piecewise schedule (i.e., the learning rate would have to be reduced once in a while throughout the training process), which is also the case in our numerics in Section 5.1. The corresponding asymptotic result (with decaying learning rate) is given below.
Theorem 2 (Smooth (nonconvex) case with decaying learning rate).
Let be smooth for every . Let be a contraction operator whose contraction coefficient is equal to . Let be generated according to Algorithm 1 with , for step sizes and , where is such that, we have and . Then the following holds.
Here (i) ; (ii) , which is lower bounded as ; and (iii) is a random variable which samples a previous parameter with probability .
Note that Theorem 2 gives a convergence rate of . We prove it in Appendix B.7.
Theorem 3 (Smooth and strongly convex case with a decaying learning rate).
Let be smooth and strongly convex. Let be a contraction operator whose contraction coefficient is equal to . Let be generated according to Algorithm 1 with , for step sizes with , where is such that we have , . Then the following holds
Here (i) , , where ; (ii) , where ; and (iii) .
Corollary 3.
For , , and using from Lemma 2 in [RSS12], we have
In order to ensure that the compression does not affect the dominating terms while converging at a rate of , we would require .
Theorem 3 is proved in Appendix B.8. For no compression and only local computations, i.e., for , and under the same assumptions, we recover/generalize a few recent results from literature with similar convergence rates:

We recover [YYZ18, Theorem 1], which does local SGD for the nonconvex case;

We generalize [Sti19, Theorem 2.2], which does local SGD for a strongly convex case and requires the unbiasedness assumption of gradients,^{3}^{3}3The unbiasedness of gradients at every worker can be ensured by assuming that each worker samples data points from the entire dataset. to the distributed case.
We emphasize that unlike [YYZ18, Sti19], which only consider local computation, we combine quantization and sparsification with local computation, which poses several technical challenges; e.g., see proofs of Lemma 4, 5, 6.
3.4 Proof Outlines
In order to prove our results, we define virtual sequences for every worker and for all as follows:
(4) 
Here can be taken to be decaying or fixed, depending on the result that we are proving. Let be the set of random sampling of the minibatches at each worker . We define

, ;

, .
3.4.1 Proof Outline of Theorem 1
Proof.
Since is smooth, we have from (4) (with fixed learning rate ) that
(5) 
With some algebraic manipulations provided in Appendix B.6, for , we arrive at
(6) 
Under the Assumption 2, stated in Section 3.1, we have
(7) 
To bound on the RHS of (6), we first show below that , i.e., the difference of the true and the virtual sequence is equal to the average memory; and then we can use the bound on the local memory terms from Lemma 5.
Lemma 6 (Memory).
Let