cpSGD: Communicationefficient and differentiallyprivate distributed SGD
Abstract
Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For variables and clients, the proposed method uses bits of communication per client per coordinate and ensures constant privacy.
We also extend and improve previous analysis of the Binomial mechanism showing that it achieves nearly the same utility as the Gaussian mechanism, while requiring fewer representation bits, which can be of independent interest.
1 Introduction
1.1 Background
Distributed stochastic gradient descent (SGD) is a basic building block of modern machine learning [26, 11, 9, 29, 1, 28, 5]. In the typical scenario of synchronous distributed learning, in every round, each client obtains a copy of a global model which it updates based on its local data. The updates (usually in the form of gradients) are sent to a parameter server, where they are averaged and used to update the global model. Alternatively, without a central server, each client saves a global model, broadcasts the gradient to all other clients, and updates its model with the aggregated gradient.
Often, the communication cost of sending the gradient becomes the bottleneck [31, 23, 22]. To address this issue, several recent works have focused on reducing the communication cost of distributed learning algorithms via gradient quantization and sparsification [33, 17, 34, 20, 21, 4, 35]. These algorithms have been shown to improve communication cost and hence communication time in distributed learning. This is especially effective in the federated learning setting where clients are mobile devices with expensive uplink communication cost [27, 20].
While communication is a key concern in client based distributed machine learning, an equally important consideration is that of protecting the privacy of participating clients and their sensitive information. Providing rigorous privacy guarantees for machine learning applications has been an area of active recent interest [6, 36, 32]. Differentially private gradient descent algorithms in particular were studied in the work of [2]. A direct application of these mechanisms in distributed settings leads to algorithms with high communication costs. The key focus of our paper is to analyze mechanisms that achieve rigorous privacy guarantees as well as have communication efficiency.
1.2 Communication efficiency
We first describe synchronous distributed SGD formally. Let be of the form
where each resides at the client. For example, ’s are weights of a neural network and is the loss of the network on data located on client . Let be the initial value. At round , the server transmits to all the clients and asks a random set of (batch size / lot size) clients to transmit their local gradient estimates . Let be the subset of clients.
The server updates as follows
for some suitable choice of . Other optimization algorithms such as momentum, Adagrad, or Adam can also be used instead of the SGD step above.
Naively for the above protocol, each of the clients needs to transmit reals, typically using bits^{2}^{2}2 is the percoordinate quantization accuracy. To represent a dimensional vector to an constant accuracy in Euclidean distance, each coordinate is usually quantized to an accuracy of ..
This communication cost can be prohibitive, e.g., for a medium size PennTreeBank language model [38], the number of parameters and hence total cost is MB (assuming 32 bit float), which is too large to be sent from a mobile phone to the server at every round.
Motivated by the need for communication efficient protocols, various quantization algorithms have been proposed to reduce the communication cost [34, 20, 21, 37, 24, 35, 5]. In these protocols, the clients quantize the gradient by a function and send an efficient representation of instead of its actual local gradient . The server computes the gradient as
and updates as before. Specifically, [34] proposes a quantization algorithm which reduces the requirement of full (or floating point) arithmetic precision to a bit or few bits per value on average. There are many subsequent works e.g., see [21] and in particular [5] showed that stochastic quantization and Elias coding [15] can be used to obtain communicationoptimal SGD for convex functions. If the expected communication cost at every round is bounded by , then the total communication cost of the modified gradient descent is at most
(1) 
All the previous papers relate the error in gradient compression to SGD convergence. We first state one such result for completeness for nonconvex functions and prove it in Appendix A. Similar (and stronger) results can be obtained for (strongly) convex functions using results in [16] and [30].
Corollary 1 ([16]).
Let be smooth and . Let satisfy . Let be a quantization scheme, and then after rounds
(2) 
and . The expectation in the above equations is over the randomness in gradients and quantization.
The above result relates the convergence of distributed SGD for nonconvex functions to the worstcase mean square error (MSE) and bias in gradient mean estimates in Equation (2). Thus smaller the mean square error in gradient estimation, better convergence. Hence, we focus on the problem of distributed mean estimation (DME), where the goal is to estimate the mean of a set of vectors.
1.3 Differential privacy
While the above schemes reduce the communication cost, it is unclear what (if any) privacy guarantees they offer. We study privacy from the lens of differential privacy (DP). The notion of differential privacy [13] provides a strong notion of individual privacy while permitting useful data analysis in machine learning tasks. We refer the reader to [14] for a survey. Informally, for the output to be differentially private, the estimated model should be indistinguishable whether a particular client’s data was taken into consideration or not. We define this formally in Section 2.
In the context of client based distributed learning, we are interested in the privacy of the gradients aggregated from clients; differential privacy for the average gradients implies privacy for the resulting model since DP is preserved by postprocessing.
The standard approach is to let the server add the noise to the averaged gradients (e.g., see [14, 2] and references within). However, the above only works under a restrictive assumption that the clients can trust the server. Our goal is to also minimize the need for clients to trust the central aggregator, and hence we propose the following model:
Clients add their share of the noise to their gradients before transmission. Aggregation of gradients at the server results in an estimate with noise equal to the sum of the noise added at each client.
This approach improves over servercontrolled noise addition in several scenarios:
Clients do not trust the server: Even in the scenario when the server is not trustworthy, the above scheme can be implemented via cryptographically secure aggregation schemes [7], which ensures that the only information about the individual users the server learns is what can be inferred from the sum. Hence, differential privacy of the aggregate now ensures that the parameter server does not learn any individual user information. This will encourage clients to participate in the protocol even if they do not fully trust the server. We note that while secure aggregation schemes add to the communication cost (e.g., [7] adds for levels of quantization), our proposed communication benefits still hold. For example, if , a 4bit quantization protocol would reduce communication cost by 67% compared to the 32 bit representation.
Server is negligent, but not malicious: the server may "forget" to add noise, but is not malicious and not interested in learning characteristics of individual users. However, if the server releases the learned model to public, it needs to be differentiallyprivate.
A natural way to extend the results of [14, 2] is to let individual users add Gaussian noise to their gradients before transmission. Since the sum of Gaussians is Gaussian itself, differential privacy results follow. However, the transmitted values now are real numbers and the benefits of gradient compression are lost. Further, secure aggregation protocols [7] require discrete inputs. To resolve these issues, we propose that the clients add noise drawn from an appropriately parameterized Binomial distribution. We refer to this as the Binomial mechanism. Since Binomial random variables are discrete, they can be transmitted efficiently. Furthermore, the choice of the Binomial is convenient in the distributed setting because sum of Binomials is also binomially distributed i.e., if
Hence the total noise post aggregation can be analyzed easily, which is convenient for the distributed setting^{3}^{3}3Another choice is the Poisson distribution. Different from Poisson, the Binomial distribution has bounded support and has an easily analyzable communication complexity which is always bounded.. Binomial mechanism can be of independent interest in other applications with discrete output as well. Furthermore, unlike Gaussian it avoids floating point representation issues.
1.4 Summary of our results
Binomial mechanism: We first study Binomial mechanism as a generic mechanism to release discrete valued data. Previous analysis of the Binomial mechanism (where you add noise ) was due to [12], who analyzed the dimensional case for and showed that to achieve differential privacy, needs to be . We improve the analysis in the following ways:

dimensions. We extend the analysis of dimensional Binomial mechanism to dimensions. Unlike the Gaussian distribution, Binomial is not rotation invariant making the analysis more involved. The key fact utilized in this analysis is that Binomial distribution is locally rotationinvariant around the mean.

Improvement. We improve the previous result and show that suffices for small , implying that the Binomial and Gaussian mechanism perform identically as . We note that while this is a constant improvement , it is crucial in making differential privacy practical.
Differentiallyprivate distributed mean estimation (DME): A direct application of Gaussian mechanism requires reals and hence bits of communication. This can be prohibitive in practice. However, a direct application of quantization [34] and Binomial mechanism has high communication cost. We show that random rotation together with the notion of high probability sensitivity can significantly improve communication.
In particular, for , we provide an algorithm achieving equal privacy and error as that of the Gaussian mechanism with communication
per round of distributed SGD. Hence when , the number of bits is .
The rest of the paper is organized as follows. In Section 2, we review the notion of differential privacy and state our results for the Binomial mechanism. Motivated by the fact that the convergence of SGD can be reduced to the error in gradient estimate computation perround, we formally describe the problem of DME in Section 4 and state our results in Section 5.
In Section 5.2, we provide and analyze the implementation of the binomial mechanism in conjunction with quantization in the context of DME. The main idea is for each client to add noise drawn from an appropriately parameterized Binomial distribution to each quantized value before sending to the server. The server further subtracts the bias introduced by the noise to achieve an unbiased mean estimator. We further show in Section 5.3 that the rotation procedure proposed in [34] which reduces the MSE is helpful in reducing the additional error due to differential privacy.
2 Differential privacy
2.1 Notation
We start by defining the notion of differential privacy. Formally, given a set of data sets provided with a notion of neighboring data sets and a query function , a mechanism to release the answer of the query, is defined to be differentially private if for any measurable subset and two neighboring data sets ,
(3) 
Unless otherwise stated, for the rest of the paper, we will assume the output spaces . We consider the mean square error as a metric to measure the error of the mechanism . Formally,
A key quantity in characterizing differential privacy for many mechanisms is the sensitivity of a query in a given norm . Formally this is defined as
(4) 
The canonical mechanism to achieve differential privacy is the Gaussian mechanism [14]:
where . We now state the wellknown privacy guarantee of the Gaussian mechanism.
Lemma 1 ( [14]).
For any , sensitivity bound , and such that
is differentially private ^{4}^{4}4All logs are to base unless otherwise stated. and the error is bounded by
2.2 High probability sensitivity
In this section we introduce a notion of high probability sensitivity which allows us to work randomized queries which do not have a worst case bound on sensitivity but have bounded sensitivity with high probability. Let represent a set of natural numbers and represent a subset of real numbers. For two random vectors , the event is defined as
Definition 1 ( sensitivity).
Given a set of integers and values , we call a randomized function , sensitive, if for any two neighboring data sets , there exist coupled random variables such that the marginal distributions of are identical to that of and and
(5) 
We show the following result for highprobability sensitivity and the proof is provided in Appendix C.
Lemma 2.
Let be an differentially private mechanism for sensitivity and let be a sensitive function. Then the composed mechanism is differentially private.
3 Binomial Mechanism
We now define the Binomial mechanism for the case when the output space of the query is . The Binomial mechanism is parameterized by three quantities where , and quantization scale for some and is given by
(6) 
where for each coordinate , and independent. One dimensional binomial mechanism was introduced by [12] for the case when . We analyze the mechanism for the general dimensional case and for any . This analysis is involved as the Binomial mechanism is not rotation invariant. By carefully exploiting the local rotation invariant structure near the mean, we show that:
Theorem 1.
The proof is given in Appendix B. We make some remarks regarding the design and the guarantee for the Binomial Mechanism. Note that the privacy guarantee for the Binomial mechanism depends on all three sensitivity parameters as opposed to the Gaussian mechanism which only depends on . The and terms can be seen as the added complexity due to discretization.
Secondly setting (i.e. providing no scale to the noise) in the expression (7), it can be readily seen that the terms involving and scale differently with respect to the variance of the noise. This motivates the use of the accompanying quantization scale in the mechanism. Indeed it is possible that the resolution of the integer that is provided by the Binomial noise could potentially be too large for the problem leading to worse guarantees. In this setting, the quantization parameter helps normalize the noise correctly. Further, it can be seen as long as the variance of the random variable is fixed, increasing and decreasing makes the Binomial mechanism closer to the Gaussian mechanism. Formally, if we let and , then using the CauchySchwartz inequality, the guarantee (7) can be rewritten as
The variance of the Binomial distribution is and the leading term in matches exactly the term in Gaussian mechanism. Furthermore, if is , then this mechanism is very similar to the Gaussian mechanism. This result agrees with the BerryEsseen type Central limit theorems for the convergence of one dimensional Binomial distribution to the Gaussian distribution.
In Figure 2, we plot the error vs for Gaussian and Binomial mechanism. Observe that as scale is reduced, error vs privacy tradeoff for Binomial mechanism approaches that of Gaussian mechanism.
4 Distributed mean estimation (DME)
We have related the SGD convergence rate to the MSE in approximating the gradient at each step in Corollary 1. Eq. (1) relates the communication cost of SGD to the communication cost of estimating gradient means. Advanced composition theorem (Thm. 3.5 [19]) or moments accounting [2] can be used to relate the privacy guarantee of SGD to that of gradient mean estimate at each instance . We also note that in SGD, we often sample the clients, standard privacy amplification results via sampling [2], can be used to get tighter bounds in this case.
Therefore, akin to [34], in the rest of the paper we just focus on the MSE and privacy guarantees of DME. The results for synchronous distributed GD follow from Corollary 1 (convergence), advanced composition theorem (privacy), and Eq. (1) (communication).
Formally, the problem of DME is defined as given vectors where is on client , we wish to compute the mean
at a central server. For gradient descent at each round , is set to . DME is a fundamental building block for many distributed learning algorithms including distributed PCA/clustering [25].
While analyzing private DME we assume that each vector has bounded norm, i.e. . The reason to make such an assumption is to be able to define and analyze the privacy guarantees and is often enforced in practice by employing gradient clipping at each client. We note that this assumption appears in previous works on gradient descent and differentially private gradient descent (e.g. [2]). Since our results also hold for all gradients without any statistical assumptions, we get desired convergence results and privacy results for SGD.
4.1 Communication protocol
Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server at the same time. We allow the use of both private and public randomness. Private randomness refers to random values generated by each client separately, and public randomness refers to a sequence of random values that are shared among all parties^{5}^{5}5Public randomness can be emulated by the server communicating a random seed.
Given vectors where resides on a client . In any independent communication protocol, each client transmits a function of (say ), and a central server estimates the mean by some function of . Let be any such protocol and let be the expected number of transmitted bits by the th client during protocol , where throughout the paper, expectation is over the randomness in protocol .
4.2 Differential privacy
To state the privacy results for DME, we define the notion of data sets and neighbors as follows. A dataset is a collection of vectors . The notion of neighboring data sets typically corresponds to those differing only on the information of one user, i.e. are neighbors if they differ in one vector.
Note that this notion of neighbors for DME in the context of distributed gradient descent translates to two data sets
being neighbors if they differ in one function and corresponds to guaranteeing privacy for individual client’s data. The bound translates to assuming , ensured via gradient clipping.
5 Results for distributed mean estimation (DME)
In this section we describe our algorithms, the associated MSE, and the privacy guarantees in the context of DME. First, we first establish a baseline by stating the results for implementing the Gaussian mechanism by adding Gaussian noise on each client vector.
5.1 Gaussian protocol
In the Gaussian mechanism, each client sends vector
where s are i.i.d distributed as . The server estimates the mean by
We refer to this protocol as . Since is distributed as the above mechanism is equivalent to applying the Gaussian mechanism on the output with variance . Since changing any of the ’s changes the norm of by at most , the following theorem follows directly from Lemma 1.
Theorem 2.
Under the Gaussian mechanism, the mean estimate is unbiased and communication cost is reals. Moreover, for any and , it is differentially private for
We remark that real numbers can be quantized to bits with insignificant effect to privacy^{6}^{6}6Follows by observing that quantizing all values to accuracy ensures minimum loss in privacy. In practice this is often implemented using 32 bits of quantization via float representation.. However this is asymptotic and can be prohibitive in practice [20], where we have a small fixed communication budget and is of the order of millions. A natural way to reduce communication cost is via quantization, where each client quantizes s before transmitting. However how privacy guarantees degrade due to quantization of the Gaussian mechanism is hard to analyze particularly under aggregation. Instead we propose to use the Binomial mechanism which we describe next.
5.2 Stochastic level quantization + Binomial mechanism
We now define the mechanism based on bit stochastic quantization proposed in [34] composed with the Binomial mechanism. It will be parameterized by quantities .
First, the server sends to all the clients, with the hope that for all , . The clients then clip each coordinate of their vectors to the range . For every integer in the range , let represent a bin (one for each ), i.e.
(8) 
The algorithm quantizes each coordinate into one of the bins stochastically and adds scaled Binomial noise. Formally client computes the following quantities for every
(9) 
where is such that and . The client sends to the server. The server now estimates by
(10) 
If , , then
and will be an unbiased estimate of the mean.
With no prior information on , the natural choice is to set . With this value of we characterize the MSE, sensitivity, and communication complexity of the Binomial mechanism below. To characterize the sensitivity of , we need few definitions. For scalars , let
(11) 
where . We note that quantities are omitted from the LHS of the equations for the ease of notation. Combined with Theorem 1, this yields the privacy guarantees for the binomial mechanism.
Theorem 3.
We provide the proof in Appendix D. For , we bound the communication cost as follows.
Corollary 2.
There exists an implementation of , which achieves the same privacy and error as the full precision Gaussian mechanism with a total communication complexity of
Therefore our results provide precise nonasymptotic and asymptotic guarantees on the total communication with respect to k. The communication cost of the above algorithm is bits per coordinate per client, which can be prohibitive. In the next section we show that these bounds can be improved via rotation.
5.3 Error reduction via randomized rotation
As seen in Corollary 2, if has error and privacy same as that of the Gaussian mechanism, it has high communication cost. The proof reveals that this is due to the error being proportional to . Therefore MSE reduces when is small, e.g., when is uniform on the unit sphere, is (whp) [10]. [34] showed that the same effect can be observed by randomly rotating the vectors before quantization. Here we show that random rotation reduces the leading term as well as improves the privacy guarantee.
Using public randomness, all clients and the central server generate a random orthogonal matrix according to some known distribution. Given a protocol for DME which takes inputs , we define as the protocol where each client first computes,
and runs the protocol on . The server then obtains the mean estimate in the rotated space using the protocol and then multiplies by to obtain the coordinates in the original basis, i.e.,
Due to the fact that can be huge in practice, we need orthogonal matrices that permits fast matrixvector products. Naive matrices that support fast multiplication such as blockdiagonal matrices often result in high values of . Similar to [34], we propose to use a special type of orthogonal matrix , where is a random diagonal matrix with i.i.d. Rademacher entries ( with probability ) and is a WalshHadamard matrix [18]. The WalshHadamard matrix of dimension for is given by the recursive formula,
Applying both rotation and its inverse takes time and space (with an inplace algorithm).
The next theorem provides the MSE guarantees for .
Theorem 4 (Appendix E).
The following corollary bounds the communication cost for when .
Corollary 3.
There exists an implementation of , that achieves the same error and privacy of the full precision Gaussian mechanism with a total communication complexity:
Hence if , then has the same privacy and utilities as the Gaussian mechanism, but with just communication cost.
6 Discussion
We trained a threelayer model (60 hidden nodes each with ReLU activation) on the infinite MNIST dataset [8] with 25M data points and 25M clients. At each step 10,000 clients send their data to the server. This setting is close to realworld settings of federated learning where there are hundreds of millions of users. The results are in Figure 2. Note that the models achieve different levels of accuracy depending on communication cost and privacy parameter . We note that we trained the model with exactly one epoch, so each sample was used at most once in training. In this setting, the per batch and the overall are the same.
There are several interesting future directions. On the theoretical side, it is not clear if our analysis of Binomial mechanism is tight. Furthermore, it is interesting to have better privacy accounting for Binomial mechanism via a moments accountant. On the practical side, we plan to explore the effects of neural network topology, overparametrization, and optimization algorithms on the accuracy of the privately learned models.
7 Acknowledgements
The authors would like to thank Keith Bonawitz, Vitaly Feldman, Jakub Konečný, Ben Kreuter, Ilya Mironov, and Kunal Talwar for their valuable suggestions and inputs.
References
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 [2] Martín Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
 [3] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast JohnsonLindenstrauss transform. In STOC, 2006.
 [4] Dan Alistarh, Demjan Grubic, Jerry Liu, Ryota Tomioka, and Milan Vojnovic. Communicationefficient stochastic gradient descent, with applications to neural networks. 2017.
 [5] Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Randomized quantization for communicationoptimal stochastic gradient descent. arXiv:1610.02132, 2016.
 [6] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464–473. IEEE, 2014.
 [7] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacypreserving machine learning. pages 1175–1191, 2017.
 [8] Leon Bottou. The infinite mnist dataset, 2007.
 [9] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with cots hpc systems. In International Conference on Machine Learning, pages 1337–1345, 2013.
 [10] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
 [11] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
 [12] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Eurocrypt, volume 4004, pages 486–503. Springer, 2006.
 [13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, volume 3876, pages 265–284. Springer, 2006.
 [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, August 2014.
 [15] Peter Elias. Universal codeword sets and representations of the integers. IEEE transactions on information theory, 21(2):194–203, 1975.
 [16] Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 [17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 1737–1746, 2015.
 [18] Kathy J Horadam. Hadamard matrices and their applications. Princeton university press, 2012.
 [19] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. IEEE Transactions on Information Theory, 63(6):4037–4049, 2017.
 [20] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 [21] Jakub Konečnỳ and Peter Richtárik. Randomized distributed mean estimation: Accuracy vs communication. arXiv preprint arXiv:1611.07555, 2016.
 [22] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and BorYiing Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 1, page 3, 2014.
 [23] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014.
 [24] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. International Conference on Learning Representations, 2018.
 [25] Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
 [26] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In HLT, 2010.
 [27] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
 [28] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Aguera y Arcas. Federated learning of deep networks using model averaging. arXiv:1602.05629, 2016.
 [29] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv preprint, 2014.
 [30] Alexander Rakhlin, Ohad Shamir, Karthik Sridharan, et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML. Citeseer, 2012.
 [31] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 [32] Anand D Sarwate and Kamalika Chaudhuri. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE signal processing magazine, 30(5):86–94, 2013.
 [33] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [34] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pages 3329–3337, 2017.
 [35] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. arXiv preprint arXiv:1705.07878, 2017.
 [36] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolton differential privacy for scalable stochastic gradient descentbased analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1307–1322. ACM, 2017.
 [37] Takuya Akiba Yusuke Tsuzuku, Hiroto Imachi. Variancebased gradient compression for efficient distributed deep learning, 2018.
 [38] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
Appendix A Proof of biased SGD
The proof is similar to the SGD proof of [16], however we account for bias in gradient estimates. Define the random variable . By the definitions of and ,
where the last inequality uses the fact that . Rearranging the above inequality and summing over all we get that
Appendix B Binomial Mechanism  Proof of Theorem 1
To remind the reader, the binomial mechanism for releasing discrete valued queries on a database is defined as follows. Given a set of databases and an integer valued query , the binomial mechanism samples a vector such that all its coordinates are distributed as the binomial distribution with parameters , i.e.
The Binomial mechanism releases the vector as the output to the query. For the analysis the reader is referred to the definition of norm sensitivity for any defined in (4). The of interest to us for the Binomial mechanism will be . Since our requirement from the Binomial mechanism will be symmetric w.r.t. and , throughout this proof, we assume that .
To prove Theorem 1, we need few auxiliary lemmas. We first state two inequalities which we use throughout the proof.
Lemma 3 (Bernstein’s inequality).
Let be independent random variables such that and w.p. 1. Let . Then for any ,
Lemma 4 (EfronStein inequality).
Let be a symmetric function of independent random variables . Let be an i.i.d. copy of , then
We use the above two results in the next two lemmas.
Lemma 5.
Let , , , . Then
Proof.
where the inequality follows from considering the two cases when can be positive or negative. ∎
Lemma 6.
Let be real numbers. Let independently such that . Let be the event that for some , such that . Then for any , with probability conditioned on ,
where is given by
(12) 
Proof.
Since and for any , ,
Hence we can bound the expectation as
Where uses the fact that for any positive random variable and any event , . uses the fact that . Note that the function we are considering is a sum of functions of independent binomial random variables and hence we can apply Bernstein’ inequality. To this end, we bound and . Since is bounded,