cpSGD: Communication-efficient and differentially-private distributed SGD

# cpSGD: Communication-efficient and differentially-private distributed SGD

Naman Agarwal 111Most of the research was performed while the author was on an internship at Google Research, New York
Princeton University
namana@cs.princeton.edu
Ananda Theertha Suresh
Felix Yu
Sanjiv Kumar
H. Brendan Mcmahan
###### Abstract

Distributed stochastic gradient descent is an important subroutine in distributed learning. A setting of particular interest is when the clients are mobile devices, where two important concerns are communication efficiency and the privacy of the clients. Several recent works have focused on reducing the communication cost or introducing privacy guarantees, but none of the proposed communication efficient methods are known to be privacy preserving and none of the known privacy mechanisms are known to be communication efficient. To this end, we study algorithms that achieve both communication efficiency and differential privacy. For variables and clients, the proposed method uses bits of communication per client per coordinate and ensures constant privacy.

We also extend and improve previous analysis of the Binomial mechanism showing that it achieves nearly the same utility as the Gaussian mechanism, while requiring fewer representation bits, which can be of independent interest.

## 1 Introduction

### 1.1 Background

Distributed stochastic gradient descent (SGD) is a basic building block of modern machine learning [26, 11, 9, 29, 1, 28, 5]. In the typical scenario of synchronous distributed learning, in every round, each client obtains a copy of a global model which it updates based on its local data. The updates (usually in the form of gradients) are sent to a parameter server, where they are averaged and used to update the global model. Alternatively, without a central server, each client saves a global model, broadcasts the gradient to all other clients, and updates its model with the aggregated gradient.

Often, the communication cost of sending the gradient becomes the bottleneck [31, 23, 22]. To address this issue, several recent works have focused on reducing the communication cost of distributed learning algorithms via gradient quantization and sparsification [33, 17, 34, 20, 21, 4, 35]. These algorithms have been shown to improve communication cost and hence communication time in distributed learning. This is especially effective in the federated learning setting where clients are mobile devices with expensive up-link communication cost [27, 20].

While communication is a key concern in client based distributed machine learning, an equally important consideration is that of protecting the privacy of participating clients and their sensitive information. Providing rigorous privacy guarantees for machine learning applications has been an area of active recent interest [6, 36, 32]. Differentially private gradient descent algorithms in particular were studied in the work of [2]. A direct application of these mechanisms in distributed settings leads to algorithms with high communication costs. The key focus of our paper is to analyze mechanisms that achieve rigorous privacy guarantees as well as have communication efficiency.

### 1.2 Communication efficiency

We first describe synchronous distributed SGD formally. Let be of the form

 F(w)=1M⋅M∑i=1fi(w),

where each resides at the client. For example, ’s are weights of a neural network and is the loss of the network on data located on client . Let be the initial value. At round , the server transmits to all the clients and asks a random set of (batch size / lot size) clients to transmit their local gradient estimates . Let be the subset of clients.

 gt(wt)=1n∑i∈Sgti(wt),wt+1≜wt−γgt(wt)

for some suitable choice of . Other optimization algorithms such as momentum, Adagrad, or Adam can also be used instead of the SGD step above.

Naively for the above protocol, each of the clients needs to transmit reals, typically using bits222 is the per-coordinate quantization accuracy. To represent a dimensional vector to an constant accuracy in Euclidean distance, each coordinate is usually quantized to an accuracy of ..

This communication cost can be prohibitive, e.g., for a medium size PennTreeBank language model [38], the number of parameters and hence total cost is MB (assuming 32 bit float), which is too large to be sent from a mobile phone to the server at every round.

Motivated by the need for communication efficient protocols, various quantization algorithms have been proposed to reduce the communication cost [34, 20, 21, 37, 24, 35, 5]. In these protocols, the clients quantize the gradient by a function and send an efficient representation of instead of its actual local gradient . The server computes the gradient as

 ~gt(wt)=1n∑i∈Sq(gti(wt)),

and updates as before. Specifically, [34] proposes a quantization algorithm which reduces the requirement of full (or floating point) arithmetic precision to a bit or few bits per value on average. There are many subsequent works e.g., see [21] and in particular [5] showed that stochastic quantization and Elias coding [15] can be used to obtain communication-optimal SGD for convex functions. If the expected communication cost at every round is bounded by , then the total communication cost of the modified gradient descent is at most

 T⋅c. (1)

All the previous papers relate the error in gradient compression to SGD convergence. We first state one such result for completeness for non-convex functions and prove it in Appendix A. Similar (and stronger) results can be obtained for (strongly) convex functions using results in [16] and [30].

###### Corollary 1 ([16]).

Let be -smooth and . Let satisfy . Let be a quantization scheme, and then after rounds

 Et∼(Unif[T])[∥∇F(wt)∥22]≤2DFLT+2√2σ√LDF√T+DB,
 where σ2=max1≤t≤T2E[∥gt(wt)−∇F(wt)∥22]+2max1≤t≤TEq[∥gt(wt)−~gt(wt)∥22], (2)

and . The expectation in the above equations is over the randomness in gradients and quantization.

The above result relates the convergence of distributed SGD for non-convex functions to the worst-case mean square error (MSE) and bias in gradient mean estimates in Equation (2). Thus smaller the mean square error in gradient estimation, better convergence. Hence, we focus on the problem of distributed mean estimation (DME), where the goal is to estimate the mean of a set of vectors.

### 1.3 Differential privacy

While the above schemes reduce the communication cost, it is unclear what (if any) privacy guarantees they offer. We study privacy from the lens of differential privacy (DP). The notion of differential privacy [13] provides a strong notion of individual privacy while permitting useful data analysis in machine learning tasks. We refer the reader to [14] for a survey. Informally, for the output to be differentially private, the estimated model should be indistinguishable whether a particular client’s data was taken into consideration or not. We define this formally in Section 2.

In the context of client based distributed learning, we are interested in the privacy of the gradients aggregated from clients; differential privacy for the average gradients implies privacy for the resulting model since DP is preserved by post-processing.

The standard approach is to let the server add the noise to the averaged gradients (e.g., see [14, 2] and references within). However, the above only works under a restrictive assumption that the clients can trust the server. Our goal is to also minimize the need for clients to trust the central aggregator, and hence we propose the following model:

Clients add their share of the noise to their gradients before transmission. Aggregation of gradients at the server results in an estimate with noise equal to the sum of the noise added at each client.

This approach improves over server-controlled noise addition in several scenarios:

Clients do not trust the server: Even in the scenario when the server is not trustworthy, the above scheme can be implemented via cryptographically secure aggregation schemes [7], which ensures that the only information about the individual users the server learns is what can be inferred from the sum. Hence, differential privacy of the aggregate now ensures that the parameter server does not learn any individual user information. This will encourage clients to participate in the protocol even if they do not fully trust the server. We note that while secure aggregation schemes add to the communication cost (e.g., [7] adds for levels of quantization), our proposed communication benefits still hold. For example, if , a 4-bit quantization protocol would reduce communication cost by 67% compared to the 32 bit representation.

Server is negligent, but not malicious: the server may "forget" to add noise, but is not malicious and not interested in learning characteristics of individual users. However, if the server releases the learned model to public, it needs to be differentially-private.

A natural way to extend the results of [14, 2] is to let individual users add Gaussian noise to their gradients before transmission. Since the sum of Gaussians is Gaussian itself, differential privacy results follow. However, the transmitted values now are real numbers and the benefits of gradient compression are lost. Further, secure aggregation protocols [7] require discrete inputs. To resolve these issues, we propose that the clients add noise drawn from an appropriately parameterized Binomial distribution. We refer to this as the Binomial mechanism. Since Binomial random variables are discrete, they can be transmitted efficiently. Furthermore, the choice of the Binomial is convenient in the distributed setting because sum of Binomials is also binomially distributed i.e., if

 Z1,Z2∼Bin(N1,p) then Z1+Z2∼% Bin(N1+N2,p).

Hence the total noise post aggregation can be analyzed easily, which is convenient for the distributed setting333Another choice is the Poisson distribution. Different from Poisson, the Binomial distribution has bounded support and has an easily analyzable communication complexity which is always bounded.. Binomial mechanism can be of independent interest in other applications with discrete output as well. Furthermore, unlike Gaussian it avoids floating point representation issues.

### 1.4 Summary of our results

Binomial mechanism: We first study Binomial mechanism as a generic mechanism to release discrete valued data. Previous analysis of the Binomial mechanism (where you add noise ) was due to [12], who analyzed the -dimensional case for and showed that to achieve differential privacy, needs to be . We improve the analysis in the following ways:

• -dimensions. We extend the analysis of -dimensional Binomial mechanism to dimensions. Unlike the Gaussian distribution, Binomial is not rotation invariant making the analysis more involved. The key fact utilized in this analysis is that Binomial distribution is locally rotation-invariant around the mean.

• Improvement. We improve the previous result and show that suffices for small , implying that the Binomial and Gaussian mechanism perform identically as . We note that while this is a constant improvement , it is crucial in making differential privacy practical.

Differentially-private distributed mean estimation (DME): A direct application of Gaussian mechanism requires reals and hence bits of communication. This can be prohibitive in practice. However, a direct application of quantization [34] and Binomial mechanism has high communication cost. We show that random rotation together with the notion of high probability sensitivity can significantly improve communication.

In particular, for , we provide an algorithm achieving equal privacy and error as that of the Gaussian mechanism with communication

 ≤n⋅d⋅(log2(1+dn)+O(loglog(ndδ)) bits,

per round of distributed SGD. Hence when , the number of bits is .

The rest of the paper is organized as follows. In Section 2, we review the notion of differential privacy and state our results for the Binomial mechanism. Motivated by the fact that the convergence of SGD can be reduced to the error in gradient estimate computation per-round, we formally describe the problem of DME in Section 4 and state our results in Section 5.

In Section 5.2, we provide and analyze the implementation of the binomial mechanism in conjunction with quantization in the context of DME. The main idea is for each client to add noise drawn from an appropriately parameterized Binomial distribution to each quantized value before sending to the server. The server further subtracts the bias introduced by the noise to achieve an unbiased mean estimator. We further show in Section 5.3 that the rotation procedure proposed in [34] which reduces the MSE is helpful in reducing the additional error due to differential privacy.

## 2 Differential privacy

### 2.1 Notation

We start by defining the notion of differential privacy. Formally, given a set of data sets provided with a notion of neighboring data sets and a query function , a mechanism to release the answer of the query, is defined to be differentially private if for any measurable subset and two neighboring data sets ,

 Pr(M(f(D1))∈S)≤eεPr(M(f(D2))∈S)+δ. (3)

Unless otherwise stated, for the rest of the paper, we will assume the output spaces . We consider the mean square error as a metric to measure the error of the mechanism . Formally,

 E(M)≜maxD∈DE[∥M(f(D))−f(D)∥22].

A key quantity in characterizing differential privacy for many mechanisms is the sensitivity of a query in a given norm . Formally this is defined as

 Δq≜max(D1,D2)∈ND∥f(D1)−f(D2)∥q. (4)

The canonical mechanism to achieve differential privacy is the Gaussian mechanism  [14]:

 Mσg(f(D))≜f(D)+Z,

where . We now state the well-known privacy guarantee of the Gaussian mechanism.

###### Lemma 1 ( [14]).

For any , sensitivity bound , and such that

 σ≥Δ2√2log1.25δ,

is differentially private 444All logs are to base unless otherwise stated. and the error is bounded by

### 2.2 High probability sensitivity

In this section we introduce a notion of high probability sensitivity which allows us to work randomized queries which do not have a worst case bound on sensitivity but have bounded sensitivity with high probability. Let represent a set of natural numbers and represent a subset of real numbers. For two random vectors , the event is defined as

 (∥v1−v2∥Q≤ΔQ)≜⋃i(∥v1−v2∥qi≤Δqi)
###### Definition 1 ((ΔQ,δ) sensitivity).

Given a set of integers and values , we call a randomized function , sensitive, if for any two neighboring data sets , there exist coupled random variables such that the marginal distributions of are identical to that of and and

 PrX1,X2(∥X1−X2∥Q≤ΔQ)≥1−δ. (5)

We show the following result for high-probability sensitivity and the proof is provided in Appendix C.

###### Lemma 2.

Let be an differentially private mechanism for sensitivity and let be a sensitive function. Then the composed mechanism is differentially private.

## 3 Binomial Mechanism

We now define the Binomial mechanism for the case when the output space of the query is . The Binomial mechanism is parameterized by three quantities where , and quantization scale for some and is given by

 MN,p,sb(f(D))≜f(D)+(Z−Np)⋅s, (6)

where for each coordinate , and independent. One dimensional binomial mechanism was introduced by [12] for the case when . We analyze the mechanism for the general -dimensional case and for any . This analysis is involved as the Binomial mechanism is not rotation invariant. By carefully exploiting the local rotation invariant structure near the mean, we show that:

###### Theorem 1.

For any , parameters and sensitivity bounds such that

 Np(1−p)≥max(23log(10d/δ),2Δ∞/s),

the Binomial mechanism is differentially private for

 ε=Δ2√2log1.25δs√Np(1−p)+Δ2cp√log10δ+Δ1bpsNp(1−p)(1−δ/10)+23Δ∞log1.25δ+Δ∞dplog20dδlog10δsNp(1−p), (7)

where , and are defined in  (17),  (12), and  (16) respectively, and for , , , and . The error of the mechanism is

 d⋅s2⋅Np(1−p).

The proof is given in Appendix B. We make some remarks regarding the design and the guarantee for the Binomial Mechanism. Note that the privacy guarantee for the Binomial mechanism depends on all three sensitivity parameters as opposed to the Gaussian mechanism which only depends on . The and terms can be seen as the added complexity due to discretization.

Secondly setting (i.e. providing no scale to the noise) in the expression (7), it can be readily seen that the terms involving and scale differently with respect to the variance of the noise. This motivates the use of the accompanying quantization scale in the mechanism. Indeed it is possible that the resolution of the integer that is provided by the Binomial noise could potentially be too large for the problem leading to worse guarantees. In this setting, the quantization parameter helps normalize the noise correctly. Further, it can be seen as long as the variance of the random variable is fixed, increasing and decreasing makes the Binomial mechanism closer to the Gaussian mechanism. Formally, if we let and , then using the Cauchy-Schwartz inequality, the guarantee (7) can be rewritten as

 ε=Δ2/σ√2log1.25/δ(1+O(1/c)).

The variance of the Binomial distribution is and the leading term in matches exactly the term in Gaussian mechanism. Furthermore, if is , then this mechanism is very similar to the Gaussian mechanism. This result agrees with the Berry-Esseen type Central limit theorems for the convergence of one dimensional Binomial distribution to the Gaussian distribution.

In Figure 2, we plot the error vs for Gaussian and Binomial mechanism. Observe that as scale is reduced, error vs privacy trade-off for Binomial mechanism approaches that of Gaussian mechanism.

## 4 Distributed mean estimation (DME)

We have related the SGD convergence rate to the MSE in approximating the gradient at each step in Corollary 1. Eq. (1) relates the communication cost of SGD to the communication cost of estimating gradient means. Advanced composition theorem (Thm. 3.5 [19]) or moments accounting [2] can be used to relate the privacy guarantee of SGD to that of gradient mean estimate at each instance . We also note that in SGD, we often sample the clients, standard privacy amplification results via sampling [2], can be used to get tighter bounds in this case.

Therefore, akin to [34], in the rest of the paper we just focus on the MSE and privacy guarantees of DME. The results for synchronous distributed GD follow from Corollary 1 (convergence), advanced composition theorem (privacy), and Eq. (1) (communication).

Formally, the problem of DME is defined as given vectors where is on client , we wish to compute the mean

 ¯X=1nn∑i=1Xi

at a central server. For gradient descent at each round , is set to . DME is a fundamental building block for many distributed learning algorithms including distributed PCA/clustering [25].

While analyzing private DME we assume that each vector has bounded norm, i.e. . The reason to make such an assumption is to be able to define and analyze the privacy guarantees and is often enforced in practice by employing gradient clipping at each client. We note that this assumption appears in previous works on gradient descent and differentially private gradient descent (e.g. [2]). Since our results also hold for all gradients without any statistical assumptions, we get desired convergence results and privacy results for SGD.

### 4.1 Communication protocol

Our proposed communication algorithms are simultaneous and independent, i.e., the clients independently send data to the server at the same time. We allow the use of both private and public randomness. Private randomness refers to random values generated by each client separately, and public randomness refers to a sequence of random values that are shared among all parties555Public randomness can be emulated by the server communicating a random seed.

Given vectors where resides on a client . In any independent communication protocol, each client transmits a function of (say ), and a central server estimates the mean by some function of . Let be any such protocol and let be the expected number of transmitted bits by the -th client during protocol , where throughout the paper, expectation is over the randomness in protocol .

The total number of bits transmitted by all clients with the protocol is

 C(π,Xn1)\lx@stackreldef=n∑i=1Ci(π,Xi).

Let the estimated mean be . For a protocol , the MSE of the estimate is

 E(π,Xn1)=E[∥^¯X−¯X∥22].

We note that bounds on , translates to bounds on gradients estimates in Eq. (2) and result in convergence guarantees via Corollary 1.

### 4.2 Differential privacy

To state the privacy results for DME, we define the notion of data sets and neighbors as follows. A dataset is a collection of vectors . The notion of neighboring data sets typically corresponds to those differing only on the information of one user, i.e. are neighbors if they differ in one vector.

Note that this notion of neighbors for DME in the context of distributed gradient descent translates to two data sets

 F=f1,f2,…fn and F′=f′1,f′2,…f′n

being neighbors if they differ in one function and corresponds to guaranteeing privacy for individual client’s data. The bound translates to assuming , ensured via gradient clipping.

## 5 Results for distributed mean estimation (DME)

In this section we describe our algorithms, the associated MSE, and the privacy guarantees in the context of DME. First, we first establish a baseline by stating the results for implementing the Gaussian mechanism by adding Gaussian noise on each client vector.

### 5.1 Gaussian protocol

In the Gaussian mechanism, each client sends vector

 Yi=Xi+Zi,

where s are i.i.d distributed as . The server estimates the mean by

 ^¯X=1n⋅n∑i=1Yi.

We refer to this protocol as . Since is distributed as the above mechanism is equivalent to applying the Gaussian mechanism on the output with variance . Since changing any of the ’s changes the norm of by at most , the following theorem follows directly from Lemma 1.

###### Theorem 2.

Under the Gaussian mechanism, the mean estimate is unbiased and communication cost is reals. Moreover, for any and , it is differentially private for

 ε=2D√nσ√2log1.25δ% and E(πg,X)=dσ2n,

We remark that real numbers can be quantized to bits with insignificant effect to privacy666Follows by observing that quantizing all values to accuracy ensures minimum loss in privacy. In practice this is often implemented using 32 bits of quantization via float representation.. However this is asymptotic and can be prohibitive in practice [20], where we have a small fixed communication budget and is of the order of millions. A natural way to reduce communication cost is via quantization, where each client quantizes s before transmitting. However how privacy guarantees degrade due to quantization of the Gaussian mechanism is hard to analyze particularly under aggregation. Instead we propose to use the Binomial mechanism which we describe next.

### 5.2 Stochastic k-level quantization + Binomial mechanism

We now define the mechanism based on -bit stochastic quantization proposed in [34] composed with the Binomial mechanism. It will be parameterized by quantities .

First, the server sends to all the clients, with the hope that for all , . The clients then clip each coordinate of their vectors to the range . For every integer in the range , let represent a bin (one for each ), i.e.

 B(r)\lx@stackreldef=−Xmax+2rXmaxk−1, (8)

The algorithm quantizes each coordinate into one of the bins stochastically and adds scaled Binomial noise. Formally client computes the following quantities for every

 Ui(j)={B(r+1)w.p. Xi(j)−B(r)B(r+1)−B(r)B(r)otherwise.Yi(j)=Ui(j)+2Xmaxk−1⋅Ti(j). (9)

where is such that and . The client sends to the server. The server now estimates by

 ^¯Xπsk(Bin(m,p))=1nn∑i=1(Yi−2Xmaxmpk−1). (10)

If , , then

 E[Yi−2Xmaxmpk−1]=Xi,

and will be an unbiased estimate of the mean.

With no prior information on , the natural choice is to set . With this value of we characterize the MSE, sensitivity, and communication complexity of the Binomial mechanism below. To characterize the sensitivity of , we need few definitions. For scalars , let

 Δ∞(Xmax,D)\lx@stackreldef=k+1 Δ1(Xmax,D)\lx@stackreldef=√dDq+√2√dDlog(2/δ)q+43log2δ Δ2(Xmax,D)\lx@stackreldef=Dq+ ⎷Δ1+√2√dDlog(2/δ)q, (11)

where . We note that quantities are omitted from the LHS of the equations for the ease of notation. Combined with Theorem 1, this yields the privacy guarantees for the binomial mechanism.

###### Theorem 3.

If , then the mean estimate is unbiased and

 E(πsk(Bin(m,p)),Xn)≤dD2n(k−1)2+dn⋅4mp(1−p)D2(k−1)2,

Furthermore if , then for any , is differentially private where is given by Theorem 1 with sensitivity parameters (Eq. (11)). Furthermore,

 C(πsk(Bin(m,p)),Xn)=n⋅(dlog2(k+m)+~O(1)).\lx@notefootnote$~O$isusedtodenotepoly−logarithmicfactors.

We provide the proof in Appendix D. For , we bound the communication cost as follows.

###### Corollary 2.

There exists an implementation of , which achieves the same privacy and error as the full precision Gaussian mechanism with a total communication complexity of

 n⋅d⋅(log2(√d+dnε2)+O(loglog(ndεδ))) % bits.

Therefore our results provide precise non-asymptotic and asymptotic guarantees on the total communication with respect to k. The communication cost of the above algorithm is bits per coordinate per client, which can be prohibitive. In the next section we show that these bounds can be improved via rotation.

### 5.3 Error reduction via randomized rotation

As seen in Corollary 2, if has error and privacy same as that of the Gaussian mechanism, it has high communication cost. The proof reveals that this is due to the error being proportional to . Therefore MSE reduces when is small, e.g., when is uniform on the unit sphere, is (whp) [10]. [34] showed that the same effect can be observed by randomly rotating the vectors before quantization. Here we show that random rotation reduces the leading term as well as improves the privacy guarantee.

Using public randomness, all clients and the central server generate a random orthogonal matrix according to some known distribution. Given a protocol for DME which takes inputs , we define as the protocol where each client first computes,

 X′i=RXi,

and runs the protocol on . The server then obtains the mean estimate in the rotated space using the protocol and then multiplies by to obtain the coordinates in the original basis, i.e.,

 ^¯X=R−1^¯X′.

Due to the fact that can be huge in practice, we need orthogonal matrices that permits fast matrix-vector products. Naive matrices that support fast multiplication such as block-diagonal matrices often result in high values of . Similar to [34], we propose to use a special type of orthogonal matrix , where is a random diagonal matrix with i.i.d. Rademacher entries ( with probability ) and is a Walsh-Hadamard matrix [18]. The Walsh-Hadamard matrix of dimension for is given by the recursive formula,

 H(21)=[111−1],H(2m)=[H(2m−1)H(2m−1)H(2m−1)−H(2m−1)].

Applying both rotation and its inverse takes time and space (with an in-place algorithm).

The next theorem provides the MSE guarantees for .

###### Theorem 4 (Appendix E).

For any , let , then

 E(Rot(πsk(Bin(m,p))),HA)≤2log2ndδ⋅D2n(k−1)2+8log2ndδ⋅mp(1−p)D2n(k−1)2+4D2δ2

and the bias is . Further if then is differentially private where is given by Theorem 1 with sensitivity parameters (Eq. (11)). Furthermore,

 C(Rot(πsk(Bin(m,p))),Xn)=n⋅(dlog2(k+m)+~O(1)).

The following corollary bounds the communication cost for when .

###### Corollary 3.

There exists an implementation of , that achieves the same error and privacy of the full precision Gaussian mechanism with a total communication complexity:

 n⋅d(log2(1+dnε2)+O(loglogdnεδ)) bits.

Hence if , then has the same privacy and utilities as the Gaussian mechanism, but with just communication cost.

## 6 Discussion

We trained a three-layer model (60 hidden nodes each with ReLU activation) on the infinite MNIST dataset [8] with 25M data points and 25M clients. At each step 10,000 clients send their data to the server. This setting is close to real-world settings of federated learning where there are hundreds of millions of users. The results are in Figure 2. Note that the models achieve different levels of accuracy depending on communication cost and privacy parameter . We note that we trained the model with exactly one epoch, so each sample was used at most once in training. In this setting, the per batch and the overall are the same.

There are several interesting future directions. On the theoretical side, it is not clear if our analysis of Binomial mechanism is tight. Furthermore, it is interesting to have better privacy accounting for Binomial mechanism via a moments accountant. On the practical side, we plan to explore the effects of neural network topology, over-parametrization, and optimization algorithms on the accuracy of the privately learned models.

## 7 Acknowledgements

The authors would like to thank Keith Bonawitz, Vitaly Feldman, Jakub Konečný, Ben Kreuter, Ilya Mironov, and Kunal Talwar for their valuable suggestions and inputs.

## References

• [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
• [2] Martín Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM, 2016.
• [3] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In STOC, 2006.
• [4] Dan Alistarh, Demjan Grubic, Jerry Liu, Ryota Tomioka, and Milan Vojnovic. Communication-efficient stochastic gradient descent, with applications to neural networks. 2017.
• [5] Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Randomized quantization for communication-optimal stochastic gradient descent. arXiv:1610.02132, 2016.
• [6] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464–473. IEEE, 2014.
• [7] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. pages 1175–1191, 2017.
• [8] Leon Bottou. The infinite mnist dataset, 2007.
• [9] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with cots hpc systems. In International Conference on Machine Learning, pages 1337–1345, 2013.
• [10] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
• [11] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
• [12] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Eurocrypt, volume 4004, pages 486–503. Springer, 2006.
• [13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, volume 3876, pages 265–284. Springer, 2006.
• [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3&#8211;4):211–407, August 2014.
• [15] Peter Elias. Universal codeword sets and representations of the integers. IEEE transactions on information theory, 21(2):194–203, 1975.
• [16] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
• [17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1737–1746, 2015.
• [18] Kathy J Horadam. Hadamard matrices and their applications. Princeton university press, 2012.
• [19] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. IEEE Transactions on Information Theory, 63(6):4037–4049, 2017.
• [20] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
• [21] Jakub Konečnỳ and Peter Richtárik. Randomized distributed mean estimation: Accuracy vs communication. arXiv preprint arXiv:1611.07555, 2016.
• [22] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 1, page 3, 2014.
• [23] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014.
• [24] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. International Conference on Learning Representations, 2018.
• [25] Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
• [26] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In HLT, 2010.
• [27] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
• [28] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Aguera y Arcas. Federated learning of deep networks using model averaging. arXiv:1602.05629, 2016.
• [29] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv preprint, 2014.
• [30] Alexander Rakhlin, Ohad Shamir, Karthik Sridharan, et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML. Citeseer, 2012.
• [31] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
• [32] Anand D Sarwate and Kamalika Chaudhuri. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE signal processing magazine, 30(5):86–94, 2013.
• [33] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
• [34] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pages 3329–3337, 2017.
• [35] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. arXiv preprint arXiv:1705.07878, 2017.
• [36] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1307–1322. ACM, 2017.
• [37] Takuya Akiba Yusuke Tsuzuku, Hiroto Imachi. Variance-based gradient compression for efficient distributed deep learning, 2018.
• [38] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.

## Appendix A Proof of biased SGD

The proof is similar to the SGD proof of [16], however we account for bias in gradient estimates. Define the random variable . By the definitions of and ,

 F(wt+1)−F(wt) ≤∇F(wt)T(wt+1−wt)+L2∥wt+1−wt∥2 ≤−∇F(wt)T(γ~gt(wt))+γ2L2∥~gt(wt)∥2 ≤−γ(1−γL2)∥∇F(wt)∥2+γ(1−γL)∥∇F(wt)∥∥δt∥+γ2L2∥δt∥2,

where the last inequality uses the fact that . Rearranging the above inequality and summing over all we get that

 Et∈Uniform(T)[∥∇F(wt)∥2] ≤1Tγ(2−γL)(2DF+Tγ2Lσ2)+2γ(1−γL)γ(2−γL)DB ≤2DFTmax{L,σ√LT√2DF}+σ√2LDF√T+DB ≤2DFLT+2√2LDFσ√T+DB.

## Appendix B Binomial Mechanism - Proof of Theorem 1

To remind the reader, the binomial mechanism for releasing discrete valued queries on a database is defined as follows. Given a set of databases and an integer valued query , the binomial mechanism samples a vector such that all its coordinates are distributed as the binomial distribution with parameters , i.e.

 Z(j)∼Bin(N,p)

The Binomial mechanism releases the vector as the output to the query. For the analysis the reader is referred to the definition of norm sensitivity for any defined in (4). The of interest to us for the Binomial mechanism will be . Since our requirement from the Binomial mechanism will be symmetric w.r.t. and , throughout this proof, we assume that .

To prove Theorem 1, we need few auxiliary lemmas. We first state two inequalities which we use through-out the proof.

###### Lemma 3 (Bernstein’s inequality).

Let be independent random variables such that and w.p. 1. Let . Then for any ,

 Pr(∑Xi≥√2∑σ2ilog1δ+23⋅Mlog1δ)≤δ.
###### Lemma 4 (Efron-Stein inequality).

Let be a symmetric function of independent random variables . Let be an i.i.d. copy of , then

 Var(f)≤n2⋅E[(f(X1,X2,…Xn)−f(X′1,X2,…Xn))2].

We use the above two results in the next two lemmas.

###### Lemma 5.

Let , , , . Then

 Pr(T=i−t)Pr(T=i)≤exp(t⋅log(i+1)(1−p)(N−i+1)p)
###### Proof.
 Pr(T=i−t)Pr(T=i) ≜(Ni−t)(Ni)pi−t(1−p)N−i+tpi(1−p)N−i =i!(N−i)!(i−t)!(N−i+t)!(1−pp)t ≤((i+1)(1−p)(N−i+1)p)t,

where the inequality follows from considering the two cases when can be positive or negative. ∎

###### Lemma 6.

Let be real numbers. Let independently such that . Let be the event that for some , such that . Then for any , with probability conditioned on ,

 d∑i=1ti(⋅log(vi+1)(1−p)(N−vi+1)p−vi+1Np+N−vi+1N(1−p)) ≤2∥t∥1(p2+(1−p)2)3Np(1−p)(Pr(A))+∥t∥2cpNp(1−p)√Pr(A)⋅√log1δ+4∥t∥∞(β+1)2(p2+(1−p)2)9N2p2(1−p)2log1δ,

where is given by

 cp≜√2(3p3+3(1−p)3+2p2+2(1−p)2). (12)
###### Proof.

Since and for any , ,

 ∣∣∣log(vi+1)(1−p)(N−vi+1)p−vi+1Np+N−vi+1N(1−p)∣∣∣ ≤1.953∣∣∣vi+1−NpNp∣∣∣2+1.953∣∣∣N−vi+1−N−NpN(1−p)∣∣∣2.

Hence we can bound the expectation as

 E[log(vi+1)(1−p)(N−vi+1)p−vi+1Np+N−vi+1N(1−p)∣∣∣A] \lx@stackrel(a)≤1Pr(A)⋅E[1.953∣∣∣vi+1−NpNp∣∣∣2+1.953∣∣∣N−vi+1−N−NpN(1−p)∣∣∣2] \lx@stackrel(b)≤1Pr(A)⋅2(p2+(1−p)2)3Np(1−p),

Where uses the fact that for any positive random variable and any event , . uses the fact that . Note that the function we are considering is a sum of functions of independent binomial random variables and hence we can apply Bernstein’ inequality. To this end, we bound and . Since is bounded,

 ∣∣∣log(vi+1)(1−p)(N−vi+1)p−vi+1Np+N−vi+1N(1−p)∣∣∣ ≤23∣∣∣vi+1−NpNp∣∣∣2+23∣∣∣N−vi+1−