DifferentiallyPrivate "Draw and Discard" Machine Learning
Abstract.
In this work, we propose a novel framework for privacypreserving clientdistributed machine learning. It is motivated by the desire to achieve differential privacy guarantees in the local model of privacy in a way that satisfies all systems constraints using asynchronous clientserver communication and provides attractive model learning properties. We call it “Draw and Discard” because it relies on random sampling of models for load distribution (scalability), which also provides additional serverside privacy protections and improved model quality through averaging. We present the mechanics of client and server components of “Draw and Discard” and demonstrate how the framework can be applied to learning Generalized Linear models. We then analyze the privacy guarantees provided by our approach against several types of adversaries and showcase experimental results that provide evidence for the framework’s viability in practical deployments.
warntextcomp
1. Introduction
In this work, we propose a Machine Learning (ML) framework, unique in many ways, that touches on several different aspects of practical deployment of locally differentially private ML, all of which are equally important. These aspects include feasibility, scalability, efficiency, spam protection and, of course, privacy. Ideally, they all must interplay together in a manner that enhances each other. From that perspective, this work is as much a systems one, as it is both privacy and machine learning focused.
Machine learning made our mobile devices “smart”. Applications span a wide range of seemingly indispensable features, such as personalized app recommendations, nextword suggestions, feed ranking, face and fingerprint recognition and many others. The downside is that they often come at the expense of privacy of the users who share their personal data with parties providing these services. However, as we demonstrate in this work, this does not necessarily need to be the case.
Historically, ML was developed with a servercentric view of first collecting data in a central place and then training models based on them. Logistic regression (Cox, 1958) and neural networks (Rosenblatt, 1958), introduced over half a century ago, follow a nowfamiliar paradigm of reading training data from the local disk and adjusting model weights until certain convergence criteria are met. With the widespread use of mobile devices capable of generating massive amounts of personal information backed up by the convenience of cloud data storage and infrastructure, the community adopted the servercentric worldview of ML simply because it was convenient to do so. As the training data grew in size and could no longer fit on a single machine or even several machines, we ended up collecting data from millions of devices on one network and sending it “sharded” for training to another network of thousands of machines. In the past, this duality of responsibilities could be justified by large disparities in hardware capabilities between the two networks, but this line is blurrier at the present time.
Sharing personal data that contributes to a global ML model and benefits everyone on the network—in many cases, the data collector the most– can be viewed as undesirable by many privacy sensitive users, due to distrust in the data collector or risks of subpoenas, data breaches and internal threats (Madden and Rainie, 2015, 2016). Following the deployment of RAPPOR (Erlingsson et al., 2014), there has been an increased interest in finding ways for users to contribute data to improve services they receive, but to do so in a provably private manner, even with respect to the data collector itself (Portnoy et al., 2016). This desire is often expressed by companies (Greenberg, 2016; WWDC 2016, 2016), presumably in part to minimize risks and exposures.
To address the privacyutility tradeoff in improving products while preserving privacy of user data even from the data collector itself, we propose a novel clientcentric distributed “Draw and Discard” machine learning framework (DDML). It provides differential privacy guarantees in the local model of privacy in a way that satisfies all systems constraints using asynchronous clientserver communication. We call it “Draw and Discard” because it relies on randomly sampling and discarding of models. Specifically, DDML maintains versions (or instances) of the machine learning model on a server, from which an instance is randomly selected to be updated by a client, and, subsequently, the updated instance randomly replaces one of the instances on a server. The update is made in a differentially private manner with users’ private data stored locally on their mobile devices.
We focus our analyses and experiments with DDML on the Generalized Linear Models (GLM) (Nelder and Wedderburn, 1972), which include regular linear and logistic regressions. GLMs provide widelydeployed and effective solutions for many applications of nonmedia rich data, such as event prediction and ranking. The convex nature of GLMs makes them perfect candidates to explore clientside machine learning without having to worry about convergence issues of the more complex ML models. Extension of DDML to Neural Networks and other models optimized through iterative gradient updates is quite straightforward.
We demonstrate through modeling, analyses, experiments and practical deployment that DDML provides attractive privacy, model learning and systems properties (feasibility, scalability, efficiency and spam protection). Specifically,

Local differential privacy: Through carefully calibrated noise in the model update step, the DDML design ensures local differential privacy (Dwork et al., 2006).

Privacy amplification against other adversaries: Furthermore, in DDML the full model update is performed on a client and only the updated model rather than raw gradients are sent to the server, which strengthens the privacy guarantees provided by DDML against weaker but realistic adversaries than the strongest possible adversary operating in the local model.

Efficient model training: Due to the variance stabilizating property of DDML, its final model averaging and relatively frequent model updating, DDML has superior finite sample performance relative to serverside batching.

Asynchronous training: continuous and scalable training without pausing the process for averaging and updating on a server side.

Spam protection: having different instances of the same model allows to assess whether any incoming update is fraudulent or not without knowledge of users’ private data.

Limited serverside footprint: store much less data on a server at any given time, since is usually much smaller (we use 20) relative to the serverside batch size adopted by the gradient averaging techniques (usually 10,000).
These properties will become more clear as we define and demonstrate them more precisely in the following sections.
DDML has two major differences from Federated Learning, an approach adopted by Google (McMahan et al., 2017a), which relies on serverside gradient batching and averaging with a possiblity of a serverside noise addition.
First, we ensure differential privacy in the more desirable local privacy model. We perform direct, noisy updates of model weights on clients, as opposed to sending raw and exact gradients back to the server. This change offers local differential privacy guarantees, and more importantly, requires an attacker to know the preimage of the model (a model sent for an update to the client) that was updated to make any inference about private user data. Separation of the two critical pieces of knowledge, pre and postupdate models necessary to make any inference, in time, especially in a highthroughput environment with instances being continously updated, poses significant practical challenges for an adversary observing a stream of updates on the server side. We discuss this in detail in Section 4.
Second, our radically different serverside model collection and handling of data improves model training efficiency in practice. We make substantially more updates to the weights than practical serverside deployments, which when coupled with the variance stabilizing property discussed in Section 3.4.1 and model averaging, produces superior finite sample performance.
Beyond these two major considerations, DDML offers a completely lockfree asynchronous, and thus, more efficient, communication between the server and clients, which is an absolute must if one is developing in a massively distributed environment (Delange, 2017), as well as a straightforward distributed way to prevent model spamming by malicious actors, without sacrificing user privacy.
We have implemented DDML at a large tech company and successfully trained many ML models. Our applications focus on ranking items, from a few dozen to several thousands, as well as security oriented services, such as predicting how likely it is that a URL one receives from someone is phishy. Our largest models contain weights in size and we find to be the right tradeoff between efficiency and scale to avoid the “hotspotting” issue (Google, 2018). Currently, at peak times for several different applications, we receive approximately 200 model updates per second.
The paper is organized as follows: Section 2 reviews differential privacy and related work. Section 3 presents a detailed overview of our framework and its features, including the variance stabilizing property in Section 3.4.1. Section 4 introduces our modeling of possible adversaries and discusses DDML’s privacy properties with respect to them. In Section 5, we present experimental evaluations of DDML’s performance, followed by discussion in Section 6.
2. Related Work
Differential privacy (DP) (Dwork et al., 2006) has become the defacto standard for privacypreserving data analysis (Dwork and Roth, 2014; Dwork, 2011; European Association for Theoretical Computer Science, 2017).
A randomized algorithm is differentially private if for all databases and differing in one user’s data, the following inequality is satisfied for all possible sets of outputs :
The parameter is called the privacy loss or privacy budget of the algorithm (Nissim et al., ming), and measures the increase in risk due to choosing to participate in the DP data collection. The variant of DP when is the strongest possible differential privacy variant called pure differential privacy; whereas allows differential privacy to fail with small probability and is called approximate differential privacy.
ML in the TrustedCurator Model: Most prior work for differentially private machine learning assumes a trustedcurator model, where the data is first collected by the company and only then a privacypreserving computation is run on it (Abadi et al., 2016; Song et al., 2013; Papernot et al., 2016; Chaudhuri and Monteleoni, 2009). The trustedcurator model is less than ideal from the user privacy perspective, as it does not provide privacy from the company collecting the data, and, in particular, leaves user data fully vulnerable to security breaches, subpoenas and malicious employees. Furthermore, even in the case of the trusted curator model, differentially private deep learning that achieves good utility with reasonable privacy parameters has been an elusive goal (Shokri and Shmatikov, 2015; McSherry, 2017; Abadi et al., 2016). For example, the work of (Abadi et al., 2016) performs well on NIST data but struggles utilitywise on CIFAR for reasonable privacy parameters.
The work closest to ours is Federated Learning (McMahan et al., 2017a, b). We already discussed major differences between DDML and Federated Learning in the introduction. We will further elaborate on the distinctions and properties of DDML in Sections 3.4 and Section 4.
ML in the Local Model: The pioneering work of RAPPOR (Erlingsson et al., 2014) for industry deployment, has been followed by several recent efforts to deploy DP in the local model, i.e., guarantee DP to the user before the data reaches the collector. Privacy in the local model is more desirable from the user’s perspective (Greenberg, 2016; Madden and Rainie, 2015; Portnoy et al., 2016; WWDC 2016, 2016), as in that case the user does not have to trust the data collector or the data being subject to internal or external threats to the data collector.
Since the focus on differentially private computations in the local model is recent, most, if not all, efforts to date have been limited to learning aggregate statistics, rather than trainng more complex machine learning models (Fanti et al., 2016; Apple, 2017; Bassily and Smith, 2015; Bassily et al., 2017; Bun et al., 2017). There are also numerous results on the socalled sample complexity for the local model, showing that the number of data points needed to achieve comparable accuracy is significantly higher in the local model than in the trusted curator model (Kairouz et al., 2014).
DDML can be considered an extension of the existing literature on locally private learning. In particular, it supplements private histogram collection of RAPPOR (Erlingsson et al., 2014) and learning simple associations (Fanti et al., 2016) by allowing estimation of arbitrary conditional expectations. While RAPPOR allows to estimate marginal and joint distributions of categorical variables, DDML provides a principled framework for estimating conditional distributions in a privacypreserving manner. For example, one can estimate the average value of given features by fitting a regular linear model described by
3. Draw and Discard Machine Learning
In this section, we present our “Draw and Discard” machine learning framework characterized by its two major components: clientside noise addition and “Draw and Discard” server architecture. Together, these contribute to strong differential privacy guarantees for client data while supporting efficient, in terms of model training, clientserver data consumption.
At the heart of DDML is the serverside idea of maintaining and randomly updating one of the model instances. This architecture presents a number of interesting properties and contributes to many aspects of the framework’s scalability, privacy, and spam and abuse protections.
DDML is not a new ML framework per se. It is modelagnostic and, in principle, works with any supervised ML model, though details vary for the clientside update and noise addition. The scope of this work is limited to Generalized Linear Models (GLMs), and we focus specifically on logistic regression to show an example of an ML model that is very popular in practice. We give a brief overview of GLMs and fully describe DDML client and server architectures next.
3.1. GLMs
In GLMs (Nelder and Wedderburn, 1972), the outcome or response variable is assumed to be generated from a particular distribution in the exponential family that includes normal (regular linear model), binomial (logistic regression) and Poisson (Poisson regression) distributions, among many others. Mathematically, GLMs model the relationship between response and features through a link function , whose exact form depends on the assumed distribution:
(1) 
To train GLM models on clients, we use Stochastic Gradient Descent (SGD) for maximum likelihood estimation, as discussed below.
3.2. DDML ClientSide Update
SGD is a widely used iterative procedure for minimizing an objective function
(2) 
where is the vector of weights to be estimated and is a functional component associated with the th observation. Traditional optimization techniques require differentiating , which, in turn, requires access to all data points at once. SGD approximates the gradient with , computed on a small batch of observations available on a single client
(3) 
To provide local privacy by adding random Laplace noise, a differentiallyprivate SGD (DPSGD) update step is performed on a client using the observations stored locally
(4) 
where is a learning rate and denotes the Laplace distribution with mean 0 and scale is called sensitivity in the differential privacy literature and is the privacy budget (Dwork and Roth, 2014).
For GLMs, assuming all features are normalized to the interval and the average gradients are clipped to (indicated by ), the differentiallyprivate update step becomes
Here, is the predicted value of given a feature vector and the model . For logistic regression, if all features are normalized to , no gradient clipping is necessary.
DDML clientside architecture is shown in Algorithm 1.
3.3. DDML ServerSide Draw and Discard
While maintaining the model instances on a server ( versions of the same model with slightly different weights), we randomly “draw” one of the instances, update it on a client and put it back into the queue by “discarding” an instance uniformly at random. With probability , we will replace the same instance, while with probability , we will replace a different one.
This seemingly simple scheme has significant practical implications for performance, quality, privacy, and antispam, which we discuss in Section 3.4.
DDML serverside architecture is shown in Algorithm 2.
Model Initialization: We initialize our models randomly from a normal distribution with means which are usually taken to be 0 in the absence of better starting values and variance where is the variance of the Laplace noise added on a client side.
Because of the variancestabilizing property (to be discussed in detail in Section 3.4.1), will remain the same in expectation even after a large number of updates. It is crucial for our spam detection solution that this initialization happens correctly and the right amount of initial noise is added to calibrate the update step on a client with the variance of the instances on the server side.
Model Averaging: We average weights from all instances for final predictions. Of course, depending on application, another way for using versions of the same model could be preferred, such as averaging predicted values from each instance, for example.
3.4. Properties and Features of DDML
We now describe properties of DDML that distinguish it from existing solutions and make it feasible and scalable for practical deployments.
Variancestabilizing Property of DDML
It is widely known that adding random noise with mean and variance at each update step leads to increased variance over time. Consequently, after updates, the variance of such a process is equal to . This unfortunate fact plagues any strictly sequential updated mechanism ( = 1) and sometimes leads to accurate but less precise estimates as model training evolves. The most remarkable property of the “Draw and Discard” algorithm with is its variancestabilizing property, shown schematically in the first panel of Figure 1. We prove in Theorem 3.1 that the expected variance of the instances of the model is equal to and remains unchanged after an infinite number of updates when adding noise with mean and variance .
Theorem 3.1 ().
Let there be models where each weight
has a mean and variance with . Selecting one model at random, adding noise with mean 0 and variance to each , and putting the model back with replacement does not change the expected variance of the weights (i.e., they remain distributed with variance ).
The intuition behind this theorem is that with probability , we replace the same model, which increases the variance of the instances. This increase, however, is exactly offset by the decrease in variance due to the cases when we replace a different model with probability of because original and updated models are highly correlated.
Proof.
We use the Law of Total Variance
Replacing the same model as drawn occurs with probability and the variance after the update for each is equal to
Replacing a different model partitions the space into 2 and models which makes nonzero. The overall mean after the update becomes
where is the mean of the model selected and model replaced and has a distribution with mean and variance .
Then
Note that the the variance component must be computed with and not because of the finite nature of in this case.
Putting it all together,
∎
DDML, due to its “Draw and Discard” replacement offers an ability to add noise to each update (and thus offer local privacy), while dissipating this additional variability through random model discarding, which may seem wasteful and inefficient at a first glance.
Asynchronous Learning.
Maintaining model instances allows for a scalable, simple and asynchronous model, updating with thousands of update requests per second. It is trivial to implement, relies on its inherent randomness for load distribution, and requires no locking mechanism that would pause the learning process to aggregate or summarize results at any given time.
Differential Privacy.
Due to random sampling of model instances, the DDML server architecture uniquely contributes to differential privacy guarantees as will be discussed in Section 4. Specifically, by keeping only the last models from clients, discarding models at random, and avoiding serverside data batching, the DDML fulfills the goal of keeping as little data as possible on the server. Through a nuanced modeling of possible adversaries (Section 4.1) and corresponding privacy analyses, DDML is able not only to provide privacy guarantees in the local model, but also improve these privacy guarantees against weaker but realistic adversaries.
Ability to Prevent Spam without Sacrificing Privacy.
The instances are instrumental in spam and abuse prevention, which is a real and ubiquitous pain point in all major clientserver deployments. Nothing prevents a client from sending an arbitrary model back to the server. We could keep track of which original instance was sent to each client; however, this would negate the serverside privacy benefits and pose implementation challenges due to asynchronicity. In DDML, having replicates of each weight allows us to compute their means and standard deviations and assess whether the updated model is consistent with these weight distributions (testing whether the updated value is within ), removing the need to make tradeoffs between privacy and antiabuse.
Improved Performance.
Lastly, averaging models for prediction naturally extends DDML into ensemble models, which have been shown to perform well in practice. Currently, the bestperforming models on the MNIST handwritten digit classification are neural net committees
4. Privacy of DDML
We now discuss differential privacy guarantees provided by DDML. Our analyses are with respect to featurelevel differential privacy, as discussed in Section 6.3.1, but they can be easily extended to modellevel privacy by scaling up the noise by the number of features or by adjusting the norm of the gradient in Algorithm 1.
4.1. Adversary Modeling
The main innovation of our work with respect to privacy analyses comes from more nuanced modeling of heterogeneous adversaries, and the demonstration that the privacy guarantees a client obtains against the strongest possible adversary operating in the local model of privacy are strengthened by DDML against weaker but realistic adversaries.
Our work introduces and considers three kinds of adversaries, differing in the power of their capabilities:
I (Channel Listener): is able to observe the communication channel between the client and the server in real time and, therefore, sees both the model instance sent to the client and the updated model instance sent from the client to the server.
II (Internal Threat): is able to observe the models on the server at a given point in time; i.e., this adversary can see model instances through but lacks the knowledge of which of the instances was the preimage for the latest model update due to lack of visibility into the communication channel.
III (Opportunistic Threat): can observe a model instance at a random time, but has no knowledge of what was the state of the model weights over the last updates, i.e., this adversary can, for example, see models at regular time intervals which is much larger than 1. Clients themselves are such threats as they periodically receive a model to update.
The first adversary is the most powerful, and the privacy guarantees we provide against this adversary (Section 4.2) correspond to the local model of privacy commonly considered by the differential privacy community (Section 2).
The second adversary is modeling ability to access the model instances from within the entity running DDML. It is reasonable to assume that such an adversary may be able to obtain several snapshots of the models, though it will be technically infeasible and/or prohibitively costly to obtain snapshots at the granularity that allows observation of models before and after every single update. In other words, this adversary may see the models knowing that they have just been updated with a particular client’s data, but the adversary would not have any knowledge of which models was the source or preimage for the latest update. The privacy guarantee against this adversary (Section 4.3) will be stronger than against the Channel Listener.
The third type of adversary is the weakest and also the most common. Occasional access to models allows attackers to obtain a snapshot of model instances (in a case of an internal threat) or just a single model instance (in a case of a client who receives a model for an update) after a reasonably large number of updates . Because independent noise additions have occurred in the meantime, each model instance has received an expected updates and therefore, independent noise additions after a particular user’s update. Every user benefits from this additional noise to a different degree, depending on the the order in which their data was ingested, and, in expectation and with high probability, enjoys significantly stronger differential privacy guarantees against this adversary than those of the local model, as will be shown in Section 4.4.
4.2. Privacy against Channel Listener (Adversary I)
4.3. Privacy against Internal Threat (Adversary II)
We now show that the DDML serverside design has privacy amplification properties whenever one considers an adversary of type II. We illustrate that by computing the expected privacy loss against adversary II, where the expectation is taken over the random coin tosses of the DDML serverside Algorithm 2 that chooses the model instances to serve and replace.
Lemma 4.1 ().
The expected privacy loss against adversary II is , where the expectation is taken over the random coin tosses of the DDML serverside Algorithm 2 that chooses the model instances to serve and replace.
Proof.
Recall the DDML algorithm and the adversary model. Adversary II knows that either 1) the client’s update overwrote the previous model, so the model instance prior to the update is no longer among the they can see, or 2) the client updated an existing model that is still observable among the , but the adversary doesn’t know which one was updated. We will call the model that was sent to the client the preimage and the resulting returned model .
Because of the design of DDML, the first scenario occurs with probability and the second scenario occurs with probability . Moreover, if we are in the first scenario (i.e., the preimage is no longer among the models due to the “discard” part of DDML), then the client has perfect privacy against adversary II. Indeed, due to the nature of the update step in GLM, provides equal support for any client input when the preimage is unknown. In other words, the privacy loss in the first scenario, , is .
We now compute the privacy loss in the second scenario, when the client updated an existing model that’s still observable among the , but the adversary doesn’t know which one was updated. We first do the analysis for , and then generalize it to any .
In this case, the privacy loss is defined as
where the probability is taken over the random choices of Algorithms 1 and 2, are all possible outputs in the range of Algorithm 1, and and are the private values of the client (in DDML’s case, the clipped average gradients in ).
Expanding to account for the uncertainty of adversary II of whether is the updated model and – its preimage, or vice versa, we have
with probabilities now being taken over the random choices of the clientside Algorithm 1.
Plugging in for the noise introduced by Algorithm 1, we have
By properties of Laplace noise,
Case analysis shows that the maximum is achieved when and or . Thus, or
Therefore, the expected privacy loss for is
Given the result for , it can be shown that the maximum for in the case of models is also achieved when all of the model instances are equal, and the updates differ by 1 in absolute value. Indeed,
Hence, in the case of models the overall expected privacy loss is , as desired. ∎
Note that the largest privacy loss is achieved when all model instances are identical, which is consistent with intuition: when all model instances are identical, the privacy amplification against adversary II comes only from the “discard” step of DDML; whereas when the model instances held by the server are nonidentical, there’s additional benefit from the uncertainty introduced by the serverside model management of Algorithm 2 against adversary II. Specifically, the adversary does not know which model instance was the one returned and which was the preimage, providing additional privacy amplification.
At first, the privacy amplification of for adversary II over adversary I, may seem insignificant. This is not the case for two reasons: first, the constants in privacy loss matter, since, by the definition of differential privacy, the privacy risk incurred by an individual choosing to contribute their data grows exponentially with the growth of the privacy loss variable (Nissim et al., ming; Dwork and Pappas, 2017). Since differential privacy started to gain traction in industry, there has been significant debate devoted to establishing what is a reasonable privacy loss rate and to optimizing the analyses and algorithms to reduce the privacy loss needed (Greenberg, 2017; Tang et al., 2017; Papernot et al., 2018).
Second, the privacy loss of Lemma 4.1 is very unlikely to be realized in practice, as the scenario of all of the model instances being identical is unlikely. Specifically, the probability of how unlikely it is can be studied using the variance stabilization Theorem 3.1. The argument would take the form of: with high probability due to variance stabilization, there are several model instances that are not identical and therefore, can be both the preimage candidates or the instances returned by the client, interchangeably. The higher the number of plausible preimage candidates among the model instances, the less certainty the adversary has about the update, and therefore, the smaller the privacy loss. We plan to formalize this intuition in future work.
4.4. Privacy against Opportunistic Threat (Adversary III)
Finally, we analyze the privacy guarantees DDML provides against adversary III – the one that is able to inspect a random model instance out of the models after a user of interest to the adversary has submitted their model instance and an expected updates to that model instance have occurred since. Note that in practice, the adversary may have an estimate of , but not know it precisely, as it is difficult to measure how many updates have occurred to a model instance in a distributed system serving millions of clients such as DDML.
The privacy amplification against this adversary will come from two sources – from the “discard” step, in that it contributes to the possibility that the model the user contributed to is discarded in the longterm and from the accumulation of the noise, in that with each model update, additional random noise is added to it, which contributes to further privacy protection for the user whose update has occurred in preceding steps. The analysis of the privacy amplification due to the “discard” step is presented in Lemma 4.2; the analysis due to noise accumulation – in Lemma 4.3.
Lemma 4.2 ().
After updates, the probability that a contribution of a particular individual is no longer present in any of the models is at least , where .
Proof of Lemma 4.2.
Consider a particular model that initially appears once among the models. At each update of Algorithm 2,
– with probability , this model gets overwritten by another model;
– with probability , there continues to be only one copy of a model derived from this model;
– with probability , the number of copies of models derived from this model increases.
So the probability that a contribution of a particular individual is no longer present in any of the models after updates is:
∎
In particular, as : and even for .
Lemma 4.3 ().
With high probability, DDML guarantees a user differential privacy against adversary III,
where
and is an arbitrary small constant (typically chosen as O(1/ number of users)). is the number of updates made to the model instance between when a user submitted his instance update and when the adversary observes the instances. The statement holds if is sufficiently large.
Proof of Lemma 4.3.
We rely on the result from concurrent and independent work of (Feldman et al., 2018) obtained in a different context to analyze the privacy amplification in this case. Specifically, their result states that for any contractive noisy process, privacy amplification are equal to the identity contraction, which we analyze below.
The sum of random variables drawn independently from the Laplace distribution with mean will tend toward a normal distribution for sufficiently large , by the Central Limit Theorem. In DDML’s case with Laplace noise, the variance of each random variable is , therefore, if we assume that the adversary observes the model instance after updates to it, the variance of the noise added will be . This corresponds to Gaussian with scale .
Lemma 1 from (Kenthapadi et al., 2013) states that for points in dimensional space that differ by at most in , addition of noise drawn from , where and ensures differential privacy. We use the result of Lemma 1 from (Kenthapadi et al., 2013), rather than the more commonly referenced result from Theorem A.1 of (Dwork and Roth, 2014), because the latter result holds only for , which is not the privacy loss used in most practical applications.
We now ask the question: what approximate differential privacy guarantee is achieved by DDML against adversary III? To answer this, fix a desired level of and use the approximation obtained from the Central Limit theorem to solve for the .
Solving the quadratic inequality, we have:
For large , the additive term of under the square root is negligible, so we have:
In DDML, , therefore,
∎
Consider a choice of . Then Lemma 4.3 states that . In other words, if DDML guarantees pure differential privacy in the local model against an adversary who can observe the channel communication between the client and the server, then it provides a approximate differential privacy guarantee against the weaker adversary who can observe the models after updates, with high probability. Although pure and approximate differential privacy are not directly comparable, one interpretation of this result is that the privacy loss guarantee against the more realistic and more common adversary III improves inversely proportional to the square root of the frequency with which that adversary can observe the model instances or, equivalently, the speed with which the adversary can ensure that he obtains the models after he knows that the target user of interest has sent the server an instance updated with their data. Even though the inverse dependence of privacy loss against adversary III on is only approximate, it is noteworthy as in practical applications, it may allow choosing a higher , and thus improve the utility and convergence speed of the learning framework. Indeed, adversary I is extremely unlikely and/or requires significant effort to implement; therefore, it may be acceptable to have a higher privacy loss against it, while simultaneously maintaining a sensible privacy loss against the more realistic and less resource intensive adversary III.
5. Experiments and Results
We study performance of the DDML framework using a multiclass logistic regression. We evaluate the impact of different choices of (the number of instances) and different levels of desired privacy budget, , on both model loss (training set) and accuracy (holdout set). In addition, we compare convergence properties of “Draw and Discard” with the standard serverside batching of the currently popular Federated Learning approach (McMahan et al., 2017a). By serverside batching, we mean a server model in which updates are streamed into a temporary storage (on a server), accumulated (usually in thousands), and then averaged out to make a single model update.
5.1. Experiment Configurations
We conduct our study on MNIST digit recognition (LeCun et al., 1998) challenge and use multiclass logistic regression with CrossEntropy loss as our model across different DDML configurations. The MNIST dataset has a training set of 60,000 examples, and a test set of 10,000 examples. Each 28x28 image is serialized to a 784 vector that serves as features for predicting 09 handwritten digits.
We set our learning rate for all experiments and standardize all features to . We simulate “clients” by randomly assigning examples to each one, resulting in 6,000 “clients”. We also make 20 passes over training data for all configurations. Because the number of model updates differs for different experiment configurations, we standardize our experiments by looking at the sample size, i.e., the number of data points ingested by the algorithm up to that point. Results of our experiments are visualized in Figure 2. For experiments where , we initialize them using normal distribution with means and variance , where is the clientside noise variance. For cases when we do not add noise on the client side, we initialize with in the clientside noise calculation.
Comparing s: Because the number of model instances is so central to the DDML framework, its impact on model performance must be carefully evaluated. When studying the effect of , we did not add any Laplace noise on the client side to eliminate additional sources of entropy. Loss on a training set for different s from 0 to 100 are shown in the first panel of Figure 2. Accuracy results on the test set using the averaged model over the instances is shown in the first panel of the second row. The configuration is equivalent to a standard serverside model training (the darkest blue line) and clearly has the best performance. As we add more model instances, the number of updates to each model instance decreases (it’s equal to in expectation), and averaging over instances, though beneficial, is not sufficient to make up for the difference. Of course, as , the sample size, goes to infinity, all configurations converge and have equivalent performance metrics
Serverside batching: A commonly used solution to server scalability problems is to batch 1,000s or 10,000s of gradients returned by clients on the server. In addition to being inferior in terms of privacy because the batch size is usually orders of magnitude larger than our , we empirically demonstrate that this approach is also inferior in terms of finitesample performance whenever serverside batch size . The second column in Figure 2 compares and with four batch sizes, , of 10, 100, 1,000, 10,000. It’s remarkable how overlapping and curves are (it holds empirically for other combinations). Because the learning process must pause to average out gradients, is usually fixed at 10,000+ to accommodate the highthroughput traffic, which can be easily handled with just model instances and no interruptions to the learning process.
Privacy parameter : For this set of experiments, we fix and vary the amount of noise added on the client. Results are shown in the last column of Figure 2, with the blue line indicating model performance without the noise (same as in the first panel). We observe that for there is no substantial negative impact of providing clientside privacy on the model’s performance, while smaller privacy parameter choice values such as do have some negative impact.
6. Discussion
6.1. Parameter Tuning and Clipping
Choosing the right learning rate is critical for model convergence. If chosen too small, the learning process proceeds too slowly, while selecting a rate too large can lead to oscillating jumps around the true minimum. We recommend trying several values in parallel and evaluating model performance to select the best one. In the future, we plan to explore adaptive learning rate methods in which we systematically decrease (and, therefore, add noise) as the model converges.
By clipping gradients to a range, we ensure that the sensitivity of our update is . In practice, the vast majority of gradients, especially as the model becomes sufficiently accurate, are much smaller in absolute terms and, thus, could be clipped more aggressively. Clipping to a range would reduce sensitivity by a factor of 10 to .
6.2. Deployment
Having deployed DDML at scale at a large company with millions of daily active clients, we realized how critical a welldesigned serverside architecture was to the clientside learning process. Due to the symmetric nature of draws and discards, with the number of reads equalling the number of writes, there must be sufficient redundancy in place to scale our serving infrastructure. model instances offer, besides increased privacy, an incredibly scalable and asynchronous solution to clientserver communication.
We have successfully trained a phishy URL prediction model using logistic regression with about 1,000 features using . After 1 billion updates to the model, we achieved an accuracy of .
Draw and Discard
One can easily make an argument that replacing model instances at random is “wasteful” from the model training perspective. It is partially true. However, so is setting the wrong learning rate, mismanaging the batch size, etc. We are never perfect in utilizing our data in an absolutely perfect way even before moving to a distributed ML setting. There, things only become more complicated from a learning perspective and it is not unreasonable to see additional performance sacrifices. The focus should be not on what we are losing, but what we are gaining in exchange. By making a small sacrifice in performance by introducing instances, we gain scalability, spam detection and additional privacy. That’s a lot to gain for an occasional loss of one of the model instances.
Other Learning Models
DDML framework can easily be extended to support neural networks and any other models whose objective function can be written as a sum of differentiable functions. Very recent work of (Masters and Luschi, 2018) may be useful in providing guidance for parameter tuning in those cases. Extending it to decision trees seems harder and further research into distributed optimization of trees is needed.
6.3. The Rate of Privacy Loss
A major struggle in the application of DP to practice (Greenberg, 2017; Tang et al., 2017) is the question of how to keep the overall peruser privacy loss (Nissim et al., ming; Dwork and Pappas, 2017) bounded over time. This is particularly challenging in the local model as more data points are needed to achieve the comparable level of utility in the local model than in the trusted curator model (Kairouz et al., 2014).
Our approach mimics the one taken by Apple’s deployment of DP (Apple, 2017; Tang et al., 2017): ensure the privacy loss due to the collection of one piece of data from one user is bounded with a reasonable privacy parameter, but allow multiple collections from the same user over time. Formally, this corresponds to the privacy loss growing as the number of items submitted, as per composition theorems (Kairouz et al., 2017; Dwork and Roth, 2014).
Featurelevel Privacy
We offer featurelevel local differential privacy and, therefore, in a situation when features are correlated, the privacy loss scales with the number of features. In principle, if one would like to achieve modellevel privacy, one needs to scale the noise up according to the number of features included in the model. Applications of differential privacy to very highdimensional data, particularly, in the local model, have not yet been adopted in practice. In theoretical work, the distinction is often mentioned, but the choice is left to industry practitioners. We believe that in practice, featurelevel privacy combined with limited serverside model retention is sufficient to protect the privacy of our clients against most realistic adversaries.
Tighter Privacy Guarantees via Other Variants of Differential Privacy and Better Adversary Modeling
With respect to privacy guarantees, the main focus of this work has been to ensure the strongest possible form of privacy – pure differential privacy in the local model. We have also discussed more realistic adversary models and the way that DDML provides even better privacy guarantees against those. We are optimistic that further improvements, both in utility and in the tightness of privacy analyses are possible via switching from Laplace to Gaussian noise in the DDML clientside Algorithm 1, further precision of adversary modeling, and in performing the analyses using the variant of Renyi Differential Privacy (Mironov, 2017) or Concentrated Differential Privacy (Bun and Steinke, 2016). The optimism for the first claim stems from experience with other differentially private applications; for the second – from the similarities between DDML and the shuffling strategy of (Bittau et al., 2017) and privacy amplification by sampling exploited by (Li et al., 2012; Smith, 2009; Abadi et al., 2016); for the third – from recent work in differentially private machine learning (Abadi et al., 2016; Geumlek et al., 2017; McMahan et al., 2017b; Wu et al., 2017) that benefits from analyses using such relaxations.
In a sense, DDML can be viewed as a system, whose particular components, such as the approach chosen to ensure local privacy and the analysis under the chosen adversary model can be varied depending on application and the desired nuance of privacy guarantee.
7. Conclusions
Clientside privacypreserving machine learning is still in its infancy and will continue to be an active and important research area both in ML and privacy for the foreseeable future. We believe that the most important contribution of this work is a completely new serverside architecture with random “draw and discards”, that offers unprecedented scalability with no interrupts to the learning process, clientside and serverside privacy guarantees and, finally, a simple, inexpensive and practical approach to clientside machine learning.
Our focus on simpler, yet useful in practice, linear models allowed us to experiment with clientside ML without having to worry about convergence and other MLrelated issues. Instead, we have sufficient freedom to zoom in on privacy considerations, build simple and scalable infrastructure and leverage this technology to improve mobile features for millions of our users.
Footnotes
 copyright: none
 This is only approximately correct, since in a highthroughput environment, another client request could have updated the same model in the meantime.
 http://yann.lecun.com/exdb/mnist/
 We do not demonstrate this in practice. It follows from theoretical optimization results on convex functions.
References
 Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep Learning with Differential Privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS ’16). 308–318. http://doi.acm.org/10.1145/2976749.2978318
 Differential Privacy Team Apple. 2017. Learning with Privacy at Scale. Vol. 1. Apple Machine Learning Journal. Issue 8. https://machinelearning.apple.com/2017/12/06/learningwithprivacyatscale.html
 Raef Bassily, Kobbi Nissim, Uri Stemmer, and Abhradeep Guha Thakurta. 2017. Practical locally private heavy hitters. In Advances in Neural Information Processing Systems. 2285–2293.
 Raef Bassily and Adam Smith. 2015. Local, private, efficient protocols for succinct histograms. In Proceedings of the fortyseventh annual ACM Symposium on Theory of Computing. ACM, 127–135.
 Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. 2017. PROCHLO: Strong Privacy for Analytics in the Crowd. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 441–459.
 Mark Bun, Jelani Nelson, and Uri Stemmer. 2017. Heavy Hitters and the Structure of Local Privacy. arXiv preprint arXiv:1711.04740 (2017).
 Mark Bun and Thomas Steinke. 2016. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference. Springer, 635–658.
 Kamalika Chaudhuri and Claire Monteleoni. 2009. Privacypreserving logistic regression. In Advances in Neural Information Processing Systems. 289–296.
 David R. Cox. 1958. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B 20 (1958), 215–242.
 Julien Delange. 2017. Why using asynchronous communications? http://julien.gunnm.org/programming/linux/2017/04/15/comparisonsyncvsasync. (2017).
 Cynthia Dwork. 2011. A firm foundation for private data analysis. Commun. ACM 54, 1 (2011), 86–95.
 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC). 265–284.
 Cynthia Dwork and George J. Pappas. 2017. Privacy in InformationRich Intelligent Infrastructure. CoRR abs/1706.01985 (2017). arXiv:1706.01985 http://arxiv.org/abs/1706.01985
 Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
 Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized Aggregatable PrivacyPreserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS ’14). 1054–1067.
 European Association for Theoretical Computer Science. 2017. 2017 Gödel Prize. https://eatcs.org/index.php/component/content/article/1news/24502017godelprize.
 Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. 2016. Building a RAPPOR with the unknown: Privacypreserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies 2016, 3 (2016), 41–61.
 Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. 2018. Privacy Amplification by Iteration. BIRS Workshop on Mathematical Foundations of Data Privacy. (2018).
 Joseph Geumlek, Shuang Song, and Kamalika Chaudhuri. 2017. Renyi Differential Privacy Mechanisms for Posterior Sampling. In Advances in Neural Information Processing Systems. 5295–5304.
 Google. 2018. Google Cloud Best Practices. https://cloud.google.com/datastore/docs/bestpractices. (2018).
 Andy Greenberg. 2017. How one of Apple’s key privacy safeguards falls short. Wired (2017). https://www.wired.com/story/appledifferentialprivacyshortcomings/
 Andy Greenberg. June 13, 2016. Apple’s Differential Privacy Is About Collecting Your Data – But Not Your Data. In Wired.
 Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2014. Extremal mechanisms for local differential privacy. In Advances in Neural Information Processing Systems. 2879–2887.
 Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2017. The composition theorem for differential privacy. IEEE Transactions on Information Theory 63, 6 (2017), 4037–4049.
 Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. 2013. Privacy via the JohnsonLindenstrauss transform. Journal of Privacy and Confidentiality 5 (2013), 39–71. Issue 1.
 Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Ninghui Li, Wahbeh Qardaji, and Dong Su. 2012. On sampling, anonymization, and differential privacy or, kanonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security. 32–33.
 Mary Madden and Lee Rainie. 2015. Americans’ Attitudes About Privacy, Security and Surveillance. Pew Research Center. http://www.pewinternet.org/2015/05/20/americansattitudesaboutprivacysecurityandsurveillance/. (2015).
 Mary Madden and Lee Rainie. 2016. Privacy and Information Sharing. Pew Research Center. http://www.pewinternet.org/2016/01/14/privacyandinformationsharing/. (2016).
 Dominic Masters and Carlo Luschi. 2018. Revisiting Small Batch Training for Deep Neural Networks. arXiv preprint arXiv:1804.07612 (2018).
 H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017a. CommunicationEfficient Learning of Deep Networks from Decentralized Data. In AISTATS.
 H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2017b. Learning Differentially Private Language Models Without Losing Accuracy. CoRR abs/1710.06963 (2017). arXiv:1710.06963 http://arxiv.org/abs/1710.06963
 Frank McSherry. 2017. Deep learning and differential privacy. (2017). https://github.com/frankmcsherry/blog/blob/master/posts/20171027.md
 Ilya Mironov. 2017. Renyi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF). 263–275.
 J. A. Nelder and R. W. M. Wedderburn. 1972. Generalized linear models. Journal of the Royal Statistical Society, Series A, General 135 (1972), 370–384.
 Kobbi Nissim, Thomas Steinke, Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, David O’Brien, and Salil Vadhan. Forthcoming. Differential Privacy: A Primer for a Nontechnical Audience (Preliminary Version). Vanderbilt Journal of Entertainment and Technology Law (2018 Forthcoming).
 Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar. 2016. Semisupervised knowledge transfer for deep learning from private training data. 5th International Conference on Learning Representations (2016).
 Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. 2018. Scalable Private Learning with PATE. International Conference on Learning Representations (ICLR) (2018).
 Erica Portnoy, Gennie Gebhart, and Starchy Grant. Sep 27, 2016. In EFF DeepLinks Blog. www.eff.org/deeplinks/2016/09/facialrecognitiondifferentialprivacyandtradeoffsappleslatestosreleases.
 F. Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain. Psychological Review (1958), 65–386.
 Reza Shokri and Vitaly Shmatikov. 2015. Privacypreserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 1310–1321.
 Adam Smith. 2009. Differential Privacy and the Secrecy of the Sample. https://adamdsmith.wordpress.com/2009/09/02/samplesecrecy/. (2009).
 Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. 2013. Stochastic gradient descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 245–248.
 Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and XiaoFeng Wang. 2017. Privacy Loss in Apple’s Implementation of Differential Privacy on MacOS 10.12. CoRR abs/1709.02753 (2017). http://arxiv.org/abs/1709.02753
 Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. 2017. Bolton Differential Privacy for Scalable Stochastic Gradient Descentbased Analytics. In Proceedings of the ACM International Conference on Management of Data. 1307–1322.
 WWDC 2016. June, 2016. WWDC 2016 Keynote. https://www.apple.com/appleevents/june2016/.