Privacy Accounting and Quality Control in the
Sage Differentially Private ML Platform
Abstract.
Companies increasingly expose machine learning (ML) models trained over sensitive user data to untrusted domains, such as enduser devices and wideaccess model stores. This creates a need to control the data’s leakage through these models. We present Sage, a differentially private (DP) ML platform that bounds the cumulative leakage of training data through models. Sage builds upon the rich literature on DP ML algorithms and contributes pragmatic solutions to two of the most pressing systems challenges of global DP: running out of privacy budget and the privacyutility tradeoff. To address the former, we develop block composition, a new privacy loss accounting method that leverages the growing database regime of ML workloads to keep training models endlessly on a sensitive data stream while enforcing a global DP guarantee for the stream. To address the latter, we develop privacyadaptive training, a process that trains a model on growing amounts of data and/or with increasing privacy parameters until, with high probability, the model meets developerconfigured quality criteria. Sage’s methods are designed to integrate with TensorFlowExtended, Google’s opensource ML platform. They illustrate how a systems focus on characteristics of ML workloads enables pragmatic solutions that are not apparent when one focuses on individual algorithms, as most DP ML literature does.
1. Introduction
Machine learning (ML) is changing the origin and makeup of the code driving many of our applications, services, and devices. Traditional code is written by programmers and encodes algorithms that express business logic, plus a bit of configuration data. We keep sensitive data – such as passwords, keys, and user data – out of our code, because we often ship this code to untrusted locations, such as enduser devices and app stores. We mediate accesses to sensitive data with access control. When we do include secrets in code, or when our code is responsible for leaking user information to an unauthorized entity (e.g., through an incorrect access control decision), it is considered a major vulnerability.
With ML, “code” – in the form of ML models – is learned by an ML platform from a lot of training data. Learning code enables largescale personalization, as well as powerful new applications like autonomous cars. Often, the data used for learning comes from users and is of personal nature, including emails, searches, website visits, heartbeats, and driving behavior. And although ML “code” is derived from sensitive data, it is often handled as secretfree code. ML platforms, such as Google’s TensorflowExtended (TFX), routinely push models trained over sensitive data to servers all around the world (baylor2017tfx, ; hazelwood2018applied, ; Li:michelangelo, ; ravi2017onDeviceMachineIntelligence, ) and sometimes to enduser devices (wu2018machine, ; strimel2018statistical, ) for faster predictions. Some companies also push feature models – such as user embedding vectors and statistics of user activity – into model stores that are often times widely accessible within the companies (hazelwood2018applied, ; Li:michelangelo, ; twitterembeddings, ). Such exposure would be inconceivable in a traditional application. Think of a word processor: it might push your documents to your device for faster access, but it would be outrageous if it pushed your documents to my (and everyone else’s) device!
There is perhaps a sense that because ML models aggregate data from multiple users, they “obfuscate” individuals’ data and warrant weaker protection than the data itself. However, this perception is succumbing to growing evidence that ML models can leak specifics about individual entries in their training sets. Language models trained over users’ emails for autocomplete have been shown to encode not only commonly used phrases but also social security numbers and credit card numbers that users may include in their communications (carlini2018theSecretSharer, ). Prediction APIs have been shown to be enable testing for membership of a particular user or entry within a training set (shokri2017membership, ). Finally, it has long been established both theoretically and empirically that access to too many linear statistics from a dataset – as an adversary might have due to periodic releases of models, which often incorporate statistics used for featurization – is fundamentally nonprivate (backes2016membership, ; dinurNissim2003revealing, ; dwork2015robustTraceability, ; homer2008resolving, ).
As companies continue to disseminate many versions of models into untrusted domains, controlling the risk of data exposure becomes critical. We present Sage, an ML platform based on TFX that uses differential privacy (DP) (dwork2006differential, ) to bound the cumulative exposure of individual entries in a company’s sensitive data streams through all the models released from those streams. At a high level, DP randomizes a computation over a dataset (e.g. training one model) to bound the leakage of individual entries in the dataset through the output of the computation (the model). Each new DP computation increases the bound over data leakage, and can be seen as consuming part of a privacy budget that should not be exceeded; Sage makes the process that generates all models and statistics preserve a global DP guarantee.
Sage builds upon the rich literature on DP ML algorithms (e.g., (abadi2016deep, ; yu2019differentially, ; mcmahan2018aGeneral, ), see §2.3) and contributes pragmatic solutions two of the most pressing systems challenges of global DP: (1) running out of privacy budget and (2) the privacyutility tradeoff. Sage expects to be given training pipelines explicitly programmed to individually satisfy a parameterized DP guarantee. It acts as a new access control layer in TFX that: mediates all accesses to the data by these DP training pipelines; instantiates their DP parameters at runtime; and accounts for the cumulative privacy loss from all pipelines to enforce the global DP guarantee against the stream. At the same time, Sage provides the developers with: control over the quality of the models produced by the DP training pipelines (thereby addressing challenge (2)); and the ability to release models endlessly without running out of privacy budget for the stream (thereby addressing challenge (1)).
The key to addressing both challenges is the realization that ML workloads operate on growing databases, a model of interaction that has been explored very little in DP, and only with a purely theoretical and far from practical approach (cummings2018differential, ). Most DP literature, largely focused on individual algorithms, assumes either static databases (which do not incorporate new data) or online streaming (where computations do not revisit old data). In contrast, ML workloads – which consist of many algorithms invoked periodically – operate on growing databases. Across invocations of different training algorithms, the workload both incorporates new data and reuses old data, often times adaptively. It is in that adaptive reuse of old data coupled with new data that our design of Sage finds the opportunity to address the preceding two challenges in ways that are practical and integrate well with TFXlike platforms.
To address the running out of privacy budget challenge, we develop block composition, the first privacy accounting method that both allows efficient training on growing databases and avoids running out of privacy budget as long as the database grows fast enough. Block composition splits the data stream into blocks, for example by time (e.g., one day’s worth of data goes into one block) or by users (e.g., each user’s data goes into one block), depending on the unit of protection (event or userlevel privacy). Block composition lets training pipelines combine available blocks into larger datasets to train models effectively, but accounts for the privacy loss of releasing a model at the level of the specific blocks used to train that model. When the privacy loss for a given block reaches a preconfigured ceiling, the block is retired and will not be used again. However, new blocks from the stream arrive with zero privacy loss and can be used to train future models. Thus, as long as the database adds new blocks fast enough relative to the rate at which models arrive, Sage will never run out of privacy budget for the stream. Finally, block composition allows adaptivity in the choice of training computation, privacy parameters, and blocks to execute on, thereby modeling the most comprehensive form of adaptivity in DP literature.
To address the privacyutility tradeoff we develop privacyadaptive training, a training procedure that controls the utility of DPtrained models by repeatedly and adaptively training them on growing data and/or DP parameters available from the stream. Models retrain until, with high probability, they meet programmerspecified quality criteria (e.g. an accuracy target). Privacyadaptive training uses block composition’s support for adaptivity and integrates well with TFX’s design, which includes a model validation stage in training pipelines.
2. Background
Our effort builds upon an opportunity we observe in today’s companies: the rise of ML platforms, trusted infrastructures that provide key services for ML workloads in production, plus strong library support for their development. They can be thought of as operating systems for ML workloads. Google has TensorFlowExtended (TFX) (baylor2017tfx, ); Facebook has FBLearner (hazelwood2018applied, ); Uber has Michelangelo (Li:michelangelo, ); and Twitter has DeepBird (leonardTwitterMeetsTensorFlow, ). The opportunity is to incorporate DP into these platforms as a new type of access control that constrains data leakage through the models a company disseminates.
2.1. ML Platforms
Fig. 1 shows our view of an ML platform; it is based on (baylor2017tfx, ; hazelwood2018applied, ; Li:michelangelo, ). The platform has several components: Training Pipelines (one for each model pushed into production), Serving Infrastructure, and a shared data store, which we call the Growing Database because it accumulates data from the company’s various streams. The access control policies on the Growing Database are exercised through Streamlevel ACLs and are typically restrictive for sensitive streams.
The Training Pipeline trains a model on data from the Growing Database and verifies that it meets specific quality criteria before it is deployed for serving or shared with other teams. It is launched periodically (e.g., daily) on datasets containing samples from a representative time window (e.g., logs over the past month). It has three customizable modules: (1) Preprocessing loads the dataset from the Growing Database, transforms it into a format suitable for training and inference by applying feature transformation operators, and splits the transformed dataset into a training set and a testing set; (2) Training trains the model on a training set; and (3) Validation evaluates one or more quality metrics – such as accuracy for classification or mean squared error (MSE) for regression – on the testing set. It checks that the metrics reach specific quality targets to warrant the model’s push into serving. The targets can be fixed by developers or can be values achieved by a previous model. If the model meets all quality criteria, it is bundled with its feature transformation operators (a.k.a. features) and pushed into the Serving Infrastructure. The model+features bundle is what we call ML code.
The Serving Infrastructure manages the online aspects of the model. It distributes the model+features to inference servers around the world and to enduser devices and continuously evaluates and partially updates it on new data. The model+features bundle is also often pushed into a companywide Model and Feature Store, from where other teams within the company can discover it and integrate into their own models. Twitter and Uber report sharing embedding models (twitterembeddings, ) and tens of thousands of summary statistics (Li:michelangelo, ) across teams through their Feature Stores. To enable such wide sharing, companies sometimes enforce more permissive access control policies on the Model and Feature Store than on the raw data.
2.2. Threat Model
We are concerned with the increase in sensitive data exposure that is caused by applying looser access controls to datacarrying ML code –models+features– than are typically applied to the data. This includes placing models+features in companywide Model and Feature Stores, where they can be accessed by developers not authorized to access the raw data. It includes pushing models+features, or their predictions, to enduser devices and prediction servers that could be compromised by hackers or oppressive governments. And it includes opening the models+features to the world through prediction APIs that can leak training data if queried sufficiently (tramer2016stealing, ; shokri2017membership, ). Our goal is to “neutralize” the wider exposure of ML codes by making the process of generating them DP across all models+features ever released from a sensitive stream.
We assume the following components are trusted and implemented correctly: Growing Database; Streamlevel ACLs; the ML platform code running a Training Pipeline. We also trust the developer that instantiates the modules in each pipeline as long as the developer is authorized by Streamlevel ACLs to access the data stream(s) used by the pipeline. However, we do not trust the wideaccess Model and Feature Store or the locations to which the serving infrastructure disseminates the model+features or their predictions. Once a model/feature is pushed to those components, it is considered released to the untrusted domain and accessible to adversaries.
We focus on two classes of attacks against models and statistics (see Dwork (dwork2017exposed, )): (1) membership inference, in which the adversary infers whether a particular entry is in the training set based on either whitebox or blackbox access to the model, features, and/or predictions (backes2016membership, ; dwork2015robustTraceability, ; homer2008resolving, ; shokri2017membership, ); and (2) reconstruction attacks, in which the adversary infers unknown sensitive attributes about entries in the training set based on similar whitebox or blackbox access (carlini2018theSecretSharer, ; dinurNissim2003revealing, ; dwork2017exposed, ).
2.3. Differential Privacy
DP is concerned with whether the output of a computation over a dataset – such as training an ML model – can reveal information about individual entries in the dataset. To prevent such information leakage, randomness is introduced into the computation to hide details of individual entries.
Definition 2.1 (Differential Privacy (DP) (dwork2014algorithmic, )).
A randomized algorithm is DP if for any with and for any , we have: .
The and parameters quantify the strength of the privacy guarantee: small values imply that one draw from such an algorithm’s output gives little information about whether it ran on or . The privacy budget upper bounds an DP computation’s privacy loss with probability (1). is a dataset distance (e.g. the symmetric difference (Mcsherry:pinq, )). If , and are neighboring datasets.
Multiple mechanisms exist to make a computation DP. They add noise to the computation scaled by its sensitivity , the maximum change in the computation’s output when run on any two neighboring datasets. Adding noise from a Laplace distribution with mean zero and scale (denoted )) gives DP. Adding noise from a Gaussian distribution scaled by gives DP.
DP is known to address the attacks in our threat model (shokri2017membership, ; dwork2017exposed, ; carlini2018theSecretSharer, ; 236254, ). At a high level, membership and reconstruction attacks work by finding data points that make the observed model more likely: if those points were in the training set, the likelihood of the observed output increases. DP prevents these attacks, as no specific data point can drastically increase the likelihood of the model outputted by the training procedure.
DP literature is very rich and mature, including in ML. DP versions exist for almost every popular ML algorithm, including: stochastic gradient descent (SGD) (abadi2016deep, ; yu2019differentially, ); various regressions (chaudhuri2008privacy, ; nikolaenko2013privacy, ; talwar2015nearlyoptimal, ; zhang2012functional, ; kifer2012convexERM, ); collaborative filtering (mcsherry2009differentially, ); language models (McMahan2018LearningDP, ); feature selection (chaudhuri2013nearlyoptimal, ); model selection (smith2013DPModelSelection, ); evaluation (boyd2015differential, ); and statistics, e.g. contingency tables (barak2007privacy, ), histograms (xu2012differentially, ). The privacy module in TensorFlow v2 implements several SGDbased algorithms (mcmahan2018aGeneral, ).
A key strength of DP is its composition property, which in its basic form, states that the process of running an DP and an DP computation on the same dataset is DP. Composition enables the development of complex DP computations – such as DP Training Pipelines – from piecemeal DP components, such as DP ML algorithms. Composition also lets one account for (and bound) the privacy loss resulting from a sequence of DPcomputed outputs, such as the release of multiple models+features.
A distinction exists between userlevel and eventlevel privacy. Userlevel privacy enforces DP on all data points contributed by a user toward a computation. Eventlevel privacy enforces DP on individual data points (e.g., individual clicks). Userlevel privacy is more meaningful than eventlevel privacy, but much more challenging to sustain on streams. Although Sage’s design can in theory be applied to userlevel privacy (§4.4), we focus most of the paper on eventlevel privacy, which we deem practical enough to be deployed in big companies. §7 discusses the limitations of this semantic.
3. Sage Architecture
The Sage training platform enforces a global DP semantic over all models+features that have been, or will ever be, released from each sensitive data stream. The highlighted portions in Fig. 2 show the changes Sage brings to a typical ML platform. First, each Training Pipeline must be made to individually satisfy DP for some privacy parameters given by Sage at runtime (box DP Training Pipeline, §3.1). The developer is responsible for making this switch to DP, and while research is needed to ease DP programming, this paper leaves that challenge aside.
Second, Sage introduces an additional layer of access control beyond traditional streamlevel ACLs (box Sage Access Control, §3.2). The new layer splits the data stream into blocks and accounts for the privacy loss of releasing a model+features bundle at the level of the specific blocks that were used to train that bundle. In theory, blocks can be defined by any insensitive attribute, with two attributes particularly relevant here: time (e.g., one day’s worth of data goes into one block) and user ID (e.g., a user’s data goes into the same block). Defining blocks by time provides eventlevel privacy; defining them by user ID accommodates userlevel privacy. Because of our focus on the former, the remainder of this section assumes that blocks are defined by time; §4 discusses sharding by user ID and other attributes.
When the privacy loss for a block reaches the ceiling, the block is retired (blocks are retired in Fig. 2). However, new blocks arrive with a clean budget and can be used to train future models: as long as the database grows fast enough in new blocks, the system will never run out of privacy budget for the stream. Perhaps surprisingly, this privacy loss accounting method, which we call block composition, is the first practical approach to avoid running out of privacy budget while enabling effective training of ML models on growing databases. §3.2 gives the intuition of block composition while §4 formalizes it and proves it DP.
Third, Sage provides developers with control over the quality of models produced by the DP Training Pipelines. Such pipelines can produce less accurate models that fail to meet their quality targets more often than without DP. They can also push in production lowquality models whose validations succeed by mere chance. Both situations lead to operational headaches: the former gives more frequent notifications of failed training, the latter gives dissatisfied users. The issue is often referred to as the privacyutility tradeoff of running under a DP regime. Sage addresses this challenge by wrapping the DP Training Pipeline into an iterative process that reduces the effects of DP randomness on the quality of the models and the semantic of their validation by invoking training pipelines repeatedly on increasing amounts of data and/or privacy budgets (box PrivacyAdaptive Training, §3.3).
3.1. Example DP Training Pipeline
Sage expects each pipeline submitted by the ML developer to satisfy a parameterized DP. Acknowledging that DP programming abstractions warrant further research, List. 1 illustrates the changes a developer would have to make at present to convert a nonDP training pipeline written for TFX to a DP training pipeline suitable for Sage. Removed/replaced code is stricken through and the added code is highlighted. The pipeline processes New York City Yellow Cab data (yellowCabData, ) and trains a model to predict the duration of a ride.
To integrate with TFX (nonDP version), the developer implements three TFX callbacks. (1) preprocessing_fn uses the dataset to compute aggregate features and make userspecified feature transformations. The model has three features: the distance of the ride; the hour of the day; and an aggregate feature representing the average speed of cab rides each hour of the day. (2) trainer_fn specifies the model that is to be trained: it configures the columns to be modeled, defines hyperparameters, and specifies the dataset. The model trains with a neural network regressor. (3) validator_fn validates the model by comparing test set MSE to a constant.
To integrate with Sage (DP version), the developer: (a) switches library calls to DP versions of the functions (which ideally would be available in the ML platform) and (b) splits the parameters, which are assigned by Sage at runtime, across the DP function calls. (1) preprocessing_fn replaces one call with a DP version that is implemented in Sage: the mean speed per day uses Sage’s dp_group_by_mean. This function (lines 3240) computes the number of times each key appears and the sum of the values associated with each key. It makes both DP by adding draws from appropriatelyscaled Laplace distributions to each count. Each data point has exactly one key value so the privacy budget usage composes in parallel across keys (Mcsherry:pinq, ). The privacy budget is split across the sum and count queries. We envision common functions like this being available in the DP ML platform. (2) trainer_fn switches the call to the nonprivate regressor with the DP implementation, which in Sage is a simple wrapper around TensorFlow’s DP SGDbased optimizer. (3) validator_fn invokes Sage’s DP model validator (§3.3).
3.2. Sage Access Control
Sage uses the composition property of DP to rigorously account for (and bound) the cumulative leakage of data from sensitive user streams across multiple releases of models+features learned from these streams. Specifically, for each sensitive stream, Sage maintains a prespecified eventlevel DP guarantee across all uses of the stream. Unfortunately, traditional DP composition theory considers either static databases, which leads to wasteful privacy accounting; or purely online streaming, which is inefficient for many ML workloads, including deep neural network training. We thus developed our own composition theory, called block composition, which leverages characteristics of ML workloads running on growing databases to permit both efficient privacy accounting and efficient learning. §4 formalizes the new theory. This section describes the limitations of existing DP composition for ML and gives the intuition for block composition and how Sage uses it as a new form of access control in ML platforms.
Data Interaction in ML on Growing Databases. Fig. 3 shows an example of a typical workload as seen by an ML platform. Each training pipeline, or query in DP parlance, is denoted . We note two characteristics. First, a typical ML workload consists of multiple training pipelines, training over time on a continuously growing database, and on data subsets of various sizes. For instance, may train a large deep neural network requiring massive amounts of data to reach good performances, while may train a linear model with smaller data requirements, or even a simple statistic like the mean of a feature over the past day. All pipelines are typically updated or retrained periodically on new data, with old data eventually being deemed irrelevant and ignored.
Second, the data given to a training pipeline – and for a DP model its DP parameters – are typically chosen adaptively. For example, the model trained in on data from with budget may give unsatisfactory performance. After a new block is collected, a developer may decide to retrain the same model in query on data from , and with a higher DP budget . Adaptivity can also happen indirectly through the data. Suppose successfully trained a recommendation model. Then, future data collected from the users (e.g. in ) may depend on the recommendations. Any subsequent query, such as , is potentially influenced by ’s output.
These characteristics imply three requirements for a composition theory suitable for ML. It must support:

Queries on overlapping data subsets of diverse sizes.

Adaptivity in the choice of: queries, DP parameters, and data subsets the queries process.

Endless execution on a growing database.
Limitations of Existing Composition Theory for ML. No previous DP composition theory supports all three requirements. DP has mostly been studied for static databases, where (adaptively chosen) queries are assumed to compute over the entire database. Consequently, composition accounting is typically made at query level: each query consumes part of the total available privacy budget for the database. Querylevel accounting has carried over even in extensions to DP theory that handle streaming databases (dwork2010differential, ) and partitioned queries (Mcsherry:pinq, ). There are multiple ways to apply querylevel accounting to ML, but each trades off at least one of our requirements.
First, one can query overlapping data subsets (R1) adaptively across queries, the data used, and the DP parameters (rogers18privacy, ) (R2) by accounting for composition at the query level against the entire stream. Queries on Fig. 3 would thus result in a total privacy loss of over the whole stream. This approach wastes privacy budget and leads to the problem of “running out of budget”. Once , enforcing a global leakage bound of means that one must stop using the stream after query . This is true even though (1) not all queries run on all the data and (2) there will be new data coming into the system in the future (e.g., ). This violates requirement (R3) of endless execution on streams.
Second, one can restructure the queries to enable finer granularity with querylevel accounting. The data stream is partitioned in blocks, as in Fig. 3. Each query is split into multiple ones, each running with DP on an individual block. The DP results are then aggregated, for instance by averaging model updates as in federated learning (McMahan2018LearningDP, ). Since each block is a separate dataset, traditional composition can account for privacy loss at the block level. This approach supports adaptivity (R2) and execution of the system on streams (R3) as new data blocks incur no privacy loss from past queries. However, it violates requirement (R1), resulting in unnecessarily noisy learning (duchi2019lower, ; duchi2018minimax, ). Consider computing a feature average. DP requires adding noise once, after summing all values on the combined blocks. But with independent queries over each block, we must add the same amount of noise to the sum over each block, yielding a more noisy total. Additionally, several DP training algorithms (abadi2016deep, ; McMahan2018LearningDP, ) fundamentally rely on sampling small training batches from large datasets to amplify privacy, which cannot be done without combining blocks.
Third, one can consume the data stream online using streaming DP. A new data point is allocated to one of the waiting queries, which consumes its entire privacy budget. Because each point is used by one query and discarded, DP holds over the entire stream. New data can be adaptively assigned to any query (R2) and arrives with a fresh budget (R3). However, queries cannot use past data or overlapping subsets, violating R1 and rendering the approach impractical for large models.
Block Composition. Our new composition theory meets all three requirements. It splits the data stream into disjoint blocks (e.g., one day’s worth of data for eventlevel privacy), forming a growing database on which queries can run on overlapping and adaptively chosen sets of blocks (R1, R2). This lets pipelines combine blocks with available privacy budgets to assemble large datasets. Despite running overlapping queries, we can still account for the privacy loss of each individual blocks, where each query impacts the blocks it actually uses, not the entire data stream. Unused blocks, including future ones, incur no privacy loss. In Fig. 3, the first three blocks each incur a privacy loss of while the last block has . The privacy loss of these three queries over the entire data stream will only be the maximum of these two values. Moreover, when the database grows (e.g. block arrives), the new blocks’ privacy loss is zero. The system can thus run endlessly by training new models on new data (R3).
Sage Access Control. With block composition, Sage controls data leakage from a stream by enforcing DP on its blocks. The company configures a desirable global policy for each sensitive stream. The Sage Access Control component tracks the available privacy budget for each data block. It allows access to a block until it runs out of budget, after which access to the block will never be granted again. When the Sage Iterator (described in §3.3) for a pipeline requests data, Sage Access Control only offers blocks with available privacy budget. The Iterator then determines the privacy parameters it will use for its iteration and informs Sage Access Control, which deducts from the available privacy budgets of those blocks. Finally, the Iterator invokes the developersupplied DP Training Pipeline, trusting it to enforce the chosen privacy parameters. §4 proves that this access control policy enforces DP for the stream.
The preceding operation is a DPinformed retention policy, but one can use block composition to define other access control policies. Suppose the company is willing to assume that its developers (or user devices and prediction servers in distinct geographies) will not collude to violate its customers’ privacy. Then the company could enforce a separate guarantee for each context (developer or geography) by maintaining separate lists of perblock available budgets.
3.3. PrivacyAdaptive Training
Sage’s design adds reliability to the DP model training and validation processes, which are rendered imprecise by the DP randomness. We describe two novel techniques: (1) SLAed validation, which accounts for the effect of randomness in the validation process to ensure a highprobability guarantee of correctness (akin to a quality service level agreement, or SLA); and (2) privacyadaptive training, which launches the DP Training Pipeline adaptively on increasing amounts of data from the stream, and/or with increased privacy parameters, until the validation succeeds. Privacyadaptive training thus leverages the adaptivity properties of block composition to address DP’s privacyutility tradeoff.
SLAed DP Validation. Fig. 2 shows the three possible outcomes of SLAed validation: Accept, Reject/timeout, and Retry. If SLAed validation returns Accept, then with high probability (e.g. 95%) the model reached its configured quality targets for prediction on new data from the same distribution. Under certain assumptions, it is also possible to give statistical guarantees of correct negative assessment, in which case SLAed validation returns Reject. We refer the reader to our extended technical report (sagetr, ) for this discussion. Sage also supports timing out a training procedure if it has run for too long. Finally, if SLAed validation returns Retry, it signals that more data is needed for an assessment.
We have implemented SLAed validators for three classes of metrics: loss metrics (e.g. MSE, log loss), accuracy, and absolute errors of sumbased statistics such as mean and variance. The technical report (sagetr, ) details the implementations and proves their statistical and DP guarantees. Here, we give the intuition and an example based on loss metrics. All validators follow the same logic. First, we compute a DP version of the test quantity (e.g. MSE) on a testing set. Second, we compute the worstcase impact of DP noise on that quantity for a given confidence probability; we call this a correction for DP impact. For example, if we add Laplace noise with parameter to the sum of squared errors on data points, assuming that the loss is in we know that with probability the sum is deflated by less than , because a draw from this Laplace distribution has just an probability to be more negative than this value. Third, we use known statistical concentration inequalities, also made DP and corrected for worst case noise impact, to upper bound with high probability the loss on the entire distribution.
Example: Loss SLAed Validator. A loss function is a measure of erroneous predictions on a dataset (so lower is better). Examples include: mean squared error for regression, log loss for classification, and minus log likelihood for Bayesian generative models. List. 2 shows our loss validator. The validation function consists of two tests: Accept (described here) and Reject (described in the technical report (sagetr, )).
Denote: the DPtrained model ; the loss function range ; the target loss . Lines 1113 compute a DP estimate of the number of samples in the test set, corrected for the impact of DP noise to be a lower bound on the true value with probability . Lines 1417 compute a DP estimate of the loss sum, corrected for DP impact to be an upper bound on the true value with probability . Lines 1820 Accept the model if the upper bound is at most . The bounds are based on a standard concentration inequality (specifically, Bernstein’s inequality), which holds under very general conditions (ShalevShwartz:2014:UML:2621980, ). We show in (sagetr, ) that the Loss Accept Test satisfies DP and enjoys the following guarantee:
Proposition 3.1 (Loss Accept Test).
With probability at least , the Accept test returns true only if the expected loss of is at most .
PrivacyAdaptive Training. Sage attempts to improve the quality of the model and its validation by supplying them with more data or privacy budgets so the SLAed validator can either Accept or Reject the model. Several ways exist to improve a DP model’s quality. First, we can increase the dataset’s size: at least in theory, it has been proven that one can compensate for the loss in accuracy due to any DP guarantee by increasing the training set size (kasiviswanathan2011can, ). Second, we can increase the privacy budget to decrease the noise added to the computation: this must be done within the available budgets of the blocks involved in the training and not too aggressively, because wasting privacy budget on one pipeline can prevent other pipelines from using those blocks.
Privacyadaptive training searches for a configuration that can be either Accepted or Rejected by the SLAed validator. We have investigated several strategies for this search. Those that conserve privacy budget have proven the most efficient. Every time a new block is created, its budget is divided evenly across the ML pipelines currently waiting in the system. Allocated DP budget is reserved for the pipeline that received it, but privacyadaptive training will not use all of it right away. It will try to Accept using as little of the budget as possible. When a pipeline is Accepted, its remaining budget is reallocated evenly across the models still waiting in Sage.
To conserve privacy budget, each pipeline will first train and test using a small configurable budget (), and a minimum window size for the model’s training. On Retry from the validator, the pipeline will be retrained, making sure to double either the privacy budget if enough allocation is available to the Training Pipeline, or the number of samples available to the Training Pipeline by accepting new data from the stream. This doubling of resources ensures that when a model is Accepted, the sum of budgets used by all failed iterations is at most equal to the budget used by the final, accepted iteration. This final budget also overshoots the best possible budget by at most two, since the model with half this final budget had a Retry. Overall, the resources used by this DP budget search are thus at most four times the budget of the final model. Evaluation §5.4 shows that this conservative strategy improves performance when multiple Training Pipelines contend for the same blocks.
4. Block Composition Theory
1: gives two neighboring block datasets and
2:for in , , do
( depends on in iter. )
3: gives DP
4: receives
return

1: gives , the index of the block with the adversarially chosen change
2:for in , , do
( depends on in iter. )
3: if create new block and then
4: gives neighboring blocks and
5: else if create new block and then
6:
7: gives , , and DP
8: if then
9: receives
10: else receives noop
return

This section provides the theoretical backing for block composition, which we invent for Sage but which we believe has broader applications (§4.4). To analyze composition, one formalizes permissible interactions with the sensitive data in a protocol that facilitates the proof of the DP guarantee. This interaction protocol makes explicit the worstcase decisions that can be made by modeling them through an adversary. In the standard protocol (detailed shortly), an adversary picks the neighboring data sets and supplies the DP queries that will compute over one of these data sets; the choice between the two data sets is exogenous to the interaction. To prove that the interaction satisfies DP, one must show that given the results of the protocol, it is impossible to determine with high confidence which of the two neighboring data sets was used.
Fig. 4 describes three different interaction protocols of increasing sophistication. Alg. (3(a)) is the basic DP composition protocol. Alg. (3(b)) is a blocklevel protocol we propose for static databases. Alg. (3(c)) is the protocol adopted in Sage; it extends Alg. (3(b)) by allowing a streaming database and adaptive choices of blocks and privacy parameters. Highlighted are changes made to the preceding protocol.
4.1. Traditional QueryLevel Accounting
QueryCompose (Alg. (3(a))) is the interaction protocol assumed in most analyses of composition of several DP interactions with a database. There are three important characteristics. First, the number of queries and the DP parameters are fixed in advance. However the DP queries can be chosen adaptively. Second, the adversary adaptively chooses neighboring datasets and for each query. This flexibility lets the protocol readily support adaptively evolving data (such as with data streams) where future data collected may be impacted by the adversary’s change to past data. Third, the adversary receives the results of running the DP queries on ; here, is the exogenous choice of which database to use and is unknown to . DP is guaranteed if cannot confidently learn given .
A common tool to analyze DP protocols is privacy loss:
Definition 4.1 (Privacy Loss).
Fix any outcome and denote . The privacy loss of an algorithm Compose(, , , ) is:
Bounding the privacy loss for any adversary with high probability implies DP (kasiviswanathan2014semantics, ). Suppose that for any , with probability over draws from , we have: . Then Compose(, , , ) is DP. This way, privacy loss and DP are defined in terms of distinguishing between two hypotheses indexed by .
Previous composition theorems (e.g. basic composition (Dwork:2006:CNS:2180286.2180305, ), strong composition (dwork2010boosting, ), and variations thereof (kairouz2015composition, )) analyze Alg. (3(a)) to derive various arithmetics for computing the overall DP semantic of interactions adhering to that protocol. In particular, the basic composition theorem (Dwork:2006:CNS:2180286.2180305, ) proves that QueryCompose(, , , ) is DP. These theorems form the basis of most ML DP work. However, because composition is accounted for at the query level, imposing a fixed global privacy budget means that one will “run out” of it and stop training models even on new data.
4.2. Block Composition for Static Datasets
Block composition improves privacy accounting for workloads where interaction consists of queries that run on overlapping data subsets of diverse sizes. This is one of the characteristics we posit for ML workloads (requirement R1 in §3.2). Alg. (3(b)), BlockCompose, formalizes this type of interaction for a static dataset setting as a springboard to formalizing the full ML interaction. We make two changes to QueryCompose. First (line 1), the neighboring datasets are defined once and for all before interacting. This way, training pipelines accessing nonoverlapping parts of the dataset cannot all be impacted by one entry’s change. Second (line 4), the data is split in blocks, and each DP query runs on a subset of the blocks.
We prove that the privacy loss over the entire dataset is the same as the maximum privacy loss on each block, accounting only for queries using this block:
Theorem 4.2 (Reduction to Blocklevel Composition).
The privacy loss of BlockCompose(, , , , ) is upperbounded by the maximum privacy loss for any block:
Proof.
Let and be the neighboring datasets picked by adversary , and let be the block index s.t. for all , and . For any result of Alg. (3(b)):
The slashed term is zero because if , then
, hence .
∎
Hence, unused data blocks allow training of other (adaptively chosen) ML models, and exhausting the DP budget of a block means we retire that block of data, and not the entire data set. This result, which can be extended to strong composition (see tech. report (sagetr, )), can be used to do tighter accounting than querylevel accounting when the workload consists of queries on overlapping sets of data blocks (requirement R1). However, it does not support adaptivity in block choice or a streaming setting, violating R2 and R3.
4.3. Sage Block Composition
Alg. (3(c)), AdaptiveStreamBlockCompose, addresses the preceding limitations with two changes. First, supporting streams requires that datasets not be fixed before interacting, because future data depends on prior models trained and pushed into production. The highlighted portions of lines 110 in Alg. (3(c)) formalize the dynamic nature of data collection by having new data blocks explicitly depend on previously trained models, which are chosen by the adversary, in addition to other mechanisms of the world that are not impacted by the adversary. Fortunately, Theorem 4.2 still applies, because model training can only use blocks that existed at the time of training, which in turn only depend on prior blocks through DP trained models. Therefore, new data blocks can train new ML models, enabling endless operation on streams (R3).
Second, interestingly, supporting adaptive choices in the data blocks implies supporting adaptive choices in the queries’ DP budgets. For a given block, one can express query ’s choice to use block as using a privacy budget of either or . Lines 78 in Alg. (3(c)) formalize the adaptive choice of both privacy budgets and blocks (requirement R2). It does so by leveraging recent work on DP composition under adaptive DP budgets (rogers18privacy, ). At each round, requests access to a group of blocks , on which to run an DP query. Sage’s Access Control permits the query only if the privacy loss of each block in will remain below . Applying our Theorem 4.2 and (rogers18privacy, )’s Theorem 3.3, we prove the following result (proof in (sagetr, )):
Theorem 4.3 (Composition for Sage Block Composition).
AdaptiveStreamBlockCompose(,,,,,) is DP if for all , enforces:
The implication of the preceding theorem is that under the access control scheme described in §3.2, Sage achieves eventlevel DP over the sensitive data stream.
4.4. Blocks Defined by User ID and Other Attributes
Block composition theory can be extended to accommodate userlevel privacy and other use cases. The theory shows that one can split a static dataset (Theorem 4.2) or a data stream (Theorem 4.3) into disjoint blocks, and run DP queries adaptively on overlapping subsets of the blocks while accounting for privacy at the block level. The theory focused on time splits, but the same theorems can be written for splits based on any attribute whose possible values can be made public, such as geography, demographics, or user IDs. Consider a workload on a static dataset in which queries combine data from diverse and overlapping subsets of countries, e.g., they compute average salary in each country separately, but also at the level of continents and GDPbased country groups. For such a workload, block composition gives tighter privacy accounting across these queries than traditional composition, though the system will still run out of privacy budget eventually because no new blocks appear in the static database.
As another example, splitting a stream by user ID enables querying or ignoring all observations from a given user, adding support for userlevel privacy. Splitting data over user ID requires extra care. If existing user IDs are not knows, each query might select user IDs that do not exist yet, spending their DP budget without adding data. However, making user IDs public can leak information. One approach is to use incrementing user IDs (with this fact public), and periodically run a DP query computing the maximum user ID is use. This would ensure DP, while giving an estimate of the range of user IDs that can be queried. In such a setting, block composition enables finegrain DP accounting over queries on any subset of the users. While our block theory supports this use case, it suffers from a major practical challenge. New blocks are now created only when new users join the system, so new users must be added at a high rate relative to the model release rate to avoid running out of budget. This is unlikely to happen for mature companies, but may be possible for emerging startups or hospitals, where the stream of incoming users/patients may be high enough to sustain modest workloads.
5. Evaluation
We ask four questions: (Q1) Does DP impact Training Pipeline reliability? (Q2) Does privacyadaptive training increase DP Training Pipeline reliability? (Q3) Does block composition help over traditional composition? (Q4) How do ML workloads perform under Sage’s DP regime?
Taxi Regression Task  
Pipelines:  Configuration:  
Linear Regression (LR)  DP Alg.  AdaSSP from (wang2018revisiting, ), DP 
Config.  Regularization param  
Budgets  
Targets  MSE  
Neural Network (NN)  DP Alg.  DP SGD from (abadi2016deep, ), DP 
ReLU, 2 hidden layers (5000/100 nodes)  
Config.  Learning rate: 0.01, Epochs: 3  
Batch: 1024, Momentum: 0.9  
Budgets  
Targets  MSE  
Avg.Speed x3*  Targets  Absolute error km/h 
Criteo Classification Task  
Pipelines:  Configuration:  
Logistic Regression (LG)  DP Alg.  DP SGD from (mcmahan2018general, ), DP 
Config.  Learning rate: , Epochs: 3 Batch: 512  
Budgets  
Targets  Accuracy  
Neural Network (NN)  DP Alg.  DP SGD from (mcmahan2018general, ), DP 
ReLU, 2 hidden layers (1024/32 nodes)  
Config.  Learning rate: 0.01, Epochs: 5  
Batch: 1024  
Budgets  
Targets  Accuracy  
Counts x26**  Targets  Absolute error 




Methodology. We consider two datasets: 37Msamples from three months of NYC taxi rides (yellowCabData, ) and 45M ad impressions from Criteo (criteoKaggle, ). On the Taxi dataset we define a regression task to predict the duration of each ride using 61 binary features derived from 10 contextual features. We implement pipelines for a linear regression (LR), a neural network (NN), and three statistics (average speeds at three time granularities). On the Criteo dataset we formulate a binary classification task predicting ad clicks from 13 numeric and 26 categorical features. We implement a logistic regression (LG), a neural network (NN), and histogram pipelines. Tab. 1 shows details.
Training: We make each pipeline DP using known algorithms, shown in Tab. 1. Validation: We use the loss, accuracy, and absolute error SLAed validators on the regression, classification, and statistics respectively. Experiments: Each model is assigned a quality target from a range of possible values, chosen between the best achievable model, and the performance of a naïve model (predicting the label mean on Taxi, with MSE , and the most common label on Criteo, with accuracy 74.3%). Most evaluation uses privacyadaptive training, so privacy budgets are chosen by Sage, with an upperbound of . While no consensus exists on what a reasonable DP budget is, this value is in line with practical prior work (abadi2016deep, ; McMahan2018LearningDP, ). Where DP budgets must be fixed, we use values indicated in Tab. 1 which correspond to a large budget (), and a small budget that varies across tasks and models. Other defaults: 90%::10% train::test ratio; ; . Comparisons: We compare Sage’s performance to existing DP composition approaches described in §3.2. We ignore the first alternative, which violates the endless execution requirement R3 and cannot support ML workloads. We compare with the second and third alternatives, which we call query composition and streaming composition, respectively.
5.1. Unreliability of DP Training Pipelines in TFX (Q1)
We first evaluate DP’s impact on model training. Fig. 5 shows the loss or accuracy of each model when trained on increasing amounts data and evaluated on 100K heldout samples from their respective datasets. Three versions are shown for each model: the nonDP version (NP), a large DP budget version (), and a small DP budget configuration with values that vary across the model and task. For both tasks, the NN requires the most data but outperforms the linear model in the private and nonprivate settings. The DP LRs catch up to the nonDP version with the full dataset, but the other models trained with SGD require more data. Thus, model quality is impacted by DP but the impact diminishes with more training data. This motivates privacyadaptive training.
To evaluate DP’s impact on validation, we train and validate our models for both tasks, with and without DP. We use TFX’s vanilla validators, which compare the model’s performance on a test set to the quality metric (MSE for taxi, accuracy for Criteo). We then reevaluate the models’ quality metrics on a separate, 100Ksample heldout set and measure the fraction of models accepted by TFX that violate their targets on the reevaluation set. With nonDP pipelines (nonDP training and validation), the false acceptance rate is 5.1% and 8.2% for the Taxi and Criteo tasks respectively. With DP pipelines (DP training, DP validation), false acceptance rates hike to 37.9% and 25.2%, motivating SLAed validation.




5.2. Reliability of DP Training Pipelines in Sage (Q2)
Sage’s privacyadaptive training and SLAed validation are designed to add reliability to DP model training and validation. However, they may come at a cost of increased data requirements over a nonDP test. We evaluate reliability and sample complexity for the SLAed validation Accept test.
Dataset  No SLA  NP SLA  UC DP SLA  Sage SLA  

Taxi  0.01  0.379  0.0019  0.0172  0.0027 
0.05  0.379  0.0034  0.0224  0.0051  
Criteo  0.01  0.2515  0.0052  0.0544  0.0018 
0.05  0.2515  0.0065  0.0556  0.0023 
Tab. 2 shows the fraction of Accepted models that violate their quality targets when reevaluated on the 100Ksample heldout set. For two confidences , we show: (1) No SLA, the vanilla TFX validation with no statistical rigor, but where a model’s quality is computed with DP. (2) NP SLA, a nonDP but statistically rigorous validation. This is the best we can achieve with statistical confidence. (3) UC DP SLA, a DP SLAed validation without the correction for DP impact. (4) Sage SLA, our DP SLAed validator, with correction. We make three observations. First, the NP SLA violation rates are much lower than the configured values because we use conservative statistical tests. Second, Sage’s DPcorrected validation accepts models with violation rates close to the NP SLA. Slightly higher for the loss SLA and slightly lower for the accuracy SLA, but well below the configured error rates. Third, removing the correction increases the violation rate by 5x for the loss SLA and 20x for the accuracy SLA, violating the confidence thresholds in both cases, at least for low . These results confirm that Sage’s SLAed validation is reliable, and that correction for DP is critical to this reliability.
The increased reliability of SLAed validation comes at a cost: SLAed validation requires more data compared to a nonDP test. This new data is supplemented by Sage’s privacyadaptive training. Fig. 5(a) and 5(b) show the amount of train+test data required to Accept a model under various loss targets for the Taxi regression task. Fig. 5(c) and 5(d) show the same for accuracy targets for the Criteo classification task. We make three observations. First, unsurprisingly, nonrigorous validation (No SLA) requires the least data but has a high failure rate because it erroneously accepts models on small sample sizes. Second, the best model accepted by Sage’s SLA validation are close to the best model accepted by No SLA. We observe the largest difference in Taxi LR where No SLA accepts MSE targets of while the Sage SLA accepts as low as . The best achievable model is slightly impacted by DP, although more data is required. Third, adding a statistical guarantee but no privacy to the validation (NP SLA) already substantially increases sample complexity. Adding DP to the statistical guarantee and applying the DP correction incurs limited additional overhead. The distinction between Sage and NP SLA is barely visible for all but the Taxi LR. For Taxi LR, adding DP accounts for half of the increase over No SLA requiring twice as much data (one data growth step in privacyadaptive training). Thus, privacyadaptive training increases reliability of DP training pipelines for reasonable increase in sample complexity.
5.3. Benefit of Block Composition (Q3)




Block composition lets us combine multiple blocks into a dataset, such that each DP query runs over all used blocks with only one noise draw. Without block composition a DP query is split into multiple queries, each operating on a single block, and receiving independent noise. The combined results are more noisy. Fig. 6(a) and 6(c) show the model quality of the LR and NN models on the Taxi dataset, when operating on blocks of different sizes, 100K and 500K for the LR, and 5M for the NN. Fig. 6(b) and 6(d) show the SLAed validation sample complexity of the same models. We compare these configurations against Sage’s combinedblock training that allows ML training and validation to operate on their full relevance windows. We can see that block composition helps both the training and validation stages. While LR training (Fig. 6(a)) performs nearly identically for Sage and block sizes of 100K or 500K (6h of data to a bit more than a day), validation is significantly impacted. The LR cannot be validated with any MSE better than with block sizes of 500K, and for blocks of size 100K. Additionally, those targets that can be validated require significantly more data without Sage’s block composition: 10x for blocks of size 500K, and almost 100x for blocks of 100K. The NN is more affected at training time. With blocks smaller than 1M points, it cannot even be trained. Even with an enormous block size of 5M, more than ten days of data (Fig. 6(c)), the partitioned model performs 8% worse than with Sage’s block composition. Although on such large blocks validation itself is not much affected, the worse performance means that models can be validated up to an MSE target of (against Sage’s ), and requires twice as much data as with block composition.
5.4. Multipipeline Workload Performance (Q4)
Last is an endtoend evaluation of Sage with a workload consisting of a data stream and ML pipelines arriving over discrete time steps. At each step, a number of new data points corresponding approximately to 1 hour of data arrives (16K for Taxi, 267K for Criteo). The time between new pipelines is drawn from a Gamma distribution. When a new pipeline arrives, its sample complexity (number of data points required to reach the target) is drawn from a power law distribution, and a pipeline with the relevant sample complexity is chosen uniformly among our configurations and targets (Tab. 1). Under this workload, we compare the average model release in steady state for four different strategies. This first two leverage Query Composition and Streaming Composition from prior work, as explained in methodology and § 3.3. The other two take advantage of Sage’s Block Composition. Both strategies uniformly divide the privacy budget of new blocks among all incomplete pipelines, but differ in how each pipeline uses its budget. Block/Aggressive uses as much privacy budget as is available when a pipeline is invoked. Block/Conserve (Sage) uses the PrivacyAdaptive Training strategy defined in § 3.3.


Fig. 8 shows each strategy’s average model release time under increasing load (higher model arrival rate), as the system enforces DP over the entire stream. We make two observations. First, Sage’s block composition is crucial. Query Composition and Streaming Composition quickly degrade to offthecharts release times: supporting more than one model every two hours is not possible and yields release times above three days. On the other hand, strategies leveraging Sage’s block composition both provide lower release times, and can support up to 0.7 model arrivals per hour (more than 15 new models per day) and release them within a day. Second, we observe consistently lower release times under the privacy budget conserving strategy. At higher rates, such as 0.7 new models per hour, the difference starts to grow: Block/Conserve has a release time 4x and 2x smaller than Block/Aggressive on Taxi (Fig. 7(a)) and Criteo (Fig. 7(b)) respectively. Privacy budget conservation reduces the amount of budget consumed by an individual pipeline, thus allowing new pipelines to use the remaining budget when they arrive.
6. Related Work
Sage’s main contribution – block composition – is related to DP composition theory. Basic (Dwork:2006:CNS:2180286.2180305, ) and strong (dwork2010boosting, ; kairouz2015composition, ) composition theorems give the DP guarantee for multiple queries with adaptively chosen computation. McSherry (Mcsherry:pinq, ) and Zhang, et.al. (Zhang:2018:EFD:3183713.3196921, ) show that nonadaptive queries over nonoverlapping subsets of data can share the DP budget. Rogers, et.al. (rogers2016privacy, ) analyze composition under adaptive DP parameters, which is crucial to our block composition. These works all consider fixed datasets and querylevel accounting.
Compared to all these works, our main contribution is to formalize the new blocklevel DP interaction model, which supports ML workloads on growing databases while enforcing a global DP semantic without running out of budget. This model sits between traditional DP interaction with static data, and streaming DP working only on current data. In proving our interaction model DP we leverage prior theoretical results and analysis methods. However, the most comprehensive prior interaction model (rogers2016privacy, ) did not support all our requirements, such as interactions with adaptively chosen data subsets, or future data being impacted by previous queries.
Streaming DP (Chan:2011:PCR:2043621.2043626, ; Dwork:2010:DPN:1873601.1873617, ; panprivatestreamingalgorithms, ; dwork2010differential, ) extends DP to data streams but is restrictive for ML. Data is consumed once and assumed to never be used again. This enables stronger guarantees, as data need not even be kept internally. However, training ML models often requires multiple passes over the data.
Cummings, et.al. (cummings2018differential, ) consider DP over growing databases. They focus on theoretical analysis and study two setups. In the fist setup, they also run DP workloads on exponentially growing data sizes. However, their approach only supports linear queries, with a runtime exponential in the data dimension and hence impractical. In the second setup, they focus on training a single convex ML model and show that it can use new data to keep improving. Supporting ML workloads would require splitting the privacy budget for the whole stream among models, creating a running out of privacy budget challenge.
A few DP systems exist, but none focuses on streams or ML. PINQ (Mcsherry:pinq, ) and its generalization wPINQ (proserpio2014calibrating, ) give a SQLlike interface to perform DP queries. They introduce the partition operator allowing parallel composition, which resembles Sage’s block composition. However, this operator only supports nonadaptive parallel computations on nonoverlapping partitions, which is insufficient for ML. Airavat (roy2010airavat, ) provides a MapReduce interface and supports a strong threat model against actively malicious developers. They adopt a perspective similar to ours, integrating DP with access control. GUPT (mohan2012gupt, ) supports automatic privacy budget allocation and lets programmers specify accuracy targets for arbitrary DP programs with a realvalued output; it is hence applicable to computing summary statistics but not to training ML models. All these works focus on static datasets and adopt a generic, querylevel accounting approach that applies to any workload. Querylevel accounting would force them to run out of privacy budget if unused data were available. Blocklevel accounting avoids this limitation but applies to workloads with specific data interaction characteristics (§3.2).
7. Summary and Future Work
As companies disseminate ML models trained over sensitive data to untrusted domains, it is crucial to start controlling data leakage through these models. We presented Sage, the first ML platform that enforces a global DP guarantee across all models released from sensitive data streams. Its main contributions are its blocklevel accounting that permits endless operation on streams and its privacyadaptive training that lets developers control DP model quality. The key enabler of both techniques is our systems focus on ML training workloads rather than DP ML’s typical focus on individual training algorithms. While individual algorithms see either a static dataset or an online training regime, workloads interact with growing databases. Across executions of multiple algorithms, new data becomes available (helping to renew privacy budgets and allow endless operation) and old data is reused (allowing training of models on increasingly large datasets to appease the effect of DP noise on model quality).
We believe that this systems perspective on DP ML presents further opportunities worth pursuing in the future. Chief among them is how to allocate data, privacy parameters, and compute resources to conserve privacy budget while training models efficiently to their quality targets. Sage proposes a specific heuristic for allocating the first two resources (§3.3), but leaves unexplored tradeoffs between data and compute resources. To conserve budgets, we use as much data as is available in the database when a model is invoked, with the lowest privacy budget. This gives us the best utilization of the privacy resource. But training on more data consumes more compute resources. Identifying principled approaches to perform these allocations is an open problem.
A key limitation of this work is its focus on eventlevel privacy, a semantic that is insufficient when groups of correlated observations can reveal sensitive information. The best known example of such correlation happens when a user contributes multiple observations, but other examples include repeated measurements of a phenomenon over time, or users and their friends on a social network. In such cases, observations are all correlated and can reveal sensitive information, such as a user’s demographic attributes, despite eventlevel DP. It should be noted that even in the face of correlated data DP holds for each individual observation: other correlated observations constitute side information, to which DP is known to be resilient. Still, to increase protection, an exciting area of future work is to add support for and evaluate userlevel privacy. Our block accounting theory is amenable to this semantic (§4.4), but finding settings where the semantic can be sustained without running out of budget is an open challenge.
8. Acknowledgements
We thank our shepherd, Thomas Ristenpart, and the anonymous reviewers for the valuable comments. This work was funded through NSF CNS1351089, CNS1514437, and CCF1740833, two Sloan Faculty Fellowships, a Microsoft Faculty Fellowship, a Google Ph.D. Fellowship, and funds from Schmidt Futures and Columbia Data Science Institute.
References
 [1] Kaggle display advertising challenge dataset. https://www.kaggle.com/c/criteodisplayadchallenge, 2014.
 [2] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proc. of the ACM Conference on Computer and Communications Security (CCS), 2016.
 [3] M. Backes, P. Berrang, M. Humbert, and P. Manoharan. Membership privacy in microRNAbased studies. In Proc. of the ACM Conference on Computer and Communications Security (CCS), 2016.
 [4] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proc. of the ACM SIGMOD International Conference on Management of Data, 2007.
 [5] D. Baylor, E. Breck, H.T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich. TFX: A Tensorflowbased productionscale machine learning platform. In Proc. of the International Conference on Knowledge Discovery and Data Mining (KDD), 2017.
 [6] K. Boyd, E. Lantz, and D. Page. Differential privacy for classifier evaluation. In Proc. of the ACM Workshop on Artificial Intelligence and Security, 2015.
 [7] N. Carlini, C. Liu, U. Erlingsson, J. Kos, and D. Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. arXiv:1802.08232, 2018.
 [8] T.H. H. Chan, E. Shi, and D. Song. Private and continual release of statistics. ACM Transactions on Information Systems Security, 2011.
 [9] K. Chaudhuri and C. Monteleoni. Privacypreserving logistic regression. In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2008.
 [10] K. Chaudhuri, A. D. Sarwate, and K. Sinha. A nearoptimal algorithm for differentiallyprivate principal components. Journal of Machine Learning Research (JMLR), 14, 2013.
 [11] R. Cummings, S. Krehbiel, K. A. Lai, and U. Tantipongpipat. Differential privacy for growing databases. In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2018.
 [12] I. Dinur and K. Nissim. Revealing information while preserving privacy. In Proc. of the International Conference on Principles of Database Systems (PODS), 2003.
 [13] J. Duchi and R. Rogers. Lower bounds for locally private estimation via communication complexity. arXiv:1902.00582, 2019.
 [14] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 2018.
 [15] C. Dwork. Differential privacy. In Automata, languages and programming. 2006.
 [16] C. Dwork. Differential privacy in new settings. In Proc. of the ACM Symposium on Discrete Algorithms (SODA), 2010.
 [17] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of the Conference on Theory of Cryptography (TCC), 2006.
 [18] C. Dwork, M. Naor, T. Pitassi, , and S. Yekhanin. Panprivate streaming algorithms. In Proc. of The Symposium on Innovations in Computer Science, 2010.
 [19] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation. In Proc. of the ACM Symposium on Theory of Computing (STOC), 2010.
 [20] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014.
 [21] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In Proc. of the IEEE Symposium on Foundations of Computer Science (FOCS), 2010.
 [22] C. Dwork, A. Smith, T. Steinke, and J. Ullman. Exposed! A survey of attacks on private data. Annual Review of Statistics and Its Application, 2017.
 [23] C. Dwork, A. Smith, T. Steinke, J. Ullman, and S. Vadhan. Robust traceability from trace amounts. In Proc. of the IEEE Symposium on Foundations of Computer Science (FOCS), 2015.
 [24] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang. Applied machine learning at Facebook: A datacenter infrastructure perspective. In Proc. of International Symposium on HighPerformance Computer Architecture (HPCA), 2018.
 [25] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 1963.
 [26] N. Homer, S. Szelinger, M. Redman, D. Duggan, W. Tembe, J. Muehling, J. V. Pearson, D. A. Stephan, S. F. Nelson, and D. W. Craig. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using highdensity SNP genotyping microarrays. PLoS Genetics, 2008.
 [27] B. Jayaraman and D. Evans. Evaluating differentially private machine learning in practice. In Proc. of USENIX Security, 2019.
 [28] P. Kairouz, S. Oh, and P. Viswanath. The composition theorem for differential privacy. In International Conference on Machine Learning (ICML), 2015.
 [29] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? SIAM Journal on Computing, 2011.
 [30] S. P. Kasiviswanathan and A. Smith. On the’semantics’ of differential privacy: A bayesian formulation. Journal of Privacy and Confidentiality, 2014.
 [31] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and highdimensional regression. In Proc. of the ACM Conference on Learning Theory (COLT), 2012.
 [32] M. Lecuyer, R. Spahn, K. Vodrahalli, R. Geambasu, and D. Hsu. Privacy accounting and quality control in the sage differentially private ml platform. Online Supplements (also available on arXiv), 2019.
 [33] N. Leonard and C. M. Halasz. Twitter meets tensorflow. https://blog.twitter.com/engineering/en_us/topics/insights/2018/twittertensorflow.html, 2018.
 [34] L. E. Li, E. Chen, J. Hermann, P. Zhang, and L. Wang. Scaling machine learning as a service. In Proc. of The International Conference on Predictive Applications and APIs, 2017.
 [35] A. Maurer and M. Pontil. Empirical Bernstein Bounds and Sample Variance Penalization. 2009.
 [36] H. B. McMahan and G. Andrew. A general approach to adding differential privacy to iterative training procedures. arXiv:1812.06210, 2018.
 [37] H. B. McMahan and G. Andrew. A general approach to adding differential privacy to iterative training procedures. arXiv:1812.06210, 2018.
 [38] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private recurrent language models. In Proc. of the International Conference on Learning Representations (ICLR), 2018.
 [39] F. McSherry and I. Mironov. Differentially private recommender systems: Building privacy into the Netflix prize contenders. In Proc. of the International Conference on Knowledge Discovery and Data Mining (KDD), 2009.
 [40] F. D. McSherry. Privacy integrated queries: An extensible platform for privacypreserving data analysis. In Proc. of the ACM SIGMOD International Conference on Management of Data, 2009.
 [41] P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: Privacy preserving data analysis made easy. In Proc. of the 2012 ACM SIGMOD International Conference on Management of Data, 2012.
 [42] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft. Privacypreserving ridge regression on hundreds of millions of records. In Proc. of IEEE Symposium on Security and Privacy (S&P), 2013.
 [43] NYC Taxi & Limousine Commission  trip record data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, 2018.
 [44] D. Proserpio, S. Goldberg, and F. McSherry. Calibrating data to sensitivity in private data analysis: a platform for differentiallyprivate analysis of weighted datasets. Proc. of the International Conference on Very Large Data Bases (VLDB), 2014.
 [45] S. Ravi. Ondevice machine intelligence. https://ai.googleblog.com/2017/02/ondevicemachineintelligence.html, 2017.
 [46] R. Rogers, A. Roth, J. Ullman, and S. Vadhan. Privacy odometers and filters: Payasyougo composition. In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2018.
 [47] R. M. Rogers, A. Roth, J. Ullman, and S. Vadhan. Privacy odometers and filters: Payasyougo composition. In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2016.
 [48] I. Roy, S. T. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Security and privacy for MapReduce. In Proc. of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010.
 [49] S. ShalevShwartz and S. BenDavid. Understanding Machine Learning: From Theory to Algorithms. Appendix B. Cambridge University Press, New York, NY, USA, 2014.
 [50] D. Shiebler and A. Tayal. Making machine learning easy with embeddings. SysML http://www.sysml.cc/doc/115.pdf, 2010.
 [51] R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In Proc. of IEEE Symposium on Security and Privacy (S&P), 2017.
 [52] A. Smith and A. Thakurta. Differentially private model selection via stability arguments and the robustness of lasso. Journal of Machine Learning Research, 2013.
 [53] G. P. Strimel, K. M. Sathyendra, and S. Peshterliev. Statistical model compression for smallfootprint natural language understanding. arXiv:1807.07520, 2018.
 [54] K. Talwar, A. Thakurta, and L. Zhang. Nearlyoptimal private LASSO. In Proc. of the Conference on Neural Information Processing Systems (NeurIPS), 2015.
 [55] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Stealing machine learning models via prediction apis. In Proc. of USENIX Security, 2016.
 [56] Y.X. Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. arXiv:1803.02596, 2018.
 [57] C.J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang. Machine learning at Facebook: Understanding inference at the edge. In Proc. of the IEEE International Symposium on HighPerformance Computer Architecture (HPCA), 2019.
 [58] J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett. Differentially private histogram publication. In Proc. of the IEEE International Conference on Data Engineering (ICDE), 2012.
 [59] L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex. Differentially private model publishing for deep learning. In Proc. of IEEE Symposium on Security and Privacy (S&P), 2019.
 [60] D. Zhang, R. McKenna, I. Kotsogiannis, M. Hay, A. Machanavajjhala, and G. Miklau. Ektelo: A framework for defining differentiallyprivate computations. In Proc. of the ACM SIGMOD International Conference on Management of Data, 2018.
 [61] J. Zhang, Z. Zhang, X. Xiao, Y. Yang, and M. Winslett. Functional mechanism: Regression analysis under differential privacy. In Proc. of the International Conference on Very Large Data Bases (VLDB), 2012.
Appendix A Block Composition
This section makes several clarifications and precisions to the block composition theory in §4. It then ports prior strong composition results to our block accounting model, both for fixed and adaptive choices of blocks and DP parameters.
a.1. Clarifications and Precisions
Neighboring Datasets. We measure the distance between datasets using the symmetric difference: viewing a dataset as a multiset, two datasets and are neighboring if their disjunctive union (the elements which are in one of the sets but not in their intersection) is at most one. We note the distance between and . This definition is not the most standard: most DP work uses the Hamming distance which counts the number of records to change to go from to . Intuitively, under the symmetric difference an attacker can add or remove a record in the dataset. Changing a record corresponds to a symmetric difference of size , but a Hamming distance of . Changing a record can still be supported under the symmetric difference using group privacy (dwork2014algorithmic, ): it is thus slightly weaker but as general.
We chose to use the symmetric difference following PINQ (Mcsherry:pinq, ), as this definition is better suited to the analysis of DP composition over splits of the dataset (our blocks). Changing one record can indeed affect two blocks (e.g. the timestamp is changed) while adding or removing records only affects one.
Neighboring Streams. This notion of neighboring dataset extends pretty directly to streams of data (Chan:2011:PCR:2043621.2043626, ; Dwork:2010:DPN:1873601.1873617, ; panprivatestreamingalgorithms, ; dwork2010differential, ). Two streams and indexed by are neighboring if there exists an index such that: for the streams are identical (i.e. ), and for all the streams up to form neighboring datasets (i.e. ). This is equivalent to our Algorithm (3(b)) where the data, though unknown, is fixed in advance with only one record being changed between the two streams.
This definition however is restrictive, because a change in a stream’s data will typically affect future data. This is especially true in an ML context, where a record changed in the stream will change the targeting or recommendation algorithms that are trained on this data, which in turn will impact the data collected in the future. Because of this, and will probably differ in a large number of observations following the adversary’s change of one record. Interactions described in Algorithm (3(c)) model these dependencies. We show in Theorem (4.3) that if the data change impacts future data only through DP results (e.g. the targeting and recommendation models are DP) and mechanisms outside of the adversary’s control (our “world” variable ), composition results are not affected.
Privacy Loss Semantics. Recall that bounding the privacy loss (Definition 4.1) with high probability implies DP (kasiviswanathan2014semantics, ): if with probability over draws from (or ) , then the interaction generating is DP.
In this paper, we implicitly treated DP and bounded loss as equivalent by declaring s as DP, but proving composition results using a bound on ’s privacy loss. However, this is not exactly true in general, as DP implies a bound on privacy loss with weaker parameters, namely that with probability at least the loss is bounded by . In practice, this small difference in not crucial, as the typical Laplace and Gaussian DP mechanisms (and those we use in Sage) do have the same DP parameters for their bounds on privacy loss (dwork2014algorithmic, ): the Laplace mechanism for DP implies that the privacy loss is bounded by and achieving DP with the Gaussian mechanism implies that the privacy loss is bounded by with probability at least .
a.2. Proof for Basic Composition
The paper omits the proof for basic composition to save space. Here, we restate the result and spell out the proof:
Theorem 0 (Theorem 4.3: Basic Sage Block Composition).
AdaptiveStreamBlockCompose(,,,,,) is DP if for all , enforces:
Proof.
Denote the highest block index that existed when query was run. Denote the data blocks that existed at that time. Recall denotes the results from all queries released previous to . Query depends on both and ; the latter is a random variable that is fully determined by . Hence, the privacy loss for Alg. (3(c)) is: = ,
After applying Theorem 4.2, we must show that
with probability . The justification follows:
Since is DP, with probability . Applying a union bound over all queries for which concludes the proof. ∎
a.3. Strong Composition with BlockLevel Accounting
Fixed Blocks and DP Parameters: We now show how to use advanced composition results (e.g. (dwork2010boosting, )) is the context of block composition. This approach requires that both the blocks used by each query and the DP parameters be fixed in advanced, and correspond to Algorithms (3(b)).
Theorem A.1 (Strong Block Composition – Fixed DP Parameters).
BlockCompose(, , , , ) is DP, with:
Proof.
After applying Theorem 4.2, what remains to be shown is that , with probability at least . Using the fact ’s privacy loss is bounded by with probability at least , we know that there exists events and with joint probability at least such that for all , . We can now condition the analysis on and , and use Theorem 3.20 of (dwork2014algorithmic, ) to get that with probability at least , , where . A union bound on the and for all completes the proof. ∎
The proof directly extends to the stream setting (yellow parts of Alg. (3(b)) in the same way as in the proof of Theorem 4.3.
Adaptive Blocks and DP Parameters: Recall that with either adaptive blocks or DP parameters (or both), DP parameters depend on history. Traditional composition theorems do not apply to this setting. For basic composition, the DP parameters still “sum” under composition. However, as showed in (rogers18privacy, ), strong composition yields a different formula: while the privacy loss still scales as the square root of the number of queries, the constant is worse than with parameters fixed in advance.
Theorem A.2 (Strong Adaptive Stream Block Composition).
AdaptiveStreamBlockCompose(, , , , , ) is DP, and:
Proof.
Similarly to the proof of Theorem 4.3, we apply Theorem 4.2, and bound the privacy loss of any block using Theorem 5.1 of (rogers18privacy, ). ∎
Appendix B Validation Tests
Sage has builtin validators for three classes of metrics. §3.3 describes the high level approach and the specific functionality and properties of the lossbased validator. This section details all three validators and proves their statistical and DP guarantees.
b.1. SLAed Validator for Loss Metrics
Denote a loss function with range measuring the quality of a prediction with label (lower is better), and a target loss on the data distribution .
Accept Test: Given a the DPtrained model , we want to release only if . The test works as follows. First, compute a DP estimate of the number of samples in the test set, corrected for the impact of DP noise to be a lower bound on the true value with probability (Lines 1113 List. 2):
Then, compute a DP estimate of the loss corrected for DP impact (Lines 1417 List. 2) to be an upper bound on the true value, , with probability :
Lines 1820, we see that Sage will Accept when:
This test gives the following guarantee:
Proposition B.1 (Loss Accept Test (same as Proposition 3.1)).
With probability at least , the Accept test returns true only if .
Proof.
The corrections for DP noise imply that , and (i.e. the lower bounds hold with probability at least (1)). Define