A Proofs of technical results

Per-instance Differential Privacy and the Adaptivity of Posterior Sampling in Linear and Ridge regression

Abstract

Differential privacy (DP), ever since its advent, has been a controversial object. On the one hand, it provides strong provable protection of individuals in a data set; on the other hand, it has been heavily criticized for being not practical, partially due to its complete independence to the actual data set it tries to protect. In this paper, we address this issue by a new and more fine-grained notion of differential privacy — per instance differential privacy (pDP), which captures the privacy of a specific individual with respect to a fixed data set. We show that this is a strict generalization of the standard DP and inherits all its desirable properties, e.g., composition, invariance to side information and closedness to postprocessing, except that they all hold for every instance separately. When the data is drawn from a distribution, we show that per-instance DP implies generalization. Moreover, we provide explicit calculations of the per-instance DP for the output perturbation on a class of smooth learning problems. The result reveals an interesting and intuitive fact that an individual has stronger privacy if he/she has small “leverage score” with respect to the data set and if he/she can be predicted more accurately using the leave-one-out data set. Using the developed techniques, we provide a novel analysis of the One-Posterior-Sample (OPS) estimator and show that when the data set is well-conditioned it provides -pDP for any target individuals and matches the exact lower bound up to a multiplicative factor. We also propose AdaOPS which uses adaptive regularization to achieve the same results with -DP. Simulation shows several orders-of-magnitude more favorable privacy and utility trade-off when we consider the privacy of only the users in the data set.

\crefformat

equation(#2#1#3) \crefrangeformatequation(#3#1#4) to (#5#2#6) \crefnameequation \Crefnameequation \crefnamedefinitiondefinitiondefinitions \CrefnamedefinitionDefinitionDefinitions \crefnameassumptionassumptionassumptions \CrefnameassumptionAssumptionAssumptions

1 Introduction

While modern statistics and machine learning had seen amazing success, their applications to sensitive domains involving personal data remain challenging due to privacy issues. Differential privacy [11] is a mathematical notion that allows strong provable protection of individuals from being identified by an arbitrarily powerful adversary, and has been increasingly popular within the machine learning community as a solution to the aforementioned problem [22, 6, 20, 1]. The strong privacy protection however comes with a steep price to pay. Differential privacy almost always lead to substantial and often unacceptable drop in utility, e.g., in contingency tables [16] and in genome-wide association studies [34]. This motivated a large body of research to focus on making differential privacy more practical [25, 10, 26, 31, 13, 4, 17] by exploiting local structures and/or revising the privacy definition.

Majority of these approaches adopt the “privacy-centric” model, which involves theoretically proving that an algorithm is differentially private for any data (within a data domain), then carefully analyzing the utility of the algorithm under additional assumptions on the data. For instance, in statistical estimation it is often assumed that the data is drawn i.i.d. from a family of distributions. In nonparametric statistics and statistical learning, the data are often assumed to having specific deterministic/structural conditions, e.g., smoothness, incoherence, eigenvalue conditions, low-rank, sparsity and so on. While these assumptions are strong and sometimes unrealistic, they are often necessary for a model to work correctly, even without privacy constraints. Take high-dimensional statistics for example, “sparsity” is almost never true, but if the true model is dense and unstructured, it is simply impossible to recover the true model anyways in the “small large ” regime. That is why Friedman et al. [18] argued that one should “bet on sparsity” regardless and hope that it is a reasonable approximation of the reality. This is known as adaptivity in that an algorithm can perform provably better when some additional conditions are true.

The effect of these assumptions on privacy is unclear, mostly because there are no tools available to analyze such adaptivity in privacy. Since differential privacy is a worst-case quantity — a property of the randomized algorithm only (independent to the data) — it is unlikely that the obtained privacy loss could accurately quantify the privacy protection on a given data set at hand. It is always an upper bound, but the bound could be too conservative to be of any use in practice (e.g., when ).

To make matter worse, the extent to which DP is conservative is highly problem-dependent. In cases like, releasing counting queries, the clearly measures the correct information leakage, since the sensitivity of such queries do not change with respect to the two adjacent data sets; however, in the context of machine learning and statistical estimation (as we will show later), the of DP can be orders of magnitude larger than the actual limit of information leakage that the randomized algorithm guarantees. That is why in practice, it is challenging even for experts of differential privacy to provide a consistent recommendation on standard questions such as:

“What is the value of I should set in my application?”

In this paper, we take a new “algorithm-centric” approach of analyzing privacy. Instead of designing algorithms that take the privacy loss as an input, we consider a fixed randomized algorithm and then analyze its privacy protection for every pair of adjacent data sets separately.

Our contribution is three-fold.

  1. First, we develop per-instance differential privacy as a strict generalization of the standard pure and approximate DP. It provides a more fine-grained description of the privacy protection for each target individual and a fixed data set. We show that it inherits many desirable properties of differential privacy and can easily recover differential privacy for a given class of data and target users.

  2. Secondly, we quantify the per-instance sensitivity in a class of smooth learning problems including linear and kernel machines. The result allows us to explicitly calculate per-instance DP of multivariate Gaussian mechanism. For an appropriately chosen noise covariance, the per-instance DP is proportional to the norm of the pseudo-residual in norm specified by the Hessian matrix. In particular, in linear regression, the per-instance sensitivity for a data point is proportional to its square root statistical leverage score and its leave-one-out prediction error.

  3. Lastly, we analyze the procedure of releasing one sample from the posterior distribution (the OPS estimator) for ridge regression as an output perturbation procedure with a data dependent choice of covariance matrix. We show using the pDP technique that, when conditioning on a data set drawn from the linear regression model or having a well-conditioned design matrix, OPS achieves -pDP for while matching the Cramer-Rao lower bound up to a multiplicative factor. OPS unfortunately cannot achieve DP with a constant while remaining asymptotically efficient. We fixed that by a new algorithm called AdaOPS , which provides -DP and -statistical efficiency at the same time.

1.1 Symbols and notations

Throughout the paper, we will use the standard notation in statistical learning. Data point . In supervised learning setting, . We use to denote either the predictive function or the parameter vector that specifies such a function. to denote the loss function or in a statistical model, represents the negative log-likelihood . For example, in linear regression, , , and . We use to denote a randomized algorithm that outputs a draw from a distribution defined on a model space. Capital denotes a data set, and will be used to denote privacy loss.

2 Per-instance differential privacy

In this section, we define notion of per-instance differential privacy, and derive its properties. We begin by parsing the standard definition of differential privacy.

Definition 1 (Differential privacy [11]).

We say a randomized algorithm satisfies -DP if for all data set and data set that can be constructed by adding or removing one row from ,

When , this is also known as pure differential privacy, and it is much stronger because for each data set , the protection holds uniformly over all privacy target . When , then the protection becomes much weaker, in that the protection is stated for each privacy target separately.

It is helpful to understand what differential privacy is protecting against — a powerful adversary that knows everything in the entire universe, except one bit of information: whether a target is in the data set or not in the data set. The optimal strategy for such an adversary is to conduct a likelihood ratio test (or posterior inference) on this bit, and differential privacy uses randomization to limit the probability of success of such test [33].

In the above, we described the original “In-or-Out” version of DP definition (see, e.g., [12, Definition 2.3, 2.4]). There is also a “Replace-One” version of the DP definition, which assumes is constructed by replacing one row of arbitrarily. This preserves the cardinality of the data set and makes it more convenient in certain settings. The “replace-one” differential privacy protects against a slightly stronger adversary who know data set except one row and can limit the possibility of the unknown row to either or . Again, this is only 1-bit of information that the adversary tries to infer and the optimal strategy for the adversary is to conduct a likelihood ratio test. In this paper, we choose to work with the “In-or-Out” version of the differential privacy, although everything we derived can also be stated for the alternative version of differential privacy.

Note that the adversary always knows and has a clearly defined target , and it is natural to evaluate the winnings and losses of the “player”, the data curator by conditioning on the same data set and privacy target. This gives rise to the following generalization of DP.

Definition 2 (Per-instance Differential Privacy).

For a fixed data set and a fixed data point . We say a randomized algorithm satisfy -per-instance-DP for if for all measurable set , it holds that

This definition is different from DP primarily because DP is the property of the only and pDP is the property of both and . If we take supremum over all and , then it recovers the standard differential privacy.

Similarly, we can define per-instance sensitivity for .

Definition 3 (per-instance sensitivity).

Let , for a fixed and . The per-instance sensitivity of a function is defined as , where could be norm or defined by a positive definite matrix .

This definition also generalizes quantities in the classic DP literature. If we condition on but maximize over all , we get local-sensitivity [25]. If we maximize over all and we get global sensitivity [12, Definition 3.1]. These two are often infinite in real-life problems, but for fixed data and target to be protected, we could still get meaningful per-instance sensitivity.

Immediately, the per-instance sensitivity implies pDP for a noise adding procedure.

Lemma 4 (Multivariate Gaussian mechanism).

Let be a deterministic map from a data set to a point in , e.g., a deterministic learning algorithm, and let the -norm per-instance sensitivity be . Then adding noise with covariance matrix obeys -pDP for any with

The proof, which is standard and we omit, simply verifies the definition of -pDP by calculating a tail bound of the privacy loss random variable and invokes Lemma 21.

2.1 Properties of pDP

We now describe properties of per-instance DP, which mostly mirror those of DP.

Fact 5 (Strong protection against identification).

Let obeys -pDP for , then for any measurable set where then given any side information aux

Proof.

Note that after fixing , is a fresh sample from , as a result, . The claimed fact then directly follows from the definition. ∎

Note that the log-odds ratio measures how likely one is able to tell one distribution from another based on side information and an event of the released result . When the log-odds ratio is close to , the outcome is equally likely to be drawn from either distributions.

Fact 6 (Convenient properties directly inherited from DP).

For each separately we have:

  1. Simple composition: Let and be two randomized algorithms, satisfying -pDP, -pDP, then jointly is -pDP.

  2. Advanced composition: Let be a sequence of randomized algorithms, where could depend on the realization of , each with -pDP, then jointly obeys -pDP.

  3. Closedness to post-processing: If satisfies -pDP, for any function , also obeys -pDP.

  4. Group privacy: If obeys -pDP with parameterized by , then

    for .

Proof.

These properties all directly follow from the proof of these properties for differential privacy (see e.g., [12]), as the uniformity over data sets is never used in the proof. The only property that gets slightly different for the new definition is group privacy, since the size of the data set changes as the size of the privacy target (now a fixed group of people) gets larger. The claim follows from a simple calculation that repeatedly apply the definition of pDP for a different . ∎

2.2 Moments of pDP, generalization and domain adaptation

One useful notion to consider in practice is to understand exactly how much privacy is provided for those who participated in the data sets. This is practically relevant, because if a cautious individual decides to not submit his/her data, he/she would necessarily do it by rejecting a data-usage agreement and therefore the data collector is not legally obligated to protect this person and in fact does not have access to his/her data in the first place. After all, the only type of identification risk that could happen to this person is that the adversary can be quite certain that he/she is not in the data set. For instance, in a study of graduate student income, a group of 200 students are polled and their average income is revealed with some small noise added to it. While an adversary can be almost certain based on the outcome that Bill Gates did not participate in the study, but that is hardly a any privacy risk to him. One advantage of pDP is that it offers a very natural way to analyze and also empirically estimate any statistics of the pDP losses over a data set or over a distribution of data points corresponding to a fixed randomized algorithm .

Definition 7 (Moment pDP for a distribution).

Let be drawn from some distribution (not necessarily a product distribution) , it induces a distribution of . Then we say that the distribution obeys th moment per-instance DP with parameter vector

For example, one can treat the problem of estimating privacy loss for a fixed data set by choosing to be a discrete uniform distribution supported on with probability for each . Taking allows us to calculate mean and variance of the privacy loss over the data set.

Similarly, if the data set is drawn iid from some unknown distribution  — a central assumption in statistical learning theory — then we can take . This allows us to use the moment of pDP losses to capture on average, how well data points drawn from are protected. It turns out that this also controls generalization error, and more generally cross-domain generalization.

Definition 8 (On-average generalization).

Under the standard notations of statistical learning, the on average generalization error of an algorithm is defined as

Proposition 9 (Moment pDP implies generalization).

Assume bounded loss function . Then the on-average generalization is smaller than

Note that this can also be used to capture the privacy and generalization of transfer learning (also known as domain adaptation) with a fixed data set or a fixed distribution. Let the training distribution be and target distribution be ,

Take or . In practice, this allows us to upper bound the generalization to the Asian demographics group, when the training data is drawn from a distribution that is dominated by white males (e.g., the current DNA sequencing data set). We formalize this idea as follows.

Definition 10 (Cross-domain generalization).

Assume . The on-average cross-domain generalization with base distribution to target distribution is defined as:

where is the inverse propensity (or importance weight) to account for the differences in the two domains.

Proposition 11.

The cross-domain on-average generalization can be bounded as follows:

The expressions in Proposition 9 and 11 are a little complex, we will simplify it to make it more readable.

Corollary 12.

Let , and and for simplicity, we write and . Then the cross domain on-average generalization is smaller than

2.3 Related notions

We now compare the proposed privacy definition with existing ones in the literature. Most attempts to weaken differential privacy aims at more careful accounting of privacy loss by treating the as a random variable. This produces nice connection of -DP to concentration inequalities and in particular, it produces advanced composition of privacy loss via Martingale concentration. More recently, the idea is extended to defining weaker notions of privacy such as concentrated-DP [13, 4] and Rényi-DP [24] that allows for more fine-grained understanding of Gaussian mechanisms. Our work is complementary to this line of work, because we consider adaptivity of to a fixed pair of data-set and privacy target, and in some cases, being a random variable jointly parameterized by and . We summarize the differences of these definitions in the following table.

Data set private target probability metric parametrized by
Pure-DP[11] only
Approx-DP[9] only
(z/m)-CDP[13, 4] only
Rényi-DP[24] only
Personal-DP[15, 20] fixed and
TV-privacy[2] only
KL-privacy[2] only
On-Avg KL-privacy[32] and
Per-instance DP fixed fixed , and
Table 1: Comparing variances of differential privacy.

It is clear from the table that if we ignore the differences in the probability metric used, per-instance DP is arguably the most general, and adaptive, since it depends on specific pairs.

The closest existing definition to ours is perhaps the personalized-DP, first seen in Ebadi et al. [15], Liu et al. [20]. It also tries to capture a personalized level of privacy. The difference is that personalized-DP requires the sensitivity of the private target to hold globally for all data sets.

On-Avg KL-privacy is also an adaptive quantity that measures the average privacy loss when the individuals in the data set and private target are drawn from the same distribution . pDP on the other hand measures the approximate worst-case privacy for a fixed pair that is not necessarily random. Not surprisingly, on-Avg KL-privacy and expected per-instance DP are intricately related to each other, as the following remark suggests.

Remark 13.

Let , namely, Gaussian noise adding. Then On-avg KL-privacy is

Second moment of per-instance DP is

The two notions of privacy are almost equivalent. They differ only by a logarithmic term and by a minor change in the way perturbation is defined. In general, for Gaussian mechanism, -KL-privacy implies -DP for any .

3 Per-instance sensitivity in smooth learning problems

In this section, we present our main results and give concrete examples in which per-instance sensitivity (hence per-instance privacy) can be analytically calculated. Specifically, we consider following regularized empirical risk minimization form:

(1)

or in the non-convex case, finding a local minimum. and are the loss functions and regularization terms. We make the following assumptions:

  1. and are differentiable in argument .

  2. The partial derivatives are absolute continuous, i.e., they are twice differentiable almost everywhere and the second order partial derivatives are Lebesgue integrable.

Our results under these assumptions will cover learning problems such as linear and kernel machines as well as some neural network formulations (e.g., multilayer perceptron and convolutional net with sigmoid/tanh activation), but not non-smooth problems like lasso, -SVM or neural networks ReLU activation. We also note that these conditions are implied by standard assumptions of strong smoothness (gradient Lipschitz) and do not require the function to be twice differentiable everywhere. For instance, the results will cover the case when either or is a Huber function, which is not twice differentiable.

Technically, these assumptions allow us to take Taylor expansion and have an integral form of the remainder, which allows us to prove the following stability bound.

Lemma 14.

Assume and satisfy Assumption A.1 and A.2. Let be a stationary point of , be a stationary point and in addition, let denotes the interpolation of and . Then the following identity holds:

The proof uses first order stationarity condition of the optimal solution and apply Taylor’s theorem on the gradient. The lemma is very interpretable. It says that the perturbation of adding or removing a data point can be viewed as a one-step quasi-newton update to the parameter. Also note that is the “score function” in parametric statistical models, and is called a “pseudo-residual” in gradient boosting [see e.g., 18, Chapter 10].

The result implies that the per-instance sensitivity in for some p.d. matrix can be stated in terms of certain norm of the “score function” specified by a quadratic form , and therefore by Lemma 4, the output perturbation algorithm:

(2)

obeys -pDP for any and

(3)

This is interesting because for most loss functions the “score function” is often proportional to the prediction error of the fitted model on data point and this result suggests that the more accurately a model predicts a data point, the more private this data point is. This connection is made more explicit when we specialize to linear regression and the per-instance sensitivity

(4)

is clearly proportional to prediction error. In addition, when we choose , the second term becomes either or , which are “in-sample” and “out-of-sample” statistical leverage scores of . Leverage score measures the importance/uniqueness of a data point relative to the rest of the data set and it is used extensively in regression analysis [5] (for outlier detection and experiment design), compressed sensing (for adaptive sampling) [29] and numerical linear algebra (for fast matrix computation)[8]. For the best of our knowledge, this is the first time leverage scores are shown to be connected to differential privacy.

\includegraphics

[width=0.45]pDP_vs_DP    \includegraphics[width=0.45]exp_compare

Figure 1: Left: -DP and distribution of -pDP data points in linear regression with isotropic Gaussian noise adding. Right: Comparing the pDP privacy loss to the -DP obtained through exponential mechanism [31] using the same posterior sampling algorithm. In both experiment .

We conclude the section with two simulate experiments (shown in the two panes of Figure 1). In the first experiment, we consider the algorithm of adding isotropic Gaussian noise to linear regression coefficients, and then compare the worst-case DP and the distribution of per-instance DP for points in the data set (illustrated as box plots). In the second experiment, we compare different notions of privacy to utility (measured as excess risk) of the fixed algorithm that samples from a scaled posterior distribution. In both cases, the average per-instance differential privacy over the data sets is several orders of magnitude smaller than the worst-case differential privacy.

4 Conclusion

In this paper, we proposed to use per-instance differential privacy (pDP) for quantifying the fine-grained privacy loss of a fixed individual against randomized data analysis conducted on a fixed data set. We analyzed its properties and showed that pDP is proportional to well-studied quantities, e.g., leverage scores, residual and pseudo-residual in statistics and statistical learning theory. This formalizes the intuitive idea that the more one can “blend into the crowd” like a chameleon, the more privacy one gets; and that the better a model fits the data, the easier it is to learn the model differentially privately. Moreover, the new notion allows us to conduct statistical learning and inference and take advantage of desirable structures of the data sets to gain orders-of-magnitude more favorable privacy guarantee than the worst case. This makes it highly practical in applications.

Specifically, we conducted a detailed case-study on linear regression to illustrate how pDP can be used. The pDP analysis allows us to identify and account for key properties of the data set, like the well-conditionedness of the feature matrix and the magnitude of the fitted coefficient vector, thereby provides strong uniform differential privacy coverage to everyone in the population whenever such structures exist. As a byproduct, the analysis also leads to an improved differential privacy guarantee for the OPS algorithm [7, 31] and also a new algorithm called AdaOPS that adaptively chooses the regularization parameters and improves the guarantee further. In particular, AdaOPS achieves asymptotic statistical efficiency and differential privacy at the same time with stronger parameters than known before.

The introduction of pDP also raises many open questions for future research. First of all, how do we tell individuals what their s and s of pDP are? This is tricky because the pDP loss itself is a function of the data, thus needs to be privatized against possible malicious dummy users. Secondly, the problem gets substantially more interesting when we start to consider the economics of private data collection. For instance, what happens if what we tell the individuals would affect their decision on whether they will participate in the data set? In fact, it is unclear how to provide an estimation of pDP in the first place if we are not sure what would the data be at the end of the day. Thirdly, from the data collector’s point of view, the data is going to be “easier” and the model will have a better “goodness-of-fit” on the collected data, but that will be falsely so to some extent, due to the bias incurred during data collection according to pDP. How do we correct for such bias and estimate the real performance of a model on the population of interest? Addressing these problems thoroughly would require the joint effort of the community and we hope the exposition in this paper will encourage researchers to play with pDP in both theory and practical applications.

Acknowledgments

The author thanks Jing Lei, Adam Smith, Jennifer Chayes and Christian Borgs for useful and inspiring discussions that motivate the work.

Appendix A Proofs of technical results

Proof of Proposition 9.

We first show that implies on-average stability and then on-average stability implies on-average generalization.

Let , and fix . We first prove stability. Let

Note that the bound is independent to .

Now we will show stability implies generalization using a “ghost sample” trick in which we resample and construct by replacing the th data point from the th data point of .

The last step simply substitutes the stability bound. Take expectation on both sides, we get a generalization upper bound of form:

Proof of Proposition 11.

The stability argument remains the same, because it is applied to a fixed pair of . We will modify the ghost sample arguments with and additional change of measure.

Proof of Corollary 12.

The inequality uses Jensen’s inequality and the monotonicity of moment generating function on non-negative random variables. The statement is obtained by Taylor’s series on . Lastly, we use the algebraic mean to upper bound the geometric mean in the first term and then use Taylor expansion. ∎

Proof of Lemma 14.

By the stationarity of

Add and subtract and apply first order Taylor’s Theorem centered at on , we get

where if we define , the remainder term can be explicitly written as

By the mean value theorem for Frechet differentiable functions, we know there is a such that we can take such that the integrand is equal to the integral.

Since is a stationary point, we have

and thus under the assumption that is invertible, we have

The other equality follows by symmetry. ∎

Proof of Theorem LABEL:thm:OPS.

Let , . Denote , , and . Correspondingly, the posterior mean and .

The covariance matrix of the two distributions are and . Using the fact that the normalization constant is known for Gaussian, the log-likelihood ratio at output is

Note that . By Lemma 18,

so

The second term in the above equation can be expanded into

(5)

where we denote the“hat” matrices and . Also define . By Sherman-Morrison-Woodbury formula, we can write

Note that and , therefore

Substitute into (5), we get

And the -probability ratio is