Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain

Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain

Yu-Xiang Wang
Amazon Web Services
California, CA 94303
yuxiangw@amazon.com
Abstract

We revisit the problem of linear regression under a differential privacy constraint. By consolidating existing pieces in the literature, we clarify the correct dependence of the feature, label and coefficient domain in the optimization error and estimation error, hence revealing the delicate price of differential privacy in statistical estimation and statistical learning. Moreover, we propose simple modifications of two existing DP algorithms: (a) posterior sampling, (b) sufficient statistics perturbation, and show that they can be upgraded into adaptive algorithms that are able to exploit data-dependent quantities and behave nearly optimally for every instance. Extensive experiments are conducted on both simulated data and real data, which conclude that both AdaOPS and AdaSSP outperform the existing techniques on nearly all 36 data sets that we test on.

1 Introduction

Linear regression is one of the oldest tools for data analysis (Galton, 1886) and it remains one of the most commonly-used as of today (Draper & Smith, 2014), especially in social sciences (Agresti & Finlay, 1997), econometics (Greene, 2003) and medical research (Armitage et al., 2008). Moreover, many nonlinear models are either intrinsically linear in certain function spaces, e.g., kernels methods, dynamical systems, or can be reduced to solving a sequence of linear regressions, e.g., iterative reweighted least square for generalized Linear models, gradient boosting for additive models and so on (see Friedman et al., 2001, for a detailed review).

In order to apply linear regression to sensitive data such as those in social sciences and medical studies, it is often needed to do so such that the privacy of individuals in the data set is protected. Differential privacy (Dwork et al., 2006b) is a commonly-accepted criterion that provides provable protection against identification and is resilient to arbitrary auxiliary information that might be available to attackers. In this paper, we focus on linear regression with -differentially privacy (Dwork et al., 2006a).

Isn’t it a solved problem?

It might be a bit surprising why this is still a problem, since several general frameworks of differential privacy have been proposed that cover linear regression. Specifically, in the agnostic setting (without a data model), linear regression is a special case of differentially private empirical risk minimization (ERM), and its theoretical properties have been quite well-understood in a sense that the minimax lower bounds are known (Bassily et al., 2014) and a number of algorithms (Chaudhuri et al., 2011; Kifer et al., 2012) have been shown to match the lower bounds under various assumptions. In the statistical estimation setting where we assume the data is generated from a linear Gaussian model, linear regression is covered by the sufficient statistics perturbation approach for exponential family models (Dwork & Smith, 2010; Foulds et al., 2016), propose-test-release framework (Dwork & Lei, 2009) as well as the the subsample-and-aggregate framework (Smith, 2008), with all three approaches achieving the asymptotic efficiency in the fixed dimension (), large sample () regime.

Despite these theoretical advances, very few empirical evaluations of these algorithms were conducted and we are not aware of a commonly-accepted best practice. Practitioners are often left puzzled about which algorithm to use for the specific data set they have. The nature of differential privacy often requires them to set parameters of the algorithm (e.g., how much noise to add) according to the diameter of the parameter domain, as well as properties of a hypothetical worst-case data set, which often leads to an inefficient use of their valuable data.

The main contribution of this paper is threefold:

  1. We consolidated many bits and pieces from the literature and clarified the price of differentially privacy in statistical estimation and statistical learning.

  2. We carefully analyzed One Posterior Sample (OPS) and Sufficient Statistics Perturbation (SSP) and proposed simple modifications of them into adaptive versions: AdaOPS and AdaSSP. Both work near optimally for every problem instance without any hyperparameter tuning.

  3. We conducted extensive real data experiments to benchmark existing techniques and concluded that the proposed technique gives rise to the more favorable privacy-utility tradeoff relative to existing methods.

Outline of this paper.

In Section 2 we will describe the problem setup and explain differential privacy. In Section 3, we will survey the literature and discuss existing algorithms. Then we will propose and analyze our new method AdaSSP and AdaOPS in Section 4 and conclude the paper with experiments in Section 5.

2 Notations and problem setup

Throughout the paper we will use and to denote the design matrix and response vector. These are collections of data points . We use to denote Euclidean norm for vector inputs, -operator norm for matrix inputs. In addition, for set inputs, denotes the radius of the smallest Euclidean ball that contains the set. For example, and . Let be the domain of coefficients. Our results do not require to be compact but existing approaches often depends on . and denote greater than or smaller to up to a universal multiplicative constant, which is the same as the big and the big . hides at most a logarithmic term. and denote the standard semidefinite ordering of positive semi-definite (psd) matrices.

We now define a few data dependent quantities. We use (abbv. ) to denote the smallest eigenvalue of , and to make the implicit dependence in and clear from this quantity, we define One can think of as a normalized smallest eigenvalue of such that . Also, is closely related to the condition number of .

The least square solution is the optimal solution to Similarly, we use denotes the optimal solution to the ridge regression objective .

In addition, we denote the global Lipschitz constant of as and data-dependent local Lipschitz constant a as . Note that when , , but will remain finite for every given data set.

Metric of success.

We measure the performance of an estimator in two ways.

First, we consider the optimization error in expectation or with probability . This is related to the prediction accuracy in the distribution-free statistical learning setting.

Second, we consider how well the coefficients can be estimated under the linear Gaussian model:

in terms of or in some cases where is a high probability event.

The optimal error in either case will depend on the specific design matrix , optimal solution , the data domain , the parameter domain as well as in the statistical estimation setting.

Differential privacy.

We will focus on estimators that are differential private, as defined below.

Definition 1 (Differential privacy (Dwork et al., 2006b)).

We say a randomized algorithm satisfies -DP if for all fixed data set and data set that can be constructed by adding or removing one row from , and for all measurable set over the probability of the algorithm

Parameter represents the amount of privacy loss from running the algorithm and denotes a small probability of failure. These are user-specified targets to achieve and the differential privacy guarantee is considered meaningful if and .

The pursuit for adaptive estimators.

Another important design feature that we will mention repeatedly in this paper is adaptivity. We call an estimator adaptive if it behaves optimally simultaneously for a wide range of parameter choices. Being adaptive is of great practical relevance because we do not need to specify the class of problems or worry about whether our specification is wrong (see examples of adaptive estimators in e.g., (Donoho, 1995; Birgé & Massart, 2001)). Adaptivity is particularly important for differentially private data analysis because often we need to decide the amount of noise to add by the size of the domain. For example, an adaptive algorithm will not rely on conservative upper bounds of , or a worst case (which would be on any ), and it can take advantage of favorable properties when they exist in the data set. We want to design an estimator that does not take these parameters as inputs and behave nearly optimally for every fixed data set under a variety of configuration of .

3 A survey of existing work

In this section, we summarize existing theoretical results in linear regression with and without differential privacy constraints. We will start with lower bounds.

3.1 Information-theoretic lower bounds

Lower bounds under linear Gaussian model.

Under the statistical assumption of linear Gaussian model , the minimax risk for both estimation and prediction are crisply characterized for each fixed design matrix :

(1)

and if we further assume that and is invertible (for identifiability), then

(2)

In the above setup, is any measurable function of (note that is fixed). These are classic results that can be found in standard statistical decision theory textbooks (See, e.g., Wasserman, 2013, Chapter 13).

Under the same assumptions, Cramer-Rao lower bound mandates that the covariance matrix of any unbiased estimator of obeys that

(3)

This bound applies to every problem instance separately and also implies a precise lower bound on the prediction variance, namely, for any and any unbiased .

Minimax risk (1), (2) and the Cramer-Rao lower bound (3) are simultaneously attained by .

Statistical learning lower bounds.

Perhaps much less well-known, linear regression is also thoroughly studied in the distribution-free statistical learning setting, where the only assumption is that the data are drawn iid from some unknown distribution defined on some compact domain . Specifically, let the risk () be

Shamir (2015) showed that when , are are Euclidean balls,

(4)

where be any measurable function of the data set to and the expectation is taken over the data generating distribution . Note that to be compatible to other bounds that appear in this paper, we multiplied the by a factor of . Informally, one can think of as in (1) so both terms depend on (or ), but the dependence on is new for the distribution-free setting.

Koren & Levy (2015) later showed that this lower bound is matched up to a constant by Ridge Regression with and both Koren & Levy (2015) and Shamir (2015) conjecture that ERM (a constrained version of OLS) without additional regularization should attain the lower bound (4). If the conjecture is true, then the unconstrained OLS is simultaneously optimal for all distributions supported on the smallest ball that contains all data points in for any being an ball with radius larger than .

Lower bounds with -privacy constraints.

Suppose that we further require to be -differentially private, then there is an additional price to pay in terms of how accurately we can approximate the ERM solution. Specifically, Bassily et al. (2014)’s lower bounds for the empirical excess risk for differentially private ERM imply that for and a sufficiently large :

  • There exists a triplet of , such that

    (5)
  • Assume all data sets are within obeys that the inverse condition number 111This requires for all data sets .. There exists a triplet of such that

    (6)

These bounds are attained by a number of algorithms, which we will go over in Section 3.2.

Comparing to the non-private minimax rates on prediction accuracy, the bounds look different in several aspects. First, neither rate for prediction error in (1) or (4) depends on whether the design matrix is well-conditioned or not, while appears explicitly in (6). Secondly, the dependence on are different, which makes it hard to tell whether the optimization error lower bound due to the privacy requirement is limiting.

To clarify the relationships, we plot Shamir’s lower bound (4) and the smaller of Bassily et. al.’s differential privacy lower bounds (5) and (6) for all configurations of graphically in Figure 2. We also use multiple lines to illustrate the shifts in these lower bounds when parameters such as and changes. In all figures is assumed to be and logarithmic terms are dropped. The price of differential privacy is highlighted as a shaded area in the figures. Interestingly, in the first case when is small (when ), then substantial price only occurs in the non-standard region where . Arguably this is OK because in that regime, people should use Ridge regression or Lasso anyways rather than OLS. In the case when is large (when ), the price is more substantial and it applies to all unless we can exploit the strong convexity in the data set. When we do, then the cost only occur for an interval in and eventually the cost of differential privacy becomes negligible relative to the minimax rate. To the best of our knowledge this is the first time the “price of differential privacy” for linear regression is discussed with clear explanation of the dependency in all parameters of the problem.

\includegraphics

[width=]figures/price_of_dp_case1a

\includegraphics

[width=]figures/price_of_dp_case1b

\includegraphics

[width=]figures/price_of_dp_case2a

\includegraphics

[width=]figures/price_of_dp_case2b

Figure 1: Illustration of the lower bounds for non-private and private linear regression.
Figure 2: Illustration of the region of where DP can be obtained without losing the statistical learning minimax rate.

The above discussion also allows us to address the following question.

When is privacy for free in statistical learning?

Specifically, what is the smallest such that an -DP algorithm matches the minimax rate in (4)? Not surprisingly, the answer depends on the relative scale of and and that of . When , (5) says that -DP algorithms can achieve the nonconvex minimax rate provided that for On the other hand, if 222This is arguably the more relevant setting. Note that if and is fixed, then . and , then we need

The regions are illustrated graphically in Figure 2. In the first case, there is a large region upon , where meaningful differential privacy (with and ) can be achieved without incurring significant toll relative to (4). In the second case, we need at least to achieve “privacy-for-free” in the most favorable case where . In the case when could be rank-deficient, then it is infeasible to achieve “privacy for free” no matter how large is.

Based on the results in Figure 2 and  2, it might be tempting to conclude that one should always prefer Case 1 over Case 2. This is unfortunately not the case because the artificial restriction of the model class via a bounded also weakens our non-private baseline. In other word, the best solution within a small might be significantly worse than the best solution in .

In practice, it is hard to find a with a small radius that fits all purposes333If then the constraint becomes limiting. If instead, then calibrating the noise according to will inject more noise than necessary. and it is unreasonable to assume . This motivates us to go beyond the worst-case and come up with adaptive algorithms that work without knowing and while achieving the minimax rate for the class with and (in the hindsight).

3.2 Existing algorithms and our contribution

We now survey the following list of five popular algorithms in differentially private learning and highlight the novelty in our proposals 444While we try to be as comprehensive as possible, the literature has grown massively and the choice of this list is limited by our knowledge and opinions..

  1. Sufficient statistics perturbation (SSP) (Vu & Slavkovic, 2009; Foulds et al., 2016): Release and differential privately and then output .

  2. Objective perturbation (ObjPert) (Kifer et al., 2012): with an appropriate and sampled from an appropriately chosen iid Gaussian random vector.

  3. Subsample and Aggregate (Sub-Agg) (Smith, 2008; Dwork & Smith, 2010): Subsample many times, apply debiased MLE to each subset and then randomize the way we aggregate the results.

  4. Posterior sampling (OPS) (Dimitrakakis et al., 2014; Wang et al., 2015): Output with parameters .

  5. NoisySGD (Bassily et al., 2014): Run SGD for a fixed number of iterations with additional Gaussian noise added to the stochastic gradient evaluated on one randomly-chosen data point.

\resizebox

! Assumptions Remarks NoisySGD A.1, A.2 Theorem 2.4 (Part 1) of (Bassily et al., 2014). A.1, A.2, A.3 Theorem 2.4 (Part 2) of (Bassily et al., 2014) ObjPert A.1, A.2 Theorem 4 (Part 2) of (Kifer et al., 2012). A.1, A.2, A.3 Theorem 5 & Appendix E.2 of (Kifer et al., 2012). OPS A.1, A.2, A.3 Results for -DP. SSP A.1 Adaptive to , but requires 55footnotemark: 5. AdaOPS & AdaSSP A.1 Adaptive in .

Table 1: Summary of optimization error bounds. This table compares the (expected or high probability ) additive suboptimality of different differentially private linear regression procedures relative to the (non-private) empirical risk minimizer . In particular, the results for NoisySGD holds in expectation and everything else are with probability (hiding at most a logarithmic factor in ). Constant factors are dropped and is assumed to be .
\resizebox

! Approxi. MLE: Rel. efficiency: Remarks Sub-Agg -DP, suboptimal in , possibly also in (Dwork & Smith, 2010). OPS -DP, adaptive in , but not asymptotically efficient (Wang et al., 2015). SSP Adaptive in , no explicit dependence on , but requires large . (Sheffet, 2017, Theorem 5.1) AdaOPS & AdaSSP Adaptive in .

Table 2: Summary or estimation error bounds under the linear Gaussian model. On the second column we compare the approximation of MLE in mean square error up to a universal constant. On the third column, we compare the relative efficiency. The relative efficiency bounds are simplified with the assumption of , which implies that and . hides terms.

We omit detailed operational aspects of these algorithms and focus our discussion on their theoretical guarantees. Interested readers are encouraged to check out each paper separately. These algorithms are proven under different scalings and assumptions. To ensure fair comparison, we make sure that all results are converted to our setting under a subset of the following assumptions.

  1. is bounded, is bounded.

  2. is bounded.

  3. All possible data set obeys that the smallest eigenvalue is greater than .

Note that A.3 is a restriction on the domain of the data set, rather than the domain of individual data points in the data set of size . While it is a little unconventional, it is valid to define differential privacy within such a restricted space of data sets. It is the same assumption that we needed to assume for the lower bound in (6) to be meaningful.

Detailed comparisons of the algorithms are shown in Table 1 and 2. As in Koren & Levy (2015), we simplify the expressions of the bound by assuming , and in addition, we assume that .

Specifically, Table 1 summarizes the upper bounds of optimization error the aforementioned algorithms relative to our two proposals: AdaOPS and AdaSSP. Comparing the rates to the lower bounds in the previous section, it is clear that NoisySGD and ObjPert both achieve the minimax rate in optimization error but their hyperparameter choice depends on the unknown and . SSP is adaptive to and but has a completely different type of issue — it can fail arbitrarily badly for regime covered under (5), and even for well-conditioned problems, its theoretical guarantees only kicks in as gets very large. Our proposed algorithms AdaOPS and AdaSSP are able to simultaneously switch between the two regimes and get the best of both worlds.

Table 2 summarizes the upper bounds for estimation. The second row compares the approximation of in MSE and the third column summarizes the statistical efficiency of the DP estimators relative to the MLE: under the linear Gaussian model. All algorithms except OPS are asymptotically efficient. For the interest of -DP, SSP has the fastest convergence rate and does not explicitly depend on the smallest eigenvalue, but again it behaves differently when is small. AdaOPS and AdaSSP are able to work nearly optimally for all .

3.3 Other related work

The problem of adaptive estimation is closely related to model selection (see, e.g., Birgé & Massart, 2001) and an approach using Bayesian Information Criteria was carefully studied in the differential private setting for the problem of constrained ridge regression by Lei et al. (2017) . Their focus however is not in adaptive prediction and estimation, but rather whether one can find the correct model that matches the selected model in its non-private counterpart. One important aspect that is missing is that the best model in the differentially private setting might not be the same as the best model in the non-private setting. This is especially true in cases when we take the distribution-free view of the problems.

Linear regression is also studied in many more specialized setup, e.g., high dimensional linear regression (Kifer et al., 2012; Talwar et al., 2014, 2015), statistical inference (Sheffet, 2017) and so on. For the interest of this paper, we focus on the standard regime of linear regression where and do not use sparsity or constraint set to achieve dependence. Lastly, the results in Sheffet (2017) are related, they cover the strong asymptotic utility guarantee of SSP under the linear Gaussian model, and their technique of adaptively adding regularization have inspired AdaSSP.

4 Adaptive private linear regression

In this section, we present and analyze AdaOPS and AdaSSP that achieve the aforementioned adaptive rate. The pseudo-code of these two algorithms are given in Algorithm 1 and Algorithm 2.

The idea of both algorithms is to release data-dependent quantities differentially privately and then use a high probability confidence interval of these quantities to calibrate the noise to privacy budget as well as to choose the ridge regression’s hyperparameter for achieving the smallest prediction error. Specifically, AdaOPS requires us to release both the smallest eigenvalue of and the local Lipschitz constant , while AdaSSP only needs the smallest eigenvalue .

0:   Data , . Privacy budget: , , Bounds: .
  1. Calculate the minimum eigenvalue .
  2. Sample and privately release
   3. Set as the positive solution of the quadratic equation
  4. Set , and solve
(7)
which has a unique solution.
  5. Calculate
  6. Sample and privately release . Set .
  7. Calibrate noise by choosing as the positive solution of the quadratic equation
(8)
and then set
  
Algorithm 1 AdaOPS: One-Posterior Sample estimator with adaptive regularization
0:   Data , . Privacy budget: , , Bounds: .
  1. Calculate the minimum eigenvalue .
  2. Privately release , where .
  3. Set
  4. Privately release for is a symmetric matrix and every element from the upper triangular matrix is sampled from .
  5. Privately release for .
  
Algorithm 2 AdaSSP: Sufficient statistics perturbation with adaptive damping

In both AdaSSP and AdaOPS, we choose by minimizing an upper bound of in the form of “variance” and “bias”

Note that while cannot be privately released, it appears in both terms and do not enter the decision process of finding the optimal that minimizes the bound. This convenient feature about is a consequence of our assumption that . Dealing with the general case involving an arbitrary is an intriguing open problem.

A tricky situation for AdaOPS is that the choice of depends on through , which is the local Lipschitz constant at the ridge regression solution . But the choice of also depends since the “variance” term above is inversely proportional to . Our solution is to express (hence ) as a function of and solve the nonlinear univariate optimization problem (7).

We are now ready to state the main results.

Theorem 2.

Algorithm 1 outputs which obeys that

  1. It satisfies -DP.

  2. Assume . With probability ,

  3. Under the linear Gaussian model with a fixed full-rank . Then conditioning on an event satisfying over only the randomness of the algorithm, we have

    and

    where constant

The proof, deferred to Appendix C, makes use of a fine-grained DP-analysis through the recent per instance DP techniques (Wang, 2017) and then convert the results to DP by releasing data dependent bounds of and the magnitude of a ridge-regression output with an adaptively chosen . Note that does not have a bounded global sensitivity. The method to release it differentially privately (described in Lemma 12) is part our technical contribution.

The AdaSSP algorithm is simpler and enjoys slightly stronger theoretical guarantees.

Theorem 3.

Algorithm 2 outputs which obeys that

  1. It satisfies -DP.

  2. Assume . With probability ,

  3. Under the linear Gaussian model, assume sufficiently large . Then there is an event satisfying over only the randomness of the algorithm, we have

    and

    with the same constant in Theorem 2 (iii).

The proof of Statement (1) is straightforward. Note that we release the eigenvalue , and differentially privately each with parameter . For the first two, we use Gaussian mechanism and for , we use the Analyze-Gauss algorithm (Dwork et al., 2014) with a symmetric Gaussian random matrix. The result then follows from the composition theorem of differential privacy. The proof of the second and third statements is provided in Appendix B. The main technical challenge is to prove the concentration on the spectrum and the Johnson-Lindenstrauss-like distance preserving properties for symmetric Gaussian random matrices (Lemma 6). We note that while SSP is an old algorithm the analysis of its theoretical properties is new to this paper.

Remarks.

Both AdaOPS and AdaSSP match the smaller of the two lower bounds (5) and (6) for each problem instance. They are slightly different in that AdaOPS preserves the shape of the intrinsic geometry (which makes it easier to reuse the standard statistical inference tools of linear regression) of the covariance matrix while AdaSSP’s bounds are slightly stronger as they do not explicitly depend on the smallest eigenvalue.

5 Experiments

In this section, we conduct synthetic and real data experiments to benchmark the performance of AdaOPS and AdaSSP relative to existing algorithms we discussed in Section 3. Kifer et al. (2012)’s version of objective perturbation worked best overall for small to median and SSP worked best among baselines when is large or . NoisySGD and Sub-Agg are excluded because they are dominated by ObjPert.

Prediction accuracy in UCI data sets experiments.

We present the results for training linear regression on 36 UCI regression data sets. Standard -scoring are performed and all data points are normalized to to process these data sets. Results in four of the data sets are presented in Figure 3. As we can see, SSP is unstable for small data. The algorithm ObjPert (Kifer et al., 2012) perform well in some cases but cannot take advantage of the strong convexity that is intrinsic to the data set. AdaOPS and AdaSSP on the other hand are able to nicely interpolate between the trivial solution and the non-private baseline and performed as well as or better than baselines for all . More detailed quantitative results on all the 36 UCI data sets are presented in Table 3 and 4 of Appendix A. As we can see, AdaOPS and AdaSSP outperform baselines in almost all data sets that we tested them on.

\includegraphics

[width=]figures/bike_with_ssp

\includegraphics

[width=]figures/buzz_with_ssp

\includegraphics

[width=]figures/gas_with_ssp

\includegraphics

[width=]figures/energy_with_ssp

Figure 3: Example of results of differentially private linear regression algorithms on 36 UCI data sets for a sequence of . Reported on the y-axis is the cross-validation prediction error in MSE and their confidence intervals.
\includegraphics

[width=]figures/Gaussian_MSE_eps_0_1

(a) Estimation MSE at
\includegraphics

[width=]figures/Gaussian_MSE_eps_1

(b) Estimation MSE at
\includegraphics

[width=]figures/Gaussian_RelativeEfficiency_eps_0_1

(c) Rel. efficiency at
\includegraphics

[width=]figures/Gaussian_RelativeEfficiency_eps_1

(d) Rel. efficiency at
Figure 4: Example of differentially private linear regression under linear Gaussian model with an increasing data size . We simulate the data from , drawn from a uniform distribution defined on . We generate as a Gaussian random matrix and then generate . We used and , both with . The results clearly illustrate the asymptotic efficiency of the proposed approaches.

Parameter estimation under linear Gaussian model.

To illustrate the performance of the algorithms under standard statistical assumptions, we also benchmarked the algorithms on synthetic data generated by a linear Gaussian model. The results, shown in Figure 4 illustrate that as gets large, AdaOPS and AdaSSP with and converge to the maximum likelihood estimator at a rate faster than the optimal statistical rate that MLE estimates , therefore at least for large , differential privacy comes for free. Note that there is a gap in SSP and AdaSSP for large , this can be thought of as a cost of adaptivity as AdaSSP needs to spend some portion of its privacy budget to release , which SSP does not, this can be fixed by using more carefull splitting of the privacy budget.

6 Conclusion

In this paper, we presented a detailed case study of the problem of differentially private linear regression. We clarified the relationships between various quantities of the problems as they appear the private and non-private information-theoretic lower bounds. We also surveyed the existing algorithms and highlight that the main drawback of these algorithms relative to their non-private counterpart is that they cannot adapt to data-dependent quantities. This is particularly true for linear regression where the ordinary least square algorithm is able to work optimally for a wide range of problem classes.

We propose AdaOPS and AdaSSP to address the issue and showed that they both work in unbounded domain. Moreover, they smoothly interpolate the two regimes studied in Bassily et al. (2014) and behave nearly optimally for every instance. We tested the two algorithms on 36 real-life regression data sets from the UCI machine learning repository and we see significant improvement over popular algorithms for almost all configurations of .

Future work includes extending the result beyond linear regression and releasing off-the-shelf packages for adaptive differentially private learning.

Acknowledgments

The author thanks Jing Lei for helpful discussions.

References

  • Agresti & Finlay (1997) Agresti, A., & Finlay, B. (1997). Statistical methods for the social sciences.
  • Armitage et al. (2008) Armitage, P., Berry, G., & Matthews, J. N. S. (2008). Statistical methods in medical research. John Wiley & Sons.
  • Bassily et al. (2014) Bassily, R., Smith, A., & Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS-14), (pp. 464–473). IEEE.
  • Birgé & Massart (2001) Birgé, L., & Massart, P. (2001). Gaussian model selection. Journal of the European Mathematical Society, 3(3), 203–268.
  • Chaudhuri et al. (2011) Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. The Journal of Machine Learning Research, 12, 1069–1109.
  • Dimitrakakis et al. (2014) Dimitrakakis, C., Nelson, B., Mitrokotsa, A., & Rubinstein, B. I. (2014). Robust and private bayesian inference. In Algorithmic Learning Theory, (pp. 291–305). Springer.
  • Donoho (1995) Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE transactions on information theory, 41(3), 613–627.
  • Draper & Smith (2014) Draper, N. R., & Smith, H. (2014). Applied regression analysis, vol. 326. John Wiley & Sons.
  • Dwork et al. (2006a) Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. (2006a). Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, (pp. 486–503). Springer.
  • Dwork & Lei (2009) Dwork, C., & Lei, J. (2009). Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, (pp. 371–380). ACM.
  • Dwork et al. (2006b) Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, (pp. 265–284). Springer.
  • Dwork & Smith (2010) Dwork, C., & Smith, A. (2010). Differential privacy for statistics: What we know and what we want to learn. Journal of Privacy and Confidentiality, 1(2), 2.
  • Dwork et al. (2014) Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014). Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, (pp. 11–20). ACM.
  • Foulds et al. (2016) Foulds, J., Geumlek, J., Welling, M., & Chaudhuri, K. (2016). On the theory and practice of privacy-preserving bayesian data analysis. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, (pp. 192–201). AUAI Press.
  • Friedman et al. (2001) Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning, vol. 1. Springer series in statistics Springer, Berlin.
  • Galton (1886) Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.
  • Greene (2003) Greene, W. H. (2003). Econometric analysis. Pearson Education India.
  • Kifer et al. (2012) Kifer, D., Smith, A., & Thakurta, A. (2012). Private convex empirical risk minimization and high-dimensional regression. Journal of Machine Learning Research, 1, 41.
  • Koren & Levy (2015) Koren, T., & Levy, K. (2015). Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems, (pp. 1477–1485).
  • Laurent & Massart (2000) Laurent, B., & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, (pp. 1302–1338).
  • Lei et al. (2017) Lei, J., Charest, A.-S., Slavkovic, A., Smith, A., & Fienberg, S. (2017). Differentially private model selection with penalized and constrained likelihood. Journal of the Royal Statistical Society.
  • Shamir (2015) Shamir, O. (2015). The sample complexity of learning linear predictors with the squared loss. Journal of Machine Learning Research, 16, 3475–3486.
  • Sheffet (2017) Sheffet, O. (2017). Differentially private ordinary least squares. In International Conference on Machine Learning, (pp. 3105–3114).
  • Smith (2008) Smith, A. (2008). Efficient, differentially private point estimators. arXiv preprint arXiv:0809.4794.
  • Stewart (1998) Stewart, G. W. (1998). Perturbation theory for the singular value decomposition. Tech. rep.
  • Talwar et al. (2014) Talwar, K., Thakurta, A., & Zhang, L. (2014). Private empirical risk minimization beyond the worst case: The effect of the constraint set geometry. arXiv preprint arXiv:1411.5417.
  • Talwar et al. (2015) Talwar, K., Thakurta, A. G., & Zhang, L. (2015). Nearly optimal private lasso. In Advances in Neural Information Processing Systems, (pp. 3025–3033).
  • Vu & Slavkovic (2009) Vu, D., & Slavkovic, A. (2009). Differential privacy for clinical trial data: Preliminary evaluations. In Data Mining Workshops, 2009. ICDMW’09. IEEE International Conference on, (pp. 138–143). IEEE.
  • Wang (2017) Wang, Y.-X. (2017). Per-instance differential privacy and the adaptivity of posterior sampling in linear and ridge regression. arXiv preprint arXiv:1707.07708.
  • Wang et al. (2015) Wang, Y.-X., Fienberg, S., & Smola, A. (2015). Privacy for free: Posterior sampling and stochastic gradient monte carlo. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), (pp. 2493–2502).
  • Wasserman (2013) Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

Appendix A Results on the 36 real regression data sets in UCI repository

The detailed results on the 36 UCI data sets are presented in Table 3 for and Table 4 for . The boldface denotes the DP algorithm where the standard deviation is smaller than the error (a positive quantity), and the 95% confidence interval covers the observed best performance among benchmarked DP algorithms.

\resizebox

! Trivial non-private SSP ObjPert AdaOPS AdaSSP 3droad 0.02750.00014 0.02650.00012 0.02650.00019 0.02730.00038 0.02650.00019 0.02650.00019 airfoil 0.1030.0069 0.05330.0074 7.0112 0.2860.08 0.1090.013 0.110.02 autompg 0.1130.011 0.02210.0032 11.111 0.2880.08 0.08790.036 0.1030.053 autos 0.130.042 0.02740.011 2.171.7 0.1190.07 0.1080.054 0.1330.073 bike 0.1070.0028 0.02790.00078 3.384.8 0.150.045 0.06290.0032 0.05930.0055 breastcancer 0.1940.027 0.1390.025 21.329 0.2110.063 0.1940.033 0.1980.039 buzz 0.06580.00015 0.01274.6e-05 0.4040.63 0.0360.0026 0.01470.00058 0.01340.00013 challenger 0.1410.084 0.1380.088 9.798.9 0.4360.36 0.1680.13 0.1940.17 concrete 0.1270.0043 0.04450.0033 45.878 0.2120.078 0.1290.013 0.1480.05 concreteslump 0.1490.039 0.02450.0071 1.7e+033.3e+03 0.2490.048 0.1620.066 0.1580.076 elevators 0.03670.0014 0.008610.00031 1.31.5 0.07350.018 0.02950.0053 0.0280.002 energy 0.2350.012 0.02320.0023 3.363.8 0.2760.048 0.2110.051 0.2170.063 fertility 0.09770.024 0.08630.024 72.31e+02 0.2490.068 0.1020.041 0.1060.044 forest 0.05640.0081 0.05710.0086 9.5112 0.1440.022 0.06040.016 0.07320.021 gas 0.1120.0062 0.02140.0028 3.284.3 0.1170.027 0.08490.012 0.09780.0062 houseelectric 0.1220.00017 0.01361.4e-05 0.01362.1e-05 0.04080.00094 0.01371.7e-05 0.01362.2e-05 housing 0.1120.019 0.03940.01 3.383.7 0.1520.036 0.1030.031 0.1250.036 keggdirected 0.1170.00095 0.01880.0011 2550 0.08750.016 0.02550.0023 0.02460.0011 keggundirected 0.06940.00074 0.004758.9e-05 3.255.5 0.05370.012 0.01270.0026 0.01110.00093 kin40k 0.06340.0012 0.06320.0013 0.06360.0022 0.1890.0032 0.06740.0018 0.06640.0022 machine 0.1210.013 0.03950.0051 96.51.5e+02 0.2980.097 0.130.031 0.1510.052 parkinsons 0.170.0026 0.1280.0024 11.318 0.2050.017 0.1620.0098 0.1640.0063 pendulum 0.02260.0061 0.01810.0049 1.350.79 0.130.022 0.02650.0085 0.04110.012 pol 0.3450.0028 0.1350.0023 6.38.4 0.3650.033 0.250.011 0.260.0059 protein 0.1670.0011 0.1190.0014 0.1630.071 0.1680.012 0.1330.0022 0.1280.0031 pumadyn32nm 0.09350.0039 0.09410.0039 1692.7e+02 0.1260.0059 0.09740.0059 0.09610.0064 servo 0.1840.039 0.07520.022 28.829 0.3280.078 0.1840.046 0.2050.093 skillcraft 0.04390.0021 0.02030.0017 92.41.8e+02 0.09060.012 0.03920.0031 0.03610.0064 slice 0.1960.0021 0.02830.00051 39.559 0.190.0072 0.1250.0035 0.1560.0018 sml 0.2110.0089 0.01430.00066 16.223 0.2280.044 0.170.019 0.1690.011 solar 0.01180.0042 0.01060.0038 10.315 0.1090.025 0.01810.0082 0.02220.01 song 0.09170.0003 0.06360.00033 0.06990.0052 0.09090.0018 0.07190.00026 0.07370.00027 stock 0.05830.0095 0.0130.0023 9.248.3 0.1890.05 0.06030.019 0.06350.026 tamielectric 0.3340.002 0.3340.0021 0.3350.003 0.350.0089 0.3370.0037 0.3350.0033 wine 0.05660.0028 0.02020.00099 10.99.7 0.1390.027 0.05670.01 0.06490.017 yacht 0.1050.017 0.01760.0055 18.531 0.2770.065 0.1030.042 0.1260.049

Table 3: Summary of UCI data experiments at .
\resizebox

! Trivial non-private SSP ObjPert AdaOPS AdaSSP 3droad 0.02750.00014 0.02650.00012 0.02650.00019 0.02670.00013 0.02650.00019 0.02650.00019 airfoil 0.1030.0069 0.05330.0074 0.05410.011 0.07060.009 0.06710.0066 0.05720.011 autompg 0.1130.011 0.02210.0032 0.1110.11 0.07980.013 0.05580.0056 0.04720.012 autos 0.130.042 0.02740.011 1152.3e+02 0.1040.051 0.09580.055 0.1020.066 bike 0.1070.0028 0.02790.00078 12.224 0.0480.0018 0.03090.001 0.02880.0016 breastcancer 0.1940.027 0.1390.025 1683.3e+02 0.2020.071 0.1870.05 0.1870.035 buzz 0.06580.00015 0.01274.6e-05 0.0270.017 0.0267.8e-05 0.01360.00044 0.01277.2e-05 challenger 0.1410.084 0.1380.088 20.131 0.2630.1 0.1330.088 0.1240.1 concrete 0.1270.0043 0.04450.0033 0.05970.015 0.07410.0059 0.0790.0042 0.06510.0039 concreteslump 0.1490.039 0.02450.0071 4.776.9 0.1790.081 0.1410.047 0.1610.09 elevators 0.03670.0014 0.008610.00031 0.0280.027 0.01690.00072 0.01650.0013 0.01340.0014 energy 0.2350.012 0.02320.0023 0.0950.094 0.0830.0051 0.0690.0088 0.04990.013 fertility 0.09770.024 0.08630.024 2.452.4 0.2250.087 0.09560.041 0.1150.043 forest 0.05640.0081 0.05710.0086 0.7031 0.0820.0098 0.05910.013 0.06210.013 gas 0.1120.0062 0.02140.0028 3.436.4 0.05970.0043 0.04510.0065 0.04730.0059 houseelectric 0.1220.00017 0.01361.4e-05 0.01362.2e-05 0.04066.1e-05 0.01362.2e-05 0.01362.2e-05 housing 0.1120.019 0.03940.01 5577 0.08730.026 0.07730.024 0.07120.024 keggdirected 0.1170.00095 0.01880.0011 0.6111.2 0.04340.00056 0.02010.00066 0.01910.00069 keggundirected 0.06940.00074 0.004758.9e-05 0.06530.12 0.02120.00028 0.006870.00049 0.005510.00016 kin40k 0.06340.0012 0.06320.0013 0.06320.002 0.06330.002 0.06330.002 0.06330.002 machine 0.1210.013 0.03950.0051 0.210.18 0.110.024 0.08180.024 0.06860.015 parkinsons 0.170.0026 0.1280.0024 0.4640.56 0.140.0026 0.1360.0036 0.1320.0038 pendulum 0.02260.0061 0.01810.0049 0.02490.0094 0.03940.0069 0.02380.0093 0.02470.0081 pol 0.3450.0028 0.1350.0023 0.2420.21 0.190.002 0.1440.0019 0.140.0034 protein 0.1670.0011 0.1190.0014 0.1190.0022 0.1310.0012 0.1230.0022 0.120.0022 pumadyn32nm 0.09350.0039 0.09410.0039 0.09450.0062 0.09490.0057 0.09540.0065 0.09580.0063 servo 0.1840.039 0.07520.022 0.09320.034 0.1390.039 0.1560.037 0.1650.049 skillcraft 0.04390.0021 0.02030.0017 0.1740.27 0.03070.0018 0.02730.0034 0.02520.0023 slice 0.1960.0021 0.02830.00051 1512.9e+02 0.08750.0008 0.04830.0012 0.05550.00054 sml 0.2110.0089 0.01430.00066 1.261.6 0.07430.0032 0.0490.0026 0.03960.0024 solar 0.01180.0042 0.01060.0038 0.03270.021 0.01950.0065 0.01270.006 0.01290.006 song 0.09170.0003 0.06360.00033 0.06360.00052 0.07060.00029 0.06420.00031 0.06370.00049 stock 0.05830.0095 0.0130.0023 0.2190.25 0.04670.022 0.04560.016 0.0380.0097 tamielectric 0.3340.002 0.3340.0021 0.3340.0033 0.3340.0033 0.3350.0037 0.3340.0033 wine 0.05660.0028 0.02020.00099 0.02460.003 0.03470.0042 0.03920.0032 0.03490.0029 yacht 0.1050.017 0.01760.0055 0.0280.011 0.05240.014 0.07940.019 0.06020.014

Table 4: Summary of UCI data experiments at

Appendix B Proof of the results for Ssp and AdaSSP

In this section, we first derive the rate for the optimization and parameter estimation error of the sufficient statistics perturbation (SuffPert) approach as was shown in Table 1 and Table 2. This will build intuition towards AdaSSP, which we will present the proof of it towards the end of the section.

b.1