Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain
Abstract
We revisit the problem of linear regression under a differential privacy constraint. By consolidating existing pieces in the literature, we clarify the correct dependence of the feature, label and coefficient domain in the optimization error and estimation error, hence revealing the delicate price of differential privacy in statistical estimation and statistical learning. Moreover, we propose simple modifications of two existing DP algorithms: (a) posterior sampling, (b) sufficient statistics perturbation, and show that they can be upgraded into adaptive algorithms that are able to exploit datadependent quantities and behave nearly optimally for every instance. Extensive experiments are conducted on both simulated data and real data, which conclude that both AdaOPS and AdaSSP outperform the existing techniques on nearly all 36 data sets that we test on.
Contents
1 Introduction
Linear regression is one of the oldest tools for data analysis (Galton, 1886) and it remains one of the most commonlyused as of today (Draper & Smith, 2014), especially in social sciences (Agresti & Finlay, 1997), econometics (Greene, 2003) and medical research (Armitage et al., 2008). Moreover, many nonlinear models are either intrinsically linear in certain function spaces, e.g., kernels methods, dynamical systems, or can be reduced to solving a sequence of linear regressions, e.g., iterative reweighted least square for generalized Linear models, gradient boosting for additive models and so on (see Friedman et al., 2001, for a detailed review).
In order to apply linear regression to sensitive data such as those in social sciences and medical studies, it is often needed to do so such that the privacy of individuals in the data set is protected. Differential privacy (Dwork et al., 2006b) is a commonlyaccepted criterion that provides provable protection against identification and is resilient to arbitrary auxiliary information that might be available to attackers. In this paper, we focus on linear regression with differentially privacy (Dwork et al., 2006a).
Isn’t it a solved problem?
It might be a bit surprising why this is still a problem, since several general frameworks of differential privacy have been proposed that cover linear regression. Specifically, in the agnostic setting (without a data model), linear regression is a special case of differentially private empirical risk minimization (ERM), and its theoretical properties have been quite wellunderstood in a sense that the minimax lower bounds are known (Bassily et al., 2014) and a number of algorithms (Chaudhuri et al., 2011; Kifer et al., 2012) have been shown to match the lower bounds under various assumptions. In the statistical estimation setting where we assume the data is generated from a linear Gaussian model, linear regression is covered by the sufficient statistics perturbation approach for exponential family models (Dwork & Smith, 2010; Foulds et al., 2016), proposetestrelease framework (Dwork & Lei, 2009) as well as the the subsampleandaggregate framework (Smith, 2008), with all three approaches achieving the asymptotic efficiency in the fixed dimension (), large sample () regime.
Despite these theoretical advances, very few empirical evaluations of these algorithms were conducted and we are not aware of a commonlyaccepted best practice. Practitioners are often left puzzled about which algorithm to use for the specific data set they have. The nature of differential privacy often requires them to set parameters of the algorithm (e.g., how much noise to add) according to the diameter of the parameter domain, as well as properties of a hypothetical worstcase data set, which often leads to an inefficient use of their valuable data.
The main contribution of this paper is threefold:

We consolidated many bits and pieces from the literature and clarified the price of differentially privacy in statistical estimation and statistical learning.

We carefully analyzed One Posterior Sample (OPS) and Sufficient Statistics Perturbation (SSP) and proposed simple modifications of them into adaptive versions: AdaOPS and AdaSSP. Both work near optimally for every problem instance without any hyperparameter tuning.

We conducted extensive real data experiments to benchmark existing techniques and concluded that the proposed technique gives rise to the more favorable privacyutility tradeoff relative to existing methods.
Outline of this paper.
2 Notations and problem setup
Throughout the paper we will use and to denote the design matrix and response vector. These are collections of data points . We use to denote Euclidean norm for vector inputs, operator norm for matrix inputs. In addition, for set inputs, denotes the radius of the smallest Euclidean ball that contains the set. For example, and . Let be the domain of coefficients. Our results do not require to be compact but existing approaches often depends on . and denote greater than or smaller to up to a universal multiplicative constant, which is the same as the big and the big . hides at most a logarithmic term. and denote the standard semidefinite ordering of positive semidefinite (psd) matrices.
We now define a few data dependent quantities. We use (abbv. ) to denote the smallest eigenvalue of , and to make the implicit dependence in and clear from this quantity, we define One can think of as a normalized smallest eigenvalue of such that . Also, is closely related to the condition number of .
The least square solution is the optimal solution to Similarly, we use denotes the optimal solution to the ridge regression objective .
In addition, we denote the global Lipschitz constant of as and datadependent local Lipschitz constant a as . Note that when , , but will remain finite for every given data set.
Metric of success.
We measure the performance of an estimator in two ways.
First, we consider the optimization error in expectation or with probability . This is related to the prediction accuracy in the distributionfree statistical learning setting.
Second, we consider how well the coefficients can be estimated under the linear Gaussian model:
in terms of or in some cases where is a high probability event.
The optimal error in either case will depend on the specific design matrix , optimal solution , the data domain , the parameter domain as well as in the statistical estimation setting.
Differential privacy.
We will focus on estimators that are differential private, as defined below.
Definition 1 (Differential privacy (Dwork et al., 2006b)).
We say a randomized algorithm satisfies DP if for all fixed data set and data set that can be constructed by adding or removing one row from , and for all measurable set over the probability of the algorithm
Parameter represents the amount of privacy loss from running the algorithm and denotes a small probability of failure. These are userspecified targets to achieve and the differential privacy guarantee is considered meaningful if and .
The pursuit for adaptive estimators.
Another important design feature that we will mention repeatedly in this paper is adaptivity. We call an estimator adaptive if it behaves optimally simultaneously for a wide range of parameter choices. Being adaptive is of great practical relevance because we do not need to specify the class of problems or worry about whether our specification is wrong (see examples of adaptive estimators in e.g., (Donoho, 1995; Birgé & Massart, 2001)). Adaptivity is particularly important for differentially private data analysis because often we need to decide the amount of noise to add by the size of the domain. For example, an adaptive algorithm will not rely on conservative upper bounds of , or a worst case (which would be on any ), and it can take advantage of favorable properties when they exist in the data set. We want to design an estimator that does not take these parameters as inputs and behave nearly optimally for every fixed data set under a variety of configuration of .
3 A survey of existing work
In this section, we summarize existing theoretical results in linear regression with and without differential privacy constraints. We will start with lower bounds.
3.1 Informationtheoretic lower bounds
Lower bounds under linear Gaussian model.
Under the statistical assumption of linear Gaussian model , the minimax risk for both estimation and prediction are crisply characterized for each fixed design matrix :
(1) 
and if we further assume that and is invertible (for identifiability), then
(2) 
In the above setup, is any measurable function of (note that is fixed). These are classic results that can be found in standard statistical decision theory textbooks (See, e.g., Wasserman, 2013, Chapter 13).
Under the same assumptions, CramerRao lower bound mandates that the covariance matrix of any unbiased estimator of obeys that
(3) 
This bound applies to every problem instance separately and also implies a precise lower bound on the prediction variance, namely, for any and any unbiased .
Statistical learning lower bounds.
Perhaps much less wellknown, linear regression is also thoroughly studied in the distributionfree statistical learning setting, where the only assumption is that the data are drawn iid from some unknown distribution defined on some compact domain . Specifically, let the risk () be
Shamir (2015) showed that when , are are Euclidean balls,
(4) 
where be any measurable function of the data set to and the expectation is taken over the data generating distribution . Note that to be compatible to other bounds that appear in this paper, we multiplied the by a factor of . Informally, one can think of as in (1) so both terms depend on (or ), but the dependence on is new for the distributionfree setting.
Koren & Levy (2015) later showed that this lower bound is matched up to a constant by Ridge Regression with and both Koren & Levy (2015) and Shamir (2015) conjecture that ERM (a constrained version of OLS) without additional regularization should attain the lower bound (4). If the conjecture is true, then the unconstrained OLS is simultaneously optimal for all distributions supported on the smallest ball that contains all data points in for any being an ball with radius larger than .
Lower bounds with privacy constraints.
Suppose that we further require to be differentially private, then there is an additional price to pay in terms of how accurately we can approximate the ERM solution. Specifically, Bassily et al. (2014)’s lower bounds for the empirical excess risk for differentially private ERM imply that for and a sufficiently large :

There exists a triplet of , such that
(5) 
Assume all data sets are within obeys that the inverse condition number ^{1}^{1}1This requires for all data sets .. There exists a triplet of such that
(6)
These bounds are attained by a number of algorithms, which we will go over in Section 3.2.
Comparing to the nonprivate minimax rates on prediction accuracy, the bounds look different in several aspects. First, neither rate for prediction error in (1) or (4) depends on whether the design matrix is wellconditioned or not, while appears explicitly in (6). Secondly, the dependence on are different, which makes it hard to tell whether the optimization error lower bound due to the privacy requirement is limiting.
To clarify the relationships, we plot Shamir’s lower bound (4) and the smaller of Bassily et. al.’s differential privacy lower bounds (5) and (6) for all configurations of graphically in Figure 2. We also use multiple lines to illustrate the shifts in these lower bounds when parameters such as and changes. In all figures is assumed to be and logarithmic terms are dropped. The price of differential privacy is highlighted as a shaded area in the figures. Interestingly, in the first case when is small (when ), then substantial price only occurs in the nonstandard region where . Arguably this is OK because in that regime, people should use Ridge regression or Lasso anyways rather than OLS. In the case when is large (when ), the price is more substantial and it applies to all unless we can exploit the strong convexity in the data set. When we do, then the cost only occur for an interval in and eventually the cost of differential privacy becomes negligible relative to the minimax rate. To the best of our knowledge this is the first time the “price of differential privacy” for linear regression is discussed with clear explanation of the dependency in all parameters of the problem.
The above discussion also allows us to address the following question.
When is privacy for free in statistical learning?
Specifically, what is the smallest such that an DP algorithm matches the minimax rate in (4)? Not surprisingly, the answer depends on the relative scale of and and that of . When , (5) says that DP algorithms can achieve the nonconvex minimax rate provided that for On the other hand, if ^{2}^{2}2This is arguably the more relevant setting. Note that if and is fixed, then . and , then we need
The regions are illustrated graphically in Figure 2. In the first case, there is a large region upon , where meaningful differential privacy (with and ) can be achieved without incurring significant toll relative to (4). In the second case, we need at least to achieve “privacyforfree” in the most favorable case where . In the case when could be rankdeficient, then it is infeasible to achieve “privacy for free” no matter how large is.
Based on the results in Figure 2 and 2, it might be tempting to conclude that one should always prefer Case 1 over Case 2. This is unfortunately not the case because the artificial restriction of the model class via a bounded also weakens our nonprivate baseline. In other word, the best solution within a small might be significantly worse than the best solution in .
In practice, it is hard to find a with a small radius that fits all purposes^{3}^{3}3If then the constraint becomes limiting. If instead, then calibrating the noise according to will inject more noise than necessary. and it is unreasonable to assume . This motivates us to go beyond the worstcase and come up with adaptive algorithms that work without knowing and while achieving the minimax rate for the class with and (in the hindsight).
3.2 Existing algorithms and our contribution
We now survey the following list of five popular algorithms in differentially private learning and highlight the novelty in our proposals ^{4}^{4}4While we try to be as comprehensive as possible, the literature has grown massively and the choice of this list is limited by our knowledge and opinions..

Objective perturbation (ObjPert) (Kifer et al., 2012): with an appropriate and sampled from an appropriately chosen iid Gaussian random vector.

NoisySGD (Bassily et al., 2014): Run SGD for a fixed number of iterations with additional Gaussian noise added to the stochastic gradient evaluated on one randomlychosen data point.
We omit detailed operational aspects of these algorithms and focus our discussion on their theoretical guarantees. Interested readers are encouraged to check out each paper separately. These algorithms are proven under different scalings and assumptions. To ensure fair comparison, we make sure that all results are converted to our setting under a subset of the following assumptions.

is bounded, is bounded.

is bounded.

All possible data set obeys that the smallest eigenvalue is greater than .
Note that A.3 is a restriction on the domain of the data set, rather than the domain of individual data points in the data set of size . While it is a little unconventional, it is valid to define differential privacy within such a restricted space of data sets. It is the same assumption that we needed to assume for the lower bound in (6) to be meaningful.
Detailed comparisons of the algorithms are shown in Table 1 and 2. As in Koren & Levy (2015), we simplify the expressions of the bound by assuming , and in addition, we assume that .
Specifically, Table 1 summarizes the upper bounds of optimization error the aforementioned algorithms relative to our two proposals: AdaOPS and AdaSSP. Comparing the rates to the lower bounds in the previous section, it is clear that NoisySGD and ObjPert both achieve the minimax rate in optimization error but their hyperparameter choice depends on the unknown and . SSP is adaptive to and but has a completely different type of issue — it can fail arbitrarily badly for regime covered under (5), and even for wellconditioned problems, its theoretical guarantees only kicks in as gets very large. Our proposed algorithms AdaOPS and AdaSSP are able to simultaneously switch between the two regimes and get the best of both worlds.
Table 2 summarizes the upper bounds for estimation. The second row compares the approximation of in MSE and the third column summarizes the statistical efficiency of the DP estimators relative to the MLE: under the linear Gaussian model. All algorithms except OPS are asymptotically efficient. For the interest of DP, SSP has the fastest convergence rate and does not explicitly depend on the smallest eigenvalue, but again it behaves differently when is small. AdaOPS and AdaSSP are able to work nearly optimally for all .
3.3 Other related work
The problem of adaptive estimation is closely related to model selection (see, e.g., Birgé & Massart, 2001) and an approach using Bayesian Information Criteria was carefully studied in the differential private setting for the problem of constrained ridge regression by Lei et al. (2017) . Their focus however is not in adaptive prediction and estimation, but rather whether one can find the correct model that matches the selected model in its nonprivate counterpart. One important aspect that is missing is that the best model in the differentially private setting might not be the same as the best model in the nonprivate setting. This is especially true in cases when we take the distributionfree view of the problems.
Linear regression is also studied in many more specialized setup, e.g., high dimensional linear regression (Kifer et al., 2012; Talwar et al., 2014, 2015), statistical inference (Sheffet, 2017) and so on. For the interest of this paper, we focus on the standard regime of linear regression where and do not use sparsity or constraint set to achieve dependence. Lastly, the results in Sheffet (2017) are related, they cover the strong asymptotic utility guarantee of SSP under the linear Gaussian model, and their technique of adaptively adding regularization have inspired AdaSSP.
4 Adaptive private linear regression
In this section, we present and analyze AdaOPS and AdaSSP that achieve the aforementioned adaptive rate. The pseudocode of these two algorithms are given in Algorithm 1 and Algorithm 2.
The idea of both algorithms is to release datadependent quantities differentially privately and then use a high probability confidence interval of these quantities to calibrate the noise to privacy budget as well as to choose the ridge regression’s hyperparameter for achieving the smallest prediction error. Specifically, AdaOPS requires us to release both the smallest eigenvalue of and the local Lipschitz constant , while AdaSSP only needs the smallest eigenvalue .
(7) 
(8) 
In both AdaSSP and AdaOPS, we choose by minimizing an upper bound of in the form of “variance” and “bias”
Note that while cannot be privately released, it appears in both terms and do not enter the decision process of finding the optimal that minimizes the bound. This convenient feature about is a consequence of our assumption that . Dealing with the general case involving an arbitrary is an intriguing open problem.
A tricky situation for AdaOPS is that the choice of depends on through , which is the local Lipschitz constant at the ridge regression solution . But the choice of also depends since the “variance” term above is inversely proportional to . Our solution is to express (hence ) as a function of and solve the nonlinear univariate optimization problem (7).
We are now ready to state the main results.
Theorem 2.
Algorithm 1 outputs which obeys that

It satisfies DP.

Assume . With probability ,

Under the linear Gaussian model with a fixed fullrank . Then conditioning on an event satisfying over only the randomness of the algorithm, we have
and where constant
The proof, deferred to Appendix C, makes use of a finegrained DPanalysis through the recent per instance DP techniques (Wang, 2017) and then convert the results to DP by releasing data dependent bounds of and the magnitude of a ridgeregression output with an adaptively chosen . Note that does not have a bounded global sensitivity. The method to release it differentially privately (described in Lemma 12) is part our technical contribution.
The AdaSSP algorithm is simpler and enjoys slightly stronger theoretical guarantees.
Theorem 3.
The proof of Statement (1) is straightforward. Note that we release the eigenvalue , and differentially privately each with parameter . For the first two, we use Gaussian mechanism and for , we use the AnalyzeGauss algorithm (Dwork et al., 2014) with a symmetric Gaussian random matrix. The result then follows from the composition theorem of differential privacy. The proof of the second and third statements is provided in Appendix B. The main technical challenge is to prove the concentration on the spectrum and the JohnsonLindenstrausslike distance preserving properties for symmetric Gaussian random matrices (Lemma 6). We note that while SSP is an old algorithm the analysis of its theoretical properties is new to this paper.
Remarks.
Both AdaOPS and AdaSSP match the smaller of the two lower bounds (5) and (6) for each problem instance. They are slightly different in that AdaOPS preserves the shape of the intrinsic geometry (which makes it easier to reuse the standard statistical inference tools of linear regression) of the covariance matrix while AdaSSP’s bounds are slightly stronger as they do not explicitly depend on the smallest eigenvalue.
5 Experiments
In this section, we conduct synthetic and real data experiments to benchmark the performance of AdaOPS and AdaSSP relative to existing algorithms we discussed in Section 3. Kifer et al. (2012)’s version of objective perturbation worked best overall for small to median and SSP worked best among baselines when is large or . NoisySGD and SubAgg are excluded because they are dominated by ObjPert.
Prediction accuracy in UCI data sets experiments.
We present the results for training linear regression on 36 UCI regression data sets. Standard scoring are performed and all data points are normalized to to process these data sets. Results in four of the data sets are presented in Figure 3. As we can see, SSP is unstable for small data. The algorithm ObjPert (Kifer et al., 2012) perform well in some cases but cannot take advantage of the strong convexity that is intrinsic to the data set. AdaOPS and AdaSSP on the other hand are able to nicely interpolate between the trivial solution and the nonprivate baseline and performed as well as or better than baselines for all . More detailed quantitative results on all the 36 UCI data sets are presented in Table 3 and 4 of Appendix A. As we can see, AdaOPS and AdaSSP outperform baselines in almost all data sets that we tested them on.
Parameter estimation under linear Gaussian model.
To illustrate the performance of the algorithms under standard statistical assumptions, we also benchmarked the algorithms on synthetic data generated by a linear Gaussian model. The results, shown in Figure 4 illustrate that as gets large, AdaOPS and AdaSSP with and converge to the maximum likelihood estimator at a rate faster than the optimal statistical rate that MLE estimates , therefore at least for large , differential privacy comes for free. Note that there is a gap in SSP and AdaSSP for large , this can be thought of as a cost of adaptivity as AdaSSP needs to spend some portion of its privacy budget to release , which SSP does not, this can be fixed by using more carefull splitting of the privacy budget.
6 Conclusion
In this paper, we presented a detailed case study of the problem of differentially private linear regression. We clarified the relationships between various quantities of the problems as they appear the private and nonprivate informationtheoretic lower bounds. We also surveyed the existing algorithms and highlight that the main drawback of these algorithms relative to their nonprivate counterpart is that they cannot adapt to datadependent quantities. This is particularly true for linear regression where the ordinary least square algorithm is able to work optimally for a wide range of problem classes.
We propose AdaOPS and AdaSSP to address the issue and showed that they both work in unbounded domain. Moreover, they smoothly interpolate the two regimes studied in Bassily et al. (2014) and behave nearly optimally for every instance. We tested the two algorithms on 36 reallife regression data sets from the UCI machine learning repository and we see significant improvement over popular algorithms for almost all configurations of .
Future work includes extending the result beyond linear regression and releasing offtheshelf packages for adaptive differentially private learning.
Acknowledgments
The author thanks Jing Lei for helpful discussions.
References
 Agresti & Finlay (1997) Agresti, A., & Finlay, B. (1997). Statistical methods for the social sciences.
 Armitage et al. (2008) Armitage, P., Berry, G., & Matthews, J. N. S. (2008). Statistical methods in medical research. John Wiley & Sons.
 Bassily et al. (2014) Bassily, R., Smith, A., & Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS14), (pp. 464–473). IEEE.
 Birgé & Massart (2001) Birgé, L., & Massart, P. (2001). Gaussian model selection. Journal of the European Mathematical Society, 3(3), 203–268.
 Chaudhuri et al. (2011) Chaudhuri, K., Monteleoni, C., & Sarwate, A. D. (2011). Differentially private empirical risk minimization. The Journal of Machine Learning Research, 12, 1069–1109.
 Dimitrakakis et al. (2014) Dimitrakakis, C., Nelson, B., Mitrokotsa, A., & Rubinstein, B. I. (2014). Robust and private bayesian inference. In Algorithmic Learning Theory, (pp. 291–305). Springer.
 Donoho (1995) Donoho, D. L. (1995). Denoising by softthresholding. IEEE transactions on information theory, 41(3), 613–627.
 Draper & Smith (2014) Draper, N. R., & Smith, H. (2014). Applied regression analysis, vol. 326. John Wiley & Sons.
 Dwork et al. (2006a) Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., & Naor, M. (2006a). Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, (pp. 486–503). Springer.
 Dwork & Lei (2009) Dwork, C., & Lei, J. (2009). Differential privacy and robust statistics. In Proceedings of the fortyfirst annual ACM symposium on Theory of computing, (pp. 371–380). ACM.
 Dwork et al. (2006b) Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, (pp. 265–284). Springer.
 Dwork & Smith (2010) Dwork, C., & Smith, A. (2010). Differential privacy for statistics: What we know and what we want to learn. Journal of Privacy and Confidentiality, 1(2), 2.
 Dwork et al. (2014) Dwork, C., Talwar, K., Thakurta, A., & Zhang, L. (2014). Analyze gauss: optimal bounds for privacypreserving principal component analysis. In Proceedings of the fortysixth annual ACM symposium on Theory of computing, (pp. 11–20). ACM.
 Foulds et al. (2016) Foulds, J., Geumlek, J., Welling, M., & Chaudhuri, K. (2016). On the theory and practice of privacypreserving bayesian data analysis. In Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence, (pp. 192–201). AUAI Press.
 Friedman et al. (2001) Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning, vol. 1. Springer series in statistics Springer, Berlin.
 Galton (1886) Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.
 Greene (2003) Greene, W. H. (2003). Econometric analysis. Pearson Education India.
 Kifer et al. (2012) Kifer, D., Smith, A., & Thakurta, A. (2012). Private convex empirical risk minimization and highdimensional regression. Journal of Machine Learning Research, 1, 41.
 Koren & Levy (2015) Koren, T., & Levy, K. (2015). Fast rates for expconcave empirical risk minimization. In Advances in Neural Information Processing Systems, (pp. 1477–1485).
 Laurent & Massart (2000) Laurent, B., & Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, (pp. 1302–1338).
 Lei et al. (2017) Lei, J., Charest, A.S., Slavkovic, A., Smith, A., & Fienberg, S. (2017). Differentially private model selection with penalized and constrained likelihood. Journal of the Royal Statistical Society.
 Shamir (2015) Shamir, O. (2015). The sample complexity of learning linear predictors with the squared loss. Journal of Machine Learning Research, 16, 3475–3486.
 Sheffet (2017) Sheffet, O. (2017). Differentially private ordinary least squares. In International Conference on Machine Learning, (pp. 3105–3114).
 Smith (2008) Smith, A. (2008). Efficient, differentially private point estimators. arXiv preprint arXiv:0809.4794.
 Stewart (1998) Stewart, G. W. (1998). Perturbation theory for the singular value decomposition. Tech. rep.
 Talwar et al. (2014) Talwar, K., Thakurta, A., & Zhang, L. (2014). Private empirical risk minimization beyond the worst case: The effect of the constraint set geometry. arXiv preprint arXiv:1411.5417.
 Talwar et al. (2015) Talwar, K., Thakurta, A. G., & Zhang, L. (2015). Nearly optimal private lasso. In Advances in Neural Information Processing Systems, (pp. 3025–3033).
 Vu & Slavkovic (2009) Vu, D., & Slavkovic, A. (2009). Differential privacy for clinical trial data: Preliminary evaluations. In Data Mining Workshops, 2009. ICDMW’09. IEEE International Conference on, (pp. 138–143). IEEE.
 Wang (2017) Wang, Y.X. (2017). Perinstance differential privacy and the adaptivity of posterior sampling in linear and ridge regression. arXiv preprint arXiv:1707.07708.
 Wang et al. (2015) Wang, Y.X., Fienberg, S., & Smola, A. (2015). Privacy for free: Posterior sampling and stochastic gradient monte carlo. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), (pp. 2493–2502).
 Wasserman (2013) Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
Appendix A Results on the 36 real regression data sets in UCI repository
The detailed results on the 36 UCI data sets are presented in Table 3 for and Table 4 for . The boldface denotes the DP algorithm where the standard deviation is smaller than the error (a positive quantity), and the 95% confidence interval covers the observed best performance among benchmarked DP algorithms.
Appendix B Proof of the results for Ssp and AdaSSP
In this section, we first derive the rate for the optimization and parameter estimation error of the sufficient statistics perturbation (SuffPert) approach as was shown in Table 1 and Table 2. This will build intuition towards AdaSSP, which we will present the proof of it towards the end of the section.