Efficient, Differentially Private Point Estimators
Differential privacy is a recent notion of privacy for statistical databases that provides rigorous, meaningful confidentiality guarantees, even in the presence of an attacker with access to arbitrary side information.
We show that for a large class of parametric probability models, one can construct a differentially private estimator whose distribution converges to that of the maximum likelihood estimator. In particular, it is efficient and asymptotically unbiased. This result provides (further) compelling evidence that rigorous notions of privacy in statistical databases can be consistent with statistically valid inference.
Privacy is a fundamental problem in modern data analysis. Increasing volumes of personal and sensitive data are collected by government and other organizations. The potential social benefits of analyzing these databases are significant; at the same time, releasing information from repositories of sensitive data can cause devastating damage to privacy. The challenge is to discover and release global characteristics of these databases, without compromising the privacy of the individuals whose data they contain.
There is a vast body of work on this problem in statistics and computer science. However, until recently, most schemes proposed in the literature lacked rigorous analysis of privacy and utility. Few works even formulated a precise definition of their schemes’ conjectured properties.
In this paper, we explore the potential of differential privacy, a definition of privacy due to Dwork et al.  that emerged from a line of work in cryptography [11, 18, 4]. This notion of privacy makes assumptions neither about what kind of attack might be perpetrated based on the released statistics, nor about what additional information the attacker might possess. It resolves a number of problems present in previous attempts at a definition. In particular, it provides precise guarantees in the presence of arbitrary side information available to the adversary but unknown to the organization that is releasing information.
Specifically, we show that for well-behaved parametric probability models, one can construct a differentially-private estimator whose distribution converges to that of the MLE. In particular, it is efficient and asymptotically unbiased. This provides (further) strong evidence that rigorous notions of database privacy can be consistent with statistically valid inference.
The problem of identifying which information in the database is safe to release has generated a vast body of work, both in statistics and computer science. Until recently, there were two nearly disjoint fields studying the data privacy problem: “statistical disclosure limitation” (also known as “data confidentiality”), initiated by the statistics community in 1960s, and “privacy-preserving data mining”, active in the database community during the 1980’s and rekindled at the turn of the 21st century by researchers in data mining. The literature in both fields is far too vast to survey here. For some pointers to the broader literature in statistics, see [33, 8, 9, 10, 31, 21]. For early work in computer science, see the survey in . Recent work in data mining was started by  and led to an explosion of literature. For (partial) references, see [6, 23, 32].
However, the schemes proposed in these fields lack rigorous analysis of privacy. Typically, the schemes have either no formal privacy guarantees or ensure security only against a specific suite of attacks. This leaves them potentially vulnerable to unforeseen attacks, and makes it difficult to compare different schemes because each of them is basically solving a different problem.
A recent line of work [11, 20, 18, 4, 16, 15, 12, 29, 17, 27, 3, 30, 5, 19, 25, 24], called private data analysis, seeks to place data privacy on more firm theoretical foundations and has been successful at formulating a strong, yet attainable privacy definition. The intuition behind the definition, which is due to Dwork et al. , is that whether an individual supplies her actual or fake information has almost no effect on the outcome of the analysis. Roughly, a randomized algorithm that takes sensitive data as input and outputs a product for publication is considered privacy-preserving if databases that differ in one entry induce nearby distributions on its outcomes (see below for a precise definition).
This paper provides a qualitatively different result from previous work, in that it relates the perturbation added for differential privacy to the provably optimal error of point estimators. For a broad class of problems, we show that differential privacy can be provided at an asymptotically vanishing cost to accuracy.
Specifically, we show a modification to the maximum likelihood estimator for parametric models which satisfies differential privacy and is asymptotically efficient, meaning that the averaged squared error of the estimator is , where is the number of samples in the input, denotes the Fisher information of at and and denotes a function that tends to zero as tends to infinity. Differential privacy is quantified by a parameter which measures information leakage; our estimator satisfies differential privacy with .
Consider a parameter estimation problem defined by a model where is a real-valued vector in a bounded space of diameter , and takes values in a (typically, either a real vector space or a finite, discrete set).
We will generally use the following notational convention: capital latin letters (, , etc) refer to random variables or processes. Their lower case analogues refer to fixed, deterministic values of these random objects (i.e. scalars, vectors, or functions).
Given i.i.d. random variables drawn according to the distribution , we would like to estimate using an estimator that takes as input the data as well an additional, independent source of randomness (used, in our case, for perturbation):
Even for a fixed input , the estimator is a random variable distributed in the parameter space . For example, it might consist of a deterministic function value that is perturbed using additive random noise, or it might consist of a sample from a posterior distribution constructed based on . We will use the capital letter to denote the random variable, and lower case to denote a specific value in . Thus, the random variable is generated from two sources of randomness: the samples and the random bits used by .
We say two fixed data sets and in are neighbors if and agree in all but one position, that is for some ,
Differential privacy compares the distributions of and corresponding to neighboring data sets . It requires that for all possible pairs of neighboring data sets, the corresponding distributions be close:
A randomized algorithm is -differentially private if for all neighboring pairs of databases and , and for all measurable subsets of outputs (events) :
This condition states that on single point in the input set can significantly influence the distribution of the estimator. Note that the privacy condition makes no reference to a distribution on . It is a “worst-case” notion of privacy that provides a guarantee even when our modeling of the distribution on is incorrect.
Given two probability measures and on a space , we can define the multiplicative distance between and to be
(We say when the supremum above doesn’t exist.) Thus, -differential privacy requires that, for all fixed neighboring data sets and , the multiplicative distance between (the distributions of) and be at most . The exact choice of distance function significantly affects the practical meaning of differential privacy—see Section 4, Remark 2 in  and  for discussion.
The MLE and Efficiency
Many methods exist to measure the quality of a point estimator. In this paper, we consider the expected squared deviation from the real parameter . For a one-dimensional parameter (), this can be written:
The notation refers to the fact that is drawn i.i.d. according to . If is unbiased, then is simply the variance . Note that all these notations are equally well-defined for a randomized estimator . The expectation is then also taken over the choice of , i.e.
(Mean squared error can be defined analogously for higher-dimensional parameter vectors. For simplicity we focus here on the one-dimensional case. The development of a higher-dimensional analogue is identical, as long as is constant with respect to . )
The maximum likelihood estimator returns a value that maximizes the likelihood function , if such a maximum exists. It is a classic result that, for well-behaved parametric families, the exists with high probability and is asymptotically normal, centered around the true value . Moreover, its expected square error is given by the inverse of Fisher information at ,
Under appropriate regularity conditions, the MLE converges in distribution to a Gaussian centered at , that is Moreover, , where denotes a function of that tends to zero as tends to infinity.
The MLE has optimal among unbiased estimators; estimators that match this bound are called efficient.
An estimator is asymptotically efficient for a model if, for all , the expected squared error converges to that of the MLE, that is, for all ,
The asymptotic efficiency of the MLE implies that its bias, , goes to zero more quickly than . However, in our main result, we will need an estimator with much lower bias. This can be obtained via a (standard) process known as bias correction.
Under appropriate regularity assumptions, we can describe the bias of MLE precisely, namely
where has a uniformly bounded derivative (see, for example, discussions in Cox and Hinkley , Firth , and Li ). Several methods exist for correcting this bias. The simplest is to subtract off an estimate of the leading term, using to estimate ; the result is called the bias-corrected MLE,
The bias-corrected MLE , converges at the same rate as the MLE but with lower bias, namely,
3 A Private, Efficient Estimator
We can now state our main result:
Under appropriate regularity conditions, there exists a (randomized) estimator which is asymptotically efficient and - differentially private, where .
More precisely, the construction takes as input the parameter and produces an estimator with mean squared error . Thus, as long as goes to 0 more slowly than , the estimator will be asymptotically efficient.
The idea is to apply the “sample-and-aggregate” method of , similar in spirit to the parametric bootstrap. The procedure is quite general and can be instantiated in several variants. We present a particular version which is sufficient to prove our main theorem.
The estimator takes the data as well as a parameter (which measures information leakage) and a positive integer (to be determined later). The idea is to break the input into blocks of points each, compute the (bias-corrected) MLE on each block, and release the average of these estimates plus some small additive perturbation. The procedure is given in Algorithm 1 and illustrated in Figure 1.
The resulting estimator has the form
For any choice of the number of blocks , the estimator is - differentially private.
Fix a particular value of , and consider the effect of changing a single entry to obtain a database (for any particular index ). At most one of the numbers can change, depending on the block which contains . The number that changes can go up or down by at most , since the parameter takes values in . This means that the mean can change by at most .
The random variables and are thus Laplace random variables with identical standard deviations and means differing by at most . Let and be the corresponding density functions. As in [4, 16], observe that the ratio of their densities is at most since for any real number :
Similarly, the ratio is bounded below by . For any measurable set with non-zero measure, the ratio is thus between and . This is exactly the requirement of differential privacy. ∎
Under the regularity conditions of Lemma 2.2, if and is set appropriately, the estimator is asymptotically unbiased, normal and efficient, that is
We will select as a function of and . For now, assume that goes to infinity with . By Lemma 2.1, each converges to normal, and moreover the bias and variance of can be bounded:
Consider the averaged estimator . Its expectation is equal to the expectations of the , while its variance scales with :
Recall that the mean squared error is the sum of the variance and squared bias of an estimator. Since the squared bias is , it vanishes asymptotically compared to the variance as long as , that is, as long as .
Thus, for sufficiently small , the estimator is efficient. We now consider for which values of the added noise is small enough so that it does not affect the efficiency of . The noise added to to get does not contribute to the bias of the estimator, but does add to the variance. Specifically: and
If then we can choose to ensure that . We need to get sufficiently small bias and to get the variance of the noise sufficiently low. Taking yields an asymptotic relative error that tends to 1, namely:
Since is constant with respect to , is efficient as long as , as desired. ∎
I am grateful to many colleagues in both statistics and computer science for helpful discussions about this project. I would especially like to thank Bing Li, from Penn State’s Department of Statistics, for insightful conversations about bias correction and asymptotic expansions of statistical functionals.
- N. R. Adam and J. C. Wortmann. Security-control methods for statistical databases: a comparative study. ACM Computing Surveys, 25(4), 1989.
- R. Agrawal and R. Srikant. Privacy-preserving data mining. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages 439–450. ACM, 2000.
- B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In L. Libkin, editor, PODS, pages 273–282. ACM, 2007.
- A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: The SuLQ framework. In PODS, 2005.
- A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In Symposium on the Theory of Computing (STOC), 2008.
- C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu. Tools for privacy preserving data mining. SIGKDD Explorations, 4(2):28–34, 2002.
- D. R. Cox and D. V. Hinkley. Theoretical Statistics. Chapman-Hall, 1974.
- T. Dalenius. Towards a methodology for statistical disclosure control. Statistik Tidskrift, (5):35–64, 1977.
- T. Dalenius and S. Reiss. Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference, (6):73–85, 1982.
- P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. The Annals of Statistics, 26(1):363–397, 1998.
- I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, pages 202–210, 2003.
- C. Dwork. Differential privacy. In ICALP, LNCS, pages 1–12, 2006.
- C. Dwork. An ad omnia approach to defining and achieving private data analysis. In F. Bonchi, E. Ferrari, B. Malin, and Y. Saygin, editors, PinKDD, volume 4890 of Lecture Notes in Computer Science, pages 1–13. Springer, 2007.
- C. Dwork. Differential privacy: A survey of results. In M. Agrawal, D.-Z. Du, Z. Duan, and A. Li, editors, TAMC, volume 4978 of Lecture Notes in Computer Science, pages 1–19. Springer, 2008.
- C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486–503, 2006.
- C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In S. Halevi and T. Rabin, editors, TCC, volume 3876 of Lecture Notes in Computer Science, pages 265–284. Springer, 2006.
- C. Dwork, F. McSherry, and K. Talwar. The price of privacy and the limits of LP decoding. In D. S. Johnson and U. Feige, editors, STOC, pages 85–94. ACM, 2007.
- C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In CRYPTO, pages 528–544, 2004.
- C. Dwork and S. Yekahnin. On lower bounds for noise in private analysis of statistical databases. Presentation at BSF/DIMACS/DyDan Workshop on Data Privacy, February 2008.
- A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211–222, 2003.
- S. E. Fienberg and A. B. Slavkovic. Making the release of confidential data from multi-way tables count. Chance, 17(3), 2004.
- D. Firth. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993.
- J. Gehrke. Models and methods for privacy-preserving data publishing and analysis (tutorial slides). In Twelfth Annual SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2006), 2006.
- S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What can we learn privately? In FOCS, 2008, To Appear.
- S. P. Kasiviswanathan and A. Smith. A note on differential privacy: Defining resistance to arbitrary side information. CoRR, arXiv:0803.39461 [cs.CR], 2008.
- B. Li. An optimal estimating equation based on the first three cumulants. Biometrika, 85(1):103–114, 1998.
- F. McSherry and K. Talwar. Differential privacy in mechanism design. In A. Sinclair, editor, IEEE Symposium on the Foundations of Computer Science (FOCS), October 2007.
- K. Nissim. Private data analysis via output perturbation. In C. C. Aggarwal and P. S. Yu, editors, Privacy-Preserving Data Mining: Models and Algorithms, pages 383–413, 2008.
- K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In U. Feige, editor, Symposium on the Theory of Computing (STOC), 2007.
- V. Rastogi, S. Hong, and D. Suciu. The boundary between privacy and utility in data publishing. In VLDB, pages 531–542, 2007.
- A. Slavkovic. Statistical Disclosure Limitation Beyond the Margins: Characterization of Joint Distributions for Contingency Tables. Ph.D. Thesis, Department of Statistics, Carnegie Mellon University, 2004.
- L. Sweeney. Privacy-enhanced linking. SIGKDD Explorations, 7(2):72–75, 2005.
- S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.