Differentially private anonymized histograms

Differentially private anonymized histograms

Ananda Theertha Suresh
Google Research New York
theertha@google.com
Abstract

For a dataset of label-count pairs, an anonymized histogram is the multiset of counts. Anonymized histograms appear in various potentially sensitive contexts such as password-frequency lists, degree distribution in social networks, and estimation of symmetric properties of discrete distributions. Motivated by these applications, we propose the first differentially private mechanism to release anonymized histograms that achieves near-optimal privacy utility trade-off both in terms of number of items and the privacy parameter. Further, if the underlying histogram is given in a compact format, the proposed algorithm runs in time sub-linear in the number of items. For anonymized histograms generated from unknown discrete distributions, we show that the released histogram can be directly used for estimating symmetric properties of the underlying distribution.

1 Introduction

Given a set of labels , a dataset is a collection of labels and counts, . An anonymized histogram of such a dataset is the unordered multiset of all non-zero counts without any label information,

For example, if , , then 111 is a multiset and not a set and duplicates are allowed.. Anonymized histograms do not contain any information about the labels, including the cardinality of . Furthermore, we only consider histograms with positive counts. The results can be extended to histograms that include zero counts. A histogram can also be represented succinctly using prevalences. For a histogram , the prevalence is the number of elements in the histogram with count ,

In the above example, , , and for . Anonymized histograms are also referred to as histogram of histograms Batu et al. (2000), histogram order statistics Paninski (2003), profiles Orlitsky et al. (2004), unattributed histograms Hay et al. (2010), fingerprints Valiant and Valiant (2011a), and frequency lists Blocki et al. (2016).

Anonymized histograms appear in several potentially sensitive contexts ranging from password frequency lists to social networks. Before we proceed to the problem formulation and results, we first provide an overview of the various contexts where anonymized histograms have been studied under differential privacy and their motivation.

Password frequency lists: Anonymized password histograms are useful to security researchers who wish to understand the underlying password distribution in order to estimate the security risks or evaluate various password defenses Bonneau (2012); Blocki et al. (2016). For example, if is the most frequent password, then is the number of accounts that an adversary could compromise with guesses per user. Hence, if the server changes the -strikes policy from to , the frequency distribution can be used to evaluate the security implications of this change. We refer readers to Blocki et al. (2016, 2018) for more uses of password frequency lists. Despite their usefulness, organizations may be wary of publishing these lists due to privacy concerns. This is further justified as it is not unreasonable to expect that an adversary will have some side information based on attacks against other organizations. Motivated by this, Blocki et al. (2016) studied the problem of releasing anonymized password histograms.

Degree-distributions in social networks: Degree distributions is one of the most widely studied properties of a graph as it influences the structure of the graph. Degree distribution can also be used to estimate linear queries on degree distributions such as number of -stars. However, some graphs may have unique degree distributions and releasing exact degree distributions is no more safer than naive anonymization, which can leave social network participants vulnerable to a variety of attacks Backstrom et al. (2007); Hay et al. (2008); Narayanan and Shmatikov (2009). Thus releasing them exactly can be revealing. Hence, Hay et al. (2009, 2010); Karwa and Slavković (2012); Kasiviswanathan et al. (2013); Raskhodnikova and Smith (2016); Blocki (2016); Day et al. (2016) considered the problem of releasing degree distributions of graphs with differential privacy. Degree distributions are anonymized histograms over the graph node degrees.

Estimating symmetric properties of discrete distributions: Let . A discrete distribution is a mapping from a domain to such that . Given a discrete distribution over symbols, a symmetric property is a property that depends only on the multiset of probabilities Valiant and Valiant (2011b); Acharya et al. (2017), e.g., entropy ( ). Other symmetric properties include support size, Rényi entropy, distance to uniformity, and support coverage. Given independent samples from an unknown , the goal of property estimation is to estimate the value of the symmetric property of interest for . Estimating symmetric properties from unknown distributions has received a wide attention in the recent past e.g., Valiant and Valiant (2011a, b); Jiao et al. (2015); Wu and Yang (2016); Orlitsky et al. (2016); Acharya et al. (2017) and has applications in various fields from neuro-science Paninski (2003) to genetics Zou et al. (2016). Recently, Acharya et al. (2018) proposed algorithms to estimate support size, support coverage and entropy with differential privacy. Optimal estimators for symmetric properties only depend on the anonymized histograms of the samples Batu et al. (2000); Acharya et al. (2017). Hence, releasing anonymous histograms with differential privacy would simultaneously yield differentially-private plug-in estimators for all symmetric properties.

2 Differential privacy

2.1 Definitions

Before we outline our results, we first define the privacy and utility aspects of anonymized histograms. Privacy has been studied extensively in statistics and computer science Dalenius (1977); Adam and Worthmann (1989); Agrawal and Aggarwal (2001); Dwork (2008) and references therein. Perhaps the most studied form of privacy is differential privacy (DP) Dwork et al. (2006, 2014), where the objective is to ensure that an adversary would not infer whether a user is present in the dataset or not.

We study the problem of releasing anonymized histograms via the lens of global-DP. We begin by defining the notion of DP. Formally, given a set of datasets and a notion of neighboring datasets , and a query function , for some domain , then a mechanism is said to be -DP, if for any two neighboring datasets , and all ,

(1)

Broadly-speaking -DP ensures that given the output, an attacker would not be able to differentiate between any two neighboring datasets. -DP is also called pure-DP and provides stricter guarantees than the approximate -DP, where equation (1) needs to hold with probability .

Since introduction, DP has been studied extensively in various applications from dataset release to learning machine learning models Abadi et al. (2016). It has also been adapted by industry Erlingsson et al. (2014). There are two models of DP: server or global or output DP, where a centralized entity has access to the entire dataset and answers the queries in a DP manner. The second model is local DP, where -DP is guaranteed for each individual user’s data Warner (1965); Kasiviswanathan et al. (2011); Duchi et al. (2013); Kairouz et al. (2014); Acharya et al. (2019). We study the problem of releasing anonymized histograms under global-DP. Here is the set of anonymized histograms, is the identity mapping, and .

2.2 Distance measure

For DP, a general notion of neighbors is as follows. Two datasets are neighbors if and only if one can be obtained from another by adding or removing a user Dwork (2008). Since, anonymized histograms do not contain explicit user information, we need few definitions to apply the above notion. We first define a notion of distance between label-count datasets. A natural notion of distance between datasets and over is the distance, , where is the count of in dataset . Since anonymized histograms do not contain any information about labels, we define distance between two histograms as

(2)

The following simple lemma characterizes the above distance in terms of counts.

Lemma 1 (Appendix B).

For an anonymized histogram , let be the highest count in the dataset.222For larger than number of counts in , . For any two anonymized histograms ,

The above distance is also referred to as sorted distance or earth-mover’s distance. With the above definition of distance, we can define neighbors as follows.

Definition 1.

Two anonymized histograms and are neighbors if and only if .

The above definition of neighboring histograms is same as the definition of neighbors in the previous works on anonymized histograms Hay et al. (2010); Blocki et al. (2016).

3 Previous and new results

3.1 Anonymized histogram estimation

Similar to previous works Blocki et al. (2016), we measure the utility of the algorithm in terms of the number of items in the anonymized histogram, .

Previous results: The problem of releasing anonymized histograms was first studied by Hay et al. (2009, 2010) in the context of degree distributions of graphs. They showed that adding Laplace noise to each count, followed by a post-processing isotonic regression step results in a histogram with expected sorted- error of

Their algorithm runs in time . The problem was also considered in the context of password frequency lists by Blocki et al. (2016). They observed that an exponential mechanism over integer partitions yields an -DP algorithm. Based on this, for , they proposed a dynamic programming based relaxation of the exponential mechanism that runs in time and returns a histogram such that

with probability . Furthermore, the relaxed mechanism is -DP.

The best information-theoretic lower bound for the utility of any -DP mechanism is due to Aldà and Simon (2018), who showed that for , any -DP mechanism has expected error of for some dataset.

New results: Following Blocki et al. (2016), we study the problem in metric. We propose a new DP mechanism PrivHist that satisfies the following:

Theorem 1.

Given a histogram in the prevalence form , PrivHist returns a histogram and a sum count that is -DP. Furthermore, if , then

for some constant and has an expected run time of . If then,

and has an expected run time of .

Together with the lower bound of Aldà and Simon (2018), this settles the optimal privacy utility trade-off for up to a multiplicative factor of . We also show that PrivHist is near-optimal for , by showing the following lower bound.

Theorem 2 (Appendix E).

For a given , let . For any -DP mechanism , there exists a histogram , such that

Theorems 1 and 2 together with Aldà and Simon (2018) show that the the proposed mechanism has near-optimal utility for all . We can infer the number of items in the dataset by . However, this estimate is very noisy. Hence, we also return the sum of counts as it is useful for applications in symmetric property estimation for distributions. Apart from the near-optimal privacy-utility trade-off, we also show that PrivHist has several other useful properties.

Time complexity: By the Hardy-Ramanujan integer partition theorem Hardy and Ramanujan (1918), the number of anonymized histograms with items is . Hence, we can succinctly represent them using space. Recall that any anonymized histogram can be written as , where is the number of symbols with count . Let be the number of distinct counts and let be the distinct counts with non-zero prevalences. Then and

and hence there are at most non-zero prevalences and can be represented as using count-prevalence pairs. Histograms are often stored in this format for space efficiency e.g., password frequency lists in Bonneau (2015). PrivHist takes advantage of this succinct representation. Hence, given such a succinct representation, it runs time as opposed to the running time of Hay et al. (2009) and running time of  Blocki et al. (2016). This is highly advantageous for large datasets such as password frequency lists with data points Blocki et al. (2016).

Pure vs approximate differential privacy: The only previous known algorithm with utility of is that of Blocki et al. (2016) and it runs in time . However, their algorithm is -approximate DP which is strictly weaker than PrivHist, whose output is -DP. For applications in social networks it is desirable to have group privacy for large groups Dwork et al. (2014). For groups of size , approximate DP, scales as -DP, which can be prohibitive for large values of . Hence -DP is preferable.

Applications to symmetric property estimation: We show that the output of PrivHist can be directly applied to obtain near-optimal sample complexity algorithms for discrete distribution symmetric property estimation.

3.2 Symmetric property estimation of discrete distributions

For a symmetric property and an estimator that uses samples, let be an upper bound on the worst expected error over all distributions with support at most , . Let sample complexity denote the minimum number of samples such that ,

Given samples , let denote the corresponding anonymous histogram. For a symmetric property , linear estimators of the form

are shown to be sample-optimal for symmetric properties such as entropy Wu and Yang (2016), support size Valiant and Valiant (2011b); Jiao et al. (2015), support coverage Orlitsky et al. (2016), and Rényi entropy Acharya et al. (2014), where s are some distribution-independent coefficients that depend on the property . Recently, Acharya et al. (2018) showed that for any given property such as entropy or support size, one can construct DP estimators by adding Laplace noise to the non-private estimator. They further showed that this approach is information theoretically near-optimal.

Instead of just computing a DP estimate for a given property, the output of PrivHist can be directly used to estimate any symmetric property. By the post-processing lemma Dwork et al. (2014), since the output of PrivHist is DP, the estimate is also DP. For an estimator , let be the Lipschitz constant given by If instead of , a DP histogram and the sum of counts is available, then can be modified as

which is differentially private. Using Theorem 1, we show that:

Corollary 1 (Appendix F).

Let satisfy , for a . Further, let there exists such that . Let . If is the sample complexity of estimator , then for

for some constant . For ,

Further, by the post-processing lemma, is also -DP.

For entropy (), normalized support size (), and normalized support coverage, there exists sample-optimal linear estimators with and have the property  Acharya et al. (2017, 2018). Hence the sample complexity of the proposed algorithm increases at most by a polynomial in . Furthermore, the increase is dependent on the maximum value of the function for distributions of interest and it does not explicitly depend on the support size. This result is slightly worse than the property specific results of Acharya et al. (2018) in terms of dependence on and . In particular, for entropy estimation, the main term in our privacy cost is and the bound of Acharya et al. (2018) is . Thus for , our dependence on and is slightly worse. However, we note that our results are more general in that can be used with any linear estimator. For example, our algorithm implies DP algorithms for estimating distance to uniformity, which have been not been studied before. Furthermore, PrivHist can also be combined with the maximum likelihood estimators of Orlitsky et al. (2004, 2016) and linear programming estimators of Valiant and Valiant (2011a), however we do not provide any theoretical guarantees for these combined algorithms.

4 PrivHist

In the algorithm description and analysis, let denote the vector and let denote the cumulative prevalences. Since, anonymized histograms are multisets, we can define the sum of two anonymized histograms as follows: for two histograms , the sum is given by Furthermore, since there is a one-to-one mapping between histograms in count form and in prevalence form , we use both interchangeably. For the ease of analysis, we also use the notation of improper histogram, where the ’s can be negative or non-integers. Finally, for a histogram indexed by super-script , we define for the ease of notation.

4.1 Approach

Instead of describing the technicalities involved in the algorithm directly, we first motivate the algorithm with few incorrect or high-error algorithms. Before we proceed recall that histograms can be written either in terms of prevalences or in terms of sorted counts .

An incorrect algorithm: A naive tendency would be to just add noise only to non-zero prevalences or counts. However, this is not differentially private. For example, consider two neighboring histograms in prevalence format, and . The resulting outputs for the above two inputs would be very different as the output of never produces a non-zero , whereas the output of produces non-zero with high probability. Similar counter examples can be shown for sorted counts.

A high-error algorithm: Instead of adding noise to non-zero counts or prevalences, one can add noise to all the counts or prevalences. It can be shown that adding noise to all the counts (including those appeared zero times), yields a error , whereas adding noise to prevalences can yield an error of , if we naively use the utility bound in terms of prevalences (3). We note that Hay et al. (2009) showed that a post-processing step after adding noise to sorted-counts and improves the utility. A naive application of Cauchy-Schwarz inequality yields an error of for that algorithm. While it might be possible to improve the dependence on by a tighter analysis, it is not clear if the dependence on can be improved.

The algorithm is given in PrivHist. After some computation, it calls two sub-routines PrivHist-LowPrivacy and PrivHist-HighPrivacy depending on the value of . PrivHist has two main new ideas: (i) splitting around and using prevalences in one regime and counts in another and (ii) the smoothing technique used to zero out the prevalence vector. Of the two (i) is crucial for the computational complexity of the algorithm and (ii) is crucial in improving the -dependence from to in the high privacy regime (). There are more subtle differences such as using cumulative prevalences instead of actual prevalences. We highlight them in the next section. We now overview our algorithm for low and high privacy regimes separately.

4.2 Low privacy regime

We first consider the problem in the low privacy regime when . We make few observations.

Geometric mechanism vs Laplace mechanism: For obtaining DP output of integer data, one can add either Laplace noise or geometric noise Ghosh et al. (2012). For -DP, the expected noise added by Laplace mechanism is , which strictly larger than that of the geometric mechanism () (see Appendix A). For , we use the geometric mechanism to obtain optimal trade off in terms of .

Prevalences vs counts: If we add noise to each coordinate of a -dimensional vector, the total amount of noise in norm scales linearly in , hence it is better to add noise to a small dimensional vector. In the worst case, both prevalences and counts can be an -dimensional vector. Hence, we propose to use prevalences for small values of , and use counts for . This ensures that the dimensionality of vectors to which we add noise is at most .

Cumulative prevalences vs prevalences: The error can be bounded in terms of prevalences as follows. See Appendix B for a proof.

(3)

If we add noise to prevalences, the error can be very high as noise is multiplied with the corresponding count  (3) . The bound in terms of cumulative prevalences is much tighter. Hence, for small values of , we use cumulative prevalences instead of prevalences themselves.

The above observations provide an algorithm for the low privacy regime. However, there are few technical difficulties. For example, if we split the counts at a threshold naively, then it is not differentially private. We now describe each of the steps in high-privacy regime and how we overcome these technical difficulties.

(1) Find : To divide the histogram into two smaller histograms, we need to know , which may not be available. Hence, we allot privacy cost to find a DP value of .

(2) Sensitivity preserving histogram split: If we divide the histogram into two parts based on counts naively and analyze the privacy costs independently for the higher and smaller parts separately, then the sensitivity would be lot higher for certain neighboring histograms. For example, consider two neighboring histograms and . If we divide in to two parts based on threshold , say and and and , then . Thus, the distance between neighboring separated histograms , would be much higher compared to and we need to add a lot of noise. Therefore, we perturb and using geometric noise. This ensures DP in instances where the neighboring histograms differ at and , and doesn’t change the privacy analysis for other types of histograms. However, adding noise may make the histogram improper as can become negative. To this end, we add fake counts at and to ensure that the histogram is proper with high probability. We remove them later in (L4). We refer readers to Appendix C.1.1 for details about this step.

(3,4) Add noise: Let (small counts) and (large counts) be the split-histograms. We add noise to cumulative prevalences in and counts in as described in the overview of the proposed algorithm.

(L1, L2) Post-processing: The noisy versions of may not satisfy the properties satisfied by the histograms i.e., . To overcome this, we run isotonic regression over noisy subject to the monotonicity constraints i.e., given noisy counts , find that minimizes , subject to the constraint that , for all . Isotonic regression in one dimension can be run linear in the number of inputs using Pool Adjacent Violators Algorithm (PAVA) or its variants Barlow et al. (1972); Mair et al. (2009). Hence, the time complexity of this algorithm is . We then round the prevalences to the nearest non-negative integers. We similarly post-process large counts and remove the fake counts that we introduced in step (2).

Since we used succinct representation of histograms, used prevalences for smaller than and counts otherwise, the expected run time of the algorithm is for .

4.3 High privacy regime

For the high-privacy regime, when , all known previous algorithms achieve an error of . To reduce the error from to , we use smoothing techniques to reduce the sensitivity and hence reduce the amount of added noise.

Smoothing method: Recall that the amount of noise added to a vector depends on its dimensionality. Since prevalences have length , the amount of noise would be . To improve on this, we first smooth the input prevalence vector such that it is non-zero only for few values of and show that the smoothing method reduces the sensitivity of cumulative prevalences and hence reduces the amount of noise added.

While applying smoothing is the core idea, two questions remain: how to select the location of non-zero values and how to smooth to reduce the sensitivity? We now describe these technical details.

(H1) Approximate high prevalences: Recall that was obtained by adding geometric noise to . In the rare case that this geometric noise is very negative, then there can be prevalences much larger than . This can affect the smoothing step. To overcome this, we move all counts above to . Since this changes the histogram with low probability, it does not affect the error.

(H2) Compute boundaries: We find a set of boundaries and smooth counts to elements in . Ideally we would like to ensure that there is a boundary that exists close to every count. For small values of , we ensure this by adding all the counts and hence there is no smoothing. If , we use boundaries that are uniform in the log-count space. However, using this technique for all values of , results in an additional factor. To overcome this, for , we use the noisy large counts in step (4) to find the boundaries and ensure that there is a boundary close to every count.

(H3) Smooth prevalences: For a that lies between two boundaries and , we divide into and as follows. We assign fraction of to and the remaining to . If two neighboring histograms differ in and , then after smoothing, and differ by . Hence the sensitivity goes down to and we can add less noise. We refer readers to Appendix C.1.2 for details about this step.

(H4) Add small noise: Since the prevalences are smoothed, we add small amount of noise to the corresponding cumulative prevalences. For , we add to obtain DP.

(H5) Post-processing: Finally, we post-process the prevalences similar to (L1) to impose monotonicity and ensure that the resulting prevalences are positive and non-negative integers.

Since we used succinct histogram representation, ensured that the size of is small, and used counts larger than to find boundaries, the expected run time is for .

5 Acknowledgments

Authors thank Jayadev Acharya and Alex Kulesza for helpful comments and discussions.

Algorithm PrivHist Input: anonymized histogram in terms of prevalences i.e., , privacy cost .
Parameters: .
Output: DP anonymized histogram and (an estimate of ).
[leftmargin=0.5cm] DP value of the total sum: , where . If , output empty histogram and . Otherwise continue. Split : Let and . [leftmargin=0.3cm] and , . and , where . Divide into two histograms and . For all , for all . DP value of . Let i.i.d. and be . DP value of : Let i.i.d. and be . If , output PrivHist-LowPrivacy and . If , output PrivHist-HighPrivacy and .

Algorithm PrivHist-LowPrivacy Input: low-count histogram , high-count histogram and Output: a histogram . [leftmargin=0.7cm] Post processing of : [leftmargin=0.3cm] Find that minimizes with . Find such that for all , . Post processing of : Compute . Let . Compute by removing elements closest to from and then removing elements closest to and output it.

Algorithm PrivHist-HighPrivacy Input: non-private histogram , high-count histogram and Output: a histogram . [leftmargin=0.7cm] Approximate higher prevalences: for , and . Compute boundaries: Let the set be defined as follows: [leftmargin=0.3cm] , . Smooth prevalences: Let denote the smallest element in . [leftmargin=0.3cm] and if , . DP value of : for each , let , where . Find that minimizes such that . Return given by, .

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §2.1.
  • [2] J. Acharya, H. Das, A. Orlitsky, and A. T. Suresh (2017) A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In International Conference on Machine Learning, pp. 11–21. Cited by: §1, §3.2.
  • [3] J. Acharya, G. Kamath, Z. Sun, and H. Zhang (2018) INSPECTRE: privately estimating the unseen. In International Conference on Machine Learning, pp. 30–39. Cited by: §1, §3.2, §3.2.
  • [4] J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi (2014) The complexity of estimating rényi entropy. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 1855–1869. Cited by: §3.2.
  • [5] J. Acharya, Z. Sun, and H. Zhang (2019) Hadamard response: estimating distributions privately, efficiently, and with little communication. In AISTATS, Cited by: §2.1.
  • [6] N. R. Adam and J. C. Worthmann (1989) Security-control methods for statistical databases: a comparative study. ACM Computing Surveys (CSUR) 21 (4), pp. 515–556. Cited by: §2.1.
  • [7] D. Agrawal and C. C. Aggarwal (2001) On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 247–255. Cited by: §2.1.
  • [8] F. Aldà and H. U. Simon (2018) A lower bound on the release of differentially private integer partitions. Information Processing Letters 129, pp. 1–4. Cited by: §3.1, §3.1, §3.1.
  • [9] L. Backstrom, C. Dwork, and J. Kleinberg (2007) Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th international conference on World Wide Web, pp. 181–190. Cited by: §1.
  • [10] R. E. Barlow, D. J. Bartholomew, J. Bremner, and H. D. Brunk (1972) Statistical inference under order restrictions: the theory and application of isotonic regression. Technical report Wiley New York. Cited by: §4.2.
  • [11] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White (2000) Testing that distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pp. 259–269. Cited by: §1, §1.
  • [12] J. Blocki, A. Datta, and J. Bonneau (2016) Differentially private password frequency lists.. In NDSS, Vol. 16, pp. 153. Cited by: §1, §1, §2.2, §3.1, §3.1, §3.1, §3.1, §3.1.
  • [13] J. Blocki, B. Harsha, and S. Zhou (2018) On the economics of offline password cracking. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 853–871. Cited by: §1.
  • [14] J. Blocki (2016) Differentially private integer partitions and their applications. Cited by: §1.
  • [15] J. Bonneau (2012) The science of guessing: analyzing an anonymized corpus of 70 million passwords. In Security and Privacy (SP), 2012 IEEE Symposium on, pp. 538–552. Cited by: §1.
  • [16] J. Bonneau (2015-12) Yahoo password frequency corpus. figshare. External Links: Link, Document Cited by: §3.1.
  • [17] T. Dalenius (1977) Towards a methodology for statistical disclosure control. statistik Tidskrift 15 (429-444), pp. 2–1. Cited by: §2.1.
  • [18] W. Day, N. Li, and M. Lyu (2016) Publishing graph degree distribution with node differential privacy. In Proceedings of the 2016 International Conference on Management of Data, pp. 123–138. Cited by: §1.
  • [19] J. C. Duchi, M. I. Jordan, and M. J. Wainwright (2013) Local privacy and statistical minimax rates. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pp. 429–438. Cited by: §2.1.
  • [20] C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §2.1.
  • [21] C. Dwork, A. Roth, et al. (2014) The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §C.1, §2.1, §3.1, §3.2, Definition 2.
  • [22] C. Dwork (2008) Differential privacy: a survey of results. In International Conference on Theory and Applications of Models of Computation, pp. 1–19. Cited by: §2.1, §2.2.
  • [23] Ú. Erlingsson, V. Pihur, and A. Korolova (2014) Rappor: randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pp. 1054–1067. Cited by: §2.1.
  • [24] A. Ghosh, T. Roughgarden, and M. Sundararajan (2012) Universally utility-maximizing privacy mechanisms. SIAM Journal on Computing 41 (6), pp. 1673–1693. Cited by: Appendix A, §4.2, Definition 3.
  • [25] G. H. Hardy and S. Ramanujan (1918) Asymptotic formulaæ in combinatory analysis. Proceedings of the London Mathematical Society 2 (1), pp. 75–115. Cited by: §3.1.
  • [26] M. Hay, C. Li, G. Miklau, and D. Jensen (2009) Accurate estimation of the degree distribution of private networks. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pp. 169–178. Cited by: §1, §3.1, §3.1, §4.1.
  • [27] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis (2008) Resisting structural re-identification in anonymized social networks. Proceedings of the VLDB Endowment 1 (1), pp. 102–114. Cited by: §1.
  • [28] M. Hay, V. Rastogi, G. Miklau, and D. Suciu (2010) Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment 3 (1-2), pp. 1021–1032. Cited by: §1, §1, §2.2, §3.1.
  • [29] J. Jiao, K. Venkat, Y. Han, and T. Weissman (2015) Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory 61 (5), pp. 2835–2885. Cited by: §1, §3.2.
  • [30] P. Kairouz, S. Oh, and P. Viswanath (2014) Extremal mechanisms for local differential privacy. In Advances in neural information processing systems, pp. 2879–2887. Cited by: §2.1.
  • [31] V. Karwa and A. B. Slavković (2012) Differentially private graphical degree sequences and synthetic graphs. In International Conference on Privacy in Statistical Databases, pp. 273–285. Cited by: §1.
  • [32] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith (2011) What can we learn privately?. SIAM Journal on Computing 40 (3), pp. 793–826. Cited by: §2.1.
  • [33] S. P. Kasiviswanathan, K. Nissim, S. Raskhodnikova, and A. Smith (2013) Analyzing graphs with node differential privacy. In Theory of Cryptography, pp. 457–476. Cited by: §1.
  • [34] P. Mair, K. Hornik, and J. de Leeuw (2009) Isotone optimization in r: pool-adjacent-violators algorithm (pava) and active set methods. Journal of statistical software 32 (5), pp. 1–24. Cited by: §4.2.
  • [35] A. Narayanan and V. Shmatikov (2009) De-anonymizing social networks. In 2009 30th IEEE Symposium on Security and Privacy, pp. 173–187. Cited by: §1.
  • [36] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang (2004) On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 426–435. Cited by: §1, §3.2.
  • [37] A. Orlitsky, A. T. Suresh, and Y. Wu (2016) Optimal prediction of the number of unseen species. Proceedings of the National Academy of Sciences 113 (47), pp. 13283–13288. Cited by: §1, §3.2, §3.2.
  • [38] L. Paninski (2003) Estimation of entropy and mutual information. Neural computation 15 (6), pp. 1191–1253. Cited by: §1, §1.
  • [39] S. Raskhodnikova and A. Smith (2016) Efficient lipschitz extensions for high-dimensional graph statistics and node private degree distributions. In FOCS, Cited by: §1.
  • [40] G. Valiant and P. Valiant (2011) Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 685–694. Cited by: §1, §1, §3.2.
  • [41] G. Valiant and P. Valiant (2011) The power of linear estimators. In Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pp. 403–412. Cited by: §1, §3.2.
  • [42] S. L. Warner (1965) Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60 (309), pp. 63–69. Cited by: §2.1.
  • [43] Y. Wu and P. Yang (2016) Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory 62 (6), pp. 3702–3720. Cited by: §1, §3.2.
  • [44] J. Zou, G. Valiant, P. Valiant, K. Karczewski, S. O. Chan, K. Samocha, M. Lek, S. Sunyaev, M. Daly, and D. G. MacArthur (2016) Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects. Nature communications 7, pp. 13293. Cited by: §1.

Appendix: Differentially private anonymized histograms

Appendix A Geometric mechanism

The mostly popular mechanism for -DP is the Laplace mechanism, which is defined as follows.

Definition 2 (Laplace mechanism ([21]).

When the true query result is , the mechanism outputs where Z is a random variable distributed as a Laplace distribution distribution: for every . If output of has sensitivity , then to achieve -DP add .

Since, we have integer inputs, we use the geometric mechanism:

Definition 3 (Geometric mechanism ([24]).

When the true query result is , the mechanism outputs where Z is a random variable distributed as a two-sided geometric distribution: for every integer . If output of is integers and has sensitivity (an integer), then to achieve -DP add .

[24] showed that geometric mechanism is universally optimal for a general class of functions under a Bayesian framework. Geometric mechanism is beneficial over Laplace mechanism in two ways: The output space of the mechanism is discrete. Since we have integer inputs, this removes the necessity of adding rounding off the outputs. For -DP, the expected noise added by the Laplace mechanism is , which strictly larger than that of the geometric mechanism () (see below). For moderate values of , this difference is a constant. We now state few properties of geometric distribution which are used in the rest of the paper.

We find the following set of equations useful in the rest of the paper. In the following let be a geometric random variable and be a Laplace random variable.

The next lemma bounds moments of when is a zero mean random variable.

Lemma 2.

Let be a random variable and . If , then

and

Proof.

To prove the first inequality, observe that

Taking expectation yields the first equation. For the second term,

(4)

Furthermore,

Combining the above two equations and using the fact that yields the second equation in the lemma. ∎

Appendix B Properties of the distance metric

Proof of Lemma 1.

Recall that the distance between to histograms is given by

Let and be the datasets that achieve the minimum above. Consider any two labels such that . Let be the dataset obtained as follows: and and for all other , . Since is the optimum,

Expanding both sides and canceling common terms, we get,

and thus if , then . Hence, the label of the highest count in both the datasets should be the same and

The distance measure satisfies triangle inequality, i.e., for any three histograms , and ,

The proof of the above equation is a simple consequence of Lemma 1 and is omitted. We now show that dividing histograms only increases the distance.

Lemma 3.

If and , then

Proof.

Since the elements in are same as elements in and elements in are same as elements in , there exists a permutation such that