Practical Differentially Private Top-k Selection with Pay-what-you-get Composition

Practical Differentially Private Top- Selection with Pay-what-you-get Composition

David Durfee Applied Research, LinkedIn Ryan Rogers Applied Research, LinkedIn
Abstract

We study the problem of top- selection over a large domain universe subject to user-level differential privacy. Typically, the exponential mechanism or report noisy max are the algorithms used to solve this problem. However, these algorithms require querying the database for the count of each domain element. We focus on the setting where the data domain is unknown, which is different than the setting of frequent itemsets where an apriori type algorithm can help prune the space of domain elements to query. We design algorithms that ensures (approximate) -differential privacy and only needs access to the true top- elements from the data for any chosen . This is a highly desirable feature for making differential privacy practical, since the algorithms require no knowledge of the domain. We consider both the setting where a user’s data can modify an arbitrary number of counts by at most 1, i.e. unrestricted sensitivity, and the setting where a user’s data can modify at most some small, fixed number of counts by at most 1, i.e. restricted sensitivity. Additionally, we provide a pay-what-you-get privacy composition bound for our algorithms. That is, our algorithms might return fewer than elements when the top- elements are queried, but the overall privacy budget only decreases by the size of the outcome set.

1 Introduction

Determining the top- most frequent items from a massive dataset in an efficient way is one of the most fundamental problems in data science, see Ilyas et al. [17] for a survey of top- processing techniques. For example, consider the task of returning the 10 most popular articles that users engaged with. However, it is important to consider users’ privacy in the dataset, since results from data mining approaches can reveal sensitive information about a user’s data [20]. Simple thresholding techniques, e.g. -anonymity, do not provide formal privacy guarantees, since adversary background knowledge or linking other datasets may cause someone’s data in a protected dataset to be revealed [24]. Our aim is to provide rigorous privacy techniques for determining the top- so that it can be built on top of highly distributed, real-time systems that might already be in place.

Differential privacy has become the gold standard for rigorous privacy guarantees in data analytics. One of the primary benefits of differential privacy is that the privacy loss of a computation on a dataset can be quantified. Many companies have adopted differential privacy, including Google [15], Apple [1], Uber [18], Microsoft [9], and LinkedIn [21], as well as government agencies, like the U.S. Census Bureau [8]. For this work, we hope to extend the use of differential privacy in practical systems to allow analysts to compute the most frequent elements in a given dataset. We are certainly not the first to explore this topic, yet the previous works require querying the count of every domain element, e.g. report noisy max [10] or the exponential mechanism [25], or require some structure on the large domain universe, e.g. frequent item sets (see Related Work). We aim to design practical, (approximate) differentially private algorithms that do not require any structure on the data domain, which is typically the case in exploratory data analysis. Further, our algorithms work in the setting where data is preprocessed prior to running our algorithms, so that the differentially private computation only accesses a subset of the data while still providing user privacy in the full underlying dataset.

We design -differentially private algorithms that can return the top- results by querying the counts of elements that only exist in the dataset. To ensure user level privacy, where we want to protect the privacy of a user’s entire dataset that might consist of many data records, we consider two different settings. In the restricted sensitivity setting, we assume that a user can modify the counts by at most 1 across at most a fixed number of elements in a data domain, which is assumed to be known. An example of such a setting would be computing the top- countries where users have a certain skill set. Assuming a user can only be in one country, we have . In the more general setting, we consider unrestricted sensitivity, where a user can modify the counts by at most 1 across an arbitrary number of elements. An example of the unrestricted setting would be if we wanted to compute the top- articles with distinct user engagement (liked, commented, shared, etc.). We design different algorithms for either setting so that the privacy parameter needs to scale with either in the restricted sensitivity setting or in the unrestricted setting. Thus, our differentially private algorithms will ensure user level privacy despite a user being able to modify the counts of any arbitrary number of elements.

The reason that our algorithms require , and are thus approximate differentially private, is that we want to allow our algorithms to not have to know the data domain, or any structure on it. For exploratory analyses, one would like to not have to provide the algorithm the full data domain beforehand. The mere presence of a domain element in the exploratory analysis might be the result of a single user’s data. Hence, if we remove a user’s data in a neighboring dataset, there are some outcomes that cannot occur. We design algorithms such that these events occur with very small probability. Simultaneously, we ensure that the private algorithms do not compromise the efficiency of existing systems.

As a byproduct of our analysis, we also include some results of independent interest. In particular, we give a composition theorem that essentially allows for pay-what-you-get privacy loss. Since our algorithms allow for outputting fewer than elements when asked for the top-, we allow the analyst to ask more queries if the algorithms return fewer than outcomes, up to some fixed bound. Further, we define a condition on differentially private algorithms that allows for better composition bounds than the general, optimal composition bounds [19, 26]. Lastly, we show how we can achieve a one-shot differentially private algorithm that provides a ranked top- result and has privacy parameter that scales with , which uses a different noise distribution than work from Qiao, et al. [14].

We see this work as bringing together multiple theoretical results in differential privacy to arrive at a practical privacy system that can be used on top of existing, real-time data analytics platforms for massive datasets distributed across multiple servers. Essentially, the algorithms allow for solving the top- problem first with the existing infrastructure for any chosen , and then incorporate noise and a threshold to output the top-, or fewer outcomes. In our approach, we can think of the existing system, such as online analytical processing (OLAP) systems, as a blackbox top- solver and without adjusting the input dataset or opening up the blackbox, we can still implement private algorithms.

1.1 Related Work

There are several works in differential privacy for discovering the most frequent elements in a dataset, e.g. top- selection and heavy hitters. There are different approaches to solving this problem depending on whether you are in the local privacy model, which assumes that each data record is privatized prior to aggregation on the server, or in the trusted curator privacy model, which assumes that the data is stored centrally and then private algorithms can be run on top of it. In the local setting, there has been academic work [3, 4] as well as industry solutions [1, 16] to identifying the heavy hitters. Note that these algorithms require some additional structure on the data domain, such as fixed length words, where the data can be represented as a sequence of some known length and each element of the sequence belongs to some known set. One can then prune the space of potential heavy hitters by eliminating subsequences that are not heavy, since a subsequence is frequent only if it is contained in a frequent sequence.

We will be working in the trusted curator model. There has been several works in this model that estimate frequent itemsets subject to differential privacy, including [5, 23, 28, 22, 29]. Similar to our work, Bhaskar et al. [5] first solve the top- problem nonprivately (but with restrictions on the choice of which can be for certain databases) and then use the exponential mechanism to return an estimate for the top-. The primary difference between these works and ours is that the domain universe in our setting is unknown and not assumed to have any structure. For itemsets, one can iteratively build up the domain from smaller itemsets, as in the locally private algorithms.

We assume no structure on the domain, as one would assume without considering privacy restrictions. This is a highly desirable feature for making differential privacy practical, since the algorithms can work over arbitrary domains. Chaudhuri et al. [7] considers the problem of returning the subject to differential privacy, where their algorithm works in the range independent setting. That is, their algorithms can return domain elements that are unknown to the analyst querying the dataset. However, their large margin mechanism can run over the entire domain universe in the worst case. The algorithms in [7] and [5] share a similar approach in that both use the exponential mechanism on elements above a threshold (completeness). In order to obtain pure-differential privacy (), [5] samples uniformly from elements below the threshold, whereas [7] never sample anything from this remaining set and thus satisfy approximate-differential privacy (). Our approach will also follow this high-level idea, but set the threshold in a different manner to ensure computational efficiency. To our knowledge, there are no top- differentially private algorithms for the unknown domain setting that never require iterating over the entire domain.

When the data domain is known and we want to compute the top- most frequent elements, then the usual approach is to first either use report noisy max [10], which adds Laplace noise to each count and reports the index of the largest noisy count, or use the exponential mechanism [25]. Then we can use a peeling technique, which removes the top element’s count and then uses report noisy max or the exponential mechanism again. There has also been work in achieving a one-shot version that adds Laplace noise to the counts once and can return a set of indices, which would be computationally more efficient, [14].

There have been several works bounding the total privacy loss of an (adaptive) sequence of differentially private mechanisms, including basic composition [12, 10], advanced composition (with improvements) [13, 11, 6], and optimal composition [19, 26]. There has also been work in bounding the privacy loss when the privacy parameters themselves can be chosen adaptively — where the previous composition theorems cannot be applied — with pay-as-you-go composition [27]. In this work, we provide a pay-what-you-get composition theorem for our algorithms which allows the analyst to only pay for the number of elements that were returned by our algorithms in the overall privacy budget. Because our algorithms can return fewer than elements when asked for the top-, we want to ensure the analyst can ask many more queries if fewer than elements have been given.

2 Preliminaries

We will represent the domain as and a user ’s data as . We then write a dataset of users as . We say that are neighbors if they differ in the addition or deletion of one user’s data, e.g. . We now define differential privacy [12].

Definition 2.1 (Differential Privacy).

An algorithm that takes a collection of records in to some arbitrary outcome set is -differentially private (DP) if for all neighbors and for all outcome sets , we have

If , then we simply write -DP.

In this work, we want to select the top- most frequent elements in a dataset . Let denote the number of users that have element , i.e. . We then sort the counts and denote the ordering as with corresponding elements . Hence, from dataset , we seek to output where we break ties in some arbitrary, data independent way.

Note that for neighboring datasets and , the corresponding neighboring histograms and can differ in all positions by at most , i.e. . In some instances, one user can only impact the count on at most a fixed number of coordinates. We then say that are -restricted sensitivity neighbors if and .

The algorithms we describe will only need access to a histogram , where we drop when it is clear from context. We will be analyzing the privacy loss of an individual user over many different top-, top- queries on a larger, overall dataset. Consider the example where we want to know the top- articles that distinct users engaged with, then we want to know the top- articles that distinct users engaged with in Germany, and so on. A user’s data can be part of each input histogram, so we want to compose the privacy loss across many different queries.

In our algorithms, we will add noise to the histogram counts. The noise distributions we consider are from a Gumbel random variable or a Laplace random variable where has PDF , has PDF , and

(1)

3 Main Algorithm and Results

We now present our main algorithm for reporting the top- domain elements and state its privacy guarantee. The limited domain procedure is given in Algorithm 1 and takes as input a histogram , parameter , some cutoff for the number of domain elements to consider, and privacy parameters . It then returns at most indices in relative rank order. At a high level, our algorithm can be thought of as solving the top- problem with access to the true data, then from this set of histogram counts, adds noise to each count to determine the noisy top- and include each index in the output only if its respective noisy count is larger than some noisy threshold. The noise that we add will be from a Gumbel random variable, given in (1), which has a nice connection with the exponential mechanism [25] (see Section 4). In later sections we will present its formal analysis and some extensions.

Input: Histogram ; privacy parameters .
Output: Ordered set of indices.
Sort .
Set .111Note that if becomes comparable to , then we can also have in the minimum statement, but we omit for simplicity. If , then we use write , in which case the algorithm becomes equivalent to the exponential mechanism with peeling. This emphasizes that provides a tuning knob between efficiency and utility.
Set .
for  do
     Set .
Sort .
Let be the sorted list up until .
Return if , otherwise return .
Algorithm 1 ; Top- from the limited domain

We now state its privacy guarantee.

Theorem 1.

Algorithm 1 is -DP for any where

(2)

Note that our algorithm is not guaranteed to output indices, and this is key to obtaining our privacy guarantees. The primary difficulty here is that the indices within the true top- can change by adding or removing one person’s data. The purpose of the threshold, , is then to ensure that the probability of outputting any index in the top- for histogram but not in the top- for a neighboring histogram is bounded by . We give more high-level intuition on this in Section 3.2.

In order to maximize the probability of outputting indices, we want to minimize our threshold value. Accordingly, whenever we have restricted sensitivity such that , we can simply choose to be as large as is computationally feasible because that will minimize our threshold . However, if the sensitivity is unrestricted or quite large, it becomes natural to consider how to set , as there becomes a tradeoff where is decreasing in whereas is increasing in . Ideally, we would set to be a point within the histogram in which we see a sudden drop, but setting it in such a data dependent manner would violate privacy. Instead, we will simply consider the optimization problem of finding index that minimizes (and is computationally feasible), and we will solve this problem with standard DP techniques.

Lemma 3.1 (Informal).

We can find a noisy estimate of the optimal parameter for a given histogram , and this will only increase our privacy loss by substituting for in the guarantees in Theorem 1.

Pay-what-you-get Composition

While the privacy loss for Algorithm 1 will be a function of regardless of whether it outputs far fewer than indices, we can actually show that in making multiple calls to this algorithm, we can instead bound the privacy loss in terms of the number of indices that are output. More specifically, we will instead take the length of the output for each call to Algorithm 1, which is not deterministic, and ensure that the sum of these lengths does not exceed some . Additionally, we need to restrict how many individual top- queries can be asked of our system, which we denote as . Accordingly, the privacy loss will then be in terms of and . We detail the multiple calls procedure in Algorithm 2.

Input: An adaptive stream of histograms , fixed integers and , along with per iterate privacy parameters .
Output: Sequence of outputs for .
while  and  do
     Based on previous outcomes, select adaptive histogram and parameters
     if  then
         Let with privacy parameters and
          and      
Return
Algorithm 2 ; Multiple queries to random threshold

From a practical perspective, this means that if we allowed a client to make multiple top- queries with a total budget of , whenever a top- query was made their total budget would only decrease in the size of the output, as opposed to . We will further discuss in Section 3.1 how this property in some ways can actually provide higher utility than standard approaches that have access to the full histogram and must output indices. We then have the following privacy statement.

Theorem 2.

For any , in Algorithm 2 is -DP where

(3)

Extensions

We further consider the restricted sensitivity setting, where any individual can change at most counts. Algorithm 1 allowed for a smaller additive factor of on the threshold for this setting, but the privacy loss for was still in terms of . The primary reason for this is that, unlike noise, adding noise to a value and releasing this estimate is not differentially private. Accordingly, if we instead run Algorithm 1 with noise, then we can achieve a dependency on the . We note that adding noise instead will not allow us to provably achieve the same guarantees as Theorem 1, and we discuss some of the intuition for this later.

Lemma 3.2 (Informal).

If we instead add noise to Algorithm 1, and we have -restricted sensitivity where , then we can obtain -DP where

In addition, we give a slight variant of Algorithm 1 in Section 6.2 that will achieve the same privacy guarantees at the cost of some generality, but will be even more practical for implementation.

Improved Advanced Composition

We also provide a result that may be of independent interest. In Section 4, we consider a slightly tighter characterization of pure differential privacy, which we refer to as range-bounded, and show that it can improve upon the total privacy loss over a sequence of adaptively chosen private algorithms. In particular, we consider the exponential mechanism, which is known to be -DP, and show that it has even stronger properties that allow us to show it is -range-bounded under the same parameters. Accordingly, we can then give improved advanced composition bounds for exponential mechanism compared to the optimal composition bounds for general -DP mechanisms given in [19, 26] (we show a comparison of these bounds in Appendix A).

3.1 Accuracy Comparisons

In contrast to previous work in top- selection subject to DP, our algorithms can return fewer than indices. Typically, accuracy in this setting is to return a set of exactly indices such that each returned index has a count that is at least the -th ranked value minus some small amount. There are known lower bounds for this definition of accuracy [2] that are tight for the exponential mechanism. Relaxing the utility statement to allow returning fewer than indices, we can show that our algorithm will achieve asymptotically better accuracy where is replaced with because our algorithm is essentially privately determining the top- on the true top- instead of top-. In fact, if we set , then we will only output indices in the top- and achieve perfect accuracy, but it is critically important to note that we are unlikely to output all indices in this parameter setting. We then provide additional conditions under which we output indices with probability at least (these formal accuracy statements are encompassed in Lemma 8.1). This condition requires a certain distance between and , which is comparable to the requirement for determining for privately outputting top- itemsets in [5], and we achieve similar accuracy guarantees under this condition. The key difference becomes that for some histograms can be as large as and hence less efficient for the algorithm in [5], but it will always return indices. Conversely, for those same histograms we maintain computational efficiency because our is a fixed parameter, but our routine will most likely output fewer than indices.

Even for those histograms in which we are unlikely to return indices, we see this as the primary advantage of our pay-what-you-get composition. If there are a lot of counts that are similar to the -th ranked value, our algorithm will simply return a single rather than a random permutation of these indices, and the analyst need only pay for a single outcome rather than for up to indices in this random permutation. Essentially, the indices that are returned are normally the clear winners, i.e. indices with counts substantially above the th value, and then the value is informative that the remaining values are approximately equal where the analyst only has to pay for this single output as opposed to paying for the remaining outputs that are close to a random permutation. We see this as an added benefit to allowing the algorithm to return fewer than indices.

3.2 Our Techniques

The primary difficulty with ensuring differential privacy in our setting is that initially taking the true top- indices will lead to different domains for neighboring histograms. More explicitly, the indices within the top- can change by adding or removing one user’s data, and this makes ensuring pure differential privacy impossible. However, the key insight will be that only indices whose value is within 1 of , the value of the th index, can go in or out of the top- by adding or removing one user’s data. Accordingly, the noisy threshold that we add will be explicitly set such that for indices with value within 1 of , the noisy estimate exceeding the noisy threshold will be a low probability event. By restricting our output set of indices to those whose noisy estimate are in the top- and exceed the noisy threshold, we ensure that indices in the top- for one histogram but not in a neighboring histogram will output with probability at most . A union bound over the total possible indices that can change will then give our desired bound on these bad events.

We now present the high level reasoning behind the proof of privacy in Theorem 1.

  1. Adding noise and taking the top- in one-shot is equivalent to iteratively choosing the subsequent index using the exponential mechanism with peeling, see Lemma 4.2.222Note that we could have alternatively written our algorithm in terms of iteratively applying exponential mechanism (and all of our analysis will be in this context), but instead adding noise once is computationally more efficient.

  2. To get around the fact that the domains can change in neighboring datasets, we define a variant of Algorithm 1 that takes a histogram and a domain as input. We then prove that this variant is DP for any input domain, see Corollary 5.1, and for a choice of domain that depends on the input histogram, it is the same as Algorithm 1, see Lemma 5.4

  3. Due to the choice of the count for element , we show that for any given neighboring datasets , the probability that Algorithm 1 evaluated on can return any element that is not part of the domain with occurs with probability , see Lemma 5.5.

We now present an overview of the analysis for the pay-what-you-get composition bound in Theorem 2.

  1. Because Algorithm 1 can be expressed as multiple iterations of the exponential mechanism, we can string together many calls to Algorithm 1 as an adaptive sequence of DP mechanisms.

  2. With multiple calls to Algorithm 1, if we ever get a outcome, we can simply start a new top- query and hence a new sequence of exponential mechanism calls. Hence, we do not need to get outcomes before we switch to a new query.

  3. To get the improved constants in (3), compared to advanced composition given in Theorem 3 [13], we introduce a tigher range-bounded characterization, which the exponential mechanism satisfies, that enjoys better composition, see Lemma 4.4.

4 Existing DP Algorithms and Extensions

We now cover some existing differentially private algorithms and extensions to them. We start with the exponential mechanism [25], and show how it is equivalent to adding noise from a particular distribution and taking the argmax outcome. Next, we will present a stronger privacy condition than differential privacy which will then lead to improved composition theorems than the optimal composition theorems [19, 26] for general DP.

Throughout, we will make use of the following composition theorem in differential privacy.

Theorem 3 (Composition [10, 13] with improvements by [19, 26]).

Let be each -DP, where the choice of may depend on the previous outcomes of , then the composed algorithm is -DP for any where

4.1 Exponential Mechanism and Gumbel Noise

The exponential mechanism takes a quality score and can be thought of as evaluating how good is for an outcome on dataset . For our setting, we will be using the following quality score in the exponential mechanism.

Definition 4.1 (Exponential Mechanism).

Let be a randomized mapping where for all outputs we have

where is the sensitivity of the quality score, i.e. for all neighboring inputs we have

We say that a quality score is monotonic in the dataset if the addition of a data record can either increase (decrease) or remain the same with any outcome, e.g. for any input and outcome . Note that is monotonic in the dataset. We then have the following privacy guarantee.

Lemma 4.1.

The exponential mechanism is -DP. Further, if is monotonic in the dataset, then is -DP.

We point out that the exponential mechanism can be simulated by adding Gumbel noise to each quality score value and then reporting the outcome with the largest noisy count.333Special thanks to Aaron Roth for pointing out this known connection with the Gumbel-max trick http://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/. This is similar to the report noisy max mechanism [10] except Gumbel noise is added rather than Laplace. We define to be the iterative peeling algorithm that first samples the outcome with the largest quality score then repeats on the remaining outcomes and continues times. We further define to be the algorithm that adds to each for and takes the indices with the largest noisy counts. We then make the following connection between and , so that we can compute the top- outcomes in one-shot. We defer the proof to Appendix B.1

Lemma 4.2.

For any input the peeling exponential mechanism is equal in distribution to . That is for any outcome vector we have

We next show that the one-shot noise addition is -DP using Theorem 3. Previous work [14] considered a one-shot approach for top- selection subject to DP with Laplace noise addition and in order to get the factor on the privacy loss, their algorithm could not return the ranked list of indices. Using Gumbel noise allows us to return the ranked list of indices in one-shot with the same privacy loss.

Corollary 4.1.

The one-shot is -DP for any where

Further, if the quality score is monotonic in the dataset, then is also -DP for any where

4.2 Bounded Range Composition

It turns out that we can actually improve on the total privacy loss for this algorithm and for a wider class of algorithms in general. We first define a slightly stronger condition than (pure) differential privacy that can give a tighter characterization of the privacy loss for certain DP mechanisms.

Definition 4.2 (Range-Bounded).

Given a mechanism that takes a collection of records in to outcome set , we say that is -range-bounded if for any and any neighboring databases we have

where we use the probability density function instead for continuous outcome spaces. 444We could also equivalently define this in terms of output sets because we are only considering pure differential privacy.

It is straightforward to see that this definition is within a factor of 2 of standard differential privacy.

Corollary 4.2.

If a mechanism is -range-bounded, then it is also -DP and conversely if is -DP then it is also -range-bounded. Furthermore, if is -range-bounded, then we have

The final consequence is exactly where our range-bounded terminology comes from because this implies that for any there is some fixed such that

In contrast, -DP only guarantees that for any

where we know that this range is tight for some mechanisms such as randomized response, which was the mechanism used for proving optimal advanced composition bounds [19, 26]. However, for other mechanisms this range is too loose. For the exponential mechanism, constructing worst-case neighboring databases such that some output’s probability increases by a factor of about requires the quality score of that output to increase and all other quality scores to decrease, which implies that their output probability remains about the same. We then show that exponential mechanism achieves the same privacy parameters as in Lemma 4.1 for our stronger charaterization.

Lemma 4.3.

The exponential mechanism is -range-bounded, furthermore if is monotonic in its dataset then is -range bounded.

Proof.

Consider any outcomes , and take any neighboring inputs and .

Plugging in the specific forms of these probabilities, it is straightforward to see that the denominators will cancel and we are left with the following with the substitution

When is monotonic in the dataset, we have either the case where and or the case where and . Hence the factor of 2 savings in the privacy parameter.

We now show that we can achieve a better composition bound when we compose -range-bounded algorithms as opposed to using Theorem 3, which applies to the composition of general DP algorithms. Intuitively this composition will save a factor of 2 because the range that will maximize the variance is due to the fact that if the range was instead skewed towards (i.e. a range of ) then almost all of the probability mass has to be on events with log-ratio around . Rather than using Azuma’s inequality on the sum of the privacy losses, as is done in the original advanced composition paper [13], we use the more general Azuma-Hoeffding bound.

Theorem 4 (Azuma-Hoeffding555http://www.math.wisc.edu/~roch/grad-prob/gradprob-notes20.pdf).

Let be a martingale with respect to the filtration . Assume that there exist measurable variables and a constant such that

Then for any we have

In fact, our composition bound for range-bounded algorithms improves on the optimal composition theorem for general DP algorithms [19, 26]. See Appendix A for a comparison of the different bounds. We defer the proof, which largely follows a similar argument to [13], to Appendix B.2.

Lemma 4.4.

Let each be -bounded range where the choice of may depend on the previous outcomes of , then the composed algorithm of each of the algorithms is -DP for any where

(4)

Note that in order to see an improvement in the advanced composition bound, we do not necessarily require that an -DP mechanism is also -range-bounded, but could be relaxed to showing it is -range-bounded for some . In particular, this will still give improvements with respect to the simpler formulation of the advanced composition bound. More specifically, the significant term that is normally considered in advanced composition is , which can be replaced with for composing -range-bounded mechanisms with . Consequently, we believe that this formulation could be useful for mechanisms beyond the exponential mechanism.

5 Limited Domain Algorithm

In this section we present the analysis of our main procedure in Algorithm 1. We begin by giving basic properties of histograms when an individual’s data is added or removed, and how this can change the domain of the true top-. This will be critical for achieving our bounds on the bad events when an index moves in or out of the true top- for a neighboring database. Next, we will give an alternative formulation of our algorithm based upon a peeling exponential mechanism. The general idea will be to show that once we have bounded the probability of outputting indices unique to the true top- of one on the neighboring histograms, then we can just consider the remaining similar outputs according to this peeling exponential mechanism and bound this in terms of pure differential privacy. Finally, we will provide a proof of Theorem 1.

5.1 Properties of Data Dependent Thresholds

In this section we will cover basic properties of how the domain of elements above a data dependent threshold can change in neighboring histograms, i.e. and , where . In our algorithm, we will use a data dependent threshold, such as the -th ordered count . Our first property that we use often within our analysis is that the count of the th largest histogram value will not change by more than one (even though the index itself may change).

Lemma 5.1.

For any neighboring histograms , where w.l.o.g. , and for any , we must have either or .

Proof.

Let be the index for . By assumption we have , which implies that for each index we must have . Therefore, for each , we have , which implies .

Similarly, we let be the index for , and we know that and are neighboring so for each index we must have . Therefore, for each , we have , which implies . ∎

Instead of considering the entire domain of size , our algorithms will be limited to a much smaller domain for each database and a given value , where

(5)

and assume that there is some arbitrary (data-independent) tie-breaking that occurs for ordering the histograms. We then have the following result, which bounds how much the change in counts between neighboring databases can be on elements that are in the set difference of the two domains.

Lemma 5.2.

For any -restricted sensitivity neighboring histograms , and some fixed , if then and if then

Proof.

If , then because . We first consider the case , which implies by Lemma 5.1 and because they are neighbors, we must have . Therefore, as desired. If instead , then again by Lemma 5.1 we have , and we must also have . Therefore, as desired.

The other claim follows symmetrically.

We now show that the set difference between the domain under and is no more than and the restricted sensitivity of the neighboring histograms

Lemma 5.3.

For any -restricted sensitivity neighboring histograms , and some fixed , then we must have

Proof.

By definition, we have , so we will assume and show for . We assume w.l.o.g. that , and because we know by construction that for any , then proving will imply . It now suffices to show that for any we must have . If then the position of index cannot have moved up the ordering from to because we assumed . Therefore, if and we must also have .

These properties will ultimately be critical in bounding the probability of indices outside of being output. Note that we typically think of , so we will eliminate from the minimum statement in Lemma 5.3 throughout the rest of the analysis.

5.2 Limited Domain Peeling Exponential Mechanism

Our main procedure is given in Algorithm 1, which involves adding Gumbel noise to each of the top- terms in the histogram we are given where . Note that from Section 4 we know that our analysis can be done by considering the exponential mechanism instead of noise addition.

We now generalize the exponential mechanism we presented in Section 4.

Definition 5.1 (Limited Histogram Exponential Mechanism).

We define the Limited Histogram Exponential Mechanism for any to be such that

for all where and

(6)

From Lemma 4.3 we then have the following result due to the fact that the exponential mechanism is -DP and we are simply adding a new coordinate with count .

Corollary 5.1.

For any fixed then is -range bounded and -DP.

In order to make our peeling algorithm in Algorithm 3 equivalent to in Algorithm 1, we will need to iterate over the same set of indices. Recall how we defined the limited domain in (5).

Lemma 5.4.

For any input histogram , and are equal in distribution.

Proof.

Note that both and will only consider terms in to add to the output. We showed in Lemma 4.2 that adding Gumbel noise to all counts in a histogram, in this case , and taking the largest is equivalent to peeling the exponential mechanism to return the largest count times. Lastly, if we select as one of the indices, then we do not return any other indices with smaller count than . ∎

Input: Histogram , subset of indices; privacy parameters .
Output: Ordered set of indices.
Set
while  do
     Set
     if  then
         Let #concatenate to to retain the order
         Return
     else
         Let #concatenate to to retain the order      
Return
Algorithm 3 ; Peeling Exponential Mechanism version of Algorithm 1
Corollary 5.2.

For any fixed and neighboring histograms , we have that is for any where is given in (2).

We will now fix two neighboring histograms , and separate out our outcome space into bad events for and . In particular, these will just be outputs that contain some index in the top- for one, but not in the top- for the other.

Definition 5.2.

For any neighboring histograms , then we define as the outcome set of and the outcome set of as .

We then define the bad outcomes as

The bulk of the heavy lifting will then be done by the following two lemmas that bound the bad events, and also give a simpler way to compare the good events in terms of pure differential privacy. For bounding the bad events, we need to upper bound the probability of outputting an index in . If we consider one call to the exponential mechanism, then we could obtain an upper bound on the probability of outputting a given index in , by restricting the possible outputs to just that index and . This will then give us the bound of . However, applying this over the possible iterative calls will give a bound of , so we will instead need a slightly more sophisticated argument that accounts for the fact that the iterative process terminates whenever is output.

Lemma 5.5.

For any neighboring histograms ,

We defer the proof to Appendix C. The next lemma will give us a clean way to compare the good events, that will mainly be due to the fact that conditional probabilities are simpler to work with in the exponential mechanism. More specifically, if we consider the rejection sampling scheme of redrawing when we see a bad event, then the resulting probability distribution is actually equivalent to simply restricting our domain to , the set of indices in the top- for both histograms. This will then allow us to compare the probability distributions of both histograms outputting from the same domain.

Lemma 5.6.

For any neighboring histograms , such that , then we have that for any

We defer the proof to Appendix C. This lemma does not immediately give us pure differential privacy on outcomes in because while we will be able to compare and using Corollary 5.2, we still need to account for which we know is at least . This will give us a reasonably simple way to achieve a bound of on the total variation distance, but with some additional work we can eliminate the factor of two. In particular, we will use the following general result in the proof of our main result.

Claim 5.1.

For any and , and any non-negative , we have that

Proof.

Multiplying each term by and cancelling like terms gives the equivalent inequality of

If , then and we are done. If , then our inequality reduces to

Rearranging terms we get this is equivalent to

which holds because we assumed . ∎

We now combine these lemmas and claim to provide our main result of this section, and we will then show how Theorem 1 immediately follows from this lemma.

Lemma 5.7.

For any neighboring histograms and for any , we have that

for any , where .

Proof.

We will first separate such that and . For ease of notation, we will let

This then implies