Optimal Schemes for Discrete Distribution Estimation under Locally Differential Privacy
We consider the minimax estimation problem of a discrete distribution with support size under privacy constraints. A privatization scheme is applied to each raw sample independently, and we need to estimate the distribution of the raw samples from the privatized samples. A positive number measures the privacy level of a privatization scheme. For a given we consider the problem of constructing optimal privatization schemes with -privacy level, i.e., schemes that minimize the expected estimation loss for the worst-case distribution. Two schemes in the literature provide order optimal performance in the high privacy regime where is very close to and in the low privacy regime where respectively.
In this paper, we propose a new family of schemes which substantially improve the performance of the existing schemes in the medium privacy regime when More concretely, we prove that when our schemes reduce the expected estimation loss by under metric and by under metric over the existing schemes. We also prove a lower bound for the region which implies that our schemes are order optimal in this regime.
A major challenge in the statistical analysis of user data is the conflict between learning accurate statistics and protecting sensitive information about the individuals. To study this tradeoff, we need a formal definition of privacy, and differential privacy has been put forth as one such candidate [1, 2]. Roughly speaking, differential privacy requires that the adversary not be able to reliably infer an individual’s data from public statistics even with access to all the other users’ data. The concept of differential privacy has been developed in two different contexts: the global privacy context (for instance, when institutions release statistics of groups of people) , and the local privacy context when individuals disclose their personal data .
In this paper, we consider the minimax estimation problem of a discrete distribution with support size under locally differential privacy. This problem has been studied in the non-private setting [5, 6], where we can learn the distribution from the raw samples. In the private setting, we need to estimate the distribution of raw samples from the privatized samples, which are generated independently from each raw sample according to a conditional distribution (also called privatization scheme) Given a privacy parameter we say that Q is -locally differentially private if the probabilities of the same output conditional on different inputs differ by a factor of at most Clearly, smaller means that it is more difficult to infer the original data from the privatized samples, and thus leads to higher privacy. For a given our objective is to find the optimal privatization scheme with -privacy level to minimize the expected estimation loss for the worst case distribution. In this paper, we are mainly concerned with the scenario where we have a large number of samples, which captures the modern trend toward “big data” analytics.
I-a Existing results:
The following two privatization schemes are the most well-known in the literature: the -ary Randomized Aggregatable Privacy-Preserving Ordinal Response (-RAPPOR) scheme [7, 8], and the -ary Randomized Response (-RR) scheme [9, 10, 11]. The -RAPPOR scheme is order optimal in the high privacy regime where is very close to and the -RR scheme is order optimal in the low privacy regime where . At the same time, to the best of our knowledge, no schemes work well in the medium privacy regime, where is far from both or Arguably, this regime is of practical importance: Indeed, if is too close to then we may need too many samples to estimate the distribution accurately; on the other hand, taking too large can compromise the privacy requirement.
Duchi et al.  gave a tight lower bound on the minimax private estimation loss for the high privacy regime where is very close to . At the same time, no meaningful lower bounds are known for the medium privacy regime.
I-B Our contributions:
In this paper we first propose a family of new privatization schemes which are order-optimal in the medium to high privacy regimes when We show that our schemes are better than the two existing schemes in the medium privacy regime where For instance, we show that for the loss our scheme outperforms the -RR scheme by a factor of , and prove similar results for -RAPPOR and loss. We also show that when our schemes reduce the expected estimation loss by under the metric and by under the metric over the existing schemes. This compares favorably with the existing literature (e.g., ) where the improvement of several percentage points constitutes a substantial advance. Second, we prove a tight lower bound for the whole region which implies that our schemes are order optimal in this regime. We also prove that in order to obtain the optimal performance, we only need to consider the privatization schemes formed by extremal configurations, namely, we can restrict ourselves to the privatization schemes with finite output alphabet and the property that the ratio between the probabilities of a given output conditional on different inputs is either or
After this paper was completed, we learned that the privatization scheme and the empirical estimator that we derive have been proposed earlier in the work of Wang et al.  under the name of -subset mechanisms. The authors of  showed that the -subset mechanisms outperform the -RR and RAPPOR schemes, quantifying the improvement in experimental results. They also proposed the efficient implementation of their estimator that we discuss in Remark III.III below. At the same time,  does not include a detailed analysis of the existing schemes and the new proposal in the medium privacy regime. Finally,  does not address lower bounds on the risk and therefore does not include the statement that the proposed privatization mechanisms are order-optimal in terms of the expected estimation loss.
Ii preliminaries and problem formulation
Notation: Let be the source alphabet and let be a probability distribution on Denote by the -dimensional probability simplex. Let be a random variable (RV) that takes values on according to p, so that Denote by the vector formed of independent copies of the RV
In the classical (non-private) distribution estimation problem, we are given direct access to i.i.d. samples drawn according to some unknown distribution Our goal is to estimate p based on the samples . We define an estimator as a function and assess the quality of the estimator in terms of the risk (expected loss)
where is some loss function. The minimax risk is defined as the following saddlepoint problem:
In the private distribution estimation problem, we can no longer access the raw samples Instead, we estimate the distribution p from the privatized samples obtained by applying a privatization mechanism Q independently to each raw sample A privatization mechanism (also called privatization scheme) is simply a conditional distribution The privatized samples take values in a set (the “output alphabet”) that does not have to be the same as
The quantities are i.i.d. samples drawn according to the marginal distribution m given by
for any where denotes an appropriate -algebra on In accordance with this setting, the estimator is a measurable function Define the minimax risk of the privatization mechanism Q as
where is the -fold product distribution and m is given by (1).
For a given a privatization mechanism is said to be -locally differentially private 111Following the existing literature, we use the quantity as the measure of privacy level even though is never used separately in our derivations and results. if
Denote by the set of all -locally differentially private mechanisms. Given a privacy level we want to find the optimal with the smallest possible minimax risk among all the -locally differentially private mechanisms. We further define the -private minimax risk as
In Sect. IV, we show that it suffices to restrict oneself to finite output alphabet i.e.,
where is the set of -locally differentially private mechanisms with finite output alphabet. For Eq. (2) is equivalent to
We shall also write the definition of the marginal distribution m in (1) as
We will use standard distance functions on distributions defined on finite sets The KL divergence between two such distributions and is defined as
The total variation distance between and is defined as
Iii new schemes
In this section we introduce a family of new privatization schemes. Our schemes are parameterized by the integer Given let the output alphabet be Clearly, Define
for all and all To define the estimator for we need to calculate the marginal distribution of each coordinate of the output. We begin with a concrete example to illustrate the method of the calculation.
Example III.1. Let be the set of all vectors with two ones and two zeros. For any is a Bernoulli random variable. Consider the event We have
Using (1), we obtain
For we can derive the marginal distribution of each coordinate of the output using the method illustrated above:
where It is easy to check that the final expression in (5) also holds for
Solving for , we obtain the empirical estimator of p under in the following form
Remark III.1. When is large, the denominator in (4) is exponentially large in In practice, can be several hundred to several thousand, and the conditional probability of each output can thus be very small. To circumvent computational difficulties in (4), we suggest the following recursive scheme for implementing Given a raw sample (input) we first produce the -th coordinate of the privatized sample (output) according to the distribution:
If is then we choose distinct elements uniformly from and set if and otherwise. If is then we choose distinct elements uniformly from and set if and otherwise. When we choose distinct elements uniformly from the set we choose them one by one: we first choose uniformly from then we choose uniformly from so on and so forth, until we choose elements. It is easy to verify that the procedure we described above produces exactly the same distribution as designed in (4). Moreover, the smallest probability we need to deal with is at least in this procedure. So the scheme can be efficiently implemented in practice.
Let us calculate the risk under the loss and loss.
Suppose that the privatization scheme is and the empirical estimator is given by (6). Let For all and we have that
The expected loss in the limit of large is given by
The proof is elementary but somewhat tedious. It is given in Appendix A.
Next we find the optimal value of to minimize the risk and risk for the worst case distribution.
Let be a given privacy level. The optimal choice of for both the risk and the risk is given by either or
Let us begin with the case. Starting from (7), we need to minimize the terms that contain
Denote the expression in the outer parentheses on the right-hand side by We have
It is easy to see that is an increasing function in the interval Thus the minimum of occurs when namely, when Notice that Since is an integer between and the minimum is attained at one of the nearest integers to
As for the loss, by using the Cauchy-Schwarz inequality twice, we can easily see that the right-hand side of (8) reaches maximum for the uniform distribution
We find It is easy to see that is an increasing function in the interval Thus the minimum of occurs when namely, when Since is integer, this concludes the proof. ∎
In order to avoid the case below we take as a convenient and nearly optimal choice. The next proposition gives upper bounds on the risk and risk for this value of .
Let Suppose that the privatization scheme is and the corresponding empirical estimator is given by (6). Let For all and we have that
and for large we have that
In the regime we have
In the regime we have
We begin with proving the upper bound on risk. We know that In (7), there are only two terms containing The first one is an increasing function of for so replacing with gives an upper bound on this term:
where follows from the assumption that and the obvious inequality for all and follows from the assumption that
The second term in (7) that we need to analyze is It is a decreasing function of for so replacing with gives an upper bound on this term:
Next we prove the upper bound on risk. As we noted before, the right-hand side of (8) is maximum when where is the uniform distribution. For this reason, we will bound from above the right-hand side of (9). Again there are only two terms in (9) that contain The first one is and it is an increasing function of for Replacing with we obtain the following upper bound on this term:
The other term in (9) that involves is and it is a decreasing function of for Replacing with we obtain the following upper bound on this term:
where follows from the fact that for all This proves (11), and the rest of the proposition follows immediately. ∎
Iii-a Comparison of our scheme with -RR and -Rappor
In this section we compare our scheme to the two existing privatization schemes in the literature. The -RR scheme is the same as in this paper. The empirical estimator for -RR scheme is given by (6) once we put In the low-privacy regime, where our choice of is so in this regime our scheme coincides with the -RR scheme.
To define the -RAPPOR scheme [7, 8], let Given an input the output vector is obtained by flipping each coordinate of independently with probability where is the -th vector in the standard basis of Formally, the -RAPPOR scheme is defined as follows:
for all and all The empirical estimator for the -RAPPOR scheme is
where [12, Prop. 4]. In the high-privacy regime, where is close to the -RAPPOR scheme and its empirical estimator give order-optimal performance. More specifically, when is small and is large, the risk is approximately and the risk is approximately At the same time, the authors of  show that for close to the minimax risk (3) behaves as
As a result, that the -RAPPOR scheme gives order-optimal performance in high privacy regime.
To compare our scheme with -RAPPOR in the high privacy regime, let be small and be large. According to (10)-(11), the risk of our scheme is approximately and the risk is approximately which are exactly the same as those of -RAPPOR scheme. Thus in the high privacy regime the proposed scheme does not improve over the known results.
At the same time, the comparison is in favor of our schemes in the medium-privacy regime when
The risks of the -RR and -RAPPOR schemes in the medium privacy regime are given in the following table.
We can make the claims of this proposition more specific by computing numerical bounds on the improvement of our scheme over the two existing schemes in the medium privacy regime. We show that if then the expected loss of our scheme is at most of the existing schemes under loss and at most of the existing schemes under loss.
To show this, let be the expected estimation loss of -RAPPOR under its empirical estimator (18) and let be the same for -RR, both measured by loss function Let be the expected estimation loss under given in (4) and its empirical estimator given in (6) for distribution (We omit parameters from the notation as they are clear from the context.) We further define
If and then
and for large
The proof is given in Appendix B.
Remark III.2. As discussed in , along with the empirical estimator for the -RR and -RAPPOR schemes, there are other estimators, for instance, the normalized estimator and the projected estimator. These estimators differ from the empirical estimator only when the latter gives some output which is not in Since the empirical estimator is unbiased, the probability of such events are exponentially small. As mentioned in the introduction, we are interested in the regime where is large, so the performance of different estimators only have exponentially small difference and can be neglected. This justifies our choice of only comparing the performance under empirical estimators.
Iv Lower bound
In this section, we give a tight lower bound on the minimax risk defined in (3). Our argument consists of two steps. In the first step we establish that in order to obtain the optimal performance, we can restrict ourselves to the privatization schemes with the so-called extremal configurations; cf. Theorem IV.5. In this part we are motivated by a result in  which shows that a similar property holds for schemes optimal in terms of information theoretic utilities, such as mutual information between the input and the output. In the second step we derive lower bounds on the risk that will establish order-optimality of the proposed privatization scheme. The main result of this section is given in the following theorem.
Iv-a Reduction to extremal configurations
We begin with showing that we only need to consider privatization schemes with finite output alphabet. The argument relies on the following technical lemma whose proof is given in Appendix C.
Let be probability measures defined on a measurable space For any partition of into a finite number of disjoint sets which are measurable with respect to the -fold product -algebra and any there exists a partition of into a finite number of disjoint measurable sets and a partition of into disjoint sets such that:
The sets are measurable with respect to the -fold finite product algebra where is the finite algebra generated by the sets
For any and any multi-index
where is the -fold product measure on the product measurable space
The next lemma establishes the fact that we do not need to look beyond finite output alphabets in our search for optimal schemes.
Let be the set of -locally differentially private mechanisms with finite output alphabet. For or
Define a clipping function