Information-Theoretic Privacywith General Distortion Constraints

Information-Theoretic Privacy with General Distortion Constraints

Abstract

The privacy-utility tradeoff problem is formulated as determining the privacy mechanism (random mapping) that minimizes the mutual information (a metric for privacy leakage) between the private features of the original dataset and a released version. The minimization is studied with two types of constraints on the distortion between the public features and the released version of the dataset: (i) subject to a constraint on the expected value of a cost function applied to the distortion, and (ii) subject to bounding the complementary CDF of the distortion by a non-increasing function . The first scenario captures various practical cost functions for distorted released data, while the second scenario covers large deviation constraints on utility. The asymptotic optimal leakage is derived in both scenarios. For the distortion cost constraint, it is shown that for convex cost functions there is no asymptotic loss in using stationary memoryless mechanisms. For the complementary CDF bound on distortion, the asymptotic leakage is derived for general mechanisms and shown to be the integral of the single letter leakage function with respect to the Lebesgue measure defined based on the refined bound on distortion. However, it is shown that memoryless mechanisms are generally suboptimal in both cases.

{IEEEkeywords}

Privacy-utility tradeoff, mutual information leakage, distortion cost function, distortion distribution constraints.

1 Introduction

Let be a random data sequence where and represent the public and private sections of the data respectively, and are drawn from an i.i.d. distribution . Each entry represents a row of the dataset. We wish to find a privacy mechanism, i.e. a random mapping, that reveals a sequence such that (i) statistical information about can be learned from , and (ii) as little information as possible about private data should be revealed by . These two goals are in conflict, since typically and are correlated (especially when ). Thus, we wish to characterize the privacy utility tradeoff (PUT) while being careful to choose meaningful utility and privacy metrics.

Our focus is on inferential adversaries that can learn the hidden features from the released dataset . To this end, we motivate the use of mutual information between the private features and the revealed version of the dataset as a metric for privacy leakage. We do so by first noting that mutual information is a regret function used in many learning applications and quantifies the Kullback-Leibler distance between the prior and posterior knowledge of the inferred data from the released data . Furthermore, mutual information is also related to the Fisher information for the asymptotically large datasets, and combined with Fano’s inequality, it serves as a measure of how well an adversary can estimate functions of hidden features.

For the choice of utility metric, the average distortion constraint in the form of has been used in many works, where is a distortion threshold and is a given distortion function between public data and released data. However, this utility metric does not capture all aspects of distortion distribution. One possible step in order to capture more aspects of the distortion distribution, is via the tail probability constraint (or equivalently called excess distortion constraint). This has been of much interest in source coding (see for example [2, 3, 4, 5, 6]), channel coding (see for example [7, 8, 9]) and studied in the context of privacy in [10]. For a more detailed survey on finite blocklength approaches see [11].

However, even the tail probability constraint does not capture the full spectrum of possibilities on applying bounds on distortion distribution. In this paper, we generalize the tail probability constraint in two ways:

  • A bound on the average distortion cost, where the distortion cost is a non-decreasing function applied on a separable distortion measure between and . The resulting PUT is given by

    (1)
  • A non-increasing function to bound the complementary CDF of the distortion measure between and . The resulting PUT is given by

    (2)

The cost constraint in (1) imposes increasing penalties on higher levels of distortion in general, and reduces to a tail probability constraint when , for some constant . The distortion distribution bound in (2) allows arbitrarily fine-tuned bounds on the complementary CDF of the distortion, and reduces to a tail probability constraint when , for some constant . Note that these two types of constraint are not equivalent in general and can capture different requirements on the distortion distribution.

1.1 Contributions

A privacy mechanism could be applied to a dataset as a whole, or to each individual entry of the dataset independently. We label the mechanisms for the two approaches as general and memoryless mechanisms. In this paper:

  • We derive precise expressions for the asymptotic leakage distortion-cost function tradeoff in (1). For memoryless mechanisms, it is equal to the single letter leakage function evaluated at the inverse of the cost function applied to the cost threshold , and for general mechanisms, it is the lower convex envelope of the leakage tradeoff curve under memoryless mechanisms.

  • We also give the exact formulation of the asymptotic leakage in (2) for memoryless and general mechanisms. For memoryless mechanisms, it is equal to the single letter leakage function evaluated at the largest distortion value that is equal to . For general mechanisms, it is the integral of single letter leakage function with respect to the Lebesgue measure defined based on the constraint function .

  • In both cases, the optimal general mechanisms are mixtures of memoryless mechanisms.

The formulations in (1) and (2) include the dependence on both the public and private aspects of the dataset. In cases where the private data is not directly available, but the statistics are known, the private (), public (), and revealed data () form a Markov chain . In this paper, we focus on the general case with both public and private data being available to the mechanism, but the results here generalize in a straightforward manner to the case when private data is not available.

1.2 Related Work

An alternative approach to more general distortion constraints is considered in [6] and referred to as -separable distortion measures 1. In [6], a multi-letter distortion measure is defined as -separable if

(3)

for an increasing function . The distortion cost constraints that we consider are more general in the sense that our notion of cost function applied to the distortion measure covers a broader class of distortion constraints than an average bound on -separable distortion measures studied in [6]. Specifically, the average constraint on an -separable distortion measure has the form

(4)

which clearly is a specific case for our formulation in (1) that results from choosing and , such that . Moreover, we allow for non-decreasing functions , which means that does not have to be strictly increasing. We also note that our focus is on privacy rather than source coding.

In the context of privacy, the privacy utility tradeoff with distinct and is studied in [12] and more extensively in [13], but the utility metric is only restricted to identity cost functions, i.e. . Generalizing this to the excess distortion constraint was considered by [10]. In [10], we also differentiated between explicit availability or unavailability of the private data to the privacy mechanism. Information theoretic approaches to privacy that are agnostic to the length of the dataset are considered in [14, 15, 16].

In [10], we also allow the mechanisms to be either memoryless (also referred to as local privacy) or general. This approach has also been considered in the context of differential privacy (DP) (see for example [17, 18, 19, 20, 21]). In the information theoretic context, it is useful to understand how memoryless mechanisms behave for more general distortion constraints as considered here. Furthermore, even less is known about how general mechanisms behave and that is what this paper aims to do.

In this paper, we first setup the problem formulation in Section 2. Then, in Section 3 we present our main results for the asymptotic leakage for general and memoryless mechanisms, under the average distortion cost and complementary CDF bounds on distortion. Finally, we provide all the proofs in Sections 5.

1.3 Notation

Throughout this paper we use as the distortion value, and to indicate the distortion function used for measuring utility. We also use for the KL-divergence between two distributions. The mutual information between two variables and is denoted by and the base for all the logarithm and exponential functions are the same, but can be any numerical value. We denote binary entropy by , and use for expectation with respect to distribution , where the subscript is dropped when it is clear from context. We denote random variables with capital letters, and their corresponding alphabet set by calligraphic letters. The lower convex envelope of a function for any point in its domain is given by

(5)

2 Problem Definition and Preliminaries

Let the source data be a dataset of independently and identically distributed (i.i.d.) random variables, where , for all . The revealed data is an -length sequence drawn from the alphabet , and all the alphabet sets are assumed to be finite sets. A random mechanism is used to generate the revealed data given the source data .

In order to quantify the utility of the revealed data, consider the single letter distortion measure as a function . Then, the distortion between -length sequences is given by . The following definitions represent our main quantities of interest, given by the minimum leakage for a dataset subject to a distortion cost constraint and a complementary CDF bound on distortion. We differentiate between the memoryless and general mechanisms by the superscripts and , respectively.

Definition 1 (Information Leakage under a Cost Function)

Given a left-continuous and non-decreasing cost function and , the minimal leakage under an expected distortion cost constraint is defined as follows:

(6)

and

(7)

where the superscript takes values or . For , the -letter mechanism is restricted to be stationary and memoryless and given by , while for it can be any mechanism.

Definition 2 (Information Leakage with Distortion CDF Bound)

Given a right-continuous and non-increasing function , the minimal leakage with a cumulative distortion distribution bounded by is defined as follows:

(8)

and

(9)

where the superscript takes values or . For , the -letter mechanism is restricted to be stationary and memoryless and given by , while for it can be any mechanism.

We now define the optimal single letter information leakage under a constraint on the expected value of the distortion. This is analogous to the single-letter rate-distortion function, and has appeared in earlier works on privacy [13]. As we will show later, this quantity appears as a key element in first-order leakage.

Definition 3 (Single Letter Information Leakage)
(10)

Note that is convex, and thus, continuous in .

Remark 1

For , and any , the optimization in (6) reduces to (10) for both memoryless and general mechanisms.

We now define functions that will be critical in expressing asymptotic leakage with the expected distortion cost bound under stationary memoryless and general mechanisms.

Definition 4

For any cost function , and a distortion cost threshold , let

(11)
(12)

and define

(13)

Consequently, for any , we have , and thus, the inverse function for can be uniquely determined as

(14)

3 Main Results

3.1 Distortion Cost Constraint

Theorem 1

Let . If , then the asymptotic minimum leakage under stationary memoryless mechanisms is given by

(15)

and for any , we have

(16)

where for any and constant ,

(17)
(18)

Furthermore, the inequality constraint in (16) reduces to equality if .

Proof sketch: From the law of large numbers, applying a memoryless mechanism concentrates the distortion around a particular , typically to the expected value, as . Therefore, the distortion cost constraint roughly translates to choosing an expected distortion such that , or equivalently . If is uniquely determined, then we have the asymptotic leakage in the form of . Otherwise, our desired lies somewhere between and . For a more detailed proof, see Section 5.1.

Remark 2

If is strictly increasing, then , and is given by (15) for any .

Remark 3

For any , since the closure of the convex hull of epigraphs of and are equal, their lower convex envelopes are equal too. Therefore, , and we refer to this value as .

Theorem 2

The asymptotic minimum leakage under general mechanisms is given by

(19)

Proof sketch: Since is convex in , a convex combination of any two feasible mechanisms is also feasible. Hence, we can always design convex combinations of memoryless mechanisms to achieve the lower convex envelope of , and therefore . Conversely, we show that it is not possible to achieve a smaller leakage. For proof details, we refer the reader to Section 5.3.

Remark 4

Note that for , we have , where the minimum is achieved by any mechanism with output independent from the input.

Remark 5

If is convex, then . Therefore, from Theorem 1 we have

(20)
Remark 6

Note that if is not equal to its lower convex envelope for some , then the optimal mechanism is formed by a convex combination of the optimal memoryless mechanisms for distortion costs and , where is the largest threshold smaller than and is the smallest threshold larger than , such that is equal to its lower convex envelope at and .

3.2 Complementary CDF Bound

We now proceed to the result on information leakage with distortion CDF bound. In the following, we give closed form results for the asymptotic information leakage with the distortion CDF bounded by a function .

Theorem 3

If is a non-increasing right-continuous function, then the asymptotic information leakage for memoryless mechanisms under a distortion CDF bound is given by

(21)

where .

{IEEEproof}

Suppose . Then, for any fixed and , choose , where is the optimal single letter mechanism achieving . Since is bounded away from zero and goes to zero as goes to infinity, the distortion constraint is satisfied for all for sufficiently large . Then, as , continuity of implies is achievable.

Conversely, according to the law of large numbers, the distortion concentrates around its expected value as goes to infinity. In other words, we have , if . This, in turn, implies that for any such that , we must have . Therefore, a feasible memoryless mechanism has to satisfy .

Finally, for , we have to satisfy . Note that in this case, the constraint for , i.e. , is also equivalent to . Therefore, the set of feasible memoryless mechanisms for is equal to those for , and thus, .

Theorem 4

If is a non-increasing right-continuous function, and the single letter leakage function is bounded on , then

(22)

where the integral is a Lebesgue––Stieltjes integral of the single letter leakage function with respect to the Lebesgue––Stieltjes measure associated with the constraint function .

Proof sketch: We first prove this result for simple constraint functions , which are in the form of a finite sum of step functions. Then, we show that any non-increasing right-continuous constraint function can be upper and lower bounded by such simple functions, and therefore, the corresponding leakage can be upper and lower bounded by that of the simple functions. For a more detailed proof, see Section 5.4.

Remark 7

An alternative way of describing the result in Theorem 22 is that the asymptotically optimal mechanism behaves as if it first chooses a random drawn from a distribution with a complementary CDF exactly equal to , and then applies the single letter optimal mechanism achieving the single letter optimal leakage in a stationary and memoryless fashion. Thus, averaging over the random choice of , the resulting leakage is given as the integral in (22).

3.3 Auxiliary Result

We now present a result characterizing the asymptotic optimal privacy leakage subject to multiple excess probability constraints. This can be seen as a special case of complementary CDF bound in which the function is a simple function, i.e. it takes finitely many values. The following results will also be used in the proof of Theorem 2.

For vectors and , where and , a simple function is illustrated in Fig. 1 and formally defined as

(23)
Figure 1: A simple .

One can verify that for a constraint function of this form, the minimization in (9) is equivalent to the information leakage with multiple excess distortion constraints, defined as follows.

Definition 5 (Information Leakage with Multiple Excess Probability Constraints)

Given a distortion vector and a tail probability vector , where and , the minimal leakage with multiple excess distortion constraints is defined as

(24)

where the -letter mechanisms in (6) are not constrained to be memoryless or stationary, and

(25)

In the following lemma, we provide the asymptotic optimal leakage under general mechanisms for the class of distortion CDF bound functions defined in Definition 5.

Lemma 1
(26)

where . In particular, we have

(27)

where

(28)

Proof sketch: The proof hinges on choosing a combination of memoryless mechanisms, each of them being the single letter optimal mechanism for a separate applied in a stationary and memoryless fashion. The weights of this combination will be chosen such that all the excess distortion probabilities are met. For a detailed proof see section 5.2.

4 Illustration of Results

In this section, we first examine the generic cases of single and double step and functions. Then, we consider a doubly symmetric binary source and derive its corresponding single letter leakage function. Finally, we use the single letter leakage function to find the asymptotically optimal leakage under specific examples of the average distortion cost constraint and complementary CDF bound.

4.1 Distortion Cost Function

Example 1

as shown in Fig. 2. In this case, , and we have

Figure 2: The single step cost function .
(29)
(30)

Therefore, according to Theorem 1 for stationary memoryless mechanisms we have

(31)

and for general mechanisms, according to Theorem 2 we have

(32)

This exactly matches our earlier results in [10] and for the special case of simplifies to the result in [3]. The leakages and are depicted in Fig. 3. Note that for , we have due to Remark 4.

Figure 3: The leakage functions and for .
Example 2

, as shown in Fig. 4. In this case, , and we have

(33)
(34)

Hence, according to Theorem 1 for stationary memoryless mechanisms we have

(35)

Note that for , the exact value for is derived by (16), and for , we have due to Remark 4. From Theorem 2, we know that is the lower convex envelope of . If , then it is given by

(36)

and otherwise,

(37)

These two cases together with their corresponding are shown in Figs. 5 and 6, respectively.

Figure 4: The double step cost function , .
Figure 5: and for , if .
Figure 6: and for , if .

4.2 Distortion CDF Constraints

We now proceed to complementary CDF bounds on distortion. First, we consider a single step function (hard tail probability constraint), and then generalize to a sum of two step functions.

Example 3

as shown in Fig. 7, where . For stationary memoryless mechanisms we have

(38)

while for the general mechanisms, we have

(39)
Figure 7: The single step complementary CDF bound function .

Note that this is equivalent to Example 1. Therefore, (38) and (39) verify the results in [3] and [10], wherein the tail probability constraint is used as a utility metric.

Example 4

as shown in Fig. 8. For stationary memoryless mechanisms we have

(40)

while for the general mechanisms, we have

(41)
Figure 8: The double step complementary CDF bound function .

4.3 Doubly Symmetric Binary Source (DSBS)

We now consider a doubly symmetric source with parameter as depicted in Fig. 9 with Hamming distortion, i.e. , as the utility metric. In the following lemma, proved in Section 5.5, we derive the single letter leakage function for this source.

Figure 9: A doubly symmetric source with parameter .
Lemma 2

For a doubly symmetric source with , the single letter leakage function is given by

(42)
Remark 8

Due to the inherent symmetry of the problem, for all , Lemma 2 holds with replaced by .

Given the single letter leakage function for a doubly symmetric source, we provide numerical examples for the asymptotically optimal leakages under both distortion cost constraints and complementary CDF bounds.

Example 5

For a doubly symmetric source with parameter and Hamming distortion, consider the cost function shown in Fig. 10. Then, the corresponding leakage functions and are shown in Fig. 11.

Figure 10: The cost function for Example 5.
Figure 11: Memoryless and general leakage functions and for Example 5.

We now proceed to an examples that resemble a soft single step complementary CDF bound. We choose functions that are parametrized with a parameter such that they converge to a hard single step CDF bound as .

Example 6

Consider a doubly symmetric source with parameter . Then, for any define

(43)

In Fig. 12, this function is plotted for , , and four different values of . Note that in Fig. 13, the value of converges to the asymptotic value of as , and is non-monotonic in .

Figure 12: as described in Example 6, for and , parametrized by .
Figure 13: for the function given in Example 6.

5 Proofs

Before proving our main results, we first review Hoeffding’s inequality, a version of Chernoff bound used for bounded random variables.

Lemma 3 (Hoeffding’s inequality [22, Theorem 2])

Let bounded independent random variables, i.e. for each . We define the empirical mean of these variables by . Then

(44)

where is positive, and is the expected value of .

5.1 Proof of Theorem 1

Assuming a stationary memoryless mechanism, we provide upper and lower bounds on in terms of . This in turn allows us to bound in terms of . Let . Then, for large enough we have

(45a)
(45b)
(45c)
(45d)

where (45c) is due to Lemma 3 and (45d) follows from the definition of . If , then , and we have

(46)

Since is left-continuous, and is continuous, taking the limit as gives

(47)

With a similar argument and using the negative of the distortion function in Lemma 3, we have

(48a)
(48b)
(48c)
(48d)

where (48d) is due to Lemma 3