Hypothesis Testing under Mutual Information Privacy Constraints in the High Privacy Regime

Hypothesis Testing under Mutual Information Privacy Constraints in the High Privacy Regime

Jiachun Liao, Lalitha Sankar, Vincent Y. F. Tan and Flavio du Pin Calmon This work is supported in part by the National Science Foundation under grants CCF-1350914 and CIF-1422358.
Abstract

Hypothesis testing is a statistical inference framework for determining the true distribution among a set of possible distributions for a given dataset. Privacy restrictions may require the curator of the data or the respondents themselves to share data with the test only after applying a randomizing privacy mechanism. This work considers mutual information (MI) as the privacy metric for measuring leakage. In addition, motivated by the Chernoff-Stein lemma, the relative entropy between pairs of distributions of the output (generated by the privacy mechanism) is chosen as the utility metric. For these metrics, the goal is to find the optimal privacy-utility trade-off (PUT) and the corresponding optimal privacy mechanism for both binary and -ary hypothesis testing. Focusing on the high privacy regime, Euclidean information-theoretic approximations of the binary and -ary PUT problems are developed. The solutions for the approximation problems clarify that an MI-based privacy metric preserves the privacy of the source symbols in inverse proportion to their likelihoods.

Hypothesis testing, privacy-guaranteed data publishing, privacy mechanism, Euclidean information theory, relative entropy, Rényi divergence, mutual information.

I Introduction

There is tremendous value to publishing datasets for a variety of statistical inference applications; however, it is crucial to ensure that the published dataset, while providing utility, does not reveal potentially privacy-threatening information. Specifically, the published dataset should allow the intended inference to be made while limiting other unwanted inferences. This requires using a randomizing mechanism (i.e., a noisy channel) that guarantees a certain measure of privacy. Any such privacy mechanism may, in turn, reduce the fidelity of the intended inference leading to a trade-off between utility of the published data and the privacy of the respondents in the dataset.

We consider the problem of privacy-guaranteed data publishing hypothesis testing. The use of large datasets to test two or more hypotheses (e.g., the 99%-1% theory of income distribution in the United States [1]) relies on the classical statistical inference framework of binary or multiple hypothesis testing. The optimal test for hypothesis testing under various scenarios (non-Bayesian, Bayesian, minimax) involves the so-called Neyman-Pearson (or likelihood ratio) test [2] in which the likelihood ratio of the hypotheses is compared to a given threshold. We focus exclusively on the non-Bayesian setting. In particular, for -ary () hypothesis testing problem, we consider the setting in which the probability of missed detection is minimized for one specific hypothesis (e.g., presence of cancer) while requiring the probabilities of false alarm for the same hypothesis to be bounded (relative to the remaining hypotheses). In this context, we can apply the Chernoff-Stein Lemma [3, Chapter 11] which states that for a pair of hypotheses the largest error exponent of the missed detection probability, under the constraint that the false alarm probability is bounded above by a constant, is the relative entropy between the probability distributions for the two hypotheses.

Inspired by the Chernoff-Stein lemma, for the -ary hypothesis setting described above, we use relative entropy as a measure of the utility of the published dataset (for hypothesis testing), and henceforth, refer to this as the relative entropy setting. Furthermore, for binary hypothesis testing (), we also consider the setting in which the probabilities of both missed detection and false alarm decrease exponentially. For this setting, using known results of hypothesis testing [4], we take the Rényi divergence as the utility metric and refer to this as the Rényi divergence setting. For the privacy metric, we use mutual information between the original and published datasets as a measure of the additional knowledge (privacy leakage) gained on average from the published dataset. By bounding the MI leakages, our goal is to develop privacy mechanisms that restrict the relative entropy between the prior and posterior (after publishing) distributions of the dataset, averaged over the published dataset. By restricting the distance between prior and posterior beliefs, we capture a large class of computationally unbounded adversaries that can use different inference methods. Specifically, bounding MI leakage allows us to exploit information-theoretic relationships between MI and the probabilities of detection/estimation to bound the ability of an adversary to “learn” the original dataset [3, 5, 6].

I-a Our Contributions

We study the privacy-preserving data publishing problem by considering a local privacy model in which the same (memoryless) mechanism is applied independently to each entry of the dataset. This allows the respondents of a dataset to apply the privacy mechanism before sharing data. Our main contributions are as follows:

  1. We introduce the privacy-utility trade-off (PUT) problem for hypothesis testing (Section II). The resulting PUT involves maximizing the minimum of a set of relative entropies, subject to constraints on the MI- based leakages for all source classes.

  2. The PUT problem involves maximizing the minimum of a set of convex functions over a convex set which is, in general, NP-hard. In Section III, we approximate the trade-off in the high privacy regime (near zero leakage) using techniques from Euclidean information theory (E-IT); these techniques have found use in deriving capacity results in [7, 8].

  3. For binary hypothesis testing, we first consider the relative entropy setting (Section IV-A), in which we determine the optimal mechanism in closed form for the E-IT approximation by exploring the problem structure. Our results suggest that the solution to the E-IT approximation is independent of the alphabet size and, more importantly, that a MI-based privacy metric preserves the privacy of the source symbols inversely proportional to their likelihoods, thereby, providing more distortion to the (informative) outliers in the dataset which, in general, are more vulnerable to detection.

  4. We extend our analysis to the Rényi divergence setting (Section IV-B), the optimal mechanism for its E-IT approximation problem in high privacy regime is similar in form to the relative entropy setting.

  5. We study the -ary hypothesis testing problem (Section V) and show that optimal solutions of the E-IT approximation can be obtained via semidefinite programs (SDPs) [9]. Specially, for binary sources the optimal mechanism is derived in closed form. The dependence on the source distribution is highlighted here as well.

  6. In Section VI, via numerical simulations, we uncover regimes of distribution tuples and leakage values for which the E-IT approximation is accurate.

I-B Related Work

Privacy-guaranteed hypothesis testing in the high privacy regime using MI as the privacy metric was first studied by the authors in [10]. Specifically, the focus of [10] is on the relative entropy setting of binary hypothesis test in the high privacy regime. We significantly extend that work with three key contributions: (a) we derive optimal mechanisms in the high privacy regime for binary hypothesis test in the Rényi divergence setting; (b) we derive optimal mechanisms in the high privacy regime for the -ary problem in relative entropy setting; (c) we provide detailed illustrations of results for binary and -ary hypothesis testing.

Recently, the problem of designing privacy mechanisms for hypothesis testing has gained interest. Kairouz et al. [11] show that the optimal locally differential privacy (L-DP) mechanism has a staircase form and can be obtained as a solution of a linear program. Gaboardi et al. [12] deal with a privacy-guaranteed hypothesis testing by using chi-square goodness of fit as the utility measure and adding Gaussian or Laplace noise to dataset to guarantee DP-based privacy protection.

Our problem differs from these efforts in using MI as the privacy metrics. In [11], the L-DP formulation, focused on the high privacy regime, requires the mechanism to limit distinction between any two letters of the source alphabet for a given output. The requirement also gathers all privacy mechanisms satisfying a desired privacy protection measured by L-DP within a hypercube. Therefore, the authors simplify the trade-off problem to a linear program by exploring the sub-linearity of the relative entropy function. In contrast, all privacy mechanisms giving a desired MI-based privacy form a convex set which is not a polytope. However, taking advantage of E-IT, we propose good approximations for the MI-based privacy utility trade-offs in high privacy regime. In fact, we present closed-form privacy mechanisms for both binary hypothesis testing with arbitrary alphabets as well as -ary hypothesis testing with binary alphabets. Furthermore, for -ary hypothesis testing with arbitrary sources, the privacy mechanism can be attained effectively by solving an SDP.

The connection between hypothesis testing and privacy has been studied in the context of location anonymization and smart meter privacy. In location privacy, the problem of determining if a sequence of anonymized data points (e.g. location positions without an accompanying user ID) belongs to a target user can be formulated as a hypothesis test. More specifically, if the distribution of the user’s data is known and unique among other users, any observed sequence can be tested against the hypothesis that it was drawn from this distribution, thus revealing if it belongs to the target user. Within this context, Montazeri et al. [13, 14] studied the problem of anonymizing sequences of location data, and characterized the probability of correctly guessing a target user’s data within a larger dataset. In related work on smart meter privacy, Li and Oechtering [15] considered the problem of private information leakage in a smart grid. Here, an adversary challenges a consumer’s privacy by performing an unauthorized binary hypothesis test on the consumer’s behavior based on smart meter readings. Li and Oechtering [15] propose a solution for mitigating the incurred privacy risk with the assist of an alternative energy source.

The theoretical analysis done by Montazeri et al. [13, 14] and Li and Oechtering [15] are related to the one presented here in that they also make use of large deviation (information-theoretic) results in hypothesis testing. However, we apply these powerful theoretical tools to a different setting, in which data is purposefully randomized before disclosure in order to provide privacy, while guaranteeing utility in terms of a successful hypothesis test. Whereas they consider a hypothesis testing adversary, here we consider a precise hypothesis test as part of the utility metric.

MI has been amply used as a measure for quantifying information leakage within the information-theoretic privacy literature (cf. [16, 17, 5, 18, 19, 20, 21] and the references therein). The connection between MI-based metrics and other privacy metrics has been studied, for example, by Makhdoumi and Fawaz [22]. In the present paper, we approximate MI by the chi-squared divergence which, in turn, also posses interesting estimation-theoretic properties [23]. An exploration of the role of chi-squared related metrics in privacy has appeared in the work of Asoodeh et al. [24, 25].

I-C Notation

We use bold capital letters to represent matrices, e.g., is a matrix with the row (or column) being and the entry . We use bold lower case letters to represent vectors, e.g. is a vector with the entry . Sets are denoted by capital calligraphic letters.

For vectors and , and functions and , is a diagonal matrix with the diagonal entry being , e.g., the diagonal matrix has diagonal entries . We denote the -norm of a vector by , the logarithm of to the base as . Probability mass functions are denoted as row vectors, e.g., . In addition, denotes the relative entropy and denotes the MI. We can write the MI between two random variables or between a probability distribution and the corresponding conditional probability matrix. Indeed, for two random variables with and , the MI is denoted as or .

Ii Problem Formulation

Ii-a General Hypothesis Testing

We consider an -ary hypothesis testing problem that distinguishes between explanations for an observed dataset. Let denote a sequence of random variables, where the entries are drawn independently according to a probability distribution . The observed random variables are assumed to be discrete with alphabet and size . The hypotheses are denoted as for . Our utility goal is to make a decision about the underlying distribution of the data . Let the disjoint decision regions be . This means that if belongs to , we decide in favor of .

Ii-B Binary Hypothesis Testing

In binary hypothesis testing, there are only two hypotheses: and . The optimum test is the Neyman-Pearson test in which the decision region for hypothesis is for some threshold . Let and be the probabilities of false alarm and missed detection for , respectively. Use to indicate the smallest probability of the missed detection subject to the condition that . The Chernoff-Stein lemma [3, Chap. 11] states that

(1)

Hence, we use as our utility function.

Ii-C -ary Hypothesis Testing

In -ary hypothesis testing, there are different errors resulting from mistaking hypothesis for , . To keep our analysis simple, we consider a scenario somewhat analogous to the “red alert” [26] problem in “unequal error protection” [27]. There is one distinguished hypothesis whose inference takes precedence. For example, in practice could be the underlying distribution of measurements of a malignant tumor; the other distributions could be the underlying distributions of measurements of various benign tumors. We would like to minimize the miss-detection rate of . In this scenario, we would design the decision regions to maximize the minimum of over , where is the error exponent (exponential rate of decay of the error probability analogous to (1)) of mistaking when is true.

Ii-D Privacy Considerations

In most data collection and classification applications, there may be an additional requirement to ensure that the dataset, while providing utility, does not leak information about the respondents of the data. This in turn implies that the data provided to the hypothesis test is not the same as the original data, but instead a randomized version that guarantees precise measures of privacy (information leakage) and utility. Specifically, we use MI as a measure of the average information leakage between the input dataset and its randomized output dataset that is used by the test. The goal is to find the randomizing mapping, henceforth referred to a privacy mechanism, such that a measure of utility of the data is maximized while ensuring that the MI-based leakages for all possible source classes are bounded.

We assume that the entries of the dataset are generated in an i.i.d. fashion. Focusing on the local privacy model, the randomizing privacy mechanism for the hypothesis testing problem is memoryless. Let , an conditional probability matrix, denote this memoryless privacy mechanism which maps the letters of the input alphabet to letters of the output alphabet , where is an arbitrary finite integer. Thus, the i.i.d. sequence , is mapped to an output sequence whose entries for all are i.i.d. with the distribution . Thus, the hypothesis test is now performed on a sequence that belongs to one of source classes with distributions111We remind that the distribution is the output distribution induced by the input (row vector) and the privacy mechanism (transition matrix) . . For the -ary setting, the error exponent, corresponding to the missed detection of as , is .

Ii-E The Privacy-Utility Trade-off

To design an appropriate privacy mechanism, we wish to maximize the minimum of the error exponents subject to the following leakage constraints: for . Formally, the privacy-utility trade-off (PUT) problem is that finding the optimal privacy mechanism of the following optimization:

(2)

where is the set of row stochastic matrices, and , , are permissible upper bounds on . The optimization in (2) maximizes the minimum of convex functions over a convex set. Since the maximum of each of the convex functions are attained on the boundary of the feasible region, the optimal solution of the optimization is also on the boundary. Because of the MI constraints, the feasible region is, in general, not a polytope, and thus, has infinitely many extremal points. While there exist computationally tractable methods to obtain a solution by approximating the feasible region by an intersection of polytopes [28], our focus is on developing a principled approximation for (2) in a specific privacy regime to obtain a closed-from and easily-interpretable privacy mechanism.

Specifically, we will work in the high privacy regime in which is small. In this regime, one can use Taylor series expansions to approximate both the objective function and the constraints. Such approximations were considered in [29, 7]. More recently, analyses based on such approximations, referred to as E-IT, have been found to be useful in a variety settings from graphical model learning [30] to network information theory problems [7][8].

Iii Approximations in the High Privacy Regime

In this section, we develop E-IT approximations of the relative entropy and the MI functions, based on which we propose an approximation of PUT in (2) in the high privacy regime.

To develop an approximation, we select an operating point which will be perturbed to provide an approximately-optimal privacy mechanism. We let where . Since our focus is on the high privacy regime, we present the approximation around a perfect privacy operation point, i.e., a privacy mechanism that achieves for all .

Lemma 1.

For perfect privacy, i.e., for all , the privacy mechanism is a rank- row stochastic matrix with every row being equal to a row vector where belongs to probability simplex, such that the entries of the vector satisfy

(3)
(4)
Proof.

For any probability distribution with entries , and a privacy mechanism ,

(5)
(6)
(7)

where (6) results from the log-sum inequality. Equality in (6) holds if and only if [3, Theorem 2.7.1]

(8)

In other words, perfect privacy, i.e., zero leakage, is achieved when every row of the optimal mechanism is the same and is equal to the probability distribution . ∎

Thus, for the perfect privacy setting, the optimal mechanism satisfying (8) does not rely on the input distribution.

Remark 1.

Note that, for any satisfying (8) that achieves perfect privacy, the utility is for all . Furthermore, the rows of , i.e., , can take any value in an -dimensional probability simplex.

The following proposition presents a E-IT approximation for the objective and constraint functions of the optimization in (2), i.e., the relative entropy and MI . This approximation is only applicable to the high privacy regime in which the privacy mechanism is modeled as a perturbation of a per Lemma 1.

Proposition 1.

In the high privacy regime with , the privacy mechanism is chosen as a perturbation of a perfect privacy achieving mechanism , i.e., . The mechanism is a rank-1 row stochastic matrix with every row being equal to a row vector whose entries satisfy and , for all . The perturbation matrix is an matrix with entries satisfying and , for all . For this perturbation model, the relative entropy for all in the objective function, and the MI for all in the constraints of (2) can be approximated as

(9)
(10)

where , for , is the entry of , is the row of , and is a diagonal matrix with diagonal entry, for all , being . For ease of analysis, setting , (9) and (10) can be rewritten as

(11)
(12)

In (9) and (11), the notation means that the difference between the left and right sides is . Similarly, in (10) amd (12), means that the two sides differ by .

Note that in Proposition 1, is in the interior of the probability simplex, i.e., . The approximation results from the observation that all rows of a privacy mechanism in the high privacy (low leakage) regime are very close to each other and both the relative entropy and MI can be approximated by the divergence. The detailed proof is in Appendix A.

Iv Binary Hypothesis Testing in the High Privacy Regime

For binary hypothesis testing, there are only two hypotheses and , and therefore, only two types of errors. In this section, we consider the simplest hypothesis testing scenario under two regimes. First, we regard one of the two hypotheses (e.g., ) as being more important than the other. In this case, the goal is to maximize the exponent of the missed detection for subject to an upper bound on its false alarm probability. Second, both hypotheses are important and the goal is to maximize a weighted sum of the two exponents of the false alarm and missed detection. For both cases, we derive the PUTs in the high privacy regime and provide methods to attain explicit privacy mechanisms.

Iv-a Binary Hypothesis Testing (Relative Entropy Setting)

We consider the case in which the false alarm of is bounded by a fixed positive constant and we examine the fastest rate of decay of its missed detection. This is exactly the problem formulated in Section II, and the PUT in (2) becomes

(13)

where is the set of all row stochastic matrices, and , are the permissible upper bounds of the privacy leakages for the two distributions and , respectively. Using the approximations in Proposition 1, the PUT for the E-IT approximation problem in the high privacy regime with , for all , is

(14a)
(14b)
(14c)

where is the -th row of the matrix , is an interior point of the -dimensional probability simplex, and is a row vector with the entry being the squared root of the entry of , i.e., .

Remark 2.

The functions in (14a) and (14b) are the E-IT approximations as presented in Proposition 1, and the constraint (14c) results from the requirement that is row stochastic. This constraint is the only one in (14) that explicitly involves the size of the output alphabet, i.e., the length of .

Theorem 1.

The optimization problem in (14) reduces to one with a vector variable as

(15)

where the absolute value of the entry of , for all , is the Euclidean norm of the row of . The matrix optimizing (14) is obtained from the optimal solution of (15) as a rank-1 matrix whose row, for all , is given by where is the entry of , and is a unit-norm -dimensional vector that is orthogonal to the non-zero-entry vector , such that

(16)
(17)
(18)

Finally, it suffices to restrict the output to a binary alphabet, i.e., .

The proof of Theorem 1 is in Appendix B. We briefly summarize the approach. The simplification of (14) to a vector optimization in (15) results from the observation that the privacy constraint (14b) only restricts the row-norms of the matrix variable , whereas affects the objective (14a) through all inner products of rows in . By exploiting this special structure, we simplify (14) to a quadratically constrained quadratic program (QCQP) with a vector variable which governs the Euclidean norms of the rows in . The optimal is then given by (16) such that the row vector is chosen to satisfy (14c). Since (17) can be satisfied by a 2-dimensional , we conclude that a binary output alphabet suffices.

Note that the objective function and constraints of the QCQP in (15) are “even” functions, i.e., if is feasible, so is its negation and both of them yield the same objective value. Using this observation, we derive a convex program by removing the square in the objective function. The following theorem provides a closed-form privacy mechanism for the PUT (14) in high privacy regime by using the Karush-Kuhn-Tucker (KKT) conditions for convex programs.

Theorem 2.

An optimal privacy mechanism for the approximation problem in (14) is

(19)

where is given by Proposition 1, is chosen to satisfy (17) and (18), and for and being the eigenvalue and eigenvector of , the optimal solution of (15), namely , is given as:

  1. if only the first constraint in (15) is active,

    (20)

    and the optimal solution is

    (21)
  2. if only the second constraint in (15) is active,

    (22)

    and the optimal solution is

    (23)
  3. when both constraints in (15) are active, the optimal solution is

    (24)

    where and satisfy

    (25)
    (26)

The proof of Theorem 2 involves proving two lemmas and is developed in Appendix C.

Remark 3.
Fig. 1: Illustration of (19). The “nominal” row vector is perturbed in different directions depending on and .

The optimal mechanism captures the fact that a statistical privacy metric, such as MI, takes into consideration the source distribution in designing the perturbation mechanism . In fact, the solutions for in (21), (23) and (24) quantify this through the term . The vector indicates the direction along which the objective function, i.e., the relative entropy, grows the fastest. In Fig. 1 we illustrate the results of Theorem 2. Thus, for a uniformly distributed source, all entries of have the same scaling such that is in the direction of . However, for a non-uniform source, the samples with low probabilities affect the direction of the most. This is a consequence of the statistical leakage metric (the MI) which causes the optimal mechanism to minimize information leakage by perturbing the low probability (more informative) symbols proportionately more relative to the higher probability ones.

Iv-B Binary Hypothesis Testing (Rényi Divergence Setting)

We now consider the scenario in which both the false alarm and missed detection probabilities for are exponentially decreasing. For this case, the trade-off between the two error probabilities is captured by the Rényi divergence as shown in [4, 31]. We use this as our utility metric and briefly review the results in [4, 31] as a starting point.

Assume that the false alarm probability decays as , for some exponent . Then, the largest error exponent of the missed detection for a fixed is a function of given by [31]

(27)

Since (27) is a convex program, it can be equivalently characterized by the Lagrangian minimization

(28)

leading to the dual problem  [31]

(29)

The optimizing (28) can be computed using the KKT conditions of (27) (cf. [4, (15)]) to further obtain

(30)

For , (30) simplifies as [4]

(31)

where is the order- Rényi divergence. From (28) and (29), we see that is the weighted sum of the two error exponents, i.e., , and as such a good candidate for a utility metric in this setting. For this metric, one can write the PUT problem as

(32)

Analogous to the PUT in (13) with relative entropy as the utility metric, the optimization in (32) is non-convex and NP-hard. Thus, we focus on the high privacy regime and approximate the order- Rényi divergence in that regime. To this end, we use the following lemma to explicitly present the relationship of the order- Rényi divergence and the relative entropy when and are “close”.

Lemma 2.

For , the following continuity statement holds: If 222We say that a vector converges to another vector , denoted as , if . , then

(33)

The proof is detailed in Appendix D.

According to Proposition 1, any privacy mechanism in the high privacy regime is a perturbation of a perfect privacy mechanism . When in (32) is close to zero, is also close to and both output distributions and approach . We now use Lemma 2 in the following corollary to show that the ratio of and converges to the constant .

Corollary 1.

Let . In (32), if , converges to a perfect privacy mechanism (cf. Lemma 1). Consequently, and the following convergence statement also holds.

(34)

From (34), we observe that as , is monotonically increasing in . Thus, in this high privacy regime, the optimizer of is the same as . As a result, in the high privacy regime we revert to the relative entropy setting, for which we provide a closed-form solution in Theorem 2.

V -ary Hypothesis Testing in the High Privacy Regime

We now consider the -ary hypothesis testing problem with distinct hypotheses , , each corresponding to a distribution . This in turn results in error probabilities of incorrectly inferring hypothesis as hypothesis . As stated in Section II, to simplify our analysis, we consider a scenario somewhat analogous to the “red alert” [26] problem in “unequal error protection” [27], i.e., there is one distinct hypothesis , the inference of which is more crucial than that of others (e.g., presence of cancer). We focus on maximizing the minimum of the error exponents corresponding to the ways of incorrectly deciding as .

For this problem of unequal -ary hypothesis testing, we introduce the PUT in (2). We can further simplify the trade-off in the high privacy regime using Proposition 1 to obtain the following PUT:

(35a)
(35b)
(35c)

Recall that is a perturbation matrix such that the privacy mechanism is related to as , and is the row of .

For ease of analysis, we start from a simplified version of (35) without the constraint (35c), which can be transformed to a semi-definite program (SDP) as summarized in the following lemma. Based on an optimal solution of the SDP, a scheme is proposed for constructing an optimal solution of (35) satisfying (35c).

Lemma 3.

The optimization in (35a) with constraint (35b) is equivalent to an SDP with ( matrix) variable given as

(36)

where is a diagonal matrix with diagonal entry equal to , and is the trace of the matrix .

Lemma 3 stems from the observation that both the objective function (35a) and constraints (35b) are linear functions of the entries of the positive semidefinite matrix . The proof for Lemma 3 is provided in Appendix E. The following theorem shows that the solution of the SDP in (3) yields an optimal privacy mechanism for the approximated PUT in (35).

Theorem 3.

An optimal privacy mechanism for the optimization problem in (35) is

(37)

where is the perfect privacy mechanism with rows , is an optimal solution of (35) obtained from an optimal solution with of the SDP in (3). It suffices to restrict the output alphabet size to , such that

(38)

where is an rectangular diagonal matrix whose diagonal entries are the square roots of the non-zero eigenvalues of , is a unitary matrix consisting of the eigenvectors of , and is an unitary matrix whose the first columns are orthogonal to .

Proof.

Let be the optimal solution of the SDP in (3) and . We decompose via an eigenvalue decomposition as follows:

(39)

Here, is an diagonal matrix consisting of entries in the eigenvalue vector and the columns of the matrix are the corresponding eigenvectors. Construct an rectangular diagonal matrix by adding one all-zero column to . Let . By choosing a unitary matrix , whose last column parallel to the -dimensional row vector , we design a matrix as such that . From Lemma 3, the SDP in (3) is equivalent to  the simplified (35) without (35c). Therefore, optimizes the simplified (35). In addition,

(40)
(41)
(42)

where (41) follows from the fact that the last column of the unitary matrix is parallel to , such that the first columns of are orthogonal to , and the inner product of its last column and is the Euclidean norm of . Therefore, the constructed above is feasible and attains the optimal value of (35). ∎

Remark 4.

Note that the size of output alphabet is at most . For the special case of binary hypothesis testing, we have shown in Theorem 1 that the rank of is and therefore, .

Remark 5.

In the absence of any constraints in (35), analogous to the binary hypothesis test, one would choose columns of to span the space contained by the vectors for all . However, the constraints in (35) depend explicitly on the vectors , and in fact, in (3) at least one constraint will be tight at the optimal solution . Thus, analogous to the binary hypothesis result, we expect the optimal mechanism to depend inversely on one or more . We show that this is indeed the case for binary sources in the following subsection.

V-a -ary Hypotheses Testing with Binary Sources

If all the distributions are Bernoulli, the difference vectors in (35) are collinear. Thus, the minimizing element in the objective is the one in which has the minimal Euclidean norm. Without loss of generality, assume . Therefore, . In this case, the E-IT approximation in (35) reduces to

(43a)
(43b)
(43c)

We notice that (43) has the same form as (14) (the E-IT approximation for binary hypothesis testing for the relative entropy setting), where the number of constraints in (43b) is . Specifically, the objective and constraints have the same structure as in (14), and thus, the results in Theorem 1 holds here. Therefore, from Theorem 2, the corresponding optimal privacy mechanism can be expressed as (19) but with

(44)

where , , are the dual variables for the constraints in (43b).

Note that for those that are non-zero, the corresponding constraints in (43b) are tight, i.e., if , . Let . Thus, the in (44) depends inversely on a linear combination of the distributions indexed by . Consequently, the optimal mechanism for the approximated PUT depends inversely on these distributions.

Vi Numerical Results

In this section, we numerically evaluate the utilities achieved by optimal privacy mechanisms for E-IT approximations in two scenarios: (binary) and (ternary) hypothesis testing. Furthermore, for the binary hypothesis testing scenario, we consider both the relative entropy and Rényi divergence settings, while for the scenario, we only focus on the relative entropy setting. Our goal is to compare the maximal utility for the E-IT approximation with that achieved for the original PUT. To this end, we start by choosing a privacy leakage level , for all , for the E-IT approximation.

Recall that for the relative entropy setting, (37) in Theorem 3 provides an optimal privacy mechanism for the E-IT approximation problem in (35) with leakage bounds for all . Specifically, for , can also be expressed as (19) in Theorem 2, where and are the first columns of and in (37), respectively. From Corollary 1, in the high privacy regime in (19) is also the optimal mechanism (for the approximated PUT) for binary hypothesis testing in Rényi divergence setting.

To evaluate the performance of , we compare its utility to that achieved by an optimal mechanism of the original PUT problem (e.g., (2) for the relative entropy setting or (32) for the Rényi divergence setting). For a fair comparison of the utilities resulting from the E-IT and original PUTs, we choose the MI leakages to be the same for both cases. Thus, for the relative entropy setting (resp. Rényi divergence setting), we compare the values of (resp. ) and (resp. ), where .

For the original PUT problems in (2) and (32), the number of independent variables in is for . Even for , finding the optimal privacy mechanism using exhaustive search techniques is computationally prohibitive. Therefore, we restrict our numerical analysis to binary sources, i.e., ; furthermore, for numerical tractability in computing , we assume that , i.e., the output alphabet is binary.

For the E-IT approximated PUTs, since the choice of does not affect the optimal , we choose , for which from (17) and (18), we have . To capture the high privacy regime, we restrict . For these parameters, the following two subsections illustrate and discuss the regimes in which the E-IT approximation is accurate.

close to each other close to the uniform distribution
Pair 1: No Yes, No
Pair 2: No No, No
Pair 3: Yes Yes, Yes
Pair 4: Yes No, No
TABLE I: Distribution pairs for binary hypothesis testing

Vi-a Binary Hypothesis Testing

We consider four pairs of Bernoulli distributions as shown in Table I for the two source classes (hypotheses) to evaluate the accuracy of optimal mechanisms for the E-IT approximation in the relative entropy and Rényi divergence settings. Figures 1(a)-1(d) illustrate the normalized utilities for Pairs 1- 4 in Table I, respectively, as a function of the normalized MI leakages, i.e., . In the four figures, the left and right -axes are for normalized utilities in the relative entropy and Rényi divergence settings, respectively. Figures 1(a) and 1(d) show that and have the same utilities in the regions highlighted by the black-dotted ellipses, in which is smaller than and of , respectively. In contrast, for Figs. 1(b) and 1(c), the utilities of and are almost the same in the entire plotted range.

(a) Pair 1
(b) Pair 2
(c) Pair 3
(d) Pair 4
Fig. 2: The relative utilities of and for the four distribution pairs in Table I

From Figs. 1(a)1(d), we deduce that for any two given distributions, there is high privacy regime in which the performance of the privacy mechanism for the E-IT approximation is almost optimal; however, the range of the regime is specific to the distribution pairs. In particular, when both distributions are close to the uniform or when both are far apart from the uniform as well as each other, the set of leakage values for which the privacy mechanism works well is larger. For the former, it can be seen that the E-IT approximations of the relative entropy and the MI are more accurate (cf. [8, Footnote 2]); for the latter, the individual approximation errors “cancel out” so the overall approximation is accurate.

Vi-B Ternary Hypothesis Testing

, ,
Triple 1
Triple 2