An Estimation-Theoretic View of Privacy

An Estimation-Theoretic View of Privacy

Hao Wang Flavio P. Calmon H. Wang and F. P. Calmon are with the John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA (emails: hao_wang@g.harvard.edu, flavio@seas.harvard.edu).
Abstract

We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. We present an estimation-theoretic analysis of the privacy-utility trade-off (PUT) in this setting. Here, an analyst is allowed to reconstruct (in a mean-squared error sense) certain functions of the data (utility), while other private functions should not be reconstructed with distortion below a certain threshold (privacy). We demonstrate how a -based information measure captures the fundamental PUT, and characterize several properties of this function. In particular, we give a sharp bound for the PUT. We then propose a convex program to compute privacy-assuring mappings when the functions to be disclosed and hidden are known a priori. Finally, we evaluate the robustness of our approach to finite samples.

1 Introduction

Data sharing and publishing is increasingly common within scientific communities [1], businesses [2], government operations [3], medical fields [4] and beyond. Data is usually shared with an application in mind, from which the data provider receives some utility. For example, when a user shares her movie ratings with a streaming service, she receives utility in the form of suggestions of new, interesting movies that fit her taste. As a second example, when a medical research group shares patient data, their aim is to enable a wider community of researchers and statisticians to learn interesting patterns from that data. Utility is then gained through new scientific discoveries.

The disclosure of non-encrypted data incurs a privacy risk through unwanted inferences. In our previous examples, the streaming service may infer the user’s political preference (potentially deemed private by the user) from her movie ratings, or an insurance company may determine the identity of a patient in the medical dataset. If privacy is a concern but the data has no immediate utility, then cryptographic methods suffice.

The dichotomy between privacy and utility has been widely studied by computer scientists, statisticians and information theorists alike. While specific metrics and models may vary among these communities, their desideratum is the same: to design mechanisms that perturb the data (or functions thereof) while achieving an acceptable privacy-utility trade-off (PUT). The feasibility of this goal depends on the chosen privacy and utility metric, as well as the topology and distribution of the data. The information-theoretic approach to privacy, and notably the results of Sankar et al. [5] [6], Issa et al. [7] [8], Asoodeh et al. [9] [10], Calmon et al. [11] [12], among others, seeks to quantify the best possible PUT for any privacy mechanism. Here, information-theoretic quantities, such as mutual information or maximal leakage, are used to characterize privacy, and, under assumptions of the distribution of the data, bounds on the fundamental PUT are derived. It is within this information-theoretic approach that the present work is inscribed.

Our aim is to characterize the fundamental performance limits of privacy-assuring mechanisms from an estimation-theoretic perspective, and to develop data-driven privacy-assuring mechanisms that provide estimation-theoretic guarantees. The specific privacy and utility metric used is the -information between probability distributions, and more generally, the principal inertia components (PICs) of the distribution of the private and disclosed data. The PICs reveal interesting facets of data disclosure under privacy constraints, and quantify the minimum mean-squared error (MMSE) achievable for reconstructing both private and useful information from the disclosed data. We do not seek to claim that the estimation-based approach subsumes other privacy metrics (e.g. differential privacy [13]). Rather, our goal is to show that the MMSE-view reveals an interesting facet of data disclosure which, in turn, can drive the design of privacy mechanisms used in practice.

The rest of the paper is organized as follows. In Section 2, we review the definition of PICs and present some properties of the PICs. In Section 3, we introduce the -privacy-utility function and give several properties based on this function. In Section 4, we discuss a fine-grained case and propose a PIC-based convex program to find privacy-assuring mappings. Finally, robustness of our approach to limited samples is discussed in Section 5.

1.1 Related work

Several papers, such as Sankar et al. [5], Calmon et al. [12], Asoodeh et al. [14], and Makhdoumi et al. [15], have studied information disclosure with privacy guarantees through an information-theoretic lens. For example, Sankar et al. [5] characterized PUTs in large databases using tools from rate-distortion theory. Calmon et al. [12] presented lower bounds for the minimum mean-squared-error of estimating private functions of a plaintext given knowledge of other, disclosed functions. Makhdoumi et al. [15] introduced the privacy funnel, where both privacy and utility are measured in terms of mutual information, and showed its connection with the information bottleneck [16]. The PUT was also explored in [17] and [18] using mutual information as a privacy metric. Currently, the most adpopted definition of privacy is differential privacy ([13], [19]), which enables queries to be computed over a database while simultaneously ensuring privacy of individual entries of the database. Fundamental bounds on composition of differentially private mechanisms were given by Kairouz et al. [20]. Other quantities from the information-theoretic literature have been used to quantify privacy and utility. For example, Asoodeh et al. [9] and Calmon et al. [11] used estimation-theoretic tools to characterize fundamental limits of privacy. Also of note, Liao et al. ([21], [22]) explored the PUT within a hypothesis testing framework.

We introduce the -privacy-utility function in this paper. Related privacy-utility functions and proof techniques that inspired our approach have appeared in [9], [10], and [23].

1.2 Notation

For a positive integer , we define . Matrices are denoted in bold capital letters (e.g. ) and vectors in bold lower-case letters (e.g. ). For a vector , is defined as the matrix with diagonal entries equal to and all other entries equal to . Capital letters (e.g. and ) are used to denote random variables, and calligraphic letters (e.g. and ) denote sets. The of a set of vectors is

(1)

We denote independence of and by and write to indicate that and have the same distribution. When , , form a Markov chain, we write . For a discrete random variable with probability mass function , we denote . The MMSE of estimating given is

(2)

The -information between two random variables, and , is defined as

(3)

Let and be two probability mass functions with the same discrete support set . We denote . For any real-valued random variable , we denote the -norm of as

The set of all functions that when composed with a random variable with distribution result in an -norm smaller than 1 is given by

(4)

1.3 Privacy Setup

Throughout this paper, denotes a variable to be hidden (e.g. political preference), and is an observed variable that depends on (e.g. movie ratings). Our goal is to disclose a realization of a random variable , produced from through a randomized mapping , called the privacy-assuring mapping. Here, , and satisfy the Markov condition . We assume that an analyst will provide some utility based on an observation of (e.g. movie recommendations), while potentially trying to estimate from . Privacy and utility will be quantified in terms of how well the analyst can reconstruct/estimate functions of and given , respectively. The support sets of , and are , and , respectively. Furthermore, we assume , and are bounded.

2 Principal Inertia Components

We present next the properties of the PICs that will be used in this paper. For a more detailed overview, we refer the reader to [11] and the references therein. We use the definition of PICs presented in [11], but note that the PICs predate [11] (e.g. [24], [25] and the references therein).

Definition 1 ([11], Definition 1).

Let and be r.v.s with support sets and , respectively, and distribution . In addition, let and be the constant functions and . For , we (recursively) define

(5)

where

(6)

The values are called the principal inertia components (PICs) of . The functions and are called the principal functions of and .

Observe that the PICs satisfy , since and

Thus, from Definition 1, .

Definition 2 ([11], Definition 14).

Let , and the smallest PIC of . We define

We also denote as throughout this paper.

When both random variables and have a finite support set, we have the following definition.

Definition 3.

For and , let be a matrix with entries , and and be diagonal matrices with diagonal entries and , respectively, where and . We define

(7)

We denote the singular value decomposition of by .

The next theorem illustrates the different characterizations of the PICs used in this paper.

Theorem 1 ([11], Theorem 1).

The following characterizations of the PICs are equivalent:

  1. The characterization given in Definition 1 where, for and given in (6), and .

  2. For any ,

    (8)

    where

    (9)

    If is unique, then given in (6).

Finally, if both and are defined over finite supports, the following characterization is also equivalent.

  1. is the -st largest singular value of . The principal functions and in (6) correspond to the columns of the matrices and , respectively, where .

The equivalent characterizations of the PICs in the above theorem have the following intuitive interpretation: the principal functions can be viewed as a basis that decompose the mean-squared error of estimating functions of a hidden variable given an observation .

It has been shown (e.g. [11]) that where . If is small, say , then, equivalently, the sum of all PICs is upper-bounded by . Since the PICs are non-negative, each PIC is also upper-bounded by . Consequently, from characterization 2 in Theorem 1, it follows that the MMSE of reconstructing any zero-mean, unit variance function of given is lower bounded by , i.e. all functions of cannot be reconstructed with small MMSE given an observation of . In other words, functions of cannot be reliably estimated from the disclosed variable in a mean-squared error sense, and, depending on the value of and the needs of the user, privacy is assured in this estimation-theoretic sense. If , then some functions of may be (perfectly) revealed, but most PICs of are still small and, thus, most functions of are hard to estimate from . Analogously, if has a large lower bound, then certain functions of can be, on average, reconstructed (i.e. estimated) with small MMSE from an observation of .

The previous discussion demonstrates how -information can be used to measure both privacy and utility from an estimation-theoretic view, where estimation is measured in terms of MMSE. We will adopt -information as a measure of both privacy and utility in the next section, and explore the fundamental PUT in terms of this metric.

3 Privacy-Utility Trade-off

The results introduced in this section use the following definition.

Definition 4.

We define the -privacy-utility function as:

where has fixed joint distribution , and

Remark 1.

Note that is a continuous mapping for fixed and finite support set , , . Therefore, is a compact set and the supremum in is indeed a maximum.

Lemma 1.

is a concave function. Furthermore, is a non-increasing mapping.

Proof.

The proof is given in the appendix. ∎

The -privacy-utility function has an upper-bound

(10)

that follows immedieatly by the data-processing inequality:

(11)

We provide next a sharp bound for the -privacy-utility function that significantly improves (10) by using properties of the PICs. We will use the following definition and two lemmas to derive this result.

Lemma 2.

Suppose . Then

(12)
(13)

where

Proof.

The proof is given in the appendix. ∎

Definition 5.

For , and , is defined as follows:

(14)

where

(15)

is the solution of a linear program which, in turn, can be written in the closed form given in the next lemma.

Lemma 3.

Suppose without loss of generality. If

where , then

(16)

And

can achieve the maximum in the definition of .

We next give an upper bound and a lower bound for the -privacy-utility function. These bounds are illustrated in Fig. 1.

Theorem 2.

For the -privacy-utility function defined in Definition 4,

where and are the PICs of .

Proof.

The lower bound for follows immediately from the concavity of .

Using Lemma 2, the -privacy-utility function can be simplified as follows:

where

We denote the singular value decomposition of and by and , respectively. Then , where and .

Let where . Suppose the diagonal elements of are . Then

(17)

Suppose the -th row of is , the -th column of is and . By the definition of PICs, , for and for . Then for

Since the first column of and are both following the properties of PICs, then . Therefore, . If , then by Equation (17)

which implies

Thus,

Remark 2.

The upper bound can also be proved by combining the trace inequality [26, Eq. (4)] with properties of the PICs.

If the value of is known, a better lower bound can be obtained as follows:

(18)

When , then and . Following Definition 5 and noticing that all PICs of are 1, then it can be shown that the upper bound and the lower bound for -privacy-utility function in Theorem 2 are both , which is equal to . Therefore, the upper bound and lower bound given in Theorem 2 is sharp.

Figure 1: Upper bound and lower bound for -privacy-utility function when .
Corollary 1.

is a strictly increasing function for .

Proof.

Firstly, is non-decreasing. To see this, denote by . Since is a concave function, it is sufficient to show that, for every , . For any , which implies . On the other hand, . Thus, .

Now suppose there exists , such that . Since is a concave and non-decreasing function, then for any , . In particular, . This contradicts the upper bound of -privacy-utility function in Theorem 2 since the upper bound implies that when . ∎

Definition 6.

.

By Corollary 1, is strictly increasing. Therefore,

(19)

By Corollary 7 in [11], when , then (i.e. there exists a privacy-assuring mapping that allows the disclosure of a non-trivial amount of useful functions while guaranteeing perfect privacy). On the other hand, when , then . The following theorem shows that when , the upper bound of in Theorem 2 is achievable around zero implying that the upper bound is sharp around 0. This theorem also provides a specific way to construct the random variable which achieves the upper bound in the high privacy region.

Theorem 3.

Suppose and . Then there exists a such that , and .

Proof.

From Theorem 1 in [11], there exists such that , and .
Fix and the privacy-assuring mapping from to is defined as follows:

(20)

Since

for any

Therefore, , which implies that is feasible. Furthermore, because of .

Since

then

Remark 3.

When and , then , where . Since is a point on the upper bound of -privacy-utility function given in Theorem 2, Theorem 3 shows that, in this case, the upper bound is achievable in the high privacy region.

4 A Convex Program for Computing Privacy-Assuring Mappings

In the previous section we studied -based metrics for both privacy and utility. This metric captures the overall error (in a mean-squared sense) of estimating functions of the private and the useful variables and , respectively. We used -information to measure both privacy and utility, and derived bounds for the PUT curve. The upper bound is shown to be achievable in the high privacy region in Theorem 3.

Next, we explore an alternative, finer-grained approach for measuring both privacy and utility based on the PICs (recall that -information is the sum of all PICs). This approach has a practical motivation, since often in there are well defined features (functions) of the data (realizations of a random variable) that should be hidden or disclosed. For example, a user might be comfortable revealing that his/her age is above a certain threshold, but not the age itself. Alternatively, a user may be willing to disclose that they prefer documentaries over action movies, but not exactly which documentary they like. More abstractly, we consider the case where certain known functions of a hidden variable should be revealed (utility), whereas others should be hidden (privacy). This is a finer-grained setting than the one used in the last section, since -information captures the aggregate reconstruction error across all zero-mean, unit variance functions.

We denote the set of functions to be hidden as

We denote the set of functions to be disclosed as

The goal is to find the privacy-assuring mapping, , such that and satisfies the following privacy-utility constraints.

  1. Utility constraints: and .

  2. Privacy constraints: .

In this section, we follow two steps, projection and optimization, to find this privacy-assuring mapping.

4.1 Projection

As a first step, we project all private functions to the observed variable and obtain a new set of functions:

The advantage of projection is twofold. First, it can significantly improve computational time of solving the optimization program, since after projection all functions are cast in terms of the observed random variable. Second, the hidden variable is not needed any more after the projection – the optimization solver only needs observed data. Therefore, the party that solves the optimization does not need access to the private data directly, further guaranteeing the safety of the sensitive information. Finally, the following theorem proves that privacy guarantees be cast in terms of the projected functions still hold for the original functions.

Theorem 4.

For , and , we have

(21)

and

(22)
Proof.

Suppose without loss of generality.
Observe that

Since , then . Therefore,

where the last inequality follows from Jensen’s inequality:

By Theorem 4, . Therefore, if the new set of functions satisfies the privacy constraints(i.e. ), the original set of functions also satisfies the privacy constraints (i.e. ).

4.2 Optimization

After projection, both privacy and utility functions are based on the observed random variable. Next, we introduce a PIC-based convex optimization program to find the privacy-assuring mapping.

First, we construct given by such that

(23)
(24)

where

and

From (23), is a basis of , and the functions can be decomposed as

(25)

If is a feasible joint distribution matrix, then the following equations follow directly from Theorem 1: