An EstimationTheoretic View of Privacy
Abstract
We study the central problem in data privacy: how to share data with an analyst while providing both privacy and utility guarantees to the user that owns the data. We present an estimationtheoretic analysis of the privacyutility tradeoff (PUT) in this setting. Here, an analyst is allowed to reconstruct (in a meansquared error sense) certain functions of the data (utility), while other private functions should not be reconstructed with distortion below a certain threshold (privacy). We demonstrate how a based information measure captures the fundamental PUT, and characterize several properties of this function. In particular, we give a sharp bound for the PUT. We then propose a convex program to compute privacyassuring mappings when the functions to be disclosed and hidden are known a priori. Finally, we evaluate the robustness of our approach to finite samples.
Contents
1 Introduction
Data sharing and publishing is increasingly common within scientific communities [1], businesses [2], government operations [3], medical fields [4] and beyond. Data is usually shared with an application in mind, from which the data provider receives some utility. For example, when a user shares her movie ratings with a streaming service, she receives utility in the form of suggestions of new, interesting movies that fit her taste. As a second example, when a medical research group shares patient data, their aim is to enable a wider community of researchers and statisticians to learn interesting patterns from that data. Utility is then gained through new scientific discoveries.
The disclosure of nonencrypted data incurs a privacy risk through unwanted inferences. In our previous examples, the streaming service may infer the user’s political preference (potentially deemed private by the user) from her movie ratings, or an insurance company may determine the identity of a patient in the medical dataset. If privacy is a concern but the data has no immediate utility, then cryptographic methods suffice.
The dichotomy between privacy and utility has been widely studied by computer scientists, statisticians and information theorists alike. While specific metrics and models may vary among these communities, their desideratum is the same: to design mechanisms that perturb the data (or functions thereof) while achieving an acceptable privacyutility tradeoff (PUT). The feasibility of this goal depends on the chosen privacy and utility metric, as well as the topology and distribution of the data. The informationtheoretic approach to privacy, and notably the results of Sankar et al. [5] [6], Issa et al. [7] [8], Asoodeh et al. [9] [10], Calmon et al. [11] [12], among others, seeks to quantify the best possible PUT for any privacy mechanism. Here, informationtheoretic quantities, such as mutual information or maximal leakage, are used to characterize privacy, and, under assumptions of the distribution of the data, bounds on the fundamental PUT are derived. It is within this informationtheoretic approach that the present work is inscribed.
Our aim is to characterize the fundamental performance limits of privacyassuring mechanisms from an estimationtheoretic perspective, and to develop datadriven privacyassuring mechanisms that provide estimationtheoretic guarantees. The specific privacy and utility metric used is the information between probability distributions, and more generally, the principal inertia components (PICs) of the distribution of the private and disclosed data. The PICs reveal interesting facets of data disclosure under privacy constraints, and quantify the minimum meansquared error (MMSE) achievable for reconstructing both private and useful information from the disclosed data. We do not seek to claim that the estimationbased approach subsumes other privacy metrics (e.g. differential privacy [13]). Rather, our goal is to show that the MMSEview reveals an interesting facet of data disclosure which, in turn, can drive the design of privacy mechanisms used in practice.
The rest of the paper is organized as follows. In Section 2, we review the definition of PICs and present some properties of the PICs. In Section 3, we introduce the privacyutility function and give several properties based on this function. In Section 4, we discuss a finegrained case and propose a PICbased convex program to find privacyassuring mappings. Finally, robustness of our approach to limited samples is discussed in Section 5.
1.1 Related work
Several papers, such as Sankar et al. [5], Calmon et al. [12], Asoodeh et al. [14], and Makhdoumi et al. [15], have studied information disclosure with privacy guarantees through an informationtheoretic lens. For example, Sankar et al. [5] characterized PUTs in large databases using tools from ratedistortion theory. Calmon et al. [12] presented lower bounds for the minimum meansquarederror of estimating private functions of a plaintext given knowledge of other, disclosed functions. Makhdoumi et al. [15] introduced the privacy funnel, where both privacy and utility are measured in terms of mutual information, and showed its connection with the information bottleneck [16]. The PUT was also explored in [17] and [18] using mutual information as a privacy metric. Currently, the most adpopted definition of privacy is differential privacy ([13], [19]), which enables queries to be computed over a database while simultaneously ensuring privacy of individual entries of the database. Fundamental bounds on composition of differentially private mechanisms were given by Kairouz et al. [20]. Other quantities from the informationtheoretic literature have been used to quantify privacy and utility. For example, Asoodeh et al. [9] and Calmon et al. [11] used estimationtheoretic tools to characterize fundamental limits of privacy. Also of note, Liao et al. ([21], [22]) explored the PUT within a hypothesis testing framework.
1.2 Notation
For a positive integer , we define . Matrices are denoted in bold capital letters (e.g. ) and vectors in bold lowercase letters (e.g. ). For a vector , is defined as the matrix with diagonal entries equal to and all other entries equal to . Capital letters (e.g. and ) are used to denote random variables, and calligraphic letters (e.g. and ) denote sets. The of a set of vectors is
(1) 
We denote independence of and by and write to indicate that and have the same distribution. When , , form a Markov chain, we write . For a discrete random variable with probability mass function , we denote . The MMSE of estimating given is
(2) 
The information between two random variables, and , is defined as
(3) 
Let and be two probability mass functions with the same discrete support set . We denote . For any realvalued random variable , we denote the norm of as
The set of all functions that when composed with a random variable with distribution result in an norm smaller than 1 is given by
(4) 
1.3 Privacy Setup
Throughout this paper, denotes a variable to be hidden (e.g. political preference), and is an observed variable that depends on (e.g. movie ratings). Our goal is to disclose a realization of a random variable , produced from through a randomized mapping , called the privacyassuring mapping. Here, , and satisfy the Markov condition . We assume that an analyst will provide some utility based on an observation of (e.g. movie recommendations), while potentially trying to estimate from . Privacy and utility will be quantified in terms of how well the analyst can reconstruct/estimate functions of and given , respectively. The support sets of , and are , and , respectively. Furthermore, we assume , and are bounded.
2 Principal Inertia Components
We present next the properties of the PICs that will be used in this paper. For a more detailed overview, we refer the reader to [11] and the references therein. We use the definition of PICs presented in [11], but note that the PICs predate [11] (e.g. [24], [25] and the references therein).
Definition 1 ([11], Definition 1).
Let and be r.v.s with support sets and , respectively, and distribution . In addition, let and be the constant functions and . For , we (recursively) define
(5) 
where
(6) 
The values are called the principal inertia components (PICs) of . The functions and are called the principal functions of and .
Definition 2 ([11], Definition 14).
Let , and the smallest PIC of . We define
We also denote as throughout this paper.
When both random variables and have a finite support set, we have the following definition.
Definition 3.
For and , let be a matrix with entries , and and be diagonal matrices with diagonal entries and , respectively, where and . We define
(7) 
We denote the singular value decomposition of by .
The next theorem illustrates the different characterizations of the PICs used in this paper.
Theorem 1 ([11], Theorem 1).
The following characterizations of the PICs are equivalent:
Finally, if both and are defined over finite supports, the following characterization is also equivalent.

is the st largest singular value of . The principal functions and in (6) correspond to the columns of the matrices and , respectively, where .
The equivalent characterizations of the PICs in the above theorem have the following intuitive interpretation: the principal functions can be viewed as a basis that decompose the meansquared error of estimating functions of a hidden variable given an observation .
It has been shown (e.g. [11]) that where . If is small, say , then, equivalently, the sum of all PICs is upperbounded by . Since the PICs are nonnegative, each PIC is also upperbounded by . Consequently, from characterization 2 in Theorem 1, it follows that the MMSE of reconstructing any zeromean, unit variance function of given is lower bounded by , i.e. all functions of cannot be reconstructed with small MMSE given an observation of . In other words, functions of cannot be reliably estimated from the disclosed variable in a meansquared error sense, and, depending on the value of and the needs of the user, privacy is assured in this estimationtheoretic sense. If , then some functions of may be (perfectly) revealed, but most PICs of are still small and, thus, most functions of are hard to estimate from . Analogously, if has a large lower bound, then certain functions of can be, on average, reconstructed (i.e. estimated) with small MMSE from an observation of .
The previous discussion demonstrates how information can be used to measure both privacy and utility from an estimationtheoretic view, where estimation is measured in terms of MMSE. We will adopt information as a measure of both privacy and utility in the next section, and explore the fundamental PUT in terms of this metric.
3 PrivacyUtility Tradeoff
The results introduced in this section use the following definition.
Definition 4.
We define the privacyutility function as:
where has fixed joint distribution , and
Remark 1.
Note that is a continuous mapping for fixed and finite support set , , . Therefore, is a compact set and the supremum in is indeed a maximum.
Lemma 1.
is a concave function. Furthermore, is a nonincreasing mapping.
Proof.
The proof is given in the appendix. ∎
The privacyutility function has an upperbound
(10) 
that follows immedieatly by the dataprocessing inequality:
(11) 
We provide next a sharp bound for the privacyutility function that significantly improves (10) by using properties of the PICs. We will use the following definition and two lemmas to derive this result.
Lemma 2.
Suppose . Then
(12) 
(13) 
where
Proof.
The proof is given in the appendix. ∎
Definition 5.
For , and , is defined as follows:
(14) 
where
(15) 
is the solution of a linear program which, in turn, can be written in the closed form given in the next lemma.
Lemma 3.
Suppose without loss of generality. If
where , then
(16) 
And
can achieve the maximum in the definition of .
We next give an upper bound and a lower bound for the privacyutility function. These bounds are illustrated in Fig. 1.
Theorem 2.
Proof.
The lower bound for follows immediately from the concavity of .
We denote the singular value decomposition of and by and , respectively. Then , where and .
Let where . Suppose the diagonal elements of are . Then
(17) 
Suppose the th row of is , the th column of is and . By the definition of PICs, , for and for . Then for
Since the first column of and are both following the properties of PICs, then . Therefore, . If , then by Equation (17)
which implies
Thus,
∎
Remark 2.
The upper bound can also be proved by combining the trace inequality [26, Eq. (4)] with properties of the PICs.
If the value of is known, a better lower bound can be obtained as follows:
(18) 
When , then and . Following Definition 5 and noticing that all PICs of are 1, then it can be shown that the upper bound and the lower bound for privacyutility function in Theorem 2 are both , which is equal to . Therefore, the upper bound and lower bound given in Theorem 2 is sharp.
Corollary 1.
is a strictly increasing function for .
Proof.
Firstly, is nondecreasing. To see this, denote by . Since is a concave function, it is sufficient to show that, for every , . For any , which implies . On the other hand, . Thus, .
Now suppose there exists , such that . Since is a concave and nondecreasing function, then for any , . In particular, . This contradicts the upper bound of privacyutility function in Theorem 2 since the upper bound implies that when . ∎
Definition 6.
.
By Corollary 1, is strictly increasing. Therefore,
(19) 
By Corollary 7 in [11], when , then (i.e. there exists a privacyassuring mapping that allows the disclosure of a nontrivial amount of useful functions while guaranteeing perfect privacy). On the other hand, when , then . The following theorem shows that when , the upper bound of in Theorem 2 is achievable around zero implying that the upper bound is sharp around 0. This theorem also provides a specific way to construct the random variable which achieves the upper bound in the high privacy region.
Theorem 3.
Suppose and . Then there exists a such that , and .
Proof.
From Theorem 1 in [11], there exists such that , and .
Fix and the privacyassuring mapping from to is defined as follows:
(20) 
Since
for any
Therefore, , which implies that is feasible. Furthermore, because of .
Since
then
∎
4 A Convex Program for Computing PrivacyAssuring Mappings
In the previous section we studied based metrics for both privacy and utility. This metric captures the overall error (in a meansquared sense) of estimating functions of the private and the useful variables and , respectively. We used information to measure both privacy and utility, and derived bounds for the PUT curve. The upper bound is shown to be achievable in the high privacy region in Theorem 3.
Next, we explore an alternative, finergrained approach for measuring both privacy and utility based on the PICs (recall that information is the sum of all PICs). This approach has a practical motivation, since often in there are well defined features (functions) of the data (realizations of a random variable) that should be hidden or disclosed. For example, a user might be comfortable revealing that his/her age is above a certain threshold, but not the age itself. Alternatively, a user may be willing to disclose that they prefer documentaries over action movies, but not exactly which documentary they like. More abstractly, we consider the case where certain known functions of a hidden variable should be revealed (utility), whereas others should be hidden (privacy). This is a finergrained setting than the one used in the last section, since information captures the aggregate reconstruction error across all zeromean, unit variance functions.
We denote the set of functions to be hidden as
We denote the set of functions to be disclosed as
The goal is to find the privacyassuring mapping, , such that and satisfies the following privacyutility constraints.

Utility constraints: and .

Privacy constraints: .
In this section, we follow two steps, projection and optimization, to find this privacyassuring mapping.
4.1 Projection
As a first step, we project all private functions to the observed variable and obtain a new set of functions:
The advantage of projection is twofold. First, it can significantly improve computational time of solving the optimization program, since after projection all functions are cast in terms of the observed random variable. Second, the hidden variable is not needed any more after the projection – the optimization solver only needs observed data. Therefore, the party that solves the optimization does not need access to the private data directly, further guaranteeing the safety of the sensitive information. Finally, the following theorem proves that privacy guarantees be cast in terms of the projected functions still hold for the original functions.
Theorem 4.
For , and , we have
(21) 
and
(22) 
Proof.
Suppose without loss of generality.
Observe that
Since , then . Therefore,
where the last inequality follows from Jensen’s inequality:
∎
By Theorem 4, . Therefore, if the new set of functions satisfies the privacy constraints(i.e. ), the original set of functions also satisfies the privacy constraints (i.e. ).
4.2 Optimization
After projection, both privacy and utility functions are based on the observed random variable. Next, we introduce a PICbased convex optimization program to find the privacyassuring mapping.
First, we construct given by such that
(23) 
(24) 
where
and
From (23), is a basis of , and the functions can be decomposed as
(25) 
If is a feasible joint distribution matrix, then the following equations follow directly from Theorem 1: