Information Theory of Data Privacy
Abstract
By combining Shannon’s cryptography model with an assumption to the lower bound of adversaries’ uncertainty to the queried dataset, we develop a secure Bayesian inferencebased privacy model and then in some extent answer Dwork et al.’s question [1]: “why Bayesian risk factors are the right measure for privacy loss”.
This model ensures an adversary can only obtain little information of each individual from the model’s output if the adversary’s uncertainty to the queried dataset is larger than the lower bound. Importantly, the assumption to the lower bound almost always holds, especially for big datasets. Furthermore, this model is flexible enough to balance privacy and utility: by using four parameters to characterize the assumption, there are many approaches to balance privacy and utility and to discuss the group privacy and the composition privacy properties of this model.
Keywords:
perfect secrecy, differential privacy, adversary’s knowledge, information privacy, dependent information
1 Introduction
Data privacy protection [2, 3, 4] studies how to query dataset while preserving the privacy of individuals whose sensitive information is contained in the dataset. The crux of this field is to find suitable privacy protection model which can provide better tradeoffs between privacy protection and data utility. Differential privacy model [5, 6] is currently the most important and popular privacy protection model.
Dwork [2] illustrated differential privacy as “differential privacy will ensure that the ability of an adversary to inflict harm (or good, for that matter)âof any sort, to any set of peopleâshould be essentially the same, independent of whether any individual opts in to, or opts out of, the dataset.”
This illustration can be explained as that differential privacy minimizes the increased risk to an individual’s privacy incurred by joining (or leaving) the dataset of the individual.
This implies that differential privacy seldom cares about the increased risk to the individual’s privacy incurred by joining (or leaving) the dataset of other individuals, which is unreasonable since other individuals’ data may also be related to the individual’s privacy.
Our powerful tool to analyze the influence is derived from Shannon’s perfect secrecy [9], whose computational complexity relaxation is the famous semantic security [10], one fundamental concept in cryptography. Specifically, the perfect secrecy ensures that outputs (or ciphertexts) of a crypto system contain no information about inputs (or plaintexts), i.e., no information about the inputs can be extracted by any adversary, and the semantic security implies that any information revealed cannot be extracted by the probabilistic polynomial time (PPT) adversaries [10, 11][12, p. 476]. To discuss the privacy problems more precisely, let us first review Shannon’s theory to cryptography.
In Shannon’s theory [9, 11, 10], a cryptography model/system is defined as a set of (probabilistic) transformations of the plaintext universe into the ciphertext universe. Each particular transformation of the set corresponds to enciphering with a particular key. The transformations are supposed reversible so that unique deciphering is possible when the key is known [10, 13]. For a plaintext and a secret key , let be the corresponding ciphertext. Consider as random variables, where the probability distributions of are the adversary’s probabilities for the choices in question, and represent his knowledge of the situation. Then the mutual information [14] or the maxmutual information , defined in Definition 3, will be a measure of information about which the adversary obtains from . The perfect secrecy is defined as and the semantic security is defined as [10, 11].
We now borrow the above Shannon’s cryptography models to construct data privacy models.
In a (data) privacy model/system there are individuals . A dataset is a (multi)set of records, where each is an assignment of . For a query [15], a privacy model is defined as a set of (probabilistic) transformations of the set of possible datasets into the set of possible query outputs . Each particular transformation of the set is called a (privacy) mechanism. Note that, being different from the cryptography models, a mechanism does not need to be reversible since there is no deciphering step in the privacy models.
Consider as random variables, whose probability distributions are the adversary’s probabilities for the choices in question, and represent his knowledge of the situation. Then the maxmutual information will be the amount of information about the individual which the adversary obtains from . Following the semantic security and the perfect secrecy, the setting with would be a reasonable choice as a privacy concept. One needs to be mentioned is that the “perfect privacy” [16, 17], i.e., the setting , is not practical since this will result in poor data utility even in the assumption of the PPT adversaries by the results in [6]. Due to technical reasons, the formal definition of the privacy concept is deferred until Section 3.
One may find an interesting thing that we seem to pick up the semantic security that Dwork et al. had claimed to be impractical to privacy problems [6, 2][18, Section 2.2]. We stress that Dwork [6] mainly proves that the “perfect privacy”, i.e., the setting , is impractical due to poor data utility (even in the assumption of the PPT adversary), but seldom claims that is impractical. In this paper, we will continue Dwork’s work [6] to discuss whether is suitable to be a privacy concept, and accurately in what extent to be; that is, we will employ Shannon’s theory to answer Dwork et al.’s question [1]: “why Bayesian risk factors are the right measure for privacy loss”. We will also continue Dwork’s work [6] to discuss the tight upper bound of for the differential privacy output , which is obviously important but is neglected by Dwork [6] and the related works [19, 20]. In fact, we have the following result.
Corollary 1 (Corollary of Proposition 4)
The mechanism satisfies for all if and only if satisfies
(1) 
Note that (1) is implied when satisfies differential privacy by the group privacy property of differential privacy in Lemma 1. Therefore, Corollary 1 implies that differential privacy mechanism allows its output such that , which will disclose too much information about the individual so long as the number of individuals is large enough, and which is our main motivation. For the differential privacy mechanism , one interesting thing in Corollary 1 is that the in (1), which is intended to be the maximal amount of disclosed information, or in other words the privacy budget [1], to the group of individuals by the theory of differential privacy, however, becomes the maximal amount of disclosed information to the individual . We will show in Proposition 4 that this is due to the other individuals’ data also contains information of the individual .
One needs to be emphasized is that it is reasonable to accept as one minimal requirement for any secure privacy mechanism. The reason is the same as that is one mininal requirement for secure cryptography models since large must result in information disclosure of the plaintext , which has been testified for more than 60 years.
Definition 1 (The Knowledge of an Adversary)
Let the random vector denote the uncertainty of an adversary to the queried dataset. Then or its probability distribution is called the knowledge of the adversary.
Note that, before this paper, there have been many Bayesian inferencebased privacy models, such as [17, 21, 19, 20, 22]. These models share a common feature: they all restrict adversaries’ knowledges. Many results, such as those in [19, 22] and Proposition 4 of this paper, show that this restriction is inevitable for better utility. Traditionally, it is direct to restrict adversaries to be PPT as in cryptography. However, the current studies in data privacy don’t suggest this restriction since most current works in data privacy are not based on it [18, 23, 24, 25]. On the other hand, the current works to restrict adversaries’ knowledges are almost no discussion on what are reasonable assumptions [17, 21, 19, 20, 22]. Note that the main obstacle to adopt these privacy models is that these models put restrictions to adversaries’ knowledges but can’t provide the reasonability of these restrictions. In this paper, our restriction to adversaries’ knowledges is shown in Assumption 1.
Let be a positive constant. Then, for any one adversary’s knowledge , there must be , where is the entropy of .
We have the following evidences to support the reasonability of the restriction.

The maximal entropy , in general, is huger in privacy models than in cryptography models. For example, to the AES256 encryption model [13], the adversary only needs to recover the 256 bits secret key in order to recover the information contained in the output and therefore it is reasonable to assume that can be very small or even zero since is at most 256 bits. However, to the Netflix Prize dataset [26] in data privacy, the adversary, in principle, needs to recover the whole dataset in order to recover the information contained in the output
^{3} and therefore it is reasonable to assume that is relatively large since the Netflix Prize dataset is large and then is at least larger than bits, which is huge compared to bits.^{4} 
The long tail phenomenon
^{5} implies that there are too much “outlier data” in big dataset, which increases the uncertainty . 
Someone may doubt of the assumption since there are too much background knowledge in data privacy protection compared to in cryptography. For example, to the Netflix Prize dataset [26], it is inevitable that there exists open data, such as the IMDb dataset, as the adversary’s background knowledge. Our comment is that, when the dataset is large enough, such as the Netflix dataset, the background knowledge, such as the IMDb dataset, in general, can’t have large part, such as over 50%, to be overlapped with the secret dataset. In fact, the Netflix Prize dataset has very small part to be overlapped with the IMDb dataset. Therefore, the entropy is still large for big dataset even though the diversity of background knowledges.

Theoretically, a dataset can be completely recovered by querying the dataset too many times as noted in [27, 28][18, Chapter 8]; that is, theoretically, the entropy can be very small or even zero [9, p. 659]. However, if we restrict the query times
^{6} and assume the dataset is big enough, we can ensure to be not too small.
Due to the above evidences, it would be reasonable to adopt Assumption 1 as a reasonable restriction to adversaries’ knowledges. Notice that Assumption 1 can achieve the idea of “crowdblending privacy” (but with a way different from [30, 31]), where each individual’s privacy is related to other individuals’ data; that is, if some other individuals’ data is kept private, then Assumption 1 holds, which in turn ensure to be holding.
1.1 Contribution and Outline
This paper aims to provide some “mathematical underpinnings of formal privacy notions” [1] and tries to answer “why Bayesian risk factors are the right measure for privacy loss” [1] by employing Shannon’s cryptography model and Assumption 1. Our contributions focus on studying how to control and related quantities based on Assumption 1.

Four parameters are developed to characterize Assumption 1, which makes it easy to control and to discuss utility. This part is our main contribution; many bounds of and of utility are obtained.

We formalize the group privacy, i.e., the privacy of a group of individuals, and the composition privacy, i.e., the privacy problem when multiple results are output, of the information privacy model. Several results are proved.
The following part of this paper is organized as follows. Section 2 presents some preliminaries. Section 3 introduces the information privacy model and compares it with other privacy models. In Section 4 we discuss the tradeoffs between privacy and utility based on Assumption 1. Section 5 discusses how to preserve the privacy of a group of individuals. Section 6 discusses the privacy problem when multiple results are output. Section 7 gives other related works. Section 8 concludes the results.
2 Preliminaries
The notational conventions of this paper are summarized in Table 1, of which some are borrowed from information theory [14].
Notation  Description 

the norm of the real vector  
the number of individuals, the set {1,…, n}, respectively  
,  the subset of , the vector , respectively 
the random variable denoting the th individual  
the output random variable  
the mutual information of and  
the maxmutual information of and  
the entropy of  
the random vector  
the vector  
the record universe of the th individual  
the Cartesian set  
the set , where denotes an empty record  
the sequence of the individuals  
the universe of record sequences  
the universe of datasets  
a set containing the query function ’s codomain  
the universe of probability distribution over (or over )  
a subset of  
the probability distribution of is in  
the maximum number of dependent individuals  
the dependent extent among the individuals  
the parameter to measure the uncertainty of the adversary to each individual  
the parameter to measure the number of unknown individuals  
the subset of with dependent parameters  
the subset of with dependent parameters  
the subset of with parameters  
the set  
PPT  the abbreviation of “probabilistic polynomial time” 
the subset of that the PPT adversaries can evaluate 
2.1 The Setting
This section provides mathematical settings of our model, where most materials contain many mathematical symbols and seem to be boring. However, we emphasize that these symbols are necessary to make the presentation clear and shorter. Therefore, the readers can skip these settings at a first reading and go back to consult them later where necessary.
Let the random variables denote individuals. Let denote the record universe of . The probability distribution of denotes an adversary’s knowledge about the individual ’s record. A dataset is a collection (a multiset) of records , where denotes the assignment of . We differentiate a record sequence from a dataset the record sequence corresponds to: the former has order among the records but the later does not. The universe of record sequences is defined as . The universe of datasets is defined as . We remark that is not a multiset, in which the same datasets are merged as one dataset. There may be multiple record sequences which correspond to a same dataset. We call the dataset as the dataset of the record sequence . For a dataset , let denote the set of all record sequences corresponding to the same dataset .
Set . Set , and . Let denote the probability distribution of . For each , set
(2) 
In this manner, can also be considered as a valued random variable with the probability distribution . Let denote the universe of probability distributions over (or over ). Note that, by letting all adversaries’ knowledges be derived from a subset of , we achieve a restriction to adversaries’ knowledges. If the probability distribution of the random variable is within , we say that is in , denoted as .
For a query function , let denote a set including all possible query results. Let denote the set of all the probability distributions on . A mechanism takes a record sequence as input and outputs a random variable valued in . Let be the random variable denoting the adversary’s observation about the output. In this manner, for and , we set
(3) 
In this paper, we abuse the notation as either denoting a probability distribution in or denoting a random variable following the probability distribution. Furthermore, for any , set for any two . Therefore, for a dataset , we set for .
In this paper, we append an empty record, denoted as , to each . In this setting, if , it means that the individual does not generate record in the dataset . Let . For a dataset , we use the histogram representation to denote the dataset , where the th entry of represents the number of elements in of type [32, 18, 33]. Two datasets are said to be neighbors (or neighboring datasets) of distance if . If , are said to be neighbors (or neighboring datasets). Two record sequences are said to be neighbors (or neighboring record sequences) if their corresponding datasets are neighbors.
For notational simplicity, in the following of this paper, we assume and are both discrete.
2.2 Differential Privacy
Differential privacy characterizes the changes of outputs when one’s record in a dataset is changed. The later changing is captured by the notion of the neighboring datasets.
Note that Definition 2 is the same as those in [34, 35], and is also equivalent to the definition of differential privacy in [5, 6, 18].
Differential privacy has group privacy property, which ensures that the strength of the privacy guarantee drops linearly with the size of the group of individuals.
Lemma 1 (Group Privacy [18])
Let be an differentially private mechanism. Then
(5) 
The composition privacy of differential privacy implies that the strength of the privacy guarantee drops in a controllable way when the number of outputs about a dataset raises.
Lemma 2 (Composition Privacy [18])
Let the mechanism satisfy differential privacy on for . Then the composition mechanism , defined as , , satisfies differential privacy on the cartesian set .
2.3 Other Materials
Lemma 3
Let and let for . If , then is increasing. Otherwise, if , then is decreasing.
Proof
Note that the derivative of is , by which the claims are immediate.
Definition 3 (MaxMutual Information [36])
The maxmutual information of the random variables is defined as
Lemma 4
There is .
Proof
3 The Model of Information Privacy
Now it’s time to give the formal definition of privacy concept. As discussed in Section 1, our privacy concept is to limit the amount of information of each individual obtained by the adversary from the output , i.e., control the value of the maxmutual information or the mutual information . For mathematical convenience, we only consider how to control the quantity in this paper. We formalize the discussions in Section 1 as the following definition.
Definition 4 (Information Privacy)
Let . Let be a mechanism and let be the output random variable. The mechanism satisfies information privacy with respect to if for any and there is
(8) 
Note that the inequality (8) is equivalent to
(9) 
since
(10)  
The parameter in the above definition is used to model adversaries’ knowledges. In this paper, we mainly set to be
(11) 
which will be discussed in Section 4.
In information theory, the relative entropy is used to measure the distance between two probability distributions and the mutual information is used to measure the amount of information that one random variable contains about another random variable [14]. The relative entropy of and , denoted as , and the mutual information of and , i.e., , have the following results.
Proposition 1
Let the mechanism satisfies information privacy with respect to and let be its output random variable. We have
(12) 
for .
Proof
The proof is direct and is omitted here.
Note that, as Definition 4, we can also define the relative entropy privacy, i.e., , and the mutual information privacy, i.e., . Furthermore, the paper [37] proposes a privacy concept called inferential privacy, i.e.,
(13) 
Note also that the inequalities (1) and (2) in [19] are essentially equivalent to the inequality (13). We now discuss the relations among the above three privacy concepts and the information privacy. There are the following results.
Proposition 2
We have the following relation among the privacy concepts: inferential privacy information privacy relative entropy privacy mutual information privacy.
Proof
The claim is due to the inequality
The claim is due to Proposition 1. The claim is due to the equation
The claims are proved.
Proposition 2 shows that the four privacy concepts, inferential privacy, information privacy, relative entropy privacy and mutual information privacy, are in decreasing order in terms of their strength to protect privacy. One can choose any one of the four concepts as the privacy concept, of which the choosing criterion depends on the privacy level of demand.
Proposition 3 (DataProcessing Inequality/PostProcessing)
Assume the mechanism satisfies information privacy with respect to and let be its output random variable. Let and let . Then the composed mechanism satisfies mutual information privacy with respect to , where for .
Proof
It is direct to define the personalized information privacy as the personalized differential privacy [38].
Definition 5 (Personalized Information Privacy)
The mechanism satisfies personalized information privacy with respect to if, for each and each , there is
(14) 
where .
3.1 Comment on Parameter Setting
In this section we consider how to set the parameter (or ) of the information privacy model. The setting of the parameter is deferred to Section 4.
One needs to be emphasized is that the dataset universe (or the record sequence universe ) should be set carefully since itself may leak individuals’ privacy and result in tracing attacks [28]. In order to see the above result clearly, we consider the query function , as an example, which can be considered as the abstraction of data publishing function [39, 40, 41]. Note that the codomain of is . Both of the differential privacy model and the information privacy model employ randomized techniques to protect privacy: When the real dataset is , in order to preserve privacy, a privacy mechanism first samples a dataset (according to a probability distribution) and then outputs as the final query result of . Or equivalently, the privacy mechanism directly samples a value from the codomain of as the final query result. The major difference of the two models is that the probability distributions used to sample or are different. Assume that the individual ’s record universe has no overlapped record with all other individuals’ record universes. Then, finding a record within an output dataset would strongly conclude the participation of the individual , which obviously is a successful tracing attack. Therefore, we should set appropriate and therefore appropriate for such that the set itself does not leak the participation of an individual. The privacyoriented (but less utilityoriented) setting is to set for all as in [18, p. 227].
3.2 Utility Measure
For the query and the dataset universe , let the set . We equip a metric over the set [15]. That is, the parity is a metric space. Note that the output of the mechanism is a probabilistic approximation of . Therefore,
for two datasets , if is large, the distance of the outputs being small would result in poor data utility.
In most parts of this paper, the above utility measuring method can be used to measure the utility of mechanisms. However, for the completeness of this paper, we will present the formal definition of utility measure in (15). Note that the two utility measure methods are consistent since the former will result in be more similar with the uniform probability distribution on , which obviously raises the distortion of .
Let denote the occurring probability distribution of the individuals . Then the utility of the mechanism is measured by the expected value of the distortion , i.e.,
(15)  
We stress that is different from the probability distribution defined in (2), where the former is the factual occurring probabilities of datasets but the later denotes the knowledge of the adversary to datasets.
The third quantity to measure the utility is or , which is used to measure the information of the individuals contained in the output . Note that large implies better utility of the mechanism since the output contains more information about that the mining or learning algorithms can mine or learn.
3.3 Some Related Works
One motivation of this paper is to solve the weakness of the differential privacy model [5, 6] as shown in Corollary 1, which implies that the differential privacy model allows to be very large. Corollary 2, which is also appeared in [19, 22], shows that the differential privacy model is equivalent to the information privacy model with respect to . Note that the setting is obviously less reasonable than the setting (11). Therefore, the information privacy model with respect to (11) is more reasonable than the differential privacy model.
As noted in Section 1, the models in [17, 21, 19, 20, 22] and the information privacy model all are the Bayesian inferencebased models and restrict adversaries’ knowledges; that is, they all employ a subset of , like in this paper, to model adversaries’ knowledges. The advantage of these models and the restrictions is clear: powerful both to model privacy problems and to balance privacy and utility. However, the disadvantage is also large: the restrictions seem to be unreasonable since there are many examples, where making such a restriction may quickly lead to a disastrous breach of privacy. We imagine that the first impressions of most readers to these privacy models in [17, 21, 19, 20, 22] are similar with ours: Compared to conciseness of the differential privacy model, these privacy models set too many kinds of ’s but none of these settings seems to be reasonable, which makes it hard to adopt these models. However, the rigorous analysis of the privacy problems by using Shannon’s cryptography theory as in Section 1 makes us revisit these models, which results in the introduction of the parameter into the information privacy model. Of course, we also face the problem of how to find a reasonable . Assumption 1 is our solution and the evidences in Section 1 show that it is reasonable, especially for big datasets. In the following sections of this paper we will present our results based on Assumption 1.
Furthermore, the papers [17, 19, 42, 43] discuss the impact of previously released data or query results, called constraints, to the privacy guarantee. The information privacy model treat these constraints by using Assumption 1; this is, these constraints can be summarized as the adversary’s knowledge to the queried dataset, and if these constraints can’t result in the adversary’s knowledge go out of the set in (11), then we can ensure the adversary can only obtain little information of each individual. Note that the above treatment to the constraints is similarly with the semantic security model in cryptography.
The papers [44, 45, 34] employ either or to define privacy concepts. We stress that both of the above two inequalities will result in poor data utility. The reason is that or is just the amount of information of contained in that the data consumer needs to mine since the data consumer is also a special kind of adversaries. In contrast, the inequalities , only restrict the information disclosure of each individual , which, in general, allows the quantities or to be large enough, so long as the number of individuals is large enough.
4 PrivacyUtility Tradeoff for Big Dataset
In this section, we consider how to set the parameter in Definition 4 in order to give appropriate privacyutility tradeoffs, where denotes adversaries’ knowledges. As noted in Section 3, the setting in (11) is a reasonable restriction to adversaries’ knowledges. Before discussing the information privacy model based on this setting, we first discuss why we must restrict adversaries’ knowledges. The following results show that the setting will result in poor utility.
Proposition 4
The following three conditions are equivalent:

with respect to .

with respect to .

.
Proof
The equivalence between the claim 1 and the claim 3 is due to
(16)  
with equality when the record sequence satisfies and , where is just the record sequence satisfying the above maximality.
The equivalence between the claim 2 and the claim 3 is due to
(17) 
with equality when the record sequence satisfies , where is just the record sequence satisfying the above maximality.
The proof is complete.
The claim 2 of Proposition 4 shows that information privacy with respect to will result in poor utility since but denotes the information of contained in , which is just the information the utility needs. Note also that the claim 3 of Proposition 4 shows that information privacy with respect to will result in two datasets even with distance must have similar outputs, which obviously results in poor utility. Therefore, it is needed to restrict adversaries’ knowledges for better utility.
Now we discuss how to control the quantity with respect to (11). We first formalize the reasons which make Assumption 1 hold. Note that
(18) 
with equality to the first inequality if and only if are independent, and with equality to the second inequality if and only if each has uniform distribution over [14]. Therefore, there are mainly two reasons which make :

The random variables are not strongly dependent.

There exist some ’s with .
Traditionally, we can use the mutual information and the entropy to characterize the above two reasons, respectively. However, for mathematical convenience, we develop four parameters to characterize them:

Use the parameter to denote the maximal number of dependent random variables in .

Use the parameter to denote the maximal dependent extent among the random variables .

Use the parameter to denote the maximal number of random variables in with .

Use the parameter to characterize the minimal entropies of the above random variables.
Subsequently, also for mathematical convenience, we will approximate the set in (11) with a set , which is parameterized by the four parameters and will be defined later; that is,
(19) 
In the following parts of this section, we will explicitly define and then and discuss how to control based on them.
4.1 The Parameter
Recall that the parameter denotes the maximal number of dependent random variables in , which is mainly motivated by the group privacy method in [46] to deal with the dependent problem and by the need to explain differential privacy using the information privacy model. Let be the largest subset of such that, for any , the maximal number of dependent random variables within is at most , where . Formally, let
(20) 
where with , each , each for . Note that, in this manner, equals and denotes the universe of probability distributions of the independent random variables . We have the following result.
Theorem 1
The mechanism satisfies information privacy with respect to if and only if satisfies
(21) 
Proof
Let , where denote the random variables in which are independent to and dependent to , respectively. Let and denote one assignment and the record universe of , respectively.
“” Assume the inequality (21) holds. For one , set , . We have
(22)  