Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis

Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis

Yanan Li, Xuebin Ren, Shusen Yang, and Xinyu Yang This work is supported in part by National Natural Science Foundation of China under Grants 61572398, 61772410, 61802298 and U1811461; the Fundamental Research Funds for the Central Universities under Grant xjj2018237; China Postdoctoral Science Foundation under Grant 2017M623177; the China 1000 Young Talents Program; and the Young Talent Support Plan of Xi’an Jiaotong University. (corresponding author: Shusen Yang). Y. Li is with National Engineering Laboratory for Big Data Analytics (NEL-BDA), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, and also with the School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China (e-mail: gogll2@stu.xjtu.edu.cn). X. Ren, and X. Yang are with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, and also with National Engineering Laboratory for Big Data Analytics (NEL-BDA), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, (e-mails: {xuebinren, yxyphd}@mail.xjtu.edu.cn). S. Yang is with National Engineering Laboratory for Big Data Analytics (NEL-BDA), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, and also with the Ministry of Education Key Lab for Intelligent Networks and Network Security (MOE KLINNS Lab), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China (e-mail: shusenyang@mail.xjtu.edu.cn).
Abstract

It has been widely understood that differential privacy (DP) can guarantee rigorous privacy against adversaries with arbitrary prior knowledge. However, recent studies demonstrate that this may not be true for correlated data, and indicate that three factors could influence privacy leakage: the data correlation pattern, prior knowledge of adversaries, and sensitivity of the query function. This poses a fundamental problem: what is the mathematical relationship between the three factors and privacy leakage? In this paper, we present a unified analysis of this problem. A new privacy definition, named prior differential privacy (PDP), is proposed to evaluate privacy leakage considering the exact prior knowledge possessed by the adversary. We use two models, the weighted hierarchical graph (WHG) and the multivariate Gaussian model to analyze discrete and continuous data, respectively. We demonstrate that positive, negative, and hybrid correlations have distinct impacts on privacy leakage. Considering general correlations, a closed-form expression of privacy leakage is derived for continuous data, and a chain rule is presented for discrete data. Our results are valid for general linear queries, including count, sum, mean, and histogram. Numerical experiments are presented to verify our theoretical analysis.

privacy leakage, correlated data, prior knowledge.

I Introduction

Leakage of private information could lead to serious consequences (e.g., financial security and personal safety), and privacy protection has been extensively studied for several decades [1, 2]. In today’s big data era, privacy issues have been attracting increasing attention from both society and academia [3, 4, 5, 6]. Differential privacy (DP) [7, 8, 9] has become the defacto standard for privacy definitions because it can provide a rigorously mathematical proof of privacy guarantees.

In practice, adversaries may be able to acquire prior knowledge (i.e., partial data records), due to database attacks [10], privacy incidents [11], and obligations to release [12]. It is commonly believed that differentially private algorithms are invulnerable to adversaries with arbitrary prior knowledge because any given privacy level can be guaranteed, even when the adversary has knowledge of all data records except certain ones (i.e., the adversary with the strongest prior knowledge). However, this is true only if all data records are independent. It has been shown that the adversary’s prior knowledge can have significant impacts on privacy leakage when data records are correlated [13, 14].

The following example demonstrates how privacy leakage can be affected by correlations and the adversaries’ prior knowledge.


Fig. 1: Illustration of Example 1: an adversary attempts to infer the information of based on the joint distribution of database , the published result , and his prior knowledge about .

Fig. 2: Illustration of Example 1: the inference results of two adversaries; the weak adversary knows nothing about , the strong adversary knows . Considering five correlations of and , when the correlation is a perfectly positive correlation or perfectly negative correlation and independent, adversaries infer different information from the output result . The problem is how to analyze the general impacts of the correlation and prior knowledge on privacy leakage.

Example 1 Fig. 1 shows a scenario in which an adversary attempts to infer some sensitive information about a database. As shown, the database consisting of two attributes, , and , publishes noisy (via a Laplace mechanism of differential privacy) statistics for privacy-preserving data mining. The adversary may acquire some prior knowledge about the database, i.e., the exact value of and the data correlations from some public knowledge (e.g., the Internet). After observing the noisy statistics , the adversary tries to infer the privacy of based on all available information. Assume a noisy statistic , the prior knowledge , and the adversary’s first impression about is before inference. The privacy information gain obtained by the adversary in the inference process is summarized in Fig. 2. We use the following three special cases to show the impacts of the correlations and prior knowledge on privacy leakage.

  1. Case 1 (Positive Correlation): and are perfectly positively correlated with coefficient , i.e., . Without prior knowledge, the adversary will infer from the observation with high confidence according to the characteristics of the Laplace mechanism. Combined with the correlation , he will infer with high confidence. With the prior knowledge, e.g., , the adversary can ascertain that from the correlation .

  2. Case 2 (Negative Correlation): and are perfectly negatively correlated with coefficient , i.e., . Without prior knowledge, the adversary can infer no additional information about through due to the negative correlation. However, with the prior knowledge , the adversary can claim that . In addition, provides no additional information.

  3. Case 3 (No Correlation): and are independent. Without prior knowledge, the adversary can infer that with relatively higher probability than from the observation . However, with the additional prior knowledge of , the adversary obtains no more confidence about because there is no correlation between and , i.e., a stronger adversary with extra prior knowledge achieves no privacy gain compared with a weaker adversary.

The above special correlation cases show that an adversary with certain prior knowledge can obtain different privacy gains under different types of correlations. For general correlation cases, i.e., when correlations are weakly positive or weakly negative (cases with red backgrounds in Fig. 2), the adversary can also infer additional information through the published results. Meanwhile, when correlations are perfectly positive or negative, adversaries with different prior knowledge can also gain different privacy information.

As demonstrated in the above examples, prior knowledge can be utilized by adversaries to infer sensitive information, leading to serious threats to various privacy preserving scenarios, such as data publishing [15, 16, 17, 18], continuous data release [19, 20, 21, 22], location based services [23, 24, 25], and social networks [26, 27]. To achieve efficient privacy protection for correlated data, it is essential to conduct rigorous theoretical studies to understand the analytical relationship between prior knowledge and privacy leakage, which is the main goal of this paper.

There have been several research efforts to this fundamental problem. The sequential composition theorem [7] of DP states that correlated data causes linear incrementing of privacy leakage if simply treating the correlated data as a whole. However, this does not utilize the correlation sufficiently and leads to a low utility for weakly correlated data. Therefore, many works [16, 19, 28, 25, 29] have focused on exploiting correlations to achieve high utility without sacrificing the privacy guarantee. However, these works do not consider adversaries with different prior knowledge, which has significant impacts on privacy leakage. Specifically, it has been demonstrated that without assumption on the adversaries’ prior knowledge, no privacy guarantee can be achieved [13, 30]. To measure the impacts of prior knowledge, Pufferfish privacy [12] and Blowfish privacy [31] formally model prior knowledge in their mathematical privacy definitions. However, there are no analytical impacts of correlation and prior knowledge on privacy leakage provided in either work [12, 31].

The state-of-the-art research, Bayesian differential privacy (BDP) [32], explicitly describes the relationship of privacy leakage and prior knowledge for a special case, i.e., when data are positively correlated. However, different types of correlations mean that the maximal influence of the query result caused by one tuple, i.e., the sensitivity, is different, and thus, leading to different privacy leakage. Therefore, it is necessary to discuss privacy leakage under all types of correlations, ranging from to (including negative, independent, and positive correlations). As BDP is based on a Laplacian matrix that can only model the positive correlations for sum queries, the analytical method and conclusions in [32] cannot be generalized to negative correlations or hybrid correlations (i.e., positive and negative coexist).

In summary, the analytical relationship between prior knowledge and privacy leakage under general correlations remains unclear. To address this problem, this paper presents the first unified analysis that considers positive, negative, and hybrid data correlations. Our contributions are as follows:

  1. We propose the definition of prior differential privacy (PDP) to measure privacy leakage caused by an adversary with any prior knowledge under general correlations. Based on PDP, we present a unified formulation (Theorem 2) to measure and discuss (Theorem 3) the impact of privacy leakage under varied prior knowledge and data correlations. Both the formulation and discussion can help us better understand the impact of prior knowledge and data correlation on privacy leakage.

  2. We analyze privacy leakage for both discrete and continuous data. For discrete data, we propose a graph model to present the structure of the adversaries’ prior knowledge, and a chain rule (Theorem 5) to compute the privacy leakage. For continuous data, instead of a Markov random field, we adopt the multivariate Gaussian model to present general data correlations and derive a closed-form expression to compute privacy leakage (Theorem 6). Our analytic method is based on the theory of Bayesian inference. The analytical results can guide us in designing more efficient mechanisms with better utility-privacy tradeoffs.

  3. We demonstrate that the analytic results can be applied to general linear queries, including count, sum, mean, and histogram. Extensive numerical simulation results verify our theoretical analysis.

The remainder of this paper is organized as follows. Section II introduces the related work. Section III introduces the notations and presents some preliminary knowledge. In section IV, a new definition PDP is proposed to analyze the impacts of prior knowledge, and we illustrate that three factors can impact privacy leakage. Section V and Section VI present the theoretical analysis of privacy leakage for both discrete data and continuous data, respectively. Numerical experiments are presented in Section VII, and we conclude this paper in Section VIII.

Ii Related Work

Ii-a Data Correlation

Many studies [13, 28, 25, 29] have demonstrated that DP may not guarantee its expected privacy when data are correlated. There are two plausible solutions to protecting the privacy of correlated data records. One is to achieve DP on each data record independently. However, the composition theorem [7] of DP has demonstrated that the privacy guarantee degrades with the number of correlated records. Another is to take the data records as a whole [33, 27, 34]. However, when the number of records is large, or the correlation is weak, the utility will still be low.

Therefore, it is crucial to accurately measure the data correlations to achieve more efficient privacy protection. Considerable work has been done from different perspectives. For general correlations, some work replaces the global sensitivity with new correlation-based parameters, such as correlated sensitivity [35] and correlated degree [36]. For example, in [35], a correlation coefficient matrix was utilized to describe the correlation of a series, and the correlation coefficient was considered as the weight to compute the global sensitivity. By utilizing inter- and intra-coupling, [36] proposed behavior functions to model the degree of correlation. For temporal correlations, most of the research work has focused on saving the privacy budget consumption in time-series data [37, 19, 24, 22]. For example, Dwork [37] proposed a cascade buffer counter algorithm to adaptively update the output result on an data stream. Fan [19] adopted a PID controller-based sampling strategy to adaptively inject Laplace noise into time-series data to improve the utility. For spatial correlations, the main idea is to group and perturb the statistics over correlated regions to avoid noise overdose [23, 38]. As a typical example, Wang [23] proposed dynamically grouping the sparse regions with similar trends and adding the same noise to reduce errors. In addition, for attribute correlations in multiattribute datasets, the fundamental idea is to reduce the dimensionality via identifying the attribute correlations [39, 40]. For example, Zhang et al. [39] constructed a Bayesian network to model the attribute correlation in high-dimensional data and then synthesized a privacy-preserving dataset in an ad hoc way. However, all these works assumed that adversaries have fixed prior knowledge, and thus, may not achieve the optimal tradeoff against adversaries with prior knowledge. In this paper, we consider both data correlations and flexible prior knowledge.

Ii-B Prior Knowledge

Prior knowledge can influence privacy leakage when the data are correlated [28, 32, 41], which has been considered in different research in terms of privacy definition and the design of privacy-preserving mechanisms. For example, the Pufferfish framework [12], aiming to help domain experts customize privacy definitions, theoretically has the potential to include all kinds of adversaries. The subsequent work of Blowfish privacy [31] developed mechanisms that permit more utility by specifying secrets about individuals and constraints about the data. In [42], a Wasserstein mechanism was proposed to fulfill the Pufferfish framework. In addition, [41] studied privacy leakage caused by the weakest adversary, and proposed the identity differential privacy (IDP) model. [43] exploited the structural characteristics of databases and the prior knowledge of domain experts to improve utility. However, no theoretical analysis on the relationship between the prior knowledge and privacy leakage has been formulated in all these work. In some research [44, 45], privacy leakage was guaranteed by limiting the difference between prior knowledge and posterior knowledge. However, in these works, the adversaries’ prior knowledge was limited to the probability distribution of the database and did not consider that partial data records may be compromised by specific adversaries. Instead, [32, 28] separated the adversary’s specific prior knowledge of partial tuples from the public knowledge of data correlations, which are derived from data distributions. Based on that, Yang et al. [32] adopted a Gaussian correlation model to study the impact of prior knowledge and demonstrated that the weakest adversary could cause the highest privacy leakage. Similar conclusions can be found in [28], which further identifies the maximally correlated group of data tuples to improve the utility. Nonetheless, the limitation is that their Laplacian matrix based Markov random field model can only be applied to analyze positive correlations on sum queries for continuous data or binary discrete data.

In this paper, we formally derive a formulation to present a unified analysis of the impact of data correlation and prior knowledge on privacy leakage, considering general linear queries on both discrete and continuous data.

Iii Preliminaries

We describe notations and conceptions in Subsection III-A, and introduce some knowledge of DP that will be used in our analysis in Subsection III-B.

Iii-a Notations

A database with tuples (attributes in a table or nodes in a graph), denoted as the set of indices , aims to release the result of a certain query function on an instance of the database, . It should be noted that, in accordance with [32, 29, 28], we use the same term “tuple” to denote the attribute instead of the record in a database. To protect the privacy of all tuples of an instance, it will return the noisy answer by adding random noise drawn from a distribution. Hence, all possible outputs constitute a probability distribution Pr, or equivalently a conditional distribution Pr. We use a set to capture the adversary’s beliefs on data correlation. We do not guarantee the privacy against adversaries out of , because there is no feasibility under arbitrary distributions [11]. The main notations are listed in Table I.

notations descriptions
A database instance .
The indices set of unknown/known tuples.
The instances of unknown/known tuples.
The sum of instance , and .
Two different values of tuple .
The database with eliminated.
The database with replaced with .
An adversary with prior knowledge to attack .
The privacy leakage caused by the adversary .
The random request generated by .
A randomized mechanism over .
All possible distributions of .
The local sensitivity of a query function on tuple .
The global sensitivity of a query function on .
TABLE I: Notations and meanings

Iii-A1 Adversary and Prior Knowledge

We denote as an adversary who attempts to infer the information of tuple , under the assumption that he knows the values of . We call the attack object, is the prior knowledge, , where . Let denotes the indices set of unknown tuples, then and the dataset . An adversary is called the strongest adversary when and is called the weakest adversary when . is called an ancestor of if is a subset of and differs by only one tuple, i.e., . More tuples in mean the adversary has stronger prior knowledge.

Iii-A2 Correlation

To measure data correlations, we adopt the Pearson correlation coefficient, which can identify linear correlations. More importantly, it can be used to distinguish positive correlations and negative correlations. In joint distribution , let denote the correlation coefficient of and under the condition . In this paper, plays an important role in the analysis of how prior knowledge affects privacy leakage.

Iii-A3 Linear Query

A linear query function can be represented as , where are correlated with the Pearson correlation coefficient . The linear query function can be transformed into a sum query on a new database by letting as , where . Then, the correlation coefficient of and should be . Combining our new privacy definition PDP (will be discussed in Subsection IV-A), models can deal with general correlations; therefore, we focus our analysis on the sum query without loss of generality, and the conclusion can be straightforwardly extended to general linear queries.

Iii-B Differential Privacy

Definition 1.

(Differential Privacy [9]). A randomized mechanism satisfies -differential privacy (-DP), if for any , the differential value

(1)

Here, is the distinguishable bound of all outputs on neighboring datasets and , where is the database with replaced with . A larger corresponds to easier distinguishability of and , which means more privacy leakage.

For numerical data, a Laplace mechanism [9] can be used to achieve -DP, by adding carefully calibrated noise to the query results. In particular, we draw noise from Laplace distribution with the probability density function

in which . Here, is the global sensitivity of query , and is the local sensitivity of . Since , the probability density function of the output can be represented as

Iv Prior Differential Privacy

To compute the privacy leakage when considering adversaries with different prior knowledge and databases with different joint distributions, we propose a new definition in Subsection IV-A. Furthermore, we illustrate that three factors can affect privacy leakage through three numerical examples in Subsection IV-B.

Iv-a Prior Differential Privacy

To evaluate privacy leakage considering adversaries have different prior knowledge, the definition BDP is proposed in [32] based on the Bayesian inference method [11, 46, 14]. However, BDP can only be applied to positive correlations. To overcome the drawback, we propose a definition named Prior Differential Privacy (PDP), which can be applied to databases with general correlations.

Definition 2.

(Prior Differential Privacy) Let be a database instance with tuples, is an adversary with the attack object and prior knowledge . The joint distribution of is denoted as , where is a set of distributions. is a randomized perturbation mechanism, and is the output space. The privacy leakage of w.r.t is the maximum logarithm function for all different values , , and any output .

(2)

We say satisfies -PDP if Eq. (2) holds for any , , . That is,

In Definition 2, is the privacy leakage caused by the adversary under the distribution , which represents the data correlation. is the maximal privacy leakage caused by all adversaries with public distribution . Compared with BDP that only considers a single distribution, PDP considers a set of distributions . Thus, PDP is more reasonable because the set can reflect the cognitive diversity of the aggregator and the adversaries.

We show that PDP is in accordance with the Bayesian inference, Eq. (2) can be written as

(3)

Eq. (3) denotes the information gain achieved by the adversary , after the adversary observes the published results . In addition, the PDP bounds the maximal information gain inferred by all possible adversaries that are no larger than . The next theorem shows that prior knowledge impacts privacy leakage only when the database is correlated.

Theorem 1.

Prior knowledge has no impact on privacy leakage when tuples in the database are mutually independent.

Proof.

For an adversary and its ancestor , we get

The last equality holds when the data tuples are independent, i.e., . And similarly, If , according to PDP, we have . Multiplying this fraction by and summing with respect to , we obtain on the basis of the definition PDP. Therefore, different prior knowledge and have the same privacy leakage. ∎

Theorem 1 is also consistent with Eq. (3). If tuples are independent, then can be omitted in Eq. (3). Therefore, the prior knowledge has no impact on the privacy leakage when tuples are independent.

Remark 1. It is worth noting that DP and PDP are also consistent in nature. They all reflect the maximal distinguishability between distributions of perturbed output calculated on two neighboring datasets. In this paper, neighboring datasets are obtained by modifying one record in the dataset. The difference in DP and PDP is the different forms of neighboring datasets. In DP, the neighboring datasets are and . However, in PDP, the neighboring datasets are and . For any given and , we have

(4)

The last equality in Eq. (4) applies DP on datasets differing at most +1 tuples. The inequality in Eq. (4) holds for the fact that , if all parameters are nonnegative and and are probability simplex. Taking the logarithm and supremum of Eq. (4) over , we obtain . Therefore, DP provides an upper bound of privacy leakage for PDP, i.e., we achieve a better trade-off between privacy and utility by adopting PDP than adopting DP.

Iv-B Influence Factors

In this section, we demonstrate that three factors, prior knowledge , joint distribution , and local sensitivity , impact privacy leakage through numerical Example 2 to Example 4.

=0 =1
=0 0.3 0.2
=1 0.2 0.3
(a) Positive correlation
=0 =1
=0 0.2 0.3
=1 0.3 0.2
(b) Negative correlation
=0 =1
=0 0.5 0
=1 0 0.5
(c) Perfect correlation
=0 =1
=0 0.5 0
=5 0 0.5
(d) Perfect correlation
TABLE II: Four Joint Distributions

As shown in Table II, there are four joint distributions of database . The first three distributions have the same domain with different correlations, the third and fourth distributions have the same correlation, but has a different domain. Considering a sum query , set Laplace mechanism scale for simplicity. Denote as the privacy leakage caused by the adversary when the distribution of is Table III(a).

Example 2 (Prior Knowledge). Two adversaries and , attempt to infer the information or . knows the information of (e.g., ), and knows nothing about . Based on the definition of PDP, we calculate and . For and , we get

When knows nothing about , according to Eq. (2), we have

The exponential entries are derived from the Laplace mechanism and given . Similarly, Therefore,

(5)
(6)

Example 2 shows that prior knowledge has significant influence when the correlations are different. More importantly, it answers the two problems extended from Example 1. In addition, we note that the privacy leakage of DP is 2 if we simply regard correlated tuples , as a whole. Therefore, we achieve stricter privacy protection than DP under the same noise mechanism. In other words, we can introduce less noise to obtain the same privacy level.

Example 3 (Correlation). An adversary attempts to infer the information of with no prior knowledge about . To show the impacts of the correlations, we modify in Tables III(a) and III(b) to obtain two distributions (a’) and (b’), in which and have stronger correlation. Computations of and are similar to in Example 2. According to Eq. (2), we obtain , . Therefore,

(7)
(8)

Example 3 demonstrates that different correlations have significant influences on privacy leakage. Particularly, Eq. (7) shows that the adversary can infer more information of through a stronger positive correlation, and Eq. (8) shows the opposite result when the correlation is negative.

Example 4 (Local Sensitivity). An adversary , with no prior knowledge of , attempts to infer . The difference in distributions Tables III(c) and III(d) is the domain of . Based on PDP and the similarity of computations of in Example 2, we have , and . Therefore,

For the sum query on , the local sensitivity of is its own domain. In distribution Table III(c), . In distribution Table III(d), . Example 4 shows that the local sensitivity impacts privacy leakage and a larger sensitivity ratio can lead to higher privacy leakage.

Examples 2-4 demonstrate that three factors impact privacy leakage, and show how to compute the privacy leakage for a database composed of two tuples. In the following sections, we will extend the numerical results to analytical results for both discrete and continuous data.

V General Relationship Analysis and Privacy Leakage Computation for Discrete Data

In this section, we analyze privacy leakage with respect to the three factors when data are discrete. Subsection V-A presents a weighted hierarchical graph (WHG) to model all adversaries with various prior knowledge. Subsection V-B discusses how to calculate the weight of edges in the WHG. Subsection V-C formulates a chain rule to represent the privacy leakage for an adversary with arbitrary prior knowledge. Subsection V-D presents a full-space-searching algorithm to compute the privacy leakage, and a fast-searching algorithm to improve the search efficiency in practice.

V-a Weighted Hierarchical Graph

A hierarchical graph is used to represent adversaries with various prior knowledge. Each node denotes an adversary, in which tuple is the attack object, and tuples set denotes the prior knowledge. For a database with tuples, there are layers in a graph. From the bottom to the top, the prior knowledge decreases by one tuple for each layer, until . To compute the privacy leakage of adversaries, we further construct a weighted hierarchical graph (WHG) by assigning weights for the edges in the graph. We first define the value of nodes as the privacy leakage caused by corresponding adversaries. In addition, the edge connecting two nodes denotes the privacy leakage difference between two adversaries with the neighboring prior knowledge sets, i.e., . Then, the process of analyzing the privacy leakage is as follows. First, we construct the hierarchical graph for all possible adversaries for a given database. Second, we compute the values of all edges in the graph to obtain the WHG (discussed in Subsection V-B). Third, we compute the values of nodes in the first layer by PDP. Finally, we can obtain all nodes’ values by proposing a chain rule (Theorem 5). Finally, the privacy leakage can be obtained by choosing the maximal node naturally.

For example, we can obtain a WHG consisting of three layers and twelve nodes for a simple database with three tuples, as shown in Fig 3. Based on the node and edges and , we obtain the privacy leakage for nodes and . Similarly, we can obtain the other four nodes in the second layer. For the node , we compute two values based on and , and choose the minimum as its privacy leakage. Similarly, we obtain another two nodes in the third layer. Now, the privacy leakage is the maximal node value in the graph.

In the above process, one key problem is to compute the edge value. Therefore, we propose a formula to address the problem.


Fig. 3: An example to show the three tuples WHG. Each node denotes an adversary who attempts to infer the tuple with prior knowledge . There are three levels composed of nodes with the same prior knowledge size. A directed edge connects the node and its ancestor from the lower layer to higher layer. Therefore, we obtain a directed graph to present all possible adversaries.

V-B Impacts of Correlations and Prior Knowledge

In this section, we mainly deduce the formula to compute the edge value, which represents the impact of privacy leakage caused by different prior knowledge. Meanwhile, we show that the edge value is closely related to data correlation.

Note that the edge value shows the gain of privacy leakage when one tuple is removed from the prior knowledge. If the edge value is positive, then the ancestor, a weaker adversary, can cause more privacy leakage. If the edge value is negative, then the ancestor, a stronger adversary, can cause more privacy leakage.

Given , denotes the conditional distribution derived from the joint distribution , and is the corresponding conditional correlation coefficient. The domain of tuple is , in which is the domain size of . Based on , the impact of on , under two different values of , can be denoted as

(9)

Then, impacts of on , under all possible pairs , can be denoted as a set

(10)

Next, a theorem shows how to compute the edge value.

Theorem 2.

Assume the privacy leakage of an adversary is , then the privacy leakage of its ancestor is

(11)

where

is the value of the edge connecting two nodes and in the WHG.

Proof.

See Appendix A. ∎

Theorem 2 shows the impact on privacy leakage caused by two adversaries whose prior knowledge differs by one tuple under general correlation. According to Theorem 2, the value is the element in the set that maximizes the privacy leakage of node . That is, presents the maximal impact of on under the conditional distribution .

To show the relationship between and the three factors described in Subsection IV-B, we rewrite the as

(12)

where

(13)

is called the increment ratio to denote the impact caused by correlations. represents the variation of privacy leakage when prior knowledge decreases. Therefore, the two components and in Eq. (12) represent the impact of local sensitivity, and correlation, respectively.

Now, we give the relationship between and conditional correlation coefficient .

Theorem 3.

(1) For a database with all possible joint distributions, . (2) Under the assumption that the domain size of and are two, then has the following relationship with :

  1. if , then ;

  2. if , then ;

  3. if , then .

(3) Under the assumption that the domain size of is two, the domain size of is greater than two, meanwhile, , and then the results in Case (2) still applies.

Proof.

See Appendix B. ∎

Case (1) in Theorem 3 shows that has the same bound as the correlation coefficient. Case (2) shows that the relationship between the edge value and the data correlations, and extends the results of Examples 2-4 to general cases. Case (3) shows that similar results hold for a general with a larger domain, as long as . Since the privacy budget in DP is commonly set as , the condition is usually true. Theorem 3 shows the impacts of the correlations and prior knowledge on the privacy leakage of the aggregation of two correlated tuples, which correspond to the different cases in Fig. 2.

Combining Theorem 3 with Eq. (11) and Eq. (12), we note that the weaker adversary causes higher privacy leakage when the tuples are positively correlated because more unknown tuples with positive correlations means a greater sensitivity to the query result. However, when tuples are negatively correlated, the weaker adversary does not cause less privacy leakage because more unknown tuples with negative correlations does not always mean smaller sensitivity or less privacy leakage.

What about when the domain size of is greater than two? Do the results in Theorem 3 still hold? Regretfully, the answer is negative. Let the domain size of be , then the number of is . We cannot guarantee all these satisfy Theorem 3. Instead, we have the following analytical results.

  1. If , at least one ;

  2. if and are independent, all ;

  3. if , at least one .

Therefore, combining the above analytical results with Eqs. (12) and (11), we also conclude that the weaker adversary can cause higher privacy leakage when and are positively correlated. The prior knowledge has no impact on privacy leakage when and are independent, which also corresponds to Theorem 1. However, we cannot derive a deterministic relationship between the privacy leakage and prior knowledge if the tuples are negatively correlated. In this situation, we have to use Eq. (11) to determine their relationship.

V-C Privacy Leakage Formulation

In this subsection, we introduce how to compute the node value, which represents the privacy leakage caused by the adversary with prior knowledge in the WHG. As mentioned in Subsection V-A, the computation relies on two steps. One step computes the node values in the first layer; the other is a chain rule. We first present how to compute the node values in the first layer.

Theorem 4.

For a database which has tuples and follows the joint distribution , the values of the nodes in the first layer are , where is the local sensitivity, and is the parameter in the Laplace mechanism.

Proof.

Based on the definition of PDP, , we have

Theorem 4 demonstrates that the joint distribution, which represents the correlation, has no impact on privacy leakage when the adversary has the strongest prior knowledge. On the basis of Theorem 4, we deduce the values of the nodes in the second layer through Theorem 2. Similarly, we can obtain the values of the nodes in layer by layer according to Theorem 2. Finally, we can obtain all nodes’ values. In particular, the following theorem presents a solution to computing the privacy leakage of a certain node in WHG.

Theorem 5.

(Chain Rule) For a node in the layer , there exists a path from the bottom node to the node . From layer 1 to layer , , are all the nodes in the path. Then, the privacy leakage of the node corresponding to this path is

(14)

where denotes -fold absolute value operation, and is the length of the path.

Proof.

The result can be obtained by using Theorem 2. In a path from the bottom to the top, there are nodes and edges, each of which consists of two nodes in the adjacent layers. The chain rule can be obtained by applying Theorem 2 on all edges in a path. ∎

Theorem 5 shows the computational process for a path from the bottom node to the given node. If there exist multiple paths, we should compute the value of each path by using Eq. (14), and then choose the minimum as the node’s value. There are three factors that can impact privacy leakage. The length of the path in Eq. (14), which represents the amount of prior knowledge. To highlight the other two factors, according to Eq. (12), we rewrite Eq. (14) as follows

(15)

According to Eq. (15), we can see that PDP is superior to group differential privacy in terms of calculating an accurate privacy leakage for the adversary with specific prior knowledge. Particularly, according to Theorem 3, we have . By setting all in Eq. (15), we have

(16)
(17)

Eqs. (16) and (17) show that the privacy leakage under PDP is more accurate than group differential privacy, which is simply derived from the sequential composition theorem. In addition, when the edge values in the WHG are all greater or less than zero, we can deduce some special results in the following corollary.

Corollary 1.
  1. When all , the PDP degrades to group differential privacy.

  2. When all , the maximal privacy leakage is obtained at the top layer, i.e., the weakest adversary causes the highest privacy leakage.

  3. When all , the maximal privacy leakage is obtained in the bottom layer, i.e., the strongest adversary causes the highest privacy leakage.

Case 1) can be derived from Eq. (16) directly. Additionally, it is easy to prove Cases 2) and 3) by using Eq. (15) and summing the nodes’ values in layer order.

Based on Corollary 1, we can easily compute the privacy leakage for these special cases. For example, in Case 2), the privacy leakage increases with the layer number. However, in general cases, when WHG has both positive and negative edges, we have to traverse the whole WHG to compute the privacy leakage.

V-D Algorithms for Computing Privacy Leakage

For a given database with tuples, the number of edges is no fewer than the number of nodes . Therefore, it is intractable to traverse the WHG when the number of tuples is large. We first use the full-space-searching algorithm to compute the least upper bound of privacy leakage and then propose a heuristic fast-searching algorithm to reduce the calculation time by limiting the searching space.

In the full-space-searching algorithm, we first initialize the value of the nodes in the first layer by Theorem 4 (line 1). Then, we generate nodes in layer by using the chain rule (Theorem 14) based on the edges’ value (Eq. (11)) between layers and (lines 3-10). Note that for a given node in layer , there may exist multiple paths from the nodes in layer to the given node. As mentioned previously, we need to retain the minimal value computed from multiple paths as the node value (line 8). Finally, we obtain the maximal privacy leakage of all nodes in the WHG (line 11).

1:Database , joint distribution
2:Privacy Leakage
3:Generate nodes in the first layer, set and ;
4:Denote all nodes in the first layer as ;
5:for =2 to  do
6:     for each node  do
7:         Generate node by subtracting from ;
8:         Compute ;
9:     end for
10:     Detect the repeated nodes with the same attack tuple and prior knowledge in layer ; only retain the node with the minimal privacy leakage;
11:     return ;
12:end for
13:return ;
Algorithm 1 Full-Space-Searching
Proposition 1.

The time complexity of the full-space-searching algorithm is .

Proof.

There are two steps to obtain the value of the nodes in the layer from the value of the nodes in the layer . One is to first obtain new nodes in layer by removing one tuple from the prior knowledge of the nodes in layer . The second step is to sort and remove the repeated nodes with the same attack tuple and prior knowledge in layer . There are nodes in layer , so the number of nodes after the first step would be , denoted as . The time complexity after the second step is . We note that . Summing from to , the time complexity of the algorithm is

(18)

As we can see, considerable time will be required to generate new nodes and to remove repeating nodes in the full-space searching algorithm. In addition, the time complexity grows exponentially with the number of tuples . To reduce the computational time complexity, a fast-searching algorithm is proposed to search a subspace of the original full space with a little sacrifice of the accuracy. Specifically, we only use the top largest nodes in layer to generate layer .

Proposition 2.

The time complexity of the fast-searching algorithm is .

Proof.

According to the fast-searching algorithm, there are, at most, tuples in layer . After the subtraction operation, there are at most tuples, denoted as . The rest of this proof is the same as that of proposition 1. ∎

1:Database , joint distribution
2:Privacy Leakage
3:Initialize nodes in layer ;
4:for =2 to  do
5:     Generate the nodes in layer ;
6:     Detect the repeated nodes and retain the minimum node;
7:     Retain the top largest nodes;
8:     return ;
9:end for
10:return ;
Algorithm 2 Fast-Searching

Vi Gaussian Model-based Analysis for Continuous Data

In this section, we further discuss the impacts of correlation and prior knowledge for the continuous-valued data. In Subsection VI-A, we first explain why the WHG is not suitable for the continuous-valued database. Then, we introduce some properties of the multivariate Gaussian distribution. In Subsection VI-B, we identify an explicit formula to compute the privacy leakage of the multivariate Gaussian model.

Vi-a Multivariate Gaussian Model

The necessity to separate the continuous situation from the discrete situation is that the computation method used in Section V is no longer sustainable. In Section V, we investigate how correlation and prior knowledge can impact privacy leakage. Based on the proposed WHG, we deduce the chain rule to compute privacy leakage. One crucial step is to compute the edge value in the WHG. Discrete-valued data can be achieved by using Eqs. (9) and (11), which requires enumerating all the different pairs of value and in the domain. Obviously, it is impossible for continuous-valued tuples with an unbounded domain. To deal with this issue, we should clarify the joint distribution. Therefore, although the analytical results in Section V still holds for both continuous-valued data; the edges’ value cannot be directly computed as discrete data.

For continuous data, a common solution is to accurately identify the global sensitivity via bounding the range (i.e., domain) of the tuples [12, 32]. Otherwise, the privacy leakage would be overestimated, and the unboundedness would destroy the utility of privacy-preserving results. Therefore, by bounding the range of as , Eq. (2) becomes

(19)

Different from the sum operation in computing probability in Section V, we use integrate to compute the probability for continuous data. That is

Here, we choose the multivariate Gaussian distribution (denoted as MGD) to describe the database since most of the continuous data can be well modeled by MGD. For a database with tuples, , is the expectation vector, and is the covariance matrix. If , and are positively correlated. If , and are negatively correlated. If , and are independent. follows the MGD if the density function of is