Preserving Individual Privacy in Serial Data Publishing
While previous works on privacy-preserving serial data publishing consider the scenario where sensitive values may persist over multiple data releases, we find that no previous work has sufficient protection provided for sensitive values that can change over time, which should be the more common case. In this work, we propose to study the privacy guarantee for such transient sensitive values, which we call the global guarantee. We formally define the problem for achieving this guarantee and derive some theoretical properties for this problem. We show that the anonymized group sizes used in the data anonymization is a key factor in protecting individual privacy in serial publication. We propose two strategies for anonymization targeting at minimizing the average group size and the maximum group size. Finally, we conduct experiments on a medical dataset to show that our method is highly efficient and also produces published data of very high utility.
Recently, there has been much study on the issues in privacy-preserving data publishing [2, 13, 12, 4, 16, 27, 7, 14, 33, 9, 22, 15]. Most previous works deal with privacy protection when only one instance of the data is published. However, in many applications, data is published at regular time intervals. For example, the medical data from a hospital may be published twice a year. Some recent papers [19, 30, 8, 6, 23, 5] study the privacy protection issues for multiple data publications of multiple instances of the data. We refer to such data publishing serial data publishing.
Following the settings of previous works, we assume that there is a sensitive attribute which contains sensitive values that should not be linked to the individuals in the database. A common example of such a sensitive attribute is diseases. While some diseases such as flu or stomach virus may not be very sensitive, some diseases such as chlamydia (a sex disease) can be considered highly sensitive. In serial publishing of such a set of data, the disease values attached to a certain individual can change over time.
A typical guarantee we want to achieve is that the probability that an adversary can derive for the linkage of a person to a sensitive value is no more than . This is well-known to be a simple form of -diversity . This guarantee sounds innocent enough for a single release data publication. However, when it comes to serial data publishing, the objective becomes quite illusive and requires a much closer look. In serial publishing, the individuals that are recorded in the data may change, and the sensitive values related to individuals may also change. We assume that the sensitive values can change freely.
Let us consider a sensitive disease chlamydia, which is a sex disease that is easily curable. Suppose that there exist 3 records of an individual in 3 different medical data releases. It is obvious that typically would not want anyone to deduce with high confidence from these released data that s/he has ever contracted chlamydia in the past. Here, the past practically corresponds to one or more of the three data releases. Therefore, if from these data releases, an adversary can deduce with high confidence that has contracted chlamydia in one or more of the three releases, privacy would have been breached. To protect privacy, we would like the probability of any individual being linked to a sensitive value in one or more data releases to be bounded from the above by . Let us call this privacy guarantee the global guarantee and the value the privacy threshold.
Though the global guarantee requirement seems to be quite obvious, to the best of our knowledge, no existing work has considered such a guarantee. Instead, the closest guarantee of previous works is the following: for each of the data releases, can be linked to chlamydia with a probability of no more than . Let us call this guarantee the localized guarantee. Would this guarantee be equivalent to the above global guarantee ? In order to answer this question, let us look at an example.
Consider two raw medical tables (or micro data) and as shown in Figure 1 at time points 1 and 2, respectively. Suppose that they contain records for the individuals . There are two kinds of attributes, namely quasi-identifier (QID) attributes and sensitive attributes. Quasi-identifier attributes are attributes that can be used to identify an individual with the help of an external source such as a voter registration list [21, 12, 13, 29]. In this example, sex and zipcode are the quasi-identifier attributes, while disease is the sensitive attribute. Attribute id is used for illustration purpose and does not appear in the published table. We assume that each individual owns at most one tuple in each table at each time point. Furthermore, we assume no additional background knowledge about the linkage of individuals to diseases, and the sensitive values linked to individuals can be freely updated from one release to the next release.
|Id Sex Zip- Disease code M 65001 flu M 65002 chlamydia F 65014 flu F 65015 fever||Id Sex Zip- Disease code M 65001 chlamydia M 65002 flu F 65014 fever F 65010 flu|
|Sex Zipcode Disease M 6500* flu M 6500* chlamydia F 6501* flu F 6501* fever||Sex Zipcode Disease M 6500* chlamydia M 6500* flu F 6501* fever F 6501* flu|
|Sex Zipcode Disease M 65001 flu M 65002 chlamydia Sex Zipcode Disease M 65001 flu M 65002 chlamydia||Sex Zipcode Disease M 65001 flu M 65002 chlamydia Sex Zipcode Disease M 65001 chlamydia M 65002 flu||Sex Zipcode Disease M 65001 chlamydia M 65002 flu Sex Zipcode Disease M 65001 flu M 65002 chlamydia||Sex Zipcode Disease M 65001 chlamydia M 65002 flu Sex Zipcode Disease M 65001 chlamydia M 65002 flu|
|(a) Possible world 1||(b) Possible world 2||(c) Possible world 3||(d) Possible world 4|
Assume that the privacy threshold is . In a typical data anonymization [21, 12, 13, 29], in order to protect individual privacy, the QID attributes of the raw table are generalized or bucketized in order to form some anonymized groups () to hide the linkage between an individual and a sensitive value. For example, table in Figure 2(a) is a generalized table of in Figure 1. We generalize the zip code of the first two tuples to 6500* so that they have the same QID values in . We say that these two tuples form an anonymized group. It is easy to see that in each published table or , the probability of linking any individual to chlamydia or flu is at most 1/2, which satisfies the localized guarantee. The question is whether this satisfies the global privacy guarantee with a threshold of .
For the sake of illustration, let us focus on the anonymized groups and containing the first two tuples in tables and in Figure 2, respectively. The probability in serial publishing can be derived by the possible world analysis. There are four possible worlds for and in these two published tables, as shown in Figure 3. Here each possible world is one possible way to assign the diseases to the individuals in such a way that is consistent with the published tables. Therefore, each possible world is a possible assignment of the sensitive values to the individuals at all the publication time points for groups and . Note that an individual can be assigned to different values at different data releases, and the assignment in one data release is independent of the assignment in another release.
Consider individual . Among the four possible worlds, three possible worlds link to “chlamydia”, namely and . In and , the linkage occurs at , and in , the linkage occurs at . Thus, the probability that is linked to “chlamydia” in at least one of the tables is equal to , which is greater than , the intended privacy threshold. From this example, we can see that localized guarantee does not imply global guarantee.
|Sex Zipcode Disease M/F 650** flu M/F 650** chlamydia M/F 650** flu M/F 650** fever||Sex Zipcode Disease M/F 650** chlamydia M/F 650** flu M/F 650** fever M/F 650** flu|
In this paper, we show that in order to ensure the global guarantee, the sizes of the anonymized groups need to be bigger than that needed for localized guarantee. In the above example, we can use size 4 anonymized groups as shown in Figure 4. There will be possible worlds. It is easy to see that of the possible worlds do not assign chlamydia to in the first release, of them do not assign chlamydia to in the second release, and of the possible worlds do not assign chlamydia to in both releases. The remaining possible worlds assign chlamydia to in at least one of the two releases. Hence, the privacy breach probability = .
The contributions of this paper include the following: We point out the problem of privacy breach that arises with localized guarantee and propose to study the problem of global guarantee in privacy preserving serial data publishing. We formally analyze the privacy breach with transient sensitive values. Useful properties related to the anonymization under the global guarantee are derived. These properties are related to the anonymized group sizes. Typically group sizes greater than that required for the localized guarantee will be needed to attain the global guarantee. These properties are then leveraged in the proposal of new anonymization strategies that can minimize the information loss. We have also conducted extensive experiments with a real medical dataset to verify our techniques. The results show that our methodology are very promising in real world applications.
The rest of this paper is organized as follows. Section 2 surveys the previous related works. Section 3 contains our problem definition. Section 4 describes a general formula for the breach probability. Section 5 discusses some key properties for this problem. Section 6 describes our methodology for privacy protection. Section 7 suggests a possible implementation. Section 8 is an empirical study. Section 9 concludes our work and points out some possible future directions.
2 Related Work
Here, we summarize the previous works on the problem of privacy preserving serial data publishing. -anonymity has been considered in  and  for serial publication allowing only insertions, but they do not consider the linkage probabilities to sensitive values. The work in  considers sequential releases for different attribute subsets for the same dataset, which is different from our definition of serial publishing.
There are some more related works that attempt to avoid the linkage of individuals to sensitive values. Delay publising is proposed in  to avoid problems of insertions, but deletion and updates are not considered. While  considers both insertions and deletions, both  and  make the assumption that when an individual appears in consecutive data releases, then the sensitive value for that individual is not changed. As pointed out in , this assumption is not realistic. Also the protection in  is record-based and not individual-based. This is quite problematic, as in our running examples, there are two records for one individual , namely, in table and in table (note that and need not be consecutive releases, so that the sensitive value linked to can change even if we adopt the above unrealistic assumption in [6, 30]). If we consider just tuple , then there are only 2 possible worlds where is linked to chlamydia in Figure 3, namely and . If we just consider tuple , there are also only 2 possible worlds linking it to chlamydia, namely and . Hence, and satisfy the record-based requirement of  if the risk threshold is 0.5. In fact, these are possible tables generated by the mechanism proposed in . However, we have shown that this anonymization does not provide the expected protection for the individuals.
The -scarcity model is introduced in  to handle the situations when some data may be permanent so that once an individual is linked to such a value, the linkage will remain in subsequent releases whenever the individual appears (not limited to consecutive releases only). However, for transient sensitive values,  and  adopt the following principle.
[Localized Guarantee] For each release of the data publication, the probability that an individual is linked to a sensitive value is bounded by a threshold.
However, we have seen in the example in the previous section that this cannot satisfy the expected privacy requirement. Hence, we consider the following principle.
[Global Guarantee] Over all the published releases, the probability that an individual has ever been linked to a sensitive value is bounded by a threshold.
Although the privacy guarantee is the most important data publication criterion, the published data must also provide a reasonable level of utility so that it can be useful for applications such as data mining or data analysis. Utility is a tradeoff for the privacy guarantee since anonymization of data introduces information loss. There are different definitions of utility in the existing literature. Here, we briefly describe some common definitions.
The anonymized group sizes have been considered in utility metrics. The average group size is considered in . In , the discernability model assigns a penalty to each tuple as determined by the square of the size of the anonymized group for . In , the normalized average anonymized group size metric is proposed, which is given by the total number of tuples in the table divided by the product of the total number of anonymized groups and a value (for -anonymity). Here, the best case occurs when each group has size .
Other works [29, 31, 26] consider categorical data that comes with a taxonomy so that the information loss is measured with respective to the structure in the taxonomy when data are generalized from the leaf nodes to higher levels in the taxonomy. Both  and  measure utility by comparing the data distributions before and after anonymization. Recently,  and  consider the accuracy in answering aggregate queries to be a measure of utility.
3 Problem Definition
Suppose tables are generated at time points, , respectively. Each table has two kinds of attributes, quasi-identifier attributes and sensitive attributes. For the sake of illustration, we consider one single sensitive attribute containing values, namely . Assume that the sensitive values for individuals can freely change from one release to another release so that the linkage of an individual to a sensitive value in one data release has no effect on the linkage of to any other sensitive value in any other data release. Assume at each time point , a data publisher generates an anonymized version of for data publishing so that each record in will belong to one anonymized group in . Given an anonymized group , we define to be a multi-set containing all sensitive values in , and to be the set of individuals that appear in .
[Possible World] A series of tables is a possible world for published tables if the following requirement is satisfied. For each ,
there is a one-to-one corresponding between individuals in and individuals in
for each anonymized group in , the multi-set of the sensitive values of the corresponding individuals in is equal to .
Let be the probability that an individual is linked to in at least one published table among published tables .
Let stand for the sensitive value of tuple . We say that is linked to in a table if for the tuple of in , . Following previous works, we define the probability based on the possible worlds as follows.
[Breach Probability] The breach probability is given by
where is the total number of possible worlds where is linked to in at least one published table among and is the total number of possible worlds for published tables .
We will describe how we derive a general formula to calculate in Section 4.
While privacy breach is the most important concern, the utility of the published data also need to be preserved. There are different definitions of utility in the existing literature. Some commonly adopted utility measurements are described in Section 2.
In this paper, we are studying the following problem.
Given a privacy parameter (a positive integer), a utility measurement, published tables, namely and one raw table , we want to generate a published table from such that the utility is maximized, and for each individual and each sensitive value ,
3.1 Global versus Localized Guarantee
Here, we show that protecting individual privacy with Principle 2 (global guarantee) implies protecting individual privacy with Principle 2 (localized guarantee). Under Principle 2, let be the probability that an individual is linked to a sensitive value in the -th table. Following the definition of probability adopted in most previous works [30, 5], we have
where is the total number of possible worlds in which is linked to in the -th table and is the total number of possible worlds for the published tables.
In our running example, =2 and from Figure 3, there are four possible worlds, 4. Consider published table . There are two possible worlds where is linked to chlamydia (), namely and . Thus, and . Similarly, when , .
In general, it is obvious that for any . We derive that
Hence we have the following lemma.
4 Breach Probability Analysis
In this section, we consider how the breach probability can be derived. For privacy breach, we focus on the possible assignment of sensitive values to one individual at a time. Therefore, we introduce the following possible world definition to deal with assignments to a particular individual.
 At any data release, let be the anonymized group that contains the record for individual in published table .
For the sake of clarity, if the context is clear, we omit the subscript and denote by .
[Possible World for ] Given a possible world for . Let us extract the tuples in each that correspond to the tuples in the anonymized group (containing individual in ) to form table . Then, the series of smaller tables, denoted by which is equal to , form a possible world for , … . We also say that that is a possible world for for .
For example, Figure 3 shows all the possible worlds for and for in the published tables shown in Figure 2(a) and Figure 2(b). Note that in the above definition, if does not appear in a table , then is an empty table.
4.1 Possible World Analysis
Since the sensitive values are transient and we do not assume any additional knowledge about the data linkage, the assignment of sensitive values to individuals in groups other than are independent of the assignment to the individuals in . Hence, we arrive at the following lemma.
The value of can be derived based on the analysis of the possible worlds for .
The above lemma helps to greatly simplify the analysis of the privacy breach by considering only in each data release. In the following, we may refer to a possible world for simply as a possible world.
Consider an anonymized group in for individual . Let be the size . Let be the total number of tuples in with sensitive value for . The total number of possible worlds for can be derived by combinatorial analysis.
[No. of Poss. Worlds for Single Table] The total number of possible worlds for the anonymized group in a single published table is equal to
For example, consider an anonymized group of size 4 containing two values, one value and one value in . Then, is equal to .
4.2 Breach Probability
Recall that our objective is to compute which involves two major components, namely and . In the following, we will describe how we obtain the values of these two components.
By Lemma 4.1, the total number of possible worlds for in the published tables , denoted by , is equal to
Next, we will describe how to obtain the formula for . Without loss of generality, we consider the privacy protection for an arbitrary sensitive value . The following analysis applies for each sensitive value.
Note that, for any arbitrary sensitive value , we have the following.
where is the total number of possible worlds where is not linked to in all published tables, namely . Thus,
Next, we will show how we derive . Let be the total number of possible worlds for table (treated as a singleton table series) that is not linked to .
Consider a possible table where is not linked to . Since is not linked to in , is linked to a sensitive value where in . The number of possible worlds for where is linked to in is equal to
By considering all sensitive values where , the total number of possible worlds for where is not linked to (i.e., ) is equal to
From Equation (1),
[Closed Form of ]
From Equation (1), is defined with a conceptual terms with the total number of possible worlds. Lemma 4.2 gives a closed form of . Given the information of (i.e., the size of the anonymized group in the -th table) and (i.e., the number of tuples in the anonymized group with sensitive value in the -th table), we can calculate with its closed form directly.
[Two-Table Illustration] Consider that we want to protect the linkage between an individual and a sensitive value . Suppose appears in both published tables and . Let and be the anonymized groups in and containing . Suppose both and are linked to .
By the notation adopted in this paper, is the size of and is the total number of tuples in with sensitive value .
By Lemma 4.2, we have
[Running Example] In our running example as shown in Figure 2, consider the second individual and a sensitive value “chlamydia”. We know that . Suppose is “chlamydia”. Thus, . With respect to the published tables as shown in Figure 2, according to the formula derived in Example 4.2,
which is greater than (the desired threshold).
However, if we publish tables as shown in Figure 4, then and .
which is smaller than .
In this paper, we aim to publish table like Figure 4 at each time point such that for each individual and each sensitive value .
From Example 4.2, we observe that a larger anonymized group size reduces the breach probability that individual is linked to sensitive value in the past. However, the anonymized group size alone cannot reduce the breach probability. Consider that an anonymized group in published table contains all sensitive values , instead of distinct sensitive values. Even though this anonymized group is larger, if it still contains all sensitive values , it is easy to verify that an individual in this anonymized group must be linked to in this table .
In fact, the breach probability is determined by the anonymized group size ratio. The anonymized group size ratio is equal to the anonymized group size divided by the total number of tuples in this anonymized group with sensitive value . In Example 4.2, since all sensitive values are distinct in an anonymized group (i.e., the total number of tuples in this anonymized group with sensitive value is equal to 1), the anonymized group size ratio is equal to the anonymized group size. In the next section, we will show that the larger anonymized group size ratio can reduce the probability.
5 Theoretical Properties
In the previous section, we describe that a larger anonymized group ratio can reduce the breach probability. In this section, we will first study some properties of our problem, including a minimum anonymized group ratio for global privacy guarantee, and then a monotonicity property that can be useful in data anonymization.
5.1 Minimum size Ratio
Recall that is the anonymized group () size and is the number of tuples in the anonymized group with sensitive value . In the following, we will derive the minimum anonymized group size ratio for privacy protection under the global guarantee.
Let be an integer greater than 1. Suppose the anonymized group in containing individual is linked to . if and only if
Proof: By Lemma 4.2, is equal to .
From the above, for any , we can see that the value of should be lower bounded by the value of
We define when .
[Running Example] From Example 4.2, we know that the published tables shown in Figure 4 satisfy the privacy requirement (i.e., where and ). At time , we want to publish a new table from a raw table which contain .
which is the minimum anonymized group size ratio in the published table . Suppose contains only one occurrence of . Then, the size of the anonymized group should be at least so that .
We have the following corollary when the inequality in Theorem 5.1 becomes an equality.
if and only if .
When a record for individual appears in a data release and in the published data , the anonymized group containing has no relation to sensitive value , then intuitively, this release should not have any impact on the privacy protection of linking to . This is formally stated in the following lemma.
If the anonymized group in containing is not linked to , then . Proof: Since the anonymized group in containing is not linked to , we know that is linked to in one of the first -th published tables. Thus,
Thus, we have
Thus, can be equal to any real number and does not affect the value of in this case.
Suppose a published table contains and we need to generate an anonymized group containing . Note that the size of the anonymized group is and the number of tuples in with sensitive value is equal to for . Without loss of generality, suppose we want to protect the privacy linkage between an individual and a sensitive value . From Theorem 5.1 and Lemma 5.1, we can determine the minimum value of for generating an anonymized group . From Theorem 5.1, if contains , in order to guarantee , we have to set the value of to satisfy
From Lemma 5.1, if does not contain , any value of will not affect the privacy related to and .
Although Theorem 5.1 suggests that if we set the value of at least , then . However, suppose we set this value exactly equal to , although we can guarantee for the published tables, there will a privacy breach (i.e., ) for any additional future published tables in which an anonymized group containing is linked to . This is a result of the following lemma.
Consider that we published tables where an anonymized group in containing is linked to . Suppose we are to publish where an anonymized group in containing is also linked to . If , then .
Monotonicity is a useful property for some anonymization process where the resulting anonymization groups are constructed in a bottom-up manner, merging smaller groups that violates the privacy requirement into bigger groups which may guarantee privacy. It is also useful when the anonymization is top-down, splitting bigger groups into smaller ones as long as the privacy guarantee holds.
Consider the privacy protection for the linkage of an individual to a sensitive value . From Lemma 5.1, we know that is independent of data releases in which any anonymized group containing (in the published tables) are not linked to . Hence, in the following, we consider the worst-case scenario where in all releases whenever there exists an anonymized group containing (in a published table), is linked to .
The monotonicity property is described as follows.
[Monotonicity] is strictly decreasing when increases.
The proof is given in the appendix. Note that is essentially the inverse of the proportion of tuples in the anonymized group. Therefore, when a bigger group that satisfies the privacy requirement is split into smaller ones, if the proportion of tuples in the small group containing is not increased, then is not increased. Conversely if a small group violates the privacy guarantee, merging it with another group may decrease the proportion of tuples and thus may be decreased.
An anonymized group is said to violate the global guarantee if there exists an individual and a sensitive value such that .
Consider an anonymized group in the published table which violates the global guarantee. If we partition into a number of smaller groups, one of the smaller groups violates the global guarantee. Proof Sketch: Suppose is the size ratio for . It is easy to see that one of the smaller groups has the size ratio smaller than . By Theorem 5.2, increases. Since violates the global guarantee (i.e., ), the smaller group also violates the global guarantee (i.e., ).
In previous sections, we have observed that, by choosing a proper size of an anonymized group, the global privacy guarantee can be achieved. In general, a size above a certain threshold size can be chosen. However, setting a size equal to the threshold size will make future anonymization infeasible (see Theorem 5.1). Therefore, it is necessary to choose a size that is greater than the threshold. The increase in size however, would lead to a decrease in the utility of the data. Hence, a question will be how to pick a smallest size that can maintain the global guarantee.
In this section, we show that if we are given a bound on the number of releases where an individual may be linked to a sensitive value , then we can devise a strategy to minimize the maximum anonymization group size. We also propose another strategy which aims to reduce the anonymized group size on average.
6.1 Constant-Ratio Strategy
In database related problems, one can typically derive effective mechanisms based on the characteristics of the data itself. In our problem scenario, a data publisher has at his/her disposal the statistical information of the data collections. For example, consider the medical database. The statistics can point to the expected frequency of an individual contracting a certain disease over his or her lifespan. With such information, one can set an estimated bound on the number of data releases that a person may indeed be linked to the disease. With this knowledge, one can adopt a constant-ratio strategy which we shall show readily can minimize the maximum size of the corresponding anonymized groups.
Constant-ratio strategy makes sure that the size of anonymized groups for individual containing divided by the number of occurrences of remain unchanged over a number of data releases. Formally, given an integer for the number of data releases, for ,
where is a positive real number constant, and is a timestamp for the -th release where both and appear. For the sake of simplicity, we set where and are positive integer constants where .
corresponds to the total number of possible releases in the future. In other words, during data publishing, the data publisher expects to publish table for this data. With this given parameter , we can calculate and such that remain unchanged when changes.
In order to make sure that for any , we need to protect . In the following, we consider which is equal to
Table 1 shows the values of with selected values of and . When increases, increases. When increases, also increases.
It remains to show that the constant-ratio strategy indeed can lead to data publishing that minimizes the maximum anonymized group sizes. First, we define this property more formally.
[Min-Max optimization] An anonymization for serial data publishing is min-max optimal if the maximum anonymized group size among the anonymized groups containing individual and sensitive value for any given and over all data releases is minimized.
[Optimality] The constant-ratio strategy generates a min-max optimal solution for serial data publishing. Proof: Let be the set of anonymized group sizes in the published tables where these anonymized groupes contain and are linked to . That is, . Let