Composition Attacks and Auxiliary Information in Data Privacy
Privacy is an increasingly important aspect of data publishing. Reasoning about privacy, however, is fraught with pitfalls. One of the most significant is the auxiliary information (also called external knowledge, background knowledge, or side information) that an adversary gleans from other channels such as the web, public records, or domain knowledge. This paper explores how one can reason about privacy in the face of rich, realistic sources of auxiliary information. Specifically, we investigate the effectiveness of current anonymization schemes in preserving privacy when multiple organizations independently release anonymized data about overlapping populations.
We investigate composition attacks, in which an adversary uses independent anonymized releases to breach privacy. We explain why recently proposed models of limited auxiliary information fail to capture composition attacks. Our experiments demonstrate that even a simple instance of a composition attack can breach privacy in practice, for a large class of currently proposed techniques. The class includes -anonymity and several recent variants.
On a more positive note, certain randomization-based notions of privacy (such as differential privacy) provably resist composition attacks and, in fact, the use of arbitrary side information. This resistance enables “stand-alone” design of anonymization schemes, without the need for explicitly keeping track of other releases. We provide a precise formulation of this property, and prove that an important class of relaxations of differential privacy also satisfy the property. This significantly enlarges the class of protocols known to enable modular design.
Composition Attacks and Auxiliary Information in Data Privacy
|Srivatsava Ranjit Ganta|
|Penn State University|
|University Park, PA 16802|
|Penn State University|
|University Park, PA 16802|
|Penn State University|
|University Park, PA 16802|
Privacy is an increasingly important aspect of data publishing. The potential social benefits of analyzing large collections of personal information (census data, medical records, social networks) are significant. At the same time, the release of information from such repositories can be devastating to the privacy of individuals or organizations [?]. The challenge is therefore to discover and release the global characteristics of these databases without compromising the privacy of the individuals whose data they contain.
Reasoning about privacy, however, is fraught with pitfalls. One of the most significant difficulties is the auxiliary information (also called external knowledge, background knowledge, or side information) that an adversary gleans from other channels such as the web or public records. For example, simply removing obviously identifying information such as names and address does not suffice to protect privacy since the remaining information (such as zip code, gender and date of birth [?]) may still identify a person uniquely when combined with auxiliary information (such as voter registration records). Schemes that resist such linkage have been the focus of extensive investigation, starting with work on publishing contingency tables [?], and more recently, in a line of techniques based on “-anonymity” [?].
This paper explores how one can reason about privacy in the face of rich, realistic sources of auxiliary information. This follows lines of work in both the data mining [?, ?, ?] and cryptography [?, ?] communities that have sought principled ways to incorporate unknown auxiliary information into anonymization schemes. Specifically, we investigate the effectiveness of current anonymization schemes in preserving privacy when multiple organizations independently release anonymized data about overlapping populations. We show new attacks on some schemes and also deepen the current understanding of schemes known to resist such attacks. Our results and their relation to previous work are discussed below.
Schemes that retain privacy guarantees in the presence of independent releases are said to compose securely. The terminology, borrowed from cryptography (which borrowed, in turn, from software engineering), stems from the fact that schemes which compose securely can be designed in a stand-alone fashion without explicitly taking other releases into account. Thus, understanding independent releases is essential for enabling modular design. In fact, one would like schemes that compose securely not only with independent instances of themselves, but with arbitrary external knowledge. We discuss both types of compositions in this paper.
The dual problem to designing schemes with good composition properties is the design of attacks that exploit such information. We call these composition attacks.A simple example of such an attack, in which two hospitals with overlapping patient populations publish anonymized medical data, is presented below. Composition attacks highlight a realistic and important class of vulnerabilities. As privacy preserving data publishing becomes more commonly deployed, it is increasingly difficult to keep track of all the organizations that publish anonymized summaries involving a given individual or entity and schemes that are vulnerable to composition attacks will become increasingly difficult to use safely.
Our contributions are summarized briefly in the abstract, above, and discussed in more detail in the following subsections.
We introduce composition attacks and study their effect on a popular class of partitioning-based anonymization schemes. Very roughly, computer scientists have worked on two broad classes of anonymization techniques. Randomization-based schemes introduce uncertainty either by randomly perturbing the raw data (a technique called input perturbation, randomized response, e.g., [?, ?, ?]), or post-randomization, e.g., [?]), or by injecting randomness into the algorithm used to analyze the data (e.g., [?, ?]). Partition-based schemes cluster the individuals in the database into disjoint groups satisfying certain criteria (for example, in -anonymity [?], each group must have size at least ). For each group, certain exact statistics are calculated and published. Partition-based schemes include -anonymity [?] as well as several recent variants, e.g., [?, ?, ?, ?, ?].
Because they release exact information, partition-based schemes seem especially vulnerable to composition attacks. In the first part of this paper we study a simple instance of a composition attack called an intersection attack. We observe that the specific properties of current anonymization schemes make this attack possible, and we evaluate its success empirically.
Example. Suppose two hospitals and in the same city release anonymized patient-discharge information. Because they are in the same city, some patients may visit both hospitals with similar ailments. Tables Composition Attacks and Auxiliary Information in Data Privacy(a) and Composition Attacks and Auxiliary Information in Data Privacy(b) represent (hypothetical) independent -anonymizations of the discharge data from and using and , respectively. The sensitive attribute here is the patient’s medical condition. It is left untouched. The other attributes, deemed non-sensitive, are generalized (that is, replaced with aggregate values), so that within each group of rows, the vectors if non-sensitive attributes are identical. If Alice’s employer knows that she is 28 years old, lives in zip code 13012 and recently visited both hospitals, then he can attempt to locate her in both anonymized tables. Alice matches four potential records in ’s data, and six potential records in ’s. However, the only disease that appears in both matching lists is AIDS, and so Alice’s employer learns the reason for her visit.
Intersection Attacks. The above example relies on two properties of the partition-based anonymization schemes: (i) Exact sensitive value disclosure: the “sensitive” value corresponding to each member of the group is published exactly; and (ii) Locatability: given any individual’s non-sensitive values (non-sensitive values are exactly those that are assumed to be obtainable from other, public information sources) one can locate the group in which individual has been put in. Based on these properties, an adversary can narrow down the set of possible sensitive values for an individual by intersecting the sets of sensitive values present in his/her groups from multiple anonymized releases.
Properties (i) and (ii) turn out to be widespread. The exact disclosure of sensitive value lists is a design feature common to all the schemes based on -anonymity: preserving the exact distribution of sensitive values is important, and so no recoding is usually applied. Locatability is less universal, since it depends on the exact choice of clustering algorithm (used to form groups) and the recoding applied to the non-sensitive attributes. However, some schemes always satisfy locatability by virtue of their structure (e.g., schemes that recursively partition the data set along the lines of a hierarchy that is subsequently used for generalization [?, ?]). For other schemes, locatability is not perfect but our experiments suggest that using simple heuristics one can locate a individual’s group with high probability.
Even with these properties, it is difficult to come up with a theoretical model for intersection attacks because the partitioning techniques generally create dependencies that are hard to model analytically. However, if the sensitive values of the members of a group could be assumed to be statistically independent of their non-sensitive attribute values, then a simple birthday-paradox-style analysis would yield reasonable bounds.
Experimental Results. Instead, we evaluated the success of intersection attacks empirically. We ran the intersection attack on two popular census databases anonymized using partition-based schemes. We evaluated the severity of such an attack by measuring the number of individuals who had their sensitive value revealed. Our experimental results confirm that partitioning-based anonymization schemes including -anonymity and its recent variants, -diversity and -closeness, are indeed vulnerable to intersection attacks. Section Composition Attacks and Auxiliary Information in Data Privacy elaborates our methodology and results.
Related Work on Modeling Background Knowledge. It is important to point out that the partition-based schemes in the literature were not designed to be used in contexts where independent releases are available. Thus, we do not view our results as pointing out a flaw in these schemes, but rather as directing the community’s attention to an important direction for future work.
It is equally important to highlight the progress that has already been made on modeling sophisticated background knowledge in partition-based schemes. One line has focused on taking into account other, known releases, such as previous publications by the same organization (“sequential” releases, [?, ?, ?]) and multiple views of the same data set [?]. Another line has considered incorporating knowledge of the clustering algorithm used to group individuals [?]. Most relevant to this paper are works that have sought to model unknown background knowledge. Martin et al. [?] and Chen et al. [?] provide complexity measures for an adversary’s side information (roughly, they measure the size of the smallest formula within a CNF-like class that can encode the side information). Both works design schemes that provably resist attacks based on side information whose complexity is below a given threshold.
Independent releases (and hence composition attacks) fall outside the models proposed by these works. The sequential release models do not fit because they deal assume the other releases are known to the anonymization algorithm. The complexity-based measures do not fit because independent releases appear to have complexity that is linear in the size of the data set.
Composition attacks appear to be difficult to reason about, and it is not initially clear whether it is possible at all to design schemes that resist such attacks. Even defining composition properties precisely is tricky in the presence of malicious behavior (for example, see [?] for a recent survey about composability of cryptographic protocols). Nevertheless, a significant family of anonymization definitions do provide guarantees against composition attacks, namely schemes that satisfy differential privacy [?]. Recent work has greatly expanded the applicability of differential privacy and its relaxations, both in the theoretical [?, ?, ?, ?, ?] and applied [?, ?, ?] literature. However, certain recently developed techniques such as sampling [?], instance-based noise addition [?] and data synthesis [?] appear to require relaxations of the definition.
It is simple to prove that both the strict and relaxed variants of differential privacy compose well (see [?, ?, ?]). Less trivially, however, one can prove that strictly differentially-private algorithms also provide meaningful privacy in the presence of arbitrary side information (Dwork and McSherry, [?]). In particular, these schemes compose well even with completely different anonymization schemes.
It is natural to ask if there are weaker definitions which provide similar guarantees. Certainly not all of them do: one natural relaxation of differential privacy, which replaces the multiplicative distance used in differential privacy with total variation distance, fails completely to protect privacy (see example 2 in [?]).
In this paper, we prove that two important relaxations of differential privacy do, indeed, resist arbitrary side information. First, we provide a Bayesian formulation of differential privacy which makes its resistance to arbitrary side information explicit. Second, we prove that the relaxed definitions of [?, ?] still imply the Bayesian formulation. The proof is non-trivial, and relies on the “continuity” of Bayes’ rule with respect to certain distance measures on probability distributions. Our result means that the recent techniques mentioned above [?, ?, ?, ?] can be used modularly with the same sort of assurances as in the case of strictly differentially-private algorithms.
Let be a multiset of tuples where each tuple corresponds to an individual in the database. Let be an anonymized version of . From this point on, we use the terms tuple and individual interchangeably, unless the context leads to ambiguity. Let be a collection of attributes and be a tuple in ; we use the notation to denote where each denotes the value of attribute in table for .
In partitioning-based anonymization approaches, there exists a division of data attributes into two classes, sensitive attributes and non-sensitive attributes. A sensitive attribute is one whose value and an individual’s association with that value should not be disclosed. All attributes other than the sensitive attributes are non-sensitive attributes.
Definition 1 (Quasi-identifier)
A set of non-sensitive attributes is called a quasi-identifier if there is at least one individual in the original sensitive database who can be uniquely identified by linking these attributes with auxiliary data.
Previous work in this line typically assumed that all the attributes in the database other than the sensitive attribute form the quasi-identifier.
Definition 2 (Equivalence Class)
An equivalence class for a table with respect to attributes in is the set of all tuples for which the projection of each tuple onto attributes in is the same, i.e., .
Partition-based schemes cluster individuals into groups, and then recode (i.e., generalize or change) the non-sensitive values so that each group forms an equivalence class with respect to the quasi-identifiers. Sensitive values are not recoded. Different criteria are used to decide how, exactly, the groups should be structured. The most common rule is -anonymity, which requires that each equivalence class contain at least individuals.
Definition 3 (-anonymity)
A release is -anonymous if for every tuple , there exist at least other tuples such that for every collection of attributes in quasi-identifier.
In our experiments we also consider two extensions to -anonymity.
Definition 4 (Entropy -diversity)
For an equivalence class , let denote the domain of the sensitive attributes, and is the fraction of records in that have sensitive value , then is -diverse if:
A table is -diverse if all its equivalence classes are -diverse.
Definition 5 (-closeness)
An equivalence class is -close if the distance between the distribution of a sensitive attribute in this class and distribution of the attribute in the whole table is no more than a threshold . A table is -close if all its equivalence classes are -close.
Locatability. As mentioned in the introduction, many anonymization algorithms satisfy locatability, that is, they output tables in which one can locate an individual’s group based only on his or her non-sensitive values.
Definition 6 (locatability)
Let be the set of quasi-identifier values of an individual in the original database . Given the -anonymized release of , the locatability property allows an adversary to identify the set of tuples in (where ) that correspond to .
Locatability does not necessarily hold for all partition-based schemes, since it depends on the exact choice of clustering algorithm (used to form groups) and the recoding applied to the non-sensitive attributes. However it is widespread. Some schemes always satisfy locatability by virtue of their structure (e.g., schemes that recursively partition the data set along the lines of a hierarchy always provide locatability if the attributes are then generalized using the same hierarchy, or if (min,max) summaries are used [?, ?]). For other schemes, locatability is not perfect but our experiments suggest that using simple heuristics can locate a person’s group with good probability. For example, microaggregation [?] clusters individuals based on Euclidean distance. The vectors of non-sensitive values in each group are replaced by the centroid (i.e., average) of the vectors. The simplest heuristic for locating an individual’s group is to choose the group with the closest centroid vector. In experiments on census data, this correctly located approximately 70% of individuals. In our attacks, we always assume locatability. This assumption was also made in previous studies [?, ?].
Armed with these basic definitions, we now proceed to formalize the intersection attack (Algorithm 1).
Let be independent anonymized releases with minimum partition-sizes of , respectively. Let be the overlapping population occurring in all the releases. The function Get_equivalence_class returns the equivalence class into which an individual falls in a given anonymized release. The function Sensitive_value_set returns the set of (distinct) sensitive values for the members in a given equivalence class.
Definition 7 (Anonymity)
For each individual in , the anonymity factor promised by each release is equal to the corresponding minimum partition-size .
However, as pointed out in [?], the actual anonymity offered is less than this ideal value and is equal to number of distinct values in each equivalence class. We call this as the effective anonymity
Definition 8 (Effective Anonymity)
For an individual in , the effective anonymity offered by a release is equal to the number of distinct sensitive values of the partition into which the individual falls into. Let be the equivalence class or partition into which falls into with respect to the release , and let denote the sensitive value set for . The effective anonymity for with respect to the release is:
For each target individual , is the effective prior anonymity with respect to (anonymity before the intersection attack). In the intersection attack, the list of possible sensitive values associated to the target is equal to intersection of all sensitive value sets , . So the effective posterior anonymity () for is:
The difference between the effective prior anonymity and effective posterior anonymity quantifies the drop in effective anonymity.
The vulnerable population () is the number of individuals (among the overlapping population) for whom the intersection attack leads to a positive drop in the effective anonymity.
After performing the sensitive value set intersection, the adversary knows only a possible set of values that each individual’s sensitive attribute can take. So, the adversary deduces that with equal probability (under the assumption that the adversary does not have any further auxiliary information) the individual’s actual sensitive value is one of the values in the set . So, the adversaries confidence level for an individual can be defined as:
Definition 9 (Confidence level )
For each individual , the confidence level of the adversary in identifying the individual’s true sensitive value through the intersection attack is defined as
Now, given some confidence level , we denote by and the set and the percentage of overlapping individuals for whom the adversary can deduce the sensitive attribute value with a confidence level of at least .
In this section we describe our experimental study111The code, parameter settings, and complete results are made available at: http://www.cse.psu.edu/~ranjit/kdd08.. The primary goal is to quantify the severity of such an attack on existing schemes. Although the earlier works address problems with -anonymization and adversarial background knowledge, to the best of our knowledge, none of these studies deal with attacks resulting from auxiliary independent releases. Furthermore, none of the studies so far have quantified the severity of such an attack.
We use three different partitioning-based anonymization techniques to demonstrate the intersection attack: -anonymity, -diversity, and -closeness. For -anonymity, we use the Mondrian multidimensional approach proposed in [?] and the microaggregation technique proposed in [?]. For -diversity and -closeness, we use the definitions of entropy -diversity and -closeness proposed in [?] and [?], respectively.
We use two census-based databases from the UCI Machine Learning repository [?]. The first one is the Adult database that has been used extensively in the -anonymity based studies. The database was prepared in a similar manner to previous studies [?, ?] (also explained in Table Composition Attacks and Auxiliary Information in Data Privacy). The resulting database contained individual records corresponding to people. The second database is the IPUMS database that contains individual information from the 1997 census studies. We only use a subset of the attributes that are similar to the attributes present in the Adult database to maintain uniformity and to maintain quasi-identifiers. The IPUMS database contains individual records corresponding to a total of people. This data set was prepared as explained in Table Composition Attacks and Auxiliary Information in Data Privacy.
From both Adult and IPUMS databases, we generate two overlapping subsets (Subset 1 and Subset 2) by randomly sampling individuals without replacement from the total population. We fixed the overlap size to . For each of the databases, the two subsets are anonymized independently and the intersection attack is run on the anonymization results. All the experiments were run on a Pentium 4 system running Windows XP with 1GB RAM.
|Work Class||7||Quasi ID|
|Marital Status||7||Quasi ID|
|Native Country||41||Quasi ID|
|Work Class||5||Quasi ID|
|Marital Status||6||Quasi ID|
|Birth Place||113||Quasi ID|
Our first goal is to quantify the extent of damage possible through the intersection attack. For this, we consider two possible situations: (i) Perfect breach and (ii) Partial breach.
A perfect breach occurs when the adversary can deduce the exact sensitive value of an individual. In other words, a perfect breach is when the adversary has a confidence level of 100% about the individual’s sensitive data. To estimate the probability of a perfect breach, we compute the percentage of overlapping population for whom the intersection attack leads to a final sensitive value set of size . Figure Composition Attacks and Auxiliary Information in Data Privacy plots this result.
We consider three scenarios for anonymizing the two overlapping subsets: (i) Mondrian on both the data subsets, (ii) Microaggregation on both the data subsets, and (iii) Mondrian on the first subset and microaggregation on the second subset. represents the pair of values used to anonymize the first and the second subset, respectively. In the experiments, we use the same values for both the subsets . Note that for simplicity, from now on we will be defining confidence level in terms of percentages.
In the case of Adult database we found that around 12% of the population is vulnerable to a perfect breach for . For the IPUMS database, this value is much more severe around 60%. As the degree of anonymization increases or in other words, as the value of increases, the percentage of vulnerable population goes down. The reason for that is that as the value of increases, the partition sizes in each subset increases. This leads to a larger intersection set and thus lesser probability of obtaining an intersection set of size .
Our next experiment aims to compute a more practical quantification of the severity of the intersection attack. In most cases, to inflict a privacy breach, all that the adversary needs to do is to boil down the possible sensitive values to a few values which itself could reveal a lot of information. For example, for a hospital discharge database, by boiling down the sensitive values of the disease/diagnosis to a few values, say, “Flu”, “Fever”, or “Cold”, it could be concluded that the individual is suffering from a viral infection. In this case, the adversary’s confidence level is . Figure Composition Attacks and Auxiliary Information in Data Privacy plots the percentage of vulnerable population for whom the intersection attack leads to a partial breach for the Adult and IPUMS databases.
Here, we only use the first anonymization scenario described earlier in which both the overlapping subsets of the database are anonymized using Mondrian multidimensional technique. Observe that the severity of the attack increases alarmingly for slight relaxation on the required confidence level. For example, in the case of IPUMS database, around 95% of the population was vulnerable for a confidence level of 25% for . For the Adult database, although this value is not as alarming, more than 60% of the population was affected.
Our next goal is to measure the drop in anonymity occurring due to the intersection attack.To achieve this, we first take a closer look at the way these schemes work. As described in the earlier sections, the basic paradigm in partitioning-based anonymization schemes is to partition the data such that each partition size is at least . The methodology behind partitioning and then summarizing varies from scheme to scheme. The minimum partition-size is thus used as a measure of the anonymity offered by these solutions. However, the effective (or true) anonymity supported by these solutions is far less than the presumed anonymity (refer to the discussion in Section Composition Attacks and Auxiliary Information in Data Privacy).
Figure Composition Attacks and Auxiliary Information in Data Privacy plots the average partition sizes and the average effective anonymities for the overlapping population. Here again, we only consider the scenario where both the overlapping subsets are anonymized using Mondrian multidimensional technique. Observe that the effective anonymity is much less than the partition size for both the data subsets. Also, note that these techniques result in partition sizes that are much larger than the minimum required of . For example, the average partition size observed in the IPUMS database for is close to . To satisfy the -anonymity definition, there is no need for any partition to be larger than . The reasoning for this is straightforward as splitting the partition of size greater than into two we get partitions of size at least . Additionally, splitting any partition of size or more only results in preserving more information. The culprit behind the larger average partition sizes is generalization based on user-defined hierarchies. Since generalization-based partitioning cannot be controlled at finer levels, the resulting partition sizes tend to be much larger than the minimum required value.
For each individual in the overlapping population, the effective prior anonymity is equal to the effective anonymity. We define the average effective prior anonymity with respect to a release as effective prior anonymities averaged over the individuals in the overlapping population. Similarly, the average effective posterior anonymity is the effective posterior anonymities averaged over the individuals in the overlapping population. The difference between the average effective prior anonymity and average effective posterior anonymity gives the average drop in effective anonymity occurring due to the intersection attack. Figure Composition Attacks and Auxiliary Information in Data Privacy plots the average effective prior anonymities and the average effective posterior anonymities for the overlapping population. Observe that the average effective posterior anonymity is much less than the average effective prior anonymity for both subsets. Also note that we measure drop in anonymities by using effective anonymities instead of presumed anonymities. The situation only gets worse (drops get larger) when presumed anonymities are used.
We now consider the -diversity and -closeness extensions to the original -anonymity definition. The goal again is to quantify the severity of the intersection attack by measuring the extent to which a partial breach occurs with varying levels of adversary confidence levels. Figure Composition Attacks and Auxiliary Information in Data Privacy plots the percentage of vulnerable population for whom the intersection attack leads to a partial breach for the Adult and IPUMS databases. Here, we anonymize both the subsets of the database with the same definition of privacy. We use the mondrian multidimensional -anonymity with the additional constraints as defined by -diversity and -closeness. Figure Composition Attacks and Auxiliary Information in Data Privacy(a) plots the result for the -diversity using the same value for both the subsets () and with . Figure Composition Attacks and Auxiliary Information in Data Privacy(b) plots the same for -closeness. Even though these extended definitions seem to perform better than the original -anonymity definition, they still lead to considerable breach in case of an intersection attack. This result is fairly intuitive in the case of -diversity. Consider the definition of -diversity: the sensitive value set corresponding to each partition should be “well” () diverse. However, there is no guarantee that the intersection of two well diverse sets leads to a well diverse set. -closeness fares similarly. Also, both these definitions tend to force larger partition sizes, thus resulting in heavy information loss. Figure Composition Attacks and Auxiliary Information in Data Privacy plots the average partition sizes of the individuals corresponding to the overlapping population. It compares the partition sizes observed for -anonymity, -diversity, and -closeness. For the IPUMS database, with a value of , -anonymity produces partitions with an average partition size of . While, for the same value of , with a value of , the average partition size obtained was close to . The partition sizes for -closeness get even worse, where a combination of and yield partitions of average size close to . We can observe similar results for the Adult database.
In all of the above experiments we use the “Occupation” (occupation code of the individual) as the sensitive attribute for both Adult and IPUMS databases as shown in Tables Composition Attacks and Auxiliary Information in Data Privacy and Composition Attacks and Auxiliary Information in Data Privacy. The domain size of the Occupation attribute in the Adult database was whereas, the domain size in the IPUMS database was . One of the plausible reasons for the attack to be more severe in case of the IPUMS database was the size of the sensitive attribute domain. This is because most of partition sizes are way larger than the minimum value required i.e. , in case of the Adult database, it is possible that the sensitive value set corresponding to every partition contains all the possible values in the domain. This implies that an intersection of two sensitive value sets results in a set of size close to the size of the domain. Thus, it is possible that intersection attack will be less effective in cases where the sensitive attribute domain size is less than the average partition size. Intuitively, it seems like that in cases where the sensitive attribute domain size is large (of the order of several hundreds) the intersection attack would be more severe. Also, most real-life databases have sensitive attributes with large domain sizes. For example, if we consider a typical hospital discharge database, an ICD9 code is used to describe the diagnosis given to the patient. The possible values for this code is a number from to [?] indicating the code for the specific patient diagnosis. In other cases, the sensitive attribute domain sizes tend be larger than this. The conjecture is that as the number of possible sensitive values increases, the intersection of two different sets results in a less diverse set.
In order to confirm this, we constructed two new versions of the IPUMS database by replacing the sensitive attribute “Occupation” of each individual with “Industry” corresponding to the individual’s work and “Income” corresponding to the total income of the individual. The domain sizes corresponding to these attributes is summarized in Table Composition Attacks and Auxiliary Information in Data Privacy. The domain size for “Industry” attribute is , for the original “Occupation” attribute si and that of “Income” is . Table Composition Attacks and Auxiliary Information in Data Privacy summarizes this. We ran the intersection attack on these new versions of the IPUMS database and compared it with the original. Figure Composition Attacks and Auxiliary Information in Data Privacy plots the average drop in effective anonymity for the overlapping population. Based on our conjecture, the drop in effective anonymity should increase with the increase in the sensitive attribute domain size. Surprisingly we did not observe the trend we were expecting. The drop in effective anonymity in case of “Occupation” was less than when compared with “Industry”. It turns out that the reason for this is that the actual number of possible values for each sensitive attribute does not necessarily be the same as the domain size, or in other words the total number of possible values. So, a large sensitive attribute domain size does not guarantee that the number of possible values actually occuring is large. Instead, a simple entropy measure such as the shannon’s entropy could be used to measure the actual number of possible values. The entropy value for each of these attributes is listed in Table Composition Attacks and Auxiliary Information in Data Privacy. Although the actual domain size for ‘Occupation” attribute is larger, its entropy is less than that of than that of the “Industry” attribute. Now, the conjecture is that as the entropy (or information content) of the sensitive attribute increases, the severity of intersection attack increases. Our result in Figure Composition Attacks and Auxiliary Information in Data Privacy confirms this. The average drop in effective anonymity increases with the entropy of the corresponding sensitive attribute domain since the non-sensitive attributes are kept the same for all the datasets.
|Sensitive Attribute||Domain Size||Diversity|
In the above experiments we have considered the scenario in which two anonymized releases contain information about overlapping population. As data publishing becomes more prevelant among organizations that would like to share data for research and collaborative purposes, it is possible that the number of anonymized releases available containing information about the same subset of people is more than just two. The adversary could use as many anonymized releases as possible to gather information about a target population and use the intersection attack to deduce the sensitive attribute values. In such a scenario, it is interesting to see how the intersection attack performs in the presence multiple (more than 2) overlapping anonymized releases. We first consider the percentage of vulnerable population with a confidence level of . Figure Composition Attacks and Auxiliary Information in Data Privacy(a) plots this for varying number of anonymized releases available to adversary. Here again, we build overlapping subsets of the IPUMS database by fixing the overlapping population at . It can be observed that the severity of the intersection attack increases with the increase in the number of anonymized releases available to the adversary. There is a significant increase in the percentage of vulnerable population with the increase in , for small values of . However, there seem to be no such significant increase for larger values of . The reason for this is that the partition sizes for larger values of tend to be large enough such that the presence of additional anonymized releases does not help the intersection attack anymore. Alternative to the severity of the attack, we can study the effect of the number of anonymized releases on the drop in effective anonymity. Figure Composition Attacks and Auxiliary Information in Data Privacy(b) plots the average drop in effective anonymity for varying number of anonymized releases. Here again we can observe that drop in effective anonymity increases with the increase in the number of anonymized releases. These results indicate that if the anonymized releases correspond to fairly larger values of , there is only limited information gained by the adversary by collecting additional releases.
In this section we give a precise formulation of “resistance to arbitrary side information” and show that several relaxations of differential privacy imply it. The formulation follows the ideas originally due to Dwork and McSherry, stated implicitly in [?]. This is, to our knowledge, the first place such a formulation appears explicitly. The proof that relaxed definitions (and hence the schemes of [?, ?, ?]) satisfy the Bayesian formulation is new. These results are explained in a greater detail in a separate technical report [?]. In this paper we just reproduce the relevant parts from [?].
We represent databases as vectors in for some domain (for example, in the case of the relational databases above, is the product of the attribute domains). There is no distinction between “sensitive” and “insensitive” information. Given a randomized algorithm , we let be the random variable (or, probability distribution on outputs) corresponding to input .
Definition 10 (Differential Privacy)
A randomized algorithm is -differentially private if for all databases that differ in one individual, and for all subsets of outputs,
This definition states that changing a single individual’s data in the database leads to a small change in the distribution on outputs. Unlike more standard measures of distance such as total variation (also called statistical difference) or Kullback-Leibler divergence, the metric here is multiplicative and so even very unlikely events must have approximately the same probability under the distributions and . This condition was relaxed somewhat in other papers [?, ?, ?, ?, ?, ?, ?]. The schemes in all those papers, however, satisfy the following relaxation [?]:
A randomized algorithm is -differentially private if for all databases that differ in one individual, and for all subsets of outputs,
The relaxations used in [?, ?, ?] were in fact stronger (i.e., less relaxed) than Definition 10. One consequence of the results below is that all the definitions are equivalent up to polynomial changes in the parameters, and so given the space constraints we work only with the simplest notion.222That said, some of the other relaxations, such as probabilistic differential privacy of [?], could lead to better parameters in Theorem 15.
There is a crisp, semantically-flavored interpretation of differential privacy, due to Dwork and McSherry, and explained in [?]: Regardless of external knowledge, an adversary with access to the sanitized database draws the same conclusions whether or not my data is included in the original data. (the use of the term “semantic” for such definitions dates back to semantic security of encryption [?]).
We require a mathematical formulation of “arbitrary external knowledge”, and of “drawing conclusions”. The first is captured via a prior probability distribution on ( is a mnemonic for “beliefs”). Conclusions are modeled by the corresponding posterior distribution: given a transcript , the adversary updates his belief about the database using Bayes’ rule to obtain a posterior :
In an interactive scheme, the definition of depends on the adversary’s choices; for simplicity we omit the dependence on the adversary in the notation. Also, for simplicity, we discuss only discrete probability distributions. Our results extend directly to the interactive, continuous case.
For a database , define to be the vector obtained by replacing position by some default value in (any value in will do). This corresponds to “removing” person ’s data. We consider related scenarios (“games”, in the language of cryptography), numbered 0 through . In Game 0, the adversary interacts with . This is the interaction that takes place in the real world. In Game (for ), the adversary interacts with . Game describes the hypothetical scenario where person ’s data is not included.
For a particular belief distribution and transcript , we consider the corresponding posterior distributions . The posterior is the same as (defined in Eq. (1)). For larger , the -th posterior distribution represents the conclusions drawn in Game , that is
Given a particular transcript , privacy has been breached if there exists an index such that the adversary would draw different conclusions depending on whether or not ’s data was used. It turns out that the exact measure of “different” here does not matter much. We chose the weakest notion that applies, namely statistical difference. If and are probability measures on the set , the statistical difference between and is defined as:
An algorithm is -semantically private if for all prior distributions on , for all databases , for all possible transcripts , and for all ,
This can be relaxed to allow a probability of failure.
An algorithm is -semantically private if, for all prior distributions , with probability at least over pairs , where the database is drawn according to and the transcript is drawn according to , for all :
Dwork and McSherry proposed the notion of semantic privacy, informally, and observed that it is equivalent to differential privacy.
Proposition 14 (Dwork-McSherry)
-differential privacy implies -semantic privacy, where .
We show that this implication holds much more generally:
Theorem 15 (Main Result)
()-differential privacy implies -semantic privacy where and .
Theorem 15 states that the relaxations notions of differential privacy used in some previous work still imply privacy in the face of arbitrary side information. This is not the case for all possible relaxations, even very natural ones. For example, if one replaced the multiplicative notion of distance used in differential privacy with total variation distance, then the following “sanitizer” would be deemed private: choose an index uniformly at random and publish the entire record of individual together with his or her identity (example 2 in [?]). Such a “sanitizer” would not be meaningful at all, regardless of side information.
Finally, the techniques used to prove Theorem 15 can also be used to analyze schemes which do not provide privacy for all pairs of neighboring databases and , but rather only for most such pairs (neighboring databases are the ones that differ in one individual). Specifically, it is sufficient that those databases where the “indistinguishability” condition fails occur with small probability.
Definition 16 (-indistinguishability)
Two random variables taking values in a set are -indistinguishable if for all sets , and
Let be a randomized algorithm. Let Then satisfies -semantic privacy for any prior distribution such that with and .
The complete proofs are described in [?]. Here we sketch the main ideas behind both the proofs. Let denote the conditional distribution of given that for jointly distributed random variables and . The following lemma (proof omitted) plays an important role in our proofs.
Lemma 18 (Main Lemma)
Suppose two pairs of random variables and are -indistinguishable (for some randomized algorithms and ). Then with probability at least over (equivalently ), the random variables and are -indistinguishable with , , and .
Let be a randomized algorithm (in the setting of Theorem 15, is a -differentially private algorithm). Let be a belief distribution (in the setting of Proposition 17, is a belief with ). The main idea behind both the proofs is to use Lemma 18 to show that with probability at least over pairs where and , . Taking a union bound over all coordinates , implies that with probability at least over pairs where and , for all , we have . For Proposition 17, it shows that satisfies -semantic privacy for . In the Theorem 15 setting where is -differentially private and is arbitrary, it shows that -differential privacy implies -semantic privacy.
In this paper we explored how one can reason about privacy in the presence of independent anonymized releases of overlapping population. Our experimental study indicates that several currently proposed partition-based anonymization schemes, including -anonymity and its variants, are vulnerable to composition attacks. On the positive side, we gave a precise formulation of the property “resistance to arbitrary side information” and show that several relaxations of differential privacy satisfy it.
The most striking question that arises from this work is whether randomness in the anonymization algorithm is necessary to resist complex side information such as independent releases. Another interesting direction would be to study other settings where composition attacks are realistic and effective? A natural candidate for future investigation are the releases of overlapping contingency tables that are often considered in the statistical literature.
- 1 Special issue on disclosure limitation methods for protecting the confidentiality of statistical data, 1998.
- 2 R. Agrawal and R. Srikant. Privacy-preserving data mining. In SIGMOD, pages 439–450. ACM Press, 2000.
- 3 S. Agrawal and J. R. Haritsa. A framework for high-accuracy privacy-preserving mining. In ICDE, pages 193–204. IEEE Computer Society, 2005.
- 4 B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In PODS, pages 273–282. ACM Press, 2007.
- 5 M. Barbaro and T. Zeller. A face is exposed for AOL searcher no. 4417749. The New York Times, Aug. 2006.
- 6 A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: The SuLQ framework. In PODS, pages 128–138. ACM Press, 2005.
- 7 J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In Secure Data Management, pages 48–63. Springer, 2006.
- 8 K. Chaudhuri and N. Mishra. When random sampling preserves privacy. In CRYPTO, pages 198–213, 2006.
- 9 B.-C. Chen, R. Ramakrishnan, and K. LeFevre. Privacy skyline: Privacy with multidimensional adversarial knowledge. In VLDB, pages 770–781. VLDB Endowment, 2007.
- 10 I. Dinur and K. Nissim. Revealing information while preserving privacy. In PODS, pages 202–210. ACM Press, 2003.
- 11 J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1):189–201, 2002.
- 12 C. Dwork. Differential privacy. In ICALP, pages 1–12. Springer, 2006.
- 13 C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, pages 486–503. Springer, 2006.
- 14 C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284. Springer, 2006.
- 15 C. Dwork and K. Nissim. Privacy-preserving datamining on vertically partitioned databases. In CRYPTO, pages 528–544. Springer, 2004.
- 16 A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In KDD, pages 217–228. ACM Press, 2002.
- 17 A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211–222. ACM Press, 2003.
- 18 S. Goldwasser and S. Micali. Probabilistic encryption. Journal of Computer and System Sciences, 28(2):270–299, 1984.
- 19 Internation classification of diseases, http://www.cdc.gov/nchs/about/otheract/icd9/abticd9.htm.
- 20 S. P. Kasiviswanathan and A. Smith. A note on differential privacy: Defining resistance to arbitrary side information. CoRR, arXiv:0803.39461 [cs.CR], 2008.
- 21 K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In ICDE, page 25. IEEE Computer Society, 2006.
- 22 K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Workload-aware anonymization. In KDD, pages 277–286. ACM Press, 2006.
- 23 N. Li, T. Li, and S. Venkatasubramanian. -closeness: Privacy beyond -anonymity and -diversity. In ICDE, pages 106–115. IEEE Computer Society, 2007.
- 24 Y. Lindell. Composition of Secure Multi-Party Protocols: A Comprehensive Study. Springer-Verlag, 2003.
- 25 A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhuber. Privacy: From theory to practice on the map. In ICDE, 2008.
- 26 A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. -diversity: Privacy beyond -anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1), 2007.
- 27 D. J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J. Y. Halpern. Worst-case background knowledge for privacy-preserving data publishing. In ICDE, pages 126–135. IEEE Computer Society, 2007.
- 28 F. McSherry and K. Talwar. Differential privacy in mechanism design. In FOCS, pages 94–103. IEEE Computer Society, 2007.
- 29 K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75–84. ACM Press, 2007.
- 30 L. Sweeney. -anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002.
- 31 UCI machine learning repository, http://www.ics.uci.edu/ mlearn/databases/.
- 32 A. van den Hout and P. van der Heijden. Randomized response, statistical disclosure control and misclassification: A review. International Statistical Review, 70:269–288, 2002.
- 33 K. Wang and B. C. M. Fung. Anonymizing sequential releases. In KDD, pages 414–423. ACM Press, 2006.
- 34 S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
- 35 R. C.-W. Wong, A. W.-C. Fu, K. Wang, and J. Pei. Minimality attack in privacy preserving data publishing. In VLDB, pages 543–554. VLDB Endowment, 2007.
- 36 X. Xiao and Y. Tao. M-invariance: towards privacy preserving re-publication of dynamic datasets. In SIGMOD, pages 689–700. ACM Press, 2007.
- 37 C. Yao, X. S. Wang, and S. Jajodia. Checking for -anonymity violation by views. In VLDB, pages 910–921. VLDB Endowment, 2005.