t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation
Microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. It has been used as an alternative to generalization and suppression to generate -anonymous data sets, where the identity of each subject is hidden within a group of subjects. Unlike generalization, microaggregation perturbs the data and this additional masking freedom allows improving data utility in several ways, such as increasing data granularity, reducing the impact of outliers and avoiding discretization of numerical data. -Anonymity, on the other side, does not protect against attribute disclosure, which occurs if the variability of the confidential values in a group of subjects is too small. To address this issue, several refinements of -anonymity have been proposed, among which -closeness stands out as providing one of the strictest privacy guarantees. Existing algorithms to generate -close data sets are based on generalization and suppression (they are extensions of -anonymization algorithms based on the same principles). This paper proposes and shows how to use microaggregation to generate -anonymous -close data sets. The advantages of microaggregation are analyzed, and then several microaggregation algorithms for -anonymous -closeness are presented and empirically evaluated.
Generating an anonymized data set that is suitable for public release is essentially a matter of finding a good equilibrium between disclosure risk and information loss. Releasing the original data set provides the highest utility to data users but incurs the greatest disclosure risk for the subjects in the data set. On the contrary, releasing random data incurs no risk of disclosure but provides no utility.
-Anonymity [23, 29] is the oldest among the so-called syntactic privacy models. Models in this class address the trade-off between privacy and utility by requiring the anonymized data set to follow a specific pattern that is known to limit the risk of disclosure. Yet, the method to be used to generate such an anonymized data set is not specified by the privacy model and must be selected to maximize data utility (because satisfying the model already ensures privacy). -Anonymity, in particular, seeks to make record re-identification unfeasible by hiding each subject within a group of subjects. To this end, -anonymity requires each record in the anonymized data set to be indistinguishable from another records as far as the quasi-identifier attributes are concerned (see Section 2 for a classification of attributes into identifiers, quasi-identifiers, confidential attributes and other attributes).
Although -anonymity protects against identity disclosure (the subject to whom a record corresponds cannot be successfully re-identified with probability greater than ), disclosure can still happen if the variability of the confidential attribute values in the group of records is small. This is known as attribute disclosure. Several refinements of the -anonymity model have been proposed to protect against attribute disclosure; they all seek to guarantee at least a certain amount of variability of the confidential attribute values within each group of indistinguishable records. In this paper we focus on the notion of -closeness , whose privacy guarantee is probably the strictest among -anonymity-like models. In fact, -closeness has been shown in [27, 8] to be related to the major alternative to -anonymity-like models, namely -differential privacy . -Closeness requires that the distribution of the confidential attribute values within each group of indistinguishable records be similar to the distribution of the confidential attribute values in the entire data set.
The dominant approach to obtain an anonymized data set satisfying -anonymity or any of its refinements is based on generalization (recoding) and suppression. The goal of generalization-based approaches is to find the minimal generalization that satisfies the requirements of the underlying privacy model. These algorithms can be adapted to the above-mentioned -anonymity refinements: it is simply a matter of introducing the additional constraints of the target privacy model when checking whether a specific generalization is viable.
Generalization-based approaches suffer from some drawbacks identified in  and reviewed in Section 4 below. Microaggregation was shown in  to be an alternative approach to generate -anonymous data sets while avoiding some of these drawbacks.
Contribution and plan of this paper
A first contribution of this paper is to identify the strong points of microaggregation to achieve -anonymous -closeness. The second contribution consists of three new microaggregation-based algorithms for -closeness, which are presented and evaluated.
In Section 2 we review some concepts used throughout the paper: -anonymity, -closeness, recoding/generalization and microaggregation. In Section 4, we identify the advantages of microaggregation over generalization/suppression for -anonymity and hence for -closeness as well; then we sketch three microaggregation-based algorithms for -closeness that are detailed in the next sections. Section 5 presents an algorithm for -closeness based on standard microaggregation followed by cluster merging. Section 6 presents an algorithm that embeds -closeness into the microaggregation process: each cluster is generated to satisfy -anonymity and then it is refined to achieve -closeness. Section 7 also embeds -closeness in the microaggregation process, but in this case each cluster is generated to satisfy simultaneous -anonymity and -closeness from the very beginning. In Section 8 we evaluate the previously proposed algorithms on real data sets. Conclusions are gathered in Section 9.
A microdata set can be modeled as a table where each row contains data on a different subject and each column contains information about a specific attribute. Let be a microdata set with records , each of them with information about attributes .
The attributes in a microdata set can be classified according to their disclosiveness into several (perhaps non-disjoint) classes (see  for more details on the following classification): identifiers, quasi-identifiers, confidential attributes, and non-confidential attributes.
Disclosure risk limitation (a.k.a. statistical disclosure control) seeks to restrict the capability of an intruder with access to the released data set to associate a piece of confidential information to a specific subject in the data set. To this end, a masked version of the original data set is released. We use the term anonymized data set to refer to .
An intruder re-identifies a record in an anonymized data set when he can determine the identity of the subject to whom the record corresponds. In case of re-identification, the intruder can associate the values of the confidential attributes in the re-identified record to the identity of the subject, thereby violating the subject’s privacy.
Definition 1 (-anonymity).
Let be a data set and be the set of quasi-identifier attributes in it. is said to satisfy -anonymity if, for each combination of values of the quasi-identifiers in , at least records in share that combination.
In a -anonymous data set, no subject’s identity can be linked (based on the quasi-identifiers) to less than records. Hence, the probability of correct re-identification is, at most, . In what follows, we use the terms -anonymous group or equivalence class to refer to a set of records that share the quasi-identifier values.
Even though -anonymity protects against identity disclosure, it is a well-known fact that -anonymous data sets are vulnerable to attribute disclosure. Attribute disclosure occurs when the variability of a confidential attribute within an equivalence class is too low. In that case, being able to determine the equivalence class of a subject may reveal too much information about the confidential attribute value of that subject.
Several refinements of -anonymity have been proposed to deal with attribute disclosure. For example, -sensitive -anonymity , -diversity , -closeness , and -closeness . As explained in Section 1, in this paper we focus on -closeness because of its strict privacy guarantee (although the methods we propose are easily adaptable to -closeness).
-Closeness seeks to limit the amount of information that an intruder can obtain about the confidential attribute of any specific subject. To this end, -closeness requires the distribution of the confidential attributes within each of the equivalence classes to be similar to their distribution in the entire data set.
An equivalence class is said to satisfy -closeness if the distance between the distribution of the confidential attribute in this class and the distribution of the attribute in the whole data set is no more than a threshold . A data set (usually a -anonymous data set) is said to satisfy -closeness if all equivalence classes in it satisfy -closeness.
The specific distance used between distributions is central to evaluate -closeness, but the original definition does not advocate any specific distance. The Earth Mover’s distance (EMD)  is the most common choice (and the one we will adopt in this paper), although other distances have also been explored [21, 27, 8]. measures the cost of transforming one distribution into another distribution by moving probability mass. EMD is computed as the minimum transportation cost from the bins of to the bins of , so it depends on how much mass is moved and how far it is moved. For numerical attributes the distance between two bins is based on the number of bins between them. If the numerical attribute takes values , where if , then . Now, if and are distributions over that, respectively, assign probability and to , then the EMD for the ordered distance can be computed as
Microaggregation is a family of perturbative methods for statistical disclosure control of microdata releases. One-dimensional microaggregation was introduced in  and multi-dimensional microaggregation was proposed and formalized in . The latter is the one that is useful for -anonymity and -closeness. It consists of the following two steps:
Partition: The records in the original data set are partitioned into several clusters, each of them containing at least records. To minimize the information loss, records in each cluster should be as similar as possible.
Aggregation: An aggregation operator is used to summarize the data in each cluster and the original records are replaced by the aggregated output. For numerical data, one can use the mean as aggregation operator; for categorical data, one can resort to the median or some other average operator defined in terms of an ontology (e.g. see ).
The partition and aggregation steps produce some information loss. The goal of microaggregation is to minimize the information loss according to some metric. A common information loss metric is the SSE (sum of squared errors). When using SSE on numerical attributes, the mean is a sensible choice as the aggregation operator, because for any given partition it minimizes SSE in the aggregation step; the challenge thus is to come up with a partition that minimizes the overall SSE. Finding an optimal partition in multi-dimensional microaggregation is an NP-hard problem ; therefore, heuristics are employed to obtain an approximation with reasonable cost.
The limitations to re-identification imposed by -anonymity can be satisfied without aggregating the values of the quasi-identifier attributes within each equivalence class after the partition step. It is less utility-damaging to break the relation between quasi-identifiers and confidential attributes while preserving the original values of the quasi-identifiers. This is the approach to attain -anonymity-like guarantees taken in [31, 26].
3 Related Work
Same as for -anonymity, the most common way to attain -closeness is to use generalization and suppression. In fact, the algorithms for -anonymity based on those principles can be adapted to yield -closeness by adding the -closeness constraint in the search for a feasible minimal generalization: in  the Incognito algorithm and in  the Mondrian algorithm are respectively adapted to -closeness. SABRE  is another interesting approach specifically designed for -closeness. In SABRE the data set is first partitioned into a set of buckets and then the equivalence classes are generated by taking an appropriate number of records from each of the buckets. Both the buckets and the number of records from each bucket that are included in each equivalence class are selected with -closeness in mind. One of the algorithms proposed in our paper uses a similar principle. However, the buckets in SABRE are generated in an iterative greedy manner which may yield more buckets than our algorithm (which analytically determines the minimal number of required buckets). A greater number of buckets leads to equivalence classes with more records and, thus, to more information loss.
In  an approach to attain -closeness-like privacy is proposed which, unlike the methods based on generalization/suppression, is perturbative. Also,  guarantees the threshold only on average and uses a distance other than EMD. Another computational approach to -closeness is presented in [8, 27] which aims at connecting -closeness and differential privacy; [8, 27] also use a distance different from EMD but their method is non-perturbative (the truthfulness of the data is preserved).
Most of the approaches to attain -closeness have been designed to preserve the truthfulness of the data. In this paper we evaluate the use of microaggregation, a perturbative masking technique. In -anonymity the relation between the quasi-identifiers and the confidential data is broken by making records in the anonymized data set indistinguishable in terms of quasi-identifiers within a group of records. Microaggregation, when performed on the projection on quasi-identifier attributes, produces a -anonymous data set . Microaggregation was also used for -anonymity without naming it in : clustering was used with the additional requirement that each cluster must have or more records.
While microaggregation has been proposed to satisfy another refinement of -anonymity (-sensitive -anonymity, ), no attempt has been made to use it for -closeness.
4 -Anonymity/-closeness and microaggregation
Microaggregation has several advantages over generalization/recoding for -anonymity that are mostly related to data utility preservation:
Global recoding may recode some records that do not need it, hence causing extra information loss. On the other hand, local recoding makes data analysis more complex, as values corresponding to various different levels of generalization may co-exist in the anonymized data. Microaggregation is free from either drawback.
Data generalization usually results in a significant loss of granularity, because input values can only be replaced by a reduced set of generalizations, which are more constrained as one moves up in the hierarchy. Microaggregation, on the other hand, does not reduce the granularity of values, because they are replaced by numerical or categorical averages.
If outliers are present in the input data, the need to generalize them results in very coarse generalizations and, thus, in a high loss of information. For microaggregation, the influence of an outlier in the calculation of averages/centroids is restricted to the outlier’s equivalence class and hence is less noticeable.
For numerical attributes, generalization discretizes input numbers to numerical ranges and thereby changes the nature of data from continuous to discrete. In contrast, microaggregation maintains the continuous nature of numbers.
It is not known how to optimally combine generalization and local suppression.
There is no agreement in the literature on how suppression should be performed: one can suppress at the record level (entire record suppressed), or suppress particular attributes in some records; furthermore, suppression can be done by either blanking a value or replacing it by a neutral value (i.e. some kind of average).
Last but not least, and no matter how suppression is performed, it complicates data analysis (users need to resort to software dealing with censored data).
Some of the above downsides of generalization and suppression motivated proposing microaggregation for -anonymity in . They also justify that we investigate here the use of microaggregation for -closeness.
The adaptation of microaggregation for -anonymity was pretty straightforward: by applying the microaggregation algorithm (with minimum cluster size ) to the quasi-identifiers one generates groups of records that share the quasi-identifier values (the aggregation step replaces the original quasi-identifiers by the cluster centroid). In microaggregation one seeks to maximize the homogeneity of records within a cluster, which is beneficial for the utility of the resultant -anonymous data set.
In -closeness one has the additional constraint that the distance between the distribution of the confidential attribute within each of the clusters (generated by microaggregation) and the distribution in the entire data set must be less than . This makes attaining -closeness more complex, because we have to reconcile the possibly conflicting goals of maximizing the within-cluster homogeneity of the quasi-identifiers and fulfilling the condition on the distance between the distributions of the confidential attributes.
In the next three sections, we propose three different algorithms to reconcile these conflicting goals. The first algorithm is based on performing microaggregation in the usual way, and then merging clusters as much as needed to satisfy the -closeness condition. This first algorithm is simple and it can be combined with any microaggregation algorithm, yet it may perform poorly regarding utility because clusters may end up being quite large. The other algorithms modify the microaggregation algorithm for it to take -closeness into account, in an attempt to improve the utility of the anonymized data set. Two variants are proposed: -anonymity-first (which generates each cluster based on the quasi-identifiers and then refines it to satisfy -closeness) and -closeness-first (which generates each cluster based on both quasi-identifier attributes and confidential attributes, so that it satisfies -closeness by design from the very beginning).
5 Standard microaggregation and merging
Generating a -close data set via generalization is essentially an optimization problem: one must find a minimal generalization that satisfies -closeness. A common way to find a solution is to iteratively generalize one of the attributes (selected according to some quality criterion) until the resulting data set satisfies -closeness. Our first proposal to attain -closeness via microaggregation follows a similar approach. We microaggregate and then merge clusters of records in the microaggregated data set; we use the distance between the quasi-identifiers of the microaggregated clusters as the quality criterion to select which groups are to be merged.
Initially, the microaggregation algorithm is run on the quasi-identifier attributes of the original data set; this step produces a -anonymous data set. Then, clusters of microaggregated records are merged until -closeness is satisfied. We iteratively improve the level of -closeness by: i) selecting the cluster whose confidential attribute distribution is most different from the confidential attribute distribution in the entire data set (that is, the cluster farthest from satisfying -closeness); and ii) merging it with the cluster closest to it in terms of quasi-identifiers. See Algorithm 1 for a detailed description of the algorithm.
Note that Algorithm 1 always returns a -close data set. In the worst case, all clusters are eventually merged into a single one and the EMD becomes zero.
The computational cost of Algorithm 1 is the sum of the cost of the initial microaggregation and the cost of merging clusters. Although optimal multivariate microaggregation is NP-hard, several heuristic approximations exist with quadratic cost on the number of records of (e.g. MDAV , V-MDAV ). For the merging part, the fact that computing the EMD for numerical data has linear cost turns the merging quadratic. More precisely, the cost of Algorithm 1 is . If MDAV is used for the microaggregation, the cost is .
6 -Closeness aware microaggregation: -anonymity-first
Algorithm 1 consists of two clearly defined steps: first microaggregate and then merge clusters until -closeness is satisfied. In the microaggregation step any standard microaggregation algorithm can be used because the enforcement of -closeness takes place only after microaggregation is complete. As a result, the algorithm is quite clear, but the utility of the anonymized data set may be far from optimal. If, instead of deferring the enforcement of -closeness to the second step, we make the microaggregation algorithm aware of the -closeness constraints at the time of cluster formation, the size of the resulting clusters and also information loss can be expected to be smaller.
Algorithm 2 microaggregates according to the above idea. It initially generates a cluster of size based on the quasi-identifier attributes. Then the cluster is iteratively refined until -closeness is satisfied. In the refinement, the algorithm checks whether -closeness is satisfied and, if it is not, it selects the closest record not in the cluster based on the quasi-identifiers and swaps it with a record in the cluster selected so that the EMD to the distribution of the entire data set is minimized.
Instead of replacing the records already added to a cluster, we could have opted for adding additional records until -closeness is satisfied. This latter approach was discarded because it led to large clusters when the dependence between quasi-identifiers and confidential attributes is high. In this case, clusters homogeneous in terms of quasi-identifiers tend to be homogeneous in terms of confidential attributes, so the within-cluster distribution of the confidential attribute differs from its distribution in the entire data set unless the cluster is (nearly) as big as the entire data set.
It may happen that the records in the data set are exhausted before -closeness is satisfied. This is most likely when the number of remaining unclustered records is small (for instance, when the last cluster is formed). Thus, Algorithm 2 alone cannot guarantee that -closeness is satisfied. A way to circumvent this shortcoming is to use Algorithm 2 as the microaggregation function in Algorithm 1. By taking into account -closeness at the time of cluster formation (as Algorithm 2 does), the number of cluster mergers in Algorithm 1 can be expected to be small and, therefore, the utility of the resulting anonymized data set can be expected to be reasonably good.
7 -Closeness aware microaggregation: -closeness-first
In Section 6 we modified the microaggregation algorithm for it to build the clusters in a -closeness aware manner. The clustering algorithm, however, kept the focus on the quasi-identifiers (records were selected based on the quasi-identifiers) and did not guarantee that every cluster satisfies -closeness. The algorithm proposed in this section prioritizes the confidential attribute, thereby making it possible to guarantee that all clusters satisfy -closeness.
We assume in this section that the values of the confidential attribute(s) can be ranked, that is, be ordered in some way. For numerical or categorical ordinal attributes, ranking is straightforward. Even for categorical nominal attributes, the ranking assumption is less restrictive than it appears, because the same distance metrics that are used to microaggregate this type of attributes can be used to rank them (e.g. the marginality distance in [7, 28]).
We start by evaluating some of the properties of the EMD distance with respect to microaggregation. To minimize EMD between the distributions of the confidential attribute within a cluster and in the entire data set, the values of the confidential attribute in the cluster must be as spread as possible over the entire data set. Consider the case of a cluster with records. The following proposition gives a lower bound of EMD for such a cluster.
Let be a data set with records, be a confidential attribute of whose values can be ranked and be a cluster of size . The earth mover’s distance between and with respect to attribute satisfies . If divides , this lower bound is tight.
The EMD can intuitively be seen as the amount of work needed to transform the distribution of attribute within into the distribution of over . The “amount of work” includes two factors: (i) the amount of probability mass that needs to be moved and (ii) the distance of the movement. When computing EMD for -closeness, the distance of the movements of probability mass for numerical attributes is measured as the ordered distance , that is, the difference between the ranks of the values of in divided by .
For the sake of simplicity, assume that divides . If that is not the case, the distance will be slightly greater, so the lower bound we compute is still valid. The probability mass of each of the values of is constant and equal to in , and it is constant and equal to in . This means that the first factor that determines the EMD (the amount of probability mass to be moved) is fixed. Therefore, to minimize EMD we must minimize the second factor (the distance by which the probability mass must be moved). Clearly, to minimize the distance, the -th value of in the cluster must lie in the middle of the -th group of records of . Figure 1 illustrates this fact.
In Figure 1 and using the ordered distance, the earth mover’s distance can be computed as times the cost of distributing the probability mass of element among the elements in the first subset:
Formula (1) takes element as the middle element of a cluster with elements. Strictly speaking, this is only possible when is odd. When is even, we ought to take either , the element just before the middle, or , the element just after the middle. In any case, the EMD ends up being the same as the one obtained in Formula (1). ∎
Note that, once and are fixed, Proposition 1 determines the minimum value of required for EMD to be smaller than . An issue with the construction of the values , , depicted in Figure 1 is that it is too restrictive. For instance, for given values of and , if the minimal EMD value computed in Proposition 1 is exactly equal to , then only clusters having as confidential attribute values , , satisfy -closeness (there may be only one such cluster). Any other cluster having different confidential attribute values does not satisfy -closeness. Moreover, in the construction of Figure 1, the clusters are generated based only on the values of the confidential attribute, which may lead to a large information loss in terms of the quasi-identifiers.
Given the limitations pointed out above, our goal is to guarantee that the EMD of the clusters is below a specific value but allowing the clustering algorithm enough freedom to select appropriate records (in terms of quasi-identifiers) for each of the clusters. The approach that we propose is similar to the one of Figure 1: we group the records in into subsets based on the confidential attribute and we then generate clusters based on the quasi-identifiers with the constraint that each cluster should contain one record from each of the subsets (the specific record is selected based on the quasi-identifier attributes). Proposition 2 gives an upper bound on the level of -closeness that we attain. To simplify the derivation and the proof, we assume in the proposition that divides .
Let be a data set with records and let be a confidential attribute of whose values can be ranked. Let be a partition of the records in into subsets of records in ascending order of the attribute . Let be a cluster that contains exactly one record from each of the subsets , for . Then .
The factors that determine EMD are: (i) the amount of probability mass that needs to be moved and (ii) the distance by which it is moved. The first factor is fixed and cannot be modified: each of the records in has probability mass , and each of the records in has probability mass of . As to the second factor, to find an upper bound to EMD, we need to consider a cluster that maximizes EMD: the records selected for inclusion into must be at the lower (or upper) end of the sets for . This is depicted in Figure 2. (Note the analogy with the proof of Proposition 1: there we took the median of each to minimize EMD.)
EMD for the case in Figure 2 can be computed as times the cost of distributing the probability mass of among the elements of :
With the upper bound on EMD given by Proposition 2, we can determine the cluster size required in the microaggregation: just replace by on the left-hand side of the bound and solve for to get a lower bound for . For a data set containing records and for a required level of -closeness and -anonymity, the cluster size must be
To keep things simple, so far we have assumed that divides . However, the algorithm to generate -close data sets must work even if that is not the case. If discarding some records from the original data set is a viable option, we could discard records until divides the new , and proceed as described above. If records cannot be discarded, some of the clusters would need to contain more than records. In particular, we may allow some clusters to have either or records.
If we group the records into sets with records, then records remain. We propose to assign the remaining records to one of the subsets. Then, when generating the clusters, two records from this subset are added to the first clusters. This is only possible if (the number of remaining records is not greater than the number of generated clusters); otherwise, there will be records not assigned to any cluster. Note, however, that using a cluster size with makes no sense: since all clusters receive more than records, what is reasonable is to adapt to reality by increasing . Specifically, to avoid having , is adjusted as
Adding two records from one of the subsets to a cluster increases the EMD of the cluster. To minimize the impact over the EMD, we need to reduce the work required to distribute the probability mass of the extra record across the whole range of values. Hence, the extra record must be close to the median record of the data set. Figure 3 illustrates the types of clusters that we allow when is odd (there is a single subset in the middle), and Figure 4 illustrates the types of clusters that we allow when is even (there are two subsets in the middle). Essentially, when is odd, the additional records are added to (the subset in the middle); then, we generate clusters with size and clusters with size , which take two records from . When is even, the additional records are split between and (the subsets in the middle); then, we generate clusters with size and clusters with size , some with an additional record from and some from .
Just as we did in Proposition 2, we can compute an upper bound for the EMD of the clusters depicted in Figures 3 and 4. The EMD of a cluster measures the cost of transforming the distribution of to the distribution of the data set. The cost of the probability mass redistribution can be computed in two steps as follows. First, we want the weight of each subset in cluster (the proportion of records in coming from each subset) to be equal to the weight of the subset in the data set; to this end, we redistribute the probability mass of the cluster between subsets. This redistribution cost, , equals the EMD between the cluster and the data set when the distributions have been discretized to the subsets. Then, for each subset , we compute , an upper bound of the cost of distributing the probability mass assigned to the subset among its elements (this is analogous to the mass distribution in the proof of Proposition 2). The EMD is the sum . The fact that there are subsets with different sizes and there are clusters with different sizes makes formulas quite tedious and unwieldy, even though the resulting bounds on EMD are very similar to the one obtained in Proposition 2. For these reasons, we will use the latter as an approximation even when does not divide ; in particular, we will determine the cluster size using Expression (3).
In terms of computational cost, Algorithm 3 has a great advantage over Algorithms 1 and 2: when running Algorithm 3, we know that by construction the generated clusters satisfy -closeness, so there is no need to compute any EMD distance. Algorithm 3 has cost , the same cost order as MDAV (on which it is based). Actually, Algorithm 3 is even slightly more efficient than MDAV: all operations being equal, some of the computations that MDAV performs on the entire data set are performed by Algorithm 3 just on one of the subsets of records.
8 Empirical evaluation
In this section we empirically evaluate and compare the proposed algorithms using several data sets and according to different metrics: actual cluster size, speed and scalability, and data utility preservation.
8.1 Actual cluster size
In a first battery of tests we used as evaluation data the Census data set , which is usual to test privacy protection methods [32, 12, 6] and contains 1,080 records with numerical attributes. Similar to , we took attributes TAXINC (Taxable income amount) and POTHVAL (Total other persons income) as quasi-identifiers, and FEDTAX (Federal income tax liability) and FICA (Social Security retirement payroll deduction) as confidential attributes.
Because -anonymity and -closeness pursue different goals (the former clusters records with similar quasi-identifiers while the latter requires clusters with a distribution of confidential attributes similar to the one of the entire data set), we defined two data sets according to the correlation between the values of quasi-identifier and confidential attributes:
Moderately correlated data set (MCD). It consists of 1,080 records with TAXINC and POTHVAL as quasi-identifier attributes, and FEDTAX as confidential attribute. The correlation between both types of attributes is 0.52. This represents the most usual scenario in which quasi-identifiers and confidential attributes show some correlation.
Highly correlated data set (HCD). It uses the same quasi-identifiers as MCD, but it takes FICA as confidential attribute. The correlation between both types of attributes is 0.92. This highly correlated data set represents a worst-case scenario for our algorithms because, to fulfill a certain -closeness level (i.e., to ensure a certain distribution of confidential values), we are likely to be forced to microaggregate records with significantly diverse quasi-identifier values, thereby incurring more information loss than in the MCD data set.
By applying the three algorithms to these two data sets for different values of and , we will show how close to are the sizes of clusters formed by each algorithm for each value of to be enforced. To minimize information loss, the closer all cluster sizes to , the better. The values have been taken in the range 2-30, which covers the most usual -anonymity values (e.g. is taken between 3 and 10 in ), whereas the values have been taken in the range 0.01-0.25 (where 0.25 is the upper bound of -closeness for this data set for the lowest , that is , according to Proposition 2).
We start by analyzing the behavior of Algorithm 1, in which records are first microaggregated in clusters of size that are thereafter merged until -closeness is fulfilled. Table I shows the actual level of microaggregation that results from the merging process: minimum, that is, the size of the smallest cluster (which determines the actual -anonymity level achieved), and average, that is, the average size of the merged clusters.
It can be seen that, in many cases, the actual level of microaggregation is significantly higher than the value of . This is undesirable because the larger the clusters, the higher the information loss. We also see that the size of the clusters tends to increase for both data sets as:
i) the parameter of -closeness decreases: since clusters have been created without considering the desired -closeness, it is unlikely that they satisfy it as gets smaller. Thus, to decrease, if necessary, the distance between the distribution of confidential attributes within each cluster and over the entire data set, the algorithm merges the already created clusters (thereby increasing their cardinality); in the worst case (i.e., around 0.01-0.05), this implies grouping all 1,080 records in a single cluster.
ii) the initial level of -anonymity increases: the coarser the initial microaggregation, the more effort (i.e., merging) is needed to achieve a certain -closeness level.
We also observe a noticeable difference between the minimum and average cardinality of the clusters, which suggests that the microggregation of records that we obtain in practice is far from optimal.
Table II shows the results for Algorithm 2. With this algorithm, we observe that the actual microaggregation levels are significantly smaller than in the previous case for the same values of and , and so is the difference between the minimum and average cardinality of the clusters. Now -closeness is enforced after creating each cluster rather than after creating all clusters. Thus, once a cluster is created, some of the records in that cluster may be replaced by unclustered records until -closeness is satisfied; doing so does not increase the cardinality of the cluster, even though it may end up clustering records with less homogeneous quasi-identifiers and thereby yielding a higher loss of information. Only if the replacement does not satisfy the desired -closeness, the clusters are merged like in Algorithm 1, thereby increasing the microaggregation level (in fact, as suggested in Section 6, we use Algorithm 2 as the microaggregation function of Algorithm 1). The results shown in Table II suggest that this process occurs for the smallest -closeness values (i.e., 0.01-0.05), which are the ones that impose the strictest constraint.
The differences between the two data sets are more noticeable if we look at the average cardinality of the clusters: the HCD data set results in a larger average cardinality, because the initial clusters present more homogeneous confidential values (these are very correlated to the more homogeneous quasi-identifier values obtained for the first clusters) and tend to require more effort (i.e., replacements and mergers) to attain -closeness.
Finally, Table III shows the results for Algorithm 3. Figures in this table show that Algorithm 3 is the one achieving an actual microaggregation level closest to the desired . Moreover, since the cardinality of the data sets (1,080 records) is a multiple of the values of , all clusters can be formed with the same cardinality (i.e., clusters are perfectly balanced). Indeed, as stated in Section 7, Algorithm 3 seeks the smallest clusters whose cardinality is at least and which satisfy a pre-specified level of -closeness. To do so, it prioritizes the fulfillment of -closeness over the homogeneity of quasi-identifiers in cluster formation. Because of this strategy, there are no differences between the MCD and HCD data sets; in fact, we can see that for most parameter choices and for both data sets the minimum and average cluster sizes are .
8.2 Speed and scalability
The second part of the evaluation focuses on measuring the speed and scalability of the three algorithms with a larger data set.
To that end, we took a higher-dimensional data set from the the Patient Discharge Data for year 2010 of Californian hospitals, which are provided by California’s Office of Statewide Health Planning and Development . We took the data set with the largest number of entries (Cedars Sinai Medical Center, with 55,668 patient records). From these, we removed records with missing attribute values and obtained a final data set with 23,435 records. Each record consists of 7 quasi-identifier attributes (e.g., patient’s age, zip code, admission date, etc.) plus one confidential attribute that specifies the amount charged for the patient’s stay in the hospital. The correlation between the quasi-identifier attributes and the confidential one is just 0.129.
The run time of the three algorithms for the Patient Discharge data set is shown in Figure 5 as a function of the value of to be attained. We set in order to give maximum freedom to the algorithms in adapting the microaggregation to the desired value of (again between 0.01 and the maximum upper bound of 0.25), and force them to create the greatest number of clusters (which the is worst case from the run time perspective).
Run time figures are coherent with the theoretical analysis of computational costs for the three algorithms. Algorithms 1 and 3 are significantly more efficient than Algorithm 2 (note the logarithmic scale of the Y-axis), because the former have just the quadratic cost of the underlying microaggregation algorithm, whereas the latter has a cubic cost resulting from the rearrangement of records required to fulfill -closeness after the creation of each cluster. Indeed, Algorithm 2 may not scale well for large data sets, whereas the other two algorithms scale as well as the underlying microaggregation. At a closer look, Algorithm 3 is significantly more efficient than Algorithm 1 for low values of . The reason is that, although the cost of both algorithms is , Algorithm 3 optimally updates the value of in terms of the actual : for small values of , the value of is large (see Equation 3), which reduces the computational cost. In contrast, Algorithm 1 only takes into account after the entire microaggregation has been performed. Finally, the run time of Algorithm 2 tends to decrease for large because, in this case, clusters are more likely to (nearly) fulfill -closeness, thus requiring less rearrangement of records after each iteration.
8.3 Data utility preservation
So far, the comparison between algorithms has been made only in terms of cluster sizes and run time. Let us now examine to what extent each algorithm preserves the data utility for a certain privacy level. Indeed, the different microaggregation strategies and the actual levels of microaggregation achieved by the three algorithms have a direct influence on the utility of the anonymized results. In the literature, the utility of an anonymized output is evaluated in terms of information loss, that is, the discrepancies between the original and the anonymized data set. The Sum of Squared Errors (SSE) is a well-known information loss measure, which is well-suited to capture the impact of creating equivalence classes by means of -anonymous microaggregation. SSE is defined as the sum of squares of attribute distances between records in the original data set and their versions in the anonymized data set. However, since SSE provides absolute error values, we normalized it to obtain a measure that is independent of the data set size (number of records and attributes) and the ranges of attribute values:
where is the number of records, is the number of attributes, is the value of the -th attribute for the -th original record, represents its anonymized version and corresponds to the Normalized Euclidean Distance. Notice that with a high SSE, that is, a high information loss, a lot of data uses are severely damaged, like for example subdomain analyses (analyses restricted to parts of the data set).
To fairly and clearly compare the three algorithms, we first took for -anonymity with values between 0.01 and 0.25 for -closeness. In this manner, any actual cluster size is feasible and the algorithms have the greatest freedom to microaggregate records to fulfill the desired -closeness. SSE values for each value of are shown in Figure 6 for the three data sets.
All graphs show that Algorithm 2 improves on Algorithm 1 and, in turn, Algorithm 3 improves on Algorithm 2. Thus, we can see that the earlier we consider the fulfillment of -closeness in the microaggregation step, the more utility is preserved in the output. This may seem paradoxical, because a -closeness aware microaggregation that prioritizes the distribution of confidential values (such as the one in Algorithm 3) is likely to cluster records with heterogeneous quasi-identifier values, and thereby incur higher information loss. Some of this is apparent in Figure 6: Algorithm 3 improves much more on the other two algorithms for the MCD and Patient Discharge than for the HCD data set, because cluster homogeneity for HCD is harder to reconcile with the -closeness requirement due to the higher correlation of quasi-identifiers and the confidential attribute. However, on the other hand, the fact that the -anonymous microaggregation is aware of the level of -closeness that should be satisfied also produces smaller clusters (of size closer to the desired ), which is beneficial to keep SSE low. In contrast, the other algorithms, and especially Algorithm 1, prioritize quasi-identifier values in the -anonymous microaggregation and, hence, they require a lot of cluster merging and/or manipulation to attain -closeness. This tends to produce larger clusters (as shown by the experiments on cluster sizes), whose aggregation incurs a greater loss of information, which is nonetheless fairly independent of the correlation between quasi-identifiers and confidential attributes; this is especially noticeable for the Patient Discharge data set, in which Algorithm 1 behaves significantly worse than the other two.
To sum up, the increase of information loss that the lower cluster homogeneity of -closeness aware microaggregation might cause is more than compensated by the information loss reduction resulting from smaller clusters.
Finally, we also evaluated the evolution of the normalized SSE as a function of both and . As a reference, Figure 7 shows this evolution for the three algorithms with the MCD data set.
First, we can see that some of the advantages of Algorithm 3 are diminished when a higher is required. As shown in Proposition 2, the actual cluster size will be the maximum between the desired and the minimum size required to fulfill -closeness. Thus, because of the optimal updating of by Algorithm 3, this algorithm is the one for which SSE increases the most as a result of the larger . Algorithms 1 and 2, on the other hand, are more immune to large values of . Indeed, since they prioritize the -anonymous microaggregation, the larger clusters obtained for large values of have a greater chance to already fulfill -closeness without the posterior merging step; since -anonymous clusters are created in order to minimize the SSE, the smaller number of merging steps required to fulfill -closeness helps to maintain cluster homogeneity and avoid increasing SSE. In any case, for any value of , the SSE for these algorithms is still higher than for Algorithm 3.
For Algorithms 1 and 2, it is also interesting to observe the spikes that occur for certain values of , which are more noticeable for Algorithm 1. Spikes occur when is not a divisor of the data set size (i.e., 1,080); that is, when it is not possible to group all records in clusters of size . In such cases, the microaggregation algorithm is forced to distribute the remaining records among already created clusters, which deteriorates cluster homogeneity and thus increases SSE. On the contrary, Algorithm 3 is more immune to this situation, because clusters are created to satisfy -closeness, rather than to minimize the SSE.
9 Conclusions and research directions
We have proposed and evaluated the use of microaggregation as a method to attain -anonymous -closeness.
The a priori benefits of microaggregation vs generalization/recoding and local suppression have been discussed. Global recoding may recode more than needed, whereas local recoding complicates data analysis by mixing together values corresponding to different levels of generalization. Also, recoding produces a greater loss of granularity of the data, is more affected by outliers, and changes numerical values to ranges. Regarding local suppression, it complicates data analysis with missing values and is not obvious to combine with recoding in order to decrease the amount of generalization. Microaggregation is free from all the above downsides.
We have proposed and evaluated three different microaggregation based algorithms to generate -anonymous -close data sets. The first one is a simple merging step that can be run after any microaggregation algorithm. The other two algorithms, -anonymity-first and -closeness-first, take the -closeness requirement into account at the moment of cluster formation during microaggregation. The -closeness-first algorithm considers -closeness earliest and provides the best results: smallest average cluster size, smallest SSE for a given level of -closeness, and shortest run time (because the actual microaggregation level is computed beforehand according to the values of and ). Thus, considering the -closeness requirement from the very beginning turns out to be the best option.
Since connections have been demonstrated between -closeness and -differential privacy of data sets [27, 8], exploring how microaggregation could be leveraged to implement the latter model in the case of data releases is a natural continuation of this work. Moreover, we will also study the adaptation of the algorithms to support categorical data by: i) defining an EMD suitable to compare categorical values of different nature (e.g., ordinal values such as colors, which can be sorted within a range, or nominal values such as jobs, hobbies, diagnoses, etc., which require interpreting their underlying semantics), ii) defining aggregation operators to compute cluster centroids (i.e., the categorical value that minimizes the distance to other values in the same cluster), and iii) properly managing records with numerical and categorical attributes in an integrated manner.
Acknowledgments and disclaimer
This work was partly supported by the European Commission (through projects FP7 ”DwB”, FP7 ”Inter-Trust” and H2020 ”CLARUS”), by the Spanish Government (through projects ”ICWT” TIN2012-32757, ”CO-PRIVACY” TIN2011-27076-C03-01 and ”BallotNext” IPT-2012-0603-430000) and by the Government of Catalonia (under grant 2014 SGR 537). Josep Domingo-Ferrer is partially supported as an ICREA-Acadèmia researcher by the Government of Catalonia and by a Google Faculty Research Award. Partial support by the Templeton World Charity Foundation is also acknowledged (”CO-UTILITY” grant). The opinions expressed in this paper are the authors’ own and do not necessarily reflect the views of the Templeton World Charity Foundation or UNESCO.
-  R. Brand, J. Domingo-Ferrer, and J.M. Mateo-Sanz. Reference data sets to test and compare SDC methods for protection of numerical microdata. European Project IST-2000-25069 CASC. http://neon.vb.cbs.nl/casc/CASCtestsets.htm
-  J. Cao, P. Karras, P. Kalnis, and K.-L. Tan. SABRE: a Sensitive Attribute Bucketization and REdistribution framework for t-closeness. The VLDB Journal, 20(1):59-81, 2011.
-  D. Defays and P. Nanopoulos. Panels of enterprises and confidentiality: the small aggregates method. In Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204, Ottawa, 1992. Statistics Canada.
-  J. Domingo-Ferrer and V. Torra. A quantitative comparison of disclosure control methods for microdata. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies (eds. L. Zayatz, P. Doyle, J. Theeuwes and J. Lane), pp. 111–134, Amsterdam, 2001. North Holland.
-  J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1):189-201, 2002.
-  J. Domingo-Ferrer and U. González-Nicolás. Hybrid microdata using microaggregation. Information Sciences, 180(15):2834–2844, 2010.
-  J. Domingo-Ferrer, D. Sánchez and G. Rufian-Torrell. Anonymization of nominal data based on semantic marginality. Information Sciences, 242:35–48, 2013.
-  J. Domingo-Ferrer and J. Soria-Comas. From -closeness to differential privacy and vice versa in data anonymization. Knowledge-Based Systems, 74:151–158, 2015.
-  J. Domingo-Ferrer and V. Torra. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Min. Knowl. Discov., 11(2):195–212, 2005.
-  C. Dwork. Differential privacy. In Proc. of the 33rd Intl. Colloquium on Automata, Languages and Programming (ICALP 2006), LNCS 4052, pp. 1–12. Springer, 2006.
-  A. Hundepool, J. Domingo-Ferrer, L. Franconi, S. Giessing, E. S. Nordholt, K. Spicer, and P.-P. de Wolf. Statistical Disclosure Control. Wiley, 2012.
-  M. Laszlo and S. Mukherjee. Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7):902–911, 2005.
-  K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: efficient full-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD 2005), pp. 49–60, New York, NY, USA, 2005. ACM.
-  K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006), Washington, DC, USA, 2006. IEEE Computer Society.
-  J. Li, R.C.-W. Wong, A.W.-C. Fu, and J. Pei. Anonymization by local recoding in data with attribute hierarchical taxonomies. IEEE Transactions on Knowledge and Data Engineering, 20(9):1181–1194, 2008.
-  N. Li, T. Li, and S. Venkatasubramanian. t-closeness: privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE 2007), pp. 106–115. IEEE, 2007.
-  N. Li, T. Li, and S. Venkatasubramanian. Closeness: a new privacy measure for data publishing. IEEE Transactions on Knowledge and Data Engineering, 22(7):943–956, 2010.
-  A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1), 2007.
-  A. Oganian and J. Domingo-Ferrer. On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Comission for Europe, 18:345–354, 2001.
-  Patient Discharge Data. Office of Statewide Health Planning & Development-OSHPD, 2010. http://www.oshpd.ca.gov/HID/Products/PatDischargeData/PublicDataSet/index.html
-  D. Rebollo-Monedero, J. Forne, and J. Domingo-Ferrer. From t-closeness-like privacy to postrandomization via information theory IEEE Transactions on Knowledge and Data Engineering, 22(11):1623–1636, 2010,
-  Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000.
-  P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In Proceedings of the IEEE Symposium on Research in Security and Privacy, 1998.
-  A. Solanas, A. Martínez-Ballesté. V-MDAV: Variable group size multivariate microaggregation. In Proceeding of the International Conference on Computational Statistics (COMPSTAT 2006), pp. 917â-925, 2006.
-  A. Solanas, F. Sebé, and J. Domingo-Ferrer. Micro-aggregation-based heuristics for p-sensitive k-anonymity: one step beyond. In Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society (PAIS 2008), pp. 61–69, New York, NY, USA, 2008. ACM.
-  J. Soria-Comas and J. Domingo-Ferrer. Probabilistic k-anonymity through microaggregation and data swapping. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2012), pp. 1–8. IEEE, 2012.
-  J. Soria-Comas and J. Domingo-Ferrer. Differential privacy via t-closeness in data publishing. In Proceedings of the 11th Annual International Conference on Privacy, Security and Trust (PST 2013), pp. 27–35, 2013.
-  J. Soria-Comas, J. Domingo-Ferrer, D. Sánchez and S. Martínez. Enhancing data utility in differential privacy via microaggregation-based -anonymity. VLDB Journal 23(5):771–794, 2014.
-  L. Sweeney. k-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002.
-  T. M. Truta and B. Vinay. Privacy protection: p-sensitive k-anonymity property. In Proceedings of the 2nd International Workshop on Privacy Data Management (PDM 2006), page 94. IEEE Computer Society, 2006.
-  X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation. In Proceedings of the 32Nd International Conference on Very Large Data Bases (VLDB 2006), pp. 139–150. VLDB Endowment, 2006.
-  W.E. Winkler, W. E. Yancey and R.H. Creecy. Disclosure risk assessment in perturbative microdata protection. In Inference Control in Statistical Databases, LNCS 2316, pp. 135–152. Springer, 2002.
Jordi Soria-Comas is a postdoctoral researcher at Universitat Rovira i Virgili. He has received his M. Sc. in Computer Security (2011) and Ph. D. in Computer Science (2013) degrees from the Universitat Rovira i Virgili. He also holds a M. Sc. in Finance from the Autonomous University of Barcelona (2004) and a B.Sc. in Mathematics from the University of Barcelona (2003). His research interests are in data privacy and security.
Josep Domingo-Ferrer (Fellow, IEEE) is a Distinguished Professor of Computer Science and an ICREA-Acadèmia Researcher at Universitat Rovira i Virgili, Tarragona, Catalonia, where he holds the UNESCO Chair in Data Privacy. He received the MSc and PhD degrees in Computer Science from the Autonomous University of Barcelona in 1988 and 1991, respectively. He also holds an MSc degree in Mathematics. His research interests are in data privacy, data security and cryptographic protocols. More information on him can be found at http://crises-deim.urv.cat/jdomingo
David Sánchez is an Associate Professor of Computer Science at Universitat Rovira i Virgili, Tarragona, Catalonia. His research interests are in data semantics and data privacy. He received a PhD in Computer Science from the Technical University of Catalonia. Contact him at firstname.lastname@example.org.
Sergio Martínez is a post-doctoral researcher at University Rovira i Virgili (URV) in Tarragona. He received an MSc in Intelligent Systems (2010) and a Ph.D in Computer Science (2013), both awarded by the URV. His research interests are in artificial intelligence, semantic similarity and privacy preservation. He has participated in European and Spanish research projects.