Nearest Neighbor Classification Using Anatomized Data
Abstract
This paper analyzes nearest neighbor classification with training data anonymized using anatomy. Anatomy preserves all data values, but introduces uncertainty in the mapping between identifying and sensitive values. We first study the theoretical effect of the anatomized training data on the nearest neighbor error rate bounds, nearest neighbor convergence rate, and Bayesian error. We then validate the derived bounds empirically. We show that 1) Learning from anatomized data approaches the limits of learning through the unprotected data (although requiring larger training data), and 2) nearest neighbor using anatomized data outperforms nearest neighbor on generalizationbased anonymization.
I Introduction
Data publishing without revealing sensitive information is an important problem. Many privacy definitions have been proposed based on generalizing/suppressing data (diversity[27], anonymity [31, 32], closeness [24], presence [30], (,)anonymity [36]). Other alternatives include value swapping [29], distortion [2], randomization [14], and noise addition (e.g., differential privacy [13]). Generalization consists of replacing identifying attribute values with a less specific version [6]. Suppression can be viewed as the ultimate generalization, replacing the identifying value with an “any” value [6]. These approaches have the advantage of preserving truth, but a less specific truth that reduces the utility of the published data.
Xiao and Tao proposed anatomization as a method to enforce diversity while preserving specific data values [37]. Anatomization splits instances across two tables, one containing identifying information and the other containing private information. The more general approach of fragmentation [7] divides a given dataset’s attributes into two sets of attributes (2 partitions) such that an encryption mechanism avoids associations between two different small partitions. Vimercati et al. extend fragmentation to multiple partitions [11], and Tamas et al. propose an extension that deals with multiple sensitive attributes [19]. The main advantage of anatomization/fragmentation is that it preserves the original values of data; the uncertainty is only in the mapping between individuals and sensitive values.
We show that this additional information has real value. First, we demonstrate that in theory, learning from anatomized data can be as good as learning from the raw data. We then demonstrate empirically that learning from anatomized data beats learning from generalizationbased anonymization.
This paper looks only at instancebased learning, specifically nonparametric nearest neighbor classifier (NN). This focus was chosen because we have solid theoretical results on the limits of learning, allowing us to compare theoretical bounds on learning from anatomized data with learning from the underlying unprotected data. We demonstrate this for a simple approach of using the anatomized data; we simply consider all possible mappings of individuals to sensitive values as equally likely.
There is concern that anatomization is vulnerable to several attacks [23, 20, 26]. While this can be an issue, any method that provides meaningful utility fails to provide perfect privacy against a sufficiently strong adversary [25, 13]. Introducing uncertainty into the anonymization process reduces the risk of many attacks, e.g, minimality [35, 8]. Our theoretical analysis holds for any assignment of items to anatomy groups, including a random assignment, which provides a high degree of robustness against minimality and correlationbased attacks. This paper has the following key contributions:

We define a classification task on anatomized data without violating the random worlds assumption. A violating classification task would be the prediction of sensitive attribute, a task that was found to be #Pcomplete by Kifer [23].

To our best knowledge, this is the first paper in the privacy community that studies the theoretical effect of training the NN on anatomized data. We show the anatomization effect for the error rate bounds and the convergence rate when the test data is neither anonymized nor anatomized. Inan et al. already gives a practical applications of such a learning scenario [21].

We show the Bayesian error estimation for any nonparametric classifier using the anatomized training data.

We compare the NN classifier trained on the anatomized data with the NN classifier trained on the unprotected data. In case of nearest neighbor classifier (1NN), we also make an additional comparison to generalization based learning scheme [21].

We last compare the theoretical estimation of convergence rate with the practical measurements when the convergence rate is defined in function of diversity.
We next summarize the related work, and give a set of definitions and notations necessary for further discussion. Section IV shows error rate bounds of the nonparametric NN classifier; Section V analyzes the effect of anatomization on the Bayesian error. Section VI formulates the 1NN convergence rate under diversity. The experimental analysis is presented in Section VII.
Ii Related Work
There have been studies in how to mine anonymized data. Nearest neighbor classification using generalized data was investigated by Martin. Nested generalization and nonnested hyperrectangles were used to generalize the data from which the nearest neighbor classifiers were trained [28]. Inan et al. proposed nearest neighbor and support vector machine classifiers using anonymized training data that satisfy anonymity. Taylor approximation was used to estimate the Euclidean distance from the anonymized training data [21]. Zhang et al. studied Naïve Bayes using partially specified training data [38], proposing a conditional likehoods computation algorithm exploring the instance space of attributevalue generalization taxonomies. Agrawal et al. proposed an iterative distribution reconstruction algorithm for the distorted training data from which a C4.5 decision tree classifier was trained [1]. Iyengar suggested using a classification metric so as to find the optimum generalization. Then, a C4.5 decision tree classifier was trained from the optimally generalized training data [22]. Fung et al. gave a topdown specialization method (TDS) for anonymization so that the anonymized data allows accurate decision trees. A new scoring function was proposed for the calculation of decision tree splits from the compressed training data [18]. Dowd et al. studied C4.5 decision tree learning from training data perturbed by random substitutions. A matrix based distribution reconstruction algorithm was applied on the perturbed training data from which an accurate C4.5 decision tree classifier was learned [12].
None of the earlier work has provided a method directly applicable to anatomized training data. A classifier using the anatomized training data requires specific theoretical and experimental analysis, because anatomized training data provides additional detail that has the potential to improve learning; but also additional uncertainty that must be dealt with. Furthermore, previous work didn’t justify theoretically why the proposed heuristics work in empirically.
Iii Definitions and Notations
In this section, the first four definitions will recall the standard definitions of unprotected data and attribute types.
Definition 1
A dataset is called a person specific dataset for population if each instance belongs to a unique individual .
The person specific data will be called the training data in this paper. Next, we will give the first type of attributes.
Definition 2
A set of attributes are called direct identifying attributes if they let an adversary associate an instance to a unique individual without any background knowledge.
Definition 3
A set of attributes are called quasiidentifying attributes if there is background knowledge available to the adversary that associates the quasiidentifying attributes with a unique individual .
We include both direct and quasiidentifying attributes under the name identifying attribute. First name, last name and social security number (SSN) are common examples of direct identifying attributes. Some common examples of quasiidentifying attributes are age, postal code, and occupation. Next, we will give the second type of attribute.
Definition 4
An attribute of instance is called a sensitive attribute if it must be protected against adversaries from correctly inferring the value for an individual.
Patient disease and individual income are common examples of sensitive attributes. Unique individuals typically don’t want these sensitive information to be publicly known when a dataset is released to public. Provided an instance , the class label is denoted by . We don’t consider the case where is sensitive, as this would make the purpose of classification to violate privacy. Typically is neither sensitive nor identifying, although the analysis holds for being an identifying attribute.
Given the former definitions, we will next define the anonymized training data following the definition of anonymity [32].
Definition 5
A training data that satisfies the following conditions is said to be anonymized training data [32]:

The training data does not contain any unique identifying attributes.

Every instance is indistinguishable from at least other instances in with respect to its quasiidentifying attributes.
In this paper, we assume that the anonymized training data is created according to a generalization based data publishing method. We next define the comparison baseline classifiers.
Definition 6
A nonparametric nearest neighbor (NN) classifier that is trained on the anonymized training data is called the anonymized NN classifier.
Definition 7
A nonparametric NN classifier that is trained on the training data is called the original NN classifier.
The anonymized NN classifier will just be the comparison baseline in the evaluation and its theoretical discussion will not be included. We go further, requiring that there must be multiple possible sensitive values that could be linked to an individual. This requires the definition of groups [27].
Definition 8
A group is a subset of instances in training data such that , and for any pair where , .
Next, we define the concept of diversity or diverse given the former group definition.
Definition 9
A set of groups is said to be diverse if and only if for all groups where is the sensitive attribute in , is the database projection operation on training data (or on data table in the database community), is the frequency of in and is the number of instances in .
We extend the data publishing method anatomization from Xiao et al. that is originally based on diverse groups [37].
Definition 10
Given a training data partitioned in diverse groups according to Definition 9, anatomization produces an identifier table and a sensitive table as follows. has schema
including the class attribute, the quasiidentifying attributes for , and the group id of the group . For each group and each instance , has an instance of the form:
has schema
where is the sensitive attribute in and is the group id of the group . For each group and each instance , has an instance of the form:
Given the learning task of predicting class attribute , definition 10 lets us observe the following about training data published according to anatomization: every instance can be matched to instances using the common attribute in both data table schemas. This observation yields the anatomized training data and the anatomized NN classifier.
Definition 11
Given two data tables and resulting from the anatomization on training data , the anatomized training data is
where is the database inner join operation with respect to the condition and is the database projection operation on training data (*) processed according to definition 10.
Definition 12
A nonparametric NN classifier that is trained on the anatomized training data is called the anatomized kNN classifier.
Using the former definitions, we now give assumptions and notations used in discussing the anatomized NN classifier. In the theoretical analysis, we assume that all the training data has a smooth probability distribution. Although anatomization requires a discrete probability distribution for the sensitive attribute , such smoothness violation is negligible since the original NN classifier is known to fit well on discrete training data [33]. The sensitive attribute is assumed to be nonbinary. The anatomized NN cases where and is even will be ignored, because such cases include the tie between nearest neighbors that makes the bounds ambiguous and complicated [15]. The total number of attributes are assumed to be ( identifying attributes and 1 sensitive attribute) and all instances are assumed to be in a separable metric space as in [9, 10, 15]. has instances whereas has instances from definition 11. All instances are i.i.d whether they are in training or test data. For the sake of simplicity, will denote the identifying attributes . stands for a test data which is not processed by any anatomization and generalization method. will be an instance of the test data . is the quadratic distance metric for a pair of instances and in metric space . denotes the set of number of nearest neighbors of in that the original NN classifier uses while denotes the set of number of nearest neighbors of in that the anatomized NN classifier uses. will interchangeably be an instance of or and will interchangeably be an instance of or . In case of , we will use and for the nearest neighbors in and . is the random variable with probability distribution from which and are drawn. Training and test instances will be column vectors in format of . is the class attribute in and with binary labels 1 and 2. Given the training data and the class label , , and stand for the posterior probability, the likelihood probability and the prior probability respectively. If the anatomized training data is used, , and are the symmetric definitions for the class label . is the error rate when is classified using . If is used to classify , will be the error rate. When hold for all , we denote the error rate by in Equation 1 [15].
(1) 
is the error rate when hold for all . can trivially be derived from Eqn. 1 by substituting with . The Bayesian errors given are denoted by and when holds for all and respectively. Eqn. 2 computes [15].
(2) 
can trivially be derived again from 2 by substituting with . and , which are and with respect to , will stand for the error rate of original NN and anatomized NN classifiers respectively. and , which are and with respect to , will stand for the Bayesian errors of original training data and anatomized training data respectively. We will denote and by and for convenience. Similarly, and will denote and . Further notations and definitions will be given in the paper if necessary.
Iv Error Bounds of Anatomized Nn
In this section, we will first show the error bounds for the anatomized 1NN classifier. We will then discuss the extension to the anatomized NN classifier for all odd . We give only proof sketches due to space limitations.
We first give Corollary 1 which is critical for the error bounds of the anatomized 1NN classifier.
Corollary 1
Convergence of the nearest neighbor in the anatomized training data . Let and be i.i.d instances taking values separable in any metric space . Let be the nearest neighbor of in . Then, with probability one.
We can intuitively say that Corollary 1 should hold for the anatomized training data if it already holds for the training data . For the nearest neighbor of , there are instances in the anatomized training data including itself. Assuming very large training data size (), must still be the closest instance to in the anatomized training data . The incorrect instances are expected to remain far and should eventually hold.
We now give a sketch of the proof or Corollary 1. Let be the sphere with radius centered at . Let’s consider that has a sphere with nonzero probability. Therefore, for any radius and any fixed ;
(3) 
Since is monotonically decreasing in terms of for all , we can conclude that holds with probability 1. The rest of proof follows the denseness of the set in the set according to Cover et al. [9].
Theorem 1
Error Rate Bounds of the anatomized 1NN classifier Let be a metric space. Let and be the likelihood probabilities of such that with class priors and . Last, let’s assume that is either a point of nonzero probability measure or a continuity point of or . Then the nearest neighbor has the probability of error with the bounds
(4) 
where denotes the Bayesian error when the anatomized training data is used.
We now give a sketch of proof for Theorem 1. Let denote the probability of error for a pair of instances and . Since Corollary 1 shows that always hold, 5 is derived from 1 by substituting with 1 and with .
(5) 
The rest of the derivation follows Cover et al. using 1, 2 [9].
Extending 4 from the anatomized 1NN classifier to the anatomized NN classifier for all odd follows the steps in Corollary 1 and Theorem 1. The key is to show that holds for all . The rest is to derive an expression of as in 5 for all odd and show that is always less than and . We exclude this derivation due to space limitations, but the derivation follows from the original NN classifier analysis in [15]. The anatomized NN classifier has the bound 6
(6) 
for all odd .
V Bayesian Error On Anatomized training data
Since it is impossible to know the exact Bayesian error, many Bayesian error estimation techniques were suggested [10, 15, 4]. In this section, the Bayesian error will be estimated for binary classification using Parzen density estimation. Although such estimation would be very interesting for multilabel classification, the theoretical analysis on unprotected data only covers binary classification [4]. The Parzen density estimation approach, which is easier to derive than the nearest neighbor density estimation approach, will follow Fukunaga [15] and Fukunaga et al. [16]. Both approaches show the same behavior in terms of the Bayesian estimation that makes the discussion general enough for any nonparametric density based binary classification method [15]. We first give three axioms and a lemma.
Axiom 1
Given the anatomized training data and the training data ; let and be the class priors for class labels . Then, is always true.
Axiom 2
Let and be and respectively. Given the anatomized training data and the training data ; let and be the smooth joint densities of identifying attributes . Then, is always true.
Axiom 3
Let and be and respectively. Given the anatomized training data and the training data ; let and be the smooth densities of sensitive attribute . Then, is always true.
Axioms 1, 2 and 3 are obvious due to the following: provided a sample of size N drawn from a probability distribution , repeating every instance for fixed times and obtaining a sample of size does not change the probability distribution . The estimated parameters and of distribution remain same.
Lemma 1
Using axioms 2 and 3, the proof of lemma 1 is straightforward. Lemma 1 and axioms 13 yield the Theorem 2. Using lemma 1, we will assume that holds asymptotically for Bayesian errors.
Theorem 2
Let be a metric space. Let and be the smooth probability density functions of . Let and be the class priors such that . Similarly, let and be the smooth probability density functions of such that with class priors and . Let and be the classifiers with biases and respectively. Let be the decision threshold with threshold bias . Let be the small changes on and resulting in and ; and , be the Bayesian error estimations with respective biases , . Let and be the Parzen density estimations; and be the kernel function for with shape matrix and size/volume parameter [15]. Last, let’s assume that 1) and are independent in the training data and the anatomized training data 2) hold 3) . Therefore,
(7) 
where always holds.
Due to lack of space, we provide a brief summary of the proof. In 7, the terms other than stand for the expected estimation error in 8 [15].
(8) 
Hence, the proof of this theorem requires the second order approximations of and . From Fukunaga [15], we know that and are expressed in function of the and . The key point of the proof is to formulate the anatomized training data effect in and and show its propagation to the and . Let be the small change in the likelihood probabilities which results in , be and be true due to axiom 1. Therefore, we have 9 and 10 as the likelihood densities in the anatomized training data using lemma 1.
(9)  
(10) 
Using 9 and 10 in the Taylor approximations of and results in the approximations of in 11
(11) 
and in 12
Training Data  Anatomized Training Data  Notations  
NN Error Rate Bounds  : 1NN error rate ()  
: NN error rate ()  
: Bayesian error ()  
: 1NN error rate ()  
: NN error rate ()  
: Bayesian error ()  
1NN Convergence Rate  : Number of training instances  
: diversity parameter  
: Number of identifying attributes  
Bayesian Error Estimation  : Bayesian error ()  
: Bayesian error estimation for  
: Kernel width parameter  
: Number of training instances  
: Small change on likelihood  
: Number of identifying attributes 
(12) 
where is true. The former equality is the result of using Parzen density estimate [15]. 11 and 12 are derived using the Taylor approximations up to second order. Plugging 11 and 12 in 8 and rewriting 8 gives 7 where each stands for an integration term.
Eqn. 7 shows that the anatomized training data reduces the variance term of the decision functions that estimate the Bayesian error. However, it is hard to determine the effect of the anatomized training data on bias terms. All , , and are possible cases depending on which might yield bias terms of bigger or smaller than ’s ones.
Vi Anatomized 1NN Convergence
We now discuss the error rate of the anatomized 1NN classifier when the anatomized training data has finite size . We will then derive the convergence rate from the former error rate. The discussion here won’t be generalized to the anatomized NN classifier since the finite size training data performance of NN classifiers are not generalized to in the pattern recognition literature [10, 15]. Also, only binary classification will be considered due to space limitations.
From Theorem 2, we intuitively expect a faster convergence rate than the original 1NN classifier’s one. For number of instances in training data , using the anatomized training data reduces the variance of any classifier’s Bayesian error estimation. Therefore, there are fewer possible models to consider for a given sample size which eventually means a faster convergence to the asymptotic result. Theorem 3 extends the analysis of Fukunaga et al. [15, 17].
Theorem 3
Let be a metric space. Let and be the smooth probability density functions of . Let and be the class priors such that . Let and be the smooth posterior probability densities such that and . Let and be the smooth posterior probability densities such that and . Let be the difference between and for class labels . Let be the quadratic distance with matrix and be the calculated value of . Let be the error rate of the anatomized 1NN classifier when . Last, let be the error rate of the anatomized 1NN classifier when . Then,
(13) 
where is
(14) 
and is
(15) 
We will give here a summary of proof. We first define in function of such that holds. Then, is written in function of and . The result is
(16) 
where is
(17) 
a 3step expectation in 17. The rest of the proof follows Fukunaga [15]. The key deviation of the anatomized training data from the training data results from the step 2. In step 2, the nearest neighbor density estimation is done on training instances instead of training instances. Thus, the expectation with respect to gives 18.
(18) 
Using 18, expectation with respect to in 17 (step 3) according to Fukunaga [15] results in 13. Table I gives a summary of theoretical analysis, including a comparison between the anatomized training data and the training data .
Vii Experiments and Results
Viia Preprocessing, Setup and Implementation
We evaluate the anatomized NN classifier using cross validation on the Adult, Bank Marketing, IPUMS datasets from UCI collection [5] and on the Fatality (fars) dataset from Keel repository [3].
In the adult dataset, we predicted the income attribute. The instances with missing values were removed and features selected using the Pearson correlation filter (CfsSubsetEval) of Weka [34]. After preprocessing, we had 45222 instances with 5 attributes education, marital status, capital gain, capital loss and hours per week and the class attribute income. The other datasets were used without feature selection. In IPUMS, we predicted whether a person is veteran or not. After removing the N/A and missing values for veteran information, there were 148585 instances with 59 attributes. In the Fatality dataset, we predicted whether a person is injured or not in a car accident based on 29 attributes. Since the class attribute was nonbinary in the original data, the instances with class labels “Injured_Severity_Unknown”, “Died_Prior_to_Accident” and “Unknown” were removed and the binary class values “Injured” vs “Not_Injured” were created. The former removal resulted in 91085 instances. In the Bank Marketing dataset, we predicted whether a person replied positively or negatively to the bank’s phone marketing campaign. The dataset is used with 41188 instances and 20 attributes.
In the Adult, Bank Marketing and IPUMS datasets, education (educrec in IPUMS) was deemed sensitive whereas the remaining attributes were quasiidentifying attributes. Education had many discrete values which lets all samples satisfy diversity when . In the Fatality dataset, “POLICE_REPORTED_ALCOHOL_INVOLVEMENT” was the sensitive attribute whereas rest of the attributes were quasiidentifying attributes. This was the only discrete attribute in the dataset other than class attribute that is not a typical quasiidentifying attribute such as state, age, zipcode.
Weka (same version of Inan et al. [21]) was used to implement the NN classifier [34]. The anatomization algorithm was implemented by us following Xiao et al. [37]. All the anatomized training data were created from identifier and sensitive tables using the merge function of R. The error rates were measured on each test fold according to the definition in Weka implementation. When we compared the anatomized 1NN with anonymized 1NN, we also used the same generalization hierarchies that Inan et al. used. The statistical tests following Kumar et al. are provided [33]
ViiB Anatomized 1NN vs Anonymized 1NN and Original 1NN
First, we compare the anatomized 1NN classifier with both anonymized and original 1NN classifiers. We consider anonymized and anatomized training data with the quasiidentifying groups having similar number of instances (). Figure 1 shows the plot of error rates on 10fold cross validation without outlier values. We give results for two scenarios: 1) vs original data 2) vs original data. Although we measured the error rates to , we omit these results due to space limitations. The results are similar when even though some instances are suppressed to maintain diversity.
In Figure 1, the general trend is that anatomized 1NN has the smallest error rates and anonymized 1NN has the largest error rates. The average error rates for anonymized 1NN and anatomized 1NN classifiers are 0.3132 and 0.204 for and 0.3132 and 0.2324 for . Meanwhile, the original 1NN has average error rate of 0.2456. When , the anatomized 1NN has significantly lower error rates than the original 1NN at the confidence intervals 0.99, 0.98, 0.95, 0.9 and 0.8. When , the anatomized 1NN has significantly lower error rates than the original 1NN at the confidence interval 0.99. This is a surprising and an interesting result showing the practical interpretation of Theorem 2 in Section V. Theorem 2 shows that the Bayesian error of the anatomized training data has smaller variance term than the Bayesian error of the training data . Hence, a model which is overfitted on the training data is likely to be left out in the search space if the model is trained from the anatomized training data .
The anatomized 1NN has significantly lower error rate than the anonymized 1NN at the confidence intervals 0.99 and 0.98 when , and at the confidence interval 0.99 when . The results aren’t statistically significant for confidence intervals smaller than 0.95 or 0.99, as the anonymized 1NN consistently doesn’t fit one fold’s training data. Its high error rate results in a significant increase in sample variance, reducing the statistical confidence. When we analyzed this training data, we noticed that the instance values were generalized to the root values of the generalization hierarchies which could eliminate the decision boundary in the original data. This observation emphasizes the anatomy’s advantage for keeping the original attribute values despite diversifying the sensitive attribute values within a group.
ViiC Anatomized NN vs. Original Nn
In this section, we compare the anatomized NN classifier with the original NN classifier. The comparison doesn’t include the anonymized NN classifier because Inan et al.’s work considers only the anonymized 1NN classifier [21]. Its extension to cases is beyond the scope of this work. Although we ran the experiments for anatomized 3NN, 5NN, 7NN and 9NN classifiers on the Adult, Bank Marketing, Fatality and IPUMS datasets, we give the results on the larger Fatality and IPUMS datasets due to space limitations. We again include the cases of and . Figure 2 plots the error rate distributions of 3NN and 5NN classifiers on Fatality dataset, and 7NN and 9NN classifiers on IPUMS data.
In the Fatality data, the anatomized 3NN and 5NN classifiers outperform the original 3NN and 5NN classifiers at the confidence intervals 0.99 and 0.98 when . The anatomized 5NN classifier also outperforms the original 5NN classifier at the confidence interval 0.95 when . In contrast, the original 3NN and 5NN classifiers outperform the anatomized 3NN and 5NN classifiers when , although not to a statistically significant level. For 3NN classifiers, the average error rates are 0.0128, 0.0135 and 0.0132 for with , with and original data respectively. On the other hand, the average error rates of 5NN classifier on with , with and original data are 0.0119, 0.0122 and 0.0122 respectively.
In the IPUMS data, the original 7NN classifier outperforms the anatomized 7NN classifier at the confidence intervals 0.99, 0.98, 0.95, 0.9 when and . On the other hand, the original 9NN classifier outperforms the anatomized 9NN classifiers at the confidence interval 0.99 when and . For 7NN classifiers, the average error rates are 0.1567, 0.1586 and 0.1549 for with , with and original data respectively. The average error rates of 9NN classifier on with , with and original data are 0.1552, 0.1568 and 0.1542 respectively.
In conclusion, the anatomized and original NN classifiers have similar statistically significant error rates for multiple values of . These results confirm the theoretical analysis that we made in the earlier sections.
ViiD Convergence Behavior
We now compare the anatomized 1NN classifier versus the original 1NN classifier on convergence behavior. We create 5 partitions from the Adult (after preprocessing), Bank Marketing, Fatality and IPUMS datasets. Each partition is used as test data, and the remaining 4 partitions are used incrementally for training. Our objective is to show how the parameter in anatomized training data change the error rates when the training data size is increased incrementally. Figure 3 plots the average error rates for the original training data, the anatomized training data with , the anatomized training data with ; and the theoretical error rate in function of the training data size.
We can’t know the asymptotical practically for theoretical error rates. We thus make the following estimation for the theoretical result. For each dataset, we set the to the minimum of the error rates in the specific dataset’s results. We then calculate the rate from the , and values that we set in the experiments. Using the and , we computed the respective bias and eventually the theoretical error rate according to the respective training data size and .
The measured error rates in Figure 3 show a convergence that is similar to the one that theoretical error rates show. Given the largest training data size ; 0.015, 0.004, 0.008 and 0.0085 are approximately the maximum deviations of measured error rates from the theoretical error rates for the Adult, Bank Marketing, Fatality and the IPUMS datasets respectively. We can also see that the convergence of error rate does not make much difference between the original data, anatomized data with and the anatomized data with . In all types of training data, the convergence rate of 1NN classifier is slow.
Viii Conclusion
This work demonstrates the feasibility of NN classification using training data protected by anatomization under diversity. We show that the asymptotic error bounds are the same for anatomized data as for the original data. Perhaps surprisingly, the proposed 1NN classifier has a faster convergence to the asymptotical error rate than the convergence of 1NN classifier using the training data without anatomization. In addition, the analysis suggests that the Bayesian error estimation for any nonparametric classifier using the anatomized training data reduces the variance term of the Bayesian error estimation, although it is hard to define the characteristic of the bias term.
Experiments on multiple datasets confirm the theoretical convergence rates. These experiments also demonstrate that proposed NN on anatomized data approaches or even outperforms NN on original data. In particular, the experiments on well known Adult data show that 1NN on anatomized data outperforms learning on data anonymized to the same anonymity levels using generalization.
Acknowledgment
This work is supported by the “Anonymous”. We thank “Anonymous” for sharing his/her implementation used for evaluating 1NN on generalizationbased anonymization. We also thank “Anonymous” for helpful comments throughout the theoretical analysis.
References
 [1] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the Twentieth ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems. Santa Barbara, California: ACM, May 2123 2001, pp. 247–255. [Online]. Available: http://doi.acm.org/10.1145/375551.375602
 [2] R. Agrawal and R. Srikant, “Privacypreserving data mining,” in Proceedings of the 2000 ACM SIGMOD Conference on Management of Data. Dallas, TX: ACM, May 1419 2000, pp. 439–450. [Online]. Available: http://doi.acm.org/10.1145/342009.335438
 [3] J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera, “Keel datamining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of MultipleValued Logic and Soft Computing, vol. 17, no. 255287, p. 11, 2010.
 [4] A. Antos, L. Devroye, and L. Györfi, “Lower bounds for bayes error estimation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 21, no. 7, pp. 643–645, 1999.
 [5] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007. [Online]. Available: http://www.ics.uci.edu/~mlearn/
 [6] V. Ciriani, S. D. C. di Vimercati, S. Foresti, and P. Samarati, “kanonymous data mining: A survey,” in Privacypreserving data mining. Springer, 2008, pp. 105–136.
 [7] V. Ciriani, S. D. C. D. Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, and P. Samarati, “Combining fragmentation and encryption to protect privacy in data storage,” ACM Trans. Inf. Syst. Secur., vol. 13, pp. 22:1–22:33, July 2010. [Online]. Available: http://doi.acm.org/10.1145/1805974.1805978
 [8] G. Cormode, N. Li, T. Li, and D. Srivastava, “Minimizing minimality and maximizing utility: Analyzing methodbased attacks on anonymized data,” in Proceedings of the VLDB Endowment, vol. 3, no. 1, 2010, pp. 1045–1056. [Online]. Available: http://dl.acm.org/citation.cfm?id=1920972
 [9] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” Information Theory, IEEE Transactions on, vol. 13, no. 1, pp. 21–27, 1967.
 [10] L. Devroye, L. Györfi, and G. Lugosi, A probabilistic theory of pattern recognition. Springer Science & Business Media, 2013, vol. 31.
 [11] S. D. C. di Vimercati, S. Foresti, S. Jajodia, G. Livraga, S. Paraboschi, and P. Samarati, “Extending loose associations to multiple fragments,” in DBSec’13, 2013, pp. 1–16.
 [12] J. Dowd, S. Xu, and W. Zhang, “Privacypreserving decision tree mining based on random substitutions,” in Emerging Trends in Information and Communication Security. Springer, 2006, pp. 145–159.
 [13] C. Dwork, “Differential privacy,” in 33rd International Colloquium on Automata, Languages and Programming (ICALP 2006), Venice, Italy, Jul. 916 2006, pp. 1–12. [Online]. Available: http://dx.doi.org/10.1007/11787006_1
 [14] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in Proceedings of the 22nd ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems (PODS 2003), San Diego, CA, Jun. 912 2003, pp. 211–222.
 [15] K. Fukunaga, Introduction to statistical pattern recognition. Academic press, 2013.
 [16] K. Fukunaga and D. M. Hummels, “Bayes error estimation using parzen and knn procedures,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 5, pp. 634–643, 1987.
 [17] K. Fukunaga and D. M. Hummels, “Bias of nearest neighbor error estimates,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 1, pp. 103–112, 1987.
 [18] B. C. M. Fung, K. Wang, and P. S. Yu, “Topdown specialization for information and privacy preservation,” in Proceedings of the 21st International Conference on Data Engineering, ser. ICDE ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 205–216. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2005.143
 [19] T. Gal, Z. Chen, and A. Gangopadhyay, “A privacy protection model for patient data with multiple sensitive attributes,” International Journal of Information Security and Privacy, IGI Global, Hershey, PA, vol. 2, no. 3, pp. 28–44, 2008.
 [20] X. He, Y. Xiao, Y. Li, Q. Wang, W. Wang, and B. Shi, “Permutation anonymization: Improving anatomy for privacy preservation in data publication.” in PAKDD Workshops, ser. Lecture Notes in Computer Science, L. Cao, J. Z. Huang, J. Bailey, Y. S. Koh, and J. Luo, Eds., vol. 7104. Springer, 2011, pp. 111–123. [Online]. Available: http://dblp.unitrier.de/db/conf/pakdd/pakdd2011w.html#HeXLWWS11
 [21] A. Inan, M. Kantarcioglu, and E. Bertino, “Using anonymized data for classification,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, ser. ICDE ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 429–440. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2009.19
 [22] V. S. Iyengar, “Transforming data to satisfy privacy constraints,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’02. New York, NY, USA: ACM, 2002, pp. 279–288. [Online]. Available: http://doi.acm.org/10.1145/775047.775089
 [23] D. Kifer, “Attacks on privacy and definetti’s theorem,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009, pp. 127–138.
 [24] N. Li and T. Li, “tcloseness: Privacy beyond kanonymity and ldiversity,” in Proceedings of the 23nd International Conference on Data Engineering (ICDE ’07), Istanbul, Turkey, Apr. 1620 2007. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2007.367856
 [25] T. Li and N. Li, “On the tradeoff between privacy and utility in data publishing,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28  July 1, 2009, 2009, pp. 517–526. [Online]. Available: http://doi.acm.org/10.1145/1557019.1557079
 [26] T. Li, N. Li, J. Zhang, and I. Molloy, “Slicing: A new approach for privacy preserving data publishing,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 3, pp. 561–574, 2012. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.236
 [27] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “diversity: Privacy beyond anonymity,” in Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE 2006), Atlanta Georgia, Apr. 2006. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2006.1
 [28] B. Martin, “Instancebased learning : Nearest neighbor with generalization,” Tech. Rep., 1995.
 [29] R. A. Moore, Jr., “Controlled dataswapping techniques for masking public use microdata sets,” U.S. Bureau of the Census, Washington, DC., Statistical Research Division Report Series RR 9604, 1996. [Online]. Available: http://www.census.gov/srd/papers/pdf/rr964.pdf
 [30] M. E. Nergiz and C. Clifton, “presence without complete world knowledge,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 868–883, Jun. 2010. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.125
 [31] P. Samarati, “Protecting respondent’s privacy in microdata release,” IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010–1027, Nov./Dec. 2001. [Online]. Available: http://dx.doi.org/10.1109/69.971193
 [32] L. Sweeney, “kanonymity: a model for protecting privacy,” International Journal on Uncertainty, Fuzziness and Knowledgebased Systems, no. 5, pp. 557–570, 2002. [Online]. Available: http://dx.doi.org/10.1142/S0218488502001648
 [33] P.N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, (First Edition). Boston, MA, USA: AddisonWesley Longman Publishing Co., Inc., 2005.
 [34] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann, Oct. 1999. [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/
 [35] R. C.W. Wong, A. W.C. Fu, K. Wang, and J. Pei, “Minimality attack in privacy preserving data publishing,” in VLDB, 2007, pp. 543–554.
 [36] R. C.W. Wong, J. Li, A. W.C. Fu, and K. Wang, “(, k)anonymity: An enhanced kanonymity model for privacy preserving data publishing,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’06. New York, NY, USA: ACM, 2006, pp. 754–759. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150499
 [37] X. Xiao and Y. Tao, “Anatomy: Simple and effective privacy preservation,” in Proceedings of 32nd International Conference on Very Large Data Bases (VLDB 2006), Seoul, Korea, Sep. 1215 2006. [Online]. Available: http://www.vldb.org/conf/2006/p139xiao.pdf
 [38] J. Zhang, D.K. Kang, A. Silvescu, and V. Honavar, “Learning accurate and concise naïve bayes classifiers from attribute value taxonomies and data,” Knowledge and Information Systems, vol. 9, no. 2, pp. 157–179, 2006.