K-Nearest Neighbor Classification Using Anatomized Data

-Nearest Neighbor Classification Using Anatomized Data

Koray Mancuhan Department of Computer Science, CERIAS
Purdue University
Email: kmancuha@purdue.edu
   Chris Clifton Department of Computer Science, CERIAS
Purdue University
Email: clifton@cs.purdue.edu
Abstract

This paper analyzes nearest neighbor classification with training data anonymized using anatomy. Anatomy preserves all data values, but introduces uncertainty in the mapping between identifying and sensitive values. We first study the theoretical effect of the anatomized training data on the nearest neighbor error rate bounds, nearest neighbor convergence rate, and Bayesian error. We then validate the derived bounds empirically. We show that 1) Learning from anatomized data approaches the limits of learning through the unprotected data (although requiring larger training data), and 2) nearest neighbor using anatomized data outperforms nearest neighbor on generalization-based anonymization.

I Introduction

Data publishing without revealing sensitive information is an important problem. Many privacy definitions have been proposed based on generalizing/suppressing data (-diversity[27], -anonymity [31, 32], -closeness [24], -presence [30], (,)-anonymity [36]). Other alternatives include value swapping [29], distortion [2], randomization [14], and noise addition (e.g., differential privacy [13]). Generalization consists of replacing identifying attribute values with a less specific version [6]. Suppression can be viewed as the ultimate generalization, replacing the identifying value with an “any” value [6]. These approaches have the advantage of preserving truth, but a less specific truth that reduces the utility of the published data.

Xiao and Tao proposed anatomization as a method to enforce -diversity while preserving specific data values [37]. Anatomization splits instances across two tables, one containing identifying information and the other containing private information. The more general approach of fragmentation [7] divides a given dataset’s attributes into two sets of attributes (2 partitions) such that an encryption mechanism avoids associations between two different small partitions. Vimercati et al. extend fragmentation to multiple partitions [11], and Tamas et al. propose an extension that deals with multiple sensitive attributes [19]. The main advantage of anatomization/fragmentation is that it preserves the original values of data; the uncertainty is only in the mapping between individuals and sensitive values.

We show that this additional information has real value. First, we demonstrate that in theory, learning from anatomized data can be as good as learning from the raw data. We then demonstrate empirically that learning from anatomized data beats learning from generalization-based anonymization.

This paper looks only at instance-based learning, specifically non-parametric nearest neighbor classifier (-NN). This focus was chosen because we have solid theoretical results on the limits of learning, allowing us to compare theoretical bounds on learning from anatomized data with learning from the underlying unprotected data. We demonstrate this for a simple approach of using the anatomized data; we simply consider all possible mappings of individuals to sensitive values as equally likely.

There is concern that anatomization is vulnerable to several attacks [23, 20, 26]. While this can be an issue, any method that provides meaningful utility fails to provide perfect privacy against a sufficiently strong adversary [25, 13]. Introducing uncertainty into the anonymization process reduces the risk of many attacks, e.g, minimality [35, 8]. Our theoretical analysis holds for any assignment of items to anatomy groups, including a random assignment, which provides a high degree of robustness against minimality and correlation-based attacks. This paper has the following key contributions:

  1. We define a classification task on anatomized data without violating the random worlds assumption. A violating classification task would be the prediction of sensitive attribute, a task that was found to be #P-complete by Kifer [23].

  2. To our best knowledge, this is the first paper in the privacy community that studies the theoretical effect of training the -NN on anatomized data. We show the anatomization effect for the error rate bounds and the convergence rate when the test data is neither anonymized nor anatomized. Inan et al. already gives a practical applications of such a learning scenario [21].

  3. We show the Bayesian error estimation for any non-parametric classifier using the anatomized training data.

  4. We compare the -NN classifier trained on the anatomized data with the -NN classifier trained on the unprotected data. In case of nearest neighbor classifier (1-NN), we also make an additional comparison to generalization based learning scheme [21].

  5. We last compare the theoretical estimation of convergence rate with the practical measurements when the convergence rate is defined in function of -diversity.

We next summarize the related work, and give a set of definitions and notations necessary for further discussion. Section IV shows error rate bounds of the non-parametric -NN classifier; Section V analyzes the effect of anatomization on the Bayesian error. Section VI formulates the 1-NN convergence rate under -diversity. The experimental analysis is presented in Section VII.

Ii Related Work

There have been studies in how to mine anonymized data. Nearest neighbor classification using generalized data was investigated by Martin. Nested generalization and non-nested hyperrectangles were used to generalize the data from which the nearest neighbor classifiers were trained [28]. Inan et al. proposed nearest neighbor and support vector machine classifiers using anonymized training data that satisfy -anonymity. Taylor approximation was used to estimate the Euclidean distance from the anonymized training data [21]. Zhang et al. studied Naïve Bayes using partially specified training data [38], proposing a conditional likehoods computation algorithm exploring the instance space of attribute-value generalization taxonomies. Agrawal et al. proposed an iterative distribution reconstruction algorithm for the distorted training data from which a C4.5 decision tree classifier was trained [1]. Iyengar suggested using a classification metric so as to find the optimum generalization. Then, a C4.5 decision tree classifier was trained from the optimally generalized training data [22]. Fung et al. gave a top-down specialization method (TDS) for anonymization so that the anonymized data allows accurate decision trees. A new scoring function was proposed for the calculation of decision tree splits from the compressed training data [18]. Dowd et al. studied C4.5 decision tree learning from training data perturbed by random substitutions. A matrix based distribution reconstruction algorithm was applied on the perturbed training data from which an accurate C4.5 decision tree classifier was learned [12].

None of the earlier work has provided a method directly applicable to anatomized training data. A classifier using the anatomized training data requires specific theoretical and experimental analysis, because anatomized training data provides additional detail that has the potential to improve learning; but also additional uncertainty that must be dealt with. Furthermore, previous work didn’t justify theoretically why the proposed heuristics work in empirically.

Iii Definitions and Notations

In this section, the first four definitions will recall the standard definitions of unprotected data and attribute types.

Definition 1

A dataset is called a person specific dataset for population if each instance belongs to a unique individual .

The person specific data will be called the training data in this paper. Next, we will give the first type of attributes.

Definition 2

A set of attributes are called direct identifying attributes if they let an adversary associate an instance to a unique individual without any background knowledge.

Definition 3

A set of attributes are called quasi-identifying attributes if there is background knowledge available to the adversary that associates the quasi-identifying attributes with a unique individual .

We include both direct and quasi-identifying attributes under the name identifying attribute. First name, last name and social security number (SSN) are common examples of direct identifying attributes. Some common examples of quasi-identifying attributes are age, postal code, and occupation. Next, we will give the second type of attribute.

Definition 4

An attribute of instance is called a sensitive attribute if it must be protected against adversaries from correctly inferring the value for an individual.

Patient disease and individual income are common examples of sensitive attributes. Unique individuals typically don’t want these sensitive information to be publicly known when a dataset is released to public. Provided an instance , the class label is denoted by . We don’t consider the case where is sensitive, as this would make the purpose of classification to violate privacy. Typically is neither sensitive nor identifying, although the analysis holds for being an identifying attribute.

Given the former definitions, we will next define the anonymized training data following the definition of -anonymity [32].

Definition 5

A training data that satisfies the following conditions is said to be anonymized training data [32]:

  1. The training data does not contain any unique identifying attributes.

  2. Every instance is indistinguishable from at least other instances in with respect to its quasi-identifying attributes.

In this paper, we assume that the anonymized training data is created according to a generalization based data publishing method. We next define the comparison baseline classifiers.

Definition 6

A non-parametric nearest neighbor (-NN) classifier that is trained on the anonymized training data is called the anonymized -NN classifier.

Definition 7

A non-parametric -NN classifier that is trained on the training data is called the original -NN classifier.

The anonymized -NN classifier will just be the comparison baseline in the evaluation and its theoretical discussion will not be included. We go further, requiring that there must be multiple possible sensitive values that could be linked to an individual. This requires the definition of groups [27].

Definition 8

A group is a subset of instances in training data such that , and for any pair where , .

Next, we define the concept of -diversity or -diverse given the former group definition.

Definition 9

A set of groups is said to be -diverse if and only if for all groups where is the sensitive attribute in , is the database projection operation on training data (or on data table in the database community), is the frequency of in and is the number of instances in .

We extend the data publishing method anatomization from Xiao et al. that is originally based on -diverse groups [37].

Definition 10

Given a training data partitioned in -diverse groups according to Definition 9, anatomization produces an identifier table and a sensitive table as follows. has schema

including the class attribute, the quasi-identifying attributes for , and the group id of the group . For each group and each instance , has an instance of the form:

has schema

where is the sensitive attribute in and is the group id of the group . For each group and each instance , has an instance of the form:

Given the learning task of predicting class attribute , definition 10 lets us observe the following about training data published according to anatomization: every instance can be matched to instances using the common attribute in both data table schemas. This observation yields the anatomized training data and the anatomized -NN classifier.

Definition 11

Given two data tables and resulting from the anatomization on training data , the anatomized training data is

where is the database inner join operation with respect to the condition and is the database projection operation on training data (*) processed according to definition 10.

Definition 12

A non-parametric -NN classifier that is trained on the anatomized training data is called the anatomized k-NN classifier.

Using the former definitions, we now give assumptions and notations used in discussing the anatomized -NN classifier. In the theoretical analysis, we assume that all the training data has a smooth probability distribution. Although anatomization requires a discrete probability distribution for the sensitive attribute , such smoothness violation is negligible since the original -NN classifier is known to fit well on discrete training data [33]. The sensitive attribute is assumed to be non-binary. The anatomized -NN cases where and is even will be ignored, because such cases include the tie between -nearest neighbors that makes the bounds ambiguous and complicated [15]. The total number of attributes are assumed to be ( identifying attributes and 1 sensitive attribute) and all instances are assumed to be in a separable metric space as in [9, 10, 15]. has instances whereas has instances from definition 11. All instances are i.i.d whether they are in training or test data. For the sake of simplicity, will denote the identifying attributes . stands for a test data which is not processed by any anatomization and generalization method. will be an instance of the test data . is the quadratic distance metric for a pair of instances and in metric space . denotes the set of number of nearest neighbors of in that the original -NN classifier uses while denotes the set of number of nearest neighbors of in that the anatomized -NN classifier uses. will interchangeably be an instance of or and will interchangeably be an instance of or . In case of , we will use and for the nearest neighbors in and . is the random variable with probability distribution from which and are drawn. Training and test instances will be column vectors in format of . is the class attribute in and with binary labels 1 and 2. Given the training data and the class label , , and stand for the posterior probability, the likelihood probability and the prior probability respectively. If the anatomized training data is used, , and are the symmetric definitions for the class label . is the error rate when is classified using . If is used to classify , will be the error rate. When hold for all , we denote the error rate by in Equation 1 [15].

(1)

is the error rate when hold for all . can trivially be derived from Eqn. 1 by substituting with . The Bayesian errors given are denoted by and when holds for all and respectively. Eqn. 2 computes [15].

(2)

can trivially be derived again from 2 by substituting with . and , which are and with respect to , will stand for the error rate of original -NN and anatomized -NN classifiers respectively. and , which are and with respect to , will stand for the Bayesian errors of original training data and anatomized training data respectively. We will denote and by and for convenience. Similarly, and will denote and . Further notations and definitions will be given in the paper if necessary.

Iv Error Bounds of Anatomized -Nn

In this section, we will first show the error bounds for the anatomized 1-NN classifier. We will then discuss the extension to the anatomized -NN classifier for all odd . We give only proof sketches due to space limitations.

We first give Corollary 1 which is critical for the error bounds of the anatomized 1-NN classifier.

Corollary 1

Convergence of the nearest neighbor in the anatomized training data . Let and be i.i.d instances taking values separable in any metric space . Let be the nearest neighbor of in . Then, with probability one.

We can intuitively say that Corollary 1 should hold for the anatomized training data if it already holds for the training data . For the nearest neighbor of , there are instances in the anatomized training data including itself. Assuming very large training data size (), must still be the closest instance to in the anatomized training data . The incorrect instances are expected to remain far and should eventually hold.

We now give a sketch of the proof or Corollary 1. Let be the sphere with radius centered at . Let’s consider that has a sphere with non-zero probability. Therefore, for any radius and any fixed ;

(3)

Since is monotonically decreasing in terms of for all , we can conclude that holds with probability 1. The rest of proof follows the denseness of the set in the set according to Cover et al. [9].

Next, Theorem 1 shows the error bounds of the anatomized 1-NN classifier using Corollary 1.

Theorem 1

Error Rate Bounds of the anatomized 1-NN classifier Let be a metric space. Let and be the likelihood probabilities of such that with class priors and . Last, let’s assume that is either a point of non-zero probability measure or a continuity point of or . Then the nearest neighbor has the probability of error with the bounds

(4)

where denotes the Bayesian error when the anatomized training data is used.

We now give a sketch of proof for Theorem 1. Let denote the probability of error for a pair of instances and . Since Corollary 1 shows that always hold, 5 is derived from 1 by substituting with 1 and with .

(5)

The rest of the derivation follows Cover et al. using 1, 2 [9].

Extending 4 from the anatomized 1-NN classifier to the anatomized -NN classifier for all odd follows the steps in Corollary 1 and Theorem 1. The key is to show that holds for all . The rest is to derive an expression of as in 5 for all odd and show that is always less than and . We exclude this derivation due to space limitations, but the derivation follows from the original -NN classifier analysis in [15]. The anatomized -NN classifier has the bound 6

(6)

for all odd .

Note that the Bayesian errors and are not always same due to the -diverse groups of the anatomization. The -diverse groups cause new likelihood and eventually posterior probabilities . thus differ from 2, because 2 uses instead of . The next section formulates this change.

V Bayesian Error On Anatomized training data

Since it is impossible to know the exact Bayesian error, many Bayesian error estimation techniques were suggested [10, 15, 4]. In this section, the Bayesian error will be estimated for binary classification using Parzen density estimation. Although such estimation would be very interesting for multi-label classification, the theoretical analysis on unprotected data only covers binary classification [4]. The Parzen density estimation approach, which is easier to derive than the nearest neighbor density estimation approach, will follow Fukunaga [15] and Fukunaga et al. [16]. Both approaches show the same behavior in terms of the Bayesian estimation that makes the discussion general enough for any non-parametric density based binary classification method [15]. We first give three axioms and a lemma.

Axiom 1

Given the anatomized training data and the training data ; let and be the class priors for class labels . Then, is always true.

Axiom 2

Let and be and respectively. Given the anatomized training data and the training data ; let and be the smooth joint densities of identifying attributes . Then, is always true.

Axiom 3

Let and be and respectively. Given the anatomized training data and the training data ; let and be the smooth densities of sensitive attribute . Then, is always true.

Axioms 1, 2 and 3 are obvious due to the following: provided a sample of size N drawn from a probability distribution , repeating every instance for fixed times and obtaining a sample of size does not change the probability distribution . The estimated parameters and of distribution remain same.

Lemma 1

Given the anatomized training data and the training data , let identifying attributes and the sensitive attribute be independent. Then, is always true under the axioms 2 and 3.

Using axioms 2 and 3, the proof of lemma 1 is straightforward. Lemma 1 and axioms 1-3 yield the Theorem 2. Using lemma 1, we will assume that holds asymptotically for Bayesian errors.

Theorem 2

Let be a metric space. Let and be the smooth probability density functions of . Let and be the class priors such that . Similarly, let and be the smooth probability density functions of such that with class priors and . Let and be the classifiers with biases and respectively. Let be the decision threshold with threshold bias . Let be the small changes on and resulting in and ; and , be the Bayesian error estimations with respective biases , . Let and be the Parzen density estimations; and be the kernel function for with shape matrix and size/volume parameter [15]. Last, let’s assume that 1) and are independent in the training data and the anatomized training data 2) hold 3) . Therefore,

(7)

where always holds.

Due to lack of space, we provide a brief summary of the proof. In 7, the terms other than stand for the expected estimation error in 8 [15].

(8)

Hence, the proof of this theorem requires the second order approximations of and . From Fukunaga [15], we know that and are expressed in function of the and . The key point of the proof is to formulate the anatomized training data effect in and and show its propagation to the and . Let be the small change in the likelihood probabilities which results in , be and be true due to axiom 1. Therefore, we have 9 and 10 as the likelihood densities in the anatomized training data using lemma 1.

(9)
(10)

Using 9 and 10 in the Taylor approximations of and results in the approximations of in 11

(11)

and in 12

Training Data Anatomized Training Data Notations
-NN Error Rate Bounds : 1-NN error rate ()
: -NN error rate ()
: Bayesian error ()
: 1-NN error rate ()
: -NN error rate ()
: Bayesian error ()
1-NN Convergence Rate : Number of training instances
: -diversity parameter
: Number of identifying attributes
Bayesian Error Estimation : Bayesian error ()
: Bayesian error estimation for
: Kernel width parameter
: Number of training instances
: Small change on likelihood
: Number of identifying attributes
TABLE I: Summary of Theoretical Analysis
(12)

where is true. The former equality is the result of using Parzen density estimate [15]. 11 and 12 are derived using the Taylor approximations up to second order. Plugging 11 and 12 in 8 and rewriting 8 gives 7 where each stands for an integration term.

Eqn. 7 shows that the anatomized training data reduces the variance term of the decision functions that estimate the Bayesian error. However, it is hard to determine the effect of the anatomized training data on bias terms. All , , and are possible cases depending on which might yield bias terms of bigger or smaller than ’s ones.

Vi Anatomized 1-NN Convergence

We now discuss the error rate of the anatomized 1-NN classifier when the anatomized training data has finite size . We will then derive the convergence rate from the former error rate. The discussion here won’t be generalized to the anatomized -NN classifier since the finite size training data performance of -NN classifiers are not generalized to in the pattern recognition literature [10, 15]. Also, only binary classification will be considered due to space limitations.

From Theorem 2, we intuitively expect a faster convergence rate than the original 1-NN classifier’s one. For number of instances in training data , using the anatomized training data reduces the variance of any classifier’s Bayesian error estimation. Therefore, there are fewer possible models to consider for a given sample size which eventually means a faster convergence to the asymptotic result. Theorem 3 extends the analysis of Fukunaga et al. [15, 17].

Theorem 3

Let be a metric space. Let and be the smooth probability density functions of . Let and be the class priors such that . Let and be the smooth posterior probability densities such that and . Let and be the smooth posterior probability densities such that and . Let be the difference between and for class labels . Let be the quadratic distance with matrix and be the calculated value of . Let be the error rate of the anatomized 1-NN classifier when . Last, let be the error rate of the anatomized 1-NN classifier when . Then,

(13)

where is

(14)

and is

(15)

We will give here a summary of proof. We first define in function of such that holds. Then, is written in function of and . The result is

(16)

where is

(17)

a 3-step expectation in 17. The rest of the proof follows Fukunaga [15]. The key deviation of the anatomized training data from the training data results from the step 2. In step 2, the nearest neighbor density estimation is done on training instances instead of training instances. Thus, the expectation with respect to gives 18.

(18)

Using 18, expectation with respect to in 17 (step 3) according to Fukunaga [15] results in 13. Table I gives a summary of theoretical analysis, including a comparison between the anatomized training data and the training data .

Vii Experiments and Results

Vii-a Preprocessing, Setup and Implementation

We evaluate the anatomized -NN classifier using cross validation on the Adult, Bank Marketing, IPUMS datasets from UCI collection [5] and on the Fatality (fars) dataset from Keel repository [3].

In the adult dataset, we predicted the income attribute. The instances with missing values were removed and features selected using the Pearson correlation filter (CfsSubsetEval) of Weka [34]. After preprocessing, we had 45222 instances with 5 attributes education, marital status, capital gain, capital loss and hours per week and the class attribute income. The other datasets were used without feature selection. In IPUMS, we predicted whether a person is veteran or not. After removing the N/A and missing values for veteran information, there were 148585 instances with 59 attributes. In the Fatality dataset, we predicted whether a person is injured or not in a car accident based on 29 attributes. Since the class attribute was non-binary in the original data, the instances with class labels “Injured_Severity_Unknown”, “Died_Prior_to_Accident” and “Unknown” were removed and the binary class values “Injured” vs “Not_Injured” were created. The former removal resulted in 91085 instances. In the Bank Marketing dataset, we predicted whether a person replied positively or negatively to the bank’s phone marketing campaign. The dataset is used with 41188 instances and 20 attributes.

In the Adult, Bank Marketing and IPUMS datasets, education (educrec in IPUMS) was deemed sensitive whereas the remaining attributes were quasi-identifying attributes. Education had many discrete values which lets all samples satisfy -diversity when . In the Fatality dataset, “POLICE_REPORTED_ALCOHOL_INVOLVEMENT” was the sensitive attribute whereas rest of the attributes were quasi-identifying attributes. This was the only discrete attribute in the dataset other than class attribute that is not a typical quasi-identifying attribute such as state, age, zipcode.

Weka (same version of Inan et al. [21]) was used to implement the -NN classifier [34]. The anatomization algorithm was implemented by us following Xiao et al. [37]. All the anatomized training data were created from identifier and sensitive tables using the merge function of R. The error rates were measured on each test fold according to the definition in Weka implementation. When we compared the anatomized 1-NN with anonymized 1-NN, we also used the same generalization hierarchies that Inan et al. used. The statistical tests following Kumar et al. are provided [33]

Fig. 1: Error Rate on 10 Fold Cross Validation

Vii-B Anatomized 1-NN vs Anonymized 1-NN and Original 1-NN

First, we compare the anatomized 1-NN classifier with both anonymized and original 1-NN classifiers. We consider anonymized and anatomized training data with the quasi-identifying groups having similar number of instances (). Figure 1 shows the plot of error rates on 10-fold cross validation without outlier values. We give results for two scenarios: 1) vs original data 2) vs original data. Although we measured the error rates to , we omit these results due to space limitations. The results are similar when even though some instances are suppressed to maintain -diversity.

In Figure 1, the general trend is that anatomized 1-NN has the smallest error rates and anonymized 1-NN has the largest error rates. The average error rates for anonymized 1-NN and anatomized 1-NN classifiers are 0.3132 and 0.204 for and 0.3132 and 0.2324 for . Meanwhile, the original 1-NN has average error rate of 0.2456. When , the anatomized 1-NN has significantly lower error rates than the original 1-NN at the confidence intervals 0.99, 0.98, 0.95, 0.9 and 0.8. When , the anatomized 1-NN has significantly lower error rates than the original 1-NN at the confidence interval 0.99. This is a surprising and an interesting result showing the practical interpretation of Theorem 2 in Section V. Theorem 2 shows that the Bayesian error of the anatomized training data has smaller variance term than the Bayesian error of the training data . Hence, a model which is overfitted on the training data is likely to be left out in the search space if the model is trained from the anatomized training data .

The anatomized 1-NN has significantly lower error rate than the anonymized 1-NN at the confidence intervals 0.99 and 0.98 when , and at the confidence interval 0.99 when . The results aren’t statistically significant for confidence intervals smaller than 0.95 or 0.99, as the anonymized 1-NN consistently doesn’t fit one fold’s training data. Its high error rate results in a significant increase in sample variance, reducing the statistical confidence. When we analyzed this training data, we noticed that the instance values were generalized to the root values of the generalization hierarchies which could eliminate the decision boundary in the original data. This observation emphasizes the anatomy’s advantage for keeping the original attribute values despite diversifying the sensitive attribute values within a group.

Vii-C Anatomized -NN vs. Original -Nn

(a) 3-NN
(b) 5-NN
(c) 7-NN
(d) 9-NN
Fig. 2: Error Rates of -NN Classifier vs Anatomized -NN Classifier (, ) on 10 Fold Cross Validation

In this section, we compare the anatomized -NN classifier with the original -NN classifier. The comparison doesn’t include the anonymized -NN classifier because Inan et al.’s work considers only the anonymized 1-NN classifier [21]. Its extension to cases is beyond the scope of this work. Although we ran the experiments for anatomized 3-NN, 5-NN, 7-NN and 9-NN classifiers on the Adult, Bank Marketing, Fatality and IPUMS datasets, we give the results on the larger Fatality and IPUMS datasets due to space limitations. We again include the cases of and . Figure 2 plots the error rate distributions of 3-NN and 5-NN classifiers on Fatality dataset, and 7-NN and 9-NN classifiers on IPUMS data.

In the Fatality data, the anatomized 3-NN and 5-NN classifiers outperform the original 3-NN and 5-NN classifiers at the confidence intervals 0.99 and 0.98 when . The anatomized 5-NN classifier also outperforms the original 5-NN classifier at the confidence interval 0.95 when . In contrast, the original 3-NN and 5-NN classifiers outperform the anatomized 3-NN and 5-NN classifiers when , although not to a statistically significant level. For 3-NN classifiers, the average error rates are 0.0128, 0.0135 and 0.0132 for with , with and original data respectively. On the other hand, the average error rates of 5-NN classifier on with , with and original data are 0.0119, 0.0122 and 0.0122 respectively.

In the IPUMS data, the original 7-NN classifier outperforms the anatomized 7-NN classifier at the confidence intervals 0.99, 0.98, 0.95, 0.9 when and . On the other hand, the original 9-NN classifier outperforms the anatomized 9-NN classifiers at the confidence interval 0.99 when and . For 7-NN classifiers, the average error rates are 0.1567, 0.1586 and 0.1549 for with , with and original data respectively. The average error rates of 9-NN classifier on with , with and original data are 0.1552, 0.1568 and 0.1542 respectively.

In conclusion, the anatomized and original -NN classifiers have similar statistically significant error rates for multiple values of . These results confirm the theoretical analysis that we made in the earlier sections.

(a) Adult Data Error Rates
(b) Bank Marketing Data Error Rates
(c) Fatality Data Error Rates
(d) IPUMS Data Error Rates
Fig. 3: Convergence Behavior of Original 1-NN Classifier vs Anatomized 1-NN Classifier (, )

Vii-D Convergence Behavior

We now compare the anatomized 1-NN classifier versus the original 1-NN classifier on convergence behavior. We create 5 partitions from the Adult (after preprocessing), Bank Marketing, Fatality and IPUMS datasets. Each partition is used as test data, and the remaining 4 partitions are used incrementally for training. Our objective is to show how the parameter in anatomized training data change the error rates when the training data size is increased incrementally. Figure 3 plots the average error rates for the original training data, the anatomized training data with , the anatomized training data with ; and the theoretical error rate in function of the training data size.

We can’t know the asymptotical practically for theoretical error rates. We thus make the following estimation for the theoretical result. For each dataset, we set the to the minimum of the error rates in the specific dataset’s results. We then calculate the rate from the , and values that we set in the experiments. Using the and , we computed the respective bias and eventually the theoretical error rate according to the respective training data size and .

The measured error rates in Figure 3 show a convergence that is similar to the one that theoretical error rates show. Given the largest training data size ; 0.015, 0.004, 0.008 and 0.0085 are approximately the maximum deviations of measured error rates from the theoretical error rates for the Adult, Bank Marketing, Fatality and the IPUMS datasets respectively. We can also see that the convergence of error rate does not make much difference between the original data, anatomized data with and the anatomized data with . In all types of training data, the convergence rate of 1-NN classifier is slow.

Viii Conclusion

This work demonstrates the feasibility of -NN classification using training data protected by anatomization under -diversity. We show that the asymptotic error bounds are the same for anatomized data as for the original data. Perhaps surprisingly, the proposed 1-NN classifier has a faster convergence to the asymptotical error rate than the convergence of 1-NN classifier using the training data without anatomization. In addition, the analysis suggests that the Bayesian error estimation for any non-parametric classifier using the anatomized training data reduces the variance term of the Bayesian error estimation, although it is hard to define the characteristic of the bias term.

Experiments on multiple datasets confirm the theoretical convergence rates. These experiments also demonstrate that proposed -NN on anatomized data approaches or even outperforms -NN on original data. In particular, the experiments on well known Adult data show that 1-NN on anatomized data outperforms learning on data anonymized to the same anonymity levels using generalization.

Acknowledgment

This work is supported by the “Anonymous”. We thank “Anonymous” for sharing his/her implementation used for evaluating 1-NN on generalization-based anonymization. We also thank “Anonymous” for helpful comments throughout the theoretical analysis.

References

  • [1] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.   Santa Barbara, California: ACM, May 21-23 2001, pp. 247–255. [Online]. Available: http://doi.acm.org/10.1145/375551.375602
  • [2] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in Proceedings of the 2000 ACM SIGMOD Conference on Management of Data.   Dallas, TX: ACM, May 14-19 2000, pp. 439–450. [Online]. Available: http://doi.acm.org/10.1145/342009.335438
  • [3] J. Alcalá, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera, “Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17, no. 255-287, p. 11, 2010.
  • [4] A. Antos, L. Devroye, and L. Györfi, “Lower bounds for bayes error estimation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 21, no. 7, pp. 643–645, 1999.
  • [5] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007. [Online]. Available: http://www.ics.uci.edu/~mlearn/
  • [6] V. Ciriani, S. D. C. di Vimercati, S. Foresti, and P. Samarati, “k-anonymous data mining: A survey,” in Privacy-preserving data mining.   Springer, 2008, pp. 105–136.
  • [7] V. Ciriani, S. D. C. D. Vimercati, S. Foresti, S. Jajodia, S. Paraboschi, and P. Samarati, “Combining fragmentation and encryption to protect privacy in data storage,” ACM Trans. Inf. Syst. Secur., vol. 13, pp. 22:1–22:33, July 2010. [Online]. Available: http://doi.acm.org/10.1145/1805974.1805978
  • [8] G. Cormode, N. Li, T. Li, and D. Srivastava, “Minimizing minimality and maximizing utility: Analyzing method-based attacks on anonymized data,” in Proceedings of the VLDB Endowment, vol. 3, no. 1, 2010, pp. 1045–1056. [Online]. Available: http://dl.acm.org/citation.cfm?id=1920972
  • [9] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” Information Theory, IEEE Transactions on, vol. 13, no. 1, pp. 21–27, 1967.
  • [10] L. Devroye, L. Györfi, and G. Lugosi, A probabilistic theory of pattern recognition.   Springer Science & Business Media, 2013, vol. 31.
  • [11] S. D. C. di Vimercati, S. Foresti, S. Jajodia, G. Livraga, S. Paraboschi, and P. Samarati, “Extending loose associations to multiple fragments,” in DBSec’13, 2013, pp. 1–16.
  • [12] J. Dowd, S. Xu, and W. Zhang, “Privacy-preserving decision tree mining based on random substitutions,” in Emerging Trends in Information and Communication Security.   Springer, 2006, pp. 145–159.
  • [13] C. Dwork, “Differential privacy,” in 33rd International Colloquium on Automata, Languages and Programming (ICALP 2006), Venice, Italy, Jul. 9-16 2006, pp. 1–12. [Online]. Available: http://dx.doi.org/10.1007/11787006_1
  • [14] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003), San Diego, CA, Jun. 9-12 2003, pp. 211–222.
  • [15] K. Fukunaga, Introduction to statistical pattern recognition.   Academic press, 2013.
  • [16] K. Fukunaga and D. M. Hummels, “Bayes error estimation using parzen and k-nn procedures,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 5, pp. 634–643, 1987.
  • [17] K. Fukunaga and D. M. Hummels, “Bias of nearest neighbor error estimates,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, no. 1, pp. 103–112, 1987.
  • [18] B. C. M. Fung, K. Wang, and P. S. Yu, “Top-down specialization for information and privacy preservation,” in Proceedings of the 21st International Conference on Data Engineering, ser. ICDE ’05.   Washington, DC, USA: IEEE Computer Society, 2005, pp. 205–216. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2005.143
  • [19] T. Gal, Z. Chen, and A. Gangopadhyay, “A privacy protection model for patient data with multiple sensitive attributes,” International Journal of Information Security and Privacy, IGI Global, Hershey, PA, vol. 2, no. 3, pp. 28–44, 2008.
  • [20] X. He, Y. Xiao, Y. Li, Q. Wang, W. Wang, and B. Shi, “Permutation anonymization: Improving anatomy for privacy preservation in data publication.” in PAKDD Workshops, ser. Lecture Notes in Computer Science, L. Cao, J. Z. Huang, J. Bailey, Y. S. Koh, and J. Luo, Eds., vol. 7104.   Springer, 2011, pp. 111–123. [Online]. Available: http://dblp.uni-trier.de/db/conf/pakdd/pakdd2011-w.html#HeXLWWS11
  • [21] A. Inan, M. Kantarcioglu, and E. Bertino, “Using anonymized data for classification,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, ser. ICDE ’09.   Washington, DC, USA: IEEE Computer Society, 2009, pp. 429–440. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2009.19
  • [22] V. S. Iyengar, “Transforming data to satisfy privacy constraints,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’02.   New York, NY, USA: ACM, 2002, pp. 279–288. [Online]. Available: http://doi.acm.org/10.1145/775047.775089
  • [23] D. Kifer, “Attacks on privacy and definetti’s theorem,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data.   ACM, 2009, pp. 127–138.
  • [24] N. Li and T. Li, “t-closeness: Privacy beyond k-anonymity and l-diversity,” in Proceedings of the 23nd International Conference on Data Engineering (ICDE ’07), Istanbul, Turkey, Apr. 16-20 2007. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2007.367856
  • [25] T. Li and N. Li, “On the tradeoff between privacy and utility in data publishing,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009, 2009, pp. 517–526. [Online]. Available: http://doi.acm.org/10.1145/1557019.1557079
  • [26] T. Li, N. Li, J. Zhang, and I. Molloy, “Slicing: A new approach for privacy preserving data publishing,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 3, pp. 561–574, 2012. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.236
  • [27] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, “-diversity: Privacy beyond -anonymity,” in Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE 2006), Atlanta Georgia, Apr. 2006. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2006.1
  • [28] B. Martin, “Instance-based learning : Nearest neighbor with generalization,” Tech. Rep., 1995.
  • [29] R. A. Moore, Jr., “Controlled data-swapping techniques for masking public use microdata sets,” U.S. Bureau of the Census, Washington, DC., Statistical Research Division Report Series RR 96-04, 1996. [Online]. Available: http://www.census.gov/srd/papers/pdf/rr96-4.pdf
  • [30] M. E. Nergiz and C. Clifton, “-presence without complete world knowledge,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 868–883, Jun. 2010. [Online]. Available: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.125
  • [31] P. Samarati, “Protecting respondent’s privacy in microdata release,” IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp. 1010–1027, Nov./Dec. 2001. [Online]. Available: http://dx.doi.org/10.1109/69.971193
  • [32] L. Sweeney, “k-anonymity: a model for protecting privacy,” International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, no. 5, pp. 557–570, 2002. [Online]. Available: http://dx.doi.org/10.1142/S0218488502001648
  • [33] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, (First Edition).   Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005.
  • [34] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.   San Francisco: Morgan Kaufmann, Oct. 1999. [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/
  • [35] R. C.-W. Wong, A. W.-C. Fu, K. Wang, and J. Pei, “Minimality attack in privacy preserving data publishing,” in VLDB, 2007, pp. 543–554.
  • [36] R. C.-W. Wong, J. Li, A. W.-C. Fu, and K. Wang, “(, k)-anonymity: An enhanced k-anonymity model for privacy preserving data publishing,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’06.   New York, NY, USA: ACM, 2006, pp. 754–759. [Online]. Available: http://doi.acm.org/10.1145/1150402.1150499
  • [37] X. Xiao and Y. Tao, “Anatomy: Simple and effective privacy preservation,” in Proceedings of 32nd International Conference on Very Large Data Bases (VLDB 2006), Seoul, Korea, Sep. 12-15 2006. [Online]. Available: http://www.vldb.org/conf/2006/p139-xiao.pdf
  • [38] J. Zhang, D.-K. Kang, A. Silvescu, and V. Honavar, “Learning accurate and concise naïve bayes classifiers from attribute value taxonomies and data,” Knowledge and Information Systems, vol. 9, no. 2, pp. 157–179, 2006.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
340741
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description