Feature Selection for Classification under Anonymity Constraint
Abstract
Over the last decade, proliferation of various online platforms and their increasing adoption by billions of users have heightened the privacy risk of a user enormously. In fact, security researchers have shown that sparse microdata containing information about online activities of a user although anonymous, can still be used to disclose the identity of the user by crossreferencing the data with other data sources. To preserve the privacy of a user, in existing works several methods (anonymity, diversity, differential privacy) are proposed for ensuring that a dataset bears small identity disclosure risk. However, the majority of these methods modify the data in isolation, without considering their utility in subsequent knowledge discovery tasks, which makes these datasets less informative. In this work, we consider labeled data that are generally used for classification, and propose two methods for feature selection considering two goals: first, on the reduced feature set the data has small disclosure risk, and second, the utility of the data is preserved for performing a classification task. Experimental results on various realworld datasets show that the method is effective and useful in practice.
rivacy Preserving Feature Selection, anonymity by containment, Maximal Itemset Mining, Greedy Algorithm, Binary Classification
1 Introduction
Over the last decade, with the proliferation of various online platforms, such as web search, eCommerce, social networking, micromessaging, streaming entertainment and cloud storage, the digital footprint of today’s Internet user has grown at an unprecedented rate. At the same time, the availability of sophisticated computing paradigm and advanced machine learning algorithms have enabled the platform owners to mine and analyze terabytes of digital footprint data for building various predictive analytics and personalization products. For example, search engines and social network platforms use search keywords for providing sponsored advertisements that are personalized for a user’s information need; ecommerce platforms use visitor’s search history for bolstering their merchandising effort; streaming entertainment providers use people’s rating data for building future product or service recommendation. However, the impressive personalization of services of various online platforms enlighten us as much, as they do make us feel insecure, which stems from knowing the fact that individual’s online behaviors are stored within these companies, and an individual, more often, is not aware of the specific information about themselves that is being stored.
The key reason for a web user’s insecurity over the digital footprint data (also known as microdata) is that such data contain sensitive information. For instance, a person’s online search about a disease medication may insinuate that she may be suffering from that disease; a fact that she rather not want to disclose. Similarly, people’s choice of movies, their recent purchases, etc. reveal enormous information regarding their background, preference and lifestyle. Arguably microdata exclude biographical information, but due to the sheer size of our digital footprint the identity of a person can still be recovered from these data by crosscorrelation with other data sources or by using publicly available background information. In this sense, these apparently nonsensitive attributes can serve as a quasiidentifier. For an example, Narayanan et al. [33] have identified a Netflix subscriber from his anonymous movie rating by using Internet Movie Database (IMDB) as the source of background information. For the case of Netflix, anonymous microdata was released publicly for facilitating Netflix prize competition, however even if the data is not released, there is always a concern that people’s digital footprint data can be abused within the company by employees or by external hackers, who have malicious intents.
For the case of microdata, the identity disclosure risk is high due to some key properties of such a dataset—highdimensionality and sparsity. Sparsity stands for the fact that for a given record there is rarely any record that is similar to the given record considering full multidimensional space. It is also shown that in highdimensional data, the ratio of distance to the nearest neighbor and the farthest neighbor is almost one, i.e., all the points are far from each other [1]. Due to this fact, privacy is difficult to achieve on such datasets. A widely used privacy metric that quantifies the disclosure risk of a given data instance is anonymity[39], which requires that for any data instance in a dataset, there are at least distinct data instances sharing the same feature vector—thus ensuring that unwanted personal information cannot be disclosed merely through the feature vector. However, for high dimensional data, anonymization is difficult to achieve even for a reasonable value of (say 5); typically, value based generalization or attribute based generalization is applied so that anonymity is achieved, but Aggrawal has proved both theoretically and experimentally that for high dimensional data anonymity is not a viable solution even for a value of 2 [1]. He has also shown that as data dimensionality increases, entire discriminatory information in the data is lost during the process of anonymization, which severely limits the data utility. Evidently, finding a good balance between a user’s privacy and the utility of high dimensional microdata is an unsolved problem—which is the primary focus of this paper.
A key observation of a reallife high dimensional dataset is that it exhibits high clustering tendency in many subspaces of the data, even though over the full dimension the dataset is very sparse. Thus an alternative technique for protecting identity disclosure on such data can be finding a subset of features, such that when projecting on these set of features an acceptable level of anonymity can be achieved. One can view this as column suppression instead of more commonly used row suppression for achieving anonymity [39]. Now for the case of feature selection for a given , there may exist many subspaces for which a given dataset satisfies anonymity, but our objective is to obtain a set of features such that projecting on this set offers the maximum utility of the dataset in terms of a supervised classification task.
Consider the toy dataset that is shown in Table 1. Each row represents a person, and each column (except the first and the last) represents a keyword. If a cell entry is ‘1’ then the keyword at the corresponding column is associated with the person at the corresponding row. Reader may think this table as a tabular representation of search log of an eCommerce platform, where the ‘1’ under column stands for the fact that the corresponding user has searched using the keyword within a given period of time, and ‘0’ represents otherwise. The last column represents whether this user has made a purchase over the same time period. The platform owner wants to solve a classification problem to predict which of the users are more likely to make a purchase.
Say, the platform owner wants to protect the identity of its site visitor by making the dataset anonymous. Now, for this toy dataset, if he chooses , this dataset is not anonymous. For instance, the feature vector of , is unique in this dataset. However, the dataset is anonymous for the same under the subspace spanned by . It is also anonymous (again for the same ) under the subspace spanned by (See Table 2). Among these two choices, the latter subspace is probably a better choice, as the features in this set are better discriminator than the features in the former set with respect to the classlabel. For feature set , if we associate the value ‘101’ with the +1 label, and the value ‘111’ with the 1 label, we make only 1 mistake out of 6. On the other hand for feature set , no good correlation exists between the feature values and the class labels.
Research problem in the above task is the selection of optimal binary feature set for utility preserving entity anonymization, where the utility is considered with respect to the classification performance and the privacy is guaranteed by enforcing a anonymity like constraint [41]. In existing works, anonymity is achieved by suppression or generalization of cell values, whereas in this work we consider to achieve the same by selecting an optimal subset of features that maximizes the classification utility of the dataset. Note that, maximizing the utility of the dataset is the main objective of this task, privacy is simply a constraint which a user enforces by setting the value of a privacy parameter based on the application domain and the user’s judgment. For the privacy model, we choose anonymity by containment in short, AC (definition forthcoming), where is the userdefined privacy parameter, which has similar meaning as it has in traditional anonymity.
Our choice of anonymity like metric over more theoretical counterparts, such as, differential privacy (DP) is due to the pragmatic reason that all existing privacy laws and regulations, such as, HIPAA (Health Information Portability and Accountability Act) and PHIPA (Personal Health Information Protection Act) use anonymity. Also, anonymity is flexible and simple, thus enabling people to understand and apply it for almost any reallife privacy preserving needs; on the contrary, DP based methods use a privacy parameter (), which has no obvious interpretation and even by the admission of original author of DP, choosing an appropriate value for this parameter is difficult [12]. Moreover, differential privacy based methods add noise to the data entities, but the decision makers in many application domains (such as, health care), where privacy is an important issue, are quite uncomfortable to the idea of noise imputation [11]. Finally, authors in [17] state that differential privacy is not suitable for protecting large sparse tables produced by statistics agencies and sampling organizations—this disqualifies differential privacy as a privacy model for protecting sparse and very high dimensional user’s microdata from the ecommerce and Internet search engines.
1.1 Our Contributions
In this work, we consider the task of feature selection under privacy constraint. This is a challenging task, as it is wellknown that privacy is always at odds with the utility of a knowledgebased system, and finding the right balance is a difficult task [22, 23]. Besides, feature selection itself, without considering the privacy constraint, is an Hard problem [18, 25, 26].
Given a classification dataset with binary features and an integer , our proposed solutions find a subset of features such that after projecting each instance on these subsets each entity in the dataset satisfies a privacy constraint, called anonymous by containment (AC). Our proposed privacy constraint AC is an adapted version of anonymity, which strikes the correct balance between disclosure risk and dataset utility, and it is particularly suitable for high dimensional binary data. We also propose two algorithms: Maximal and Greedy. The first is a maximal itemset mining based method and the second is a greedy incremental approach, both respecting the userdefined AC constraints.
The algorithms that we propose are particularly intended for high dimensional sparse microdata where the features are binary. The nature of such data is different from a typical dataset that is considered in many of the existing works on privacy preserving data disclosure mechanism. The first difference is that existing works consider two kinds of attributes, sensitive and nonsensitive, whereas for our dataset all attributes are considered to be sensitive, and any subset of published attributes can be used by an attacker to deanonymize one or more entities in the dataset using probabilistic inference methodologies. On the other hand, the unselected attributes are not published so they cannot be used by an attacker to deanonymize an entity. Second, we only consider binary attributes, which enable us to provide efficient algorithms and an interesting anonymization model. Considering only binary attributes may sound an undue restriction, but in reality binary attributes are adequate (and often preferred) when modeling online behavior of a person, such as ‘like’ in Facebook, ‘bought’ in Amazon, and ‘click’ in Google advertisement. Also, collecting explicit user feedback in terms of frequency data (say, the number of times a search keyword is used ) may be costly in many online platforms. Nevertheless, as shown in [33], binary attributes are sufficient for an attacker to deanonymize a person using high dimensional microdata, so safeguarding user privacy before disclosing such dataset is important.
The contributions of our work are outlined below:

We propose a novel method for entity anonymization using feature selection over a set of binary attributes from a twoclass classification dataset. For this, we design a new anonymization model, named anonymity by containment (AC), which is particularly suitable for highdimensional binary microdata.

We propose two methods for solving the above task and show experimental results to validate the effectiveness of these methods.

We show the utility of the proposed methods with three reallife applications; specifically, we show how the privacyaware feature selection affects their performance.
2 Privacy Basics
Given a dataset , where each row corresponds to a person, and each column contains nonpublic information about that person; examples include disease, medication, sexual orientation, etc. In the context of online behavior, the search keywords, or purchase history of a person may be such information. Privacy preserving data publishing methodologies make it difficult for an attacker to deanonymize the identity of a person who is in the dataset. For deanonymization, an attacker generally uses a set of attributes that act almost like a key and it uniquely identifies some individual in the datasets. These attributes are called quasiidentifiers. anonymity is a wellknown privacy metric defined as below.
Definition 1 (anonymity).
A dataset satisfies anonymity if for any row entity there exist at least other entities that have the same values as for every possible quasiidentifiers.
The database in Table 1 is not anonymous, as the row entity is unique considering the entire attributeset as quasiidentifier. On the other hand, It is anonymous for both the datasets (one with Feature Set1 and the other with Feature Set2) in Table 2. For numerical or categorical attributes, a process, called generalization and/or suppression (row or cell value) are used for achieving anonymity. Generalization partitions the values of an attribute into disjoint buckets and identifies each bucket with a value. Suppression either hides the entire row or some of its cell values, so that the anonymity of that entity can be maintained. Generalization and suppression make anonymous group, where all the entities in that group have the same value for every possible quasiidentifier, and for a dataset to be anonymous, the size of each of such groups is at least . In this work we consider binary attributes; each such attribute has only two values, 0 and 1. For binary attributes, value based generalization relegates to the process of column suppression, which incurs a loss of data utility. In fact, any form of generalization based anonymization incurs a loss in data utility due to the decrement of data variance or due to the loss of discernibility. Suppression of a row is also a loss as in this case the entire row entity is not discernible for any of the remaining entities in the dataset. Unfortunately, most of existing methods for achieving anonymity using both generalization and suppression operations do not consider an utility measure targeting supervised classification task.
There are some security attacks against which anonymity is vulnerable. For example, anonymity is susceptible to both homogeneity and background knowledge based attacks. More importantly, anonymity does not provide statistical guaranty about anonymity which can be obtained by using differential privacy [12]—a method which provides strong privacy guarantees independent of an adversary’s background knowledge. There are existing methods that adopt differential privacy principle for data publishing. Authors in [3, 13] propose Laplace mechanism to publish the contingency tables by adding noise generated from a Laplace distribution. However, such methods suffer from the utility loss due to the large amount of added noise during the sanitization process. To resolve this issue, [30] proposes to utilize exponential mechanism for maximizing the tradeoff between differential privacy and data utility. However, the selection of utility function used in the exponential mechanism based approach strongly affects the data utility in subsequent data analysis task. In this work, we compare the performance of our proposed privacy model, namely AC, with both Laplace and exponential based differential privacy frameworks (See Section 5.2) to show that AC better preserves the data utility than the differential privacy based methods.
A few works [19, 36] exist which consider classification utility together with anonymity based privacy model, but none of them consider feature selection which is the main focus of this work. In one of the earliest works, Iyengar [19] solves anonymization through generalization and suppression while minimizing a proposed utility metric called (Classification Metric) using a genetic algorithm which provides no optimality guaranty. The CM is defined as below:
Definition 2 (Classification Metric [19]).
Classification metric () is a utility metric for classification dataset, which assigns a penalty of 1 for each suppressed entity, and for each nonsuppressed entity it assigns a penalty of 1 if those entities belong to the minority class within its anonymous group. value is equal to the sum of penalties over all the entities.
In this work, we compare the performance of our work with CM based privacyaware utillty metric.
User  Class  

1  0  1  0  1  
1  0  1  0  1  
1  0  0  1  1  
1  0  1  0  1  
1  1  1  0  1  
1  1  0  1  1 
User  Feature Set1  Class  Feature Set2  

1  0  1  1  0  1  
1  0  1  1  0  1  
1  0  1  0  1  1  
1  0  1  1  0  1  
1  1  1  1  0  1  
1  1  1  0  1  1 
3 Problem Statement
Given a classification dataset with binary attributes, our objective is to find a subset of attributes which increase the nondisclosure protection of the row entities, and at the same time maintain the classification utility of the dataset without suppressing any of the row entities. In this section we will provide a formal definition of the problem.
We define a database as a binary relation between a set of entities () and a set of features (); thus, , where and ; and are the number of entities and the number of features, respectively. The database can also be represented as a binary data matrix, where the rows correspond to the entities, and the columns correspond to the features. For an entity , and a feature , if , the corresponding data matrix entry , otherwise . Thus each row of is a binary vector of size in which the entries correspond to the set of features with which the corresponding row entity is associated. In a classification dataset, besides the attributes, the entities are also associated to a class label which is a category value. In this task we assume a binary class label . A typical supervised learning task is to use the features to predict the class label of an entity.
We say that an entity contains a set of features , if for all ; set is also called containment set of the entity .
Definition 3 (Containment Set).
Given a binary dataset, , the containment set of a row entity , represented as , is the set of attributes such that , and , .
When the dataset is clear from the context we will simply write instead of to represent the containment set of .
Definition 4 (anonymity by containment).
In a binary dataset and for a given positive integer , an entity satisfies anonymity by containment if there exists a set of entities , such that . In other words, their exist at least other entities in such that their containment set is the same or a superset of .
By definition, if an entity satisfies anonymity by containment, it satisfies the same for all integer values from 1 upto . We use the term to denote the largest for which the entity satisfies anonymity by containment.
Definition 5 (anonymous by Containment Group).
For a binary dataset , if satisfies the anonymity by containment, anonymous by containment group with respect to exists and this is , where is the largest possible set as is defined in Definition 4.
Definition 6 (anonymous by Containment Dataset).
A binary dataset is anonymous by containment if every entity satisfies anonymity by containment.
We extend the term over a dataset as well, thus is the smallest for which the dataset is anonymous.
Example: For the dataset in Table 1, . Entity satisfies anonymity by containment, because for each of the following three entities , and , their containment sets are the same or supersets of . But, the entity only satisfies anonymity by containment, as besides itself no other entity contains . anonymous by containment group of exists, and it is , but anonymous by selection group for the same entity does not exist. The dataset in Table 1 is anonymous by containment because there exists one entity, namely such that the highest value for which satisfies anonymity by containment is 1; alternatively
anonymity by containment (AC) is the privacy metric that we use in this work. The argument for this metric is that if a large number of other entities contain the same or super feature subset which an entity contains, the disclosure protection of the entity is strong, and vice versa. Thus a higher value of stands for a higher level of privacy for . anonymity by containment () is similar to anonymity for binary feature set except that for  only the ‘1’ value of feature set is considered as a privacy risk. It is easy to see that anonymity by containment (AC) is a relaxation of anonymity. In fact, the following lemma holds.
Lemma 1.
If a dataset satisfies anonymity for a value, it also satisfies AC for the same value, but the reverse does not hold necessarily.
Proof.
Say, the dataset satisfies anonymity; then for any row entity , there exists at least other row entities with identical row vector as . Containment set of all these entities is identical to , so satisfies AC. Since this holds for all , the dataset satisfies AC.
To prove that the reverse does not hold, we will give a counterexample. Assume has three entities and two features with the following feature values, . satisfies AC because the smallest anonymity by containment value of the entities in the dataset is 2. But, the dataset does not satisfy 2anonymity because the entity is unique in the dataset. ∎
However, the relaxed privacy that AC provides is adequate for disclosure protection in a high dimensional sparse microdata with binary attributes, because AC conceals the list of the attributes in a containment set of an entity, which could reveal sensitive information about the entity. For example, if the dataset is about the search keywords that a set of users have used over a given time, for a person having a 1 value under a keyword potentially reveals sensitive information about the behavior or preference of that person. Having a value of 0 for a collection of features merely reveals the knowledge that the entity is not associated with that attribute. In the online microdata domain, due to the high dimensionality of the data, nonassociation with a set of attributes is not a potential privacy risk. Also note that, in traditional datasets, only a few attributes which belong to nonsensitive group are assumed to be quasiidentifier, so a privacy metric, like anonymity works well for such dataset. But, for highdimensional dataset, anonymity is severely restrictive and utility loss of data by column suppression is substantial because feature subsets containing very small number of features pass anonymity criteria. On the other hand, AC based privacy metric enables selection of sufficient number of features for retaining the classification utility of the dataset. In short, AC retains the classification utility substantially, whereas anonymity fails to do so for most high dimensional data.
Feature selection [18] for a classification task is to select a subset of highly predictive variables so that classification accuracy possibly improves which happens due to the fact that contradictory or noisy attributes are generally ignored during the feature selection step. For a dataset , and a featureset , following relational algebra notations, we use to denote the projection of database over the feature set . Now, given a userdefined integer number , our goal is to perform an optimal feature selection on the dataset to obtain which satisfies two objectives: first, is anonymous by containment, i.e., ; second, maintains the predictive performance of the classification task as much as possible. Selecting a subset of features is similar to the task of column suppression based privacy protection, but the challenge in our task is that we want to suppress column that are risk to privacy, and at the same time we want to retain columns that have good predictive performance for a downstream supervised classification task using the sanitized dataset. For denoting the predictive performance of a dataset (or a projected dataset) we define a classification utility function . The higher the value of , the better the dataset for the classification. We consider to be a filter based feature selection criteria which is independent of the classification model that we use.
The formal research task of this work is as below. Given a binary dataset , and an integer number , find so that is maximized under the constraint that . Mathematically,
(1)  
subject to 
Due to the fact that the problem 1 is a combinatorial optimization problem (optimizing over the space of feature subsets) which is NPHard, here we propose two effective local optimal solutions for this problem.
4 Methods
In this section, we describe two algorithms, namely Maximal and Greedy that we propose for the task of feature selection under privacy constraint. Maximal is a maximal itemset mining based feature selection method, and Greedy is a greedy method with privacy constraint based filtering. In the following subsections, we discuss them in details.
4.1 Maximal Itemset based Approach
A key observation regarding anonymity by containment () of a dataset is that this criteria satisfies the downwardclosure property under feature selection. The following lemma holds:
Lemma 2.
Say is a binary dataset and and are two feature subsets. If , then .
Proof: Let’s prove by contradiction. Suppose and . Then from the definition of , there exists at least one entity for which . Now, let’s assume and are the set of entities which make the anonymous by containment group for the entity in and , respectively. Since ; so there exists an entity , for which and ; But this is impossible, because , if holds, then must be true. Thus, the lemma is proved by contradiction. ∎.
Let’s call the collection of feature subsets which satisfy the threshold for a given , the feasible set and represent it with . Thus, . A subset of features is called maximal if it has no supersets which is feasible. Let be the set of all maximal subset of features. Then . As we can observe given an integer , if there exists a maximal feature set that satisfies the constraint, then any feature set , also satisfies the same constraint, i.e., if based on the Lemma 2.
Example: For the dataset in Table 1,
the anonymous by containment feasible feature set
Lemma 3.
Say, is a binary dataset, and is its transaction representation where each entity is a transaction consisting of the containment set . Frequent itemset of the dataset with minimum support threshold are the feasible feature set for the optimization problem 1.
Proof: Say, is a frequent itemset in the transaction for support threshold . Then, the supportset of in are the transactions (or entities) which contain . Since, is frequent, the supportset of consists of at least entities. In the projected dataset , all these entities make a anonymous by containment group, thus satisfying anonymity by containment. For each of the remaining entities (say, ), ’s containment set contains some subset of (say ) in . Since, is a frequent itemset and , is also frequent with a supportset that has at least entities. Then also belongs to a anonymous by containment group. Thus, each of the entities in belongs to some anonymous by containment group(s) which yields: . Hence proved. ∎
A consequence of Lemma 2 is that for a given dataset , an integer , and a feature set , if , any subset of (say, ) satisfies . This is identical to the downward closure property of frequent itemset mining. Also, Lemma 3 confirms that any itemset that is frequent in the transaction representation of for a minimum support threshold is a feasible solution for problem 1. Hence, Apriori like algorithm for itemset mining can be used for effectively enumerating all the feature subsets of which satisfies the required anonymity by containment constraint.
Maximal Feasible Feature Set Generation
For large datasets, the feasible feature set which consists of feasible solutions for the optimization problem 1 can be very large. One way to control its size is by choosing appropriate ; if increases, decreases, and viceversa, but choosing a large negatively impacts the classification utility of the dataset, thus reducing the optimal value of problem (1). A brute force method for finding the optimal feature set is to enumerate all the feature subset in and find the one that is the best given the utility criteria . However, this can be very slow. So, Maximal generates all possible maximal feature sets instead of generating and search for the best feature subset within . The idea of enumerating instead of comes from the assumption that with more features the classification performance will increase; thus, the size of the feature set is its utility function value; i.e., , and in that case the largest set in is the solution to the problem 1.
An obvious advantage of working only
with the maximal feature set is that for many datasets, , thus finding solution within instead of leads
to significant savings in computation time.
Just like the case for frequent itemset mining, maximal frequent itemset mining
algorithm can also be used for finding . Any offtheshelf software
can be used for this. In Maximalalgorithm we use the LCMMiner package provided
in
Classification Utility Function
The simple utility function has a few limitations. First, the ties are very commonplace, as there are many maximal feature sets that have the same size. Second, and more importantly, this function does not take into account the class labels of the instances so it cannot find a feature set that maximizes the separation between the positive and negative instances. So, we consider another utility function, named as , which does not succumb much to the tie situation. It also considers the class label for choosing features that provide good separation between the positive and negative classes.
Definition 7 (Hamming Distance).
For a given binary database , and a subset of features, , the Hamming distance between and is defined as below:
(2) 
where is the indicator function, and and are the th feature value under for the entities and , respectively.
We can partition the entities in into two disjoint subsets, and ; entities in have a class label value of , and entities in have a class label value of .
Definition 8 (HamDist).
Given a dataset where the partitions and are based on class labels, the classification utility function for a feature subset is the average Hamming distance between all pair of entities and such that and .
(3) 
Example: For the dataset in Table 1, for its projection on (see, Table 2), distance of from the negative entities are , and the same for the other positive entities, and also. So, . From the same table we can also see that .
As we can observe from Equations 2 and 3, the utility function reflects the discriminative power between classes given the feature set . The larger the value of , the better the quality of selected feature set to distinguish between classes. Another separation metric similar to is DistCnt (Distinguish Count), which is defined below.
Definition 9 (DistCnt).
For , and , is the number of pairs from and which can be distinguished using at least one feature in . Mathematically,
(4) 
can also be used instead of in the Maximal algorithm. Note that, we can also use criterion (see Definition 2) instead of ; however, experimental results show that performs much poorer in terms of AUC. Besides, both and functions have some good properties (will be discussed in Section 4.2) which does not have.
The Maximal algorithm utilizes classification utility metrics ( or ) for selecting the best feature set from the maximal set . For some datasets, the size of can be large and selecting the best feature set by applying the utility metric on each element of can be timeconsuming. Then, we can find the best feature set among the largest sized element in . Another option is to consider the maximal feature sets in in the decreasing order of their size in such a way that at most of the maximal feature sets from set are chosen as candidate for which the utility metric computation is performed. In this work we use this second option by setting for all our experiments.
Maximal Itemset based Method (Pseudocode)
The pseudocode for Maximal is given in Algorithms 1. Maximal takes integer number and the number of maximal patterns as input and returns the final feature set which satisfies anonymity by containment. Line 1 uses the LCMMiner to generate all the maximal feature sets that satisfy anonymity by containment for the given value. Line 2 groups maximal feasible feature sets according to its size and selects top maximal feature sets with the largest size and builds the candidate feature sets. Then the algorithm computes the feature selection criteria of each feature set in the candidate feature sets and returns the best feature set that has the maximum value for this criteria.
The complexity of the above algorithm predominantly depends on the complexity of the maximal itemset mining step (Line 1), which depends on the input value . For larger , the privacy is stronger and it reduces making the algorithm run faster, but the classification utility of the dataset may suffer. On the other hand, for smaller , can be large making the algorithm slower, but it better retains the classification utility of the dataset.
4.2 Greedy with Modular and SubModular Objective Functions
A potential limitation of Maximal is that for dense datasets this method can be slow. So, we propose a second method, called Greedy which runs much faster as it greedily adds a new feature to an existing feasible featureset. For greedy criteria, Greedy can use different separation functions which discriminate between positive and negative instances. In this work we use (See Definition 8) and (See Definition 9). Thus Greedy solves the Problem (1) by replacing by either of the two functions. Because of the monotone property of these functions, Greedy ensures that as we add more features, the objective function value of (1) monotonically increases. The process stops once no more features are available to add to the existing feature set while ensuring the desired value of the projected dataset.
Submodularity, and Modularity
Definition 10 (Submodular Set Function).
Given a finite ground set , a monotone function that maps subsets of to a real number is called submodular if
If the above condition is satisfied with equality, the function is called modular.
is monotone, submodular, and modular.
Proof.
For a dataset , , and are two arbitrary featuresets, such that . , where the partition is based on class label. Consider the pair , such that and . Let, be a function that sums the Hamming distance over all such pairs for a given feature subset . Thus, , where the function is the Hamming distance between and as defined in Equation 2. Similarly we can define , for the feature subset . Using Equation 2, is the summation over each of the features in . Since , includes the sum values for the variables in and possibly includes the sum value of other variables, which is nonnegative. Summing over all () pairs yields . So, is monotone. Now, for a feature ,
Similarly, . Then, we have . Dividing both sides by yields . Hence proved with the equality. ∎
is monotone, and submodular.
Proof.
Given a dataset where is partitioned as based on class label. Now, consider a bipartite graph, where vertices in one partition (say, ) correspond to features in , and the vertices of other partition (say, ) correspond to a distinct pair of entities such that , and ; thus, . If for a feature , we have , an edge exists between the corresponding vertices and . Say, and and . For a set of vertices, represents their neighborlist. Since, the size of neighborlist of a vertexset is monotone and submodular, for , we have , and . By construction, for a feature set, , contains the entitypairs for which at least one featurevalue out of is different. Thus, function is and it is submodular. ∎
For monotone submodular function , let be a set of size obtained by selecting elements one at a time, each time choosing an element provides the largest marginal increase in the function value. Let be a set that maximizes the value of over all element sets. Then ; in other words, provides approximation. For modular function [10].
Greedy Method (Pseudocode)
Using the above theorems we can design two greedy algorithms, one for modular function , and the other for submodular function . The pseudocodes of these algorithms are shown in Algorithm 2 and Algorithm 3. Both the methods take binary dataset and integer value as input and generate the selected feature set as output. Initially . For modular function, the marginal gain of an added feature can be precomputed, so Algorithm 2 first sorts the features in nonincreasing order of their values, and greedily adds features until it encounters a feature such that its addition does not satisfy the constraint. For submodular function , margin gain cannot be precomputed, so Algorithm 3 selects the new feature by iterating over all the features and finding the best one (Line 5 11). The terminating condition of this method is also identical to the Algorithm 2. Since the number of features is finite, both the methods always terminate with a valid which satisfies .
Compared to Maximal, both greedy methods are faster. With respect to number of features (), Algorithm 2 runs in time and Algorithm 3 runs in time. Also, using Theorem 4.2.1, Algorithm 2 returns the optimal size featureset, and Algorithm 3 returns , for which the objective function value is optimal over all possible size feature sets.
Dataset  # Entities  # Features  # Pos  # Neg  Density 

Adult Data  32561  19  24720  7841  27.9% 
Entity  148  552  74  74  9.7% 
Disambiguation  
1099  24604  618  481  0.9% 
5 Experiments and Results
In order to evaluate our proposed methods we perform various experiments. Our main objective in these experiments is to validate how the performance of the proposed privacy preserving classification varies as we change the value of —userdefined privacy threshold metric. We also compare the performance of our proposed utility preserving anonymization methods with other existing anonymization methods such as anonymity and differential privacy. It is important to note that we do not claim that our methods provide a better utility with identical privacy protection as other methods, rather we claim that our methods provide adequate privacy protection which is suitable for high dimensional sparse microdata with a much superior AUC value—a classification utility metric which we want to maximize in our problem setup. We use three realworld datasets for our experiments. All three datasets consist of entities that are labeled with 2 classes. The number of entities, the number of features, the distribution of the two classes (#postive and #negative), and the dataset density (fraction of nonzero cell values) are shown in Table 3.
5.1 Privacy Preserving Classification Tasks
Below, we discuss the datasets and the privacy preserving classification tasks
that we solve using our proposed methods.
Entity Disambiguation (ED) [44]. The objective of this
classification task is to identify whether the name reference at a row in the
data matrix maps to multiple reallife persons or not. Such an exercise is
quite common in the Homeland Security for disambiguating multiple
suspects from their digital footprints [45, 38].
Privacy of the people in such a dataset
is important as many innocent persons can also be listed as a
suspect. Given a set of keywords that are associated with a name reference, we
build a binary data matrix for solving the ED task. We use
Arnetminer
To solve the entity disambiguation problem we first perform topic modeling over the keywords and then compute the distribution of entity ’s keywords across different topics. Our hypothesis is that for a pure entity the topic distribution will be concentrated on a few related topics, but for an impure entity (which is composed of multiple reallife persons) the topic distribution will be distributed over many nonrelated topics. We use this idea to build a simple classifier which uses an entropybased score for an entity as below:
(5) 
where is the probability of belonging to topic , and represents the predefined number of topics for topic modeling. Clearly, for a pure entity the entropybased score is relatively smaller than the same for a nonpure entity. We use this score as our predicted value and compute AUC (area under ROC curve) to report the performance of the classifier.
Adult. The Adult dataset
Email The last dataset, namely Email dataset
5.2 Experimental Setting
For our experiments, we vary the value of the proposed anonymity by selection () metric and run Maximal and different variants of Greedy independently for building projected classification datasets for which value is at least . We use the names and for the two variants of Greedy (Algorithm 2 and 3), which optimize Hamming distance and Distinguish count greedy criteria, respectively. As we mentioned earlier, anonymity based method imposes strong restriction which severely affects the utility of the dataset. To demonstrate that, instead of using , we utilize anonymity as our privacy criteria for different variants of Greedy. We call these competing methods anonymity , and anonymity . It is important to note that, in our experiments under the same setting, the anonymity based competing methods may not provide the same level of privacy. For instance, for the same value, privacy protection of our proposed method is not the same as that of the anonymity , simply because AC is a relaxation of anonymity.
We also use four other methods for comparing their performance with the performance of our proposed methods. We call these competing methods RF [7], Greedy [19], LaplaceDP, and ExponentialDP. We discuss these methods below.
RF is a Randomization Flipping based anonymization technique presented in [7], which randomly flips the feature value such that each instance in the dataset satisfies the anonymity privacy constraint. RF uses clustering such that after random flipping operation, each cluster has at least entities with the same feature values with respect to the entire feature set.
CM greedy represents another greedy based method which uses Classification Metric utility criterion proposed in [19] as utility metric (See definition 2). It assigns a generalization penalty over the rows of the dataset and uses a genetic algorithm for the classification task, but for a fair comparison we use CM criterion in the Greedy algorithm and with the selected features we use identical setup for classification.
LaplaceDP [21] is a method to use feature selection for differential private data publishing. Authors in [21] utilize Laplace mechanism [13] for differential privacy guarantee. To compare with their method, we first compute the utility of each feature as its true output using function in Definition 8 denoted as . Then we add independently generated noise according to a Laplace distribution with to each of the outputs, and the noisy output for each feature is defined as , where is the sensitivity of function. After that we select top features by considering largest noisy outputs. On the reduced dataset, we apply a private data release method which provides differential privacy guaranty. The general philosophy of this method is to first derive a frequency matrix of the reduced dataset over the feature domain and add Laplace noise with to each count (known as marginal) to satisfy the differential privacy. Then the method adds additional data instances to match the above count. Such an approach is discussed in [12] as a private data release mechanism.
ExponentialDP is another differential privacy aware feature selection method. Compared to the work presented in [21], we use exponential mechanism [30] based differential privacy to select features. In particular, we choose each feature with probability proportional to . That is, the feature with a higher utility score in terms of function is exponentially more likely to be chosen. The private data release stage of ExponentialDP is as same as the one in LaplaceDP. Note that, for both LaplaceDP and ExponentialDP, prior feature selection is essential for such methods to reduce the data dimensionality, otherwise the number of marginals is an intractable number (, for a binary dataset with features) and adding instances to match count for each such instance is practically impossible.
For all the algorithms and all the datasets (except ED) we use the LibSVM to perform SVM classification using L2 loss with 5fold cross validation. The only parameter for libSVM is regularizationloss tradeoff which we tune using a small validation set. For each of the algorithms, we report AUC and the selected feature count (SFC). For RF method, it selects all the features, so for this method we report the percentage of cell values for which the bit is flipped. We use different anonymity by containment () values in our experiments. For practical anonymization, value between 5 and 10 is suggested in the earlier work[41]; we use three different values, which are and . For a fair comparison, for both Laplace and Exponential DP, we use the same number of features as is obtained for the case of Greedy under . Since anonymity and differential privacy use totally different parameter setting mechanisms (one based on , and the other based on ), it is not easy to understand what value of in DP will make a fair comparison for a value of 5 in AC. So, for both LaplaceDP and exponentialDP, we show the differential privacy results for different values: , and . Note that the original work [12] has suggested to use a value of 1.0 for . While using DP based methods, we distribute half of the privacy budget for the feature selection step and the remaining half to add noise into marginals in the private data release step. Moreover, in the feature selection procedure, we further equally divide the budget for the selection of each feature.
RF, LaplaceDP, and ExponentialDP are randomized methods, so for each dataset we run all of them times and report the average AUC and standard deviation. For each result table in the following sections, we also highlight the best results in terms of AUC among all methods under same setting. We run all the experiments on a 2.1 GHz Machine with 4GB memory running Linux operating system.
Method  AUC (Selected Feature Count)  

k=5  k=8  k=11  
Maximal  0.82 (61)  0.81 (43)  0.79 (32) 
0.88 (27)  0.88 (24)  0.81 (16)  
0.81 (11)  0.81 (11)  0.80 (10)  
CM Greedy [19]  0.68 (2)  0.68 (2)  0.68 (2) 
RF [7]  0.750.02 (11.99%)  0.730.03 (14.03%)  0.720.02 (16.49%) 
anonymity  0.55 (3)  0.55 (2)  0.55 (2) 
anonymity  0.79 (3)  0.79 (3)  0.77 (2) 
FullFeatureSet  0.87 (552) 
5.3 Name Entity Disambiguation
In Table 4 we report the AUC value of anonymized name entity disambiguation task using various privacy methods (in rows) for different values (in columns). For better comparison, our proposed methods, competing methods, and nonprivate methods are grouped by the horizontal lines: our proposed methods are in the top group, the competing methods are in the middle group, and nonprivate methods are in the bottom group. For differential privacy comparison, we show the AUC result in Figure 1(a) 1(b). For each method, we also report the count of selected features (SFC). Since RF method uses the full set of features; for this method the value in the parenthesis is the percent of cell values that have been flipped. We also report the AUC performance using full feature set (last row). As nonprivate method in bottom group has no privacy restriction, thus the result is independent of .
For most of the methods increasing decreases the number of selected features, which translates to poorer classification performance; this validates the privacyutility tradeoff. However, for a given , our proposed methods perform better than the competing methods in terms of AUC metric for all different values. For instance, for , the AUC result of RF and CM Greedy are only and respectively, whereas different versions of proposed Greedy obtain AUC values between and . Among the competing methods, both LaplaceDP and ExponentialDP perform the worst ( AUC under ) as shown in the first group of bars in Figure 1(a) & 1(b), and anonymity performs the best (0.79 for =5); yet all completing methods perform much poorer than our proposed methods. A reason for this may be most of the competing methods are too restrictive, as we can see that they are able to select only 2 to 3 features for various values. In comparison, our proposed methods are able to select between 11 and 61 features, which help our methods to retain classification utility. The bad performance of differential privacy based methods is due to the fact that in such a setting, the added noise is too large in both feature selection and private data release steps. In general, the smaller the , the stronger privacy guarantee the differential privacy provides. However, stronger privacy protection in terms of always leads to worse data utility in terms of AUC as shown in Figure 1(a) 1(b). Therefore, even though differential privacy provides stronger privacy guarantee, the utility of data targeting supervised classification task is significantly destroyed. For this dataset, we observe that the performance of RF is largely dependent on the percentage of flips in the cellvalue; if this percentage is large, the performance is poor. As increases, with more privacy requirement, the percentage of flips increases, and the AUC drops.
For a sparse dataset like the one that we use for entity disambiguation, feature selection helps classification performance. In this dataset, using full set of features (no privacy), we obtain only 0.87 AUC value, whereas using less than 10% of features we can achieve comparable or better AUC using our proposed methods (when =5). Even for , our methods retain substantial part of the classification utility of the dataset and obtain AUC value of 0.81 (see second row). Also, note that under and , our performs better than using full feature set, which demonstrates our proposed privacyaware feature selection methods not only have the competitive AUC performance, but provide strong privacy guarantees as well.
Method  AUC (Selected Feature Count)  

k=5  k=8  k=11  
Maximal  0.74 (8)  0.74 (8)  0.75 (7) 
0.77 (9)  0.77 (9)  0.76 (8)  
0.78 (10)  0.78 (10)  0.76 (8)  
CM Greedy [19]  0.71 (5)  0.71 (5)  0.71 (5) 
RF [7]  0.800.02 (0.60%)  0.800.03 (1.00%)  0.800.02 (1.44%) 
anonymity  0.72 (8)  0.72 (8)  0.72 (8) 
anonymity  0.73 (8)  0.70 (6)  0.70 (6) 
FullFeatureSet  0.82 (19) 
Method  AUC (Selected Feature Count)  

k=5  k=8  k=11  
Maximal  0.94 (121)  0.92 (66)  0.90 (58) 
0.91 (11)  0.91 (11)  0.91 (11)  
0.95 (11)  0.93 (7)  0.93 (7)  
CM Greedy [19]  0.86 (3)  0.86 (3)  0.86 (3) 
RF [7]  0.870.02 (1.30%)  0.860.01 (1.73%)  0.870.03 (2.03%) 
anonymity  0.84 (4)  0.84 (4)  0.84 (4) 
anonymity  0.81 (4)  0.81 (4)  0.81 (4) 
FullFeatureSet  0.95 (24604) 
5.4 Adult Data
The performance of various methods for the Adult dataset is shown in Table 5, where the rows and columns are organized identically as in the previous table. Adult dataset is lowdimensional and dense (27.9% values are nonzero). Achieving privacy on such a dataset is comparatively easier, so existing methods for anonymization work fairly well on this. As we can observe, RF performs the best among all the methods. The good performance of RF is owing to the very small percentage of flips which ranges from 0.60% to 1.50% for various values. Basically, RF can achieve anonymity on this dataset with very small number of flips, which helps it maintain the classification utility of the dataset. For the same reason, anonymity and anonymity methods are also able to retain many dimensions (8 out of 19 for =5) of this dataset, and perform fairly well. On the other hand, different versions of Greedy and Maximal retain between 8 and 10 dimensions and achieve an AUC between 0.74 and 0.78, which are close to 0.80 (the AUC value for RF). Also, note that, the AUC using the full set of features (no privacy) is , so the utility loss due to the privacy is not substantial for this dataset. As a remark, our method is particularly suitable for high dimensional sparse data for which anonymization using traditional methods is difficult.
5.5 Spam Email Filtering
In Table 6, we compare AUC value of different methods for spam email filtering task. This is a very high dimensional data with features. As we can observe, our proposed methods, especially and perform better than the competing methods. For example, for , the classification AUC of RF is with flip rate , but using less than of features obtains an AUC value of , which is equal to the AUC value using the full feature set. Again, anonymity based methods show worse performance as they select less number of features due to stronger restriction of this privacy metric. For instance, for , selects features, but anonymity selects only 4 features. Due to this, classification results using anonymity constraint are worse compared to those using our proposed as privacy metric. As shown in Figure 1(a) 1(b), both LaplaceDP and ExponentialDP with various privacy budget setups perform much worse than all the competing methods in Table 6, which demonstrates that the significant amount of added noise during the sanitization process deteriorates the data utility and leads to bad classification performance. Among our methods, both and Maximal are the best as they consistently hold the classification performance for all different settings.
6 Related Work
We discuss the related work under the following two categories.
6.1 PrivacyPreserving Data Mining
In terms of privacy model, several privacy metrics have been widely used in order to quantify the privacy risk of published data instances, such as anonymity [41], closeness [27], diversity [28], and differential privacy [14]. Existing works on privacy preserving data mining solve a specific data mining problem given a privacy constraint over the data instances, such as classification [43], regression [16], clustering [42] and frequent pattern mining [15]. However the solutions proposed in these works are strongly influenced by the specific data mining task and also by the specific privacy model. In fact, the majority of the above works consider distributed privacy where the dataset is partitioned among multiple participants owning different portions of the data, and the goal is to mine shared insights over the global data without compromising the privacy of the local portions. A few other works [6, 5] consider output privacy by ensuring that the output of a data mining task does not reveal sensitive information.
anonymity privacy metric, due to its simplicity and flexibility, has been studied extensively over the years. Authors in [2] presents the anonymity patterns for the application of association rule mining. Samarati [39] proposes formal methods of anonymity using suppression and generalization techniques. She also introduced the concept of minimal generalization. Meyerson et al.[31] prove that two definitions of optimality are hard: first, to find the minimum number of cell values that need to be suppressed; second, to find the minimum number of attributes that need to be suppressed. Henceforth, a large number of works have explored the approximation of anonymization [31, 4]. However, none of these works consider the utility of the dataset along with the privacy requirements. Kifer et al.[22] propose methods that inject utility in the form of data distribution information into anonymous and diverse tables. However, the above work does not consider a classification dataset. Iyengar [19] proposes a utility metric called which is explicitly designed for a classification dataset. However, It assigns a generalization penalty over the rows of the dataset, but its performance is poor as we have shown in this work.
In recent years differential privacy [12, 34, 24] has attracted much attention in privacy research literatures. Authors in [9] propose a sampling based method for releasing high dimensional data with differential privacy guarantees. [37] proposes a method of publishing differential private low order marginals for high dimensional data. Even though authors in [9, 37] claim that they deal with high dimensional data, the dimensionality of data is at most from the experiments in their works. [40] makes use of anonymity to enhance data utility in differential privacy. An interesting observation of this work is that differential privacy based method, by itself, is not a good privacy mechanism, in regards to maintaining data utility. [8] proposes a probabilistic topdown partitioning algorithm to publish the setvalued data via differential privacy. Authors of [32] propose to utilize exponential mechanism to release a decision tree based classifier that satisfies differential privacy. However in their work, privacy is embedded in the data mining process, hence they are not suitable as a data release mechanism, and more importantly they can only be used along with the specific classification model within which the privacy mechanism is builtin.
6.2 PrivacyAware Feature Selection
Empirical study for the use of feature selection in Privacy Preserving Data Publishing has been proposed in [20] [21]. However, in their work, they use feature selection as an addon tool prior to data anonymization and do not consider privacy during the feature selection process. For our work, we consider privacyaware feature selection with a twin objective of privacy preservation and utility maintenance. To the best of our knowledge, the most similar works to ours for the use of feature selection in Privacy Preserving Data Publishing are presented in [35, 29] recently. [35] considers privacy as a cost metric in a dynamic feature selection process and proposes a greedy based iterative approach for solving the task, where the data releaser requests information about one feature at a time until a predefined privacy budget is exhausted. However the entropy based privacy metric presented in this work is strongly influenced by the specific classifier. [29] presents a genetic approach for achieving anonymity by partitioning the original dataset into several projections such that each one of them adheres to anonymity. But the proposed method does not provide optimality guaranty.
7 Conclusion and Future Work
In this paper, we propose a novel method for entity anonymization using feature selection. We define a new anonymity metric called anonymity by containment which is particularly suitable for high dimensional microdata. We also propose two feature selection methods along with two classification utility metrics. These metrics satisfy submodular properties, thus they enable effective greedy algorithms. In experiment section we show that both proposed methods select good quality features on a variety of datasets for retaining the classification utility yet they satisfy the user defined anonymity constraint.
In this work, we consider binary features. We also show experimental results using categorical features by making them binary, so the work can easily be extended for datasets with categorical features. An immediate future work is to extend this work on datasets with realvalued features. Another future direction would be to consider absent attributes in the privacy model. In real world, for some binary datasets, absent attributes can cause privacy violation, such as, they can be used for negative association rule mining.
Acknowledgements
We sincerely thank the reviewers for their insightful comments. This research is sponsored by both Mohammad Al Hasan’s NSF CAREER Award (IIS1149851) and Noman Mohammed’s NSERC Discovery Grants (RGPIN201504147). The contents are solely the responsibility of the authors and do not necessarily represent the official views of NSF and NSERC.
Footnotes
 To enhance the readability, we write the feature set as string; for example, the set is written as .
 http://research.nii.ac.jp/~uno/code/lcm.html
 http://arnetminer.org
 https://archive.ics.uci.edu/ml/datasets/Adult
 http://www.csmining.org/index.php/pu1andpu123adatasets.html
References
 C. C. Aggarwal. On kanonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB’05, pages 901–909, 2005.
 M. Atzori, F. Bonchi, F. Giannotti, and D. Pedreschi. Kanonymous patterns. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD’05, pages 10–21, 2005.
 B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of the 26th ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’07, pages 273–282, 2007.
 R. J. Bayardo and R. Agrawal. Data privacy through optimal kanonymization. In Proceedings of the 21st International Conference on Data Engineering, ICDE ’05, pages 217–228, 2005.
 R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 503–512, 2010.
 L. Bonomi and L. Xiong. Mining frequent patterns with differential privacy. Proceedings of Very Large Data Bases Endowment, 6(12):1422–1427, Aug. 2013.
 J.W. Byun, A. Kamra, E. Bertino, and N. Li. Efficient kanonymization using clustering techniques. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications, DASFAA’07, pages 188–200, 2007.
 R. Chen, B. C. Desai, N. Mohammed, L. Xiong, and B. C. M. Fung. Publishing setvalued data via differential privacy. In Proceedings of the Very Large Data Bases Endowment, volume 4, pages 1087–1098, 2011.
 R. Chen, Q. Xiao, Y. Zhang, and J. Xu. Differentially private highdimensional data publication via samplingbased inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 129–138, 2015.
 M. Conforti and G. CornuÃ©jols. Submodular set functions, matroids and the greedy algorithm: Tight worstcase bounds and some generalizations of the radoedmonds theorem. Discrete Applied Mathematics, 7(3):251 – 274, 1984.
 F. K. Dankar and K. E. Emam. Practicing differential privacy in health care: A review. In Trans. Data Privacy, volume 6, pages 35–67, Apr. 2013.
 C. Dwork. Differential privacy: A survey of results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC’08, pages 1–19. 2008.
 C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC’06, pages 265–284, 2006.
 C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundation and Trends in Theoretical Computer Science, 9:211–407, Aug. 2014.
 A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 217–228, 2002.
 S. E. Fienberg and J. Jin. Privacypreserving data sharing in high dimensional regression and classification settings. Journal of Privacy and Confidentiality, 4(1):10, 2012.
 S. E. Fienberg, A. Rinaldo, and X. Yang. Differential privacy and the riskutility tradeoff for multidimensional contingency tables. In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD, pages 187–199, 2010.
 I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, Mar. 2003.
 V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 279–288, 2002.
 Y. Jafer, S. Matwin, and M. Sokolova. Task oriented privacy preserving data publishing using feature selection. In Proceedings 27th Canadian Conference on Advances in Artificial Intelligence, pages 143–154. 2014.
 Y. Jafer, S. Matwin, and M. Sokolova. Using feature selection to improve the utility of differentially private data publishing. Proc. Computer Science, 37:511–516, 2014.
 D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06, pages 217–228, 2006.
 D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 193–204, 2011.
 J. Lee and C. Clifton. Differential identifiability. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 1041–1049, 2012.
 D. Li and M. Becchi. Deploying graph algorithms on gpus: an adaptive solution. In 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pages 1013–1024, 2013.
 G. Li. Mining local and global patterns for complex data classification. PhD thesis, Rensselaer Polytechnic Institute, 2013.
 N. Li and T. Li. tcloseness: Privacy beyond kanonymity and ldiversity. In IEEE 23rd International Conference on Data Engineering (ICDE), pages 106–115, 2007.
 A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. Ldiversity: Privacy beyond kanonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), Mar. 2007.
 N. Matatov, L. Rokach, and O. Maimon. Privacypreserving data mining: A feature set partitioning approach. Information Sciences, 180(14):2696 – 2720, 2010.
 F. McSherry and K. Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, pages 94–103, Providence, RI, 2007.
 A. Meyerson and R. Williams. On the complexity of optimal kanonymity. In Proceedings of the 23rd ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS ’04, pages 223–228, 2004.
 N. Mohammed, R. Chen, B. C. Fung, and P. S. Yu. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 493–501, 2011.
 A. Narayanan and V. Shmatikov. Robust deanonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, SP ’08, pages 111–125, 2008.
 H. Nguyen, A. Imine, and M. Rusinowitch. Network Structure Release under Differential Privacy. Transactions on Data Privacy, 9(3):215–241, 2016.
 E. Pattuk, M. Kantarcioglu, H. Ulusoy, and B. Malin. Privacyaware dynamic feature selection. In 2015 IEEE 31st International Conference on Data Engineering (ICDE), pages 78–88, 2015.
 F. Prasser, R. Bild, J. Eicher, H. Spengler, F. Kohlmayer, and K. A. Kuhn. Lightning: Utilitydriven anonymization of highdimensional data. Transactions on Data Privacy, 9(2):161–185, 2016.
 W. Qardaji, W. Yang, and N. Li. Practical differentially private release of marginal contingency tables. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 1435–1446.
 T. Saha, B. Zhang, and M. Hasan. Name disambiguation from link data in a collaboration graph using temporal and topological features. Social Network Analysis and Mining, 5(1), 2015.
 P. Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering (TKDE), 13(6):1010–1027, Nov. 2001.
 J. SoriaComas, J. DomingoFerrer, D. Sanchez, and S. Martinez. Enhancing data utility in differential privacy via microaggregationbased kanonymity. In Very Large Data Bases (VLDB) Journal, volume 23, pages 771–794, Oct. 2014.
 L. Sweeney. Kanonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 10(5):557–570, Oct. 2002.
 J. Vaidya and C. Clifton. Privacypreserving kmeans clustering over vertically partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 206–215, 2003.
 J. Vaidya, M. Kantarc, and C. Clifton. Privacypreserving naive bayes classification. Very Large Data Bases (VLDB) Journal, 17(4):879–898, July 2008.
 B. Zhang, M. Dundar, and M. A. Hasan. Bayesian nonexhaustive classification a case study: Online name disambiguation using temporal record streams. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pages 1341–1350. ACM, 2016.
 B. Zhang, T. K. Saha, and M. Al Hasan. Name disambiguation from link data in a collaboration graph. In the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), volume 5, pages 81–84, 2014.