Feature Selection for Classification under Anonymity Constraint

Feature Selection for Classification under Anonymity Constraint


Over the last decade, proliferation of various online platforms and their increasing adoption by billions of users have heightened the privacy risk of a user enormously. In fact, security researchers have shown that sparse microdata containing information about online activities of a user although anonymous, can still be used to disclose the identity of the user by cross-referencing the data with other data sources. To preserve the privacy of a user, in existing works several methods (-anonymity, -diversity, differential privacy) are proposed for ensuring that a dataset bears small identity disclosure risk. However, the majority of these methods modify the data in isolation, without considering their utility in subsequent knowledge discovery tasks, which makes these datasets less informative. In this work, we consider labeled data that are generally used for classification, and propose two methods for feature selection considering two goals: first, on the reduced feature set the data has small disclosure risk, and second, the utility of the data is preserved for performing a classification task. Experimental results on various real-world datasets show that the method is effective and useful in practice.


rivacy Preserving Feature Selection, -anonymity by containment, Maximal Itemset Mining, Greedy Algorithm, Binary Classification

1 Introduction

Over the last decade, with the proliferation of various online platforms, such as web search, eCommerce, social networking, micro-messaging, streaming entertainment and cloud storage, the digital footprint of today’s Internet user has grown at an unprecedented rate. At the same time, the availability of sophisticated computing paradigm and advanced machine learning algorithms have enabled the platform owners to mine and analyze tera-bytes of digital footprint data for building various predictive analytics and personalization products. For example, search engines and social network platforms use search keywords for providing sponsored advertisements that are personalized for a user’s information need; e-commerce platforms use visitor’s search history for bolstering their merchandising effort; streaming entertainment providers use people’s rating data for building future product or service recommendation. However, the impressive personalization of services of various online platforms enlighten us as much, as they do make us feel insecure, which stems from knowing the fact that individual’s online behaviors are stored within these companies, and an individual, more often, is not aware of the specific information about themselves that is being stored.

The key reason for a web user’s insecurity over the digital footprint data (also known as microdata) is that such data contain sensitive information. For instance, a person’s online search about a disease medication may insinuate that she may be suffering from that disease; a fact that she rather not want to disclose. Similarly, people’s choice of movies, their recent purchases, etc. reveal enormous information regarding their background, preference and lifestyle. Arguably microdata exclude biographical information, but due to the sheer size of our digital footprint the identity of a person can still be recovered from these data by cross-correlation with other data sources or by using publicly available background information. In this sense, these apparently non-sensitive attributes can serve as a quasi-identifier. For an example, Narayanan et al. [33] have identified a Netflix subscriber from his anonymous movie rating by using Internet Movie Database (IMDB) as the source of background information. For the case of Netflix, anonymous microdata was released publicly for facilitating Netflix prize competition, however even if the data is not released, there is always a concern that people’s digital footprint data can be abused within the company by employees or by external hackers, who have malicious intents.

For the case of microdata, the identity disclosure risk is high due to some key properties of such a dataset—high-dimensionality and sparsity. Sparsity stands for the fact that for a given record there is rarely any record that is similar to the given record considering full multi-dimensional space. It is also shown that in high-dimensional data, the ratio of distance to the nearest neighbor and the farthest neighbor is almost one, i.e., all the points are far from each other [1]. Due to this fact, privacy is difficult to achieve on such datasets. A widely used privacy metric that quantifies the disclosure risk of a given data instance is -anonymity[39], which requires that for any data instance in a dataset, there are at least distinct data instances sharing the same feature vector—thus ensuring that unwanted personal information cannot be disclosed merely through the feature vector. However, for high dimensional data, -anonymization is difficult to achieve even for a reasonable value of (say 5); typically, value based generalization or attribute based generalization is applied so that -anonymity is achieved, but Aggrawal has proved both theoretically and experimentally that for high dimensional data -anonymity is not a viable solution even for a value of 2 [1]. He has also shown that as data dimensionality increases, entire discriminatory information in the data is lost during the process of -anonymization, which severely limits the data utility. Evidently, finding a good balance between a user’s privacy and the utility of high dimensional microdata is an unsolved problem—which is the primary focus of this paper.

A key observation of a real-life high dimensional dataset is that it exhibits high clustering tendency in many sub-spaces of the data, even though over the full dimension the dataset is very sparse. Thus an alternative technique for protecting identity disclosure on such data can be finding a subset of features, such that when projecting on these set of features an acceptable level of anonymity can be achieved. One can view this as column suppression instead of more commonly used row suppression for achieving -anonymity [39]. Now for the case of feature selection for a given , there may exist many sub-spaces for which a given dataset satisfies -anonymity, but our objective is to obtain a set of features such that projecting on this set offers the maximum utility of the dataset in terms of a supervised classification task.

Consider the toy dataset that is shown in Table 1. Each row represents a person, and each column (except the first and the last) represents a keyword. If a cell entry is ‘1’ then the keyword at the corresponding column is associated with the person at the corresponding row. Reader may think this table as a tabular representation of search log of an eCommerce platform, where the ‘1’ under column stands for the fact that the corresponding user has searched using the keyword within a given period of time, and ‘0’ represents otherwise. The last column represents whether this user has made a purchase over the same time period. The platform owner wants to solve a classification problem to predict which of the users are more likely to make a purchase.

Say, the platform owner wants to protect the identity of its site visitor by making the dataset -anonymous. Now, for this toy dataset, if he chooses , this dataset is not -anonymous. For instance, the feature vector of , is unique in this dataset. However, the dataset is -anonymous for the same under the subspace spanned by . It is also -anonymous (again for the same ) under the subspace spanned by (See Table 2). Among these two choices, the latter subspace is probably a better choice, as the features in this set are better discriminator than the features in the former set with respect to the class-label. For feature set , if we associate the value ‘101’ with the +1 label, and the value ‘111’ with the -1 label, we make only 1 mistake out of 6. On the other hand for feature set , no good correlation exists between the feature values and the class labels.

Research problem in the above task is the selection of optimal binary feature set for utility preserving entity anonymization, where the utility is considered with respect to the classification performance and the privacy is guaranteed by enforcing a -anonymity like constraint [41]. In existing works, -anonymity is achieved by suppression or generalization of cell values, whereas in this work we consider to achieve the same by selecting an optimal subset of features that maximizes the classification utility of the dataset. Note that, maximizing the utility of the dataset is the main objective of this task, privacy is simply a constraint which a user enforces by setting the value of a privacy parameter based on the application domain and the user’s judgment. For the privacy model, we choose -anonymity by containment in short, -AC (definition forthcoming), where is the user-defined privacy parameter, which has similar meaning as it has in traditional -anonymity.

Our choice of -anonymity like metric over more theoretical counterparts, such as, differential privacy (DP) is due to the pragmatic reason that all existing privacy laws and regulations, such as, HIPAA (Health Information Portability and Accountability Act) and PHIPA (Personal Health Information Protection Act) use -anonymity. Also, -anonymity is flexible and simple, thus enabling people to understand and apply it for almost any real-life privacy preserving needs; on the contrary, DP based methods use a privacy parameter (), which has no obvious interpretation and even by the admission of original author of DP, choosing an appropriate value for this parameter is difficult [12]. Moreover, differential privacy based methods add noise to the data entities, but the decision makers in many application domains (such as, health care), where privacy is an important issue, are quite uncomfortable to the idea of noise imputation [11]. Finally, authors in [17] state that differential privacy is not suitable for protecting large sparse tables produced by statistics agencies and sampling organizations—this disqualifies differential privacy as a privacy model for protecting sparse and very high dimensional user’s microdata from the e-commerce and Internet search engines.

1.1 Our Contributions

In this work, we consider the task of feature selection under privacy constraint. This is a challenging task, as it is well-known that privacy is always at odds with the utility of a knowledge-based system, and finding the right balance is a difficult task [22, 23]. Besides, feature selection itself, without considering the privacy constraint, is an -Hard problem [18, 25, 26].

Given a classification dataset with binary features and an integer , our proposed solutions find a subset of features such that after projecting each instance on these subsets each entity in the dataset satisfies a privacy constraint, called -anonymous by containment (-AC). Our proposed privacy constraint -AC is an adapted version of -anonymity, which strikes the correct balance between disclosure risk and dataset utility, and it is particularly suitable for high dimensional binary data. We also propose two algorithms: Maximal and Greedy. The first is a maximal itemset mining based method and the second is a greedy incremental approach, both respecting the user-defined AC constraints.

The algorithms that we propose are particularly intended for high dimensional sparse microdata where the features are binary. The nature of such data is different from a typical dataset that is considered in many of the existing works on privacy preserving data disclosure mechanism. The first difference is that existing works consider two kinds of attributes, sensitive and nonsensitive, whereas for our dataset all attributes are considered to be sensitive, and any subset of published attributes can be used by an attacker to de-anonymize one or more entities in the dataset using probabilistic inference methodologies. On the other hand, the unselected attributes are not published so they cannot be used by an attacker to de-anonymize an entity. Second, we only consider binary attributes, which enable us to provide efficient algorithms and an interesting anonymization model. Considering only binary attributes may sound an undue restriction, but in reality binary attributes are adequate (and often preferred) when modeling online behavior of a person, such as ‘like’ in Facebook, ‘bought’ in Amazon, and ‘click’ in Google advertisement. Also, collecting explicit user feedback in terms of frequency data (say, the number of times a search keyword is used ) may be costly in many online platforms. Nevertheless, as shown in [33], binary attributes are sufficient for an attacker to de-anonymize a person using high dimensional microdata, so safeguarding user privacy before disclosing such dataset is important.

The contributions of our work are outlined below:

  1. We propose a novel method for entity anonymization using feature selection over a set of binary attributes from a two-class classification dataset. For this, we design a new anonymization model, named -anonymity by containment (-AC), which is particularly suitable for high-dimensional binary microdata.

  2. We propose two methods for solving the above task and show experimental results to validate the effectiveness of these methods.

  3. We show the utility of the proposed methods with three real-life applications; specifically, we show how the privacy-aware feature selection affects their performance.

2 Privacy Basics

Given a dataset , where each row corresponds to a person, and each column contains non-public information about that person; examples include disease, medication, sexual orientation, etc. In the context of online behavior, the search keywords, or purchase history of a person may be such information. Privacy preserving data publishing methodologies make it difficult for an attacker to de-anonymize the identity of a person who is in the dataset. For de-anonymization, an attacker generally uses a set of attributes that act almost like a key and it uniquely identifies some individual in the datasets. These attributes are called quasi-identifiers. -anonymity is a well-known privacy metric defined as below.

Definition 1 (-anonymity).

A dataset satisfies -anonymity if for any row entity there exist at least other entities that have the same values as for every possible quasi-identifiers.

The database in Table 1 is not -anonymous, as the row entity is unique considering the entire attribute-set as quasi-identifier. On the other hand, It is -anonymous for both the datasets (one with Feature Set-1 and the other with Feature Set-2) in Table 2. For numerical or categorical attributes, a process, called generalization and/or suppression (row or cell value) are used for achieving -anonymity. Generalization partitions the values of an attribute into disjoint buckets and identifies each bucket with a value. Suppression either hides the entire row or some of its cell values, so that the anonymity of that entity can be maintained. Generalization and suppression make anonymous group, where all the entities in that group have the same value for every possible quasi-identifier, and for a dataset to be -anonymous, the size of each of such groups is at least . In this work we consider binary attributes; each such attribute has only two values, 0 and 1. For binary attributes, value based generalization relegates to the process of column suppression, which incurs a loss of data utility. In fact, any form of generalization based -anonymization incurs a loss in data utility due to the decrement of data variance or due to the loss of discernibility. Suppression of a row is also a loss as in this case the entire row entity is not discernible for any of the remaining entities in the dataset. Unfortunately, most of existing methods for achieving -anonymity using both generalization and suppression operations do not consider an utility measure targeting supervised classification task.

There are some security attacks against which -anonymity is vulnerable. For example, -anonymity is susceptible to both homogeneity and background knowledge based attacks. More importantly, -anonymity does not provide statistical guaranty about anonymity which can be obtained by using -differential privacy [12]—a method which provides strong privacy guarantees independent of an adversary’s background knowledge. There are existing methods that adopt differential privacy principle for data publishing. Authors in [3, 13] propose Laplace mechanism to publish the contingency tables by adding noise generated from a Laplace distribution. However, such methods suffer from the utility loss due to the large amount of added noise during the sanitization process. To resolve this issue,  [30] proposes to utilize exponential mechanism for maximizing the trade-off between differential privacy and data utility. However, the selection of utility function used in the exponential mechanism based approach strongly affects the data utility in subsequent data analysis task. In this work, we compare the performance of our proposed privacy model, namely -AC, with both Laplace and exponential based differential privacy frameworks (See Section 5.2) to show that -AC better preserves the data utility than the differential privacy based methods.

A few works [19, 36] exist which consider classification utility together with -anonymity based privacy model, but none of them consider feature selection which is the main focus of this work. In one of the earliest works, Iyengar [19] solves -anonymization through generalization and suppression while minimizing a proposed utility metric called (Classification Metric) using a genetic algorithm which provides no optimality guaranty. The CM is defined as below:

Definition 2 (Classification Metric [19]).

Classification metric () is a utility metric for classification dataset, which assigns a penalty of 1 for each suppressed entity, and for each non-suppressed entity it assigns a penalty of 1 if those entities belong to the minority class within its anonymous group. value is equal to the sum of penalties over all the entities.

In this work, we compare the performance of our work with CM based privacy-aware utillty metric.

User Class
1 0 1 0 1
1 0 1 0 1
1 0 0 1 1
1 0 1 0 1
1 1 1 0 1
1 1 0 1 1
Table 1: A toy 2-class dataset with binary feature-set
User Feature Set-1 Class Feature Set-2
1 0 1 1 0 1
1 0 1 1 0 1
1 0 1 0 1 1
1 0 1 1 0 1
1 1 1 1 0 1
1 1 1 0 1 1
Table 2: Projections of the dataset in Table 1 on two feature-sets (Feature Set-1 and Feature Set-2)

3 Problem Statement

Given a classification dataset with binary attributes, our objective is to find a subset of attributes which increase the non-disclosure protection of the row entities, and at the same time maintain the classification utility of the dataset without suppressing any of the row entities. In this section we will provide a formal definition of the problem.

We define a database as a binary relation between a set of entities () and a set of features (); thus, , where and ; and are the number of entities and the number of features, respectively. The database can also be represented as a binary data matrix, where the rows correspond to the entities, and the columns correspond to the features. For an entity , and a feature , if , the corresponding data matrix entry , otherwise . Thus each row of is a binary vector of size in which the entries correspond to the set of features with which the corresponding row entity is associated. In a classification dataset, besides the attributes, the entities are also associated to a class label which is a category value. In this task we assume a binary class label . A typical supervised learning task is to use the features to predict the class label of an entity.

We say that an entity contains a set of features , if for all ; set is also called containment set of the entity .

Definition 3 (Containment Set).

Given a binary dataset, , the containment set of a row entity , represented as , is the set of attributes such that , and , .

When the dataset is clear from the context we will simply write instead of to represent the containment set of .

Definition 4 (-anonymity by containment).

In a binary dataset and for a given positive integer , an entity satisfies -anonymity by containment if there exists a set of entities , such that . In other words, their exist at least other entities in such that their containment set is the same or a superset of .

By definition, if an entity satisfies -anonymity by containment, it satisfies the same for all integer values from 1 upto . We use the term to denote the largest for which the entity satisfies -anonymity by containment.

Definition 5 (-anonymous by Containment Group).

For a binary dataset , if satisfies the -anonymity by containment, -anonymous by containment group with respect to  exists and this is , where is the largest possible set as is defined in Definition 4.

Definition 6 (-anonymous by Containment Dataset).

A binary dataset is -anonymous by containment if every entity satisfies -anonymity by containment.

We extend the term over a dataset as well, thus is the smallest for which the dataset is anonymous.

Example: For the dataset in Table 1, . Entity satisfies -anonymity by containment, because for each of the following three entities , and , their containment sets are the same or supersets of . But, the entity only satisfies -anonymity by containment, as besides itself no other entity contains . -anonymous by containment group of exists, and it is , but -anonymous by selection group for the same entity does not exist. The dataset in Table 1 is -anonymous by containment because there exists one entity, namely such that the highest -value for which satisfies -anonymity by containment is 1; alternatively

-anonymity by containment (-AC) is the privacy metric that we use in this work. The argument for this metric is that if a large number of other entities contain the same or super feature subset which an entity contains, the disclosure protection of the entity is strong, and vice versa. Thus a higher value of stands for a higher level of privacy for . -anonymity by containment (-) is similar to -anonymity for binary feature set except that for - only the ‘1’ value of feature set is considered as a privacy risk. It is easy to see that -anonymity by containment (-AC) is a relaxation of -anonymity. In fact, the following lemma holds.

Lemma 1.

If a dataset satisfies -anonymity for a value, it also satisfies -AC for the same -value, but the reverse does not hold necessarily.


Say, the dataset satisfies -anonymity; then for any row entity , there exists at least other row entities with identical row vector as . Containment set of all these entities is identical to , so satisfies -AC. Since this holds for all , the dataset satisfies -AC.

To prove that the reverse does not hold, we will give a counter-example. Assume has three entities and two features with the following feature values, . satisfies -AC because the smallest anonymity by containment value of the entities in the dataset is 2. But, the dataset does not satisfy 2-anonymity because the entity is unique in the dataset. ∎

However, the relaxed privacy that -AC provides is adequate for disclosure protection in a high dimensional sparse microdata with binary attributes, because -AC conceals the list of the attributes in a containment set of an entity, which could reveal sensitive information about the entity. For example, if the dataset is about the search keywords that a set of users have used over a given time, for a person having a 1 value under a keyword potentially reveals sensitive information about the behavior or preference of that person. Having a value of 0 for a collection of features merely reveals the knowledge that the entity is not associated with that attribute. In the online microdata domain, due to the high dimensionality of the data, non-association with a set of attributes is not a potential privacy risk. Also note that, in traditional datasets, only a few attributes which belong to non-sensitive group are assumed to be quasi-identifier, so a privacy metric, like -anonymity works well for such dataset. But, for high-dimensional dataset, -anonymity is severely restrictive and utility loss of data by column suppression is substantial because feature subsets containing very small number of features pass -anonymity criteria. On the other hand, -AC based privacy metric enables selection of sufficient number of features for retaining the classification utility of the dataset. In short, -AC retains the classification utility substantially, whereas -anonymity fails to do so for most high dimensional data.

Feature selection [18] for a classification task is to select a subset of highly predictive variables so that classification accuracy possibly improves which happens due to the fact that contradictory or noisy attributes are generally ignored during the feature selection step. For a dataset , and a feature-set , following relational algebra notations, we use to denote the projection of database over the feature set . Now, given a user-defined integer number , our goal is to perform an optimal feature selection on the dataset to obtain which satisfies two objectives: first, is -anonymous by containment, i.e., ; second, maintains the predictive performance of the classification task as much as possible. Selecting a subset of features is similar to the task of column suppression based privacy protection, but the challenge in our task is that we want to suppress column that are risk to privacy, and at the same time we want to retain columns that have good predictive performance for a downstream supervised classification task using the sanitized dataset. For denoting the predictive performance of a dataset (or a projected dataset) we define a classification utility function . The higher the value of , the better the dataset for the classification. We consider to be a filter based feature selection criteria which is independent of the classification model that we use.

The formal research task of this work is as below. Given a binary dataset , and an integer number , find so that is maximized under the constraint that . Mathematically,

subject to

Due to the fact that the problem 1 is a combinatorial optimization problem (optimizing over the space of feature subsets) which is NP-Hard, here we propose two effective local optimal solutions for this problem.

4 Methods

In this section, we describe two algorithms, namely Maximal and Greedy that we propose for the task of feature selection under privacy constraint. Maximal is a maximal itemset mining based feature selection method, and Greedy is a greedy method with privacy constraint based filtering. In the following subsections, we discuss them in details.

4.1 Maximal Itemset based Approach

A key observation regarding -anonymity by containment () of a dataset is that this criteria satisfies the downward-closure property under feature selection. The following lemma holds:

Lemma 2.

Say is a binary dataset and and are two feature subsets. If , then .

Proof: Let’s prove by contradiction. Suppose and . Then from the definition of , there exists at least one entity for which . Now, let’s assume and are the set of entities which make the anonymous by containment group for the entity in and , respectively. Since ; so there exists an entity , for which and ; But this is impossible, because , if holds, then must be true. Thus, the lemma is proved by contradiction. ∎.

Let’s call the collection of feature subsets which satisfy the threshold for a given , the feasible set and represent it with . Thus, . A subset of features is called maximal if it has no supersets which is feasible. Let be the set of all maximal subset of features. Then . As we can observe given an integer , if there exists a maximal feature set that satisfies the constraint, then any feature set , also satisfies the same constraint, i.e., if based on the Lemma 2.

Example: For the dataset in Table 1, the -anonymous by containment feasible feature set 1 and . In this dataset, the feature-set because in , and the size of the -anonymous by containment group of is 1; thus . On the other hand for feature-set , the projected dataset has two -anonymous by containment groups, which are and ; since each group contains at least two entities,

Lemma 3.

Say, is a binary dataset, and is its transaction representation where each entity is a transaction consisting of the containment set . Frequent itemset of the dataset with minimum support threshold are the feasible feature set for the optimization problem 1.

Proof: Say, is a frequent itemset in the transaction for support threshold . Then, the support-set of in are the transactions (or entities) which contain . Since, is frequent, the support-set of consists of at least entities. In the projected dataset , all these entities make a -anonymous by containment group, thus satisfying -anonymity by containment. For each of the remaining entities (say, ), ’s containment set contains some subset of (say ) in . Since, is a frequent itemset and , is also frequent with a support-set that has at least entities. Then also belongs to a -anonymous by containment group. Thus, each of the entities in belongs to some -anonymous by containment group(s) which yields: . Hence proved. ∎

A consequence of Lemma 2 is that for a given dataset , an integer , and a feature set , if , any subset of (say, ) satisfies . This is identical to the downward closure property of frequent itemset mining. Also, Lemma 3 confirms that any itemset that is frequent in the transaction representation of for a minimum support threshold is a feasible solution for problem 1. Hence, Apriori like algorithm for itemset mining can be used for effectively enumerating all the feature subsets of which satisfies the required -anonymity by containment constraint.

Maximal Feasible Feature Set Generation

For large datasets, the feasible feature set which consists of feasible solutions for the optimization problem 1 can be very large. One way to control its size is by choosing appropriate ; if increases, decreases, and vice-versa, but choosing a large negatively impacts the classification utility of the dataset, thus reducing the optimal value of problem (1). A brute force method for finding the optimal feature set is to enumerate all the feature subset in and find the one that is the best given the utility criteria . However, this can be very slow. So, Maximal generates all possible maximal feature sets instead of generating and search for the best feature subset within . The idea of enumerating instead of comes from the assumption that with more features the classification performance will increase; thus, the size of the feature set is its utility function value; i.e., , and in that case the largest set in is the solution to the problem 1.

An obvious advantage of working only with the maximal feature set is that for many datasets, , thus finding solution within instead of leads to significant savings in computation time. Just like the case for frequent itemset mining, maximal frequent itemset mining algorithm can also be used for finding . Any off-the-shelf software can be used for this. In Maximalalgorithm we use the LCM-Miner package provided in2 which, at present, is the fastest method for finding maximal frequent itemsets.

Classification Utility Function

The simple utility function has a few limitations. First, the ties are very commonplace, as there are many maximal feature sets that have the same size. Second, and more importantly, this function does not take into account the class labels of the instances so it cannot find a feature set that maximizes the separation between the positive and negative instances. So, we consider another utility function, named as , which does not succumb much to the tie situation. It also considers the class label for choosing features that provide good separation between the positive and negative classes.

Definition 7 (Hamming Distance).

For a given binary database , and a subset of features, , the Hamming distance between and is defined as below:


where is the indicator function, and and are the th feature value under for the entities and , respectively.

We can partition the entities in into two disjoint subsets, and ; entities in have a class label value of , and entities in have a class label value of .

Definition 8 (HamDist).

Given a dataset where the partitions and are based on class labels, the classification utility function for a feature subset is the average Hamming distance between all pair of entities and such that and .


Example: For the dataset in Table 1, for its projection on (see, Table 2), distance of from the negative entities are , and the same for the other positive entities, and also. So, . From the same table we can also see that .

As we can observe from Equations 2 and 3, the utility function reflects the discriminative power between classes given the feature set . The larger the value of , the better the quality of selected feature set to distinguish between classes. Another separation metric similar to is DistCnt (Distinguish Count), which is defined below.

Definition 9 (DistCnt).

For , and , is the number of pairs from and which can be distinguished using at least one feature in . Mathematically,


can also be used instead of in the Maximal algorithm. Note that, we can also use criterion (see Definition 2) instead of ; however, experimental results show that performs much poorer in terms of AUC. Besides, both and functions have some good properties (will be discussed in Section  4.2) which does not have.

The Maximal algorithm utilizes classification utility metrics ( or ) for selecting the best feature set from the maximal set . For some datasets, the size of can be large and selecting the best feature set by applying the utility metric on each element of can be time-consuming. Then, we can find the best feature set among the largest sized element in . Another option is to consider the maximal feature sets in in the decreasing order of their size in such a way that at most of the maximal feature sets from set are chosen as candidate for which the utility metric computation is performed. In this work we use this second option by setting for all our experiments.

0:  , ,
1:  Calculate maximal feature set which contains the feature-sets satisfying -AC for the given
2:  Select best feature-set based on the criteria by considering largest-sized feature set in .
3:  return
Algorithm 1 Maximal Itemset Mining Based Feature Selection Method

Maximal Itemset based Method (Pseudo-code)

The pseudo-code for Maximal is given in Algorithms 1. Maximal takes integer number and the number of maximal patterns as input and returns the final feature set which satisfies -anonymity by containment. Line 1 uses the LCM-Miner to generate all the maximal feature sets that satisfy -anonymity by containment for the given value. Line 2 groups maximal feasible feature sets according to its size and selects top maximal feature sets with the largest size and builds the candidate feature sets. Then the algorithm computes the feature selection criteria of each feature set in the candidate feature sets and returns the best feature set that has the maximum value for this criteria.

The complexity of the above algorithm predominantly depends on the complexity of the maximal itemset mining step (Line 1), which depends on the input value . For larger , the privacy is stronger and it reduces making the algorithm run faster, but the classification utility of the dataset may suffer. On the other hand, for smaller , can be large making the algorithm slower, but it better retains the classification utility of the dataset.

4.2 Greedy with Modular and Sub-Modular Objective Functions

A potential limitation of Maximal is that for dense datasets this method can be slow. So, we propose a second method, called Greedy which runs much faster as it greedily adds a new feature to an existing feasible feature-set. For greedy criteria, Greedy can use different separation functions which discriminate between positive and negative instances. In this work we use (See Definition 8) and (See Definition 9). Thus Greedy solves the Problem (1) by replacing by either of the two functions. Because of the monotone property of these functions, Greedy ensures that as we add more features, the objective function value of (1) monotonically increases. The process stops once no more features are available to add to the existing feature set while ensuring the desired value of the projected dataset.

Submodularity, and Modularity

Definition 10 (Submodular Set Function).

Given a finite ground set , a monotone function that maps subsets of to a real number is called submodular if

If the above condition is satisfied with equality, the function is called modular.


is monotone, submodular, and modular.


For a dataset , , and are two arbitrary feature-sets, such that . , where the partition is based on class label. Consider the pair , such that and . Let, be a function that sums the Hamming distance over all such pairs for a given feature subset . Thus, , where the function is the Hamming distance between and as defined in Equation 2. Similarly we can define , for the feature subset . Using Equation 2, is the summation over each of the features in . Since , includes the sum values for the variables in and possibly includes the sum value of other variables, which is non-negative. Summing over all () pairs yields . So, is monotone. Now, for a feature ,

Similarly, . Then, we have . Dividing both sides by yields . Hence proved with the equality. ∎


is monotone, and submodular.


Given a dataset where is partitioned as based on class label. Now, consider a bipartite graph, where vertices in one partition (say, ) correspond to features in , and the vertices of other partition (say, ) correspond to a distinct pair of entities such that , and ; thus, . If for a feature , we have , an edge exists between the corresponding vertices and . Say, and and . For a set of vertices, represents their neighbor-list. Since, the size of neighbor-list of a vertex-set is monotone and submodular, for , we have , and . By construction, for a feature set, , contains the entity-pairs for which at least one feature-value out of is different. Thus, function is and it is submodular. ∎

0:  ,
1:  Sort the features in non-increasing order based on their , denoted as
3:  for each feature  do
4:     if  then
6:     else
7:        break
8:     end if
9:  end for
10:  return
Algorithm 2 Greedy Algorithm for

For monotone submodular function , let be a set of size obtained by selecting elements one at a time, each time choosing an element provides the largest marginal increase in the function value. Let be a set that maximizes the value of over all -element sets. Then ; in other words, provides -approximation. For modular function  [10].

Greedy Method (Pseudo-code)

Using the above theorems we can design two greedy algorithms, one for modular function , and the other for submodular function . The pseudo-codes of these algorithms are shown in Algorithm 2 and Algorithm 3. Both the methods take binary dataset and integer value as input and generate the selected feature set as output. Initially . For modular function, the marginal gain of an added feature can be pre-computed, so Algorithm 2 first sorts the features in non-increasing order of their values, and greedily adds features until it encounters a feature such that its addition does not satisfy the constraint. For submodular function , margin gain cannot be pre-computed, so Algorithm 3 selects the new feature by iterating over all the features and finding the best one (Line 5 -11). The terminating condition of this method is also identical to the Algorithm 2. Since the number of features is finite, both the methods always terminate with a valid which satisfies .

Compared to Maximal, both greedy methods are faster. With respect to number of features (), Algorithm 2 runs in time and Algorithm 3 runs in time. Also, using Theorem 4.2.1, Algorithm 2 returns the optimal size feature-set, and Algorithm 3 returns , for which the objective function value is optimal over all possible size- feature sets.

0:  ,
2:  repeat
5:     for  do
6:        Compute
7:        if  then
10:        end if
11:     end for
13:  until 
14:  return
Algorithm 3 Greedy Algorithm for
Dataset # Entities # Features # Pos # Neg Density
Adult Data 32561 19 24720 7841 27.9%
Entity 148 552 74 74 9.7%
Email 1099 24604 618 481 0.9%
Table 3: Statistics of Real-World Datasets

5 Experiments and Results

In order to evaluate our proposed methods we perform various experiments. Our main objective in these experiments is to validate how the performance of the proposed privacy preserving classification varies as we change the value of —user-defined privacy threshold metric. We also compare the performance of our proposed utility preserving anonymization methods with other existing anonymization methods such as -anonymity and differential privacy. It is important to note that we do not claim that our methods provide a better utility with identical privacy protection as other methods, rather we claim that our methods provide adequate privacy protection which is suitable for high dimensional sparse microdata with a much superior AUC value—a classification utility metric which we want to maximize in our problem setup. We use three real-world datasets for our experiments. All three datasets consist of entities that are labeled with 2 classes. The number of entities, the number of features, the distribution of the two classes (#postive and #negative), and the dataset density (fraction of non-zero cell values) are shown in Table 3.

5.1 Privacy Preserving Classification Tasks

Below, we discuss the datasets and the privacy preserving classification tasks that we solve using our proposed methods.

Entity Disambiguation (ED) [44]. The objective of this classification task is to identify whether the name reference at a row in the data matrix maps to multiple real-life persons or not. Such an exercise is quite common in the Homeland Security for disambiguating multiple suspects from their digital footprints [45, 38]. Privacy of the people in such a dataset is important as many innocent persons can also be listed as a suspect. Given a set of keywords that are associated with a name reference, we build a binary data matrix for solving the ED task. We use Arnetminer3 academic publication data. In this dataset, each row is a name reference of one or multiple researchers, and each column is a research keyword within the computer science research umbrella. A ‘1’ entry represents that the name reference in the corresponding row has used the keyword in her (or their) published works. In our dataset, there are 148 rows which are labeled such that half of the people in this dataset are pure entity (a negative case), and the rest of them are multi-entity (a positive case). The dataset contains 552 attributes (keywords).

To solve the entity disambiguation problem we first perform topic modeling over the keywords and then compute the distribution of entity ’s keywords across different topics. Our hypothesis is that for a pure entity the topic distribution will be concentrated on a few related topics, but for an impure entity (which is composed of multiple real-life persons) the topic distribution will be distributed over many non-related topics. We use this idea to build a simple classifier which uses an entropy-based score for an entity as below:


where is the probability of belonging to topic , and represents the pre-defined number of topics for topic modeling. Clearly, for a pure entity the entropy-based score is relatively smaller than the same for a non-pure entity. We use this score as our predicted value and compute AUC (area under ROC curve) to report the performance of the classifier.

Adult. The Adult dataset 4 is based on census data and has been widely used in earlier works on -anonymization [19]. For our experiments, we use eight of the original attributes; these are age, work class, education, marital status, occupation, race, gender, and hours-per-week. The classification task is to determine whether a person earns over 50K a year or not. Among all of the attributes, gender is originally binary. For the other attributes, we make them as binary for our purpose. For example, for marital attribute, we consider never-married (1) versus others (0). For race attribute, we consider white (1) versus others (0). For the numerical attributes, we cut them into different categories and consider a binary attribute for each category. For instance, we partition age value in five non-overlapping intervals: , , , , and , and then each of the five intervals becomes a binary attribute. Similarly, education attribute is divided into intervals and hour/week attribute is divided into interval. In this way, we have a total of attributes for the Adult dataset. As we can see privacy of the individuals in such a dataset is quite important as many people consider their personal data, such as race, gender, marital status and so on as sensitive attributes and they are not willing to release them to public.

Email The last dataset, namely Email dataset 5 is a collection of approximately 1099 personal email messages distributed in 10 different directories. Each directory contains both legitimate and spam messages. To respect the privacy issue, each token including word, number, and punctuation symbol is encrypted by a unique number. The classification task is to distinguish the spam email with nonspam email. We use this data to mimic microdata (such as twitter or Facebook messages) classification. Privacy is important in such a dataset as keyword based features in a micro-message can potentially identify a person. In the dataset, each row is an email message, and each column denotes a token. A ‘1’ in a cell represents that the row reference contains the token in the email message.

5.2 Experimental Setting

For our experiments, we vary the value of the proposed -anonymity by selection () metric and run Maximal and different variants of Greedy independently for building projected classification datasets for which value is at least . We use the names and for the two variants of Greedy (Algorithm 2 and 3), which optimize Hamming distance and Distinguish count greedy criteria, respectively. As we mentioned earlier, -anonymity based method imposes strong restriction which severely affects the utility of the dataset. To demonstrate that, instead of using , we utilize -anonymity as our privacy criteria for different variants of Greedy. We call these competing methods -anonymity , and -anonymity . It is important to note that, in our experiments under the same setting, the -anonymity based competing methods may not provide the same level of privacy. For instance, for the same value, privacy protection of our proposed method is not the same as that of the -anonymity , simply because -AC is a relaxation of -anonymity.

We also use four other methods for comparing their performance with the performance of our proposed methods. We call these competing methods RF [7], Greedy [19], Laplace-DP, and Exponential-DP. We discuss these methods below.

RF is a Randomization Flipping based -anonymization technique presented in [7], which randomly flips the feature value such that each instance in the dataset satisfies the -anonymity privacy constraint. RF uses clustering such that after random flipping operation, each cluster has at least entities with the same feature values with respect to the entire feature set.

CM greedy represents another greedy based method which uses Classification Metric utility criterion proposed in [19] as utility metric (See definition 2). It assigns a generalization penalty over the rows of the dataset and uses a genetic algorithm for the classification task, but for a fair comparison we use CM criterion in the Greedy algorithm and with the selected features we use identical setup for classification.

Laplace-DP [21] is a method to use feature selection for -differential private data publishing. Authors in [21] utilize Laplace mechanism [13] for -differential privacy guarantee. To compare with their method, we first compute the utility of each feature as its true output using function in Definition 8 denoted as . Then we add independently generated noise according to a Laplace distribution with to each of the outputs, and the noisy output for each feature is defined as , where is the sensitivity of function. After that we select top- features by considering largest noisy outputs. On the reduced dataset, we apply a private data release method which provides -differential privacy guaranty. The general philosophy of this method is to first derive a frequency matrix of the reduced dataset over the feature domain and add Laplace noise with to each count (known as marginal) to satisfy the -differential privacy. Then the method adds additional data instances to match the above count. Such an approach is discussed in [12] as a private data release mechanism.

Exponential-DP is another -differential privacy aware feature selection method. Compared to the work presented in [21], we use exponential mechanism [30] based -differential privacy to select features. In particular, we choose each feature with probability proportional to . That is, the feature with a higher utility score in terms of function is exponentially more likely to be chosen. The private data release stage of Exponential-DP is as same as the one in Laplace-DP. Note that, for both Laplace-DP and Exponential-DP, prior feature selection is essential for such methods to reduce the data dimensionality, otherwise the number of marginals is an intractable number (, for a binary dataset with features) and adding instances to match count for each such instance is practically impossible.

For all the algorithms and all the datasets (except ED) we use the LibSVM to perform SVM classification using L2 loss with 5-fold cross validation. The only parameter for libSVM is regularization-loss trade-off which we tune using a small validation set. For each of the algorithms, we report AUC and the selected feature count (SFC). For RF method, it selects all the features, so for this method we report the percentage of cell values for which the bit is flipped. We use different -anonymity by containment () values in our experiments. For practical -anonymization, value between 5 and 10 is suggested in the earlier work[41]; we use three different values, which are and . For a fair comparison, for both Laplace and Exponential DP, we use the same number of features as is obtained for the case of Greedy under . Since -anonymity and differential privacy use totally different parameter setting mechanisms (one based on , and the other based on ), it is not easy to understand what value of in DP will make a fair comparison for a value of 5 in -AC. So, for both Laplace-DP and exponential-DP, we show the differential privacy results for different values: , and . Note that the original work [12] has suggested to use a value of 1.0 for . While using DP based methods, we distribute half of the privacy budget for the feature selection step and the remaining half to add noise into marginals in the private data release step. Moreover, in the feature selection procedure, we further equally divide the budget for the selection of each feature.

RF, Laplace-DP, and Exponential-DP are randomized methods, so for each dataset we run all of them times and report the average AUC and standard deviation. For each result table in the following sections, we also highlight the best results in terms of AUC among all methods under same setting. We run all the experiments on a 2.1 GHz Machine with 4GB memory running Linux operating system.

Method AUC (Selected Feature Count)
k=5 k=8 k=11
Maximal 0.82 (61) 0.81 (43) 0.79 (32)
0.88 (27) 0.88 (24) 0.81 (16)
0.81 (11) 0.81 (11) 0.80 (10)
CM Greedy  [19] 0.68 (2) 0.68 (2) 0.68 (2)
RF  [7] 0.750.02 (11.99%) 0.730.03 (14.03%) 0.720.02 (16.49%)
-anonymity 0.55 (3) 0.55 (2) 0.55 (2)
-anonymity 0.79 (3) 0.79 (3) 0.77 (2)
Full-Feature-Set 0.87 (552)
Table 4: AUC Comparison among different privacy methods for the name entity disambiguation task

5.3 Name Entity Disambiguation

In Table 4 we report the AUC value of anonymized name entity disambiguation task using various privacy methods (in rows) for different values (in columns). For better comparison, our proposed methods, competing methods, and non-private methods are grouped by the horizontal lines: our proposed methods are in the top group, the competing methods are in the middle group, and non-private methods are in the bottom group. For differential privacy comparison, we show the AUC result in Figure 1(a) 1(b). For each method, we also report the count of selected features (SFC). Since RF method uses the full set of features; for this method the value in the parenthesis is the percent of cell values that have been flipped. We also report the AUC performance using full feature set (last row). As non-private method in bottom group has no privacy restriction, thus the result is independent of .

For most of the methods increasing decreases the number of selected features, which translates to poorer classification performance; this validates the privacy-utility trade-off. However, for a given , our proposed methods perform better than the competing methods in terms of AUC metric for all different values. For instance, for , the AUC result of RF and CM Greedy are only and respectively, whereas different versions of proposed Greedy obtain AUC values between and . Among the competing methods, both Laplace-DP and Exponential-DP perform the worst ( AUC under ) as shown in the first group of bars in Figure 1(a) & 1(b), and -anonymity performs the best (0.79 for =5); yet all completing methods perform much poorer than our proposed methods. A reason for this may be most of the competing methods are too restrictive, as we can see that they are able to select only 2 to 3 features for various values. In comparison, our proposed methods are able to select between 11 and 61 features, which help our methods to retain classification utility. The bad performance of differential privacy based methods is due to the fact that in such a setting, the added noise is too large in both feature selection and private data release steps. In general, the smaller the , the stronger privacy guarantee the differential privacy provides. However, stronger privacy protection in terms of always leads to worse data utility in terms of AUC as shown in Figure 1(a) 1(b). Therefore, even though differential privacy provides stronger privacy guarantee, the utility of data targeting supervised classification task is significantly destroyed. For this dataset, we observe that the performance of RF is largely dependent on the percentage of flips in the cell-value; if this percentage is large, the performance is poor. As increases, with more privacy requirement, the percentage of flips increases, and the AUC drops.

For a sparse dataset like the one that we use for entity disambiguation, feature selection helps classification performance. In this dataset, using full set of features (no privacy), we obtain only 0.87 AUC value, whereas using less than 10% of features we can achieve comparable or better AUC using our proposed methods (when =5). Even for , our methods retain substantial part of the classification utility of the dataset and obtain AUC value of 0.81 (see second row). Also, note that under and , our performs better than using full feature set, which demonstrates our proposed privacy-aware feature selection methods not only have the competitive AUC performance, but provide strong privacy guarantees as well.

Method AUC (Selected Feature Count)
k=5 k=8 k=11
Maximal 0.74 (8) 0.74 (8) 0.75 (7)
0.77 (9) 0.77 (9) 0.76 (8)
0.78 (10) 0.78 (10) 0.76 (8)
CM Greedy  [19] 0.71 (5) 0.71 (5) 0.71 (5)
RF  [7] 0.800.02 (0.60%) 0.800.03 (1.00%) 0.800.02 (1.44%)
-anonymity 0.72 (8) 0.72 (8) 0.72 (8)
-anonymity 0.73 (8) 0.70 (6) 0.70 (6)
Full-Feature-Set 0.82 (19)
Table 5: Comparison among different privacy methods for Adult dataset using AUC
Method AUC (Selected Feature Count)
k=5 k=8 k=11
Maximal 0.94 (121) 0.92 (66) 0.90 (58)
0.91 (11) 0.91 (11) 0.91 (11)
0.95 (11) 0.93 (7) 0.93 (7)
CM Greedy  [19] 0.86 (3) 0.86 (3) 0.86 (3)
RF  [7] 0.870.02 (1.30%) 0.860.01 (1.73%) 0.870.03 (2.03%)
-anonymity 0.84 (4) 0.84 (4) 0.84 (4)
-anonymity 0.81 (4) 0.81 (4) 0.81 (4)
Full-Feature-Set 0.95 (24604)
Table 6: Comparison among different privacy methods for Email dataset using AUC
(a) Laplace Mechanism based Differential Privacy
(b) Exponential Mechanism based Differential Privacy
Figure 1: Classification Performance of differential privacy based methods for different values on three datasets. Laplace mechanism is on the left and Exponential mechanism is on the right. Each group of bars belong to one specific dataset, and within a group different bars represent different values.

5.4 Adult Data

The performance of various methods for the Adult dataset is shown in Table 5, where the rows and columns are organized identically as in the previous table. Adult dataset is low-dimensional and dense (27.9% values are non-zero). Achieving privacy on such a dataset is comparatively easier, so existing methods for anonymization work fairly well on this. As we can observe, RF performs the best among all the methods. The good performance of RF is owing to the very small percentage of flips which ranges from 0.60% to 1.50% for various values. Basically, RF can achieve -anonymity on this dataset with very small number of flips, which helps it maintain the classification utility of the dataset. For the same reason, -anonymity and -anonymity methods are also able to retain many dimensions (8 out of 19 for =5) of this dataset, and perform fairly well. On the other hand, different versions of Greedy and Maximal retain between 8 and 10 dimensions and achieve an AUC between 0.74 and 0.78, which are close to 0.80 (the AUC value for RF). Also, note that, the AUC using the full set of features (no privacy) is , so the utility loss due to the privacy is not substantial for this dataset. As a remark, our method is particularly suitable for high dimensional sparse data for which anonymization using traditional methods is difficult.

5.5 Spam Email Filtering

In Table 6, we compare AUC value of different methods for spam email filtering task. This is a very high dimensional data with features. As we can observe, our proposed methods, especially and perform better than the competing methods. For example, for , the classification AUC of RF is with flip rate , but using less than of features obtains an AUC value of , which is equal to the AUC value using the full feature set. Again, -anonymity based methods show worse performance as they select less number of features due to stronger restriction of this privacy metric. For instance, for , selects features, but -anonymity selects only 4 features. Due to this, classification results using -anonymity constraint are worse compared to those using our proposed as privacy metric. As shown in Figure 1(a) 1(b), both Laplace-DP and Exponential-DP with various privacy budget setups perform much worse than all the competing methods in Table 6, which demonstrates that the significant amount of added noise during the sanitization process deteriorates the data utility and leads to bad classification performance. Among our methods, both and Maximal are the best as they consistently hold the classification performance for all different settings.

6 Related Work

We discuss the related work under the following two categories.

6.1 Privacy-Preserving Data Mining

In terms of privacy model, several privacy metrics have been widely used in order to quantify the privacy risk of published data instances, such as -anonymity [41], -closeness [27], -diversity [28], and differential privacy [14]. Existing works on privacy preserving data mining solve a specific data mining problem given a privacy constraint over the data instances, such as classification [43], regression [16], clustering [42] and frequent pattern mining [15]. However the solutions proposed in these works are strongly influenced by the specific data mining task and also by the specific privacy model. In fact, the majority of the above works consider distributed privacy where the dataset is partitioned among multiple participants owning different portions of the data, and the goal is to mine shared insights over the global data without compromising the privacy of the local portions. A few other works [6, 5] consider output privacy by ensuring that the output of a data mining task does not reveal sensitive information.

-anonymity privacy metric, due to its simplicity and flexibility, has been studied extensively over the years. Authors in [2] presents the -anonymity patterns for the application of association rule mining. Samarati [39] proposes formal methods of -anonymity using suppression and generalization techniques. She also introduced the concept of minimal generalization. Meyerson et al.[31] prove that two definitions of -optimality are -hard: first, to find the minimum number of cell values that need to be suppressed; second, to find the minimum number of attributes that need to be suppressed. Henceforth, a large number of works have explored the approximation of anonymization [31, 4]. However, none of these works consider the utility of the dataset along with the privacy requirements. Kifer et al.[22] propose methods that inject utility in the form of data distribution information into -anonymous and -diverse tables. However, the above work does not consider a classification dataset. Iyengar [19] proposes a utility metric called which is explicitly designed for a classification dataset. However, It assigns a generalization penalty over the rows of the dataset, but its performance is poor as we have shown in this work.

In recent years differential privacy [12, 34, 24] has attracted much attention in privacy research literatures. Authors in  [9] propose a sampling based method for releasing high dimensional data with differential privacy guarantees.  [37] proposes a method of publishing differential private low order marginals for high dimensional data. Even though authors in [9, 37] claim that they deal with high dimensional data, the dimensionality of data is at most from the experiments in their works.  [40] makes use of -anonymity to enhance data utility in differential privacy. An interesting observation of this work is that differential privacy based method, by itself, is not a good privacy mechanism, in regards to maintaining data utility.  [8] proposes a probabilistic top-down partitioning algorithm to publish the set-valued data via differential privacy. Authors of [32] propose to utilize exponential mechanism to release a decision tree based classifier that satisfies -differential privacy. However in their work, privacy is embedded in the data mining process, hence they are not suitable as a data release mechanism, and more importantly they can only be used along with the specific classification model within which the privacy mechanism is built-in.

6.2 Privacy-Aware Feature Selection

Empirical study for the use of feature selection in Privacy Preserving Data Publishing has been proposed in [20]  [21]. However, in their work, they use feature selection as an add-on tool prior to data anonymization and do not consider privacy during the feature selection process. For our work, we consider privacy-aware feature selection with a twin objective of privacy preservation and utility maintenance. To the best of our knowledge, the most similar works to ours for the use of feature selection in Privacy Preserving Data Publishing are presented in [35, 29] recently.  [35] considers privacy as a cost metric in a dynamic feature selection process and proposes a greedy based iterative approach for solving the task, where the data releaser requests information about one feature at a time until a predefined privacy budget is exhausted. However the entropy based privacy metric presented in this work is strongly influenced by the specific classifier.  [29] presents a genetic approach for achieving -anonymity by partitioning the original dataset into several projections such that each one of them adheres to -anonymity. But the proposed method does not provide optimality guaranty.

7 Conclusion and Future Work

In this paper, we propose a novel method for entity anonymization using feature selection. We define a new anonymity metric called -anonymity by containment which is particularly suitable for high dimensional microdata. We also propose two feature selection methods along with two classification utility metrics. These metrics satisfy submodular properties, thus they enable effective greedy algorithms. In experiment section we show that both proposed methods select good quality features on a variety of datasets for retaining the classification utility yet they satisfy the user defined anonymity constraint.

In this work, we consider binary features. We also show experimental results using categorical features by making them binary, so the work can easily be extended for datasets with categorical features. An immediate future work is to extend this work on datasets with real-valued features. Another future direction would be to consider absent attributes in the privacy model. In real world, for some binary datasets, absent attributes can cause privacy violation, such as, they can be used for negative association rule mining.


We sincerely thank the reviewers for their insightful comments. This research is sponsored by both Mohammad Al Hasan’s NSF CAREER Award (IIS-1149851) and Noman Mohammed’s NSERC Discovery Grants (RGPIN-2015-04147). The contents are solely the responsibility of the authors and do not necessarily represent the official views of NSF and NSERC.


  1. To enhance the readability, we write the feature set as string; for example, the set is written as .
  2. http://research.nii.ac.jp/~uno/code/lcm.html
  3. http://arnetminer.org
  4. https://archive.ics.uci.edu/ml/datasets/Adult
  5. http://www.csmining.org/index.php/pu1-and-pu123a-datasets.html


  1. C. C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB’05, pages 901–909, 2005.
  2. M. Atzori, F. Bonchi, F. Giannotti, and D. Pedreschi. K-anonymous patterns. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD’05, pages 10–21, 2005.
  3. B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’07, pages 273–282, 2007.
  4. R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proceedings of the 21st International Conference on Data Engineering, ICDE ’05, pages 217–228, 2005.
  5. R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, pages 503–512, 2010.
  6. L. Bonomi and L. Xiong. Mining frequent patterns with differential privacy. Proceedings of Very Large Data Bases Endowment, 6(12):1422–1427, Aug. 2013.
  7. J.-W. Byun, A. Kamra, E. Bertino, and N. Li. Efficient k-anonymization using clustering techniques. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications, DASFAA’07, pages 188–200, 2007.
  8. R. Chen, B. C. Desai, N. Mohammed, L. Xiong, and B. C. M. Fung. Publishing set-valued data via differential privacy. In Proceedings of the Very Large Data Bases Endowment, volume 4, pages 1087–1098, 2011.
  9. R. Chen, Q. Xiao, Y. Zhang, and J. Xu. Differentially private high-dimensional data publication via sampling-based inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 129–138, 2015.
  10. M. Conforti and G. Cornuéjols. Submodular set functions, matroids and the greedy algorithm: Tight worst-case bounds and some generalizations of the rado-edmonds theorem. Discrete Applied Mathematics, 7(3):251 – 274, 1984.
  11. F. K. Dankar and K. E. Emam. Practicing differential privacy in health care: A review. In Trans. Data Privacy, volume 6, pages 35–67, Apr. 2013.
  12. C. Dwork. Differential privacy: A survey of results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC’08, pages 1–19. 2008.
  13. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, TCC’06, pages 265–284, 2006.
  14. C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundation and Trends in Theoretical Computer Science, 9:211–407, Aug. 2014.
  15. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 217–228, 2002.
  16. S. E. Fienberg and J. Jin. Privacy-preserving data sharing in high dimensional regression and classification settings. Journal of Privacy and Confidentiality, 4(1):10, 2012.
  17. S. E. Fienberg, A. Rinaldo, and X. Yang. Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD, pages 187–199, 2010.
  18. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, Mar. 2003.
  19. V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pages 279–288, 2002.
  20. Y. Jafer, S. Matwin, and M. Sokolova. Task oriented privacy preserving data publishing using feature selection. In Proceedings 27th Canadian Conference on Advances in Artificial Intelligence, pages 143–154. 2014.
  21. Y. Jafer, S. Matwin, and M. Sokolova. Using feature selection to improve the utility of differentially private data publishing. Proc. Computer Science, 37:511–516, 2014.
  22. D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06, pages 217–228, 2006.
  23. D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pages 193–204, 2011.
  24. J. Lee and C. Clifton. Differential identifiability. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 1041–1049, 2012.
  25. D. Li and M. Becchi. Deploying graph algorithms on gpus: an adaptive solution. In 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pages 1013–1024, 2013.
  26. G. Li. Mining local and global patterns for complex data classification. PhD thesis, Rensselaer Polytechnic Institute, 2013.
  27. N. Li and T. Li. t-closeness: Privacy beyond k-anonymity and l-diversity. In IEEE 23rd International Conference on Data Engineering (ICDE), pages 106–115, 2007.
  28. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), Mar. 2007.
  29. N. Matatov, L. Rokach, and O. Maimon. Privacy-preserving data mining: A feature set partitioning approach. Information Sciences, 180(14):2696 – 2720, 2010.
  30. F. McSherry and K. Talwar. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, pages 94–103, Providence, RI, 2007.
  31. A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’04, pages 223–228, 2004.
  32. N. Mohammed, R. Chen, B. C. Fung, and P. S. Yu. Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, pages 493–501, 2011.
  33. A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, SP ’08, pages 111–125, 2008.
  34. H. Nguyen, A. Imine, and M. Rusinowitch. Network Structure Release under Differential Privacy. Transactions on Data Privacy, 9(3):215–241, 2016.
  35. E. Pattuk, M. Kantarcioglu, H. Ulusoy, and B. Malin. Privacy-aware dynamic feature selection. In 2015 IEEE 31st International Conference on Data Engineering (ICDE), pages 78–88, 2015.
  36. F. Prasser, R. Bild, J. Eicher, H. Spengler, F. Kohlmayer, and K. A. Kuhn. Lightning: Utility-driven anonymization of high-dimensional data. Transactions on Data Privacy, 9(2):161–185, 2016.
  37. W. Qardaji, W. Yang, and N. Li. Practical differentially private release of marginal contingency tables. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 1435–1446.
  38. T. Saha, B. Zhang, and M. Hasan. Name disambiguation from link data in a collaboration graph using temporal and topological features. Social Network Analysis and Mining, 5(1), 2015.
  39. P. Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledge and Data Engineering (TKDE), 13(6):1010–1027, Nov. 2001.
  40. J. Soria-Comas, J. Domingo-Ferrer, D. Sanchez, and S. Martinez. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. In Very Large Data Bases (VLDB) Journal, volume 23, pages 771–794, Oct. 2014.
  41. L. Sweeney. K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, Oct. 2002.
  42. J. Vaidya and C. Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 206–215, 2003.
  43. J. Vaidya, M. Kantarc, and C. Clifton. Privacy-preserving naive bayes classification. Very Large Data Bases (VLDB) Journal, 17(4):879–898, July 2008.
  44. B. Zhang, M. Dundar, and M. A. Hasan. Bayesian non-exhaustive classification a case study: Online name disambiguation using temporal record streams. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pages 1341–1350. ACM, 2016.
  45. B. Zhang, T. K. Saha, and M. Al Hasan. Name disambiguation from link data in a collaboration graph. In the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), volume 5, pages 81–84, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description