Inverse Classification with Limited Budget and Maximum Number of Perturbed Samples

Inverse Classification with Limited Budget and Maximum Number of Perturbed Samples


Most recent machine learning research focuses on developing new classifiers for the sake of improving classification accuracy. With many well-performing state-of-the-art classifiers available, there is a growing need for understanding interpretability of a classifier necessitated by practical purposes such as to find the best diet recommendation for a diabetes patient. Inverse classification is a post modeling process to find changes in input features of samples to alter the initially predicted class. It is useful in many business applications to determine how to adjust a sample input data such that the classifier predicts it to be in a desired class. In real world applications, a budget on perturbations of samples corresponding to customers or patients is usually considered, and in this setting, the number of successfully perturbed samples is key to increase benefits. In this study, we propose a new framework to solve inverse classification that maximizes the number of perturbed samples subject to a per-feature-budget limits and favorable classification classes of the perturbed samples. We design algorithms to solve this optimization problem based on gradient methods, stochastic processes, Lagrangian relaxations, and the Gumbel trick. In experiments, we find that our algorithms based on stochastic processes exhibit an excellent performance in different budget settings and they scale well.

1 Introduction

Classification is a building block for solving various machine learning tasks such as customer segmentation, sentimental analysis, and image recognition. Numerous state-of-the-art classification models including deep neural networks have been developed to achieve high classification accuracy [1, 12]. A common post modeling step is to consider changes in features that alter the predicted class, e.g. from the prediction of becoming sick to remaining healthy. Given a trained classifier, inverse classification models identify minimal changes of input features of a sample so that the sample is predicted as a desired class that is different from its originally predicted class [13]. It is first introduced as a topic of sensitivity analysis [17] and then augmented as an interpretability approach [2]. Viewing inverse classification as a utility-based data mining problem, Lash et al. [12] argue that it is a subtopic of strategic learning [3]. Inverse classification is also related to counterfactual explanation in interpretable machine learning. A counterfactual explanation reveals how a sample should be perturbed to significantly change its original prediction. By crafting counterfactual samples we can interpret how a classifier computes individual predictions [18, 23]. Laugel et al. [13] and Lowd and Meck [14] point out that inverse classification is related to adversarial learning [22] that aims to attack a classifier by applying small perturbations to samples to modify their initial predictions. Inverse classification and counterfactual explanation studies focus on interpretability of classification models. Meanwhile, adversarial learning mainly focuses on robustness of associated models. For example, developing a defensive system against adversarial attacks is a study of interest in adversarial learning. Perturbing samples so that they are predicted as a desired label is a common goal in these areas.

In many business applications samples correspond to treatment of customers or patients and the goal is to perturb their treatment in order to be predicted in a more favorable class. We often have limits on feature perturbation amounts, e.g., a limited number of interactions with all of the customers or we only have limited availability of a drug or procedure. Within per-feature-budget limits we want to treat as many customers or patients as possible. This is equivalent to stating that we want to maximize the number of perturbed samples because we turn each one of them into a more favorable class. Past works deal with objectives such as minimizing budget or other continuous loss functions but to capture the number of samples to select yields a discrete function which poses unique challenges addressed herein. In this work we are focusing on the problem of maximizing the number of selected samples subject to per-feature-budget limits and desired reclassification of the perturbed samples. This setting is a direct result of use cases that are encountered by our industry partner.

A typical setting considered in inverse classification, counterfactual explanations, and adversarial learning is as follows. Let us assume that we have a classifier with input and output . The goal is to generate an adversarial sample that is of the same form as a given sample , and is to be predicted as a desired class that is different from the originally predicted label of . Especially in adversarial learning, there are two types of adversarial examples. A non-targeted adversarial example is generated by adding small perturbation to so that is classified as any class that is not the original ground truth. A targeted adversarial sample fools a classifier so that it produces a desired label where is the desired class determined by the adversary. The norm of the perturbation between adversarial samples and given samples is usually used as the loss function where can be [6]. In some cases, a set of budget constraints for the perturbation is introduced, and subsequently, not all of candidate samples can be successfully perturbed as desired. Herein we focus on spending the budget so that as many samples as possible can be successfully perturbed within the budget. Existing inverse classification and adversarial attack frameworks that minimize cost of the perturbation do not consider and capture this perspective.

In this study, we develop a new framework that can be applied to inverse classification and counterfactual explanations as well as adversarial learning. We assume to have a budget constraint on perturbations of continuous input features. In order to obtain the maximal number of successfully perturbed samples within a budget, we define an objective function that maximizes the number of samples to be perturbed, which is different from the existing formulations. For this, we introduce a binary variable indicating which sample is to be perturbed. In addition, we include a set of constraints that the probability of the desired class produced by the classifier is higher than all other classes by a tunable margin. This is done to avoid a purely adversarial change in the prediction but instead induce perturbations yielding the actual desired change in the real-world process. We propose Langrangian-based models relying on binary variables. An alternative view considering the selection of samples as a stochastic process with unknown probabilities yields even better algorithms. The resulting model uses chance constraints and the Gumbel trick to make algorithms more efficient. We rely on gradient-based optimization in conjunction with Lagrangian relaxations.

For evaluation, we use a real-world proprietary dataset from the insurance industry and a public dataset from a health clinic, [7, 9]. We compare the performance of our algorithms with respect to the number of selected samples and the budget consumed per selected sample. Algorithms based on stochastic processes significantly outperform those relying on binary variables. We conduct budget and scalability experiments on the real-world data; the relative improvement of the proposed stochastic algorithms over an existing method with a traditional objective function is 7% and 24% on average for a variety of budget and possible sample size settings, respectively. In the budget experiments on the public data, the stochastic algorithms outperform the existing model by 19% on average.

The contributions of this work are as follows.

  1. We introduce a new framework to solve inverse classification to achieve the maximal number of successfully perturbed samples within a per-feature budget. As far as we know, this framework has not yet been applied to inverse classification in the existing literature.

  2. We design novel algorithms based on gradient methods, stochastic processes, Lagrangian relaxations and the Gumbel trick.

  3. In the computational study, stochastic approaches perform well on different budget scenarios, and they scale.

The rest of this paper is organized as follows. In Section 2, the related work is discussed. Section 3 describes the proposed models, and the algorithms to solve them are presented in Section 5. Section 5 provides the computational study including experimental details and analyses of the experimental results. Conclusions are given in Section 6.

2 Related work

In inverse classification and counterfactual explanation studies the focus is either on unconstrained or constrained problems, or algorithmic mechanisms [12, 11]. A formulation framework is related to feasibility and implementability of perturbed samples, which yield either an unconstrained [1, 24] or a constrained problem [2, 5, 12, 11, 17, 23]. Since an unconstrained formulation does not consider practical constraints such as a budget, it tends to produce unrealistic perturbations of input features of a sample, e.g. cannot offer a drug if a patient is at home. A constrained formulation provides realistic perturbations, however, it is challenging to solve them. There are three factors to be considered: a) identify changeable features to be perturbed, e.g. an unchangeable feature can be a product purchase history, b) how costly is it to change a feature, and c) limit the amount of perturbations over all samples, i.e., a budget [12, 11]. In [2], only aspect c is considered. Mannino et al. [17] consider b), but do not consider a) and c). Lash et al. [12] propose a general framework that considers a, b, and c, however, a prediction confidence constraint is not included. With respect to algorithms there are greedy [1, 5, 11, 17, 24] and non-greedy [2, 12] algorithms. Greedy methods are computationally efficient but typically suffer from low solution quality. Non-greedy methods tend to focus on more moderate objectives not capturing many aspects so that the obtained adversarial samples are more realistic. In [1, 5, 11, 17, 24], heuristic methods that do not use gradients such as local search, hill climbing, and genetic algorithm are used. In [12], a projected gradient method is adopted and in [2], a non-linear solver package is used to solve a constrained problem. Our work is different from the aforementioned research since none of the existing methods consider maximizing the number of perturbed samples in their formulation. Because of the discrete nature of the counting objective functions these algorithms are not appropriate.

In adversarial learning, recent research mostly focuses on generating adversarial samples to attack deep learning models since its purpose is to study robustness of state-of-the-art classifiers. Most adversarial attacks are targeted against deep networks on visual or audio perception tasks as they allow a striking demonstration of the large effects of minute perturbations on the eventual prediction results in comparison with the robustness of human perception. Szegedy et al. [21] propose the following optimization problem and solve it by using constrained L-BFGS,

where is a loss function of classification toward a desired label such as cross-entropy and is hyper parameter. Goodfellow et al. [8] propose a method called Fast Gradient Sign method (FGSM) using the sign of gradients, where is used as a distance metric for perturbation. It is not guaranteed to produce optimal solutions, but it is quick to obtain close adversarial examples. Given an input image FGSM performs

where is selected to be small. Kurakin et al. [10] propose an improved FGSM, iterative gradient sign method (I-FGSM), by taking multiple smaller steps in the direction of gradient sign rather than one step and its output is clipped by . It produces superior results to FGSM by updating on each iteration as

where . Recently, Papernot et al. [19] propose a greedy algorithm to generate adversarial examples using gradients to compute a saliency map, called Jacobian-based Saliency Map Attack (JSMA). These three algorithms do not consider a budget constraint that is critical and practical in inverse classification even though they have a bound at a pixel level. In addition, our framework is designed to achieve the maximal number of successfully perturbed samples within a budget. These algorithms cannot be modified in a meaningful way to tackle our setting.

3 Proposed models

In this section, we present our constrained optimization problem to solve inverse classification. The formulation is designed to generate the maximal number of adversarial examples that are classified as a desired class. In addition, we include a set of budget constraints on perturbations of input features. We first introduce notation and the baseline model. Next, we present variations of the baseline model - chance constraint models that assume decision variables follow an unknown probability distribution.

We denote a given sample by and a perturbed sample by . We assume that all features are continuous. Let be a function associated with a classification model that computes a score of being in a class such as the probability of classes. We optimize over samples given a new desired label vector with and . We denote a perturbed input feature matrix by . Furthermore, we introduce binary variables, , to decide which sample is to be perturbed. We maximize the number of these binary variables that have value 1. We introduce general budget constraints on perturbations of input features. In addition, we have a per-sample constraint, called the prediction confidence constraint, to capture a margin for prediction reflecting the uncertainty in the score function. Formally, the max samples model (MS) is formulated as


where each and is a nonlinear function associated with the budget and prediction confidence constraint, respectively. A typical budget constraint for feature on perturbation of input features is


where is a given budget for feature . Prediction confidence constraints are explicitly expressed as


where is a desired class of and is a given margin. In MS, we rewrite (3) by multiplying it with binary variables so that we consider the constraints only on selected samples as follows:


MS is challenging to solve due to the presence of constraints and binary variables . To address the latter, we assume that the binary variables have a probability distribution which is an approximation. We impose that the variables follow either Bernoulli or Categorical distributions considering dependency among samples. First, we present the Bernoulli case where no relationship among perturbed samples is assumed.

Let , i.e., with probability . Let . Transforming MS, we propose a Bernoulli chance max samples model (BCMS) as


where is a parameter. We can explicitly rewrite BCMS as


The chance max samples model with the Categorical distribution (CCMS) considers dependency among samples to be perturbed. To this end, let with and an integer parameter. CCMS has the same formulation as BCMS, but we use the following binary variables to determine which sample to perturb:


We finish with a benchmark model based on the existing framework [18, 21] of generating adversarial samples that is designed to solve inverse classification with a minimal cost of perturbation. This model optimizes over samples in to minimize the loss function , where denotes the Kullback-Leibler divergence. The model (KL) reads


where each and is a nonlinear function associated with budget (2) and prediction confidence constraints (3) without the binary variable . This model has a smaller number of decision variables than our models since it does not include binary variables or any possible related variables; however, it does not explicitly count the number of perturbed samples and thus it has a different objective. KL does not necessarily achieve the maximal number of successfully perturbed samples.

4 Algorithms

In this section, we describe algorithms based on Lagrangian and subgradient methods. We first reformulate the problem as an unconstrained problem by using Lagrangian multipliers. We develop algorithms to solve Lagrangian functions based on the projected subgradient method. Projection is used to keep Lagrangian multipliers positive during updates or to maintain probability requirements. For chance max samples models, we apply the Gumbel trick [16] to use approximate gradients when updating the binary variables.

Algorithm for max samples model We first define the Lagrangian function for MS (1) as



and and are Lagrangian multipliers. We propose Algorithm 1 to solve . The algorithm consists of two main loops to solve the problem; the inner loop updates input features and binary variables to maximize , and the outer loop updates Lagrangian multipliers to minimize . In the algorithm, we initialize all with one as we aim to achieve as many successfully perturbed samples as possible. Meanwhile, we add a line to break the inner loop when all entries of are zero, which is the case of no updates on variables.

1:Initialize .
2:while until convergence do
4:     while until convergence do
5:         Break if
7:         for  do
8:              if  then
10:              else
12:              end if
13:         end for
14:     end while
17:end while
Algorithm 1 MS

Note that , and . In addition, lines 7-13 in Algorithm 1 are derived by solving

where is assumed to be constant in this part of the algorithm.

Algorithm for Bernoulli and Categorical chance max samples model We define the Lagrangian function for BCMS (6) as


We need to solve where we have to compute gradients of with respect to . To relieve the burden of computing exact gradients, we apply the Gumbel trick [16] to use their approximation. Let the exact probability be where is the indicator function. We first approximate where and are hyperparameters. We have with . Let us consider first the Bernoulli case. Applying the Gumbel trick, the expectation term is approximately computed as



and . Here the Gumbel distribution is denoted by and and are hyperparameters. Values approximate . We can rewrite (10) as


We propose Algorithm 2 to solve . This algorithm has two main loops to maximize with respect to and , and to minimize with respect to the Lagrangian multipliers. A part of generating Gumbel’s samples is added to the inner loop so that approximated gradients are used to update variables. In the algorithm we use

1:Initialize .
2:while until convergence do
3:     while until convergence do
4:         for  do
5:              for  do
8:              end for
9:               with as in (11)
10:         end for
11:          with as in (13)
14:     end while
17:end while
Algorithm 2 BCMS

For CCMS, we define the Lagrangian function based on (5), which is written as

where is element of as defined in (7). Based on the Gumbel’s approach we approximate by

The approximate Lagrangian function for CCMS reads


where is a Gumbel matrix, and is the same as in (11) except that

We propose Algorithm 3 to solve , which has the same structure as Algorithm 2 for BCMS. The part of simulating Gumbel’s samples is edited for the Categorical distribution (lines 4-13).

1:Initialize .
2:while until convergence do
3:     while until convergence do
4:         for  do
5:              for  do
6:                  for  do
8:                  end for
9:              end for
13:         end for
18:     end while
21:end while
Algorithm 3 CCMS

Algorithm for KL The Lagrangian function for KL (8) reads


where and are Lagrangian multipliers. Algorithm 4 is designed to solve . Similar to Algorithm 1, it consists of two loops; the inner loop updates input features to minimize , and the outer loop updates Lagrangian multipliers to maximize .

2:while until convergence do
6:end while
Algorithm 4 KL

Obtaining the final solution Since Lagrangian methods do not necessarily guarantee feasibility of solutions with respect to budget and prediction confidence, we conduct the following steps to obtain the final solution set.

  1. Run one of the Algorithms 1-4 to obtain a ‘good’ but possibly infeasible solution .

  2. Find a subset of samples satisfying prediction confidence constraints, i.e., all of the samples in are classified as desired.

  3. Solve the following problem over to find the final set of feasible samples:


    where .

Sequences Let us consider the case when samples are sequences of varying length of feature vectors of the same length; e.g., corresponds to an LSTM or transformer. In this case, is not well defined, i.e. it is not a matrix and thus (1) is ill-posed. If we consider only sequences of the same length, then (1) is well defined. If we have different lengths, then we can form different disjoint subsets of samples and define (1) for each one subset. The link between all subsets becomes a joint per-feature budget. This budget needs to be allocated to the problems. Herein we use a simple strategy of allocating for each feature (note that = ).

5 Computational study

In this section, we conduct a computational study on two datasets: a proprietary dataset and a public dataset. We experiment with different budget scenarios and we assess scalability with respect to the number of samples. Model implementations for all the experiments are done in Python using Tesla V100 GPU and Intel Xeon CPU E5-2697 v4 @ 2.30Hz for the real-world dataset, and Titan XP 1080 GPU and Intel Xeon Silver 4112 CPU @ 2.60GHz for the public dataset.

We use the following hyperparameter values: , , , , and . The learning rates and affecting and are selected as one of . We use a decaying learning rates and , affecting the Lagrangian multipliers and , initially set as one in . The stopping criterion is set to be the maximum number of iterations (variable updates). For the outer loop it is set to be 10 for MS, BCMS, and KL and 20 for CCMS, and for the inner loop it is set to be 10,000, 100, 100, and 5,000 for MS, BCMS, CCMS and KL, respectively. The initial Lagrangian multipliers are selected as one from but adding white Gaussian noise.

5.1 Real-world data

We conduct experiments on a real-world proprietary dataset that contains sequential input features for 5 classes, which is introduced in [20]. The data has 169 features and sequences are of size from 1 to 150 and thus the joint per-feature budget needs to be employed. The classification model Sparse Time LSTM from [20] is used as . The accuracy of the model on approximately 700,000 training samples is around 70%.

Budget experiments

We perturb 300 samples selected from the test set which are grouped into 5 different groups by input sequence length. Thus, we have , and . The 300 samples are correctly predicted by the trained classifier into a “negative” class - four of the five classes - (e.g. have disease) and thus the perturbed samples should fall into the positive class - the remaining class - (e.g. does not have the disease). We perturb 19 features since the remaining features cannot be altered in practice. To decide the sizes of the budgets, we first run Algorithm 4 with unlimited budgets to measure how much budget is needed for successful perturbation. Then, as well as based on practical considerations from subject matter experts and data stockholders, we determine small, middle, and large sizes of budgets by the amounts that are proportional to the total budget consumption with the unlimited budgets.

(a) Size of
(b) Consumption per sample average
(c) Budget constraint residual
(d) Prediction gap average
Figure 1: Real-world data: Budget experiment

Figure 1 shows the results of the budget experiment. We find that algorithms with the Gumbel’s method BCMS and CCMS perform better than other algorithms. They achieve a larger size of successfully perturbed samples than other algorithms, and they also achieve lower consumption per sample defined as the budget used by all of the samples divided by . The relative improvement of BCMS and CCMS over KL is 10% and 7%, 9% and 6%, and 7% and 4% for small, middle, and large budget, respectively. This is because the objective of the max samples models is to maximize the number of successfully perturbed samples. In addition, in plot (a) we observe that a larger budget achieves a larger size of successfully perturbed samples for all algorithms, which is expected. We also analyze budget and prediction confidence constraints. For budget constraint residuals defined by in (2) divided by the total available budget, we compute how much of the budget is spent for each budget constraint, and calculate the mean of them, see plot (b). In addition, a prediction gap is computed by measuring the gap between the top and the second best prediction probabilities, see plot (d) for average prediction gaps. We find that budget constraint residuals of BCMS and CCMS are lower than those from other algorithms, and their prediction gaps are smaller than the others. We reason this as BCMS and CCMS spend budgets large enough to guarantee a certain level of prediction confidence; it is larger than in their predictions, but not more than necessary. On the other hand, KL has a large prediction gap that shows high confidence in its prediction which is more than necessary. This is a reason their budget residual per sample is relatively large.

Scalability experiments

We also conduct a scalability analysis of our algorithms. We use three different sizes of samples , and , with samples in each set being grouped into based on their input sequence length. In addition, they have inclusive relationships such that . In this context, we have two strategies of initializing samples to be perturbed. First, we initialize input features of samples in the larger set with previously obtained values from the subset, and the rest of samples that are not in the subset are initialized randomly. For example, we run an algorithm on , and then run the algorithm on . When we run it on , we initialize samples from in with obtained values from the run on , and samples from randomly. The other strategy is to initialize all samples randomly. Similar to the budget experiments, all samples are originally labeled as one of negative classes and correctly predicted by the trained classifier. We use the middle size budget and the other hyperparameters are the same as those used in the budget experiments.

(a) Size of
(b) Consumption per sample average
(c) Budget constraint residual
(d) Prediction gap average
Figure 2: Real-world data: Scalability experiment

Figure 2 shows the results of the scalability experiment. Note that a run on with subset initialization is denoted by 300-Sub, and one with random initialization is denoted by 300-Ran in the figure. Similar to the budget experiments, the algorithms with Gumbel’s method BCMS and CCMS perform better than other algorithms. The count of samples successfully perturbed by them is larger than the counts from other algorithms, and also the average consumption per sample is smaller. The relative improvement of the stochastic models BCMS and CCMS over KL is 20% and 13% for , 31% and 17.5% for , and 34.5% and 20% for on average over different initialization strategies. The observations from Section 5.1.1 apply to each individual sample size. We find that the relative improvement of BCMS and CCMS over KL increases linearly as sample size increases in Figure 3. In terms of the two initialization strategies, both cases show similar results and thus the benefits of warm-start are negligible. Regarding the budget and prediction confidence constraints, we find similar results to the budget experiments. Budget constraint residuals for BCMS and CCMS are lower than KL, and their prediction gaps are smaller than for the other algorithm. The aforementioned conclusions apply to all of different sizes of samples and thus we conclude that our algorithms scale efficiently.

Figure 3: Real-world data: Relative improvement as the sample size increases

5.2 Public data: MIMIC

MIMIC is a public dataset that describes clinical information of patients admitted to an Intensive Care Unit at the Beth Israel Deaconess Medical Center in Boston, Massachusetts from 2001 to 2012. It contains 58,576 samples for patient admissions. Descriptive statistics can be found in [7, 9]. In this study, we use 13 input features corresponding to the health state for 30-day mortality predictions used in [15]. Since MIMIC is a time series dataset and has missing values, we use a recurrent network based on Gated Recurrent Unit, called GRU-D, that is widely used for coping with multivariate time series with missing values [4] for imputation and predictions. Its AUC is around 0.78 which is comparable to state-of-the-art. We conduct only a budget experiment for this dataset due to its limited size.

We perturb 75 samples selected from the test set, i.e., . The 75 samples are originally labeled as “dead” and correctly predicted by the trained classifier. Our purpose is to perturb the samples so that they are predicted as “alive.” To decide the size of budgets which represent the per drug amount, we first run Algorithm 4 with unlimited budget constraints to measure how much perturbation is needed. Then, we determine small, middle, and large sizes of budgets by imposing 40%, 60%, and 80% of the total budget achieved by unlimited budgets.

(a) Size of
(b) Consumption per sample average
(c) Budget constraint residual
(d) Prediction gap average
Figure 4: MIMIC: Budget experiment

Figure 4 shows the results of the budget experiment. Similar to the results on the real-world data, stochastic algorithms for max samples models, BCMS and CCMS perform better than KL and MS. They obtain a larger size of successfully perturbed samples than the other algorithms, and also achieve smaller consumption per sample. The relative improvement of our max samples models MS, BCMS and CCMS over KL is 5%, 19% and 19% on average for all different budget scenarios. It is interesting to observe that KL uses a much bigger portion of the budget than the other algorithms.

5.3 Optimization aspects

In this part we discuss optimization aspects of the algorithms. Figure 5 shows sizes of (successfully perturbed samples) for each algorithm at each outer iteration. We observe that the number of increases with iterations and stays at the highest point, which implies that the algorithms converge to a (local) optimal value. We also find that BCMS converges faster than CCMS in all cases. We also note that KL converges faster than CCMS, however, the largest from KL is not larger than the one from CCMS and BCMS.

Figure 7 shows the values of Lagrangian multipliers at each outer iteration for the MIMIC dataset. The average values of with respect to the budget constraint decrease as each algorithm iterates, which implies that the amount of constraint violation decreases. Regarding that is associated with the prediction confidence constraint, KL values decrease as the algorithm proceeds while the max sample algorithms drive them higher. In Figure 6 we further observe that the norm of gradients of for the max samples algorithms shrink to zero, which implies that the algorithms reduce the amount of constraint violation. MS shows erratic behaviors in some cases and its performance is unstable. We reason this is due to the presence of constraints and binary variables in MS.

Table 1 shows the actual runtime of the algorithms on the larger, proprietary dataset. Algorithms for stochastic models BCMS and CCMS take longer to update variables per iteration in the inner loop than MS and KL require since the Gumbel’s simulation is implemented. However, the total runtime of BCMS and CCMS does not necessarily take longer than others. Based on the runtime, convergence and size of we conclude that BCMS is the best performer among all the algorithms considered herein. Algorithms CCMS is a close second.

Algorithm Runtime per iteration (No. of the max iterations)
Outer loop Inner loop Total runtime
MS 20 sec (10) 0.7 sec (10,000) 19 hr
BCMS 25 sec (10) 18 sec (100) 5 hr
CCMS 25 sec (20) 15 sec (100) 8 hr
KL 20 sec (10) 0.9 sec (5,000) 10 hr
Table 1: Runtime of algorithms

6 Conclusion

In this paper, a new framework for inverse classification is proposed. We formulate a constrained optimization problem that maximizes the number of successfully perturbed samples with budget and prediction confidence constraints. In addition, we formulate a stochastic problem with chance constraints. To solve the constrained problems, algorithms based on Lagrangian and subgradient methods are developed. Based on the computational study, we find that the algorithms perform well in various budget settings and are scalable.

(a) MIMIC: Budget experiment
(b) Real-world data: Budget experiment
(c) Real-world data: Scalability experiment
Figure 5: Size of at each outer iteration
(a) of MS
(b) of BCMS
(c) of CCMS
Figure 6: Norm of Lagrangian multipliers of max samples algorithms at each outer iteration of MIMIC
Figure 7: Lagrangian multipliers at each outer iteration of MIMIC


  1. C. C. Aggarwal, C. Chen and J. Han (2010) The inverse classification problem. Journal of Computer Science and Technology 18 (1), pp. 458–468. Cited by: §1, §2.
  2. D. Barbella, S. Benzaid, J. Christensen, B. Jackson, X. V. Qin and D. Musicant (2009) Understanding support vector machine classifications via a recommender system-like approach. In the International Conference on Data Mining, , pp. 305–311. Cited by: §1, §2.
  3. F. Boylu, H. Aytug and G. J. Koehler (2010) Induction over strategic agents. Information Systems Research 21 (1), pp. 170–189. Cited by: §1.
  4. Z. Che, S. Purushotham, K. Cho, D. Sontag and Y. Liu (2018) Recurrent neural networks for multivariate time series with missing values. Scientific Reports 8 (6085), pp. 645–653. Cited by: §5.2.
  5. C. Chi, W. N. Street and M. M. Ward (2012) Individualized patient-centered lifestyle recommendations: an expert system for communicating patient specific cardiovascular risk information and prioritizing lifestyle options. Journal of Biomedical Informatics 45 (), pp. 1164–1174. Cited by: §2.
  6. Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu and J. Li (2018) Boosting adversarial attacks with momentum. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–12. Cited by: §1.
  7. A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C. Peng and H. E. Stanley (2000) PhysioBank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. 2155–220. Cited by: §1, §5.2.
  8. I. Goodfellow, J. Shlens and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  9. A. E. W. Johnson, T. J. Pollard, L. Shen, L. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi and R. G. Mark (2016) MIMIC-III, a freely accessible critical care database. Scientific Data (), pp. . Cited by: §1, §5.2.
  10. A. Kurakin, I. Goodfellow and S. Bengio (2017) Adversarial examples in the physical world. Computing Research Repository (CoRR) abs/1607.02533v4. External Links: Link, 1607.02533v4 Cited by: §2.
  11. M. T. Lash, Q. Lin, W. N. Street, J. G. Robinson and J. Ohlmann (2017) Generalized inverse classification. In the 2017 SIAM International Conference on Data Mining, , pp. 162–170. Cited by: §2.
  12. M. T. Lash, Q. Lin, W. N. Street and J. G. Robinson (2017) A budget-constrained inverse classification framework for smooth classifiers. In IEEE International Conference on Data Mining Workshops, pp. 1184–1193. Cited by: §1, §2.
  13. T. Laugel, M. Lesot, C. Marsala, X. Renard and M. Detyniecki (2018) Comparison-based inverse classification for interpretability in machine learning. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations, J. Medina, M. Ojeda-Aciego, J. L. Verdegay, D. A. Pelta, I. P. Cabrera, B. Bouchon-Meunier and R. R. Yager (Eds.), pp. 100–111. External Links: ISBN 978-3-319-91473-2 Cited by: §1.
  14. D. Lowd and C. Meek (2005) Adversarial learning. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 641–647. Cited by: §1.
  15. Y. Luo, P. Szolovits, A. S. Dighe and J. M. Baron (2018) 3D-MICE: Integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. Journal of the American Medical Informatics Association 25 (6), pp. 645–653. Cited by: §5.2.
  16. C. J. Maddison, D. Tarlow and T. Minka (2014) Sampling. In Proceedings of the 27th Conference on Neural Information Processing Systems, pp. 3086–3094. Cited by: §4, §4.
  17. M. V. Mannino and M. V. Koushik (2000) The cost-minimizing inverse classification problem: a genetic algorithm approach. Decision Support Systems 29 (), pp. 283–300. Cited by: §1, §2.
  18. C. Molnar (2019) Interpretable machine learning: a guide for making black box models explainable. Online: \url External Links: Link Cited by: §1, §3.
  19. N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik and A. Swami (2016) The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, pp. 372–387. Cited by: §2.
  20. A. Stec, D. Klabjan and J. Utke (2013) Unified recurrent neural network for many feature types. ArXiv e-prints abs/1809.08717. External Links: Link, 1809.08717 Cited by: §5.1.
  21. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fergus (2013) Intriguing properties of neural networks. Computing Research Repository (CoRR) abs/1312.6199. External Links: Link, 1312.6199 Cited by: §2, §3.
  22. J. D. Tygar (2011) Adversarial machine learning. IEEE Internet Computing 15 (5), pp. 4–6. Cited by: §1.
  23. S. Wachter, B. Mittelstadt and C. Russell (2018) Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harvard Journal of Law and Technology 31 (2), pp. 842–887. Cited by: §1, §2.
  24. C. Yang, W. N. Street and J. G. Robinson (2012) 10-year CVD risk prediction and minimization via inverse classification. In the 2nd ACM SIGHIT Symposium on International Health Informatics, Vol. , pp. 603–610. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description