Adversarial Examples for CostSensitive Classifiers
Abstract
Motivated by safetycritical classification problems, we investigate adversarial attacks against costsensitive classifiers. We use current stateoftheart adversariallyresistant neural network classifiers xie2018feature as the underlying models. Costsensitive predictions are then achieved via a final processing step in the feedforward evaluation of the network. We evaluate the effectiveness of costsensitive classifiers against a variety of attacks and we introduce a new costsensitive attack which performs better than targeted attacks in some cases. We also explored the measures a defender can take in order to limit their vulnerability to these attacks. This attacker/defender scenario is naturally framed as a twoplayer zerosum finite game which we analyze using game theory.
1 Introduction
Many safetycritical classification problems are not indifferent to the different types of possible errors. For example, when classifying tumors in medical images one may be relatively indifferent between misclassifications within the supercategories of malignant or benign tumors, but may be particularly interested in avoiding misclassifications across those categories, for example misidentifying a malignant Lobular Carcinoma tumor as being instead a benign Fibroadenoma tumor cancerClassifier .
It is a relatively simple matter to adjust the predictions of a trained classifier to reflect the different costs associated with the various types of classification errors using a formalism known as costsensitivity elkan2001foundations ; domingos1999metacost . Without costsensitivity, the most likely class is taken to be the prediction made by a classification model. In contrast, costsensitive classifiers make predictions by first computing the expected cost associated with each prediction, and then taking the class with the smallest expected cost to be the model prediction.
Motivated by safetycritical classification problems, we investigate adversarial attacks on costsensitive classifiers. We use current stateoftheart adversariallyresistant neural network classifiers xie2018feature as the underlying models, and we considered multiple types of attacks, as well as various defensive actions that may be taken to mitigate the effect of the attacks. Our key findings are:

Classifiers face a tradeoff between maximizing accuracy and minimizing cost:
Predictions can be made with the goal of either maximizing the accuracy or minimizing the expected cost. While these are not diametrically opposed goals (for example a perfect classifier will incur zero cost), in practice there will be a tradeoff where the classifier can make conservative predictions which lower both the cost and the overall accuracy. 
The attacker faces a tradeoff between minimizing accuracy and maximizing cost:
Similarly, the attacker can craft adversarial examples designed to either minimize the defender’s accuracy or increase their average cost. As before, these goals aren’t necessarily in conflict with one another  for example if the attacks succeed 100% of the time, then both goals may be simultaneously accomplished, but in practice attacks will only succeed some fraction of the time and the attacker will be faced with a tradeoff. 
Calibration leads to both better defenses and more effective attacks:
The expected cost depends on the predicted probabilities of the neural network and not just on the overall class prediction. Therefore, it is important that the classifier produce accurate probability estimates. We find that both costsensitive defenses and attacks may be improved by calibrating these estimates. 
The attacker/defender scenario is naturally analyzed in terms of game theory:
We explored many different pairings of attacks and defensive measures. The identification of good strategies becomes more difficult as the number of possible scenarios increases. We observe that this problem is naturally framed as a twoplayer zerosum finite game, and therefore the game theoretic concepts of Nash equilibria and dominant strategies may be used to analyze the attacker/defender competition.
2 CostSensitivity for classification problems
In this section we provide a brief review of costsensitivity elkan2001foundations ; domingos1999metacost . Throughout this work we shall consider classification problems and denote the inputs as , and we will use the indices to run over all possible classes, i.e. .
Of central importance in costsensitive classification problems is the costmatrix , which is defined to be the cost of predicting class when the correct class is . The cost may be measured in any units since the costsensitive predictions are unaffected by scaling the cost matrix by an overall constant. We shall require that the costs are nonnegative, , with equality if and only if , which reflects the fact that a correct classification should incur no cost. Given the cost matrix and , the model estimate for the probability that an input belongs to class , the expected cost of predicting class will be denoted as , and is simply elkan2001foundations
(1) 
The costsensitive (CS) prediction is then the class for which the expected cost is smallest, i.e.
(2) 
In contrast, in most classification settings the prediction is taken to be the most likely class:
(3) 
which we shall refer to as the maximum probability (MP) prediction.
2.1 Geometry of costsensitive predictions
To gain an intuition for how costsensitive predictions compare to the more standard maximum probability predictions, it is useful to consider the problem from a geometrical perspective. Binary classification is especially simple: if the probability of class 1 is denoted , then the probability of class 2 is . The maximum probability prediction is determined by whether (prediction is class 2) or (prediction is class 1). The effect of costsensitivity then is to shift the decision threshold from to a new value determined by the relative cost of the two types of errors. That is, class 1 is predicted if , where now
(4) 
The higher dimensional case of is more interesting. In this case it is more useful to work in terms of the probability simplex, which is a dimensional hypersurface embedded in dimensions. The embedding coordinates are the class probabilities, i.e. for an input , and the simplex is the surface satisfying the constraints and . The vertices of the simplex are the points for which all the probability mass is placed on a single class. The simplex may be divided into cells such that all points within a single cell will lead to the same maximum probability prediction. This is depicted for in Fig. 1. The effect of costsensitivity is to shift the cell boundaries, for example as in Fig. 2. In general, cells representing classes which are costly to misidentify, for example the malignant Lobular Carcinoma tumor discussed above, will expand, corresponding to an increased risk aversion.
2.2 Multiclass classification problems with two supercategories
A general costmatrix for class classification is determined by parameters (assuming that the diagonals are zero, representing zero cost for correct predictions). Both for simplicity and because we are motivated by scenarios such as the benign/malignant tumor classification discussed above, we consider a much smaller family of costmatrices. We split the classes into 2 supercategories, which we call the “sensitive" and “insensitive" categories. Let there be members of the insensitive group, and members of the sensitive group, and split the label index as for the insensitive group members, and for the sensitive group. We shall consider scenarios where the main concern is intercategory misclassifications, especially misclassifying a sensitive class as an insensitive class (i.e. misidentifying a malignant tumor as a benign tumor). Intracategory misclassifications will also have associated costs, albeit they will be less significant than intercategory costs.
In this scenario, we can break the costmatrix into 4 blocks,
(5) 
and take each constituent block matrix to be
(6) 
Here the lowercase ’s are constants, and and are Kronecker deltas. The constant is the cost of misclassifications within the insensitive supercategory, and is similarly the cost of misclassifications within the sensitive supercategory. The offdiagonal term represents the cost of mislabeling a sensitive class as insensitive, and vice versa for . Motivated by safetycritical scenarios where the most costly type of mistake is misidentifying a sensitive class as insensitive, we will assume that the different costs obey the following inequalities:
(7) 
so that the costmatrix is determined by just 4 independent numbers.
3 Adversarial examples for costsensitive classifiers
Costsensitivity is particularly relevant for safetycritical scenarios because it enables classifiers to take into account the fact that some mistakes are more deleterious than others. This general framework naturally complements the context of adversarial examples, which are artificially generated inputs of a classifier designed to cause mistakes szegedy2013intriguing , and which are an important threat for safetycritical applications of classifiers.^{1}^{1}1See gilmer2018motivating for an analysis of the concrete ways in which adversarial examples are relevant for AI Safety. The general idea that different misclassifications are associated with different costs should be reflected both in how the classifier makes predictions and in the types of adversarial attacks a malicious actor would choose to employ in order to cause maximum damage.
Concretely, we consider an attacker/defender scenario in which the defender is a neural network classifier, and the attacker is an agent attempting to fool the defender by presenting it adversarial examples. We will investigate multiple types of attacks against classifiers making predictions according to both the maximum probability criterion and the minimum cost criterion. We also consider both whitebox and blackbox scenarios, in which the attacker has or does not have access to the defender network, respectively.
As usual, we take the attack to be defined by a constrained optimization problem. To set notation, let be the objective function ( may also depend on other quantities such as the target label), and the optimization problem is then
(8) 
Here is the attack set, the set of allowable perturbations around a given clean input . Throughout this work, we will take to be an ball in the norm, i.e. .^{2}^{2}2Other authors, most notably gilmer2018motivating , have noted a number of shortcomings in using this attack set for research into safetycritical implications of adversarial examples. We do not disagree with these observations, but will work with the ball nonetheless both for mathematical convenience and because we regard this issue as orthogonal to the main idea of the current work, which is the relevance of costsensitivity to adversarial example research. Independent of the objective function, in all cases we shall use the same projected gradient descent (PGD) method of madry2017towards to solve the optimization problem and to generate examples. The attack PGD update rule is
(9) 
where represents a sequence of perturbed inputs, is the stepsize parameter, and is a projection operator that projects the perturbation down to the attack set . The initial perturbation, , will be randomly initialized within the attack set .
3.1 Targeted attacks
We will consider two types of adversarial attacks. The first is a targeted attack, where the objective function is given by the negative crossentropy of the target label. That is, if is the target label, then
(10) 
In terms of the probability simplex coordinates, the optimal solution is when all the probability mass has been placed on the target class, i.e. . The target class could be chosen randomly, or it could be chosen to induce a particularly costly error. As an example in the costsensitive setting, an effective attack would be one which tricked the classifier into thinking that belonged to the insensitive class when in fact it belonged to a sensitive one.
3.2 Maximum minimum expected cost attacks
If the goal of the attacker is to increase the costs of the defender’s mistakes, it is natural to consider an attack which is designed to explicitly increase the expected cost. Therefore, we introduce the Maximum Minimum Expected Cost Attack (or maximin attack for short):
(11) 
Unlike the targeted attack, the maximin attack does not depend on the true class. Thus, the maximin attack always aims to modify the input so that the point in the probability simplex moves to the point of maximal , which by symmetry can be seen to be the intersection point where the costs are identical for all class predictions . In particular, for the example of Fig. 2, this is the point where all 3 cell boundaries intersect. Because this attack aims to bring to an interior point in the simplex, as opposed to a vertex, it will not be as effective as a targeted attack with costsensitive targets  assuming that the optimization problem associated with both attacks can be fully solved. For example, for the cost matrix considered in Fig. 2, the expected cost at the intersection of all three cell boundaries is , whereas a cost of could be achieved if the prediction was 1 and the true class was 3. However, the optimization problem defining adversarial attacks is rarely able to be solved exactly, and thus there could well be instances where the maximin attack is more effective  indeed, we shall find this to be the case in what follows.
As far as we are aware, we are the first to consider adversarial attacks designed to directly maximize the cost. Recently, Zhang and Evans zhang2018cost considered a costsensitive extension of Wong and Kolter’s approach towards developing provably robust classifiers wong2017provable . In the Zhang and Evans extension, robustness is defined with respect to cost, as opposed to the overall misclassification error. Our work is complementary to theirs as we consider attacks designed to explicitly increase the cost.
4 Attack comparison
In this section we detail the numerical experiments used to compare the efficacy of the 3 different types of attacks considered here.
4.1 Experimental setup
We considered the task of image classification on the ImageNet dataset ILSVRC15 . Our motivating interest is nearterm scenarios in which an imperfect but highperformance image classification system is employed in a safetycritical application. Given the amount of attention adversarial examples have received, it seems plausible that many organizations will be cognizant of the threat posed by adversarial examples, and will therefore choose to employ models with some level of resistance. For simple enough problems one can obtain provable guarantees regarding robustness (see for example wong2017provable ; raghunathan2018certified and references therein), but these methods do not currently scale for modern image classifiers trained on highresolution images.^{3}^{3}3As this work was nearing completion progress on this problem was made in cohen2019certified . Thus, we shall focus on problems for which the vulnerability to adversarial examples can only be mitigated, not fully eliminated or bounded.
We consider attacking networks which have been adversarially trained goodfellow2014explaining ; kurakin2016adversarial ; kannan2018adversarial ; madry2017towards , so that they are somewhat resistant to adversarial attacks. In particular, we used pretrained models released as part of the recent work xie2018feature . Three such pretrained models were released: ResNeXt101, ResNet152 Denoise, and ResNet152 Baseline. These models obtain between 6268% top1 accuracy on clean images, and 5257% accuracy on adversarially perturbed images with random targets (we specify the attack details below). All three models were trained on adversarial examples, and the first two also incorporate a novel form of feature denoising to enhance their resistance to adversarial examples.
A simple but crucial point is that a cost matrix is required in order to implement costsensitive predictions. The cost matrix encapsulates the costs associated with different types of mistakes, but these may be hard to quantify in certain applications. To return to the example of identifying malignant tumors, clearly false positives are less costly mistakes than false negatives, but are they 10x worse, 100x worse, or 1000x worse? These valuations must be made for each application, and could involve a rich set of considerations which we shall not get into here. Instead, we simply consider an arbitrary cost matrix with values chosen according to what seems like plausible values. In particular, we let there be insensitive classes and sensitive classes, with the costs taken to be^{4}^{4}4We note that we randomly permuted the ImageNet labels in order to avoid grouping together similar classes in the insensitive/sensitive supercategories.
(12) 
Although these values were mostly chosen arbitrarily, they were picked so that the effect of being costsensitive would be nontrivial. For example, as , with the other values held constant, a costsensitive classifier will always err on the side of caution and predict the sensitive class. Similarly, if the differences in cost are very slight, then a costsensitive classifier will mostly make predictions according to the most likely class. These values were chosen to avoid either extreme. An additional complication is that an adversary may not know (or may only partially know) the cost matrix used by the defender network. Thus, in costsensitive adversarial examples the costmatrix becomes part of the whitebox/blackbox characterization of the problem. In this work, we assume that the cost matrix is known to the attacker.
4.2 Experimental results
We generated adversarial attacks using the ResNeX1101 pretrained model of Ref. xie2018feature , and evaluated the attacks against each of the 3 pretrained models. The attack is a whitebox attack when the defending network is the same ResNeX1101 model used to generate the attacks, and it is a blackbox attack when the defending network is either of the ResNet152 models. We considered 3 types of attacks: targeted with random targets, targeted with costsensitive targets, and the maximin attack introduced in Sec. 3. We use the same attack parameters as in xie2018feature , and used PGD to generate attacks for numbers of steps. The attacks are constrained to lie in an ball with , and the stepsize was taken to be (except for the case , in which case we set ). Furthermore, each attack was randomly initialized in the ball.
In Table 1 we present the results for whitebox attacks generated using the ResNeXt101 model. The attack details are as follows. The number of PGD iterations was taken to be , and the results in this table were computed by averaging over 50,000 distinct attacks, one for each of the images in the ImageNet validation set. Both the accuracy and average cost are evaluated for the two prediction methods discussed above, maximum probability and minimum cost. The column abbreviations are MP Acc  maximum probability prediction accuracy, MP Cost  maximum probability average cost, MC Acc  minimum cost prediction accuracy, MC Cost  minimum cost prediction average cost. The values indicate the 95% confidence intervals, which were computed by assuming that the means are normally distributed.
Attack Type  MP Acc. (%)  MP Cost  MC Acc (%)  MC Cost 

ResNeXt101  
clean images  
random targets  
max cost targets  
maximin cost 
There are a number of interesting observations to make. First, it is unsurprising that the accuracy is similar for both types of targeted attacks when the defending network makes maximum probability predictions, since in this case the costsensitive targeted attacks represent a fairly large subset of random targeted attacks. However, it is surprising that the costsensitive targeted attacks do such a poor job of increasing the cost for both types of predictions. This illustrates that for adversariallyresistant networks such as those of xie2018feature , targeted attacks are a poor way to increase the cost. The maximin cost attack outperforms all others when it comes to increasing the cost, although it unsurprisingly leads to fewer overall errors. The increase in cost is quite dramatic for a defending network making maximum probability predictions, and although the effect is less significant for minimum cost predictions, it still far outperforms either targeted attack.
We present additional results for blackbox attacks and variable attack strength in Appendix A. The blackbox attacks performed similarly to the whitebox attacks, although they were (predictably) slightly less effective overall. Increasing significantly improved the performance of the attacks.
5 Calibration
In many machine learning applications, the only output of a classifier that is used is the class prediction. However, there are many scenarios in which the probability estimates are also used. Costsensitive learning is one such example as the minimum cost prediction, Eq. 2, depends upon . A perfect classifier would place all the probability mass on the correct label, i.e. , and the minimum cost prediction would be .^{5}^{5}5Recall that we are assuming that the cost matrix satisfies , with equality if and only if . For imperfect classifiers, a desirable property of the probability estimates is that they be calibrated niculescu2005predicting . A classifier is said to be calibrated if the prediction accuracy agrees with the probability estimates. For example, whenever a calibrated classifier makes a prediction of class for an input with , it will be correct on average 90% of the time. As a result, the probability estimates of calibrated classifiers may be interpreted as confidences.
Both the minimum cost prediction, Eq. 2, and the maximum minimum expected cost attack, Eq. 11, depend directly on the probability estimates , and so it is natural to wonder if calibration might significantly affect the results, for example by making the minimum cost predictions more robust, or the maximum minimum expected cost attack more effective. Both the attacker and the defender may separately elect to calibrate leading to a total of four possible scenarios. The scenario where neither party calibrates was treated in the previous section, and in Appendix C we present results for remaining scenarios (defender calibrates, attacker calibrates, and both calibrate). We also provide details on the temperaturescaling calibration method used in Appendix B.
6 Game theoretic analysis
In the above sections and in the appendices we have considered a total of 6 different attacks (targeted with random targets, targeted with costsensitive targets, and the maximin attack, each of which can be either generated using a calibrated or an uncalibrated network), as well as 4 types of predictions (maximum probability or minimum cost, each of which may be made using a calibrated or an uncalibrated network). A convenient framework for analyzing the resulting 24 possible scenarios is game theory.
The attacker/defender setup considered here may be formulated as a finite zerosum twoplayer game. The payoff of the attacker is the average cost, and the defender’s payoff is the negative average cost. The payoff matrix for this game may be obtained using the uncalibrated results of Table 1, together with the calibrated results presented in Table 4, 5, 6 in Appendix C. Here, stands for “maximum probability", for "minimum cost", for “targeted with random targets", for “targeted with costsensitive targets", and for “maximin". Notice that the first two rows are identical  the temperature scaling calibration method used does not affect the maximum probability prediction, and therefore it also does not affect the average misclassification costs).
Attacker  

,  ,  ,  ,  ,  ,  
Defender  ,  8.44  8.43  8.45  8.45  13.94  13.97 
,  8.44  8.43  8.45  8.45  13.94  13.97  
,  2.94  2.95  2.98  3.00  3.50  4.16  
,  3.21  3.22  3.25  3.25  3.38  3.39 
For this simple game, there is a single pure strategy Nash equilibrium (shown in bold in Table 2), which is that the defender makes calibrated minimum cost predictions , and the attacker makes calibrated maximin attacks . Note that is a dominant strategy for the attacker, but is not dominant for the defender.
The result of this simple game theory analysis is that, in terms of the average cost, both parties should calibrate, minimum cost predictions are better than maximum probability ones, and the best attack is the maximin attack. These conclusions may well change with the many factors that went into this analysis  the cost matrix, the underlying classification problem, the strength of the attacks (measured in terms of and the size of the attack set ), etc. However, this overall framework for comparing strategies should be generally applicable. It is possible that in more complicated scenarios the Nash equilibrium will be a mixed strategy, as opposed to the pure strategy found here.
7 Conclusions and future directions
Safety critical systems are not likely to operate by simply selecting the most likely outcomes; they will need to consider cost of those outcomes and determine the probability thresholds for their predictions accordingly. At the same time, attacks on these costsensitive models are particularly important to study because of the critical nature of these systems. We demonstrated several whitebox and blackbox attacks on costsensitive classifiers built from stateoftheart adversariallyresistant ResNet image classifiers. These classifiers were made resistant by training them on targeted adversarial examples, and we find that they are still vulnerable to attacks designed to increase the expected cost.
While our experimental results were generated for image classification systems, our general framework should apply more broadly to any classification problem. Costsensitive classifiers and attacks thereon can easily be envisioned for text analysis (e.g. be sure not to miss terrorist sentiments) or industrial plant operation (e.g. be sure not to miss irregular signals and alerts that lead to accidents). In fact, most applications are not indifferent between different types of misclassifications, making costsensitivity broadly applicable. When those applications are safetycritical, an analysis of the efficacy of attacks and defenses should be carried out.
Lastly, we conclude with some directions for future work. Much of this work implicitly assumes that both parties (the defender and the attacker) know the cost matrix. In practice, it may be hard to convert an implicit value system based on possibly vague and looselyshared principles into an explicit numerical matrix. Even when such a task is achievable, there are many scenarios where the attacker would not be expected to have access to this information. Thus, one area of future work involves studying the effect of imperfect knowledge of the costmatrix for the attacker, and whether the attacker can learn to infer the costmatrix by observing the classifier predictions (and in turn using this information to construct better attacks). It would also be interesting to study the effect of a noisy costmatrix, perhaps reflecting the challenges faced by the defender in encoding a value system into a cost matrix.
A second line of work would be to go beyond the pretrained models of xie2018feature , and to consider other forms of adversariallyresistant models, especially ones for which analytic bounds could be obtained. In particular, it would be very interesting to apply costsensitivity to certifiable adversarial robustness cohen2019certified , for which rigorous analytic results are possible. Lastly, it would also be interesting to extend beyond normbased attacks, and consider more comprehensive attack sets gilmer2018motivating .
Acknowledgments
We would like to thank our colleagues at RAND with whom we had many fruitful discussions: Jair Aguirre, Caolionn O’Connnell, Edward Geist, Justin Grana, Christian Johnson, Osonde Osoba, Éder Sousa, Brian Vegetabile and Li Ang Zhang. This work was funded by RAND Project Air Force, contract number FA701416D1000.
References
 (1) C. Xie, Y. Wu, L. van der Maaten, A. Yuille, and K. He, Feature denoising for improving adversarial robustness, arXiv preprint arXiv:1812.03411 (2018).
 (2) J. Xie, R. Liu, J. Lutrell IV, and C. Zhang, Deep learning based analysis of histopathological images of breast cancer, Frontiers in Genetics (2019).
 (3) C. Elkan, The foundations of costsensitive learning, in International joint conference on artificial intelligence, vol. 17, pp. 973–978, Lawrence Erlbaum Associates Ltd, 2001.
 (4) P. Domingos, Metacost: A general method for making classifiers costsensitive, in KDD, vol. 99, pp. 155–164, 1999.
 (5) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, arXiv preprint arXiv:1312.6199 (2013).
 (6) J. Gilmer, R. P. Adams, I. Goodfellow, D. Andersen, and G. E. Dahl, Motivating the rules of the game for adversarial example research, arXiv preprint arXiv:1807.06732 (2018).
 (7) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, arXiv preprint arXiv:1706.06083 (2017).
 (8) X. Zhang and D. Evans, Costsensitive robustness against adversarial examples, arXiv preprint arXiv:1810.09225 (2018).
 (9) E. Wong and J. Z. Kolter, Provable defenses against adversarial examples via the convex outer adversarial polytope, arXiv preprint arXiv:1711.00851 (2017).
 (10) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (IJCV) 115 (2015), no. 3 211–252.
 (11) A. Raghunathan, J. Steinhardt, and P. Liang, Certified defenses against adversarial examples, arXiv preprint arXiv:1801.09344 (2018).
 (12) J. M. Cohen, E. Rosenfeld, and J. Z. Kolter, Certified adversarial robustness via randomized smoothing, arXiv preprint arXiv:1902.02918 (2019).
 (13) I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, arXiv preprint arXiv:1412.6572 (2014).
 (14) A. Kurakin, I. Goodfellow, and S. Bengio, Adversarial machine learning at scale, arXiv preprint arXiv:1611.01236 (2016).
 (15) H. Kannan, A. Kurakin, and I. Goodfellow, Adversarial logit pairing, arXiv preprint arXiv:1803.06373 (2018).
 (16) A. NiculescuMizil and R. Caruana, Predicting good probabilities with supervised learning, in Proceedings of the 22nd international conference on Machine learning, pp. 625–632, ACM, 2005.
 (17) C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, On calibration of modern neural networks, in Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1321–1330, JMLR. org, 2017.
 (18) M. P. Naeini, G. Cooper, and M. Hauskrecht, Obtaining well calibrated probabilities using bayesian binning, in TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
Appendix A Additional results for uncalibrated attacks and predictions
In Sec. 4.2 we presented results for whitebox attacks, with neither party calibrating. These attacks were both generated by and submitted to the ResNeXt101 model of xie2018feature . Results for blackbox attacks may be obtained by submitting these same attacks to the other two adversariallyrobust models released by xie2018feature , the ResNet152 DeNoise and the ResNet152 Baseline models. These are shown in Table 3, again for . These results are qualitatively similar to the whitebox results, which demonstrates the transferability of adversarial attacks aimed at increasing the cost as well as the overall classification error.
Attack Type  MP Acc. (%)  MP Cost  MC Acc (%)  MC Cost 

ResNet152 Denoise  
clean images  
random targets  
max cost targets  
maximin cost  
ResNet152 Baseline  
clean images  
random targets  
max cost targets  
maximin cost 
In addition to studying the transferability of attacks, we also investigated the dependence of the whitebox attack efficacy on . To this end, we generated 10,000 attacks with ranging from 10 to 1000. The results are plotted below in Fig. 3, which shows the cost and accuracy for both types of predictions (maximum probability (MP) and minimum cost (MC)). The plots indicate that in many cases increasing number of steps to about 100 or 200 significantly improves the efficacy of the attack. In particular, larger values of allows the targeted attacks with costsensitive targets to outperform the attacks with random targets in all cases. Additionally, with additional steps the efficacy of the maximin attack decreases relative to the other attacks against a minimum cost classifier, as shown in the bottomright figure. Against a maximum probability classifier, the maximin attack is far more effective at increasing the cost, as shown in the bottomleft figure.
Appendix B Temperature scaling calibration
The calibration of neural networks was originally studied in niculescu2005predicting . The issue was recently revisited for more modern architectures in guo2017calibration , and we shall adopt their methodology.
The extent to which a classifier is wellcalibrated may be measured by the Expected Calibration Error (ECE) naeini2015obtaining , which is defined as
(13) 
Here, with represents a binning of predictions and is the total number of samples. Predictions are grouped into bin if their confidence (i.e. probability estimate ) lies within the interval . Within each bin, the overall accuracy and average confidence are computed. An ECE of 0 indicates that the classifier is perfectly calibrated.
There are many techniques for calibrating a classifier. Perhaps the simplest is temperature scaling, in which the softmax operation relating the logits to probabilities is modified via a temperature term as follows:
(14) 
For , this reduces to the usual softmax operation. For , the probabilities are squeezed to become closer to one another, and for the probabilities are pushed apart so that there is a wider disparity between them. The extreme limit of corresponds to a uniform distribution, and the limit places all probability mass on the most probable label. An important property of temperature scaling is that it preserves the ordering of the probabilities. For example, the temperature scaling cannot change the sign of the relative log probabilities. Temperature scaling may be used to calibrate a classifier by using a separate validation set to find the optimal temperature which minimizes the ECE error, and then using this temperature to calibrate the probability estimates on the test set data.
Both the minimum cost prediction, Eq. 2, and the maximum minimum expected cost attack, Eq. 11, depend directly on the probability estimates , and so it is natural to wonder if calibration might significantly affect the results, for example by making the minimum cost predictions more robust, or the maximum minimum expected cost attack more effective. We investigated this issue for the whitebox attacks in which both the attacking and defending network was the pretrained ResNeXt101 model of xie2018feature . First, we evaluated the calibration of the ResNeXt101 model, using 5000 images, representing 10% of the full validation set. The ECE was found to be 0.055, representing a fairly wellcalibrated classifier. To gain a better sense for the calibration, in Fig. 4 below we plot the socalled reliability diagram guo2017calibration showing vs. .
The above reliability diagram and ECE value of 0.055 used the standard softmax operation, i.e. . Allowing to vary, an optimal value of ECE was found at the calibration temperature .
Appendix C Calibration scenarios
The calibration temperature of found above could be used by the defender, the attacker, or both. The defender would be motivated to use calibrated probabilities so that their minimum cost predictions would be (hopefully) more accurate, and similarly the attacker would be motivated to use calibrated probabilities to generate more effective attacks. Thus, in the tables below we show results for the case where the defender calibrates but the attacker does not (Table 4), the case where the defender does not calibrate but the attacker does (Table 5), and the case in which both defender and attacker calibrate (Table 6). The case in which neither party calibrates is covered above in Table 1. In all cases, the same calibration temperature was used, and the results in the tables correspond to an average over the 45,000 validation images not used in the calibration step.
Attack Type  MP Acc. (%)  MP Cost  MC Acc (%)  MC Cost 

ResNeXt101  
clean images  
random targets  
max cost targets  
maximin cost 
Attack Type  MP Acc. (%)  MP Cost  MC Acc (%)  MC Cost 

ResNeXt101  
clean images  
random targets  
max cost targets  
maximin cost 
Attack Type  MP Acc. (%)  MP Cost  MC Acc (%)  MC Cost 

ResNeXt101  
clean images  
random targets  
max cost targets  
maximin cost 
In discussing the results, let us first draw attention to the impact of calibration on the clean images. The maximum probability statistics are unaffected, which is to be expected since the temperature scaling method of calibration used here cannot change the maximum probability prediction.^{6}^{6}6The astute reader will have noticed that there are in fact slight differences between the MP results for clean uncalibrated and calibrated images. This are due to the fact that the averages computed in this section are over 45,000 images, as opposed to the 50,000 used in the previous section. For the minimum cost predictions, the accuracy drops a nontrivial amount (from 61.2% to 57.5%) and the cost decreases slightly.
Moving next to consider the effect of calibration on the efficacy of the attacks, the results show that calibration (of either party) has a significant impact on the minimum cost predictions, but not on the maximum probability ones. In discussing the results, we will take the perspective of the defender, and assume that the attacker is held fixed. Consider first the case of an uncalibrated attacker. The results show that the two types of targeted attacks are much more effective against a calibrated minimum cost defender than an uncalibrated one. The accuracy decreases (from about 41% to about 34%) and the cost increases (from about 3 to about 3.2). Interestingly, the trend is reversed for the maximin attack. This attack is more effective against an uncalibrated minimum cost classifier (3.50 compared to 3.38 for a calibrated one). Thus, whether the defender should calibrate or not depends on the attack type.
Consider next the case in which the attacker calibrates. Once again, the maximum probability statistics are only very weakly affected by the defender’s decision to calibrate. For the minimum cost predictions, it is again the case that the targeted attacks are more effective against a calibrated defender, whereas the maximin attack is rendered less effective by calibration. Here the distinction is even more pronounced than before. The cost for an uncalibrated minimum cost classifier is 4.16, and drops to 3.39 after calibration.
To summarize, calibration is important for minimum cost classifiers. A defender can reduce their vulnerability to a maximin attack designed to increase the expected cost by calibrating, and similarly an attacker can increase the effectiveness of the maximin attack against a minimum cost defender by calibrating. Against targeted attacks, however, calibration can decrease the defender’s performance. In Sec. 6 we use game theory to conduct a more systematic analysis of the various strategies available to both the attacker and defender.