When Not to Classify: Anomaly Detection of Attacks (ADA) on DNN Classifiers at Test Time

When Not to Classify: Anomaly Detection of Attacks (ADA) on DNN Classifiers at Test Time

David J. Miller, Yujia Wang, George Kesidis The authors are with the School of EECS, Pennsylvania State University, University Park, PA, 16802. E-mail: {djm25,gik2}@psu.edu .This work supported in part by an AFOSR DDDAS grant and a Cisco Systems URP gift.

A significant threat to the recent, wide deployment of machine learning-based systems, including deep neural networks (DNNs), for a host of application domains is adversarial learning (Adv-L) attacks. While attacks that corrupt training data are of concern, the main focus here is on exploits applied against (DNN-based) classifiers at test time. While much work has focused on devising attacks that make perturbations to a test pattern (e.g., an image) which are human-imperceptible and yet still induce a change in the classifier’s decision, there is relative paucity of work in defending against such attacks. Moreover, our thesis is that most existing defense approaches “miss the mark”, seeking to robustify the classifier to make “correct” decisions on perturbed patterns. While, unlike some prior works, we make explicit the motivation of such approaches, we argue that it is generally much more actionable to detect the attack, rather than to “correctly classify” in the face of it. We hypothesize that, even if human-imperceptible, adversarial perturbations are machine-detectable. We propose a purely unsupervised anomaly detector (AD), based on suitable (null hypothesis) density models for the different DNN layers and a novel Kullback-Leibler “distance” AD test statistic. This paper addresses the fundamental questions: 1) when is it appropriate to aim to “correctly classify” a perturbed pattern?; 2) What is a good AD detection statistic, one which exploits all likely sources of anomalousness associated with a test-time attack? 3) What is a good multivariate density model for different layers of a DNN (which may use different activation functions)? 4) Where in the DNN (in an early layer, a middle layer, or at the penultimate layer) will the most anomalous signature manifest? Tested on MNIST and CIFAR10 image databases under the prominent attack strategy proposed by Goodfellow et al. [5], our approach achieves compelling ROC AUCs for attack detection of 0.992 on MNIST, 0.957 on noisy MNIST images, and 0.924 on CIFAR10. We also show that a simple detector that counts the number of white regions in the image achieves 0.97 AUC in detecting the attack on MNIST proposed by Papernot et al. [12].


adversarial learning, test-time attack, anomaly detection, deep neural networks, Kullback-Leibler decision statistic, mixture distribution, rectified linear unit (RELU), penultimate layer

I Introduction

We are in the midst of a great era in machine learning (ML), which has found broad applications ranging from military, industrial, medical, multimedia/Web, and scientific (including genomics) to even the political, social science, and legal arenas. However, as ML systems are ever more broadly deployed, they become ever more enticing targets, both for individual hackers as well as for nation-state intelligence services, which may seek to “break” them. Thus, adversarial learning (Adv-L) has become a hot topic, with researchers from both the security and ML communities mainly (asymmetrically) focusing on devising various types of attacks, but also some defenses against same. Focusing on statistical classification, prominent attack types include: i) tampering with labeled training data to degrade a learned classifier’s accuracy, e.g. [6],[9],[17]; ii) reverse engineering attacks, which seek to learn a non-public (black box/undisclosed) classifier’s decision-making rule by making numerous (even random) queries to the classifier [15]; and, perhaps most importantly, iii) foiling attacks, wherein test (operational) examples are perturbed, imperceptibly with respect to human perception, but in such a way that the classifier’s decision will change (and now disagrees with the consensus human decision), e.g. [12],[5],[2]. Such attacks may e.g. cause an autonomous vehicle to fail to recognize a road sign, or an automated system to falsely target a civilian vehicle.

Recently, in [10], the authors published a somewhat critical review of research in this area. Amongst other concerns, the ones that are of particular relevance to the present paper are the following:

  1. Some foiling attacks, such as [12], essentially presume that the classifier is defenseless. In particular, while the perturbed patterns in [12] do induce changes to the classifier’s decision (thus demonstrating the attack’s “success”), they also manifest artifacts (e.g. salt and pepper noise) that are quite visible in their published figures. These artifacts should be relatively easy to (automatically) detect in practice, as we will show in the sequel. Thus, their attack should only be successful against a defenseless system.

  2. The foiling attacks in [5],[12] assume that one must decide on one of the known categories – there is no possibility to reject a sample. Classification with a rejection option111This is routinely applied, for example, by Siri. can provide optimal (minimum risk) decision rules when informed by the relative costs of various kinds of classification errors and the cost of sample rejection. Moreover, rejecting an attack sample is in fact, in certain scenarios, logically the correct decision to make and the one with least cost/consequences, e.g. leaking the least information to the attacker. This will be explained in the sequel. Moreover, in security-sensitive settings, where the stakes for test-time attacks are the highest, the problem is often not classification per se, but rather authentication222As of 2015, there is a speaker recognition option as part of Siri. Also, authentication often involves simple multifactor criteria, e.g. CAPTCHAs. Moreover, limited privileges may be enforced, e.g. Siri cannot be used to enter data (particularly passwords) into Web pages.. In a well-designed biometric authentication system, if there is any significant decision uncertainty (or atypicality) associated with the presented pattern, the system will reject it (e.g., deny access to an individual). In doing so, one is essentially deciding that the given test pattern, though “closest” to a particular (authenticated/known) class, amongst all such classes, is “too anomalous” relative to the typical patterns seen from that class – moreover, one is weighing the cost of false positives (invalid accesses) much higher than that of false rejections (invalid access denials).

  3. [12] and [5] (strongly) assumed that the classifier structure and its parameter values are known to the attacker. Recent work has proposed techniques to reverse-engineer a (black box) classifier without necessarily even knowing its structure. In [15], the authors consider black box machine learning services, offered by companies such as Google, where, for a given (presumably big data, big model) domain, a user pays for class decisions on individual samples (queries) submitted to the ML service. [15] demonstrates that, with a relatively modest number of queries (perhaps as many as ten thousand or more), one can learn a classifier on the given domain that closely mimics the black box ML service decisions. Once the black box has been reverse-engineered, the attacker need no longer subscribe to the ML service. Moreover, revere-engineering also enables foiling attacks (which do require knowledge of the classifier’s decision rule). One weakness of [15] is that it neither considers very large (feature space) classification domains nor very large networks (deep neural networks (DNNs)) – orders of magnitude more queries may be needed to reverse-engineer a DNN on a large-scale domain. However, a much more critical weakness of [15] stems from one of its (purported) greatest advantages – the authors tout that their reverse-engineering does not require any labeled training samples from the domain333For certain sensitive domains, or ones where obtaining real examples is expensive, the user may in fact have no realistic means of obtaining a significant number of real data examples from the domain. This is one main reason why the ML service is needed in the first place – the company or its client are the (exclusive) owners of this (labeled, precious) data resource, on the given domain.. In fact, in [15], the attacker’s queries to the black box are randomly drawn, e.g. uniformly, over the given feature space. While such random querying is demonstrated to achieve reverse-engineering, what was not recognized in [15] is that this makes the attack easily detectable by the ML service – randomly selected query patterns will typically look nothing like legitimate examples from any of the classes – they are very likely to be extreme outliers, of all the classes. Each such query is thus individually highly suspicious by itself – thus, even a few, let alone thousands of such queries, should be trivially detected as jointly improbable under a null distribution (estimable from the training set defined over all the classes from the domain). Even if the attacker employed bots, each of which makes a small number of queries (even as few as five), each bot’s random queries should also easily be detected as anomalous, likely associated with a reverse-engineering attack.

What is central to all three of these critiques is that the papers [12],[5],[15], and others ignore the potential of a (purely unsupervised) anomaly detection (AD) approach for defeating the attack. While the likely effectiveness of AD to defeat [15] and even [12] is obvious, it is less clear such an approach will be effective against a (less human-perceptible) attack such as [5]. However, this will be demonstrated in the sequel.

Irrespective of the approach taken (whether AD or some other approach), there is relative paucity of work in general on defenses against foiling attacks. Several recent works include [3],[16],[13]. The basic premise and objective taken in these papers is to robustify the classifier so that a foiling pattern that is a perturbation of a pattern from class A is still assigned to class A by the classifier. [3] modifies the support vector machine (SVM) training objective to ensure the learned weight vector is not sparse. Thus, if the attacker corrupts some features, other (unperturbed) features still contribute to decisionmaking. However, [3] may fail if only a few features are strongly class-discriminating. [16] considers DNNs for digit recognition and malware detection. They randomly nullify (zero) input features, both during training and use/inference. This may “eliminate” perturbed features. There are, however, several shortcomings here. First, for the malware domain, the features as defined in [16] are binary . Thus, nullifying (zeroing) does not alter a feature’s original value (if it is zero). We suggest recoding the binary features to . Now, nullifying (zeroing) always changes the feature value. This may improve the performance of the method in [16]. Second, there is a significant tradeoff in [16] between accuracy in correctly classifying attacked examples and accuracy of the classifier in the absence of attacks. As the nullification rate is increased, the frequency of defeating the attack increases, but accuracy in the absence of attack decreases. For the CIFAR domain, in the best case, one fourth of attack examples still cause misclassifications – with significant loss in accuracy absent the attack. Clearly, the attack is still quite effective, even with the feature nullification from [16]. [13] on the other hand reports relatively small loss in accuracy in the absence of attacks, for their “distillation” defense strategy.

However, a more fundamental limitation of [16],[3], and [13] concerns semantics of inferences. Consider digit recognition. Foiling-resistance means a perturbed version of ‘3’ is still classified as a ‘3’. This may make sense if the perturbed digit is still objectively recognizable (e.g. by a human being) as an instance of a ‘3’. In such case, it may be desirable to “robustly” classify this pattern as a ‘3’ (It also may not be desirable, even in this case, to make such a decision – this will be explained in the sequel). However, the perturbed example may no longer be unambiguously recognizable as a ‘3’ – recall the perturbed digit examples from [12], with significant salt and pepper noise and other artifacts. For some of the published images in [12], “don’t know” may be the most reasonable answer. Moreover, irrespective of whether the perturbed pattern is class-ambiguous, on its face it appears to be far more important, operationally, to recognize that the classifier is being subjected to a foiling attack than to “correctly” classify in the face of the attack. Once an attack is detected, preventive measures to defeat the attack may be taken – e.g., blocking the access of the attacker to the classifier. Moreover, actions that are typically made based on the classifier’s decisions may either be preempted or (conservatively) modified. For example, for an autonomous vehicle, once an attack on its image recognition system is detected, the vehicle may take the following action sequence: 1) slow down, move to the side of the road, and stop; 2) await further instructions. Similarly, a machine that is actuated based on recognized voice commands might be put into a “sleep” mode, under which it can do no damage. Similar “conservative” actions might be taken after attack detection in other application domains (involving financial transactions, medical diagnosis, etc.) – surely, in a medical diagnostic setting, one should not try to make sense of (make diagnosis based upon) a fabricated X-ray or MRI image – and if one has the means to detect a fabricated image, one should do so prior to performing any diagnosis. If an attack is detected, a new image scan should be taken, with the diagnosis then made based on trusted image data. In [16],[3], and [13] it is presumed that “correctly” classifying attacked test patterns is the “right” objective, without considering the (AD) alternative for which we are advocating.

Beyond the above (merely rational) argument that detection is more important than classification here, one can also consider the problem more formally. The following analysis. is quite facile – however, we are not aware that it has been given elsewhere. Specifically, let us simply recognize that there are only two mechanisms by which an attacked image is forwarded to a classifier (see Figure 1).

Fig. 1: The two possible foiling attack mechanisms, with the attacker either compromising the data source directly or intercepting data on its way to the classifier.

In one mechanism, there is an honest generator of an image, but it is then intercepted and perturbed by an adversary (essentially, a “man-in-the-middle” attack)444Standard techniques to defend against man-in-the-middle attacks could involve encryption of the whole message (image) or appending a hash of the image using the honest generator’s private cryptographic key or hash function. The latter just detects tampering in transit. Such encryption comes at non-negligible computational cost. In this paper, we do not assume such cryptographic techniques are used against a man-in-the-middle attack. This is a reasonable assumption for cases where the data sources are, e.g., legacy Internet-of-Things (IoT) devices with little computational and storage power understanding only reduced protocol sets. Whether the attack is a man-in-the-middle may also not be known.. However, there is a second mechanism, wherein the adversary is the generator of the original image (or has hijacked/compromised the image/sample generation mechanism)555Considering this threat, one may deploy redundant sources, leading to kinds of Byzantine consensus problems which, again, are not considered in this paper.. Only in the first case is it meaningful to try to “correctly” classify the image forwarded to the classifier. For the latter mechanism, it may not even be moot to correctly classify the perturbed image. It may in fact be detrimental to do so. In particular, note that, just as there are two attacked image mechanisms, there are three possibilities regarding the recipiency of the classifier’s decision. One is that the honest generator is the recipient of the classifier’s decision. However, if the attacker can intercept the image from an honest generator, why can it not also intercept the classifier’s decision (again, as a “man-in-the-middle”)? (If it does so, it can then even fabricate a decision and forward it to the honest generator.). This is the second possible recipient – the attacker. The third possibility is that both the honest generator and the attacker receive the classifier’s decision. Note that if the attacker is the image generator and also the recipient of the decision, then seeking to “correctly classify” will give the most information to the attacker (e.g., if the attacker has only partial knowledge of the classifier’s decision rule) – producing a “don’t know” decision gives less information to the attacker, and does not even necessarily reveal that the attack has been discovered (the classifier might be known to use classification with a rejection option, irrespective of possible attacks). It is only meaningful to “correctly classify” when there is an honest generator and when this honest generator will be the (a) recipient of the classifier’s decision. However, even under this particular scenario, for the reasons articulated previously, attack detection is much more important than robust classification – if no attack is detected, one can still make a best effort to correctly classify the pattern. This two-step process, with detection followed by classification when no attack is detected, is the structure of our proposed system.

While [3], [16], and [13] focus on achieving “correct” classification in the face of attacks, more recently [4] did investigate an AD defense against foiling attacks. However, ultimately, their paper put forward a supervised learning method, learning to discriminate “attack” from “no attack”, using supervised attack examples generated by the methods from [5] and [12]. Several comments are in order here. First, [4]’s supervised approach may not be effective in classifying never-before-seen attacks – it may only be accurate in classifying attacks generated using the methods of [5] and [12]. This was not investigated in [4]. Second, the authors ultimately settled on a supervised approach because their unsupervised (pure AD) method did not achieve very good detection results. However, there are several limitations of the AD proposed in [4], which are remedied by the novel approach presented here.

  • [4] only evaluated atypicality of a test sample relative to a null density model for the class predicted by the DNN – the likelihood with respect to this density being lower than a threshold is one indication that the sample is an attack instance. However, consider the process of generating an attack sample. One starts with a “normal” sample from one category (call it , the source category) and then perturbs it until the classifier’s decision is altered to a destination category (call it ). In addition to expecting that the perturbed sample may have “too low” likelihood under , one might also expect it to have “too high” likelihood under . The method proposed here exploits this additional source of atypicality and demonstrates that this yields significant gains in detection accuracy, compared with [4].

  • [4] used a kernel-based density estimator to evaluate the likelihood under the null. Here, we additionally investigate mixtures of Gaussians, with the number of mixture components chosen using the Bayesian Information Criterion [14].

  • [4] only considered atypicality of the feature vector at the output of the penultimate layer of the network. Fundamentally, there is the question of where in a CNN or DNN (in an early layer, a middle layer, or at the penultimate (e.g. RELU) layer) will the most anomalous signature manifest? In fact, the most powerful version of our approach exploits anomalies which may manifest at any one of several layers.

Ii Proposed Method: Anomaly Detection of Adversarial Attacks (ADA)

Ii-a Notation and Setup

Consider a “raw” feature vector , which could e.g. represent a (scanned) array of gray scale values comprising a digital image. Consider an -layer DNN. Let be the DNN’s a posteriori probability that originates from class , amongst the categories in a classification problem with known categories. Without loss of generality we represent these categories by the integers . There is a labeled training set , where are the labeled training samples from class . We have two purposes for this training set. First, it is used to learn the DNN posterior model, (via a suitable DNN training method). Second, suppose that is the output vector for layer of the DNN, , when is the input to the DNN – layer could be sigmoidal, an RELU, or even a max-pooling layer of the DNN. Then, by feeding each of the training examples from class , , into the (already trained) DNN and extracting the layer output vector for each such example, we can create a layer derived feature vector training set conditioned on class (with explicit notational dependence on omitted for simplicity), i.e. . For each such derived training set , representative of class , one can learn the class-conditional density, assuming a particular parametric density form and performing suitable model learning (e.g., maximum likelihood estimation, coupled with model-order selection techniques such as Bayesian Information Criterion [14], to estimate the model structure and “order” (e.g., the number of components, in the case of a mixture density)). Denote the resulting learned class-conditional densities by . These densities together constitute a “null hypothesis model” – where the null hypothesis is that a test vector is the result of feeding in an unperturbed image from one of the categories into the DNN and extracting the -th layer output of the DNN. The alternative hypothesis, accordingly, is that is the result of feeding an attacked (perturbed) image, call it , into the DNN.

Ii-B The Method from [4]

The AD proposed in [4] consists of the following operations, given a test pattern (image), :

  1. Determine the maximum a posteriori class under the DNN model:

  2. Compute , where is the function whose input is and whose output is the layer output of the DNN.

  3. Evaluate and declare an attack instance if this value falls below a preset threshold.

Beyond the above-described procedure, full specification of the method in [4] requires the choice of the layer, , and the density function “family” . [4] used the penultimate layer of the DNN (the layer immediately preceding the decision layer, ) and chose a simple, Gaussian kernel-based density estimator. We will discuss these aspects further below. However, without even considering these choices, we propose an improved detection procedure, which fundamentally exploits more information about a possible anomaly, to better overall effect (as demonstrated by our results), compared to the procedure from [4].

Ii-C The Proposed ADA Method

Consider a successful attack example – one which was obtained by starting from a “clean” example from an (unknown) source class and then perturbing it until the DNN’s decision on this perturbed example (now ) is no longer , but is now (the “destination” class). The premise behind the approach in [4] is that a test pattern which results from feeding an attacked version of , not itself, into the DNN, will have atypically low likelihood under the density model for the DNN-predicted class . While we expect that this may be true, if the perturbation of is not very large (consistent with its human-imperceptibility), we might also expect that will exhibit too much typicality (too high a likelihood) under some class other than , i.e. under the source category, . It does not matter that the source category is unknown. We can simply determine our best estimate of this category as: , with the associated “typicality” .

Accordingly, we hypothesize that attack patterns should be both “too atypical” under and “too typical” under . While this may seem to entail an unwieldy detection strategy that may require use of two detection thresholds, we instead propose a single, theoretically-grounded decision statistic that captures both requirements (jointly assessing the “atypicality” and the “typicality”). Specifically, define a two-class posterior evaluated with respect to the (density-based) null model, i.e.: , where gives the proper normalization:

Likewise, define the corresponding two-class posterior evaluated via the DNN: , where

Both deviations (“too atypical” and “too typical”) are captured in the Kullback-Leibler “distance” decision statistic: , i.e. we declare a detection when this statistic exceeds a preset threshold value.

Note that one of (several, possible) interpretations for the (asymmetric) roles of and in is that represents the “true” posterior, with an alternative model posterior whose “agreement” with the true is what we would like to assess. We place the null model posterior estimate in the “true” position because we believe it better reflects actual class uncertainties in a test pattern than does the DNN posterior. Moreover, in our experiments we have found that this choice leads to better AD detection power than what one gets by exchanging the roles of and (and also modestly better than what one obtains using a symmetric version of KL divergence).

We emphasize that our expectation that an attacked pattern will exhibit both “too high atypicality” (with respect to ) and “too high typicality” (with respect to ) does not require any strong assumptions about the attacker – it merely assumes that the attack is seeking to be imperceptible to a human being. To achieve this, the attacked example should be definitively recognizable by a human being as a legitimate (visually artifact-free) instance of one of the categories (). This constraint, in conjunction with the attack’s success (the DNN classifying it to ) may necessitate that the test pattern will exhibit unusually high likelihood for a category other than that predicted by the DNN. At the same time, because the (successful) attack example should not appear to a human to belong to , this may necessitate that the test pattern will exhibit unusually low likelihood under .

Ii-D Other Design Choices, Yielding ADA Extensions

There are several strategies the basic ADA method described above:

  1. Layer and Null Model Choices: [4] chose , the penultimate layer of the DNN, and used a Gaussian kernel-based density estimator. We investigate Gaussian mixture densities in place of the kernel-based estimate. We also investigate several different layers, , for extraction of the feature vector, .

  2. Maximizing KL over different layers: Rather than restricting to a single layer, for a given test image one can measure the KL distance at multiple layers and choose, as the decision statistic, the maximum KL distance across the different layers. This may enhance detection performance, as anomalous signatures may not always prominently manifest in the same (e.g., penultimate) layer. We refer to this approach as ADA-maxKL.

  3. Considering All Classes: Instead of just considering and , it is also possible to form probability vectors for and over all classes. While our “model” for the attack suggests that most of the anomaly signature may manifest with respect to and , more information may be exploitable by considering all classes.

  4. Exploiting uncertainty about and knowledge of class confusion: ADA as defined so far makes a hard decision estimate of the source class, . Alternatively, we can estimate the probability that via

    This can be used to evaluate an average KL distance (considering all possibilities for ). However, going further, suppose we have knowledge of the classifier’s confusion matrix , reflecting normal class confusion in the absence of an attack (obtained e.g. from a validation set). Then, a class pair with very small confusion probability is much more likely associated with an attack than a pair with high class confusion. Accordingly, we suggest to weight the KL distance by 666We replace zero probabilities in the confusion matrix by small values and renormalize so that it remains a pmf, conditioned on each class .. This weighting increases the decision statistic for pairs that are unlikely to occur due to normal (non-attack) classifier confusion. Combining both these ideas, we construct the Average, Weighted ADA (AW-ADA) statistic for a given layer as:

    where . Moreover, this can be evaluated for different layers, with the decision statistic its maximum over all considered layers. The resulting approach is dubbed AW-ADA-maxKL. As will be seen, this approach achieves significant improvement in detection accuracy on the CIFAR10 data set.

We next investigate these AD variants, along with comprehensive experimental comparisons between our proposed methods and that of [4].

Iii Experimental Results

We have evaluated our proposed ADA detection method and its variants in comparison with [4], considering both several image databases, several attack strategies, and a few different experimental scenarios.

Iii-a Data Sets

We experimented on the MNIST [8] and CIFAR10 [1] data sets. MNIST is a ten-class dataset with 60,000 grayscale images, representing the digits ‘0’ through ‘9’. CIFAR-10 is a ten-class dataset with 60,000 color images, consisting of various animal and vehicle categories. Both data sets consist of 50,000 training images and 10,000 test images, with all class equally represented in both the training and test sets. For anomaly detection purposes, the data batch under consideration in our experiments consists of the test images plus the crafted attack images (whose generation is discussed below).

Iii-B Classifiers

For training deep neural networks, we used mini-batch gradient descent with a cross entropy loss function and a mini-batch size of 256 samples. For MNIST, we trained the LeNet-5 convolutional neural-net [18]. This neural net reaches an accuracy of 98.1% on the MNIST test set. For CIFAR10, we used a 12-layer deep neural network architecture suggested in [4]. This neural net, once trained, reaches an accuracy of 83.1% on the CIFAR10 test set.

Iii-C Attacks

During the phase of crafting adversarial samples, we only perturbed test set samples that were correctly classified. This is plausible since the attack is not by definition truly successful unless it causes a misclassification777In practice, the attacker would start from a data sample (image) with a known ground-truth label, one that the attacker knows the classifier correctly classifies. Thus, with knowledge of the classifier, the attacker will know if his crafted sample is successful or not.. We implemented the fast gradient step method (FGSM) attack [5] and the Jacobian-based Saliency Map Attack (JSMA) [12] on the image data sets. FGSM is a “global” method, making small magnitude perturbations, but to all pixels in the image. By contrast, JSMA is a more “localized” attack, making changes to far fewer pixels, but with large changes needed on these pixels, in order to induce successful attacks (misclassifications). For JSMA, we implemented the version that alters a (minimal set of) dark pixels, changing them from dark to white.

For each test sample, from a particular class (e.g., ), we randomly selected (in an equally likely fashion) one of the other classes (e.g., ) and generated an attack instance starting from the test image, using the given attack algorithm, such that the classifier will assign the perturbed image to class . In this way, for MNIST, we successfully crafted 9845 adversarial images using the JSMA attack and 9762 adversarial samples using the FGSM attack. For CIFAR10, we only implemented the FGSM attack888The JSMA attack in [12] was only applied to MNIST, not CIFAR10.. For this data set, we successfully crafted 9243 adversarial images. Note that, in all cases, the attack success rate was high (albeit higher on MNIST).

Some attack examples are shown for FGSM and JSMA attacks on MNIST in Figure 2 and Figure 3, respectively. Note that while FGSM is generally thought of as an “imperceptible” attack, there are ‘ghost’ artifacts in Figure 2 that are visually perceptible. For CIFAR10, though, we did indeed find FGSM attacks to be visually imperceptible. JSMA attacks, on the other hand, are quite visually perceptible in Figure 3 – there are extra white pixels and also visible salt and pepper noise. Thus, while JMSA does induce misclassifications, it is arguable whether the attack is fully “successful” in the sense defined in [12] in that the resulting images do have significant artifacts that might cause a human being to misclassify the attacked image in some cases (or to profess “I don’t know”). The latter may be especially true for some of the attack images starting from the ‘5’ and ‘3’ categories.

Iii-D Noisy Data Scenario

The MNIST data is fairly “clean” – the images are gray scale, but many pixels are nearly white or nearly black. In order to model the scenario where the data is messier (and thus where attack detection may be more challenging) we considered experiments where Gaussian noise was added to image intensity values of non-attack images. For the case of FGSM attacks, noisy images were obtained by adding Gaussian noise to every pixel, with the mean and variance chosen to match the mean and variance of the perturbations produced by the FGSM attack. For the case of JMSA attacks, the same approach was applied (using the mean and variance estimated for the JMSA attack), but on a randomly chosen subset of pixels, whose size was chosen to equal the number of pixels modified by the JMSA attack. In this way, for both attacks, (noisy) non-attack images are generated that are more difficult to distinguish from attack images than the original (clean) non-attack images.

For experiments involving noisy images, the experimental protocol was as follows. First, we designed classifiers and crafted attack images for the clean data set. Next, after estimating the mean and variance of the attack perturbations, we added noise to the training set. We then retrained the classifier and estimated the class-conditional null densities based on the noisy training set. We then crafted new attack images working from the original test images – this is required because the attack images should be successful in causing misclassifications on the new classifier (trained on the noisy training images). Finally, noise is added to the original test samples, creating noisy (non-attacked) test samples999For MNIST, in the noisy case, the noisy classifier’s test set accuracy was in the case of the attack from [5] and in the case of the attack from [12]. Thus, adding noise did not substantially compromise the accuracy of classification.. The noisy test samples and the new attack samples form the batch on which anomaly detection is performed. In the case of CIFAR10, since the color images are intrinsically “noisy”, we do not add any noise – all CIFAR10 experiments were thus based on the original training and test images.

Fig. 2: FGSM adversarial image matrix, with starting true class on the row and classified class on the column. The main diagonal shows non-attacked, correctly classified images, while other entries are successful attack images.
Fig. 3: JSMA adversarial image matrix.

Iii-E Null Density Modelling

For modelling null densities, we considered both the Gaussian kernel density estimator used in [4] and Gaussian mixture models (GMMs), with the number of mixture components chosen to minimize the Bayesian Information Criterion. For the Gaussian kernel density, the variance parameter was chosen to maximize likelihood on the training set. For the GMM, we considered both full covariance matrices and diagonal covariance matrices, depending on the dimensionality of the DNN layer being modelled. Table I below shows the layer-dependent GMM modelling choices we made. Note that Lenet-5 has one max-pooling layer while the 12-layer neural-net has 2 max-pooling layers.

penultimate layer max-pooling layer
Lenet-5 for MNIST full covariance diagonal covariance
12-layer DNN full covariance diagonal covariance
TABLE I: GMM modeling choices

Iii-F Anomaly Detection Scenarios

In our experiments, there are essentially 3 different experimental scenarios that were investigated.

  • clean case: We do not craft any noisy samples. The training phase is based on the original training set and the test batch for AD consists of the original test samples and the crafted adversarial samples. All the experiments for CIFAR10 were done under this case.

  • noisy case: The experimental protocol for this case was previously discussed. The test batch for AD in this case consists of noisy versions of the original test samples and crafted adversarial samples that induce misclassifications by the classifier that was trained on the noisy training set. Note that the detection problem is expected to be more difficult here, compared with the clean case, as will be borne out by our results.

  • mismatch case: noisy is only added to the original test samples, while training is still based on the original (clean) training set. In this case, we included in the AD test batch the original (clean) test images, the noisy test images, and the adversarial images. Because the classifier and null modeling are based on clean data whereas noisy data is included in the AD test batch, we refer to this as the “mismatched” case. This case is useful for assessing system robustness (when training and test conditions are statistically mismatched).

Iii-G Results

Fig. 4: ADA-GMM ROC for MNIST under the FGSM attack
Fig. 5: ADA-maxKL ROC for noisy MNIST under the FGSM attack
ADA-kernel kernel in [4] ADA-GMM ADA-maxKL
clean 0.9703 0.9746 0.9918 NA
noisy 0.9049 0.8752 0.8695 0.9569
mismatch 0.9201 0.7825 NA NA
TABLE II: AUC scores on MNIST dataset with FGSM attack

Table II shows results on MNIST under the FGSM attack. In the first 3 columns, we only null-modeled the penultimate layer of the DNN. For the case of ADA-maxKL (applied in the noisy case), maxKL is based on two layers – the single maxpooling layer and the penultimate layer. The penultimate layer is modeled using a GMM with a full covariance matrix while the maxpooling layer is modeled by a Gaussian kernel. Note that all the methods work very well for the clean MNIST data set, but with ADA-GMM giving the best results and a highly compelling 0.992 AUC. For the noisy case, the maxKL paradigm, considering anomalies in two different layers, is needed to get the best results for ADA. This method significantly improves over the other ADA methods and over [4], achieving 0.957 AUC. For these two (best) results, we show the ROC curves in Figures 4 and 5. For the mismatch case, our basic ADA-kernel method’s exploitation of null information from both and is seen to substantially outperform the method from [4], which only exploits information from .

We also note that since we propose to detect and then classify if there are no detections, our system changes the distribution of the samples being classified (only those not falsely detected as attacks), which could in principle affect accuracy of the classifier. However, we have found that at relatively modest false detection rates (e.g. 5% or less), there are extremely modest changes in the classifier’s (conditional) test set accuracy (based on the test set that excludes false detections). This is true for both MNIST and CIFAR10.

Note that, beyond exploiting and , it is possible to define probability vectors on the full complement of classes, with KL distance measured between these probability vectors. Table III shows the AUC difference between just using 2 classes (source and destination) and using all classes. In this case, we applied the ADA-kernel detector in penultimate layer modelling on the CIFAR-10 dataset. In this experiment and, in general, anecdotally, we have found that the gains in going from use of and to use of all classes is typically modest. This validates the main idea of the ADA detection paradigm – that an attack example is mainly expected to be “too atypical” with respect to and “too typical” with respect to . Regardless, in the subsequent experiments, we did (except where otherwise mentioned) evaluate KL distance on all classes, since this does exploit even more information (even if not much more) than that just gleaned from and .

two classes all classes
clean 0.8159 0.8273
TABLE III: AUC scores on CIFAR-10 dataset with FGSM attack and ADA-kernel method.
ADA- ADA- kernel AW-ADA-
maxKL kernel in [4] maxKL
clean 0.8756 0.8289 0.8273 0.9235
ideal 0.9155 NA NA NA
TABLE IV: AUC scores on CIFAR-10 dataset with FGSM attack

Table IV shows results on the CIFAR-10 dataset under the FGSM attack. AUCs are much lower than for MNIST. We believe this is due to the fact that the classes are much more confusable for CIFAR10 (with only 0.82 test set accuracy) than for MNIST. However, the maxKL paradigm still gives substantial AUC gains over both [4] and the basic, single layer ADA method (with the AUC improving from 0.83 to 0.88). We will discuss the AW-ADA-maxKL results in Table IV shortly.

The “ideal” row shows results for an experiment where the misclassified test samples are excluded from the AD test batch. While these samples cannot of course be excluded in practice, the goal here is to understand whether or not most of the AD suboptimality on CIFAR10 is attributable to misclassified test samples. While the AUC does increase under the “ideal” case, it is still well below 1.0.

For ADA-maxKL in the clean case in Table IV, we modeled the penultimate layer using a GMM and modeled the remaining two maxpooling layers with a Gaussian kernel.

clean 0.8756 0.8567
TABLE V: AUC scores on CIFAR-10 dataset with FGSM attack and ADA-maxKL.

We also want to illustrate how different modelling choices for these layers affect performance. For concision, we use G as short for GMM and K as short for Gaussian kernel, i.e., K-K-G means the first 2 maxpooling layers are modeled using a Gaussian kernel while the penultimate layer is modeled using a GMM. Table V indicates that how the penultimate layer is modeled has only a modest effect on the detection accuracy on CIFAR10.

To further improve detection accuracy on CIFAR10, we implemented the AW-ADA-maxKL method, defined in section II. Note that, as seen from Table IV, this method (which exploits uncertainty in and class confusion matrix information) gives a big boost in ADA-maxKL performance, with the AUC going from 0.8756 to 0.9235, even larger AUC than that achieved by removing the misclassified test samples (the “ideal” result).

ADA- ADA- kernel white- region
maxKL kernel in [4] counting counting
clean 0.9314 NA NA 0.9466 0.97
noisy 0.8908 0.8807 0.8818 0.9062 NA
TABLE VI: AUC scores of various anomaly detectors on MNIST under the JSMA attack.

Table VI evaluates several methods on the JSMA attack applied to MNIST in the clean and noisy cases. Note that ADA models the joint feature vector for a layer, which is a function of the entire image. Thus, we would expect that ADA is most suitable for detecting global attacks, wherein many/most pixels in the image are modified – this is borne out by our very strong detection results on MNIST for FGSM attacks in Table II. JSMA, however, strongly restricts the number of modified pixels (but necessitates gross changes be made to these pixels, in order to succeed in inducing misclassifications). JSMA, accordingly, is a more “local” attack. Thus, we might expect ADA not to perform as well in detecting JSMA attacks as FGSM attacks. This is borne out in Table VI, where ADA-maxKL manages a respectable 0.93 AUC in the clean case (but not nearly the 0.992 AUC achieved in detecting FGSM attacks).

However, this is not to say that JSMA attacks are intrinsically “harder” to detect. In fact, looking at the images in Figure 3, the JSMA attacks are quite discernible, with salt and pepper noise and extra white content, which should make them readily detectable (by a simple, but suitably defined detector). To demonstrate this, we constructed two very simple detectors for JSMA attacks. One forms the class-conditional (null) histogram on the number of white pixels in the image. To make a detection decision on a given image, one counts the number of white pixels, , identifies the class whose mean white count is closest to , and then computes a one-sided p-value at , based on the class’s (null) histogram. This p-value is the decision statistic that is thresholded. This very simple detector achieves an AUC of in the clean MNIST case, a bit better than ADA-maxKL. To achieve even better results (specifically targeting the clean image case), one can simply count the number of disjoint contiguous white regions in the image. Clean MNIST digits generally consist of a single white region (where a region is defined as a collection of pixels that are “connected”, with two white pixels connected if they are in the same first-order (8-pixel) neighborhood, and with the entire region defined by applying transitive closure on pixel connectedness over the whole image). By contrast, nearly all JSMA attack images have extra isolated white regions (associated with salt and pepper noise). Simply using the number of white regions in the image as a decision statistic yields 0.97 AUC in the clean MNIST case – this strong result for this very simple detector indicates the susceptibility of the JSMA attack to a (simple) anomaly detection strategy, even as the ADA approach is not most suitable for this attack. Detecting JSMA attacks in the noisy MNIST case is more challenging, as seen in Table VI.

Iv Discussion and Future Work Directions

We have not fully explored the most suitable (null) density function models for the different layers of the DNN – mixture of Gaussians gives probability support to all of , whereas for a sigmoidal or RELU network layer the outputs are restricted to being non-negative. We obtained strong results using GMM modelling for such layers (MNIST under FGSM attack), but in future we may explore other density functions that do restrict probability support to a non-negative feature space domain. Beyond this, from our results involving GMMs with full covariance matrices, it does seem to be quite beneficial to capture feature correlations (so simple models for non-negative feature vectors such as a vanilla Dirichlet distribution (or mixtures of same), which do not well-capture such feature dependencies, are conjectured not to be most suitable).

We have only investigated the attacks from [5] and [12]. There is recent, quite interesting work on universal attacks, i.e. a single perturbation vector/image that induces classification errors when added to most test patterns for a given domain [11]. We believe this work is important because it is suggestive of the fragility of DNN-based classification. However, a single perturbation vector that is required to induce classification errors on nearly all test patterns from a domain must be “larger” than that required to successfully misclassify a single test pattern (and the approaches from [5] and [12] customize the perturbation vector for each individual test pattern). Thus, we conjecture that images perturbed by a universal attack should be more easily detected than those perturbed by a pattern-customized attack. Thus, our approach, which we have demonstrated to be quite successful in detecting the attacks from [5] and [12], is expected to be highly successful in detecting universal attacks.

While we have focused on DNN classifiers, our approach is suitable more generally to complement other classifiers in detecting test-time attacks. However, for classifiers that are not neural networks (e.g., a decision tree, an SVM, a model-based classifier, etc.), it may be most suitable to apply our approach only at one “layer” – i.e., to the feature vector that is input to the classifier.

Likewise, we have only considered image classification domains. Our approach should be suitable for other domains such as speech recognition [2] or music genre classification [7]. Here, the feature vector might not be the raw data (speech waveform) - it might be cepstral or wavelet coefficients. Accordingly, the null density models may need to be well-customized for this domain, in order to give the strongest attack detection results (e.g., hidden Markov models may be needed here). Moreover, our approach can also be readily applied when features are categorical, ordinal, or mixed continuous-discrete. For discrete feature spaces (e.g., a text domain, where the attacker is seeking to (imperceptibly) modify a document to fool a document/email classifier), (null) density functions would need to be replaced by (sufficiently rich) joint probability mass function null models. All of these are potential directions to consider in our future work.

V Conclusions

We have strongly argued here for an (in general, unsupervised) AD approach to detect foiling attacks at test time. The argument needed to be made because there is an (apparently accepted) alternative approach [16],[3],[13] – to “correctly classify” in the face of the attack – that seems to make sense when the attack is human-imperceptible. The reason this approach seems to make sense is because even if the test pattern has been altered by an attack, if such alteration is human-imperceptible, one can reasonably expect that a robust automated classifier may produce the same decision as a human being on the altered pattern. We have argued that, even if this is the case, operationally, attack detection is more important. Further, “correctly classifying”, even if well-defined, may play into the hands of the attacker if the attacker is a (the) recipient of the classifier’s decision. Even for classification domains where human perception is not a factor – e.g. huge-dimensional feature space domains (gene microarray or other bioinformatics domains, large-scale sensor array (e.g. Internet-of-things) domains, documents and video (unless a human will actually read the whole document or watch the whole video)) or domains whose feature vectors are essentially “voodoo” (uninterpretable) to a human being (e.g., computer software, for malware detection) – one could argue that an attacker will try to minimize the attack “footprint” to evade detection, and thus a robust classifier may still be able to correctly classify the given content, even in the presence of the attack. However, again, we would argue that detecting the attack, to the extent that this can be achieved, is far more important than classifying in the face of it. As one stark example, who cares whether one can correctly classify the software type of a computer program if the program in fact contains malware? Likewise, if the goal of an attack was not to fool a classifier, but rather simply to embed false content (into a document or video), then detecting the attack (to the extent this is possible) would seem to be a priority objective. While attack detection is the clear goal, the approach developed here does not necessarily offer any new insights for these more general (and quite challenging) AD problems, except to the extent that anomalous signatures for such problems might still manifest in a class-specific fashion (though not with respect to “source” and “destination” class density models, as exploited here).


  • [1] The CIFAR-10 dataset. https://www.cs.toronto.edu/ kriz/cifar.html.
  • [2] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou. Hidden voice commands. In Proc. 25th USENIX Security Symposium, Austin, TX, August 2016.
  • [3] A. Demontis, M. Melis, B. Biggio, D. Maiorca, D. Arp, K. Rieck, I. Corona, G. Giacinto, and F. Roli. Yes, Machine Learning Can Be More Secure! A Case Study on Android Malware Detection. https://arxiv.org/abs/1704.08996.
  • [4] R. Feinman, R. Curtin, S. Shintre, and A. Gardner. Detecting adversarial samples from artifacts. https://arxiv.org/abs/1703.00410v2.
  • [5] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
  • [6] L. Huang, A.D. Joseph, B. Nelson, B.I.P. Rubinstein, and J.D. Tygar. Adversarial machine learning. In Proc. 4th ACM Workshop on Artificial Intelligence and Security (AISec), 2011.
  • [7] C. Kereliuk, B. Sturm, and J. Larsen. Deep learning, audio adversaries, and music content analysis. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, 2015.
  • [8] Y. LeCun, C. Cortes, and C.J.C. Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
  • [9] B. Miller, A. Kantchelian, S. Afroz, R. Bachwani, E. Dauber, L. Huang, M.C. Tschantz, A.D. Joseph, and J.D. Tygar. Adversarial active learning. In Proc. Workshop on Artificial Intelligence and Security (AISec), 2014.
  • [10] D.J Miller, X. Hu, Z. Qiu, and G. Kesidis. Adversarial learning: a critical review and active learning study. In Proc. IEEE MLSP, Sept. 2017.
  • [11] S. Mooosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
  • [12] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z.B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In Proc. 1st IEEE European Symp. on Security and Privacy, 2016.
  • [13] N. Papernot, P. McDaniel, Xi Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, 2016.
  • [14] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
  • [15] F. Tamer, F. Zhang, A. Juels, M. Reiter, and T. Ristenpart. Stealing machine learning models via prediction apis. In USENIX Security Symposium, 2016.
  • [16] Q. Wang, W. Guo, K. Zhang, A. Ororbia, X. Xing, L. Giles, and X. Liu. Adversary resistant deep neural networks with an application to malware detection. In KDD, 2017.
  • [17] H. Xiao, B. Biggio, B. Nelson, H. Xiao, C. Eckert, and F. Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160(C):53–62, July 2015.
  • [18] LeNet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet/.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description