Detecting Adversarial Examples in Deep Networks with Adaptive Noise Reduction
Deep neural networks (DNNs) play a key role in many applications. Unsurprisingly, they also became a potential attack target of adversaries. Some studies have demonstrated DNN classifiers can be fooled by the adversarial example, which is crafted via introducing some perturbations into an original sample. Accordingly, some powerful defense techniques were proposed against adversarial examples. However, existing defense techniques require modifying the target model or depend on the prior knowledge of attack techniques to different degrees. In this paper, we propose a straightforward method for detecting adversarial image examples. It doesn’t require any prior knowledge of attack techniques and can be directly deployed into unmodified off-the-shelf DNN models. Specifically, we consider the perturbation to images as a kind of noise and introduce two classical image processing techniques, scalar quantization and smoothing spatial filter, to reduce its effect. The image two-dimensional entropy is employed as a metric to implement an adaptive noise reduction for different kinds of images. As a result, the adversarial example can be effectively detected by comparing the classification results of a given sample and its denoised version. Thousands of adversarial examples against some state-of-the-art DNN models are used to evaluate the proposed method, which are crafted with different attack techniques. The experiment shows that our detection method can achieve an overall recall of 93.73% and an overall precision of 95.47% without referring to any prior knowledge of attack techniques.
Deep neural networks (DNNs) have been widely adopted in many applications such as computer vision [36, 41], speech recognition [18, 30], and natural language processing [16, 67]. DNNs have exhibited very impressive performance in these tasks, especially in the image classification . Some DNN-based classifiers achieved even higher performance than human [59, 58]. Meanwhile, their robustness has also raised concerns.
Some recent studies [26, 49, 61] demonstrate that DNN-based image classifiers can be fooled by adversarial examples, which are well-crafted to cause a trained model to misclassify the instances it is given. As shown in Figure 1, an adversarial image can be generated by adding some imperceptible perturbations into a given image . Consequently, a famous DNN classifier GoogLeNet  will misclassify the resultant image, while a human observer can still correctly classify it and without noticing the existence of the introduced perturbations. These studies demonstrate that the adversaries could potentially use the crafted image to inflict serious damages. As shown in , a stop sign, after being crafted, will be incorrectly classified as a yield sign. As a result, a self-driving car equipped with the DNN classifier may behave dangerously.
Some techniques have been proposed to defend adversarial examples in DNNs [26, 34, 53, 54]. Most of them require modifying the target classifier model. For example, the adversarial training is a straightforward defense technique which uses as many adversarial samples as possible during training process as a kind of regularization [26, 34, 53]. This can make it harder for attackers to generate new adversarial examples. Papernot et al.s  introduced a defense technique named defensive distillation to adversarial sample. Two networks were trained as a distillation, where the first network produced probability vectors to label the original dataset, while the other was trained using the newly labeled dataset. As a result, the effectiveness of adversarial examples can be substantially reduced. Several very recent studies [20, 29, 27, 65] focus on detecting adversarial examples directly. Similarly, these techniques also require modifying the model or acquiring sufficient adversarial examples, such as training new sub-models , retraining a revised model as a detector using known adversarial examples , performing a statistical test on a large group of adversarial and benign examples , or training the key detection parameter using a number of adversarial examples and their corresponding benign ones .
Unfortunately, retraining an existing model or changing its architecture will introduce expensive training cost. Generating appropriate adversarial examples for training or statistical testing is also of high cost and depends on a very comprehensive prior knowledge of various potential adversarial techniques. Even worse, the attacker can craft adversarial examples with the technique unknown to the defender. In this case, the adversarial example has a good chance to evade the classification. Moreover, training a classifier with an emerging attack technique would take some time. There always is a window for attackers to craft effectual adversarial examples. Furthermore, most of existing defense techniques are model-specific. To apply a defense technique to different models, they need to be rebuilt or retrained individually. The security enhancement to a model cannot be directly applied to other ones.
To address the aforementioned challenges, we present in this paper a new technique capable of effectively capturing adversarial examples, even in the absence of prior knowledge about potential attacks.
Our approach is based upon the observation that to make the adversarial change imperceptible, the perturbation incurred by the adversarial examples typically need to be confined within a small range. This is important, since otherwise, the example will be easily identified by human. Consequently, the information introduced by the perturbations should also be less than that of the original image. In the proposed method, the perturbation is regarded as a kind of noise and the noise reduction techniques are leveraged to reduce its adversarial effect. If the effect is downgraded properly, the denoised adversarial example will be classified as a new class that is different with the adversarial target. On the other hand, for the legitimate sample, the same denoising operation will most likely just slightly changes the image’s semantics, keeping it still within its original category. Intuitively, all the adversarial perturbation is added later on to the image and therefore tends to less tolerant of the noise reduction process than the original image information. The information remaining in a denoised benign sample can be still enough for the classifier to correctly identify its class. In fact, some studies [25, 2, 40] have shown that the state-of-the-art classifier is invariant to different input transformations, such as translation, rotation, scale and etc. To this end, the adversarial example can be effectively detected by inspecting whether the classification of a sample is changed after it is denoised.
Two classical image processing techniques, scalar quantization and smoothing spatial filter, are leveraged to reduce the effect of perturbations. However, it is obviously inappropriate to denoise all samples in the same way. The quantization or smoothing suitable for a high-resolution image sample may be too excessive for a low-resolution one. To improve the generality of our method, an adaptive noise reduction is enforced by utilizing the two-dimensional (2-D) entropy of the sample. Specifically, as illustrated in Figure 2, the key component of our detection method is a filter. When feeding a sample f(x, y) to the target classifier, it will be denoised by the filter to generate a filtered sample f’(x, y). The sample is first quantized with an appropriate interval size, which is determined by computing the 2-D entropy of the sample. We also use the entropy to decide whether the quantized sample needs to be smoothed. Only when the entropy is larger than a threshold, will it be smoothed by a spatial smoothing filter. As demonstrated in Section 4.1, introducing the 2-D entropy can essentially improve the generality and performance of the proposed method. Finally, if the denoised version of a sample is classified as a different class to the original sample, it is identified as an adversarial example.
We employ some state-of-the-art DNN models and popular datasets, such as GoogLeNet , CaffeNet , ImageNet  and MNIST  to evaluate the effectiveness of the proposed method. Three up-to-date attack techniques, i.e., FGSM , DeepFool , and CW attacks , are used to craft adversarial examples. In total, there are 9,162 effectual adversarial examples generated against the models. The experiment shows that the proposed method can achieve an overall recall of 93.73% and an overall precision of 95.47% for detecting the adversarial examples.
In summary, our three main contributions are the following.
We model the perturbation of the DNN adversarial samples as image noise and introduce classical image processing techniques to reduce its effect. This allows us to effectively detect adversarial samples without prior knowledge of attack techniques.
We employ the 2-D entropy to automatically adjust the detection strategy for a specific sample. This makes the proposed method capable of detecting different kinds of adversarial examples without requiring tuning its parameters, and can be directly integrated into unmodified target models.
Using state-of-the-art DNN models, we demonstrate that the proposed method can effectively detect the adversarial examples generated by different attack techniques with a high recall and precision111The source code of our detection method, along with the experiment data, is all available at https://github.com/OwenSec/DeepDetector..
The rest of the paper is organized as follows. In Section 2, we present some essential background knowledge, including a brief introduction to deep neural networks and three up-to-date attack techniques which are used in our evaluation. In Section 3, the proposed method is described at length. In Section 4, we evaluate the effectiveness of our method via detecting the adversarial examples crafted by three attack techniques. Some potential problems and limitations are discussed in Section 5. We review the related work in Section 6 and conclude in Section 7.
In this section, we provide some preliminaries on DNNs and the attack techniques used to craft adversarial examples.
2.1. Deep Neural Networks
As illustrated in Figure 3, a DNN consists of a succession of neural layers. Each neural layer serves as a parametric function to model the new representation obtained from the previous layer. Gradually from the low layers to the high layers, the network can efficiently realize feature extractions. A weight vector, indicating the activation of each neuron, is assigned for each neural layer. It is updated during training phase with the backpropagation algorithm. Generally, the features imported into the low layers are the raw data describing the basic original properties of the problem instances. After multiple layers abstraction, the features extracted from high layers possess more semantic information of the input.
According to the type of output expected from the network, DNN can be fallen into two main categories: supervised learning and unsupervised learning. The former is mainly used for classification. The network is trained with labeled dataset to learn some connections between inputs and outputs [15, 17, 23, 36]. The latter is often used for feature extraction  and network pre-training , which is trained with unlabeled dataset. In this paper, we focus on the DNNs used as classifiers. As shown in Figure 3, the DNN classifier outputs a vector p indicating the predication confidence of each predefined class j (j1…n). The target of the attackers is to make the network output an incorrect predication for the input provided by them.
2.2. Crafting Adversarial Example
Szegedy et al.  first made the intriguing discovery that various machine learning models, including DNNs [36, 39], are vulnerable to adversarial samples. In general, for a given sample x and a trained model C, the attacker aims to craft an adversarial example x = x + x by adding a perturbation x to x, such that C(x*) C(x).
In most of the cases, the attacker wants the target model misclassify the resultant image, while a human observer can still correctly classify it and without noticing the existence of the introduced perturbation. In practice, the adversarial examples can be generated straightforwardly  or with an optimization procedure [13, 49, 61]. In this paper, we choose the following three up-to-date attack techniques to perform detection experiments. They can produce imperceptible perturbations.
Fast Gradient Sign Method. Goodfellow et al.  proposed a straightforward strategy named fast gradient sign method (FGSM) to craft adversarial samples against GoogLeNet . The method is easy to implement and can compute adversarial perturbations very efficiently. Let c be the true class of x and J (C, x, c) be the cost function used to train the DNN C. The perturbation is computed as the sign of the model’s cost function gradient, i.e.
where (range from 0.0 to 1.0) is set to be small enough to make x undetectable. Choosing a small can produce a well-disguised adversarial example. The change to the original image is difficult to be spotted by a human. As shown in Figure 1, using a very small (1/255) can also get a valid adversarial example. For a human observer, the difference from the original image is insensible. On the contrary, a large is likely to introduce noticeable perturbations but can get more adversarial examples when the original images are simple (e.g., handwritten digits).
In the classical FGSM algorithm, all input pixels are applied either a positive or negative change in the same degree according to the direction (sign) of corresponding cost gradients. However, as illustrated in Figure 4, we found that only manipulating the 30,000 (19.92%) input pixels with the highest positive or negative gradient magnitude can also generate an effectual adversarial sample using the same . The result implies that we can’t assume the perturbation follows some kind of distribution.
DeepFool. Moosavi-Dezfooli et al.  devised the DeepFool algorithm to find very small perturbations that are sufficient to change the classification result. The original image x is manipulated iteratively. At each iteration of the algorithm, the perturbation vector for x that reaches the decision boundary is computed, and the current estimate is updated. The algorithm stops until the predicted class of x changes. DeepFool is implemented as an optimization procedure which can yield a good approximation of the minimal perturbation. Moosavi-Dezfooli et al. performed some attack experiments against several DNN image classifiers, such as CaffeNet  and GoogLeNet , and so on. The experiments demonstrated that DeepFool can lead to a smaller perturbation, but which is still effective to trick the target models.
CW Attacks. Carlini and Wagner  also employed an optimization algorithm to seek as small as possible perturbations. Three powerful attacks (CW attacks for short) are designed for the L, L, and L distance metrics. Using some public datasets, such as MNIST  and ImageNet , Carlini and Wagner trained some deep network models to evaluate their attack methods. As demonstrated in , CW attacks can find closer adversarial examples than the other attack techniques and never fail to find an adversarial example. For example, CW L and L attacks can find adversarial examples at least 2 times lower distortion than FGSM. Besides, Carlini and Wagner also illustrated their attacks can effectively break the defensive distillation .
The basic idea behind our method is to regard the perturbation as a kind of noise and introduce image processing techniques to reduce its adversarial effect as far as possible.
Generally, as described in Section 2, an adversarial sample is crafted by superimposing some perturbations on the original image. In this sense, the perturbation introduced in the adversarial sample is an additive noise item (x, y), and the adversarial sample can be considered as a degraded image g(x, y) of the original image f(x, y) as follows.
where x and y are spatial coordinates, and f, g and are the functions mapping a pixel of coordinates (x, y) to its intensity.
For example, the perturbation of an FGSM adversarial sample is actually a random additive noise whose amplitude is . In fact, it is the noise that makes the sample misclassified. Ideally, if we can reconstruct the original image f(x, y) from an adversarial sample g(x, y), adversarial samples can be detected immediately. However, it is very difficult, if not impossible, to achieve this due to lack of the necessary knowledge about the noise term (x, y). Instead, we seek to reconstruct the original image in the sense of classification. Namely, we want to convert g(x, y) to a new image f’(x, y) such that its predicted class C(f’(x, y)) is the same as C(f(x, y))
Naturally, we hope that the classifier can correctly identify a benign sample after the conversion. If so, the adversarial example can be effectively detected by checking whether the classification of a sample is changed. If the classification is changed, the sample will be identified as a potential adversarial sample. Otherwise, it is considered benign. Fortunately, the state-of-the-art DNN image classifiers can tolerate a certain degree of distortion, although they are weak when facing adversarial samples. Goodfellow et al.  found that the features learned by deep networks are invariant to different input transformations, such as translation, rotation, scale and etc. LeCun et al. [2, 40] also demonstrated LeNet-5 classifier is robust to translation, scale, rotation, squeezing, and stroke width. Take the image shown in Figure 5(a) as an example, it is classified as Zebra by GoogLeNet with 99.97% confidence. We get several processed samples with some classical image processing methods, including graying, resizing, compressing and blurring. We can see that the obtained samples are still correctly classified with high confidences as shown in Figure 5(b) (e) respectively.
As mentioned in Section 1, the noise reduction techniques are leveraged to reduce the effect of the perturbation. Based on the above discussion, we have reasons to believe that although some details of interest in the sample may be removed too, the classifiers can output a correct classification for a denoised image. In image processing, there are a number of noise reduction techniques. Some of them are based on the prior knowledge of the noise. For example, Lee filtering , a very effective algorithm to filter noise. Nonetheless, this algorithm requires prior knowledge about the noise such as the underlying distribution, which is unavailable in our context. Besides, as demonstrated in Figure 4, the perturbations can be a kind of completely random noise, there is not a predictable distribution about them. For this reason, two straightforward techniques that require no prior knowledge, namely scalar quantization and smoothing spatial filter, are adopted to detect adversarial examples.
For scalar quantization, the size of intervals is a key parameter. In principle, using large intervals can more effectively reduce the effect of the perturbation but introduce more distortions at the same time, and the ”business” of an image is damaged more heavily. This may result in a misclassification for a quantized benign sample, and produces a false positive. On the contrary, a small step may bring a number of false negatives due to inadequate noise reduction.
We utilize the entropy of image to determine the parameter. The image entropy is a quantity which is used to measure the amount of information possessed by an image. Commonly, the higher the entropy of an image is, the richer its semantics often is. Consequently, for an image with higher entropy, more information is required for the classifier to correctly identify its class. Based on the intuition, to avoid excessively eliminating the information of a sample, a small interval size will be applied to the high-entropy samples when quantizing them. Accordingly, the low-entropy samples will be assigned with a large interval size.
Smoothing a sample will blur its details and often decrease its information. However, for a very simple image (with a low entropy), e.g., a handwritten digit, the smoothing may excessively eliminate its details, which are important to the classification task. Namely, the low-entropy image can’t tolerate the blurring well from the perspective of the classification. To this end, we use the entropy to decide whether the sample needs to be smoothed.
3.2. Computing Entropy
The conventional image entropy (1-D entropy) only concerns the concentration of the pixel values distribution. In order to catch the spatial correlation among the pixels, we employ two-dimensional entropy (2-D entropy) to measure the information of an image.
Without loss of generality, for an M N image with 256 pixel levels (0255), the average pixel value of the neighborhood is first calculated for each pixel. In this study, we adopt the averaging filter mask shown in Figure 10 to calculate the average pixel value of the neighborhood. This forms a pair (i, j), the pixel value i and the average of the neighborhood j. The frequency of the pair is denoted as f, and a joint probability mass function p is calculated as equation (3). On the basis, the 2-D entropy of the image can be computed as equation (4).
For a RGB color image, its 2-D entropy is the average of the 2-D entropies of its three color planes, which are computed individually.
3.3. Scalar Quantization
Quantization is the process of representing a large (possibly infinite) set of values with a smaller (finite) one, e.g., mapping the real numbers to the integers. In image processing, quantization is often employed as a lossy compression technique by mapping a range of pixel intensities to a single representing one. In other words, reducing the number of colors of an image to cut its file size.
Scalar quantization is the most practical and straightforward approach to quantize an image. In scalar quantization, all inputs within a specified interval are mapped to a common value (called codeword), and the inputs in a different interval will be mapped to a different codeword. There are two types of scalar quantization techniques, uniform quantization and non-uniform quantization . In uniform quantization, the input will be separated into the same size intervals, and in non-uniform quantization they are usually of different sizes chosen with an optimization algorithm to minimize the distortion . In practice, we can set the interval size according to the probability density function (PDF) of pixel values. The intervals for frequent pixel values can be set smaller, and larger for infrequent pixel values. Figure 6 illustrates the examples of the two kinds of quantization.
Images are meant to be viewed by the human and the human eyes can tolerate some distortions, such as the color reduction introduced by the lossy compression. In practice, the state-of-the-art DNN-based image classifiers are trained and classify samples from the view of human observers. Accordingly, these trained classifiers can also tolerate the color reduction to some extent. Namely, for a benign sample, its classification is likely to be preserved for its quantized version. As shown in Figure 7, GoogLeNet can still correctly classify the scalar quantized samples with high confidences.
More importantly, the quantization technique cannot only be leveraged to compress the size of an image but also to reduce the noise in it. For an adversarial sample g(x, y) generated from f(x, y), the change to pixel values brought by the perturbation can be blurred with an appropriate quantization. As a result, the classification result of its quantized version, C(g’(x, y)), is likely to be reverted to the original classification C(f(x, y)) and different with C(g(x, y)). We believe that the quantization technique can be leveraged to find potential adversarial samples by inspecting whether the classification result of a sample is changed after being quantized.
In practice, the perturbation may distribute in all pixel values. For example, almost all pixels are added a perturbation in an adversarial image generated by the FGSM algorithm. If we adopt non-uniform quantization and choose a small interval for frequent pixel values, the effect of the perturbation in corresponding pixels may not be effectively reduced. Besides, finding appropriate non-uniform interval sizes will require more complex computation. To effectively downgrade the effect of the perturbation and achieve better performance, in this study, we adopt the uniform quantization technique to handle the sample.
To develop a scalar quantization, we first need to determine an appropriate interval size. For a given sample, an adaptive interval size will be applied to it according to its 2-D entropy computed as described in Section 3.2. As shown in Table 1, we determine the corresponding denoising strategies for different the 2-D entropies with a small-scale empirical study of different types of images in some popular datasets. They are determined by analyzing the relationship between the entropy of samples and the classification of their denoised versions in different denoising settings. Note that the empirical study only concerns the benign images that are easily available from many sources. As demonstrated in Section 4, the current strategy setting works well. In the future, we can perform a large-scale analysis to seek a possible better setting.
|2-D entropy||Quantization Intervals||Smoothing?|
The image with a high 2-D entropy (larger than 9.50) often contains rich details, such as a photograph of an animal. According to the suggestion of Safe RGB Colors , we separate each color plane (R, G and B) into six intervals and set the step to 50. The colors representing a quantized sample are limited in 216 (6) safe RGB colors. All pixel values in an interval will be quantized to its left value. Our scalar quantization and codebook are illustrated in Figure 8 and Table 2. The quantization will perform the same quantization on the three color planes of a given sample respectively. As illustrated in Figure 9, after quantizing, the adversarial sample shown in Figure 1 is correctly classified as Panda by GoogLeNet with 98.92% confidence; and the benign sample is still classified as Panda with 99.71% confidence.
For the image with a low 2-D entropy (less than 8.50), such as a handwritten digit, it will be handled by an aggressive quantization with only two intervals of the same size. The intensities within an interval are also mapped to its left value, i.e., 0 or 128. Other images are quantized with four intervals in the same way.
A preliminary experiment on detecting FGSM adversarial samples shows that directly using the proposed quantization as the detection filter can achieve an average recall of 84.01% and an average precision of 84.99% for detecting adversarial samples. In essence, scalar quantization is a kind of point operation. Subsequently, we further reduce perturbation by introducing the neighborhood operation technique to achieve better detection performance.
3.4. Spatial Smoothing Filter
The spatial smoothing filter is one of the most classical techniques for noise reduction. The idea behind it is to modify the value of the pixels in an image based on a local neighborhood of the pixels. As a result, the sharp transitions in pixel intensities, often brought by noise, are reduced in the target image. In linear smoothing filtering, the filtered image f’(x, y) is the convolution of the original image f(x, y) with a filter mask w(x, y) as follows.
The filter mask determines the smoothing effect. Figure 10 presents a simple 5 5 averaging filter mask. With the mask, the intensity of a pixel is replaced with the standard average of the intensities of the pixels in its 5 5 neighborhood. After filtering, the target image is blurred and small details are removed from it. However, from the viewpoint of image classification, the objects of interest may be highlighted and easy to detect. In fact, to some extent, the state-of-the-art classifier does ”like” the modification introduced by smoothing filtering. As shown in Figure 11, although the smoothed image is blurred by the filter, it is still correctly classified as Cab by GoogLeNet and surprisingly with a higher confidence (95.36%) than the original image (69.16%). However, as mentioned above, for a low-resolution image, the smoothing may be too excessive to preserve enough semantics information. We also use the 2-D entropy to determine whether the smoothing should be performed. As listed in Table 1, only the image whose 2-D entropy is larger than 9.50 is smoothed after being quantized.
In theory, adopting a filter mask with larger size will reduce noise more effectively but also blur the edges more heavily. However, the edges are often the desirable features for identifying an object of interest. As a tradeoff, we adopt a 5 5 filter mask to further reduce noise (perturbation) in quantized samples. Besides, in practice, some features of interest can be emphasized by giving more importance (weight) to some pixels in the mask at the expense of others . For example, we can give bigger weights to the pixels at the center of the mask to reduce blurring in the smoothing process. In this study, we believe the vertical and horizontal edges are most fundamental for identifying an object and adopt an aggressive way to preserve them. As shown in Figure 12, in the proposed filter, the pixel at the center and its vertical and horizontal neighbors are weighted by 1 and all others are weighted by 0.
By applying the smoothing filter to the quantized samples, some false positives and false negatives can be pruned. As illustrated in Figure 13, an adversarial sample crafted with FGSM is still misclassified as Flatworm with 79.62% confidence by GoogLeNet even after being quantized. However, its smoothed version is correctly classified as Zebra with 85.38% confidence. As a result, the adversarial sample can be detected successfully and a false negative will be avoided. On the other hand, as shown in Figure 14, a quantized begin sample is misclassified as Golden Fish and resulting in a false positive. Similarly, we can restore its correct classification Pineapple by using the smoothing filter.
3.5. Detection Filter
Unfortunately, the smoothing technique may bring an excessive blurring to some samples and produce new false positives. To this end, we design a combination filter based on the two above techniques rather than simply concatenating them together.
As discussed in Section 2, the attacker often wants the perturbation introduced in the adversarial sample as small as possible to make it imperceptible. In other words, the perturbation to the pixel intensity is often limited in a small range. If the intensity of a pixel is blurred too much by the smoothing filter, the smoothing might be unnecessary. Based on the above intuition, our detection filter is defined by the following equation
f’(x, y) =
where f(x, y) is the quantized original image and f(x, y) is the smoothed quantized image. For a given pixel (a, b) of the input sample f(x, y), the output pixel value f’(a, b) will be replaced with its quantization f(a, b) when the distance between the quantization f(a, b) and the original pixel value f(a, b) is smaller than the one between f(a, b) and f(a, b); otherwise, it will be set to f(a, b). As illustrated in Figure 15, there is a benign sample h(x, y) correctly identified as Pineapple (98.48%) by GoogLeNet. The non-optimized denoised version h(x, y) is misclassified as Bee with a low but the highest confidence (8.81%) in prediction vector. If it is the ultimate output of our filter, a false positive will be produced. According to equation (6), the optimized denoised version h’(x, y) is classified as Pineapple (88.70%) and the false positive is avoided.
The proposed method is transparent to the target model. In practice, the detection filter can be directly integrated with any off-the-shelf model as a sample preprocessor. The target model can be kept unchanged.
We evaluate the effectiveness of our method by applying it to detect adversarial examples crafted by the attack techniques described in Section 2.2. The recall rate and the precision rate are used to quantity the detection performance, which are defined as follows
where TP is the number of correctly detected adversarial examples (true positives), FN the number of adversarial samples that survive from our detection (false negatives), and FP the number of benign images that are detected as adversarial examples (false positives). The higher recall and precision rates indicate the better detection performance.
4.1. Detecting FGSM Examples
Two off-the-shelf DNN models are employed to explore the effectiveness of the proposed detection method with respect to the FGSM attack. One is a GoogLeNet model trained with the ImageNet dataset , which has been taken as the attack target of FGSM in . The other is a DNN model trained with the MNIST dataset, which is from an adversarial machine learning library  and trained for testing the FGSM attack. ImageNet  is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories, while MNIST  is a small-scalar dataset of simple gray handwritten digits.
We randomly choose four classes (Zebra, Panda, Cab, and Pineapple) of images from ImageNet to craft FGSM adversarial examples. For high-resolution images in ImageNet, we can use a small to craft effectual adversarial examples. In the experiment, is set to 1/255, this means the pixel value is manipulated by adding or decreasing 1. In total, 1,301 effectual adversarial examples are crafted. To get enough experiment samples for MNIST images, we use a comparatively large (0.10) and craft 3,435 effectual adversarial examples from the 10,000 images in the MNIST test set.
The detection test set consists of all generated adversarial examples and their original images. As summarized in Table 3, the proposed method achieves an average recall of 90.32% with an average precision of 90.66% in detecting the 2,602 ImageNet samples; and for the 6,870 MNIST samples, the recall and precision rate are 90.95% and 97.81% respectively. Note that, the detection performance is obtained without using any prior knowledge about adversarial examples.
There is an obvious difference between the 2-D entropies of ImageNet and MNIST images. The ImageNet sample often has a high entropy, while that of the MNIST sample is generally lower. As mentioned above, we use the 2-D entropy to provide an adaptive noise filtering for a given sample. According the 2D-entropies of samples, in the detection experiment, the number of quantization intervals is automatically set to six or four for the most of ImageNet samples, and two or four for MNIST samples.
To demonstrate the effectiveness of introducing the 2-D entropy, we conduct an experiment to detect the samples with fixed parameters. The number of quantization intervals is deliberately set to two for all ImageNet samples, and six for MNIST samples. This results in an unacceptable detection performance. The average precision rate for ImageNet samples is dropped to 56.10%, i.e., almost all benign samples are incorrectly identified as adversarial. For MNIST samples, a great number of false negatives are produced such that the recall rate is downgraded to 64.13%. Figure 16 provides two examples. As shown in the first row, a benign ImageNet sample is incorrectly identified as an adversarial example when applying a 2-interval quantization to it. However, with a 6-interval quantization fitting in with its 2-D entropy (13.13), the false positive can be eliminated. Similarly, the MNIST adversarial example shown in the second row is missed when choosing an inappropriate interval size but can be detected with an adaptive setting.
4.2. Detecting DeepFool Examples
Moosavi-Dezfooli et al. used two state-of-the-art CaffeNet and GoogLeNet models trained with ImageNet dataset to test their DeepFool attack, and the two models are available in .
We still choose the same four classes of images (i.e., Zebra, Panda, Cab, and Pineapple) to generate adversarial examples. By using the DeepFool algorithm provided in , 1,234 effectual adversarial examples are generated for the CaffeNet model and 1,032 for the GoogLeNet model. The generated examples and the corresponding original images make up our detection test set. The detection results of these samples are listed in Table 4. An average recall of 95.62% and an average precision of 91.12% is achieved for the samples targeting CaffeNet; 93.22% and 92.15% for the ones targeting GoogLeNet respectively.
Note that although DeepFool can produce a smaller perturbation than FGSM, the proposed method is still effective and even achieves a higher detection accuracy on almost the same ImageNet samples. As illustrated in Figure 17, an effectual adversarial example is crafted from the original image shown in Figure 1 with DeepFool, which can fool the GoogLeNet model into misclassifying it as Llama. Although the introduced perturbation is obviously smaller, our method can still successfully detect it.
4.3. Detecting CW Examples
Carlini and Wagner also use MNIST and ImageNet to train two DNN models as their attack targets. We directly download the two trained models from  for our evaluation. Considering CW L and L attacks don’t result in observable perturbations, we choose them to generate adversarial examples.
In practice, using the two attacks is more expensive than other attack techniques. In our computer, generating an ImageNet adversarial example with L and L take about 30 minutes and 4.5 hours respectively. For this reason, we only picked the first 1,000 images in MNIST and 30 random images for each of the four ImageNet classes (listed in Table 3) as experiment dataset. Eventually, L attack successfully generated 991 effectual adversarial examples and 110 from the 120 ImageNet images; and L attack output 991 and 67 effectual examples from the two groups of images respectively. The generated examples and the corresponding original images make up our detection test set.
As listed in Table 5, the proposed method achieves very high recall and precision rates. For MNIST samples, there are only 10 false positives and 13 false negatives in 3,964 MNIST samples, and 9 and 0 in 354 ImageNet samples. By the way, detecting a sample with the proposed method only takes about 8 seconds. The introduced overhead is negligible compared with the time consumption of generating an adversarial example.
All in all, 18,322 samples are used to evaluate our method in above experiments, half of them are adversarial and half are benign. We achieve an overall recall of 93.73% and an overall precision of 95.47% in detecting the adversarial examples generated by the three attack techniques.
5. Discussion and Limitations
Robustness to Purposeful Attacks. If the adversaries are aware of the proposed method, they may try to develop a new attack technique to evade detection. However, if the perturbation to a pixel can survive from our scalar quantization, it must make the pixel value be mapped to a different interval. In other words, the amplitude of perturbation should be large enough. As a result, a perceptible modification will be introduced into the whole original image, and compromise the utility of the adversarial example. In many attack scenarios, a weird adversarial example is unacceptable, especially when the adversarial example is expected to fool the classifier and human observer at the same time. Besides, under the constraint of our filtering, it is not completely impossible to use an optimization procedure to compute an effectual adversarial example, but it would be difficult and expensive. We have reasons to believe that the proposed method can make it far more challenging to develop a new effective and practicable attack technique.
False Positives and False Negatives. In principle, the performance of our detection method is closely related to the classification capacity of target classifiers. Some false positives and false negatives are caused by the ambiguous images, which are essentially hard to classify for the target classifier.
As shown in Figure 18, an image consisting of various fruits is labeled as Pineapple, but GoogLeNet can tell that with only 19.55% confidence. This is really not a strong prediction. The sample is also considered as a Lemon with 10.85% confidence and a Jackfruit 9.43%. After being denoised by our filter, the image is misclassified as Lemon and results in a false positive. However, we think that neither the model nor the proposed detection filter is to blame for the false positive, but the ambiguity within the image is. The confusing images cannot only result in false positives, but also false negatives. Take an adversarial example generated with FGSM as example, the image shown in Figure 19 is perturbed from Pineapple to Sea Anemone but only with 19.76% confidence. GoogLeNet gives a weak prediction for it. And our detection method also fails to detect this adversarial example and produces a false negative.
There are quite a few ambiguous images like the two examples in our test set, which brings down the precision rate as well as the recall rate. For the ambiguous samples, inspecting more predict classes might be necessary rather than just the top one with the highest confidence. We can compare the predict vectors to find a difference for detection as done in , if a trained threshold is available.
The above phenomenon also implies that the stealthy adversarial examples may can be generated by purposefully touching off an incorrect but weak prediction. We will further analyze the phenomenon and seek a new attack technique.
Perceptible Perturbations. Some attack techniques, such as CW L , may introduce the large-amplitude perturbation. According to the L attack algorithm, the number of altered pixels is limited, but a pixel can be changed without any limitations. Consequently, as illustrated in Figure 20, the obtained adversarial example may present easy-to-notice distortions. It can be easily spotted by a human. However, it can still be exploited to launch an effective attack when the human interaction is no consideration. In principle, it is very difficult to properly reduce the effect of the heavy perturbation only with the filtering technique without compromising the semantics of the original image. To develop an effective technique to detect this kind of example is beyond this paper’s scope but will be our future research.
Other Image Processing Techniques. There are a number of other image processing techniques in addition to the ones adopted in our method. Some of them may can be leveraged to further improve our detection method, such as Rényi entropy , image segmentation , etc. For example, we can segment an adversarial example into some regions according to the connectivity among pixels to find such a region that possesses as much as possible information. From it, we have a good chance to restore the correct classification when the perturbation is isolated in other regions. In this way, the adversarial example shown in Figure 20 can be detected. In the future, we plan to investigate other image processing techniques to develop a more sophisticated detection method, especially for detecting the adversarial example with large-amplitude perturbations.
6. Related Work
Many existing studies have paid much attention to the security of classifiers, and the arm race between adversaries and defenders will never end.
Attacks on Traditional Classifiers. Many studies have investigated the security of traditional machine learning methods  and proposed some attack methods. Lowd and Meek conduct an attack that minimizes a cost function . They further propose attacks against statistical spam filters that add the words indicative of non-spam emails to spam emails . The same strategy is employed in . In , a methodology, called reverse mimicry, is designed to evade structural PDF malware detection systems. The main idea is injecting malicious content into a legitimate PDF while introducing minimum differences within its structure. In , an online learning-based system for detection of PDF malware, PDF, was used as a case to investigate the effectiveness of evasion attacks. The study reconstructs a similar classifier through training one of the publicly available datasets by a few deduced features, and then evades PDF by insertion of dummy content into PDF files. In , an algorithm is proposed for evasion of classifiers with differentiable discriminant functions. The study empirically demonstrated that popular models such as SVMs and neural networks can be evaded with high probability even if the adversary can only learn limited knowledge. Liang et al.  demonstrated that client-side classifiers are also vulnerable to evasion attacks.
Xu et al.  presented a general approach to find evasive variants by stochastically manipulate a malicious sample seed. The experiment showed that the effectual variants can be automatically generated to against two PDF malware classifiers, i.e., PDF and Hidost.
Fredrikson et al. [21, 64] developed a new form of model inversion attack which can infer sensitive features used in decision tree models and recover images from some facial recognition models by exploiting confidence values revealed by the target models. The proposed attack may cause serious privacy disclosure problems . More model inversion attacks can be found in .
Defenses for Traditional Classifiers. Many countermeasures against evasion attacks have been proposed, such as using game theory [12, 11] or probabilistic models [10, 55] to predict attack strategy to construct more robust classifiers, employing multiple classifier systems (MCSs) [7, 8, 9] to increase the difficulty of evasion, and optimizing feature selection [22, 35] to make the features evenly distributed.
Game-theoretical approaches [12, 11] model the interactions between the adversary and the classifier as a game. The adversary’s goal is to evade detection by minimally manipulating the attack instances, while the classifier is retrained to correctly classify them.
MCSs [7, 8, 9], as the name suggests, uses multiple classifiers rather than only one to improve classifier’s robustness. The adversary who wants to effectively evade the classification has to fight with more than one classifier.
Kantchelian et al.  present family-based ensembles of classifiers. In particular, they trained an ensemble of classifiers, one for each family of malware. By combining classifications, it will be determined whether an unknown binary is malware, and if it is, which family it belongs to. What’s more, they also demonstrate the importance of human operators in adversarial environments.
In , the method weight evenness via feature selection optimization is proposed. By appropriate feature selection, the weight of every feature is evenly distributed, thus the adversary has to manipulate a larger number of features to evade detection. In , the features are reweighted inversely proportional to their corresponding importance, making it difficult for the adversary to exploit the features.
Unfortunately, these attack and defense techniques for traditional classifiers cannot be directly applied to DNNs. Along with the prevalence of DNNs, researchers have begun to pay close attention to the security of DNNs.
Attacks on DNNs. Recently, researchers have begun to attack DNN-based classifiers through crafting adversarial samples. There are various methods to generate adversarial samples against DNNs in various fields, not limited in computer vision [26, 49, 61], but also speech recognition , text classification  and malware detection .
Kereliuk et al.  proposed a method to craft adversarial audio examples using the gradient information of the model’s loss function. Through the application of minor perturbations to the input magnitude spectra, they can effectively craft an adversarial example. Text as discrete data is sensitive to perturbation. Liang et al. proposed a method to craft adversarial text examples. Three perturbation strategies, namely insertion, modification, and removal, are designed to generate an adversarial sample for a given text. By computing the cost gradients, what should be inserted, modified or removed, where to insert and how to modify are determined effectively. By elaborately dressing a text sample, the adversary can modify the classification to any other classes while still keeps the meaning unchanged. Grosse et al.  presented a method to craft adversarial examples on neural networks for malware classification, by adapting the method originally proposed in .
In this paper, we focus on the detection of adversarial images. We believe that our method can be applied to detect adversarial examples for audio, which is also a kind of continuous data. But the proposed technique cannot be applied to discrete data, such as the adversarial text and malware. The new method need to be developed for effectively detecting them.
Note that there are two recent studies focus on crafting adversarial examples in the physical world. Kurakin et al.  demonstrated that the adversarial images obtained from a cell-phone camera can still fool an ImageNet classifier. Sharif et al.  presented an attack method to fool facial biometric systems. They showed that with some well-crafted eyeglass frames, a subject can dodge recognition or impersonate others.
Besides, Shokri et al.  developed a novel black-box membership inference attack against machine learning models, including DNN and non-DNN models. Given a data record, the attacker can determine whether it is in the target model’s training dataset. For health-care datasets, such information leakage is unacceptable.
Improve the Robustness of Deep Networks. The adversarial training [26, 34, 49, 53] is a straightforward defense technique to improve the robustness of target models. Retraining models by adding as many as possible adversarial samples can bring more challenges for attackers to find new adversarial samples.
Wang et al.  integrated a data transformation module right in front of a standard DNN to improve the model’s resistance to adversarial examples. This data transformation module leverages non-parametric dimensionality reduction methods, and projects all the input samples into a new representation before passing the inputs to the target DNN in training and testing. Wang et al.  also proposed another method, named random feature nullification, for constructing adversary resistant DNNs. In particular, it randomly nullifies or masks features within input samples in both the training and testing phase. Such nullification makes a DNN model non-deterministic and then improves model’s resistance to adversarial samples.
The proposed method is compatible with the above defense techniques. Defenders can still use our method in the enhanced model to get a better performance.
Detection Techniques. Some studies also focus on detecting adversarial examples directly.
Xu et al. in a very recent study  proposed a method, called Feature Squeezing, to detect adversarial examples in a similar way as ours. They explore two approaches to squeeze the features of an image: reducing the color bit depth of each pixel and smoothing it using a spatial filter. Their system identifies the adversarial examples by measuring the disagreement among the prediction vectors of the original and squeezed examples. Their experiment illustrated high performance was achieved when detecting FGSM adversarial examples in a MNIST model. However, a predefined threshold is required for determining how much disagreement indicate the current sample is adversarial. In their experiment, half of the examples are used to train the threshold that can produce the best detection accuracy on training examples. This means the defender must have a sufficient number of adversarial examples generated with potential attack techniques. As a result, the method works well only when the attack technique is known but less effective when facing unknown attacks. Moreover, in principle, for different datasets, models or attacks, the thresholds need to be retrained to achieve acceptable performance. By introducing the 2-D entropy, we implement an adaptive detection method and can be directly applied to different models, datasets and attack techniques with the same setting and without requiring any prior knowledge of attacks.
Grosse et al.  put forward a defense to detect adversarial examples using statistical tests. The method requires a sufficient large group of adversarial examples and benign examples to estimate their data distribution. However, the statistical test method cannot be directly applied to detect individual examples, making it less useful in practice. For this reason, Grosse et al. further propose a new method by adding an additional class (e.g., adversarial class) to the model’s output and retraining the model to classify adversarial examples as the new class.
Metzen et al.  used a large number of adversarial examples to train a detector to identify unknown adversarial examples. A small ”detector” subnetwork is trained on the binary classification task of distinguishing benign samples from adversarial perturbations.
Feinman et al.  devised two novel features to detect adversarial examples based on the idea that adversarial examples deviate the true data manifold. They introduced density estimates to measure the distance between an unknown input sample and a set of benign samples. The method is computationally expensive and may be less effective in detecting adversarial examples which are very close to benign samples.
Many efforts have been paid to use various techniques to defend or detect the adversarial image examples in DNNs. However, the prior knowledge of attack techniques or the modifications to the target model is often required. This paper presents a straightforward and effective adversarial image examples detection method. The adversarial perturbations are regarded as a kind of noise and the proposed method is implemented as a filter to reduce their effect. The image 2-D entropy is used to automatically adjust the detection strategy for specific samples. Our method provides two important features (1) without requiring the prior knowledge about attacks and (2) can be directly integrated into unmodified models. The experiment shows that our method can achieve a high recall and precision in detecting the adversarial examples generated by the different attack techniques and targeting different models. Our method is also compatible with other defense techniques. A better performance can be achieved by combining them together.
Our research demonstrated that the adversarial images can be effectively analyzed with classical image processing techniques. In the future, we will investigate more image processing techniques to find more effective and practicable detection techniques.
The authors would like to thank the anonymous reviewers for their insightful comments. The work is supported by XXX and YYY.
-  A simple and accurate method to fool deep neural networks. https://github.com/lts4/deepfool.
-  LeNet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet/.
-  Robust evasion attacks against neural network to find adversarial examples. https://github.com/carlini/nn_robust_attacks.
-  The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist.
-  Marco Barreno, Blaine Nelson, Russell Sears, Anthony D Joseph, and J Doug Tygar. Can machine learning be secure? In Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pages 16–25. ACM, 2006.
-  Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 387–402. Springer, 2013.
-  Battista Biggio, Giorgio Fumera, and Fabio Roli. Multiple classifier systems for adversarial classification tasks. In Proceedings of the International Workshop on Multiple Classifier Systems, pages 132–141. Springer, 2009.
-  Battista Biggio, Giorgio Fumera, and Fabio Roli. Multiple classifier systems for robust classifier design in adversarial environments. International Journal of Machine Learning and Cybernetics, 1(1-4):27–41, 2010.
-  Battista Biggio, Giorgio Fumera, and Fabio Roli. Multiple classifier systems under attack. In Proceedings of the International Workshop on Multiple Classifier Systems, pages 74–83. Springer, 2010.
-  Battista Biggio, Giorgio Fumera, and Fabio Roli. Design of robust classifiers for adversarial environments. In Proceedings of the 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 977–982. IEEE, 2011.
-  Michael Brückner, Christian Kanzow, and Tobias Scheffer. Static prediction games for adversarial learning problems. Journal of Machine Learning Research, 13(Sep):2617–2654, 2012.
-  Michael Brückner and Tobias Scheffer. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547–555. ACM, 2011.
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644, 2016.
-  Bin Chen and Jia-ju Zhang. On short interval expansion of rényi entropy. Journal of High Energy Physics, 11:164, 2013.
-  Dan CireşAn, Ueli Meier, Jonathan Masci, and Jürgen Schmidhuber. Multi-column deep neural network for traffic sign classification. Neural Networks, 32:333–338, 2012.
-  Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
-  George E Dahl, Jack W Stokes, Li Deng, and Dong Yu. Large-scale malware classification using random projections and neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3422–3426. IEEE, 2013.
-  George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, 2012.
-  Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.
-  Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
-  Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333. ACM, 2015.
-  Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pages 353–360. ACM, 2006.
-  Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on Machine learning, pages 513–520. ACM, 2011.
-  Rafael C Gonzalez and Richard E Woods. Digital image processing. Prentice Hall, 2002.
-  Ian Goodfellow, Honglak Lee, Quoc V Le, Andrew Saxe, and Andrew Y Ng. Measuring invariances in deep networks. In Proceedings of Advances in Neural Information Processing Systems, pages 646–654, 2009.
-  Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In Proceedings of the 2015 International Conference on Learning Representations, 2015.
-  Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
-  Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial perturbations against deep neural networks for malware classification. arXiv preprint arXiv:1606.04435, 2016.
-  Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267, 2017.
-  Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
-  Deng Jia, Dong Wei, Socher Richard, Li-Jia Li, Li Kai, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  Alex Kantchelian, Sadia Afroz, Ling Huang, Aylin Caliskan Islam, Brad Miller, Michael Carl Tschantz, Rachel Greenstadt, Anthony D Joseph, and JD Tygar. Approaches to adversarial drift. In Proceedings of the 2013 ACM workshop on Artificial intelligence and security, pages 99–110. ACM, 2013.
-  Corey Kereliuk, Bob L Sturm, and Jan Larsen. Deep learning and music adversaries. IEEE Transactions on Multimedia, 17(11):2059–2071, 2015.
-  Aleksander Kołcz and Choon Hui Teo. Feature weighting for improved classifier robustness. In Proceedings of the 6th Conference on Email and Anti-Spam, 2009.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
-  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
-  Pavel Laskov et al. Practical evasion of a learning-based classifier: A case study. In Proceedings of the 2014 IEEE Symposium on Security and Privacy (S&P), pages 197–211. IEEE, 2014.
-  Quoc V Le. Building high-level features using large scale unsupervised learning. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8595–8598. IEEE, 2013.
-  Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
-  Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. Convolutional networks and applications in vision. In Proceedings of the 2010 IEEE International Symposium on Circuits and Systems (ISCAS), pages 253–256. IEEE, 2010.
-  Jong-Sen Lee. Digital image enhancement and noise filtering by use of local statistics. IEEE transactions on pattern analysis and machine intelligence, (2):165–168, 1980.
-  Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. Deep text classification can be fooled. arXiv preprint arXiv:1704.08006, 2017.
-  Bin Liang, Miaoqiang Su, Wei You, Wenchang Shi, and Gang Yang. Cracking classifiers for evasion: A case study on the google’s phishing pages filter. In Proceedings of the 25th International Conference on World Wide Web, pages 345–356. International World Wide Web Conferences Steering Committee, 2016.
-  Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 641–647. ACM, 2005.
-  Daniel Lowd and Christopher Meek. Good word attacks on statistical spam filters. In Proceedings of the 2nd Conference on Email and Anti-Spam, 2005.
-  Davide Maiorca, Igino Corona, and Giorgio Giacinto. Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious pdf files detection. In Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, pages 119–130. ACM, 2013.
-  Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. Artificial Neural Networks and Machine Learning, pages 52–59, 2011.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
-  Blaine Nelson, Marco Barreno, Fuching Jack Chi, Anthony D Joseph, Benjamin IP Rubinstein, Udam Saini, Charles A Sutton, J Doug Tygar, and Kai Xia. Exploiting machine learning to subvert your spam filter. LEET, 8:1–9, 2008.
-  Nicolas Papernot, Ian Goodfellow, Ryan Sheatsley, Reuben Feinman, and Patrick McDaniel. cleverhans v1.0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768, 2016.
-  Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519, 2017.
-  Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Proceedings of the 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387. IEEE, 2016.
-  Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (S&P), pages 582–597. IEEE, 2016.
-  Ricardo N Rodrigues, Lee Luan Ling, and Venu Govindaraju. Robustness of multimodal biometric fusion methods against spoof attacks. Journal of Visual Languages & Computing, 20(3):169–179, 2009.
-  Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1528–1540. ACM, 2016.
-  Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (S&P). IEEE, 2017.
-  Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identification-verification. In Proceedings of Advances in Neural Information Processing Systems, pages 1988–1996, 2014.
-  Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation from predicting 10,000 classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1891–1898, 2014.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In Proceedings of the 2014 International Conference on Learning Representations, 2014.
-  Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G Ororbia II, Xinyu Xing, C. Lee Giles, and Xue Liu. Adversary resistant deep neural networks with an application to malware detection. arXiv preprint arXiv:1610.01239, 2016.
-  Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G Ororbia II, Xinyu Xing, C Lee Giles, and Xue Liu. Learning adversary-resistant deep neural networks. arXiv preprint arXiv:1612.01401, 2016.
-  Xi Wu, Matthew Fredrikson, Somesh Jha, and Jeffrey F Naughton. A methodology for formalizing model-inversion attacks. In Proceedings of the 2016 IEEE Computer Security Foundations Symposium (CSF), pages 355–370. IEEE, 2016.
-  Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
-  Weilin Xu, Yanjun Qi, and David Evans. Automatically evading classifiers. In Proceedings of the 2016 Network and Distributed Systems Symposium, 2016.
-  Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Proceedings of Advances in Neural Information Processing Systems, pages 649–657, 2015.