Adversarial Perturbations on the Perceptual Ball

Adversarial Perturbations on the Perceptual Ball

Abstract

We present a simple regularisation of Adversarial Perturbations based upon the perceptual loss. While the resulting perturbations remain imperceptible to the human eye, they differ from existing adversarial perturbations in two important regards: (i) our resulting perturbations are semi-sparse, and typically make alterations to objects and regions of interest leaving the background static; (ii) our perturbations do not alter the distribution of data in the image and are undetectable by state-of-the art-methods. As such this work reinforces the connection between explainable AI and adversarial perturbations.

We show the merits of our approach by evaluating on standard explainablity benchmarks and by defeating recent tests for detecting adversarial perturbations, substantially decreasing the effectiveness of detecting adversarial perturbations.

1 Introduction

Adversarial Perturbations [50] refers to the small alteration of a data point or image leading to a substantial change in classification response. Despite drastically changing the response of a machine learning classifier, such perturbations are often imperceptible to humans. As such, the mismatch between how computers respond to adversarial perturbed images and how people respond pose severe challenges when deploying life altering decision making systems in the real world, with notable examples in driverless cars [12]; weapon detection [2]; and person re-identification [10].

Two compelling arguments for the existence of adversarial perturbations in images have been offered. The first due to Goodfellow [17] remarks that adversarial perturbations are simply an artefact of a high-dimensional space, and it is entirely expected a small perturbations to every pixel in the image can add up to a large change in the classifier response, and that in fact the same behaviour is found in linear classifiers.

A second argument attempts to understand why sparse (and potentially even single pixel) attacks exist; and attributes their effectiveness to exploding gradients. Exploding gradients refers to how changes in functional response can grow exponentially with the depth of the network, and this is an issue known to plague the learning of Recurrent Neural Networks [31], and the very deep networks common to computer vision.

This phenomena occurs because, by construction, neural networks form a product of (convolutional) matrix operations interlaced with non-linearities; and for directions/locations in which these non-linearities act approximately linearly, the eigenvalues of the Jacobian can grow exponentially with depth (see [31] for a formal derivation). While this phenomena is well-studied in the context of training networks, with remedies being offered in the form of normalisation [21] or gradient clipping [31], the same phenomena can occur when generating adversarial perturbations. This means that a carefully chosen and small perturbation can have an extremely large effect in the response of a deep classifier.

To understand how these arguments fit together, and which of these explanations accounts for the familiar behaviour of adversarial perturbations, we propose a simple novel regularisation that bounds the exponential growth of the classifier response by regularising the perceptual distance [22] between the original image and its adversarial perturbation.

One common criticism of adversarial perturbations is that the generated images lie outside the manifold of natural images, and that if we could directly sample from the manifold of images, our adversarial perturbations would be both larger and more representative of the real world. Restricting adversarial perturbations to the manifold of natural images should limit the impact of exploding gradients – if samples are drawn from this space then a well-trained classifier should implicitly reflect the smoothness of the true labels of the underlying data distribution.

Figure 1: Explaining Images. From left to right: (i) The original image; (ii) The original image plus the adversarial perturbation (Indistinguishable from (i)); (iii) The interest map (per pixel magnitude of the adversarial perturbations) renormalised to be visible, highlighting salient objects in the image. (iv) The dominant connected component and the resulting bounding box from automatic object detection.

However, characterising the manifold of natural images outside of handwritten digits has proven extremely challenging1. Our approach provides a complementary lightweight alternative. Rather than attempting to characterise the manifold of images, we restrict the benefit of exploding gradients that drives algorithms searching for minimal adversarial perturbations to step off the manifold.

We propose a novel regularisation for adversarial perturbations based around the perceptual loss. Our new perturbations have unexpected properties, tending to highlight objects and regions of interest within the image (see figure 1). Moreover, they have different statistics to some existing adversarial perturbations allowing us to bypass an existing approach for adversarial perturbation detection. We evaluate on standard explainability dataset for image classifiers.

2 Prior work

Numerous approaches to adversarial perturbations have been proposed in the past. These can loosely be divided into white-box[29, 8, 5, 28] approaches that assume access to the underlying nature of the model and black-box methods which do not[30, 25]. The search for an adversarial perturbation is often formulated as trying to find the closest point to a particular image, under the , or norm that takes a different class label to this particular image.

Also of interest are other works that add additional constraints to the perturbation to try to make the generated images more plausible. Such works may restrict the space of perturbations considered by trying to find an adversarial perturbations that confounds many classifiers at once [11], or is robust to image warps[2]. Other approaches consider only a single image and single classifier, but restricted adversarial perturbations to lie on the manifold of plausible images [48, 16, 40, 44]. The principle limitation to this approach is that as a minimal first step it requires a plausible generator of natural images, something that is achievable with small simple datasets such as MNIST but currently out of reach for even the 224 by 244 thumbnails used by typical ImageNet [37] classifiers.

Adversarial Perturbations and Counterfactuals

There has been substantial work relating the generation of adversarial perturbations and counterfactual explanations. This relationship follows from the definitions in philosophy and folk psychology of a counterfactual explanation as answering the question “What would need to be different in order for outcome A to have occurred instead of B?”. With full causal models of images being outside our grasp, such questions are commonly answered using Lewis’s Closest Possible World semantics[24], rather than Pearl’s Structured Causal Models [32]. Under Lewis’s framework an explanation for why an image is classified as ‘dog’ rather than ‘cat’ can be found by searching for the most similar possible world (i.e. image) to which the classifier assigns the label ‘cat’.

Conceptually, this is no different to searching for an adversarial perturbation sampled from the space of possible images. Several approaches have been proposed that either bypass the requirement that the counterfactual is an image, and return text descriptions [20], naïvely ignore the requirement that the world is plausible[53], used prototypes[51], or auto-encoders [9] to characterise the manifold of plausible images, or require large edits that replace regions of the image, either with the output of GANs [6] or with patches sampled from other images [18].

Figure 2: More focused Adversarial Perturbations. From left to right: (i) The original Image. (ii) An perturbed image found using Deepfool and minimising equation (3). (iii) A perceptually perturbed image minimising (4). As described in the text, the use of the perceptual loss results in more focused alterations in cluttered scenes.

Adversarial Perturbations and Gradient Methods

The majority of methods in the explainability of computer vision tend to be gradient or importance based methods that assign an importance weight to every pixel in the image; every super-pixel; or to mid-level neurons. These gradient methods and adversarial perturbations are strongly related. In fact, with most modern networks being piecewise linear, if the found adversarial perturbation and the original image lie on the same linear piece, the difference between the original image, and closest adversarial perturbations under the norm is equivalent to the direction of steepest descent, up to scaling. As such, adversarial perturbations can be thought of as a slightly robustified method of estimating the gradient, that takes into account some local non-linearities.

Perturbation methods look over a larger range to estimate long-range gradient-like responses including [56] who applied constant-value occlusion masks to different input patches repeatedly to find regions that changed the most. LIME [35] constructed a linear model using the responses obtained from the perturbing super-pixels. In more recent works, Extremal Perturbation [15] identifies an optimal fixed mask of the image to occlude that gives the maximal effect on the network’s output.

Of the pure gradient based approaches [41] - calculates the gradient of the output with respect to the input to create a saliency map giving fine grained, but potentially less interpretable results. Other gradient based approach includes SmoothGrad [43] which stabilises the saliency maps by averaging over multiple noisy copy of them and Integrated Gradients[49] where an attribution score is calculated by accumulating gradient when perturbing an empty image to the input image.

The CAM based approaches [59, 38] sum the activation maps in the final convolutional layer of the network. These small activation maps are then up-sampled to obtain a heatmap that highlights particularly salient regions. Grad-CAM is a generalised variant which finds similar regions of interest to the perturbation based approaches [38].

A number of experiments were developed to test saliency methods including the pointing game [33, 57, 38], the weakly supervised object localisation task [14, 7] and the insertion and deletion game [33, 54]. In particular, [1] developed experiments to test saliency methods suitability. These experiments had been applied on a number of existing saliency techniques including; Gradient[41], SmoothGrad[43], GuidedBackprop[47], Integrated Gradients [49], GradientInput[39] and GradCAM [38].

Detecting Adversarial Perturbations

Two main schools of thought for defending against adversarial perturbations exist. Either classifiers can be strengthen to resist the adversarial perturbations [30, 55, 52], or adversarial perturbations can be detected directly [4] and excluded.

Multiple different detection approaches exist, a good review (and subsequent rebuttal) of several different detection approaches can be found in [4] which covers approaches such as: adding additional classes to a classifier [19]; utilising additional classifiers e.g. [27]; density estimates e.g. [13]; eigen-decompositions[58] and a variety of others. Other interesting approaches utilises the generative models of images to detect adversarial examples e.g. [44].

Of particular interest to this paper, are approaches that use layers or mid-level responses of the classifier to identify adversarial perturbations, as we regularise these layer through our perceptual loss. One such approach is [27] who added small neural networks to detect the perturbation on the value of various layers. Another is [46] who constructed statistics based on finding subsets of neurons in a given layer with unusual values. Finally, many detection methods can be fooled by constructing losses that explicitly account for the method used to detect the perturbations [4].

Figure 3: Perceptual Perturbations as Explanations. See discussion in section 4.

3 Methodology

We consider a standard multi-class classifier as that takes an image as input, and returns a dimensional vector consisting of the confidence the classifier assigns to each one of classes. And we say the classifier assigns the class to the image .

Given an image classified as label we consider the multi-class margin:

(1)

and note that if and only if does not assign label to image . As such an adversarial perturbation can be found by minimising

(2)

where is a small target value greater than zero. It is well-known [45] that minimising a loss of the form:

(3)

is equivalent to finding a minimiser of subject to the requirement that lies in the ball defined by for some . As such minimising this objective for an appropriate value of and is a good strategy for finding adversarial perturbations of image with small norm.

Writing for the classifier response of in the layer of the neural net, we consider the related loss

(4)

defined over a set of layers of the neural net .

The second half of this objective is the perceptual loss of [22], and minimising this objective is equivalent to finding a minimiser of (3) subject to the requirement that lies in the ball defined by .

Some care needs to be taken in selecting the layers of the network we regularise over. If we regularise every level of the network, the network will be overly constrained and not change its classifier response.

That said, the method seems to be relatively robust to choices of layer, and for VGGnet [42] we use the layers in the batch normalised VGG192. For the evaluating adversarial perturbation detection we use a standard CNN on CIFAR detailed in section 5.

A visual comparison of the results of minimising equations (3) and (4) can be seen in figure 2.

The objective is optimised using LBFGS[60], a standard algorithm, well-suited for minimising these smooth non-stochastic objectives. To guarantee that the perturbed image lies within valid RGB values, we enforce consider two sets of constraints (i) that the values do not exceed the range observed in the original image, or (ii) within an based -ball of the original image (used in the section on adversarial perturbations in section 5), we simply clip the solution found at each iteration to lie inside these bounds.

4 Perceptual Perturbations for Visual Explanations

Figure 4: A qualitative comparison of existing methods. See discussion in section 4.

Before describing our experimental overview, we give a qualitative analysis of the perceptual perturbations, as shown in figure 3. The found perturbations do a good job of localising on a single object class, even in the presence of highly textured or cluttered images (e.g. dragon fly on fern; polar bear; coral reef), containing multiple classes (dog and man; baseball and people; lawnmower). The perturbations tend to focus upon the heads of the labelled class reflecting the result of [14] that heads are more salient (llama; elephant), but importantly it only finds heads of the explained class salient, and not those belonging to other objects (e.g. dog with leopard; dog with man; baseball; guitar). Much of the confusion in localisation seems to occur in supporting classes close behind the object - for example the human legs behind the lawnmower are found to be salient as is the torso of the man playing the guitar. This could be because these classes frequently occur in close proximity to one another; and provide supporting evidence for the detected class.

Following recent work [1] that points out that edge detectors with no knowledge of the classifier do surprisingly well in the insertion and deletion metrics used in visual explainablity, we instead focus on object localisation.

To transform our perceptual perturbations into a Saliency Map, we simply treat the magnitude of the alteration in each pixel as its salience. We evaluate the quality of our perceptual perturbations as explanations by using the localisation protocol of [14, 57, 3]. We predicted a bounding box for the most dominant object in the first 1000 Imagenet [37] validation images and employ simple threshold methods for fitting bounding boxes. For the first approach, we follow [14] in using a value-threshold where we normalise individual heatmaps to be in the range of , we then square-root transform the saliency maps and grid search over the set of thresholds where at intervals of size . For the second experiment, we also follow [14] in using a threshold scaled by the per image mean. Finally, we evaluate on a third measure based around percentage-threshold where we consider the top most salient pixels.

As is standard, for each threshold, we extract the largest connected component and draw a bounding box around it. The object is assumed to be successfully localised when the Intersection Over Union measure (IOU) between this box and the ground truth is above . Following GradCAM’s guided version[38], which makes use of image gradients from [47], we consider a guided variant of our own consisting of an element-wise multiplication between our perturbations and the normalised guided gradient of the image with respect to the margin .

We compare our perceptual method and its guided variant with GradCAM [38], Guided Backprop [47], Guided GradCAM [38], SmoothGrad [43], Integrated Gradient [49], Excitation Backprop  [57], RISE [33] and Extremal Perturbations [15]. To demonstrate that the perceptual loss is important to the successful

A qualitative evaluation of the methods can be seen in figure 3. These images were selected to be challenging – we visualise a subset of those images where DeepFool’s adversarial perturbation did not align with the object. For reasons of space and fairness, we do not show DeepFool’s adversarial perturbations in this figure. Compared with other visual explanation techniques, our method highlights the interior textures of the target object in the image. This differs to gradient-based methods which captures finer edge details such as SmoothGrad [43] and to activation-based methods which highlights the entire object coarsely such as GradCAM [38]. This is perhaps most clear with the panda image where our method captures the interior texture of the Panda rather than just its hard contours.

We compare the generated bounding box with the ground truth bounding box. For the approaches [38, 43, 49, 57, 15], we made use of the implementation developed for the paper [15]3. For RISE[33] we made use of the authors publicly available code4. For DeepFool we used the FoolBox Library provided by [34]5. We used default parameters for all of the visual explanation methods on the validation images.

Method Value Percent Mean
GradCAM [38] 0.48 0.48 0.47
g-GradCAM[38] 0.50 0.48 0.46
Smooth Grad[43] 0.46 0.47 0.46
Integrated Grad[49] 0.44 0.49 0.44
Excitation[57] 0.49 0.46 0.45
Extremal[15] 0.54 0.52 0.53
Guided Backprop [47] 0.46 0.48 0.47
RISE[33] 0.57 0.58 0.58
DeepFool[29] 0.57 0.66 0.57
Us 0.48 0.46 0.46
Us Guided 0.43 0.46 0.46
Table 1: Error for value threshold, percentage threshold and mean threshold. Lower is better.

We perform better than the tested approaches on object localisation, obtaining the lowest error on value, and along with Excitation back-prop [57], the lowest error on percent based threshold, and the lowest error across all choices of thresholding. We do noticeably less well on mean-based thresholding, being beaten by two methods Excitation Backprop [57], and Integrated Grad [49], and obtaining similar score to several other approaches.

5 Detecting Perceptual Perturbations

PGD

Mean

CW

Ours

Figure 5: Cone plots for each method. The greater the curvature, the easier it is to detect an adversarial perturbation using  [36]. Following [36] we take unperturbed images, with their respective adversarial perturbations generated by PGD, Mean, CW, and our method. We plot the softmax assignment to the correct class of each of the methods in the space spanned by the direction of the adversarial perturbation (y direction) and a randomly selected orthogonal direction of equal norm (x direction). For the original image, the adversarial perturbation and an orthogonal perturbation, we then sample the space as follows. , for and . The upper (green) dot represents the location of the original image, while the lower (red) dot represents the location of the adversarial perturbation. We average over 100 randomly selected CIFAR images from the test set and for each image and position in the space we average over 5 random orthogonal directions. The logarithmic colour scales were selected for within plot clarity and vary substantially between plots.

To demonstrate that our adversarial perturbations have fundamentally different properties to existing approaches we demonstrate how recently published approach for detecting adversarial perturbations fail to detect our approach. In particular, we compare again the recently published detection method [36] that leverages changes induced by random noise in the differences in the logits to uncover adversarial perturbations.

This is a particularly relevant test for our approach, as the defence by [36] is motivated by the notion that adversarial perturbations found by minimising some p-norm distance from the original image typically have different properties to naturally occurring images. In contrast, our approach by explicitly minimising a distance measure previously unused in the literature on adversarial perturbations will produce images that are closer to the distribution of real images according to the perceptual distance, and may bypass these tests that rely upon different properties.

The approach used by [36] is as follows: Given a training set, for each pair of classes and , they calculate statistics describing how the difference between the classifier response for the true class, and for another class changes with injected noise. This must take into account the class dependent nature of the distribution (see [36] for full details). Using these quantities they construct a z-score between the observed class of an image and every other class and construct a classifier that estimates if an image is adversarial based on the z-score deviating by more than a class defined threshold.

They motivate this approach in several ways, one of which is the ‘cone’ plot (see figure 5, in which they plot the softmax of the target class over the space spanned by the direction of the adversarial perturbation and that of randomly selected orthogonal directions (see latter in this section for a full explanation). The plot demonstrates that for standard adversarial perturbations, unlike natural images, moving in an orthogonal direction increases the probability of the target class. As most random perturbations in high-dimensional spaces are approximately orthogonal to the direction of the adversarial perturbation, [36] suggested evaluating the stability of the classifier response under random noise.

Adversarial perturbation defences can only be expected to work if the generated perturbations satisfy the expected constraints, the so called threat model. As such, we form our adverse perturbations by adding additional constraints to those expected by [36] and for conformity, we restrict our adversarial perturbations lie to an with an distance of the original image of less than .

To match the experimental setup in [36], in which the adversaries are also restricted to 20 iterations of PGD [26], we also restrict the number of iterations of LBFGS to match the number of PGD iterations and we return the perturbation with the best loss over the function evaluations.

We compare our results against several standard methods that are implemented in the package provided by [36]6, namely, PGD [26] (projected gradient descent) PGD mean (average PGD over noisy inputs) and CW [5] (Carlini Wagner). The Carlini Wagner attack provided by the package is the L2 attack, and does not restrict the norm of the peturbations, although empirically it typically, but not always, has an of less than the limit of .

Using the code provided by the authors of [36], we make a direct comparison with their detection approach on the CIFAR dataset with their default network, a pre-trained7 batch-normalised network consisting of 7 convolutional blocks. This network is used for all subsequent experiments in this section. We arbitrary fix the 5th, 10th and 15th layer of this network, where following our convention in section 3 we define a layer as any operation. We emphasise that this choice of layers was not tuned, and we did not evaluate any other choices. The results can be seen in Table 2. We outperform each of the other approaches, achieving a low detection rate of % with a similar level of classifier success as the other approaches. The performance of PGD differs slightly from the values reported in [36], a brief investigation appears to indicate that the method is sensitive choice of noise, and the default parameters provided by package may not be those used in the original paper.

A reproduction of the cone plot from [36], showcasing the different structure of the different perturbation types can be seen in figure 5. This figure plots the change in the softmax response of the target class against the direction of the adversarial perturbation and that of a random orthogonal perturbation (see figure caption for full implementation details).The cone-like structure described by [36] can be seen for both PGD and the Mean method. However neither our approach nor the CW perturbations have the same structure. This may be related to the use of the norm when generating the CW perturbations (the version provided as part of [36] is the L2 version) which we also use in addition to our perceptual loss components and matching the of the threat model. As shown by table 2, our results are noticeably harder to detect with [36] only able to detect perceptual perturbations of the time vs. the detection rate for CW being .

PGD Mean Ours CW
Clean Detection 0.0228 0.0224 0.024 0.0242
Attack Detection 0.7446 0.3275 0.1447 0.5658
Classifier Suc. % 0.0388 0.0399 0.0435 0.0431
Table 2: Comparison of the performance of our approach and 3 approaches provided by the authors of [36] using the default noise models. ‘Clean Detection’ gives the false positive rate of the classifier, and ‘Attack Detection’ the true positive rate.

6 Conclusion

We have presented a novel regularisation based on the perceptual loss for the generation of adversarial perturbations. This regularisation is designed to block the exploitation of exploding gradients when generating adversarial perturbations forcing larger and more meaningful perturbations to be generated. The fact that such perturbations still exist under these constraints and remain imperceptible to humans is another piece of the puzzle in understanding the interrelationship between adversarial perturbations, neural networks, and human vision.

We have shown how these perturbations can be interpreted as explanations and obtained state-of-the-art results on a standard explainability benchmark. Moreover, the properties of these novel perturbations means they are not recent work.

Acknowledgements

This work was supported in part by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, and The Alan Turing Institute. Andrew Elliott was funded by EPSRC grant EP/N510129/1 at The Alan Turing Institute and Accenture Plc; Stephen Law was funded by EPSRC grant EP/N510129/1 at The Alan Turing Institute and Chris Russell was partially supported by the Alan Turing Institute and programmatic research funding provided by the Luminate Group. We would further like to thank Tomas Lazauskas and the team at the PEARL cluster for providing access to the GPU cluster and giving us invaluable help in setting up the computational environment.

Appendix A Visual Explanation

We include a selection of the successful examples and the failure cases from the visual explanation experiment on Imagenet [37]. We display the original image, the difference between the perceptually perturbed and the original, the saliency map, the dominant connected component with the resulting bounding box and the dominant connected component masked with the original image.

Figure 6 shows selected examples from our method which failed in the explanation experiment. We note in some failed cases, our method was able to localise on the target object correctly. These cases can occur when the predicted bounding box is significantly larger or smaller than the ground truth bounding box. This is apparent in row 3, where the predicted bounding box is notably larger than the ground truth localisation of the bighorn. Another common failure is the occurrence of two objects in the scene where the ground truth annotation is placed on one of the two objects. This is clearly visible in row 4 where there are two porcupines in the same scene but only one of them is being highlighted. There is also the case where the target object is partially occluded from the scene such as the american alligator in row 11, in this case the ground truth bounding box includes part of the animal hidden behind the vegetation. There are also examples where the predicted bounding box has clearly localised on objects not within the target class. One such example is in row 5, for which our method highlights both the tricycle and the two children in the scene.

Figure 7 shows selected examples from our method which succeeded in the explanation experiment. Most of the successful localised examples contain a single dominant object in the scene. For example, the soup bowl in row 1, the cougar in row 2 and the guenon in row 3. In some cases, the method succeeds in localising the target object with a busy background, e.g. the cabbage butterfly in row 5, the bee eater in row 7 and the white wolf in row 11. The method also achieves success in localising partial objects such as the dugong in row 10.

Appendix B Adversarial Perturbations

For completeness, we present several examples using each of the methods present in the main paper. We do not filter on successful cases, and we select the examples to be recognisable and to highlight several classes. We display the results in figure 8, where we present the original image, the adversarially perturbed image, and the difference between the two images rescaled to highlight the adversarial perturbation.

Much like the results on Imagenet in the visual explanations section we note that our perceptually perturbations are localised on the object in question in all examples presented with the exception of row 6. The other PGD based approaches (PGD and Mean) appear to localise less well, although they do show localisation in row 3 and 4. Echoing our discussion of the cone plot in the main body, the L2 variant of CW also appears to localise well, at least on CIFAR.

       Sea Snake

          Alp

         Big horn

        Porcupine

       Rain barrel

         Tricycle

         Tripod

        Cardigan

         Vulture

        Afghan hound

   American alligator

        White wolf

Figure 6: Failure examples on Imagenet [37]. From left to right: (i) The original image; (ii) The difference between the perceptually perturbed and the original; (iii) The saliency map; (iv) The dominant connected component and the resulting bounding box in green and the ground truth in red; (v) The dominant connected component masked with the original image.

        Soup bowl

         Cougar

        Guenon

  Recreational vehicle

   Cabbage butterfly

          Pickup

         Bee eater

    Kerry blue terrier

   Standard Schnauzer

        Dugong

        White wolf

       Wood rabbit

Figure 7: Successful examples on Imagenet [37]. From left to right: (i) The original image; (ii) The difference between the perceptually perturbed and the original; (iii) The saliency map; (iv) The dominant connected component and the resulting bounding box in green and the ground truth in red; (v) The dominant connected component masked with the original image.
Orig. PGD PGD Diff Mean Mean Diff CW CW Diff Ours Our Diff
Figure 8: Examples from CIFAR [23] from the adversarial perturbation experiments. From left to right: (i) The original image; (ii) PGD perturbed image; (iii) PGD difference map; (iv) Mean perturbed image; (v) Mean difference map; (vi) CW perturbed image; (vii) CW difference map; (viii) Our perturbed image; and (ix) Our difference map. Note each difference map has been re-scaled to highlight the differences between regions and thus the colours are not directly comparable between images.

Footnotes

  1. See discussion in the experimental section of [48].
  2. To avoid ambiguity, our indexing treats each operation such as convolution, ReLu or BatchNorm as a separate layer.
  3. https:__github.com_facebookresearch_TorchRay
  4. https:__github.com_eclique_RISE
  5. https:__github.com_bethgelab_foolbox
  6. https:__github.com_yk_icml19_public
  7. https:__github.com_aaron-xichen_pytorch-playground

References

  1. J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt and B. Kim (2018) Sanity checks for saliency maps. External Links: 1810.03292 Cited by: §2, §4.
  2. A. Athalye, L. Engstrom, A. Ilyas and K. Kwok (2017) Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §1, §2.
  3. C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang and W. Xu (2015) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2956–2964. Cited by: §4.
  4. N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §2, §2, §2.
  5. N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §2, §5.
  6. C. Chang, E. Creager, A. Goldenberg and D. Duvenaud (2018) Explaining image classifiers by counterfactual generation. External Links: 1807.08024 Cited by: §2.
  7. A. Chattopadhay, A. Sarkar, P. Howlader and V. N. Balasubramanian (2018-03) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). External Links: ISBN 9781538648865, Link, Document Cited by: §2.
  8. P. Chen, Y. Sharma, H. Zhang, J. Yi and C. Hsieh (2018) Ead: elastic-net attacks to deep neural networks via adversarial examples. In Thirty-second AAAI conference on artificial intelligence, Cited by: §2.
  9. A. Dhurandhar, P. Chen, R. Luss, C. Tu, P. Ting, K. Shanmugam and P. Das (2018) Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, pp. 592–603. Cited by: §2.
  10. W. Ding, X. Wei, X. Hong, R. Ji and Y. Gong (2019) Universal adversarial perturbations against person re-identification. arXiv preprint arXiv:1910.14184. Cited by: §1.
  11. G. Elsayed, S. Shankar, B. Cheung, N. Papernot, A. Kurakin, I. Goodfellow and J. Sohl-Dickstein (2018) Adversarial examples that fool both computer vision and time-limited humans. In Advances in Neural Information Processing Systems, pp. 3910–3920. Cited by: §2.
  12. K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno and D. Song (2017) Robust physical-world attacks on deep learning models. arXiv preprint arXiv:1707.08945. Cited by: §1.
  13. R. Feinman, R. R. Curtin, S. Shintre and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §2.
  14. R. C. Fong and A. Vedaldi (2017-10) Interpretable explanations of black boxes by meaningful perturbation. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781538610329, Link, Document Cited by: §2, §4, §4.
  15. R. Fong, M. Patrick and A. Vedaldi (2019) Understanding deep networks via extremal perturbations and smooth masks. External Links: 1910.08485 Cited by: §2, Figure 4, Table 1, §4, §4.
  16. J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg and I. Goodfellow (2018) Adversarial spheres. arXiv preprint arXiv:1801.02774. Cited by: §2.
  17. I. J. Goodfellow, J. Shlens and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  18. Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh and S. Lee (2019) Counterfactual visual explanations. External Links: 1904.07451 Cited by: §2.
  19. K. Grosse, P. Manoharan, N. Papernot, M. Backes and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §2.
  20. L. A. Hendricks, R. Hu, T. Darrell and Z. Akata (2018) Generating counterfactual explanations with natural language. arXiv preprint arXiv:1806.09809. Cited by: §2.
  21. S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1.
  22. J. Johnson, A. Alahi and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §1, §3.
  23. A. Krizhevsky, V. Nair and G. Hinton () CIFAR-10 (canadian institute for advanced research). . External Links: Link Cited by: Figure 8.
  24. D. Lewis (2013) Counterfactuals. John Wiley & Sons. Cited by: §2.
  25. Y. Liu, X. Chen, C. Liu and D. Song (2016) Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770. Cited by: §2.
  26. A. Madry, A. Makelov, L. Schmidt, D. Tsipras and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §5, §5.
  27. J. H. Metzen, T. Genewein, V. Fischer and B. Bischoff (2017) On detecting adversarial perturbations. In Proceedings of 5th International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2, §2.
  28. A. Modas, S. Moosavi-Dezfooli and P. Frossard (2019) Sparsefool: a few pixels make a big difference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9087–9096. Cited by: §2.
  29. S. Moosavi-Dezfooli, A. Fawzi and P. Frossard (2016-06) DeepFool: a simple and accurate method to fool deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781467388511, Link, Document Cited by: §2, Table 1.
  30. N. Papernot, P. McDaniel, X. Wu, S. Jha and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §2, §2.
  31. R. Pascanu, T. Mikolov and Y. Bengio (2012) Understanding the exploding gradient problem. CoRR, abs/1211.5063 2. Cited by: §1, §1.
  32. J. Pearl (2000) Causality: models, reasoning and inference. Vol. 29, Springer. Cited by: §2.
  33. V. Petsiuk, A. Das and K. Saenko (2018) RISE: randomized input sampling for explanation of black-box models. External Links: 1806.07421 Cited by: §2, Figure 4, Table 1, §4, §4.
  34. J. Rauber, W. Brendel and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: §4.
  35. M. Ribeiro, S. Singh and C. Guestrin (2016) “Why should i trust you?”explaining the predictions of any classifier. SigKDD. Cited by: §2.
  36. K. Roth, Y. Kilcher and T. Hofmann (2019) The odds are odd: a statistical test for detecting adversarial examples. In International Conference on Machine Learning, pp. 5498–5507. Cited by: Figure 5, Table 2, §5, §5, §5, §5, §5, §5, §5, §5, §5.
  37. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: Appendix A, Figure 6, Figure 7, §2, §4.
  38. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra (2016-10) Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv e-prints, pp. arXiv:1610.02391. External Links: 1610.02391 Cited by: §2, §2, Figure 4, Table 1, §4, §4, §4, §4.
  39. A. Shrikumar, P. Greenside, A. Shcherbina and A. Kundaje (2016) Not just a black box: learning important features through propagating activation differences. External Links: 1605.01713 Cited by: §2.
  40. C. Simon-Gabriel, Y. Ollivier, L. Bottou, B. Schölkopf and D. Lopez-Paz (2018) Adversarial vulnerability of neural networks increases with input dimension. arXiv preprint arXiv:1802.01421. Cited by: §2.
  41. K. Simonyan, A. Vedaldi and A. Zisserman (2013) Deep inside convolutional networks. ICLR. Cited by: §2, §2.
  42. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.
  43. D. Smilkov, N. Thorat, B. Kim, F. Viégas and M. Wattenberg (2017) SmoothGrad: removing noise by adding noise. External Links: 1706.03825 Cited by: §2, §2, Figure 4, Table 1, §4, §4, §4.
  44. Y. Song, T. Kim, S. Nowozin, S. Ermon and N. Kushman (2017) Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §2, §2.
  45. D. C. Sorensen (1982) Newton’s method with a model trust region modification. SIAM Journal on Numerical Analysis 19 (2), pp. 409–426. Cited by: §3.
  46. S. Speakman, S. Sridharan, S. Remy, K. Weldemariam and E. McFowland (2018) Subset scanning over neural network activations. arXiv preprint arXiv:1810.08676. Cited by: §2.
  47. J. T. Springenberg, A. Dosovitskiy, T. Brox and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. External Links: 1412.6806 Cited by: §2, Table 1, §4, §4.
  48. D. Stutz, M. Hein and B. Schiele (2019) Disentangling adversarial robustness and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6976–6987. Cited by: §2, footnote 1.
  49. M. Sundararajan, A. Taly and Q. Yan (2017) Axiomatic attribution for deep networks. External Links: 1703.01365 Cited by: §2, §2, Figure 4, Table 1, §4, §4, §4.
  50. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  51. A. Van Looveren and J. Klaise (2019) Interpretable counterfactual explanations guided by prototypes. arXiv preprint arXiv:1907.02584. Cited by: §2.
  52. D. Vijaykeerthy, A. Suri, S. Mehta and P. Kumaraguru (2019) Hardening deep neural networks via adversarial model cascades. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
  53. S. Wachter, B. Mittelstadt and C. Russell (2017) Counterfactual explanations without opening the black box: automated decisions and the gpdr. Harv. JL & Tech. 31, pp. 841. Cited by: §2.
  54. J. Wagner, J. M. Köhler, T. Gindele, L. Hetzel, J. T. Wiedemer and S. Behnke (2019) Interpretable and fine-grained visual explanations for convolutional neural networks. External Links: 1908.02686 Cited by: §2.
  55. S. Wang, X. Wang, P. Zhao, W. Wen, D. Kaeli, P. Chin and X. Lin (2018) Defensive dropout for hardening deep neural networks under adversarial attacks. In Proceedings of the International Conference on Computer-Aided Design, pp. 71. Cited by: §2.
  56. M. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. ECCV. Cited by: §2.
  57. J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen and S. Sclaroff (2017-12) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. External Links: ISSN 1573-1405, Link, Document Cited by: §2, Figure 4, Table 1, §4, §4, §4, §4.
  58. C. Zhao, P. T. Fletcher, M. Yu, Y. Peng, G. Zhang and C. Shen (2019) The adversarial attack and detection under the fisher information metric. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5869–5876. Cited by: §2.
  59. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba (2016-06) Learning deep features for discriminative localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781467388511, Link, Document Cited by: §2.
  60. C. Zhu, R. H. Byrd, P. Lu and J. Nocedal (1997) Algorithm 778: l-bfgs-b: fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS) 23 (4), pp. 550–560. Cited by: §3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402578
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description