Adversarial Perturbations on the Perceptual Ball
We present a simple regularisation of Adversarial Perturbations based upon the perceptual loss. While the resulting perturbations remain imperceptible to the human eye, they differ from existing adversarial perturbations in two important regards: (i) our resulting perturbations are semi-sparse, and typically make alterations to objects and regions of interest leaving the background static; (ii) our perturbations do not alter the distribution of data in the image and are undetectable by state-of-the art-methods. As such this work reinforces the connection between explainable AI and adversarial perturbations.
We show the merits of our approach by evaluating on standard explainablity benchmarks and by defeating recent tests for detecting adversarial perturbations, substantially decreasing the effectiveness of detecting adversarial perturbations.
Adversarial Perturbations  refers to the small alteration of a data point or image leading to a substantial change in classification response. Despite drastically changing the response of a machine learning classifier, such perturbations are often imperceptible to humans. As such, the mismatch between how computers respond to adversarial perturbed images and how people respond pose severe challenges when deploying life altering decision making systems in the real world, with notable examples in driverless cars ; weapon detection ; and person re-identification .
Two compelling arguments for the existence of adversarial perturbations in images have been offered. The first due to Goodfellow  remarks that adversarial perturbations are simply an artefact of a high-dimensional space, and it is entirely expected a small perturbations to every pixel in the image can add up to a large change in the classifier response, and that in fact the same behaviour is found in linear classifiers.
A second argument attempts to understand why sparse (and potentially even single pixel) attacks exist; and attributes their effectiveness to exploding gradients. Exploding gradients refers to how changes in functional response can grow exponentially with the depth of the network, and this is an issue known to plague the learning of Recurrent Neural Networks , and the very deep networks common to computer vision.
This phenomena occurs because, by construction, neural networks form a product of (convolutional) matrix operations interlaced with non-linearities; and for directions/locations in which these non-linearities act approximately linearly, the eigenvalues of the Jacobian can grow exponentially with depth (see  for a formal derivation). While this phenomena is well-studied in the context of training networks, with remedies being offered in the form of normalisation  or gradient clipping , the same phenomena can occur when generating adversarial perturbations. This means that a carefully chosen and small perturbation can have an extremely large effect in the response of a deep classifier.
To understand how these arguments fit together, and which of these explanations accounts for the familiar behaviour of adversarial perturbations, we propose a simple novel regularisation that bounds the exponential growth of the classifier response by regularising the perceptual distance  between the original image and its adversarial perturbation.
One common criticism of adversarial perturbations is that the generated images lie outside the manifold of natural images, and that if we could directly sample from the manifold of images, our adversarial perturbations would be both larger and more representative of the real world. Restricting adversarial perturbations to the manifold of natural images should limit the impact of exploding gradients – if samples are drawn from this space then a well-trained classifier should implicitly reflect the smoothness of the true labels of the underlying data distribution.
However, characterising the manifold of natural images outside of handwritten
digits has proven extremely challenging
We propose a novel regularisation for adversarial perturbations based around the perceptual loss. Our new perturbations have unexpected properties, tending to highlight objects and regions of interest within the image (see figure 1). Moreover, they have different statistics to some existing adversarial perturbations allowing us to bypass an existing approach for adversarial perturbation detection. We evaluate on standard explainability dataset for image classifiers.
2 Prior work
Numerous approaches to adversarial perturbations have been proposed in the past. These can loosely be divided into white-box[29, 8, 5, 28] approaches that assume access to the underlying nature of the model and black-box methods which do not[30, 25]. The search for an adversarial perturbation is often formulated as trying to find the closest point to a particular image, under the , or norm that takes a different class label to this particular image.
Also of interest are other works that add additional constraints to the perturbation to try to make the generated images more plausible. Such works may restrict the space of perturbations considered by trying to find an adversarial perturbations that confounds many classifiers at once , or is robust to image warps. Other approaches consider only a single image and single classifier, but restricted adversarial perturbations to lie on the manifold of plausible images [48, 16, 40, 44]. The principle limitation to this approach is that as a minimal first step it requires a plausible generator of natural images, something that is achievable with small simple datasets such as MNIST but currently out of reach for even the 224 by 244 thumbnails used by typical ImageNet  classifiers.
Adversarial Perturbations and Counterfactuals
There has been substantial work relating the generation of adversarial perturbations and counterfactual explanations. This relationship follows from the definitions in philosophy and folk psychology of a counterfactual explanation as answering the question “What would need to be different in order for outcome A to have occurred instead of B?”. With full causal models of images being outside our grasp, such questions are commonly answered using Lewis’s Closest Possible World semantics, rather than Pearl’s Structured Causal Models . Under Lewis’s framework an explanation for why an image is classified as ‘dog’ rather than ‘cat’ can be found by searching for the most similar possible world (i.e. image) to which the classifier assigns the label ‘cat’.
Conceptually, this is no different to searching for an adversarial perturbation sampled from the space of possible images. Several approaches have been proposed that either bypass the requirement that the counterfactual is an image, and return text descriptions , naïvely ignore the requirement that the world is plausible, used prototypes, or auto-encoders  to characterise the manifold of plausible images, or require large edits that replace regions of the image, either with the output of GANs  or with patches sampled from other images .
Adversarial Perturbations and Gradient Methods
The majority of methods in the explainability of computer vision tend to be gradient or importance based methods that assign an importance weight to every pixel in the image; every super-pixel; or to mid-level neurons. These gradient methods and adversarial perturbations are strongly related. In fact, with most modern networks being piecewise linear, if the found adversarial perturbation and the original image lie on the same linear piece, the difference between the original image, and closest adversarial perturbations under the norm is equivalent to the direction of steepest descent, up to scaling. As such, adversarial perturbations can be thought of as a slightly robustified method of estimating the gradient, that takes into account some local non-linearities.
Perturbation methods look over a larger range to estimate long-range gradient-like responses including  who applied constant-value occlusion masks to different input patches repeatedly to find regions that changed the most. LIME  constructed a linear model using the responses obtained from the perturbing super-pixels. In more recent works, Extremal Perturbation  identifies an optimal fixed mask of the image to occlude that gives the maximal effect on the networkâs output.
Of the pure gradient based approaches  - calculates the gradient of the output with respect to the input to create a saliency map giving fine grained, but potentially less interpretable results. Other gradient based approach includes SmoothGrad  which stabilises the saliency maps by averaging over multiple noisy copy of them and Integrated Gradients where an attribution score is calculated by accumulating gradient when perturbing an empty image to the input image.
The CAM based approaches [59, 38] sum the activation maps in the final convolutional layer of the network. These small activation maps are then up-sampled to obtain a heatmap that highlights particularly salient regions. Grad-CAM is a generalised variant which finds similar regions of interest to the perturbation based approaches .
A number of experiments were developed to test saliency methods including the pointing game [33, 57, 38], the weakly supervised object localisation task [14, 7] and the insertion and deletion game [33, 54]. In particular,  developed experiments to test saliency methods suitability. These experiments had been applied on a number of existing saliency techniques including; Gradient, SmoothGrad, GuidedBackprop, Integrated Gradients , GradientInput and GradCAM .
Detecting Adversarial Perturbations
Two main schools of thought for defending against adversarial perturbations exist. Either classifiers can be strengthen to resist the adversarial perturbations [30, 55, 52], or adversarial perturbations can be detected directly  and excluded.
Multiple different detection approaches exist, a good review (and subsequent rebuttal) of several different detection approaches can be found in  which covers approaches such as: adding additional classes to a classifier ; utilising additional classifiers e.g. ; density estimates e.g. ; eigen-decompositions and a variety of others. Other interesting approaches utilises the generative models of images to detect adversarial examples e.g. .
Of particular interest to this paper, are approaches that use layers or mid-level responses of the classifier to identify adversarial perturbations, as we regularise these layer through our perceptual loss. One such approach is  who added small neural networks to detect the perturbation on the value of various layers. Another is  who constructed statistics based on finding subsets of neurons in a given layer with unusual values. Finally, many detection methods can be fooled by constructing losses that explicitly account for the method used to detect the perturbations .
We consider a standard multi-class classifier as that takes an image as input, and returns a dimensional vector consisting of the confidence the classifier assigns to each one of classes. And we say the classifier assigns the class to the image .
Given an image classified as label we consider the multi-class margin:
and note that if and only if does not assign label to image . As such an adversarial perturbation can be found by minimising
where is a small target value greater than zero. It is well-known  that minimising a loss of the form:
is equivalent to finding a minimiser of subject to the requirement that lies in the ball defined by for some . As such minimising this objective for an appropriate value of and is a good strategy for finding adversarial perturbations of image with small norm.
Writing for the classifier response of in the layer of the neural net, we consider the related loss
defined over a set of layers of the neural net .
The second half of this objective is the perceptual loss of , and minimising this objective is equivalent to finding a minimiser of (3) subject to the requirement that lies in the ball defined by .
Some care needs to be taken in selecting the layers of the network we regularise over. If we regularise every level of the network, the network will be overly constrained and not change its classifier response.
That said, the method seems to be relatively robust to choices of layer, and for
VGGnet  we use the layers in the batch
The objective is optimised using LBFGS, a standard algorithm, well-suited for minimising these smooth non-stochastic objectives. To guarantee that the perturbed image lies within valid RGB values, we enforce consider two sets of constraints (i) that the values do not exceed the range observed in the original image, or (ii) within an based -ball of the original image (used in the section on adversarial perturbations in section 5), we simply clip the solution found at each iteration to lie inside these bounds.
4 Perceptual Perturbations for Visual Explanations
Before describing our experimental overview, we give a qualitative analysis of the perceptual perturbations, as shown in figure 3. The found perturbations do a good job of localising on a single object class, even in the presence of highly textured or cluttered images (e.g. dragon fly on fern; polar bear; coral reef), containing multiple classes (dog and man; baseball and people; lawnmower). The perturbations tend to focus upon the heads of the labelled class reflecting the result of  that heads are more salient (llama; elephant), but importantly it only finds heads of the explained class salient, and not those belonging to other objects (e.g. dog with leopard; dog with man; baseball; guitar). Much of the confusion in localisation seems to occur in supporting classes close behind the object - for example the human legs behind the lawnmower are found to be salient as is the torso of the man playing the guitar. This could be because these classes frequently occur in close proximity to one another; and provide supporting evidence for the detected class.
Following recent work  that points out that edge detectors with no knowledge of the classifier do surprisingly well in the insertion and deletion metrics used in visual explainablity, we instead focus on object localisation.
To transform our perceptual perturbations into a Saliency Map, we simply treat the magnitude of the alteration in each pixel as its salience. We evaluate the quality of our perceptual perturbations as explanations by using the localisation protocol of [14, 57, 3]. We predicted a bounding box for the most dominant object in the first 1000 Imagenet  validation images and employ simple threshold methods for fitting bounding boxes. For the first approach, we follow  in using a value-threshold where we normalise individual heatmaps to be in the range of , we then square-root transform the saliency maps and grid search over the set of thresholds where at intervals of size . For the second experiment, we also follow  in using a threshold scaled by the per image mean. Finally, we evaluate on a third measure based around percentage-threshold where we consider the top most salient pixels.
As is standard, for each threshold, we extract the largest connected component and draw a bounding box around it. The object is assumed to be successfully localised when the Intersection Over Union measure (IOU) between this box and the ground truth is above . Following GradCAM’s guided version, which makes use of image gradients from , we consider a guided variant of our own consisting of an element-wise multiplication between our perturbations and the normalised guided gradient of the image with respect to the margin .
We compare our perceptual method and its guided variant with GradCAM , Guided Backprop , Guided GradCAM , SmoothGrad , Integrated Gradient , Excitation Backprop , RISE  and Extremal Perturbations . To demonstrate that the perceptual loss is important to the successful
A qualitative evaluation of the methods can be seen in figure 3. These images were selected to be challenging – we visualise a subset of those images where DeepFool’s adversarial perturbation did not align with the object. For reasons of space and fairness, we do not show DeepFool’s adversarial perturbations in this figure. Compared with other visual explanation techniques, our method highlights the interior textures of the target object in the image. This differs to gradient-based methods which captures finer edge details such as SmoothGrad  and to activation-based methods which highlights the entire object coarsely such as GradCAM . This is perhaps most clear with the panda image where our method captures the interior texture of the Panda rather than just its hard contours.
We compare the generated bounding box with the ground truth bounding box.
For the approaches
[38, 43, 49, 57, 15],
we made use of the implementation developed for the paper 
|Guided Backprop ||0.46||0.48||0.47|
We perform better than the tested approaches on object localisation, obtaining the lowest error on value, and along with Excitation back-prop , the lowest error on percent based threshold, and the lowest error across all choices of thresholding. We do noticeably less well on mean-based thresholding, being beaten by two methods Excitation Backprop , and Integrated Grad , and obtaining similar score to several other approaches.
5 Detecting Perceptual Perturbations
To demonstrate that our adversarial perturbations have fundamentally different properties to existing approaches we demonstrate how recently published approach for detecting adversarial perturbations fail to detect our approach. In particular, we compare again the recently published detection method  that leverages changes induced by random noise in the differences in the logits to uncover adversarial perturbations.
This is a particularly relevant test for our approach, as the defence by  is motivated by the notion that adversarial perturbations found by minimising some p-norm distance from the original image typically have different properties to naturally occurring images. In contrast, our approach by explicitly minimising a distance measure previously unused in the literature on adversarial perturbations will produce images that are closer to the distribution of real images according to the perceptual distance, and may bypass these tests that rely upon different properties.
The approach used by  is as follows: Given a training set, for each pair of classes and , they calculate statistics describing how the difference between the classifier response for the true class, and for another class changes with injected noise. This must take into account the class dependent nature of the distribution (see  for full details). Using these quantities they construct a z-score between the observed class of an image and every other class and construct a classifier that estimates if an image is adversarial based on the z-score deviating by more than a class defined threshold.
They motivate this approach in several ways, one of which is the ‘cone’ plot (see figure 5, in which they plot the softmax of the target class over the space spanned by the direction of the adversarial perturbation and that of randomly selected orthogonal directions (see latter in this section for a full explanation). The plot demonstrates that for standard adversarial perturbations, unlike natural images, moving in an orthogonal direction increases the probability of the target class. As most random perturbations in high-dimensional spaces are approximately orthogonal to the direction of the adversarial perturbation,  suggested evaluating the stability of the classifier response under random noise.
Adversarial perturbation defences can only be expected to work if the generated perturbations satisfy the expected constraints, the so called threat model. As such, we form our adverse perturbations by adding additional constraints to those expected by  and for conformity, we restrict our adversarial perturbations lie to an with an distance of the original image of less than .
To match the experimental setup in , in which the adversaries are also restricted to 20 iterations of PGD , we also restrict the number of iterations of LBFGS to match the number of PGD iterations and we return the perturbation with the best loss over the function evaluations.
We compare our results against several standard methods that are
implemented in the package provided by 
Using the code provided by the authors of , we make
a direct comparison with their detection approach on the CIFAR dataset with
their default network,
A reproduction of the cone plot from , showcasing the different structure of the different perturbation types can be seen in figure 5. This figure plots the change in the softmax response of the target class against the direction of the adversarial perturbation and that of a random orthogonal perturbation (see figure caption for full implementation details).The cone-like structure described by  can be seen for both PGD and the Mean method. However neither our approach nor the CW perturbations have the same structure. This may be related to the use of the norm when generating the CW perturbations (the version provided as part of  is the L2 version) which we also use in addition to our perceptual loss components and matching the of the threat model. As shown by table 2, our results are noticeably harder to detect with  only able to detect perceptual perturbations of the time vs. the detection rate for CW being .
|Classifier Suc. %||0.0388||0.0399||0.0435||0.0431|
We have presented a novel regularisation based on the perceptual loss for the generation of adversarial perturbations. This regularisation is designed to block the exploitation of exploding gradients when generating adversarial perturbations forcing larger and more meaningful perturbations to be generated. The fact that such perturbations still exist under these constraints and remain imperceptible to humans is another piece of the puzzle in understanding the interrelationship between adversarial perturbations, neural networks, and human vision.
We have shown how these perturbations can be interpreted as explanations and obtained state-of-the-art results on a standard explainability benchmark. Moreover, the properties of these novel perturbations means they are not recent work.
This work was supported in part by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, and The Alan Turing Institute. Andrew Elliott was funded by EPSRC grant EP/N510129/1 at The Alan Turing Institute and Accenture Plc; Stephen Law was funded by EPSRC grant EP/N510129/1 at The Alan Turing Institute and Chris Russell was partially supported by the Alan Turing Institute and programmatic research funding provided by the Luminate Group. We would further like to thank Tomas Lazauskas and the team at the PEARL cluster for providing access to the GPU cluster and giving us invaluable help in setting up the computational environment.
Appendix A Visual Explanation
We include a selection of the successful examples and the failure cases from the visual explanation experiment on Imagenet . We display the original image, the difference between the perceptually perturbed and the original, the saliency map, the dominant connected component with the resulting bounding box and the dominant connected component masked with the original image.
Figure 6 shows selected examples from our method which failed in the explanation experiment. We note in some failed cases, our method was able to localise on the target object correctly. These cases can occur when the predicted bounding box is significantly larger or smaller than the ground truth bounding box. This is apparent in row 3, where the predicted bounding box is notably larger than the ground truth localisation of the bighorn. Another common failure is the occurrence of two objects in the scene where the ground truth annotation is placed on one of the two objects. This is clearly visible in row 4 where there are two porcupines in the same scene but only one of them is being highlighted. There is also the case where the target object is partially occluded from the scene such as the american alligator in row 11, in this case the ground truth bounding box includes part of the animal hidden behind the vegetation. There are also examples where the predicted bounding box has clearly localised on objects not within the target class. One such example is in row 5, for which our method highlights both the tricycle and the two children in the scene.
Figure 7 shows selected examples from our method which succeeded in the explanation experiment. Most of the successful localised examples contain a single dominant object in the scene. For example, the soup bowl in row 1, the cougar in row 2 and the guenon in row 3. In some cases, the method succeeds in localising the target object with a busy background, e.g. the cabbage butterfly in row 5, the bee eater in row 7 and the white wolf in row 11. The method also achieves success in localising partial objects such as the dugong in row 10.
Appendix B Adversarial Perturbations
For completeness, we present several examples using each of the methods present in the main paper. We do not filter on successful cases, and we select the examples to be recognisable and to highlight several classes. We display the results in figure 8, where we present the original image, the adversarially perturbed image, and the difference between the two images rescaled to highlight the adversarial perturbation.
Much like the results on Imagenet in the visual explanations section we note that our perceptually perturbations are localised on the object in question in all examples presented with the exception of row 6. The other PGD based approaches (PGD and Mean) appear to localise less well, although they do show localisation in row 3 and 4. Echoing our discussion of the cone plot in the main body, the L2 variant of CW also appears to localise well, at least on CIFAR.
Kerry blue terrier
|Orig.||PGD||PGD Diff||Mean||Mean Diff||CW||CW Diff||Ours||Our Diff|
- See discussion in the experimental section of .
- To avoid ambiguity, our indexing treats each operation such as convolution, ReLu or BatchNorm as a separate layer.
- (2018) Sanity checks for saliency maps. External Links: Cited by: §2, §4.
- (2017) Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §1, §2.
- (2015) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2956–2964. Cited by: §4.
- (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §2, §2, §2.
- (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §2, §5.
- (2018) Explaining image classifiers by counterfactual generation. External Links: Cited by: §2.
- (2018-03) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). External Links: Cited by: §2.
- (2018) Ead: elastic-net attacks to deep neural networks via adversarial examples. In Thirty-second AAAI conference on artificial intelligence, Cited by: §2.
- (2018) Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, pp. 592–603. Cited by: §2.
- (2019) Universal adversarial perturbations against person re-identification. arXiv preprint arXiv:1910.14184. Cited by: §1.
- (2018) Adversarial examples that fool both computer vision and time-limited humans. In Advances in Neural Information Processing Systems, pp. 3910–3920. Cited by: §2.
- (2017) Robust physical-world attacks on deep learning models. arXiv preprint arXiv:1707.08945. Cited by: §1.
- (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §2.
- (2017-10) Interpretable explanations of black boxes by meaningful perturbation. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: Cited by: §2, §4, §4.
- (2019) Understanding deep networks via extremal perturbations and smooth masks. External Links: Cited by: §2, Figure 4, Table 1, §4, §4.
- (2018) Adversarial spheres. arXiv preprint arXiv:1801.02774. Cited by: §2.
- (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
- (2019) Counterfactual visual explanations. External Links: Cited by: §2.
- (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §2.
- (2018) Generating counterfactual explanations with natural language. arXiv preprint arXiv:1806.09809. Cited by: §2.
- (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1.
- (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §1, §3.
- () CIFAR-10 (canadian institute for advanced research). . External Links: Cited by: Figure 8.
- (2013) Counterfactuals. John Wiley & Sons. Cited by: §2.
- (2016) Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770. Cited by: §2.
- (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §5, §5.
- (2017) On detecting adversarial perturbations. In Proceedings of 5th International Conference on Learning Representations (ICLR), External Links: Cited by: §2, §2.
- (2019) Sparsefool: a few pixels make a big difference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9087–9096. Cited by: §2.
- (2016-06) DeepFool: a simple and accurate method to fool deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2, Table 1.
- (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §2, §2.
- (2012) Understanding the exploding gradient problem. CoRR, abs/1211.5063 2. Cited by: §1, §1.
- (2000) Causality: models, reasoning and inference. Vol. 29, Springer. Cited by: §2.
- (2018) RISE: randomized input sampling for explanation of black-box models. External Links: Cited by: §2, Figure 4, Table 1, §4, §4.
- (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: §4.
- (2016) âWhy should i trust you?âexplaining the predictions of any classifier. SigKDD. Cited by: §2.
- (2019) The odds are odd: a statistical test for detecting adversarial examples. In International Conference on Machine Learning, pp. 5498–5507. Cited by: Figure 5, Table 2, §5, §5, §5, §5, §5, §5, §5, §5, §5.
- (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: Appendix A, Figure 6, Figure 7, §2, §4.
- (2016-10) Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv e-prints, pp. arXiv:1610.02391. External Links: Cited by: §2, §2, Figure 4, Table 1, §4, §4, §4, §4.
- (2016) Not just a black box: learning important features through propagating activation differences. External Links: Cited by: §2.
- (2018) Adversarial vulnerability of neural networks increases with input dimension. arXiv preprint arXiv:1802.01421. Cited by: §2.
- (2013) Deep inside convolutional networks. ICLR. Cited by: §2, §2.
- (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.
- (2017) SmoothGrad: removing noise by adding noise. External Links: Cited by: §2, §2, Figure 4, Table 1, §4, §4, §4.
- (2017) Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §2, §2.
- (1982) Newtonâs method with a model trust region modification. SIAM Journal on Numerical Analysis 19 (2), pp. 409–426. Cited by: §3.
- (2018) Subset scanning over neural network activations. arXiv preprint arXiv:1810.08676. Cited by: §2.
- (2014) Striving for simplicity: the all convolutional net. External Links: Cited by: §2, Table 1, §4, §4.
- (2019) Disentangling adversarial robustness and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6976–6987. Cited by: §2, footnote 1.
- (2017) Axiomatic attribution for deep networks. External Links: Cited by: §2, §2, Figure 4, Table 1, §4, §4, §4.
- (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
- (2019) Interpretable counterfactual explanations guided by prototypes. arXiv preprint arXiv:1907.02584. Cited by: §2.
- (2019) Hardening deep neural networks via adversarial model cascades. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
- (2017) Counterfactual explanations without opening the black box: automated decisions and the gpdr. Harv. JL & Tech. 31, pp. 841. Cited by: §2.
- (2019) Interpretable and fine-grained visual explanations for convolutional neural networks. External Links: Cited by: §2.
- (2018) Defensive dropout for hardening deep neural networks under adversarial attacks. In Proceedings of the International Conference on Computer-Aided Design, pp. 71. Cited by: §2.
- (2014) Visualizing and understanding convolutional networks. ECCV. Cited by: §2.
- (2017-12) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084â1102. External Links: Cited by: §2, Figure 4, Table 1, §4, §4, §4, §4.
- (2019) The adversarial attack and detection under the fisher information metric. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5869–5876. Cited by: §2.
- (2016-06) Learning deep features for discriminative localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §2.
- (1997) Algorithm 778: l-bfgs-b: fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS) 23 (4), pp. 550–560. Cited by: §3.