Adversarial Examples in Modern Machine Learning: A Review

Adversarial Examples in Modern Machine Learning: A Review

Rey Reza Wiyatno   Anqi Xu   Ousmane Dia   Archy de Berker
Element AI
{rey.reza, ax, ousmane, archy}

Recent research has found that many families of machine learning models are vulnerable to adversarial examples: inputs that are specifically designed to cause the target model to produce erroneous outputs. In this survey, we focus on machine learning models in the visual domain, where methods for generating and detecting such examples have been most extensively studied. We explore a variety of adversarial attack methods that apply to image-space content, real world adversarial attacks, adversarial defenses, and the transferability property of adversarial examples. We also discuss strengths and weaknesses of various methods of adversarial attack and defense. Our aim is to provide an extensive coverage of the field, furnishing the reader with an intuitive understanding of the mechanics of adversarial attack and defense mechanisms and enlarging the community of researchers studying this fundamental set of problems.

Table of Contents

1 Introduction

Machine learning algorithms have critical roles in an increasing number of domains. Technologies such as autonomous vehicles and language translation systems use machine learning at their core. Since the early success of Convolutional Neural Networks (CNNs) on the ImageNet Large Scale Visual Recognition Competition (ILSVRC) [49, 182, 120], deep learning [132, 77] has been successfully applied to numerous tasks including image classification [120, 92, 100], segmentation [192, 91], object tracking [93, 16, 218], object detection [72, 73, 174], speech recognition [81, 82, 237], language translation [206, 69, 221], and many more.

Despite the success of modern machine learning techniques in performing various complex tasks, security in machine learning has received far less academic attention. Robustness, both to accident and to malevolent agents, is clearly a crucial determinant of the success of machine learning systems in the real world. For example, Amodei et al. [5] catalogue a variety of possible negative impacts from poorly secured systems, particularly emphasizing issues around sensitive applications such as medical and transportation systems. Although these authors focused on the safety of reinforcement learning algorithms in particular, many of the addressed concerns can directly applied to wide range of machine learning models.

In this review, we focus on the application of adversarial examples to supervised learning problems. We leave it to future authors to cover unsupervised learning, reinforcement learning, or other classes of problems. We also restrict our discussion to classification tasks, as opposed to regression, although many of the methods discussed below may generalize to other problem contexts.

1.1 What can we learn from adversarial examples?

The concept of an adversarial example long predates work in machine learning. Humans are famously vulnerable to perceptual illusions, which have been fruitfully used to explore the mechanisms and representations underlying human cognition. Such illusions are frequently used to elucidate the implicit priors present in human perception. For instance, the Müller-Lyer illusion [147], in which a line bracketed by two outwards-facing arrows appears longer than one bracketed by outwards facing ones is thought to reveal a “cubeness” prior learned by people who live in highly-rectangular environments [1]. Color and brightness constancy - our ability to perceive colors and brightness as unchanging despite variance in illumination - are richly explored in illusions such as Adelson’s checkerboard illusion and Lotto’s coloured cube. Such illusions can even probe the inter-individual variance in visual priors: the recently popular “dress colour” meme is thought to rely upon peoples’ differing expectations of illuminating light, with people used to blue-tinted illumination seeing the dress as gold/white, and those assuming warm illumination seeing it as blue/black [23].

Such illusions elucidate not only cognitive phenomena, but details of the underlying neural circuitry (for a thorough review, see Eagleman [54]). For example, the phenomenon of Inter Ocular transfer has provided a classic paradigm for identifying the locus of illusory effects. Based upon whether an illusory effect transfers from one eye to another, the substrate can be localized to pre-cortical circuitry (no transfer between eyes) or cortical ones, pathways after the mingling of information between the eyes [18, 17]. Similarly, adaptation-based illusions - where the appearance of an image changes over a period of prolonged viewing - have been used to predict the tuning curves of orientation-tuned neurons in visual cortex [18], the organization of colour representation in the lateral geniculate nucleus [142], and the sensitivity of the three colour-coding cell types in the retina [203]. So effective were these techniques that they earned the nickname “the psychologists’ microelectrode” - a precise, non-invasive way to characterize the internal workings of a blackbox system.

Szegedy et al. [211] found that deep neural networks are also vulnerable to “illusions”. These “adversarial examples” are created by the addition of “hidden messages” to an image, causing a machine learning model to grossly misclassify the perturbed image. Unlike perceptual illusions in humans, which are handcrafted, such examples are algorithmically generated to fool machine learning models. Figure 1 shows the now famous adversarial example generated using a method called the Fast Gradient Sign Method (FGSM) [80] (see Section 4.2). Judging from these evidences, there is indeed a huge generalisation gap between humans and deep neural networks [71].

The parallel between perceptual illusions in humans and adversarial examples in machines extends beyond the superficial in several important ways. One explanation for the effectiveness of adversarial examples is that they push the input data off the manifold of natural inputs. This is precisely the same mechanism by which perceptual illusions in humans are generated, often exploiting mechanisms that generally improve the fidelity of perception but result in erroneous percepts when input data differs from natural inputs. As such, both approaches help to illustrate which features of the input domain are important for the performance of the system. Secondly, just as perceptual illusions have been used to interrogate the neural organization of perception, adversarial examples can help us understand the computations performed by neural networks. Early explanations focused upon the linearity of deep models [80], but more recent work focuses upon the entropy of the logit outputs [47]. Thirdly, recent evidence suggests a much stronger correspondence, namely that the same examples that fool neural networks also fool humans in time-limited settings [56]. As observed in that paper, this opens up the interesting possibility that techniques for adversarial defense might draw inspiration from the biological mechanisms that render humans invulnerable to such examples under normal viewing conditions.

Although the recent interest in adversarial attacks is chiefly concerned with their application to deep learning models, the field itself precedes the era of deep learning. For example, Huang et al. [102] established a taxonomy of adversarial attacks that has inspired many other works in adversarial machine learning, and also described a case study on adversarial attacks upon SpamBayes [176], a machine learning model for email spam filtering. Other works have also showed empirically that various machine learning models such as logistic regression, decision trees [24], k-Nearest Neighbor (kNN) [46], and Support Vector Machines (SVM) [45] are also vulnerable to adversarial examples [80, 163].

Figure 1: Adversarial example generated using the Fast Gradient Sign Method (FGSM) (taken from Goodfellow et al. [80]). GoogleNet [209] classified the original image shown on the left correctly as panda, but misclassified the image on the right as gibbon with high confidence. Note that the perturbation in the middle has been amplified for visualization purposes only.

1.2 Why this work?

Since the findings of Szegedy et al. [211], the arms race between adversarial attacks and defenses has accelerated. For example, defensive distillation [166] (see Section 6.1.3), then the state of the art defense against adversarial examples, was defeated within a year by an attack method proposed by Carlini and Wagner [34] (see Section 4.11). Similarly, defense using adversarial training [80] (see Section 6.1.1) once thought to be robust to whitebox attacks [126], was swiftly shown to rely upon a phenomenon called gradient masking [163, 162, tramèr2018ensemble] which can be circumvented by certain types of attacks. Most recently, a variety of defenses proposed by various different groups [29, 143, 50, 228, 170, 198, 87] that were accepted to the Sixth International Conference on Learning Representations (ICLR) 2018 have been shown to rely on gradient masking, and thus circumventable, shortly after the acceptance decision [8]. This further emphasizes how difficult it is to solve the adversarial examples problem in machine learning.

In this paper, we discuss various adversarial attack and defense strategies. Note that although adversarial examples exist in various domains such as computer security [84], speech recognition [37, 4, 119], and text [239], this paper focuses on adversarial examples in the computer vision domain where adversarial examples have been most extensively studied. We particularly focus upon perceptually imperceptible and inconspicuously visible adversarial examples, which have the clearest potential for malicious use in the real world. Although other works have attempted to provide literature review on adversarial examples [233, 3], our work provides in-depth explanations of the motivation and mechanism of a wide array of attack and defense algorithms, along with a complete taxonomy and ontology of adversarial attacks and defenses, and a discussion of the strengths and weaknesses of different methods.

This paper is organized as follows. We begin by defining a list of common terms in adversarial machine learning and describing the notation used in this paper, in Section 2.1 and 2.2, respectively. Section 3 provides a general introduction to adversarial examples. The taxonomy, ontology, and discussion of adversarial attack and defense methods are defined in Section 4 and 6, respectively. We discuss several adversarial attacks in the real world in Section 5, and the transferability property of adversarial examples in Section 7. Finally, we conclude and suggest interesting lines of future research in Section 8.

2 Common Terms and Notations

2.1 Common Terms

We provide definitions of several terms that are commonly used in the field of adversarial machine learning in Table 1.

Common Terms Definition
Adversarial example Input to a machine learning model that is intentionally designed to
cause a model to make mistake in its predictions despite resembling
a valid input to a human.
Adversarial perturbation Difference between a non-adversarial example and its adversarial
Adversarial attacks Methods to generate adversarial examples.
Adversarial defenses Methods to defend against adversarial examples.
Adversarial robustness The property of resisting misclassification of adversarial examples.
Adversarial detection Methods to detect adversarial examples.
Whitebox attack Attack scenario where an attacker has complete access to the
target model, including the model’s architecture and parameters.
Blackbox attack Attack scenario where an attacker can only observe the outputs of
the targeted model.
Transferability A property of adversarial examples:
examples specifically generated to fool a model can also be used to
fool other models.
Universal attack Attack scenario where an attacker devises a single transform such
as image perturbation that adversarially confuses the model for
all or most input values (input-agnostic).
Targeted attack Attack scenario where an attacker wants the adversaries to be
mispredicted in a specific way.
Non-targeted attack Attack scenario where an attacker does not care about the outcome
as long as the example is mispredicted.
Adversarial training [80] Adversarial defense technique to train a model by including
adversarial examples into the training set. Note that this differs from
the notion of adversarial training used in Generative
Adversarial Networks (GANs) [79].
Gradient masking [162, tramèr2018ensemble] Defense mechanisms which prevent a model revealing
meaningful gradients, masking or hiding the
gradients of the outputs with respect to the inputs.
Shattered gradients [8] When the gradients of a model are hard to compute
exactly due to non-differentiable operations.
Stochastic gradients [8] When the gradients of a model are obstructed due to some stochastic
or random operations.
Obfuscated gradients [8] A form of gradient masking which encompasses shattered gradients,
stochastic gradients, vanishing, and exploding gradients.
Vanishing gradients When the gradients of a model are converging to zero.
Exploding gradients When the gradients of a model are diverging to infinity.
Table 1: Common terms in adversarial machine learning.

2.2 Notations

In order to make this paper easier to follow, we use the notations defined in Table 2 throughout this paper unless otherwise stated.

Notation Description
A set of input data (e.g., training set).
An instance of input data (usually a vector, a matrix, or a tensor).
A modified instance of , typically adversarial.
The additive perturbation from to , i.e., .
The true class label for the input instance .
An adversarial target class label for .
Output of a machine learning model parameterized by (e.g., a neural network). For a classifier, specifically refers to the softmax predictions vector. Note that is often omitted for notation simplicity. Also, we use the parenthesized subscript to denote the -th element of the softmax vector, i.e., the predicted likelihood for class .
Predicted label of a classifier model, i.e.,
Predicted logits vector of from a softmax classifier model.
Loss function used to train a machine learning model. Note that we use instead of to simplify the notation.
Derivative of the loss function with respect to .
Table 2: Notations used in this survey.

3 Adversarial Examples

Adversarial examples are typically defined as inputs , where the differences between and non-adversarial inputs are minimal under a distance metric (e.g., can be the distance), whilst fooling the target model . Generally, adversarial examples seek to satisfy


where is a small constant that bounds the magnitude of the perturbations, and denotes the predicted label of a classifier model (i.e., ). Note that we use to denote throughout this paper. However, some works have proposed perturbations that are visible but inconspicuous [190, 61, 26], so the similarity constraint between and can be relaxed. Throughout this paper, we call these as “imperceptible” and “inconspicuous” (i.e., may be visible but not suspicious) adversarial examples, respectively.

Adversarial attacks are often categorized as either whitebox or blackbox. In whitebox attacks, the attacker is assumed to have information about the target model such as the architecture, parameters, training procedure, or the training data. On the other hand, blackbox attacks involve only access to the output of a target model, and not its internals. Blackbox attacks are more realistic assumption in the real world since an attacker rarely enjoys knowledge of the internals of the victim. However, evaluating models against whitebox attacks is important to measure the performance of the model in the worst-case scenarios. Adversarial attacks can be further categorized as targeted or non-targeted attacks. In a targeted attack, the adversarial example is designed to elicit a specific classification - like classifying all faces as belonging to George Clooney - whilst non-targeted attack only seek to generate an incorrect classification, regardless of class.

Adversarial examples exhibit an interesting phenomenon called the transferability property [211, 80, 164, 138]. This property states that adversarial examples generated to fool a specific model can often be used to fool other models. This phenomenon will be discussed in Section 7.

Why are machine learning models vulnerable to these examples? Several works have argued that adversarial examples are effective because they lie in the low probability region of the data manifold [211, 164]Goodfellow et al. [80] pointed that deep neural networks are vulnerable to adversarial examples due to the local linearity property of these models, especially when using activation functions like the Rectified Linear Units (ReLU) [74] or Maxout [78]Goodfellow et al. [80] observed that although deep neural networks use non-linear activation functions, one often trains such networks to only operate in the linear regions of the activation functions to avoid things like the vanishing gradient problem [98, 97, 168]. Furthermore, Goodfellow et al. [80] considered the fact that FGSM [80] (see Section 4.2) was designed based on the linear assumption works effectively to fool deep neural networks to support their argument that neural networks behave like a linear classifier. Arpit et al. [7] analyzed the capacity of neural networks to memorize training data and found that models with high degree of memorization are more vulnerable to adversarial examples. Jo and Bengio [110] argued that convolutional neural networks tend to learn the statistical regularities in the dataset rather than the high level abstract concepts. This may be related to the transferability property; since adversarial examples are transferable between models that are trained on the same dataset, these different models may have learned the same statistics and hence fall into the same traps. Similarly, Ilyas et al. [106] suggested that adversarial examples exist as a byproduct of exploiting non-robust features that exist in a dataset. Up to now, the reasons why machine learning models are vulnerable to adversarial examples are still an open research area.

4 Adversarial Attacks

In this section, we survey the adversarial attacks that exist today. Figure 2 and Table 3 illustrate the ontology and taxonomy of adversarial attack techniques discussed in this paper, respectively.

Adversarial Attack(s) Transparency Specificity Remarks
L-BFGS [211] W T, NT Early attack on neural networks using constrained optimization method
FGSM [80] W T, NT A fast single-step gradient ascent attack
BIM [125, 126] W T, NT Iterative variants of FGSM
ILLCM [125, 126] W T Extension of BIM to attack models with many output classes
R+FGSM [tramèr2018ensemble] W T, NT FGSM [80] with random initialization, can circumvent gradient masking
AMDR [184] W T, NT Similar to L-BFGS but targetting feature space
DeepFool [156] W NT Efficient method to find minimal perturbation that causes misclassification
JSMA [165] W T, NT Some variants of JSMA can fool defensive distillation
SBA [163] B T, NT Can fool defensive distillation [166], MagNet [149], gradient masking defenses
Hot/Cold [179] W T Simultaneously moving towards “hot” class and away from “cold” class
C&W [34] W T, NT Can fool defensive distillation [166], MagNet [149] and various detector networks
UAP [155] W NT Generate input-agnostic perturbations
DFUAP [157] W NT Generate input-agnostic perturbations without knowing any inputs
VAE Attacks [118] W T, NT Can fool VAE [115] and potentially defenses relying on generative models
ATN [10] W T, NT Generate adversarial examples using neural networks
DAG [229] W T, NT Can fool semantic segmentation & object detection Models
ZOO [40] B T, NT Can fool defensive distillation [166] and non-differentiable models
OPA [205] B T, NT Uses genetic algorithm, can generate adversary by just modifying one pixel
Houdini [43] W, B T, NT Method for attacking models directly through its non-differentiable metric
MI-FGSM [51] W T, NT BIM + momentum, faster to converge and better transferability
AdvGAN [225] W T, NT Generate adversarial examples using GAN [79]
Boundary Attack [25] B T, NT Can fool defensive distillation [166] and non-differentiable models
NAA [239] B NT Can generate adversaries for non-sensory inputs such as text
stAdv [226] W T, NT Unique perceptual similarity objective
EOT [9] W T, NT Good for creating physical adversaries and fooling randomization defenses
BPDA [8] W T, NT Can fool various gradient masking defenses
SPSA [217] B T, NT Can fool various gradient masking defenses
DDN [178] W T, NT Better convergence compared to other constrained optimization methods
CAMOU [236] B NT Attack in simulation using SBA [163], can be used to attack detection model
00footnotetext: W: Whitebox00footnotetext: B: Blackbox00footnotetext: T: Targeted00footnotetext: NT: Non-targeted
Table 3: Taxonomy of adversarial attacks covered in this paper.
Figure 2: Ontology of adversarial attacks covered in this paper.

4.1 L-BFGS Attack

The L-BFGS attack [211] is an early method designed to fool models such as deep neural networks for image recognition tasks. Its end goal is to find a perceptually-minimal input perturbation , i.e., , within bounds of the input space, that is adversarial, i.e., Szegedy et al. [211] used the Limited Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [136] to transform this difficult optimization problem into a box-constrained formulation where the goal is to find that minimizes


where elements of are normalized to , is the true loss function of the targeted model (e.g., categorical cross-entropy), and is the target misclassification label. Since this objective does not guarantee that will be adversarial for any specific value of , the above optimization process is iterated for increasingly large values of via line search until an adversary is found. Optionally, the resulting value can be further optimized using the bisection method (a.k.a. binary search) between the range of the final line search segment.

This attack was successfully applied to misclassify many image instances on both AlexNet [120] and QuocNet [129], which were state-of-the-art classification models at the time. Thanks to its Euclidean distance constraint, L-BFGS produces adversaries that are perceptually similar to the original input . Moreover, a key advantage of modeling adversarial examples generation process as a general optimization problem is that it allows for flexibility in folding additional criteria into the objective function. For instance, one may choose to use perceptual similarity metrics other than the distance, depending on requirements of a given application domain. We will see concrete examples of such criteria in subsequent sections.

4.2 Fast Gradient Sign Method

The Fast Gradient Sign Method (FGSM) [80] is designed to quickly find a perturbation direction for a given input such that the training loss function of the target model will increase, reducing classification confidence and increasing the likelihood of inter-class confusion. While there is no guarantee that increasing the training loss by a given amount will result in misclassification, this is nevertheless a sensible direction to take since the loss value for a misclassified instance is by definition larger than otherwise.

FGSM works by calculating the gradient of the loss function with respect to the input, and creating a small perturbation by multiplying a small chosen constant by the sign vector of the gradient:


where is the first derivative of the loss function with respect to the input . In the case of deep neural networks, this can be calculated through the backpropagation algorithm [181]. In practice, the generated adversarial examples must be within the bounds of the input space (e.g., [0, 255] pixel intensities for an 8-bit image), which is enforced by value-clipping.

The authors proposed to bound the input perturbation under the supremum metric (i.e., ) to encourage perceptual similarity between and . Under this -norm constraint, the sign of the gradient vector maximizes the magnitude of the input perturbation, which consequently also amplifies the adversarial change in the model’s output. A variant of FGSM that uses the actual gradient vector rather than its sign vector was later introduced as the Fast Gradient Value (FGV) method [179].

A sample adversarial example and perturbation generated by FGSM can be seen in Fig. 1. Note that this attack can be applied to any machine learning model where can be calculated. Compared to the numerically-optimized L-BFGS attack, FGSM computes gradients analytically and thus finds solutions much faster. On the other hand, FGSM does not explicitly optimize for the adversary to have a minimal perceptual difference, instead using a small to weakly bound the perturbation . Optionally, once an adversarial example is found at a given value, one can use an iterative strategy similar to the L-BFGS attack’s line search of to further enhance perceptual similarity, although the resulting may still not have minimal perceptual difference since perturbations are only searched along the sign vector of the gradient.

Huang et al. [104] have also shown that FGSM can be used to attack reinforcement learning models [207] where the policies are parameterized using neural networks such as the Trust Region Policy Optimization (TRPO) [187], Asynchronous Advantage Actor-Critic (A3C) [153], and Deep Q-Network (DQN) [154]. By applying the FGSM to modify images from various Atari games (e.g., Pong, Chopper Command, Seaquest, Space Invaders) [12], the agent can be fooled into taking sub-optimal actions.

4.3 Basic Iterative Method

The Basic Iterative Method (BIM) [125, 126] is one of many extensions of FGSM [80] (see Section 4.2), and is sometimes referred to as Iterative FGSM or I-FGSM. BIM applies FGSM multiple times within a supremum-norm bound on the total input perturbation, . The adversarial examples generated by BIM are defined as


where is the total number of iterations and is the per-iteration step size. The clipping operator constrains each input feature (e.g., pixel) at coordinate to be within an -neighborhood of the original instance , as well as within the feasible input space (e.g., for 8-bit intensity values):


BIM was the first method shown to be effective with printed paper examples (see Section 5). Beyond classification models, BIM has also been used to attack semantic segmentation models [64], such as the FCN [192].

4.4 Iterative Least-Likely Class Method

The authors of BIM also proposed a targeted variant of the attack called the Iterative Least-Likely Class Method (ILLCM), where the goal is to generate an adversarial example which is misclassified as a specific target class  [125, 126]. In fact, ILLCM targets the class with the least likelihood of being chosen by the original classifier, i.e., . The corresponding iterative update is given as:


This update is nearly identical to Equation 4, except that the predicted class in the cross-entropy loss is changed from the true label to an adversarial target , and the sign of the gradient update is reversed. Thus, whilst the non-targeted BIM and FGSM attacks increase the original classifier’s training loss, effectively “undoing training” and encouraging inter-class confusion, the targeted ILLCM reduces classification loss of an adversarial training pair to misguide the model into having excessive confidence towards the target class .

Why target the least likely class, ? Doing so maximizes misclassification robustness, preventing the model finding trivial adversarial examples for which the true classes are very similar. This is particularly relevant when working with large datasets such as ImageNet [49, 182], which contain many similar-looking classes. For instance, ImageNet includes examples of both Siberian Huskies and Alaskan Malamutes (both wolflike dogs). Non-targeted attacks risk finding trivial adversarial examples which cause confusion between the two, which is both easy to achieve and relatively benign. In contrast, ILLCM aims to maximize the negative impact of an attack through dramatic misclassification (e.g., dog airplane), while minimizing input perturbations.

4.5 R+fgsm

The randomized single-step attack, or R+FGSM [tramèr2018ensemble], adds a small random perturbation to the input before applying the adversarial perturbation generated by FGSM. This helps avoid the defensive strategy of gradient masking [163, 162, tramèr2018ensemble], which is illustrated in Fig. 3.

tramèr2018ensemble discussed how adversarially training [80] (see Section 6.1.1) a model on FGSM [80] adversaries can effectively lead to gradient masking. tramèr2018ensemble showed that there are many orthogonal adversarial directions, and that local loss gradients do not necessarily translate to the direction where the global loss of a model will be maximum. This led to some complacency in the machine learning community, in the mistaken belief that adversarially trained models are robust to unseen adversarial examples.

(a) Loss surface of an adversarially trained model
(b) Same loss surface zoomed in for small
Figure 3: Illustration of a specific instance of gradient masking for an adversarially trained model (taken from tramèr2018ensemble). Starting with values for model parameters and near the center of Fig. (b)b, the gradient locally points towards smaller values for both parameters, while we see that on a wider scale the loss can be maximized (within an adversarial attack process) for larger values of and .

In order to escape from this particular case of gradient masking, R+FGSM first modifies the input into by pushing it towards a random direction (which is sampled as the sign of an unit multi-variate Gaussian), and then calculates the derivative of the loss with respect to :


Here, is a positive constant hyperparameter such that . While the adversarial gradient now contributes less perturbation magnitude than in FGSM (i.e., instead of ), the random pre-perturbation increases the chance of finding an attack direction that escapes gradient masking.

tramèr2018ensemble found that R+FGSM has a higher attack success rate compared to FGSM on adversarially trained Inception ResNet v2 [208] and Inception v3 models [210]. Furthermore R+FGSM was also found to be stronger compared to 2-steps BIM [125, 126] (see Section 4.3) on adversarially trained Inception v3 model, which suggests that random sampling helps provides better directions than using the local gradients. This strengthens the suspicion of gradient masking as random guesses should not produce better adversarial directions than using the actual gradients.

4.6 Adversarial Manipulation of Deep Representations

The approaches discussed so far optimize for properties of the output softmax or logits. Sabour et al. [184] proposed an altered formulation of the L-BFGS attack [211] that instead generates examples which resemble the target class in terms of their hidden layer activations. Adversarial Manipulation of Deep Representations (AMDR) thus optimizes for similarity between a perturbed source image and a target image with different class label in the intermediary layers of a network, rather than the output layers.

Formally, AMDR requires a target neural network classifier , a source image , and a target image of a different class, , where . Given the above, the goal is to find an adversarial example that looks like , yet whose internal representations resembles those of of . The resulting optimization objective is thus:


where denotes the output of at the -th layer, i.e., the internal representations of the input, and denotes the bound for the norm. Contrary to the L-BFGS attack, this method does not require a target label, but rather needs a target image and a chosen feature layer .

As shown by Sabour et al. [184], the AMDR attack successfully finds adversaries when evaluated against CaffeNet [109], AlexNet [120], GoogleNet [209], and a VGG variant [38] on ImageNet [49, 182] and Places205 [240] datasets. The authors also qualitatively evaluated the internal representations by inverting them back into an image using the technique proposed by [145] and found that the inverted images resembled the target images.

4.7 DeepFool

The DeepFool algorithm [156] estimates the distance of an input instance to the closest decision boundary of a multi-class classifier. This result can be used both as a measure of the robustness of the model to attacks, and as a minimal adversarial perturbation direction. As motivation, the authors note that for a binary linear classifier, the distance to the decision boundary (which is simply a line) can be analytically computed using the point-to-line distance formula. This readily generalizes to a multi-class linear classifier, where the desired measure can be computed as the distance to the nearest of the decision boundary lines among the classes .

Generalizing further to non-linear multi-class neural networks, DeepFool iteratively perturbs the input by linearizing the model’s per-class decision boundaries around the current set-point (starting from ), identifies the class with the closest linearized decision boundary, and moves to this estimated boundary point. As shown in Algorithm 1, this process is repeated till becomes misclassified. 111In the original formulation [156], the iterative algorithm terminates when , while our variant terminates on . These are identical when the classifier’s prediction of a given input correctly reflects the ground truth label . Otherwise, does not need to be perturbed if the original model already misclassifies it, and perturbing it to the nearest decision boundary might actually correct the misclassification. Recall that denotes the -th element of the softmax prediction vector , while denotes element-wise multiplication.

  Input: input , ground truth label , number of classes , classifier , desired norm , let
  Output: adversarial perturbation
  Initialize: , ,
  while  do
     for  to  do
        if  then
        end if
     end for
  end while
Algorithm 1 DeepFool Algorithm for Multi-Class Classifier [156]

The size of the resulting perturbation can be interpreted as a measure of the model’s robustness to adversarial attacks. DeepFool can compute this measure using a variety of different distance metrics including Euclidean norm and supremum norm. In practice, once an adversarial perturbation is found, the adversarial example is nudged further beyond the decision boundary to guarantee misclassification, e.g., .

DeepFool has been successful in attacking various models such as LeNet [131], Network in Network [134], CaffeNet [109], and GoogLeNet [209]. Furthermore, [156] found that DeepFool generates adversaries that have times smaller perturbations compared to those resulting from FGSM [80] (see Section 4.2) on MNIST and CIFAR10 models, and times smaller perturbations for ImageNet models. DeepFool was also found to produce adversaries with slightly smaller perturbations compared to the L-BFGS attack [211] (see Section 4.1) while being much faster (e.g., more than x speedup) to compute.

4.8 Jacobian-based Saliency Map Attacks

The notion of the saliency map was originally conceived for visualizing how deep neural networks make predictions [196]. The saliency map rates each input feature (e.g., each pixel in an image) by its influence upon the network’s class prediction. Jacobian-based Saliency Map Attacks (JSMA) [165] exploit this information by perturbing a small set of input features to cause misclassification. This is in contrast to attacks like the FGSM [80] (see Section 4.2) that modify most, if not all, input features. As such, JSMA attacks tend to find sparse perturbations.

Given the predicted softmax probabilities vector from a neural network classifier, one formulation of the saliency map is


where denotes the -th element of , and is a specified label of interest, e.g., the target for visualization or attack. Intuitively, the saliency map uses components of the gradient to quantify the degree to which each input feature positively correlates with a target class of interest , while on average negatively correlating with all other classes . If either condition is violated for a given feature , then is set to zero, effectively ignoring features which are not preferentially associated with the target class. The features with the largest saliency measure can then be increased to amplify the model’s predicted confidence for a target class , whilst attenuating confidences for all other classes.

This process can easily be inverted to provide a negative saliency map, , describing which features should be reduced to increase a target class probability. This formulation requires inverting the inequalities for the two low-saliency conditions:

  Input: -dimensional input normalized to , target class , classifier , maximum number of iterations , perturbation step
  Initialize: , , search domain
  while  and and  do
     for every pixel pair ()  do
        if  then
        end if
     end for
     if  then
     end if
     if  then
     end if
  end while
Algorithm 2 Jacobian-based Saliency Map Attack by Increasing Pixel Intensities [165]

Papernot et al. [165] notes that both saliency measures and are overly strict when applied to individual input features (e.g., single image pixels), since it is likely that the sum of gradient contributions across non-targeted classes will trigger the minimal-saliency criterion. Consequently, as shown in Algorithm 2, the Jacobian-based Saliency Map Attack alters the saliency measures to search over pairs of pixels instead. Concretely, given a search domain initialized to contain the indices of all input features, the algorithm finds the most salient pixel pair, perturbs both values by , and then removes saturated feature indices from the search domain. This process is repeated until either an adversary is found, or in practice following a maximum number of iteration , e.g.:


The formulation in Algorithm 2 finds adversarial examples by increasing feature values () based on the saliency measure. An alternative attack variant that decreases feature values can be constructed by substituting with and setting . Both variants are targeted attacks that increase a classifier’s softmax prediction confidence for a chosen adversarial target class .

The original authors prescribed in order to find adversaries in as few iterations as possible. In general though, we can use a smaller feature perturbation step, i.e., , to produce adversarial examples with fewer saturated features. Additionally, the feature saturation criterion can be also altered to be -bounded around the initial input values (e.g., using from Equation 5), to further constrain per-pixel perceptual difference.

Carlini and Wagner [34] note that the above saliency measures and attack can be alternatively applied to evaluate the gradient of logits rather than of the softmax probabilities . While using different saliency measures results in favoring slightly different pixel pairs, both variants successfully find adversarial examples. We designate the original algorithm variants as JSMA+F and JSMA-F, and those using logit-based saliency maps as JSMA+Z and JSMA-Z, where + and - indicate whether input features are increased or decreased.

Papernot et al. [165] showed that JSMA can successfully fool a model by just modifying a few input features. They found that adversaries can be found by just modifying of the input features in order to fool a targeted MNIST model. However, there are still room for improving the misclassification rate and efficiency by picking which features should be updated in a more optimal way. For example, note that Algorithm 2 needs to test every possible pixel pairs in the search domain before deciding on which pixel pairs should be updated for every iteration, which is computationally expensive to perform.

All JSMA variants above must be given a specific target class . This choice affects the speed and quality of the attack, since misclassification under certain classes are easier to attain than others, such as perturbing a hand-written digit to look like . Instead of increasing the prediction probability (or logit) of an adversarial target , we propose to remove this dependency altogether by instead altering JSMA to decrease the model’s prediction confidence of the true class label (). These non-targeted JSMA variants are realized by swapping the saliency measure employed, i.e., follow when increasing feature values (NT-JSMA+F / NT-JSMA+Z), or when decreasing feature values (NT-JSMA-F / NT-JSMA-Z).

  Input: -dimensional input normalized to , true class label , classifier , maximum number of iterations , pixel perturbation step , maximum perturbation bound
  Initialize: , , ,
  while  and and  do
     for every pixel pair () and every class  do
        if  then
           if  then
           end if
        end if
     end for
     if   then
     end if
     if   then
     end if
  end while
Algorithm 3 Maximal Jacobian-based Saliency Map Attack

Extending further, we propose a combined attack, termed Maximal Jacobian-based Saliency Map Attack (M-JSMA), that merges both targeted variants and both non-targeted variants together. As shown in Algorithm 3, at each iteration the maximal-salient pixel pair is chosen over every possible class , whether adversarial or not. In this way, we find the most influential features across all classes, in the knowledge that changing these is likely to change the eventual classification. Furthermore, instead of enforcing low-saliency conditions via or , we identify which measure applies to the most salient pair to decide on the perturbation direction accordingly. A history vector is added to prevent oscillatory perturbations. Similar to NT-JSMA, M-JSMA terminates when the predicted class no longer matches the true class .

% H % H % H
JSMA+F 100 34.8 4.32 0.90 99.9 93.1 6.12 1.22 100 34.7 3.01 1.27
JSMA-F 100 32.1 3.88 0.88 99.9 82.2 4.37 1.21 100 36.9 2.13 1.23
NT-JSMA+F 100 17.6 3.35 0.64 100 18.8 3.27 1.03 99.9 17.5 2.36 1.16
NT-JSMA-F 100 19.7 3.44 0.70 99.9 33.2 2.99 0.98 99.9 19.6 1.68 1.12
M-JSMA_F 100 14.9 3.04 0.62 99.9 18.7 3.42 1.02 99.9 17.4 2.16 1.12
Table 4: Performance comparison of the original JSMA, non-targeted JSMA, and maximal JSMA variants (, ): % of successful attacks, average and perturbation distances, and average entropy of misclassified softmax prediction probabilities.

Table 4 summarizes attacks carried out on correctly-classified test-set instances in the MNIST [131], Fashion MNIST [227], and CIFAR10 [121] datasets, using targeted, Non-Targeted, and Maximal JSMA variants. For targeted attacks, we consider only adversaries that were misclassified in the fewest iterations over target classes. The JSMA+F results showed that on average only of pixels needed to be perturbed in order to create adversaries, thus corroborating findings from [165]. More importantly, as evidenced by lower values, NT-JSMA found adversaries much faster than the fastest targeted attacks across all 3 datasets, while M-JSMA was consistently even faster and on average only perturbed of input pixels. Additionally, the quality of adversaries found by NT-JSMA and M-JSMA were also superior, as indicated by smaller perceptual differences between the adversaries and the original inputs , and by lower misclassification uncertainty as reflected by prediction entropy . Since M-JSMA considers all possible class targets, and both and metrics and perturbation directions, these results show that it inherits the combined benefits from both the original JSMA and NT-JSMA.

4.9 Substitute Blackbox Attack

All of the techniques covered so far are whitebox attacks, relying upon access to a model’s innards. Papernot et al. [163] proposed one of the early practical blackbox methods, called the Substitute Blackbox Attack (SBA). The key idea is to train a substitute model to mimic the blackbox model, and use whitebox attack methods on this substitute. This approach leverages the transferability property of adversarial examples. Concretely, the attacker first gathers a synthetic dataset, obtains predictions on the synthetic dataset from the targeted model, and then trains a substitute model to imitate the targeted model’s predictions.

After the substitute model is trained, adversaries can be generated using any whitebox attacks since the details of the substitute model are known (e.g., [163] used the FGSM [80] (see Section 4.2) and JSMA [165] (see Section 4.8)). We refer to SBA based on the type of adversarial attacks used when attacking the substitute model. For example, if the attacker uses FGSM to attack the substitute model, we refer this as FGSM-SBA.

The success of this approach depends on choosing adequately-similar synthetic data samples and a substitute model architecture using high-level knowledge of the target classifier setup. As such, an intimate knowledge of the domain and the targeted model is likely to aid the attacker. Even if the absence of specific expertise, the transferability property suggests that adversaries generated from a well-trained substitute model are likely to fool the targeted model as well.

Papernot et al. [163] note that in practice the attacker is constrained from making unlimited query to the targeted model. Consequently, the authors introduced the Jacobian-based Dataset Augmentation technique, which generates a limited number of additional samples around a small initial synthetic dataset to efficiently replicate the target model’s decision boundaries. Concretely, given an initial sample , one calculates the Jacobian of the predicted class’ likelihood assigned by the targeted model with respect to the inputs. Since the attacker cannot apply analytical backpropagation to the targeted model, this gradient is instead calculated using the substitute model , which we denote as . A new sample is then synthesized by perturbing along the sign of the gradient, by a small step . While this process resembles FGSM, its purpose is instead to create samples that are likely to be classified with high confidence . Papernot et al. [163] noted that the resulting augmented dataset better represents the decision boundary of the targeted model, in comparison to randomly sampling more data points that would most likely fall outside the target model’s training-set manifold.

  Input: initial training set , targeted model , initial substitute model , maximum number of iterations , small constant
  Output: refined substitute model
  for  to  do
     Train on input-label pairs
  end for
Algorithm 4 Substitute Model Training with Jacobian-based Dataset Augmentation [163]

The entire training procedure for the substitute model is summarized in Algorithm 4. The attacker first creates a small initial training set . For example, can be initialized by picking one sample from each possible class of a set that represents the input domain of the targeted model. The substitute model is then trained on the synthetic dataset using labels provided by the targeted model (e.g., by querying the targeted model). New datapoints are then generated by perturbing each sample in the existing dataset along the general direction of variation. Finally, the new inputs are added to the existing dataset, i.e., the size of the synthetic dataset grows per iteration. This process is then repeated several times.

It is interesting to note that the targeted model does not have to be differentiable for the attack to succeed. The differentiability constraint applies only to the substitute model. As long as the substitute model has the capacity to approximate the targeted model, this attack is feasible. Papernot et al. [163] showed that substitute blackbox attack can be used to attack other machine learning models like logistic regression, Support Vector Machines (SVM) [45], k-Nearest Neighbor (kNN) [46], and non-differentiable models such as decision trees [24].

The authors evaluated SBA by targeting real world image recognition systems from Amazon, Google, and MetaMind on the MNIST dataset [131], and successfully fooled all targets with high accuracies (). This method also successfully attacked a blackbox deep neural network model that was trained on German Traffic Sign Recognition Benchmarks (GTSRB) dataset [202]. Furthermore, SBA was shown to also circumvent defense methods that rely on gradient masking such as adversarial training on FGSM adversaries [80] (see Section 6.1.1) and defensive distillation [166] (see Section 6.1.3).

4.10 Hot/Cold Attack

Building on the idea of altering the input to increase classifier loss, as in FGSM [80] (see Section 4.2), Rozsa et al. [179] proposed an attack algorithm based upon setting the values of the classification logits . We can then use the gradients with respect to the input to push inputs towards producing the desired logits. The logits are modified such that the per-class gradients will point in directions where the output of the network will increase the probability of target (“hot”) class and decrease the probability of the ground truth (“cold”) class. The Hot/Cold attack alters a target classifier’s logits into:


where denotes the -th element of the logits vector .

Intuitively, by maximizing these modified logits using gradient ascent on , will be increased (assuming that it starts with a positive setpoint value) while will be decreased. Correspondingly, the target model will predict the adversarial class with increased likelihood and predict the true class with decreased likelihood , since the softmax function is monotonically increasing with respect to its logit inputs. Finally, letting the other elements have zero values isolates the adversarial perturbation search to focus only on target and ground truth classes.

Once is obtained, and gradient directions are extracted with respect to the input , we then search for the closest adversarial perturbations along these directions using line search and bisection search, as in the L-BFGS attack. While the original manuscript [179] had some ambiguities on how to exactly compute gradient(s) from , one sensible approach is to consider two separate directions based on and , perform line search on each, and select the closest adversary. Also, while not explicitly specified by the original authors, since the Hot/Cold Attack finds the closest adversary via gradient line search, it correspondingly enforces perceptual similarity between and following the metric.

The authors carried out preliminary assessments of this adversarial attack for the LeNet classification model on the MNIST dataset, and attained decent adversarial rates. Nevertheless, this work’s main goal was to use generated adversaries to robustify a classifier model against further attackers via Adversarial Training [80] (see Section 6.1.1). The authors also anecdotally noted that many perturbations found using Hot/Cold Attack had structural visual appearances, and argued that such perturbations resemble patterns in natural images more so than the perceptually random-noise perturbations generated by FGSM [80].

4.11 Carlini & Wagner Attacks

Carlini and Wagner [34] introduced a family of attacks for finding adversarial perturbations that minimize diverse similarity metrics: , , and . The core insight transforms a general constrained optimization strategy similar to the L-BFGS attack [211] (see Section 4.1) into an empirically-chosen loss function within an unconstrained optimization formulation:


where denotes the -th component of the classifier’s logits, denotes the target label, and represents a parameter that reflects the minimum desired confidence margin for the adversarial example.

Conceptually, this loss function minimizes the distance in logit values between class and the second most-likely class. If currently has the highest logit value, then the difference of the logits will be negative, and so the optimization will stop when this logit difference between and the runner-up class exceeds . On the other hand, if does not have the highest logit value, then minimizing brings the gap between the logits of the winning class and the target class closer together, i.e., either reducing the highest class’ prediction confidence and/or increasing the target class’ confidence.

Furthermore, the parameter establishes a best-case stopping criterion, in that the logit of the adversarial class is larger than the runner-up class’ logit by at least . Thus, explicitly encodes a minimum desirable degree of robustness for the target adversary. Note that when , the resulting adversarial examples would misclassify the network with weak robustness, as any further slight perturbations may revert to a non-adversarial softmax selection.

The C&W attack formulation is given as:


where is a change of variable such that , which is introduced to bound to be within . The minimum value of the parameter is chosen through an outer optimization loop procedure, which is detailed further below.

The C&W attack is much more complex than the variant, since its associated distance metric is non-differentiable. Instead, an iterative strategy is proposed to successively eliminate non-significant input features such that misclassification can be achieved by perturbing as few input values as possible. During initialization, an allowed set is defined to include all input features in . Next, at each iteration, an attack attempt is carried out, under the constraint of perturbing only features within . If the attempt is successful, then the next non-significant feature is identified and removed from , where , , and . This iterative procedure is repeated until the -constrained attack fails to find an adversarial example, at which time the latest successful adversarial example is returned. To speed up this iterative algorithm, the authors suggested to “warm-start” each attack attempt by using the latest adversarial example found in previous iteration, which is pre-modified to satisfy the reduced set. Intuitively, the selection criterion quantifies how much the loss value is affected by perturbing the -th feature. Thus, eliminating with the minimum criterion score has the least amount of impact on potential misclassification.

Similar to the C&W attack, the attack variant also requires an iterative algorithm, since the metric is not fully differentiable. Its optimization objective is given as:


The parameter is initialized to , and is reduced after every iteration by a factor of if for all , until no adversarial example is found. In short, this strategy successively constrains the magnitude of adversarial perturbations to be bounded by successively smaller . Again, similar to the attack, warm-start can be used at every iteration to speed up the entire process.

As another practical implementation enhancement for finding robust adversaries, the authors recommend to optimize the scale parameter empirically rather than fixing it to a constant value. Concretely, in an outer optimization loop, is first set to a very stringent value (e.g., ), and then iteratively relaxed via doubling until a first adversary is found.

Carlini and Wagner [34] empirically showed their methods to be superior to the state-of-the-art attacks at the time when evaluated on MNIST, CIFAR10, and ImageNet. In their evaluations, the proposed attack was compared against JSMA [165] (see Section 4.8), variant was compared against DeepFool [156] (see Section 4.7), and method was compared against FGSM [80] (see Section 4.2) and BIM [125, 126] (see Section 4.3). The three C&W attacks consistently outperformed the incumbents in terms of average distortion and attack success rate. Furthermore, they found that JSMA is too expensive to perform when used on ImageNet models since the dimension of ImageNet data are much higher compared to MNIST or CIFAR10, while their method has no difficulty finding adversarial examples on ImageNet. As mentioned previously, the proposed attacks successfully circumvented defensive distillation with 100% success rate, while keeping the adversarial examples to be similar with the original input under , , and metrics. Finally, the authors also showed that their attacks transfer between models, even to those trained using defensive distillation. They also report that increases in lead to increases in transferability, supporting the notion that modulates the robustness of examples.

4.12 Universal Adversarial Perturbation

All of the attack methods covered thus far search for adversarial perturbations of a specific input instance, such as distorting a particular image of a dog to be misclassified as “cat”. There are no guarantees that such perturbations will remain adversarial when added to a different input instance. In contrast, [155] demonstrated the existence of Universal Adversarial Perturbations (UAP); perturbations that are input-agnostic. They show that a single perturbation can cause misclassification when added to most images from a given dataset (see Fig. 4 for the illustration of UAP). The authors argued that the high adversarial efficacy of UAP indicates the presence of non-random, exploitable geometric correlations in the target model’s decision boundaries.

Figure 4: Illustration of UAP (taken from Moosavi-Dezfooli et al. [155]).

UAP works by accumulating perturbations calculated over individual inputs. As shown in Algorithm 5, this meta-procedure runs another sample-specific attack (e.g., FGSM [80]) within a loop to incrementally build a sample-agnostic UAP, . In each iteration, an universally-perturbed input sample is further altered by via a per-sample attack method. Then, the universal perturbation is adjusted to account for , all-the-while satisfying a maximum bound on the -norm of the UAP. In this manner, the resulting can be added to each data sample in order to push them towards a nearby decision boundary of the target model. This meta-attack procedure is repeated until the portion of samples that are misclassified, e.g., the “fooling rate”, exceeds a desired threshold .

  Input: input-label pairs from a dataset of entries, target classifier , maximum norm constraint , minimum desired fooling rate
  Output: universal adversarial perturbation
  while fooling rate do
     for  to  do
        if  then
        end if
     end for
  end while
Algorithm 5 Universal Adversarial Perturbation [155]

Algorithm 5 is slightly modified from Moosavi-Dezfooli et al. [155], in that we only update when an universally-perturbed sample is correctly classified (i.e., ). The original formulation considered all samples that did not change prediction labels . These two variants differ only for s that are misclassified by the target model, i.e., . Our practical stance is that such data samples do not need to be perturbed further to be adversarial, and that further perturbations might actually make them non-adversarial, i.e., .

The authors demonstrated existence of UAPs for various models including CaffeNet [109], GoogLeNet [209], ResNet-152 [92], and several variants of VGG [195, 38] with fooling rates as high as for the ImageNet validation dataset. UAP was also shown to be highly transferable between these models. This method was subsequently extended to find UAP that fools semantic segmentation models [152].

One potential deficiency of Algorithm 5 is that it does not guarantee that each updated UAP will still be adversarial to data points that appear before the update. Indeed, the proposed method may need to iterated over the same dataset multiple times before it can attain a high desired fooling rate. It would be interesting for future research to identify a UAP variant that can update without weakening its adversarial nature on previously-seen data samples.

Additionally, we observe there is no mandatory need for Algorithm 5 to enforce norms, both when computing the per-instance perturbation , and when updating towards . While this choice of norm may be indirectly supported by related bounds for random perturbations in the authors’ comparative analyses, it is nevertheless not fully justified. We argue that the UAP meta-attack procedure has two key objectives: finding a robustly-adversarial (i.e., one that confidently mis-classifies the target model) universal perturbation, while being perceptually indistiguishable as quantified by an imposed -norm constraint, . When , the per-instance attack might find an adversarial perturbation that could arbitrarily violate the latter constraint, and much of its adversarial robustness might be subsequently lost when updating towards .

To address the above concern, we suggest that Algorithm 5 could be updated to use a per-instance adversarial attack that optimized a matching norm constaint, and also update as . This might then mitigate the need to re-enforce the constraint explicitly. This meta-procedure could be further generalized to consider, on each update step, a multitude of adversaries generated by diverse per-instance attacks. One might then be able to choose an update direction for such that its adversarial robustness is increased while always preserving the stipulated constraint. We encourage the readers to substantiate this conjecture.

4.13 Data-Free UAP

As an extension of UAP [155] (see Section 4.12), Mopuri et al. [158, 157] proposed a new algorithm to generate UAP without requiring access to the training data (i.e., data-free).222We mainly focus on the variant proposed by [157], as it improves upon [158]. The Data-Free UAP (DFUAP) method aims to find an universal perturbation that saturates all activations of a neural network by itself, by minimizing the following loss:


Here, denotes the activations of the model at the -th layer (among total layers), given as the entire input. As is common for adversarial attacks, a perceptual similarity constraint is imposed on the magnitude of under the -norm, i.e., .

To implement the above optimization, is initialized randomly, and then updated via gradient descent while constrained by the -norm bound. Also, the authors empirically found that minimizing a small subset of activations (e.g., only convolutional layers, or only the last layers of convolutional blocks) resulted in similar fooling rates compared to minimizing all activations, thus the former objective should be used in practice for efficiency sakes.

Although DFUAP was designed to be completely agnostic of training data, it can be slightly altered to take advantage of prior knowledge on the training data when available. For instance, Mopuri et al. [157] suggested how one can leverage information such as the mean of a dataset, dynamic range of the dataset (e.g., [0, 255] for 8-bits RGB images), or samples of the training data. Concretely, in the case where the mean of the training data and dynamic range of the input are available, the loss function can be changed to:


where denotes a random perturbation sampled from Gaussian distribution with mean and variance such that lies within the dynamic range of the input. Alternatively, if some training samples are available, then the loss could be modified to:


Although in general DFUAP is not as effective at fooling the target model as the UAP generated by [155], this method can be significantly faster to compute, since one does not have to iterate through the entire dataset. Mopuri et al. [157] also showed that this method can be used to attack semantic image segmentation (e.g., FCN [192]) and depth estimation (e.g., Monodepth [75]) models. Furthermore, [155] showed that the fooling rate of the data-free UAP method is comparable to the original UAP method when data prior is available. Perhaps more interestingly, they showed that the original UAP method only outperformed the data-free method when the dataset size was sufficiently large (e.g., at least 10,000 images for ImageNet-level models such as the GoogLeNet [209] or CaffeNet [109]).

4.14 VAE Attacks

Although the majority of work in the field has focused upon fooling classifiers, recent research has extended adversarial attacks to generative models. Kos et al. [118] focus upon the scenario where a generative model is being used as a sophisticated compression mechanism, allowing a sender and receiver to efficiently exchange information using a shared latent space. Concretely, a sender takes an input image, encodes it with the model, and sends the receiver the latent code. The receiver uses the same model to decode from this latent representation to an output image. In this case, an adversarial example would be one that is poorly reconstructed after encoding and decoding through the model.

Kos et al. [118] describe ways to find adversarial examples for models like the Variational Autoencoder (VAE) [115] and VAE-GAN [128]. The authors proposed three separate strategies: classifier attack, attack, and latent attack. These methods apply to models that specifically have an encoder and a decoder component, which respectively, compresses the input into a latent representation , and reconstructs the input from , i.e., and .

The simplest method, the classifier attack, augments the frozen generative model (e.g., VAE or VAE-GAN), by training a new classifier network , and then applies an existing algorithm like FGSM [80] (see Section 4.2) to find adversaries . During the process, this approach also alters the latent encoding into an adversarial counterpart such that the decoder produces poor reconstructions . Nevertheless, there are no general guarantees that success in fooling the classifier will result in fooling the decoder.

The setup of the second attack method, attack, involves altering the loss function for training a VAE:


where denotes the Kullback-Leibler divergence [122], denotes an analytical approximation for the underlying conditional latent distribution , is a prior distribution of that is assumed to be Gaussian, and is the evidence lower bound (ELBO), which in this setup entails the cross-entropy loss between the input and its reconstruction .

When training a VAE normally, the reconstruction target of is naturally set to the output of the encoder-decoder networks, i.e., . However, this loss metric can be altered to quantify the gap between a given input , and the reconstruction of another adversarial target input . Subsequently, an optimization formulation resembling methods such as the L-BFGS attack [211] (see Section 4.1) can be formulated as:


Similar to the L-BFGS attack, the scaling parameter is initially set to a small value, and then incrementally increased via line search till an adversary is found. This outer optimization loop ensures that the resulting adversary strictly enforces the loss.

The third strategy, latent attack, differs in the goal of matching to an adversarial target latent vector , in contrast to the attack’s aim of matching the reconstructions between and an adversarial target input . To this end, a different loss function is used:


while the adversarial optimization process remains the same:


Kos et al. [118] showed that the above attack strategies successfully fool VAE and VAE-GAN on various datasets such as MNIST [131], SVHN [159], and CelebA [139]. However, the classifier attack produces lower quality reconstructions than the and latent attacks, which may be due to the fact that the classifier itself is easier to fool compared to the generative model. The authors also report that the attack was the slowest since this method builds a reconstruction at every iteration, whilst the latent attack was found to be the most effective.

Although these attacks are meant to fool generative reconstruction models rather than discriminative classification models, some adversarial defense techniques use similar generative models to remove adversarial perturbations from an input, such as PixelDefend [198] (see Section 6.1.11) and DefenseGAN [170] (see Section 6.1.12). Consequently, knowing that generative models can also be attacked may alert one from naively favoring generative models as a defense strategy.

4.15 Adversarial Transformation Networks

Instead of using optimizers like L-BFGS attack directly (see Section 4.1), Adversarial Transformation Networks (ATN) [10] are neural networks that either transform encode-and-reconstruct non-adversarial inputs into adversarial counterparts, or generates an additive adversarial perturbation given a specific input. Once trained, the network can swiftly generate new adversarial examples. The former variant, the Adversarial Auto-Encoding (AAE) network, is depicted in Fig. (a)a, which contrasts with the latter Perturbation ATN (P-ATN) model, as seen in Fig. (b)b.

(a) AAE
(b) P-ATN
Figure 5: Illustrations of AAE and P-ATN.

Given a pre-trained target classifier network , in both variants the generator is trained to minimize:


where enforces perceptual similarity between and 333While the original authors used metrics for and , this is not a mandatory requirement, as others have found adversaries using other similarity metrics, such as loss in feature space [68, 111]., while forces the softmax probabilities of the perturbed input, , to match an adversarial class distribution . The scaling hyper-parameter is either set heuristically or empirically optimized to adequately balance between the perceptual similarity and adversarial misclassification objectives.

One naive instantiation of ATN is to set the adversarial class distribution to a one-hot encoding of , as when training normal classification models. Nevertheless, the authors proposed a more effective reranking function, whose -th components are minimally different from the softmax probabilities of the pre-trained classifier :


Since the hyper-parameter enhances the desired confidence for the adversarial class, in general needs to be vector-re-normalized. It is important to note that each trained ATN generator can only produce adversarial examples that will be misclassified as a particular class by the targeted model. Multiple ATNs must be trained in order to fool a given classifier into diverse adversarial classes.

Baluja and Fischer [10] found that the transferability of adversarial examples generated by ATN is fairly poor, but can be enhanced by training the ATN to fool multiple networks at the same time (e.g., the gradients coming from attacking multiple networks are averaged during backpropagation). ATN exhibits some benefits that other attack methods are lacking: for instance, since ATN takes advantage of the expressiveness of neural networks, the adversarial examples generated by ATN tend to be more diverse. This diversity of adversarial examples can also be leveraged for adversarial training [80] (see Section 6.1.1). Similar to preliminary investigations by Hamm [89], a future interesting research direction is to formulate a min-max game between a classifier and an ATN that can provide some guarantee of the classifier’s robustness, by learning from diverse sets of adversarial examples generated by the ATN without suffering from catastrophic forgetting [116] between training iterations.

4.16 Dense Adversary Generation

Dense Adversary Generation (DAG) [229] is a targeted attack method for semantic image segmentation and object detection models. DAG generalizes from attacks on classification models in that it aims to misclassify multiple target outputs associated to each given input, namely multiple pixel labels for semantic segmentation tasks, or multiple region proposals for object detection tasks. DAG requires both the ground truth class labels as well as specific adversarial labels for every target output of a given model. Using this information, DAG incrementally builds an adversarial input perturbation that decreases the predicted logits for the true classes while increasing the logits for the adversarial labels among target outputs, thus resulting in as many misclassified outputs as possible.444While the original formulation for DAG computes input-gradients targeting the logit layer rather than targeting the normalized softmax probabilities , there is no fundamental limitation to forbid undertaking the latter variant. Nevertheless, as shown by Carlini and Wagner [34] and others, in certain setups such as with defensively-distilled models [166], attack methods that operate on the logit layer tend to be significantly more successful than those targeting the softmax layer.

Formally, given an image , we expand the definitions of the logit vectors , softmax probabilities , and predicted class labels to be specific to each of the target outputs of the target model. As shown in Algorithm 6, the input is incrementally perturbed by into on each iteration . To compute each such perturbation, first DAG identifies the set of target outputs that are still correctly classified by the model given . Among each , the vector is then built by accumulating the positive input-gradients of logits for output-specific adversarial classes , i.e., , as well as the negative input-gradients of the true-class logits, . This resulting perturbation , when added to