Enhancing Adversarial Example Transferability with an Intermediate Level Attack

Enhancing Adversarial Example Transferability with an Intermediate Level Attack

Qian Huang
Cornell University
qh53@cornell.edu
Equal contribution.
   Isay Katsman*
Cornell University
isk22@cornell.edu
   Horace He*
Cornell University
hh498@cornell.edu
   Zeqi Gu*
Cornell University
zg45@cornell.edu
   Serge Belongie
Cornell University
sjb344@cornell.edu
   Ser-Nam Lim
Facebook AI
sernam@gmail.com
Abstract

Neural networks are vulnerable to adversarial examples, malicious inputs crafted to fool trained models. Adversarial examples often exhibit black-box transfer, meaning that adversarial examples for one model can fool another model. However, adversarial examples are typically overfit to exploit the particular architecture and feature representation of a source model, resulting in sub-optimal black-box transfer attacks to other target models. We introduce the Intermediate Level Attack (ILA), which attempts to fine-tune an existing adversarial example for greater black-box transferability by increasing its perturbation on a pre-specified layer of the source model, improving upon state-of-the-art methods. We show that we can select a layer of the source model to perturb without any knowledge of the target models while achieving high transferability. Additionally, we provide some explanatory insights regarding our method and the effect of optimizing for adversarial examples using intermediate feature maps.

1 Introduction

Figure 1: An example of an ILA modification of a pre-existing adversarial example for ResNet18. ILA modifies the adversarial example to increase its transferability. Note that although the original ResNet18 adversarial example managed to fool ResNet18, it does not manage to fool the other networks. The ILA modification of the adversarial example is, however, more transferable and is able to fool more of the other networks.

Adversarial examples are small, imperceptible perturbations of images carefully crafted to fool trained models [30, 8]. Studies such as [13] have shown that Convolutional Neural Networks (CNNs) are particularly vulnerable to such adversarial attacks. The existence of these adversarial attacks suggests that our architectures and training procedures produce fundamental blind spots in our models, and that our models are not learning the same features that humans do.

These adversarial attacks are of interest for more than just the theoretical issues they pose – concerns have also been raised over the vulnerability of CNNs to these perturbations in the real world, where they are used for mission-critical applications such as online content filtration systems and self-driving cars [7, 15]. As a result, a great deal of effort has been dedicated to studying adversarial perturbations. Much of the literature has been dedicated to the development of new attacks that use different perceptibility metrics [2, 28, 26], security settings (black box/white box) [23, 1], as well as increasing efficiency [8]. Defending against adversarial attacks is also well studied. In particular, adversarial training, where models are trained on adversarial examples, has been shown to be effective under certain assumptions [18, 27].

Adversarial attacks can be classified into two categories: white-box attacks and black-box attacks. In white-box attacks, information of the model (i.e., its architecture, gradient information, etc.) is accessible, whereas in black-box attacks, the attackers have access only to the prediction. Black-box attacks are a bigger concern for real-world applications for the obvious reason that such applications typically will not reveal their models publicly, especially when security is a concern (e.g., CNN-based objectionable content filters in social media). Consequently, black-box attacks are mostly focused on the transferability of adversarial examples [17].

Moreover, adversarial examples generated using white-box attacks will sometimes successfully attack an unrelated model. This phenomenon is known as “transferability.” However, black-box success rates for an attack are nearly always lower than those of white-box attacks, suggesting that the white-box attacks overfit on the source model. Different adversarial attacks transfer at different rates, but most of them are not optimizing specifically for transferability. This paper aims to achieve the goal of increasing the transferability of a given adversarial example. To this end, we propose a novel method that fine-tunes a given adversarial example through examining its representations in intermediate feature maps that we call Intermediate Level Attack (ILA).

Our method draws upon two primary intuitions. First, while we do not expect the direction found by the original adversarial attack to be the most optimal for transferability, we do expect it to be a reasonable proxy, as it still transfers far better than random noise would. As such, if we are searching for a more transferable attack, we should be willing to stray from the original attack direction in exchange for increasing the norm111Perturbations with a higher norm are generally more effective, regardless of layer (holds true for black-box attacks as well).. However, from the ineffectiveness of random noise on neural networks, we see that straying too far from the original direction will cause a decrease in effectiveness – even if we are able to increase the norm by a modest amount. Thus, we must balance staying close to the original direction and increasing norm. A natural way to do so is to maximize the projection onto the original adversarial perturbation.

Second, we note that although for transferability we would like to sacrifice some direction in exchange for increasing the norm, we are unable to do so in the image space without changing perceptibility, as norm and perceptibility are intrinsically tied222Under the standard -ball constraints.. However, if we examine the intermediate feature maps, perceptibility (in image space) is no longer intrinsically tied to the norm in an intermediate feature map, and we may be able to increase the norm of the perturbation in that feature space significantly with no change in perceptibility in the image space. We will investigate the effects of perturbing different intermediate feature maps on transferability and provide insights drawn from empirical observations.

Our contributions are as follows:

  • We propose a novel method, ILA, that enhances black-box adversarial transferability by increasing the perturbation on a pre-specified layer of a model. We conduct a thorough evaluation that shows our method improves upon state-of-the-art methods on multiple models across multiple datasets. See Sec. 4.

  • We introduce a procedure, guided by empirical observations, for selecting a layer that maximizes the transferability using the source model alone, thus obviating the need for evaluation on transfer models during hyperparameter optimization. See Sec. 4.2.

  • Additionally, we provide explanatory insights into the effects of optimizing for adversarial examples using intermediate feature maps. See Sec. 5.

2 Background and Related Work

2.1 General Adversarial Attacks

An adversarial example for a given model is generated by augmenting an image so that in the model’s decision space its representation moves into the wrong region. Most prior work in generating adversarial examples for attack focuses on disturbing the softmax output space via the input space [8, 18, 21, 6]. Some representative white-box attacks are the following:

Gradient Based Approaches The Fast Gradient Sign Method (FGSM) [8] generates an adversarial example with the update rule:

It is the linearization of the maximization problem

where represents the original image; is the adversarial example; is the ground-truth label; is the loss function; and is the model until the final softmax layer. Its iterative version (I-FGSM) applies FGSM iteratively [15]. Intuitively, this fools the model by increasing its loss, which eventually causes misclassification. In other words, it finds perturbations in the direction of the loss gradient of the last layer (i.e., the softmax layer).

Decision Boundary Based Approaches Deepfool [21] produces approximately the closest adversarial example iteratively by stepping towards the nearest decision boundary. Universal Adversarial Perturbation [20] uses this idea to craft a single image-agnostic perturbation that pushes most of a dataset’s images across a model’s classification boundary.

Model Ensemble Attack The methods mentioned above are designed to yield the best performance only on the model they are tuned to attack; often, the generated adversarial examples do not transfer to other models. In contrast, [17] proposed the Model-based Ensembling Attack that transfers better by avoiding dependence on any specific model. It uses models with softmax outputs, notated as , …, , and solves

Using such an approach, the authors showed that the decision boundaries of different CNNs align with each other. Consequently, an adversarial example that fools multiple models is likely to fool other models as well.

2.2 Intermediate-layer Adversarial Attacks

A small number of studies have focused on perturbing mid-layer outputs. These include [22], which perturbs mid-layer activations by crafting a single universal perturbation that produces as many spurious mid-layer activations as possible. Another is Feature Adversary Attack [32, 25], which performs a targeted attack by minimizing the distance of the representations of two images in internal neural network layers (instead of in the output layer). However, instead of emphasizing adversarial transferability, it focuses more on internal representations. Results in the paper show that even when given a guide image and a dissimilar target image, it is possible to perturb the target image to produce an embedding similar to that of the guide image.

Two other related works [12, 24] focus on perturbing intermediate activation maps for the purpose of increasing adversarial transferability in a method similar to that of [32, 25] except they focus on black-box transferability. Their method does not focus on fine-tuning existing adversarial examples and differs significantly in attack methodology from ours.

Another recent work that examines intermediate layers for the purposes of increasing transferability is TAP [33]. The TAP attack attempts to maximize the norm between the original image and the adversarial example at all layers. In contrast to our approach, they do not attempt to take advantage of a specific layer’s feature representations, instead choosing to maximize the norm of the difference across all layers. In addition, unlike their method which generates an entirely new adversarial example, our method fine-tunes existing adversarial examples, allowing us to leverage existing adversarial attacks.

3 Approach

Based on the motivation presented in the introduction, we propose the Intermediate Level Attack (ILA) framework, shown in Algorithm 2. We propose the following two variants, differing in their definition of the loss function . Note that we define as the output at layer of a network given an input .

1:Original image in dataset ; Adversarial example generated for by baseline attack; Function that calculates intermediate layer output; bound ; Learning rate ; Iterations ; Loss function .
2:procedure ILA()
3:     
4:     
5:     while  do
6:         
7:         
8:         
9:         
10:         
11:         
12:     end while
13:     return
14:end procedure
Figure 2: Intermediate Level Attack algorithm

3.1 Intermediate Level Attack Projection (ILAP) Loss

Given an adversarial example generated by attack method for natural image , we wish to enhance its transferability by focusing on a layer of a given network . Although is not the optimal direction for transferability, we view as a hint for this direction. We treat as a directional guide towards becoming more adversarial, with emphasis on the disturbance at layer . Our attack will attempt to find an such that matches the direction of while maximizing the norm of the disturbance in that direction. The high-level idea is that we want to maximize for the reasons expressed in Section 1. Since this is a maximization, we can disregard constants, and this simply becomes the dot product. The objective we solve is given below, and we term it the ILA projection loss:

(1)

3.2 Intermediate Level Attack Flexible (ILAF) Loss

Since the image may not be the optimal direction for us to optimize towards, we may want to give the above loss greater flexibility. We do this by explicitly balancing both norm maximization and also fidelity to the adversarial direction at layer . We note that in a rough sense, ILAF is optimizing for the same thing as ILAP. We augment the above loss by separating out the maintenance of the adversarial direction from the magnitude, and control the trade-off with the additional parameter to obtain the following loss, termed the ILA flexible loss:

(2)

3.3 Attack

In practice, we choose either the ILAP or ILAF loss and iterate times to attain an approximate solution to the respective maximization objective. Note that the projection loss only has the layer as a hyperparameter, whereas the flexible loss also has the additional loss weight as a hyperparameter. The above attack assumes that is a pre-generated adversarial example. As such, the attack can be viewed as a fine-tuning of the adversarial example . We fine-tune for greater norm of the output difference at layer (which we hope will be conducive to greater transferability) while attempting to preserve the output difference’s direction to avoid destroying the original adversarial structure.

4 Results

We start by showing that ILAP increases transferability against I-FGSM, MI-FGSM [6] and Carlini-Wagner [4] in the context of CIFAR-10 (Sections 4.1 and 4.2). Results for FGSM and Deepfool are shown in Appendix A333We re-implemented all attacks except Deepfool, for which we used the original publicly provided implementation. For C&W, we used a randomized targeted version, since it has better performance.. We test on a variety of models, namely: ResNet18 [9], SENet18 [10], DenseNet121 [11] and GoogLeNet [29]. Architecture details are specified in Appendix A; note that in the below results sections, instead of referring to the architecture specific layer names, we refer to layer indices (e.g. is the last layer of the first block). Our models are trained on CIFAR-10 [14] with the code and hyperparameters in [16] to final test accuracies of for ResNet18, for SENet18, for DenseNet121, and for GoogLeNet.

For a fair comparison, we use the output of an attack that was run for iterations as a baseline. ILAP runs for iterations starting from scratch using the output of attack after iterations as the reference adversarial example. The learning rate is set to for both I-FGSM and MI-FGSM444Tuning the learning rate does not substantially affect transferability, as shown in Appendix G..

In Section 4.2 we also show that we can select a nearly-optimal layer for transferability using only the source model. Moreover, ILAF allows further tuning to improve the performance across layers (Section 4.3).

Finally, we demonstrate that ILAP also improves transferability under the more complex setting of ImageNet [5] and that it supercedes state-of-the-art attacks focused on increasing transferability, namely the Zhou et al. attack (TAP) [33] and the Xie et al. attack [31] (Section 4.4).

4.1 ILAP Targeted at Different Values

To confirm the effectiveness of our attack, we fix a single source model and baseline attack method, and then check how ILAP transfers to the other models compared to the baseline attack. Results for ResNet18 as the source model and I-FGSM as the baseline method are shown in Figure 3. Comparing the results of both methods on the other models, we see that ILAP outperforms I-FGSM when targeting any given intermediate layer, and does especially well for the optimal hyperparameter value of . Note that the choice of layer is important for both performance on the source model and target models. Full results are shown in Appendix A.

Figure 3: Transfer results of ILAP against I-FGSM on ResNet18 as measured by DenseNet121, SENet18, and GoogLeNet on CIFAR-10 (lower accuracies indicate better attack).
Figure 4: Disturbance values at each layer for ILAP targeted at layer for ResNet18. Observe that the in the legend refers to the hyperparameter set in the ILAP attack, and afterwards the disturbance values were computed on layers indicated by the in the x-axis. Note that the last peak is produced by the ILAP attack.
TAP [33] DI-FGSM [31]
Transfer 20 Itr Opt ILAP 20 Itr Opt ILAP
Inc-v4 36.3% 15.2% 50.2% 26.7%
IncRes-v2 40.7% 20.1% 54.6% 29.3%
  • Same model as source model.

Table 1. Same as experiment in Table 2 but with TAP and DI-FGSM from Xie et al. [31]. Evaluation is performed with 5000 randomly selected ImageNet validation set images, and . The source model used is Inc-v3 and the target layer specified for ILAP is Conv2d_4a_3x3.
Table 1: ILAP vs. State-of-the-art Transfer Attacks

4.2 ILAP with Pre-Determined Value

Above we demonstrated that adversarial examples produced by ILAP exhibit the strongest transferability when targeting a specific layer (i.e. choosing a layer as the hyperparameter). We wish to pre-determine this optimal value based on the source model alone, so as to avoid tuning the hyperparameter by evaluating on other models. To do this, we examine the relationship between transferability and the ILAP layer disturbance values for a given ILAP attack. We define the disturbance values of an ILAP attack perturbation as values of the function for all values of in the source model. For each value of in ResNet18 (the set of is defined for each architecture in Appendix A) we plot the disturbance values of the corresponding ILAP attack in Figure 4. The same figure is given for other models in Appendix B.

We notice that the adversarial examples that produce the latest peak in the graph are typically the ones that have highest transferability for all transferred models (Table 2). Given this observation, we propose that the latest that still exhibits a peak is a nearly optimal value of (in terms of maximizing transferability). For example, according to Figure 4, we would choose as the expected optimal hyperparameter for ILAP with ResNet18 as the source model. Table 2 supports our claim and shows that selecting this layer gives an optimal or near-optimal attack. We discuss our discovered explanatory insights for this method in Section 5.3.

MI-FGSM C & W
Source Transfer 20 Itr 10 Itr ILAP Opt ILAP 1000 Itr 500 Itr ILAP Opt ILAP
5.7% 11.3% 2.3% (6) 7.3% 5.2% 2.1% (5)
ResNet18 SENet18 33.8% 30.6% 30.6% (4) 85.4% 41.7% 41.7% (4)
() DenseNet121 35.1% 30.4% 30.4% (4) 84.4% 41.7% 41.7% (4)
GoogLeNet 45.1% 37.7% 37.7% (4) 90.6% 57.3% 57.3% (4)
ResNet18 31.0% 27.5% 27.5% (4) 87.5% 42.7% 42.7% (4)
SENet18 3.3% 10.0% 2.6% (6) 6.2% 7.3% 3.1% (5)
() DenseNet121 31.6% 27.3% 27.3% (4) 88.5% 38.5% 38.5% (4)
GoogLeNet 41.1% 34.8% 34.8% (4) 91.7% 52.1% 52.1% (4)
ResNet18 34.4% 28.1% 28.1%(6) 87.5% 37.5% 37.5% (6)
DenseNet121 SENet18 33.5% 27.7% 27.7% (6) 86.5% 34.4% 34.4% (6)
() 6.4% 4.0% 0.8%(9) 2.1% 0.0% 0.0% (9)
GoogLeNet 36.3% 30.3% 30.3% (6) 90.6% 45.8% 45.8% (6)
ResNet18 44.6% 34.5% 33.2%(3) 89.6% 63.5% 60.4% (7)
GoogLeNet SENet18 43.0% 33.5% 32.6%(3) 90.6% 53.1% 53.1% (9)
() DenseNet121 38.9% 29.2% 28.8%(3) 89.6% 58.3% 51.0% (8)
1.5% 1.4% 0.5% (11) 4.2% 0.0% 0.0% (12)
  • Same model as the source model.

Table 2. Accuracies after attack are shown for the models (lower accuracies indicate better attack). The hyperparameter in the ILAP attack is being fixed for each source model as decided by the layer disturbance graphs (e.g. setting for ResNet18 since it was the last peak in Figure 4). “Opt ILAP” refers to a 10 iteration ILAP that chooses the optimal layer (determined by evaluating on transfer models). Perhaps surprisingly, ILAP beats out the baseline attack on the original model as well.
Table 2: ILAP Results

4.3 ILAF vs. ILAP

We show that ILAF can further improve transferability with the additional tunable hyperparameter . The best ILAF result for each model improves over ILAP as shown in Table 3. However, note that the optimal differs for each model and requires substantial hyperparameter tuning to outperform ILAP. Thus, ILAF can be seen as a more model-specific version that requires more tuning, whereas ILAP works well out of the box. Full results are in Appendix C.

Model ILAP (best) ILAF (best)
DenseNet121 27.7% 26.6%
GoogLeNet 35.8% 34.7%
SENet18 27.5% 26.3%
Table 3. Here we show the difference in transfer performance between ILAP vs. ILAF generated using ResNet18 (with optimal hyperparameters for both attacks).
Table 3: ILAP vs. ILAF

4.4 ILAP on ImageNet

We also tested ILAP on ImageNet, with ResNet18, DenseNet121, SqueezeNet, and AlexNet pretrained on ImageNet (as provided in [19]). The learning rates for all attacks are tuned for best performance. For I-FGSM the learning rate is set to , for ILAP with I-FGSM to , for MI-FGSM to , and for ILAP with MI-FGSM to . To evaluate transferability, we tested the accuracies of different models over adversarial examples generated from all ImageNet test images. We observe that ILAP improves over I-FGSM and MI-FGSM on ImageNet. Results for ResNet18 as the source model and I-FGSM as the baseline attack are shown in Figure 14. Full results are in Appendix D.

In order to show our approach outperforms pre-existing methods, we tested ILAP against both TAP [33]555Code was not made available for this paper, hence we reproduced their method to the best of our ability. and Xie et al. [31]666Pretrained ImageNet models for Inc-v3, Inc-v4, and IncRes-v2 were obtained from Cadene’s Github repo [3]. in an ImageNet setting. The results are shown in Table 1777Results indicating that ILAP is competitive with TAP on CIFAR-10 are in Appendix H..

Figure 5: Transfer results of ILAP against I-FGSM on ResNet18 as measured by DenseNet121, SqueezeNet, and AlexNet on ImageNet (lower accuracies indicate better attack).

5 Explaining the Effectiveness of Intermediate Layer Emphasis

At a high level, we motivated projection in an intermediate feature map as a way to increase transferability. We saw empirically that it was desireable to target the layer corresponding to the latest peak (see Figure 4) on the source model in order to maximize transferability. In this section, we attempt to explain the factors causing ILAP performance to vary across layers as well as what they suggest about the optimal layer for ILAP. As we iterate through layer indices, there are two factors affecting our performance: the angle between the original perturbation direction and best transfer direction (defined below in Section 5.1) as well as the linearity of the model decision boundary.

Below, we discuss how the factors change across layers and affect transferability of our attack.

5.1 Angle between the Best Transfer Direction and the Original Perturbation

Motivated by [17] (where it is shown that the decision boundaries of models with different architectures often align) we define the Best Transfer Direction (BTD):

Best Transfer Direction: Let be an image and be a large (but finite) set of distinct CNNs. Find such that

Then the Best Transfer Direction of x is .

Since our method uses the original perturbation as an approximation for the BTD, it is intuitive that the better this approximation is in the current feature representation, the better our attack will perform.

We want to investigate the nature of how well a chosen source model attack, like I-FGSM, aligns with the BTD throughout layers. Here we measure alignment between an I-FGSM perturbation and an empirical estimate of the BTD (a multi-fool perturbation of the four models we evaluate on in the CIFAR-10 setting) using the angle between them. We investigate the alignment between the feature map outputs of the I-FGSM perturbation and the BTD at each layer. As shown in Figure 6, the angle between the perturbation of I-FGSM and that of the BTD decreases as we iterate the layer indices. Therefore, the later the target layer is in the source model, the better it is to use I-FGSM’s attack direction as a guide. This is a factor increasing transfer attack success rate as layer indices increase.

To test our hypothesis, we propose to eliminate this source of variation in performance by using a multi-fool perturbation as the starting perturbation for ILAP, which is a better approximation for the BTD. As shown in Figure 7, ILAP performs substantially better when using a multi-fool perturbation as a guide rather than an I-FGSM perturbation, thus confirming that using a better approximation of the BTD gives better performance for ILAP. In addition, we see that these results correspond with what we would expect from Figure 6. In the earlier layers, I-FGSM is a worse approximation of the BTD, so passing in a multi-fool perturbation improves performance significantly. In the later layers, I-FGSM is a much better approximation of the BTD, and we see that passing in a multi-fool perturbation does not increase performance much.

Figure 6: As shown in the above figure, in terms of angle, I-FGSM produces a better approximation for the estimated best transfer direction as we increase the layer index.
Figure 7: Here we show that ILAP with a better approximation for BTD (multi-fool) performs better. In addition, using a better approximation for BTD disproportionately improves the earlier layers’ performance.

5.2 Linearity of Decision Boundary

If we view I-FGSM as optimizing to cross the decision boundary, we can interpret ILAP as optimizing to cross the decision boundary approximated with a hyper-plane perpendicular to the I-FGSM perturbation. As the layer indices increase, the function from the feature space to the final output of the source model tends to becomes increasingly linear (there are more nonlinearities between earlier layers and the final layer than there are between a later layer and the final layer). In fact, we note that at the final layer, the decision boundary is completely linear. Thus, our linear approximation of the decision boundary becoming more accurate is one factor in improving ILAP performance as we select the later layers.

We define the “true decision boundary” as a majority-vote ensemble of a large number of CNNs. Note that for transfer, we care less about how well we are approximating the source model decision boundary than we do about how well we are approximating the true decision boundary. In most feature representations we expect that the true decision boundary is more linear, as ensembling reduces variance. However, note that at least in the final layer, by virtue of the source model decision boundary being exactly linear, the true decision boundary cannot be more linear, and is likely to be less linear.

We hypothesize that this flip is what causes us to perform worse in the final layers. In these layers, the source model decision boundary is more linear than the true decision boundary, so our approximation performs poorly. We test this hypothesis by attacking two variants of ResNet18 augmented with 3 linear layers before the last layer: one variant without activations following the added layers (var1) and one with (var2). As shown in Figure 8, ILAP performance decreases less in the second variant. Also note that these nonlinearities also cause worse ILAP performance earlier in the network.

Thus, we conclude that the extreme linearity of the last several layers is associated with ILAP performing poorly.

Figure 8: When there is more nonlinearity present in the later portion of the network, the performance of ILAP does not deteriorate as rapidly. Variant 1 (var1) is the version of ResNet18 with additional linear layers not followed by activations, while Variant 2 (var2) does have activations.

5.3 Explanation of the main result

In this section, we tie together all of the above factors to explain the optimal intermediate layer for transferability. Denote:

  • the decreasing angle difference between I-FGSM’s and BTD’s perturbation direction as Factor 1

  • the increasing linearity with respect to the decision boundary as we increase layer index as Factor 2, and

  • the excessive linearity of the source model decision boundary as Factor 3

On the transfer models, as the index of the attacked source model layer increases, Factors 1 and 2 increase attack rate, while Factor 3 decreases the attack rate. Thus, before some layer, Factors 1 and 2 cause transferability to increase as layer index increases; however, afterward, Factor 3 wins out and causes transferability to decrease as the layer index increases. Thus the layer right before the point where this switch happens is the layer that is optimal for transferability.

We note that this explanation would also justify the method presented in Section 4.2. Intuitively, having a peak corresponds with having the linearized decision boundary (from using projection as the objective) be very different from the source model’s decision boundary. If this were not the case, then I-FGSM would presumably have found this improved perturbation already. As such, choosing the last layer that we can get a peak at corresponds with both having as linear of a decision boundary as possible (as late of a layer as possible) while still having enough room to move (the peak).

On the source model, since there is no notion of a “transfer” attack, Factor 3 and Factor 1 do not have any effect. Therefore, Factor 2 causes the performance of the later layers to improve, so much so that at the final layer ILAP’s performance on the source model is actually equal or better on all the attacks we used as baselines (see Figure 3). We hypothesize the improved performance on the source model is the result of a simpler loss and thus an easier to optimize loss landscape.

6 Conclusion

We introduce a novel attack, coined ILA, that aims to enhance the transferability of any given adversarial example. It is a framework with the goal of enhancing transferability by increasing projection onto the Best Transfer Direction. Within this framework, we propose two variants, ILAP and ILAF, and analyze their performance. We demonstrate that there exist specific intermediate layers that we can target with ILA to substantially increase transferability with respect to the attack baselines. In addition, we show that a near-optimal target layer can be selected without any knowledge of transfer performance. Finally, we provide some intuition regarding ILA’s performance and why it performs differently in different feature spaces.

Potential future works include making use of the interactions between ILA and existing adversarial attacks to explain differences among existing attacks, as well as extending ILA to perturbations produced for different settings (universal or targeted perturbations). In addition, other methods of attacking intermediate feature spaces could be explored, taking advantage of the properties we explored in this paper.

Acknowledgements

We want to thank Pian Pawakapan, Prof. Kavita Bala, and Prof. Bharath Hariharan for helpful discussions. This work is supported in part by a Facebook equipment donation.

References

  • [1] A. Athalye, N. Carlini, and D. A. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In ICML, Cited by: §1.
  • [2] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer (2017) Adversarial patch. CoRR abs/1712.09665. Cited by: §1.
  • [3] R. Cadene (2019) Pretrained-models.pytorch. GitHub. Note: https://github.com/Cadene/pretrained-models.pytorch Cited by: footnote 6.
  • [4] N. Carlini and D. A. Wagner (2017) Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §4.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.
  • [6] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2017) Boosting adversarial attacks with momentum. Cited by: §2.1, §4.
  • [7] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2017) Robust physical-world attacks on deep learning models. Cited by: §1.
  • [8] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. CoRR abs/1412.6572. Cited by: §1, §1, §2.1, §2.1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4, §7.1.
  • [10] J. Hu, L. Shen, and G. Sun (2017) Squeeze-and-excitation networks. CoRR abs/1709.01507. Cited by: §4, §7.1.
  • [11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. Cited by: §4, §7.1.
  • [12] N. Inkawhich, W. Wen, H. H. Li, and Y. Chen (2019) Feature space perturbations yield more transferable adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7066–7074. Cited by: §2.2.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [14] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §4, §7.8.
  • [15] A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. CoRR abs/1607.02533. Cited by: §1, §2.1.
  • [16] K. Liu (2018) PyTorch cifar10. GitHub. Note: https://github.com/kuangliu/pytorch-cifar Cited by: §4, §7.1.
  • [17] Y. Liu, X. Chen, C. Liu, and D. X. Song (2016) Delving into transferable adversarial examples and black-box attacks. CoRR abs/1611.02770. Cited by: §1, §2.1, §5.1.
  • [18] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. CoRR abs/1706.06083. Cited by: §1, §2.1.
  • [19] S. Marcel and Y. Rodriguez (2010) Torchvision the machine-vision package of torch. In ACM Multimedia, Cited by: §4.4, §7.4.
  • [20] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 86–94. Cited by: §2.1.
  • [21] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: a simple and accurate method to fool deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582. Cited by: §2.1, §2.1.
  • [22] K. R. Mopuri, A. Ganeshan, and R. V. Babu (2018) Generalizable data-free objective for crafting universal adversarial perturbations. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.2.
  • [23] N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In AsiaCCS, Cited by: §1.
  • [24] A. Rozsa, M. Günther, and T. E. Boult (2017) LOTS about attacking deep features. In International Joint Conference on Biometrics (IJCB), Cited by: §2.2.
  • [25] S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet (2015) Adversarial manipulation of deep representations. CoRR abs/1511.05122. Cited by: §2.2, §2.2.
  • [26] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter (2017) Adversarial generative nets: neural network attacks on state-of-the-art face recognition. CoRR abs/1801.00349. Cited by: §1.
  • [27] A. Sinha, H. Namkoong, and J. C. Duchi (2017) Certifying some distributional robustness with principled adversarial training. Cited by: §1.
  • [28] J. Su, D. V. Vargas, and K. Sakurai (2017) One pixel attack for fooling deep neural networks. CoRR abs/1710.08864. Cited by: §1.
  • [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Cited by: §4, §7.1.
  • [30] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. CoRR abs/1312.6199. Cited by: §1.
  • [31] C. Xie, Z. Zhang, Y. Zhou, S. Bai, J. Wang, Z. Ren, and A. L. Yuille (2019-06) Improving transferability of adversarial examples with input diversity. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.4, Table 1, §4.
  • [32] X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li (2017) Adversarial examples: attacks and defenses for deep learning. CoRR abs/1712.07107. Cited by: §2.2, §2.2.
  • [33] W. Zhou, X. Hou, Y. Chen, M. Tang, X. Huang, X. Gan, and Y. Yang (2018) Transferable adversarial perturbations. In ECCV, Cited by: §2.2, §4.4, Table 1, §4, 8th item, §7.8, §7.8, Table 8.

7 Appendix

The following are provided in the supplementary material for this paper:

  • A more thorough description of the networks used, the layers selected for attack, and full results on other attacks tested

  • Complete disturbance graphs across layers of an attacking comprised of ILAP and I-FGSM

  • A more complete result showing the comparison between ILAP and ILAF

  • Full results for ILAP’s performance on ImageNet

  • Visualization of decision boundary

  • Results for different norm values

  • Results for ablating the learning rate used in ILAP

  • Results comparing ILAP to TAP [33] on CIFAR-10

7.1 ILAP Network Overview and Results for Other Base Attacks

As shown in the main paper, we tested ILAP against MI-FGSM, C&W, and TAP. We also tested I-FGSM, DeepFool, and FGSM. We test on a variety of models, namely: ResNet18 [9], SENet18 [10], DenseNet121[11] and GoogLeNet [29] trained on CIFAR-10. For each source model, each large block output in the source model and each attack , we generate adversarial examples for all images in the test set using with 20 iterations as a baseline. We then generate adversarial examples using with 10 iterations as input to ILA, which will then run for 10 iterations. The learning rate is set to for I-FGSM, for I-FGSM with momentum and for ILAP. We are in the norm setting with for all attacks. We then evaluate transferability of baseline and ILA adversarial examples over the other models by testing their accuracies, as shown in Figure 11.

Below is the list of layers (models from [16]) we picked for each source model, which is indexed starting from 0 in the experiment results:

  • ResNet18: conv, bn, layer1, layer2, layer3, layer4, linear (layer1-4 are basic blocks)

  • GoogLeNet: pre_layers, a3, b3, maxpool, a4, b4, c4, d4, e4, a5, b5, avgpool, linear

  • DenseNet121: conv1, dense1, trans1, dense2, trans2, dense3, trans3, dense4, bn, linear

  • SENet18: conv1, bn1, layer1, layer2, layer3, layer4, linear (layer1-4 are pre-activation blocks)

Additional results for the I-FGSM, FGSM, and DeepFool attacks are given in tables 4 and 5. Note that the output of DeepFool is clipped to satisfy our -ball constraint.

I-FGSM DeepFool
Source Transfer 20 Itr 10 Itr ILAP Opt ILAP 50 Itr 25 Itr ILAP Opt ILAP
ResNet18222Model that is exactly the same model as the source model. 3.3% 7.6% 1.8% (5) 48.7% 12.9% 5.4% (5)
ResNet18 SENet18 44.4% 27.5% 27.5% (4) 87.4% 43.7% 43.7% (4)
() DenseNet121 45.8% 27.7% 27.7% (4) 89.1% 43.8% 43.8% (4)
GoogLeNet 58.6% 35.8% 35.8% (4) 89.3% 50.7% 50.7% (4)
ResNet18 36.8% 25.8% 25.8% (4) 91.9% 40.3% 39.9% (5)
SENet18 2.4% 7.9% 2.3% (6) 56.8% 11.4% 5.1% (6)
() DenseNet121 38.0% 25.9% 25.9% (4) 92.9% 41.3% 41.1% (5)
GoogLeNet 48.4% 33.7% 33.7% (4) 92.3% 48.7% 48.7% (4)
ResNet18 45.1% 26.7% 26.7%(6) 81.6% 30.1% 30.1% (6)
DenseNet121 SENet18 43.4% 26.1% 26.1%(6) 81.5% 29.0% 28.9% (7)
() 2.6% 1.7% 0.8%(9) 34.9% 4.1% 3.3% (9)
GoogLeNet 47.3% 28.6% 28.6%(6) 82.3% 32.4% 32.4% (6)
ResNet18 55.9% 34.0% 32.7%(3) 92.3% 44.0% 44.0% (9)
GoogLeNet SENet18 55.6% 33.1% 31.8% (3) 92.1% 42.9% 42.9% (9)
() DenseNet121 48.9% 28.7% 28.1%(3) 93.1% 38.1% 38.1% (9)
0.9% 0.8% 0.4% (11) 51.5% 4.2% 3.9% (11)
Table 4. Accuracies after attack using ILAP based on I-FGSM and DeepFool. Note that although significant improvement for transfer is exhibited for DeepFool, the original attack transfer rates are quite poor (the accuracies are still quite high after a DeepFool transfer attack).
Table 4: ILAP vs. I-FGSM and DeepFool Results
FGSM
 Source Transfer 20 Itr 10 Itr ILAP Opt ILAP
47.7% 2.0% 2.0% (6)
ResNet18 SENet18 63.6% 42.6% 42.6% (6)
() DenseNet121 64.9% 44.6% 44.5% (5)
GoogLeNet 66.5% 55.5% 54.2% (4)
ResNet18 60.7% 37.4% 36.1% (5)
SENet18 40.7% 3.0% 3.0% (6)
() DenseNet121 61.8% 37.0% 36.3% (5)
GoogLeNet 63.8% 46.3% 45.3% (5)
ResNet18 65.0% 36.4% 36.2% (6)
DenseNet121 SENet18 65.0% 35.5% 35.5% (7)
() 47.3% 5.8% 0.9% (9)
GoogLeNet 64.6% 37.6% 37.4% (6)
ResNet18 64.9% 43.5% 43.5% (9)
GoogLeNet SENet18 65.1% 43.8% 43.8% (9)
() DenseNet121 63.7% 39.7% 39.7% (9)
36.6% 5.9% 0.6% (12)
  • Same model as source model.

Table 5. Accuracies after attack based on FGSM. Note that significant improvement occurs in the ILAP settings.
Table 5: ILAP vs. FGSM Results
Figure 9: Visualizations for ILAP against I-FGSM and MI-FGSM baselines on CIFAR-10
Figure 10: Visualizations for ILAP aganist Deepfool and FGSM with momentum baselines on CIFAR-10
Figure 11: Visualizations for ILAP aganist Deepfool and FGSM with momentum baselines on CIFAR-10

7.2 Disturbance graphs

In this experiment, we used the same setting as our main experiment in Appendix 7.1 to generate adversarial examples, with only I-FGSM used as the reference attack. The average disturbance of each set of adversarial examples is calculated at each layer. We repeated the experiment for all four models described in Appendix 7.1, as shown in Figure 12. Observe that the in the legend refers to the hyperparameter set in the ILA attack, and afterwards the disturbance values were computed on layers indicated by the in the x-axis.

Figure 12: Disturbance graphs of ILAP with I-FGSM as reference

7.3 ILAP vs ILAF Full Result

As described in the main paper, we compared the performace of ILAP and ILAF with a range of . We used the same setting as our main experiment in Appendix 7.1 for ILAP and ILAF to generate adversarial examples, with only I-FGSM used as the reference attack. The result is shown in Figure 13.

Figure 13: ILAP vs ILAF comparisions

7.4 ILAP on ImageNet Full Result

We tested ILAP against I-FGSM and I-FGSM with momentum on ImageNet similarly to the experiment on CIFAR-10. The models we used are ResNet18, DenseNet121, SqueezeNet1.0 and AlexNet. The learning rate is set to for I-FGSM, for ILAP plus I-FGSM, for I-FGSM with momentum and for ILAP plus I-FGSM with momentum. To evaluate transferability, we test the accuracies of different models over adversarial examples generated from all ImageNet test images, as shown in Figure 14.

Below is the list of layers (models from [19]) we picked for each source model:

  • ResNet18: conv1, bn1, layer1, layer2, layer3, layer4, fc

  • DenseNet121: conv0, denseblock1, transition1, denseblock2, transition2, denseblock3, transition3, denseblock4, norm5, classifier

  • SqueezeNet1.0: Features: 0 3 4 5 7 8 9 10 12, classifier

  • AlexNet: Features: 0 3 4 6 8 10, classifiers: 1 4

Figure 14: Visualizations for ILAP against I-FGSM and I-FGSM with momentum baselines on ImageNet

7.5 Visualization of the Decision Boundary

To gain some understanding over how ILA interplays with the decision boundaries, we visualize the two dimensional plane between the initial I-FGSM perturbation and the ILA perturbation for some examples. Visualization is done on Resnet with layer 4, and I-FGSM as the starting perturbation. See Figure 15.

Figure 15: Visualization of the decision boundary relative to the two adversarial examples generated. Yellow is the correct label’s decision space, red is the incorrect label’s decision space. The purple dot is the original image’s location, the green circle is the I-FGSM perturbation, and the green diamond is the ILA perturbation. Note that for the above, it seems the vector between the purple dot and green diamond is more orthogonal to the decision boundary than the vector between the purple dot and green dot (hence roughly indicating that ILA is working as intended in producing a more orthogonal transfer vector).

7.6 Fooling with Different Values

In this experiment, we use ILAP to generate adversarial examples with an I-FGSM baseline attack on ResNet18 with , while other settings are kept the same as in section 7.1. We then evaluated their transferability against I-FGSM baseline on the adversarial examples of the whole test set, as shown in Figure 16.

Figure 16: Transferability graphs for different epsilons

7.7 Learning Rate Ablation

We set iterations to 20 for both I-FGSM and I-FGSM with Momentum and experimented different learning rates on ResNet18. We then evaluate different models’ accuracies on the generated adversarial examples, as shown in Table 6 and 7.

learning rate ResNet18222Model that is exactly the same model as the source model. SENet18 DenseNet121 GoogLeNet
0.002 3.3% 44.9% 47.1% 59.3%
0.008 0.8% 45.6% 46.8% 60.0%
0.014 0.6% 47.2% 49.4% 59.5%
0.02 1.3% 46.8% 51.4% 59.8%
Table 6: Learning rate ablation for I-FGSM
learning rate SENet18 DenseNet121 GoogLeNet
0.002 5.9% 35.0% 36.6% 46.1%
0.008 0.6% 43.0% 43.8% 56.1%
0.014 0.4% 43.6% 45.2% 55.9%
0.02 0.4% 44.1% 46.4% 57.2%
Table 7: Learning rate ablation for I-FGSM with Momentum

7.8 Comparison to TAP [33]

CIFAR-10 [14] results comparing a 20 iteration TAP [33] baseline to 10 iterations of ILAP using the output of a 10 iteration TAP attack are shown in Table 8.

TAP [33]
Source Transfer 20 Itr Opt ILAP
6.2% 1.9% (6)
ResNet18 SENet18 31.6% 28.4% (4)
() DenseNet121 32.7% 28.5% (4)
GoogLeNet 41.6% 36.8% (4)
ResNet18 31.4% 23.5% (4)
SENet18 2.0% 1.7% (5)
() DenseNet121 31.3% 24.1% (4)
GoogLeNet 41.5% 33.1% (4)
ResNet18 35.2% 27.4% (6)
DenseNet121 SENet18 34.2% 26.8% (7)
() 4.8% 1.0% (9)
GoogLeNet 37.8% 29.8% (6)
ResNet18 37.1% 33.6% (9)
GoogLeNet SENet18 36.5% 32.9% (9)
() DenseNet121 32.6% 28.1% (9)
1.3% 0.4% (12)
  • Same model as source model.

Table 8. Same as experiment in Table 2 of the main paper but with TAP. Hyperparameters for TAP are set to .
Table 8: ILAP vs. TAP Results
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393252
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description