###### Abstract

Recent breakthroughs in defenses against adversarial examples, like adversarial training, make the neural networks robust against various classes of attackers (e.g., first-order gradient-based attacks). However, it is an open question whether the adversarially trained networks are truly robust under unknown attacks. In this paper, we present interval attacks, a new technique to find adversarial examples to evaluate the robustness of neural networks. Interval attacks leverage symbolic interval propagation, a bound propagation technique that can exploit a broader view around the current input to locate promising areas containing adversarial instances, which in turn can be searched with existing gradient-guided attacks. We can obtain such a broader view using sound bound propagation methods to track and over-approximate the errors of the network within given input ranges. Our results show that, on state-of-the-art adversarially trained networks, interval attack can find on average 47% relatively more violations than the state-of-the-art gradient-guided PGD attack.

oddsidemargin has been altered.

marginparsep has been altered.

topmargin has been altered.

marginparwidth has been altered.

marginparpush has been altered.

paperheight has been altered.

The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.

Enhancing Gradient-based Attacks with Symbolic Intervals

Shiqi Wang ^{0 }
Yizheng Chen ^{0 }
Ahmed Abdou ^{0 }
Suman Jana ^{0 }

^{†}

^{†}footnotetext:

^{1}AUTHORERR: Missing \icmlaffiliation.

^{2}AUTHORERR: Missing \icmlaffiliation. . Correspondence to: Shiqi Wang <tcwangshiqi@cs.columbia.edu>.

Presented at the ICML 2019 Workshop on Security and Privacy of Machine Learning. Copyright 2019 by the author(s).

Deep learning systems have achieved strong performance at large scale. However, attackers can easily locate adversarial examples that are perceptibly the same as the original images and misclassified by state-of-the-art deep learning models. Making Neural Networks (NNs) robust against adversarial inputs has resulted in an arms race between new defenses and attacks that break them. Recent breakthroughs in defenses can make NNs robust against various classes of attackers, including state-of-the-art Projected Gradient Descent (PGD) attack. In (Madry et al., 2018a), Madry et al. argued that PGD attacks are the “ultimate” first-order attacks by preliminary tests. They adversarially retrain the networks with violations found by PGD to obtain robustness. Such training procedure is called adversarial training. The conclusion is that training models with PGD attacks can gain robustness against all first-order gradient attacks.

Nonetheless, the adversarially trained networks might not be truly robust against unknown attacks even if it has good robustness against state-of-the-art attacks like PGD. Since neural networks are highly non-convex, there is high chance that the existing attacks will miss many adversarial examples.

The main challenge behind locating adversarial examples using gradients as guidance is that the search process can get stuck at local optima. Therefore, we hope to have a special first-order gradient that can offer a broader view within surrounding area, potentially guiding us towards the worst-case behavior of neural networks.

Our critical insight is that we can obtain such a broader view by existing sound over-approximation methods. These methods are sound. It means the estimated output ranges are guaranteed to always over-approximate the ground-truth ranges (i.e., never miss any adversarial examples but might introduce false positives). For convenience, we call these methods sound bound propagation. Recently, they have been successfully used for verifiable robust training (Wong et al., 2018; Mirman et al., 2018; Wang et al., 2018a; Gowal et al., 2018). For each step of training, the weights of the networks are updated with the gradients of the verifiable robust loss provided by sound propagation methods over given input ranges. We borrow a similar idea for our attacks by proposing a generic way to extract and apply the gradient from the output provided by these methods.

Essentially, sound propagation methods relax the input range into various abstract domains (e.g., zonotope (Gehr et al., 2018), convex polytope (Wong & Kolter, 2018), and symbolic interval (Wang et al., 2018b; c)) and propagate them layer by layer until we have an over-approximated output abstract domain. The gradient that encodes a broader view within surrounding area can be obtained from such abstract domains of the outputs. It can guide us toward a promising sub-area. In Figure 1, we show the symbolic interval and its interval gradient as an example to illustrate their effectiveness on the attack performance.

In this paper, we describe such a novel gradient-based attack, interval attack, which is the first attack framework to combine both sound bound propagation method and existing gradient-based attacks. Here, we focus specifically on symbolic interval analysis as our sound bound propagation. However, our attack is generic. It can be adapted to leverage any other sound bound propagation methods and their corresponding gradient values. The interval attack contains two main steps: it (1) uses interval gradient to locate promising starting points where the surrounding area has the worst-case behaviors indicated by the sound propagation methods, and (2) uses strong gradient-based attacks to accurately locate the optima within the surrounding area. Such design is due to the fact that the sound bound propagation will introduce overestimation error and prevent convergence to an exact optima if only relying on interval gradient. Specifically, such overestimation error is proportional to the width of the input range (e.g., in Figure 2). To achieve the best performance by encoding the broader view of the surrounding area, we design a strategy (Section id1) to dynamically adjust the range used for each step of the interval attack.

We implement our attack with symbolic interval analysis and evaluate it on three MNIST models that are trained with the state-of-the-art defense, adversarial training (Madry et al., 2018a). Interval attack is able to locate on average 47% relatively more adversarial examples than PGD attacks. We also evaluate our attack on the model from MadryLab MNIST Challenge (Madry et al., 2018b), which is considered to be the most robust model on the MNIST dataset so far. On this model, interval attack has achieved the best result than all of the existing attacks. Code of interval attack is available at: https://github.com/tcwangshiqi-columbia/Interval-Attack.

In this section, we first motivate the need for stronger attacks by analyzing the loss distribution found by PGD attacks with different random starting points. Then we dive into the definition of interval gradients and details of interval attack procedure.

Why we need stronger attack? PGD attack is so far a well-known and a standard method to measure the robustness of trained NNs. Madry et al. even argued that PGD attacks are the “ultimate” first-order attacks, i.e., no other first-order attack will be able to significantly improve over the solutions found by PGD (Madry et al., 2018a). In order to support the statement, Madry et al. performed iterations of the PGD attack from random starting points within bounded -balls of test inputs and showed that all the solutions found by these instances are distinct local optima with similar loss values. Therefore, they concluded that these distinct local optima found by PGD attacks are very close to the best solution that can be found by any first-order attacker.

In Figure 2, we demonstrate such assumption to be flawed with our first-order interval attacks. Essentially, for each random starting point, instead of directly applying PGD attack, interval attack will first use sound bound propagation to locate promising sub-area and then apply PGD attacks. We randomly picked two different images on which PGD attacks cannot find violations but the interval attacks can, and then we repeated the same tests in (Madry et al., 2018a) for PGD, CW, and interval attacks. On both images, regular PGD attacks cannot find any violation even with random starts. The losses found by PGD attacks are very concentrated between 0 and 1, which is consistent with the observations in (Madry et al., 2018a). Similarly, the losses of CW attacks on these images show concentrated distribution. However, interval attacks can locate promising sub-area and use the same PGD attacks to offer a much larger range of loss distribution. Particularly, it can find over violations out of the same starting points. Therefore, the method of Madry et al. (Madry et al., 2018a) cannot provide robustness against all first-order adversaries even if the model is robust against PGD attacks with random starting points.

Interval gradient (). As described in Section id1, any existing sound bound propagation method can be a good fit in our interval attack. In this paper, we use symbolic interval analysis to provide tight output bounds (Wang et al., 2018b; c). Essentially, such analysis produces two parallel linear equations ( and shown in Figure 1) to tightly bound the output of each neuron. Compared to other types of sound bound propagation, symbolic interval analysis provides tight over-approximation while its interval gradient can be easily accessed.

Based on that, we define the interval gradient to be equal to the slope of these parallel lines. For example, let us assume symbolic interval analysis propagates input range and finally provides the bounds of a neural network’s output as through the network. Then, the interval gradient will be .

For other sound bound propagation methods that do not have parallel output bounds, we can estimate the interval gradients with average gradients (slope) of all the corresponding output bounds. Here we give one generic definition of interval gradients. Assume the output range provided by sound bound propagation is presented in the form of abstract domain as , then the interval gradient can be estimated by:

Note that how to choose the best sound bound propagation methods and how to estimate interval gradients worth further discussion. Here we just provide one possible solution that has good empirical results with symbolic interval analysis.

Inputs: target input, attack budget |

Parameters: : iterations, : step size, : starting point, |

p: input region step size |

Output: perturbed |

Interval attack details. Algorithm 1 shows how we implement the interval attack. The key challenge is to find a suitable value of representing the input region over which the interval gradient will be computed (Line 3 to Line 11) at each iteration of the interval-gradient-based search. If is too large, the symbolic interval analysis will introduce large overestimation error leading to the wrong direction. On the other hand, if is too small, the information from the surrounding area might not be enough to make a clever update.

Therefore, we dynamically adjust the size of during the attack. Specifically, we allow a hyperparameter to be tuned during the attack procedure which controls the step size for searching as Line 7. Since is usually small such that the sound bound propagation tends to be accurate. We can then pick the smallest that will cause potential violations in relaxed output abstract domain. The suitable interval gradient for that abstract domain can thus be accessed within such .

After a bounded number of iterations with interval gradients, we use PGD attack with a starting point from the promising sub-area identified by the interval-gradient-based search to locate a concrete violation as shown in Line 21.

Network | # Hidden units | # Parameters | ACC (%) | Attack success rate (%) | |||

PGD | CW | Interval Attack | Interval Attack Gain | ||||

MNIST_FC1 | 1,024 | 668,672 | 98.1 | 39.2 | 42.2 | 56.2 | +17 (43%) |

MNIST_FC2 | 10,240 | 18,403,328 | 98.8 | 34.4 | 32.2 | 44.4 | +10.0 (38%) |

MNIST_Conv | 38,656 | 3,274,634 | 98.4 | 7.2 | 7.3 | 11.6 | +4.4 (61%) |

* Interval attack achieves the best attack success rate in MadryLab MNIST Challenge (Madry et al., 2018b). |

Setup. We implement symbolic interval analysis (Wang et al., 2018c) and interval attack on top of Tensorflow 1.9.0
^{1}^{1}1https://www.tensorflow.org/.
All of our experiments are run on a GeForce GTX 1080 Ti.

We evaluate the interval attack on three neural networks MNIST_FC1, MNIST_FC2 and MNIST_Conv. The network details are shown in Table 1. All of these networks are trained to be adversarially robust using the Madry et al.’s technique (Madry et al., 2018a). For PGD attacks, we use 40 iterations and 0.01 as step size. We define the robustness region, , to be bounded by norm with over normalized inputs (76 out of 255 pixel value). We use the 1,024 randomly selected images from the MNIST test set to measure accuracy and robustness. MNIST_FC1 contains two hidden layers each with 512 hidden nodes and achieves 98.1% test accuracy. Similarly, MNIST_FC2 contains five hidden layers each with 2,048 hidden nodes and achieves 98.8% test accuracy. MNIST_Conv was adversarially robust trained by Madry et al. (Madry et al., 2018a) and was released publicly as part of the MadryLab MNIST Challenge (Madry et al., 2018b). The model uses two convolutional layers and two maxpooling layers to achieve 98.4% test accuracy, on which the PGD attack success rate is only 7.2%.

Attack effectiveness. The interval attack is very effective at finding violations in all of the MNIST models. We compared the strength of the interval attack against the state-of-the-art PGD and CW attacks, as shown in Table 1. For the first two models, we give the same amount of attack time. The iterations for PGD and CW are around 7,200 for MNIST_FC1 and 42,000 for MNIST_FC2. We ran the interval attack for 20 iterations against MNIST_FC1 and MNIST_FC2, which took 53.9 seconds and 466.5 seconds respectively. The interval attack is able to achieve from 38% to 61% relative increase in the attack success rate of PGD. On the MNIST_Conv network from the MadryLab MNIST Challenge, interval attack achieves the highest success rate compared to all of the existing attacks.

Our results show that using only first-order gradient information, the interval attack can significantly improve the attack success rate over that of the PGD and CW attacks in adversarially robust networks.

Inefficiency of adversarially robust training with interval attack. One obvious way of increasing the robustness of trained networks against interval-based attacks is to use such attacks for adversarial training. However, due to the high overhead introduced by sound bound propagation methods, robust training with interval attack often struggle to converge. To demonstrate that, we evaluated adversarial robust training using interval attack on MNIST_Small network for . As shown in Figure 3, even after 12 hours of training time for such a small network, the interval-based adversarially robust training does not converge as well as its PGD-based counterpart. To achieve the same ERA, such interval-based adversarially robust training takes around 15.5 hours which is 47 times slower than its PGD-based counterpart. The gap will be further widened on larger networks.

Therefore, instead of improving the adversarially robust training schemes, recent verifiably robust training is a promising direction, which can fundamentally improve the verifiable robustness and defend against stronger attackers (Wong et al., 2018; Mirman et al., 2018; Wang et al., 2018a; Gowal et al., 2018).

Many defenses against adversarial examples have been proposed (Gu & Rigazio, 2014; Papernot et al., 2015; Cisse et al., 2017; Papernot & McDaniel, 2017; 2018; Athalye et al., 2018; Buckman et al., 2018; Song et al., 2017; Xie et al., 2017; Zantedeschi et al., 2017), which are followed by a sequence of stronger attacks breaking them in quick succession (Papernot et al., 2016; Carlini & Wagner, 2017b; Elsayed et al., 2018; Carlini & Wagner, 2017a; Moosavi-Dezfooli et al., 2016; Biggio et al., 2013; Ma et al., 2018; Papernot et al., 2017; Guo et al., 2017; He et al., 2017; Athalye & Sutskever, 2018; Carlini & Wagner, 2018; Pei et al., 2017; Tian et al., 2018). We refer the interested readers to the survey by Athalye et al. (Athalye et al., 2018) for more details. In spite of large amounts of attacks, the adversarially trained model (Madry et al., 2018a) remains robust against known attacks. One attack paper (He et al., 2018) that breaks the region-based defense (Cao & Gong, 2017) also uses the region information. However, their surrounding area estimation, unlike us, is based on sampling and thus may miss violations compared to the interval attack.

We propose a novel type of gradient-based attacks which is the first generic attack framework to combine sound bound propagation methods with existing gradient-based attacks. By plotting the loss distributions found by interval attack with random starting points, we are able to show that PGD attack is not the ultimate first-order adversary. On three adversarially trained MNIST networks, interval attack can provide on average 47% relative improvement over PGD attacks. In MadryLab MNIST challenge (Madry et al., 2018b), it achieves the best performance so far. The empirical results show that there is valuable research space seeking for stronger attacks by applying tighter sound bound propagation methods and stronger first-order attacks.

This work is sponsored in part by NSF grants CNS-16-17670, CNS-18-42456, CNS-18-01426; ONR grant N00014-17-1-2010; an ARL Young Investigator (YIP) award; and a Google Faculty Fellowship. Any opinions, findings, conclusions, or recommendations expressed herein are those of the authors, and do not necessarily reflect those of the US Government, ONR, ARL, NSF, or Google.

## References

- Athalye & Sutskever (2018) Athalye, A. and Sutskever, I. Synthesizing robust adversarial examples. International Conference on Machine Learning (ICML), 2018.
- Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
- Biggio et al. (2013) Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndić, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Springer, 2013.
- Buckman et al. (2018) Buckman, J., Roy, A., Raffel, C., and Goodfellow, I. Thermometer encoding: One hot way to resist adversarial examples. 2018.
- Cao & Gong (2017) Cao, X. and Gong, N. Z. Mitigating evasion attacks to deep neural networks via region-based classification. In Proceedings of the 33rd Annual Computer Security Applications Conference, pp. 278–287. ACM, 2017.
- Carlini & Wagner (2017a) Carlini, N. and Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. ACM, 2017a.
- Carlini & Wagner (2017b) Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017b.
- Carlini & Wagner (2018) Carlini, N. and Wagner, D. Magnet and âefficient defenses against adversarial attacksâ are not robust to adversarial examples. arXiv preprint arXiv:1711.08478, 2018.
- Cisse et al. (2017) Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning (ICML), pp. 854–863, 2017.
- Elsayed et al. (2018) Elsayed, G. F., Shankar, S., Cheung, B., Papernot, N., Kurakin, A., Goodfellow, I., and Sohl-Dickstein, J. Adversarial examples that fool both human and computer vision. arXiv preprint arXiv:1802.08195, 2018.
- Gehr et al. (2018) Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., and Vechev, M. Ai 2: Safety and robustness certification of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy (SP), 2018.
- Gowal et al. (2018) Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Mann, T., and Kohli, P. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715, 2018.
- Gu & Rigazio (2014) Gu, S. and Rigazio, L. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
- Guo et al. (2017) Guo, C., Rana, M., Cisse, M., and van der Maaten, L. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017.
- He et al. (2017) He, W., Wei, J., Chen, X., Carlini, N., and Song, D. Adversarial example defenses: Ensembles of weak defenses are not strong. arXiv preprint arXiv:1706.04701, 2017.
- He et al. (2018) He, W., Li, B., and Song, D. Decision boundary analysis of adversarial examples. 2018.
- Ma et al. (2018) Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S., Houle, M. E., Schoenebeck, G., Song, D., and Bailey, J. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
- Madry et al. (2018a) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations (ICLR), 2018a.
- Madry et al. (2018b) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Madry mnist challenge. 2018b.
- Mirman et al. (2018) Mirman, M., Gehr, T., and Vechev, M. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning (ICML), pp. 3575–3583, 2018.
- Moosavi-Dezfooli et al. (2016) Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582, 2016.
- Papernot & McDaniel (2017) Papernot, N. and McDaniel, P. Extending defensive distillation. arXiv preprint arXiv:1705.05264, 2017.
- Papernot & McDaniel (2018) Papernot, N. and McDaniel, P. Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765, 2018.
- Papernot et al. (2015) Papernot, N., McDaniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1511.04508, 2015.
- Papernot et al. (2016) Papernot, N., Carlini, N., Goodfellow, I., Feinman, R., Faghri, F., Matyasko, A., Hambardzumyan, K., Juang, Y.-L., Kurakin, A., Sheatsley, R., et al. cleverhans v2. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768, 2016.
- Papernot et al. (2017) Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. ACM, 2017.
- Pei et al. (2017) Pei, K., Cao, Y., Yang, J., and Jana, S. Deepxplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), pp. 1–18. ACM, 2017.
- Song et al. (2017) Song, Y., Kim, T., Nowozin, S., Ermon, S., and Kushman, N. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
- Tian et al. (2018) Tian, Y., Pei, K., Jana, S., and Ray, B. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pp. 303–314. ACM, 2018.
- Wang et al. (2018a) Wang, S., Chen, Y., Abdou, A., and Jana, S. Mixtrain: Scalable training of formally robust neural networks. arXiv preprint arXiv:1811.02625, 2018a.
- Wang et al. (2018b) Wang, S., Pei, K., Justin, W., Yang, J., and Jana, S. Formal security analysis of neural networks using symbolic intervals. 27th USENIX Security Symposium, 2018b.
- Wang et al. (2018c) Wang, S., Pei, K., Justin, W., Yang, J., and Jana, S. Efficient formal safety analysis of neural networks. Advances in Neural Information Processing Systems (NIPS), 2018c.
- Wong & Kolter (2018) Wong, E. and Kolter, J. Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. International Conference on Machine Learning (ICML), 2018.
- Wong et al. (2018) Wong, E., Schmidt, F., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. Advances in Neural Information Processing Systems (NIPS), 2018.
- Xie et al. (2017) Xie, C., Wang, J., Zhang, Z., Ren, Z., and Yuille, A. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017.
- Zantedeschi et al. (2017) Zantedeschi, V., Nicolae, M.-I., and Rawat, A. Efficient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 39–49. ACM, 2017.