Bypassing Feature Squeezing by Increasing Adversary Strength
Feature Squeezing is a recently proposed defense method which reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. It has been shown that feature squeezing defenses can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks. However, we demonstrate on the MNIST and CIFAR-10 datasets that by increasing the adversary strength of said state-of-the-art attacks, one can bypass the detection framework with adversarial examples of minimal visual distortion. These results suggest for proposed defenses to validate against stronger attack configurations.
|Yash Sharma1 and Pin-Yu Chen2|
|1The Cooper Union, New York, NY 10003, USA|
|2IBM Research, Yorktown Heights, NY 10598, USA|
Deep neural networks (DNNs) achieve state-of-the-art performance in various tasks in machine learning and artificial intelligence, such as image classification, speech recognition, machine translation and game-playing. Despite their effectiveness, recent studies have illustrated the vulnerability of DNNs to adversarial examples Szegedy et al. (2013); Goodfellow et al. (2015). For instance, a carefully designed perturbation to an image can lead a well-trained DNN to misclassify. Even worse, effective adversarial examples can also be made virtually indistinguishable to human perception. Adversarial examples crafted to evade a specific model can even be used to mislead other models trained for the same task, exhibiting a property known as transferability Liu et al. (2016); Papernot et al. (2016); Sharma & Chen (2017).
To address this problem, numerous defense mechanisms have been proposed; one which has achieved strong results is feature squeezing. Feature squeezing relies on applying input transformations to reduce the degrees of freedom available to an adversary by “squeezing” out unnecessary input features. The authors in Xu et al. (2017) propose a detection method using such input transformations by relying on the intuition that if the original and squeezed inputs produce substantially different outputs from the model, the input is likely to be adversarial. By comparing the difference between predictions with a selected threshold value, the system is designed to output the correct prediction for legitimate examples and reject adversarial inputs. By combining multiple squeezers in a joint detection framework, the authors claim that the system can can successfully detect adversarial examples from eleven state-of-the-art methods Xu et al. (2017).
In this paper, we show that by increasing the adversary strength of the state-of-the-art methods, the feature squeezing joint detection method can be readily bypassed. We demonstrate this on both the MNIST and CIFAR-10 datasets. We experiment with EAD Chen et al. (2017), a generalization of the state-of-the-art C&W attack Carlini & Wagner (2017), and I-FGSM Kurakin et al. (2016), an iterative version of the often used fast gradient attack Goodfellow et al. (2015). For EAD, and C&W, we increase the adversary strength by increasing , which controls the necessary margin between the predicted probability of the target class and that of the rest. For I-FGSM, we increase the adversary strength by increasing , which controls the allowable distortion. We find that adversarial examples with minimal visual distortion can be generated which bypass feature squeezing under these stronger attack configurations. Our results suggest that proposed defenses should validate against adversarial examples of maximal distortion, as long as the examples remain visually adversarial.
2 Experiment Setup
Two types of feature squeezing were focused on by the authors in Xu et al. (2017): (i) reducing the color bit depth of images; and (ii) using smoothing (both local and non-local) to reduce the variation among pixels. For the detection method, the model’s original prediction is compared with the prediction on the squeezed sample using the norm. As a defender typically does not know the exact attack method, multiple feature squeezers are combined by outputting the maximum distance. The threshold is selected targeting a false positive rate below 5% by choosing a threshold that is exceeded by no more than 5% of legitimate samples.
For MNIST, the joint detector consists of a 1-bit depth squeezer with 2x2 median smoothing. For CIFAR-10, the joint detector consists of a 5-bit depth squeezer with 2x2 median smoothing and a non-local means filter with a 13x13 search window, 3x3 patch size, and a Gaussian kernel bandwidth size of 2. We use the same thresholds as used in Xu et al. (2017). We generate adversarial examples using the EAD Chen et al. (2017) and I-FGSM attacks Kurakin et al. (2016).
EAD generalizes the state-of-the-art C&W attack Carlini & Wagner (2017) by performing elastic-net regularization, linearly combining the and penalty functions Chen et al. (2017). The hyperparameter controls the trade-off between and minimization. We test EAD in both the general case and the special case where is set to 0, which is equivalent to the C&W attack. For MNIST and CIFAR-10, was set to 0.01 and 0.001, respectively. We tune , which is a confidence parameter that controls the necessary margin between the predicted probability of the target class and that of the rest, in our experiments. is increased starting from 10 on both datasets, which was the value used in the feature squeezing experiments Xu et al. (2017). Full detail on the implementation is provided in the supplementary material.
For attacks, which we will consider, fast gradient methods (FGM) use the sign of the gradient of the training loss with respect to the input for crafting adversarial examples Goodfellow et al. (2015). I-FGSM iteratively uses FGM with a finer distortion, followed by an -ball clipping Kurakin et al. (2016). We tune , which controls the allowable distortion, in our experiments. is increased starting from 0.3 on MNIST and 0.008 on CIFAR-10, which were the values used for the feature squeezing experiments Xu et al. (2017). Full detail on the implementation is provided in the supplementary material.
We randomly sample 100 images from the MNIST and CIFAR-10 test sets. For each dataset, we use the same pre-trained state-of-the-art models as used in Xu et al. (2017). We generate adversarial examples in the non-targeted case, force network to misclassify, and in the targeted case, force network to misclassify to a target class . As done in Xu et al. (2017), we try two different targets, the Next class ( = label + 1 mod # of classes) and the least-likely class (LL).
3 Experiment Results
The generated adversarial examples are tested against the proposed MNIST and CIFAR-10 joint detection configurations. In Tables 1 and 2, the results of tuning for C&W and EAD are provided, and are presented with the results for I-FGSM at the lowest value at which the highest attack success rate (ASR) was yielded, against the MNIST and CIFAR-10 joint detectors, respectively. In all cases, EAD outperforms the C&W attack, particularly at lower confidence levels, indicating the importance of minimizing the distortion for generating robust adversarial examples with minimal visual distortion. Specifically, we find that with enough strength, each attack is able to achieve near 100% ASR against the joint detectors.
In Figure 2, non-targeted MNIST adversarial examples generated by EAD are shown at . In Figure 2, non-targeted CIFAR-10 adversarial examples generated by EAD are shown at . Adversarial examples generated in the least-likely targeted case are provided in the supplementary material, These figures indicate that adversarial examples generated by EAD at high , which bypass the joint feature squeezing detector, have minimal visual distortion. This holds true for adversarial examples generated by I-FGSM with high on CIFAR-10, but not on MNIST.
Feature Squeezing is a recently proposed class of input transformations which when combined in a joint detection framework has been shown to achieve high detection rates against state-of-the-art attacks. We show on the MNIST and CIFAR-10 datasets that by increasing the adversary strength, by tuning the confidence and constraint for EAD and I-FGSM, respectively, the proposed joint detection configuration can be bypassed with adversarial examples of minimal visual distortion. These results suggest for proposed defenses to validate against stronger attack configurations, using the maximal adversary strength where examples remain visually similar to the inputs. For future work, we aim to validate if other recently proposed defenses are robust to strong adversaries.
- Beck & Teboulle (2009) A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1):183–202., 2009.
- Carlini & Wagner (2017) N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), 39–57., 2017.
- Chen et al. (2017) P.Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C.J. Hsieh. Ead: Elastic-net attacks to deep neural networks via adversarial examples. arXiv preprint arXiv:1709.04114, 2017.
- Goodfellow et al. (2015) I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. ICLR’15; arXiv preprint arXiv:1412.6572, 2015.
- Kingma & Ba (2014) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kurakin et al. (2016) A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. ICLR’17; arXiv preprint arXiv:1611.01236, 2016.
- Liu et al. (2016) Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
- Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
- Sharma & Chen (2017) Y. Sharma and P. Y. Chen. Attacking the madry defense model with l1-based adversarial examples. arXiv preprint arXiv:1710.10733, 2017.
- Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Tramèr et al. (2017) F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
- Xu et al. (2017) W. Xu, D. Evans, and Y. Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
Appendix A Supplementary Material
a.1 Attack Details
The targeted attack formulations are discussed below, non-targeted attacks can be implemented in a similar fashion. We denote by and the original and adversarial examples, respectively, and denote by the target class to attack.
EAD generalizes the state-of-the-art C&W attack Carlini & Wagner (2017) by performing elastic-net regularization, linearly combining the and penalty functions Chen et al. (2017). The formulation is as follows:
where is defined as:
By increasing , one trades off minimization for minimization. When is set to 0, EAD is equivalent to the C&W attack. By increasing , one increases the necessary margin between the predicted probability of the target class and that of the rest. Therefore, increasing improves adversary strength but compromises visual quality.
We implement 9 binary search steps on the regularization parameter (starting from 0.001) and run iterations for each step with the initial learning rate . For finding successful adversarial examples, we use the ADAM optimizer for the C&W attack and implement the projected FISTA algorithm with the square-root decaying learning rate for EAD Kingma & Ba (2014); Beck & Teboulle (2009).
Fast gradient methods (FGM) use the gradient of the training loss with respect to for crafting adversarial examples Goodfellow et al. (2015). For attacks, which we will consider, is crafted by
where specifies the distortion between and , and takes the sign of the gradient.
a.2 EAD Adversarial Examples in the Targeted Case (Figures 3 and 4)
In Figure 4, least-likely targeted MNIST adversarial examples generated by EAD are shown at . In Figure 4, least-likely targeted CIFAR-10 adversarial examples generated by EAD are shown at . Distortion is more apparent in the targeted case, particularly in the least-likely targeted case, but the examples are still visually adversarial.
a.3 I-FGSM Adversarial Examples in the Non-Targeted Case (Figures 5 and 6)
In Figure 6, non-targeted MNIST adversarial examples generated by I-FGSM are shown at . In Figure 6, non-targeted CIFAR-10 adversarial examples generated by I-FGSM are shown at . CIFAR-10 adversarial examples at high have minimal visual distortion, however MNIST examples at high , which yield the optimal ASR, have clear distortion.