Improving Adversarial Robustness by
A recent line of research proposed (either implicitly or explicitly) gradient-masking preprocessing techniques to improve adversarial robustness. However, as shown by Athaley-Carlini-Wagner, essentially all these defenses can be circumvented if an attacker leverages approximate gradient information with respect to the preprocessing. This thus raises a natural question of whether there is a useful preprocessing technique in the context of white-box attacks, even just for only mildly complex datasets such as MNIST. In this paper we provide an affirmative answer to this question. Our key observation is that for several popular datasets, one can approximately encode entire dataset using a small set of separable codewords derived from the training set, while retaining high accuracy on natural images. The separability of the codewords in turn prevents small perturbations as in attacks from changing feature encoding, leading to adversarial robustness. For example, for MNIST our code consists of only two codewords, and , and the encoding of any pixel is simply (i.e., whether a pixel is at least ). Applying this code to a naturally trained model already gives high adversarial robustness even under strong white-box attacks based on Backward Pass Differentiable Approximation (BPDA) method of Athaley-Carlini-Wagner that takes the codes into account. We give density-estimation based algorithms to construct such codes, and provide theoretical analysis and certificates of when our method can be effective. Systematic evaluation demonstrates that our method is effective in improving adversarial robustness on MNIST, CIFAR-10, and ImageNet, for either naturally or adversarially trained models.
Improving Adversarial Robustness by
Jiefeng Chen 11footnotemark: 1 Xi Wu ††thanks: Equal contribution. Yingyu Liang Somesh Jha University of Wisconsin-Madison Google
noticebox[b]Preprint. Work in progress.\end@float
Adversarial robustness of deep neural networks (DNNs) has received significant attention in recent years. A long line of recent research Dziugaite et al. (2016); Guo et al. (2017); Meng & Chen (2017); Xu et al. (2017); Buckman et al. (2018); Xie et al. (2017); Song et al. (2017); Samangouei et al. (2018) proposed preprocessing techniques to improve adversarial robustness. These techniques start with a pre-trained model , but before feeding a feature vector to it, we first preprocess using some highly nonlinear function , with the hope that can “remove” the noise introduced by adversarial perturbations. These techniques are attractive because they are typically simple and efficient, and can be applied directly to a model without modifying the training. Unfortunately, as recently demonstrated by Athalye et al. (2018), essentially all these preprocessing defenses can be circumvented because they rely on, either implicitly or explicitly, gradient masking, whose effect can be easily bypassed by leveraging appropriate approximate gradient information.
The current state of affairs raises a natural and intriguing question of whether there is a gradient-masking preprocessing technique that can be useful in the context of white-box attacks. In this paper we give a positive answer to this question. Our key observation is that for several popular datasets, including MNIST, CIFAR-10, and ImageNet, the data is surprisingly well structured in the following sense: One can encode the entire dataset using a small set of separable codewords (for some appropriate notion of separability that we will describe shortly), while retaining high accuracy on natural images with a naturally trained model working on the encoded features. For example, on MNIST one can use only two codewords, and , and preprocess an image by discretizing each pixel using the rule , and the test accuracy of a naturally trained model on the discretized images is as high as . Intriguingly, this phenomenon does not only hold for MNIST, but also hold for more complex datasets such as CIFAR-10 and ImageNet. The following table provides evidence of this phenomenon.
|Dataset||Original accuracy||Accuracy after preprocessing|
The separability we require on the codewords is that they have large pairwise distances under an appropriate metric. Intriguingly, while intuitively one may hope to find such codewords using standard clustering algorithms such as k-means or k-medoids, empirically we found that these clustering algorithms do not work well. We therefore devise new algorithms to construct separable codes based on density estimation. To gain insights into our construction, we provide an idealized theoretical model, under which we show that our preprocessing with separable codes can boost robustness of a base model against adversarial attacks with budget , as long as the base model is robust against some structured perturbations that are much weaker than -adversarial attacks. Finally, we also provide certifying procedures to verify our defense on simpler datasets such as MNIST.
To evaluate our method, we adapt the Backward Pass Differentiable Approximation method (BPDA) of Athalye et al. (2018) to our setting and construct white-box attacks which take into account the codes used by our pre-processing. We focus on attacks and our main empirical question is how much these codes can help improving adversarial robustness under these strong white-box attacks. Our findings are encouraging: For MNIST and attacks with adversarial budget , applying codes to a naturally trained model gives adversarial robustness, and applying to an adversarially trained model we obtain a remarkable adversarial robustness. We observe similar findings on CIFAR-10 and ImageNet, where applying our preprocessing with an appropriate separable code, to either naturally or adversarially trained model, results in nontrivial improvement to adversarial robustness.
In summary this paper makes the following contributions:
We demonstrate a somewhat surprising clustering property that hold for several popular datasets including MNIST, CIFAR-10, and ImageNet.
We show, both theoretically and empirically, how to leverage this property to devise a useful preprocessing method to improve adversarial robustness.
We conduct systematic experiments to demonstrate the effectiveness of our method against strongest known white-box (BPDA) attacks adapted to our setting. As a result of this study, we hope that our results could thus open an avenue for further studying leveraging distribution-specific properties to defend against adversarial perturbations.
The rest of the paper is organized as follows. We start with some important related works and preliminaries in Section 2 and Section 3, respectively. We give constructions and analysis of our separable codes in Section 4, and Section 5 performs a detailed empirical study. Finally Section 6 concludes.
2 Related Work
The susceptibility of neural networks to adversarial perturbations was first discovered by Szegedy et al. (2013), and a large body of research has been devoted to it since then. Currently, the most successful way to defend against adversarial attacks is adversarial training, which trains the model on adversarial examples. It achieved state-of-art results on MNIST and CIFAR-10 Madry et al. (2017). Another family of defenses do not require retraining. These defenses preprocess inputs with a transformation and then call a pre-trained classifier to classify the image. Proposals along this line include thermometer encodingBuckman et al. (2018), JPEG compressionDziugaite et al. (2016), image re-scalingXie et al. (2017), feature squeezingXu et al. (2017), quiltingGuo et al. (2017), and neural-based transformationsSong et al. (2017); Samangouei et al. (2018); Meng & Chen (2017). These defenses are computationally efficient and can be integrated with existing classifier easily. However, as shown by Athalye et al. (2018), until now no preprocessing technique is effective against white box attacks. This provides direct motivation for this work.
In parallel with this work, Schmidt et al. (2018) made a similar observation that in some idealized theoretical models, non-linear transformations can help improving adversarial robustness. However, their main results focus on investigating sample complexity to achieve robustness. On the other hand, our construction and analysis give evidence that some appropriate transformations can directly improve adversarial robustness, even for a naturally trained model.
In this paper we focus on supervised classification problems, where we have a model (where are model parameters, and we simply write if is clear from the context) that maps to a set of class labels. There have been several different “adversarial learning” settings in which we consider different types of attacks, such as training time attacks where an adversary wishes to poison a data set so that a “bad” hypothesis is learned by an ML-algorithm, and model extraction attacks Tramèr et al. (2016). This paper, instead, focuses on test-time attacks where we assume that the classifier has been trained without any interference from the attacker (i.e. no training time attacks). Given an attack distance metric , the goal of an attacker is to craft a perturbation so that , where and . There are several test-time attacks known and in this paper we mainly use a recent method by Athalye et al. (2018).
There have been several directions for defenses to test-time attacks, but we mainly focus on preprocessing. In such defenses, we design a preprocessing function , and with a base model , the end-to-end predictions are produced as . In this context there are three types of attacks: (1) Black Box Attack where the attacker can only get zero-order information of (i.e. the outputs of correspond to the given inputs), (2) Gray Box Attack where the attacker know , but is unaware of , and (3) White Box Attack where the attacker knows both and . This paper considers white-box attacks since it is the strongest attack model, and to the best of our knowledge, no current preprocessing techniques is effective in white-box model.
Given a preprocessing technique , the goal of an adversary is thus to find another image such that and . We need the following definitions for robustness:
Definition 1 (Local robustness predicate).
Given an image , the condition for an adversary with budget to not succeed is defined as the following predicate:
We call this predicate the robustness predicate. Further a predicate is called a local certificate iff implies . In other words, if is true then it provides a proof of the robustness at .
Definition 2 (Robustness accuracy).
The following quantity is called robustness accuracy (it is the measure of robustness across the entire probability distribution)
4 Preprocessing with Separable Codes: Constructions and Analysis
This section is structured as follows:
In Section 4.1 we present a general discretization framework which leverages a “global” codebook. We then describe separable codes and key properties these codes should satisfy. Finally, we present algorithms for constructing separable codes based on density estimation.
To provide further insights, in Section 4.2 we give an idealized generative model of images, and formally analyze the effectiveness of our method for that model. Our analysis results demonstrate that if data indeed satisfies the well clustering property as dictated by separable global codes, then our method can provably improve adversarial robustness.
Finally in Section 4.3 we present certificates for both local and global robustness in practical scenarios.
4.1 Preprocessing Framework and Codebook Construction
At the high level, our framework is very simple and has the following two steps:
At training time, we construct a codebook for some , where is in the pixel space. This codebook is then fixed during test time.
Given an image at test time, replaces each pixel with a codeword that has minimal distance to .
In this work we focus on that replaces each pixel with its nearest codeword under certain distance metric. Many other directions are possible and we leave those for future investigation. We now give details about separable codes.
Separable Codes and Construction via Density Estimation. Separable codes are codes that satisfy the following two properties:
Separability. Pairwise distances of codewords is large enough for certain distance metric.
Representativity. There exists a classifier that has good accuracy on the discretized data based on , as described by the framework above.
Intuitively, one may want to apply common clustering algorithms, such as -means and -medoids, to find separable codes. Intriguingly, empirically we found that that such techniques do not perform well. We therefore devise new algorithms to construct separable codes based on density estimation and greedy selection on all the pixels in the training data. This algorithm is described in Algorithm 1. This algorithm takes as input a set of images , a kernel function for density estimation, and number of codes and a distance parameter . It repeats times and at each time, first estimates the densities for all pixel values, then adds the one with highest density to the codebook, and finally, remove all pixels within distance of the picked.
Instantiation. There are many possible kernel functions we can use to instantiate Algorithm 1. In this work we use the simplest choice, the identity kernel, if and otherwise. In that case, the density estimation at line 3 above becomes counting the frequencies of a pixel in .
4.2 An Idealized Model and Its Analysis
To garner additional insights, we propose and analyze an idealized model under which we can precisely analyze when our method can improve adversarial robustness. Roughly speaking, our results suggest the following: Suppose that data is good in the sense that it can be “generated” using some “ground-truth” codewords that are sufficiently separated; then, as long as we can find a -approximation for the ground truth and we have a base model that is robust with respect to the -approximation, it follows is immune to any adversarial attack with budget , thus providing a boost of adversarial robustness. We now present details.
Idealized generative model of images. Suppose each pixel is dimensional vector of discrete values in . Suppose there is a set of ground-truth codes , where the codes are well separated so that for some large . Each image is generated by first generating a skeleton image where each pixel is a code, and then adding noise to the skeleton. We do not make assumptions about generating the label . Formally,
is generated from some distribution over , where the marginal distribution satisfies .
, where takes valid discrete values so that , and
where takes valid discrete values, is a parameter, and is a normalization factor.
Quantifying robustness. We now prove our main theoretical result in the idealized model. We say that a set of codewords is a -approximation of if for any , . For the above generative model, one can show that the codewords found by Algorithm 1 are -approximation of the ground truth (the proof is deferred to Lemma 1).
We then call a transformation to be a -code perturbation of if given any skeleton , replaces any in it with a code satisfying . With this definition we show that, on an image attacked with adversarial budget , our discretization will output a -code perturbation on . This leads to the following proposition whose proof is in the supplemental materials.
Assume the idealized generative model, and where
where is the number of pixels in the training dataset. Assume in Algorithm 1. Then for any ,
with probability at least , where the minimum is taken over all -code perturbation .
Proposition 1 follows from the following lemma showing that our code construction method finds a set of codes close to the ground-truth codes.
Let denote a pixel in the generated images. We have for any , , and for any such that for all , . By Hoeffding’s inequality, with probability at least , the empirical estimations satisfy
Then . Note that and , so for each , Algorithm 1 picks a code from the neighborhood around of radius exactly once. This completes the proof. ∎
Essentially, this proposition says that we can “reduce” defending against adversarial attacks to defending against -code perturbations . Therefore, as long as we have a base model that is robust to small structured perturbation (i.e., the -code perturbation), then we can defend against any adversarial attacks, which is a significant boost of robustness. Encouragingly, we observed that the intuition is consistent with our experiments: Our method gives better performance using an adversarially trained , than a naturally trained , on structured data like MNIST.
4.3 Certificate for the Discretization Defense
Now we derive the certificates for our defense method. For a pixel , let denote its nearest code in , and define
where is the adversarial budget. Then after perturbation bounded by , the perturbed pixel has distance from at most , while it has distance from any at least . So it can only be discretized to a code in . Then all possible outcomes of the discretization after perturbation are
This then leads to the following local and global certificates.
Local certificate. For a data point , if for any we have , then it is guaranteed that is correctly classified by to even under the adversarial attack. Formally, let be the indicator that for all , then
so serves as a local certificate for .
Global certificate. Define . Then clearly, , so serves as a lower bound for the robustness accuracy. This certificate can be estimated on a validation set of data points. Applying Hoeffding’s inequality leads to the following.
Let be the frequency of on a set of i.i.d. samples from the data distribution. Then with probability at least over ,
Note that computing the certificate needs enumerating that can be of exponential size. However, when the pixels are well clustered, most of them are much closer to their nearest code than to the others, and thus will not be discretized to a new code after perturbation, i.e., . Then is of small size and the certificate is easy to compute. This is indeed the case on the MNIST data, which allows us to compute the estimated certificate in our experiments.
We perform a systematic evaluation with three main empirical questions: Q1: Is our discretization defense method effective under the state-of-the-art white-box attacks, especially those that take the codes into account? Q2: Can we certify robustness, as described in Section 4.3? Q3: How do hyperparameters in our method affect the performance? In summary, we find the following:
We show that our method can improve, often significantly, the robustness of a base classifier. In particular, for naturally trained models, we observe significant improvement on MNIST and CIFAR-10, but no effect on ImageNet, which is possibly because that the natural base classifier on ImageNet is not robust at all. For adversarially trained models based on Madry et al. (2017), we observe improvement on MNIST and ImageNet, but not on CIFAR-10, possibly due to adversarial generalization gap as observed by Schmidt et al. (2018).
For smaller adversarial budget on MNIST, we show that we can provide certificates. We can compute an estimated certificate , which is better than the state-of-the-art certificate using much more sophisticated approaches.
Finally, the only key hyperparameter in our method is the number of codes . We show that a larger value of usually leads to slightly improved natural accuracy, but significantly decreased adversarial robustness.
Next we present details. We start with experimental setup.
Dataset. We study three dataset: MNIST, CIFAR-10 and ImageNet. MNIST and CIFAR-10 both contains 50000 training images and 10000 test images. For MNIST, the pixel values are normalized to . We only use a subset of ImageNet used by the NIPS Adversarial Attacks & Defenses challenge Kurakin et al. (2017), which contains 1000 development images and 5000 test images from 1001 categories. We use the entire training set of MNIST and CIFAR-10 to train models and derive code books. But we only use the development set of ImageNet to derive code books. We use entire test set of MNIST to evaluate defenses. But to reduce evaluation time, we only use 1000 test images of CIFAR-10 and ImageNet to evaluate our method.
Pre-trained Models. For MNIST and CIFAR-10, we use naturally and adversarially pre-trained models from Madry et al. (2017). For ImageNet, we use a naturally pre-trained InceptionResNet-V2 model from Szegedy et al. (2017) and an adversarially pre-trained InceptionResNet-V2 model from Tramèr et al. (2017).
Training Hyper-parameters. To retrain models on MNIST and CIFAR-10, we use the same hyper-parameters as Madry et al. (2017), except that in order to reduce training time, when we naturally (or adversarially) retrain CIFAR-10 models, we use naturally (or adversarially) pre-trained model to initialize model parameters and train for 10000 epochs.
Evaluation methods. We evaluate our discretization defense under white-box attacks based on the state-of-the-art Backward Pass Differentiable Approximation(BPDA) method Athalye et al. (2018). In the forward pass, we compute , while for the backward pass, we replace by , which is (overloading notations and for pixels) . Note that when , . For MNIST, we set ; For CIFAR-10 and ImageNet, we set . To evaluate classifier’s robustness without discretization, we use the PGD attack. We define natural accuracy, or simply accuracy, as the accuracy on clean data. Similarly, robustness accuracy, which we denote as robustness, as the accuracy under attack.
Effectiveness. We consider 6 settings: (1) nat_pre: no defenses, model naturally trained on original data; (2) adv_pre: no defenses, model adversarially trained on original data; (3) disc_nat_pre: our defense + model naturally trained on original data; (4) disc_adv_pre: our defense + model adversarially trained on original data; (5) disc_nat_re: our defense + model naturally trained on preprocessed data; (6) disc_adv_re: our defense + model adversarially trained on preprocessed data. Results are shown in Figure 1 and 2. Our method in general improve the robustness. But it does not improve adversarially trained models on CIFAR-10, possibly due to the big gap between training and test robustness accuracy. This is referred to as the adversarial generalization gap Schmidt et al. (2018) and suggests that there is not sufficient data for improving the robustness.
Certificate. We compute the estimated certificate ( in Section 4.3) for defense on 1000 test images of MNIST. We also compute the global certificate ( in Proposition 2) where the failure probability is set to be 0.01. For computational reasons we put a threshold and for with , we report Unable and treat them as failure cases when computing the certificates. Our results appear in Table 2, and and are compared with experimental results in Figure 3. There exists some methods to compute estimated certificate robustness, like Katz et al. (2017); Kolter & Wong (2017); Raghunathan et al. (2018). The state-of-art estimated certificate robustness on MNIST under perturbations with is Kolter & Wong (2017), with a fairly sophisticated method. Our discretization defense, which is much simpler and more efficient, gives a better estimated certificate robustness of .
Effect of Number of Codewords. Finally, we study the relationship between number of codewords and accuracy or robustness. Table 3, 4, 5 give the results for MNIST, CIFAR-10, and ImageNet, respectively. We achieve high accuracy with only a few number of codewords, especially when models are retrained. On naturally trained model, with fewer codewords, we gain more robustness. On adversarially trained model, our method also improves robustness if we use a proper number of codewords. Also, during adversarial retraining in CIFAR-10, we observe that for , the training robustness can be higher than , for , the training robustness can be higher than , and even for , the training robustness can be higher than . However, at test time, no defense achieves robustness higher than . This is possibly due to the adversarial generalization gap as pointed out by Schmidt et al. (2018) and we need more data for adversarial generalization.
|Naturally Trained Model||Accuracy||Robustness||Accuracy||Robustness|
|Adversarially Trained Model||2||0.6||98.17%||97.24%||99.29%||93.01%|
|Naturally Trained Model||Accuracy||Robustness||Accuracy||Robustness|
|Adversarially Trained Model||2||64||63.20%||24.40%||68.30%||33.20%|
|Naturally Trained Model||10||64||53.90%||23.00%|
|Adversarially Trained Model||10||64||62.60%||7.50%|
In this paper we take a first step to leverage data specific distributional properties to improve adversarial robustness. Our key insight is to exploit a well-clustering structural property of data features that is shared by several common datasets including MNIST, CIFAR-10, and ImageNet. Based on this observation we propose a discretization framework to leverage separable codes to improve adversarial robustness. A systematic evaluation clearly demonstrates the efficacy of our results. Our work gives rise to a number of intriguing questions, such as the connection of our separable codes with other representation learning techniques such as sparse dictionary learning, which we leave to future research.
- Athalye et al. (2018) Athalye, Anish, Carlini, Nicholas, & Wagner, David. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420.
- Buckman et al. (2018) Buckman, Jacob, Roy, Aurko, Raffel, Colin, & Goodfellow, Ian. 2018. Thermometer encoding: One hot way to resist adversarial examples. In: Submissions to International Conference on Learning Representations.
- Dziugaite et al. (2016) Dziugaite, Gintare Karolina, Ghahramani, Zoubin, & Roy, Daniel M. 2016. A study of the effect of jpg compression on adversarial images. arXiv preprint arXiv:1608.00853.
- Guo et al. (2017) Guo, Chuan, Rana, Mayank, Cissé, Moustapha, & van der Maaten, Laurens. 2017. Countering Adversarial Images using Input Transformations. arXiv preprint arXiv:1711.00117.
- Katz et al. (2017) Katz, Guy, Barrett, Clark, Dill, David L, Julian, Kyle, & Kochenderfer, Mykel J. 2017. Reluplex: An efficient SMT solver for verifying deep neural networks. Pages 97–117 of: International Conference on Computer Aided Verification. Springer.
- Kolter & Wong (2017) Kolter, J Zico, & Wong, Eric. 2017. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851.
- Kurakin et al. (2017) Kurakin, Alexey, Goodfellow, Ian, & Bengio, Samy. 2017. Nips 2017: Defense against adversarial attack. https://www.kaggle.com/c/nips-2017-defense-against-adversarial-attack.
- Madry et al. (2017) Madry, Aleksander, Makelov, Aleksandar, Schmidt, Ludwig, Tsipras, Dimitris, & Vladu, Adrian. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
- Meng & Chen (2017) Meng, Dongyu, & Chen, Hao. 2017. Magnet: a two-pronged defense against adversarial examples. Pages 135–147 of: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM.
- Raghunathan et al. (2018) Raghunathan, Aditi, Steinhardt, Jacob, & Liang, Percy. 2018. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344.
- Samangouei et al. (2018) Samangouei, Pouya, Kabkab, Maya, & Chellappa, Rama. 2018. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In: International Conference on Learning Representations, vol. 9.
- Schmidt et al. (2018) Schmidt, Ludwig, Santurkar, Shibani, Tsipras, Dimitris, Talwar, Kunal, & Mądry, Aleksander. 2018. Adversarially Robust Generalization Requires More Data. arXiv preprint arXiv:1804.11285.
- Song et al. (2017) Song, Yang, Kim, Taesup, Nowozin, Sebastian, Ermon, Stefano, & Kushman, Nate. 2017. PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples. arXiv preprint arXiv:1710.10766.
- Szegedy et al. (2013) Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian, & Fergus, Rob. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
- Szegedy et al. (2017) Szegedy, Christian, Ioffe, Sergey, Vanhoucke, Vincent, & Alemi, Alexander A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. Page 12 of: AAAI, vol. 4.
- Tramèr et al. (2016) Tramèr, Florian, Zhang, Fan, Juels, Ari, Reiter, Michael K, & Ristenpart, Thomas. 2016. Stealing Machine Learning Models via Prediction APIs. Pages 601–618 of: USENIX Security Symposium.
- Tramèr et al. (2017) Tramèr, Florian, Kurakin, Alexey, Papernot, Nicolas, Boneh, Dan, & McDaniel, Patrick. 2017. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204.
- Xie et al. (2017) Xie, Cihang, Wang, Jianyu, Zhang, Zhishuai, Ren, Zhou, & Yuille, Alan. 2017. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991.
- Xu et al. (2017) Xu, Weilin, Evans, David, & Qi, Yanjun. 2017. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155.