Adversarial Defense via Data Dependent Activation Function and Total Variation Minimization
We improve the robustness of deep neural nets to adversarial attacks by using an interpolating function as the output activation. This data-dependent activation function remarkably improves both classification accuracy and stability to adversarial perturbations. Together with the total variation minimization of adversarial images and augmented training, under the strongest attack, we achieve up to 20.6, 50.7, and 68.7 accuracy improvement w.r.t. the fast gradient sign method, iterative fast gradient sign method, and Carlini-Wagner attacks, respectively. Our defense strategy is additive to many of the existing methods. We give an intuitive explanation of our defense strategy via analyzing the geometry of the feature space. For reproducibility, the code is made available at: https://github.com/BaoWangMath/DNN-DataDependentActivation.
The adversarial vulnerability  of deep neural nets (DNNs) threatens their applicability in security critical tasks, e.g., autonomous cars , robotics , DNN-based malware detection systems [21, 8]. Since the pioneering work by Szegedy et al. , many advanced adversarial attack schemes have been devised to generate imperceptible perturbations to sufficiently fool the DNNs [7, 20, 6, 30, 12, 3]. And not only are adversarial attacks successful in white-box attacks, i.e. when the adversary has access to the DNN parameters, but attacks are also successful in black-box attacks, i.e. it has no access to the parameters. Black-box attacks are successful because one can perturb an image so it misclassifies on one DNN, and the same perturbed image also has a significant chance to be misclassified by another DNN; this is known as transferability of adversarial examples . Due to the transferability of adversarial examples, it is very easy to attack neural nets in a black-box fashion [15, 5]. In fact, there exist universal perturbations that can imperceptibly perturb any image and cause misclassification for any given network . There is much recent research on designing advanced adversarial attacks and defending against adversarial perturbation.
In this work, we propose to defend against adversarial attacks by changing the DNNs’ output activation function to a manifold-interpolating function, in order to seamlessly utilize the training data’s information when performing inference. Together with the total variation minimization (TVM) and augmented training, we show state-of-the-art defense results on the CIFAR-10 benchmark. Moreover, we show that adversarial images generated from attacking the DNNs with an interpolating function are more transferable to other DNNs, than those resulting from attacking standard DNNs.
2 Related Work
Defensive distillation was recently proposed to increase the stability of DNNs which dramatically reduces the success rate of adversarial attacks , and a related approach () cleverly modifies the training data to increase robustness against black-box attacks, and adversarial attacks in general. To counter the adversarial perturbations,  proposed to use image transformation, e.g., bit-depth reduction, JPEG compression, TVM, and image quilting. Similar idea of denoising the input was later explored by , where they divide the input into patches, denoise each patch, and then reconstruct the image. These input transformations are intended to be non-differentiable, thus making adversarial attacks more difficult, especially for gradient-based attacks. Song et al  noticed that small adversarial perturbations shift the distribution of adversarial images far from the distribution of clean images. Therefore they proposed to purify the adversarial images by PixelDefend. Adversarial training is another family of defense methods to improve the stability of DNNs [7, 16, 19]. GANs are also employed for adversarial defense . In , the authors proposed a straight-through estimation of the gradient to attack the defense methods that is based on the obfuscated gradient. Meanwhile, many advanced attack methods have been proposed to attack the DNNs [30, 12].
Instead of using softmax functions as the DNNs’ output activation, Wang et al  utilized a class of non-parametric interpolating functions. This is a combination of both deep and manifold learning which causes the DNNs to sufficiently utilize the geometric information of the training data. The authors show a significant amount of generalization accuracy improvement, and the results are more stable to when only has a limited amount of training data.
3 Deep Neural Nets with Data-Dependent Activation Function
In this section, we summarize the architecture, training, and testing procedures of the DNNs with the data-dependent activation . An overview of training and testing of the standard DNNs with softmax output activation is shown in Fig. 1 (a) and (b), respectively. In the th iteration of training, given a mini-batch of training data , the procedure is:
Forward propagation: Transform into features by a DNN block (ensemble of convolutional layers, nonlinearities and others), and then through the softmax activation to get the predictions :
Then the loss is computed (e.g., cross entropy) between and : .
Backpropagation: Update weights (, ) by gradient descent (learning rate ):
Once the model is optimized, the predicted labels for testing data are:
 proposed to replace the data-agnostic softmax activation by a data-dependent interpolating function, defined in the next section.
3.1 Manifold Interpolation - A Harmonic Extension Approach
Let be a set of points in a high dimensional manifold and be a subset of which are labeled with label function . We want to interpolate a function that is defined on the entire manifold and can be used to label the entire dataset . The harmonic extension is a natural and elegant approach to find such an interpolating function, which is defined by minimizing the Dirichlet energy functional:
with the boundary condition:
where is a weight function, typically chosen to be Gaussian: with being a scaling parameter. The Euler-Lagrange equation for Eq. (1) is:
By solving the linear system (Eq. (2)), we obtain labels for unlabeled data . This interpolation becomes invalid when the labeled data is tiny, i.e., . To resolve this issue, the weights of the labeled data is increased in the Euler-Lagrange equation, which gives:
The solution to Eq. (3) is named weighted nonlocal Laplacian (WNLL), denoted as . For classification tasks, is the one-hot labels for the example .
3.2 Training and Testing the DNNs with Data-Dependent Activation Function
In both training and testing of the WNLL activated DNNs, we need to reserve a small portion of data/label pairs, denoted as , to interpolate the label for new data . We name the reserved data as the template. Directly replacing softmax by WNLL has difficulties in back propagation, namely, the true gradient is difficult to compute since WNLL defines a very complex implicit function. Instead, to train WNLL activated DNNs, a proxy via an auxiliary neural net (Fig.1(c)) is employed. On top of the original DNNs, we add a buffer block (a fully connected layer followed by a ReLU), and followed by two parallel branches, WNLL and the linear (fully connected) layers. The auxiliary DNNs can be trained by alternating between training DNNs with linear and WNLL activations, respectively. The training loss of the WNLL activation function is backpropped via a straight-through estimation approach [2, 4]. At test time, we remove the linear classifier from the neural nets and use the DNN and buffer blocks together with WNLL to predict new data (Fig. 1 (d)); here for simplicity, we merge the buffer block to the DNN block. For a given set of testing data , and the labeled template , the predicted labels for is given by
4 Adversarial Attacks
We consider three benchmark attack methods in this work, namely, the fast gradient sign method (FGSM) , iterative FGSM (IFGSM) , and Carlini-Wagner’s (CW-L2)  attacks. We denote the classifier defined by the DNNs with softmax activation as for a given instance (, ). FGSM finds the adversarial image by maximizing the loss , subject to the perturbation with as the attack strength. Under the first order approximation i.e., , the optimal perturbation is given by
IFGSM iterates FGSM to generate enhanced adversarial images, i.e.,
where , and , with be the number of iterations.
The CW-L2 attack is proposed to circumvent defensive distillation. For a given image label pair , and , CW-L2 searches the adversarial image that will be classified to class by solving the optimization problem:
where is the adversarial perturbation (for simplicity, we ignore the dependence of in ).
The equality constraint in Eq. (6) is hard to satisfy, so instead Carlini et al. consider the surrogate
where is the logit vector for an input , i.e., output of the neural net before the softmax layer. is the logit value corresponding to class . It is easy to see that is equivalent to . Therefore, the problem in Eq. (6) can be reformulated as
where is the Lagrangian multiplier.
By letting , Eq. (8) can be converted to an unconstrained optimization problem. Moreover, Carlini et al. introduce the confidence parameter into the above formulation. Above all, CW-L2 attacks seek adversarial images by solving the following problem
This unconstrained optimization problem can be solved efficiently by the Adam optimizer . All three of the attacks clip the values of the adversarial image to between 0 and 1.
4.1 Adversarial Attack for DNNs with WNLL Activation Function
In this work, we focus on untargeted attacks and defend against them. For a given small batch of testing images and template , we denote the DNNs modified with WNLL as output activation as , where is the composition of the DNN and buffer blocks defined in Fig. 1 (c). By ignoring dependence of the loss function on the parameters, the loss function for DNNs with WNLL activation can be written as . The above attacks for DNNs with WNLL activation on the batch of images, , are formulated below.
where ; and .
where is the logit values of the input images .
Based on our numerical experiments, the batch size of has minimal influence on the adversarial attack and defense. In all of our experiments we choose the batch size of to be . Similar to , we choose the size of the template to be .
We apply the above attack methods to ResNet-56  with either softmax or WNLL as the output activation function. For IFGSM, we run 10 iterations of Eqs. (5) and (11) to attack DNNs with two different output activations, respectively. For CW-L2 attacks (Eqs. (9, 12)) in both scenarios, we set the parameters and . Figure 2 depicts three randomly selected images (horse, automobile, airplane) from the CIFAR-10 dataset, their adversarial versions by different attack methods on ResNet-56 with two kinds of activation functions, and the TV minimized images. All attacks successfully fool the classifiers to classify any of them correctly. Figure 2 (a) shows that FGSM and IFGSM with perturbation changes the contrast of the images, while it is still easy for humans to correctly classify them. The adversarial images of the CW-L2 attacks are imperceptible, however they are extremely strong in fooling DNNs. Figure 2 (b) shows the images of (a) with a stronger attack, . With a larger , the adversarial images become more noisy. The TV minimized images of Fig. 2 (a) and (b) are shown in Fig. 2 (c) and (d), respectively. The TVM removes a significant amount of detailed information from the original and adversarial images, meanwhile it also makes it harder for humans to classify both the TV-minimized version of the original and adversarial images. Visually, it is hard to discern the adversarial images resulting from attacking the DNNs with two types of output layers.
5 Analysis of the Geometry of Features
We consider the geometry of features of the original and adversarial images. We randomly select 1000 training and 100 testing images from the airplane and automobile classes, respectively. We consider two visualization strategies for ResNet-56 with softmax activation: (1) extract the original 64D features output from the layer before the softmax, and (2) apply the principle component analysis (PCA) to reduce them to 2D. However, the principle components (PCs) do not encode the entire geometric information of the features. Alternatively, we add a 2 by 2 fully connected (FC) layer before the softmax, then utilize the 2D features output from this newly added layer. We verify that the newly added layer does not change the performance of ResNet-56 as shown in Fig. 3 that the training and testing performance remains essentially the same for these two cases.
Figure 4 (a) and (b) show the 2D features generated by ResNet-56 with additional FC layer for the original and adversarial testing images, respectively, where we generate the adversarial images by using FGSM (). Before adversarial perturbation (Fig. 4 (a)), there is a straight line that can easily separate the two classes. The small perturbation causes the features to overlap and there is no linear classifier that can easily separate these two classes (Fig. 4 (b)). The first two PCs of the 64D features of the clean and adversarial images are shown in Fig. 4 (c) and (d), respectively. Again, the PCs are well separated for clean images, while adversarial perturbation causes overlap and concentration.
The bottom charts of Fig. 4 depict the first two PCs of the 64D features output from the layer before the WNLL. The distributions of the unperturbed training and testing data are the same, as illustrated in panels (e) and (f). The new features are better separated which indicates that DNNs with WNLL is more robust to small random perturbation. Panels (g) and (h) plot the features of the adversarial and TV minimized adversarial images in the test set. The adversarial attacks move the automobiles’ features to the airplanes’ region. The TVM helps to eliminate the outliers. Based on our computation, most of the adversarial images of the airplane classes can be correctly classified with the interpolating function. The training data guides the interpolating function to classify adversarial images correctly. The fact that the adversarial changes the features’ distribution was also noticed in .
6 Adversarial Defense by Interpolating Function and TVM
To defend against adversarials, we combine the ideas of data-dependent activation, input transformation, and training data augmentation. We train ResNet-56, respectively, on the original training data, the TV minimized training data, and a combination of the previous two. On top of the data-dependent activation output and augmented training, we further apply the TVM  used by  to transform the adversarial images to boost defensive performance. The basic idea is to reconstruct the simplest image from the sub-sampled image, , with the mask filled by a Bernoulli binary random variable, by solving the following TVM problem
where is the regularization constant.
7 Numerical Results
7.1 Transferability of the Adversarial Images
To verify the efficacy of attack methods for DNNs with WNLL output activation, we consider the transferability of adversarial images. We train ResNet-56 on the aforementioned three types of training data with either softmax or WNLL activation. After the DNNs are trained, we attack them by FGSM, IFGSM, and CW-L2 with different .Finally, we classify the adversarial images by using ResNet-56 with the opponent activation. We list the mutual classification accuracy on adversarial images in Table. 1. The adversarial images resulting from attacking DNNs with two types of activation functions are both transferable, as the mutual classification accuracy is significantly lower than testing on the clean images. Overall, when applying ResNet-56 with WNLL activation to classify the adversarial images resulting from attacking ResNet-56 with softmax activation, the network has a remarkably higher accuracy. For instance, for DNNs that are trained on the original images and attacked by FGSM, DNNs with the WNLL classifier have at least 5.4 higher accuracy (56.3 v.s. 61.7 ()). The accuracy improvement is more significant in many other scenarios.
|Attack Method||Training data|
|Classification accuracy of ResNet-56 with softmax on adversarial images produced by attacking ResNet-56 with WNLL|
|FGSM||Original + TVM data||62.9||61.7||60.6||59.4||58.9|
|IFGSM||Original + TVM data||53.9||49.2||44.7||41.9||39.9|
|CW-L2||Original + TVM data||81.5||81.5||81.8||81.2||81.5|
|Classification accuracy of ResNet-56 with WNLL on adversarial images produced by attacking ResNet-56 with softmax|
|FGSM||Original + TVM data||69.7||67.6||65.5||64.8||63.4|
|IFGSM||Original + TVM data||60.0||53.0||47.5||41.6||38.4|
|CW-L2||Original + TVM data||90.6||90.6||90.5||90.1||90.4|
7.2 Adversarial Defense
Figure 5 plots the result of adversarial defense by combining the WNLL activation, TVM, and training data augmentation. Panels (a), (b) and (c) show the testing accuracy of ResNet-56 with and without defense on CIFAR-10 data for FGSM, IFGSM, and CW-L2, respectively. It can be observed that with increasing attack strength, , the testing accuracy decreases rapidly. FGSM is a relatively weak attack method, as the accuracy remains above 53.5 () even with the strongest attack. Meanwhile, the defense maintains accuracy above 71.8 (). Figure 5 (b) and (c) show that both IFGSM and CW-L2 can fool ResNet-56 near completely even with small . The defense maintains the accuracy above 68.0, 57.2, respectively, under the CW-L2 and IFGSM attacks. Compared to state-of-the-art defensive methods on CIFAR-10, PixelDefend, our method is much simpler and faster. Without adversarial training, we have shown our defense is more stable to IFGSM, and more stable to all three attacks under the strongest attack than PixelDefend . Moreover, our defense strategy is additive to adversarial training and many other defenses including PixelDefend.
To analyze the defensive contribution from each component of the defensive strategy, we separate the three parts and list the testing accuracy in Table. 2. Simple TVM cannot defend FGSM attacks except when the DNNs are trained on the augmented data, as shown in the first and fourth horizontal blocks of the table. WNLL activation improves the testing accuracy of adversarial attacks significantly and persistently. Augmented training can improve the stability consistently as well.
|Attack Method||Training data|
|FGSM||Original + TVM data||93.1||63.2/66.6||62.7/67.8||62.4/68.7||62.0/68.1||61.3/68.7|
|IFGSM||Original + TVM data||93.1||32.1/61.5||24.5/57.4||20.1/54.1||17.1/51.3||15.9/48.9|
|CW-L2||Original + TVM data||93.1||13.6/62.2||13.6/62.2||13.0/62.1||12.0/62.1||12.0/61.9|
|Data-Dependent Activated ResNet-56|
|FGSM||Original + TVM data||94.7||70.6/71.8||68.8/73.1||67.2/74.9||66.9/73.6||63.7/74.1|
|IFGSM||Original + TVM data||94.7||35.0/67.4||25.1/64.9||20.5/61.9||17.5/58.7||16.3/57.2|
|CW-L2||Original + TVM data||94.7||61.6/68.6||61.1/68.0||61.9/68.1||61.2/69.2||61.5/68.7|
8 Concluding Remarks
In this paper, by analyzing the influence of adversarial perturbations on the geometric structure of the DNNs’ features, we propose to defend against adversarial attack by applying a data-dependent activation function, total variation minimization on the adversarial images, and training data augmentation. Results on ResNet-56 with CIFAR-10 benchmark reveal that the defense improves robustness to adversarial perturbation significantly. Total variation minimization simplifies the adversarial images, which is very useful in removing adversarial perturbation. Another interesting direction to explore is to apply other denoising methods to remove adversarial perturbation. Moreover, we noticed that an adversarial perturbation changes the features’ distribution severely, and one possible way to correct this is to design algorithms that purify the adversarial images.
This material is based on research sponsored by the Air Force Research Laboratory and DARPA under agreement number FA8750-18-2-0066. And by the U.S. Department of Energy, Office of Science and by National Science Foundation, under Grant Numbers DOE-SC0013838 and DMS-1554564, (STROBE). And by the NSF DMS-1737770 and the Simons foundation. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.
-  N. Akhtar and A. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553, 2018.
-  A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. International Conference on Machine Learning, 2018.
-  A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. International Conference on Machine Learning, 2018.
-  Y. Bengio, N. Leonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
-  W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248, 2017.
-  N. Carlini and D.A. Wagner. Towards evaluating the robustness of neural networks. IEEE European Symposium on Security and Privacy, pages 39–57, 2016.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6275, 2014.
-  K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel. Adversarial perturbations against deep neural networks for malware classification. arXiv preprint arXiv:1606.04435, 2016.
-  A. Guisti, J. Guzzi, D.C. Ciresan, F.L. He, J.P. Rodriguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Carlo, and et al. A machine learning approach to visual perception of forecast trails for mobile robots. IEEE Robotics and Automation Letters, pages 661–667, 2016.
-  C. Guo, M. Cisse, and L. van der Maaten. Countering adversarial images using input transformation. International Conference on Learning Representations, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Black-box adversarial attacks with limited queries and information. International Conference on Machine Learning, 2018.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
-  Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
-  A. Mardy, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations, 2018.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Seyed-Mohsen Moosavi-Dezfooli, Ashish Shrivastava, and Oncel Tuzel. Divide, denoise, and defend against adversarial attacks. CoRR, abs/1802.06806, 2018.
-  T. Na, J. H. Ko, and S. Mukhopadhyay. Cascade adversarial machine learning regularized with a unified embedding. International Conference on Learning Representations, 2018.
-  N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z.B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. IEEE European Symposium on Security and Privacy, pages 372–387, 2016.
-  N. Papernot, P. McDaniel, A. Sinha, and M. Wellman. Sok: Towards the science of security and privacy in machien learning. arXiv preprint arXiv:1611.03814, 2016.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. IEEE European Symposium on Security and Privacy, 2016.
-  Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR, abs/1605.07277, 2016.
-  L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, pages 259–268, 1992.
-  P. Samangouei, M. Kabkab, and R. Chellappa. Defense-gan: Protecting classifiers against adversaial attacks using generative models. International Conference on Learning Representations, 2018.
-  Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. International Conference on Learning Representations, 2018.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, and I. Goodfellow. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
-  B. Wang, X. Luo, Z. Li, W. Zhu, Z. Shi, and S. Osher. Deep neural nets with interpolating function as output activation. Advances in Neural Information Processing Systems, 2018.
-  X. Wu, U. Jang, J. Chen, L. Chen, and S. Jha. Reinforcing adversarial robustness using model confidence induced by adversarial training. International Conference on Machine Learning, 2018.