Learning to Disentangle Robust and Vulnerable Features for Adversarial Detection

Learning to Disentangle Robust and Vulnerable Features for Adversarial Detection

Byunggill Joe
cp4419@kaist.ac.kr &Sung Ju Hwang
sjhwang82@kaist.ac.kr &Insik Shin

Although deep neural networks have shown promising performances on various tasks, even achieving human-level performance on some, they are shown to be susceptible to incorrect predictions even with imperceptibly small perturbations to an input. There exists a large number of previous works which proposed to defend against such adversarial attacks either by robust inference or detection of adversarial inputs. Yet, most of them cannot effectively defend against whitebox attacks where an adversary has a knowledge of the model and defense. More importantly, they do not provide a convincing reason why the generated adversarial inputs successfully fool the target models. To address these shortcomings of the existing approaches, we hypothesize that the adversarial inputs are tied to latent features that are susceptible to adversarial perturbation, which we call vulnerable features. Then based on this intuition, we propose a minimax game formulation to disentangle the latent features of each instance into robust and vulnerable ones, using variational autoencoders with two latent spaces. We thoroughly validate our model for both blackbox and whitebox attacks on MNIST, Fashion MNIST5, and Cat & Dog datasets, whose results show that the adversarial inputs cannot bypass our detector without changing its semantics, in which case the attack has failed.

1 Introduction

Although deep neural networks have achieved impressive performances on many tasks, sometimes even surpassing human performance, researchers have found that they could be easily fooled by even slight perturbations of inputs. Adversarial examples, which are deliberately generated to change the output without inducing semantic changes in the perspective of human perception Szegedy et al. (2014); Goodfellow et al. (2015), can sometimes bring down the accuracy of the model to zero percent. Many of previous work try to solve this problem in the forms of robust inference Gu and Rigazio (2015); Ma̧dry et al. (2018); Zheng et al. (2016); Samangouei et al. (2018); Schott et al. (2019); Dhillon et al. (2018), which aims to obtain correct results even with the adversarial inputs, or by detecting adversarial inputs Ma et al. (2019); Feinman et al. ; Gong et al. (2017); Grosse et al. ; Pang et al. (2018); Xu et al. (2018); Wang et al. (2018).

However most of previous works do not work properly in whitebox attack scenarios where the adversary has the same knowledge as a defender. Most of the past defenses which at a time successfully defended against adversarial attacks were later broken. For example, adversarial defenses leveraging randomness, obfuscated gradients, input denoising, and neuron activations which were once deemed as robust, were later broken with sophisticated whitebox attacks such as expectation over transformation or backward pass differentiable approximation Athalye et al. (2018); Carlini and Wagner (2017a).

Figure 1: Latent spaces of features

We hypothesize that the malfunctions of the previous defenses are caused by the existence of features that are more susceptible to adversarial perturbations, which misguide the model to make incorrect predictions. Such a concept of vulnerable features has been addressed in a few existing works Tsipras et al. (2018); Ilyas et al. (2019). These approaches define vulnerability of the input features either based on the human perception, based on the amount of perturbation at the network output, or correlation with the label. Yet, our hypothesis has a different perspective of vulnerability/robustness in that we assume that there exists a latent feature space where adversarial inputs to a label form a common latent distribution. That is, vulnerable and robust features by our definition are not input features, but rather latent features residing in hypothetical spaces. While vulnerability/robustness of features are assumed as given and fixed in the existing work, this redefinition allows us to explicitly learn to disentangle features into robust and vulnerable features by learning their feature spaces.

Toward this goal, we propose a variational autoencoder with two latent feature spaces, for robust and vulnerable features respectively. Then, we train this model using a two-player mini-max game, where the adversary tries to maximize the probability that the adversarially perturbed instances are in the robust feature space, and the defender tries to minimize this probability. This procedure of learning to disentangle robust and vulnerable features further allows us to detect features based on their likelihood of the feature belonging to either of the two feature categories (see section 1).

We validate our model on multiple datasets, namely MNIST LeCun et al. (1998), Fashion MNIST5 Xiao et al. (2017), and Cat & Dog 15, and show that the attacks cannot bypass our detector without incurring semantic changes to the input images, in which the attack has failed.

Our contributions can be summarized as follows:

  1. We empirically show that adversarial attacks are negative side effects of vulnerable features, which is the byproduct of implicit representation learning algorithms.

  2. From the above empirical observation, we hypothesize that different adversarial inputs to a label form a common latent distribution.

  3. Based on this hypothesis, we propose a new defense mechanism based on variational autoencoders with two latent spaces, and a two-player mini-max game to learn the two latent spaces, for robust and vulnerable features each, and use it as an adversarial input detector.

  4. We conduct blackbox and whitebox attacks to our detector and show that adversarial examples cannot bypass it without inducing semantic changes, which means attack failure.

2 Related work

After revealing severe defects of neural networks against adversarial inputsSzegedy et al. (2014); Goodfellow et al. (2015), researchers have found more sophisticated attacks to fool the neural networksGoodfellow et al. (2015); Ma̧dry et al. (2018); Dong et al. (2018); Carlini and Wagner (2017b); Moosavi-Dezfooli et al. (2016); Chen et al. (2018). On the other side, many researchers have tried to propose a defense mechanism against such attacks as a form of robust prediction Gu and Rigazio (2015); Ma̧dry et al. (2018); Zheng et al. (2016); Samangouei et al. (2018); Schott et al. (2019); Dhillon et al. (2018) or detector of adversarial inputs Ma et al. (2019); Feinman et al. ; Gong et al. (2017); Grosse et al. ; Pang et al. (2018); Xu et al. (2018); Wang et al. (2018). Many of them rely on randomness, obfuscated gradients, or distribution of neuron activations, which are known to malfunction under adaptive whitebox attacks Athalye et al. (2018); Carlini and Wagner (2017a). Some of previous works studied root causes of this intriguing defects. Goodfellow et al.Goodfellow et al. (2015) suggest there should be vulnerability in benign input distributions with an interpretation of deep neural networks as linear classifiers. Tsipras et al.Tsipras et al. (2018) further propose a dichotomy of robust feature and non-robust feature. They analyze the intrinsic trade-off between robustness and accuracy. Concurrent to this work, Ilyas et al.Ilyas et al. (2019) find that non-robust features suffice for achieving good accuracy on benign inputs. They also provide a theoretical framework to analyze the non-robust features. On the other hand, we provide a hypothesis about latent distributions of vulnerable features, and disentangle vulnerable latent space from the entire latent feature space under adaptive whitebox attacks.

3 Background and Motivational Experiment


The key premise of our proposed framework is that standard neural-network classifiers implicitly learn two types of features: robust and vulnerable features. The robust features, in an intuitive sense, correspond to signals that make semantic sense to humans, which may describe texture, colors, local shapes or patches in image domain. The vulnerable features are considered as imperceptible to human senses but are leveraged by models for prediction since they help lowering the training loss. We posit that adversarial attacks exploit vulnerable features to imperceptibly perturb inputs that induce erroneous prediction.


We now present notations used throughout this paper. Firstly, is an input in dimensions. We distinguish between benign input and adversarial input , where Typical is one of and . We use an ordered set with matching subscripts and to indicate a set of and and with superscript to indicate label . Given an input , the true label of is , and is an inaccurate (misled) label of corresponding . and are ordered sets of labels of and that have the same indice in and . The number of unique labels in classification is . We denote a dataset as a pair (, ). and are the probatility density functions of robust features and vulnerable features respectively for the inputs of label . is a set of model parameters of a variational autoencoder for label (i.e., VAE).

Attack methods.

We briefly explain representative attack methods used in this paper.

Fast Gradient Sign Method (FGSM) is an one step method proposed by Goodfellow et al. Goodfellow et al. (2015). is a training loss. The attack takes sign of and perturbs the with a size parameter to increase , resulting in an unexpected result.

Projected Gradient Decent (PGD) is an iterative version of the FGSM attack with random start Ma̧dry et al. (2018). To generate , the attack perturbs with a step size parameter based on the sign of . It limits its search space in the input space of , which is implemented as a clipping function with a bound. It can initialize the attack, adding uniform noise to an original image, .

Momentum Iterative Method (MIM) introduces concept of momentum to the PGD attack Dong et al. (2018). Instead of directly updating from , it applies previous value with decay factor to . Then it updates with the sign of and the step size , while limiting search space in .

Carlini & Wagner Method (CW-L2) Carlini and Wagner (2017b) introduces and searches adversarial inputs on the space, because it relaxes discontinuous property at the minimum and maximum input values. To minimize the distortion of adversarial inputs, it incorporates a loss term, . It defines a function to induce miss-classification, where is the logit value of label right before a softmax layer, and is a label of the original input . The is for larger differences in logit values, resulting in high confidence. It balances between loss and loss in a binary search of .

00footnotetext: The sign of D loss depends whether an attack uses gradient descent(+) or ascent(-).

3.1 Vulnerability of Implicit Feature Learning

By showing that the vulnerable features are prevalent in benign datasets (, ), we validate our argument that the implicit learning algorithms may lead models to learn vulnerable features that could be exploited by an adversary, without explicit regularizations to prevent learning of them. We now provide an evidence in support of this, where a benign dataset (, ) can be classified with good accuracy by a classifier which is trained with a dataset (,) only containing vulnerable features.

The basic idea to construct (, ) for is to leverage a dataset of adversarial inputs (,) that an arbitrary attack generates, to fool a classifier pre-trained with (, ), where is a set of incorrect labels. The set of adversarial inputs cause to make erroneous prediction toward a target label . That is, each contains a set of vulnerable features that learned to identify , because the features do not contain any semantically meaningful features of (see Figure 2). Based on this reasoning, we construct (, ) by attacking and accumulating a set of with the inaccurate labels . For clarity, (, ) should not contain robust features; it should not include benign inputs and should not allow large distortions that could change the semantics of the inputs.

Figure 2: Examples of vulnerable feature dataset (, ) generated by the CW-L2_d attack
FGSM_d PGD_d MIM_d CW-L2_d FGSM_d PGD_d MIM_d CW-L2_d
Test accuracy 0.37 0.85 0.92 0.98 0.53 0.71 0.78 0.78
Test accuracy 0.54 0.81 0.73 0.96 1.0 0.95 0.93 1.0
# of iterations 2 7 30 30 3 30 30 30
loss term of 11footnotemark: 1
Table 1: Performance of a classifier trained only with vulnerable feature dataset (, )

To make the accumulation process efficient, we introduce a discriminator which learns the vulnerable features in (, ) and prevents the attack from re-exploiting the vulnerable features that has already learned. The attack should then bypass , which is only possible with exploiting a new set of vulnerable features. After a pre-determined number of iterations for accumulation, we train with and measure the performance of on and test datasets .

We conduct this experiment 222It is worth noting that a similar experiment Ilyas et al. (2019) was conducted independently at the same time. A key difference is that the experiment in Ilyas et al. (2019) is designed to construct non-robust (vulnerable) input datasets while our experiment aims to construct a common latent distribution of a maximal set of vulnerable features with . with two datasets, MNIST and Fashion MNIST5. In the case of Fashion MNIST5, we use a subset of Fashion MNIST Xiao et al. (2017) which are "Coat (0)", "Trouser (1)", "Sandal (2)", "Sneaker (3)", and "Bag (4)". Figure 2 illustrates vulnerable feature datasets (, ) generated by . We can see that describes a completely different visual object classes from , as is comprised of only the vulnerable features. For more details of the experiment, see subsection A.1 in supplementary file.

We use four representative attacks, including FGSM_d, PGD_d, MIM_d, and CW-L2_d, where "_d" indicates that each individual base attack is adapted to bypass with an additional attack objective. Table 1 summarizes the performance of . Interestingly, the results show that it is possible to achieve high accuracy of up to 0.98 (MNIST) and 0.78 (Fashion MNIST5) on (, ), even though they are trained with (, ). We note that also achieves high accuracy on test datasets (, ), which are unseen in the training phase.

From the above experiments, we can draw the following two conclusions:

  • High accuracy on (, ): Vulnerable features are prevalent in (, ), and we need new training algorithms that are able to distinguish vulnerable and robust features for robustness.

  • High accuracy on (, ): (, ) must share some high-level features in common, although they may be imperceptible to humans. Based on this observation, we hypothesize that the adversarial inputs of the same label exist in the common latent distribution.

4 Approach

Figure 3: Implicit vs. disentangling learning in latent feature spaces (label 0 on MNIST)

Based on the hypothesis that vulnerable features make up a common latent distribution for each label, we propose a new learning algorithm to recognize and disentangle a latent space of the vulnerable features from distributions of the all features. Figure 3 shows the difference from an implicit algorithm (left) that learns latent features without such distinction.

Specifically, we propose a variational autoencoder with two latent feature spaces and respectively for robust and vulnerable features. We regularize and not to estimate distributions of adversarial and benign inputs respectively, in addition to an original objective of the variational autoencoder, as a mini-max game. As the training converges, gets close to a distribution of robust features, and represents a distribution of vulnerable features. Then the could be used to detect adversarial inputs, since they will form a distinct distribution in that separates them from benign inputs.

4.1 Mini-max game

A key idea of our proposed training can be represented as two-player mini-max game between a defender and an adversary on the two probability distributions of the robust features and the vulnerable features of each label . The defender seeks to detect an adversarial attack on an input by checking if is higher than a threshold, and the adversary aims to maliciously perturb the input while compromising the detection.

In the beginning, is initialized to reflect the distribution of all the robust and vulnerable features, but does not represent a feature. Each player alternately plays the game. In the adversary’s turn, the adversary perturbs with , in order to maximize but minimize to compromise the defender. In the next turn, the defender controls the model parameter to perform the opposite, aiming to collect all the vulnerable features that have been exploited by the adversary and segregate them into the distribution of . If the defender successfully detects all the vulnerable features, then the defender wins the game. Otherwise, the adversary wins. However, since the set of vulnerable features is finite with the finite size of , the defender with a proper detection strategy will eventually win the game, after a sufficient number of turns.

4.2 Network architecture

To embed the proposed mini-max game in a training process, we suggest a network architecture described in Figure 4. For each label , we have a variational autoencoder VAEP. Kingma and Welling (2013) that consists of an encoder and a decoder . Instead of one type of latent variables, samples two types of latent variables and for each of benign and adversarial inputs of the label . We denote for simplicity. generates for any given , where is either or depending on whether is or . The information is given in the training process as a flag, and can selectively take or based on the flag.

Figure 4: Proposed network architecture for the disentangling learning

For classification, we integrate the VAEs into a classifier . Given an input , estimates probabilities of on of each VAE, and returns an index of the highest probability as the predicted label . This process can be denoted as . For the detection of an adversarial input , we utilize the probability of on relatively compared to the values of , given the predicted label . Specifically, we detect if .

4.3 Training the network

We train each VAE with the loss . We design based on an evidence lower bound of with two regularization terms, which penalize errors in variational inference. Specifically, when erroneous estimates happen, that is, when assigns with high probability or assigns with high probability, VAE is penalized and encouraged to distinguish between the robust features of and the vulnerable features in . We provide the detailed derivation of in Section A.2 of the supplementary file. The final form of is as follows, where selects the -th mean element of , and selects the -th standard deviation element of from reparameterization.

We choose the pixel-wise mean squared error (MSE) for the first two terms as reconstruction errors, and the standard normal distribution as priors for and . We also introduce constants and , respectively, for the KL divergence terms and the loss terms of variational inference for a practical purpose.

Before we train VAE with we should incorporate the knowledge of our defense mechanism to the existing attacks, considering the whitebox attack model. Given an attack loss of an arbitrary attack , we linearly combine a term to bypass our detector with a coefficient . As a result we get . We also introduce a binary search to find the proper in a similar way to the CW-L2 attack. We modify all the attacks in this paper. We denote an attack with "_W", if the attack is modified to work on the whitebox model. The sign of the term depends on whether the attack is based on gradient descent (+), or ascent (-).

   Set of adversarial inputs to label    An adapted whitebox attack   while  do            TRAIN to distinguish and maximizing with   end while Algorithm 1 Training process of VAE

Algorithm 1 describes our training algorithm. It is an iterative process, where at each iteration attacks VAE to exploit vulnerable features in , and the VAE corrects the distribution and with to identify vulnerable features found in each iteration. We should only attack , in order to prevent the inclusion of robust features in . The training ends when can not find adversarial inputs that could bypass the detector.

5 Experiment

MIM_W CW-L2_W No defense
MNIST 0.97 0.98 0.99
Fashion 0.98 0.98 0.99
Cat & Dog 0.96 0.96 0.99
Table 2: accuracy of models

We evaluate our defense mechanism under both blackbox and whitebox attacks, trained with MIM_W and CW-L2_W. In the blackbox setting, we evaluate how precisely our detector filters out adversarial inputs by measuring AUC scores. In the whitebox setting, we first quantitatively evaluate attack success ratio and qualitatively analyze whether successful adversarial inputs induce semantic changes.

Baseline. We choose Gong et al. Gong et al. (2017) as a baseline for comparison, which also leverages adversarial inputs to train an auxiliary classifier for detecting adversarial inputs. Note that Gong et al. does not incorporate adaptive attacks in their approach, although we denote it with the same notation (e.g., MNIST, A=PGD_W).

Attacks. Our attacks are based on the publicly available implementations 21; N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko, V. Behzadan, K. Hambardzumyan, Z. Zhang, Y. Juang, Z. Li, R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot, P. Hendricks, J. Rauber, and R. Long (2018), and the whitebox attacks are adapted to bypass our detection mechanism. All adversarial inputs are generated from a separate test dataset, in an untargeted way.

Datasets. We evaluate our detector on the MNIST LeCun et al. (1998), Fashion MNIST5 Xiao et al. (2017), and Cat & Dog 15 datasets. In the case of the Cat & Dog dataset, we collect total 2028 frontal faces of cats and dogs, and resize it into 64 x 64 x 1 with a single channel.

We first show that our defense methods achieve a level of accuracy similar to those without defense mechanisms (see Table 2). Additional information including training parameters and details of the attacks are described in Section A.3 of the supplementary file.

5.1 Blackbox substitute model attack

In the blackbox setting, the adversary has no information regarding our defense mechanism, but we assume that the adversary has the same datasets as a defender. The adversary builds its own standard substitute classifier , and generates a group of adversarial inputs with an attack to fool . After that, the adversary attacks with , and the defender detects based on the values of where . The blackbox substitute model attacks are possible exploiting the transferability Goodfellow et al. (2015); Liu et al. (2017); Ilyas et al. (2018); Narodytska and Prasad Kasiviswanathan (2017); Chen et al. (2017) of over classifiers trained with similar datasets.

Table 3 shows AUC scores for detection results, where each cell compares ours (left) with the baseline (right) for each individual attack while our defense approaches achieve 0.98 on average and perform better than the baseline by up to 0.33. We attribute the success of our model to its ability to disentangle the distribution of the vulnerable features of into from the distribution of whole features. To see if the features are actually disentangled, we visualize with t-SNE Laurens van der Maaten (2008) in Figure 5. It shows clear separation between and as we expected. We conclude that the transferability between the models is reduced by disentangling the vulnerable features which the adversary might exploit for found in benign datasets.

FGSM 0.99 / 0.99 0.98 / 0.99 0.98 / 0.99 0.99 / 0.97
PGD 0.99 / 0.96 0.99 / 0.99 0.97 / 0.79 0.99 / 0.96
MIM 0.99 / 0.90 0.99 / 0.98 0.98 / 0.92 0.99 / 0.97
CW 0.97 / 0.64 0.97 / 0.96 0.96 / 0.64 0.97 / 0.95
Figure 5: Disentangled latent distributions of robust and vulnerable features (blue vs. others ), on MNIST, A=MIM_W VAE
Table 3: AUC scores of blackbox attack detection. Our approach (left) remains more generalized over various attacks compared to the baseline (right)

5.2 Whitebox attack

Whitebox attacks are difficult to defend because the adversary has exactly the same knowledge as the defender, which could be exploited in order to fool the defender. For clear analysis, we define a set of success conditions of the adversary when an inference label of is , as follows:

  1. [label=]

  2. Low probability on vulnerable features to bypass the detector: .

  3. High probability on robust features to convince the defender: .

  4. Semantic meaning of the original input should be retained.

Ratio 0.32 0.28 0.18 0.19
Mean 45.08 48.25 43.34 37.66
Table 4: Result of CW-L2_W attack

For and , Figure 6 plots attack success ratios along the distortion on MNIST and Fashion MNIST5. As increases, the success ratio also increases except FGSM_W. Our defense shows gradual slope compared to the baseline. Table 4 shows the whitebox attack result of CW-L2_W with minimized distortions in a binary search. CW-L2_W achieves average success ratio of 0.30 and 0.19 with distortion333The L distortion is calculated as in [0, 255] input range. of 46.66 and 40.5 for MNIST and Fashion MNIST5, respectively.

Regarding , Figure 7 compares the visual differences between each pair of a benign image (left) and an adversarial image (right). The predicted label for each image is shown in yellow, in the bottom right corner. We choose as a reference distortion value for the MIM_W attacks. We can clearly observe the semantic changes on the adversarial images. We additionally evaluate our defense mechanism on a Cat & Dog dataset. It also shows the semantics changed between the labels (). Some lines are appeared or disappeared in MNIST rather than noisy dots, and sneakers turn into sandals with similar styles such as overall shape or pattern. In the case of Cat & Dog dataset, features of dogs are appearing in the adversarial inputs generated from cats as big noses and long spouts, while the brightness of the fur, or angles of faces seem to be preserved. Figure 8 show adversarial perturbations from the attacks result in clear semantic changes with our approach compared to other baselines. We provide more results obtained with various including the PGD_W attack in Section A.4 of the supplementary file. From the result we conclude our approach successfully distinguishes vulnerable features from whole features compared to the baseline. Furthermore, considering the semantic changes of adversarial inputs, we conclude that the robust features estimated in are well-aligned with human perception.

Figure 6: Success ratio of whitebox attacks along the distortion (according to C1 & C2)
Figure 7: Visual results of whitebox attacks on our defense
Figure 8: Semantic comparison to baselinesGong et al. (2017); Pang et al. (2018) under the CW-L2_W attack; the attack yields larger semantic changes on our approach compared to the baselines444We found and corrected implementation errors in the robustness evaluation of the reverse cross entropy Pang et al. (2018), and we could bypass the detector in the adaptive whitebox attack. We confirmed it with an author of the paper.

6 Conclusion

In this work, we hypothesize about the latent feature space for adversarial inputs of a label and conjectured that feature space entanglement of vulnerable and robust features is the main reason of adversarial vulnerability of neural networks, and proposed to learn a space that disentangles the latent distributions of the vulnerable and robust features. Specifically, we trained a set of variatonal autoencoders for each label with two latent spaces, and trained them using a two-player mini-max game, which results in learning disentangled representations for robust and vulnerable features. We show that our approach successfully identifies the vulnerable features and also identifies sufficiently robust features in the whitebox attack scenario. However, we cannot guarantee that our approach is a panacea, and further research is required for the discovery of new attacks. We hope our work stimulates research toward more reliable and explainable machine learning.


  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples. External Links: arXiv:1802.00420v4 Cited by: §1, §2.
  • N. Carlini and D. Wagner (2017a) Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods. In In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, External Links: arXiv:1705.07263v2, ISBN 9781450352024 Cited by: §1, §2.
  • N. Carlini and D. Wagner (2017b) Towards Evaluating the Robustness of Neural Networks. In IEEE Symposium on Security and Privacy, Cited by: §2, §3.
  • P. Chen, Y. Sharma, H. Zhang, J. Yi, and C. Hsieh (2018) EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples. In In AAAI Conference on Artificial Intelligence, External Links: 1709.04114, Link Cited by: §2.
  • P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security (AISec), Cited by: §5.1.
  • G. S. Dhillon, K. Azizzadenesheli, and Z. C. Lipton (2018) Towards the first adversarially robust neural network model on mnist. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting Adversarial Attacks with Momentum. In In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.
  • [8] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner Detecting Adversarial Samples from Artifacts. External Links: arXiv:1703.00410v3 Cited by: §1, §2.
  • Z. Gong, W. Wang, and W. Ku (2017) Adversarial and Clean Data Are Not Twins. External Links: arXiv:1704.04960v1 Cited by: §1, §2, Figure 8, §5.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §3, §5.1.
  • [11] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. Mcdaniel On the ( Statistical ) Detection of Adversarial Examples. External Links: arXiv:1702.06280v2 Cited by: §1, §2.
  • S. Gu and L. Rigazio (2015) Towards deep neural network architectures robust to adversarial examples. In International Conference on Learning Representations (ICLR) Workshops, Cited by: §1, §2.
  • A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. In International Conference on Learning Representations (ICLR), Cited by: §5.1.
  • A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. External Links: arXiv:1905.02175 Cited by: §1, §2, footnote 2.
  • [15] Kaggle cats and dogs dataset. Note: https://www.microsoft.com/en-us/download/details.aspx?id=54765 Cited by: §1, §5.
  • G. H. Laurens van der Maaten (2008) Visualizing Data using t-SNE. Journal of Machine Learning Research. Cited by: §5.1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings ofthe IEEE, 86(11):2278–2324, Cited by: §1, §5.
  • Y. Liu, X. Chen, C. Liu, and D. Song (2017) Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations (ICLR), Cited by: §5.1.
  • S. Ma, Y. Liu, G. Tao, W. Lee, and X. Zhang (2019) NIC : Detecting Adversarial Samples with Neural Network Invariant Checking. In Proceedings of the 2019 Annual Network and Distributed System Security Symposium (NDSS), External Links: ISBN 189156255X Cited by: §1, §2.
  • A. Ma̧dry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §3.
  • [21] MNIST adversarial examples challenge. Note: https://github.com/MadryLab/mnist_challengeAccessed: 2010-09-30 Cited by: §5.
  • S. M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-Decem, pp. 2574–2582. External Links: Document, arXiv:1511.04599v3 Cited by: §2.
  • N. Narodytska and S. Prasad Kasiviswanathan (2017) Simple black-box adversarial perturbations for deep networks. In In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.
  • D. P. Kingma and M. Welling (2013) Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), Cited by: §A.2, §4.2.
  • T. Pang, C. Du, Y. Dong, and J. Zhu (2018) Towards Robust Detection of Adversarial Examples. In Conference on Neural Information Processing Systems (NIPS), Cited by: §1, §2, Figure 8, footnote 4.
  • N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko, V. Behzadan, K. Hambardzumyan, Z. Zhang, Y. Juang, Z. Li, R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot, P. Hendricks, J. Rauber, and R. Long (2018) Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768. Cited by: §5.
  • P. Samangouei, M. Kabkab, and R. Chellappa (2018) Defense-gan: protecting classifiers against adversarial attacks using generative models. Cited by: §1, §2.
  • L. Schott, J. Rauber, M. Bethge, and W. Brendel (2019) Towards the first adversarially robust neural network model on mnist. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • C. Szegedy, J. Bruna, D. Erhan, and I. Goodfellow (2014) Intriguing properties of neural networks. In In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Ma̧dry (2018) Robustness May Be at Odds with Accuracy. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • J. Wang, G. Dong, J. Sun, X. Wang, and P. Zhang (2018) Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing. In CoRR, External Links: arXiv:1812.05793v2 Cited by: §1, §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: §1, §3.1, §5.
  • W. Xu, D. Evans, and Y. Qi (2018) Feature Squeezing : Detecting Adversarial Examples in Deep Neural Networks. In Proceedings of the 2018 Annual Network and Distributed System Security Symposium (NDSS), Cited by: §1, §2.
  • S. Zheng, Y. Song, T. Leung, and I. Goodfellow (2016) Improving the robustness of deep neural networks via stability training. In In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.

Appendix A Appendix

a.1 Additional information about the motivational experiment

We conducted the motivational experiments described in the Section 3.1 to demonstrate that the classifier trained with a dataset (,) only containing vulnerable features can achieve high accuracy on a benign dataset (, ). Table 5 and Table 6 describe the model parameters and the attack parameters used in the experiments. Algorithm 2 details the process of creating (,) with the discriminator .

We use abbreviated notations in Table 5, and Table 6. The "c(x,y)" is a convolutional layer with ReLU activation. The x is size of a kernel, and the y is the number of kernels. "mp(x)" is a max pooling layer whose pooling size is x by x. The "d(x)" is a dense layer where x is the number of neurons. The "sm(x)" is a softmax layer with output dimension x. For the attack parameter, "e" is epsilon and i is iteration of attacks. "ss" is a step size of perturbations. "df" is a decay factor for the MIM attack. "lr", "cf", "ic", "bs" are learning rate, confidence, initial coefficient for the miss-classification loss, and the number of binary search steps.

For the fashion MNIST5, we intentionally choose the subset of Fashion MNIST such as coat (0), trouser (1), sandal (2), sneaker (3), bag (4) for decreasing effect of inter-label robust features. For example, sneaker and ankle boot, coat and pull over are quite similar. By doing so, we could extract vulnerable features only for each label and get an accurate result.

We implement the discriminator as a separate model with one dimension of sigmoid output. The learns to distinguish benign inputs as 0, and adversarial inputs as 1. As a conequence, we can interpret the output value of as a probability where an input would be an adverarial input. In terms of the bypassing for attacks, we linearly incorporate the output value of in objective loss functions of the attacks. As the attacks minimizing the output probability of , the attacks generate new with new vulnerable features.

Models Parameters
MNIST, c(2,20) mp(2) c(2,50) mp(2) d(500) sm(10)
MNIST, c(5,20) mp(2) c(5,50) mp(2) d(256) sm(10)
Fashion MNIST5, c(5,20) mp(2) c(5,50) mp(2) d(500) sm(10)
Fashion MNIST5, c(5,20) mp(2) c(5,50) mp(2) d(256) sm(10)
Table 5: Model parameters in the motivational experiment
Blackbox substitute model attacks
MNIST e:0.3 e:0.3, i:90, e:0.3, i:640, i:160, lr:0.1
ss:0.01 ss:0.01, df:0.3 cf:3, ic:10, bs:1
Fashion e:0.3 e:0.4, i:90 e:0.3 , i:320 i:160, lr:0.1
MNIST5 ss:0.01 ss:0.001, df:0.3 cf:3, ic:10, bs:1
Table 6: Attack parameters in the motivational experiment
  (, ): Given benign dataset
  (, ) : Empty dataset for vulnerable features
   Pre-trained and initialized model respectively with same input and output dimensions
   Arbitrary attack
   Discriminator between benign (0) and adversarial (1) inputs
  while  do
     TRAIN to distinguish and
  end while
  TRAIN with and
  PRINT accuracy of on and
Algorithm 2 Training process only with the vulnerable features

a.2 Training loss derivation

To approximate and distinguish the latent variable distributions of and , we maximize for each VAE, where


consists of two terms. The first one is an evidence term indicating the probability of occurence of and , where


The second one is a loss term of variational inference which penalizes in the case of wrong variation inference to each distribution and . It can be expanded to incorporate latent variables and as follows.


Plugging typical ELBO expansion P. Kingma and Welling (2013) of the term, and the term into the equation 1, we get following .


We choose the pixel-wise mean squared error (MSE) for the first two terms as reconstruction errors, and the standard normal distribution as priors for and . We also introduce constants and , respectively, for the KL divergence terms and the loss terms of variational inference for a practical purpose.

a.3 Training and attack parameters

We use abbreviated notations in Table 7 as like in Table 5 In additionto that the "z(x,y)" is a sampling layer for latent variables. The x and y are dimensions of and . We use and in all trainings. For the label inference we choose the nearest mean classifier on , because its linear property prevents the vanishing gradients problem which makes attacks fail but known to be penetrable.

Model Parameters
MNIST, =MIM_W :3(c(4,16))z(8,8) e:0.5, i:3e3, ss:1e-3, df:0.9, bs:0
MNIST, =CW-L2_W :d(24)d(49)3(ct(4,16))d(784) i:1e3, lr:1e-3, cf:0, ic:1, bs:0
Fashion MNIST5, =MIM_W :3(c(4,32))z(8,8) e:0.5, i:3e3, ss:1e-3, df:0.9, bs:0
Fashion MNIST5, =CW-L2_W :d(24)d(49)3(ct(4,32))d(784) i:2e3, lr:1e-3, cf:0, ic:1, bs:0
Cat & Dog, =MIM_W :2(c(12,32)-bn-relu-mp(2))c(12,32)-bn-relu-z(64,64) e:0.2, i:12e2, ss:1e-3, df:0.9, bs:0
Cat & Dog, =CW-L2_W :d(24)d(49)3(ct(4,64))d(4096) i:3e3, lr:3e-3, cf:0, ic:1, bs:0
Table 7: Training parameters and accuracy of with/without our defense. (e: distortion, i:iterations, ss:step_size, df:decay_factor, bs:binary_search_steps, lr:learning_rate, cf:confidence, ic:initial_constant)
Blackbox substitute model attacks Whitebox attacks
MNIST e:0.3 e:0.4, i:90, e:0.3, i:160, i:160, lr:0.1 bs:3 i:240, ss:0.01, i:1200,bss:1e-3, i:1.2e4, lr:0.01
ss:0.01 ss:0.01, df:0.3 cf:3, ic:10, bs:1 bs:3 df:0.9, bs:3 cf:200, ic:10, bs:3
Fashion e:0.3 e:0.4, i:90 e:0.3 , i:160 i:160, lr:0.1 bs:3 i:240, ss:0.01, i:1200, ss:1e-3, i:1.2e4, lr:0.01
MNIST5 ss:0.01 ss:0.01, df:0.3 cf:3, ic:10, bs:1 bs:3 df:0.9, bs:3 cf:200, ic:10, bs:3
Table 8: Attack parameters used in the experiments

a.4 Additional figures about the whitebox attacks

In this section, we qualitatively evaluate the performance of our proposed defense mechanism against whitebox attacks as a function of epsilon. Figures 910, and 11 show the results of the PGD_W and MIM_W attacks under a wide range of the epsilon from 0.2 to 0.8. Even when the value of the epsilon is small, there are many cases where one may make mistake. As the epsilon becomes larger, semantic changes become more apparent. In the case of MNIST, attacks frequently occurred to 4, 7, 9, 3, and 5, which are of similar shapes. In Fashion MNIST5, attacks also frequently occurred to similar forms such as sandals and sneakers. The attack between the sandals and the sneakers shows that the original style is maintained to some extent, and a new image is created. In Cat & Dog, when a cat image was attacked towards a dog, it was found that dog nose and long spout typically appeared. On the other hand, when dog images were attacked towards a cat, flat nose and Y-shaped mouth appeared. Especially, when the epsilon is very large up to 0.8, the robust and the vulnerable features are well learned when we see that the semantically meaningful change is dominant and no perturbation like noise is added in the background.

Figure 9: Whitebox attack changes semantics of inputs (MNIST)


Figure 10: Whitebox attack changes semantics of inputs (Fashion MNIST)
Figure 11: Whitebox attack changes semantics of inputs (Cat & Dog)

Figure 12 depicts the attack success ratio (according to C1 & C2) on the Cag & Dog dataset as a function of epsilon. Unlike MNIST and Fashion MNIST5, the success ratio is not so high. This can be interpreted to mean that the Cat & Dog dataset has more delicate semantic features (i.e., robust features) compared to the other datasets, and it is more difficult for attackers to detect them during perturbation.

Figure 12: Whitebox attack success ratio along the distortion (Cat & Dog).

In terms of the comparison of the state of the art, our defense mechanism shows clear semantic changes against the attacks. For example Gong et al.’s auxiliarty classifier as a detector does not induce any robustness under adaptive whatbox attacks, and Reverse cross entropy which tries to impose non-maximal entropy in training phase, also does not work in whitebox attack unlike with the report in the paper. Actually it could not show significant robustness increase comapared to the Gong et al.’s approach in terms of the amount of distortion and semantic changes (see Figure 13).

Figure 13: Baseline comparison on MIM_W whitebox attack success ratio (C1, C2) along the distortion. Gong et al. and Ours are trained with MIM_W attack

a.5 Experiment environments

We conduct our experiments on Ubuntu 16.04 machine with 4 GTX 1080 ti graphic cards and 64GB RAM installed. We build our experiments with tensorflow version 1.12.0, on python version 3.6.8.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description