Transferable Adversarial Robustness using Adversarially Trained Autoencoders

Transferable Adversarial Robustness using Adversarially Trained Autoencoders

Pratik Vaishnavi1, Kevin Eykholt2, Atul Prakash2, Amir Rahmati1
1Stony Brook University       2University of Michigan, {keykholt,aprakash},

Machine learning has proven to be an extremely useful tool for solving complex problems in many application domains. This prevalence makes it an attractive target for malicious actors. Adversarial machine learning is a well-studied field of research in which an adversary seeks to cause predicable errors in a machine learning algorithm through careful manipulation of the input. In response, numerous techniques have been proposed to harden machine learning algorithms and mitigate the effect of adversarial attacks. Of these techniques, adversarial training, which augments the training data with adversarial inputs, has proven to be an effective defensive technique. However, adversarial training is computationally expensive and the improvements in adversarial performance are limited to a single model. In this paper, we propose Adversarially-Trained Autoencoder Augmentation, the first transferable adversarial defense that is robust to certain adaptive adversaries. We disentangle adversarial robustness from the classification pipeline by adversarially training an autoencoder with respect to the classification loss. We show that our approach achieves comparable results to state-of-the-art adversarially trained models on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. Furthermore, we can transfer our approach to other vulnerable models and improve their adversarial performance without additional training. Finally, we combine our defense with ensemble methods and parallelize adversarial training across multiple vulnerable pre-trained models. In a single adversarial training session, the autoencoder can achieve adversarial performance on the vulnerable models that is comparable or better than standard adversarial training.


Machine learning algorithms are becoming the preferred tool to empower systems across multiple applications domains ranging from automatically monitoring employee hygiene and safety to influencing control decisions in self-driving cars and trading. With such pervasive use, it is critical to understand and address the vulnerabilities associated with machine learning algorithms so as to mitigate the risks in real systems. Adversarial machine learning attacks are one class of such vulnerabilities in which an attacker can reliably induce predictable errors in machine learning systems. At a high-level, given a model and a correctly labelled input, an adversarial attack computes the necessary modifications on the input such that the model incorrectly labels the input, while ensuring that the modifications are inconspicuous (e.g., imperceptible to a human observer).

Figure 1: Overview of Adversarially-Trained Autoencoder Augmentation. Adversarial images that would otherwise be misclassified by naturally trained classifiers, can be adjusted and correctly classified by using AAA.

In an effort to mitigate or prevent the effect of adversarial attacks, multiple defensive techniques have been proposed. Of them, the most promising technique, adversarial training, uses a basic data augmentation strategy to improve a model’s performance in adversarial scenarios [madry2018towards]. During training, adversarial examples are computed on-the-fly and added to the training data. This approach has proven to greatly improve the prediction accuracy on adversarial inputs for certain types of adversarial attacks. However, adversarial training requires retraining the classifier on adversarial data and introduces significant performance overhead during training as adversarial examples must be generated during each training step. Modifications to the original approach have been made to reduce the performance overhead on larger datasets such as Imagenet by weakening the adversarial attack algorithm used at training time and diversifying the models used to generate adversarial inputs [kurakin2017atscale, tramer2018ensembletraining]. Despite these modifications, there still remains a problem, which is that the benefits of adversarial training only extend to a single model. In order to create additional adversarially robust models, adversarial training procedure must be repeated for each model.

In this work, we propose Adversarially-Trained Autoencoder Augmentation (AAA), a model agnostic technique for improving adversarial robustness. We re-examine data pre-processing adversarial defenses and focus on the stacked classifier architecture, in which a denoising autoencoder is used to mitigate the effect of adversarial inputs [DBLP:journals/corr/GuR14, Liao2018DefenseAA, meng2017magnet]. Traditionally, autoencoder based defenses are trained on a static set of adversarial samples generated against a naturally trained classifier. However, these defenses are not robust to an adaptive adversary that is aware of the defense in place. Our approach differs from previous work in that we train the autoencoder against an adaptive adversary using both a supervised and an unsupervised objective, which enhance classifier performance and encourage transferability. Our approach provides comparable performance on Fashion-MNIST and CIFAR-10 datasets, and outperforms classic adversarial training on MNIST by . Furthermore, as we’ve designed AAA to be model agnostic, we transfer the adversarially trained autoencoder to different naturally trained classifiers and improve their adversarial accuracy by and for MNIST and Fashion-MNIST datasets, respectively. Finally, we find that on CIFAR-10, our initial approach does not completely enable the transferability of our pipeline. However, by utilizing ensemble adversarial training, AAA can parallelize adversarial training across multiple classifiers and be partially transferable. In fewer training iterations per model, Adversarially-Trained Autoencoder Augmentation can achieve adversarial performance comparable to an adversarially trained model in the worst case. In the best case, using our pipeline with ensemble training improved a model’s natural and adversarial accuracy by 9.17% compared to normal adversarial training.

Our Contributions.

  • We propose Adversarially-Trained Autoencoder Augmentation, the first transferable adversarial defense robust to an adaptive L adversary. Through adversarial training of an autoencoder, we disentangle adversarial robustness and classification enabling us to transfer adversarial robustness improvements across multiple classifiers.

  • On MNIST and Fashion-MNIST, Adversarially-Trained Autoencoder Augmentation results in comparable or better performance than traditional adversarial training, while being completely transferable, improving the adversarial accuracy of a different classifier by and respectively.

  • On CIFAR-10, despite comparable performance to adversarial training on the training classifier, our approach reported weaker transferability on a different classifier. Thus, we extended our approach to use ensemble techniques for partial transferability. In doing so, we found that the resulting autoencoder achieved comparable or better adversarial accuracy to individually adversarially training each classifier (9.17% improvement in adversarial accuracy in the best case) and with less overall training.


Adversarial Attacks. Adversarial examples were introduced by Szegedy et al.. They observed that by maximizing the prediction error of a classifier for a given input, it was possible to learn imperceptible perturbations to add to the input, characterized by an distance, which cause it to be misclassified [szegedy2014intriguing]. Since then, numerous adversarial attacks have been developed which can be classified based on the adversary’s knowledge of the model. White-box attacks assume the adversary has perfect knowledge of the model parameters. Using this knowledge, an adversary can use back-propagation to precisely compute the necessary adversarial modifications. E.g., the Jacobian-Based Saliency Map (JSMA) attack uses back-propagation to create adversarial saliency maps, which measure the impact of each input feature on the output decision [papernot2016limitations]. The Fast-Gradient Sign method (FGSM) modifies the entire input based on the model gradients with respect to the input [goodfellow2014explaining, kurakin2016adversarial].

Black-box attacks assume the adversary only has the ability to query a model for its soft (probability distribution) or hard (predicted label) output. In these attacks, the adversary queries the model and uses the output to estimate gradient information, which can then be used to re-enact a white-box attack [Narodytska2017SimpleBA, zoo, Bhagoji2018PracticalBA, Cheng2018QueryEfficientHB]. Alternatively, it is known that adversarial inputs are transferable. Aan adversarial input created to cause misclassification errors for one model can be reused to cause misclassification errors in other models despite differences in the model architectures [szegedy2014intriguing]. Using this property, an adversary creates adversarial inputs using a white-box attack on a model they have white-box access to, and then exposes the created adversarial inputs to the black-box model [papernot2016transferability, liu2016delving].

Adversarial Defenses. Given the widespread use of machine learning, successful adversarial attacks against deployed systems could result in dire real-world consequences. As such, it is critical to develop techniques to mitigate the effect of adversarial attacks. Early adversarial defenses such as defensive distillation [distillation], input transformations [inputtransform], and defensive DNNs [song2018pixeldefend, defensegan] are techniques, which rely on masking or breaking the gradient used in white-box attacks to generate adversarial examples. However, many of these early defenses have been broken as an adaptive adversary can perform end-to-end attacks by approximating the gradient through these defenses [DBLP:journals/corr/abs-1802-00420].

Currently, adversarial training is recognized as the state-of-the-art technique to improve a model’s adversarial robustness to white-box projected gradient descent (PGD) attacks. Adversarial training improves a model’s robustness to adversarial examples by generating them on-the-fly to be used during training [szegedy2014intriguing]. More specifically, in each training iteration adversarial examples that maximize the model’s loss are generated iteratively [madry2018towards]. Through adversarial training, Madry et al. created MNIST and CIFAR classifiers with significantly improved adversarial robustness. Later, due to the poor scalability of the original approach, the single-step FGSM attack was used to reduce the performance overhead of adversarial training for large datasets [kurakin2017atscale]. The performance overhead was further reduced when ensemble techniques were applied to adversarial training [tramer2018ensembletraining]. In addition, Tramèr et al. revealed that adversarial training results in overfitting the model to the generated adversarial examples. After training, the adversarially trained model remained susceptible to transferable adversarial examples. Thus, ensemble adversarial training served a second purpose to improve the black-box adversarial robustness of the trained model.

Adversarial Autoencoder Defenses. As adversarial examples are typically generated by adding noise to a correctly classified input, it is natural to attempt to use denoising algorithms to remove the adversarial noise. Gu and Rigazio explored using denoising autoencoders (AE) when pre-processing the input to defend against adversarial inputs  [DBLP:journals/corr/GuR14]. Given a deep neural network classifier, they generated adversarial examples, and used them to train an AE that mapped the adversarial examples back to the original inputs based on a reconstruction loss. By stacking the AE and the classifier, the success rate of adversarial examples generated against the original classifier dropped significantly. However, as adversarial inputs are generated with respect to the classifier rather than the full pipeline, adaptive adversarial attacks remain possible. A recent work proposed two modifications to stacked architecture defense [Liao2018DefenseAA]. First, the AE was changed to output inverse adversarial noise to correct the modified input rather than fully reconstruct the original input. This change was based on the hypothesis that learning the adversarial noise added to an input is an easier task than learning to reconstruct the original input. Second, they replaced the pixel-based reconstruction loss with a loss based on the hidden layers of the target classifier they are defending. These two changes resulted in training an AE that removes noise from the input such that the error amplification effect from adversarial modifications is mitigated. This work, though, still generates adversarial inputs with respect to the classifier rather than the full pipeline.

Xie et al. take a different approach choosing to incorporate denoising as part of the network architecture rather than as a pre-processing step [Xie_2019_CVPR]. They characterize the error amplification effect as significant noise in the feature maps of the network. When examining the features maps of adversarial examples, they observe that semantically uninformative portions of the input had higher than normal feature map activation. Thus, they add denoising blocks in between intermediate layers in the network to suppress the noise in the feature maps and refocus the network on semantically informative information. On Imagenet, they adversarially trained a ResNet model containing four denoising blocks and demonstrated improved black-box and white-box adversarial robustness compared to existing works. Although their approach is interesting and helps explain the mechanisms of adversarial attacks, it remains model dependent. Extra denoising layers must added to a model’s architecture in order to achieve adversarial robustness, whereas our approach is designed to be model agnostic.

Adversarially-Trained Autoencoder Augmentation

Adversarial training in its current setting has a major drawback: it introduces significant computational overhead on the training process, hence making it orders of magnitude slower than natural training. Furthermore, an adversarially trained model cannot improve the performance of other vulnerable models for the task. A string of works in recent literature [shafahi2019adversarial, zhang2019you] have studied methods to reduce the overhead of adversarial training. We propose Adversarially-Trained Autoencoder Augmentation (AAA), which creates a separate adversarially robust component in the classification pipeline. We design our approach to be transferable so that a single adversarial training session is sufficient to improve the adversarial robustness of multiple vulnerable models trained on the same dataset.

In their work on MNIST adversarial training, Madry et al. observed that only 3 of the 32 filters in the first convolutional layer of the adversarially trained model contribute to the input to the second layer [madry2018towards]. They characterized the behavior of these three filters as thresholding filters, which remove most of the adversarial modifications from the input. From their observations, we re-define an adversarially trained model as a two step process: 1) Denoising; 2) Classification. Thus, by disentangling the first step from the model, we can create a model agnostic defense, which is transferable across all vulnerable classifiers trained on a similar data distribution.


To implement our model-agnostic defense, we use an autoencoder (AE) , which consists of an encoder that projects a -dimensional input to a -dimensional latent space , where , and a decoder that projects the latent feature vector back into the original input space. The reconstructed input is given to a classifier , which outputs a label . Given a trained classifier , we train as follows:


is the cross entropy loss. is the set of allowable adversarial perturbations on . is the loss of the AE.

The mean squared error, , is the traditional loss function used when training AEs as it encourages the AE to map noisy inputs back to the clean data manifold. However, such an objective does not encourage the AE to minimize classification risk. As such, it remains possible that a well-trained AE will map a noisy input to an adversarial input. Thus, we adversarially train our AE with respect to three different loss functions. , the cross-entropy loss, encourages the AE towards reconstructions that minimize the classification risk, which provides robustness to adversarial examples. , the mean squared error, encourages the AE towards reconstructions that lie on the natural data manifold, which improves the transferability of our adversarial defense. Finally, takes into account the benefits of both loss functions resulting in reconstructions on the natural data manifold that minimize classification risk.


Experimental Setup

For our experiments, we evaluated adversarially-trained autoencoder augmentation (AAA) on three different datasets: MNIST, Fashion-MNIST, and CIFAR-10 [mnist, fashion-mnist, cifar]. As described previously, our defense pipeline consists of the adverarially trained AE and a pre-trained classifier. In all of the experiments using AAA, the classifier used was naturally trained .

For our MNIST experiments, we created a simple convolutional AE consisting of two convolution layers, two fully connected layers, and two deconvolution layers. For the classifier, we used the classifier architecture provided by Madry et al. 111˙challenge. We evaluate transferability using a classifier comprised of two fully connected layers.

For Fashion-MNIST, we use a convolutional AE consisting of 15 convolution layers, two max-pooling layers, and two upsampling layers. For the classifier, we add two convolution layers to the MNIST classifier architecture. We evaluate transferability using a classifier comprised of four fully connected layers.

For CIFAR-10, we used a U-Net AE architecture [unet]. The main difference between a U-Net AE and a standard convolutional AE is the use of skip connections, which are forward feed connections between the encoding and decoding layers in the network that enable higher fidelity reconstructions. Our U-Net AE uses five encoding convolution layers and four decoding convolution layers. We also use three different classifier architectures. For the first architecture, we use the ResNet architecture provided by Madry et al. 222˙challenge. For the second model, we used the VGG-19 classifier [vgg]. For the last model, we created a simple DNN consisting of four convolution layers and a fully connected layer. Additional details regarding the models can be found in the supplementary material.

Training Details

We used an Adam optimizer and adversarially trained the AE using the three loss functions mentioned in the previous section. For MNIST and Fashion-MNIST, adversarial examples were generated in each training iteration using a 40-step L bounded PGD attack with a step size of and and respectively. For CIFAR-10, adversarial examples were generated in each training iteration using a 10-step L bounded PGD attack with a step size of and 333We scale the pixel values to the range . For all experiments, we set the initial learning rate at and decreased it by half if the validation loss did not improve over five epochs.

MNIST Results

We first present our experimental results on the MNIST dataset, which is a gray-scale dataset composed of 28x28 sized images of handwritten digits. We evaluate all models against a white-box -step and -step L bounded PGD attack with step size of and . We compare the performance of an adversarially trained model to our approach in Table 1.

Model Natural Adversarial Accuracy
Accuracy 40-step 200-step
Nat. Training 99.24% 0% 0%
Adv. Training 99.10% 93.85% 91.10%
99.08% 95.19% 93.54%
99.17% 94.29% 91.81%
99.15% 95.27% 94.04%
Table 1: Performance comparison of AAA to the adversarial training approach proposed by Madry et al. \shortcitemadry2018towards on MNIST.

We observe that for all three loss functions, AAA minimally impacts the natural accuracy of the naturally trained classifier, while also improving the adversarial accuracy significantly. Interestingly, we note only slightly outperforms despite having no relation to the classifier objective. This behavior is likely due to the simple nature of the MNIST dataset which allows for the learning of class-distinctive features using an unsupervised objective like only. Finally, compared to adversarially training the classifier only, AAA has consistently higher adversarial accuracy despite the fact that AAA relies on naturally trained classifiers.

Adversarial training is a non-transferable technique to improve the adversarial performance of a model. That is to say, a pre-trained adversarially trained model cannot be reused to improve the adversarial performance of a different model. In Table 2, we measure the transferability of AAA on a pre-trained fully connected classifier. For these results, we reuse the AE trained on the naturally trained Madry et al. classifier architecture.

Model Natural Adversarial Accuracy
Accuracy 40-step 200-step
FC-Classifier 98.00% 0% 0%
28.75% 17.38% 15.26%

98.21% 88.90% 85.91%

98.12% 91.62% 89.97%

Table 2: AAA transferability results on MNIST using a fully connected classifier.

Depending on the loss function used during training, the transferability of our technique varies. As we expected, has higher transferability than as it seeks to project the input onto the natural data manifold. As such, its reconstructions are more likely to generalize across multiple models as Table 3 illustrates. encourages the reconstructions to reduce the classification risk with respect to the training model, which may imply that the AE overfits based on the visualizations in Table 3. , which includes a classification loss, has better transferability than . This improvement can be attributed to the idea that classifiers learn similar decision boundaries with respect to the natural data [goodfellow2014explaining].

Table 3: We visualize the output of the AE for MNIST, Fashion-MNIST, and CIFAR-10 (ResNet). The first column contains natural images. The other columns show the output of , , and respectively.

Fashion-MNIST Results

Fashion-MNIST is a grayscale dataset that contains 28x28 sized images of ten different types of clothes. We evaluate all models against a -step and -step L bounded PGD attack with step size of and . As before, we first compare the performance of an adversarially trained model to AAA in Table 4.

Model Natural Adversarial Accuracy
Accuracy 40-step 200-step
Nat. Training 92.55% 5.64% 5.40%
Adv. Training 87.73% 81.99% 78.07%
87.59% 79.81% 76.89%
89.26% 43.10% 25.06%
87.4% 76.83% 73.97%
Table 4: Performance comparison of AAA to the adversarial training approach proposed by Madry et al. \shortcitemadry2018towards on Fashion-MNIST.

AAA improves the adversarial accuracy of the natural classifier with minimal impact on the natural accuracy. However, unlike in MNIST, has significantly worse performance compared to , suggesting that some form of classifier loss must be used to train an adversarially robust AE. Finally, we note that and have comparable performance to adversarial training.

In Table 5, we augment a fully connected classifier and measure the transferability of AAA. As before in MNIST, generates reconstructions specific to the training model, which do not transfer well to other pre-trained models. We also see that encourages reconstructions that appear similar to the natural images, resulting in higher transferability compared to , but the adversarial accuracy remains low. uses demonstrates that combining both loss functions results in a highly accurate defense, that generates reconstructions, which generalize across different classifiers. Figure 3 visualizes the reconstructions for each loss function on Fashion-MNIST.

Model Natural Adversarial Accuracy
Accuracy 40-step 200-step
FC-Classifier 89.74% 5.25% 4.33%
33.60% 26.36% 25.25%

88.38% 38.34% 32.95%

74.44% 54.20% 51.27%
Table 5: AAA transferability results on Fashion-MNIST using a fully connected classifier.

CIFAR-10 Results

CIFAR-10 contains 32x32 sized color images belonging to one of ten possible class labels. We evaluate all models against a -step L bounded PGD attack with step size of and . In Table 6, we compare the performance of AAA to adversarial training using the ResNet classifier.

Model Natural Adversarial
Accuracy Accuracy
Nat. Training 89.67% 2.13%
Adv. Training 81.49% 47.79%
77.21% 47.13%
86.07% 0.08%
80.06% 47.17%
Table 6: Performance comparison of AAA to the adversarial training approach proposed by Madry et al. \shortcitemadry2018towards on CIFAR-10.

and both result in adversarial accuracy improvements comparable to adversarial training, with a similar reduction in natural accuracy. Also, we note again that although preserves the natural accuracy of the classifier, it does not have any impact on adversarial accuracy. Based on our results on MNIST, Fashion-MNIST, and CIFAR-10, we see a trend that as dataset complexity increases, performs increasing worse with respect to mitigating the effect of adversarial attacks.

Next, we augment the naturally trained VGG-19 and DNN classifiers with the AE trained on ResNet. In Table 7, we present the transferability of AAA with respect to these two classifiers. As before, we see that is not transferable and has only a minimal improvement on adversarial accuracy. However, unlike in our previous experiments, also performs poorly suggesting a different loss function must be used to obtain transferability.

Nat Acc Adv Acc Nat Acc Adv Acc
Nat. Training 91.47% 1.49% 78.86% 1.93%
25.02% 12.52% 14.25% 2.01%

87.78% 2.70% 76.77% 4.65%

30.53% 15.10% 20.08% 6.95%

Table 7: AAA transferability evaluation on CIFAR-10. The AE is trained on ResNet and is transferred to two different classifier.

Alternatively, we can use ensemble adversarial training to create an AE that improves the adversarial performance of multiple models in a single training session. Ensemble adversarial training is a modification of adversarial training that randomly selects a model from an ensemble each epoch and generates adversarial examples with respect to the chosen model, rather than the target training model [tramer2018ensembletraining]. Traditional adversarial training created models that remained vulnerable to transferability attacks. Tramèr et al. showed that their ensemble modification solves this problem, improving a model’s robustness to such attacks. With respect to AAA, we use ensemble adversarial training to create an AE that can improve the adversarial accuracy of all classifiers in the ensemble, while also only modestly reducing natural accuracy of each classifier.

In each training iteration, we randomly select one of the three classifiers and generate adversarial examples with respect to the chosen classifier. Then, we adversarially train the AE with respect to and the chosen classifier. We choose to use as, based on the previous experiments, this was the most likely loss function to result in an accurate, transferable AE. All of the training parameters remain the same as before including the number of training epochs. In Table 8, we compare the performance of AAA using ensemble adversarial training to the adversarially trained classifiers.

Model Natural Adversarial
Accuracy Accuracy
ResNet 89.67% 2.13%
ResNet 81.49% 47.79%
ResNet 80.06% 47.17%
VGG 91.47% 1.49%
VGG 71.95% 41.85%
VGG 78.22% 44.74%
DNN 78.86% 1.93%
DNN 69.48% 37.80%
DNN 79.60% 46.97%
Table 8: Performance comparison of AAA using ensemble adversarial training to the adversarial training approach, , proposed by Madry et al. \shortcitemadry2018towards.

We observe that AAA using ensemble adversarial training has two advantages over standard adversarial training. First, it achieves similar or better performance than standard adversarial training. For VGG and DNN, AAA significantly improved both the natural and adversarial accuracy (e.g., an additional 10.12% and 9.17% for DNN’s natural and adversarial accuracy respectively). For ResNet, it remained competitive with respect to the adversarially trained model. Second, and more importantly, AAA is transferable across all three models while requiring of the total training iterations necessary for standard adversarial training. In order obtain adversarially robust classifiers using standard adversarial training, three adversarial training sessions must be run individually.

Extensions to AAA

In this section, we discuss improvements and future work regarding adversarial autoencoder augmentation.

kNN Reconstruction. Traditionally, given an input, the classifier outputs the label associated with the highest predicted probability. However, recent work has found that using a k-nearest neighbors (kNN) algorithm in the hidden layers of the network can improve the explainability of neural networks and establish confidence metrics on classifier predictions [papernotkNN]. For a given input and a given hidden layer, the closest neighbors are selected and the confidence of a prediction is based on the fraction neighbors that agree with the output prediction. On normal inputs, it was shown that a majority of the closest neighbors would often agree on the predicted label in each hidden layer of a naturally trained model. However, for adversarial inputs, which are not part of the training data manifold, there was much more diversity in the labels of the nearest neighbors, resulting in a low confidence prediction. Based on this observation, we measure the performance benefits if the kNN algorithm is used during the reconstruction step. As Adversarially-Trained Autoencoder Augmentation creates an AE that is robust to adversarial inputs and transferable, we expect that inputs with the same label will be close together in the latent space.

First, we store the latent space vectors of the training data. Then, at runtime, we compute the ten nearest neighbors for a given test input, and average the latent space representations of those neighbors. The average latent space representation is used to compute reconstruction output. Table 9 shows the results of kNN reconstruction on the MNIST and Fashion-MNIST CNN models and Table 10 show the transferability evaluation on the fully connected classifiers. In most cases, kNN reconstruction further improves the adversarial accuracy of AAA. Furthermore, we see that kNN reconstruction significantly improves the transferability of AAA. This behavior is likely because the average latent space representation obtained from kNN projects the adversarial input to a point on the natural data manifold, which in an input recognizable to all the classifiers.

One drawback of kNN reconstruction is the large performance overhead during evaluation. In MNIST, we use a latent space vector of size 128 and the evaluation on the test dataset took approximately five minutes. Contrast this in Fashion-MNIST, where the latent space vector is of size 7x7x256 requiring approximately 13 hours for evaluation on the test dataset. In future work, we will explore optimization techniques such as locality sensitive hashing to improve the speed and reduce the complexity of the nearest neighbor computation so kNN can be efficiently used with AAA.

MNIST Natural Adversarial Accuracy
Accuracy 40-step 200-step
98.55% 96.03% (+ 0.84%) 95.43% (+ 1.89%)
97.16% 94.91% (+ 0.62%) 94.36% (+ 2.55%)
98.21% 95.92% (+ 0.65%) 95.48% (+ 1.44%)
58.49% 55.92% (- 23.89%) 55.06% (- 21.83%)
83.30% 79.52% (+ 36.42%) 76.33% (+ 51.27%)
76.40% 73.57% (- 3.26%) 73.00% (- 0.97%)
Table 9: Evaluation results using the kNN reconstruction modification for a white-box PGD attack. Green and red numbers show the accuracy differences compared to the results in Tables 1 and Table 4 where kNN reconstruction was not used.
MNIST Natural Adversarial Accuracy
Accuracy 40-step 200-step
29.67% 27.60% (+ 10.22%) 27.27% (+ 12.01%)
97.12% 94.11% (+ 5.21%) 93.57% (+ 7.66%)
98.12% 95.63% (+ 5.01%) 95.08% (+ 5.11%)
38.11% 37.24% (+ 10.88%) 37.12% (+ 11.87%)
81.77% 76.41% (+ 38.07%) 42.06%(+ 9.11%)
77.91% 75.40% (+ 21.21%) 73.78% (+ 22.51%)
Table 10: AAA transferability results using the kNN reconstruction modification using a fully connected classifier. Green and red numbers show the accuracy differences compared to the results in Tables 2 and 5 where kNN reconstruction was not used.

Black-box Adversarial Training. Model privacy research is concerned with ensuring that trained machine learning algorithms do not reveal any information about the underlying, possibly sensitive data that was used for training [pate, prada]. As such, creating adversarially robust classifiers through adversarial training may not be possible as white-box access is necessary in order to update the model parameters. Thus, it is desirable to develop a black-box approach towards improving adversarial robustness as model owners would be able to maintain model privacy.

Although white-box access to the classifier may be impossible, white-box access to the AE we use in our defense can be granted. Similar to black-box adversarial attacks, AAA can employ two strategies to enable black-box adversarial training: transferability and gradient estimation. In our MNIST and Fashion-MNIST experiments, AAA was transferable. We could select a classifier, optimize the AE with respect to , and improve the adversarial performance of a different classifier despite no prior training using the new classifier. For simple datasets, was sufficient to improve the transferability of AAA. For more complex datasets, further work needs to be done to identity a loss function which will encourage AAA transferability.

An alternative and more general method to use AAA for black-box adversarial training is gradient estimation. Black-box classifier models have a query interface, which returns the predicted label or probability distribution for a given input. Prior black-box attacks have used estimation techniques such as the finite difference method (FDM) to estimate the loss gradient and perform PGD adversarial attacks [zoo, Bhagoji2018PracticalBA]. In a similar fashion, we can use the FDM to improve the adversarial performance of a black-box classifier. First, we use FDM to enact a black-box attack and generate adversarial inputs during training. Next, we use the FDM to approximate the loss gradient with respect to the reconstruction output of the AE. Finally, we chain the estimated gradient through the rest of the pipeline, which we have white-box access to, and update the AE parameters accordingly. This second approach remains as future work.


In this paper, we propose Adversarially-Trained Autoencoder Augmentation as a model agnostic adversarial defense. AAA provides comparable performance to traditional adversarial training, while allowing complete transferability for simpler datasets such as MNIST and MNIST-Fashion. On more complex datasets such as CIFAR-10, AAA can parallelize adversarial training across multiple classifiers, achieving, at minimum, comparable adversarial performance to an adversarially trained model. To our knowledge, Adversarially-Trained Autoencoder Augmentation represents the first transferable adversarial defense that is robust to an adaptive L adversary.


This material is based in part upon work supported by the National Science Foundation under Grant No. 1646392 and the NVIDIA GPU Grant program.


Supplemental Material

MNIST Model Architectures

Auto-encoder DNN FC
conv 32-3-2 conv 32-3-1 dense 256
conv 64-3-2 maxpool 22 dense 256
dense (no activation) 1024 conv 64-3-1 dense 10
dense 7*7*64 maxpool 22
deconv 32-3-2 dense 1024
deconv (sigmoid) 1-3-2 dense 10
Table 11: Model architecture for each of the models used in the MNIST experiments (kernel size, number of output filters, stride). ReLU activation is used unless specified.

Fashion-MNIST Model Architectures

Auto-encoder DNN FC
conv-bn 2 32-3-1 conv 2 32-3-1 dense 1024
maxpool 22 maxpool 22 dense 512
conv-bn 2 64-3-1 conv 2 64-3-1 dense 256
maxpool 22 maxpool 22 dense 10
conv-bn 2 128-3-1 dense 1024
conv-bn 2 256-3-1 dense 10
conv-bn 2 128-3-1
conv-bn 2 64-3-1
upscale 22
conv-bn 2 32-3-1
upscale 22
conv (sigmoid) 1-3-1
Table 12: Model architecture for each of the models used in the Fashion-MNIST experiments (kernel size, number of output filters, stride). ReLU activation is used unless specified.

CIFAR-10 Model Architectures

conv-bn 2 64-3-1 conv 32-3-1
conv-bn 2 128-3-1 conv 64-3-1
conv-bn 2 256-3-1 maxpool 22
conv-bn 2 256-3-1 dropout 0.5
upscale + concat 22 conv 128-3-1
conv-bn 2 256-3-1 maxpool 22
upscale + concat 22 conv 128-2-1
conv-bn 2 128-3-1 maxpool 2 2
upscale + concat 22 dropout 0.5
conv-bn 2 64-3-1 dense 1500
conv (sigmoid) 3-1-1 dropout 0.5
dense 10
Table 13: Model architecture for the U-Net and DNN models used in the CIFAR-10 experiments (kernel size, number of output filters, stride). For the architecture details of ResNet and VGG, the reader can refer to  [he2016deep, vgg]. ReLU activation is used unless specified.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description