Defense Against Adversarial Attacks Using
Feature Scatteringbased Adversarial Training
Abstract
We introduce a feature scatteringbased adversarial training approach for improving model robustness against adversarial attacks. Conventional adversarial training approaches leverage a supervised scheme (either targeted or nontargeted) in generating attacks for training, which typically suffer from issues such as label leaking as noted in recent works. Differently, the proposed approach generates adversarial images for training through feature scattering in the latent space, which is unsupervised in nature and avoids label leaking. More importantly, this new approach generates perturbed images in a collaborative fashion, taking the intersample relationships into consideration. We conduct analysis on model robustness and demonstrate the effectiveness of the proposed approach through extensively experiments on different datasets compared with stateoftheart approaches.
1 Introduction
While breakthroughs have been made in many fields such as image classification leveraging deep neural networks, these models could be easily fooled by the so call adversarial examples szegedy2013intriguing (); biggio13ecml (). In terms of the image classification, an adversarial example for a natural image is a modified version which is visually indistinguishable from the original but causes the classifier to produce a different label prediction biggio13ecml (); szegedy2013intriguing (); FGSM (). Adversarial examples have been shown to be ubiquitous beyond classification, ranging from object detection DAG (); phy_attack () to speech recognition cisse2017houdini (); asr_attack ().
Many encouraging progresses been made towards improving model robustness against adversarial examples tramer2017ensemble (); madry2017towards (); guided_dn (); deep_defense (). Among them, adversarial training FGSM (); madry2017towards () is one of the most popular technique athalye2018obfuscated (), which conducts model training using the adversarially perturbed images in place of the original ones. However, several challenges remain to be addressed. Firstly, some adverse effects such as label leaking is still an issue hindering adversarial training kurakin2016scale (). Currently available remedies either increase the number of iterations for generating the attacks madry2017towards () or use classes other than the groundtruth for attack generation kurakin2016scale (); xie2018feature (); bilateral (). Increasing the attack iterations will increase the training time proportionally while using nongroundtruth targeted approach cannot fully eliminate label leaking. Secondly, previous approaches for both standard and adversarial training treat each training sample individually and in isolation w.r.t.other samples. Manipulating each sample individually this way neglects the intersample relationships and does not fully leverage the potential for attacking and defending, thus limiting the performance.
Manifold and neighborhood structure have been proven to be effective in capturing the intersample relationships Saul03thinkglobally (); NCA (). Natural images live on a lowdimensional manifold, with the training and testing images as samples from it HintonSalakhutdinov2006b (); Saul03thinkglobally (); ot_manifold (); boundary_tilt (). Modern classifiers are overcomplete in terms of parameterizations and different local minima have been shown to be equally effective under the clean image setting no_barrier (). However, different solution points might leverage different set of features for prediction. For learning a wellperforming classifier on natural images, it suffices to simply adjust the classification boundary to intersect with this manifold at locations with good separation between classes on training data, as the test data will largely reside on the same manifold not_bug (). However, the classification boundary that extends beyond the manifold is less constrained, contributing to the existence of adversarial examples boundary_tilt (); trans_attack (). For examples, it has been pointed out that some clean trained models focus on some discriminative but less robust features, thus are vulnerable to adversarial attacks not_bug (); jacobsen2018excessive (). Therefore, the conventional supervised attack that tries to move feature points towards this decision boundary is likely to disregard the original data manifold structure. When the decision boundary lies close to the manifold for its out of manifold part, adversarial perturbations lead to a tilting effect on the data manifold boundary_tilt (); at places where the classification boundary is far from the manifold for its out of manifold part, the adversarial perturbations will move the points towards the decision boundary, effectively shrinking the data manifold. As the adversarial examples reside in a large, contiguous region and a significant portion of the adversarial subspaces is shared FGSM (); WardeFarley20161AP (); topology (); trans_attack (); universal_pert (), pure labelguided adversarial examples will clutter as least in the shared adversarial subspace. In summary, while these effects encourage the model to focus more around the current decision boundary, they also make the effective data manifold for training deviate from the original one, potentially hindering the performance.
Motived by these observations, we propose to shift the previous focus on the decision boundary to the intersample structure. The proposed approach can be intuitively understood as generating adversarial examples by perturbing the local neighborhood structure in an unsupervised fashion and then performing model training with the generated adversarial images. The overall framework is shown in Figure 1. The contributions of this work are summarized as follows:

we propose a novel featurescattering approach for generating adversarial images for adversarial training in a collaborative and unsupervised fashion;

we present an adversarial training formulation which deviates from the conventional minimax formulation and falls into a broader category of bilevel optimization;

we analyze the proposed approach and compare it with several stateoftheart techniques, with extensive experiments on a number of standard benchmarks, verifying its effectiveness.
2 Background
2.1 Adversarial Attack, Defense and Adversarial Training
Adversarial examples, initially demonstrated in biggio13ecml (); szegedy2013intriguing (), have attracted great attention recently biggio13ecml (); FGSM (); tramer2017ensemble (); madry2017towards (); athalye2018obfuscated (); ten_years (). Szegedy et al. pointed out that CNNs are vulnerable to adversarial examples and proposed an LBFGSbased algorithm for generating them szegedy2013intriguing (). A fast gradient sign method (FGSM) for adversarial attack generation is developed and used in adversarial training in FGSM (). Many variants of attacks have been developed later moosavi2015deepfool (); carlini2016towards (); one_pixel (); gan_attack (); patch (); brendel2018decisionbased (). In the mean time, many efforts have been devoted to defending against adversarial examples metzen2017detecting (); meng2017magnet (); xie2017mitigating (); guo2017countering (); guided_dn (); samangouei2018defensegan (); song2017pixeldefend (); prakash2018deflecting (); liu2017towards (). Recently, athalye2018obfuscated () showed that many existing defence methods suffer from a false sense of robustness against adversarial attacks due to gradient masking, and adversarial training FGSM (); kurakin2016scale (); tramer2017ensemble (); madry2017towards () is one of the effective defense method against adversarial attacks. It improves model robustness by solving a minimax problem as FGSM (); madry2017towards ():
(1) 
where the inner maximization essentially generates attacks while the outer minimization corresponds to minimizing the “adversarial loss” induced by the inner attacks madry2017towards (). The inner maximization can be solved approximately, using for example a onestep approach such as FGSM FGSM (), or a multistep projected gradient descent (PGD) method madry2017towards ()
(2) 
where is a projection operator projecting the input into the feasible region . In the PGD approach, the original image is randomly perturbed to some point within , the cube around , and then goes through several PGD steps with a step size of as shown in Eqn.(2).
Label leaking kurakin2016scale () and gradient masking PapernotMJFCS15 (); tramer2017ensemble (); athalye2018obfuscated () are some wellknown issues that hinder the adversarial training kurakin2016scale (). Label leaking occurs when the additive perturbation is highly correlated with the groundtruth label. Therefore, when it is added to the image, the network can directly tell the class label by decoding the additive perturbation without relying on the real content of the image, leading to higher adversarial accuracy than the clean image during training. Gradient masking PapernotMJFCS15 (); tramer2017ensemble (); athalye2018obfuscated () refers to the effect that the adversarially trained model learns to “improve” robustness by generating less useful gradients for adversarial attacks, which could be bypassed with a substitute model for generating attacks, thus giving a false sense of robustness athalye2018obfuscated ().
2.2 Different Distances for Feature and Distribution Matching
Euclidean distance is arguably one of the most commonly used metric for measuring the distance between a pair of points. When it comes to two sets of points, it is natural to accumulate the individual pairwise distance as a measure of distance between the two sets, given the proper correspondence. Alternatively, we can view each set as an empirical distribution and measure the distance between them using KullbackLeibler (KL) or JensenShannon (JS) divergence. The challenge for learning with KL or JS divergence is that no useful gradient is provided when the two empirical distributions have disjoint supports or have a nonempty intersection contained in a set of measure zero W_GAN (); OT_GAN (). The optimal transport (OT) distance is an alternative measure of the distance between distributions with advantages over KL and JS in the scenarios mentioned earlier. The OT distance between two probability measures and is defined as:
(3) 
where denotes the set of all joint distributions with marginals and , and is the cost function (Euclidean or cosine distance). Intuitively, is the minimum cost that has to transport from to . It provides a weaker topology than many other measures, which is important for applications where the data typically resides on a low dimensional manifold of the input embedding space W_GAN (); OT_GAN (), which is the case for natural images. It has been widely applied to many tasks, such as generative modeling gen_model (); W_GAN (); OT_GAN (); primal_gan_vae (); FMGAN (), autoencoding WAE () and dictionary learning 2016roletaistats (). For comprehensive historical and computational perspective of OT, we refer to ot_book (); 2018Peyrecomputationalot ().
3 Feature Scatteringbased Adversarial Training
3.1 Feature Matching and Feature Scattering
Feature Matching
. Conventional training treats training data as i.i.d samples from a data distribution, overlooking the connections between samples. The same assumption is used when generating adversarial examples for training, with the direction for perturbing a sample purely based on the direction from the current data point to the decision boundary, regardless of other samples. While effective, it disregards the interrelationship between different feature points, as the adversarial perturbation is computed individually for each sample, neglecting any collective distributional property. Furthermore, the supervised generation of the attacks makes the generated perturbations highly biases towards the decision boundary, as shown in Figure 2. This is less desirable as it might neglect other directions that are crucial for learning robust models not_bug (); saliency_map () and leads to label leaking due to high correlation between the perturbation and the decision boundary.
The idea of leveraging intersample relationship for learning dates back to the seminal work of Saul03thinkglobally (); NCA (); NCA_nonlinear (). This type of local structure is also exploited in this work, but for adversarial perturbation. The quest of local structure utilization and seamless integration with the endtoendtraining framework naturally motivates an OTbased soft matching scheme, using the OTdistance as in Eqn.(3). We consider OT between discrete distributions hereafter as we mainly focus on applying the OT distance on image features. Specifically, consider two discrete distributions , which can be written as and , with the Dirac function centered on .^{1}^{1}1The two discrete distributions could be of different dimensions; here we present the exposition assuming the same dimensionality to avoid notion clutter. The weight vectors and belong to the dimensional simplex, i.e., , as both and are probability distributions. Under such a setting, computing the OT distance as defined in Eqn.(3) is equivalent to solving the following networkflow problem
(4) 
where . is an dimensional allone vector. represents the Frobenius dotproduct. is the transport cost matrix such that . In this work, the transport cost is defined as the cosine distance between image features:
(5) 
where denotes the feature extractor with parameter . We implement as the deep neural network upto the softmax layer. We can now formally define the feature matching distance as follows.
Definition 1.
(Feature Matching Distance) The feature matching distance between two set of images is defined as , the OT distance between empirical distributions and for the two sets.
Note that the featurematching distance is also a function of (i.e. ) when is used for extracting the features in the computation of the ground distance as in Eqn.(5). We will simply use the notation in the following when there is no danger of confusion to minimize notional clutter .
Feature Scattering.
Based on the feature matching distance defined above, we can formulate proposed feature scattering method as follows:
(6) 
This can be intuitively interpreted as maximizing the feature matching distance between the original and perturbed empirical distributions with respect to the inputs subject to domain constraints
where denotes the cube with center and radius . Formally, we present the notion of feature scattering as follows.
Definition 2.
(Feature Scattering) Given a set of clean data , which can be represented as an empirical distribution as with , the feature scattering procedure is defined as producing a perturbed empirical distribution with by maximizing , the feature matching distance between and , subject to domain and budget constraints.
Remark 1.
As the feature scattering is performed on a batch of samples leveraging intersample structure, it is more effective as adversarial attacks compared to structureagnostic random perturbation while is less constrained than supervisedly generated perturbations which is decision boundary oriented and suffers from label leaking. Empirical comparisons will be provided in Section 5.
3.2 Adversarial Training with Feature Scattering
We leverage feature scattering for adversarial training, with the mathmatical formulation as follows
(7) 
The proposed formulation deviates from the conventional minimax formulation for adversarial training FGSM (); madry2017towards (). More specifically, it can be regarded as an instance of the more general bilevel optimization problem bilevel_problem (); bilevel_book (). Feature scattering is effective for adversarial training scenario as there is a requirements of more data schmidt2018adversarially (). Feature scattering promotes data diversity without drastically altering the structure of the data manifold as in the conventional supervised approach, with label leaking as one manifesting phenomenon. Secondly, the feature matching distance couples the samples within the batch together, therefore the generated adversarial attacks are produced collaboratively by taking the intersample relationship into consideration. Thirdly, feature scattering implicitly induces a coupled regularization (detailed below) on model training, leveraging the intersample structure for joint regularization. The following theorem characterizes the properties of the proposed approach and its connections and differences with previous methods.
Theorem 1.
The proposed approach is equivalent to the minimization of a loss, , consisting of the conventional loss on the original data, and a regularization term coupled over the inputs.
The proof has been deferred to the supplementary file. This result is useful for understanding the properties of both existing and the proposed approach. First, it highlights the unique property of the proposed feature scattering approach that it induces an effective regularization term that is coupled over all inputs, i.e., . This implies that the model leverages information from all inputs in a joint fashion for learning, offering the opportunity of collaborative regularization leveraging intersample relationships. Second, the usage of a function () different from for inducing offers more flexibilities in the effective regularization; moreover, no label information is incorporated in , thus avoiding potential label leaking as in the conventional case when is highly correlated with . Finally, in the case when is separable over inputs and takes the form of a supervised loss, e.g., , the proposed approach reduces to the conventional adversarial training setup FGSM (); madry2017towards (). The overall procedure for the proposed approach is presented in Algorithm 1.
4 Discussions
Manifoldbased Defense manifold_projection (); meng2017magnet (); manifold_nn (); manifold_gan ()
. manifold_projection (); meng2017magnet (); manifold_gan () proposed to defend by projecting the perturbed image onto a proper manifold. manifold_nn () used a similar idea of manifold projection but approximated this step with a nearest neighbor search against a webscale database. Differently, we leverage the manifold in the form of intersample relationship for the generation of the perturbations, which induces an implicit regularization of the model when used in the adversarial training framework. While defense in manifold_projection (); meng2017magnet (); manifold_nn (); manifold_gan () is achieved by shrinking the perturbed inputs towards the manifold, we expand the manifold using feature scattering to generate perturbed inputs for adversarial training.
Intersample Regularization zhang2017mixup (); logit_pairing (); VAT ()
. Mixup zhang2017mixup () generates training examples by linear interpolation between pairs of natural examples, thus introducing an linear inductive bias in the vicinity of training samples. Therefore, the model is expected to reduce the amount of undesirable oscillations for offmanifold samples. Logit pairing logit_pairing () augments the original training loss with a “pairing” loss, which measures the difference between the logits of clean and adversarial images. The idea is to suppress spurious logits responses using the natural logits as a reference. Similarly, virtual adversarial training VAT () proposed a regularization term based on the KL divergence of the prediction probability of original and adversarially perturbed images. In our model, the intersample relationship is leveraged for generating the adversarial perturbations, which induces an implicit regularization term in the objective function that is coupled over all input samples.
Wasserstein GAN and OTGAN W_GAN (); OT_GAN (); FMGAN ()
. Generative Adversarial Networks (GAN) is a family of techniques that learn to capture the data distribution implicitly by generating samples directly GAN (). It originally suffers from the issues of instability of training and mode collapsing GAN (); W_GAN (). OTrelated distances W_GAN (); sinkhorn_dis () have been used for overcoming the difficulties encountered in the original GAN training W_GAN (); OT_GAN (). This technique has been further extended to generating discrete data such as texts FMGAN (). Different from GANs, which maximizes a discrimination criteria w.r.t.the parameters of the discriminator for better capturing the data distribution, we maximize a feature matching distance w.r.t.the perturbed inputs for generating proper training data to improve model robustness.
5 Experiments
Baselines and Implementation Details
. Our implementation is based on PyTorch and the code to reproduce our results will be available on the project page.^{2}^{2}2https://github.com/author/feature_scatter We conduct extensive experiments across several benchmark datasets including CIFAR10 krizhevsky2009learning (), CIFAR100 krizhevsky2009learning () and SVHN netzer2011reading (). We use Wide ResNet (WRN2810) zagoruyko2016wide () as the network structure following madry2017towards (). We compare the performance of the proposed method with a number of baseline methods, including: i) the model trained with standard approach using clean images (Standard) krizhevsky2009learning (), ii) PGDbased approach from Madry et al. (Madry) madry2017towards (), which is one of the most effective defense method athalye2018obfuscated (),^{3}^{3}3https://github.com/anishathalye/obfuscatedgradients, iii) another recent method performs adversarial training with both image and label adversarial perturbations (Bilateral) bilateral (). For training, the initial learning rate is for CIFAR and for SVHN. We set the number of epochs the Standard and Madry methods as with transition epochs as as we empirically observed the performance of the trained model stabilized before epochs. The training scheduling of epochs similar to bilateral () with the same transition epochs used as we empirically observed it helps with the model performance, possibly due to the increased variations of data via feature scattering. We performed standard data augmentation including random crops with pixels of padding and random horizontal flips krizhevsky2009learning () during training. The perturbation budget of is used in training following literature madry2017towards (). Label smoothing of 0.5, attack iteration of 1 and Sinkhorn algorithm sinkhorn_dis () with regularization of 0.01 is used. For testing, model robustness is evaluated by approximately computing an upper bound of robustness on the test set, by measuring the accuracy of the model under different adversarial attacks, including whitebox FGSM FGSM (), PGD madry2017towards (), CW carlini2016towards () (CWloss carlini2016towards () within the PGD framework) attacks and variants of blackbox attacks.
5.1 Visual Classification Performance Under Whitebox Attacks
CIFAR10
. We conduct experiments on CIFAR10 krizhevsky2009learning (), which is a popular dataset that is widely use in adversarial training literature madry2017towards (); bilateral () with classes, K training images per class and K test images. We report the accuracy on the original test images (Clean) and under PGD and CW attack with iterations (PGD and CW) madry2017towards (); carlini2016towards (). The evaluation results are summarized in Table 1. It is observed Standard model fails drastically under different whitebox attacks. Madry method improves the model robustness significantly over the Standard model. Under the standard PGD20 attack, it achieves 44.9% accuracy. The Bilateral approach further boosts the performance to 57.5%. The proposed approach outperforms both methods by a large margin, improving over Madry by 25.6%, and is 13.0% better than Bilateral, achieving 70.5% accuracy under the standard 20 steps PGD attack. Similar patten has been observed for CW metric.
Models  Clean  Accuracy under Whitebox Attack ()  
FGSM  PGD10  PGD20  PGD40  PGD100  CW10  CW20  CW40  CW100  
Standard  95.6  36.9  0.0  0.0  0.00  0.0  0.0  0.0  0.0  0.0 
Madry  85.7  54.9  45.1  44.9  44.8  44.8  45.9  45.7  45.6  45.4 
Bilateral  91.2  70.7  –  57.5  –  55.2  –  56.2  –  53.8 
Proposed  90.0  78.4  70.9  70.5  70.3  68.6  62.6  62.4  62.1  60.6 
We further evaluate model robustness against PGD attacker under different attack budgets with a fixed attack step of 20, with the results shown in Figure 3 (a). It is observed that the performance of Standard model drops quickly as the attack budget increases. The Madry model madry2017towards () improves the model robustness significantly across a wide range of attack budgets. The Proposed approach further boosts the performance over the Madry model madry2017towards () by a large margin under different attack budgets. We also conduct experiments using PGD attacker with different attack iterations with a fixed attack budget of 8, with the results shown in Figure 3 (bc) and also Table 1. It is observed that both Madry madry2017towards () and Proposed can maintain a fairly stable performance when the number of attack iterations is increased. Notably, the proposed approach consistently outperforms the Madry madry2017towards () model across a wide range of attack iterations. From Table 1, it is also observed that the Proposed approach also outperforms Bilateral bilateral () under all variants of PGD and CW attacks. We will use a PGD/CW attackers with and attack step 20 and 100 in the sequel as part of the threat models.
SVHN
. We further report results on the SVHN dataset netzer2011reading (). SVHN is a 10way house number classification dataset, with training images and test images. The additional training images are not used in experiment. The results are summarized in Table 2(a). Experimental results show that the proposed method achieves the best clean accuracy among all three robust models and outperforms other method with a clear margin under both PGD and CW attacks with different number of attack iterations, demonstrating the effectiveness of the proposed approach.
CIFAR100
. We also conduct experiments on CIFAR100 dataset, with classes, K training and K test images krizhevsky2009learning (). Note that this dataset is more challenging than CIFAR10 as the number of training images per class is ten times smaller than that of CIFAR10. As shown by the results in Table 2(b), the proposed approach outperforms all baseline methods significantly, which is about 20% better than Madry madry2017towards () and Bilateral bilateral () under PGD attack and about 10% better under CW attack. The superior performance of the proposed approach on this data set further demonstrates the importance of leveraging intersample structure for learning blindspot_attack ().
5.2 Ablation Studies
We investigate the impacts of algorithmic components and more results are in the supplementary file.
The Importance of Feature Scattering
. We empirically verify the effectiveness of feature scattering, by comparing the performances of models trained using different perturbation schemes: i) Random: a natural baseline approach that randomly perturb each sample within the epsilon neighborhood; ii) Supervised: perturbation generated using groundtruth label in a supervised fashion; iii) FeaScatter: perturbation generated using the proposed feature scattering method. All other hyperparameters are kept exactly the same other than the perturbation scheme used. The results are summarized in Table 3(a). It is evident that the proposed feature scattering (FeaScatter) approach outperforms both Random and Supervised methods, demonstrating its effectiveness. Furthermore, as it is the major component that is difference from the conventional adversarial training pipeline, this result suggests that feature scattering is the main contributor to the improved adversarial robustness.
The Role of Matching
. We further investigate the role of matching schemes within the feature scattering component by comparing several different schemes: i) Uniform matching, which matches each clean sample uniformly with all perturbed samples in the batch; ii) Identity matching, which matches each clean sample to its perturbed sample only; iii) OTmatching: the proposed approach that assigns soft matches between the clean samples and perturbed samples according to the optimization criteria. The results are summarized in Table 3(b). It is observed all variants of matching schemes lead to performances that are on par or better than stateoftheart methods, implying that the proposed framework is effective in general. Notably, OTmatching leads to the best results, suggesting the importance of the proper matching for feature scattering.
The Impact of OTSolvers
. Exact minimization of Eqn.(4) over is intractable in general W_GAN (); OT_GAN (); gen_model (); sinkhorn_dis (). Here we compare two practical solvers, the Sinkhorn algorithm sinkhorn_dis () and the Inexact Proximal point method for Optimal Transport (IPOT) algorithm xie2018fast (). More details on them can be found in the supplementary file and sinkhorn_dis (); xie2018fast (); 2018Peyrecomputationalot (). The results are summarized in Table 4. It is shown that different instantiations of the proposed approach with different OTsolvers lead to comparable performances, implying that the proposed approach is effective in general regardless of the choice of OTsolvers.
OTsolver  CIFAR10  SVHN  CIFAR100  
Clean  FGSM  PGD20  PGD100  CW20  CW100  Clean  FGSM  PGD20  PGD100  CW20  CW100  Clean  FGSM  PGD20  PGD100  CW20  CW100  
Sinkhorn  90.0  78.4  70.5  68.6  62.4  60.6  96.2  83.5  62.9  52.0  61.3  50.8  73.9  61.0  47.2  46.2  34.6  30.6 
IPOT  89.9  77.9  69.9  67.3  59.6  56.9  96.0  82.6  60.0  49.3  57.8  48.4  74.2  67.3  47.5  46.3  32.0  29.3 
5.3 Performance under Blackbox Attack
To further verify if a degenerate
minimum is obtained, we evaluate the robustness of the model trained with the proposed approach w.r.t.blackbox attacks (BAttack) following tramer2017ensemble (). Two different models are used for generating test time attacks: i) Undefended: undefended model trained using Standard approach, ii) Siamese: a robust model from another training session using the proposed approach.
As demonstrated by the results in the table on the right, the model trained with the proposed approach is robust against different types of blackbox attacks, verifying that a nondegenerate solution is learned tramer2017ensemble ().
Finally, we visualize in Figure 4 the loss surfaces of different models as another level of comparison.
6 Conclusion
We present a feature scatteringbased adversarial training method in this paper. The proposed approach distinguish itself from others by using an unsupervised featurescattering approach for generating adversarial training images, which leverages the intersample relationship for collaborative perturbation generation. We show that a coupled regularization term is induced from feature scattering for adversarial training and empirically demonstrate the effectiveness of the proposed approach through extensive experiments on benchmark datasets.
References
 (1) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 (2) A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine learning, 2018.
 (3) J. F. Bard. Practical Bilevel Optimization: Algorithms and Applications. Springer Publishing Company, Incorporated, 1st edition, 2010.
 (4) B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Srndic, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2013.
 (5) B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning. CoRR, abs/1712.03141, 2017.
 (6) W. Brendel, J. Rauber, and M. Bethge. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. In International Conference on Learning Representations, 2018.
 (7) T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer. Adversarial patch. CoRR, abs/1712.09665, 2017.
 (8) N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017.
 (9) N. Carlini and D. A. Wagner. Audio adversarial examples: Targeted attacks on speechtotext. In IEEE Symposium on Security and Privacy Workshops, 2018.
 (10) L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen, Y. Zhang, G. Wang, R. Zhang, and L. Carin. Adversarial text generation via featuremover’s distance. In Advances in Neural Information Processing Systems, 2018.
 (11) M. Cisse, Y. Adi, N. Neverova, and J. Keshet. Houdini: Fooling deep structured prediction models. In Advances in Neural Information Processing Systems, 2017.
 (12) M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, 2013.
 (13) S. Dempe, V. Kalashnikov, G. A. PrezValds, and N. Kalashnykova. Bilevel Programming Problems: Theory, Algorithms and Applications to Energy Networks. Springer Publishing Company, Incorporated, 2015.
 (14) F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht. Essentially no barriers in neural network energy landscape. In International Conference on Machine Learning, 2018.
 (15) A. Dubey, L. van der Maaten, Z. Yalniz, Y. Li, and D. Mahajan. Defense against adversarial images using webscale nearestneighbor search. CoRR, abs/1903.01612, 2019.
 (16) C. Etmann, S. Lunz, P. Maass, and C.B. Schönlieb. On the connection between adversarial robustness and saliency map interpretability. In International Conference on Machine Learning, 2019.
 (17) K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, F. Tramèr, A. Prakash, T. Kohno, and D. Song. Physical adversarial examples for object detectors. CoRR, abs/1807.07769, 2018.
 (18) A. Fawzi, S. MoosaviDezfooli, P. Frossard, and S. Soatto. Empirical study of the topology and geometry of deep networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 (19) A. Genevay, G. Peyre, and M. Cuturi. GAN and VAE from an optimal transport point of view. In arXiv:1706.01807, 2017.
 (20) A. Genevay, G. Peyre, and M. Cuturi. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, 2018.
 (21) J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems, 2005.
 (22) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
 (23) I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 (24) C. Guo, M. Rana, M. Cissé, and L. van der Maaten. Countering adversarial images using input transformations. In International Conference on Learning Representations, 2018.
 (25) G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 (26) A. Ilyas, A. Jalal, E. Asteri, C. Daskalakis, and A. G. Dimakis. The robust manifold defense: Adversarial training using generative models. CoRR, abs/1712.09196, 2017.
 (27) A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features. In International Conference on Learning Representations, 2019.
 (28) J.H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Excessive invariance causes adversarial vulnerability. In International Conference on Learning Representations, 2019.
 (29) H. Kannan, A. Kurakin, and I. J. Goodfellow. Adversarial logit pairing. CoRR, abs/1803.06373, 2018.
 (30) A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 (31) A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations, 2017.
 (32) F. Liao, M. Liang, Y. Dong, and T. Pang. Defense against adversarial attacks using highlevel representation guided denoiser. In Computer Vision and Pattern Recognition, 2018.
 (33) B. Lindqvist, S. Sugrim, and R. Izmailov. AutoGAN: Robust classifier against adversarial attacks. CoRR, abs/1812.03405, 2018.
 (34) X. Liu, M. Cheng, H. Zhang, and C.J. Hsieh. Towards robust neural networks via random selfensemble. In European Conference on Computer Vision, 2018.
 (35) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
 (36) D. Meng and H. Chen. MagNet: a twopronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017.
 (37) J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. In International Conference on Learning Representations, 2017.
 (38) T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: a regularization method for supervised and semisupervised learning. CoRR, abs/1704.03976, 2017.
 (39) S. MoosaviDezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In CVPR, 2017.
 (40) S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 (41) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 (42) N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. CoRR, abs/1511.07528, 2015.
 (43) S. Park and M. Thorpe. Representing and learning high dimensional data with the optimal transport map from a probabilistic viewpoint. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 (44) G. Peyré and M. Cuturi. Computational optimal transport. to appear in Foundations and Trends in Machine Learning, 2018.
 (45) A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer. Deflecting adversarial attacks with pixel deflection. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 (46) A. Rolet, M. Cuturi, and G. Peyré. Fast dictionary learning with a smoothed Wasserstein loss. In International Conference on Artificial Intelligence and Statistics, 2016.
 (47) R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In International Conference on Artificial Intelligence and Statistics, 2007.
 (48) T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.
 (49) P. Samangouei, M. Kabkab, and R. Chellappa. DefenseGAN: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018.
 (50) L. K. Saul, S. T. Roweis, and Y. Singer. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119–155, 2003.
 (51) L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285, 2018.
 (52) Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
 (53) J. Su, D. V. Vargas, and K. Sakurai. One pixel attack for fooling deep neural networks. CoRR, abs/1710.08864, 2017.
 (54) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 (55) T. Tanay and L. D. Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. CoRR, abs/1608.07690, 2016.
 (56) I. Tolstikhin, O. Bousquet, S. Gelly, and B. Scholkopf. Wasserstein autoencoders. In International Conference on Learning Representations, 2018.
 (57) F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
 (58) F. Tramèr, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel. The space of transferable adversarial examples. CoRR, abs/1704.03453, 2017.
 (59) C. Villani. Optimal transport, old and new. Springer, 2008.
 (60) J. Wang. Bilateral adversarial training: Towards fast training of more robust models against adversarial attacks. CoRR, abs/1811.10716, 2018.
 (61) D. WardeFarley. 1 adversarial perturbations of deep neural networks. 2016.
 (62) C. Xiao, B. Li, J. yan Zhu, W. He, M. Liu, and D. Song. Generating adversarial examples with adversarial networks. In International Joint Conference on Artificial Intelligence, 2018.
 (63) C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigating adversarial effects through randomization. In International Conference on Learning Representations, 2018.
 (64) C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. Adversarial examples for semantic segmentation and object detection. In International Conference on Computer Vision, 2017.
 (65) C. Xie, Y. Wu, L. van der Maaten, A. Yuille, and K. He. Feature denoising for improving adversarial robustness. arXiv preprint arXiv:1812.03411, 2018.
 (66) Y. Xie, X. Wang, R. Wang, and H. Zha. A fast proximal point method for Wasserstein distance. In arXiv:1802.04307, 2018.
 (67) Z. Yan, Y. Guo, and C. Zhang. Deep defense: Training dnns with improved adversarial robustness. In Advances in Neural Information Processing Systems, 2018.
 (68) S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
 (69) H. Zhang, H. Chen, Z. Song, D. Boning, inderjit dhillon, and C.J. Hsieh. The limitations of adversarial training and the blindspot attack. In International Conference on Learning Representations, 2019.
 (70) H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.