NAG: Network for Adversary Generation
Abstract
Adversarial perturbations can pose a serious threat for deploying machine learning systems. Recent works have shown existence of imageagnostic perturbations that can fool classifiers over most natural images. Existing methods present optimization approaches that solve for a fooling objective with an imperceptibility constraint to craft the perturbations. However, for a given classifier, they generate one perturbation at a time, which is a single instance from the manifold of adversarial perturbations. Also, in order to build robust models, it is essential to explore the manifold of adversarial perturbations. In this paper, we propose for the first time, a generative approach to model the distribution of adversarial perturbations. The architecture of the proposed model is inspired from that of GANs and is trained using fooling and diversity objectives. Our trained generator network attempts to capture the distribution of adversarial perturbations for a given classifier and readily generates a wide variety of such perturbations. Our experimental evaluation demonstrates that perturbations crafted by our model (i) achieve stateoftheart fooling rates, (ii) exhibit wide variety and (iii) deliver excellent cross model generalizability. Our work can be deemed as an important step in the process of inferring about the complex manifolds of adversarial perturbations.
1 Introduction
Machine learning^{†}^{†}* denotes authors contributed equally. systems are shown [prsystemsunderattackpari2014, evasionmlkd2013, adversarialmlacmmm2011] to be vulnerable to adversarial noise: small but structured perturbation added to the input that affects the model’s prediction drastically. Recently, the most successful Deep Neural Network based object classifiers have also been observed [intriguingarxiv2013, explainingharnessingarxiv2014, deepfoolcvpr2016, atscalearxiv2016, practicalbbarxiv2016] to be susceptible to adversarial attacks with almost imperceptible perturbations. Researchers have attempted to explain this intriguing aspect via hypothesizing linear behaviour of the models (e.g. [explainingharnessingarxiv2014, deepfoolcvpr2016]), finite training data (e.g. [deeparchFTML2009]), etc. More importantly, the adversarial perturbations exhibit cross model generalizability. That is, the perturbations learned on one model can fool another model even if the second model has a different architecture or has been trained on a disjoint subset of training images [intriguingarxiv2013, explainingharnessingarxiv2014].
Recent startling findings by MoosaviDezfooli et al. [universalcvpr2017] and Mopuri et al. [mopuribmvc2017] have shown that it is possible to mislead multiple stateoftheart deep neural networks over most of the images by adding a single perturbation. That is, these perturbations are imageagnostic and can fool multiple diverse networks trained on a target dataset. Such perturbations are named “Universal Adversarial Perturbations” (UAP) because a single adversarial noise can perturb images from multiple classes. On one side, the adversarial noise poses a severe threat for deploying machine learning based systems in the real world. Particularly, for the applications that involve safety and privacy (e.g., autonomous driving and access granting), it is essential to develop robust models against adversarial attacks. On the other side, it also poses a challenge to our understanding of these models along with the current learning practices. Thus, the adversarial behaviour of the deep learning models to small and structured noise demands a rigorous study now more than ever.
All the existing methods, weather image specific [intriguingarxiv2013, explainingharnessingarxiv2014, deepfoolcvpr2016, carlinissp2017, liuiclr2017] or agnostic [universalcvpr2017, mopuribmvc2017], can craft only a single perturbation that makes the target classifier susceptible. Specifically, these methods typically learn a single perturbation from a possibly bigger set of perturbations that can fool the target classifier. It is observed that for a given technique (e.g., UAP [universalcvpr2017], FFF [mopuribmvc2017]), the perturbations learned across multiple runs are not very different. In spite of optimizing with a different data ordering or initialization, their objectives end up learning very close perturbations in the space (refer sec. LABEL:subsec:diversity). In essence, these approaches can only prove that the UAPs exist for a given classifier by crafting one such perturbation . This is very limited information about the underlying distribution of such perturbations and in turn about the target classifier itself. Therefore, a more relevant task at hand is to model the distribution of adversarial perturbations. Doing so can help us better analyze the susceptibility of the models against adversarial perturbations. Furthermore, modelling the distributions would provide insights regarding the transferability of adversarial examples and help to prevent blackbox attacks [defensivedistillationarxiv2015, liuiclr2017]. It also helps to efficiently generate large number of adversarial examples for learning robust models via adversarial training [ensembleATarxiv2017].
Empirical evidence [explainingharnessingarxiv2014, warde2016] has shown that the perturbations exist in large contiguous regions rather than being scattered in multiple small discontinuous pockets. In this paper, we attempt to model such regions for a given classifier via generative modelling. We introduce a GAN [gannips2014] like generative model to capture the distribution of the unknown adversarial perturbations. The freedom from parametric assumptions on the distribution and the target distribution being unknown (no known samples from the target distribution of adversarial perturbations) make the GAN framework a suitable choice for our task.
The major contributions of this work are

A novel objective (eq. 2) to craft universal adversarial perturbations for a given classifier that achieves the stateofthe art fooling performance on multiple CNN architectures trained for object recognition.

For the first time, we show that it is possible to model the distribution () of such perturbations for a given classifier via a generative model. For this, we present an easily trainable framework for modelling the unknown distribution of perturbations.

We demonstrate empirically that the learned model can capture the distribution of perturbations and generates perturbations that exhibit diversity, high fooling capacity and excellent cross model generalizability.
The rest of the paper is organized as follows: section 2 details the proposed method, section LABEL:sec:expts presents comprehensive experimentation to validate the utility of the proposed method, section LABEL:sec:relworks briefly discusses the existing related works, and finally section LABEL:sec:conclu concludes the paper.
2 Proposed Approach
This section presents a detailed account of the proposed method. For ease of reference, we first briefly introduce the GAN [gannips2014] framework.
2.1 GANs
Generative models for images have seen renaissance lately, especially because of the availability of large datasets [imagenetijcv2015, places2arxiv2016] and the emergence of deep neural networks. Particularly, Generative Adversarial Networks (GAN) [gannips2014] and Variational AutoEncoders (VAE) [vaearxiv2013] have shown significant promise in this direction. In this work, we utilize a GAN like framework to model the distribution of the adversarial perturbations. A typical GAN framework consists of two parts: a Generator and a Discriminator . The generator transforms a random vector into a meaningful image ; i.e., , where is usually sampled from a simple distribution (e.g., , ). is trained to produce images that are indistinguishable from real images from the true data distribution . The discriminator accepts an image and outputs the probability for it to be a real image, a sample from . Typically, is trained to output low probability when a fake (generated) image is presented. Both and are trained adversarially to compete with and improve each other. A properly trained generator at the end of training is expected to produce images that are indistinguishable from real images.
2.2 Modelling the adversaries
A broad overview of our method is illustrated in Fig. 1. We first formalize the notations used in the subsequent sections of the paper. Note that in this paper, we have considered CNNs that are trained for object recognition [resnet2015, googlenetarxiv21014, vggarxiv2014, caffearxiv2014]. The data distribution over which the classifiers are trained is denoted as and a particular sample from is represented as . The target CNN is denoted as , therefore the output of a given layer is denoted as . The predicted label for a given data sample is denoted as . Output of the softmax layer is denoted as , which is a vector of predicted probabilities for each of the target categories . The imageagnostic additive perturbation that fools the target CNN is denoted as . denotes the limit on the perturbation in terms of its norm. Our objective is to model the distribution of such perturbations for a given classifier. Formally, we seek to model
(1) 
Since our objective is to model the unknown distribution of imageagnostic perturbations for a given trained classifier (target CNN), we make suitable changes in the GAN framework. The modifications we make are: (i) Discriminator is replaced by the target CNN which is already trained and whose weights are frozen, and (ii) a novel loss (fooling and diversity objectives) instead of the adversarial loss to train the Generator . Thus, the objective of our work is to train a model that can fool the target CNN. The architecture for is also similar to that of typical GAN which transforms a random sample to an image through a dense layer and a series of deconv layers. More details about the exact architecture are discussed in section LABEL:sec:expts and also in the appendix. We now proceed to discuss the fooling objective that enables us to model the adversarial perturbations for a given classifier.
2.3 Fooling Objective
In order to fool the target CNN, the generator should be driven by a suitable objective. Typical GANs use adversarial loss to train their . However, in this work we attempt to model a distribution whose samples are unavailable. We know only a single attribute of those samples which is to be able to fool the target classifier. We incorporate this attribute via a fooling objective to train our that models the unknown distribution of perturbations.
We denote the label predicted by the target CNN on a clean sample as benign prediction and that predicted on the corresponding perturbed sample as adversarial prediction. Similarly, we denote the output vector of the softmax layer without and after adding as and respectively. Ideally a perturbation should confuse the classifier so as to flip the benign prediction into a different adversarial prediction. For this to happen, after adding , the confidence of the benign prediction should be reduced and that of another category should be made higher. Thus, we formulate a fooling loss to minimize the confidence of benign prediction on the perturbed sample
(2) 
Fig. 1 gives a graphical explanation of the objective, where the fooling objective is shown by the blue colored block. Note that the fooling loss essentially encourages to generate perturbations that decrease confidence of benign predictions and thus eventually flip the label.
2.4 Diversity Objective
The fooling loss only encourages to learn a that can guarantee high fooling capability for the generated perturbations . This objective might lead to some local minima where the learns only a limited set of effective perturbations as in [universalcvpr2017, mopuribmvc2017]. However, our objective is to model the distribution such that it covers all varieties of those perturbations. Therefore, we introduce an additional component to the loss that encourages to explore the space of perturbations and generate a diverse set of perturbations. We term this objective the Diversity objective. Within a minibatch of generated perturbations, this objective indirectly encourages them to be different by separating their feature embeddings projected by the target classifier. In other words, for a given pair of generated perturbations and , our objective increases the distance between and at a given layer in the classifier.
(3) 
where , is the batch size, , are data sample and perturbation in the minibatch respectively. Note that a batch contains perturbations generated by (via transforming random vectors ) and data samples . is the output of the CNN at layer and is a distance metric (e.g., Euclidean) between a pair of features. The orange colored block in Fig. 1 illustrates the diversity objective.
Thus, our final loss becomes the summation of both fooling and diversity objectives and is given by
(4) 
Since it is important to learn diverse perturbations that exhibit maximum fooling, we give equal importance to both and in the final loss to learn the (i.e., ).
2.5 Architecture and Implementation details
\adl@mkpreaml—\@addtopreamble\@arstrut\@preamble  VGGF  CaffeNet  GoogLeNet  VGG16  VGG19  ResNet50  ResNet152  Mean FR  
VGGF  Our  94.10 1.84  81.28 3.50  64.153.41  52.938.50  55.392.68  50.564.50  47.674.12  63.73 
UAP  93.7  71.8  48.4  42.1  42.1    47.4  57.58  
CaffeNet  Our  79.251.44  96.441.56  66.661.84  50.405.61  55.134.15  52.383.96  48.584.25  64.12 
UAP  74.0  93.3  47.7  39.9  39.9    48.0  56.71  
GoogLeNet  Our  64.830.86  70.462.12  90.371.55  56.404.13  59.143.17  63.214.40  59.221.64  66.23 
UAP  46.2  43.8  78.9  39.2  39.8    45.5  48.9  
VGG16  Our  60.562.24  65.556.95  67.384.84  77.572.77  73.251.63  61.283.47  54.382.63  65.71 
UAP  63.4  55.8  56.5  78.3  73.1    63.4  65.08  
VGG19  Our  67.802.49  67.585.59  74.480.94  80.563.26  83.782.45  68.753.38  65.431.90  72.62 
UAP  64.0  57.2  53.6  73.5  77.8    58.0  64.01  
ResNet50  Our  47.062.60  63.351.70  65.301.14  55.162.61  52.672.58  86.642.73  66.401.89  62.37 
UAP                  
ResNet152  Our  57.664.37  64.862.95  62.331.39  52.173.41  53.184.16  73.322.75  87.242.72  64.39 
UAP  46.3  46.3  50.5  47.0  45.5    84.0  53.27 