NAG: Network for Adversary Generation

NAG: Network for Adversary Generation

Konda Reddy Mopuri*, Utkarsh Ojha*, Utsav Garg and R. Venkatesh Babu
Video Analytics Lab, Computational and Data Sciences
Indian Institute of Science, Bangalore, India
kondamopuri@iisc.ac.in, utkarsh2254@gmail.com, utsav002@e.ntu.edu.sg and venky@iisc.ac.in
Abstract

Adversarial perturbations can pose a serious threat for deploying machine learning systems. Recent works have shown existence of image-agnostic perturbations that can fool classifiers over most natural images. Existing methods present optimization approaches that solve for a fooling objective with an imperceptibility constraint to craft the perturbations. However, for a given classifier, they generate one perturbation at a time, which is a single instance from the manifold of adversarial perturbations. Also, in order to build robust models, it is essential to explore the manifold of adversarial perturbations. In this paper, we propose for the first time, a generative approach to model the distribution of adversarial perturbations. The architecture of the proposed model is inspired from that of GANs and is trained using fooling and diversity objectives. Our trained generator network attempts to capture the distribution of adversarial perturbations for a given classifier and readily generates a wide variety of such perturbations. Our experimental evaluation demonstrates that perturbations crafted by our model (i) achieve state-of-the-art fooling rates, (ii) exhibit wide variety and (iii) deliver excellent cross model generalizability. Our work can be deemed as an important step in the process of inferring about the complex manifolds of adversarial perturbations.

1 Introduction

Machine learning* denotes authors contributed equally. systems are shown [prsystemsunderattack-pari-2014, evasion-mlkd-2013, adversarialml-acmmm-2011] to be vulnerable to adversarial noise: small but structured perturbation added to the input that affects the model’s prediction drastically. Recently, the most successful Deep Neural Network based object classifiers have also been observed [intriguing-arxiv-2013, explainingharnessing-arxiv-2014, deepfool-cvpr-2016, atscale-arxiv-2016, practicalbb-arxiv-2016] to be susceptible to adversarial attacks with almost imperceptible perturbations. Researchers have attempted to explain this intriguing aspect via hypothesizing linear behaviour of the models (e.g[explainingharnessing-arxiv-2014, deepfool-cvpr-2016]), finite training data (e.g[deeparch-FTML-2009]), etc. More importantly, the adversarial perturbations exhibit cross model generalizability. That is, the perturbations learned on one model can fool another model even if the second model has a different architecture or has been trained on a disjoint subset of training images [intriguing-arxiv-2013, explainingharnessing-arxiv-2014].

Recent startling findings by Moosavi-Dezfooli et al. [universal-cvpr-2017] and Mopuri et al. [mopuri-bmvc-2017] have shown that it is possible to mislead multiple state-of-the-art deep neural networks over most of the images by adding a single perturbation. That is, these perturbations are image-agnostic and can fool multiple diverse networks trained on a target dataset. Such perturbations are named “Universal Adversarial Perturbations” (UAP) because a single adversarial noise can perturb images from multiple classes. On one side, the adversarial noise poses a severe threat for deploying machine learning based systems in the real world. Particularly, for the applications that involve safety and privacy (e.g., autonomous driving and access granting), it is essential to develop robust models against adversarial attacks. On the other side, it also poses a challenge to our understanding of these models along with the current learning practices. Thus, the adversarial behaviour of the deep learning models to small and structured noise demands a rigorous study now more than ever.

Figure 1: Overview of the proposed approach that models the distribution of universal adversarial perturbations for a given classifier. The illustration shows a batch of random vectors transforming into perturbations by which get added to the batch of data samples . The top portion shows adversarial batch , bottom portion shows shuffled adversarial batch and middle portion shows the benign batch . The Fooling objective (eq. 2) and Diversity objective (eq. 3) constitute the loss. Note that the target CNN is a trained classifier and its parameters are not updated during the proposed training. On the other hand, the parameters of generator are randomly initialized and learned through backpropagating the loss. (Best viewed in color).

All the existing methods, weather image specific [intriguing-arxiv-2013, explainingharnessing-arxiv-2014, deepfool-cvpr-2016, carlini-ssp-2017, liu-iclr-2017] or agnostic [universal-cvpr-2017, mopuri-bmvc-2017], can craft only a single perturbation that makes the target classifier susceptible. Specifically, these methods typically learn a single perturbation from a possibly bigger set of perturbations that can fool the target classifier. It is observed that for a given technique (e.g., UAP [universal-cvpr-2017], FFF [mopuri-bmvc-2017]), the perturbations learned across multiple runs are not very different. In spite of optimizing with a different data ordering or initialization, their objectives end up learning very close perturbations in the space (refer sec. LABEL:subsec:diversity). In essence, these approaches can only prove that the UAPs exist for a given classifier by crafting one such perturbation . This is very limited information about the underlying distribution of such perturbations and in turn about the target classifier itself. Therefore, a more relevant task at hand is to model the distribution of adversarial perturbations. Doing so can help us better analyze the susceptibility of the models against adversarial perturbations. Furthermore, modelling the distributions would provide insights regarding the transferability of adversarial examples and help to prevent black-box attacks [defensivedistillation-arxiv-2015, liu-iclr-2017]. It also helps to efficiently generate large number of adversarial examples for learning robust models via adversarial training [ensembleAT-arxiv-2017].

Empirical evidence [explainingharnessing-arxiv-2014, warde2016] has shown that the perturbations exist in large contiguous regions rather than being scattered in multiple small discontinuous pockets. In this paper, we attempt to model such regions for a given classifier via generative modelling. We introduce a GAN [gan-nips-2014] like generative model to capture the distribution of the unknown adversarial perturbations. The freedom from parametric assumptions on the distribution and the target distribution being unknown (no known samples from the target distribution of adversarial perturbations) make the GAN framework a suitable choice for our task.

The major contributions of this work are

  • A novel objective (eq. 2) to craft universal adversarial perturbations for a given classifier that achieves the state-of-the art fooling performance on multiple CNN architectures trained for object recognition.

  • For the first time, we show that it is possible to model the distribution () of such perturbations for a given classifier via a generative model. For this, we present an easily trainable framework for modelling the unknown distribution of perturbations.

  • We demonstrate empirically that the learned model can capture the distribution of perturbations and generates perturbations that exhibit diversity, high fooling capacity and excellent cross model generalizability.

The rest of the paper is organized as follows: section 2 details the proposed method, section LABEL:sec:expts presents comprehensive experimentation to validate the utility of the proposed method, section LABEL:sec:relworks briefly discusses the existing related works, and finally section LABEL:sec:conclu concludes the paper.

2 Proposed Approach

This section presents a detailed account of the proposed method. For ease of reference, we first briefly introduce the GAN [gan-nips-2014] framework.

2.1 GANs

Generative models for images have seen renaissance lately, especially because of the availability of large datasets [imagenet-ijcv-2015, places2-arxiv-2016] and the emergence of deep neural networks. Particularly, Generative Adversarial Networks (GAN) [gan-nips-2014] and Variational Auto-Encoders (VAE) [vae-arxiv-2013] have shown significant promise in this direction. In this work, we utilize a GAN like framework to model the distribution of the adversarial perturbations. A typical GAN framework consists of two parts: a Generator and a Discriminator . The generator transforms a random vector into a meaningful image ; i.e., , where is usually sampled from a simple distribution (e.g., , ). is trained to produce images that are indistinguishable from real images from the true data distribution . The discriminator accepts an image and outputs the probability for it to be a real image, a sample from . Typically, is trained to output low probability when a fake (generated) image is presented. Both and are trained adversarially to compete with and improve each other. A properly trained generator at the end of training is expected to produce images that are indistinguishable from real images.

2.2 Modelling the adversaries

A broad overview of our method is illustrated in Fig. 1. We first formalize the notations used in the subsequent sections of the paper. Note that in this paper, we have considered CNNs that are trained for object recognition [resnet-2015, googlenet-arxiv-21014, vgg-arxiv-2014, caffe-arxiv-2014]. The data distribution over which the classifiers are trained is denoted as and a particular sample from is represented as . The target CNN is denoted as , therefore the output of a given layer is denoted as . The predicted label for a given data sample is denoted as . Output of the softmax layer is denoted as , which is a vector of predicted probabilities for each of the target categories . The image-agnostic additive perturbation that fools the target CNN is denoted as . denotes the limit on the perturbation in terms of its norm. Our objective is to model the distribution of such perturbations for a given classifier. Formally, we seek to model

(1)

Since our objective is to model the unknown distribution of image-agnostic perturbations for a given trained classifier (target CNN), we make suitable changes in the GAN framework. The modifications we make are: (i) Discriminator is replaced by the target CNN which is already trained and whose weights are frozen, and (ii) a novel loss (fooling and diversity objectives) instead of the adversarial loss to train the Generator . Thus, the objective of our work is to train a model that can fool the target CNN. The architecture for is also similar to that of typical GAN which transforms a random sample to an image through a dense layer and a series of deconv layers. More details about the exact architecture are discussed in section LABEL:sec:expts and also in the appendix. We now proceed to discuss the fooling objective that enables us to model the adversarial perturbations for a given classifier.

2.3 Fooling Objective

In order to fool the target CNN, the generator should be driven by a suitable objective. Typical GANs use adversarial loss to train their . However, in this work we attempt to model a distribution whose samples are unavailable. We know only a single attribute of those samples which is to be able to fool the target classifier. We incorporate this attribute via a fooling objective to train our that models the unknown distribution of perturbations.

We denote the label predicted by the target CNN on a clean sample as benign prediction and that predicted on the corresponding perturbed sample as adversarial prediction. Similarly, we denote the output vector of the softmax layer without and after adding as and respectively. Ideally a perturbation should confuse the classifier so as to flip the benign prediction into a different adversarial prediction. For this to happen, after adding , the confidence of the benign prediction should be reduced and that of another category should be made higher. Thus, we formulate a fooling loss to minimize the confidence of benign prediction on the perturbed sample

(2)

Fig. 1 gives a graphical explanation of the objective, where the fooling objective is shown by the blue colored block. Note that the fooling loss essentially encourages to generate perturbations that decrease confidence of benign predictions and thus eventually flip the label.

2.4 Diversity Objective

The fooling loss only encourages to learn a that can guarantee high fooling capability for the generated perturbations . This objective might lead to some local minima where the learns only a limited set of effective perturbations as in [universal-cvpr-2017, mopuri-bmvc-2017]. However, our objective is to model the distribution such that it covers all varieties of those perturbations. Therefore, we introduce an additional component to the loss that encourages to explore the space of perturbations and generate a diverse set of perturbations. We term this objective the Diversity objective. Within a mini-batch of generated perturbations, this objective indirectly encourages them to be different by separating their feature embeddings projected by the target classifier. In other words, for a given pair of generated perturbations and , our objective increases the distance between and at a given layer in the classifier.

(3)

where , is the batch size, , are data sample and perturbation in the mini-batch respectively. Note that a batch contains perturbations generated by (via transforming random vectors ) and data samples . is the output of the CNN at layer and is a distance metric (e.g., Euclidean) between a pair of features. The orange colored block in Fig. 1 illustrates the diversity objective.

Thus, our final loss becomes the summation of both fooling and diversity objectives and is given by

(4)

Since it is important to learn diverse perturbations that exhibit maximum fooling, we give equal importance to both and in the final loss to learn the (i.e., ).

2.5 Architecture and Implementation details

\adl@mkpreaml—\@addtopreamble\@arstrut\@preamble VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-50 ResNet-152 Mean FR

VGG-F Our 94.10 1.84 81.28 3.50 64.153.41 52.938.50 55.392.68 50.564.50 47.674.12 63.73

UAP 93.7 71.8 48.4 42.1 42.1 - 47.4 57.58
CaffeNet Our 79.251.44 96.441.56 66.661.84 50.405.61 55.134.15 52.383.96 48.584.25 64.12

UAP 74.0 93.3 47.7 39.9 39.9 - 48.0 56.71
GoogLeNet Our 64.830.86 70.462.12 90.371.55 56.404.13 59.143.17 63.214.40 59.221.64 66.23

UAP 46.2 43.8 78.9 39.2 39.8 - 45.5 48.9
VGG-16 Our 60.562.24 65.556.95 67.384.84 77.572.77 73.251.63 61.283.47 54.382.63 65.71

UAP 63.4 55.8 56.5 78.3 73.1 - 63.4 65.08
VGG-19 Our 67.802.49 67.585.59 74.480.94 80.563.26 83.782.45 68.753.38 65.431.90 72.62

UAP 64.0 57.2 53.6 73.5 77.8 - 58.0 64.01
ResNet-50 Our 47.062.60 63.351.70 65.301.14 55.162.61 52.672.58 86.642.73 66.401.89 62.37

UAP - - - - - - - -
ResNet-152 Our 57.664.37 64.862.95 62.331.39 52.173.41 53.184.16 73.322.75 87.242.72 64.39

UAP 46.3 46.3 50.5 47.0 45.5 - 84.0 53.27
Table 1: Average fooling rates of the perturbations modelled by our generative network vs. UAP [universal-cvpr-2017]. Rows indicate the target net for which perturbations are modelled and columns indicate the net under attack. Note that, in each row, entry where the target CNN matches with the network under attack represents white-box attack and the rest represent the black-box attacks. For our method, along with average fooling rates, the corresponding standard deviations are also mentioned. The best result for each case is shown in bold and UAP best cases are shown in blue. The mean avg. rate achieved by the Generator for each of the target CNNs is shown in the rightmost column.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
44890
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description