Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers
Abstract
Deep neural networks have been shown to exhibit an intriguing vulnerability to adversarial input images corrupted with imperceptible perturbations. However, the majority of adversarial attacks assume global, finegrained control over the image pixel space. In this paper, we consider a different setting: what happens if the adversary could only alter specific attributes of the input image? These would generate inputs that might be perceptibly different, but still naturallooking and enough to fool a classifier. We propose a novel approach to generate such “semantic” adversarial examples by optimizing a particular adversarial loss over the rangespace of a parametric conditional generative model. We demonstrate implementations of our attacks on binary classifiers trained on face images, and show that such naturallooking semantic adversarial examples exist. We evaluate the effectiveness of our attack on synthetic and real data, and present detailed comparisons with existing attack methods. We supplement our empirical results with theoretical bounds that demonstrate the existence of such parametric adversarial examples.
Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers
Ameya Joshi Amitangshu Mukherjee Soumik Sarkar Chinmay Hegde^{†}^{†}thanks: This work was supported in part by NSF grants CCF1750920, CNS1845969, DARPA AIRA grant PA180202, AFOSR YIP Grant FA95501710220, an ERP grant from Iowa State University, a GPU gift grant from NVIDIA corporation, and faculty fellowships from the Black and Veatch Foundation. Iowa State University {ameya, amimukh, soumiks, chinmay}@iastate.edu
1 Introduction
The existence of adversarial inputs for deep neural networkbased classifiers has been well established by several recent works [5, 10, 16, 17, 58, 41]. The adversary typically confounds the classifier by adding an imperceptible perturbation to a given input image, where the range of the perturbation is defined in terms of bounded pixelspace norm balls. Such adversarial “attacks” appear to catastrophically affect the performance of stateoftheart classifiers [1, 22, 23, 54].
Pixelspace normconstrained attacks reveal interesting insights about generalization properties of deep neural networks. However, imperceptible attacks are certainly not the only means available to an adversary. Consider an input example that comprises salient, invariant features along with modifiable attributes. An example would be an image of a face, which consists of invariant features relevant to the identity of the person, and variable attributes such as hair color and presence/absence of eyeglasses. Such adversarial examples, though perceptually distinct from the original input, appear natural and acceptable to an oracle or a human observer but would still be able to subvert the classifier. Unfortunately, the large majority of adversarial attack methods do not port over to such natural settings.
A systematic study of such attacks is paramount in safetycritical applications that deploy neural classifiers, such as facerecognition systems or vision modules of autonomous vehicles. These systems are required to be immune to a limited amount of variability in input data, particularly when these variations are achieved through natural means. Therefore, a method to generate adversarial examples using natural perturbations, such as facial attributes in the case of face images, or different weather conditions for autonomous navigation systems, would shed further insights into the realworld robustness of such systems. We refer to such perceptible attacks as “semantic” attacks.
This setting fundamentally differs from existing attack approaches and has been (largely) unexplored thus far. Semantic attacks utilize nonlinear generative transformations of an input image instead of linear, additive techniques (such as image blending). Such complicated generative transformations would display higher degrees of nonlinearity in the corresponding attacks, the effects of which warrant further investigation. In addition, the role of the number of modifiable attributes (parameters in the generative models) in the given input is also an important point of consideration.
Contributions: We propose and rigorously analyze a framework for generating adversarial examples for a deep neural classifier by modifying semantic attributes.
We leverage generative models such as Fader Networks [30] that have semantically meaningful, tunable attributes corresponding to parameters into a continuous bounded space that implicitly define the space of “natural” input data. Our approach exploits this property by treating the range space of these attribute models as a manifold of semantic transformations of an image.
We pose the search for adversarial examples on this semantic manifold as an optimization problem over the parameters conditioning the generative model. Using face image classification as a running test case, we train a variety of parametric models (including Fader Networks and Attribute GANs), and demonstrate the ability to generate semantically meaningful adversarial examples using each of these models. In addition to our empirical evaluations, we also provide a theoretical analysis of a simplified semantic attack model to understand the capacity of parametric attacks that typically exploit a significantly lower dimensional attack space compared to the classical pixelspace attacks.
Our specific contributions are as follows:

We propose a novel optimization based framework to generate semantically valid adversarial examples using parametric generative transformations.

We explore realizations of our approach using variants of multiattribute transformation models: Fader Networks [30] and Attribute GANs [20] to generate adversarial face images for a binary classifier trained on the CelebA dataset [37]. Some of our modified multiattribute models are nontrivial and may be of independent interest.

We present an empirical analysis of our approach and show that increasing the dimensionality of the attack space results in more effective attacks. In addition, we investigate a sequence of increasingly nonlinear attacks, and demonstrate that a higher degree of nonlinearity (surprisingly) leads to weaker attacks.

Finally, we provide a preliminary theoretical analysis by providing upper bounds for the classification error for a simplified surrogate model under adversarial condition [52]. This analysis supports our empirical observations regarding the dimensionality of the attack space.
We demonstrate the effectiveness of our attacks on simple deep classifiers trained over complex image datasets; hence, our empirical comparisons are significantly more realistic than popular attack methods such as FGSM [16] and PGD [29, 39] that primarily have focused on simpler datasets such as MNIST [32] and CIFAR. Our approach also presents an interesting usecase for multiattribute generative models which have been used solely as visualization tools thus far.
Outline: We begin with a review of relevant literature in Section 2. We describe our proposed framework, Semantic Adversarial Generation, in section 3. In Section 4 we describe two variants of our framework to show different methods of ensuring the semantic constraint. We provide empirical analysis of our work in Section 5. We further present empirical analysis and theoretical qualification on the dimensionality of the parametric attack space in Section 6, and conclude with possible extensions in Section 7.
2 Related Work
Due to space constraints coupled with the large amount of recent progress in the area of adversarial machine learning, our discussion of related work is necessarily incomplete. We defer a more detailed discussion to the supplementary material.
Our focus is on white box, testtime attacks on deep classification systems; other families of attacks (such as backdoor attacks, data poisoning schemes, and blackbox attacks) are not directly relevant to our setting, and we do not discuss those methods here.
Adversarial Attacks: Evidence that deep classifiers are susceptible to imperceptible adversarial examples can be attributed to Szegedy et al. [58]. Goodfellow et al. [16] and Kurakin et al. [29] extend this line of work using the Fast Gradient Sign Method (FGSM) and its iterative variants. Carlini and Wagner [5] devise stateoftheart attacks under various pixelspace normball constraints by proposing multiple adversarial loss functions. Athalye et al. [1] further analyze several defense approaches against pixelspace adversarial attacks, and demonstrate that most existing defenses can be surpassed by approximating gradients over defensively trained models.
Such attacks perturb the pixelspace under an imperceptibility constraint. On the contrary, we approach the problem of generating adversarial examples that have perceptible yet semantically valid modifications. Our method considers a smaller ‘parametric’ space of modifiable attributes that have physical significance.
Parametric Adversarial Attacks: Parametric attacks are a recently introduced class of attacks in which the attack space is defined by a set of parameters rather than the pixel space. Such approaches result in more “natural” adversarial examples as they target the image formation process instead of the pixel space. Recent works by Athalye et al. [2] and Liu et al. [35] use optimization over geometric surfaces in 3D space to create adversarial examples. Zhang et al. [70] demonstrate the existence of adversarially designed textures that can camouflage vehicles. Zhao et al. [71] generate adversarial examples by using the parametric input latent space of GANs[18]. Dabouei et al. [9] employ a generative model to geometrically perturb facial landmarks to generate adversarial faces. Sharif et al. [55] propose a generative model to alter images of faces with eyeglasses in order to confound a face recognition classifier. Contrary to these methods, we consider the inverse approach of using a pretrained multiattribute generative model to transform inputs over multiple attributes for generating adversarial examples.
Song et al. [57] optimize over the latent space of a conditional GAN to generate unrestricted adversarial examples for a gender classifier. While our approach is thematically similar, we fundamentally differ in the context of being able to generate adversarial counterparts for given test samples while providing a finer degree of control using multiattribute generative models. We discuss relevant literature regarding such multiattribute generative models below.
AttributeBased Conditional Generative Models: Generative Adversarial Networks (GAN) [18] are a popular approach for the generation of samples from a realworld data distribution. Recent advancements [49, 36, 64, 6] in GANs allow for creation of high quality realistic images. Chen et al. [6] introduce the concept of a attribute learning generative model where visual features are parametrized by an input vector.
Perarnaue et al. [48] use a Conditional Generative Adversarial Network [40] and an encoder to learn the attribute invariant latent representation for attribute editing. Fader Networks [30] improve upon this using an autoencoder with a latent discriminator. He et al. [20] argue that such an attribute invariant constraint is too constrictive and replace it an attribute classification constraint and a reconstruction loss instead to alter only the desired attributes preserving attributeexcluding features. These models are primarily used for generation of a large variety of facial images. We provide a secondary (and perhaps practical) use case for such attribute models in the context of understanding generalization properties of neural networks.
3 Semantic Attacks
Conceptually, producing an adversarial semantic (“natural”) perturbation of a given input depends on two algorithmic components: (i) the ability to navigate the manifold of parametric transformations of an input image, and (ii) the ability to perform optimization over this manifold that maximizes the classification loss with respect to a given target model. We describe each component in detail below.
Notation: We assume a whitebox threat model, where the adversary has access to a target model and the gradients associated with it. The model classifies an input image, into one of classes, represented by a onehot output label, . In this paper, we focus on binary classification models () while noting that our framework transparently extends to multiclass models. Let denote parametric transformations, conditioned on a parameter vector, . Here, each element of (say, ) is a real number that corresponds to a specific semantic attribute. For example, may correspond to facial hair, with a value of zero (or negative) denoting absence and a positive value denoting presence of hair on a given face example. We define a semantic adversarial attack as the deliberate process of transforming an input image, via a parametric model to produce a new example such that .
3.1 Parametric Transformation Models
First, let us consider the problem of generating semantic transformations of a given input example. In order to create semantically transformed examples, the defined parametric generative model should satisfy two properties: should reconstruct the invariant data in an image, and should be able to independently perturb the semantic attributes while minimally changing the invariant data.
The parametric transformation model therefore, is trained to reconstruct the original example while disentangling the semantic attributes. This involves conditioning the generative model on a set of parameters corresponding to the modifiable attributes. The semantic parameter vector consists of these parameters and is input to the parametric model to control the expression of semantic attributes.
We argue that the rangespace of such a model approximates the manifold of the semantic transformations of input images. Therefore, the semantic transformation model can be used a projection operator to ensure that a solution to an optimization problem will lie in the set of semantic transformations of an input image. We also observe that the semantic parameter vectors will be much lower in dimension than the original image.
3.2 Adversarial Parameter Optimization
The problem of generating an semantic adversarial example essentially can be thought of as finding the right set of attributes that a classifier is adversarially susceptible to. In our approach, we model this as an optimization problem over the semantic parameters.
The generation of adversarial examples is generally modelled as an optimization problem that can be broken down into two sub problems: (1) Optimization of an adversarial loss over the target network to find the direction of an adversarial perturbation. (2) Projection of the adversarial vector on the viable solutionspace.
In the first step, we optimize over an adversarial loss, . We model the second step as a projection of the adversarial vector onto the range space of a parametric transformation model. This is achieved by cascading the output of the transformation function to the input of our target network. The optimization problem can then be solved by backpropagating over both the network and the transform. We also modify the CarliniWagner untargeted adversarial loss [5] as shown in equation 1 to include our semantic constraint:
(1)  
where is the original label index and are the class label indices for any of the other classes.
In comparison to the grid search method presented in Zhao et al. [71] and Engstrom et al. [12], our optimization algorithm scales better. In addition, we create semantic adversarial transformations with multiple attributes for a specific input allowing for a finegrained analysis of the generalization capacities of the target model.
4 Semantic Transformations
While our semantic attack framework is applicable to any parametric transformation model that enables gradient computations, we instantiate it by constructing adversarial variants of two recently proposed generative models: Fader networks [30] and AttributeGANs (AttGAN) [20].
4.1 Adversarial Fader Network
A Fader Network [30] is an encoderdecoder architecture trained to modify images with continuously parameterized attributes. They achieve this by learning an invariance over the encoded latent representation while disentangling the semantic information of the images and attributes. The invariance of the attributes is learnt by an adversarial training step in the latent space with the help of a latent discriminator which is trained to identify correct attributes corresponding to each training sample.
Using our framework, we can adapt any pretrained Fader Network to model the manifold of semantic perturbations of a given input. We note that minor adjustments are needed in our setting, since the parameter vector required by the approach of [30] requires each scalar attribute, , to be represented by a tuple, . Since there is a onetoone mapping between the two representations, we can project any realvalued parameter vector into this tuple form via an additional, fixed affine transformation layer. Given this extra “attribute encoding” step, all gradient computations proceed as before. We quantitatively study the effect of allowing the attacker access to single or multiple semantic attributes. In particular, we construct three approaches for generating semantic adversarial examples: (i) A single attribute Fader Network; (ii) A multiattribute Fader Network; and (iii) A cascaded sequence of single attribute Fader Networks.
Single Attribute Attack: For the single attribute attack, we use the rangespace of a pretrained, single attribute Fader Network to constrain our adversarial attack. The single attribute attack constrains an attacker to only modify a specified attribute for all the images. In the case of face images, such attributes might include presence/absence of eyeglasses, hair color, and nose shape.
In our experiments, we present examples of attacks on a gender classifier using three separate single attributes: eyeglasses, age, and skin complexion. Figure 2 describes the mechanism of a singleattribute adversarial Fader Network used to generate an adversarial example by adding eyeglasses.
MultiAttribute Attack: Similar to the singleattribute case, we may also use pretrained multiattribute Fader Networks to model cases where the adversary has access to multiple modifiable traits.
A limitation of multiattribute Fader Networks lies in the difficulty of their training. This is because a Fader Network is required to learn disentangled representations of the attributes while in practice, semantic attributes cannot be perfectly decoupled. We resolve this using a novel conditional generative model described as follows.
Cascaded Attribute Attack: We propose a novel method to simulate multiattribute attacks by stagewise concatenation pretrained single attribute Fader networks. The benefit is that the computational burden of learning disentangled representations is now removed.
Each singleattribute model exposes a attribute latent vector. During execution of Alg. 1 we jointly optimize over all the attribute vectors. The optimal adversarial vector is then segmented into corresponding attributes for each Fader Network to generate an adversarial example.
4.2 Adversarial AttGAN
A second encoderdecoder architecture [20], known as AttGAN, achieves a similar goal as Fader Networks of editing attributes by manipulating the encoded latent representation; however, AttGAN disentangles the semantic attributes from the underlying invariances of the data by considering both the original and the flipped labels while training. This is achieved by training a latent discriminator and classifier pair to classify both the original and the transformed image to ensure invariance.
In order to generate semantic adversarial examples using AttGAN, we use a pretrained generator conditioned on attributes. The attribute vector in this case, is encoded to be a perturbation of the original sequence of attributes for the image. We consider the two sets of attributes listed in Table 2 to generate adversarial examples. In our experience, the AttGAN architecture provides a more stable reconstruction, thus allowing for more modifiable parameters.
5 Experimental Results
(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h)  (i) 
We showcase our semantic adversarial attack framework using a binary (gender) classifier as the target maodel trained on the CelebA dataset [37]. All experiments were performed on a single workstation equipped with an NVidia Titan Xp GPU in PyTorch [47] v1.0.0. We train the classifier using the ADAM optimizer [26] over the categorical crossentropy loss.The training data is augmented with random horizontal flipping to ensure that the classifier does not overfit. The target model achieves a (standard) accuracy of 99.7% on the test set (10% of the dataset).
Our goal is to break this classifier model using semantic attacks. To do so, we use a subset of 500 randomly selected images from the test set. Each image is transformed by our algorithm using the various parametric transformation families described in Section 4. Our metric of comparison for all adversarial attacks is the target model accuracy on the generated adversarial test set.
Adversarial Fader Networks: We consider the three approaches documented in section 4.1. For every image in our original test set, we generate adversarial examples by optimizing the adversarial loss in equation 1 with respect to the corresponding attribute parameters.
In the cases of singleattribute and cascaded sequential attacks, we use the pretrained singleattribute models provided by Lample et al. [30] to represent the manifold of semantic transformations. For the multiattribute attack, we train 3 multiattribute Fader Networks with the attributes presented in Table 2. We create an adversarial test set for each our approaches as described in Section 4.1 using our algorithm as defined in Algorithm 1.
Our experiments show that Adversarial Fader Networks successfully generate examples that confound the binary classifier in all cases; see Table 2. Visual adversarial examples are displayed in Figure 1 and Figure 3. We also observe that multiattribute attacks outperform singleattribute attacks, which conforms with intuition; a more systematic analysis of the effect of the number of semantic attributes on attack performance is provided below in Section 6.
Adversarial AttGAN: We perform a similar set of experiments using the multiattribute AttGAN implementation of He et al.[20]. We record the performance over two experiments: one using 5 attributes, and the second using 6 attributes, as seen in Table 2. We observe a significant improvement in performance as the number of semantic attributes increases (in particular, adding the eyebrows attribute results in nearly a 30% drop in model accuracy).
Comparison with parameterspace sampling: We compare our method with a previouslyproposed approach that investigates parametric attacks et al. [12]. They propose picking random samples from the parameter space and choose the adversarial example generated by the sample giving the worst cross entropy loss (we use ).
We showcase the results in Table 2, and observe that in all cases (but one), our semantic adversarial attack algorithm outperforms random sampling. In addition, the table also reveals that random examples in the range of Fader Networks or AttGANs are mostly classified correctly. This suggests that the target model is generally invariant to the low reconstruction error incurred by the parametric transformation models^{1}^{1}1We do not compare our work with other approaches such as the Differentiable Renderer [35] and 3D adversarial attacks [69], since these papers expect oracle access to a 3D rendering environment. We also do not compare with Song et al.[57] since they generate adversarial examples from scratch, whereas our attack targets specific inputs..
Comparison with pixelspace attacks: In addition to our analyses described above, we also compare our attacks with the stateoftheart CarliniWagner attack [5] as well as several other attack techniques [16, 29, 12] in Table 2. To ensure fair comparison, we consider the maximum distance over our multiattribute attacks as the bound parameter for all pixelnorm based attacks. From the table, we observe that the CarliniWagner attack is extremely effective; on the other hand, our semantic attacks are able to outperform other methods such as FGSM [17] and PGD [39].
We also compare our approach to Spatial Attacks of [12], which uses a grid search over affine transformations of an input to generate adversarial examples; constraints do not apply here, and instead we use default parameters provided in [12]. Our proposed attack methods are considerably more successful.
We additionally provide detailed experiments on binary classifiers for other attributes in the supplementary section.
6 Analysis: Impact of Dimensionality
From our experiments, we observe that limiting the adversary to a lowdimensional, semantic parametric transformation of the input leads to lesseffective attacks than pixelspace attacks (at least when the same loss is optimized). Moreover, singleattribute semantic attacks are more powerful than multiattribute attacks. This observation makes intuitive sense: the dimension of the manifold of perturbed inputs effectively represents the capacity of the adversary, and hence a greater number of degrees of freedom in the perturbation should result in more effective attacks. In pixelspace attacks, the adversary is free to search over a highdimensional ball centered around the input example, which is perhaps why norm attacks are so hard to defend against [1].
In this section, we provide experimental and theoretical analysis that precisely exposes the impact of the dimensionality of the attribute parameters. While our analysis is stylized and not directly applicable to deep neural classifiers, it constitutes a systematic first attempt towards upper bounds on what a semantically constrained adversary can possibly hope to achieve.
6.1 Synthetic Experiments
We propose and analyze the following synthetic setup which enables explicit control over the dimension of the semantic perturbations. Data: We construct a dataset of samples from a mixture of Gaussians (MoG) with 10 components (denoted by ) defined over . Each data sample is obtained by uniformly sampling one of the mixture component means, and then adding random Gaussian noise with standard deviation . The component means are chosen as 10 randomly selected images (1 for each digit) from the MNIST dataset [32] rescaled to (i.e., the ambient dimension is ).
Target Model: We artificially define two classes: the first class containing images generated from digits 04 and the second class containing images from samples 59. We train a simple twolayer fully connected network, as the target model. The classifier is trained by optimizing crossentropy using ADAM [26] for 50 epochs, resulting in training accuracy of 100%, validation accuracy of 99.8%, and test accuracy of 99.6%.
Parametric Transformations: We consider a stylized transformation function, . We study the effect of varying for two specific parametric transformation models.
Subspace attacks: We first consider an additive (linear) attack model. Here, the manifold of semantic perturbations is constrained to lie a dimensional subspace spanned by an arbitrary matrix , whose columns are assumed to be orthonormal, and
(2) 
Neural attacks: We next consider a multiplicative attack model. Here the manifold of perturbations corresponds to a rank transformation of the input.
(3) 
Here, and follow the definition presented earlier. This transformation can be interpreted as the action of a shallow (twolayer) autoencoder network with hidden neurons with scalar activations parameterized by .
Nonlinear ReLU variants: We also consider each of the above two attacks in the rectified setting where the transformation is passed through a rectified linear unit:
Results: We analyse the effect of the dimensionality of the attack space() by considering the performance of the subspace and neural attacks on the target binary classifier. Figure 4 shows the comparison of the constrained attacks for the linear and nonlinear cases.
We infer the following: (i) As expected, increasing dimensionality of the semantic attack space leads to less accurate target models; (ii) Adding a nonlinearity to the transformation function reduces the viability of both subspace and rankconstrained attacks. (iii) Subspaceconstrained attacks are more powerful than rankconstrained attacks. In general, the degree of “nonlinearity” in the transformation model appears to be inversely proportional to the power of the corresponding semantic attack. We believe this phenomenon is somewhat surprising, and defer a more thorough analysis to future work.
6.2 Theory
In the case of subspace attacks, we can explicitly derive upper bounds on the generalization behavior of target models. Our derivation follows the recent approach of Schmidt et al. [52], who consider a simplified version of the data model defined in Section 6.1 and bound the performance of a linear classifier in terms of its robust classification error.
Def. 6.1 (Robust Classification Error).
Let be a distribution and let be any set containing . Then the robust classification error of any classifier is defined as .
Using this definition, we analyze the efficacy of subspace attacks on a simplified linear classifier trained using a mixture of two spherical Gaussians. Consider a dataset with samples sampled from a mixture of two Gaussians with component means and standard deviation . We assume a linear classifier , defined by the unit vector , as .
Let . Assuming that the target classifier is welltrained (i.e., is sufficiently wellcorrelated with the true component mean ), we can upper bound the probability of error incurred by the classifier when subjected to any subspace attack.
Theorem 1 (Robust classification error for subspace attacks).
Let be such that . Then, the linear classifier has a robust classification error upper bounded as:
(4) 
The proof is deferred to the supplementary material, but we provide some intuition. Lemma 20 of [52] recovers a similar result, albeit with the term in the exponent being replaced by . This is because they only consider bounded perturbations in pixelspace, and hence their bound on the robust classification error scales exponentially according to the ambient dimension , while our bound is expressed in terms of the number of semantic attributes . A natural next step would be to derive sample complexity bounds analogous to [52] but we do not pursue that direction here.
7 Discussion and Conclusions
We conclude with possible obstacles facing our approach and directions for future work.
We have provided evidence that there exist adversarial examples for a deep neural classifier that may be perceptible, yet are semantically meaningful and hence difficult to detect. A key obstacle is that parameters associated with semantic attributes are often difficult to decouple. This poses a practical challenge, since it is difficult to train a conditional generative model where each dimension of the latent parameter vector controls a specific semantic attribute independently. However, the success of recent efforts in this direction, including Fader Networks [30], AttGans [20], and StarGANs [8] demonstrate promise of our approach: any newly developed conditional generative models can be used to mount a semantic attack using our framework.
Despite the existence of semantic adversarial examples, we have found that enforcing semantic validity confounds the adversary’s task, and that target models are generally able to classify a significant subset of the examples generated under our semantic constraint. Figure 5 are examples of images generated with severe artifacts, yet that are successfully classified. This brings to us to the question: is “naturalness” a strong defense?
This intuition is the premise of a recent defense strategy called DefenseGAN [51]. Indeed, our approach can be viewed as converse of this strategy: DefenseGAN uses the rangespace of a generative model (specifically, a GAN) to defend against pixelspace attacks, while conversely, we use the same principle to attack trained target models. A closer look into the interplay between the two approaches is worthy of future study.
Acknowledgements
We thank Gauri Jagatap, Mohammedreza Soltani, and Anuj Sharma for helpful discussions.
References
 [1] A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
 [2] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. In ICML, 2018.
 [3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 [4] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer. Adversarial patch. arXiv preprint arXiv:1712.09665, 2017.
 [5] N. Carlini and D. A. Wagner. Towards evaluating the robustness of neural networks. 2017 IEEE Symposium on Security and Privacy (SP), 2017.
 [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, 2016.
 [7] X. Chen, C. Liu, B. Li, K. Lu, and D. Song. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arxiv preprint, abs/1712.05526, 2017.
 [8] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multidomain imagetoimage translation. In CVPR, 2018.
 [9] A. Dabouei, S. Soleymani, J. M. Dawson, and N. M. Nasrabadi. Fast geometricallyperturbed adversarial faces. WACV, 2019.
 [10] S. Dathathri, S. Zheng, S. Gao, and R. Murray. Measuring the Robustness of Neural Networks via Minimal Adversarial Examples. In NeurIPSW, volume 35, 2017.
 [11] R. R. de Castro and H. A. Rabitz. Targeted nonlinear adversarial perturbations in images and videos. arxiv preprint, abs/1809.00958, 2018.
 [12] L. Engstrom, D. Tsipras, L. Schmidt, and A. Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. arxiv preprint, abs/1712.02779, 2017.
 [13] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. X. Song. Robust physicalworld attacks on deep learning visual classification. CVPR, 2018.
 [14] A. Fawzi, H. Fawzi, and O. Fawzi. Adversarial vulnerability for any classifier. In NeurIPS, 2018.
 [15] A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning, 107, 2018.
 [16] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
 [17] I. J. Goodfellow. Defense against the dark arts: An overview of adversarial example security research and future research directions. arxiv preprint, abs/1806.04169, 2018.
 [18] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NeurIPS, 2014.
 [19] T. Gu, B. DolanGavitt, and S. Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arxiv preprint, abs/1708.06733, 2017.
 [20] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen. Attgan: Facial attribute editing by only changing what you want. arxiv preprint, 2017.
 [21] G. Hinton, N. Srivastava, and K. Swersky. Lecture 6a, overview of minibatch gradient descent.
 [22] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Blackbox adversarial attacks with limited queries and information. In PMLR, volume 80, 2018.
 [23] A. Ilyas, L. Engstrom, and A. Madry. Prior convictions: Blackbox adversarial attacks with bandits and priors. arxiv preprint, abs/1807.07978, 2018.
 [24] T. Kaneko, K. Hiramatsu, and K. Kashino. Generative attribute controller with conditional filtered generative adversarial networks. CVPR, 2017.
 [25] T. Kim, B. Kim, M. Cha, and J. Kim. Unsupervised visual attribute transfer with reconfigurable generative adversarial networks. arxiv preprint, abs/1707.09798, 2017.
 [26] D. Kingma and J. Ba. Adam: a method for stochastic optimization (2014). In ICLR, 2015.
 [27] D. P. Kingma and M. Welling. Autoencoding variational bayes. arxiv preprint, abs/1312.6114, 2014.
 [28] P. W. Koh and P. Liang. Understanding blackbox predictions via influence functions. In JMLR, volume 70, 2017.
 [29] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arxiv preprint, abs/1607.02533, 2017.
 [30] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, et al. Fader networks: Manipulating images by sliding attributes. In NeurIPS, 2017.
 [31] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
 [32] Y. LeCun and C. Cortes. MNIST handwritten digit database, 2010.
 [33] M. Li, W. Zuo, and D. Zhang. Convolutional network for attributedriven and identitypreserving human face generation. arxiv preprint, abs/1608.06434, 2016.
 [34] M. Li, W. Zuo, and D. Zhang. Deep identityaware transfer of facial attributes. arxiv preprint, abs/1610.05586, 2016.
 [35] H.T. D. Liu, M. Tao, C.L. Li, D. Nowrouzezahrai, and A. Jacobson. Beyond pixel normballs: Parametric adversaries using an analytically differentiable renderer. In ICLR, 2019.
 [36] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. In NeurIPS, 2017.
 [37] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
 [38] Y. Lu, Y.W. Tai, and C.K. Tang. Attributeguided face generation using conditional cyclegan. In ECCV, 2018.
 [39] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
 [40] M. Mirza and S. Osindero. Conditional generative adversarial nets. arxiv preprint, abs/1411.1784, 2014.
 [41] S.M. MoosaviDezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. CVPR, 2017.
 [42] S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard. Deepfool: A simple and accurate method to fool deep neural networks. CVPR, 2016.
 [43] S.M. MoosaviDezfooli, A. Fawzi, J. Uesato, and P. Frossard. Robustness via curvature regularization, and vice versa. In CVPR, 2019.
 [44] K. R. Mopuri, U. Ojha, U. Garg, and R. V. Babu. Nag: Network for adversary generation. CVPR, 2018.
 [45] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. In ICML, 2017.
 [46] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. EuroS&P, 2016.
 [47] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NeurIPSW, 2017.
 [48] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible conditional gans for image editing. arxiv preprint, abs/1611.06355, 2016.
 [49] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arxiv preprint, abs/1511.06434, 2016.
 [50] J. Rauber, W. Brendel, and M. Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017.
 [51] P. Samangouei, M. Kabkab, and R. Chellappa. DefenseGAN: Protecting classifiers against adversarial attacks using generative models. In ICLR, 2018.
 [52] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. In NeurIPS, 2018.
 [53] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein. Poison frogs! targeted cleanlabel poisoning attacks on neural networks. In NeurIPS, 2018.
 [54] A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein. Are adversarial examples inevitable? In ICLR, 2019.
 [55] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter. Adversarial generative nets: Neural network attacks on stateoftheart face recognition. arxiv preprint, abs/1801.00349, 2018.
 [56] W. Shen and R. Liu. Learning residual images for face attribute manipulation. CVPR, 2017.
 [57] Y. Song, R. Shu, N. Kushman, and S. Ermon. Constructing unrestricted adversarial examples with generative models. In NeurIPS, 2018.
 [58] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [59] O. Tange. Gnu parallel  the commandline power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb. 2011.
 [60] F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. D. McDaniel. Ensemble adversarial training: Attacks and defenses. arxiv preprint, abs/1705.07204, 2017.
 [61] B. Tran, J. Li, and A. Madry. Spectral signatures in backdoor attacks. In NeurIPS, 2018.
 [62] A. Turner, D. Tsipras, and A. Madry. Cleanlabel backdoor attacks, 2019.
 [63] P. Upchurch, J. R. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Q. Weinberger. Deep feature interpolation for image content changes. CVPR, 2017.
 [64] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
 [65] C. Xiao, J.Y. Zhu, B. Li, W. He, M. Liu, and D. X. Song. Spatially transformed adversarial examples. arxiv preprint, abs/1801.02612, 2018.
 [66] H. Xiao, B. Biggio, B. Nelson, H. Xiao, C. M. Eckert, and F. Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160, 2015.
 [67] H. Xiao, H. Xiao, and C. M. Eckert. Adversarial label flips attack on support vector machines. In ECAI, 2012.
 [68] T. Xiao, J. Hong, and J. Ma. Dnagan: Learning disentangled representations from multiattribute images. arxiv preprint, abs/1711.05415, 2018.
 [69] X. Zeng, C. Liu, Y.S. Wang, W. Qiu, L. Xie, Y.W. Tai, C.K. Tang, and A. L. Yuille. Adversarial attacks beyond the image space. arxiv preprint, abs/1711.07183, 2017.
 [70] Y. Zhang, H. Foroosh, P. David, and B. Gong. Camou: Learning physical vehicle camouflages to adversarially attack detectors in the wild. In ICLR, 2019.
 [71] Z. Zhao, D. Dua, and S. Singh. Generating natural adversarial examples. In ICLR, 2018.
 [72] S. Zhou, T. Xiao, Y. Yang, D. Feng, Q. He, and W. He. Genegan: Learning object transfiguration and attribute subspace from unpaired data. arxiv preprint, abs/1705.04932, 2017.
 [73] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. ICCV, 2017.
Appendix A Related Work
Adversarial Examples and Attacks.
In 2014, Szegedy et al. [58] shows that deep neural networks had mainly two counter intuitive properties, stating that the space described by higher layers of neural networks captures semantic information and there exists adversarial examples which questioned the generalization ability of a neural network. They generate such adversarial examples under the distance constraint which look similar to the original images but are classified with a different label by the classifier using a box constrained LBFGS attack.
Goodfellow et. al [16] and Kurakin et al. [29] generate adversarial examples using Fast Gradient Sign method and its iterative variant under the constraint in less computation time. Other methods similar to FGSM have been mentioned in [60].
Papernot et al. [46] implements an attack under the constraint where they modify the pixel having the most significant contribution in changing the classification of the model to the target class. MoosaviDezfooli et al. [42] describe an untargeted attack algorithm under the constraint with the assumption that neural networks are linear in nature which they further extend to nonlinear neural networks. Another family of attacks relates to a single universal adversarial direction for a dataset. MoosaviDezfooli et al. [41] prove the existence of an imageagnostic adversarial perturbation. Fawzi et al. [15] extend this to theoretically show that every classifier is vulnerable to adversarial attacks. MoosaviDezfooli et al.further consider the effect of the curvature of the decision boundaries on the existence of adversarial examples in [43].
Carlini and Wagner [5] propose three attacks for adversarial image generation and shows that defensive distillation is not an effective defence mechanism. They devise attacks under the three norms in literature , and to measure the deviation of adversarial perturbation from the original sample over seven different surrogate loss functions and finally selecting one of them which we use in our attack algorithm as well. The attack that they implement in this work is proven to be the most effective attack in literature and is a benchmark for comparison.
The primary difference between the aforementioned attacks and our attack is that these attacks perturb the image and make imperceptible changes in the pixel space and thereby not modifying the image in a semantic way. On the other hand, our attack focuses making naturalistic perceptible changes to the image which are semantic in nature and realistic.
Parametric adversarial attacks.
The use of parametric transformations to generate adversarial examples has been tackled by several previous works. Most of these parametric attacks target the image formation process to create adversarial example. A recent work by Liu et al.perturbs geometrical surfaces or lighting by optimizing over the relevant parameters for a 3D environment. They show convincing results with realistic looking adversarial examples. Zeng et al. [69] use FGSM to perturb 3D models of objects to create adversarial examples. The primary caveat to such approaches is that they require precise 3D models of the objects that they create adversarial examples.
Athalye et al. [2] demonstrate the creation of a realworld adversarial 3D model using optimization over affine transformations corresponding to realworld realizations. Eykhol et al. [13] also provide mechanisms for realworld realizable adversarial examples for stop signs using designed adversarial stickers.
Mopuri et al. [44] train a generative adversarial network to generate adversarial attacks for classifiers. Zhao et al. [71] show an interesting use of a GAN and an inverter network where they search over the input space of the GAN to generate semantically valid adversarial examples. These approaches are morally similar to our approach though we focus on specific physically perturbed attributes of images rather than imperceptible perturbations. CAMOU [70] is a more recent work that learns a neural approximator for physical camouflage and then optimizes over the same to generate an adversarial version to fool object detectors.
The space of generating adversarial examples using GANs for face recognition systems has also been touched upon by Dabouei et al. [9] and Sharif et al. [55] which train generative networks for the specific purpose of creating adversarial examples. Sharif et al.especially show a realizable attack by adding glasses using a generative network to fool a face recognition classifier. We, in comparison, provide a more diverse attack space allowing for various semantic attributes. In addition, since our attack involves physically realizable perceptible attributes, it can be used to characterize a classifier’s performance against physical adversarial attacks as well.
Song et al. [57] uses an Auxiliary Class Generative Adversarial Network (ACGAN) [45] to generate unrestricted adversarial examples from noise and then optimizes over the latent space of the conditional GAN to find such adversarial examples which get missclassified by a gender classifier. The paper describes the use of Mechanical Turk as a checker for naturalness and validation for the generated images belonging to the desired class. We approach the more complex problem of finding an adversarial transformation for an input image instead of generating a random semantic adversarial example.
Attribute based generative models.
Our approach relies on the use of attribute based generative models for enforcing the semantic constraint and representing attributes as a realvalued semantic variable. we discuss a few relevant approaches published recently.
As mentioned in [20], the literature related to facial attribute editing can be broadly divided into two sections, optimization based approaches and learning based approaches. Optimization approaches include Li et al. [33] and Gardneret al. [63] where the former optimizes the CNN feature difference between the input face image and the face images with the desired attributes with respect to the input face while the latter optimizes the input face in order to match the deep feature along the direction vector between the faces with and without the attributes.
Li et al. [34] describe a method to optimize over an adversarial attribute loss and a deep identity feature loss in order to train a deep identity aware transfer model to add or remove facial attributes to/from a face. Shen et al. [56] learn the difference between images before and after manipulation to simultaneously train two networks for respectively adding and removing a specific attribute.
Generative Adversarial Networks(GAN) [18] are a popular approach for the generation of samples from a realworld data distribution. Recent advancements [49, 36, 64, 6] in GANs allow for creation of high dimensional, high quality realistic images. These have been incorporated into the several attribute swapping generative models. Zhou et al. [72] recombine the information of the latent information of two images to swap a specific attribute between the given images. Liu et al. [36] generate high quality images by coupling GANs in order to learn a shared latent representation in order to tackle several unsupervised image translation tasks including domain adaptation and face image translation.
For multiple attribute swapping, models based on Kingma et al. [27], Goodfellow et al. [18],Larsen et al. [31], Mirza et al. [40], Radford et al. [49] have become quite popular recently. Perarnaue et al. [48] uses a Conditional Generative Adversarial Network [40] and encoder to learn the attribute invariant latent representation for attribute editing. Similar work has been seen in Fader Networks [30] where the model learns the attribute invariant latent space in order to identify a face as one and the same with or without a specific attribute. On the other hand, AttGAN [20] argues that such attribute invariant constraint is a bit too excessive and imposes an attribute classification constraint and a reconstruction loss instead to alter only the desired attributes preserving attributeexcluding features. StarGAN [8] uses a cyclic consistency loss to preserve information and instead of learning a latent representation, it trains a conditional attribute transfer network to modify attributes. Chen et al. [6] and Odena et al. [45] map the generated images back to the conditional signals with the help of an auxiliary classifier to learn this conditional generation of the images. Kaneko et al. [24] uses a conditional filtered generative adversarial network to present a generative attribute controller to edit attributes of an image while preserving the variations of an attribute.
Xiao et al. [68] swaps blocks of the latent distribution containing relevant attributes between a given pair of images. A similar approach has been seen in Kimet al. [25] where the latent representation is divided in blocks corresponding different attributes and these latent blocks are swapped in order to achieve multiple attribute swapping.
Data poisoning.
Much of the prior work mentioned discuss about adversarial attacks during inference. Data poisoning is a technique where the adversary injects false data to hinder the generalization capability of a deep neural network. Koh et al. [28] present the seminal work on data poisoning for deep neural networks where they construct approximate upper bounds to provide certificates to a large class of attacks. Xiao et al. [67] and Xiao et al. [66] also present a similar approach but on shallow learning models. Another class of data poisoning attack is referred to as a backdoor attack, where an adversary corrupts the model to misclassify either a specific input or a group of inputs to a target label thus engineering a backdoor that can be used to corrupt the learned model. Gu et al. [19] demonstrate a method to train a network maliciously with good performance on training and validation datasets but persistent poor performance on inputs associated with backdoor triggers.
These attacks can be realistic in nature, for e.g., a stop sign can be identified by the classifier as a speed limit sign in the presence of backdoor triggers which are mainly special markers added to the inputs by the adversary. Turner et al. [62] show that an adversary is able to gain whole control over the target model during inference, by training with samples generated with a GAN. More recently Tran et al. [61] identify a property related to all backdoor attacks known as spectral signatures with which poisoned examples from real image datasets can be detected and removed effectively. Chen et al. [7] demonstrate an application of such backdoor attacks on a visual recognition system where they were able to break a weak threat model with a limited number of poisoned data examples with semantic attribute changes. This is perhaps the first attempt at considering the effect of semantic changes.
Appendix B Theoretical Results
Robust classification error for subspace attacks
We present a proof for the upper bound of the robust classification error in the case of subspace attacks. Recall the data model we use; a Mixture of Gaussians data model, with two components and . Each of the components are regarded as classes. We additionally assume a linear classifier, defined by the unit vector, .
Let and .
Under the assumption that the linear classifier is well trained, i.e., is sufficiently correlated with the true component mean, , we upper bound the robust classification error. This involves considering the sample generalization error of a linear classifier on Gaussian data. We adapt arguments from Schmidt et al. [52] for the case of subspace attacks. The theorem statement is repeated here for convenience.
Theorem 1.
Let be such that . Then, the linear classifier has a robust classification error upper bounded as:
(5) 
Proof.
For proving the above statement, we consider the probability of adversarial misclassification under a rank constrained attack.
Given where ; , we consider a linear additive attack under a rank constraint,
(6) 
Here, is a random matrix with the columns forming an orthonormal basis of dimensionality . In addition, we consider that the adversarial example thus created is constrained to be in the norm ball, , which implies that
(7) 
We attempt to bound the probability that a rank constrained adversarial example, , created using equation 6, exists under the constraint defined by equation 7.
Let .
Now,
(8) 
Consider the domain of the minimization,
Now using the definition of the operator norm for rectangular matrices (See [3], Sec A.1.5) and the fact that is orthonormal,
Let set and set . We can clearly see that . Now considering the , as
Thus we show that,
(9) 
From the above inequality, but not vice versa.
By using the inclusion argument of probability measure, we can therefore show that,
(10) 
We now upper bound the term using the same argument as that of Lemma 20 in [52].
Let  
We now drop as the constraint is symmetric and use definition of dual norm,
We now invoke Lemma 17 from [52] with and to bound the ,
(11) 
∎
Appendix C Details of Experiments
Dataset: For our experiments, we use the CelebA dataset [37]. The dataset has approximately 200k images of faces. Each image is annotated with binary attributes. Examples of these attributes are gender, age and skin complexion. We preprocess the images by cropping the central subimage and resizing each crop to . The resized images are then normalized to be between and .
Target Binary Classifier: We attack a pretrained gender binary classifier using our approach. The architecture used for the classifier is shown in Table 3. We train the classifier with 70% of the CelebA dataset [37] as training data and 20% as validation data using categorical crossentropy. We use ADAM [26] as our optimizer. Our model is 95.6% accurate on the test set (10% of the dataset). We additionally train a binary age classifier with the same architecture.
Layers  Size 
Convolutional Layer with Relu  32x3x3 
Maxpooling Layer  2x2 
Convolutional Layer with Relu  64x3x3 
Maxpooling Layer  2x2 
Convolutional Layer with Relu  128x3x3 
Maxpooling Layer  2x2 
Fully Connected Layer  1024 
Fully Connected Layer  2 
Adversarial Fader Networks
Architecture of Fader Networks. Fader Networks are an encoderdecoder architecture that disentangles semantic attributes during the reconstruction process. This is achieved by training a discriminator on the encoded latent vector while simultaneously reconstructing the original image from the concatenated latent vector and the semantic attribute vector. Figure 7 shows the architecture of the Fader Networks. An intriguing effect of the training process is that the attribute vector space can be treated as a continuous and bounded space. We further can optimize over this space to generate adversarial examples.
Attack Type  Attributes  Accuracy of target model (%)  Random Sampling (%) 
Single Attribute Attack  A1  70.0  87.0 
A2  61.0  93.0  
A3  48.0  88.0  
Multi Attribute Attack  A1,A5,A6  12.0  86.0 
A2,A5,A6  7.00  85.0  
A1,A2,A7  28.0  84.0  
Cascaded Multi Attribute Attack  A1A2A3  30.0  68.0 
A1A3A4  31.0  80.0  
A2A3A4  42.0  68.0  
Single and Multi Attribute Attacks. We train three multiattribute Fader networks with attributes presented in table 4. The pretrained Fader networks are used as semantic constraints with the attribute vectors as the optimization variables. We then process examples from the CelebA test set with the semantic attack algorithm to generate adversarial examples. In order to make our optimization algorithm compatible with Fader Networks, we create a nonparametric forward model to convert the attribute vector to a compatible form. We call this forward model “Attribute Encoding”.
We generate semantic adversarial images by optimizing over a modified CarliniWagner loss [5] with respect to the attribute vectors using ADAM [26] with a learning rate of . We also experimented with various other optimizers including stochastic gradient descent, RMSProp [21], but find that ADAM generates sharper images as well is the most successful.
Our experiments show that successful multiattribute models tend to be deeper and wider. In addition, these networks are extremely susceptible to mode collapse unless the hyperparameters are carefully tuned. We hypothesize that this is an effect of the strong coupling of facial attributes, thus making the generatordiscriminator optimization difficult. An unconditioned generative neural network generally learns to associate these entangled representations to a latent vector space where dimension represents some combination of attributes. In order to get past this, we model the multiattribute perturbation problem as a sequential perturbation of single attributes.
Cascaded Attribute Attack. For the cascaded attribute attack, we cascade several smaller single attribute models one after the other to sequentially transform the input image. In this case, the problem of decoupling facial features from the underlying invariant data is divided among multiple models. The transformed image is then input to the target model. We generate adversarial examples as in the previous two cases for the CelebA test set by optimizing the CarliniWagner loss. In this case, we also modify the attribute encoding module to treat each attribute tuple separately.
We find that the semantically transformed images tend to be less sharp as compared to the ones generated single or multiattribute attacks. This can be attributed to the concatenation of several reconstruction steps. Sequential reconstruction leads to loss of information and the reconstruction error compounding.
Attribute Encoding. Each attribute is represented by a tuple of real numbers that sum up to one. These tuples are concatenated into an attribute vector. To ensure that this structure is preserved over the optimization framework, we use a nonparametric forward model to algebraically manipulate our optimization variables to this specific representation. The encoding module also implements the box constraint for the optimized attribute values to lie between and in order to ensure that the generated images are valid.
Adversarial Attribute GANs
Architecture of Attribute GANs. Attribute GANs [20] improve upon Fader Networks by using a discriminatorclassifier pair to analyse the reconstructed images (Refer Figure 7 for the architecture). They optimize over a combination of a reconstruction loss, an adversarial loss and an attribute constraint loss to ensure the editing of the exact desired attribute while preserving the attribute excluding details at the same time. The encoded latent vector is conditioned on the attribute vector during the decoding process. This results in the decoupling of semantic attributes from the underlying identity data. AttGAN takes as input an image and an attribute vector where each element represents an attribute. We select attributes to perturb for our semantic attack.
We use a pretrained AttGAN model with semantic attributes. For our experiments we consider and attributes respectively for transforming input images.
Attacks. We adapt our adversarial Fader Network approach to the AttGANs by modifying the “Attribute Encoding” module to mask attributes that we do not perturb. The encoding module also constrains the elements to lie between and as required by our algorithm to generate valid images.
Appendix D Results
Our additional experiments on the binary age classifier show that our approach is able to generate adversarial examples for other classifiers trained on the CelebA dataset (See Table 4). Note that our observations regarding the increasing effectiveness of our attack approach as the number of attributes we perturb increase, holds even for a new classifier. We also compare the performance of our attack with worstof10 random sampling (similar to the approach in [12].) This proves that our approach is successful at generating semantic adversarial examples.
Qualitative Results for Attacks on Binary Gender Classifier
(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h)  (i) 
(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h)  (i) 
l  
(a)  (b)  (c)  (d)  (e)  (f) 
(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h)  (i)  (j) 