Adversarial Attacks Beyond the Image Space

Adversarial Attacks Beyond the Image Space

Abstract

Generating adversarial examples is an intriguing problem and an important way of understanding the working mechanism of deep neural networks. Recently, it has attracted a lot of attention in the computer vision community. Most existing approaches generated perturbations in image space, i.e., each pixel can be modified independently. However, it remains unclear whether these adversarial examples are authentic, in the sense that they correspond to actual changes in physical properties.

This paper aims at exploring this topic in the contexts of object classification and visual question answering. The baselines are set to be several state-of-the-art deep neural networks which receive 2D input images. We augment these networks with a differentiable 3D rendering layer in front, so that a 3D scene (in physical space) is rendered into a 2D image (in image space), and then mapped to a prediction (in output space). There are two (direct or indirect) ways of attacking the physical parameters. The former back-propagates the gradients of error signals from output space to physical space directly, while the latter first constructs an adversary in image space, and then attempts to find the best solution in physical space that is rendered into this image. An important finding is that attacking physical space is much more difficult, as the direct method, compared with that used in image space, produces a much lower success rate and requires heavier perturbations to be added. On the other hand, the indirect method does not work out, suggesting that adversaries generated in image space are inauthentic. By interpreting them in physical space, most of these adversaries can be filtered out, showing promise in defending adversaries.

Figure 1:  Adversarial examples for object classification and visual question answering. The first row is the original image. The middle group shows the perturbations (magnified by a factor of \mathbf{5} and shifted by \mathbf{128}) and perturbed images by attacking image space, and the bottom group by attacking physical space. p and \mathrm{conf} are the perceptibility (see Section ) and the confidence score on the predicted class, respectively. Attacking physical space is more difficult, as we always observe a larger perceptibility and a lower confidence score.
Figure 1: Adversarial examples for object classification and visual question answering. The first row is the original image. The middle group shows the perturbations (magnified by a factor of and shifted by ) and perturbed images by attacking image space, and the bottom group by attacking physical space. and are the perceptibility (see Section ) and the confidence score on the predicted class, respectively. Attacking physical space is more difficult, as we always observe a larger perceptibility and a lower confidence score.

1Introduction

Recent years have witnessed a rapid development in the area of deep learning, in which deep neural networks have been applied to a wide range of computer vision tasks, such as image classification [16][12], object detection [9][34], semantic segmentation [37][6], visual question answering [2][14], etc. The basic idea is to design a hierarchical structure to learn visual patterns from labeled data. With the availability of powerful computational resources and large-scale image datasets [7], researchers have designed more and more complicated models and achieved a boost in visual recognition performance.

Despite the great success of deep learning, there still lacks an effective method to understand the working mechanism of deep neural networks. An interesting effort is to generate so-called adversarial perturbations. They are visually imperceptible noise [11] which, after being added to an input image, changes the prediction results completely, sometimes ridiculously. These examples can be constructed in a wide range of vision problems, including image classification [28], object detection and semantic segmentation [45]. Researchers believed that the existence of adversaries implies unknown properties in feature space [43].

This work is motivated by the fact that conventional 2D adversaries were often generated by modifying each image pixel individually. Thus, while being strong in attacks, it remains unclear if they can be generated by perturbing physical properties in the 3D world. We notice that previous work found adversarial examples “in the physical world” by taking photos on the printed perturbed images [17]. But our work is different and more essential as we only allow to modify some basic physical parameters such as surface normals. For this respect, we follow [19] to implement 3D rendering as a differentiable layer, and plug it into the state-of-the-art neural networks for object classification and visual question answering. In this way, we build a mapping function from physical space (a set of physical parameters, including surface normals, illumination and material), via image space (a rendered 2D image), to output space (the object class or the answer to a question).

We aim at answering two questions. (i) Is it possible to directly generate perturbations in physical space (i.e., modifying basic physical parameters)? (ii) Given an adversary in image space, is it possible to find an approximate solution in physical space, so that the re-rendered 2D image preserves the attacking ability? Based on our framework, these questions correspond to two different ways of generating perturbations in physical space. The first one, named the direct method, computes the difference between the current output and the desired output, back-propagates the gradients to the physics layer directly and makes modifications. The second one, the indirect method, first constructs an adversary in image space, and then attempts to find the best solution in physical space that is rendered into it. Both methods are implemented by the iterative version of the Fast Gradient Sign Method (FGSM) [11]. We constrain the change in image intensities to guarantee the perturbations to be visually imperceptible. Experiments are performed on two datasets, i.e., 3D ShapeNet [5] for object classification and CLEVR [14] for visual question answering.

Our major finding is that attacking physical space is much more difficult than attacking image space. Although it is possible to find adversaries using the direct manner (YES to Question (i)), the success rate is lower and the perceptibility of perturbations becomes much larger than required in image space. This is expected, as the rendering process couples changes in pixel values, i.e., modifying one physical parameter (e.g., illumination) may cause many pixels to be changed at the same time. This also explains why we found it almost impossible to generate adversaries using the indirect manner (conditional NO to Question (ii); it is possible that currently available optimization algorithms such as FGSM are not strong enough). An implication of this research is an effective approach to defend adversaries generated in image space – finding an approximate solution in physical space and re-rendering will make them fail.

The remainder of this paper is organized as follows. Section 2 briefly introduces related work. The approach of generating adversarial perturbations in physical space is described in Section 3. After experiments are shown in Section 4, we conclude our work in Section 5.

2Related Work

Deep learning is the state-of-the-art machine learning technique to learn visual representations from labeled data. The basic methodology is to stack differentiable units in a hierarchical manner [16]. It is believed that a network with a sufficient number of layers and neurons can capture complicated distributions in feature space. With large-scale image datasets such as ImageNet [7], powerful computational resources such as GPU’s, and the assistance of efficient strategies [27][39][13], it is possible to train very deep networks in a reasonable period of time. Recent years, deep neural networks have been widely applied to computer vision problems, including image classification [38][42][12], object detection [9][34], semantic segmentation [37][6], visual question answering [2][8][15], etc.

Despite the success of deep learning, it remains a challenging task to explain what is learned by these complicated models. One of the most interesting efforts towards this goal is to generated adversaries. In terms of adversaries [11], we are talking about small noise that is (i) imperceptible to humans, and (ii) able to cause deep neural networks make wrong predictions after being added to the input image. It was shown that such perturbations can be easily generated in a white-box environment, i.e., both the network structure and pre-trained weights are known to the attacker. Early studies were mainly focused on image classification [28][26]. But soon, researchers were able to attack deep networks for detection and segmentation [45], and also visual question answering [46]. Efforts were also made in finding universal perturbations which can transfer across images [25], as well as adversarial examples in the physical world which were produced by taking photos on the printed perturbed images [17].

Attacking a known network (a.k.a, a white box) started with setting a goal of prediction. There were generally two types of goals. The first one (a non-targeted attack) aimed at reducing the probability of the true class [28], and the second one (a targeted attack) defined a specific class that the network should predict [20]. After that, the error between the current and the target predictions was computed, and gradients back-propagated to the image layer. This idea was developed into a set of algorithms, including the Steepest Gradient Descent Method (SGDM) [26] and the Fast Gradient Sign Method (FGSM) [11]. The difference lies in that SGDM computed accurate gradients, while FGSM merely kept the sign in every dimension of these gradients. The latter one, while being less powerful in direct attacks, often enjoys stronger transfer ability. The iterative version of these two algorithms were also studied [17]. In comparison, attacking an unknown network (a.k.a., a black box) is much more challenging [20], and an effective way is to sum up perturbations from a set of white-box attacks [45].

In opposite, there exist efforts in protecting deep networks from adversarial attacks. Defensive distillation [31] proposed to defend the network by distilling robust visual knowledge, but a stronger attacking method was designed to beat this defense [4]. It was shown that training deep networks in a large-scale dataset increases the robustness against adversarial attacks [18], and a more essential solution was to add adversarial examples to the training data [44]. Researchers also developed an algorithm to detect whether an image was attacked by adversarial perturbations [23]. The battle between attackers and defenders continues, and these ideas are closely related to the bloom of Generative Adversarial Networks [10][33].

3Approach

3.1Motivation: Inauthenticity of Adversaries

Despite the fast development in generating adversaries to attack deep neural networks, we note that almost all the existing algorithms assumed that each pixel of the image can be modified independently. Although this strategy was successful in finding strong adversaries, the perturbed images may be inauthentic, i.e., they may not correspond to any 3D scenes in the real world. Motivated by this, we study the possibility of constructing adversaries by directly modifying the physical parameters, i.e., surface normals and material of an object, and illumination of a 3D scene. Note that our goal is more essential than the previous work [17], which generated adversaries “in the physical world” by taking photos of the printed perturbed images.

To be concrete, this work is aimed at answering two questions. First, is it possible to perturb the 3D physical parameters so as to attack 2D deep networks? Second, are the conventional 2D adversaries interpretable by perturbing some physical parameters of a 3D scene? We answer these two questions based on one single framework (Section Section 3.2), which plugs a rendering module to various deep neural networks, and thus builds a mapping function from the physics layer, via the image layer, to the output layer.

Our goal is to generate adversaries on the physics layer. There are two different ways, i.e., either directly back-propagating errors from the output layer to the physics layer, or first constructing an adversary in the image layer, and then finding a set of physical parameters that are rendered into it. We name them as direct and indirect ways of attacking the physics layer. In experiments, the direct method works reasonably well, but the indirect method fails completely. Quantitative results are shown in Section 4.

3.2From Physical Parameters to Prediction

As the basis of this work, we build an end-to-end framework which receives the physical parameters of a 3D scene, renders them into a 2D image, and outputs prediction, e.g., the class of an object, or the answer to a visual question. Note that our research involves 3D to 2D rendering as part of the pipeline, which stands out from previous work which either worked on rendered 2D images [40][15], or directly processed 3D data without rendering them into 2D images [32][36].

We denote the physical space, image space and output space by , and , respectively. Given a 3D scene , the first step is to render it into a 2D image . For this purpose, we consider three sets of physical parameters, i.e., surface normals , illumination , and material . By giving these parameters, we assume that the camera geometries, e.g., position, rotation, field-of-view, etc., are known beforehand and will remain unchanged in each case. The rendering module (Section ?) is denoted by . Then, we make use of in two vision tasks, i.e., object classification and visual question answering. An individual deep neural network receives as its input, and outputs the prediction . The networks for classification and visual question answering are denoted by and , respectively. Here, is the question, and are the output vectors, and and are the corresponding parameters, respectively.

3D Object Rendering

We make use of [19], a differentiable algorithm for 3D object rendering. Note that some other algorithms [3][22] provide better rendering qualities but cannot be used in this paper because they are non-differentiable. Differentiability is indispensable, as it enables us to back-propagate errors from the image layer to the physics layer.

Three sets of parameters are considered. (i) Surface normals are represented by a -channel image which has the same size as the desired output image , sized , where each pixel is encoded by the azimuth and polar angles of the normal vector at this position. (ii) Illumination is defined by an HDR environment map of dimension , with each pixel storing the intensity of the light coming from this direction (a spherical coordinate system is used). (iii) Material impacts image rendering with a set of bidirectional reflectance distribution functions (BRDFs) which describe the point-wise light reflection for both diffuse and specular surfaces [29]. The material parameters used in this paper come from the directional statistics BRDF model [30], which represents a BRDF as a combination of distributions with parameters in each. Mathematically, we have , and .

The rendering algorithm [19] is based on some reasonable assumptions, e.g., translation-invariant natural illumination (incoming light depends only on the direction), and there is no emission and omit complex light interactions like inter-reflections and subsurface scattering. Then the intensity of each pixel of can be computed by integrating the reflected light intensity over all incoming light directions [21]. The integral is substituted by a discrete sum over a set of incoming light directions for numerical computations. In practice, the rendering process is implemented as a network layer, which is differentiable to input parameters , and . Please refer to [19] for mathematical equations and technical details.

Object Classification

Based on the rendered 2D images, object classification is straightforward. Two popular networks (AlexNet [16] and ResNet [12]) are investigated. We start with two models pre-trained on the ILSVRC2012 dataset [35], and fine-tune them using the rendered images in the ShapeNet [5] dataset. This is necessary, as the rendered images contain quite different visual patterns from natural images.

In the testing stage, the network parameters is fixed, and we predict the class by , where ( is the number of classes) is the probability distribution over all classes.

Visual Question Answering

The visual question answering problem [14] considered in this paper is essentially a classification task. Given an input image and a question , the goal is to choose the correct answer from a pre-defined set ( choices).

We make use of the algorithm described in [15]. This algorithm consists of two components: a program generator and an execution engine. The goal of the program generator is to convert the question written in natural language to a tree structure (i.e., a program) that describes the order and types of a series of actions to carry out. Specifically, a sequence-to-sequence model [41] is used to translate words in the question to the prefix traversal of the abstract syntax tree. The execution engine assembles a neural module network [1] according to the predicted program. Each module is a small convolutional neural network that corresponds to one predicted action in the tree. The convolutional image features may be queried at various places in this assembled network.

In the testing stage, given a question , the network structure and its parameters are fixed, and the prediction process is very similar to that in object classification. Thus, we unify object classification and visual question answering into the same formulation:

where denotes network parameters, i.e., or .

3.3Attacking Physical Space

Perceptibility

The goal of an adversarial attack is to produce a visually imperceptible perturbation, so that the network makes incorrect predictions after it is added to the original image. Let the physical parameters be , , , and the rendered image be . Denote the perturbations added to these parameters by , and , respectively. The perturbation added to the rendered image is:

Perceptibility is computed on the perturbations of both the rendered image and the physical parameters. Following [43][26], the image perceptibility is defined as:

where is a -dimensional vector representing the RGB intensities (normalized in ) of a pixel. Similarly, we can also define the perceptibility values for object parameters, i.e., , and . For example,

In experiments, is the major criterion of visual imperceptibility, but we also guarantee that , and are very small.

Setting a Goal of Attacking

Attacking the physical parameters starts with setting a goal, which is what we hope the network to predict. In practice, this is done by defining a loss function , which determines how far the current output is from the desired status. In this work, the goal is set in two ways, i.e., either targeted or non-targeted. A targeted attack specifies a class as which the image should be classified, and thus defines a target output vector using the one-hot encoding scheme. The Manhattan distance between and forms the loss function:

On the other hand, a non-targeted attack specified a class as which the image should not be classified, and the goal is to minimize the -th dimension of the output :

Throughout this paper, we make use of the non-target attack, and refer to the loss term as .

Direct Attacks

There are two ways of attacking physical space. The direct attack works by expanding the loss function , i.e., , and minimizing this function with respect to the physical parameters , and . Note that these parameters can also be optimized either jointly or individually. Without loss of generality, we optimize individually in the following example.

The optimization starts with an initial (unperturbed) state . A total of iterations are performed. In the -th round, we compute the gradient vectors with respect to , i.e., , and update along this direction. We follow the Fast Gradient Sign Method (FGSM) [11] to only preserve the sign in each dimension of the gradient vector, as this algorithm does not need normalization, in which large gradient values may swallow small ones. We denote the perturbation in the -th iteration by

and update by , where is the learning rate. This iterative process is terminated if the goal of attacking is reached. The accumulated perturbation over all iterations is denoted by for later reference.

Throughout the attacking process, in order to guarantee imperceptibility, we constrain the RGB intensity changes on the image layer. In each iteration, after a new set of physical perturbations are generated, we check all pixels on the re-rendered image, and any perturbations exceeding a fixed threshold is truncated. In practice, is set to be , the reasonable value that is imperceptible to human eyes. Truncations cause the inconsistency between the physical parameters and the rendered image and risk failures in attacking. To avoid frequent truncations, we set the learning rate to be small, which consequently increases the number of iterations needed to attack the network.

Indirect Attacks

In opposite, the indirect attack first finds a perturbation in image space, then computes perturbations in physical space, i.e., , and . is generated in the same way of perturbing the physical parameters.

The next step is to find physical perturbations, i.e., , and , so that the newly rendered image

is close to . Mathematically, we minimize the following loss function:

Similarly, Eqn is expanded in physical space by substituting with Eqn , and optimization can be performed on physical parameters either jointly or individually.

Note that the indirect way is indeed pursuing interpreting in physical space. However, as we will show in experiments, this way does not work out in any cases. This suggests that adversaries generated in image space, despite being strong in attacks, are often inauthentic, and their approximations in physical space either do not exist, or cannot be found by this simple optimization.

4Experiments

4.13D Object Classification

Table 1: Effect of white-box adversarial attacks to image space or individual elements in physical space. By combined, we allow three sets of physical parameters to be perturbed jointly. Succ. denotes the success rate of attacks (, higher is better), and is the perceptibility value (unit: , lower is better) defined in Section .
Attacking
Succ. Succ. Succ. Succ. Succ.
FGSM on AlexNet
FGSM on ResNet-34
Figure 2:  Examples of adversaries generated in the 3D object classification task. In each group, the top row shows the original testing image, which is correctly predicted by both AlexNet (A) and ResNet (R). The following two rows display the perturbations and the attacked image, respectively. All perturbations are magnified by a factor of \mathbf{5} and shifted by \mathbf{128}. p is the perceptibility value defined in Section , and \mathrm{conf} is the confidence score of the prediction.
Figure 2: Examples of adversaries generated in the 3D object classification task. In each group, the top row shows the original testing image, which is correctly predicted by both AlexNet (A) and ResNet (R). The following two rows display the perturbations and the attacked image, respectively. All perturbations are magnified by a factor of and shifted by . is the perceptibility value defined in Section , and is the confidence score of the prediction.

Settings

We investigate 3D object recognition in the ShapeNetCore-v2 dataset [5], which contains rigid object categories, each with various 3D models. We randomly sample 3D models from each class, and generate fixed viewpoints for each object, so that each category has training images. Similarly, another randomly chosen objects for each class are used for testing.

We start with two popular deep neural networks, i.e., a -layer AlexNet [16] and a -layer deep residual network [12]. It is easy to generate our algorithm to other network structures. Both networks are pre-trained on the ILSVRC2012 dataset [35], and fine-tuned in our training set through epochs. Each mini-batch contains samples, and the learning rate is for AlexNet and for ResNet-34. These networks work quite well on the original testing set (no perturbations are added), i.e., AlexNet achieves top- classification accuracy, and ResNet-34 reports . These numbers are comparable to the single-view baseline accuracy reported in [40].

Figure 3:  Curves of the average loss function value throughout the iterations of FGSM. The starting point of each curve is the average prediction confidence on the original images.
 Curves of the average loss function value throughout the iterations of FGSM. The starting point of each curve is the average prediction confidence on the original images.
Figure 3: Curves of the average loss function value throughout the iterations of FGSM. The starting point of each curve is the average prediction confidence on the original images.

For each class, from the correctly classified testing samples, we choose images with the highest classification probabilities as the targets to generate adversaries. This target set seems small, but we emphasize that attacking in physical space is very time-consuming, as we need to repeatedly re-render the perturbed physical parameters throughout a total of iterations. Using a Titan-X Pascal GPU, attacking one image takes minutes on average1.

Quantitative Results

We apply the iterative version of Fast Gradient Sign Method (FGSM), and try to attack each physical parameters (surface normals, illumination and surface) individually or to attack them jointly. For comparison, we also provide results of attacking image space directly. For each individual set of parameters, we set a non-target goal (see Section ?), use the SGD optimizer (momentum is and weight-decay is ), and set the maximal number of iterations to be . Choosing the learning rate is a little bit tricky. If the learning rate is too large, the truncation in image space will happen frequently (see Section ?), and we cannot guarantee an accurate correspondence between physical and image spaces. On the other hand, if the learning rate is too small, the accumulated perturbations are not enough to change the prediction. We choose the best learning rate from for each set of physical parameters. This is not cheating, as adversarial attacks assume that we know the labels of all the target images [11].

Results of direct attacks are summarized in Table 1 (the indirect method does not work out, see below). First, we demonstrate that adversaries widely exist in both image space and physical space. In image space, as researchers have explored before [43][26], it is easy to confuse the network with small perturbations – in our case, the success rate is always and the perceptibility does not exceed . In physical space, however, generating adversaries becomes much more difficult – the success rate becomes lower and large perceptibility values are often observed on the successful cases. Typical adversarial examples generated in physical space are shown in Figure 2.

Diagnosis

Among three sets of physical parameters, attacking surface normals is more effective than the other two. This is as expected, as using local perturbations is often easier in attacking deep neural networks [11]. The surface normal matrix shares the same dimensionality with the image lattice, and changing an element in the matrix only impacts on a single pixel in the rendered image. In comparison, illumination and material are both global properties of the 3D scene or the object, so tuning each parameter will cause a number of pixels to be modified, hence less effective in adversarial attacks. This property also holds in the scenario of visual question answering. As a side note, although perturbing surface normals allows to change each pixel independently, the rendered RGB intensity is also highly impacted by other two parameters. This is the reason why allowing all physical parameters to be jointly optimized produces the highest success rate. We take this option in the remaining diagnostic experiments.

We plot the curves of the average loss over target images throughout the iterative process of generating adversaries in Figure 3. The loss values in physical attacks especially for illumination and material drop much slower than those in image attacks. Even with a larger number of iterations, there are still some objects that are not attacked successfully, especially in the scenarios of perturbing the illumination and material parameters.

Finally, as the average perceptibility in physical space is much larger than that in image space, we conjecture that adversaries generated in image space are inauthentic, i.e., using the current optimization approach (FGSM), it is almost impossible to find physical parameters that are approximately rendered into them. This is verified using the indirect method described in Section ?. In both AlexNet and ResNet-34, it fails on all target images, regardless of the learning rate and optimizer (SGD or Adam).

4.2Visual Question Answering

Table 2: Generating white-box attacks for visual question answering with IEP . By combined, we allow three sets of physical parameters to be perturbed jointly. Succ. denotes the success rate of attacks (, higher is better) of giving a correct answer, and is the perceptibility value (unit: , lower is better) defined in Section .
Attacking
Succ. Succ. Succ. Succ. Succ.
FGSM on IEP
Figure 4:  Examples of adversaries generated in the 3D visual question answering task. In each group, the top row shows the original testing image and two questions, both of which are correctly answered. The following two rows display the perturbations and the attacked image, respectively. All perturbations are magnified by a factor of \mathbf{5} and shifted by \mathbf{128}. p is the perceptibility value defined in Section , and \mathrm{conf} is the confidence score of choosing this answer. Note that the rightmost column shows a ridiculous answer.
Figure 4: Examples of adversaries generated in the 3D visual question answering task. In each group, the top row shows the original testing image and two questions, both of which are correctly answered. The following two rows display the perturbations and the attacked image, respectively. All perturbations are magnified by a factor of and shifted by . is the perceptibility value defined in Section , and is the confidence score of choosing this answer. Note that the rightmost column shows a ridiculous answer.

Settings

We extend our experiments to a more challenging vision task – visual question answering. Experiments are performed on the recently released CLEVR dataset [14]. This is an engine that can generate an arbitrary number of 3D scenes with meta-information (object configuration). Each scene is also equipped with multiple generated questions, e.g., asking for the number of specified objects in the scene, or if the object has a specified property.

The baseline algorithm is named Inferring and Executing Programs (IEP) [15]. It applied an LSTM to parse each question into a tree structure, which is then converted into a neural module network. We use the released model without training it by ourselves. We sample a subset of images from the original testing set, and equip each of them with visual questions. The original model reports an answering (classification) accuracy of on these images.

We randomly pick up testing images, on which all questions are correctly answered, as the target images. The settings for generating adversarial perturbations are the same as in the classification experiments, i.e., the iterative FGSM are used, and three sets of physical parameters are attacked either individually or jointly.

Quantitative Results

Results are shown in Table 2. We observe similar phenomena as in the classification experiments. This is as expected, since after the question is parsed and a neural module network is generated, attacking either image or physical space is essentially equivalent to that in the classification task.

A side note comes from perturbing the material parameters. Although some visual questions are asking about the material (e.g., metal or rubber) of an object, the success rate of this type of questions does not differ from that in attacking other questions significantly. This is because we are constraining perceptibility, which does not allow the material parameters to be modified by a large value.

A significant difference of visual question answering comes from the so called language prior. With a language parser, the network is able to clinch a small subset of answers without looking at the image, e.g., when asked about the color of an object, it is very unlikely for the network to answer yes or three. We find that sometimes the network can make such ridiculous errors, e.g., in the rightmost column of Figure 4, when asked about the shape of an object, the network answers no after a non-targeted attack.

5Conclusions

This paper delivers an important message, which is the difficulty of generating adversaries in physical space. To study this topic, we plug a differentiable rendering layer into the state-of-the-art deep networks for object classification and visual question answering. Two methods are used to attack the physical parameters. Directly constructing adversaries in physical space is effective, but the success rate is lower than that in image space, and much heavier perturbations are required in a successful attack. Second, based on the current optimization algorithms, e.g., iterative FGSM, it is almost impossible to generate the adversaries in image space by perturbing the physical parameters, which suggests that 2D adversaries in physical space are often not well explained by physical perturbations.

This work has two potentials. First, the existence of real physical adversaries may trigger research in more complicated 3D applications, e.g., stereo matching [47] or reinforcement learning [24] in 3D virtual scenes. Second, in 3D vision scenarios, we can defend the deep neural networks from 2D adversaries by enforcing an image to be interpreted in physical space, so that their attacking abilities are weakened or removed after being re-rendered.

Acknowledgements

We thank Guilin Liu, Cihang Xie, Zhishuai Zhang and Yi Zhang for instructive discussions.

Footnotes

  1. This time cost depends on the parameters allowed to be perturbed. Perturbing surface normals, illumination and material takes around , and minutes on each image, respectively, and attacking all of them jointly takes around minutes.

References

  1. Neural Module Networks.
    J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Computer Vision and Pattern Recognition, 2016.
  2. VQA: Visual Question Answering.
    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. International Conference on Computer Vision, 2015.
  3. Blender – a 3D modelling and rendering package.
    Blender Online Community. https://www.blender.org/, 2017.
  4. Towards Evaluating the Robustness of Neural Networks.
    N. Carlini and D. Wagner. IEEE Symposium on Security and Privacy, 2017.
  5. ShapeNet: An Information-Rich 3D Model Repository.
    A. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. arXiv preprint arXiv:1512.03012, 2015.
  6. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.
    L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  7. ImageNet: A Large-Scale Hierarchical Image Database.
    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Computer Vision and Pattern Recognition, 2009.
  8. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering.
    H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Advances in Neural Information Processing Systems, 2015.
  9. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
    R. Girshick, J. Donahue, T. Darrell, and J. Malik. Computer Vision and Pattern Recognition, 2014.
  10. Generative Adversarial Nets.
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Advances in Neural Information Processing Systems, 2014.
  11. Explaining and Harnessing Adversarial Examples.
    I. Goodfellow, J. Shlens, and C. Szegedy. International Conference on Learning Representations, 2015.
  12. Deep Residual Learning for Image Recognition.
    K. He, X. Zhang, S. Ren, and J. Sun. Computer Vision and Pattern Recognition, 2016.
  13. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
    S. Ioffe and C. Szegedy. International Conference on Machine Learning, 2015.
  14. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning.
    J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Zitnick, and R. Girshick. Computer Vision and Pattern Recognition, 2017.
  15. Inferring and Executing Programs for Visual Reasoning.
    J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Zitnick, and R. Girshick. International Conference on Computer Vision, 2017.
  16. ImageNet Classification with Deep Convolutional Neural Networks.
    A. Krizhevsky, I. Sutskever, and G. Hinton. Advances in Neural Information Processing Systems, 2012.
  17. Adversarial Examples in the Physical World.
    A. Kurakin, I. Goodfellow, and S. Bengio. Workshop Track, International Conference on Learning Representations, 2017.
  18. Adversarial Machine Learning at Scale.
    A. Kurakin, I. Goodfellow, and S. Bengio. International Conference on Learning Representations, 2017.
  19. Material Editing Using a Physically Based Rendering Network.
    G. Liu, D. Ceylan, E. Yumer, J. Yang, and J. Lien. International Conference on Computer Vision, 2017.
  20. Delving into Transferable Adversarial Examples and Black-Box Attacks.
    Y. Liu, X. Chen, C. Liu, and D. Song. International Conference on Learning Representations, 2017.
  21. Reflectance and Natural Illumination from a Single Image.
    S. Lombardi and K. Nishino. European Conference on Computer Vision, 2012.
  22. SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth.
    J. McCormac, A. Handa, S. Leutenegger, and A. Davison. International Conference on Computer Vision, 2017.
  23. On Detecting Adversarial Perturbations.
    J. Metzen, T. Genewein, V. Fischer, and B. Bischoff. International Conference on Learning Representations, 2017.
  24. Learning to Navigate in Complex Environments.
    P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. arXiv preprint arXiv:1611.03673, 2016.
  25. Universal Adversarial Perturbations.
    S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Computer Vision and Pattern Recognition, 2017.
  26. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks.
    S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Computer Vision and Pattern Recognition, 2016.
  27. Rectified Linear Units Improve Restricted Boltzmann Machines.
    V. Nair and G. Hinton. International Conference on Machine Learning, 2010.
  28. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images.
    A. Nguyen, J. Yosinski, and J. Clune. Computer Vision and Pattern Recognition, 2015.
  29. Geometrical Considerations and Nomenclature for Reflectance.
    F. Nicodemus, J. Richmond, J. Hsia, I. Ginsberg, and T. Limperis. Radiometry, pages 94–145, 1992.
  30. Directional Statistics BRDF Model.
    K. Nishino. International Conference on Computer Vision, 2009.
  31. Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks.
    N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. IEEE Symposium on Security and Privacy, 2016.
  32. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation.
    C. Qi, H. Su, K. Mo, and L. Guibas. Computer Vision and Pattern Recognition, 2017.
  33. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.
    A. Radford, L. Metz, and S. Chintala. International Conference on Learning Representations, 2016.
  34. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
    S. Ren, K. He, R. Girshick, and J. Sun. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
  35. ImageNet Large Scale Visual Recognition Challenge.
    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. International Journal of Computer Vision, pages 1–42, 2015.
  36. Exploiting the PANORAMA Representation for Convolutional Neural Network Classification and Retrieval.
    K. Sfikas, T. Theoharis, and I. Pratikakis. Eurographics Workshop on 3D Object Retrieval, 2017.
  37. Fully Convolutional Networks for Semantic Segmentation.
    E. Shelhamer, J. Long, and T. Darrell. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, 2017.
  38. Very Deep Convolutional Networks for Large-Scale Image Recognition.
    K. Simonyan and A. Zisserman. International Conference on Learning Representations, 2015.
  39. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  40. Multi-view Convolutional Neural Networks for 3D Shape Recognition.
    H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. International Conference on Computer Vision, 2015.
  41. Sequence to Sequence Learning with Neural Networks.
    I. Sutskever, O. Vinyals, and Q. Le. Advances in Neural Information Processing Systems, 2014.
  42. Going Deeper with Convolutions.
    C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Computer Vision and Pattern Recognition, 2015.
  43. Intriguing Properties of Neural Networks.
    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. 2014.
  44. Ensemble Adversarial Training: Attacks and Defenses.
    F. Tramer, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. arXiv preprint arXiv:1705.07204, 2017.
  45. Adversarial Examples for Semantic Segmentation and Object Detection.
    C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. International Conference on Computer Vision, 2017.
  46. Can You Fool AI with Adversarial Examples on a Visual Turing Test?
    X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darell, and D. Song. arXiv preprint arXiv:1709.08693, 2017.
  47. UnrealStereo: A Synthetic Dataset for Analyzing Stereo Vision.
    Y. Zhang, W. Qiu, Q. Chen, X. Hu, and A. Yuille. arXiv preprint arXiv:1612.04647, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
1990
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description