#
Adversarial Geometry and Lighting

using a Differentiable Renderer

###### Abstract

Many machine learning classifiers are vulnerable to adversarial attacks, inputs with perturbations designed to intentionally trigger misclassification. Modern adversarial methods either directly alter pixel colors, or “paint” colors onto a 3D shapes. We propose novel adversarial attacks that directly alter the geometry of 3D objects and/or manipulate the lighting in a virtual scene. We leverage a novel differentiable renderer that is efficient to evaluate and analytically differentiate. Our renderer generates images realistic enough for correct classification by common pretrained models, and we use it to design physical adversarial examples that consistently fool these models. We conduct qualitative and quantitate experiments to validate our adversarial geometry and adversarial lighting attack capabilities.

## 1 Introduction

Neural networks continue to drive the progress of machine learning and computer vision at a rapid pace. A startling threat to this bright future is the existence of adversarial examples: intentionally perturbed inputs to cause incorrect classification. For applications requiring safety such as self-driving, the implications are severe.

Optimizing for adversarial attacks on image classifiers is a trending topic, but, so far, previous works either directly alter pixel colors or alter the colors printed on photographed objects. Photographs are indeed influenced by the color of an object in view, but also the geometry and the lighting conditions. From an optimization perspective, manipulating color while holding the geometry and lighting fixed is in a sense the simplest attack. Attacking via changes to geometry and lighting require back-propagating through rendering: the process of simulating image creation given lights, geometries, and a camera description.

We present the first adversarial geometry and adversarial lighting attacks for image classification. We consistently fool pre-trained networks by perturbing physically meaningful parameters controlling an object’s surface geometry and the environment lighting conditions. To achieve these, we propose a novel differentiable, physically-based renderer. Unlike previous approach Zeng et al. (2017), ours is orders of magnitude faster and realistic enough for correct classification by common pre-trained nets. The parameters of our renderer include explicit control of the surface geometry of the shape and coefficients of a reduced subspace spanning realistic lighting conditions. Leveraging existing attack strategies, we search for geometry and lighting parameters to fool classifiers. Our attacks generalize across multiple views. We demonstrate our success via qualitative and quantitative experiments on publicly available datasets of realistic 3D models. Finally, we show preliminary success using our adversarial attack to increase classifier robustness during training.

## 2 Related Work

Our work is built upon the premise that simulated or rendered images can participate in computer vision and machine learning on real-world tasks. Many previous works use rendered images during training, as rendering provides a theoretically infinite supply of input data Movshovitz-Attias et al. (2016); Chen et al. (2016); Varol et al. (2017); Su et al. (2015); Qiu and Yuille (2016); Johnson-Roberson et al. (2017). Our work complements these previous works. We demonstrate the potential danger lurking in misclassification of renderings due to subtle changes to geometry and lighting.

#### Differential Renderer

For our work, it is essential to have access to not just rendered images, but also partial derivatives of the rendering process with respect to the geometric and lighting parameters. Neither high-accuracy path-tracing renderers (e.g., Mitsuba Jakob (2018)) nor real-time renderers with hardware drivers (e.g., via OpenGL/DirectX) directly accommodate computing derivatives or automatic differentiation of pixel colors with respect to geometry and lighting variables. Loper and Black (2014) propose a fully differentiable CPU renderer using forward-mode automatic differentiation. Computing derivatives with their approach is orders of magnitude slower than our analytical differentiation. Liu et al. (2017) build a differentiable approximation of a renderer, by treating rendering as a deterministic function that can be learned provided sufficient examples of inputs and outputs. This not only introduces a host of hyper-parameter choices, but also unnecessary error due to imperfect learning. Kato et al. (2018) focus on the influence of whether a pixel is covered by a certain triangle during the rendering process and approximate the gradient by bluring such function. Our approach uses analytical derivative which is more accurate than their approxmation. Athalye and Sutskever (2017) treat rendering as a sparse matrix product allowing efficient and accurate differentiation, but only with respect to surface colors, not geometry nor environment lighting. Instead, we apply approximations to physically based rendering from first principles common in computer graphics. Away from (self-)silhouettes, partial derivatives of pixel colors with respect to our triangle-mesh geometry and spherical harmonic lighting parameters are easily and efficiently computed analytically. Compared to previous renderers, our proposal is differentiable, efficient, and realistic enough to illicit correct classifications from photograph-trained nets.

#### Adversarial Attacks

Szegedy et al. (2013) introduce the vulnerability of state-of-the-art deep neural nets by purposefully creating images with human-imperceptible yet misclassification-inducing noise. Since then, adversarial attacking has been a rich research area Akhtar and Mian (2018); Szegedy et al. (2013); Goodfellow et al. (2014); Rozsa et al. (2016); Kurakin et al. (2016a); Moosavi Dezfooli et al. (2016); Dong et al. (2017); Papernot et al. (2017); Moosavi-Dezfooli et al. (2017); Chen et al. (2017). Earlier works assume adversaries have direct access to the raw input pixel colors of the targeted network. Kurakin et al. (2016b) study the transferability of attacks to the physical world by printing then photographing the adversarial image. Athalye and Sutskever (2017) and Evtimov et al. (2017) propose extensions to non-planar (yet still fixed) geometry and multiple viewing angles respectively. These works only change the colors “painted” on a physical object, not the object’s geometry or its environment’s lighting. Zeng et al. (2017) generate adversarial examples by altering physical parameters by an rendering network Liu et al. (2017) trained to approximate the rendering function. Their approach, however, requires task-specific pretrained networks and renderings lack any semblance of realism. Ideally, an adversary would like to construct a physical, differentiable camera where backpropagation could – like an invisible hand – reach into the real world and perturb a shape’s physical geometry or rearrange the sun and clouds to affect the lighting. We take a step closer to this idealized physical adversarial camera by altering the remaining physical parameters in the rendering process – geometry and lighting – within a differentiable, physically-based renderer.

Although our proposed renderer is also differentiable with respect to the colors of an object, we ignore the well studied, color-based adversarial examples. Original adversarial attacks considered human-imperceptible changes such altering a single pixel Su et al. (2017) or adding small amounts of high-frequency noise Szegedy et al. (2013). Our adversarial geometries follow this philosophy (see small magnitude, high frequency changes in Figure 1). Our adversarial lighting, on the other hand, affects the color of an object everywhere, so instead of a small pixel-color difference, we construct a subspace of realistic lighting conditions and exploit human perception’s lighting-invariance during object recognition to create plausible adversaries.

## 3 Background

Rendering is the process of generating a 2D image from a 3D scene by simulating the physics of light. Light sources in the scene emit photons that then interact with objects in the scene (Figure 2). At each interaction, photons are either reflected, transmitted or absorbed, changing trajectory and repeating until arriving at a sensor such as a camera. The material and shape of a surface influence such interactions. In particular, the photons get reflected from the surface in a non-uniformly way, where the distribution is determined by material properties and the surface geometry. Such interaction is also influenced by the surface color which determines whether photons of a certain wavelength are reflected or absorbed.

After photons interact with the 3D scene, some photons may arrive the camera and contribute to image formation (Figure 3). Concretely, considering a camera and its associated image plane, each pixel’s final color is the result of an integral over all the emitted and reflected photon trajectories that eventually pass through that pixel. A faithful simulation of this process is costly Pharr et al. (2016), and techniques developed in computer graphics rely on simplified models to generate sufficiently accurate approximations.

### 3.1 Computer Rendering

Our renderer is based on a standard pipeline in real-time computer graphics. Specifically we utilize OpenGL as the rasterization engine, converting a 3D scene into pixels (Figure 4).

We represent the surface of each 3D object as a triangle mesh (Figure 5), a collection of connected triangles which are stored as a list of point positions and a list of triplets of triangle-corner indices. These are so-called vertex list and face list. Each triangle is projected onto the image plane and rasterized into pixels. Only visible pixels are kept (closest point on object per pixel). This is efficiently conducted in parallel using OpenGL.

The color of the 3D object is encoded as a standard image (Figure 5). Each vertex of the triangle mesh is mapped to a 2D position on the color image. During rasterization, the color of each pixel is determined via barycentric interpolation of the 2D image positions of the containing triangle’s corners, and then a lookup in the image. In this way, even a low-resolution geometry can have a high-resolution color signal.

In general, the surface material dictates the distribution of outgoing light in each direction as a function of the incoming light direction. In our case, we consider opaque (non-transparent) diffuse material where photons are reflected uniformly in all directions (Figure 6), thus the color of an object is independent of the viewing direction.

To compute the color at a point on a surface, we need to model the photons (lighting) that are incident to the surface. A common approach is to consider photons that land at a point as a spherical function where the position represents the direction of the incoming light and the value represents the color intensity of the light.

So far we have discussed the fundamental components of computer rendering: geometry, color, material, and lighting. In Section 4, we discuss the design choices of our differentiable renderer. In Section 5, we derive the analytical derivatives with respect to geometry and lighting parameters and present our physical adversarial attack frameworks. Section 6 and Section 7 evaluate our approach quantitatively in many scenarios such as adversarial training.

## 4 Physically-Based Differentiable Renderer

Adversarial examples are typically generated by defining a cost function over the space of images that enforces some intuition of what failure should look like, typically using variants of gradient descent where the gradient is accessible by differentiating through networks Szegedy et al. (2013); Goodfellow et al. (2014); Rozsa et al. (2016); Kurakin et al. (2016a); Moosavi Dezfooli et al. (2016); Dong et al. (2017).

The choices of cost function includes increasing the cross-entropy loss of the correct class Goodfellow et al. (2014), decreasing the cross-entropy loss of the least-likely class Kurakin et al. (2016a), a combination of cross-entropies Moosavi Dezfooli et al. (2016), and much more Szegedy et al. (2013); Rozsa et al. (2016); Dong et al. (2017); Tramèr et al. (2017). We use a combination of cross-entropies which allows users to have flexibility in choosing untarget and target attacks by specifying different set of labels . In particular,

(1) |

where is the output of the classifier, is the label which the user wants to decrease their predicted probabilities, and is the label to increase predicted probabilities. In our experiments, is the correct class and is ignored or chosen manually.

Physical adversarial attacks compute adversarial examples via perturbing lighting and geometry parameters, thus requires derivatives with respect to physical parameters. As described in Section 3, rendering is hard to differentiate and costly to evaluate, but we will show that with only a few assumptions, we can analytically differentiate the rendering process.

### 4.1 Assumptions for Differentiability

To compute physical adversaries, we need a differentiable renderer to propagate image gradients to physical parameters in order to compute adversarial lighting and geometry. At the same time, we need to have faithful enough rendering quality so that a machine trained on real photographs can recognize the renderer images. The rendering quality is crucial for training models that are robust to real photos (see Section 7).

We develop our differentiable renderer with three common assumptions in real-time rendering. Our first assumption assumes diffuse material which reflects lights uniformly for all directions. The second assumption, local illumination, only considers lights that bounce directly from the light source to the camera. Lastly, we assume light sources are far away from the scene, allowing us to represent lighting in the entire scene with one spherical function (detailed rationale is provided in Appendix A).The three assumptions simplify the integral required for rendering and allow us to represent lighting in terms of spherical harmonics which are a set of orthonormal bases functions on the sphere providing an analogue to Fourier transformation. Spherical harmonics is the key ingredient for accelerating the evaluation and developing a differentiable renderer. Please refer to Appendix B for derivation.

Our differentiable renderer models the physical processes of underlying light transport in 3D scenes with a spherical harmonics formulation that captures realistic lighting effects and admits analytic derivatives. We are orders of magnitude faster comparing to the state-of-the-art fully differentiable renderer, OpenDR Loper and Black (2014), which is based on automatic differentiation (see Figure 7). In addition, our approach is scalable to handle problems with more than 100,000 variables, but OpenDR runs out of memory for problems with more than 3,500 variables. Comparing to the previous physical adversarial attacks Zeng et al. (2017), our method is 70 faster and generates images with much higher quality.

## 5 Physically-Based Adversarial Attack Framework

Our differentiable render provides derivatives of the image with respect to physical parameters, enabling physically-based attacks by back-propagating image gradients. Rendering can be viewed as a function which takes scene parameters and outputs an image , where is the lighting, is the geometry, and is the color of the shape. In this section, we show how to differentiate through and propose two frameworks: adversarial lighting and adversarial geometry.

### 5.1 Adversarial Lighting

Adversarial lighting generates adversarial examples by changing the lighting parameters , parameterized using spherical harmonics Green (2003). With our differentiable renderer, we can compute analytically (derivation is provided in Appendix B.4) and apply the chain rule to update .

Our approach enjoys the efficiency of spherical harmonics and is 70 faster than previous work Zeng et al. (2017) (see Section 6). In addition, spherical harmonics act as a constraint to prevent unrealistic lighting because natural lightings in our everyday life are low-frequency signals. For instance, rendering of diffuse materials can be approximated with only 1 pixel intensity error by the first 2 orders of spherical harmonics Ramamoorthi and Hanrahan (2001). As computer can only take finite number of bases, spherical harmonics lighting implicitly filters out high-frequency, unrealistic lightings.

We use spherical harmonics lighting up to band-7 with a realistic Eucalyptus Grove lighting provided in Ramamoorthi and Hanrahan (2001) as our initial lighting condition. In Figure 8, we perform adversarial lighting attacks in both single-view and multi-view cases and both attacks successfully fool the classifier. The multi-view version optimizes the summation of the cost functions for each view, thus the gradient can be computed as the summation over all camera views

#### Outdoor Lighting

We show that adversarial lighting is flexible to constrain a further subspace of realistic lighting limited to outdoor lighting conditions governed by sunlight and weather. In the inset, we compute adversarial lights over the space of skylights by applying one more chain rule to the Preetham skylight parameters Preetham et al. (1999); Habel et al. (2008). Detail about taking derivatives is provided in Appendix C. Note that adversarial skylight only has three parameters, the low degrees of freedom makes it more difficult to find adversaries.

### 5.2 Adversarial Geometry

Adversarial Geometry computes adversarial examples by perturbing the surface points of an object. The 3D object are represented as a triangle meshe with vertices, surface points are vertex positions which determines surface normals and then determines the shading. We can compute adversarial geometries by applying the chain rule:

(2) |

can be obtained via differentiating the network and can be analytically computed via a derivation detailed in Appendix D. We represent each triangle with one face normal (flat shading), thus can be computed analytically. In particular, the Jacobian of a face normal with respect to one of its corner vertices is

where is the height vector: the shortest vector to the corner from the line of the opposite edge. With the derivatives, in Figure 1 and Figure 9 we can see that small adversarial vertex perturbations can fool deep neural networks in both single-view and multi-view cases. Note that we upsample meshes to have 10K vertices as a preprocessing step to increase the degrees of freedom for perturbations.

#### Deep Optical Illusion

We apply adversarial geometry to generate optical illusions which the same object is classified differently by the same deep network from different views. This can be achieved by specifying different target classes in the loss function (Equation 1) for different view points. In Figure 10, we specify the in Equation 1 to be a dog and a cat for two different views in order to generate such adversarial geometry.

## 6 Evaluation

We evaluate our rendering quality by whether our rendered images are recognizable by models trained on real photographs. We collect 75 high-quality textured 3D shapes from cgtrader.com and turbosquid.com to evaluate our rendering quality. We augment the shapes by changing the field of view, backgrounds, and viewing directions, then keep the parameters that are correctly classified by pretrained ResNet-101 on ImageNet (see Section E for detail about data augmentation). Figure 11 shows the histogram of model confidence on the correct labels over 10,000 correctly classified rendered images from our differentiable renderer. The confidence is computed using softmax function and the results show that our rendering quality is faithful enough to model realistic environments.

#### Quantitative Evaluation

We evaluate our proposed adversarial attacks quantitatively on 10,000 examples generated from our dataset. In the unconstrained cases (Section 5), we can always fool the networks. Note that we do not constrain the -norm of the perturbation because, in reality, light and shape changes are quite large and perceivable. However, fixing the maximum perturbation is useful for studying decision boundaries and the robustness against the attacks Fawzi et al. (2016). In Figure 12, we evaluate how many examples we can fool within a small amount of perturbation in the lighting/geometry parameters. In particular, we compare performance of random and adversarial perturbations under constraints: 0.1 maximum perturbation of each spherical harmonics coefficient and 0.002 maximum vertex displacement along each axis. We show that state-of-the-art classifiers are not robust to changes of lighting and geometry even within small perturbations.

#### Black-box Attacks

Szegedy et al. Szegedy et al. (2013) shows that image adversaries are transferable across models. Similary, we test 5,000 ResNet physical adversaries on unseen networks and show that physical adversaries also share across models (see Table 2), including AlexNet Krizhevsky et al. (2012), DenseNet Huang et al. (2017), SqueezeNet Iandola et al. (2016), and VGG Simonyan and Zisserman (2014).

#### Multiview Evaluation

In Section 5, we show that adversarial lighting and geometry can be generalized to multiple views in the white-box scenario. We are interested in the generalization power to unseen views. In Table 2, we randomly sample 500 correctly classified views for a given shape and perform our adversarial attacks only on a subset of views. We then evaluate how adversarial lights/shapes perform on all the views. The results shows that adversarial lights are more generalizable to fool unseen views; adversarial shapes, yet, are less generalizable.

Alex | VGG | Squeeze | Dense | |
---|---|---|---|---|

Lighting | 81.2% | 65.0% | 78.6% | 43.5% |

Geometry | 70.3% | 58.9% | 71.1% | 40.1% |

#Views | 0 | 1 | 5 |
---|---|---|---|

Lighting | 0.0% | 29.4% | 64.2% |

Geometry | 0.0% | 0.6% | 3.6% |

#### Runtime

Our differentiable renderer approach is 70 faster than the previous approach Zeng et al. (2017), due to the efficiency of OpenGL and spherical harmonics representation. We take less than 10 seconds to find one adversary, faster than Zeng et al. (2017) which requires 12 minutes. Figure 13 presents our runtime per

iteration for computing the derivatives, an adversary normally requires less than 10 iterations. We evaluate our serial python implementation, except the rendering is implemented using OpenGL Shreiner and Group (2009), on an Intel Xeon 3.5GHz CPU with 64GB of RAM and an NVidia GeForce GTX 1080. Due to the CPU implementation, our runtime strongly depends on the image resolution, specifically the number of pixels that require to take derivatives.

## 7 Adversarial Training Against Real Photographs

Adversarial training injects adversarial examples into training with the aim of increasing the robustness of machine learning models. Typically, we evaluate adversarial training against computer generated adversarial images Kurakin et al. (2016a); Madry et al. (2017); Tramèr et al. (2017). In constrast, our evaluation differs from the majority of the literature, we evaluate adversarial training against real photos (i.e., captured using cameras), and not computer generated adversarial images. This evaluation method is motivated by a long-standing goal: using adversarial examples to make models robust to real photos, and not just synthetic images. We take first steps towards this objective, introducing a high-performance, differentiable renderer and evaluating it with adversarial training against real photographs.

#### Training

We perform adversarial training on CIFAR-100 Krizhevsky and Hinton (2009) with WideResNet (16 layers, 4 wide factor) Zagoruyko and Komodakis (2016) using our adversarial lighting framework. We use a standard adversarial training method which adds a fix number of adversarial examples to the training data at each epoch Kurakin et al. (2016a). In our experiment, we have three training cases: (1) CIFAR-100, (2) CIFAR-100 + 100 rendered images of an orange under random lighting, (3) CIFAR-100 + 100 rendered images of orange with adversarial lighting. Comparing to reported accuracy in Zagoruyko and Komodakis (2016), the WideResNets trained on the three cases all have similar performance () on CIFAR-100 test set.

#### Testing

We created a test set of real photos, captured in real-life, using a controlled lighting and camera setup: we photographed oranges under different lighting conditions. Lighting conditions were generated using an LG PH550 projector and we captured our photos using a calibrated Prosilica GT 1920 camera. Ideally, we would like them to be real photos under adversarial lightings, but our hardware lighting setup only generates lighting from a fixed solid angle of directions about our objects, as opposed to a fully spherical adversarial lighting environment. Figure 14 shows samples from the 500 real photographs. We evaluate the robustness of models via the test accuracy. In particular, average prediction accuracies over five trained WideResNets on our test data under the three training cases are (1) 4.6, (2) 40.4, and (3) 65.8. Our result shows that training on high-quality rendered adversarial images improves the robustness on real photos. Our result shows a promising direction: models trained adversarially using high-quality rendered images should become increasingly robust to real-world photographs, as the underlying rendering model approaches an increasingly accurate simulation of the physics of real-world light transport.

## 8 Limitations & Future Work

We generalize adversarial attacks to physical parameters, including lighting and geometry, by attaching a differentiable renderer. These adversaries have physical interpretations and expose vulnerabilities of state-of-the-art classifiers in real-world situations. We also make the first attempt showing that synthetic adversarial examples have potential to increase robustness to real photos. Our adversarial geometry and lighting are a subset of all possible physical-based attacks. By mapping gradients to other parameters obtains a new class of physical adversarial examples. For instance, adding different material models to the renderer would create adversarial materials. Beyond computing physical adversarial examples, our differentiable renderer can be perceived as a machinery for generalizing image-based machine learning techniques to variations of rendering parameters.

## References

- Akhtar and Mian [2018] Naveed Akhtar and Ajmal S. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410–14430, 2018.
- Athalye and Sutskever [2017] Anish Athalye and Ilya Sutskever. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017.
- Basri and Jacobs [2003] Ronen Basri and David W Jacobs. Lambertian reflectance and linear subspaces. IEEE transactions on pattern analysis and machine intelligence, 25(2):218–233, 2003.
- Chen et al. [2017] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
- Chen et al. [2016] Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Synthesizing training images for boosting human 3d pose estimation. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 479–488. IEEE, 2016.
- Dong et al. [2017] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Xiaolin Hu, Jianguo Li, and Jun Zhu. Boosting adversarial attacks with momentum. arXiv preprint arXiv:1710.06081, 2017.
- Dunster [2010] TM Dunster. Legendre and related functions. NIST handbook of mathematical functions, pages 351–381, 2010.
- Evtimov et al. [2017] Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physical-world attacks on machine learning models. arXiv preprint arXiv:1707.08945, 2017.
- Fawzi et al. [2016] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pages 1632–1640, 2016.
- Goodfellow et al. [2014] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Green [2003] Robin Green. Spherical harmonic lighting: The gritty details. In Archives of the Game Developers Conference, volume 56, page 4, 2003.
- Habel et al. [2008] Ralf Habel, Bogdan Mustata, and Michael Wimmer. Efficient spherical harmonics lighting with the preetham skylight model. In Eurographics (Short Papers), pages 119–122, 2008.
- Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261–2269, 2017. doi: 10.1109/CVPR.2017.243. URL https://doi.org/10.1109/CVPR.2017.243.
- Iandola et al. [2016] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- Jakob [2018] Wenzel Jakob. Mitsuba physically based renderer, 2018. URL http://www.mitsuba-renderer.org.
- Johnson-Roberson et al. [2017] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 746–753. IEEE, 2017.
- Kajiya [1986] James T Kajiya. The rendering equation. In ACM Siggraph Computer Graphics, volume 20, pages 143–150. ACM, 1986.
- Kato et al. [2018] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3907–3916, 2018.
- Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
- Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Kurakin et al. [2016a] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016a.
- Kurakin et al. [2016b] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In Proc. ICLR, 2016b.
- Liu et al. [2017] Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. Material editing using a physically based rendering network. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2280–2288. IEEE, 2017.
- Loper and Black [2014] Matthew M Loper and Michael J Black. OpenDR: An approximate differentiable renderer. In European Conference on Computer Vision, pages 154–169. Springer, 2014.
- Madry et al. [2017] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Miller [1994] Gavin Miller. Efficient algorithms for local and global accessibility shading. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’94, pages 319–326, New York, NY, USA, 1994. ACM. ISBN 0-89791-667-0. doi: 10.1145/192161.192244. URL http://doi.acm.org/10.1145/192161.192244.
- Moosavi Dezfooli et al. [2016] Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), number EPFL-CONF-218057, 2016.
- Moosavi-Dezfooli et al. [2017] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 86–94, 2017.
- Movshovitz-Attias et al. [2016] Yair Movshovitz-Attias, Takeo Kanade, and Yaser Sheikh. How useful is photo-realistic rendering for visual learning? In European Conference on Computer Vision, pages 202–217. Springer, 2016.
- Papernot et al. [2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
- Pharr et al. [2016] Matt Pharr, Wenzel Jakob, and Greg Humphreys. Physically based rendering: From theory to implementation. Morgan Kaufmann, 2016.
- Preetham et al. [1999] Arcot J Preetham, Peter Shirley, and Brian Smits. A practical analytic model for daylight. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 91–100. ACM Press/Addison-Wesley Publishing Co., 1999.
- Qiu and Yuille [2016] Weichao Qiu and Alan Yuille. UnrealCV: Connecting computer vision to unreal engine. In European Conference on Computer Vision, pages 909–916. Springer, 2016.
- Ramamoorthi and Hanrahan [2001] Ravi Ramamoorthi and Pat Hanrahan. An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 497–500. ACM, 2001.
- Rozsa et al. [2016] Andras Rozsa, Ethan M Rudd, and Terrance E Boult. Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 25–32, 2016.
- Shreiner and Group [2009] Dave Shreiner and The Khronos OpenGL ARB Working Group. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Versions 3.0 and 3.1. Addison-Wesley Professional, 7th edition, 2009. ISBN 0321552628, 9780321552624.
- Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Sloan et al. [2005] Peter-Pike Sloan, Ben Luna, and John Snyder. Local, deformable precomputed radiance transfer. In ACM Transactions on Graphics (TOG), volume 24, pages 1216–1224. ACM, 2005.
- Su et al. [2015] Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3d model views. In Proc. ICCV, pages 2686–2694, 2015.
- Su et al. [2017] Jiawei Su, Danilo Vasconcellos Vargas, and Sakurai Kouichi. One pixel attack for fooling deep neural networks. arXiv preprint arXiv:1710.08864, 2017.
- Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Tramèr et al. [2017] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
- Varol et al. [2017] Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017.
- Williams [1978] Lance Williams. Casting curved shadows on curved surfaces. In ACM Siggraph Computer Graphics, volume 12, pages 270–274. ACM, 1978.
- Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016.
- Zeng et al. [2017] Xiaohui Zeng, Chenxi Liu, Weichao Qiu, Lingxi Xie, Yu-Wing Tai, Chi Keung Tang, and Alan L Yuille. Adversarial attacks beyond the image space. arXiv preprint arXiv:1711.07183, 2017.

Adversarial Geometry and Lighting using a Differentiable Renderer Supplementary Material

## Appendix A Physically Based Rendering

Physically based rendering (PBR) seeks to model the flow of light, typically the assumption that there exists a collection of light sources that generate light; a camera that receives this light; and a scene that modulates the flow light between the light sources and camera Pharr et al. [2016].

The computer graphics has dedicated decades of effort into developing methods and technologies to enable PBR to synthesize of photorealistic images under a large gamut of performance requirements. Much of this work is focused around taking approximations of the cherished Rendering equation Kajiya [1986], which describes the propagation of light through a point in space. If we let be the output radiance, be the point in space, be the output direction, be the emitted radiance, be incoming radiance, be the incoming angle, be the way light be reflected off the material at that given point in space we have:

From now on we will ignore the emission term as it is not pertinent to our discussion.
Because of the speed of light, what we perceive is not the propagation of light at an instant, but the steady state solution to the rendering equation evaluated at every point in space.
This is clearly intractable and mainly serves as a reference for which to place a plethora of assumptions and simplifications in order to make it numerically tractable.
Many of these methods focus on ignoring light with nominal affects on the final rendered image vis a vis assumptions on the way light travels.
For intsance, it is usually assumed that light does not interact with air in a substantial way, which is usually stated as assuming that the space between objects is a vacuum, which limits the interactions of light to the objects in a scene.
Another common assumption is that light does not penetrate objects, which makes it difficult to render objects like milk and human skin^{1}^{1}1this is why simple renderers make these sorts of objects look like plastic, the complexity of light propagation can be described in terms of light bouncing off objects’ surfaces.

### a.1 Local Illumination

It is common to see assumptions that limit number of bounces light is allowed.In our case we chose to assume that the steady state is sufficiently approximated by an extremely low number of iterations: one. This means that it seems sufficient to model the lighting of a point in space by the light sent to it directly by light sources. Working with such a strong simplification does, of course, lead to a few artifacts. For instance, light occluded by other objects is ignored so shadows disappear and auxiliary techniques are usually employed to evaluate shadows Williams [1978], Miller [1994].

When this assumption is coupled with a camera we approach what is used in standard rasterization systems such as OpenGL Shreiner and Group [2009], which is what we use. These systems compute the illumination of a single pixel by determining the fragment of an object visible through that pixel and only compute the light that traverses directly from the light sources, through that fragment, to that pixel. The lighting of a fragment is therefore determined by a point and the surface normal at that point, so we write the fragment’s radiance as :

(3) |

### a.2 Lambertian Material

Each point on an object has a model approximating the transfer of incoming light to a given output direction , which is usually called the material. On a single object the material parameters may vary quite a bit and the correspondence between points and material parameters is usually called the texture map which forms the texture of an object. There exists a wide gamut of material models, from mirror materials that transport light from a single input direction to a single output direction, to materials that reflect light evenly in all directions, to materials liked brushed metal that reflect differently along different angles. For the sake of document we only consider diffuse materials, also called Lambertian materials, where we assume that incoming light is reflected uniformly, i.e is a constant function with respect to angle, which we denote :

(4) |

This function is usually called the albedo, which can be perceived as color on the surface for diffuse material, and we reduce our integration domain to the upper hemisphere in order to model light not bouncing through objects. Furthermore, since only the only and are the incoming ones we can now suppress the “incoming” in our notation and just use and respectively.

### a.3 Environment Mapping

The illumination of static, distant objects such as the ground, the sky, or mountains do not change in any noticeable fashion when objects in a scene are moved around, so can be written entirely in terms of , . If their illumination forms a constant it seems prudent to pre-compute or cache their contributions to the illumination of a scene. This is what is usually called environment mapping and they fit in the rendering equation as a representation for the total lighting of a scene, i.e the total incoming radiance . Because the environment is distant, it is common to also assume that the position of the object receiving light from an environment map does not matter so this simplifies to be independent of position:

(5) |

### a.4 Spherical Harmonics

Despite all of our simplifications the inner integral is still a fairly generic function over . Many techniques for numerically integrating the rendering equation have emerged in the computer graphics community and we choose one which enables us to perform pre-computation and select our desired accuracy in terms of frequencies: spherical harmonics. Spherical harmonics are a basis on so, given a spherical harmonics expansion of the integrand, the evaluation of the above integral can be reduced to a weighted product of coefficients. This particular basis is chosen because it acts as a sort of Fourier basis for functions on the sphere and so the bases are each associated with a frequency, which leads to a convenient multi-resolution structure. In fact, the rendering of diffuse objects under distant lighting can be 99 approximated by just the first few spherical harmonics bases Ramamoorthi and Hanrahan [2001].

We will only need to note that the spherical harmonics bases are denoted with the subscript with as the frequency and that there are of them, denoted by superscripts between to inclusively. For further details on them please take a glance at Appendix B.

If we approximate a function in terms of spherical harmonics coefficients the integral can be precomputed:

(6) |

### a.5 Our Rendering Model

We developed our renderer using OpenGL as our rasterization engine where the surface of each represented by a triangle mesh. We assume that each object is diffuse (Lambertian), but the color (albedo) is allowed to vary freely along the surface. In our experiments we enable lighting and vertices to vary, making our renderer a function of the triangle vertices, and the lighting conditions : .

## Appendix B Differentiable Renderer

In this section we will discuss how to explicitly compute the derivatives used in the main article. The key idea is to utilize the orthonormal property of spherical harmonics. Here we give a more detailed discussion about spherical harmonics.

### b.1 Spherical Harmonics

Spherical harmonics are closely related to the Legendre polynomials which are a class of orthogonal polynomials defined by a recurrence relation.

(7) | ||||

(8) | ||||

(9) |

The associated Legendre polynomials are based off of the Legendre polynomials and can be fully defined by the relations

(10) | ||||

(11) | ||||

(12) |

Using the associated Legendre polynomials we can define the spherical harmonics basis as

(13) | ||||

We can observe that the associated Legendre polynomials correspond to the spherical harmonics bases that are rotationally symmetric along the axis ().

In order to incorporate spherical harmonics into Equation 5, we change the integral domain from the upper hemisphere back to via a operation

(14) |

We see that the integral is comprised of two components: a lighting component and a component that depends on the normal . The strategy is to pre-compute the two components by projecting onto spherical harmonics, and evaluate the integral via a dot product at runtime, as we will now derive.

### b.2 Lighting in Spherical Harmonics

Approximating the lighting component in Equation 14 using spherical harmonics up to band can be written as

where is the spherical harmonics coefficient and is computed by projecting onto

### b.3 Clamped Cosine in Spherical Harmonics

So far, we have projected the lighting term onto spherical harmonics bases. Now we also need to parameterize the second component , so-called the clamped cosine function, in Equation 14 using spherical harmonics:

where can be computed by projecting onto

Unfortunately, this formulation turns out to be tricky to compute. Instead, the common practice is to analytically compute the coefficients for unit direction and evaluate the coefficients for different normals by rotating . Here we show how can we compute analytically

(15) |

In fact, because is rotationally symmetric around the -axis, its projection onto will have many zeros except the rotationally symmetric spherical harmonics . In other words, is non-zero only when . So we can simplify Equation 15 to

This integral can be evaluated analytically (see the Appendix A in Basri and Jacobs [2003]) and obtain

Then the spherical harmonics coefficients of the clamped cosine function can be computed by rotating Sloan et al. [2005] using this formula

(16) |

So far, we have projected the two terms in Equation 14 to spherical harmonics bases. We can then apply the orthonormal property of spherical harmonics to compute our integral:

(17) | ||||

(18) |

Note that the second to last step uses the Kronecker delta and it follows from spherical harmonics being a set of orthonormal bases. Combining with Equation 16, we can derive the rendering equation using spherical harmonics lighting for diffuse objects

(19) |

### b.4 Lighting and TextureDerivatives

Equation 21 is our final simplified rendering equation. With this equation, the derivative with respect to the lighting coefficients is trivial

(20) |

If applications require the derivative with respect to the color (texture) of shapes, it is straightforward

(21) |

## Appendix C Differentiating Skylight Parameters

To model possible outdoor daylight conditions, we use the analytical Preetham skylight model Preetham et al. [1999]. This model is calibrated by atmospheric data and parameterized by two intuitive parameters: turbidity , which describes the cloudiness of the atmosphere and two polar angles for direction of the sun. Note that are not the polar angles for representing incoming light direction in . The spherical harmonics representation of the Preetham skylight is presented in Habel et al. [2008] as

In order to incorporate spherical harmonics, Habel et al. [2008] perform a non-linear least squares fit to write as a polynomial of and to solve for

where are scalar coefficients, then can be computed by applying a spherical harmonics rotation with using this formula

We refer the reader to Preetham et al. [1999] for more detail. For the purposes of this article we just need the above form to compute the derivatives.

### c.1 Derivatives

The derivatives of the lighting with respect to the skylight parameters can be computed as

## Appendix D Derivatives of Surface Normals

In this section, we discuss how to take the derivative of spherical harmonics with respect to surface normals, which is an essential task for computing the derivative with respect to the geometry. Specifically, to take the derivative of the rendering equation Equation 21 with respect to geometry is computed by applying the chain rule

Computing is provided in Section 5.2 and requires derivatives of spherical harmonics with respect to surface normals.

To begin with, let’s recall the relationship of surface normal with polar angles

we can compute the derivative of spherical harmonics with respect to surface normals through

(22) | |||

(23) | |||

(24) |

Note that the derivative of the associated Legendre polynomials can be computed by applying the recurrence formula Dunster [2010]

Thus the derivative of polar angles with respect to surface normals are

(25) | |||

(26) |

In summary, the derivative of the rendering equation Equation 21 with respect to the surface normal can be computed by

where is provided in Equation 24.

## Appendix E Data Augmentation

We place the centroid, calculated as the weighted average of the mesh vertices where the weights are the vertex areas, at the origin and normalize shapes to range -1 to 1; The field of view is chosen to be 2 and 3 in the same unit with the normalized shape; Background images include plain colors and real photos, which have small influence on model predictions; Viewing directions are chosen to be 60 degree zenith and uniformly sampled 16 views from 0 to azimuthal angle.