We present BlockGAN, an image generative model that learns object-aware 3D scene representations directly from unlabelled 2D images. Current work on scene representation learning either ignores scene background or treats the whole scene as one object. Meanwhile, work that considers scene compositionality treats scene objects only as image patches or 2D layers with alpha maps. Inspired by the computer graphics pipeline, we design BlockGAN to learn to first generate 3D features of background and foreground objects, then combine them into 3D features for the whole scene, and finally render them into realistic images. This allows BlockGAN to reason over occlusion and interaction between objects’ appearance, such as shadow and lighting, and provides control over each object’s 3D pose and identity, while maintaining image realism. BlockGAN is trained end-to-end, using only unlabelled single images, without the need for 3D geometry, pose labels, object masks, or multiple views of the same scene. Our experiments show that using explicit 3D features to represent objects allows BlockGAN to learn disentangled representations both in terms of objects (foreground and background) and their properties (pose and identity).


noitemsep,topsep=2pt,parsep=2pt,partopsep=2pt \patchcmd\@combinedblfloats


1 Introduction

The computer graphics pipeline has achieved impressive results in generating high-quality images, while offering users a great level of freedom and controllability over the generated images. This has many applications in creating and editing content for the creative industries, such as films, games, scientific visualisation, and more recently, in generating training data for computer vision tasks. However, the current pipeline, ranging from generating 3D geometry and textures, rendering, compositing and image post-processing, can be very expensive in terms of labour, time, and costs.

Recent image generative models, in particular generative adversarial networks (GANs; Goodfellow et al., 2014), have greatly improved the visual fidelity and resolution of generated images Karras et al. (2018, 2019); Brock et al. (2019). Conditional GANs Mirza and Osindero (2014) allow users to manipulate images, but require labels during training. Recent work on unsupervised disentangled representations using GANs Chen et al. (2016); Karras et al. (2019); Nguyen-Phuoc et al. (2019) relaxes this need for labels. The ability to produce high-quality images while providing a certain level of controllability has made GANs an increasingly attractive alternative to the traditional graphics pipeline for content generation. However, most work focuses on property disentanglement, such as shape, pose and appearance, without considering the compositionality of the images, i.e., scenes being made up of multiple objects. Therefore, they do not offer controls over individual objects in a way that respects the interaction of objects, such as consistent lighting and shadows in the scene. This is a major limiting factor of current image genrative models, compared to the graphics pipeline, where each 3D object is modelled individually in terms of geometry and appearance, and is combined into 3D scenes with consistent lighting.

Even when compositionality of objects is considered, most approaches treat objects as 2D layers that are combined using alpha compositing Yang et al. (2017); van Steenkiste et al. (2018); Engelcke et al. (2020). Moreover, they also assume that each object’s appearance is independent Bielski and Favaro (2019); Chen et al. (2019a); Engelcke et al. (2020). While this layering approach has led to good results in terms of object separation and visual fidelity, it is fundamentally limited by the choice of 2D representation. Firstly, it is hard to manipulate properties that require 3D understanding, such as pose or perspective. Secondly, object layers tend to bake in appearance and cannot adequately represent view-specific appearance, such as shadows or material highlights changing as objects move around in the scene. Finally, it is non-trivial to model the appearance interactions between object, such as scene lighting that affects objects’ shadows on a background.

We introduce BlockGAN, a generative adversarial network that learns 3D object-oriented scene representations directly from unlabelled 2D images. Instead of learning 2D layers of objects and combining them with alpha compositing, BlockGAN learns to generate 3D object features and to combine them into deep 3D scene features that are projected and rendered as 2D images. This process closely resembles the traditional computer graphics pipeline where scenes are modelled in 3D, enabling reasoning over occlusion and interaction between object appearance, such as shadows or highlights. During test time, each object’s pose can be manipulated using 3D transformations directly applied to the object’s deep 3D features. We can also add new objects to the generated image by introducing more 3D object features to the 3D scene features, even when BlockGAN was trained with scenes containing fewer objects. This shows that BlockGAN has learnt a non-trivial representation of objects and their interaction, instead of merely memorizing images.

BlockGAN is trained end-to-end in an unsupervised manner directly from unlabelled 2D images, without any multi-view images, paired images, pose labels, or 3D shapes. We experiment with BlockGAN on a variety of synthetic and natural image datasets. In summary, our main contributions are:

  • BlockGAN, an unsupervised image generative model that learns an object-aware 3D scene representation directly from unlabelled 2D images, disentangling both between objects and individual object properties (pose and identity);

  • showing that BlockGAN can learn to separate objects from the background even in cluttered scenes; and

  • demonstrating that BlockGAN’s object features can be added and manipulated to create novel scenes that are not observed during training.

2 Related work

Figure 1: BlockGAN’s generator network. Each noise vector is mapped to deep 3D object features , which are transformed to the desired 3D pose using a 3D similarity transformation (rotation, translation, uniform scaling). These object features are then combined into 3D scene features, where the camera pose is applied, before being projected to 2D features that render the final image .

Generative adversarial networks.

Unsupervised GANs learn to map samples from a latent distribution to data that can be categorised as real data by a discriminator network. Conditional GANs enable control over the generated image content, but require labels during training. Recent work on unsupervised disentangled representation learning using GANs provides controllability over the final images without the need for labels. One way is to redesign loss functions to maximize mutual information between generated images and latent variables Chen et al. (2016); Jeon et al. (2018). However, these models do not guarantee which factors can be learnt, and have limited success when applied to natural images. Recent advances in GANs show that network architecture design can play a vital role in both improving training stability Chen et al. (2019b) and controllability of generated images Karras et al. (2019); Nguyen-Phuoc et al. (2019). In this work, we also focus on designing an appropriate architecture to learn object-level disentangled representations. We show that injecting inductive biases about how the 3D world is composed of 3D objects enables BlockGAN to learn 3D object-aware scene representations directly from 2D images, thus providing control over both 3D pose and appearance of individual objects.

3D-aware neural image synthesis.

Several methods have shown that introducing 3D structures into convolutional neural networks can greatly improve the quality Park et al. (2017); Nguyen-Phuoc et al. (2018); Rhodin et al. (2018); Sitzmann et al. (2019b) and controllability of the image generation process Nguyen-Phuoc et al. (2019); Olszewski et al. (2019); Zhu et al. (2018). This can be achieved with explicit 3D representations, like occupancy voxel grids Zhu et al. (2018); Rematas and Ferrari (2019), meshes, or shape templates Kossaifi et al. (2018); Shu et al. (2018), in conjunction with hand-crafted differentiable renderers Loper and Black (2014); Henzler et al. (2019); Chen et al. (2019c); Liu et al. (2019). Renderable deep 3D representations can also be learnt directly from images Nguyen-Phuoc et al. (2019); Sitzmann et al. (2019a, b). HoloGAN (2019) further shows that adding inductive biases about the 3D structure of the world enables unsupervised disentangled feature learning between shape, appearance and pose. However, these learnt representations are either object-centric (i.e., no background), or treat the whole scene as one object. Thus, they do not consider scene compositionality, i.e., components that can move independently. BlockGAN, in contrast, is designed to learn object-aware 3D representations that are combined into a unified 3D scene representation.

Object-aware image synthesis.

Recent image synthesis methods decompose the image generation process into generating image components like layers or image patches, and combining them into the final image Yang et al. (2017); Kwak and Zhang (2016); van Steenkiste et al. (2018). This includes conditional GANs that use segmentation masks Türkoğlu et al. (2019); Papadopoulos et al. (2019), scene graphs Johnson et al. (2018), object labels, key points or bounding boxes Hinz et al. (2019); Reed et al. (2016), which have shown impressive results for natural image datasets. Recently, unsupervised methods Eslami et al. (2016); Kosiorek et al. (2018); van Steenkiste et al. (2018); Engelcke et al. (2020) learned object disentanglement for multi-object scenes on simpler synthetic datasets (single-colour objects, simple lighting, and material). Other approaches successfully separate foreground from background objects in natural images, but Yang et al. (2017) make assumptions about the size of objects, while Bielski and Favaro (2019) and Chen et al. (2019a) make strong assumptions about independent object appearance. Overall, in these methods, object components are treated as image patches or 2D layers with corresponding masks, and are combined via alpha compositing at the pixel level to generate the final stage. The work closest to ours learns to generate multiple 3D primitives (cuboids, spheres and point clouds), renders them into separate 2D layers with a hand-crafted differentiable renderer, and alpha-composes them based on their depth ordering to create the final image (Liao et al., 2019). Despite the explicit 3D geometry, this method does not handle cluttered backgrounds and requires extra supervision in the shape of labelled images with and without foreground objects.

BlockGAN takes a different approach. We treat objects as learnt 3D features with corresponding 3D poses, and learn to combine them into 3D scene features. Not only does this provide control over the 3D pose object, but also enables BlockGAN to learn realistic lighting and shadows. Finally, our approach allows adding an arbitrary number of foreground objects to the 3D scene features to generate images with multiple objects, which are not observed at training time.

3 Method

Inspired by the image rendering process in the computer graphics pipeline, we assume that each image is a rendered 2D image of a 3D scene composed of 3D foreground objects in addition to the background :


where the function combines multiple objects into unified scene features that are projected to the image by . We assume each object is defined in a canonical orientation and generated from a noise vector by a function before being individually posed using parameters : .

We inject the inductive bias of compositionality of the 3D world into BlockGAN in two ways. (1) The generator is designed to first generate 3D features for each object independently, before transforming and combining them into unified scene features, in which objects interact. (2) Unlike other methods that use 2D image patches or layers to represent objects, BlockGAN directly learns from unlabelled images how to generate objects as 3D features. This allows our model to disentangle the scene into separate 3D objects and allows the generator to reason over 3D space, enabling object pose manipulation and appearance interaction between objects. BlockGAN, therefore, learns to both generate and render the scene features into images that can fool the discriminator.

Figure 1 illustrates the BlockGAN generator architecture. Each noise vector is mapped to 3D object features . Objects are then transformed to according to their pose using a 3D similarity transformation, before being combined into 3D scene features using the scene composer. The scene features are transformed into the camera coordinate system before being projected to 2D features to render the final images using the camera projector function. During training, we randomly sample both the noise vectors and poses . During test time, objects can be generated with a given identity in the desired pose .

BlockGAN is trained end-to-end using only unlabelled 2D images, without the need for any labels, such as poses, 3D shapes, multi-view inputs, masks, or geometry priors like shape templates, symmetry or smoothness terms. We next explain each component of the generator in more detail.

3.1 Learning 3D object representations

Each object is a deep 3D feature grid generated by , where is an object generator that takes as input a noise vector controlling the object appearance, and the object’s 3D pose , which comprises its uniform scale , rotation and translation . The object generator is specific to each category of objects, and is shared between objects of the same category. We assume that 3D scenes consist of at least two objects: the background and one or more foreground objects . This is different to object-centric methods that only assume a single object with a simple white background Sitzmann et al. (2019a), or only deal with static scenes whose object components cannot move independently (Nguyen-Phuoc et al., 2019). We show that, even when BlockGAN is trained with only one foreground and background object, we can add an arbitrary number of foreground objects to the scene at test time.

Figure 2: Left: BlockGAN’s object generator sub-network. Each object starts with a constant tensor that is learnt with the rest of the network. Right: Illustration of an AdaIN layer. The noise vector is mapped to two affine parameters, and , for the AdaIN layer, which modulates the 3D features . The result is transformed to the desired pose and passed to the scene composer function.

To generate 3D object features, BlockGAN implements the style-based strategy of HoloGAN Nguyen-Phuoc et al. (2019), which was shown to help with the disentanglement between pose and identity while improving training stability. As illustrated in Figure 2, the noise vector is mapped to affine parameters – the “style controller” – for adaptive instance normalization (AdaIN; Huang and Belongie, 2017) after each 3D convolution layer. However, unlike HoloGAN, which learns 3D features directly for the whole scene, BlockGAN learns 3D features for each object, which are transformed to their target poses using similarity transformations, and combined into 3D scene features. We implement these 3D similarity transformations by trilinear resampling of the 3D features according to the translation, rotation and scale parameters ; samples falling outside the feature tensor are clamped to zero. This allows BlockGAN to not only separate object pose from identity, but also to disentangle multiple objects in the same scene.

3.2 Scene composer function

We combine the 3D object features into scene features


using a scene composer function . We consider three candidate functions: (i) element-wise summation, (ii) element-wise maximum and (iii) a multi-layer perceptron (MLP). Element-wise operations easily generalise to multiple objects in any order. The MLP takes the objects concatenated along the channel dimension as input. While all functions enable object disentanglement learning (see Section 4.5), the element-wise maximum achieves the best image quality and we thus use it for all experiments. Its invariance to permutation and flexible number of input objects also allows adding new objects into the scene features during test time, even when trained with only two objects (see Section 4.3).

3.3 Learning to render

Instead of using a hand-crafted, differentiable renderer, we aim to learn the rendering process directly from unlabelled images. HoloGAN showed that this approach is more expressive as it is capable of handling unlabelled, natural image data. However, their projection model is limited to a weak perspective, which does not support foreshortening – an effect that is observed when objects are close to real (perspective) cameras. We therefore introduce a graphics-based perspective projection function that first transforms the 3D scene features into camera space using a projective transformation, and then learn the projection of the 3D features to a 2D feature map.

Figure 3: Left: 2D illustration of the camera’s viewing volume (frustum) overlaid on the scene-space features. We trilinearly resample the scene features based on the viewing volume at the orange dots. Right: The resulting camera-space features before projection to 2D.

From scene to camera space.

The computer graphics pipeline implements perspective projection using a projective transformation that converts objects from world coordinates (our scene space) to camera coordinates Marschner et al. (2015). We implement this camera transformation similar to the similarity transformations used to manipulate the pose of objects in Section 3.1, by resampling the 3D scene features according to the viewing volume (frustum) of the virtual perspective camera (see Figure 3). For correct perspective projection, this transformation must be a projective transformation, the superset of similarity transforms Yan et al. (2016). Specifically, the viewing frustum, in scene space, can be defined relative to the camera’s pose using the angle of view, and the distance of the near and far planes. The camera-space features are a new 3D tensor of features, of size , whose corners are mapped to the corners of the camera’s viewing frustum using the unique projective 3D transformation computed from the coordinates of corresponding corners using the Direct Linear Transform (Chapter 3, Hartley and Zisserman, 2004).

In practice, we combine the object and camera transformations into a single transformation by multiplying the two transformation matrices and resampling the object features accordingly in a single step, directly from object to camera space. This is computationally more efficient than resampling twice, and advantageous from a sampling theory point of view, as the features are only interpolated once, not twice, and thus less information is lost by the resampling. The combined transformation is a fixed, differentiable function with parameters . The individual objects are then combined in camera space before the final projection.

Learning the camera projection function

After the camera transformation, the 3D features are projected into view-specific 2D feature maps using the learnt camera projection . This function ensures that occlusion correctly shows near objects in front of distant object. Following the RenderNet projection unit (2018), we reshape the 3D camera-space features (with depth and channels) into a 2D feature map with channels, followed by a per-pixel MLP (i.e., 11 convolution) that outputs channels.

3.4 Loss functions

We train BlockGAN adversarially using the non-saturating GAN loss Goodfellow et al. (2014). For natural images with more cluttered backgrounds, we also add the style discriminator loss similar to Nguyen-Phuoc et al. (2019). In addition to classifying the images as real or fake, this discriminator also looks at images at the feature level. In particular, given image features at layer , the style discriminator classifies the mean and standard deviation over the spatial dimension, which describe the image “style” Huang and Belongie (2017). This more powerful discriminator discourages the foreground generator to include parts of the background within the foreground object(s). We provide detailed network and loss definitions in the supplemental material.

4 Experiments

Figure 4: BlockGAN enables manipulation of individual objects (rotation, translation, changing identity of background or foreground) across different datasets: (i) Synth-Car-One, (ii) Synth-Car-Two, and (iii) Synth-Chairs. Notice how the shadow and highlight change as objects move around in the scene, and how changing the background lighting affects the appearance of foreground objects. In contrast, HoloGAN does not provide similar object-aware controls. Figure 5 shows similar results on natural images. Please refer to the supplemental video.


We train BlockGAN with both synthetic and natural image datasets at a resolution of 6464 pixels, with increasing complexity. For the synthetic Synth-Chairs dataset, we render images of 3D chair models with high-quality textures from PhotoShape Park et al. (2018) using Blender. We create two synthetic car datasets: with one car (Synth-Car-One) and with two cars (Synth-Car-Two). We collect 3D car models from online shape repositories such as TurboSquid and 3D Warehouse, and manually augment their materials. For all synthetic images, we use the background and lighting setup provided by the CLEVR dataset Johnson et al. (2017). We use the real Car dataset by Yang et al. (2015) and randomly crop images during training.

Implementation details.

We assume a fixed and known number of objects. The architectures of the foreground and background object generators are similar, with the same number of output channels, but foreground generators have twice as many channels in the learnt constant tensor. Since foreground objects are smaller than the background, we keep the background object always at scale=1, and randomly sample scales 1 for the foreground objects. Please see the supplemental material for more details. We will make our code publicly available.

4.1 Qualitative results

We show qualitative results on datasets with increasing complexity. Figure 4 shows that BlockGAN learns to disentangle different objects within a scene: foreground from background, and between multiple foreground objects, despite only being trained with unlabelled images. This enables smooth manipulation of each object’s pose and identity . More importantly, since BlockGAN combines deep object features into scene features, changes in an object’s properties also influence the background, e.g., an object’s shadows and highlights adapt to the object’s movement. These effects can be better observed in the animations in our supplemental material.

4.2 Quantitative results

We evaluate the visual fidelity of BlockGAN’s results using Kernel Inception Distance (KID; Bińkowski et al., 2018). Unlike FID Heusel et al. (2017), KID has an unbiased estimator and works even for a small number of images. Note that KID does not measure the quality of disentanglement, which is the main contribution of BlockGAN.

We first compare BlockGAN with a vanilla GAN (WGAN-GP by Gulrajani et al., 2017). Secondly, we compare with LR-GAN Yang et al. (2017), a 2D-based method that learns to generate image background and foregrounds separately and recursively. Finally, we compare with HoloGAN, which learns 3D scene representations that separate camera pose and identity, but does not consider object disentanglement. For WGAN-GP, we use a publicly available implementation1. For LR-GAN and HoloGAN, we use the code provided by the authors. We tune hyperparameters and then compute the KID for 10,000 images generated by each model. Table 1 shows that BlockGAN generates images with competitive or better visual fidelity than other methods. Figure 5 shows results on natural images, where BlockGAN also enables object-centric modifications.

WGAN-GP (2017) 0.141 0.002 0.111 0.002 0.035 0.001
LR-GAN (2017) 0.038 0.001 0.036 0.002 0.014 0.001
HoloGAN (2019) 0.070 0.001 0.058 0.002 0.028 0.002
BlockGAN (ours) 0.039 0.001 0.031 0.001 0.016 0.001
Table 1: KID estimates (mean std), lower is better, between real images and images generated by BlockGAN and other GANs. BlockGAN achieves competitive KID scores while providing control of objects in the generated images (not measured by KID).

4.3 3D object manipulation beyond the training set

Geometric object-centric manipulation.

Since objects are disentangled in BlockGAN’s scene representation, we can manipulate them separately. Here we apply spatial manipulations that were not part of the similarity transformation used during training, such as horizontal stretching, or slicing and combining different foreground objects. Figure 6 shows that the learnt deep 3D object representations can be modified intuitively, despite never having seen 3D explicit geometry, or multiple views of the same objects during training. Additionally, changes in the foreground objects also lead to corresponding changes in shadows and highlights. This shows the advantage of learning both 3D object features and their combination to 3D scene features in the generator of BlockGAN.

Figure 5: Car. Even for natural images with cluttered backgrounds, BlockGAN can still disentangle objects in a scene well. Note that interpolating the background () affects the appearance of the car in a meaningful way, showing the benefit of 3D scene features.
Figure 6: Geometric modification of the learnt 3D object features (unless stated, background is fixed): splitting and combining (top), stretching (middle), and adding and manipulating new objects after training (bottom). Bottom row shows (a) Original scene, (b) new object added, (c) manipulated, (d–e) different background appearance, and (f) more objects added. Note the realistic lighting and shadows.

Scene modification by adding objects.

The 3D object features learnt by BlockGAN can also be reused to add more objects to the scene at test time. Here we use BlockGAN trained on datasets with only one background and one foreground object, and show that more foreground objects of the same category can be added to the same scene to create novel scenes with multiple foreground objects. Figure 6 (bottom) shows that new objects can be added and manipulated just like the original objects while maintaining realistic shadows and highlights. This shows that BlockGAN has learnt 3D object representations that can be reused and manipulated intuitively, instead of merely memorizing training images.

4.4 Benefits of 3D object-aware scene representations

Comparison to 2D-based LR-GAN

Yang et al. (2017) first generate 2D background layers, and then generate and combine foreground layers with the generated background using alpha-compositing. Both BlockGAN and LR-GAN show the importance of combining objects in a contextually relevant manner to generate visually realistic images (see Table 1). However, LR-GAN does not offer explicit control over the object’s location. More importantly, LR-GAN learns an entangled representation of foreground and background: when fixing the foreground noise vector and sampling a different background, the foreground object also changes (Figure 7). Finally, LR-GAN does not allow adding more foreground objects during test time like BlockGAN. This demonstrates the benefits of learning disentangled 3D object features compared to a 2D-based approach.

Comparison to entangled 3D scene representation

We compare BlockGAN with HoloGAN, which also learns deep 3D scene features but does not consider object disentanglement. In particular, HoloGAN only considers one noise vector for identity and one pose for the entire scene, and does not consider translation as part of . While HoloGAN works well with object-centred scenes, it struggles with moving foreground objects. Figure 4 shows that HoloGAN tends to associate each pose with a fixed object’s identity (i.e., moving objects erroneously changes identity of both foreground and background), while changing only changes a small part of the background. BlockGAN, on the other hand, can separate identity and pose for each object, while being able to learn scene-level effects such as lighting and shadows.

Figure 7: Comparison between LR-GAN (top row) and BlockGAN (bottom row) for Synth-Car-One (left) and Car (right).

4.5 Ablation study

Scene composer function.

We consider three functions: (i) element-wise summation, (ii) element-wise maximum, and (iii) an MLP. We train BlockGAN with each function and compare their performance in terms of visual quality (KID score) in Table 2. While all three functions can successfully combine objects into a scene, the element-wise maximum performs best and easily generalises to multiple objects. Therefore, we use the element-wise maximum for BlockGAN.

Method Synth-Car-One (6464) Synth-Chairs (6464)
Sum 0.040 0.002 0.038 0.001
MLP 0.044 0.001 0.033 0.001
Max 0.039 0.002 0.031 0.001
Table 2: KID estimates for different scene composer functions.

Non-uniform pose distribution.

For the natural Car dataset, we observe that BlockGAN has difficulties learning the full 360° rotation of the car, even though foreground and background are disentangled well. We hypothesise that this is due to the mismatch between the true (unknown) pose distribution of the car, and the uniform pose distribution we assume during training. To test this, we create a synthetic dataset similar to Synth-Car-One with a limited range of rotation, and train BlockGAN with a uniform distribution for pose sampling. With the imbalanced dataset, Figure 8 (bottom) shows correct disentangling of foreground and background. However, rotation of the car only produces images with frontal or near-frontal views (top); while, movement along depth dimension results in cars that are also randomly rotated sideways (middle). We observe similar behaviour for the natural Car dataset. This suggests that learning object disentanglement and full 3D pose rotation might be two independent problems. While assuming a uniform pose distribution already enables good object disentanglement, learning the pose distribution directly from the training data would likely improve the quality of 3D transformations.

The supplemental material includes additional studies for explicitly modelling the perspective camera and adopting the style discriminator for scenes with cluttered backgrounds.

Figure 8: Different manipulations applied to BlockGAN trained on (a) a dataset with imbalanced rotations, and (b) a balanced dataset.

5 Discussion and Future Work

We introduced BlockGAN, an image generative model that learns 3D object-aware scene representations from unlabelled images. We show that BlockGAN can learn a disentangled scene representation both in terms of objects and their properties, which allows geometric manipulations not observed during training. Most excitingly, even when BlockGAN is trained with fewer objects, additional 3D object features can be added to the scene features at test time to create novel scenes with multiple objects. In addition to computer graphics applications, this opens up exciting possibilities, such as combining BlockGAN with models like BiGAN Donahue et al. (2017) or ALI Dumoulin et al. (2017) to learn powerful object representations for scene understanding and reasoning.

Future work can adopt more powerful relational learning models Santoro et al. (2017); Vaswani et al. (2017); Kipf et al. (2020) to learn more complex object interactions such as inter-object shadows or reflections. Currently, we assume prior knowledge of the number of objects for training. We also assume object poses are uniformly distributed and independent from each other. Therefore, we believe that BlockGAN can be further extended if more accurate information can be acquired from the training images. Finally, we can enforce stronger view consistency as objects move around the scene by using videos or multiple views of the same scene.


We received support from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 665992, the EPSRC Centre for Doctoral Training in Digital Entertainment (EP/L016540/1), RCUK grant CAMERA (EP/M023281/1), an EPSRC-UKRI Innovation Fellowship (EP/S001050/1), and an NVIDIA Corporation GPU Grant. We received a gift from Adobe.

Appendix A Additional ablation studies

Learning without the perspective camera

Here we show the advantage of implementing the perspective camera explicitly, compared to using a weak-perspective projection like HoloGAN (Nguyen-Phuoc et al., 2019). Since a perspective camera directly affects foreshortening, it provides strong cues for BlockGAN to solve the scale/depth ambiguity. This is especially important for BlockGAN to learn to project and reason over occlusion by concatenating the depth and channel dimension, followed by an MLP. Since the MLP is very flexible, BlockGAN trained without a perspective camera, therefore, tends to learn to associate an object’s identity with scale and depth, while changing depth only changes the object’s appearance (see Figure 9).

Figure 9: The effect of modelling the perspective camera explicitly (b) compared to using a weak-perspective camera (a). Note that with the weak-perspective camera (a), translation along the depth dimension (top) leads to identity changes without any translation in depth, while changing the latent vector (bottom) changes both depth translation and, to a lesser extent, the object identity. Using a perspective camera correctly disentangles position and identity (b).

Learning without the style discriminator

When BlockGAN is trained with a standard discriminator on datasets with a cluttered background, such as the real Car dataset, the foreground object features tend to include part of the background object. This creates visual artefacts when objects move in the scene (indicated by red arrows in Figure 10a). We hypothesise that these artefacts should be picked up by the discriminator since generated images should look unrealistic. Therefore, we add more powerful style discriminators (Nguyen-Phuoc et al., 2019) to the original discriminator at different layers. Figure 10b shows that the generator is indeed discouraged from adding background information to the foreground object features, leading to cleaner results.

Figure 10: With a standard discriminator (a), a part of the background appearance is baked into the foreground object (see red arrows). Adding the style discriminator (b) cleanly separates the car from the background.

Additional details for the imbalanced rotation ablation study

To generate the imbalanced rotation dataset, we use the same general setup as Synth-Car-One (see Section D below). However, instead of sampling the car’s rotation about the up-axis uniformly, we sample the rotation uniformly from the front/left/back/right viewing directions °. In other words, the car is only seen from the front/left/back/right 30°, respectively, and there are four evenly spaced gaps of 60° that are never observed, for example views from the front-right.

Appendix B Comparison to other methods

In Figure 11, we show generated samples by WGAN-GP (Gulrajani et al., 2017), LR-GAN (Yang et al., 2017), HoloGAN (Nguyen-Phuoc et al., 2019) and our BlockGAN. Compared to other models, BlockGAN produces samples with very competitive quality, and offers explicit control over poses of objects in the generated images.

Figure 11: Samples from WGAN-GP, LR-GAN, HoloGAN and our BlockGAN (from top to bottom) trained on the datasets Synth-Car-One, Synth-Chairs and Cars (from left to right).

Implementations details

For WGAN-GP, we use a publicly available implementation2. For LR-GAN and HoloGAN, we use the code provided by the authors. Note that for HoloGAN, we modify the 3D transformation to add translation during training, since this method assumes that foreground objects are at the image centre.

Name # Images Azimuth Elevation Scaling Horiz. transl. Depth transl.
Synth-Car-One 80,000 0° – 359° 45° 0.5 – 0.6 –5 – 5 –5 – 5
Synth-Car-two 80,000 0° – 359° 45° 0.5 – 0.6 –5 – 5 –5 – 5
Synth-Chairs 100,000 0° – 359° 45° 0.5 – 0.6 –5 – 5 –5 – 5
Cars (Yang et al., 2015) 139,714 0° – 359° 0° – 35° 0.5 – 0.8 –3 – 4 –5 – 6
Table 3: Datasets used in our paper. ‘Azimuth’ describes the object rotation about the up axis. ‘Elevation’ refer to the camera. ‘Scaling’ is the scale factor applied to foreground objects. ‘Horiz. transl.’ and ‘Depth transl.’ are horizontal and depth translations of foreground object relative to the global origin. Ranges represent uniform random distributions.

Appendix C Loss function

For datasets with cluttered backgrounds like the natural Car dataset, we adopt style discriminators in addition to the normal image discriminator (see the benefit in Figure 10). Style discriminators perform the same real/fake classification task as the standard image discriminator, but at the feature level across different layers. In particular, style discriminators classify the mean and standard deviation of the features at different levels (which are believed to describe the image “style”). The mean and variance of the features are computed across batch and spatial dimensions independently using:


The style-discriminators are implemented as MLPs with sigmoid activation functions for binary classification. A style discriminator at layer is written as


The total loss therefore can be written as


We set for all natural datasets and for synthetic datasets.

Appendix D Datasets

For the synthetic datasets (Synth-Car-One, Synth-Car-Two, Synth-Chairs), we use the scene set-ups provided by the CLEVR dataset (Johnson et al., 2017). These include a fixed, grey background, a virtual camera with fixed parameters but random location jittering, and random lighting. We also use the render script from CLEVR to randomly place foreground objects into the scene and render them. We render all image at resolution 128 128, and bi-linearly downsample them to 64 64 for training. For the natural Car dataset, each image is first scaled such that the smaller side is 64, then it is cropped to produce a 6464  pixel crop. During training, we randomly move the 6464 cropping window before cropping the image. Figure 12 includes samples from our generated dataset.

Link for 3D textured chair models:

Link for natural Car dataset:

Figure 12: Samples from the synthetic datasets.

Appendix E Implementation

e.1 Training details

Virtual camera model

We assume a virtual camera with a focal length of 35 mm and a sensor size of 32 mm (Blender’s default values), which corresponds to an angle of view of  degrees (we use the same setup for natural images).


We initialise all weights using and biases as . For all synthetic datasets, we use a , and (for ), to account for their relative visual complexity. For the natural Cars dataset, we use and . Table 3 describes the ranges of pose we use for sampling during training.


We train BlockGAN using the Adam optimiser (Kingma and Ba, 2015), with and .

We use the same learning rate for both the discriminator and the generator. Empirically, we find that updating the generator twice for every update of the discriminator achieves images with the best visual fidelity. We use a learning rate of 0.0001 for all synthetic datasets (Synth-Car-One, Synth-Car-Two and Synth-Chairs). For the natural Cars dataset, we use a learning rate of 0.00005.

We train all datasets with batch size of 64 for 50 epochs. Training takes 1.5 days for the synthetic datasets and 3 days for the natural Cars dataset.


All models were trained using a single GeForce RTX 2080 GPU.

e.2 Network architecture

We describe the network architecture for the BlockGAN foreground object generator in Table 8, the BlockGAN background generator in Table 8, and the overall BlockGAN generator in Tables 8 and 8 for synthetic and real datasets, respectively. Note that we use ReLU for the synthetic datasets and LReLU for the natural Car dataset after the AdaIN layer. The discriminator is described in Table 8.

In terms of the notation in Section 3 of the main paper, object features have dimensions , scene features have the same dimensions , and camera features have dimensions (before up-convolutions to ) with channels for synthetic datasets and channels for natural image datasets.

Layer type Kernel size Stride Normalisation Output dimension
Learnt constant tensor AdaIN
UpConv 2 AdaIN
UpConv 2 AdaIN
3D transformation
Table 5: Network architecture of the BlockGAN background (BG) object generator.
Layer type Kernel size Stride Normalisation Output dimension
Learnt constant tensor AdaIN
UpConv 2 AdaIN
UpConv 2 AdaIN
3D transformation
Table 6: Network architecture of the BlockGAN generator for Synth-Car-One, Synth-Car-Two and Synth-Chairs.
Layer type Kernel size Stride Activation Norm. Output dimension
FG generator (Table 8) ReLU
BG generator (Table 8) ReLU
Element-wise maximum
Conv 1 ReLU
UpConv 2 ReLU AdaIN
UpConv 2 ReLU AdaIN
UpConv 1 ReLU AdaIN
Table 7: Network architecture of the BlockGAN generator for the real Cars dataset.
Layer type Kernel size Stride Activation Normal. Output dimension
FG generator (Table 8) LReLU
BG generator (Table 8) LReLU
Element-wise maximum
Conv 1 LReLU
UpConv 2 LReLU AdaIN
UpConv 2 LReLU AdaIN
UpConv 1 LReLU AdaIN
Table 8: Network architecture of the BlockGAN discriminator for both synthetic and real datasets.
Layer type Kernel size Stride Activation Normalisation Output dimension
Conv 2 LReLU IN/Spectral
Conv 2 LReLU IN/Spectral
Conv 2 LReLU IN/Spectral
Conv 2 LReLU IN/Spectral
Fully connected Sigmoid None/Spectral
Table 4: Network architecture of the BlockGAN foreground (FG) object generator.




  1. Emergence of object segmentation in perturbed generative models. In NeurIPS, Cited by: §1, §2.
  2. Demystifying MMD GANs. In ICLR, Cited by: §4.2.
  3. Blender – a 3D modelling and rendering package. Blender Foundation. External Links: Link Cited by: §E.1, §4.
  4. Large scale GAN training for high fidelity natural image synthesis. In ICLR, Cited by: §1.
  5. Unsupervised object segmentation by redrawing. In NeurIPS, Cited by: §1, §2.
  6. On self modulation for generative adversarial networks. In ICLR, Cited by: §2.
  7. Learning to predict 3D objects with an interpolation-based differentiable renderer. In NeurIPS, Cited by: §2.
  8. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In NIPS, Cited by: §1, §2.
  9. Adversarial feature learning. In ICLR, Cited by: §5.
  10. Adversarially learned inference. In ICLR, Cited by: §5.
  11. GENESIS: generative scene inference and sampling with object-centric latent representations. In ICLR, Cited by: §1, §2.
  12. Attend, infer, repeat: fast scene understanding with generative models. In NIPS, Cited by: §2.
  13. Generative adversarial nets. In NIPS, Cited by: §1, §3.4.
  14. Improved training of Wasserstein GANs. In NIPS, Cited by: Appendix B, §4.2, Table 1.
  15. Multiple view geometry in computer vision. Cambridge University Press. External Links: ISBN 0521540518, Document Cited by: §3.3.
  16. Escaping Plato’s cave: 3D shape from adversarial rendering. In ICCV, Cited by: §2.
  17. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, Cited by: §4.2.
  18. Generating multiple objects at spatially distinct locations. In ICLR, Cited by: §2.
  19. Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §3.1, §3.4.
  20. IB-GAN: disentangled representation learning with information bottleneck GAN. Note: \url Cited by: §2.
  21. Image generation from scene graphs. In CVPR, Cited by: §2.
  22. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, Cited by: Appendix D, §4.
  23. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, Cited by: §1.
  24. A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §1, §2.
  25. Adam: a method for stochastic optimization. In ICLR, Cited by: §E.1.
  26. Contrastive learning of structured world models. In ICLR, Cited by: §5.
  27. Sequential attend, infer, repeat: generative modelling of moving objects. In NeurIPS, Cited by: §2.
  28. GAGAN: geometry-aware generative adverserial networks. In CVPR, Cited by: §2.
  29. Generating images part by part with composite generative adversarial networks. Note: arXiv:1607.05387 Cited by: §2.
  30. Towards unsupervised learning of generative models for 3D controllable image synthesis. Note: arXiv:1912.05237 Cited by: §2.
  31. Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In ICCV, Cited by: §2.
  32. OpenDR: an approximate differentiable renderer. In ECCV, Cited by: §2.
  33. Fundamentals of computer graphics. 4th edition, A K Peters/CRC Press. Cited by: §3.3.
  34. Conditional generative adversarial nets. Note: arXiv:1411.1784 Cited by: §1.
  35. RenderNet: a deep convolutional network for differentiable rendering from 3D shapes. In NeurIPS, Cited by: §2, §3.3.
  36. HoloGAN: unsupervised learning of 3D representations from natural images. In ICCV, Cited by: Appendix A, Appendix A, Appendix B, §1, §2, §2, §3.1, §3.1, §3.4, Table 1.
  37. Transformable bottleneck networks. ICCV. Cited by: §2.
  38. How to make a pizza: learning a compositional layer-based GAN model. In CVPR, Cited by: §2.
  39. Transformation-grounded image generation network for novel 3D view synthesis. In CVPR, Cited by: §2.
  40. PhotoShape: photorealistic materials for large-scale shape collections. ACM Transactions on Graphics 37 (6). Cited by: §4.
  41. Learning what and where to draw. In NIPS, Cited by: §2.
  42. Neural voxel renderer: learning an accurate and controllable rendering tool. Note: arXiv/1912.04591 Cited by: §2.
  43. Unsupervised geometry-aware representation for 3D human pose estimation. In ECCV, Cited by: §2.
  44. A simple neural network module for relational reasoning. In NIPS, Cited by: §5.
  45. Deforming autoencoders: unsupervised disentangling of shape and appearance. In ECCV, Cited by: §2.
  46. DeepVoxels: learning persistent 3D feature embeddings. In CVPR, Cited by: §2, §3.1.
  47. Scene representation networks: continuous 3D-structure-aware neural scene representations. In NeurIPS, Cited by: §2.
  48. A layer-based sequential framework for scene generation with GANs. In AAAI, Cited by: §2.
  49. A case for object compositionality in deep generative models of images. In NeurIPS Workshops, Cited by: §1, §2.
  50. Attention is all you need. In NIPS, Cited by: §5.
  51. Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In NIPS, Cited by: §3.3.
  52. LR-GAN: layered recursive generative adversarial networks for image generation. ICLR. Cited by: Appendix B, §1, §2, §4.2, §4.4, Table 1.
  53. A large-scale car dataset for fine-grained categorization and verification. In CVPR, External Links: ISSN 1063-6919 Cited by: Table 3, §4.
  54. Visual object networks: image generation with disentangled 3D representations. In NeurIPS, Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description