# LAVAE: Disentangling Location

and Appearance

###### Abstract

We propose a probabilistic generative model for unsupervised learning of structured, interpretable, object-based representations of visual scenes. We use amortized variational inference to train the generative model end-to-end. The learned representations of object location and appearance are fully disentangled, and objects are represented independently of each other in the latent space. Unlike previous approaches that disentangle location and appearance, ours generalizes seamlessly to scenes with many more objects than encountered in the training regime. We evaluate the proposed model on multi-MNIST and multi-dSprites data sets.

## 1 Introduction

Many hallmarks of human intelligence rely on the capability to perceive the world as a layout of distinct physical objects that endure through time—a skill that infants acquire in early childhood [spelke1990principles, spelke2013perceiving, spelke2007core]. Learning compositional, object-based representations of visual scenes, however, is still regarded as an open challenge for artificial systems [bengio2013representation, garnelo2019reconciling].

Recently, there has been a growing interest in unsupervised learning of disentangled representations [locatello2018challenging], which should separate the distinct, informative factors of variations in the data, and contain all the information on the data in a compact, interpretable structure [bengio2013representation]. This notion is highly relevant in the context of visual scene representation learning, where distinct objects should arguably be represented in a disentangled fashion. However, despite recent breakthroughs [chen2016infogan, higgins2017beta, kim2018disentangling], multi-object scenarios are rarely considered [eslami2016attend, van2018relational, burgess2019monet].

We propose the Location-Appearance Variational AutoEncoder (LAVAE), a probabilistic generative model that, without supervision, learns structured, compositional, object-based representations of visual scenes. We explicitly model an object’s location and appearance with distinct latent variables, unlike in most previous works, thus providing a highly beneficial inductive bias. Following the framework of variational autoencoders (VAEs) [kingma2013auto, rezende2014stochastic], we parameterize the approximate variational posterior of the latent variables with inference networks that are trained end-to-end with the generative model. Our model learns to correctly count objects and compute a compositional, object-wise, interpretable representation of the scene. Objects are represented independently of each other, and each object’s location and appearance are disentangled. Unlike previous approaches that disentangle location and appearance, LAVAE generalizes seamlessly to scenes with many more objects than in the training regime. We demonstrate these capabilities on multi-MNIST and multi-dSprites data sets similar to those by eslami2016attend and greff2019multi.

## 2 Method

#### Generative model.

We propose a latent variable model for images in which the latent space is factored into location and appearance of a variable number of objects. For each image with pixels, the number of objects is modeled by a latent variable , their locations by latent variables , and their appearance by . We assume the number of objects in every image to be bounded by . The joint distribution of the observed and latent variables for each data point is:

(1) |

where we use the shorthand and similarly for .

The generative process can be described as follows. First, the number of objects is sampled from a categorical distribution

(2) |

where is a learned probability vector of size . The location variables are sequentially sampled without replacement from a categorical distribution with classes:

(3) |

where , and is a vector of ones of length . To each , which is a one-hot representation of an object’s location, corresponds a continuous appearance vector of size that describes the object.

The likelihood function is parameterized in a compositional manner. For each image, the visual representation, or sprite, of the th object is generated from by a shared function . Each sprite is then convolved with a 2-dimensional Kronecker delta that is the one-hot representation of the object’s location. Finally, the resulting tensors are added together to give the pixel-wise parameters of the distribution:

(4) |

where denotes 2-dimensional discrete convolution.

#### Inference model.

The approximate posterior for a data point has the following form:

(5) |

Since each data point is assumed independent, the lower bound (ELBO) to the marginal log likelihood is the sum of the ELBO for each data point, which is:

(6) |

where the first term is the likelihood and the second is the negative Kullback-Leibler (KL) divergence between and . The different terms of the KL divergence can be derived as in Appendix A and estimated by Monte Carlo sampling.

Two inference networks compute appearance and location feature maps, and , both having size like the input. The inference model for the number of objects is a categorical distribution parameterized by a function of the location features . Object locations follow categorical distributions without replacement parameterized by logits . The vector at location in the feature map represents the appearance parameters for object , i.e. the mean and log variance of . The overall inference process can be summarized as follows:

(7) | |||||

(8) | |||||

(9) | |||||

(10) |

where by we denote the th element of (with a standard basis vector), and the probability vector for location sampling at each step is computed iteratively:

(11) |

The expectations in the variational bound are handled as follows: For we use discrete categorical sampling. This gives a biased gradient estimator, but in practice this has not affected inference on . We use the Gumbel-softmax relaxation [jang2016categorical, maddison2016concrete] for and the Gaussian reparameterization trick for .

## 3 Results

In all experiments we use the same architecture for LAVAE. See Appendix B for all implementation details. Briefly, the sprite decoder in the generative model consists of a fully connected layer and a convolutional layer, followed by 3 residual convolutional blocks. The appearance and location inference networks are fully convolutional, thus the feature maps and have the same spatial size as .

As a baseline to compare the proposed model against we used a VAE implemented as a fully convolutional network: the absence of fully connected layers makes it easier to preserve spatial information, thereby allowing the model to achieve a higher likelihood and generalize more naturally. Moreover, this choice of baseline is closer in spirit to our model than a VAE with fully connected layers.

We evaluate LAVAE on multi-MNIST and multi-dSprites data sets consisting of 200k images with 0 to 3 objects in each image. 190k images are used for training, whereas the remaining 10k are left for evaluation. We generated an additional test set of 10k images with 7 objects each. Examples with more than 3 objects are never used for training.

### 3.1 Multi-MNIST

Each image in the multi-MNIST data set consists of a number of statically binarized MNIST digits scattered at random (avoiding overlaps) onto a black canvas of size . The digits are first rescaled from their original size () to by bilinear interpolation, and finally binarized by rounding. When generating images, digits are picked from a pool of either 1k or 10k MNIST digits. We call the two resulting data sets multi-MNIST-1k and multi-MNIST-10k, depending on the number of MNIST digits used for generation. We independently model each pixel as a Bernoulli random variable parameterized by the decoder:

(12) |

where the parameter vector defined in Section 2 is the pixel-wise mean of the Bernoulli distribution.

Figure 1 qualitatively shows the performance of LAVAE on multi-MNIST-10k test images. The inferred location of all objects in an image are summarized by a black canvas where white pixels indicate object presence. For each of those locations, the model infers a corresponding appearance latent variable. Each digit on the right is generated by from one of these appearance variables.

Samples from the prior are shown in Figure 2: note that the number of generated objects is consistent with the training set, since the prior is learned. Quantitative evaluation results are shown in Table 1 and compared with a VAE baseline. The inferred object count was correct on almost all images of *all test sets*—even with a held-out number of objects—and across all tested random seeds.

A fundamental characteristic of disentangled representations is that a change in a single representational factor should correspond to a change in a single underlying factor of variation [bengio2013representation, locatello2018challenging]. By demonstrating that the appearance and location of single objects can be independently manipulated in the latent space, the qualitative disentanglement experiments in Figure 3 prove that objects are disentangled from one another, and so are the appearance and location of each object.

multi-MNIST-1k | multi-MNIST-10k | multi-dSprites | ||||
---|---|---|---|---|---|---|

acc. | acc. | acc. | ||||

LAVAE | % | % | % | |||

baseline | — | — | — |

*test sets*. The log likelihood lower bound is estimated with 100 importance samples. Note that the MNIST-10k dataset is a more complex task than MNIST-1k because the model has to capture a larger variation in appearance. The object count accuracy is measured as the percentage of images for which the inferred number of objects matches the ground truth (which is not available during training).

### 3.2 Multi-dSprites

The multi-dSprites data set is generated similarly to the multi-MNIST ones, by scattering a number of sprites on a black canvas of size . Here, the sprites are simple shapes in different colors, sizes, and orientations, as in the dSprites data set [higgins2017beta, dsprites17]. The shape of each sprite is randomly chosen among square, ellipse, and triangle. The maximum sprite size is , and each sprite’s scale is randomly chosen among 6 linearly spaced values in . The orientation angle of each sprite is uniformly chosen among 40 linearly spaced values in . The color is uniformly chosen among the 7 colors such that the RGB values are saturated (either 0 or 255), and at least one of them is not 0. This means that each color component of each pixel is a binary random variable and can be modelled independently as in the multi-MNIST case (leading to terms in the likelihood, instead of ).

Figure 4 shows images generated by sampling from the prior in LAVAE and in the VAE baseline, where we can appreciate how our model accurately captures the number and diversity of objects in the data set. Figure 5 shows instead an example of inference on test images, as explained in Section 3.1. Finally, as we did for multi-MNIST, we performed disentanglement experiments where, starting from the latent representation of a test image, we manipulate the latent variables. In Figure 6 we change the order of the location variables to swap objects, whereas Figure 7 shows the effect of separately altering the location and appearance latent variables of a single object.

### 3.3 Generalizing to more objects

As mentioned above, LAVAE correctly infers the number of objects in images from the 7-object versions of our data sets, despite the fact that it was only trained on images with up to 3 objects. Furthermore, by representing each object independently and disentangling location and appearance of each object, it accurately decomposes 7-object scenes and allows intervention as easily as in images with fewer objects. Figure 8 demonstrates this on the 7-object version of the multi-MNIST-10k data set. Finally, in Figure 9 we show images generated by LAVAE after modifying the prior to be uniform in .

## 4 Related work

Our work builds on recent advances in probabilistic generative modelling, in particular variational autoencoders (VAEs) [kingma2013auto, rezende2014stochastic]. One of the methods closest to our work in spirit is Attend Infer Repeat (AIR) [eslami2016attend], which performs explicit object-wise inference through a recurrent network that iteratively attends to one object at a time. A limitation of this approach, however, is that it has not been shown to generalize well to a larger number of objects. Closely related to our work is also the multi-entity VAE [nash17], in which multiple objects are independently modelled by different latent variables. The inference process does not include an explicit attention mechanism, and uses instead a spatial KL map as a proxy for object presence. Each object’s latent is decoded into a full image, and these are aggregated by an element-wise operation, thus the representation of each object’s location and appearance are entangled. In the same spirit, the recently proposed generative models MONet [burgess2019monet] and IODINE [greff2019multi] learn without supervision to segment the scene into independent and interpretable object-based parts. Although these are more flexible than AIR and can model more complex scenes, the representations they learn of object location and appearance are not disentangled. All methods cited here are likelihood based so they can and should be compared in terms of test log likelihood. We leave this for future work.

Other unsupervised approaches to visual scene decomposition include Neural Expectation Maximization [greff2017neural, van2018relational], which amortizes the classic EM for a spatial mixture model, and Generative Query Networks [eslami2018neural], that learn representations of rich 3D scenes but do not factor them into objects and need point-of-view information during training. Methods following the vision-as-inverse-graphics paradigm [poggio1985computational, yuille2006vision] learn structured, object-centered representations by making strong assumptions on the latent codes or by exploiting the true generative model [kulkarni2015deep, wu2017neural, tian2019learning]. Non-probabilistic approaches to scene understanding include adversarially trained generative models [pathak2016context] and self-supervised methods [doersch2015unsupervised, vondrick2018tracking]. These, however, do not explicitly tackle representation learning, and often have to rely on heuristics such as region masks. Finally, examples of supervised approaches are semantic and instance segmentation [ronneberger2015u, he2017mask, jegou2017one, liu2018path], where acquiring labels for training is typically expensive, and the focus is not on learning structured representations.

## 5 Conclusion

We presented LAVAE, a probabilistic generative model for unsupervised learning of structured, compositional, object-based representations of visual scenes. We follow the amortized stochastic variational inference framework, and approximate the latent posteriors by inference networks that are trained end-to-end with the generative model. On multi-MNIST and multi-dSprites data sets, LAVAE learns without supervision to correctly count and locate all objects in a scene. Thanks to the structure of the generative model, objects are represented independently of each other, and the location and appearance of each object are completely disentangled. We demonstrate this in qualitative experiments, where we manipulate location or appearance of single objects independently in the latent space. Our model naturally generalizes to visual scenes with many more objects than encountered during training.

These properties make LAVAE robust to scene complexity, opening up possibilities for leveraging the learned representations for downstream tasks and reinforcement learning agents. However, in order to smoothly transfer to scenes with semantically different components, the appearance latent space should be disentangled. Since in this work we focused on robust model-based disentanglement of location and appearance, more work should be done to fully assess and improve disentanglement in the appearance model. Another natural extension is to investigate how our methods can be applied to 3d scenes and complex natural images.

## References

## Appendix A KL divergence

The KL in Eq. (6) can be expanded as follows:

where all expectations can be estimated by Monte Carlo sampling.

## Appendix B Implementation details

The input image is fed into a residual network with 4 blocks having 2 convolutional layers each. Every convolution is followed by a Leaky ReLU and batch normalization. A final convolutional layer outputs the feature map with channels. The output of an identical but independent residual network is fed into a 3-layer convolutional network with one output channel that represents the location logits . The logits are multiplied by a constant factor . See below for more details about the rescaling of logits.

The number of objects is then inferred by a deterministic function of the location logits. This function filters out points in that are not local maxima, then counts the number of points above the threshold . If is the inferred number of objects in the scene, outputs the corresponding one-hot vector, and the distribution deterministically takes the value . We fix the maximum number of objects per image to .

The locations are iteratively sampled from a categorical distribution with logits , where all the logits of the previously sampled location are set to a low value to prevent them from being sampled again. At the th step, when the th object’s location is sampled, the corresponding feature vector in is interpreted as the means and log variances of the components of the th appearance variable. When sampling from the categorical distribution, we use Gumbel-Softmax for single-sample gradient estimation. The temperature parameter is exponentially annealed from 0.5 to 0.02 in 100k steps. The latent space for each appearance variable has dimensions.

The sprite-generating function is a convolutional network that takes as input a sampled appearance vector . The first part of the network consists of a fully connected layer, a convolutional layer, and a bilinear interpolation. The vector is then expanded and concatenated to the resulting tensor along the channel dimension. The second part of the network consists of 3 residual blocks with 2 convolutional layers each, followed by a final convolutional layer with a sigmoid activation. Leaky ReLU and batch normalization are used after each layer, except in the last residual block. The size of a generated sprite for the multi-MNIST data set is , for multi-dSprites it is . The mean of the Bernoulli output is then computed from these sprites as explained above.

We optimized the model with stochastic gradient descent, using Adamax with batch size 64. The initial learning rate was 1e-4 or 5e-4 for the location inference network (for the multi-MNIST and multi-dSprites data sets, respectively) and 1e-3 for the rest of the model. In the location inference net, the learning rate was exponentially decayed by a factor of 100 in 100k steps. The model parameters are about 500k, split almost evenly among location inference, appearance inference, and sprite generation.

#### Training warm-up.

In practice we found it beneficial to include a warm-up period at the beginning of training, in which an auxiliary loss is added to the negative ELBO. We train a fully-convolutional VAE in parallel with our model, and take the location-wise KL in the latent space as a rough proxy for object location, as suggested by nash17. The additional loss is the squared distance between and the KL map. The network inferring object location is therefore initially encouraged to mimic a function that is a (rather rough) approximation of the location, and then it is fine-tuned. Intuitively, we are biasing sampling of in favor of information-rich locations, which significantly speeds up and stabilizes training. To stabilize training, we also found it beneficial to randomly force during training with a probability that is 1 during warm-up (30k steps) and then linearly decays to 0.1 in the following 30k steps.

The auxiliary VAE is implemented as a fully convolutional network, in which the encoder consists of 3 residual blocks with downsampling between blocks, and a final convolutional layer. The decoder loosely mirrors the encoder. The latent variables are arranged as a 3D tensor with size . The auxiliary loss is added to the original training loss for a warm-up phase of 30k steps. In the subsequent 30k steps, the contribution of this term to the overall loss is linearly annealed to 0.

#### Rescaling location logits.

Assume we have objects and the logits of the inferred location distribution are either 0 or . The probability of one of the “hot” locations after softmax is where is a constant that should not depend on , and should be close to . Solving for we get . If is large enough and is small enough, we have , and the constant is relatively small in magnitude. Because of our assumptions on the logits and on , we can write , and therefore the high logits should be approximately proportional to . Thus, using the same fully convolutional network architecture for multiple image sizes, the logits should be scaled by a factor proportional to . The threshold for counting objects should likewise follow this rule of thumb.

#### Baseline implementation.

As baseline we use a fully convolutional -VAE, where the latent variables are organized as a 3D tensor. Being closer in spirit to our method, this leads to a fairer comparison. Furthermore, this inductive bias makes it easier to preserve spatial information. Indeed, this empirically leads to better likelihood and generated samples than a VAE with fully connected layers (and a 1D structure for latent variables), and allows to model more naturally a varying number of objects in a scene. The encoder consists of a convolutional layer with stride 2, followed by 4 residual blocks with 2 convolutional layers each. Between the first two pairs of blocks, a convolutional layer with stride 2 reduces dimensionality. The resulting tensor has size , and a final convolutional layer brings the number of channels to where is the number of latent variables. The decoder architecture takes as input a sample of size and outputs the pixel-wise Bernoulli means. Its architecture loosely mirrors the encoder’s, with convolutions being replaced by transposed convolutions, except for the last upsampling operation which consists of bilinear interpolation and ordinary convolution. All convolutions and transposed convolutions, except for the ones between two residual blocks, are followed by Leaky ReLU and batch normalization. The last convolutional layer is only followed by a sigmoid nonlinearity. In our experiments we used and we linearly annealed from 0 to 1 in 100k steps. The number of model parameters is about 1M.

## Appendix C Results on multi-MNIST-1k

Here we present additional visual results on the multi-MNIST-1k data set, similar to those discussed in the main text.