LAVAE: Disentangling Location and Appearance

LAVAE: Disentangling Location
and Appearance

Andrea Dittadi & Ole Winther
Department of Applied Mathematics and Computer Science, Technical University of Denmark
Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital
Bioinformatics Centre, Department of Biology, University of Copenhagen
{adit, olwi}@dtu.dk
Abstract

We propose a probabilistic generative model for unsupervised learning of structured, interpretable, object-based representations of visual scenes. We use amortized variational inference to train the generative model end-to-end. The learned representations of object location and appearance are fully disentangled, and objects are represented independently of each other in the latent space. Unlike previous approaches that disentangle location and appearance, ours generalizes seamlessly to scenes with many more objects than encountered in the training regime. We evaluate the proposed model on multi-MNIST and multi-dSprites data sets.

1 Introduction

Many hallmarks of human intelligence rely on the capability to perceive the world as a layout of distinct physical objects that endure through time—a skill that infants acquire in early childhood [spelke1990principles, spelke2013perceiving, spelke2007core]. Learning compositional, object-based representations of visual scenes, however, is still regarded as an open challenge for artificial systems [bengio2013representation, garnelo2019reconciling].

Recently, there has been a growing interest in unsupervised learning of disentangled representations [locatello2018challenging], which should separate the distinct, informative factors of variations in the data, and contain all the information on the data in a compact, interpretable structure [bengio2013representation]. This notion is highly relevant in the context of visual scene representation learning, where distinct objects should arguably be represented in a disentangled fashion. However, despite recent breakthroughs [chen2016infogan, higgins2017beta, kim2018disentangling], multi-object scenarios are rarely considered [eslami2016attend, van2018relational, burgess2019monet].

We propose the Location-Appearance Variational AutoEncoder (LAVAE), a probabilistic generative model that, without supervision, learns structured, compositional, object-based representations of visual scenes. We explicitly model an object’s location and appearance with distinct latent variables, unlike in most previous works, thus providing a highly beneficial inductive bias. Following the framework of variational autoencoders (VAEs) [kingma2013auto, rezende2014stochastic], we parameterize the approximate variational posterior of the latent variables with inference networks that are trained end-to-end with the generative model. Our model learns to correctly count objects and compute a compositional, object-wise, interpretable representation of the scene. Objects are represented independently of each other, and each object’s location and appearance are disentangled. Unlike previous approaches that disentangle location and appearance, LAVAE generalizes seamlessly to scenes with many more objects than in the training regime. We demonstrate these capabilities on multi-MNIST and multi-dSprites data sets similar to those by eslami2016attend and greff2019multi.

2 Method

Generative model.

We propose a latent variable model for images in which the latent space is factored into location and appearance of a variable number of objects. For each image with pixels, the number of objects is modeled by a latent variable , their locations by latent variables , and their appearance by . We assume the number of objects in every image to be bounded by . The joint distribution of the observed and latent variables for each data point is:

(1)

where we use the shorthand and similarly for .

The generative process can be described as follows. First, the number of objects is sampled from a categorical distribution

(2)

where is a learned probability vector of size . The location variables are sequentially sampled without replacement from a categorical distribution with classes:

(3)

where , and is a vector of ones of length . To each , which is a one-hot representation of an object’s location, corresponds a continuous appearance vector of size that describes the object.

The likelihood function is parameterized in a compositional manner. For each image, the visual representation, or sprite, of the th object is generated from by a shared function . Each sprite is then convolved with a 2-dimensional Kronecker delta that is the one-hot representation of the object’s location. Finally, the resulting tensors are added together to give the pixel-wise parameters of the distribution:

(4)

where denotes 2-dimensional discrete convolution.

Inference model.

The approximate posterior for a data point has the following form:

(5)

Since each data point is assumed independent, the lower bound (ELBO) to the marginal log likelihood is the sum of the ELBO for each data point, which is:

(6)

where the first term is the likelihood and the second is the negative Kullback-Leibler (KL) divergence between and . The different terms of the KL divergence can be derived as in Appendix A and estimated by Monte Carlo sampling.

Two inference networks compute appearance and location feature maps, and , both having size like the input. The inference model for the number of objects is a categorical distribution parameterized by a function of the location features . Object locations follow categorical distributions without replacement parameterized by logits . The vector at location in the feature map represents the appearance parameters for object , i.e. the mean and log variance of . The overall inference process can be summarized as follows:

(7)
(8)
(9)
(10)

where by we denote the th element of (with a standard basis vector), and the probability vector for location sampling at each step is computed iteratively:

(11)

The expectations in the variational bound are handled as follows: For we use discrete categorical sampling. This gives a biased gradient estimator, but in practice this has not affected inference on . We use the Gumbel-softmax relaxation [jang2016categorical, maddison2016concrete] for and the Gaussian reparameterization trick for .

3 Results

In all experiments we use the same architecture for LAVAE. See Appendix B for all implementation details. Briefly, the sprite decoder in the generative model consists of a fully connected layer and a convolutional layer, followed by 3 residual convolutional blocks. The appearance and location inference networks are fully convolutional, thus the feature maps and have the same spatial size as .

As a baseline to compare the proposed model against we used a VAE implemented as a fully convolutional network: the absence of fully connected layers makes it easier to preserve spatial information, thereby allowing the model to achieve a higher likelihood and generalize more naturally. Moreover, this choice of baseline is closer in spirit to our model than a VAE with fully connected layers.

We evaluate LAVAE on multi-MNIST and multi-dSprites data sets consisting of 200k images with 0 to 3 objects in each image. 190k images are used for training, whereas the remaining 10k are left for evaluation. We generated an additional test set of 10k images with 7 objects each. Examples with more than 3 objects are never used for training.

3.1 Multi-MNIST

Each image in the multi-MNIST data set consists of a number of statically binarized MNIST digits scattered at random (avoiding overlaps) onto a black canvas of size . The digits are first rescaled from their original size () to by bilinear interpolation, and finally binarized by rounding. When generating images, digits are picked from a pool of either 1k or 10k MNIST digits. We call the two resulting data sets multi-MNIST-1k and multi-MNIST-10k, depending on the number of MNIST digits used for generation. We independently model each pixel as a Bernoulli random variable parameterized by the decoder:

(12)

where the parameter vector defined in Section 2 is the pixel-wise mean of the Bernoulli distribution.

Figure 1 qualitatively shows the performance of LAVAE on multi-MNIST-10k test images. The inferred location of all objects in an image are summarized by a black canvas where white pixels indicate object presence. For each of those locations, the model infers a corresponding appearance latent variable. Each digit on the right is generated by from one of these appearance variables.

Samples from the prior are shown in Figure 2: note that the number of generated objects is consistent with the training set, since the prior is learned. Quantitative evaluation results are shown in Table 1 and compared with a VAE baseline. The inferred object count was correct on almost all images of all test sets—even with a held-out number of objects—and across all tested random seeds.

A fundamental characteristic of disentangled representations is that a change in a single representational factor should correspond to a change in a single underlying factor of variation [bengio2013representation, locatello2018challenging]. By demonstrating that the appearance and location of single objects can be independently manipulated in the latent space, the qualitative disentanglement experiments in Figure 3 prove that objects are disentangled from one another, and so are the appearance and location of each object.

Figure 1: Inference and reconstruction on test images. For each image, from left to right: input, reconstruction, summary of inferred locations, sprites generated from each appearance latent variable.
Figure 2: Generated samples. Left: images generated by LAVAE from its prior , where is learned from data. Right: images generated by the baseline from its prior .
multi-MNIST-1k multi-MNIST-10k multi-dSprites
acc. acc. acc.
LAVAE % % %
baseline
Table 1: Quantitative results on multi-MNIST and multi-dSprites test sets. The log likelihood lower bound is estimated with 100 importance samples. Note that the MNIST-10k dataset is a more complex task than MNIST-1k because the model has to capture a larger variation in appearance. The object count accuracy is measured as the percentage of images for which the inferred number of objects matches the ground truth (which is not available during training).
Figure 3: Disentanglement experiments on test images. Objects are represented independently of each other, and their location and appearance are disentangled by design. Left: Latent traversal on one of the 7 location variables. Top right: Reordering the sequence or equivalently of leads to objects being swapped (top row: original reconstruction; bottom row: swapped objects). Bottom right: In each row, latent traversal on one of the appearance variables along one dimension.

3.2 Multi-dSprites

The multi-dSprites data set is generated similarly to the multi-MNIST ones, by scattering a number of sprites on a black canvas of size . Here, the sprites are simple shapes in different colors, sizes, and orientations, as in the dSprites data set [higgins2017beta, dsprites17]. The shape of each sprite is randomly chosen among square, ellipse, and triangle. The maximum sprite size is , and each sprite’s scale is randomly chosen among 6 linearly spaced values in . The orientation angle of each sprite is uniformly chosen among 40 linearly spaced values in . The color is uniformly chosen among the 7 colors such that the RGB values are saturated (either 0 or 255), and at least one of them is not 0. This means that each color component of each pixel is a binary random variable and can be modelled independently as in the multi-MNIST case (leading to terms in the likelihood, instead of ).

Figure 4 shows images generated by sampling from the prior in LAVAE and in the VAE baseline, where we can appreciate how our model accurately captures the number and diversity of objects in the data set. Figure 5 shows instead an example of inference on test images, as explained in Section 3.1. Finally, as we did for multi-MNIST, we performed disentanglement experiments where, starting from the latent representation of a test image, we manipulate the latent variables. In Figure 6 we change the order of the location variables to swap objects, whereas Figure 7 shows the effect of separately altering the location and appearance latent variables of a single object.

Figure 4: Generated samples. Left: images generated by LAVAE from its prior , where is learned from data. Right: images generated by the baseline from its prior .
Figure 5: Example of inference and reconstruction on multi-dSprites test images. From left to right: input, reconstruction, summary of inferred locations, sprites generated from the inferred appearance latent variables.
Figure 6: Object swap. Changing the order of location (or equivalently appearance) latent variables leads to objects being swapped. Top row: original reconstruction of test image; bottom row: objects are swapped by manipulating the latent variables.
Figure 7: Left: Location traversal. Latent traversal on location variables in a test image with 3 objects. Right: Appearance traversal. Latent traversal on appearance variables: changing for some along one latent dimension corresponds to changing appearance attributes of one specific object. The appearance latent space is only partially disentangled. Here we show examples where a change in one latent dimension leads to a change of a single factor of variation (rows 2, 4, 5, 6). However, in row 1 there is a change both in color and shape, and in row 3 both in color and scale.

3.3 Generalizing to more objects

As mentioned above, LAVAE correctly infers the number of objects in images from the 7-object versions of our data sets, despite the fact that it was only trained on images with up to 3 objects. Furthermore, by representing each object independently and disentangling location and appearance of each object, it accurately decomposes 7-object scenes and allows intervention as easily as in images with fewer objects. Figure 8 demonstrates this on the 7-object version of the multi-MNIST-10k data set. Finally, in Figure 9 we show images generated by LAVAE after modifying the prior to be uniform in .

Figure 8: Disentanglement with more objects. Latent traversal on location (top) and appearance (bottom) variables, on multi-MNIST-10k test images containing 7 objects. LAVAE can still correctly infer the scene’s structure and reconstruct it, allowing intervention on location or appearance of single objects.
Figure 9: Generation with fixed number of objects. Images generated by LAVAE from a modified prior in which takes value 4 or 5 with probability .

4 Related work

Our work builds on recent advances in probabilistic generative modelling, in particular variational autoencoders (VAEs) [kingma2013auto, rezende2014stochastic]. One of the methods closest to our work in spirit is Attend Infer Repeat (AIR) [eslami2016attend], which performs explicit object-wise inference through a recurrent network that iteratively attends to one object at a time. A limitation of this approach, however, is that it has not been shown to generalize well to a larger number of objects. Closely related to our work is also the multi-entity VAE [nash17], in which multiple objects are independently modelled by different latent variables. The inference process does not include an explicit attention mechanism, and uses instead a spatial KL map as a proxy for object presence. Each object’s latent is decoded into a full image, and these are aggregated by an element-wise operation, thus the representation of each object’s location and appearance are entangled. In the same spirit, the recently proposed generative models MONet [burgess2019monet] and IODINE [greff2019multi] learn without supervision to segment the scene into independent and interpretable object-based parts. Although these are more flexible than AIR and can model more complex scenes, the representations they learn of object location and appearance are not disentangled. All methods cited here are likelihood based so they can and should be compared in terms of test log likelihood. We leave this for future work.

Other unsupervised approaches to visual scene decomposition include Neural Expectation Maximization [greff2017neural, van2018relational], which amortizes the classic EM for a spatial mixture model, and Generative Query Networks [eslami2018neural], that learn representations of rich 3D scenes but do not factor them into objects and need point-of-view information during training. Methods following the vision-as-inverse-graphics paradigm [poggio1985computational, yuille2006vision] learn structured, object-centered representations by making strong assumptions on the latent codes or by exploiting the true generative model [kulkarni2015deep, wu2017neural, tian2019learning]. Non-probabilistic approaches to scene understanding include adversarially trained generative models [pathak2016context] and self-supervised methods [doersch2015unsupervised, vondrick2018tracking]. These, however, do not explicitly tackle representation learning, and often have to rely on heuristics such as region masks. Finally, examples of supervised approaches are semantic and instance segmentation [ronneberger2015u, he2017mask, jegou2017one, liu2018path], where acquiring labels for training is typically expensive, and the focus is not on learning structured representations.

5 Conclusion

We presented LAVAE, a probabilistic generative model for unsupervised learning of structured, compositional, object-based representations of visual scenes. We follow the amortized stochastic variational inference framework, and approximate the latent posteriors by inference networks that are trained end-to-end with the generative model. On multi-MNIST and multi-dSprites data sets, LAVAE learns without supervision to correctly count and locate all objects in a scene. Thanks to the structure of the generative model, objects are represented independently of each other, and the location and appearance of each object are completely disentangled. We demonstrate this in qualitative experiments, where we manipulate location or appearance of single objects independently in the latent space. Our model naturally generalizes to visual scenes with many more objects than encountered during training.

These properties make LAVAE robust to scene complexity, opening up possibilities for leveraging the learned representations for downstream tasks and reinforcement learning agents. However, in order to smoothly transfer to scenes with semantically different components, the appearance latent space should be disentangled. Since in this work we focused on robust model-based disentanglement of location and appearance, more work should be done to fully assess and improve disentanglement in the appearance model. Another natural extension is to investigate how our methods can be applied to 3d scenes and complex natural images.

References

Appendix A KL divergence

The KL in Eq. (6) can be expanded as follows:

where all expectations can be estimated by Monte Carlo sampling.

Appendix B Implementation details

The input image is fed into a residual network with 4 blocks having 2 convolutional layers each. Every convolution is followed by a Leaky ReLU and batch normalization. A final convolutional layer outputs the feature map with channels. The output of an identical but independent residual network is fed into a 3-layer convolutional network with one output channel that represents the location logits . The logits are multiplied by a constant factor . See below for more details about the rescaling of logits.

The number of objects is then inferred by a deterministic function of the location logits. This function filters out points in that are not local maxima, then counts the number of points above the threshold . If is the inferred number of objects in the scene, outputs the corresponding one-hot vector, and the distribution deterministically takes the value . We fix the maximum number of objects per image to .

The locations are iteratively sampled from a categorical distribution with logits , where all the logits of the previously sampled location are set to a low value to prevent them from being sampled again. At the th step, when the th object’s location is sampled, the corresponding feature vector in is interpreted as the means and log variances of the components of the th appearance variable. When sampling from the categorical distribution, we use Gumbel-Softmax for single-sample gradient estimation. The temperature parameter is exponentially annealed from 0.5 to 0.02 in 100k steps. The latent space for each appearance variable has dimensions.

The sprite-generating function is a convolutional network that takes as input a sampled appearance vector . The first part of the network consists of a fully connected layer, a convolutional layer, and a bilinear interpolation. The vector is then expanded and concatenated to the resulting tensor along the channel dimension. The second part of the network consists of 3 residual blocks with 2 convolutional layers each, followed by a final convolutional layer with a sigmoid activation. Leaky ReLU and batch normalization are used after each layer, except in the last residual block. The size of a generated sprite for the multi-MNIST data set is , for multi-dSprites it is . The mean of the Bernoulli output is then computed from these sprites as explained above.

We optimized the model with stochastic gradient descent, using Adamax with batch size 64. The initial learning rate was 1e-4 or 5e-4 for the location inference network (for the multi-MNIST and multi-dSprites data sets, respectively) and 1e-3 for the rest of the model. In the location inference net, the learning rate was exponentially decayed by a factor of 100 in 100k steps. The model parameters are about 500k, split almost evenly among location inference, appearance inference, and sprite generation.

Training warm-up.

In practice we found it beneficial to include a warm-up period at the beginning of training, in which an auxiliary loss is added to the negative ELBO. We train a fully-convolutional VAE in parallel with our model, and take the location-wise KL in the latent space as a rough proxy for object location, as suggested by nash17. The additional loss is the squared distance between and the KL map. The network inferring object location is therefore initially encouraged to mimic a function that is a (rather rough) approximation of the location, and then it is fine-tuned. Intuitively, we are biasing sampling of in favor of information-rich locations, which significantly speeds up and stabilizes training. To stabilize training, we also found it beneficial to randomly force during training with a probability that is 1 during warm-up (30k steps) and then linearly decays to 0.1 in the following 30k steps.

The auxiliary VAE is implemented as a fully convolutional network, in which the encoder consists of 3 residual blocks with downsampling between blocks, and a final convolutional layer. The decoder loosely mirrors the encoder. The latent variables are arranged as a 3D tensor with size . The auxiliary loss is added to the original training loss for a warm-up phase of 30k steps. In the subsequent 30k steps, the contribution of this term to the overall loss is linearly annealed to 0.

Rescaling location logits.

Assume we have objects and the logits of the inferred location distribution are either 0 or . The probability of one of the “hot” locations after softmax is where is a constant that should not depend on , and should be close to . Solving for we get . If is large enough and is small enough, we have , and the constant is relatively small in magnitude. Because of our assumptions on the logits and on , we can write , and therefore the high logits should be approximately proportional to . Thus, using the same fully convolutional network architecture for multiple image sizes, the logits should be scaled by a factor proportional to . The threshold for counting objects should likewise follow this rule of thumb.

Baseline implementation.

As baseline we use a fully convolutional -VAE, where the latent variables are organized as a 3D tensor. Being closer in spirit to our method, this leads to a fairer comparison. Furthermore, this inductive bias makes it easier to preserve spatial information. Indeed, this empirically leads to better likelihood and generated samples than a VAE with fully connected layers (and a 1D structure for latent variables), and allows to model more naturally a varying number of objects in a scene. The encoder consists of a convolutional layer with stride 2, followed by 4 residual blocks with 2 convolutional layers each. Between the first two pairs of blocks, a convolutional layer with stride 2 reduces dimensionality. The resulting tensor has size , and a final convolutional layer brings the number of channels to where is the number of latent variables. The decoder architecture takes as input a sample of size and outputs the pixel-wise Bernoulli means. Its architecture loosely mirrors the encoder’s, with convolutions being replaced by transposed convolutions, except for the last upsampling operation which consists of bilinear interpolation and ordinary convolution. All convolutions and transposed convolutions, except for the ones between two residual blocks, are followed by Leaky ReLU and batch normalization. The last convolutional layer is only followed by a sigmoid nonlinearity. In our experiments we used and we linearly annealed from 0 to 1 in 100k steps. The number of model parameters is about 1M.

Appendix C Results on multi-MNIST-1k

Here we present additional visual results on the multi-MNIST-1k data set, similar to those discussed in the main text.

Figure 10: Prior samples. Left: images generated by sampling from LAVAE’s prior , where is learned from data. Right: images generated by sampling from the baseline’s prior .
Figure 11: Disentanglement experiments on test images with more objects than in the training regime. Objects are represented independently of each other, and their location and appearance are disentangled by design. Top left: Reordering the sequence or equivalently of leads to objects being swapped (top row: original reconstruction; bottom row: swapped objects). Top right: Latent traversal on one of the 7 location variables. Bottom: Latent traversal on one of the 7 appearance variables (along 3 different dimensions).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
391971
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description