Explicit Disentanglement of Appearance and Perspective in Generative Models
Abstract
Disentangled representation learning finds compact, independent and easytointerpret factors of the data. Learning such has been shown to require an inductive bias, which we explicitly encode in a generative model of images. Specifically, we propose a model with two latent spaces: one that represents spatial transformations of the input data, and another that represents the transformed data. We find that the latter naturally captures the intrinsic appearance of the data. To realize the generative model, we propose a Variationally Inferred Transformational Autoencoder (VITAE) that incorporates a spatial transformer into a variational autoencoder. We show how to perform inference in the model efficiently by carefully designing the encoders and restricting the transformation class to be diffeomorphic. Empirically, our model separates the visual style from digit type on MNIST, separates shape and pose in images of human bodies and facial features from facial shape on CelebA.
figuret
1 Introduction
Disentangled Representation Learning (DRL) is a fundamental challenge in machine learning that is currently seeing a renaissance within deep generative models. DRL approaches assume that an AI agent can benefit from separating out (disentangle) the underlying structure of data into disjointed parts of its representation. This can furthermore help interpretability of the decisions of the AI agent and thereby make them more accountable.
Even though there have been attempts to find a single formalized notion of disentanglement (Higgins et al., 2018), no such theory exists (yet) which is widely accepted. However, the intuition is that a disentangled representation should separate different informative factors of variation in the data (Bengio et al., 2012). This means that changing a single latent dimension should only change a single interpretable feature in the data space .
Within the DRL literature, there are two main approaches. The first is to hardwire disentanglement into the model, thereby creating an inductive bias. This is well known e.g. in convolutional neural networks, where the convolution operator creates an inductive bias towards translation in data. The second approach is to instead learn a representation that is faithful to the underlying data structure, hoping that this is sufficient to disentangle the representation. However, there is currently little to no agreement in the literature on how to learn such representations (Locatello et al., 2019).
We consider disentanglement of two explicit groups of factors, the appearance and the perspective. We here define the appearance as being the factors of data that are left after transforming by its perspective. Thus, the appearance is the form or archetype of an object and the perspective represents the specific realization of that archetype. Practically speaking, the perspective could correspond to an image rotation that is deemed irrelevant, while the appearance is a representation of the rotated image, which is then invariant to the perspective. This interpretation of the world goes back to Plato’s allegory of the cave, from which we also borrow our terminology. This notion of removing perspective before looking at the appearance is wellstudied within supervised learning, e.g. using spatial transformer nets (STNs) (Jaderberg et al., 2015).
This paper contributes an explicit model for disentanglement of appearance and perspective in images, called the variational inferred transformational autoencoder (VITAE). As the name suggests, we focus on variational autoencoders as generative models, but the idea is general (Fig. 1). First we encode/decode the perspective features in order to extract an appearance that is perspectiveinvariant. This is then encoded into a second latent space, where inputs with similar appearance are encoded similarly. This process generates an inductive bias that disentangles perspective and appearance. In practice, we develop an architecture that leverages the inference part of the model to guide the generator towards better disentanglement. We also show that this specific choice of architecture improves training stability with the right choice of parametrization of perspective factors. Experimentally, we demonstrate that our model on four datasets: standard disentanglement benchmark dSprites, disentanglement of style and content on MNIST, pose and shape on images of human bodies (Fig. 2) and facial features and facial shape on CelebA.
2 Related work
Disentangled representations learning (DRL) have long been a goal in data analysis. Early work on nonnegative matrix factorization (Lee and Seung, 1999) and bilinear models (Tenenbaum and Freeman, 2000) showed how images can be composed into semantic “parts” that can be glued together to form the final image. Similarly, EigenFaces (Turk and Pentland, 1991) have often been used to factor out lighting conditions from the representation (Shakunaga and Shigenari, 2001), thereby discovering some of the physics that govern the world of which the data is a glimpse. This is central in the longstanding argument that for an AI agent to understand and reason about the world, it must disentangle the explanatory factors of variation in data (Lake et al., 2016). As such, DRL can be seen as a poor man’s approximation to discovering the underlying causal factors of the data.
Independent components are, perhaps, the most stringent formalization of “disentanglement”. The seminal independent component analysis (ICA) (Comon, 1994) factors the signal into statistically independent components. It has been shown that the independent components of natural images are edge filters (Bell and Sejnowski, 1997) that can be linked to the receptive fields in the human brain (Olshausen and Field, 1996). Similar findings have been made for both video and audio (van Hateren and Ruderman, 1998; Lewicki, 2002). DRL, thus, allows us to understand both the data and ourselves. Since independent factors are the optimal compression, ICA finds the most compact representation, implying that the predictive model can achieve maximal capacity from its parameters. This gives DLR a predictive perspective, and can be taken as a hint that a welltrained model might be disentangled. In the linear case, independent components have many successful realizations (Hyvärinen and Oja, 2000), but in the general nonlinear case, the problem is not identifiable (Hyvärinen et al., 2018).
Deep DRL was initiated by Bengio et al. (2012) who sparked the current interest in the topic. One of the current stateoftheart methods for doing disentangled representation learning is the VAE (Higgins et al., 2017), that modifies the variational autoencoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014) to learn a more disentangled representation. VAE enforces more weight on the KLdivergence in the VAE loss, thereby optimizing towards latent factors that should be axis aligned i.e. disentangled. Newer models like TCVAE (Chen et al., 2018) and DIPVAE (Kumar et al., 2017) extend VAE by decomposing the KLdivergences into multiple terms, and only increase the weight on terms that analytically disentangles the models. InfoGAN (Chen et al., 2016) extends the latent code of the standard GAN model (Goodfellow et al., 2014) with an extra latent code and then penalize low mutual information between generated samples and . DCIGN (Kulkarni et al., 2015) forces the latent codes to be disentangled by only feeding in batches of data that vary in one way (e.g. pose, light) while only having small disjoint parts of the latent code active.
Shape statistics is the key inspiration for our work. The shape of an object was first formalized by Kendall (1989) as being what is left of an object when translation, rotation and scale are factored out. That is, the intrinsic shape of an object should not depend on viewpoint. This idea dates, at least, back to D’Arcy Thompson (1917) who pioneered the understanding of the development of biological forms. In Kendall’s formalism, the rigid transformations (translation, rotation and scale) are viewed as group actions to be factored out of the representation, such that the remainder is shape. Higgins et al. (2018) follow the same idea by defining disentanglement as a factoring of the representation into group actions. Our work can be seen as a realization of this principle within a deep generative model. When an object is represented by a set of landmarks, e.g. in the form of discrete points along its contour, then Kendall’s shape space is a Riemannian manifold that exactly captures all variability among the landmarks except translation, rotation, and scale of the object. When the object is not represented by landmarks, then similar mathematical results are not available. Our work shows how the same idea can be realized for general image data, and for a much wider range of transformations than the rigid ones. LearnedMiller (2006) proposed a related linear model that generate new data by transforming a prototype, which is estimated by joint alignment.
Transformations are at the core of our method, and these leverage the architecture of spatial transformer nets (STNs) (Jaderberg et al., 2015). While these work well within supervised learning, (Lin and Lucey, 2016; Annunziata et al., 2018; Detlefsen et al., 2018) there has been limited uptake within generative models. Lin et al. (2018) combine a GAN with an STN to compose a foreground (e.g a furniture) into a background such that it look neutral. The AIR model (Eslami et al., 2016) combines STNs with a VAE for object rendering, but do not seek disentangled representations. In supervised learning, data augmentation is often used to make a classifier partially invariant to select transformations (Baird, 1992; Hauberg et al., 2016).
3 Method
Our goal is to extend a variational autoencoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014) such that it can disentangle appearance and perspective in data. A standard VAE assumes that data is generated by a set of latent variables following a standard Gaussian prior,
(1) 
Data is then generated by first sampling a latent variable and then sample from the conditional (often called the decoder). To make the model flexible enough to capture complex data distributions, and are modeled as deep neural nets. The marginal likelihood is then intractable and a variational approximation to is needed,
(2) 
where and are deep neural networks, see Fig. (a)a.
When training VAEs, we therefore simultaneously train a generative model and an inference model (often called the encoder). This is done by maximizing a variational lower bound to the likelihood called the evidence lower bound (ELBO)
(3) 
The first term measures the reconstruction error between and and the second measures the KLdivergence between the encoder and the prior . Eq. 3 can be optimized using the reparametrization trick (Kingma and Welling, 2013). Several improvements to VAEs have been proposed (Burda et al., 2015; Kingma et al., 2016), but our focus is on the standard model.
3.1 Incorporating an inductive bias
To incorporate an inductive bias that is able to disentangle appearance from perspective, we change the underlying generative model to rely on two latent factors and ,
(4) 
where we assume that and both follow standard Gaussian priors. Similar to a VAE, we also model the generators as deep neural networks. To generate new data , we combine the appearance and perspective factors using the following 3step procedure that uses a spatial transformer (ST) layer (Jaderberg et al., 2015) (dotted box in Fig. (b)b):

Sample and from .

Decode both samples , .

Transform with parameters using a spatial transformer layer: .
This process is illustrated by the dotted box in Fig. (b)b.
Unconditional VITAE inference.
As the marginal likelihood (A) is intractable, we use variational inference. A natural choice is to approximate each latent group of factors independently of the other i.e.
(5) 
The combined inference and generative model is illustrated in Fig. (b)b. For comparison, a VAE model is shown in Fig. (a)a. It can easily be shown that the ELBO for this model is merely a VAE with a KLterm for each latent space (see supplements).
Conditional VITAE inference.
This inference model does not mimic the generative process of the model, which may be suboptimal. Intuitively, we expect the encoder to approximately perform the inverse operation of the decoder, i.e. . Since the proposed encoder (5) does not include an STlayer, it may be difficult to train an encoder to approximately invert the decoder. To accommodate this, we first include an STlayer in the encoder for the appearance factors. Secondly, we explicitly enforce that the predicted transformation in the encoder is the inverse of that of the decoder , i.e. (more on invertibility in Sec. 3.2). The inference of appearance is now dependent on the perspective factor , i.e.
(6) 
These changes to the inference architecture are illustrated in Fig. (c)c. It can easily be shown that the ELBO for this model is given by
(7) 
which resembles the standard ELBO with a additional term (derivation in supplementary material), corresponding to the second latent space. We will call both models variational inferred transformational autoencoders (VITAE) and we will denote the first model (5) as unconditional/UVITAE and the second model (A) as conditional/CVITAE. The naming comes from Eq. 5 and A, where is respectively unconditioned and conditioned on . Experiments will show that the conditional architecture is essential for inference (Sec. 4.2).
3.2 Transformation classes
Until now, we have assumed that there exists a class of transformations that captures the perspective factors in data. Clearly, the choice of depends on the true factors underlying the data, but in many cases an affine transformation should suffice.
(8) 
However, the CVITAE model requires access to the inverse transformation . The inverse of Eq. 8 is given by , which only exist if has a nonzero determinant.
One, easily verified, approach to secure invertibility is to parametrize the transformation by two scale factors , one rotation angle , one shear parameter and two translation parameters :
(9) 
In this case the inverse is trivially
(10) 
where the scale factors must be strictly positive.
An easier and more elegant approach is to leverage the matrix exponential. That is, instead of parametrizing the transformation in Eq. 8, we instead parametrize the velocity of the transformation
(11) 
The inverse^{1}^{1}1Follows from and being commuting matrices. is then . Then in Eq. 11 is a diffiomorphism (i.e. a differentiable invertible map with a differentiable inverse) (Duistermaat and Kolk, 2000). Experiments show that diffeomorphic transformations stabilize training and yield tighter ELBOs (see supplements).
Often we will not have prior knowledge regarding which transformation classes are suitable for disentangling the data. A natural way forward is then to apply a highly flexible class of transformations that are treated as “blackbox”. Inspired by Detlefsen et al. (2018), we also consider transformations using the highly expressive diffiomorphic transformations CPAB from Freifeld et al. (2015). These can be viewed as an extension to Eq. 11: instead of having a single affine transformation parametrized by its velocity, the image domain is divided into smaller cells, each having their own affine velocity. The collection of local affine velocities can be efficiently parametrized and integrated, giving a fast and flexible diffeomorphic transformation, see Fig. 4 for a comparison between an affine transformation and a CPAB transformation. For details, see Freifeld et al. (2015).
We note, that our transformer architecture are similar to the work of Lorenz et al. (2019) and Xing et al. (2019) in that they also tries to achieve disentanglement through spatial transformations. However, our work differ in the choice of transformation. This is key, as the theory of Higgins et al. (2018) strongly relies on disentanglement through group actions. This places hard constrains on which spatial transformations are allowed: they have to form a smooth group. Both thinplatespline transformations considered in Lorenz et al. (2019) and displacement fields considered in Xing et al. (2019) are not invertible and hence do not correspond to proper group actions. Since diffiomorphic transformations form a smooth group, this choice is paramount to realize the theory of Higgins et al. (2018).
4 Experimental results and discussion
For all experiments, we train a standard VAE, a VAE (Higgins et al., 2017), a TCVAE (Chen et al., 2018), a DIPVAEII (Kumar et al., 2017) and our developed VITAE model. We model the encoders and decoders as multilayer perceptron networks (MLPs). For a fair comparison, the number of trainable parameters is approximately the same in all models. The models were implemented in Pytorch (Paszke et al., 2017) and the code is available at https://github.com/SkafteNicki/unsuper/.
Evaluation metric. Measuring disentanglement still seems to be an unsolved problem, but the work of Locatello et al. (2019) found that most proposed disentanglement metrics are highly correlated. We have chosen to focus on the DICmetric from Eastwood and Williams (2019), since this metric has seen some uptake in the research community. This metric measures how will the generative factors can be predicted from latent factors. For the MNIST and SMPL datasets, the generative factors are discrete instead of continuous, so we change the standard linear regression network to a kNNclassification algorithm. We denote this metric in the results.



dSprite  MNIST  SMPL  

ELBO  ELBO  ELBO  
VAE  47.05  49.32  0.05  169  172  0.579  0.485  
VAE  79.45  81.38  0.18  150  152  0.653  0.525  
TCVAE  66.48  68.12  0.30  141  144  0.679  0.651  
DIPVAEII  46.32  48.92  0.12  140  155  0.733  0.743  
UVITAE  55.25  57.29  0.22  142  143  0.782  0.673  
CVITAE  68.26  70.49  0.38  139  141  0.884  0.943 
4.1 Disentanglement on shapes
We initially test our models on the dSprites dataset (Matthey et al., 2017), which is a well established disentanglement benchmarking dataset to evaluate the performance of disentanglement algorithms. The results can be seen in Table 1. We find that our proposed CVITAE model perform best, followed by the TCVAE model in terms of disentanglement. The experiments clearly shows the effect on performance of the improved inference structure of CVITAE compared to UVITAE. It can be shown that the conditional architecture of CVITAE, minimizes the mutual information between and , leading to better disentanglement of the two latent spaces. To get the UVITAE architecture to work similarly would require a auxiliary loss term added to the ELBO.
4.2 Disentanglement of MNIST images
Secondly, we test our model on the MNIST dataset (LeCun et al., 1998). To make the task more difficult, we artificially augment the dataset by first randomly rotating each image by an angle uniformly chosen in the interval and secondly translating the images by , where is uniformly chosen from the interval [3, 3]. For VITAE, we model the perspective with an affine diffiomorphic transformation (Eq. 11).
The quantitative results can be seen in Table 1. We clearly see that CVITAE outperforms the alternatives on all measures. We overall observes that better disentanglement, seems to give better distribution fitting. Qualitatively, Fig. 5 shows the effect of manipulating the latent codes alongside test reconstructions for VAE, VAE and CVITAE. Due to space constraints, the results from TCVAE and DIPVAEII can found in the supplementary material. The plots were generated by following the protocol from Higgins et al. (2017): one latent factor is linearly increased from 3 to 3, while the rest is kept fixed. In the VAE (Fig. (a)a), this changes both the appearance (going from a 7 to a 1) and the perspective (going from rotated slightly left to rotated right). We see no meaningful disentanglement of latent factors. In the VAE model (Fig. (b)b), we observe some disentanglement, since only the appearance changes with the latent factor. However this disentanglement comes at the cost of poor reconstructions. This tradeoff is directly linked to the emphasized regularization in the VAE. We note that the value proposed in the original paper (Higgins et al., 2017) is insufficiently low for our experiments to observe any disentanglement, and we use based on qualitative evaluation of results. For TCVAE and DIPVAEII we observe nearly the same amount of qualitative disentanglement as VAE, however these models achieve less blurred samples and reconstructions. This is probably due to the two models decomposition of the KLterm, only increasing the parts that actually contributes to disentanglement. Finally, for our developed VITAE model (Fig. (c)c), we clearly see that when we change the latent code in the appearance space (top row), we only change the content of the generated images, while manipulating the latent code in the perspective space (bottom row) only changes the perspective i.e. image orientation.
Interestingly, we observe that there exists more than one prototype of a 1 in the appearance space of VITAE, going from slightly bent to straightened out. By our definition of disentanglement, that everything left after transforming the image is appearance, there is nothing wrong with this. This is simply a consequence of using an affine transformation that cannot model this kind of local deformation. Choosing a more flexible transformation class could factor out this kind of perspective. The supplements contain generated samples from the different models.
4.3 Disentanglement of body shape and pose
We now consider synthetic image data of human bodies generated by the Skinned MultiPerson Linear Model (SMPL) (Loper et al., 2015) which are explicitly factored into shape and pose. We generate 10,000 bodies (8,000 for training, 2,000 for testing), by first continuously sampling body shape (going from thin to thick) and then uniformly sampling a body pose from four categories ((arms up, tight), (arms up, wide), (arms down, tight), (arms down, wide)). Fig. 2 shows examples of generated images. Since change in body shape approximately amounts to a local shape deformation, we model the perspective factors using the aforementioned "blackbox" diffiomorphic CPAB transformations (Sec. 3.2). The remaining appearance factor should then reflect body pose.
Quantitative evaluation. We again refer to Table 1 that shows ELBO, test set loglikelihood and disentanglement score for all models. As before, CVITAE is both better at modelling the data distribution and achieves a higher disentanglement score. The explanation is that for a standard VAE model (or VAE and its variants for that sake) to learn a complex body shape deformation model, it requires a high capacity network. However, the VITAE architecture gives the autoencoder a shortcut to learning these transformations that only requires optimizing a few parameters. We are not guaranteed that the model will learn anything meaningful or that it actually uses this shortcut, but experimental evidence points in that direction. A similar argument holds in the case of MNIST, where a standard MLP may struggle to learn rotation of digits, but the STlayer in the VITAE architecture provides a shortcut. Furthermore, we found the training of VITAE to be more stable than other models.
Qualitative evaluation. Again, we manipulate the latent codes to visualize their effect (Fig. 14). This time, we here show the result for TCVAE, DIPVAEII and VITAE. The results from standard VAE and VAE can be found in supplementary material. For both TCVAE and DIPVAEII we do not observe disentanglement of body pose and shape, since the decoded images both change arm position (from up to down) and body shape. We note that for both VAE, TCVAE and DIPVAEII we did a grid search for their respective hyper parameters. For these three models, we observe that the choice of hyper parameters (scaling of KL term) can have detrimental impact of reconstructions and generated samples. Due to lack of space, test set reconstructions and generated samples can be found in the supplementary material. For VITAE we observe some disentanglement of body pose and shape, as variation in appearance space mostly changes the positions of the arms, while the variations in the perspective space mostly changes body shape. The fact that we cannot achieve full disentanglement of this SMPL dataset indicates the difficulty of the task.
4.4 Disentanglement on CelebA
Finally, we qualitatively evaluated our proposed model on the CelebA dataset (Liu et al., 2015). Since this is a " real life " dataset we do not have access to generative factors and we can therefore only qualitatively evaluate the model. We again model the perspective factors using the aforementioned CPAB transformations, which we assume can model the facial shape. The results can be seen in Fig. 7, which shows latent traversals of both the perspective and appearance factors, and how they influence the generated images. We do observe some interpolation artifacts that are common for architectures using spatial transformers.



5 Summary
In this paper, we have shown how to explicitly disentangle appearance from perspective in a variational autoencoder (Kingma and Welling, 2013; Rezende et al., 2014). This is achieved by incorporating a spatial transformer layer (Jaderberg et al., 2015) into both encoder and decoder in a coupled manner. The concepts of appearance and perspective are broad as is evident from our experimental results in human body images, where they correspond to pose and shape, respectively. By choosing the class of transformations in accordance with prior knowledge it becomes an effective tool for controlling the inductive bias needed for disentangled representation learning. On both MNIST and body images our method quantitatively and qualitatively outperforms general purpose disentanglement models (Higgins et al., 2017; Chen et al., 2018; Kumar et al., 2017). We find it unsurprisingly that in situations where some prior knowledge about the generative factors is known, encoding these in the into the model give better result than ignoring such information.
Our results support the hypothesis (Higgins et al., 2018) that inductive biases are necessary for learning disentangled representations, and our model is a step in the direction of getting fully disentangled generative models. We envision that the VITAE model should be combined with other models, by first using the VITAE model to separate appearance and perspective, and then training a second model only on the appearance. This will factor out one latent factor at a time, leaving a hierachy of disentangled factors.
Acknowledgements.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n^{o} 757360). NSD and SH were supported in part by a research grant (15334) from VILLUM FONDEN. We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPU hardware used for this research.
References
 DeSTNet: densely fused spatial transformer networks. CoRR abs/1807.04050. External Links: 1807.04050 Cited by: §2.
 Latent Space Oddity: on the Curvature of Deep Generative Models. arXiv eprints, pp. arXiv:1710.11379. External Links: 1710.11379 Cited by: Appendix B.
 Document image defect models. In Structured Document Image Analysis, pp. 546–556. Cited by: §2.
 The “independent components” of natural scenes are edge filters. Vision research 37 (23), pp. 3327–3338. Cited by: §2.
 Representation Learning: A Review and New Perspectives. arXiv eprints, pp. arXiv:1206.5538. External Links: 1206.5538 Cited by: §1, §2.
 Importance weighted autoencoders. CoRR abs/1509.00519. External Links: 1509.00519 Cited by: Appendix A, §3.
 Isolating Sources of Disentanglement in Variational Autoencoders. External Links: 1802.04942, Link Cited by: Appendix B, §2, §4, §5.
 InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. CoRR abs/1606.03657. External Links: 1606.03657 Cited by: §2.
 Independent component analysis, a new concept?. Signal Processing 36 (3), pp. 287 – 314. Note: Higher Order Statistics External Links: ISSN 01651684 Cited by: §2.
 On growth and form.. On growth and form.. Cited by: §2.
 Deep diffeomorphic transformer networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix C, §2, §3.2.
 Lie groups and lie algebras. In Lie Groups, pp. 1–92. Cited by: §3.2.
 A Framework for the Quantitative Evaluation of Disentangled Representations. Vol. 9. External Links: Link Cited by: §4.
 Attend, Infer, Repeat: Fast Scene Understanding with Generative Models. External Links: Document, 1603.08575 Cited by: §2.
 Highlyexpressive spaces of wellbehaved transformations: keeping it simple. In ICCV, Cited by: Appendix B, Appendix B, §3.2.
 Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §2.
 Dreaming more data: classdependent distributions over diffeomorphisms for learned data augmentation. In Proceedings of the 19th international Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 41. Cited by: §2.
 Towards a Definition of Disentangled Representations. arXiv eprints, pp. arXiv:1812.02230. External Links: 1812.02230 Cited by: §1, §2, §3.2, §5.
 VAE: learning basic visual concepts with a constrained variational framework. ICLR. Cited by: Appendix B, §2, §4.2, §4, §5.
 Independent component analysis: algorithms and applications. Neural networks 13 (45), pp. 411–430. Cited by: §2.
 Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. arXiv eprints, pp. arXiv:1805.08651. External Links: 1805.08651 Cited by: §2.
 Spatial transformer networks. CoRR abs/1506.02025. External Links: 1506.02025 Cited by: §1, §2, §3.1, §5.
 Ladder Variational Autoencoders. ArXiv eprints. External Links: 1602.02282 Cited by: Appendix B.
 A survey of the statistical theory of shape. Statistical Science, pp. 87–99. Cited by: §2.
 AutoEncoding Variational Bayes. ArXiv eprints. External Links: 1312.6114 Cited by: §2, §3, §3, §5.
 Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: Appendix B.
 Improving Variational Inference with Inverse Autoregressive Flow. arXiv eprints, pp. arXiv:1606.04934. External Links: 1606.04934 Cited by: §3.
 Deep convolutional inverse graphics network. CoRR abs/1503.03167. External Links: 1503.03167 Cited by: §2.
 Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. External Links: 1711.00848, Link Cited by: Appendix B, §2, §4, §5.
 Building machines that learn and think like people. CoRR abs/1604.00289. External Links: 1604.00289 Cited by: §2.
 Data driven image models through continuous joint alignment. IEEE Trans. Pattern Anal. Mach. Intell. 28 (2), pp. 236–250. External Links: ISSN 01628828 Cited by: §2.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE, pp. 86(11):2278–2324. Cited by: §4.2.
 Learning the parts of objects by nonnegative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §2.
 Efficient coding of natural sounds. Nature neuroscience 5 (4), pp. 356. Cited by: §2.
 STGAN: Spatial Transformer Generative Adversarial Networks for Image Compositing. ArXiv eprints. External Links: 1803.01837 Cited by: §2.
 Inverse compositional spatial transformer networks. CoRR abs/1612.03897. External Links: 1612.03897 Cited by: §2.
 Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.4.
 Challenging common assumptions in the unsupervised learning of disentangled representations. Proceedings of the 36th International Conference on Machine Learning. Cited by: §1, §4.
 SMPL: a skinned multiperson linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §4.3.
 Unsupervised PartBased Disentangling of Object Shape and Appearance. Proceedings of Computer Vision and Pattern Recognition (CVPR). Cited by: §3.2.
 DSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dspritesdataset/ Cited by: §4.1.
 Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607. Cited by: §2.
 Automatic differentiation in pytorch. In NIPSW, Cited by: §4.
 Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ArXiv eprints. External Links: 1401.4082 Cited by: §2, §3, §5.
 Decomposed eigenface for face recognition under various lighting conditions. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1, pp. I–I. Cited by: §2.
 Separating style and content with bilinear models. Neural Comput. 12 (6), pp. 1247–1283. External Links: ISSN 08997667 Cited by: §2.
 Eigenfaces for recognition. Journal of cognitive neuroscience 3 (1), pp. 71–86. Cited by: §2.
 Independent component analysis of natural image sequences yields spatiotemporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London B: Biological Sciences 265 (1412), pp. 2315–2320. Cited by: §2.
 Deformable Generator Network: Unsupervised Disentanglement of Appearance and Geometry. Proceedings of Computer Vision and Pattern Recognition (CVPR). Cited by: §3.2.
Appendix A Derivation of the ELBO for CVITAE and UVITAE
We will here focus on deriving the ELBO for the CVITAE, because as we will see the ELBO for the UVITAE can easily be identified from this. For both models it hold that the generative model is given by
We know assume that the inference of appearance now becomes dependent on the perspective factors i.e.
as in the CVITAE model. The logposterior is then given by:
By using Jensen’s inequality once to exchange the outer expectation with the gives us
Then, by using Jensen’s inequality once more to exchange the and inner expectation we get
Here term 1 is reconstruction term between and , is the term 2 is the KL divergence for the appearance space and its prior and term 3 is the KL divergence for the perspective space and its prior . Similar to how gradients are calculate in VAE’s, the outer expectation in term 2 is calculated with respect to a single sample, but can also be computed with respect to multiple samples similar to the work of Burda et al. (2015).
To get the ELBO of the UVITAE model, we make the the inference of the latent spaces independent of each other i.e. . This get rid of the expectation in term 2 and we are left with
which is the ELBO for the UVITAE model. The intuition behind this equation is that the UVITAE model is just a standard VAE, where the latent space has been split into two smaller latent spaces , thus this is reflected in ELBO where the KLterm is similar split into two terms.
Appendix B Implementation details for the experiments
Below we describe the network architectures in details. All models were trained using the Adam optimizer (Kingma and Ba, 2014) with fixed learning rate of . For the MNIST experiments we used a batch size of 512 and trained for a 2000 epochs and for SMLP and CelebA experiments we used a batch size of 256 and trained for 5000 epochs. No early stopping was used. Similar to Kaae Sønderby et al. (2016), we use annealing/warmup for the KLdivergence by scaling the term(s) by , where the warmup parameter was set to half the number of epochs.
Details for MNIST experiments. Pixel values of the images are scaled to the interval [0,1]. Each pixel is assumed to be Bernoulli distributed. For the encoders and decoders we use multilayer perceptron networks. For the VAE, VAE (Higgins et al., 2017), TCVAE (Chen et al., 2018) and DIPVAE (Kumar et al., 2017), we use the settings listed below. For both VITAE models, we model both encoders and both decoders with approximately half the neurons, for a fair comparison. In practice we found that the encoders/decoders of the appearance factors benefits from having a bit higher capacity than the encoders/decoders of the perspective factors.
Layer 1  Layer 2  Layer 3  

128, (LeakyReLU)  64, (LeakyReLU)  d, (Linear)  
128, (LeakyReLU)  64, (LeakyReLU)  d, (softplus)  
64, (LeakyReLU)  64, (LeakyReLU)  D, (Sigmoid) 
Here and for VAE based models and for VITAE based models. The numbers corresponds to the size of the layer and the parenthesis is the used activation function. For the LeakyRelu activation function we use hyper parameter . We only parametrize a mean function in the decoder because we assume the output pixels are Bernoulli distributed.
Details for SMPL experiments. Images was generated using the SMPL library^{2}^{2}2http://smpl.is.tue.mpg.de/. The parameters for generating the body shape was drawn from a distribution. The parameters that controls the body pose was uniformly sampled from one out of 4 prespecified pose configurations, see Table 3.
Pose 1  Pose 2  Pose 3  Pose 4  

Left shoulder  
Right shoulder  
Left arm  
Right arm 
The resolution of each image was scaled down to . Each pixel is assumed to be Normal distributed. For the VAE based models, we use the settings listed below. For the VITAE models we used approximately half the neurons for the encoders/decoders.
Layer 1  Layer 2  Layer 3  

256, (LeakyReLU)  128, (LeakyReLU)  d, (Linear)  
256, (LeakyReLU)  128, (LeakyReLU)  d, (softplus)  
128, (LeakyReLU)  256, (LeakyReLU)  D, (Linear) 
Here and for VAE based models and for VITAE based models. The numbers corresponds to the size of the layer and the parenthesis is the used activation function. For the LeakyRelu activation function we use hyper parameter . We only parametrize a mean in the decoder because the variance function is in general very hard to train and completely arbitrarily outside the latent manifold (Arvanitidis et al., 2017). It was therefore fixed for all pixels in all images to . For the CPAB transformations (Freifeld et al., 2015) we ran the experiments with tessellation parameters [2, 4] with zero boundary constrains and no volume preservation constrains. With these settings, we are generating perspective transformations of size 30 i.e. .
Details for CelebA experiments. We use the align and cropped version of the dataset, downloaded from the homepage^{3}^{3}3http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html. Each image was then down sampled to size , to decrease computational time. Each pixel is assumed to be Normal distributed. For this task we use a convolutionalVAE. Below is listed the configuration of the network:
Layer 1  Layer 2  Layer 3  Layer 4  

Conv(10, 5, 2, LeakyReLU)  Conv(20, 5, 2, LeakyReLU)  Conv(40, 3, 2, LeakyReLU)  Dense(2, Linear)  
Conv(10, 5, 2, LeakyReLU)  Conv(20, 5, 2, LeakyReLU)  Conv(40, 3, 2, LeakyReLU)  Dense(2, Softplus)  
DeConv(40, 3, 2, LeakyReLU)  DeConv(20, 3, 2, LeakyReLU)  DeConv(10, 5, 2, LeakyReLU)  DeConv(3, 5, 2, Sigmoid) 
For the CPAB transformation (Freifeld et al., 2015) we ran the experiments with tessellation parameters [4, 4] with zero boundary constrains and no volume preservation constrains. With these settings, we are generating perspective transformations of size 62.
Computational requirements. Even though VITAE has a more complicated architecture than VAE (comparing Fig. (3a) vs. (3c) in main paper) both forward and backward passes in the models have roughly the same complexity when we use affine transformations (see Table 6). Using the more complex CPAB transformations adds some penalty to the computational time.
Forward  Backward  

VAE & VAE  0.0016s  0.014s  
TCVAE  0.0020s  0.016s  
DIPVAEII  0.0025s  0.018s  
CVITAE  Affine  0.0092s  0.037s 
CPAB  0.1s  0.86s 
Appendix C Stability results
In the main paper we discuss multiple ways to parameterize an affine transformation. If we choose with a diffiomorphic parameterization, we have found that this also has positive positive optimization properties. Fig. 8 shows the ELBO as a function of the learning rate for the three different choices of affine parametrization discussed in the main paper, using our CVITAE architecture. We clearly see that the diffeomorphic affine parametrization archives a tighter bound, and can run for much higher learning rates (faster convergence) before the network begins to diverge. These findings are similar to those of Detlefsen et al. (2018) in the supervised context.
These experiments was conducted on the MNIST dataset. For all three experiments we use the CVITAE architecture with a neural network structure as Table 2. A batch size of 512 was used. The results where generated by changing the parametrization of the affine spatial transformer between
Affine  (12)  
AffineDecomp  (13)  
AffineDiffio  (14) 
and by varying the learning rate . The lower subplot of Figure 4, was generated using a learning rate of to make sure that all transformer types would converge.
Appendix D Additional results
d.1 MNIST experiments
In Fig. 9 reconstructions from the different models can be seen. In Fig. 10 generated sampler from the different models can be seen. In Fig. 11 latent manipulations can be seen.
Appendix E SMPL experiment
In Fig. 12 reconstructions from the different models can be seen. In Fig. 13 generated sampler from the different models can be seen. In Fig. 11 latent manipulations can be seen.