On the Transfer of Disentangled Representations in Realistic Settings
Abstract
Learning meaningful representations that disentangle the underlying structure of the data generating process is considered to be of key importance in machine learning. While disentangled representations were found to be useful for diverse tasks such as abstract reasoning and fair classification, their scalability and realworld impact remain questionable. We introduce a new highresolution dataset with 1M simulated images and over 1,800 annotated realworld images of the same robotic setup. In contrast to previous work, this new dataset exhibits correlations, a complex underlying structure, and allows to evaluate transfer to unseen simulated and realworld settings where the encoder i) remains in distribution or ii) is out of distribution. We propose new architectures in order to scale disentangled representation learning to realistic highresolution settings and conduct a largescale empirical study of disentangled representations on this dataset. We observe that disentanglement is a good predictor for outofdistribution (OOD) task performance.
1
1 Introduction
Disentangled representations hold the promise of generalization to unseen scenarios (Higgins et al., 2017b), increased interpretability (Adel et al., 2018; Higgins et al., 2018) and faster learning on downstream tasks (van Steenkiste et al., 2019; Locatello et al., 2019a). However, most of the focus in learning disentangled representations has been on small synthetic datasets whose ground truth factors exhibit perfect independence by design. More realistic settings remain largely unexplored. We hypothesize that this is because realworld scenarios present several challenges that have not been extensively studied to date. Important challenges are scaling (much higher resolution in observations and factors), occlusions, and correlation between factors. Consider, for instance, a robotic arm moving a cube: Here, the robot arm can occlude parts of the cube, and its endeffector position exhibits correlations with the cube’s position and orientation. Another difficulty is that we typically have only limited access to ground truth labels in the real world, which requires robust frameworks for model selection when no or only weak labels are available.
The goal of this work is to provide a path towards disentanglement in realistic settings. First, we argue that this requires a new dataset that captures the challenges mentioned above. We propose a dataset consisting of simulated observations from a scene where a robotic arm interacts with a cube in a stage (see Fig. 1). This setting exhibits correlations and occlusions that are typical in realworld robotics. Second, we show how to scale the architecture of disentanglement methods to perform well on this dataset. Third, we extensively analyze the usefulness of disentangled representations in terms of outofdistribution downstream generalization, both in terms of heldout factors of variation and sim2real transfer. In fact, our dataset is based on the TriFinger robot from Wüthrich et al. (2020), which can be built to test the deployment of models in the real world. While the analysis in this paper focuses on the transfer and generalization of predictive models, we hope that our dataset may serve as a benchmark to explore the usefulness of disentangled representations in realworld control tasks.
The contributions of this paper can be summarized as follows:

We propose a new dataset for disentangled representation learning, containing 1M simulated highresolution images from a robotic setup, with seven partly correlated factors of variation. Additionally, we provide a set of over 1,800 annotated images from the corresponding realworld setup that can be used for challenging sim2real transfer tasks.

We propose a new neural architecture to successfully scale VAEbased disentanglement learning approaches to complex datasets.

We conduct a largescale empirical study on generalization to various transfer scenarios on this challenging dataset. We train 1,080 models from stateoftheart disentanglement methods and discover that disentanglement is a good predictor for outofdistribution (OOD) task performance.
2 Related Work
Disentanglement methods.
Most stateoftheart disentangled representation learning approaches are based on the framework of variational autoencoders (VAEs) (Kingma and Welling, 2014; Rezende et al., 2014). A (highdimensional) observation is assumed to be generated according to the latent variable model where the latent variables have a fixed prior . The generative model and the approximate posterior distribution are typically parameterized by neural networks, which are optimized by maximizing the evidence lower bound (ELBO):
(1) 
As the above objective does not enforce any structure on the latent space except for some similarity to , different regularization strategies have been proposed, along with evaluation metrics to gauge the disentanglement of the learned representations (Higgins et al., 2017a; Kim and Mnih, 2018; Burgess et al., 2018; Kumar et al., 2018; Chen et al., 2018; Eastwood and Williams, 2018). Recently, Locatello et al. (2019b, Theorem 1) showed that the purely unsupervised learning of disentangled representations is impossible. This limitation can be overcome without the need for explicitly labeled data by introducing weak labels (Locatello et al., 2020; Shu et al., 2019). Ideas related to disentangling the factors of variation date back to the nonlinear ICA literature (Comon, 1994; Hyvärinen and Pajunen, 1999; Bach and Jordan, 2002; Jutten and Karhunen, 2003; Hyvarinen and Morioka, 2016; Hyvarinen et al., 2019; Gresele et al., 2019). Recent work combines nonlinear ICA with disentanglement (Khemakhem et al., 2020; Sorrenson et al., 2020; Klindt et al., 2020).
Evaluating disentangled representations.
The BetaVAE (Higgins et al., 2017a) and FactorVAE (Kim and Mnih, 2018) scores measure disentanglement by performing an intervention on the factors of variation and predicting which factor was intervened on. The Mutual Information Gap (MIG) (Chen et al., 2018), Modularity (Ridgeway and Mozer, 2018), DCI Disentanglement (Eastwood and Williams, 2018) and SAP score (Kumar et al., 2018) are based on matrices relating factors of variation and codes (e.g. pairwise mutual information, feature importance and predictability).
Datasets for disentanglement learning.
dSprites (Higgins et al., 2017a), which consists of binary lowresolution 2D images of basic shapes, is one of the most commonly used synthetic datasets for disentanglement learning. ColordSprites, NoisydSprites, and ScreamdSprites are slightly more challenging variants of dSprites. The SmallNORB dataset contains toy images rendered under different lighting conditions, elevations and azimuths (LeCun et al., 2004). Cars3D (Reed et al., 2015) exhibits different car models from Fidler et al. (2012) under different camera viewpoints. 3dshapes is a popular dataset of simple shapes in a 3D scene (Kim and Mnih, 2018). Finally, Gondal et al. (2019) proposed MPI3D, containing images of physical 3D objects with seven factors of variation, such as object color, shape, size and position available in a simulated, simulated and highly realistic rendered simulated variant. Except MPI3D which has over 1M images, the size of the other datasets is limited with only to images. All of the above datasets exhibit perfect independence of all factors, the number of possible states is on the order of 1M or less, and due to their static setting they do not allow for dynamic downstream tasks such as reinforcement learning. In addition, except for SmallNORB, the image resolution is limited to 64x64 and there are no occlusions.
Other related work.
Transfer of learned disentangled representations from simulation to the real world has been recently investigated by Gondal et al. (2019) on the MPI3D dataset, and previously by Higgins et al. (2017b) in the context of reinforcement learning. Locatello et al. (2020) probed the outofdistribution generalization of downstream tasks trained on disentangled representations. However, these representations are trained on the entire dataset. Generalization and transfer performance especially for representation learning has likewise been studied in Dayan (1993); Muandet et al. (2013); HeinzeDeml and Meinshausen (2017); RojasCarulla et al. (2018); Suter et al. (2019); Li et al. (2018); Arjovsky et al. (2019); Krueger et al. (2020); Gowal et al. (2020). Moreover, sim2real transfer is of major interest in the robotic learning community, because of limited data and supervision in the real world (Tobin et al., 2017; Rusu et al., 2017; Peng et al., 2018; James et al., 2019; Yan et al., 2020; Andrychowicz et al., 2020).
3 Scaling Disentangled Representations to Complex Scenarios
Factor of Variation  Values 

Upper joint  30 values in 
Middle joint  30 values in 
Lower joint  30 values in 
Cube position x  30 values in 
Cube position y  30 values in 
Cube rotation  10 values in 
Cube color hue  12 values in 
A new challenging dataset.
Simulated images in our dataset are derived from the trifinger robot platform introduced by Wüthrich et al. (2020). The motivation for choosing this setting is that (1) it is challenging due to occlusions, correlations, and other difficulties encountered in robotic settings, (2) it requires modeling of fine details such as tip links at high resolutions, and (3) it corresponds to a robotic setup, so that learned representations can be used for control and reinforcement learning in simulation and in the real world. The scene comprises a robot finger with three joints that can be controlled to manipulate a cube in a bowlshaped stage. Fig. 1 shows examples of scenes from our dataset. The data is generated from 7 different factors of variation (FoV) listed in Table 1. Unlike in previous datasets, not all FoVs are independent: The endeffector (the tip of the finger) can collide with the floor or the cube, resulting in infeasible combinations of the factors (see Section B.1). We argue that such correlations are a key feature in realworld data that is not present in existing datasets. The high FoV resolution results in approximately 1.52 billion feasible states, but the dataset itself only contains one million of them (approximately 0.065% of all possible FoV combinations), realistically rendered into images. Additionally, we recorded an annotated dataset under the same conditions in the realworld setup: we acquired 1,809 camera images from the same viewpoint and recorded the labels of the 7 underlying factors of variation. This dataset can be used for outofdistribution evaluations, fewshot learning, and testing other sim2real aspects.
Model architecture.
When scaling disentangled representation learning to more complex datasets, such as the one proposed here, one of the main bottlenecks in current VAEbased approaches is the flexibility of the encoder and decoder networks. In particular, using the architecture from Locatello et al. (2019b), none of the models we trained correctly captured all factors of variation or yielded highquality reconstructions. While the increased image resolution already presents a challenge, the main practical issue in our new dataset is the level of detail that needs to be modeled. In particular, we identified the cube rotation and the lower joint position to be the factors of variation that were the hardest to capture. This is likely because these factors only produce relatively small changes in the image and hence the reconstruction error.
To overcome these issues, we propose a deeper and wider neural architecture than those commonly used in the disentangled representation learning literature, where the encoder and decoder typically have 4 convolutional and 2 fullyconnected layers. Our encoder consists of a convolutional layer, 10 residual blocks, and 2 fullyconnected layers. Some residual blocks are followed by 1x1 convolutions that change the number of channels, or by average pooling that downsamples the tensors by a factor of 2 along the spatial dimensions. Each residual block consists of two 3x3 convolutions with a leaky ReLU nonlinearity, and a learnable scalar gating mechanism (Bachlechner et al., 2020). Overall, the encoder has 23 convolutional layers and 2 fully connected layers. The decoder mirrors this architecture, with average pooling replaced by bilinear interpolation for upsampling. The total number of parameters is approximately 16.3M. See Appendix A for further implementation details.
Experimental setup.
We perform a largescale empirical study on the simulated dataset introduced above by training 1,080 VAE models.

We train the models using either unsupervised learning or weakly supervised learning (Locatello et al., 2020). In the weakly supervised case, a model is trained with pairs of images that differ in factors of variation. Here we fix as it was shown to lead to higher disentanglement by Locatello et al. (2020). The dataset therefore consists of 500k pairs of images that differ in only one FoV.

The latent space dimensionality is in .

Each of the 108 resulting configurations is trained with 10 random seeds.
Can we scale up disentanglement learning?
Most of the trained VAEs in our empirical study fully capture all the elements of a scene, correctly model heavy occlusions, and generate detailed, highquality samples and reconstructions (see Section B.2). From visual inspections such as the latent traversals in Fig. 2, we observe that many trained models fully disentangle the groundtruth factors of variation. This, however, appears to only be possible in the weakly supervised scenario. The fact that models trained without supervision learn entangled representations is in line with the impossibility result for the unsupervised learning of disentangled representations from Locatello et al. (2019b). Latent traversals from a selection of models with different degrees of disentanglement are presented in Section B.3. Interestingly, the highdisentanglement models seem to correct for correlations and interpolate infeasible states, i.e. the fingertip traverses through the cube or the floor.
Summary: The proposed architecture can scale disentanglement learning to more realistic settings, but a form of weak supervision is necessary to achieve high disentanglement.
How useful are common disentanglement metrics in realistic scenarios?
The violin plot in Fig. 3 (left) shows that DCI and MIG measure high disentanglement under weak supervision and lower disentanglement in the unsupervised setting. This is consistent with our qualitative conclusion from visual inspection of the models and with the aforementioned impossibility result. Many of the models trained with weak supervision exhibit a very high DCI score (29% of them have 99% DCI, some of them up to 99.89%). SAP and Modularity appear to be ineffective at capturing disentanglement in this setting, as also observed by Locatello et al. (2019b). Finally, note that the BetaVAE and FactorVAE metrics are not straightforward to be evaluated on datasets that do not contain all possible combinations of factor values. According to Fig. 3 (right), DCI and MIG strongly correlate with test accuracy of GBT classifiers predicting the FoVs. In the weakly supervised setting, these metrics are strongly correlated with the ELBO (positively) and with the reconstruction loss (negatively). We illustrate these relationships in more detail in Section B.4. Such correlations were also observed by Locatello et al. (2020) on significantly less complex datasets, and can be exploited for unsupervised model selection: these unsupervised metrics can be used as proxies for disentanglement metrics, which would require fully labeled data.
Summary: DCI and MIG appear to be useful disentanglement metrics in realistic scenarios, whereas other metrics seem to fall short of capturing disentanglement or can be difficult to compute. When using weak supervision, we can select disentangled models with unsupervised metrics.
4 Framework for the Evaluation of OOD Generalization
Previous work has focused on evaluating the usefulness of disentangled representations for various downstream tasks, such as predicting ground truth factors of variation, fair classification, and abstract reasoning. Here we propose a new framework for evaluating the outofdistribution (OOD) generalization properties of representations. More specifically, we consider a downstream task – in our case, regression of ground truth factors – trained on a learned representation of the data, and evaluate the performance on a heldout test set. While the test set typically follows the same distribution as the training set (indistribution generalization), we also consider test sets that follow a different distribution (outofdistribution generalization). Our goal is to investigate to what extent, if at all, downstream tasks trained on disentangled representations exhibit a higher degree of OOD generalization than those trained on entangled representations.
Let denote the training set for disentangled representation learning. To investigate OOD generalization, we train downstream regression models on a subset to predict ground truth factor values from the learned representation computed by the encoder. We independently train one predictor per factor. We then test the regression models on a set that differs distributionally from the training set , as it either contains images corresponding to heldout values of a chosen FoV (e.g. unseen object colors), or it consists of realworld images. We now differentiate between two scenarios: (1) , i.e. the OOD test set is a subset of the dataset for representation learning; (2) and are disjoint and distributionally different. These two scenarios will be denoted by OOD1 and OOD2, respectively. For example, consider the case in which distributional shifts are based on one FoV: the color of the object. Then, we could define these datasets such that images in always contain a red or blue object, and those in always contain a red object. In the OOD1 scenario, images in would always contain a blue object, whereas in the OOD2 case they would always contain an object that is neither red nor blue.
The regression models considered here are Gradient Boosted Trees (GBT), random forests, and MLPs with hidden layers. Since random forests exhibit a similar behavior to GBTs, and all MLPs yield similar results to each other, we choose GBTs and the 2layer MLP as representative models and only report results for those. To quantify prediction quality, we normalize the ground truth factor values to the range , and compute the mean absolute error (MAE). Since the values are normalized, we can define our transfer metric as the average of the MAE over all factors (except for the FoV that is OOD).
5 Benefits and Transfer of Structured Representations
Experimental setup.
We evaluate the transfer metric introduced in Section 4 across all 1,080 trained models. To compute this metric, we train regression models to predict the ground truth factors of variation, and test them under distributional shift. We report scores for two different regression models: a Gradient Boosted Tree (GBT) and an MLP with 2 hidden layers of size [256, 256]. In the OOD1 setting, we have , hence the encoder is indistribution: we are testing the predictor on representations that were in the training set of the representation learning algorithm. Therefore, we expect the representations to be meaningful. We consider three scenarios:

OOD1A: The regression models are trained on 1 color (red) and evaluated on the remaining 7 colors.

OOD1B: The regression models are trained on 4 colors with high hue in the HSV space, and evaluated on 4 colors with low hue (extrapolation).

OOD1C: The regression models are again trained and evaluated on 4 colors, but the training and evaluation colors are alternating along the hue dimension (interpolation).
In the more challenging setting where even the encoder goes outofdistribution (OOD2, with ), we train the regression models on a subset of the training set that includes all 8 colors, and we consider the two following scenarios:

OOD2A: The regression models are evaluated on simulated data, on 4 colors that are out of the encoder’s training distribution.

OOD2B: The regression models are evaluated on realworld images of the robotic setup, without any adaptation or finetuning.
Is disentanglement correlated with OOD1 generalization?
In Fig. 4 we consistently observe a negative correlation between disentanglement and transfer error across all OOD1 settings. The correlation is mild when using MLPs, strong when using GBTs. This difference is expected, as GBTs have an axisalignment bias whereas MLPs can – given enough data and capacity – disentangle an entangled representation more easily. Our results therefore suggest that highly disentangled representations are useful for generalizing outofdistribution as long as the encoder remains indistribution. This is in line with the correlation found by Locatello et al. (2019b) between disentanglement and the GBT10000 metric. There, however, GBTs are tested on the same distribution as the training distribution, while here we test them under distributional shift. Given that the computation of disentanglement scores requires labels, this is of little benefit in the unsupervised setting. However, it can be exploited in the weakly supervised setting, where disentanglement was shown to correlate with ELBO and reconstruction loss (Section 3). Therefore, model selection for representations that transfer well in these scenarios is feasible based on the ELBO or reconstruction loss, when weak supervision is available. Note that, in absolute terms, the OOD generalization error with encoder indistribution (OOD1) is very low in the highdisentanglement case (the only exception being the MLP in the OOD1C case, with the 17 color split, which seems to overfit). This suggests that disentangled representations can be useful in downstream tasks even when transferring out of the training distribution.
Summary: Disentanglement seems to be positively correlated with OOD generalization of downstream tasks, provided that the encoder remains indistribution (OOD1). Since in the weakly supervised case disentanglement correlates with the ELBO and the reconstruction loss, model selection can be performed using these metrics as proxies for disentanglement. These metrics have the advantage that they can be computed without labels, unlike disentanglement metrics.
Is disentanglement correlated with OOD2 generalization?
As seen in Fig. 5, the negative correlation between disentanglement and GBT transfer error is weaker when the encoder is out of distribution (OOD2). Nonetheless, we observe a nonnegligible correlation for GBTs in the OOD2A case, where we investigate outofdistribution generalization along one FoV, with observations in still generated from the same simulator. In the OOD2B setting, where the observations are taken from cameras in the corresponding realworld setting, the correlation between disentanglement and transfer performance appears to be minor at best. This scenario can be considered a variant of zeroshot sim2real generalization.
Summary: Disentanglement has a minor correlation with outofdistribution generalization outside of the training distribution of the encoder (OOD2).
What else matters for OOD2 generalization?
Results in Fig. 6 suggest that adding Gaussian noise to the input during training as described in Section 3 leads to significantly better OOD2 generalization, and has no effect on OOD1 generalization. Adding noise to the input of neural networks is known to lead to better generalization (Sietsma and Dow, 1991; Bishop, 1995). This is in agreement with our results, since OOD1 generalization does not require generalization of the encoder, while OOD2 does. Interestingly, closer inspection reveals that the contribution of different factors of variation to the generalization error can vary widely. See Section B.5 for further details. In particular, with noisy input, the position of the cube is predicted accurately even in realworld images (5% mean absolute error on each axis). This is promising for robotics applications, where the true state of the joints is observable but inference of the cube position relies on object tracking methods. Fig. 7 shows an example of realworld inputs and reconstructions of their simulated equivalents.
Summary: Adding input noise during training appears to be significantly beneficial for OOD2 generalization, while having no effect when the encoder is kept in its training distribution (OOD1).
6 Conclusion
Despite the growing importance of the field and the potential societal impact in the medical domain (Chartsias et al., 2018) or fair decision making (Locatello et al., 2019a), stateoftheart approaches for learning disentangled representations have so far only been systematically evaluated on synthetic toy datasets. Here we introduced a new highresolution dataset with 1M simulated images and over 1,800 annotated realworld images of the same setup. This dataset exhibits a number of challenges and features which are not present in previous datasets: it contains correlations between factors, occlusions, a complex underlying structure, and it allows for evaluation of transfer to unseen simulated and realworld settings. We proposed a new VAE architecture to scale disentangled representation learning to this realistic setting and conducted a largescale empirical study of disentangled representations on this dataset. We discovered that disentanglement is a good predictor of OOD task performance and showed that, in the context of weak supervision, model selection for good OOD performance can be based on the ELBO or the reconstruction loss, which are accessible without explicit labels. Our setting allows for studying a wide variety of interesting downstream tasks in the future, such as reinforcement learning or learning a dynamics model of the environment. Finally, we believe that in the future it will be important to take further steps in the direction of this paper by considering settings with even more complex structures and stronger correlations between factors.
Acknowledgements
The authors thank Shruti Joshi and Felix Widmaier for their useful comments on the simulated setup, Anirudh Goyal for helpful discussions and comments and CIFAR for the support.
Appendix A Implementation Details
Here we will provide more details on the implementation and training of the models. We train the VAEs by maximizing the objective function
with using the Adam optimizer (Kingma and Ba, 2014) with default parameters. We use a batch size of 64 and train for 400k steps. The learning rate is initialized to 1e4 and halved at 150k and 300k training steps. We clip the global gradient norm to before each weight update. Following Locatello et al. (2019b), we use a Gaussian encoder with an isotropic Gaussian prior for the latent variable, and a Bernoulli decoder.
An overview of the encoder and decoder architecture is shown in Fig. 8, and further details are provided in Tables 2, 4 and 3. In preliminary experiments, we observed that batch normalization (Ioffe and Szegedy, 2015), layer normalization (Ba et al., 2016), and dropout (Srivastava et al., 2014) did not significantly affect performance in terms of ELBO, model samples, and disentanglement scores, both in the unsupervised and weakly supervised settings. On the other hand, layer normalization before the posterior parameterization (last layer of the encoder) appeared to be beneficial for stability in early training.
While using encoder and decoder architectures based on residual blocks leads to fast and effective convergence, in practice, we observed that it may be challenging to keep the gradients in check at the beginning of training.
Our implementation of weakly supervised learning is based on AdaGVAE (Locatello et al., 2020), but uses a symmetrized KL divergence:
to infer which latent dimensions should be aggregated.
The noise added to the encoder’s input consists of two independent components, both iid Gaussian with zero mean: one is independent for each subpixel (RGB) and has standard deviation , the other is a pixelwise (greyscale) noise with standard deviation , bilinearly upsampled by a factor of 16. The latter has been designed (by visual inspection) to roughly mimic observation noise in the real images due to complex lighting conditions.
Residual Block 

Input: 
LeakyReLU(0.02) 
Conv 3x3, channels 
LeakyReLU(0.02) 
Conv 3x3, channels 
Scalar gate 
Sum with input 
Operation  Output Shape 

Input  
Conv 5x5, stride 2, 64 channels  
LeakyReLU(0.02)  — 
2x ResidualBlock(64)  — 
Conv 1x1, 128 channels  
AveragePool(2)  
2x ResidualBlock(128)  — 
AveragePool(2)  
2x ResidualBlock(128)  — 
Conv 1x1, 256 channels  
AveragePool(2)  
2x ResidualBlock(256)  — 
AveragePool(2)  
2x ResidualBlock(256)  — 
Flatten  
LeakyReLU(0.02)  — 
FC(512)  
LeakyReLU(0.02)  — 
LayerNorm  — 
2x FC() 
Operation  Output Shape 

Input  
FC(512)  
LeakyReLU(0.02)  — 
FC(4096)  
Reshape  
2x ResidualBlock(256)  — 
BilinearInterpolation(2)  
2x ResidualBlock(256)  — 
Conv 1x1, 128 channels  
BilinearInterpolation(2)  
2x ResidualBlock(128)  — 
BilinearInterpolation(2)  
2x ResidualBlock(128)  — 
Conv 1x1, 64 channels  
BilinearInterpolation(2)  
2x ResidualBlock(64)  — 
BilinearInterpolation(2)  
LeakyReLU(0.02)  — 
Conv 5x5, channels 
Appendix B Additional Results
b.1 Dataset Correlations
b.2 Samples and Reconstructions
b.3 Latent Traversals
b.4 Unsupervised Metrics and Disentanglement
b.5 OutofDistribution Transfer
b.6 OutofDistribution Reconstructions
Footnotes
 footnotemark:
 Reproducing these experiments requires approximately 2.8 GPU years (NVIDIA Tesla V100 PCIe).
 This instability may also be exacerbated in probabilistic models by the sampling step in latent space, where a large log variance causes the decoder input to take very large values. Intuitively, this might be a reason why layer normalization before latent space appears to be beneficial for training stability.
References
 Tameem Adel, Zoubin Ghahramani, and Adrian Weller. Discovering interpretable representations for both deep generative and discriminative models. In International Conference on Machine Learning, pages 50–59, 2018.
 OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous inhand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
 Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David LopezPaz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
 Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Francis Bach and Michael Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3(7):1–48, 2002.
 Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. arXiv preprint arXiv:2003.04887, 2020.
 Chris M Bishop. Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108–116, 1995.
 Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in betaVAE. arXiv preprint arXiv:1804.03599, 2018.
 Agisilaos Chartsias, Thomas Joyce, Giorgos Papanastasiou, Scott Semple, Michelle Williams, David Newby, Rohan Dharmakumar, and Sotirios A Tsaftaris. Factorised spatial representation learning: Application in semisupervised myocardial segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 490–498. Springer, 2018.
 Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, 2018.
 Pierre Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287–314, 1994.
 Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
 Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018.
 Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In Advances in neural information processing systems, pages 611–619, 2012.
 Muhammad Waleed Gondal, Manuel Wüthrich, Djordje Miladinović, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, and Stefan Bauer. On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In Advances in Neural Information Processing Systems, 2019.
 Sven Gowal, Chongli Qin, PoSen Huang, Taylan Cemgil, Krishnamurthy Dvijotham, Timothy Mann, and Pushmeet Kohli. Achieving robustness in the wild via adversarial mixing with disentangled representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1211–1220, 2020.
 Luigi Gresele, Paul K. Rubenstein, Arash Mehrjou, Francesco Locatello, and Bernhard Schölkopf. The incomplete rosetta stone problem: Identifiability results for multiview nonlinear ica. In Conference on Uncertainty in Artificial Intelligence (UAI), 2019.
 Christina HeinzeDeml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469, 2017.
 Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017a.
 Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zeroshot transfer in reinforcement learning. In International Conference on Machine Learning, 2017b.
 Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko Bošnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. Scan: Learning hierarchical compositional visual concepts. In International Conference on Learning Representations, 2018.
 Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by timecontrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems, 2016.
 Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 1999.
 Aapo Hyvarinen, Hiroaki Sasaki, and Richard E Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In International Conference on Artificial Intelligence and Statistics, 2019.
 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, and Konstantinos Bousmalis. Simtoreal via simtosim: Dataefficient robotic grasping via randomizedtocanonical adaptation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12627–12637, 2019.
 Christian Jutten and Juha Karhunen. Advances in nonlinear blind source separation. In International Symposium on Independent Component Analysis and Blind Signal Separation, pages 245–256, 2003.
 Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207–2217, 2020.
 Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, 2018.
 Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Diederik P Kingma and Max Welling. Autoencoding variational Bayes. In International Conference on Learning Representations, 2014.
 David Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. arXiv preprint arXiv:2007.10930, 2020.
 David Krueger, Ethan Caballero, JoernHenrik Jacobsen, Amy Zhang, Jonathan Binas, Remi Le Priol, and Aaron Courville. Outofdistribution generalization via risk extrapolation (rex). arXiv preprint arXiv:2003.00688, 2020.
 Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, 2018.
 Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In IEEE Conference on Computer Vision and Pattern Recognition, 2004.
 Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 624–639, 2018.
 Francesco Locatello, Gabriele Abbati, Thomas Rainforth, Stefan Bauer, Bernhard Schölkopf, and Olivier Bachem. On the fairness of disentangled representations. In Advances in Neural Information Processing Systems, pages 14611–14624, 2019a.
 Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, 2019b.
 Francesco Locatello, Ben Poole, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem, and Michael Tschannen. Weaklysupervised disentanglement without compromises. arXiv preprint arXiv:2002.02886, 2020.
 Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pages 10–18, 2013.
 Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1–8. IEEE, 2018.
 Scott Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogymaking. In Advances in Neural Information Processing Systems, 2015.
 Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the fstatistic loss. In Advances in Neural Information Processing Systems, 2018.
 Mateo RojasCarulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. The Journal of Machine Learning Research, 19(1):1309–1342, 2018.
 Andrei A Rusu, Matej Večerík, Thomas Rothörl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Simtoreal robot learning from pixels with progressive nets. In Conference on Robot Learning, pages 262–270, 2017.
 Rui Shu, Yining Chen, Abhishek Kumar, Stefano Ermon, and Ben Poole. Weakly supervised disentanglement with guarantees. arXiv preprint arXiv:1910.09772, 2019.
 Jocelyn Sietsma and Robert JF Dow. Creating artificial neural networks that generalize. Neural networks, 4(1):67–79, 1991.
 Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
 Peter Sorrenson, Carsten Rother, and Ullrich Köthe. Disentanglement by nonlinear ica with general incompressibleflow networks (gin). arXiv preprint arXiv:2001.04872, 2020.
 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 Raphael Suter, Djordje Miladinovic, Bernhard Schölkopf, and Stefan Bauer. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In International Conference on Machine Learning, pages 6056–6065. PMLR, 2019.
 Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017.
 Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, and Olivier Bachem. Are disentangled representations helpful for abstract visual reasoning? arXiv preprint arXiv:1905.12506, 2019.
 Manuel Wüthrich, Felix Widmaier, Felix Grimminger, Joel Akpo, Shruti Joshi, Vaibhav Agrawal, Bilal Hammoud, Majid Khadiv, Miroslav Bogdanovic, Vincent Berenz, et al. Trifinger: An opensource robot for learning dexterity. arXiv preprint arXiv:2008.03596, 2020.
 Mengyuan Yan, Qingyun Sun, Iuri Frosio, Stephen Tyree, and Jan Kautz. How to close simreal gap? transfer with segmentation! arXiv preprint arXiv:2005.07695, 2020.