WeaklySupervised Disentanglement Without Compromises
Abstract
Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of noni.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a largescale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.
tabular
1 Introduction
A recent line of work argued that representations which are disentangled offer useful properties such as interpretability [1, 3, 27], predictive performance [47, 48], reduced sample complexity on abstract reasoning tasks [70], and fairness [46, 14]. The key underlying assumption is that highdimensional observations (such as images or videos) are in fact a manifestation of a lowdimensional set of independent groundtruth factors of variation [47, 3, 69]. The goal of disentangled representation learning is to learn a function mapping the observations to a lowdimensional vector that contains all the information about each factor of variation, with each coordinate (or a subset of coordinates) containing information about only one factor. Unfortunately, Locatello et al. [47] showed that the unsupervised learning of disentangled representations is theoretically impossible from i.i.d. observations without inductive biases. In practice, they observed that unsupervised models exhibit significant variance depending on hyperparameters and random seed, making their training somewhat unreliable.
On the other hand, many data modalities are not observed as i.i.d. samples from a distribution [15, 66, 29, 3, 52, 68, 61]. Changes in natural environments, which typically correspond to changes of only a few underlying factors of variation, provide a weak supervision signal for representation learning algorithms [20, 60, 5, 4]. Stateoftheart weaklysupervised disentanglement methods [6, 30, 64] assume that observations belong to annotated groups where two things are known at training time: (i) the relation between images in the same group, and (ii) the group each image belongs to. Bouchacourt et al. [6], Hosoya [30] consider groups of observations differing in precisely one of the underlying factors. An example of such a group are images of a given object with a fixed orientation, in a fixed scene, but of varying color. Shu et al. [64] generalized this notion to other relations (e.g., single shared factor, ranking information). In general, precise knowledge of the groups and their structure may require either explicit human labeling or at least strongly controlled acquisition of the observations. As a motivating example, consider the video feedback of a robotic arm. In two temporally close frames, both the manipulated objects and the arm may have changed their position, the objects themselves may be different, or the lighting conditions may have changed due to failures.
In this paper, we consider learning disentangled representations from pairs of observations which differ by a few factors of variation [5, 60, 4] as in Figure 1. Unlike previous work on weaklysupervised disentanglement, we consider the realistic and broadly applicable setting where we observe pairs of images and have no additional annotations: It is unknown which and how many factors of variation have changed. In other words, we do not know which group each pair belongs to, and what is the precise relation between the two images. The only condition we require is that the two observations are different and that the change in the factors is not dense. The key contributions of this paper are:

We theoretically show that identifiability is possible from noni.i.d. pairs of observations under weak assumptions. Our proof motivates the setup we consider, which is identifiable as opposed to the standard one, which was proven to be nonidentifiable [47]. Further, we use theoretical arguments to inform the design of our algorithms, recover existing groupbased VAE methods [6, 30] as special cases, and relax their impractical assumptions.

We perform a largescale reproducible experimental study training over disentanglement models and over one million downstream classifiers
^{1} on five different data sets, one of which consisting of real images of a robotic platform [23]. 
We demonstrate that one can reliably learn disentangled representations with weak supervision only, without relying on supervised disentanglement metrics for model selection, as done in previous works. Further, we show that these representations are useful on a diverse suite of downstream tasks, including a novel experiment targeting strong generalization under covariate shifts, fairness [46] and abstract visual reasoning [70].
2 Related work
Recovering independent components of the data generating process is a wellstudied problem in machine learning. It has roots in the independent component analysis (ICA) literature, where the goal is to unmix independent nonGaussian sources of a dimensional signal [13]. Crucially, identifiability is not possible in the nonlinear case from i.i.d. observations [36]. Recently, the ICA community has considered weak forms of supervision such as temporal consistency [34, 35], auxiliary supervised information [37, 41], and multiple views [25]. A parallel thread of work has studied distribution shifts by identifying changes in causal generative factors [74, 75, 33], which is linked to a causal view of disentanglement [67, 61].
On the other hand, more applied machine learning approaches have experienced the opposite shift. Initially, the community focused on more or less explicit and task dependent supervision [55, 72, 43, 12, 50, 51]. For example, a number of works rely on known relations between the factors of variation [39, 71, 22, 17, 32, 73, 49] and disentangling motion and pose from content [31, 21, 16, 24].
Recently, there has been a renewed interest in the unsupervised learning of disentangled representations [27, 7, 42, 11, 44] along with quantitative evaluation [42, 19, 44, 11, 57, 18]. After Locatello et al. [47] proved that unsupervised learning of disentangled representations is theoretically impossible without inductive biases, the focus shifted back to semisupervised [48, 65, 41] and weaklysupervised approaches [6, 30, 64].
3 Generative models
We first describe the generative model commonly used in the disentanglement literature, and then turn to the weaklysupervised model used in this paper.
Unsupervised generative model First, a is drawn from a set of independent groundtruth factors of variation . Second, the observations are obtained as draws from . The factors of variation do not need to be onedimensional but we assume so to simplify the notation.
Disentangled representations The goal of disentanglement learning is to learn a mapping where the effect of the different factors of variation is axisaligned with different coordinates. More precisely, each factor of variation is associated with exactly one coordinate (or group of coordinates) of and viceversa (and the groups are nonoverlapping). As a result, varying one factor of variation and keeping the others fixed results in a variation of exactly one coordinate (group of coordinates) of . Locatello et al. [47] theoretically showed that learning such a mapping is theoretically impossible without inductive biases or some other, possibly weak, form of supervision.
Weaklysupervised generative model We study learning of disentangled image representations from paired observations, for which some (but not all) factors of variation have the same value. This can be modeled as sampling two images from the causal generative model with an intervention [52] on a random subset of the factors of variation. Our goal is to use the additional information given by the pair (as opposed to a single image) to learn a disentangled image representations. We generally do not assume knowledge of which or how many factors are shared, i.e., we do not require controlled acquisition of the observations. This observation model applies to many practical scenarios. For example, we may want to learn a disentangled representation of a robot arm observed through a camera: In two temporally close frames some joint angles will likely have changed, but others will have remained constant. Other factors of variation may also change independently of the actions of the robot. An example can be seen in Figure 1 (right) where the first degree of freedom of the arm and the color of the background changed. More generally this observation model applies to many natural scenes with moving objects [20]. More formally, we consider the following generative model. For simplicity of exposition, we assume that the number of factors in which the two observations differ is constant (we present a strategy to deal with varying in Section 4.1). The generative model is given by
(1)  
(2) 
where is the subset of shared indices of size sampled from a distribution over the set , and the and are all identical. The generative mechanism is modeled using a function , with and , which maps the latent variable to observations of dimension , typically . To make the relation between and explicit, we use a function obeying
with . Intuitively, to generate , selects entries from with index in and substitutes the remaining factors with , thus ensuring that the factors indexed by are shared in the two observations. The generative model (1)–(2) does not model additive noise; we assume that noise is explicitly modeled as a latent variable and its effect is manifested through as done by [3, 47, 26, 27, 67, 56, 45, 42, 23]. For simplicity, we consider the case where groups consisting of two observations (pairs), but extensions to more than two observations are possible [25].
4 Identifiability and algorithms
First, we show that, as opposed to the unsupervised case [47], the generative model (1)–(2) is identifiable under weak additional assumptions. Note that the joint distribution of all random variables factorizes as
(3) 
where the likelihood terms have the same distribution, i.e., . We show that to learn a disentangled generative model of the data it is therefore sufficient to recover a factorized latent distribution with factors , a corresponding likelihood , as well as a distribution over , which together satisfy the constraints of the true generative model (1)–(2) and match the true after marginalization over when substituted into (3).
Theorem 1.
Consider the generative model (1)–(2). Further assume that are continuous distributions, is a distribution over s.t. for we have . Let in (2) be smooth and invertible on with smooth inverse (i.e., a diffeomorphism). Given unlimited data from and the true (fixed) , consider all tuples obeying these assumptions and matching after marginalization over when substituted in (3). Then, the posteriors are disentangled in the sense that the aggregate posteriors are coordinatewise reparameterizations of the groundtruth prior up to a permutation of the indices of .
Discussion Under the assumptions of this theorem, we established that all generative models that match the true marginal over the observations must be disentangled. Therefore, constrained distribution matching is sufficient to learn disentangled representations. Formally, the aggregate posterior is a coordinatewise reparameterization of the true distribution of the factors of variation (up to index permutations). In other words, there exists a onetoone mapping between every entry of and a unique matching entry of , and thus a change in a single coordinate of implies a change in a single matching coordinate of [3]. Changing the observation model from single i.i.d. observations to noni.i.d. pairs of observations generated according to the generative model (1)–(2) allows us to bypass the nonidentifiability result of [47]. Our result requires strictly weaker assumptions than the result of Shu et al. [64] as we do not require group annotations, but only knowledge of . As we shall see in Section 4.1, can be cheaply and reliably estimated from data at runtime. Although the weak assumptions of Theorem 1 may not be satisfied in practice, we will show that the proof can inform practical algorithm design.
4.1 Practical adaptive algorithms
We conceive two VAE [27] variants tailored to the weaklysupervised generative model (1)–(2) and a selection heuristic to deal with unknown and random . We will see that these simple models can very reliably learn disentangled representations.
The key differences between theory and practice are that: (i) we use the ELBO and amortized variational inference for distribution matching (the true and learned distributions will not exactly match after training), (ii) we have access to a finite number of data only, and (iii) the theory assumes known, fixed , but might be unknown and random.
Enforcing the structural constraints Here we present a simple structure for the variational family that allows us to tractably perform approximate inference on the weaklysupervised generative model. First note that the alignment constraints imposed by the generative model (see (7) and (8) evaluated for in Appendix A) imply for the true posterior
(4)  
(5) 
(with probability ) and we want to enforce these constraints on the approximate posterior of our learned model. However, the set is unknown. To obtain an estimate of we therefore choose for every pair the coordinates with the smallest . To impose the constraint (4) we then replace each shared coordinate with some average of the two posteriors
else, 
and obtain in analogous manner. As we later simply use the averaging strategies of the GroupVAE (GVAE) [30] and the Multi LevelVAE (MLVAE) [6], we term variants of our approach which infers the groups and their properties adaptively AdaptiveGroupVAE (AdaGVAE) and AdaptiveMLVAE (AdaMLVAE), depending on the choice of the averaging function . We then optimize the following variant of the VAE objective
(6) 
where [27]. The advantage of this averagingbased implementation of (4), over implementing it, for instance, via a term that encourages the distributions of the shared coordinates to be similar, is that averaging imposes a hard constraint in the sense that and can jointly encode only one value per shared coordinate. This in turn implicitly enforces the constraint (5) as the nonshared dimensions need to be efficiently used to encode the nonshared factors of and .
We emphasize that the objective (6) is a simple modification of the VAE objective and is very easy to implement. Finally, we remark that invoking Theorem 4 of [41], we achieve consistency under maximum likelihood estimation up to the equivalence class in our Theorem 1, for and in the limit of infinite data and capacity.
Inferring In the (practical) scenario where is unknown, we use the threshold
where , and average the coordinates with . This heuristic is inspired by the “elbow method” [40] for model selection in means clustering and singular value decomposition and we found it to work surprisingly well in practice (see the experiments in Section 5). This estimate relies on the assumption that not all factors have changed. All our adaptive methods use this heuristic. Although a formal recovery argument cannot be made for arbitrary data sets, inductive biases may limit the impact of an approximate in practice. We further remark that this heuristic always yields the correct if the encoder is disentangled.
Relation to prior work Closely related to the proposed objective (6) the GVAE of Hosoya [30] and the MLVAE of Bouchacourt et al. [6] assume is known and implement using different averaging choices. Both assume Gaussian approximate posteriors where are the mean and variance of and are the mean and variance, of . For the coordinates in , the GVAE uses a simple arithmetic mean ( and ) and the MLVAE takes the product of the encoder distributions, with taking the form:
Our approach critically differs in the sense that is not known and needs to be estimated for every pair of images.
Recent work combines nonlinear ICA with disentanglement [41, 65]. Critically, these approaches are based on the setup of Hyvarinen et al. [37] which requires access to label information such that factorizes as . In contrast, we base our work on the setup of Gresele et al. [25], which only assumes access to two sufficiently distinct views of the latent variable. Shu et al. [64] train the same type of generative models over paired data but use a GAN objective where inference is not required. However, they require known and fixed as well as annotations of which factors change in each pair.
5 Experimental results
Experimental setup We consider the setup of Locatello et al. [47]. We use the five data sets where the observations are generated as deterministic functions of the factors of variation: dSprites [27], Cars3D [56], SmallNORB [45], Shapes3D [42], and the realworld robotics data set MPI3D [23]. Our unsupervised baselines correspond to a cohort of unsupervised models (VAE [27], AnnealedVAE [7], FactorVAE [42], TCVAE [11], DIPVAEI and II [44]), each with the same six hyperparameters from Locatello et al. [47] and 50 random seeds.
To create data sets with weak supervision from the existing disentanglement data sets, we first sample from the discrete according to the groundtruth generative model (1)–(2). Then, we sample factors of variation that should not be shared by the two images and resample those coordinates to obtain . This ensures that each image pair differs in at most factors of variation. For we consider the range from to . This last setting corresponds to the case where all but one factor of variation are resampled. We study both the case where is constant across all pairs in the data set and where is sampled uniformly in the range for every training pair ( in the following). Unless specified otherwise, we aggregate the results for all values of .
For each data set, we train four weaklysupervised methods: Our adaptive and vanilla (groupsupervision) variants of GVAE [30] and MLVAE [6]. For each approach we consider six values for the regularization strength
and 10 random seeds, training a total of weaklysupervised models. We perform model selection using the weaklysupervised reconstruction loss (i.e., the sum of the first two terms in (6))
To evaluate the representations, we consider the disentanglement metrics in Locatello et al. [47]: BetaVAE score [27], FactorVAE score [42], Mutual Information Gap (MIG) [11], Modularity [57], DCI Disentanglement [19] and SAP score [44]. To directly compare the disentanglement produced by different methods, we report the DCI Disentanglement [19] in the main text and defer the plots with the other scores to the appendix as the same conclusions can be drawn based on these metrics. Appendix B contains full implementation details.
5.1 Is weak supervision enough for disentanglement?
In Figure 2, we compare the performance of the weaklysupervised methods with against the unsupervised methods. Unlike in unsupervised disentanglement with VAEs where is common, we find (the ELBO) performs best in most cases. We clearly observe that weaklysupervised models outperform the unsupervised ones. In Figure 6 in the appendix, we further observe that they are competitive even if we allow fully supervised model selection on the unsupervised models. The AdaGVAE performs similarly to the AdaMLVAE. For this reason, we focus the following analysis on the AdaGVAE, and include AdaMLVAE results in Appendix C.
Summary With weak supervision, we reliably learn disentangled representations that outperform unsupervised ones. Our representations are competitive even if we perform fully supervised model selection on the unsupervised models.
5.2 Are our methods adaptive to different values of ?
In Figure 3 (left), we report the performance of AdaGVAE without model selection for different values of on MPI3D (see Figure 10 in the appendix for the other data sets). We observe that AdaGVAE is indeed adaptive to different values of and it achieves better performance when the change between the factors of variation is sparser. Note that our method is agnostic to the sharing pattern between the image pairs. In applications where the number of shared factors is known to be constant, the performance may thus be further improved by injecting this knowledge into the inference procedure.
Summary Our approach makes no assumption of which and how many factors are shared and successfully adapts to different values of . The sparser the difference on the factors of variation, the more effective our method is in using weak supervision and learning disentangled representations.
5.3 Supervisionperformance tradeoffs
The case where we actually know which factor of variation is not shared was previously considered in [6, 30, 64]. Clearly, this additional knowledge should lead to improvements over our method. On the other hand, this information may be correct but incomplete in practice: For every pair of images, we know about one factor of variation that has changed but it may not be the only one. We therefore also consider the setup where but the algorithm is only informed about one factor. Note that the original GVAE assumes group knowledge, so we directly compare its performance with our AdaGVAE. We defer the comparison with MLVAE [6] and with the GANbased approaches of [64] to Appendix C.3.
In Figure 3 (center and right), we observe that when , the knowledge of which factor was changed generally improves the performance of weaklysupervised methods on MPI3D. On the other hand, the GVAE is not robust to incomplete knowledge as its performance degrades when the factor that is labeled as nonshared is not the only one. The performance degradation is stronger on the data sets with more factors of variation (dSprites/Shapes3D/MPI3D) as can be seen in Figure 12 in the appendix. This may not come as a surprise as groupbased disentanglement methods all assume that the group knowledge is precise.
Summary Whenever the groups are fully and precisely known, this information can be used to improve disentanglement. Even though our adaptive method does not use group annotations, its performance is often comparable to the methods of [6, 30, 64]. On the other hand, in practical applications there may not be precise control of which factors have changed. In this scenario, relying on incomplete group knowledge significantly harms the performance of GVAE and MLVAE as they assume exact group knowledge. A blend between our adaptive variant and the vanilla GVAE may further improve performance when only partial group knowledge is available.
5.4 Are weaklysupervised representations useful?
In this section, we investigate whether the representations learned by our AdaGVAE are useful on a variety of tasks. We show that representations with small weaklysupervised reconstruction loss (the sum of the first two terms in (6)) achieve improved downstream performance [47, 48], improved downstream generalization [52] under covariate shifts [63, 53, 2], fairer downstream predictions [46], and improved sample complexity on an abstract reasoning task [70]. To the best of our knowledge, strong generalization under covariate shift has not been tested on disentangled representations before.
Key insight We remark that the usefulness insights of Locatello et al. [47, 48, 46], van Steenkiste et al. [70] are based on the assumption that disentangled representations can be learned without observing the factors of variation. They consider models trained without supervision and argue that some of the supervised disentanglement scores (which require explicit labeling of the factors of variation) correlate well with desirable properties. In stark contrast, we here show that all these properties can be achieved simultaneously using only weaklysupervised data.
Downstream performance
In this section, we consider the prediction task of Locatello et al. [47] that predicts the values of the factors of variation from the representation. We also evaluate whether our weaklysupervised reconstruction loss is a good proxy for downstream performance. We use a setup identical to Locatello et al. [47] and train the same logistic regression and gradient boosted decision trees (GBT) on the learned representations using different sample sizes (///). All test sets contain examples.
In Figure 4 (left), we observe that the weaklysupervised reconstruction loss of AdaGVAE is generally anticorrelated with downstream performance. The best weaklysupervised disentanglement methods thus learn representations that are useful for training accurate classifiers downstream.
Summary The weaklysupervised reconstruction loss of our AdaGVAE is a useful proxy for downstream accuracy.
Generalization under covariate shift
Assume we have access to a large pool of unlabeled paired data and our goal is to solve a prediction task for which we have a smaller labeled training set. Both the labeled training set and test set are biased, but with different biases. For example, we want to predict object shape but our training set contains only red objects, whereas the test set does not contain any red objects. We create a biased training set by performing an intervention on a random factor of variation (other than the target variable), so that its value is constant in the whole training set. We perform another intervention on the test set, so that the same factor can take all other values. We train a GBT classifier on 10000 examples from the representations learned by AdaGVAE. For each target factor of variation, we repeat the training of the classifier 10 times for different random interventions. For this experiment, we consider only dSprites, Shapes3D and MPI3D since Cars3D and SmallNORB are too small (after an intervention on their most fine grained factor of variation, they only contain 96 and 270 images respectively).
In Figure 4 (center) we plot the rank correlation between disentanglement scores and weaklysupervised reconstruction, and the results for generalization under covariate shifts for AdaGVAE. We note that both the disentanglement scores and our weaklysupervised reconstruction loss are correlated with strong generalization. In Figure 4 (right), we highlight the gap between the performance of a classifier trained on a normal train/test split (which we refer to as weak generalization) as opposed to this covariate shift setting. We do not perform model selection, so we can show the performance of the whole range of representations. We observe that there is a gap between weak and strong generalization but the distributions of accuracies significantly overlap and are significantly better than a naive classifier based on the prior distribution of the classes.
Summary Our results provide compelling evidence that disentanglement is useful for strong generalization under covariate shifts. The best AdaGVAE models in terms of weaklysupervised reconstruction loss are useful for training classifiers that generalize under covariate shifts.
Fairness
Recently, Locatello et al. [46] showed that disentangled representations may be useful to train robust classifiers that are fairer to unobserved sensitive variables independent of the target variable. While they observed a strong correlation between demographic parity [8, 76] and disentanglement, the applicability of their approach is limited by the fact that disentangled representations are difficult to identify without access to explicit observations of the factors of variation [47].
Our experimental setup is identical to the one of Locatello et al. [46] and we measure unfairness of a classifier as in Locatello et al. [46, Section 4]. In Figure 5 (left), we show that the weaklysupervised reconstruction loss of our AdaGVAE correlates with unfairness as strongly as the disentanglement scores, even though the former can be computed without observing the factors of variation. In particular, we can perform model selection without observing the sensitive variable. In Figure 5 (center), we show that our AdaGVAE with and model selection allows us to train and identify fairer models compared to the unsupervised models of Locatello et al. [46]. Furthermore, their model selection heuristic is based on downstream performance which requires knowledge of the sensitive variable. From both plots we conclude that our weaklysupervised reconstruction loss is a good proxy for unfairness and allows us to train fairer classifiers in the setup of Locatello et al. [46] even if the sensitive variable is not observed.
Summary We showed that using weak supervision, we can train and identify fairer classifiers in the sense of demographic parity [8, 76]. As opposed to Locatello et al. [46], we do not need to observe the target variable and yet, our principled weaklysupervised approach outperforms their semisupervised heuristic.
Abstract visual reasoning
Finally, we consider the abstract visual reasoning task of van Steenkiste et al. [70]. This task is based on Raven’s progressive matrices [54] and requires completing the bottom right missing panel of a sequence of context panels arranged in a grid (see Figure 17 (left) in the appendix). The algorithm is presented with six potential answers and needs to choose the correct one. To solve this task, the model has to infer the abstract relationships between the panels. We replicate the experiment of van Steenkiste et al. [70] on Shapes3D under the same exact experimental conditions (see Appendix B for more details).
In Figure 5 (right), one can see that at low sample sizes, the weaklysupervised reconstruction loss is strongly anticorrelated with performance on the abstract visual reasoning task. As previously observed by van Steenkiste et al. [70], this benefit only occurs at low sample sizes.
Summary We demonstrated that training a relational network on the representations learned by our AdaGVAE improves its sample efficiency. This result is in line with the findings of van Steenkiste et al. [70] where disentanglement was found to correlate positively with improved sample complexity.
6 Conclusion
In this paper, we considered the problem of learning disentangled representations from pairs of noni.i.d. observations sharing an unknown, random subset of factors of variation. We demonstrated that, under certain technical assumptions, the associated disentangled generative model is identifiable. We extensively discussed the impact of the different supervision modalities, such as the degree of grouplevel supervision, and studied the impact of the (unknown) number of shared factors. These insights will be particularly useful to practitioners having access to specific domain knowledge. Importantly, we show how to select models with strong performance on a diverse suite of downstream tasks without using supervised disentanglement metrics, relying exclusively on weak supervision. This result is of great importance as the community is becoming increasingly interested in the practical benefits of disentangled representations [70, 46, 14, 9, 38, 10, 28]. Future work should apply the proposed framework to challenging realworld data sets where the factors of variation are not observed and extend it to an interactive setup involving reinforcement learning.
Acknowledgments: The authors thank Stefan Bauer, Ilya Tolstikhin, Sarah Strauss and Josip Djolonga for helpful discussions and comments. Francesco Locatello is supported by the Max Planck ETH Center for Learning Systems, by an ETH core grant (to Gunnar Rätsch), and by a Google Ph.D. Fellowship. This work was partially done while Francesco Locatello was at Google Research, Brain Team, Zurich.
Appendix A Proof of Theorem 1
Recall that the true marginal likelihoods , are completely specified through the smooth, invertible function . The corresponding posteriors are completely determined by . The model family for candidate marginal likelihoods and corresponding posteriors are hence conditional distributions specified by the set of smooth invertible functions and their inverses , respectively.
In order to prove identifiability, we show that every candidate posterior distribution (more precisely, the corresponding ) on the generative model (1)–(2) satisfying the assumptions stated in Theorem 1 inverts in the sense that the aggregate posterior is a coordinatewise reparameterization of up to permutation of the indices. Crucially, while neither the latent variables nor the shared indices are directly observed, observing pairs of images allows us to verify whether a candidate distribution has the right factorization (3) and sharing structure imposed by or not.
The proof is composed of the following steps:

We characterize the constraints that need to hold for the posterior (the associated ) inverting for fixed .

We parameterize all candidate posteriors (the associated ) as a function for a fixed .

We show that, for fixed , (the associated ) has two disentangled coordinate subspaces, one corresponding to and one corresponding to , in the sense that varying and keeping fixed results in changes of the coordinate subspace of corresponding to only, and vice versa.

We show that randomly sampling implies that every candidate posterior has an aggregated posterior which is a coordinatewise reparameterization of the distribution of the true factors of variation.
Step 1 We start by noting that since any continuous distribution can be obtained from the standard uniform distribution (via the inverse cumulative distribution function), it is sufficient to simply set to the dimensional standard uniform distribution and try to recover an axisaligned, smooth, invertible function (which completely characterizes and via its inverse) as well as the distribution .
Next, assume that is fixed but unknown, i.e., the following reasoning is conditionally on . By the generative process (1)–(2) we know that all smooth, invertible candidate functions need to obey with probability (and irrespective of whether or is used)
(7)  
(8) 
for all , where is arbitrary but fixed. indexes the the coordinate subspace in the image of corresponding to the unknown coordinate subspace of shared factors of . Note that choosing requires knowledge of ( can be inferred from ). Also note that satisfies (7)–(8) for .
Step 2 All smooth, invertible candidate functions can be written as , where is a smooth invertible function with smooth inverse (using that the composition of smooth invertible functions is smooth and invertible) that maps the dimensional uniform distribution to .
We have i.e., and similarly . Expressing now (7)–(8) through we have with probability
(9)  
(10) 
Thanks to invertibility and smoothness of we know that maps the coordinate subspace of to a dimensional submanifold of and the coordinate subspace to a dimensional submanifold of that is disjoint from .
Step 3 Next, we shall see that for a fixed the only admissible functions are identifying two groups of factors (corresponding to two orthogonal coordinate subspaces): Those in and those in .
To see this, we prove that can only satisfy (9)–(10) if it aligns the coordinate subspace of with the coordinate subspace of and with . In other words, and lie in the coordinate subspaces and , respectively, and the Jacobian of is block diagonal with blocks of coordinates indexed by and .
By contradiction, if does not lie in the coordinate subspace then (9) is violated as is smooth and invertible but its arguments obey for every with probability .
Likewise, if does not lie in the coordinate subspace then (10) is violated as is smooth and invertible but its arguments satisfy with probability .
As a result, (9) and (10) can only be satisfied if maps each coordinate in to a unique matching coordinate in . In other words there exists a permutation on such that can be simplified as , where
(11)  
(12) 
Note that the permutation is required because the choice of is arbitrary. This implies that the Jacobian of is block diagonal with blocks corresponding to coordinates indexed by and (or equivalently and ).
For fixed , i.e., considering , we can recover the groups of factors in and up to permutation of the factor indices. Note that this does not yet imply that we can recover all axisaligned as the factors in and may still be entangled with each other, i.e., is not axis aligned within and .
Step 4 If now is drawn at random, we observe a mixture of distributions (but not itself) and needs to associate every with one and only one to satisfy (7)–(8), for every .
Indeed, suppose that are distributed according to a mixture of and with . Then (7) can only be satisfied with probability for a subset of coordinates of size due to invertibility and smoothness of , but . The same reasoning applies for mixtures of more than two subsets of . Therefore, (7) cannot be satisfied for drawn from a mixture of distribution but associated with a single .
Conversely, for a given , all need to be associated with the same due to invertibility and smoothness of . in more detail, all will share the same dimensional coordinate subspace due to (9)–(10) and therefore cannot be associated with two different as .
Further, note that due to the smoothness and invertibility of , for every pair of associated and we have and . The assumption
(13) 
hence implies that we “observe” every factor through as the intersection of two sets , and this intersection will be reflected as the intersection of the corresponding two coordinate subspaces . This, together with (11)–(12) finally implies
(14) 
for some permutation on . This in turns imply that the Jacobian of is diagonal.
Therefore, by change of variables formula we have
(15) 
where the second equality is a consequence of the Jacobian being diagonal, and thanks to being invertible on . From (15), we can see that is a coordinatewise reparameterization of up to permutation of the indices. As a consequence, a change in a coordinate of implies a change in the unique corresponding coordinate of , so (or, equivalently, ) disentangles the factors of variation.
Final remarks The considered generative model is identifiable up to coordinatewise reparametrization of the factors. can then be recovered via . Note that (13) effectively ensures that to a weak supervision signal is available for each factor of variation.
Appendix B Implementation Details
We base our study on the disentanglement_lib of [47]. Here, we report for completeness all the hyperparameters used in our study. Our code will be released as part of the disentanglement_lib.
In our study, fix the architecture (Table 1) along with all other hyperparameters (Table 3) except for one hyperparameter for each model (Table 2). All hyperparameters for the unsupervised models are identical to [47]. As our methods penalize the rate term in the ELBO similarly to VAE, we use the same hyperparameter range. We however note that in most cases, our model selection technique selects . Exploring a different range for smaller than one is beyond the scope of this work. For the unsupervised methods we use the same random seeds of [47]. For the weaklysupervised methods, we use .
Downstream Task The vanilla downstream task is based on [47]. For each representation, we sample training sets of sizes , , and . The test set always contains points. The downstream task consists in predicting the value of each factor of variation from the representation. We use the same two models of [47]: a cross validated logistic regression from Scikitlearn with 10 different values for the regularization strength () and folds and a gradient boosting classifier (GBT) from Scikitlearn with default parameters.
Downstream Task with Covariate Shift We consider the same setup of the normal downstream task, but we only train a gradient boosted classifier with examples (). For every target factor of variation we repeat 10 times the following process: sample another factor of variation uniformly and fix its value over the whole training set to an uniformly sampled value. The test set contains only examples where the intervened factors take values that are different from the one in the training set. We report the average test performance.
Fairness Downstream Task The fairness downstream task is based on [46]. We train the same on each representation predicting each factor of variation and measure the unfairness using the formula in their Section 4.
Abstract reasoning task We use the same Shapes3D simplified data set when training the relational network (scale and azimuth can only take four values instead of 8 and 16 to make the task feasible for humans). We consider the case where the rows in the grid have either 1, 2, or 3 constant groundtruth factors. We train the same relational model [59] as in [70] (with identical hyperparameters) on the frozen representations of our adaptive methods.
We use hyperparameters identical to [70] which are reported here for completeness. The downstream classifier is the Wild Relation Networks (WReN) model of [59]. For the experiments, we use the following random search space over the hyperparameters. The optimizer’s parameters are depcited in Table 4. The edge MLP has either 256 or 512 hidden units and 2, 3, or 4 hidden layers. The graph MLP has either 128 or 256 hidden units and 1 or 2 hidden layers before the final linear layer to compute the score. We also uniformly sample whether we apply no dropout, dropout of 0.25, dropout of 0.5, or dropout of 0.75 to units before this last layer and 10 random seeds.
Encoder  Decoder 

Input: number of channels  Input: 
conv, 32 ReLU, stride 2  FC, 256 ReLU 
conv, 32 ReLU, stride 2  FC, ReLU 
conv, 64 ReLU, stride 2  upconv, 64 ReLU, stride 2 
conv, 64 ReLU, stride 2  upconv, 32 ReLU, stride 2 
FC 256, F2  upconv, 32 ReLU, stride 2 
upconv, number of channels, stride 2 
Model  Parameter  Values 

VAE  
AnnealedVAE  
iteration threshold  
FactorVAE  
DIPVAEI  
DIPVAEII  
TCVAE  
GVAE  
AdaGVAE  
MLVAE  
AdaMLVAE 



Parameter  Values 

Batch size  
Optimizer  Adam 
Adam: beta1  0.9 
Adam: beta2  0.999 
Adam: epsilon  1e8 
Adam: learning rate 
Appendix C Additional Results
c.1 Section 5.1
In Figure 6, we show that our methods are competitive even with fully supervised model selection on the unsupervised methods.
While our main analysis is focused on DCI Disentanglement [19], we report in Figure 8 the performance of out methods when evaluated using each disentanglement score as well as Completeness [19] in Figure 7. Overall, we observe that the trends we observed in Section 5.1 for DCI Disentanglement can be observed also for the other disentanglement scores (with the partial exception of Modularity [58]). In Figure 9 we show that the disentanglement metrics are consistently correlated with the training metrics. We chose the weaklysupervised reconstruction loss for model selection but ELBO and overall Loss are also suitable.
c.2 Section 5.2
c.3 Section 5.3
In Figures 12 and 13, we observe that, regardless of the averaging, when and the different factor is known to the algorithm, this knowledge improves the disentanglement. However, when this knowledge is incomplete it harms the disentanglement. In Figure 14 we show how our method compare with the Change and Share GANbased approaches of [64]. The goal of this plot is to show that ballpark the two approaches achieves similar results. We stress that strong conclusions should not be drawn from this plot as [64] used different experimental conditions from ours. Finally, we remark that [64] assume access to which factors was either shared or changed in the pair. Our method was designed to benefit from very similar images and without any additional annotation, so it is not completely surprising that when our performances are worse. It is however interesting to notice how the GAN based methods perform especially well on the data sets SmallNORB and MPI3D where VAE based approaches struggle with reconstruction as the objects are either too detailed or too small.
c.4 Section 5.4
In Figure 15 we show the figure analogous to Figure 4 for the AdaMLVAE. We observe that the trends are comparable to the ones we observed for the AdaGVAE. In Figures 16 and 17, we show the results on the fairness and abstract reasoning downstream task for the AdaMLVAE. Overall, we observe that the conclusions we drew for the AdaGVAE is valid for the AdaMLVAE too: good models in terms of weaklysupervised reconstruction loss are useful on all the considered downstream tasks.
Footnotes
 We invested approximately 5.85 GPU years (NVIDIA P100) for our experimental evaluation.
 In Figure 9 in the appendix, we show that the training loss and the ELBO correlate similarly with disentanglement.
References
 (2018) Discovering interpretable representations for both deep generative and discriminative models. In International Conference on Machine Learning, pp. 50–59. Cited by: §1.
 (2010) Impossibility theorems for domain adaptation. In International Conference on Artificial Intelligence and Statistics, pp. 129–136. Cited by: §5.4.
 (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. Cited by: §1, §1, §3, §4.
 (2019) A metatransfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912. Cited by: §1, §1.
 (2017) The consciousness prior. arXiv preprint arXiv:1709.08568. Cited by: §1, §1.
 (2018) Multilevel variational autoencoder: learning disentangled representations from grouped observations. In AAAI Conference on Artificial Intelligence, Cited by: 1st item, 2nd item, §1, §2, §4.1, §4.1, §5.3, §5.3, §5.
 (2018) Understanding disentangling in betaVAE. arXiv preprint arXiv:1804.03599. Cited by: §2, §5.
 (2009) Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pp. 13–18. Cited by: §5.4.3, §5.4.3.
 (2019) Hybrid deep fault detection and isolation: combining deep neural networks and system performance models. arXiv preprint arXiv:1908.01529. Cited by: §6.
 (2019) Disentangled representation learning in cardiac image analysis. Medical image analysis 58, pp. 101535. Cited by: §6.
 (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, Cited by: §2, §5, §5.
 (2014) Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583. Cited by: §2.
 (1994) Independent component analysis, a new concept?. Signal Processing 36 (3), pp. 287–314. Cited by: §2.
 (2019) Flexibly fair representation learning by disentanglement. In International Conference on Machine Learning, pp. 1436–1445. Cited by: §1, §6.
 (1993) Improving generalization for temporal difference learning: the successor representation. Neural Computation 5 (4), pp. 613–624. Cited by: §1.
 (2017) Factorized variational autoencoders for modeling audience reactions to movies. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
 (2017) Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2019) A heuristic for unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614. Cited by: §2.
 (2018) A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, Cited by: §C.1, §2, §5.
 (1991) Learning invariance from transformation sequences. Neural Computation 3 (2), pp. 194–200. Cited by: §1, §3.
 (2019) Deep selforganization: interpretable discrete representation learning on time series. In International Conference on Learning Representations, Cited by: §2.
 (2017) A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2019) On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In Advances in Neural Information Processing Systems, Cited by: Figure 1, 3rd item, §3, §5.
 (2015) Learning to linearize under uncertainty. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2019) The incomplete rosetta stone problem: identifiability results for multiview nonlinear ica. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §2, §3, §4.1.
 (2018) Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: §3.
 (2017) BetaVAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Cited by: §1, §2, §3, §4.1, §4.1, §5, §5.
 (2017) DARLA: improving zeroshot transfer in reinforcement learning. In International Conference on Machine Learning, Cited by: §6.
 (1999) Feature extraction through lococode. Neural Computation 11 (3), pp. 679–714. Cited by: §1.
 (2019) Groupbased learning of disentangled representations with generalizability for novel contents. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pp. 2506–2513. Cited by: 1st item, 2nd item, §1, §2, §4.1, §4.1, §5.3, §5.3, §5.
 (2018) Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2017) Behind distribution shift: mining driving forces of changes and causal arrows. In IEEE 17th International Conference on Data Mining (ICDM 2017), pp. 913–918. Cited by: §2.
 (2016) Unsupervised feature extraction by timecontrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2017) Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460–469. Cited by: §2.
 (1999) Nonlinear independent component analysis: existence and uniqueness results. Neural Networks. Cited by: §2.
 (2019) Nonlinear ica using auxiliary variables and generalized contrastive learning. In International Conference on Artificial Intelligence and Statistics, Cited by: §2, §4.1.
 (2020) Discovering physical concepts with neural networks. Physical Review Letters 124 (1), pp. 010508. Cited by: §6.
 (2015) Bayesian representation learning with oracle constraints. arXiv preprint arXiv:1506.05011. Cited by: §2.
 (1996) The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal 17 (6), pp. 441–458. Cited by: §4.1.
 (2019) Variational autoencoders and nonlinear ica: a unifying framework. arXiv preprint arXiv:1907.04809. Cited by: §2, §2, §4.1, §4.1.
 (2018) Disentangling by factorising. In International Conference on Machine Learning, Cited by: §2, §3, §5, §5.
 (2015) Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2018) Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, Cited by: §2, §5, §5.
 (2004) Learning methods for generic object recognition with invariance to pose and lighting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3, §5.
 (2019) On the fairness of disentangled representations. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Figure 16, 4th item, §1, Figure 5, §5.4.3, §5.4.3, §5.4.3, §5.4, §5.4, §6.
 (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, Cited by: Appendix B, Appendix B, Appendix B, 2nd item, §1, §2, §3, §3, §4, §4, §5.4.1, §5.4.3, §5.4, §5.4, §5, §5.
 (2020) Disentangling factors of variation using few labels. International Conference on Learning Representations. Cited by: §1, §2, §5.4, §5.4.
 (2018) Competitive training of mixtures of independent deep generative models. In Workshop at the 6th International Conference on Learning Representations (ICLR), Cited by: §2.
 (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2017) Learning disentangled representations with semisupervised deep generative models. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2017) Elements of causal inference  foundations and learning algorithms. Adaptive Computation and Machine Learning Series, MIT Press. Cited by: §1, §3, §5.4.
 (2009) Dataset shift in machine learning. The MIT Press. Cited by: §5.4.
 (1941) Standardization of progressive matrices, 1938. British Journal of Medical Psychology 19 (1), pp. 137–150. Cited by: §5.4.4.
 (2014) Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, Cited by: §2.
 (2015) Deep visual analogymaking. In Advances in Neural Information Processing Systems, Cited by: §3, §5.
 (2018) Learning deep disentangled embeddings with the fstatistic loss. In Advances in Neural Information Processing Systems, Cited by: §2, §5.
 (2016) A survey of inductive biases for factorial representationlearning. arXiv preprint arXiv:1612.05299. Cited by: §C.1.
 (2018) Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pp. 4477–4486. Cited by: Appendix B, Appendix B.
 (2007) Learning graphical model structure using l1regularization paths. In AAAI, Vol. 7, pp. 1278–1283. Cited by: §1, §1.
 (2019) Causality for machine learning. Note: arXiv:1911.10500 Cited by: §1, §2.
 (2012) On causal and anticausal learning. In International Conference on Machine Learning, Cited by: Figure 1.
 (2000) Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §5.4.
 (2020) Weakly supervised disentanglement with guarantees. International Conference on Learning Representations. Cited by: Figure 14, §C.3, 1st item, §1, §2, §4.1, §4, §5.3, §5.3.
 (2020) Disentanglement by nonlinear ica with general incompressibleflow networks (gin). arXiv preprint arXiv:2001.04872. Cited by: §2, §4.1.
 (1995) Reinforcement driven information acquisition in nondeterministic environments. In Proceedings of the international conference on artificial neural networks, Paris, Vol. 2, pp. 159–164. Cited by: §1.
 (2019) Interventional robustness of deep latent variable models. In International Conference on Machine Learning, Cited by: §2, §3.
 (2017) Disentangling the independently controllable factors of variation by interacting with the world. Learning Disentangled Representations Workshop at NeurIPS. Cited by: §1.
 (2018) Recent advances in autoencoderbased representation learning. arXiv preprint arXiv:1812.05069. Cited by: §1.
 (2019) Are disentangled representations helpful for abstract visual reasoning?. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix B, Figure 17, 4th item, §1, Figure 5, §5.4.4, §5.4.4, §5.4.4, §5.4, §5.4, §6.
 (2016) Understanding visual concepts with continuation learning. arXiv preprint arXiv:1602.06822. Cited by: §2.
 (2015) Weaklysupervised disentangling with recurrent transformations for 3D view synthesis. In Advances in Neural Information Processing Systems, Cited by: §2.
 (2018) Disentangled sequential autoencoder. In International Conference on Machine Learning, pp. 5656–5665. Cited by: §2.
 (2015) Multisource domain adaptation: a causal view. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp. 3150–3157. Cited by: §2.
 (2017) Causal discovery from nonstationary/heterogeneous data: skeleton estimation and orientation determination. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence (IJCAI 2017), pp. 1347–1353. Cited by: §2.
 (2015) On the relation between accuracy and fairness in binary classification. arXiv preprint arXiv:1505.05723. Cited by: §5.4.3, §5.4.3.