Weakly-Supervised Disentanglement Without Compromises

Weakly-Supervised Disentanglement Without Compromises

Abstract

Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.

\makesavenoteenv

tabular

\printAffiliationsAndNotice

1 Introduction

A recent line of work argued that representations which are disentangled offer useful properties such as interpretability [1, 3, 27], predictive performance [47, 48], reduced sample complexity on abstract reasoning tasks [70], and fairness [46, 14]. The key underlying assumption is that high-dimensional observations (such as images or videos) are in fact a manifestation of a low-dimensional set of independent ground-truth factors of variation  [47, 3, 69]. The goal of disentangled representation learning is to learn a function mapping the observations to a low-dimensional vector that contains all the information about each factor of variation, with each coordinate (or a subset of coordinates) containing information about only one factor. Unfortunately, Locatello et al. [47] showed that the unsupervised learning of disentangled representations is theoretically impossible from i.i.d. observations without inductive biases. In practice, they observed that unsupervised models exhibit significant variance depending on hyperparameters and random seed, making their training somewhat unreliable.

Figure 1: (left) The proposed generative model. We observe pairs of observations sharing a random subset of latent factors: is generated by ; is generated by combining the subset of and resampling the remaining entries (modeled by ). (right) Real-world example of the model: A pair of images from MPI3D [23] where all factors are shared except the first degree of freedom and the background color (red values). This corresponds to a setting where few factors in a causal generative model change, which, by the independent causal mechanisms principle, leaves the others invariant [62].

On the other hand, many data modalities are not observed as i.i.d. samples from a distribution [15, 66, 29, 3, 52, 68, 61]. Changes in natural environments, which typically correspond to changes of only a few underlying factors of variation, provide a weak supervision signal for representation learning algorithms [20, 60, 5, 4]. State-of-the-art weakly-supervised disentanglement methods [6, 30, 64] assume that observations belong to annotated groups where two things are known at training time: (i) the relation between images in the same group, and (ii) the group each image belongs to. Bouchacourt et al. [6], Hosoya [30] consider groups of observations differing in precisely one of the underlying factors. An example of such a group are images of a given object with a fixed orientation, in a fixed scene, but of varying color. Shu et al. [64] generalized this notion to other relations (e.g., single shared factor, ranking information). In general, precise knowledge of the groups and their structure may require either explicit human labeling or at least strongly controlled acquisition of the observations. As a motivating example, consider the video feedback of a robotic arm. In two temporally close frames, both the manipulated objects and the arm may have changed their position, the objects themselves may be different, or the lighting conditions may have changed due to failures.

In this paper, we consider learning disentangled representations from pairs of observations which differ by a few factors of variation [5, 60, 4] as in Figure 1. Unlike previous work on weakly-supervised disentanglement, we consider the realistic and broadly applicable setting where we observe pairs of images and have no additional annotations: It is unknown which and how many factors of variation have changed. In other words, we do not know which group each pair belongs to, and what is the precise relation between the two images. The only condition we require is that the two observations are different and that the change in the factors is not dense. The key contributions of this paper are:

  • We present simple adaptive group-based disentanglement methods which do not require annotations of the groups, as opposed to [6, 30, 64]. Our approach is readily applicable to a variety of settings where groups of non-i.i.d. observations are available with no additional annotations.

  • We theoretically show that identifiability is possible from non-i.i.d. pairs of observations under weak assumptions. Our proof motivates the setup we consider, which is identifiable as opposed to the standard one, which was proven to be non-identifiable [47]. Further, we use theoretical arguments to inform the design of our algorithms, recover existing group-based VAE methods [6, 30] as special cases, and relax their impractical assumptions.

  • We perform a large-scale reproducible experimental study training over disentanglement models and over one million downstream classifiers1 on five different data sets, one of which consisting of real images of a robotic platform [23].

  • We demonstrate that one can reliably learn disentangled representations with weak supervision only, without relying on supervised disentanglement metrics for model selection, as done in previous works. Further, we show that these representations are useful on a diverse suite of downstream tasks, including a novel experiment targeting strong generalization under covariate shifts, fairness [46] and abstract visual reasoning [70].

2 Related work

Recovering independent components of the data generating process is a well-studied problem in machine learning. It has roots in the independent component analysis (ICA) literature, where the goal is to unmix independent non-Gaussian sources of a -dimensional signal [13]. Crucially, identifiability is not possible in the nonlinear case from i.i.d. observations [36]. Recently, the ICA community has considered weak forms of supervision such as temporal consistency [34, 35], auxiliary supervised information [37, 41], and multiple views [25]. A parallel thread of work has studied distribution shifts by identifying changes in causal generative factors [74, 75, 33], which is linked to a causal view of disentanglement [67, 61].

On the other hand, more applied machine learning approaches have experienced the opposite shift. Initially, the community focused on more or less explicit and task dependent supervision [55, 72, 43, 12, 50, 51]. For example, a number of works rely on known relations between the factors of variation [39, 71, 22, 17, 32, 73, 49] and disentangling motion and pose from content [31, 21, 16, 24].

Recently, there has been a renewed interest in the unsupervised learning of disentangled representations [27, 7, 42, 11, 44] along with quantitative evaluation [42, 19, 44, 11, 57, 18]. After Locatello et al. [47] proved that unsupervised learning of disentangled representations is theoretically impossible without inductive biases, the focus shifted back to semi-supervised [48, 65, 41] and weakly-supervised approaches [6, 30, 64].

3 Generative models

We first describe the generative model commonly used in the disentanglement literature, and then turn to the weakly-supervised model used in this paper.

Unsupervised generative model First, a is drawn from a set of independent ground-truth factors of variation . Second, the observations are obtained as draws from . The factors of variation do not need to be one-dimensional but we assume so to simplify the notation.

Disentangled representations The goal of disentanglement learning is to learn a mapping where the effect of the different factors of variation is axis-aligned with different coordinates. More precisely, each factor of variation is associated with exactly one coordinate (or group of coordinates) of and vice-versa (and the groups are non-overlapping). As a result, varying one factor of variation and keeping the others fixed results in a variation of exactly one coordinate (group of coordinates) of . Locatello et al. [47] theoretically showed that learning such a mapping is theoretically impossible without inductive biases or some other, possibly weak, form of supervision.

Weakly-supervised generative model We study learning of disentangled image representations from paired observations, for which some (but not all) factors of variation have the same value. This can be modeled as sampling two images from the causal generative model with an intervention [52] on a random subset of the factors of variation. Our goal is to use the additional information given by the pair (as opposed to a single image) to learn a disentangled image representations. We generally do not assume knowledge of which or how many factors are shared, i.e., we do not require controlled acquisition of the observations. This observation model applies to many practical scenarios. For example, we may want to learn a disentangled representation of a robot arm observed through a camera: In two temporally close frames some joint angles will likely have changed, but others will have remained constant. Other factors of variation may also change independently of the actions of the robot. An example can be seen in Figure 1 (right) where the first degree of freedom of the arm and the color of the background changed. More generally this observation model applies to many natural scenes with moving objects [20]. More formally, we consider the following generative model. For simplicity of exposition, we assume that the number of factors in which the two observations differ is constant (we present a strategy to deal with varying in Section 4.1). The generative model is given by

(1)
(2)

where is the subset of shared indices of size sampled from a distribution over the set , and the and are all identical. The generative mechanism is modeled using a function , with and , which maps the latent variable to observations of dimension , typically . To make the relation between and explicit, we use a function obeying

with . Intuitively, to generate , selects entries from with index in and substitutes the remaining factors with , thus ensuring that the factors indexed by are shared in the two observations. The generative model (1)–(2) does not model additive noise; we assume that noise is explicitly modeled as a latent variable and its effect is manifested through as done by [3, 47, 26, 27, 67, 56, 45, 42, 23]. For simplicity, we consider the case where groups consisting of two observations (pairs), but extensions to more than two observations are possible [25].

4 Identifiability and algorithms

First, we show that, as opposed to the unsupervised case [47], the generative model (1)–(2) is identifiable under weak additional assumptions. Note that the joint distribution of all random variables factorizes as

(3)

where the likelihood terms have the same distribution, i.e., . We show that to learn a disentangled generative model of the data it is therefore sufficient to recover a factorized latent distribution with factors , a corresponding likelihood , as well as a distribution over , which together satisfy the constraints of the true generative model (1)–(2) and match the true after marginalization over when substituted into (3).

Theorem 1.

Consider the generative model (1)–(2). Further assume that are continuous distributions, is a distribution over s.t. for we have . Let in (2) be smooth and invertible on with smooth inverse (i.e., a diffeomorphism). Given unlimited data from and the true (fixed) , consider all tuples obeying these assumptions and matching after marginalization over when substituted in (3). Then, the posteriors are disentangled in the sense that the aggregate posteriors are coordinate-wise reparameterizations of the ground-truth prior up to a permutation of the indices of .

Discussion Under the assumptions of this theorem, we established that all generative models that match the true marginal over the observations must be disentangled. Therefore, constrained distribution matching is sufficient to learn disentangled representations. Formally, the aggregate posterior is a coordinate-wise reparameterization of the true distribution of the factors of variation (up to index permutations). In other words, there exists a one-to-one mapping between every entry of and a unique matching entry of , and thus a change in a single coordinate of implies a change in a single matching coordinate of  [3]. Changing the observation model from single i.i.d. observations to non-i.i.d. pairs of observations generated according to the generative model (1)–(2) allows us to bypass the non-identifiability result of [47]. Our result requires strictly weaker assumptions than the result of Shu et al. [64] as we do not require group annotations, but only knowledge of . As we shall see in Section 4.1, can be cheaply and reliably estimated from data at run-time. Although the weak assumptions of Theorem 1 may not be satisfied in practice, we will show that the proof can inform practical algorithm design.

4.1 Practical adaptive algorithms

We conceive two -VAE [27] variants tailored to the weakly-supervised generative model (1)–(2) and a selection heuristic to deal with unknown and random . We will see that these simple models can very reliably learn disentangled representations.

The key differences between theory and practice are that: (i) we use the ELBO and amortized variational inference for distribution matching (the true and learned distributions will not exactly match after training), (ii) we have access to a finite number of data only, and (iii) the theory assumes known, fixed , but might be unknown and random.

Enforcing the structural constraints Here we present a simple structure for the variational family that allows us to tractably perform approximate inference on the weakly-supervised generative model. First note that the alignment constraints imposed by the generative model (see (7) and (8) evaluated for in Appendix A) imply for the true posterior

(4)
(5)

(with probability ) and we want to enforce these constraints on the approximate posterior of our learned model. However, the set is unknown. To obtain an estimate of we therefore choose for every pair the coordinates with the smallest . To impose the constraint (4) we then replace each shared coordinate with some average of the two posteriors

else,

and obtain in analogous manner. As we later simply use the averaging strategies of the Group-VAE (GVAE) [30] and the Multi Level-VAE (ML-VAE) [6], we term variants of our approach which infers the groups and their properties adaptively Adaptive-Group-VAE (Ada-GVAE) and Adaptive-ML-VAE (Ada-ML-VAE), depending on the choice of the averaging function . We then optimize the following variant of the -VAE objective

(6)

where [27]. The advantage of this averaging-based implementation of (4), over implementing it, for instance, via a -term that encourages the distributions of the shared coordinates to be similar, is that averaging imposes a hard constraint in the sense that and can jointly encode only one value per shared coordinate. This in turn implicitly enforces the constraint (5) as the non-shared dimensions need to be efficiently used to encode the non-shared factors of and .

We emphasize that the objective (6) is a simple modification of the -VAE objective and is very easy to implement. Finally, we remark that invoking Theorem 4 of [41], we achieve consistency under maximum likelihood estimation up to the equivalence class in our Theorem 1, for and in the limit of infinite data and capacity.

Inferring In the (practical) scenario where is unknown, we use the threshold

where , and average the coordinates with . This heuristic is inspired by the “elbow method” [40] for model selection in -means clustering and -singular value decomposition and we found it to work surprisingly well in practice (see the experiments in Section 5). This estimate relies on the assumption that not all factors have changed. All our adaptive methods use this heuristic. Although a formal recovery argument cannot be made for arbitrary data sets, inductive biases may limit the impact of an approximate in practice. We further remark that this heuristic always yields the correct if the encoder is disentangled.

Relation to prior work Closely related to the proposed objective (6) the GVAE of Hosoya [30] and the ML-VAE of Bouchacourt et al. [6] assume is known and implement using different averaging choices. Both assume Gaussian approximate posteriors where are the mean and variance of and are the mean and variance, of . For the coordinates in , the GVAE uses a simple arithmetic mean ( and ) and the ML-VAE takes the product of the encoder distributions, with taking the form:

Our approach critically differs in the sense that is not known and needs to be estimated for every pair of images.

Recent work combines non-linear ICA with disentanglement [41, 65]. Critically, these approaches are based on the setup of Hyvarinen et al. [37] which requires access to label information such that factorizes as . In contrast, we base our work on the setup of Gresele et al. [25], which only assumes access to two sufficiently distinct views of the latent variable. Shu et al. [64] train the same type of generative models over paired data but use a GAN objective where inference is not required. However, they require known and fixed as well as annotations of which factors change in each pair.

5 Experimental results

Experimental setup We consider the setup of Locatello et al. [47]. We use the five data sets where the observations are generated as deterministic functions of the factors of variation: dSprites [27], Cars3D [56], SmallNORB [45], Shapes3D [42], and the real-world robotics data set MPI3D [23]. Our unsupervised baselines correspond to a cohort of unsupervised models (-VAE [27], AnnealedVAE [7], Factor-VAE [42], -TCVAE [11], DIP-VAE-I and II [44]), each with the same six hyperparameters from Locatello et al. [47] and 50 random seeds.

To create data sets with weak supervision from the existing disentanglement data sets, we first sample from the discrete according to the ground-truth generative model (1)–(2). Then, we sample factors of variation that should not be shared by the two images and re-sample those coordinates to obtain . This ensures that each image pair differs in at most factors of variation. For we consider the range from to . This last setting corresponds to the case where all but one factor of variation are re-sampled. We study both the case where is constant across all pairs in the data set and where is sampled uniformly in the range for every training pair ( in the following). Unless specified otherwise, we aggregate the results for all values of .

For each data set, we train four weakly-supervised methods: Our adaptive and vanilla (group-supervision) variants of GVAE [30] and ML-VAE [6]. For each approach we consider six values for the regularization strength and 10 random seeds, training a total of weakly-supervised models. We perform model selection using the weakly-supervised reconstruction loss (i.e., the sum of the first two terms in (6))2. We stress that we do not require labels for model selection.

To evaluate the representations, we consider the disentanglement metrics in Locatello et al. [47]: BetaVAE score [27], FactorVAE score [42], Mutual Information Gap (MIG) [11], Modularity [57], DCI Disentanglement [19] and SAP score [44]. To directly compare the disentanglement produced by different methods, we report the DCI Disentanglement [19] in the main text and defer the plots with the other scores to the appendix as the same conclusions can be drawn based on these metrics. Appendix B contains full implementation details.

5.1 Is weak supervision enough for disentanglement?

In Figure 2, we compare the performance of the weakly-supervised methods with against the unsupervised methods. Unlike in unsupervised disentanglement with -VAEs where is common, we find (the ELBO) performs best in most cases. We clearly observe that weakly-supervised models outperform the unsupervised ones. In Figure 6 in the appendix, we further observe that they are competitive even if we allow fully supervised model selection on the unsupervised models. The Ada-GVAE performs similarly to the Ada-ML-VAE. For this reason, we focus the following analysis on the Ada-GVAE, and include Ada-ML-VAE results in Appendix C.

Figure 2: Our adaptive variants of the group-based disentanglement methods (models 6 and 7) significantly and consistently outperform unsupervised methods. In particular, the Ada-GVAE consistently yields the same or better performance than the Ada-ML-VAE. In this experiment, we consider the case where the number of shared factors of variation is random and different for every pair with high probability (). Legend: 0=-VAE, 1=FactorVAE, 2=-TCVAE, 3=DIP-VAE-I, 4=DIP-VAE-II, 5=AnnealedVAE, 6=Ada-ML-VAE, 7=Ada-GVAE

Summary With weak supervision, we reliably learn disentangled representations that outperform unsupervised ones. Our representations are competitive even if we perform fully supervised model selection on the unsupervised models.

5.2 Are our methods adaptive to different values of ?

In Figure 3 (left), we report the performance of Ada-GVAE without model selection for different values of on MPI3D (see Figure 10 in the appendix for the other data sets). We observe that Ada-GVAE is indeed adaptive to different values of and it achieves better performance when the change between the factors of variation is sparser. Note that our method is agnostic to the sharing pattern between the image pairs. In applications where the number of shared factors is known to be constant, the performance may thus be further improved by injecting this knowledge into the inference procedure.

Summary Our approach makes no assumption of which and how many factors are shared and successfully adapts to different values of . The sparser the difference on the factors of variation, the more effective our method is in using weak supervision and learning disentangled representations.

Figure 3: (left) Performance of the Ada-GVAE with different on MPI3D. The algorithm adapts well to the unknown and benefits from sparser changes. (center and right) Comparison of Ada-ML-VAE with the vanilla ML-VAE which assumes group knowledge. We note that group knowledge may improve performance (center) but can also hurt when it is incomplete (right).
Figure 4: (left) Rank correlation between our weakly-supervised reconstruction loss and performance of downstream prediction tasks with logistic regression (LR) and gradient boosted decision-trees (GBT) at different sample sizes for Ada-GVAE. We observe a general negative correlation that indicates that models with a low weakly-supervised reconstruction loss may also be more accurate. (center) Rank correlation between disentanglement scores and weakly-supervised reconstruction loss, along with strong generalization accuracy under covariate shifts for Ada-GVAE. (right) Distribution of vanilla (weak) generalization and under covariate shifts (strong generalization) for Ada-GVAE. The horizontal line corresponds to the accuracy of a naive classifier based on the prior only.

5.3 Supervision-performance trade-offs

The case where we actually know which factor of variation is not shared was previously considered in [6, 30, 64]. Clearly, this additional knowledge should lead to improvements over our method. On the other hand, this information may be correct but incomplete in practice: For every pair of images, we know about one factor of variation that has changed but it may not be the only one. We therefore also consider the setup where but the algorithm is only informed about one factor. Note that the original GVAE assumes group knowledge, so we directly compare its performance with our Ada-GVAE. We defer the comparison with ML-VAE [6] and with the GAN-based approaches of [64] to Appendix C.3.

In Figure 3 (center and right), we observe that when , the knowledge of which factor was changed generally improves the performance of weakly-supervised methods on MPI3D. On the other hand, the GVAE is not robust to incomplete knowledge as its performance degrades when the factor that is labeled as non-shared is not the only one. The performance degradation is stronger on the data sets with more factors of variation (dSprites/Shapes3D/MPI3D) as can be seen in Figure 12 in the appendix. This may not come as a surprise as group-based disentanglement methods all assume that the group knowledge is precise.

Summary Whenever the groups are fully and precisely known, this information can be used to improve disentanglement. Even though our adaptive method does not use group annotations, its performance is often comparable to the methods of [6, 30, 64]. On the other hand, in practical applications there may not be precise control of which factors have changed. In this scenario, relying on incomplete group knowledge significantly harms the performance of GVAE and ML-VAE as they assume exact group knowledge. A blend between our adaptive variant and the vanilla GVAE may further improve performance when only partial group knowledge is available.

5.4 Are weakly-supervised representations useful?

In this section, we investigate whether the representations learned by our Ada-GVAE are useful on a variety of tasks. We show that representations with small weakly-supervised reconstruction loss (the sum of the first two terms in (6)) achieve improved downstream performance [47, 48], improved downstream generalization [52] under covariate shifts [63, 53, 2], fairer downstream predictions [46], and improved sample complexity on an abstract reasoning task [70]. To the best of our knowledge, strong generalization under covariate shift has not been tested on disentangled representations before.

Key insight We remark that the usefulness insights of Locatello et al. [47, 48, 46], van Steenkiste et al. [70] are based on the assumption that disentangled representations can be learned without observing the factors of variation. They consider models trained without supervision and argue that some of the supervised disentanglement scores (which require explicit labeling of the factors of variation) correlate well with desirable properties. In stark contrast, we here show that all these properties can be achieved simultaneously using only weakly-supervised data.

Downstream performance

In this section, we consider the prediction task of Locatello et al. [47] that predicts the values of the factors of variation from the representation. We also evaluate whether our weakly-supervised reconstruction loss is a good proxy for downstream performance. We use a setup identical to Locatello et al. [47] and train the same logistic regression and gradient boosted decision trees (GBT) on the learned representations using different sample sizes (///). All test sets contain examples.

In Figure 4 (left), we observe that the weakly-supervised reconstruction loss of Ada-GVAE is generally anti-correlated with downstream performance. The best weakly-supervised disentanglement methods thus learn representations that are useful for training accurate classifiers downstream.

Summary The weakly-supervised reconstruction loss of our Ada-GVAE is a useful proxy for downstream accuracy.

Figure 5: (left) Rank correlation between both disentanglement scores and our weakly-supervised reconstruction loss with the unfairness of GBT10000 on all the data sets for Ada-GVAE. (center) Unfairness of the unsupervised methods with the semi-supervised model selection heuristic of [46] and our weakly-supervised Ada-GVAE with . (right) Rank correlation with down-stream accuracy of the abstract visual reasoning models of [70] throughout training (i.e., for different sample sizes).

Generalization under covariate shift

Assume we have access to a large pool of unlabeled paired data and our goal is to solve a prediction task for which we have a smaller labeled training set. Both the labeled training set and test set are biased, but with different biases. For example, we want to predict object shape but our training set contains only red objects, whereas the test set does not contain any red objects. We create a biased training set by performing an intervention on a random factor of variation (other than the target variable), so that its value is constant in the whole training set. We perform another intervention on the test set, so that the same factor can take all other values. We train a GBT classifier on 10000 examples from the representations learned by Ada-GVAE. For each target factor of variation, we repeat the training of the classifier 10 times for different random interventions. For this experiment, we consider only dSprites, Shapes3D and MPI3D since Cars3D and SmallNORB are too small (after an intervention on their most fine grained factor of variation, they only contain 96 and 270 images respectively).

In Figure 4 (center) we plot the rank correlation between disentanglement scores and weakly-supervised reconstruction, and the results for generalization under covariate shifts for Ada-GVAE. We note that both the disentanglement scores and our weakly-supervised reconstruction loss are correlated with strong generalization. In Figure 4 (right), we highlight the gap between the performance of a classifier trained on a normal train/test split (which we refer to as weak generalization) as opposed to this covariate shift setting. We do not perform model selection, so we can show the performance of the whole range of representations. We observe that there is a gap between weak and strong generalization but the distributions of accuracies significantly overlap and are significantly better than a naive classifier based on the prior distribution of the classes.

Summary Our results provide compelling evidence that disentanglement is useful for strong generalization under covariate shifts. The best Ada-GVAE models in terms of weakly-supervised reconstruction loss are useful for training classifiers that generalize under covariate shifts.

Fairness

Recently, Locatello et al. [46] showed that disentangled representations may be useful to train robust classifiers that are fairer to unobserved sensitive variables independent of the target variable. While they observed a strong correlation between demographic parity [8, 76] and disentanglement, the applicability of their approach is limited by the fact that disentangled representations are difficult to identify without access to explicit observations of the factors of variation [47].

Our experimental setup is identical to the one of Locatello et al. [46] and we measure unfairness of a classifier as in Locatello et al. [46, Section 4]. In Figure 5 (left), we show that the weakly-supervised reconstruction loss of our Ada-GVAE correlates with unfairness as strongly as the disentanglement scores, even though the former can be computed without observing the factors of variation. In particular, we can perform model selection without observing the sensitive variable. In Figure 5 (center), we show that our Ada-GVAE with and model selection allows us to train and identify fairer models compared to the unsupervised models of Locatello et al. [46]. Furthermore, their model selection heuristic is based on downstream performance which requires knowledge of the sensitive variable. From both plots we conclude that our weakly-supervised reconstruction loss is a good proxy for unfairness and allows us to train fairer classifiers in the setup of Locatello et al. [46] even if the sensitive variable is not observed.

Summary We showed that using weak supervision, we can train and identify fairer classifiers in the sense of demographic parity [8, 76]. As opposed to Locatello et al. [46], we do not need to observe the target variable and yet, our principled weakly-supervised approach outperforms their semi-supervised heuristic.

Abstract visual reasoning

Finally, we consider the abstract visual reasoning task of van Steenkiste et al. [70]. This task is based on Raven’s progressive matrices [54] and requires completing the bottom right missing panel of a sequence of context panels arranged in a grid (see Figure 17 (left) in the appendix). The algorithm is presented with six potential answers and needs to choose the correct one. To solve this task, the model has to infer the abstract relationships between the panels. We replicate the experiment of van Steenkiste et al. [70] on Shapes3D under the same exact experimental conditions (see Appendix B for more details).

In Figure 5 (right), one can see that at low sample sizes, the weakly-supervised reconstruction loss is strongly anti-correlated with performance on the abstract visual reasoning task. As previously observed by van Steenkiste et al. [70], this benefit only occurs at low sample sizes.

Summary We demonstrated that training a relational network on the representations learned by our Ada-GVAE improves its sample efficiency. This result is in line with the findings of van Steenkiste et al. [70] where disentanglement was found to correlate positively with improved sample complexity.

6 Conclusion

In this paper, we considered the problem of learning disentangled representations from pairs of non-i.i.d. observations sharing an unknown, random subset of factors of variation. We demonstrated that, under certain technical assumptions, the associated disentangled generative model is identifiable. We extensively discussed the impact of the different supervision modalities, such as the degree of group-level supervision, and studied the impact of the (unknown) number of shared factors. These insights will be particularly useful to practitioners having access to specific domain knowledge. Importantly, we show how to select models with strong performance on a diverse suite of downstream tasks without using supervised disentanglement metrics, relying exclusively on weak supervision. This result is of great importance as the community is becoming increasingly interested in the practical benefits of disentangled representations [70, 46, 14, 9, 38, 10, 28]. Future work should apply the proposed framework to challenging real-world data sets where the factors of variation are not observed and extend it to an interactive setup involving reinforcement learning.

Acknowledgments: The authors thank Stefan Bauer, Ilya Tolstikhin, Sarah Strauss and Josip Djolonga for helpful discussions and comments. Francesco Locatello is supported by the Max Planck ETH Center for Learning Systems, by an ETH core grant (to Gunnar Rätsch), and by a Google Ph.D. Fellowship. This work was partially done while Francesco Locatello was at Google Research, Brain Team, Zurich.

Appendix A Proof of Theorem 1

Recall that the true marginal likelihoods , are completely specified through the smooth, invertible function . The corresponding posteriors are completely determined by . The model family for candidate marginal likelihoods and corresponding posteriors are hence conditional distributions specified by the set of smooth invertible functions and their inverses , respectively.

In order to prove identifiability, we show that every candidate posterior distribution (more precisely, the corresponding ) on the generative model (1)–(2) satisfying the assumptions stated in Theorem 1 inverts in the sense that the aggregate posterior is a coordinate-wise reparameterization of up to permutation of the indices. Crucially, while neither the latent variables nor the shared indices are directly observed, observing pairs of images allows us to verify whether a candidate distribution has the right factorization (3) and sharing structure imposed by or not.

The proof is composed of the following steps:

  1. We characterize the constraints that need to hold for the posterior (the associated ) inverting for fixed .

  2. We parameterize all candidate posteriors (the associated ) as a function for a fixed .

  3. We show that, for fixed , (the associated ) has two disentangled coordinate subspaces, one corresponding to and one corresponding to , in the sense that varying and keeping fixed results in changes of the coordinate subspace of corresponding to only, and vice versa.

  4. We show that randomly sampling implies that every candidate posterior has an aggregated posterior which is a coordinate-wise reparameterization of the distribution of the true factors of variation.

Step 1 We start by noting that since any continuous distribution can be obtained from the standard uniform distribution (via the inverse cumulative distribution function), it is sufficient to simply set to the -dimensional standard uniform distribution and try to recover an axis-aligned, smooth, invertible function (which completely characterizes and via its inverse) as well as the distribution .

Next, assume that is fixed but unknown, i.e., the following reasoning is conditionally on . By the generative process (1)–(2) we know that all smooth, invertible candidate functions need to obey with probability (and irrespective of whether or is used)

(7)
(8)

for all , where is arbitrary but fixed. indexes the the coordinate subspace in the image of corresponding to the unknown coordinate subspace of shared factors of . Note that choosing requires knowledge of ( can be inferred from ). Also note that satisfies (7)–(8) for .

Step 2 All smooth, invertible candidate functions can be written as , where is a smooth invertible function with smooth inverse (using that the composition of smooth invertible functions is smooth and invertible) that maps the -dimensional uniform distribution to .

We have i.e., and similarly . Expressing now (7)–(8) through we have with probability

(9)
(10)

Thanks to invertibility and smoothness of we know that maps the coordinate subspace of to a -dimensional submanifold of and the coordinate subspace to a -dimensional sub-manifold of that is disjoint from .

Step 3 Next, we shall see that for a fixed the only admissible functions are identifying two groups of factors (corresponding to two orthogonal coordinate subspaces): Those in and those in .

To see this, we prove that can only satisfy (9)–(10) if it aligns the coordinate subspace of with the coordinate subspace of and with . In other words, and lie in the coordinate subspaces and , respectively, and the Jacobian of is block diagonal with blocks of coordinates indexed by and .

By contradiction, if does not lie in the coordinate subspace then (9) is violated as is smooth and invertible but its arguments obey for every with probability .

Likewise, if does not lie in the coordinate subspace then (10) is violated as is smooth and invertible but its arguments satisfy with probability .

As a result, (9) and (10) can only be satisfied if maps each coordinate in to a unique matching coordinate in . In other words there exists a permutation on such that can be simplified as , where

(11)
(12)

Note that the permutation is required because the choice of is arbitrary. This implies that the Jacobian of is block diagonal with blocks corresponding to coordinates indexed by and (or equivalently and ).

For fixed , i.e., considering , we can recover the groups of factors in and up to permutation of the factor indices. Note that this does not yet imply that we can recover all axis-aligned as the factors in and may still be entangled with each other, i.e., is not axis aligned within and .

Step 4 If now is drawn at random, we observe a mixture of distributions (but not itself) and needs to associate every with one and only one to satisfy (7)–(8), for every .

Indeed, suppose that are distributed according to a mixture of and with . Then (7) can only be satisfied with probability for a subset of coordinates of size due to invertibility and smoothness of , but . The same reasoning applies for mixtures of more than two subsets of . Therefore, (7) cannot be satisfied for drawn from a mixture of distribution but associated with a single .

Conversely, for a given , all need to be associated with the same due to invertibility and smoothness of . in more detail, all will share the same -dimensional coordinate subspace due to (9)–(10) and therefore cannot be associated with two different as .

Further, note that due to the smoothness and invertibility of , for every pair of associated and we have and . The assumption

(13)

hence implies that we “observe” every factor through as the intersection of two sets , and this intersection will be reflected as the intersection of the corresponding two coordinate subspaces . This, together with (11)–(12) finally implies

(14)

for some permutation on . This in turns imply that the Jacobian of is diagonal.

Therefore, by change of variables formula we have

(15)

where the second equality is a consequence of the Jacobian being diagonal, and thanks to being invertible on . From (15), we can see that is a coordinate-wise reparameterization of up to permutation of the indices. As a consequence, a change in a coordinate of implies a change in the unique corresponding coordinate of , so (or, equivalently, ) disentangles the factors of variation.

Final remarks The considered generative model is identifiable up to coordinate-wise reparametrization of the factors. can then be recovered via . Note that (13) effectively ensures that to a weak supervision signal is available for each factor of variation.

Appendix B Implementation Details

We base our study on the disentanglement_lib of [47]. Here, we report for completeness all the hyperparameters used in our study. Our code will be released as part of the disentanglement_lib.

In our study, fix the architecture (Table 1) along with all other hyperparameters (Table 3) except for one hyperparameter for each model (Table 2). All hyperparameters for the unsupervised models are identical to [47]. As our methods penalize the rate term in the ELBO similarly to -VAE, we use the same hyperparameter range. We however note that in most cases, our model selection technique selects . Exploring a different range for smaller than one is beyond the scope of this work. For the unsupervised methods we use the same random seeds of [47]. For the weakly-supervised methods, we use .

Downstream Task The vanilla downstream task is based on [47]. For each representation, we sample training sets of sizes , , and . The test set always contains points. The downstream task consists in predicting the value of each factor of variation from the representation. We use the same two models of [47]: a cross validated logistic regression from Scikit-learn with 10 different values for the regularization strength () and folds and a gradient boosting classifier (GBT) from Scikit-learn with default parameters.

Downstream Task with Covariate Shift We consider the same setup of the normal downstream task, but we only train a gradient boosted classifier with examples (). For every target factor of variation we repeat 10 times the following process: sample another factor of variation uniformly and fix its value over the whole training set to an uniformly sampled value. The test set contains only examples where the intervened factors take values that are different from the one in the training set. We report the average test performance.

Fairness Downstream Task The fairness downstream task is based on [46]. We train the same on each representation predicting each factor of variation and measure the unfairness using the formula in their Section 4.

Abstract reasoning task We use the same Shapes3D simplified data set when training the relational network (scale and azimuth can only take four values instead of 8 and 16 to make the task feasible for humans). We consider the case where the rows in the grid have either 1, 2, or 3 constant ground-truth factors. We train the same relational model [59] as in [70] (with identical hyperparameters) on the frozen representations of our adaptive methods.

We use hyperparameters identical to [70] which are reported here for completeness. The downstream classifier is the Wild Relation Networks (WReN) model of [59]. For the experiments, we use the following random search space over the hyper-parameters. The optimizer’s parameters are depcited in Table 4. The edge MLP has either 256 or 512 hidden units and 2, 3, or 4 hidden layers. The graph MLP has either 128 or 256 hidden units and 1 or 2 hidden layers before the final linear layer to compute the score. We also uniformly sample whether we apply no dropout, dropout of 0.25, dropout of 0.5, or dropout of 0.75 to units before this last layer and 10 random seeds.

Encoder Decoder
Input: number of channels Input:
conv, 32 ReLU, stride 2 FC, 256 ReLU
conv, 32 ReLU, stride 2 FC, ReLU
conv, 64 ReLU, stride 2 upconv, 64 ReLU, stride 2
conv, 64 ReLU, stride 2 upconv, 32 ReLU, stride 2
FC 256, F2 upconv, 32 ReLU, stride 2
upconv, number of channels, stride 2
Table 1: Encoder and Decoder architecture for the main experiment.
Model Parameter Values
-VAE
AnnealedVAE
iteration threshold
FactorVAE
DIP-VAE-I
DIP-VAE-II
-TCVAE
GVAE
Ada-GVAE
ML-VAE
Ada-ML-VAE
Table 2: Model’s hyperparameters. We allow a sweep over a single hyperparameter for each model.
Parameter Values
Batch size
Latent space dimension
Optimizer Adam
Adam: beta1 0.9
Adam: beta2 0.999
Adam: epsilon 1e-8
Adam: learning rate 0.0001
Decoder type Bernoulli
Training steps 300000
(a) Hyperparameters common to each of the considered methods.
Discriminator
FC, 1000 leaky ReLU
FC, 1000 leaky ReLU
FC, 1000 leaky ReLU
FC, 1000 leaky ReLU
FC, 1000 leaky ReLU
FC, 1000 leaky ReLU
FC, 2
(b) Architecture for the discriminator in FactorVAE.
Parameter Values
Batch size
Optimizer Adam
Adam: beta1 0.5
Adam: beta2 0.9
Adam: epsilon 1e-8
Adam: learning rate 0.0001
(c) Parameters for the discriminator in FactorVAE.
Table 3: Other fixed hyperparameters.
Parameter Values
Batch size
Optimizer Adam
Adam: beta1 0.9
Adam: beta2 0.999
Adam: epsilon 1e-8
Adam: learning rate
Table 4: Parameters for the optimizer in the WReN.

Appendix C Additional Results

c.1 Section 5.1

In Figure 6, we show that our methods are competitive even with fully supervised model selection on the unsupervised methods.

Figure 6: Our adaptive variants of the group-based disentanglement methods with unsupervised model selection based on the reconstruction loss are competitive with fully supervised model selection on the unsupervised models. In this experiment, we consider the case where the number of shared factors of variation is random and different for every pair. Legend: 0=-VAE, 1=FactorVAE, 2=-TCVAE, 3=DIP-VAE-I, 4=DIP-VAE-II, 5=AnnealedVAE, 6=Ada-ML-VAE, 7=Ada-GVAE

While our main analysis is focused on DCI Disentanglement [19], we report in Figure 8 the performance of out methods when evaluated using each disentanglement score as well as Completeness [19] in Figure 7. Overall, we observe that the trends we observed in Section 5.1 for DCI Disentanglement can be observed also for the other disentanglement scores (with the partial exception of Modularity [58]). In Figure 9 we show that the disentanglement metrics are consistently correlated with the training metrics. We chose the weakly-supervised reconstruction loss for model selection but ELBO and overall Loss are also suitable.

Figure 7: Our adaptive variants of the group-based disentanglement methods are competitive with unsupervised methods also in terms of Completeness. In this experiment, we consider the case where the number of shared factors of variation is random and different for every pair. Legend: 0=-VAE, 1=FactorVAE, 2=-TCVAE, 3=DIP-VAE-I, 4=DIP-VAE-II, 5=AnnealedVAE, 6=Ada-ML-VAE, 7=Ada-GVAE
Figure 8: Our adaptive variants of the group-based disentanglement methods are competitive with unsupervised methods on all disentanglement scores. In this experiment, we consider the case where the number of shared factors of variation is random and different for every pair. Legend: 0=-VAE, 1=FactorVAE, 2=-TCVAE, 3=DIP-VAE-I, 4=DIP-VAE-II, 5=AnnealedVAE, 6=Ada-ML-VAE, 7=Ada-GVAE
Figure 9: Rank correlation between training metrics and disentanglement scores for Ada-GVAE (top) and Ada-ML-VAE (bottom).

c.2 Section 5.2

Figure 10: Performance of the Ada-GVAE with different degrees of supervision in the data. The best performances are when —only one factor is changed in each pair—and they consistently degrade the fewer factors are shared until only a single factor of variation is shared. In the most general case, each pair has a different number of shared factors and the performance is consistent with the trend observed before.
Figure 11: Performance of the Ada-ML-VAE with different amounts of supervision in the data. The best performances are when – only one factor is changed – and they consistently degrade the fewer factors are shared until only a single factor of variation is shared. In the most general case, each pair has a different amount of shared factors and the performance are consistent with the trend observed before.

Performance of Ada-GVAE 10 and Ada-ML-VAE 11 for different values of . Generally, we observe that performances are best when the change between the pictures is sparser, i.e., . We again note that the higher is the more similar the performances are with the vanilla -VAE.

c.3 Section 5.3

In Figures 12 and 13, we observe that, regardless of the averaging, when and the different factor is known to the algorithm, this knowledge improves the disentanglement. However, when this knowledge is incomplete it harms the disentanglement. In Figure 14 we show how our method compare with the Change and Share GAN-based approaches of [64]. The goal of this plot is to show that ball-park the two approaches achieves similar results. We stress that strong conclusions should not be drawn from this plot as [64] used different experimental conditions from ours. Finally, we remark that [64] assume access to which factors was either shared or changed in the pair. Our method was designed to benefit from very similar images and without any additional annotation, so it is not completely surprising that when our performances are worse. It is however interesting to notice how the GAN based methods perform especially well on the data sets SmallNORB and MPI3D where VAE based approaches struggle with reconstruction as the objects are either too detailed or too small.

Figure 12: Comparison of Ada-GVAE with the vanilla GVAE which requires group knowledge. We note that group knowledge can improve disentanglement but can also significantly hurt when it is incomplete. Top row: , bottom row: .
Figure 13: Comparison of Ada-ML-VAE with the vanilla ML-VAE which assumes group knowledge. We note that group knowledge improves performances but can also significantly hurt when it is incomplete.
Figure 14: Comparison between the Change and Share GAN-based approach of [64] without model selection. Legend 0=Change, 1=Share, 2=Ada-GVAE , 3=Ada-GVAE , 4=Ada-ML-VAE , 5=Ada-ML-VAE . We remark that these methods are not directly comparable as (1) the experimental conditions are different and (2) Shu et al. [64] have access to additional supervision (which factor is shared or changed).

c.4 Section 5.4

In Figure 15 we show the figure analogous to Figure 4 for the Ada-ML-VAE. We observe that the trends are comparable to the ones we observed for the Ada-GVAE. In Figures 16 and 17, we show the results on the fairness and abstract reasoning downstream task for the Ada-ML-VAE. Overall, we observe that the conclusions we drew for the Ada-GVAE is valid for the Ada-ML-VAE too: good models in terms of weakly-supervised reconstruction loss are useful on all the considered downstream tasks.

Figure 15: (left) Rank correlation between our weakly-supervised reconstruction loss and performance of downstream prediction tasks with Logistic Regression (LR) and Gradient Boosted decision-Trees at different sample sizes for the Ada-ML-VAE. We observe a general negative correlation that indicates that models with a good weakly-supervised reconstruction loss may also be more accurate. (center) Rank correlation between disentanglement scores and weakly-supervised reconstruction loss with strong generalization under covariate shifts for the Ada-ML-VAE. (right) Generalization gap between weak and strong generalization for the Ada-ML-VAE over all models. The horizontal line is the accuracy of random chance.
Figure 16: (left) Rank correlation between both disentanglement scores and the weakly-supervised reconstruction loss of our Ada-ML-VAE with the unfairness of GBT10000 on all the data sets. (right) Unfairness of the unsupervised methods with the semi-supervised model selection heuristic of [46] and our Ada-ML-VAE with . From both plots, we conclude that out weakly-supervised reconstruction loss is a good proxy for the unfairness and allows to train fairer classifiers in the setup of [46] even if the sensitive variable is not observed.
Figure 17: (left) Example of the abstract visual reasoning task of [70]. The solution is the panel in the central row on the right. (right) Rank correlation between disentanglement metrics, prediction accuracy, weakly-supervised reconstruction and down-stream accuracy of the abstract visual reasoning models throughout training (i.e., for different sample sizes) for the Ada-ML-VAE.

Footnotes

  1. We invested approximately 5.85 GPU years (NVIDIA P100) for our experimental evaluation.
  2. In Figure 9 in the appendix, we show that the training loss and the ELBO correlate similarly with disentanglement.

References

  1. T. Adel, Z. Ghahramani and A. Weller (2018) Discovering interpretable representations for both deep generative and discriminative models. In International Conference on Machine Learning, pp. 50–59. Cited by: §1.
  2. S. Ben-David, T. Lu, T. Luu and D. Pál (2010) Impossibility theorems for domain adaptation. In International Conference on Artificial Intelligence and Statistics, pp. 129–136. Cited by: §5.4.
  3. Y. Bengio, A. Courville and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. Cited by: §1, §1, §3, §4.
  4. Y. Bengio, T. Deleu, N. Rahaman, R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal and C. Pal (2019) A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912. Cited by: §1, §1.
  5. Y. Bengio (2017) The consciousness prior. arXiv preprint arXiv:1709.08568. Cited by: §1, §1.
  6. D. Bouchacourt, R. Tomioka and S. Nowozin (2018) Multi-level variational autoencoder: learning disentangled representations from grouped observations. In AAAI Conference on Artificial Intelligence, Cited by: 1st item, 2nd item, §1, §2, §4.1, §4.1, §5.3, §5.3, §5.
  7. C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins and A. Lerchner (2018) Understanding disentangling in beta-VAE. arXiv preprint arXiv:1804.03599. Cited by: §2, §5.
  8. T. Calders, F. Kamiran and M. Pechenizkiy (2009) Building classifiers with independency constraints. In 2009 IEEE International Conference on Data Mining Workshops, pp. 13–18. Cited by: §5.4.3, §5.4.3.
  9. M. A. Chao, C. Kulkarni, K. Goebel and O. Fink (2019) Hybrid deep fault detection and isolation: combining deep neural networks and system performance models. arXiv preprint arXiv:1908.01529. Cited by: §6.
  10. A. Chartsias, T. Joyce, G. Papanastasiou, S. Semple, M. Williams, D. E. Newby, R. Dharmakumar and S. A. Tsaftaris (2019) Disentangled representation learning in cardiac image analysis. Medical image analysis 58, pp. 101535. Cited by: §6.
  11. T. Q. Chen, X. Li, R. Grosse and D. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, Cited by: §2, §5, §5.
  12. B. Cheung, J. A. Livezey, A. K. Bansal and B. A. Olshausen (2014) Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583. Cited by: §2.
  13. P. Comon (1994) Independent component analysis, a new concept?. Signal Processing 36 (3), pp. 287–314. Cited by: §2.
  14. E. Creager, D. Madras, J. Jacobsen, M. Weis, K. Swersky, T. Pitassi and R. Zemel (2019) Flexibly fair representation learning by disentanglement. In International Conference on Machine Learning, pp. 1436–1445. Cited by: §1, §6.
  15. P. Dayan (1993) Improving generalization for temporal difference learning: the successor representation. Neural Computation 5 (4), pp. 613–624. Cited by: §1.
  16. Z. Deng, R. Navarathna, P. Carr, S. Mandt, Y. Yue, I. Matthews and G. Mori (2017) Factorized variational autoencoders for modeling audience reactions to movies. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  17. E. L. Denton and V. Birodkar (2017) Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, Cited by: §2.
  18. S. Duan, N. Watters, L. Matthey, C. P. Burgess, A. Lerchner and I. Higgins (2019) A heuristic for unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614. Cited by: §2.
  19. C. Eastwood and C. K. Williams (2018) A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, Cited by: §C.1, §2, §5.
  20. P. Földiák (1991) Learning invariance from transformation sequences. Neural Computation 3 (2), pp. 194–200. Cited by: §1, §3.
  21. V. Fortuin, M. Hüser, F. Locatello, H. Strathmann and G. Rätsch (2019) Deep self-organization: interpretable discrete representation learning on time series. In International Conference on Learning Representations, Cited by: §2.
  22. M. Fraccaro, S. Kamronn, U. Paquet and O. Winther (2017) A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems, Cited by: §2.
  23. M. W. Gondal, M. Wüthrich, D. Miladinović, F. Locatello, M. Breidt, V. Volchkov, J. Akpo, O. Bachem, B. Schölkopf and S. Bauer (2019) On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In Advances in Neural Information Processing Systems, Cited by: Figure 1, 3rd item, §3, §5.
  24. R. Goroshin, M. F. Mathieu and Y. LeCun (2015) Learning to linearize under uncertainty. In Advances in Neural Information Processing Systems, Cited by: §2.
  25. L. Gresele, P. K. Rubenstein, A. Mehrjou, F. Locatello and B. Schölkopf (2019) The incomplete rosetta stone problem: identifiability results for multi-view nonlinear ica. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §2, §3, §4.1.
  26. I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende and A. Lerchner (2018) Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: §3.
  27. I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed and A. Lerchner (2017) Beta-VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Cited by: §1, §2, §3, §4.1, §4.1, §5, §5.
  28. I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell and A. Lerchner (2017) DARLA: improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning, Cited by: §6.
  29. S. Hochreiter and J. Schmidhuber (1999) Feature extraction through lococode. Neural Computation 11 (3), pp. 679–714. Cited by: §1.
  30. H. Hosoya (2019) Group-based learning of disentangled representations with generalizability for novel contents. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 2506–2513. Cited by: 1st item, 2nd item, §1, §2, §4.1, §4.1, §5.3, §5.3, §5.
  31. J. Hsieh, B. Liu, D. Huang, L. F. Fei-Fei and J. C. Niebles (2018) Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, Cited by: §2.
  32. W. Hsu, Y. Zhang and J. Glass (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems, Cited by: §2.
  33. B. Huang, K. Zhang, J. Zhang, R. Sanchez-Romero, C. Glymour and B. Schölkopf (2017) Behind distribution shift: mining driving forces of changes and causal arrows. In IEEE 17th International Conference on Data Mining (ICDM 2017), pp. 913–918. Cited by: §2.
  34. A. Hyvarinen and H. Morioka (2016) Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems, Cited by: §2.
  35. A. Hyvarinen and H. Morioka (2017) Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460–469. Cited by: §2.
  36. A. Hyvärinen and P. Pajunen (1999) Nonlinear independent component analysis: existence and uniqueness results. Neural Networks. Cited by: §2.
  37. A. Hyvarinen, H. Sasaki and R. E. Turner (2019) Nonlinear ica using auxiliary variables and generalized contrastive learning. In International Conference on Artificial Intelligence and Statistics, Cited by: §2, §4.1.
  38. R. Iten, T. Metger, H. Wilming, L. Del Rio and R. Renner (2020) Discovering physical concepts with neural networks. Physical Review Letters 124 (1), pp. 010508. Cited by: §6.
  39. T. Karaletsos, S. Belongie and G. Rätsch (2015) Bayesian representation learning with oracle constraints. arXiv preprint arXiv:1506.05011. Cited by: §2.
  40. D. J. Ketchen and C. L. Shook (1996) The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal 17 (6), pp. 441–458. Cited by: §4.1.
  41. I. Khemakhem, D. P. Kingma and A. Hyvärinen (2019) Variational autoencoders and nonlinear ica: a unifying framework. arXiv preprint arXiv:1907.04809. Cited by: §2, §2, §4.1, §4.1.
  42. H. Kim and A. Mnih (2018) Disentangling by factorising. In International Conference on Machine Learning, Cited by: §2, §3, §5, §5.
  43. T. D. Kulkarni, W. F. Whitney, P. Kohli and J. Tenenbaum (2015) Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, Cited by: §2.
  44. A. Kumar, P. Sattigeri and A. Balakrishnan (2018) Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, Cited by: §2, §5, §5.
  45. Y. LeCun, F. J. Huang and L. Bottou (2004) Learning methods for generic object recognition with invariance to pose and lighting. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3, §5.
  46. F. Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Schölkopf and O. Bachem (2019) On the fairness of disentangled representations. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Figure 16, 4th item, §1, Figure 5, §5.4.3, §5.4.3, §5.4.3, §5.4, §5.4, §6.
  47. F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Schölkopf and O. Bachem (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, Cited by: Appendix B, Appendix B, Appendix B, 2nd item, §1, §2, §3, §3, §4, §4, §5.4.1, §5.4.3, §5.4, §5.4, §5, §5.
  48. F. Locatello, M. Tschannen, S. Bauer, G. Rätsch, B. Schölkopf and O. Bachem (2020) Disentangling factors of variation using few labels. International Conference on Learning Representations. Cited by: §1, §2, §5.4, §5.4.
  49. F. Locatello, D. Vincent, I. Tolstikhin, G. Rätsch, S. Gelly and B. Schölkopf (2018) Competitive training of mixtures of independent deep generative models. In Workshop at the 6th International Conference on Learning Representations (ICLR), Cited by: §2.
  50. M. F. Mathieu, J. J. Zhao, A. Ramesh, P. Sprechmann and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, Cited by: §2.
  51. S. Narayanaswamy, T. B. Paige, J. Van de Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood and P. Torr (2017) Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems, Cited by: §2.
  52. J. Peters, D. Janzing and B. Schölkopf (2017) Elements of causal inference - foundations and learning algorithms. Adaptive Computation and Machine Learning Series, MIT Press. Cited by: §1, §3, §5.4.
  53. J. Quionero-Candela, M. Sugiyama, A. Schwaighofer and N. D. Lawrence (2009) Dataset shift in machine learning. The MIT Press. Cited by: §5.4.
  54. J. C. Raven (1941) Standardization of progressive matrices, 1938. British Journal of Medical Psychology 19 (1), pp. 137–150. Cited by: §5.4.4.
  55. S. Reed, K. Sohn, Y. Zhang and H. Lee (2014) Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, Cited by: §2.
  56. S. Reed, Y. Zhang, Y. Zhang and H. Lee (2015) Deep visual analogy-making. In Advances in Neural Information Processing Systems, Cited by: §3, §5.
  57. K. Ridgeway and M. C. Mozer (2018) Learning deep disentangled embeddings with the f-statistic loss. In Advances in Neural Information Processing Systems, Cited by: §2, §5.
  58. K. Ridgeway (2016) A survey of inductive biases for factorial representation-learning. arXiv preprint arXiv:1612.05299. Cited by: §C.1.
  59. A. Santoro, F. Hill, D. Barrett, A. Morcos and T. Lillicrap (2018) Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pp. 4477–4486. Cited by: Appendix B, Appendix B.
  60. M. Schmidt, A. Niculescu-Mizil and K. Murphy (2007) Learning graphical model structure using l1-regularization paths. In AAAI, Vol. 7, pp. 1278–1283. Cited by: §1, §1.
  61. B. Schölkopf (2019) Causality for machine learning. Note: arXiv:1911.10500 Cited by: §1, §2.
  62. B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang and J. Mooij (2012) On causal and anticausal learning. In International Conference on Machine Learning, Cited by: Figure 1.
  63. H. Shimodaira (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §5.4.
  64. R. Shu, Y. Chen, A. Kumar, S. Ermon and B. Poole (2020) Weakly supervised disentanglement with guarantees. International Conference on Learning Representations. Cited by: Figure 14, §C.3, 1st item, §1, §2, §4.1, §4, §5.3, §5.3.
  65. P. Sorrenson, C. Rother and U. Köthe (2020) Disentanglement by nonlinear ica with general incompressible-flow networks (gin). arXiv preprint arXiv:2001.04872. Cited by: §2, §4.1.
  66. J. Storck, S. Hochreiter and J. Schmidhuber (1995) Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the international conference on artificial neural networks, Paris, Vol. 2, pp. 159–164. Cited by: §1.
  67. R. Suter, D. Miladinović, S. Bauer and B. Schölkopf (2019) Interventional robustness of deep latent variable models. In International Conference on Machine Learning, Cited by: §2, §3.
  68. V. Thomas, E. Bengio, W. Fedus, J. Pondard, P. Beaudoin, H. Larochelle, J. Pineau, D. Precup and Y. Bengio (2017) Disentangling the independently controllable factors of variation by interacting with the world. Learning Disentangled Representations Workshop at NeurIPS. Cited by: §1.
  69. M. Tschannen, O. Bachem and M. Lucic (2018) Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069. Cited by: §1.
  70. S. van Steenkiste, F. Locatello, J. Schmidhuber and O. Bachem (2019) Are disentangled representations helpful for abstract visual reasoning?. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix B, Figure 17, 4th item, §1, Figure 5, §5.4.4, §5.4.4, §5.4.4, §5.4, §5.4, §6.
  71. W. F. Whitney, M. Chang, T. Kulkarni and J. B. Tenenbaum (2016) Understanding visual concepts with continuation learning. arXiv preprint arXiv:1602.06822. Cited by: §2.
  72. J. Yang, S. E. Reed, M. Yang and H. Lee (2015) Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In Advances in Neural Information Processing Systems, Cited by: §2.
  73. L. Yingzhen and S. Mandt (2018) Disentangled sequential autoencoder. In International Conference on Machine Learning, pp. 5656–5665. Cited by: §2.
  74. K. Zhang, M. Gong and B. Schölkopf (2015) Multi-source domain adaptation: a causal view. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp. 3150–3157. Cited by: §2.
  75. K. Zhang, B. Huang, J. Zhang, C. Glymour and B. Schölkopf (2017) Causal discovery from nonstationary/heterogeneous data: skeleton estimation and orientation determination. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI 2017), pp. 1347–1353. Cited by: §2.
  76. I. Zliobaite (2015) On the relation between accuracy and fairness in binary classification. arXiv preprint arXiv:1505.05723. Cited by: §5.4.3, §5.4.3.