Evaluating the Disentanglement of Deep Generative Models through Manifold Topology
Abstract
Learning disentangled representations is regarded as a fundamental task for improving the generalization, robustness, and interpretability of generative models. However, measuring disentanglement has been challenging and inconsistent, often dependent on an adhoc external model or specific to a certain dataset. To address this, we present a method for quantifying disentanglement that only uses the generative model, by measuring the topological similarity of conditional submanifolds in the learned representation. This method showcases both unsupervised and supervised variants. To illustrate the effectiveness and applicability of our method, we empirically evaluate several stateoftheart models across multiple datasets. We find that our method ranks models similarly to existing methods. We make our code publicly available at https://redacted.
1 Introduction
Learning disentangled representations is important for a variety of tasks, including adversarial robustness, generalization to novel tasks, and interpretability [Stutz et al., 2019, Alemi et al., 2016, Ridgeway, 2016, Bengio et al., 2013]. Recently, deep generative models have shown marked improvement in disentanglement across an increasing number of datasets and a variety of training objectives [Chen et al., 2016, Lin et al., 2019, Higgins et al., 2017, Kim and Mnih, 2018, Chen et al., 2018, Burgess et al., 2018, Karras et al., 2018]. Nevertheless, quantifying the extent of this disentanglement has remained challenging and inconsistent. As a result, evaluation has often resorted to qualitative inspection for comparisons between models.
Existing evaluation metrics are rigid: while some rely on training additional adhoc models that depend on the generative model, such as a classifier, regressor, or an encoder [Eastwood and Williams, 2018, Kim and Mnih, 2018, Higgins et al., 2017, Chen et al., 2018, Glorot et al., 2011, Grathwohl and Wilson, 2016, Karaletsos et al., 2015, Duan et al., 2019], others are tuned for a particular dataset [Karras et al., 2018]. These both pose problems to the metric’s reliability, its relevance to different models and tasks, and consequently, its applicable scope. Specifically, reliance on training and tuning external models presents a tendency to be sensitive to additional hyperparameters and introduces partiality for models with particular training objectives, e.g. variational methods [Chen et al., 2018, Kim and Mnih, 2018, Higgins et al., 2017, Burgess et al., 2018] or adversarial methods with an encoder head on the discriminator [Chen et al., 2016, Lin et al., 2019]. In fact, this reliance may provide an explanation for the frequent fluctuation in model rankings when new metrics are introduced [Kim and Mnih, 2018, Lin et al., 2019, Chen et al., 2016]. Meanwhile, datasetspecific preprocessing, such as automatically removing background portions from generated portrait images [Karras et al., 2018], generally limits the scope of the metric’s applicability because it depends on the preprocessing procedure and may otherwise be unreliable without it.
To address this, we introduce an unsupervised disentanglement metric that can be applied across different model architectures and datasets without training an adhoc model for evaluation or introducing a datasetspecific preprocessing step. We achieve this by examining topology, an intrinsic property of a manifold that typically exhibits modes corresponding to highdensity regions, surrounded by lowdensity regions [Cayton, 2005, Narayanan and Mitter, 2010, Goodfellow et al., 2016]. Our method investigates the topology of these lowdensity regions (holes) by estimating homology, a topological invariant that characterizes the distribution of holes on a manifold. We first condition the manifold on each latent dimension and subsequently measure the homology of these conditional submanifolds. By comparing homology, we examine the degree to which conditional submanifolds continuously deform into each other. This provides a notion of topological similarity that is higher across submanifolds conditioned on disentangled dimensions than those conditioned on entangled ones. From this, we construct our metric using the aggregate topological similarity across data submanifolds conditioned on every latent dimension in the generative model.
In this paper, we make several key contributions:

We present an unsupervised metric for evaluating disentanglement that only requires the generative model (decoder) and is datasetagnostic. In order to achieve this, we propose measuring the topology of the learned data manifold with respect to its latent dimensions. Our metric accounts for topological similarity within a dimension and across homeomorphic dimensions.

We also introduce a supervised variant that compares the generated topology to a real reference.

For both variants, we develop a topological similarity criterion based on Wasserstein distance, which defines a metric on barcode space in persistent homology.

Empirically, we perform an extensive set of experiments to demonstrate the applicability of our method across 10 models and three datasets using both the supervised and unsupervised variants. We find that our results are consistent with several existing methods.
2 Background
Our method draws inspiration from the Manifold Hypothesis [Cayton, 2005, Narayanan and Mitter, 2010, Goodfellow et al., 2016], which posits that there exists a lowdimensional manifold on which real data lie and is supported, and that generative models learn an approximation of that manifold . As a result, the true data manifold contains highdensity regions, separated by large expanses of lowdensity regions, assuming a topology. approximates this topology through the learning process.
A manifold is a space , for example a subset of for some , so that for every point , there is a subset which can reparametrized to an open disc in . A coordinate chart for the manifold is an open subset of together with a continuous parametrization of a subset of . An atlas for is a collection of coordinate charts that cover . For example, any open hemisphere in a sphere is a coordinate chart., and the collection of all open hemispheres form an atlas. We say two manifolds are homeomorphic if is there is a continuous map from to that has a continuous inverse. Intuitively, two manifolds are homeomorphic if one can be viewed as a continuous reparametrization of the other. If we have a continuous map from a manifold to , and are given two nearby points and in , it is often useful to compare the subsets and , and they are often manifolds. They are frequently homeomorphic, and we will be using topological invariants that can distinguish between two nonhomeomorphic manifolds.
Among the easiest topological invariants to numerically estimate is homology [Hatcher, 2005], which characterizes the number of dimensional holes in a topological space such as a manifold. Intuitively, these holes correspond to lowdensity regions on the manifold. The field of persistent homology offers several methods for estimating the homology of a topological space from data samples [Carlsson, 2019]. Under the manifold hypothesis, the latent space of a deep generative model has an extremely dense underlying manifold with few, if any, holes, making homology difficult to measure and distinguish across submanifolds. As a result, recent important work on Relative Living Times (RLTs) [Khrulkov and Oseledets, 2018] has used persistent homology to estimate the topology of a deep generative model on the data manifold. This also enables the direct comparison of generated data manifolds to real ones.
To obtain RLTs, we first construct a family of simplicial complexes—graphlike structures—from data samples, each starting with a set of vertices representing the data points and no edges (see Figure 2). Each complex is a VietorisRips complex, which characterizes the topology of a set of (data) points and is a common method for statistically estimate topology in persistent homology [Carlsson, 2019, Lim et al., 2020]. These simplicial complexes approximate the topology of the data manifold, by identifying dimensional holes present in the simplices at varying levels of proximity. Proximity is defined as the radius of a ball around each symplectic vertex. If the balls of two vertices intersect, an edge is drawn between those vertices. As proximity increases, simplicial complexes with varying numbers of dimensional holes will form—in fact, holes will both appear and disappear as a function of increasing proximity.
RLTs use the notion of increasing proximity over time to construct a discrete distribution—known as a persistence barcode (Carlsson [2019], Zomorodian and Carlsson [2005], Ghrist [2008])—over the duration of each dimensional hole as it appears and disappears, or their lifetime relative to other holes. This is merely one method to vectorize persistence barcodes that is efficient, and we leave it to future work to explore alternate methods ([Adcock et al., 2013, Bubenik, 2015]). To measure the topological similarity between data samples representing two generative model manifolds, Khrulkov and Oseledets [2018] then take the Euclidean mean of several RLTs to produce a discrete probability distribution, called a Mean Relative Living Time; they propose employing the Euclidean distance between two Mean Relative Living Times as the measure of topological similarity between two sets of data samples, known as the Geometry Score. However, Euclidean distance does not define a metric barcode space, so these values may not be precise. Wasserstein distance, on the other hand, does in fact define a metric on barcode space [Carlsson, 2019].
3 Manifold Interpretation of Disentanglement
From the manifold perspective, disentanglement is an extrinsic property that is dependent on the generative model’s atlas. Consider a disentangled generative model with manifold that assumes topology . We can define another generative model with the same underlying manifold and , but it is entangled and has a different atlas. In fact, we can define several alternate disentangled and entangled atlases, provided there are multiple valid factorizations of the space. As a result, we need a method that can detect whether an atlas is disentangled.
In this paper, we slice into submanifolds that are conditioned on a factor at value . These conditional submanifolds have separate topologies from and depend on the coordinate chart associating it with the model’s atlas. If we observe samples from one factor, e.g. at varying values of , we find that all samples appear identical, except with respect to that single factor of variation set to a different value of . For a generative model, the correspondence between latent dimensions and factors is not known upfront. As a result, we perform this procedure by conditioning on each latent dimension .
Conditional submanifold topology. For two submanifolds to have the same topology, there needs to be a continuous deformation from one to the other, i.e. there exists a continuous and invertible mapping between them. First, assume that there exists an invertible mapping, or encoder , and a generative model , where both functions are continuous. Then, for a given z and , we can recover z by composition . We can also construct a simple linear mapping , which adapts a factor’s value, such that remains continuously deformable. This holds across factors, where the manifold is topologically symmetric with respect to different factors, i.e. its conditional submanifolds are homeomorphic. As an example, consider a disentangled generative model that traces a triaxial ellipsoid . If we condition the model on varying values of each factor, the resulting submanifolds are 2dimensional ellipses and have the same topology.
Most complex manifolds have submanifolds that have nonhomeomorphic factors of variation. For example, consider a generative model that traces a cylindrical shell with angle , height , and for simplicity, no thickness. The submanifolds conditioned on angle form lines (no holes), while the submanifolds conditioned on height form circles (a 1D hole). However, the topology remains the same for a given factor, e.g. either varying sized lines or circles. A visualization of this principle on a cone is shown in Figure 3. Taken together, this means that submanifolds within a factor (intrafactor) are homeomorphic, while submanifolds between factors (extrafactor) can be either homeomorphic or nonhomeomorphic.
Topological asymmetry. Because topologically asymmetric submanifolds are nonhomeomorphic, using a single that continuously deforms across submanifolds no longer holds under disentanglement. To address this, assume that for each factor , there exists a continuous invertible encoder that exclusively encodes information on from a generated sample. In the cylindrical shell example, this means continuously deforming across submanifolds conditioned on varying values of using (deforming between lines) and likewise for using an (deforming between circles). Note that this formulation prevents continuous deformations between lines and circles. More generally, we cannot continuously deform across submanifolds conditioned by arbitrary factors and expect the topology to be preserved. This procedure now amounts to performing latent traversals along an axis and observing the topology of the resulting submanifolds. In a disentangled model, the conditional submanifolds exhibit the same topology by continuous composition of , using a linear mapping that only adapts factor across the traversal, i.e. .
In an entangled model by contrast, more than one factor—such as both the angle and height in the cylindrical shell example—exhibit variation along a dimension . Put another way, the topology on submanifolds conditioned on changes when multiple factors contribute to variation along this dimension. Concretely, following the cylindrical shell example, a dimension that encodes height and, after a certain height threshold, also begins to adapt the angle will result in a topology that changes at that threshold to include a 2D hole. Consequently, submanifolds conditioned on the same factor have the same topology in a disentangled model, yet different topology in an entangled one.
Because we cannot assume that the data manifold of a generative model is completely symmetric, we only consider submanifolds to be homeomorphic along the same factor in a disentangled model. By contrast, since these submanifolds are not homeomorphic in an entangled model, we can measure the similarity across these submanifolds to evaluate a model’s disentanglement. Using this notion of intrafactor topological similarity, we may sufficiently measure disentanglement in most cases, but it does not shield us from the scenario where a generative model learns a single trivial factor along all dimensions, i.e. a factorization of one. If we assume that there exists assymmetries in the data manifold, then ensuring that the manifold exhibits topological dissimilarity between certain factors would disarm that case. We operationalize this by identifying homeomorphic groups of factors, whereby each group has its own distinct topology to ensure there is not a factorization of one. Within groups, we still measure topological similarity, but between different groups, we also calculate topological dissimilarity. Consequently, topological similarity and dissimilarity form the basis of our metric.
Ties to prior work. As noted in a foundational paper on disentanglement [Bengio et al., 2013], disentanglement constitutes a bijective mapping between factors of variation in the data to dimensions in the latent space , e.g. . Using homology, we can determine whether this bijective mapping holds along different factors, by observing the topological similarity of their conditional submanifolds and measuring the extent to which they continuously deform into each other. Aligned with newer definitions of disentanglement [Higgins et al., 2018, Duan et al., 2019], our framing permits multiple valid factorizations, where different groups of homeomorphic factors can compose alternate factorizations. Our supervised variant is meant to consider a target factorization corresponding to factors on the real manifold. In this variant, we follow the existing definition of supervised disentanglement [Shu et al., 2019] that allows different subsets of dimensions to contribute to a target factor and for target factors to exhibit statistical dependence.
3.1 Topological similarity using Wasserstein Relative Living Times
In order to estimate the topological similarity between conditional submanifolds, we build on Relative Living Times [Khrulkov and Oseledets, 2018] and introduce Wasserstein (W.) Relative Living Times. Wasserstein distance, unlike Euclidean distance, defines a metric on barcode space; recall that barcodes are the discrete distributions representing the presence and absence of different dimensional holes (or more formally known as the Betti numbers, or th homology groups), which are aggregated to form RLTs [Carlsson, 2019]. For the sake of a valid metric, we replace Euclidean distance and Euclidean averages with Wasserstein distance and Wasserstein barycenters.
Thus, in lieu of the Euclidean mean across RLTs, we equivalently employ the W. barycenter [Agueh and Carlier, 2011]. For distances between W. barycenters, we employ standard W. distance.
The W. barycenter of the distributions intuitively corresponds to finding the minimum cost for transporting to each , where cost is defined in W2 distance:
where and . This is a weighted Fréchet mean, , where . In contrast, Euclidean distance, or the norm, is defined using .
Because our distributions represent discrete unnormalized counts of dimensional holes, we leverage recent work in unbalanced optimal transport [Chizat et al., 2018, Frogner et al., 2015] that assumes that are not normalized probabilities containing varying cumulative mass. The unbalanced W. barycenter modifies the W. distance to penalize the marginal distributions based on the extended KL divergence [Chizat et al., 2018, Dognin et al., 2019]. Unlike Euclidean, Hellinger, or total variation distance, W. distance defines a valid metric on barcode space in persistent homology.
We show in Appendix B that the use of both W. RLTs and W. distance result in a distance metric on sets of RLTs that best separates similar and dissimilar topologies.
3.2 Metric
Equipped with a procedure for measuring topological similarity, we develop a metric for scoring the disentanglement from intragroup topological similarity and extragroup topological dissimilarity.
Beginning with intrafactor topological similarity, we are concerned with the degree to which the topology of varies with respect to a factor at different values of . Specifically, we condition the manifold on a particular factor at value , while allowing other factors to vary. We then measure the topology of this conditional submanifold. For each factor , we find the topology of conditional submanifolds at varying values of . A disentangled model would exhibit topological similarity within the set of submanifolds conditioned on the same . We visualize similar and dissimilar W. RLTs on factors of the CelebA dataset in Figure 4.
For a generative model, the correspondence between latent dimensions and factors is not known upfront. As a result, we perform this procedure by conditioning on each latent dimension . We then assess pairwise topological similarity across latent dimensions , using W. distance between W. RLTs. This operation constructs a dimensional similarity matrix . We use spectral coclustering [Dhillon, 2001] on to cocluster into biclusters, which represent different groups of homeomorphic factors. Spectral coclustering uses SVD to identify, in our case, the most likely biclusters, or the subsets of rows that are most similar to columns in . The resulting biclusters create a correspondence from latent dimensions to a group of homeomorphic factors . Aggregating biclusters in , we obtain a dimensional matrix (see examples in Figure 5). We then minimize the total variation of intragroup variance and extragroup variance on to find the value for . Using , we compute a score that rewards high intragroup similarity and low extragroup similarity . This score is based on the normalized cut objective in spectral coclustering to measure the strength of associations within a bicluster [Dhillon, 2001]. As a result, the unsupervised metric .
Supervised variant. In order to capture the correspondence between the learned and real data topology, we present a supervised variant that uses labels of relevant factors on the real dataset to represent the real data topology. While this variant requires labeled data, there are no external adhoc classifiers or encoders that might favor one training criterion over another. The real topology is approximated in the same way as the generated, but we have desired groups of factors upfront. See Figure 1 for a comparison between real and generated Wasserstein RLTs of two dSprites factors. Note that do not necessarily need to belong to different homology groups; we account for this during spectral coclustering with the generated data manifold. The major difference is that the generated data topology is no longer compared to itself, but to the real data topology, where topological similarity is now computed between the two manifolds: . Note that the relevant factors in the real topology form a specific factorization, so a model that finds an alternate factorization and that scores well on the unsupervised metric may not fare well on the supervised variant.
We use the same spectral coclustering procedure, though this time on a matrix. Note that because it is not a square matrix, and , where so that we only consider the real factors if . Finally, we normalize the final score by the number of factors , to penalize methods that do not find any correspondence to some factors. Ultimately, the supervised score favors groups of latent factors that have similar topological submanifolds to those of the reals.
Limitations. In Figure 6, we highlight cases where our metric may face limitations, delineated from scenarios where it would behave as expected. The first limitation is that it is theoretically possible for two factors to be disentangled and, under cases of complete symmetry, still have the same topology. This is more likely in datasets with trivial topologies that are significantly simpler than dSprites. While partial symmetry is handled in the metric with spectral coclustering of homeomorphic factors, complete symmetry is not.
Because we assume that the manifold is not perfectly symmetric, we do not account for all factors to present symmetry. In order to safeguard against this case, we would need to consider the covariance of topological similarities across pairwise conditional manifolds. This requires selecting fixed points from that hold two dimensions constant, and subsequently verifying that the topologies do not covary. However, this approach comes with a high computational cost for a benefit only to, for the most part, simple toy datasets. If we assign a Dirichlet process prior over all possible topologies [Ranganathan, 2008] and treat the number of factors as the number of samples, we find that the probability of having only a single set of all homeomorphic factors decreases factorially with the number of dimensions .
An additional limitation of our method is that RLTs do not compute a full topology of the data manifold, but instead efficiently approximate one topological invariant, homology, so that we can comparatively rank generative models on disentanglement. Our overall approach of measuring disentanglement is general enough to incorporate measurements of other topological invariants.
Model  Dataset  
VAE  dSprites  23.53 8.14  3.55 4.25 
TCVAE  dSprites  14.92 3.46  0.79 1.35 
InfoGANCR  dSprites  9.73 4.03  1.85 2.63 
FactorVAE  dSprites  8.66 1.83  0.35 0.90 
InfoGAN  dSprites  7.42 1.19  0.16 0.92 
VAE  dSprites  7.05 1.25  1.54 1.27 
VAE  dSprites  6.53 2.89  1.81 2.90 
StyleGAN  CelebAHQ  1.03 0.24  0.77 0.07 
ProGAN  CelebAHQ  0.68 0.08  0.37 0.45 
Model  Dataset  
VAE  CelebA  4.73 2.27  0.29 0.25 
TCVAE  CelebA  10.66 2.48  0.04 0.36 
InfoGANCR  CelebA  0.72 0.27  0.07 0.15 
FactorVAE  CelebA  8.53 4.53  0.14 0.28 
InfoGAN  CelebA  1.11 0.81  0.00 0.01 
VAE  CelebA  6.98 2.78  0.00 0.15 
VAE  CelebA  15.10 8.94  0.13 0.38 
BEGAN  CelebA  0.85 0.25  0.22 0.10 
WGANGP  CelebA  0.83 0.29  0.07 0.13 
4 Experiments
Across an extensive set of experiments, our goal is to show the extent to which our metric is able to compare across generative model architectures, training criteria (e.g. variational/adversarial, with or without an encoder), and datasets. We additionally show that our metric performs similarly to existing disentanglement metrics.
Datasets. We present empirical results on three datasets: (1) dSprites [Matthey et al., 2017] is a canonical disentanglement dataset whose five generating factors {shape, scale, orientation, xposition, yposition} are complete and independent, i.e. they fully describe all combinations in the dataset; (2) CelebA [Liu et al., 2015] is a popular dataset for disentanglement and image generation, and is comprised of over 202k human faces, which we align and crop to be px. There are also 40 attribute labels for each image; and (3) CelebaHQ [Karras et al., 2017], a higher resolution subset of CelebA consisting of 30,000 images, which has recently gained popularity in image generation [Karras et al., 2017, 2018].
Generative models. We compare ten canonical generative models, including a standard VAE, VAE [Higgins et al., 2017], VAE [Burgess et al., 2018], FactorVAE [Kim and Mnih, 2018], TCVAE [Chen et al., 2018], InfoGAN [Chen et al., 2016], InfoGANCR [Lin et al., 2019], BEGAN [Berthelot et al., 2017], WGANGP [Gulrajani et al., 2017], ProGAN [Karras et al., 2017], and StyleGAN [Karras et al., 2018]. We evaluate VAE and InfoGAN variants on dSprites and CelebA, WGANGP and BEGAN on CelebA, and ProGAN and StyleGAN on CelebAHQ. We match models to datasets, on which they have previously demonstrated strong performance and stable training.
Metric parity. We find that and rank models similarly to several other frequently cited metrics, including: (1) an informationtheoric metric MIG, which uses an encoder [Chen et al., 2018], (2) a supervised metric from [Kim and Mnih, 2018] which uses a classifier, and (3) a datasetspecific metric PPL [Karras et al., 2018] that caters to high resolution face datasets such as CelebAHQ. We use scores from their respective papers and prior work [Chen et al., 2018, Kim and Mnih, 2018, Lin et al., 2019, Karras et al., 2018], and show that our method ranks most or all models the same across each metric ( compared to MIG and PPL, to the supervised method). The source of deviation from MIG is the ranking of VAE; nevertheless, both of our scores exhibit exceptionally high variance across runs, suggesting that VAE has inconsistent disentanglement performance (see Figure 7). The classifier method ranks TCVAE and FactorVAE quite far apart, while ours ranks them similarly. We find that their nearly identical training objectives should rank them more closely and do not find this disparity particularly unexpected. Finally, our method agrees with PPL rankings on CelebAHQ.
As shown in Table 1, these experiments highlight several key observations:

Performance is not only architecturedependent, but also datasetdependent. This highlights the importance of having a metric that can cater to comparisons across these facets. Nevertheless, we note that VAE shows especially strong results on both metrics and two dataset settings.

As expected, the VAE and InfoGAN variants designed for disentanglement show greater performance on than their GAN counterparts. However, on , we find that BEGAN is able to perform inseparably close to VAE, suggesting that the model learns dependent factors consistent with the attributes in CelebA.

With similar training objectives, TCVAE and FactorVAE demonstrate comparable strong performances on across both dSprites and CelebA. TCVAE displays slight, yet consistent, improvements over FactorVAE, which may point to FactorVAE’s underestimation of total correlation [Chen et al., 2018]. Nevertheless, FactorVAE demonstrates higher on dSprites.

StyleGAN demonstrates consistently higher disentanglement, compared to ProGAN, which supports architectural decisions made for StyleGAN [Karras et al., 2018].
5 Conclusion
In this paper, we have introduced a disentanglement metric that measures intrinsic properties of a generative model with respect to its factors of variation. Our metric circumvents the typical requirements of existing metrics, such as requiring an adhoc model, a particular dataset, or a canonical factorization. This opens up the stage for broader comparisons across models and datasets. Our contributions also consider several cases of disentanglement, where labeled data is not available (unsupervised) or where direct comparisons to userspecified, semantically interpretable factors are desired (supervised). Ultimately, this work advances our ability to leverage the intrinsic properties of generative models to observe additional desirable facets and to apply these properties to important outstanding problems in the field.
Broader Impact
This research can aid in alleviating bias in deep generative models and more generally, unsupervised learning. Disentanglement has been shown to help with potentially reducing bias or identifying sources of bias in the underlying data by observing the factors of variation. Those who will benefit from this research will be users of generative models, who wish to disentangle or evaluate the disentanglement of particular models for downstream use. This may include artists or photo editors who use generative models for image editing. For negative consequences, this research broadly advances research in deep generative models, which have been shown to have societal consequences when applied maliciously, e.g. mimicking a political figure in DeepFakes.
Appendix A: Our Approach under the Definition of Disentanglement
Our method follows from a prior definition of disentanglement (Shu et al., 2019), which decomposes disentanglement into two components, restrictiveness and consistency. Restrictiveness over a set of latent dimensions is met when changes to correspond to only changes to the factors of variation . Consistency is met when changes to the set of factors is only controlled by changes to the latent dimensions . This decomposition allows for statistically dependent factors of variation to exist, which is often present in naturally occurring data. For a model to be fully disentangled requires every index of the model to be disentangled.
We approach consistency and restrictiveness from measuring manifold topology. We first assume the manifold of a generative model is equipped with an atlas , where is an open subset of and is the dimension of the manifold. Additionally, we have factors of variation . We would like to drive only one of these maps , while leaving the others fixed. Of course, we now have the composite maps , which can be thought of as a map from a small ball in into . Our goal to find maps so that the map has the form and the th output depends only on the th input. Because is topologically distinct, we can use this to evaluate disentanglement.
We derive measures of failure on this diagonalization, and our idea is to study the submanifolds , in particular the persistent homology of these submanifolds, to generate an evaluation metric which is guiding us towards the disentangled situation. We believe that under a perfectly disentangled model, perturbing the value of should not change the topology of the manifold (restrictiveness) or (consistency). From this, we measure disentanglement through its decomposition of consistency and restrictiveness, by comparing the persistence barcodes of these submanifolds.
We also cluster latent dimensions into groups, so our metric rewards disentangled groups, and moreover rewards maximizing the number of disentangled groups, which would be interpreted as products of tangled manifolds.
Assumptions. We make the following assumptions:

Assumption A. Each map is topologically distinct.

Assumption B. We can measure the persistent homology of the generated space.

Assumption C. If a set of mappings are not topologically distinct, then we can treat their shared or dimension as the same dimension.

Assumption D. In the supervised case, each can be observed for each .
With our method, we can evaluate the degree to which a set of latent dimensions corresponds to a single . This is a stronger form of restrictiveness that disentanglement necessitates. In order to identify , we cluster topologically similar latent dimensions. We penalize intracluster variance, which discourages having a set of latent dimensions correspond to distinct factors of variation and which denotes higher restrictiveness.
Furthermore, we can evaluate the degree to which a single factor is affected by different clusters of latent dimensions {,,…}, which also control other factors of variation. Removing shared factors increases consistency, by increasing the distance between distinct clusters. As a result, distinct clusters cannot be similar if they are consistent. This corresponds to a stronger form of consistency that disentanglement necessitates. Thus, we penalize extracluster similarity, which encourages topological variation between clusters and which denotes higher consistency.
Appendix B: Related work
In addition to the background section of our paper, we would like to point out related work, some reiterated for ease of cohesively examining the related literature.
Disentanglement metrics. Existing disentanglement metrics depend on an external model such as an encoder or classifier to be applicable across datasets, or datasetspecific preprocessing. Several metrics train classifiers to detect separability of the data, generated by conditioning on different latent dimensions [Eastwood and Williams, 2018, Kim and Mnih, 2018, Karras et al., 2018]. These are reliant on hyperparameters and the architecture of the classifiers. Recently, the mutual information gap MIG was proposed as an informationtheoretic metric, yet relies on a readily available encoder in order to estimate latent entropy [Chen et al., 2018]. Many stateoftheart GANs do not have an encoder readily available, and has even been cited as a barrier to use [Karras et al., 2018]. Finally, the perceptual path length was proposed to measure disentanglement without relying on an external model, but the method is specific to face datasets such as CelebA, as it crops out the background prior to evaluation [Karras et al., 2018]. To address these limitations in a metric’s applicability and scope, we propose a method that focuses on only using the generative model’s decoder and can be applied across datasets. Additionally, because the utility of disentanglement is often with respect specific subsets of factors that are humaninterpretable, we include a supervised variant of our metric that compares the real data manifold with the generated one. Finally, there is a difference between evaluating disentanglement and learning a disentangled representation, the latter of which requires constructing a valid loss function for learning and guaranteeing disentanglement, a process that requires at least weak supervision [F. Locatello et al., 2019].
Geometry of deep generative models. Prior work has explored applying Riemmannian geometry to deep generative models [Shao et al., 2018, Chen et al., 2017, Rieck et al., 2018]. One work approximates the geodesics of the latent manifold to visually inspect deep generative models as an alternative to linear interpolation [Chen et al., 2017]. Another work also explores computing geodesics efficiently and shows that style between interpolations can be transferred with the approach [Shao et al., 2018]. The closest work to ours has explored the geometry, specifically the normalized margin and tangent space alignment, of latent spaces in disentangling VAE models [Shukla et al., 2018]. This work is interesting in that it leverages the lower dimensionality of latent spaces to enable more computationally feasible calculations, such as singular value decomposition. However, they do not propose an disentanglement evaluation method and do not explore the learned data manifold or homology.
Persistent homology: barcodes and Wasserstein distance. Carlsson [2019] presents a survey of persistent homology and its applied uses. Specifically, there are multiple methods for vectorizing persistence barcodes, including persistence landscapes, persistence images, symmetric polynomials [Carlsson, 2019, Bubenik, 2015, Adcock et al., 2013]. Additionally, Wasserstein distance defines a metric on barcode space, as detailed by Carlsson [2019]:
where or , and are two barcodes, denotes the set of all bijections for which for only infinitely many , refers to the penalty function between barcodes. Thus, we use Wasserstein distance where , which underlies Wasserstein barycenters, on our barcodes, over prior work using Euclidean distance and Euclidean means [Khrulkov and Oseledets, 2018].
Appendix C: Wasserstein Mean RLTs vs. Euclidean Mean RLTs
We show visual comparisons of our method (Wasserstein Relative Living Times) against the prior method using the Euclidean mean to obtain the average distribution across Relative Living Times [Khrulkov and Oseledets, 2018].
We empirically evaluate the use of Euclidean distance compared to W. distance and Euclidean mean compared to W. mean on the real dSprites dataset in Table 2. The table indicates that the Wasserstein distance between Wasserstein barycenters is the most capable of differentiating similar and dissimilar persistent homologies.
RLT Distance Metric  Wasserstein RLTs  Wasserstein Distance  Differentiation Ratio 
Geometry Score  –  –  1.60x 
(W. RLT)  ✓  –  1.75x 
(W. Distance)  –  ✓  2.14x 
Ours  ✓  ✓  2.93x 
Appendix D: Topological signatures of dSprites
We show topological signatures for each factor from the real dSprites dataset. In the supervised variant, we discover that similar topological signatures in the generated manifold match the ones in the reals for corresponding latent interpretations that semantically adapt these factors.
Appendix E: Topological similarity matrix of dSprites
Here, we display the topological similarity matrix for the dSprites dataset, showing every value of each factor. A visible diagonal can be seen for each factor. Observe that the first three squares in the top left corner correspond to shape, the next six to scale, the next forty to orientation, the next thirtysix to xposition, and finally the last thirtysix to yposition. Note that this grid is not spectrally coclustered.
Appendix F: Hyperparameters
For all models, we used opensourced PyTorch implementations and model checkpoints that implement prior work using default hyperparameters from the papers. We use pretrained model checkpoints and do not tune them further. One exception of using TensorFlow is the InfoGAN variants, where we could not reproduce results from any opensourced PyTorch implementations, a known issue for InfoGAN [Higgins et al., 2017, Kim and Mnih, 2018], and instead training to the default number of epochs and hyperparameters based on the papers, because pretrained checkpoints were not available for these tasks. Additionally, we use default hyperparameters and functions for spectral coclustering (sklearn) and Geometry Score implementations. The Geometry Score implementation used a gamma of and an of 1000. All of these hyperparameters were constant across all datasets, models, experiments.
Appendix G: Computational Complexity
Let be the number of RLTs per latent dimension, be the number of RLT landmarks, be the number of images sampled per latent dimension, be the number of latent dimensions, be the number of factors of variation, and be the number of bins for the probability distribution histogram:

Calculating the RLTs per is O() [Khrulkov and Oseledets, 2018].

Calculating W. barycenters is O(), with Maxiter [Dognin et al., 2019].

Calculating W. distances between all barycenters is O() by the Sinkhorn algo with tolerance [T. Lin et al., 2019].

Spectral coclustering can be computed in O() [M. Vlachos et al., 2014], and we optimize over the number of biclusters, at most , so this is O(). Calculating the bicluster scores has the same runtime.
Treating and as constants, and noting , then this is O(). Note that many of these subprocedures can be substantially parallelized.
References
 The ring of algebraic functions on persistence bar codes. arXiv preprint arXiv:1304.0530. Cited by: §2, Appendix B: Related work.
 Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis 43 (2), pp. 904–924. Cited by: §3.1.
 Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §1.
 Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1, §3.
 BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §4.
 Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research 16 (1), pp. 77–102. Cited by: §2, Appendix B: Related work.
 Understanding disentangling in betavae. arXiv preprint arXiv:1804.03599. Cited by: §1, §1, §4.
 Persistent homology and applied homotopy theory. Handbook of Homotopy Theory. Cited by: §2, §2, §2, §3.1, Appendix B: Related work.
 Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep 12 (117), pp. 1. Cited by: §1, §2.
 Metrics for deep generative models. arXiv preprint arXiv:1711.01204. Cited by: Appendix B: Related work.
 Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §1, §1, Figure 7, 3rd item, §4, §4, Appendix B: Related work.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §1, §1, §4.
 Scaling algorithms for unbalanced optimal transport problems. Mathematics of Computation 87 (314), pp. 2563–2609. Cited by: §3.1.
 Coclustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. Cited by: §3.2.
 Wasserstein barycenter model ensembling. arXiv preprint arXiv:1902.04999. Cited by: §3.1, item 2.
 Unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614. Cited by: §1, §3.
 A framework for the quantitative evaluation of disentangled representations. Cited by: §1, Appendix B: Related work.
 Challenging common assumptions in the unsupervised learning of disentangled representations. ICML. Cited by: Appendix B: Related work.
 Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pp. 2053–2061. Cited by: §3.1.
 Barcodes: the persistent topology of data. Bulletin of the American Mathematical Society 45 (1), pp. 61–75. Cited by: §2.
 Domain adaptation for largescale sentiment classification: a deep learning approach. Cited by: §1.
 Deep learning. MIT press. Cited by: §1, §2.
 Disentangling space and time in video with hierarchical variational autoencoders. arXiv preprint arXiv:1612.04440. Cited by: §1.
 Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §4.
 A construction for computer visualization of certain complex curves. Notices of the Amer. Math. Soc 41 (9), pp. 1156–1163. Cited by: Figure 2.
 Algebraic topology. Cambridge University Press. Cited by: §2.
 Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: §3.
 Betavae: learning basic visual concepts with a constrained variational framework.. Iclr 2 (5), pp. 6. Cited by: §1, §1, §4, Appendix F: Hyperparameters.
 Bayesian representation learning with oracle constraints. arXiv preprint arXiv:1506.05011. Cited by: §1.
 Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §4, §4.
 A stylebased generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §1, §1, 4th item, §4, §4, §4, Appendix B: Related work.
 Geometry score: a method for comparing generative adversarial networks. arXiv preprint arXiv:1802.02664. Cited by: Figure 2, §2, §2, §3.1, Appendix B: Related work, Figure 8, Appendix C: Wasserstein Mean RLTs vs. Euclidean Mean RLTs, item 1.
 Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §1, §1, Figure 7, §4, §4, Appendix B: Related work, Appendix F: Hyperparameters.
 Vietorisrips persistent homology, injective metric spaces, and the filling radius. arXiv preprint arXiv:2001.07588. Cited by: §2.
 InfoGANcr: disentangling generative adversarial networks with contrastive regularizers. arXiv preprint arXiv:1906.06034. Cited by: §1, §1, §4, §4.
 Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.
 Improving cocluster quality with application to product recommendations. CIKM. Cited by: item 4.
 DSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dspritesdataset/ Cited by: §4.
 Sample complexity of testing the manifold hypothesis. In Advances in neural information processing systems, pp. 1786–1794. Cited by: §1, §2.
 Probabilistic topological maps. Ph.D. Thesis, Georgia Institute of Technology. Cited by: §3.2.
 A survey of inductive biases for factorial representationlearning. arXiv preprint arXiv:1612.05299. Cited by: §1.
 Neural persistence: a complexity measure for deep neural networks using algebraic topology. arXiv preprint arXiv:1812.09764. Cited by: Appendix B: Related work.
 The riemannian geometry of deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 315–323. Cited by: Appendix B: Related work.
 Weakly supervised disentanglement with guarantees. arXiv preprint arXiv:1910.09772. Cited by: §3.
 Geometry of deep generative models for disentangled representations. In Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing, pp. 1–8. Cited by: Appendix B: Related work.
 Disentangling adversarial robustness and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6976–6987. Cited by: §1.
 On efficient optimal transport: an analysis of greedy and accelerated mirror descent algorithms. ICML. Cited by: item 3.
 Computing persistent homology. Discrete & Computational Geometry 33 (2), pp. 249–274. Cited by: §2.