Evaluating Disentanglement of Structured Latent Representations
We design the first multi-layer disentanglement metric operating at all hierarchy levels of a structured latent representation, and derive its theoretical properties. Applied to object-centric representations, our metric unifies the evaluation of both object separation between latent slots and internal slot disentanglement into a common mathematical framework. It also addresses the problematic dependence on segmentation mask sharpness of previous pixel-level segmentation metrics such as ARI. Perhaps surprisingly, our experimental results show that good ARI values do not guarantee a disentangled representation, and that the exclusive focus on this metric has led to counterproductive choices in some previous evaluations. As an additional technical contribution, we present a new algorithm for obtaining feature importances that handles slot permutation invariance in the representation.
A salient challenge in the field of feature learning is the ability to decompose the representation of images and scenes into distinct objects that are generated separately and then combined together. Indeed, the capacity to reason about objects and their relations is a central aspect of human intelligence spelke1992origins that can be used in conjunction with graph neural networks to enable efficient relational reasoning and improve existing AI systems lake2017building; wang2018nervenet; watters2019cobra; battaglia2018relational; brunton2020machine. In the last years, several models have been proposed that learn a compositional representation for images: TAGGER greff2016tagger, NEM greff2017neural, R-NEM van2018relational, MONet burgess2019monet, IODINE greff2019multi, GENESIS engelcke2019genesis, Slot Attention locatello2020object, MulMON li2020learning, SPACE lin2020space. These models have to jointly learn how to represent individual objects and how to segment the image into different components, the latter sometimes being referred to as perceptual grouping. They all share a number of common principles: (i) Split the latent representation into several groups of dimensions, also known as slots. (ii) Inside each slot, encode information about both pixel group assignments and individual object representation. (iii) Maintain a symmetry between slots in order to respect the permutation invariance of objects composition.
In order to compare algorithms and select models, it is indispensable to have robust disentanglement metrics. At the level of individual factors of variations, a representation is said to be disentangled when information about the different factors is separated between different latent dimensions bengio2013representation; locatello2019challenging. At object-level, disentanglement measures the degree of object separation between slots. However, all existing metrics higgins2016beta; kim2018disentangling; chen2018isolating; ridgeway2018learning; eastwood2018framework; kumar2017variational are limited to the individual case, which disregards the representation structure. To cite kim2018disentangling about the FactorVAE metric:
The definition of disentanglement we use […] is clearly a simplistic one. It does not allow correlations among the factors or hierarchies over them. Thus this definition seems more suited to synthetic data with independent factors of variation than to most realistic datasets.
As a result, prior work has restricted to measuring the degree of object separation via pixel-level segmentation metrics. Most considered is the Adjusted Rand Index rand1971objective; greff2019multi, a measure of clustering similarity than can be applied to object separation by considering pixel segmentation as a cluster assignment. Other metrics such as Segmentation Covering arbelaez2010contour have been introduced to penalize over-segmentation of objects. A fundamental limitation is that they do not evaluate directly the quality of the representation, but instead consider a visual proxy of object separation. This results in problematic dependence on the quality of the inferred segmentation masks. For instance, the authors of IODINE greff2019multi noticed that the low ARI of their model on the Multi-dSprites dataset is mostly due to unsharp segmentation masks, despite good object separation. Moreover, our experimental results show that is possible to obtain near perfect ARI without actually learning a satisfying representation.
To address these limitations, we present the first structured disentanglement metric operating at different levels of hierarchy in the data generating process. Applied to object-centric representations, our method mathematically unifies the evaluation of object separation between slots and of internal slot disentanglement. It naturally extends to data generating processes and representations with arbitrary hierarchies. Moreover, it is mathematically elegant and prone to theoretical analysis: We leverage information theoretic inequalities to understand the behavior of our metric when combining hierarchy levels. Our key theoretical result is that our multi-level framework provide a sound substitute for prior disentanglement metrics. Experimentally, we compare the representations learned by three architectures: MONet, GENESIS and IODINE. The results confirm issues with pixel-level segmentation metrics and show that our metric offers a potential solution. They also offer additional insight about the respective strengths of the different models.
2 Background and Definitions
2.1 Disentanglement criteria
There exists an extensive literature discussing notions of disentanglement, accounting for all the different definitions is outside the scope of this paper. We chose to focus on the three criteria formalized by eastwood2018framework, which stand out because of their clarity and simplicity. Disentanglement is the degree to which a representation separates the underlying factors of variation, with each latent variable capturing at most one generative factor. Completeness is the degree to which each underlying factor is captured by a single code variable. Informativeness is the amount of information that a representation captures about the underlying factors of variation. Similarly to prior work, the word disentanglement is also used as a generic term that simultaneously refers to these three criteria. In the following, it should be clear depending on the context whether it is meant as general or specific.
2.2 The DCI metric
For brevity reasons, we only describe the DCI metric eastwood2018framework, which is most closely related to our work. The supplementary material provides a comprehensive overview of existing disentanglement metrics.
Consider a dataset composed of observations , which we assume are generated by combining underlying factors of variation. The value of the different factors for observation are denoted . Suppose we have learned a representation from this dataset. The DCI metric is based on the affinity matrix , where measures the relative importance of latent in predicting the value of factor . Supposing an appropriate normalization for the matrix, disentanglement is measured as the weighted average of the matrix’ row entropy. Conversely, completeness is measured as the weighted average of column entropy. Finally, informativeness is measured as the normalized error of the predictor used to obtain the matrix .