# Quantifying the Effects of Enforcing Disentanglement on Variational Autoencoders

## Abstract

The notion of disentangled autoencoders was proposed as an extension to the variational autoencoder by introducing a disentanglement parameter , controlling the learning pressure put on the possible underlying latent representations. For certain values of this kind of autoencoders is capable of encoding independent input generative factors in separate elements of the code, leading to a more interpretable and predictable model behaviour. In this paper we quantify the effects of the parameter on the model performance and disentanglement. After training multiple models with the same value of , we establish the existence of consistent variance in one of the disentanglement measures, proposed in literature. The negative consequences of the disentanglement to the autoencoder’s discriminative ability are also asserted while varying the amount of examples available during training.

## 1 Introduction

The exponential growth in data availability and the rapid increase of computational power in the past decade have allowed neural network based algorithms to achieve impressive practical results in the fields of computer vision [14, 18], natural language processing and generation [7, 21], and game playing [19] to mention a few, surpassing human performance on several complex tasks [8, 19]. Despite the undeniable potential of the deep learning approach, however, more research is required to better understand its limits [20].

This work primarily concerns the model of a disentangled autoencoder which represents a recent development towards building more transparent and interpretable generative models. It is capable of learning independent generating factors separately in the network, thus being more predictable in its behaviour. Given certain input data we might know what values to expect for the code and, conversely, small disturbances of the code result in expected changes of the output. We study the properties of this model with respect to changing the values of the disentanglement parameter , measuring both its disentanglement level and discriminative ability.

Autoencoders have been part of the neural network field since the late 80s [2, 12]. Because of their capability to perform dimensionality reduction, they are sometimes considered to do a more powerful non-linear Principal Component Analysis (PCA) [1, 6]. The disentangled autoencoder was initially introduced by Higgins *et al.* [9, 10] and has ever since been applied in semi-supervised learning environments as well [16]. It can be considered a generalisation of the variational autoencoder devised by Kingma and Welling [13]. Earlier attempts at disentangled factor learning were reported to either require a priori knowledge about the data generating factors [3, 11, 17], or do not scale well [4, 5].

## 2 Background

Kingma and Welling [13] derived the variational autoencoder framework by rearranging the evidence lower bound (ELBO) so that the intractable posterior can be eliminated

(1) | |||

(2) |

The first term corresponds to the autoencoder’s reconstruction error and the second one is the Kullback-Leibler (KL) divergence between the posterior approximation and , acting as a regulariser. In practice, this cost is typically dominated by the reconstruction error so Higgins *et al.* [9, 10] took this approach further by specifying the optimisation problem

(3) |

for . Applying the Karush-Kuhn-Tucker conditions [1], Equation (3) can be written as a Lagrangian

(4) |

with , deriving the final disentangled autoencoder cost function. A practical choice is to set and . In this way not only the term can be evaluated analytically [13], but choosing to be the isotropic normal distribution with perfectly uncorrelated components forces the model to learn representations which encode statistically independent features about the data separately, in different positions of the code. Varying the value for regulates the amount of the applied learning pressure and in the next section we closely examine the effect of varying this disentanglement parameter.

## 3 Experiments

### 3.1 Disentanglement level with respect to

#### Data

Higgins *et al.* [9] made the key assumption that the observed data should possess transform continuities in order to be able to find some regularity in it in an unsupervised manner. We assume the input data is generated by factors of variation, densely sampled from their respective continuous distributions. In accordance to this considerations, we have constructed a synthetic dataset of 64x64 binarised images containing each a single shape. The generative factors defining each image are: a shape – square (), ellipse () or triangle (); position X (16 values); position Y (16 values); scale (6 levels); rotation (60 values over the range). The images were randomly separated in training, validation and test sets in a ratio 70:15:15 in a stratified way. Special care was taken to reduce the leakage between the subsets by removing duplicate images incidentally caused by some idempotent transformations (e.g. rotation of a square in \ang90, \ang180 or \ang270 produces the same figure). The final dataset consists of 267,021 images^{1}

#### Measuring disentanglement

The disentanglement of an autoencoder cannot be usefully measured by its reconstruction accuracy or the KL-divergence term of the loss function as they fail to convey the notion of independence we want to obtain for the elements of the code. Precisely, disentanglement effect would mean distinguishing the generating factors of the data and encoding them in separate code elements.

Higgins *et al.* [9] proposed a disentanglement measuring method which tries to evaluate this property of the trained autoencoders. A random set of generating factors is taken, the image is constructed, and the code means are extracted. The same procedure is repeated, but this time one of the factors is randomly modified while all the others are kept the same. Denote the newly extracted code means with . A low capacity linear classifier is trained to map (division intended for normalisation) to the single factor that was changed during the process of obtaining and . The classifier accuracy is then reported as a disentanglement measure of the autoencoder of interest. The assumption is that if a simple classifier is capable of inferring the single input generating factor responsible for the code perturbation, then the model provides some form of transparency and interpretability.

An alternative method in which one of the factors is fixed while all the others are randomly sampled between the generation of and is presented in a subsequent work by the same authors [10] but we only evaluate the first approach here.

#### Results

The disentanglement levels of four types of autoencoders we trained, varying from 0 to 5 with a step of are presented in Figure 1. Simple and denoising variants of autoencoders were considered and both fully connected and convolutional architectures were tested. The applied noise in the denoising case was salt-and-pepper, randomly flipping up to 20% of the pixels. For each , 5 autoencoder models were trained. In all graphs here and below we plot the means of the results obtained for all models trained with the same while the error bars denote standard deviations.

The first thing to become clear is the high variance between separate runs with the same value for . A potential reason for that might be the method not being completely capable of closing the gap between the notion of disentanglement and factor independence we have with the underlying properties of the representations learnt by the autoencoders. For example, it was observed that for the position latents in the case of (which is supposed to be the disentangled case according to Higgins *et al.* [9]), the autoencoder may sometimes learn “curved” or rotated, but still orthogonal, coordinate systems, which differs from what we would expect. Moreover, when reporting their results in [9], the bottom 50% of the obtained measurements have been discarded for unknown reasons (this was not performed when organising our results in Figure 1). Higgins *et al.* [9, 10] report results for fixed values of ( and ) only, so to the best of our knowledge the findings about the intermediate values, presented in Figure 1 constitute previously unpublished work.

Another trend is the increase in the disentanglement with bigger values for . The growth seems to be the most steady for convolutional denoising autoencoders. This is consistent with the claims that convolutional networks might be better at capturing image structures than fully connected ones and that adding noise and reconstructing the original data could act as a good regulariser.

The assumption for bigger values of is that at some point the autoencoder disentanglement will get flat (as starting to happen for the fully connected denoising case) and from then onwards further increase of will be damaging, as it will come at the cost of reducing the autoencoder’s reconstruction ability. This in turn can lead to losing some useful learnt properties about the data. An application in which even small (but nonzero) values of can be harmful is described in the next section.

### 3.2 MNIST classification with disentangled autoencoders

After evaluating disentangled autoencoders’ behaviour on our synthetic dataset, it was a natural continuation to test them against an established machine learning benchmark. As such, the MNIST [15] dataset was considered to be a suitable candidate. The evaluation procedure began with an unsupervised autoencoder training first. Subsequently, a Support Vector Machine classifier was trained (using the same training dataset) to map the image codes, produced by the encoder network, to the respective image classes. The results are presented in Figure 2. The outcomes of the experiments using 10 and 30% of the MNIST training dataset were included because the 20% ones were outlying.

Increasing the number of training examples consistently increases the classification accuracy, as expected – providing more labels helps the model generalise. Although with higher variance, the convolutional architecture seem to be more robust to training the autoencoder with fewer datapoints. The major spikes in the classification accuracy from to can be attributed to overfitting. When , the network tends to learn a one-to-one mapping and the latents may end up unrelated to their nearby values. The KL term acts as a regulariser, adding smoothness to the learnt latent manifold.

Taking into account the increasing error in the autoencoders reconstruction precision, it can be concluded that the autoencoder disentanglement is deteriorating for classification problems when applied to the MNIST dataset. This is an expected result, especially because of the lack of explicit continuity and generating factors of the MNIST images. However, it establishes the fact that there is a trade-off between the two terms of the disentangled autoencoder loss function and that they force the model to learn different properties about the data. When training a disentangled autoencoder, this trade-off should be considered and a balanced solution is desirable.

## 4 Conclusion

This work contributes with, to the best of our knowledge, new and unpublished findings about the properties of disentangled autoencoders. In particular, their level of disentanglement was measured over a whole range of values and it was discovered that, as expected, the disentanglement typically makes the models’ performance worse in classification tasks.

### Footnotes

- The source code to reproduce all of our experiments described in this work can be found at

www.github.com/mpeychev/disentangled-autoencoders. - Best viewed in colour.

### References

- Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
- H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4):291–294, 1988.
- Brian Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen. Discovering hidden factors of variation in deep networks. CoRR, abs/1412.6583, 2014.
- Taco S. Cohen and Max Welling. Transformation properties of learned visual representations. CoRR, abs/1412.7659, 2014.
- G. Desjardins, A. Courville, and Y. Bengio. Disentangling Factors of Variation via Generative Entangling. arXiv, October 2012.
- Ian Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- Alex Graves, A.-R. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.
- Kaiming He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 1026–1034, Washington, DC, USA, 2015. IEEE Computer Society.
- Irina Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner. Early visual concept learning with unsupervised deep learning. CoRR, abs/1606.05579, 2016.
- Irina Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner. -VAE: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.
- Geoffrey E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In Proceedings of the 21th International Conference on Artificial Neural Networks - Volume Part I, ICANN’11, pages 44–51, Berlin, Heidelberg, 2011. Springer-Verlag.
- Geoffrey E Hinton and R. S. Zemel. Autoencoders, minimum description length and helmholtz free energy. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 3–10. Morgan-Kaufmann, 1994.
- Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
- Alex Krizhevsky, I. Sutskever, and G. E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.
- Yang Li, Q. Pan, S. Wang, H. Peng, T. Yang, and E. Cambria. Disentangled variational auto-encoder for semi-supervised learning. CoRR, abs/1709.05047, 2017.
- Scott Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1431–II–1439. JMLR.org, 2014.
- Florian Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
- David Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 01 2016.
- Christian Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
- Aäron van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016.