Disentangling Factors of Variation by Mixing Them
We propose an unsupervised approach to learn image representations that consist of disentangled factors of variation. A factor of variation corresponds to an image attribute that can be discerned consistently across a set of images, such as the pose or color of objects. Our disentangled representation consists of a concatenation of feature chunks, each chunk representing a factor of variation. It supports applications such as transferring attributes from one image to another, by simply swapping feature chunks, and classification or retrieval based on one or several attributes, by considering a user specified subset of feature chunks. We learn our representation in an unsupervised manner, without any labeling or knowledge of the data domain, using an autoencoder architecture with two novel training objectives: first, we propose an invariance objective to encourage that encoding of each attribute, and decoding of each chunk, are invariant to changes in other attributes and chunks, respectively; and second, we include a classification objective, which ensures that each chunk corresponds to a consistently discernible attribute in the represented image, hence avoiding the shortcut where chunks are ignored completely. We demonstrate the effectiveness of our approach on the MNIST, Sprites, and CelebA datasets.
Deep learning techniques have led to highly successful natural image representations, some focusing on synthesis of detailed, high resolution images in photographic quality [15, 6], and others on disentangling image features into semantically meaningful properties [25, 22, 7], for example.
In this paper, we learn a disentangled image representation that separates the feature vector into multiple chunks, each chunk representing intuitively interpretable properties, or factors of variation, of the image. We propose a completely unsupervised approach that does not require any labeled data, such as pairs of images where only one factor of variation changes (different viewpoints, for example) [22, 1].
The basic assumption of our technique is that images can be represented by a set of factors of variation, each one corresponding to a semantically meaningful image attribute. In addition, each factor of variation can be encoded using its own feature vector, which we call a feature chunk. That is, images are simply represented as concatenations of feature chunks, in a given order. We obtain disentanglement of feature chunks by leveraging autoencoders, and as a key contribution of this paper, by developing a novel invariance objective. The goal of the invariance objective is that each attribute is encoded into a chunk invariant to changes in other attributes, and that each chunk is decoded into an attribute invariant to changes in other chunks. We implement this objective using a sequence of two feature swapping autoencoders that mix feature chunks.
The invariance objective using feature swapping on its own, however, does not guarantee that each feature chunk represents a meaningful factor of variation. Instead, the autoencoder could represent the image with a single chunk, and ignore all the others. This is called the shortcut problem . We address the shortcut problem with a classification constraint, which forces each chunk to have a consistent, discernible effect on the generated image.
We demonstrate successful results of our approach on several datasets, where we obtain representations consisting of feature chunks that determine semantically meaningful image properties. In summary, we make the following contributions:
A novel architecture to learn image representations of disentangled factors of variation in a completely unsupervised manner, where the representation consists of a concatenation of a fixed number of feature chunks. Our approach can learn several factors of variation simultaneously.
A novel invariance objective to obtain disentanglement by encouraging invariant encoding and decoding of image attributes and feature chunks, respectively.
A novel classification constraint to ensure that each feature chunk represents a consistent, discernible factor of variation of the represented image.
An evaluation on the MNIST, Sprites, and CelebA datasets to demonstrate the effectiveness of our approach.
2 Related work
Our architecture builds on an autoencoder [5, 13, 2], which is a neural network with two main components, an encoder and a decoder. The encoder is designed to extract a feature representation of the input (image), and the decoder translates the features back to the input. Different flavors of autoencoders have been taught to perform image restoration [29, 21, 4], or image transformation . While basic autoencoders do not impose any constraints on the representation itself, variational autoencoders  add a generative probabilistic formulation, which forces the representation to follow a Gaussian distribution and allows sampling images by applying the decoder to a Gaussian noise vector. Thanks to their flexibility, autoencoders became ubiquitous tools in larger systems for domain adaptation [32, 16], or unsupervised feature learning . Autoencoders are also used to learn feature disentangling [25, 22, 1]. In our work we also use them as feature extractors. Our contribution is a novel, unsupervised training method that ensures the separation of factors of variation into several feature chunks.
Generative Adversarial Networks (GANs)  were designed to learn to sample from data distributions that are given simply as a set of training samples from the desired distributions. They solve the task using two competing neural networks: a generator translates input noise vectors into fake data samples, while a discriminator tries to classify the fake and real samples correctly. In an ideal training scenario, the system converges to a state where the generator produces fake data samples following the same distribution as the training data, and the discriminator can only detect fake samples by chance. By enforcing that generated samples follow a certain distribution, GANs have been successful at synthesizing more realistic images , learning representation , sampling images from a specific domain , or ensuring image-feature pairs have the same distribution when computing one from another . As the adversarial loss constrains the distribution of the generated data but not the individual data samples, it allows to reduce the need for data labeling. In particular, Shrivastava et al.  use GANs to transfer known attributes of synthetic, rendered examples to the domain of real images, thus creating virtually unlimited datasets for supervised training. In our work we use GANs to enforce that images look realistic when their attributes are transferred.
2.3 Disentangling and Independence
There are many fully supervised supervised methods [28, 30] that disentangle factors of variation. Arguably the simplest way for disentangling is to swap the feature representation in an autoencoder , then force the decoder to produce the result with the swapped attributes.
GANs and adversarial training have been leveraged to reduce the need for complete labeling of all factors of variation. Instead, these techniques allow the separation of some labeled factor of variation, without labeling any other attributes. For example, Mathieu et al.  apply adversarial training on the image domain, and Denton et al.  propose adversarial training on the feature domain to achieve this.
Recent work  describes the fundamental challenges of disentangling under such weak labeling. They prove the existence of the reference ambiguity, which means that only the labeled factor can be provably recovered without additional knowledge. In many cases in practice, however, the reference ambiguity does not arise. Another challenge is the shortcut problem, which means that all information is stored in one feature chunk, while the others are ignored. This can be provably solved as shown in . Unsupervised training without prior knowledge makes the issues with the reference ambiguity (the interpretability of factors) or the shortcut problem more severe. Nonetheless, our method can recover multiple semantically meaningful factors in several datasets.
In some approaches, the physics of the image formation model is integrated into the network training, with factors like depth and camera pose  or albedo, surface normals and shading . In the case of  the prior knowledge allows training without any labels.
By maximizing the mutual information between synthesized images and latent features, InfoGAN  makes the latent features interpretable as semantically meaningful attributes. InfoGAN is also completely unsupervised, as is our approach, but it does not include an encoding stage. In contrast to InfoGAN, we build on an autoencoder, which allows us to recover the disentangled representation from input images, and swap attributes between images. In addition, we use a novel classification constraint instead of the feature consistency approach in InfoGAN.
Two recent techniques, -VAE  and DIP-VAE , build on variational autoencoders (VAEs) to disentangle interpretable factors in an unsupervised way, similar to our approach. They encourage the latent features to be independent by generalizing the KL-divergence term in the VAE objective, which measures the similarity between the prior and posterior distribution of the latent factors. Instead, we build on swapping autoencoders  and adversarial training . We encourage disentanglement using an invariance objective, rather than trying to match an isotropic Gaussian prior.
3 Unsupervised Disentanglement of Factors of Variation
A representation of images where the factors of variations are disentangled can be exploited for various computer vision tasks. At the image level, it allows to transfer attributes from one image to another. At the feature level, this representation can be used for image retrieval and classification. To achieve this representation and to enable the applications at both the image and feature level, we leverage autoencoders. Here, an encoder transforms the input image to its feature representation , where consists of multiple chunks . The dimension of the full feature is therefore . In addition, a decoder transforms the feature representation back to the image via .
Our main objective is to learn a disentangled representation, where each feature chunk corresponds to an image attribute. For example, when the data are face images, chunk could represent the hair color, the gender and so on. With a disentangled representation, we can transfer attributes from one image to another simply by swapping the feature chunks. An image could take the hair color from image and all the other attributes from .
In our approach, we interpret disentanglement as invariance. In a disentangled representation, the encoding of each image attribute into its feature chunk should be invariant to transformations of any other image properties. Vice versa, the decoding of each chunk into its corresponding attribute should be invariant to changes of other chunks. In our example, if and have the same gender, we must have irrespective of any other attribute. Hence, a disentangled representation is also useful for image retrieval, where we can search for nearest neighbors of a specified attribute. Invariance is also beneficial for classification, where a simple linear classifier is sufficient to classify each attribute based on its corresponding feature chunk. This observation inspired previous work  to quantify disentanglement performance using linear classifiers on the full features .
In the following, we describe how we learn a disentangled representation from data without any additional knowledge (e.g., labels, data domain) by using swapping autoencoders. One of the main challenges in the design of the autoencoder and its training is that the encoder and the decoder could just make use of a single feature chunk (provided that this is sufficient to represent the whole input image) and ignore the other chunks. We call this failure mode a shortcut taken by the autoencoder during training. We propose a novel invariance objective to obtain disentanglement, and a classification objective to avoid the shortcut problem.
3.1 Network Architecture
Our network architecture is shown in Figure 1. There are three main components: We enforce invariance using a sequence of two swapping autoencoders, and a discriminator; we avoid the shortcut problem using a classifier. They are all implemented as neural networks.
Swapping Autoencoders. We leverage a sequence of two swapping autoencoders to enforce invariance, ensuring that we encode each attribute into a feature chunk invariant to changes in other attributes, and that we decode each chunk similarly in an invariant manner into its attribute. More precisely, the sequence of two swapping autoencoders performs the following operations (Figure 1):
Sample two images and independently, and encode them into and .
Mixing: swap a randomly selected set of feature chunks specified by a mask , where and from into . Swapping leads to a new feature , where is the element-wise multiplication.
Decode a new image
Encode again, .
Swap the remaining original feature chunks, given by mask , from into , that is, .
Decode the final image , from the swapped features of and .
Finally, we minimize the distance between and , thus the loss function can be written as
where we sum over all possible mask settings, and and are the encoder and decoder parameters respectively.
Intuitively, the key idea is that the cycle of decoding and re-encoding of the mixed feature vector should preserve the chunks from that were copied in . In other words, these chunks from should be decoded into corresponding attributes of . In addition, re-encoding into the intermediate image consisting of a mix of attributes from and attributes from , should return the same feature chunks originally from .
Discriminator. To ensure that the generated perturbed images are valid images according to the input data distribution, we impose an additional adversarial term, which is defined as
where are the discriminator parameters. In the ideal case when the GAN objective reaches the global optimum, the distribution of fake images should match the real image distribution. With the invariance and adversarial loss, however, is still possible to encode all image attributes into one feature chunk and keep the rest constant. This solution optimizes both the invariance loss and the adversarial loss perfectly. As mentioned before, this is called the shortcut problem and we address it using an additional loss based on a classification task.
Classifier. The last component of our network takes three images as inputs. The input images and and the generated image . It tries to decide for every chunk whether the composite image was generated using the feature from the first or the second input image. The formal loss function is
where are the classifier parameters, and its outputs are . The classifier consists of binary classifiers, one for each chunk, that decide whether the composite image was generated using the corresponding chunk from the first image or the second. We use the cross entropy loss for classification, so the last layer of the classifier is a sigmoid. The classifier loss can only be minimized if there is a meaningful attribute encoded in every chunk. Hence, the shortcut problem cannot occur as it would be impossible to decide which chunks were used to create the composite image.
We use a network architecture similar to DCGAN  for encoder, decoder, and discriminator. For the classifier, we use Alexnet with batch normalization after each convolutional layer, but we do not use any dropout. The image inputs of the classifier are concatenated along the RGB channels. We use equal weights for the swapping auto-encoder, GAN, and classifier for our experiments on the Mnist and Sprites datasets. For CelebA, we increase the weight of the swapping auto-encoder to . In all experiments, we separate the feature after the last layer of the encoder into 8 chunks, where each chunk is expected to represent one attribute, with equal size for each of the eight chunks. We experimented with different chunk sizes, and found that the chunk size does not affect the results as long as it is reasonably large. Our results are obtained with chunk size 8 for Mnist, 64 for Sprites and 64 for CelebA. We observed that reducing the chunk size in CelebA leads to lower rendering quality. For CelebA, we also show experiments with a network architecture similar to BEGAN  for encoder, decoder and discriminator. The encoder has the same architecture as the encoder part of the discriminator in BEGAN, and we choose .
|SWAP + C||0.57||0.65||0.43||0.63||0.55||0.58||0.51||0.56|
|SWAP + G||0.59||0.31||0.44||0.24||0.54||0.96||0.47||0.51|
|SWAP + C + G||0.58||0.80||0.94||0.49||0.58||0.96||0.52||0.70|
We experimented on three public datasets, the MNIST  handwritten digits, Sprites  animated figures, and CelebA  faces. We show qualitative results on all datasets and quantitative evaluations and ablation studies on Sprites and CelebA.
The MNIST dataset consists of K handwritten digits for the training and K for the test set, given as grayscale images with a size of pixels. There are different classes referring to the different digits. Other attributes like rotation angle or stroke width are not labeled. Our method can disentangle the labeled attribute as well as some non-labeled ones. Figure 3 shows visual attribute transfer for three factors, digit class, rotation angle, and stroke width. The three chunks were chosen by visually inspecting which chunk corresponds to which attribute. We can see that our method disentangles the factors efficiently on the image level.
The sprites dataset has synthetically rendered animated characters (sprites). The dataset is split into a training set with , validation set with , and test set with sprites. Each sprite is rendered at positions, thus the number of images is K in total. The dataset has many labeled attributes: body shape, skin color, vest color, hairstyle, arm and leg color, and finally weapon type. The pose labels can be extracted from the frame number of the animations. This rich attribute labeling is ideal for testing the disentanglement of our algorithms.
We perform ablation studies on the components of our method. We compare the SWAP, SWAP+G, SWAP+C and SWAP+G+C methods, where SWAP denotes the swapping loss, G the adversarial loss and C the classifier loss in the objective. The qualitative results are shown in Figure 2. We transfer attributes encoded in single chunks from the top row to the leftmost column images. Each column of generated images corresponds to a different chunk taken from the image in the top row. We can see that SWAP already learned to disentangle chunks. SWAP+G does not improve the disentangling, as the role of G is to make the images look more realistic. However the rendering quality with SWAP is already good. SWAP+C does not improve disentangling either, it rather creates artifacts in the rendering. The intuitive explanation is that adding C solves the shortcut problem in the sense that all chunks carry information, so the classifier can decide its origin. However the information is put into the artifacts, while the interpretable attributes are ignored. SWAP+G+C on the other hand improves the performance, as the artifacts are eliminated by G, so the shortcut problem can only be avoided by disentangling the factors. The method recovers independent factors.
For quantitative analysis we perform nearest neighbor search using a chunk of the features and compute the mean average precision using an attribute as ground truth. We repeat the search for all chunk and attribute pairs, and for each attribute we choose the best performing chunk to represent it. We ignore the weapon type attribute in our evaluation, as it is only visible in a small subset of poses. We also compare our method to the vanilla autoencoder AE. The autoencoder only has one chunk, but its dimensionality is the same as the full feature of the other methods. In table 1 we compare the results of our methods. We can see that SWAP already has a good disentanglement performance compared to AE. As we expect adding only the G or C term does not help, but when we combine all three losses, we achieve substantial improvement.
Figure 4 visualizes attribute transfer for all chunks, similar as in Figure 3, using our top performing method SWAP+C+G. Our method could recover the leg and vest colors into single chunks, while the pose and arm color attribute pair is represented by one chunk. The skin color and hairstyle attributes are also entangled and represented by another chunk. There are positions, where the sprites stand on a white bar. Even though this attribute is fully determined by the position, our method separates it to its own chunk.
CelebA contains K color images of celebrity faces. The training, validation, and test sizes are K, K and K respectively. There are labeled binary attributes indicating gender, hair color, facial hair and so on. We applied our method with both BEGAN and DCGAN architectures.
Figure 5 shows the attribute transfer for each chunk. We can see that DCGAN exhibits more pronounced attribute transfer, while BEGAN tends to blur out the changes. Figure 6 shows the nearest neighbors of some query images in the dataset using DCGAN. We used the distance on the specified feature chunks to search for top matches. For each chunk the top matches preserve a semantic attribute of the query image. Our method could recover five semantically meaningful attributes: brightness, glasses, hair color, hair style, and pose and smile. Notice that the attributes discovered with attribute transfer match the attributes in image retrieval. For brevity we only show those five chunks.
We performed quantitative tests on our learned features. We followed the evaluation technique based on the equivariant disentanglement property described in . A feature representation is considered disentangled when the attributes can be classified using a simple linear classifier. In our special case when an attribute depends only on one chunk (a subspace), a linear classifier would perform well by setting the classifier weights w.r.t. the other chunks to zero. We train binary classifiers on the whole feature vector, each with a different labeled attribute as ground truth. The classifier prediction is , where the classifier weights are computed as
where are the attribute labels. The bias term is set by minimizing the hinge loss. For a fair comparison we normalize the features by setting the variance for each coordinate to one, as in  the features are already normalized by the variational autoencoder. The results are shown in table 2. We can see that our network is competitive with the state of the art methods, -VAE  and DIP-VAE . The BEGAN architecture performs slightly worse than DCGAN, despite the superior rendering quality of the latter.
The qualitative and quantitative experiments showed that there is still room for improvement for both image and feature level disentanglement. We believe that using a better network architecture or training (especially for GAN) could boost the performance significantly.
We have introduced a novel method to disentangle factors of variation of a single set of images where no annotation is available. Our representation is computed through an autoencoder, which is trained by imposing constraints between the encoded features and the rendered images. We train the decoder to render realistic images by feeding features obtained by randomly mixing features from two images and by using adversarial training. Moreover, we force the autoencoder to make full use of the features by training it jointly with a classifier that determines how features have been mixed from an input image. We show that this technique successfully disentangles factors of variation in the MNIST, Sprites and CelebA datasets.
-  Anonymous. Challenges in disentangling independent factors of variation. International Conference on Learning Representations, 2018.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
-  D. Berthelot, T. Schumm, and L. Metz. BEGAN: Boundary equilibrium generative adversarial networks. arXiv:1703.10717, 2017.
-  S. A. Bigdeli, M. Jin, P. Favaro, and M. Zwicker. Deep mean-shift priors for image restoration. arxiv:1709.03749, 2017.
-  H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4):291–294, 1988.
-  Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. arxiv:1707.09405, 2017.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
-  E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. arXiv:1705.10915, 2017.
-  J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. International Conference on Learning Representations, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2016.
-  G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.
-  G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
-  P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2016.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arxiv:1710.10196, 2017.
-  T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
-  A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. arxiv:1711.00848, 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
-  X.-J. Mao, C. Shen, and Y.-B. Yang. Image restoration using convolutional auto-encoders with symmetric skip connections. arXiv:1606.08921, 2016.
-  M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5041–5049, 2016.
-  D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
-  S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In Advances in Neural Information Processing Systems, pages 1252–1260, 2015.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
-  Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In CVPR, 2017.
-  L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In CVPR, 2017.
-  P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning. ACM, 2008.
-  P. Xi, Y. Xiang, S. Kihyuk, M. Dimitris, and C. Manmohan. Reconstruction for feature disentanglement in pose-invariant face recognition. arXiv:1702.03041, 2017.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv:1703.10593, 2017.