Plugin Factorization for
Latent Representation Disentanglement
Abstract
In this work, we propose a Factorized DisentanglerEntangler Network (FDEN) that learns to decompose a latent representation into two mutually independent factors, namely, identity and style. Given a latent representation, the proposed framework draws a set of interpretable factors aligned to identity of an observed data and learns to maximize the independency between these factors. Our work introduces an idea for a plugin method to disentangle latent representations of already learned deep models with no affect to the model. In doing so, it brings the possibilities of extending stateoftheart models to solve different tasks and also maintain the performance of its original task. Thus, FDEN is naturally applicable to jointly perform multiple tasks such as fewshot learning and imagetoimage translation in a single framework. We show the effectiveness of our work in disentangling a latent representation in two parts. First, to evaluate the alignment of factor to an identity, we perform fewshot learning using only the aligned factor. Then, to evaluate the effectiveness of decomposition of latent representation and to show that plugin method does not affect the deep model in its performance, we perform imagetoimage style transfer by mixing factors of different images. These evaluations show, qualitatively and quantitatively, that our proposed framework can indeed disentangle a latent representation.
Plugin Factorization for
Latent Representation Disentanglement
Jee Seok Yoon^{†}^{†}thanks: This work is done during an internship at Kakao Corp. Wonjun Ko HeungIl Suk^{†}^{†}thanks: Corresponding Author Department of Brain and Cognitive Engineering Korea University Seoul, South Korea {wltjr1007, wjko, hisuk}@korea.ac.kr
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Disentangled representation learning has been of great interest among researchers in the field of machine learning. A representation is generally considered disentangled when it can capture interpretable semantic information and factors of variations underlying the problem structure [1]. Thus, the concept of disentangled representation is closely related to that of factorial representation [2, 3, 4], which claims that a unit of a disentangled representation should correspond to an independent factor of an observed data. For example, given a facial image, each unit of a disentangled representation should correspond to distinctive factors like color of pupil, or nondistinctive factors like color of sunglasses. Due to these properties, researchers have found disentangled representation useful in various tasks such as fewshow learning [5, 6, 7, 8], domain adaptation [9, 10], and image translation [11, 12, 2].
While having a disentangled representation is desirable, it does not imply that a (entangled) latent representation is less powerful or that it does not have any interpretability. In fact, some existing models [13, 14] that do not consider disentanglement have maintained its stateoftheart performance over the years. This is mostly due to models being highly tuned [15, 16] with large amount of data [17]. Numerous works have utilized these highly tuned pretrained models as an initializer for their proposed network [18, 19]. However, these works often modify the architecture or weights of pretrained model such that it no longer can solve the problems it was originally designed for. Here, the motivation of our work is to develop a framework that disentangles a representation of a pretrained model without losing the ability to perform its original task (Figure 1).
Thus, we propose a Factorized DisentanglerEntangler Network (FDEN) that learns to decompose a latent representation into independent factors. Specifically, given a latent representation, the proposed network draws a set of interpretable factors and learns to informationtheoretically maximize the independency between these factors. In addition, it can entangle the independent factors back into its original representation, making it an autoencoderlike architecture. The motivation behind the autoencoderlike architecture is to utilize the latent representation from a pretrained model rather than to develop and train a disentangled representation from scratch. In doing so, we can focus our efforts solely on disentanglement with the benefit of the outstanding performance already given by the pretrained model. Thus, FDEN is divided into three parts (Figure 2): Disentangler, Factorizer, and Entangler. First, the Disentangler takes a latent representation from a pretrained model and decomposes it into multiple streams. Then, the Factorizer uses an information theoretic measure to factorize each stream into independent factors. Finally, the Entangler takes the factors and reconstructs the original representation so that the pretrained model can reuse them at will.
To evaluate our proposed framework, we perform qualitative and quantitative examination of the disentangled representation. First, we measure the effectiveness of nonlinear decomposition (i.e., Disentangler) by performing fewshot learning with only single factor relevant to identity of an observed data. Then, we examine the degree of independency between factors (i.e., Factorizer) and effectiveness of nonlinear combination (i.e., Entangler) by conducting imagetoimage translation with factors of different images. The main contributions of our work are threefold^{1}^{1}1Source code available at https://github.com/wltjr1007/FDEN:

We propose a novel framework of Factorized DisentanglerEntangler Network (FDEN) that decomposes a latent representation into independent factors.

To our best knowledge, we are the first to propose an idea for a plugin method to disentangle latent representations of already learned deep models. Our work brings the possibilities of extending stateoftheart models to solve different tasks without modifying the weights so that it can maintain the performance of its original task.

Our method is naturally applicable to fewshot learning and/or imagetoimage translation in a single framework.
2 Background
2.1 Mutual Information
The proposed framework utilizes mutual information to informationtheoretically maximize the independency between variables. Mutual information is a measure of dependency between random variables. The mutual information between random variables X and Z can be formulated as the Kullback–Leibler (KL) divergence between the joint probability distribution, , and the product of the marginal probability distributions, :
(1) 
Mutual information is known to measure the true dependence [20] since it captures the linear and nonlinear statistical dependency between variables. Thus, we chose mutual information as the objective function for nonlinearly decomposing a latent representation. Subsection 3.2 discusses how FDEN utilizes mutual information in detail.
The DonskerVaradhan representation of KLdivergence
2.2 Invertible Networks
The motivation of our work is to disentangle the latent space of already trained deep networks without modifying their architecture or weights. Thus, our work focuses on deep networks that can utilize a latent representation extensively, otherwise it would defeat the purpose of learning to disentangle a representation. Here, we define deep networks that can benefit from FDEN as invertible networks^{2}^{2}2Note that several works use different terms for “invertible network”, such as reversible networks [23], bidirectional networks [24], but they are essentially referring to networks with similar architecture. [25, 26].
We define invertible networks as neural networks that are capable of inverting the process of inference. Specifically, an invertible network can create a latent representation from the input space and reconstruct an input from a latent representation. In contrast to most neural networks, which are only capable of learning to model the latent space, an invertible network jointly learns to model the input space and the latent space. Thus, invertible networks are capable of utilizing the disentangled representations created by FDEN. A typical example of an invertible network is the variational autoencoder [27] which infers the data distribution given an observed data and inverts the process to reconstruct the observed data given the data distribution. For this work, we focus on deep invertible networks such as bidirectional GANs [28, 29], deep autoencoders [30, 31], deep neural networks [26, 25].
3 Factorized DisentanglerEntangler Network
The proposed Factorized DisentanglerEntangler Network (FDEN) is a novel framework that can be plugged into a invertible network and disentangle its latent representation without modifying its weights. The goal of FDEN is to disentangle a representation into interpretable and independent factors and entangle it back into its original representation. To achieve this, FDEN is divided into three modules (Figure 2): Disentangler , Factorizer , and Entangler . Since FDEN does not modify the weights of the invertible network, it allows not only disentanglement of the representation, but also the invertible network to perform its tasks as originally designed.
3.1 DisentanglerEntangler
The DisentanglerEntangler network is autoencoderlike architecture that takes input a representation z and reconstructs its original representation . The Disentangler takes input a representation z and decodes it with . Then, the decoded representation is decomposed into an identity factor and a style factor . The Entangler takes the factors and into their corresponding streams and . These streams are then concatenated on the channel axis and fed into the encoder to reconstruct the original representation . Since the goal of DisentanglerEntangler network is to reconstruct the original representation, we introduce the reconstruction objective function:
(3) 
At this point, representation z is merely decomposed and reassembled into . It is not disentangled and the factors do not carry any distinguishable information. Thus, we introduce a module called Factorizer to give information into these factors in the next subsection.
3.2 Factorizer
The Factorizer uses an informationtheoretic measure to give statistically decomposed factors and . The general idea is to minimize the mutual information between all factors while giving them relevant information.
Statisticians Network
The first component of the Factorizer, the statisticians network , estimates the mutual information between factors. Our goal is to minimize the mutual information between and so that they are maximally independent to each other. We follow [22] (i.e., Equation (2)) to estimate the mutual information between factors and :
(4) 
where is the statisticians network, is the joint distribution of identity and style factor (i.e., ), and is the marginal distribution of identity and style factor. We simplify the marginal distribution by taking from the joint distribution and from the joint distribution shuffled by the batch axis, .
The latent representation is now factorized into independent factors. However, it is not yet disentangled since its factors do not correspond to any features of the observed data. The next component of Factorizer is introduced to give meaningful information to a factor.
Alignment Network
The alignment network aligns each factors to desired features of an observed data. Specifically, it is a classifier that implicitly guides a factor to have a desired information. In training this, we exploit an ‘episodic learning’ scheme commonly used in meta learning and fewshot learning [32, 33]. Episodic learning refers to a human’s ability to learn from temporal contexts of experiences or episodes [34]. Episodic learning in machine learning tries to replicate this phenomenon by learning from episodes of datalabel pairs.
The general idea behind episodic learning is similar to kNN in that its objective is to predict which support samples within an episode is most similar to the query sample. This contrasts with traditional classification since the objective of episodic learning is not to classify but to measure the distance between two samples. Thus, it is naturally able to predict samples and classes unseen during training. Thus, one of the advantages of using an episodic learning scheme is that it can generalize the training problem to match the test environment. In a similar manner, the main motivation behind using an episodic learning scheme in our work is to align the identity factors with similar identity factors. In doing so, the identity factors should contain information on its identity and also the relationship between other identities.
Here, we formally define the settings of episodic learning similar to that of [32]. First, we define episode as the distribution over all possible labels , where a label set contains batches of randomly chosen unique classes. Then, we define as the support set with datalabel pairs , and as the batches of a single datalabel pair. The objective of episodic learning is to match query datalabel pair with support datalabel pair with the same label. Thus, we formulate the objective function of episodic learning as following:
(5) 
where is the cross entropy objective function between predictions and ground truths y.
3.3 Learning
Here, we define the overall objective function for FDEN:
(6) 
where is the weight constant, and the negative term of is to maximize Equation (4) due to its supremum term.
Gradient Reversal Layer
Note that needs to be maximized to successfully estimate the mutual information, while our goal is to minimize the dependency between factors. Thus, we add a Gradient Reversal Layer (GRL) [35] before the first layer of stasticians network. In essence, GRL multiplies the gradients by a negative constant during backpropagation. With GRL in place, the statisticians network will maximize to estimate mutual information, but the rest of the network will be guided towards minimization of mutual information.
Adaptive Gradient Clipping
Since is unbounded, its gradients can overwhelm the gradients of other objective functions if left unchecked. To mitigate this, we apply an adaptive gradient clipping [22]:
(7) 
where is the adapted gradients, , and (positive due to GRL). is gradient over since only backpropagates through and , and we only apply an adaptive gradient clipping when there is a possibility of to overwhelm other objective functions.
4 Related Works
4.1 Learning the Representation.
Learning a representation can be generalized into three architectures: topdown, bottomup, and invertible architectures. The topdown architecture refers to models that learn a representation given an observed data [14]. A good example of a topdown architecture is deep neural networks, where the latent representation is inferred from an observed data. The bottomup architecture learns to generate data given its representation. GAN [36] is a typical example of bottomup architecture. GANs take a representation (e.g., a random vector) to synthesize data. Due to this effect, assuming that a GAN is well trained, representations outside of the spectrum of observed data can be learned up to a certain precision [37]. Invertible architecture jointly learns to model input space and latent space. For example, invertible networks can synthesize data from random vector and generate representation given an observed data [28], i.e., inversion of generator. These approaches can be interpreted as efforts to define a representation by first learning the representation, i.e., bottomup approach, then mapping the observed and synthetic data on to the representation, i.e., topdown approach. Our work is an extension of these efforts. Given a representation, we aim to project it into manifold of each factor to define the representation in terms of factors.
4.2 Implicit Disentanglement
Implicit disentanglement of representation refers to representation learning approaches that implicitly infer disentanglement. These approaches often guide a representation to be dependent to a feature of an observed data to make a disentangled representation. For example, [38] disentangles a representation by maximizing the mutual information between a representation and label or attribute. [39] linearly decomposed a latent representation into multiple vectors and used a GAN to synthesize data given a decomposed vector, then the discriminator is used to combine the vectors into a single disentangled representation. FDEN’s alignment network uses a similar approach of implicitly guiding a factor to relate to a feature, i.e., class, of an observed data. However, our approach also incorporates explicitly disentangling a representation by factorizing it into independent factors.
4.3 Explicit Disentanglement
Explicit disentanglement of representation refers to approaches that directly model a representation to be disentangled. One of the simplest way is to insert information directly into the representation. For example, [40] inserts onehot vector labels directly into the representation to simulate disentanglement. More complicated approaches include factorial representation. [12, 41] disentangles a representation by decomposing it into multiple factors with the help of an adversarial classifier. [42] combines modalityspecific representations to create an unified and disentangled representation. The factorial representation approach is similar to our work in that it decomposes a representation into factors. However, our work uses mutual information to factorize a latent representation, which makes our work statistically sound. Also, these works must be trained from scratch, while our work exploits pretrained model so that it can perform its original task.
5 Experiment
In this section, we perform various tasks to evaluate the proposed method. Our goal is to show that each module of FDEN is effective in disentangling a latent representation into independent factors. Thus, we divided the evaluation into two parts. First, to evaluate the alignment network in aligning the identity factor and its identity, we perform fewshot learning with only an inferred identity factor. Then, we examine the effectiveness of the decomposition of a latent representation, i.e., the Disentangler. Finally, to evaluate the independency between factors, we perform imagetoimage translation by mixing identity and style factors of different images.
5.1 Data sets
We evaluate FDEN on various domains of data set: Omniglot (character), MSCeleb1M (facial), MiniImageNet (natural), and Oxford Flower (floral) data set.
Omniglot
The Omniglot [43] data set consists of 1,623 characters from 50 alphabets, where each character is drawn by 20 different people via Amazon’s Mechanical Turk. We partitioned the data set by 1,200 characters for training and remaining 423 for testing. Following [44], we have augmented the data set by rotating 90, 180, 270 degrees, where each rotation is treated as a new character (i.e., 4,800 characters for training data set and 1,692 characters for testing data set).
MSCeleb1M Lowshot
The MSCeleb1M [45] lowshot data set consists of facial images of 21,000 celebrities. This data set is partitioned into 20,000 celebrities for training and 1,000 celebrities for testing. There are average of 58 images per celebrity in the training data set (total of 1,155,175 images), and 5 images per celebrity in the test data set (total of 5,000 images).
MiniImageNet
Oxford Flower
The Oxford Flower [47] data set consists of images of 102 flower species with 40 to 258 per flower species. We have split the data set by randomly selecting 82 flower species for training and 20 flower species for testing.
5.2 Implementation Details
Invertible Network
For the invertible network, we utilize a pretrained Adversarially Learned Inference (ALI) [28]. ALI is a GAN that jointly learns a generation network and an inference network. We chose ALI for its simplicity in implementation and its ability to create powerful latent representation. For MSCeleb1M, MiniImageNet, and Oxford data set, we replicated the model designed for CelebA data set. For Omniglot data set, we replicated the model designed for SVHN data set.
Factorized DisentanglerEntangler Network
FDEN consists of Disentangler, statisticians network, alignment network, and Entangler which are multilayer perceptrons parameterized by and , respectively. For the sake of simplicity, we kept each modules consist of 3 or 4 fully connected layers with dropout, batch normalization, and a leaky ReLu activation.
For details of hyperparameters, readers are referred to Appendix B of supplementary material.
5.3 Experimental Setup
Fewshot Learning
The alignment network exploits an episodic learning scheme that is suitable for fewshot learning environment. Each episode consists of randomly sampled unique classes, support samples per class, and a query sample from one of the classes. Given support samples, the goal of fewshot learning is to predict which of unique classes does the query sample belong to. In fewshot learning literature, these setup is generally called the way, shot learning. We evaluate our results on 1,000 episodes with unseen samples for all experiments.
ImagetoImage Translation
Given representation of two samples, and , we perform imagetoimage translation by mixing their identity factors, and , with style factors of different images, and . Since Entangler is nonlinear, we can also partially mix the factors linearly. For example, . Without modifying the weights of the invertible networks, we reconstruct an translated image with .
5.4 Results
Omniglot  MiniImageNet  

5way  20way  5way  
1shot  5shot  1shot  1shot  5shot  
Matching Net. [32]  98.1%  98.9%  93.8%  43.5%  55.3% 
Prototypical Net. [44]  98.8%  99.7%  96.0%  49.4%  68.2% 
FDEN (Ours)  88.3%  95.4%  82.6%  43.9%  48.6% 
Fewshot Learning
We evaluate FDEN on fewshot learning to show that the decomposed identity factors are successfully aligned to identity information of the observed data. Thus, we validate our results on two different domains of data with varying complexity, Omniglot and MiniImageNet, and compare our results with stateoftheart methods, Matching Networks [32] and Prototypical Networks [44] (Table 1).
One property of FDEN is that it only learns to exploit the latent space. In other words, FDEN does not have any information on the input data except for a model’s representation of it. Thus, fewshot learning lets us evaluate the effectiveness of the alignment of identity factors with identity information given only a representation. Although our results are lower than that of the stateoftheart methods, considering these properties of FDEN, we see that our results are plausible.
To further analyze our results, we’ve drawn tSNE scatter plots with factors from 5way 1shot Omniglot model (Figure 3, larger version available on Figure 9 of the supplementary material). The tSNE plot for identity factors shows apparent clusters of samples with same class, while the style factors show no visible clusters. This observation suggests that the identity factors are indeed aligned to identity information. On the other hand, a style factor consists of all information independent to the identity factor and it does not consider alignment to any single information, hence the entanglement in the tSNE plot. Thus, to evaluate the style factors, we examine the results of imagetoimage translation in the next paragraph.
ImagetoImage Translation
For imagetoimage translation, we evaluate our results on Omniglot, MSCeleb1M, and Oxford Flower data set (Figure 4). The goal of this experiment is to show the effectiveness of FDEN’s ability to decompose and reconstruct a latent representation.
Our results show that identity relevant features are clearly aligned with identity factors. For example, first MSCeleb1M images from Figure 4 show clear interpolation between a woman and a man. Since we only factorize a latent representation into two factors, style factors carry multiple features independent to identity factor. Thus, during interpolation between factors, we see multiple factors changing together, such as changes in rotation and brightness in face and background. Although it is hard to distinguish what changes during interpolating factors of Omniglot and Oxford Flower data sets, we can notice that each step of interpolation results in somewhat interpretable changes. These observations show that FDEN can indeed decompose a latent representation into independent factors.
Also, comparing the ALI’s reconstructed image (1^{st} row 2^{nd} column, 6^{th} row 3^{rd} column) and the FDEN’s reconstructed image (1^{st} row 3^{rd} column, 3^{rd} row 5^{th} column), we can observe that they are very similar. This shows that FDEN can indeed be plugged into a invertible network without reducing its performance (additional results are available on Appendix A of supplementary material).
6 Conclusion
In this work, we propose Factorized DisentanglerEntangler Network (FDEN) that learns to decompose a latent representation into independent factors. Our work bring the possibilities of extending stateoftheart models to solve different tasks and also maintain the performance of its original task. One property of our work is that it only exploits the latent space, but not the input space. A possible future work can be to jointly incorporate the latent and input space to disentangle a representation.
Acknowledgements
This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017001779, A machine learning and statistical inference framework for explainable artificial intelligence) and Kakao Corp. (Development of Algorithms for Deep LearningBased One/Fewshot Learning).
References
 [1] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6541–6549.
 [2] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 2610–2620.
 [3] H. Kim and A. Mnih, “Disentangling by factorising,” in Proceedings of the International Conference on Machine Learning, 2018, pp. 4153–4171.
 [4] J. Schmidhuber, “Learning factorial codes by predictability minimization,” Neural Computation, vol. 4, no. 6, pp. 863–879, 1992.
 [5] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the fstatistic loss,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 185–194.
 [6] T. Scott, K. Ridgeway, and M. C. Mozer, “Adapted deep embeddings: A synthesis of methods for kshot inductive transfer learning,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 76–85.
 [7] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “Darla: Improving zeroshot transfer in reinforcement learning,” in Proceedings of the International Conference on Machine Learning, 2017, pp. 1480–1490.
 [8] L. Chen, H. Zhang, J. Xiao, W. Liu, and S.F. Chang, “Zeroshot visual recognition using semanticspreserving adversarial embedding networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1043–1052.
 [9] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3712–3722.
 [10] Y.C. Liu, Y.Y. Yeh, T.C. Fu, S.D. Wang, W.C. Chiu, and Y.C. Frank Wang, “Detach and adapt: Learning crossdomain disentangled deep representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8867–8876.
 [11] A. GonzalezGarcia, J. van de Weijer, and Y. Bengio, “Imagetoimage translation for crossdomain disentanglement,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 1287–1298.
 [12] A. H. Liu, Y.C. Liu, Y.Y. Yeh, and Y.C. F. Wang, “A unified feature disentangler for multidomain image translation and manipulation,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 2590–2599.
 [13] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the Advances in neural information processing systems, 2012, pp. 1097–1105.
 [15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Proceedings of the Advances in neural information processing systems, 2016, pp. 2234–2242.
 [16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
 [17] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inceptionv4, inceptionresnet and the impact of residual connections on learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 4278–4284.
 [18] B. Zhao, X. Sun, Y. Fu, Y. Yao, and Y. Wang, “Msplit lbi: Realizing feature selection and dense estimation simultaneously in fewshot and zeroshot learning,” Proceedings of the International Conference on Machine Learning, 2018.
 [19] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Advances in Neural Information Processing Systems, 2018, pp. 10 019–10 029.
 [20] J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal information coefficient,” Proceedings of the National Academy of Sciences, vol. 111, no. 9, pp. 3354–3359, 2014.
 [21] M. D. Donsker and S. S. Varadhan, “Asymptotic evaluation of certain markov process expectations for large time. iv,” Communications on Pure and Applied Mathematics, vol. 36, no. 2, pp. 183–212, 1983.
 [22] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proceedings of the International Conference on Machine Learning, 2018, pp. 531–540.
 [23] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The reversible residual network: Backpropagation without storing activations,” in Advances in neural information processing systems, 2017, pp. 2214–2224.
 [24] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
 [25] J.H. Jacobsen, A. Smeulders, and E. Oyallon, “irevnet: Deep invertible networks,” Proceedings of the International Conference on Learning Representations, 2018.
 [26] J. Behrmann, D. Duvenaud, and J.H. Jacobsen, “Invertible residual networks,” Proceedings of the International Conference on Machine Learning, 2018.
 [27] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” in Advances in neural information processing systems, 2016, pp. 2352–2360.
 [28] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” in Proceedings of the International Conference on Learning Representations, 2017.
 [29] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.
 [30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of machine learning research, vol. 11, no. Dec, pp. 3371–3408, 2010.
 [31] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” in Interspeech, 2013, pp. 436–440.
 [32] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in Advances in neural information processing systems, 2016, pp. 3630–3638.
 [33] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel, “Metalearning for semisupervised fewshot classification,” Proceedings of the International Conference on Learning Representations, 2018.
 [34] A. M. Nuxoll, “Episodic learning,” Encyclopedia of the sciences of learning, pp. 1157–1159, 2012.
 [35] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domainadversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
 [36] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [37] A. Antoniou, A. Storkey, and H. Edwards, “Data augmentation generative adversarial networks,” arXiv preprint arXiv:1711.04340, 2017.
 [38] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, 2016, pp. 2172–2180.
 [39] C. Donahue, Z. C. Lipton, A. Balsubramani, and J. McAuley, “Semantically decomposing the latent spaces of generative adversarial networks,” Proceedings of the International Conference on Learning Representations, 2018.
 [40] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 2642–2651.
 [41] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang, “Exploring disentangled feature representation beyond face identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2080–2089.
 [42] Y.H. H. Tsai, P. P. Liang, A. Zadeh, L.P. Morency, and R. Salakhutdinov, “Learning factorized multimodal representations,” Proceedings of the International Conference on Learning Representations, 2019.
 [43] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Humanlevel concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
 [44] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for fewshot learning,” in Advances in Neural Information Processing Systems, 2017, pp. 4077–4087.
 [45] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Msceleb1m: A dataset and benchmark for largescale face recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 87–102.
 [46] S. Ravi and H. Larochelle, “Optimization as a model for fewshot learning,” in Proceedings of the International Conference on Learning Representations, 2016.
 [47] M.E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 2008, pp. 722–729.
Supplementary Material
Appendix Appendix A Additional Results
Images on the first and the last column are the input images that we are interested in translating. Images on the second and sixth columns are ALI’s original reconstruction. Images in the middle are results of reconstruction with interpolated identity and style factors of the input images.
Appendix Appendix B Hyperparameters
b.1 Fden
Operation  Feature Maps  Batch Norm  Dropout  Activation  

input  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Concatenate and along the channel axis  
Fully Connected  1024  0.2  Leaky ReLu  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  64  0.2  Leaky ReLu  
Fully Connected  1  0.2  Linear  
input  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Concatenate and along the channel axis  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
Optimizer  Adam  
Batch size  16  
Episodes per epoch  10,000  
Epochs  1,000  
Leaky ReLu slope  0.01  
Weight initialization  Truncated Normal ()  
Loss weights  

b.2 Adversarially Learned Inference
We chose ALI [28] for the invertible network of our framework. We used the exactly the same hyperparameters presented on the Appendix A of [28]. For training Omniglot data set, we used the model designed for unsupervised learning of SVHN. For training MiniImageNet, MSCeleb1M, Oxford Flower data sets, we used the model designed for unsupervised learning of CelebA. Although [28] designed a model for a variat of ImageNet (Tiny ImageNet), our preliminary results showed that CelebA model could synthesize better images with MiniImageNet data set.
For training MiniImageNet, MSCeleb1M, Oxford Flowers data sets, we’ve included a reconstruction loss between the input image and its reconstructed image. This results in steady convergence and better reconstruction.