Plug-in Factorization forLatent Representation Disentanglement

Plug-in Factorization for
Latent Representation Disentanglement

Jee Seok Yoon    Wonjun Ko    Heung-Il Suk
Department of Brain and Cognitive Engineering
Korea University
Seoul, South Korea
{wltjr1007, wjko, hisuk}@korea.ac.kr
This work is done during an internship at Kakao Corp.Corresponding Author
Abstract

In this work, we propose a Factorized Disentangler-Entangler Network (FDEN) that learns to decompose a latent representation into two mutually independent factors, namely, identity and style. Given a latent representation, the proposed framework draws a set of interpretable factors aligned to identity of an observed data and learns to maximize the independency between these factors. Our work introduces an idea for a plug-in method to disentangle latent representations of already learned deep models with no affect to the model. In doing so, it brings the possibilities of extending state-of-the-art models to solve different tasks and also maintain the performance of its original task. Thus, FDEN is naturally applicable to jointly perform multiple tasks such as few-shot learning and image-to-image translation in a single framework. We show the effectiveness of our work in disentangling a latent representation in two parts. First, to evaluate the alignment of factor to an identity, we perform few-shot learning using only the aligned factor. Then, to evaluate the effectiveness of decomposition of latent representation and to show that plugin method does not affect the deep model in its performance, we perform image-to-image style transfer by mixing factors of different images. These evaluations show, qualitatively and quantitatively, that our proposed framework can indeed disentangle a latent representation.

 

Plug-in Factorization for
Latent Representation Disentanglement


  Jee Seok Yoonthanks: This work is done during an internship at Kakao Corp.    Wonjun Ko    Heung-Il Sukthanks: Corresponding Author Department of Brain and Cognitive Engineering Korea University Seoul, South Korea {wltjr1007, wjko, hisuk}@korea.ac.kr

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction

Disentangled representation learning has been of great interest among researchers in the field of machine learning. A representation is generally considered disentangled when it can capture interpretable semantic information and factors of variations underlying the problem structure [1]. Thus, the concept of disentangled representation is closely related to that of factorial representation [2, 3, 4], which claims that a unit of a disentangled representation should correspond to an independent factor of an observed data. For example, given a facial image, each unit of a disentangled representation should correspond to distinctive factors like color of pupil, or non-distinctive factors like color of sunglasses. Due to these properties, researchers have found disentangled representation useful in various tasks such as few-show learning [5, 6, 7, 8], domain adaptation [9, 10], and image translation [11, 12, 2].

While having a disentangled representation is desirable, it does not imply that a (entangled) latent representation is less powerful or that it does not have any interpretability. In fact, some existing models [13, 14] that do not consider disentanglement have maintained its state-of-the-art performance over the years. This is mostly due to models being highly tuned [15, 16] with large amount of data [17]. Numerous works have utilized these highly tuned pre-trained models as an initializer for their proposed network [18, 19]. However, these works often modify the architecture or weights of pre-trained model such that it no longer can solve the problems it was originally designed for. Here, the motivation of our work is to develop a framework that disentangles a representation of a pre-trained model without losing the ability to perform its original task (Figure 1).

Thus, we propose a Factorized Disentangler-Entangler Network (FDEN) that learns to decompose a latent representation into independent factors. Specifically, given a latent representation, the proposed network draws a set of interpretable factors and learns to information-theoretically maximize the independency between these factors. In addition, it can entangle the independent factors back into its original representation, making it an autoencoder-like architecture. The motivation behind the autoencoder-like architecture is to utilize the latent representation from a pre-trained model rather than to develop and train a disentangled representation from scratch. In doing so, we can focus our efforts solely on disentanglement with the benefit of the outstanding performance already given by the pre-trained model. Thus, FDEN is divided into three parts (Figure 2): Disentangler, Factorizer, and Entangler. First, the Disentangler takes a latent representation from a pre-trained model and decomposes it into multiple streams. Then, the Factorizer uses an information theoretic measure to factorize each stream into independent factors. Finally, the Entangler takes the factors and reconstructs the original representation so that the pre-trained model can reuse them at will.

To evaluate our proposed framework, we perform qualitative and quantitative examination of the disentangled representation. First, we measure the effectiveness of non-linear decomposition (i.e., Disentangler) by performing few-shot learning with only single factor relevant to identity of an observed data. Then, we examine the degree of independency between factors (i.e., Factorizer) and effectiveness of non-linear combination (i.e., Entangler) by conducting image-to-image translation with factors of different images. The main contributions of our work are three-fold111Source code available at https://github.com/wltjr1007/FDEN:

  • We propose a novel framework of Factorized Disentangler-Entangler Network (FDEN) that decomposes a latent representation into independent factors.

  • To our best knowledge, we are the first to propose an idea for a plug-in method to disentangle latent representations of already learned deep models. Our work brings the possibilities of extending state-of-the-art models to solve different tasks without modifying the weights so that it can maintain the performance of its original task.

  • Our method is naturally applicable to few-shot learning and/or image-to-image translation in a single framework.

Figure 1: FDEN in an image-to-image translation scenario. First, FDEN takes input a latent representation z and decomposes it into an identity factor  and a style factor . Then, a latent representation is reconstructed by mixing the factors of different representations. Note that latent representation z is created by a pre-trained invertible network with its weights fixed during training FDEN. Far left: Input images, and , for invertible networks. Far right: Reconstructed images of latent representation with interpolated factors.

2 Background

2.1 Mutual Information

The proposed framework utilizes mutual information to information-theoretically maximize the independency between variables. Mutual information is a measure of dependency between random variables. The mutual information between random variables X and Z can be formulated as the Kullback–Leibler- (KL-) divergence between the joint probability distribution, , and the product of the marginal probability distributions, :

(1)

Mutual information is known to measure the true dependence [20] since it captures the linear and non-linear statistical dependency between variables. Thus, we chose mutual information as the objective function for non-linearly decomposing a latent representation. Subsection 3.2 discusses how FDEN utilizes mutual information in detail.

The Donsker-Varadhan representation of KL-divergence

Since mutual information is intractable for continuous variables, we use a dual representation [21] for KL-divergence computation:

(2)

where is a family of functions parameterized by a neural network. For full derivation of Equation (2), readers are referred to [22].

2.2 Invertible Networks

The motivation of our work is to disentangle the latent space of already trained deep networks without modifying their architecture or weights. Thus, our work focuses on deep networks that can utilize a latent representation extensively, otherwise it would defeat the purpose of learning to disentangle a representation. Here, we define deep networks that can benefit from FDEN as invertible networks222Note that several works use different terms for “invertible network”, such as reversible networks [23], bi-directional networks [24], but they are essentially referring to networks with similar architecture. [25, 26].

We define invertible networks as neural networks that are capable of inverting the process of inference. Specifically, an invertible network can create a latent representation from the input space and reconstruct an input from a latent representation. In contrast to most neural networks, which are only capable of learning to model the latent space, an invertible network jointly learns to model the input space and the latent space. Thus, invertible networks are capable of utilizing the disentangled representations created by FDEN. A typical example of an invertible network is the variational autoencoder [27] which infers the data distribution given an observed data and inverts the process to reconstruct the observed data given the data distribution. For this work, we focus on deep invertible networks such as bi-directional GANs [28, 29], deep autoencoders [30, 31], deep neural networks [26, 25].

Figure 2: Overview of Factorized Disentangler-Disentangler Network (FDEN). FDEN is divided into three modules: Disentangler , Factorizer , and Entangler . Our model is an autoencoder-like architecture that takes input a representation z and reconstructs its original representation . (a) First, the Disentangler takes a latent representation z from a invertible network and decomposes it into two factors and . (b) Then, the Factorizer uses an information theoretic way to maximize the independency between each factors. (c) Finally, the Entangler takes the factors and reconstructs its original representation . Note that latent representation z is created by a pre-trained invertible network with its weights fixed during training FDEN.

3 Factorized Disentangler-Entangler Network

The proposed Factorized Disentangler-Entangler Network (FDEN) is a novel framework that can be plugged into a invertible network and disentangle its latent representation without modifying its weights. The goal of FDEN is to disentangle a representation into interpretable and independent factors and entangle it back into its original representation. To achieve this, FDEN is divided into three modules (Figure 2): Disentangler , Factorizer , and Entangler . Since FDEN does not modify the weights of the invertible network, it allows not only disentanglement of the representation, but also the invertible network to perform its tasks as originally designed.

3.1 Disentangler-Entangler

The Disentangler-Entangler network is autoencoder-like architecture that takes input a representation z and reconstructs its original representation . The Disentangler takes input a representation z and decodes it with . Then, the decoded representation is decomposed into an identity factor   and a style factor  . The Entangler takes the factors and into their corresponding streams and . These streams are then concatenated on the channel axis and fed into the encoder to reconstruct the original representation . Since the goal of Disentangler-Entangler network is to reconstruct the original representation, we introduce the reconstruction objective function:

(3)

At this point, representation z is merely decomposed and reassembled into . It is not disentangled and the factors do not carry any distinguishable information. Thus, we introduce a module called Factorizer to give information into these factors in the next subsection.

3.2 Factorizer

The Factorizer uses an information-theoretic measure to give statistically decomposed factors and . The general idea is to minimize the mutual information between all factors while giving them relevant information.

Statisticians Network

The first component of the Factorizer, the statisticians network , estimates the mutual information between factors. Our goal is to minimize the mutual information between and so that they are maximally independent to each other. We follow [22] (i.e., Equation (2)) to estimate the mutual information between factors and :

(4)

where is the statisticians network, is the joint distribution of identity and style factor (i.e., ), and is the marginal distribution of identity and style factor. We simplify the marginal distribution by taking from the joint distribution and from the joint distribution shuffled by the batch axis, .

The latent representation is now factorized into independent factors. However, it is not yet disentangled since its factors do not correspond to any features of the observed data. The next component of Factorizer is introduced to give meaningful information to a factor.

Alignment Network

The alignment network aligns each factors to desired features of an observed data. Specifically, it is a classifier that implicitly guides a factor to have a desired information. In training this, we exploit an ‘episodic learning’ scheme commonly used in meta learning and few-shot learning [32, 33]. Episodic learning refers to a human’s ability to learn from temporal contexts of experiences or episodes [34]. Episodic learning in machine learning tries to replicate this phenomenon by learning from episodes of data-label pairs.

The general idea behind episodic learning is similar to kNN in that its objective is to predict which support samples within an episode is most similar to the query sample. This contrasts with traditional classification since the objective of episodic learning is not to classify but to measure the distance between two samples. Thus, it is naturally able to predict samples and classes unseen during training. Thus, one of the advantages of using an episodic learning scheme is that it can generalize the training problem to match the test environment. In a similar manner, the main motivation behind using an episodic learning scheme in our work is to align the identity factors with similar identity factors. In doing so, the identity factors should contain information on its identity and also the relationship between other identities.

Here, we formally define the settings of episodic learning similar to that of [32]. First, we define episode as the distribution over all possible labels , where a label set contains batches of randomly chosen unique classes. Then, we define as the support set with data-label pairs , and as the batches of a single data-label pair. The objective of episodic learning is to match query data-label pair with support data-label pair with the same label. Thus, we formulate the objective function of episodic learning as following:

(5)

where is the cross entropy objective function between predictions   and ground truths y.

3.3 Learning

Here, we define the overall objective function for FDEN:

(6)

where is the weight constant, and the negative term of is to maximize Equation (4) due to its supremum term.

Gradient Reversal Layer

Note that needs to be maximized to successfully estimate the mutual information, while our goal is to minimize the dependency between factors. Thus, we add a Gradient Reversal Layer (GRL) [35] before the first layer of stasticians network. In essence, GRL multiplies the gradients by a negative constant during backpropagation. With GRL in place, the statisticians network will maximize to estimate mutual information, but the rest of the network will be guided towards minimization of mutual information.

Adaptive Gradient Clipping

Since is unbounded, its gradients can overwhelm the gradients of other objective functions if left unchecked. To mitigate this, we apply an adaptive gradient clipping [22]:

(7)

where is the adapted gradients, , and (positive due to GRL). is gradient over since only backpropagates through and , and we only apply an adaptive gradient clipping when there is a possibility of to overwhelm other objective functions.

4 Related Works

4.1 Learning the Representation.

Learning a representation can be generalized into three architectures: top-down, bottom-up, and invertible architectures. The top-down architecture refers to models that learn a representation given an observed data [14]. A good example of a top-down architecture is deep neural networks, where the latent representation is inferred from an observed data. The bottom-up architecture learns to generate data given its representation. GAN [36] is a typical example of bottom-up architecture. GANs take a representation (e.g., a random vector) to synthesize data. Due to this effect, assuming that a GAN is well trained, representations outside of the spectrum of observed data can be learned up to a certain precision [37]. Invertible architecture jointly learns to model input space and latent space. For example, invertible networks can synthesize data from random vector and generate representation given an observed data [28], i.e., inversion of generator. These approaches can be interpreted as efforts to define a representation by first learning the representation, i.e., bottom-up approach, then mapping the observed and synthetic data on to the representation, i.e., top-down approach. Our work is an extension of these efforts. Given a representation, we aim to project it into manifold of each factor to define the representation in terms of factors.

4.2 Implicit Disentanglement

Implicit disentanglement of representation refers to representation learning approaches that implicitly infer disentanglement. These approaches often guide a representation to be dependent to a feature of an observed data to make a disentangled representation. For example, [38] disentangles a representation by maximizing the mutual information between a representation and label or attribute. [39] linearly decomposed a latent representation into multiple vectors and used a GAN to synthesize data given a decomposed vector, then the discriminator is used to combine the vectors into a single disentangled representation. FDEN’s alignment network uses a similar approach of implicitly guiding a factor to relate to a feature, i.e., class, of an observed data. However, our approach also incorporates explicitly disentangling a representation by factorizing it into independent factors.

4.3 Explicit Disentanglement

Explicit disentanglement of representation refers to approaches that directly model a representation to be disentangled. One of the simplest way is to insert information directly into the representation. For example, [40] inserts one-hot vector labels directly into the representation to simulate disentanglement. More complicated approaches include factorial representation. [12, 41] disentangles a representation by decomposing it into multiple factors with the help of an adversarial classifier. [42] combines modality-specific representations to create an unified and disentangled representation. The factorial representation approach is similar to our work in that it decomposes a representation into factors. However, our work uses mutual information to factorize a latent representation, which makes our work statistically sound. Also, these works must be trained from scratch, while our work exploits pre-trained model so that it can perform its original task.

5 Experiment

In this section, we perform various tasks to evaluate the proposed method. Our goal is to show that each module of FDEN is effective in disentangling a latent representation into independent factors. Thus, we divided the evaluation into two parts. First, to evaluate the alignment network in aligning the identity factor and its identity, we perform few-shot learning with only an inferred identity factor. Then, we examine the effectiveness of the decomposition of a latent representation, i.e., the Disentangler. Finally, to evaluate the independency between factors, we perform image-to-image translation by mixing identity and style factors of different images.

5.1 Data sets

We evaluate FDEN on various domains of data set: Omniglot (character), MS-Celeb-1M (facial), Mini-ImageNet (natural), and Oxford Flower (floral) data set.

Omniglot

The Omniglot [43] data set consists of 1,623 characters from 50 alphabets, where each character is drawn by 20 different people via Amazon’s Mechanical Turk. We partitioned the data set by 1,200 characters for training and remaining 423 for testing. Following [44], we have augmented the data set by rotating 90, 180, 270 degrees, where each rotation is treated as a new character (i.e., 4,800 characters for training data set and 1,692 characters for testing data set).

MS-Celeb-1M Low-shot

The MS-Celeb-1M [45] low-shot data set consists of facial images of 21,000 celebrities. This data set is partitioned into 20,000 celebrities for training and 1,000 celebrities for testing. There are average of 58 images per celebrity in the training data set (total of 1,155,175 images), and 5 images per celebrity in the test data set (total of 5,000 images).

Mini-ImageNet

The Mini-ImageNet is a partition of ImageNet data set created by [46] for few-shot learning. It consists of 100 classes from ImageNet with 600 images per class, and [46] splits it into 64, 16, 20 classes for training, validation, testing, respectively.

Oxford Flower

The Oxford Flower [47] data set consists of images of 102 flower species with 40 to 258 per flower species. We have split the data set by randomly selecting 82 flower species for training and 20 flower species for testing.

5.2 Implementation Details

Invertible Network

For the invertible network, we utilize a pre-trained Adversarially Learned Inference (ALI) [28]. ALI is a GAN that jointly learns a generation network and an inference network. We chose ALI for its simplicity in implementation and its ability to create powerful latent representation. For MS-Celeb-1M, Mini-ImageNet, and Oxford data set, we replicated the model designed for CelebA data set. For Omniglot data set, we replicated the model designed for SVHN data set.

Factorized Disentangler-Entangler Network

FDEN consists of Disentangler, statisticians network, alignment network, and Entangler which are multilayer perceptrons parameterized by and , respectively. For the sake of simplicity, we kept each modules consist of 3 or 4 fully connected layers with dropout, batch normalization, and a leaky ReLu activation.

For details of hyperparameters, readers are referred to Appendix B of supplementary material.

5.3 Experimental Setup

Few-shot Learning

The alignment network exploits an episodic learning scheme that is suitable for few-shot learning environment. Each episode consists of randomly sampled unique classes, support samples per class, and a query sample from one of the classes. Given support samples, the goal of few-shot learning is to predict which of unique classes does the query sample belong to. In few-shot learning literature, these setup is generally called the -way, -shot learning. We evaluate our results on 1,000 episodes with unseen samples for all experiments.

Image-to-Image Translation

Given representation of two samples, and , we perform image-to-image translation by mixing their identity factors, and , with style factors of different images, and . Since Entangler is non-linear, we can also partially mix the factors linearly. For example, . Without modifying the weights of the invertible networks, we reconstruct an translated image with .

5.4 Results

Figure 3: t-SNE scatter plot of factors from 5-way 1-shot Omniglot model. Each plot consists of 5 unique classes with 20 samples per class. (a) Plot for identity factors and (b) plot for style factors. A larger version of this figure is avaialbe on the Figure 9 of supplementary material.
Omniglot Mini-ImageNet
5-way 20-way 5-way
1-shot 5-shot 1-shot 1-shot 5-shot
Matching Net. [32] 98.1% 98.9% 93.8% 43.5% 55.3%
Prototypical Net. [44] 98.8% 99.7% 96.0% 49.4% 68.2%
FDEN (Ours) 88.3% 95.4% 82.6% 43.9% 48.6%
Table 1: -way -shot learning accuracy on Omniglot and Mini-ImageNet data set
Few-shot Learning

We evaluate FDEN on few-shot learning to show that the decomposed identity factors are successfully aligned to identity information of the observed data. Thus, we validate our results on two different domains of data with varying complexity, Omniglot and Mini-ImageNet, and compare our results with state-of-the-art methods, Matching Networks [32] and Prototypical Networks [44] (Table 1).

One property of FDEN is that it only learns to exploit the latent space. In other words, FDEN does not have any information on the input data except for a model’s representation of it. Thus, few-shot learning lets us evaluate the effectiveness of the alignment of identity factors with identity information given only a representation. Although our results are lower than that of the state-of-the-art methods, considering these properties of FDEN, we see that our results are plausible.

To further analyze our results, we’ve drawn t-SNE scatter plots with factors from 5-way 1-shot Omniglot model (Figure 3, larger version available on Figure 9 of the supplementary material). The t-SNE plot for identity factors shows apparent clusters of samples with same class, while the style factors show no visible clusters. This observation suggests that the identity factors are indeed aligned to identity information. On the other hand, a style factor consists of all information independent to the identity factor and it does not consider alignment to any single information, hence the entanglement in the t-SNE plot. Thus, to evaluate the style factors, we examine the results of image-to-image translation in the next paragraph.

Figure 4: Results of image-to-image translation for MS-Celeb-1M, Omniglot, and Oxford Flower data set. For each data set, images on the first and the last column are the input images that we are interested in translating. Images on the second and sixth columns are ALI’s original reconstruction. Images in the middle are results of reconstruction with interpolated identity and style factors of the input images.
Image-to-Image Translation

For image-to-image translation, we evaluate our results on Omniglot, MS-Celeb-1M, and Oxford Flower data set (Figure 4). The goal of this experiment is to show the effectiveness of FDEN’s ability to decompose and reconstruct a latent representation.

Our results show that identity relevant features are clearly aligned with identity factors. For example, first MS-Celeb-1M images from Figure 4 show clear interpolation between a woman and a man. Since we only factorize a latent representation into two factors, style factors carry multiple features independent to identity factor. Thus, during interpolation between factors, we see multiple factors changing together, such as changes in rotation and brightness in face and background. Although it is hard to distinguish what changes during interpolating factors of Omniglot and Oxford Flower data sets, we can notice that each step of interpolation results in somewhat interpretable changes. These observations show that FDEN can indeed decompose a latent representation into independent factors.

Also, comparing the ALI’s reconstructed image (1st row 2nd column, 6th row 3rd column) and the FDEN’s reconstructed image (1st row 3rd column, 3rd row 5th column), we can observe that they are very similar. This shows that FDEN can indeed be plugged into a invertible network without reducing its performance (additional results are available on Appendix A of supplementary material).

6 Conclusion

In this work, we propose Factorized Disentangler-Entangler Network (FDEN) that learns to decompose a latent representation into independent factors. Our work bring the possibilities of extending state-of-the-art models to solve different tasks and also maintain the performance of its original task. One property of our work is that it only exploits the latent space, but not the input space. A possible future work can be to jointly incorporate the latent and input space to disentangle a representation.

Acknowledgements

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence) and Kakao Corp. (Development of Algorithms for Deep Learning-Based One-/Few-shot Learning).

References

  • [1] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6541–6549.
  • [2] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 2610–2620.
  • [3] H. Kim and A. Mnih, “Disentangling by factorising,” in Proceedings of the International Conference on Machine Learning, 2018, pp. 4153–4171.
  • [4] J. Schmidhuber, “Learning factorial codes by predictability minimization,” Neural Computation, vol. 4, no. 6, pp. 863–879, 1992.
  • [5] K. Ridgeway and M. C. Mozer, “Learning deep disentangled embeddings with the f-statistic loss,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 185–194.
  • [6] T. Scott, K. Ridgeway, and M. C. Mozer, “Adapted deep embeddings: A synthesis of methods for k-shot inductive transfer learning,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 76–85.
  • [7] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “Darla: Improving zero-shot transfer in reinforcement learning,” in Proceedings of the International Conference on Machine Learning, 2017, pp. 1480–1490.
  • [8] L. Chen, H. Zhang, J. Xiao, W. Liu, and S.-F. Chang, “Zero-shot visual recognition using semantics-preserving adversarial embedding networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1043–1052.
  • [9] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3712–3722.
  • [10] Y.-C. Liu, Y.-Y. Yeh, T.-C. Fu, S.-D. Wang, W.-C. Chiu, and Y.-C. Frank Wang, “Detach and adapt: Learning cross-domain disentangled deep representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8867–8876.
  • [11] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio, “Image-to-image translation for cross-domain disentanglement,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 1287–1298.
  • [12] A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang, “A unified feature disentangler for multi-domain image translation and manipulation,” in Proceedings of the Advances in Neural Information Processing Systems, 2018, pp. 2590–2599.
  • [13] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Proceedings of the Advances in neural information processing systems, 2016, pp. 2234–2242.
  • [16] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
  • [17] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 4278–4284.
  • [18] B. Zhao, X. Sun, Y. Fu, Y. Yao, and Y. Wang, “Msplit lbi: Realizing feature selection and dense estimation simultaneously in few-shot and zero-shot learning,” Proceedings of the International Conference on Machine Learning, 2018.
  • [19] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Advances in Neural Information Processing Systems, 2018, pp. 10 019–10 029.
  • [20] J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal information coefficient,” Proceedings of the National Academy of Sciences, vol. 111, no. 9, pp. 3354–3359, 2014.
  • [21] M. D. Donsker and S. S. Varadhan, “Asymptotic evaluation of certain markov process expectations for large time. iv,” Communications on Pure and Applied Mathematics, vol. 36, no. 2, pp. 183–212, 1983.
  • [22] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in Proceedings of the International Conference on Machine Learning, 2018, pp. 531–540.
  • [23] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The reversible residual network: Backpropagation without storing activations,” in Advances in neural information processing systems, 2017, pp. 2214–2224.
  • [24] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
  • [25] J.-H. Jacobsen, A. Smeulders, and E. Oyallon, “i-revnet: Deep invertible networks,” Proceedings of the International Conference on Learning Representations, 2018.
  • [26] J. Behrmann, D. Duvenaud, and J.-H. Jacobsen, “Invertible residual networks,” Proceedings of the International Conference on Machine Learning, 2018.
  • [27] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” in Advances in neural information processing systems, 2016, pp. 2352–2360.
  • [28] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” in Proceedings of the International Conference on Learning Representations, 2017.
  • [29] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilibrium generative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.
  • [30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of machine learning research, vol. 11, no. Dec, pp. 3371–3408, 2010.
  • [31] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” in Interspeech, 2013, pp. 436–440.
  • [32] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” in Advances in neural information processing systems, 2016, pp. 3630–3638.
  • [33] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel, “Meta-learning for semi-supervised few-shot classification,” Proceedings of the International Conference on Learning Representations, 2018.
  • [34] A. M. Nuxoll, “Episodic learning,” Encyclopedia of the sciences of learning, pp. 1157–1159, 2012.
  • [35] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [37] A. Antoniou, A. Storkey, and H. Edwards, “Data augmentation generative adversarial networks,” arXiv preprint arXiv:1711.04340, 2017.
  • [38] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, 2016, pp. 2172–2180.
  • [39] C. Donahue, Z. C. Lipton, A. Balsubramani, and J. McAuley, “Semantically decomposing the latent spaces of generative adversarial networks,” Proceedings of the International Conference on Learning Representations, 2018.
  • [40] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 2642–2651.
  • [41] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang, “Exploring disentangled feature representation beyond face identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2080–2089.
  • [42] Y.-H. H. Tsai, P. P. Liang, A. Zadeh, L.-P. Morency, and R. Salakhutdinov, “Learning factorized multimodal representations,” Proceedings of the International Conference on Learning Representations, 2019.
  • [43] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
  • [44] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems, 2017, pp. 4077–4087.
  • [45] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European Conference on Computer Vision.   Springer, 2016, pp. 87–102.
  • [46] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in Proceedings of the International Conference on Learning Representations, 2016.
  • [47] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.   IEEE, 2008, pp. 722–729.

Supplementary Material

Appendix Appendix A Additional Results

Images on the first and the last column are the input images that we are interested in translating. Images on the second and sixth columns are ALI’s original reconstruction. Images in the middle are results of reconstruction with interpolated identity and style factors of the input images.

Figure 5: Additional results on MS-Celeb-1M data set.
Figure 6: Additional results on Omniglot data set.
Figure 7: Additional results on Oxford Flower data set.
Figure 8: Additional results on Mini-ImageNet data set.

Appendix Appendix B Hyperparameters

b.1 Fden

Operation Feature Maps Batch Norm Dropout Activation
input
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Concatenate and along the channel axis
Fully Connected 1024 0.2 Leaky ReLu
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 64 0.2 Leaky ReLu
Fully Connected 1 0.2 Linear
input
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Concatenate and along the channel axis
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 0.2 Linear
Optimizer Adam
Batch size 16
Episodes per epoch 10,000
Epochs 1,000
Leaky ReLu slope 0.01
Weight initialization Truncated Normal ()
Loss weights
Omniglot - 256
MS-Celeb-1M, Mini-ImageNet, Oxford - 512
Table 2: Model hyperparameters.

b.2 Adversarially Learned Inference

We chose ALI [28] for the invertible network of our framework. We used the exactly the same hyperparameters presented on the Appendix A of [28]. For training Omniglot data set, we used the model designed for unsupervised learning of SVHN. For training Mini-ImageNet, MS-Celeb-1M, Oxford Flower data sets, we used the model designed for unsupervised learning of CelebA. Although [28] designed a model for a variat of ImageNet (Tiny ImageNet), our preliminary results showed that CelebA model could synthesize better images with Mini-ImageNet data set.

For training Mini-ImageNet, MS-Celeb-1M, Oxford Flowers data sets, we’ve included a reconstruction loss between the input image and its reconstructed image. This results in steady convergence and better reconstruction.

Appendix Appendix C Larger version of t-SNE scatter plot

Figure 9: t-SNE scatter plot of identity factors from 5-way 1-shot Omniglot model. Plot consists of 5 unique classes with 20 samples per class.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
368941
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description