A Plug-in Method for Representation Factorization

A Plug-in Method for Representation Factorization

Abstract

In this work, we focus on decomposing the latent representations in GANs or learned feature representations in deep auto-encoders into semantically controllable factors in a semi-supervised manner, without modifying the original trained models. Specifically, we propose a Factors Decomposer-Entangler Network (FDEN) that learns to decompose a latent representation into mutually independent factors. Given a latent representation, the proposed framework draws a set of interpretable factors, each aligned to independent factors of variations by maximizing their total correlation in an information-theoretic means. As a plug-in method, we have applied our proposed FDEN to the existing networks of Adversarially Learned Inference and Pioneer Network and conducted computer vision tasks of image-to-image translation in semantic ways, \eg, changing styles while keeping an identify of a subject, and object classification in a few-shot learning scheme. We have also validated the effectiveness of our method with various ablation studies in qualitative, quantitative, and statistical examination.

\cvprfinalcopy

1 Introduction

With the advances in deep learning and its successes in various applications, it has been of great interest to interpret or understanding the feature representations. In particular, thanks to a generic framework of deep generative adversarial learning, we have a great tool, \eg, Generative Adversarial Network (GAN) [15] and its variants [6, 9, 16], to implicitly estimate the underlying data distribution in connection with a latent space. However, as the latent representation is highly entangled, it is still challenging to get insights or interpret such latent representations in an observation space (\eg, an image).

A representation is generally considered disentangled when it can capture interpretable semantic information or factors of the variations underlying in the problem structure [1]. Thus, the concept of disentangled representation is closely related to that of factorial representation [8, 24, 42], which claims that a unit of a disentangled representation should correspond to an independent factor of an observed data. For example, there are different factors that describe a facial image, such as gender, baldness, smiling, pose, identify, \etc. In this perspective, previous researches have also validated the effectiveness of disentangled representation in various tasks such as few-show learning [41, 43, 21, 7], domain adaptation [47, 30], and image translation [14, 29, 8].

While learning a disentangled representation is desirable, it does not imply that an (entangled) latent representation is less powerful nor that it does not have any interpretability. In fact, various methods that do not consider disentanglement [39, 27] achieved the state-of-the-art performance in their respective domains.

In this work, given a pertrained deep model, empowered with data generation such as GANs [12, 6] or Deep Auto-Encoders (DAEs) [18, 8], we focus on decomposing the latent representations in GANs or learned feature representations in DAEs into semantically controllable factors in a semi-supervised manner, without modifying the original trained models.

Figure 1: Overview of proposed framework. Factors Decomposer-Entangler Network (FDEN) takes input a representation z from fixed pre-trained model and outputs a reconstructed representation . In doing so, FDEN can factorize the representation into independent factors using information-theoretic approaches.

Specifically, we devise a Factors Decomposer-Entangler Network (FDEN) that learns to decompose a representation into semantically independent factors in a semi-supervised manner. Given a latent or feature representation vector, the proposed network draws a set of interpretable factors, (some of which are derived in a supervised way when such information for an input data is available), which are information-theoretically maximized in mutual information. In addition, it can restore the independent factors back into its original representation, making FDEN an autoencoder-like architecture. The motivation behind the autoencoder-like architecture is to utilize the latent representation from a fixed pre-trained model rather than to develop and train a disentangled representation from scratch. In doing so, we can focus our efforts solely on disentanglement with the benefit of the performance achieved by the pre-trained model itself. Note that our method follows a general consensus on a robust representation learning (a) disentangling as many factors as possible, (b) maintaining as much information in the original data as possible [5].

To evaluate our proposed framework, we perform qualitative, quantitative, and statistical examination of the factorized representation. First, we measure the effectiveness of factorized representation in downstream tasks by performing image-to-image translation in conjunction with few-shot learning. Then, we examine the how each component of FDEN works towards creating a factorized representation with exhaustive ablation studies and statistical analysis. The main contributions of our work are three-fold1:

  • We propose a novel network, called Factors Decomposer-Entangler Network (FDEN), that can be easily plugged in an existing network that is empowered with data generation.

  • Thanks to the factorization property, our network can be used for image-to-image translation in semantic ways, \eg, changing styles while keeping an identify of a subject, and for classification tasks in a few-shot learning scheme.

  • Our work brings the possibilities of extending state-of-the-art models to solve different tasks without modifying the weights so that it can maintain the performance of its original task.

2 Related Works

Exploiting the Representation Vector There is a consensus [5, 19, 31] among researcher that a robust approach to representation learning is through disentanglement. To my best knowledge, previous work on disentangled representation has been focused on unsupervised approaches to make each unit of a representation vector interpretable and independent to other units [20, 24]. For example, Kim et al. [24] evaluate their representation on classification performance of predicting which index of a representation correspond to a factor of variation. However, recent observation have pointed out flaws in unsupervised approaches to disentanglement and suggested future works on (semi-) supervised approaches to disentanglement [33]. To this extent, Bau et al. [2, 1] takes a more direct and semi-supervised approach to exploiting the units of a representation. Specifically, they propose ways to exploit the units of pre-trained neural networks to independently turn on or off factor of variations. This is achieved by altering the value of the unit and analyzing the changes in classification performance. In a similar manner, our work approaches disentanglement through semi-supervised factorial learning approach. However we take into account the representation as a whole rather than a single unit of representation.

Deep Learning Based Independent Component Analysis Embedding or restoring independent components in a representation has been an on-going research topic in representation learning since decades ago [10, 23, 20]. There have been approaches to directly minimize the dependency between two random variations by the means of adversarial learning [29, 31] and feature normalization [48]. With the advances of GANs, models exploiting mutual information [4, 9] and its variants [37, 20] have been proposed. These works are indirect approaches to independent component analysis that utilize the dual representation of mutual information to maximize the mutual dependency between the data sample and its representation vector. There have been several approaches on directly minimizing the mutual information, but they cannot be applied to neural networks [38] or ignore the dual upper bound term (i.e. supremum term in Equation 3). In contrast to these works, our work introduces a direct approach to minimizing the dependency between random variables that are applicable to most deep neural networks.

Figure 2: Overview of Factors Decomposer-Entangler Network (FDEN). FDEN is divided into three modules: Decomposer , Factorizer , and Entangler . Our model is an autoencoder-like architecture that takes input a representation z and reconstructs its original representation . (a) First, the Decomposer takes a latent representation z from a fixed pre-trained network and decomposes it into a set of factors . (b) Then, the Factorizer uses an information theoretic way to maximize the independency between each factor. (c) Finally, the Entangler takes the factors and reconstructs its original representation .

3 Preliminary

Mutual Information In terms of an information theory, mutual information is a measure of dependency between two random variables and can be formulated as the Kullback-Leibler (KL) divergence as follows:

(1)

where denotes a joint probability distribution and is the product of the marginal probability distributions and . As it captures both linear and non-linear statistical dependency between variables, it is believed for mutual information to measure the true dependence [25]. Thus, we utilized mutual information in formulating our objective function as a means of non-linearly decomposing a latent representation.

Total Correlation Total correlation, or multi-information, is a variation of mutual information that can capture the dependency among multiple random variables. For example, the total correlation among a set of random variables can be formulated as the KL-divergence between the joint probability and the product of marginal probability :

(2)

In Subsection 4.3, we discuss how FDEN utilizes mutual information and total correlation in detail.

Donsker-Varadhan representation of KL-divergence Since mutual information and total correlation is intractable for continuous variables, we exploit a dual representation [11] for KL-divergence computation:

(3)

where is a family of functions parameterized by a neural network. For full derivation of Equation (3), readers are referred to [4].

4 Factors Decomposer-Entangler Network

The proposed Factors Decomposer-Entangler Network (FDEN) is a novel framework that can be plugged into a pre-trained network, empowered with data generation (\eg, GANs) or reconstruction (\eg, DAEs), and factorize its latent or feature representation . Specifically, the goal of FDEN is to decompose an input representation into independent and semantically interpretable factors without losing the original information in the latent or feature representation . To achieve this, we compose an FDEN with three modules (Figure 2): Decomposer , Factorizer , and Entangler . It is noteworthy that since FDEN uses a fixed pre-trained network and deals with the latent or feature representation from the network, it allows us solely to factorize the input representation for other new tasks, while keeping the network’s capacity or power for its original tasks.

4.1 Latent or Feature Representation

Our FDEN has an autoencoder-like structure that takes a latent or feature representation from a pre-trained network as an input. For a pre-trained network, we consider networks that are capable of generating or encoding-decoding observable samples (\eg, an image). In other words, we focus on deep networks that find a latent representation from the input space and also reconstruct or generate a sample given its latent representation. Typical examples of these neural networks include bi-directional GANs [12, 6], autoencoders [45, 34], and invertible networks [3, 22].

4.2 Decomposer-Entangler

The Decomposer-Entangler network (Figure 2 (a) and (c), respectively) is autoencoder-like architecture that takes a representation z as input and reconstructs its original representation . Specifically, the Decomposer  takes a representation z as input and decodes it with a global decoder network . Then, the decoded representation is decomposed into a set of factors, each of which uses a local decoder network, \eg,  . The Entangler  takes the factors  into their corresponding streams . These streams are then concatenated on the channel axis and fed into the global encoder to reconstruct the original representation . Since the goal of Decomposer-Entangler network is to reconstruct the original representation, we introduce the reconstruction objective function . Also, since the sample x and its representation z may or may not be bijective, we include a regularizer to the reconstruction objective function as follows:

(4)

where is a constant weight term for the regularizer. Note that a fixed pre-trained network takes input to reconstruct its data (Figure 1).

At this point, a representation z is merely decomposed and reassembled into (for an ablation study on FDEN trained with only objective function, refer to subsection 5.2). Although these factors contain information in , they are not aligned to specific factors of variation. In other words, the factors are not independent nor do they carry any distinguishable information. Thus, we introduce a module called Factorizer to give information into these factors in the next subsection.

4.3 Factorizer

The Factorizer uses an information-theoretic measure to make the factors independent and has distinguishable information. The general idea is to minimize the total correlation among all factors (via Statisticians Network) while giving them relevant information using a set of classifiers (Alignment Network).

Statisticians Network The first component of the Factorizer, the Statisticians Network , estimates the total correlation among factors in a one-versus-all scheme. Our goal is to minimize the total correlation among factors so that they are maximally and mutually independent to each other. We follow [4] (i.e., Equation (3)) to estimate the total correlation among factors:

(5)

where is the Statisticians Network, is the joint distribution of all factors (i.e., ), and is the product of marginal distribution of all factors. We simplify the marginal distribution by taking from the joint distribution and from the joint distribution shuffled i.i.d. by the batch axis for each factors, i.e. . Although the latent representation is factorized into independent factors, from a semantic point of view, the decomposed factors are not necessarily interpretable. In this regard, we further consider minimal networks that help mapping to human understandable factors in a supervised manner.

Alignment Networks The Alignment Network is designed to link each factor to one of the human labelled factors (or attributes) in a supervised manner. Concretely, there are a set of classifiers that identifies whether an input sample for the latent representation has the target factor information or not. This supervised learning implicitly guides each factor to be aligned with one of the factor labels. The Statisticians Network makes the factors independent to each other, so when one factor has information on a factor of variation, \eg, for gender, the other factors \ie, , will have other independent information. However, as there exist a huge number of factors that possibly make diverse variations in samples, it is not suitable to consider the human labelled attributes only. In this regard, we further consider another independent factor that is dedicated for other potential factors, not specified in human labels. This unspecified factor is trained in an unsupervised way, just being involved in total correlation maximization. To jointly train the Alignment Networks except for , we define the supervised loss function as follows:

(6)

It should be noted that this Alignment Network is capable of aligning between factors and human labelled attributes, thanks to our Statisticians Network that causes the factors to be independent via total correlation maximization. Further, the reconstruction loss ensures that any information loss is minimal so that the decomposed factors have enough information.

(a) Schematic overview of image-to-image translation scenario
(b) FDEN in an image-to-image translation scenario
Figure 3: FDEN in an image-to-image translation scenario. First, FDEN takes input a latent representation z and decomposes it into an identity factor  and a style factor . Then, a latent representation is reconstructed by linearly interpolating the factors of different representations (e.g. ).

4.4 Learning

Here, we define the overall objective function for FDEN:

(7)

where are the coefficients to weight different losses, and the negative is due to maximization of Equation (5) for its supremum term. Since we need to minimize our objective and the dependency among factors, we introduce a workaround using a Gradient Reversal Layer [13] in the following paragraph.

Gradient Reversal Layer (GRL) Note that needs to be maximized to successfully estimate the dual representation of KL-divergence, but our goal is to minimize the dependency between factors. Thus, we add a Gradient Reversal Layer (GRL) [13] before the first layer of Statiscians Network. In essence, the GRL multiplies the gradients by a negative constant during backpropagation. With the GRL in place, the Statisticians Network will maximize to estimate the total correlation, but the rest of the network will be guided towards minimization of mutual information (for analysis on the effectiveness of GRL against other approach, refer to subsection 5.5.1).

Adaptive Gradient Clipping Since is unbounded, its gradients can overwhelm the gradients of other objective functions if left uncontrolled. To mitigate this, we apply an adaptive gradient clipping [4]:

(8)

where is the adapted gradients, , and (positive due to GRL). is the gradients over since only backpropagates through and .

5 Experiment

In this section, we perform various experiments to evaluate the Factors Decomposer-Entangler Network (FDEN). Our goal is to show that each module of FDEN is effective to decompose a latent representation into independent factors. Thus, we evaluate our FDEN using a top-down approach, i.e. from model to an individual unit of factor. First, we start off by performing a suite of ablation studies to see the effectiveness of each module of FDEN in factorizing a representation. Second, we evaluate the effectiveness of factors by performing various downstream tasks. Finally, we analyze individual units of factors to see if a representation is indeed reasonably factorized.

5.1 Data sets

We evaluate our proposed FDEN on various domains of data set: Omniglot (character), MS-Celeb-1M (facial with identity), CelebA (facial with attributes), Mini-ImageNet (natural), and Oxford Flower (floral) data set.

Omniglot The Omniglot [28] data set consists of 1,623 characters from 50 alphabets, where each character is drawn by 20 different people via Amazon’s Mechanical Turk. Following [46, 44], we partitioned the data set by 1,200 characters for training and remaining 423 for testing. Also following [46, 44], we have augmented the data set by rotating 90, 180, 270 degrees, where each rotation is treated as a new character (i.e., 4,800 characters for training data set and 1,692 characters for testing data set).

MS-Celeb-1M Low-shot The MS-Celeb-1M [17] low-shot data set consists of facial images of 21,000 celebrities. This data set is partitioned (by [17]) into 20,000 celebrities for training and 1,000 celebrities for testing. There are average of 58 images per celebrity in the training data set (total of 1,155,175 images), and 5 images per celebrity in the test data set (total of 5,000 images).

CelebA The CelebA [32] data set consists of 202,599 celebrity facial images with 40 binary attributes, such as eye-glasses, bangs, smiling. The data set is partitioned (by [32]) into 162,770 images for training, 19,867 images for validation, and 19,962 images for testing.

Mini-ImageNet The Mini-ImageNet is a partition of ImageNet data set created by [40] for few-shot learning. It consists of 100 classes from ImageNet with 600 images per class, and [40] have split the data set it into 64, 16, 20 classes for training, validation, testing, respectively.

Oxford Flower The Oxford Flower [36] data set consists of images of 102 flower species with 40 to 258 per flower species. We have split the data set by randomly selecting 82 flower species for training and 20 flower species for testing.

Figure 4: Results of image-to-image translation for MS-Celeb-1M, Omniglot, and Oxford Flower data set. For each data set, images on the first and the last column are the input images that we are interested in translating. Images on the second and sixth columns are ALI’s original reconstruction. Images in the middle are results of reconstruction with interpolated identity and style factors of the input images.

5.2 Implementation Details

pre-trained Networks For the pre-trained network, we utilize Adversarially Learned Inference (ALI) [12] and Pioneer Network [18].

ALI is a bi-directional GAN that jointly learns a generation network and an inference network. We chose ALI for its simplicity in implementation and its ability to create powerful latent representation. For MS-Celeb-1M, Mini-ImageNet, and Oxford data set, we replicated the model designed by the authors for CelebA data set. For Omniglot data set, we replicated the model designed by the authors for SVHN data set.

Pioneer Network [18] is a progressively growing autoencoder which can achieve high quality reconstructions. We have chosen Pioneer Network also for its state-of-the-art reconstruction performance. Apart from various GANs, Pioneer Network created one of the highest quality reconstructions we have found. We use the pre-trained model for CelebA-128 publicly open at author’s website.

Factors Decomposer-Entangler Network FDEN consists of Decomposer, Statisticians Network, Alignment Network, and Entangler which are fully connected layers parameterized by and , respectively. For the sake of simplicity and model complexity, we kept each module to consist of 3 or 4 fully connected layers with dropout, batch normalization, and a leaky ReLu activation.

For details of hyperparameters, readers are referred to the Supplementary.

5.3 Downstream Task

Image-to-Image Translation

The goal of this experiment is to show the effectiveness of FDEN’s ability to decompose and reconstruct a latent representation. Given representation of two samples, and , we perform image-to-image translation by linearly interpolating their identity factors, and , with style factors of different images, and (Figure 3). For example, . Without modifying the weights of the invertible networks, we reconstruct an translated image with . For image-to-image translation, we evaluate our results with on Omniglot, MS-Celeb-1M, and Oxford Flower data set using pre-trained ALI (Figure 4) .

Our results show that identity relevant features are clearly aligned with identity factors. For example, first MS-Celeb-1M images from Figure 4 show clear interpolation between a woman and a man row-wise. Since we only factorize a representation into two factors, style factor carry all non-relevant information to identity. Thus, during interpolation between factors, we see multiple factors changing together, such as changes in rotation and brightness in face and background. Although it is hard to distinguish which factor of variation changes during interpolating factors of Omniglot and Oxford Flower data sets, we can notice that each step of interpolation results in a partially interpretable change. These observations show that FDEN can indeed decompose a latent representation into independent factors.

Also, comparing the ALI’s reconstructed image (\nth1 row \nth2 column, \nth6 row \nth3 column) and the FDEN’s reconstructed image (\nth1 row \nth3 column, \nth3 row \nth5 column), we can observe that they are very similar. This shows that FDEN can indeed be plugged into a pre-trained network without reducing its performance on its original downstream task (additional hi-resolution results are available on the Supplementary).

Few-shot Learning

Omniglot Mini-ImageNet
5-way 20-way 5-way
1-shot 5-shot 1-shot 1-shot 5-shot
[46] 98.1% 98.9% 93.8% 43.5% 55.3%
[44] 98.8% 99.7% 96.0% 49.4% 68.2%
FDENe 91.1% 99.0% 90.7% 49.4% 61.4%
MLP 80.3% 89.8% 65.2% 26.3% 37.2%
FDENf 88.3% 95.4% 82.6% 43.9% 48.6%
Table 1: -way -shot learning accuracy. FDENf is FDEN trained with fixed pre-trained network and FDENe is FDEN trained end-to-end with pre-trained network. MLP is the baseline experiment with MLP classifier using only representation z.

There have been several approaches on evaluating a representation, most notably the disentanglement scores [33]. However, these metrics either measure the performance of each unit of a representation vector in relation to its attribute classification performance or its ability to maximize the mutual information, which essentially is not applicable to our work. Thus, we chose few-shot learning performance as an alternative metric on evaluating how much the factors are independent from each other.

For this experiment, the Alignment Network exploits an episodic learning scheme that is suitable for few-shot learning environment. Each episode consists of randomly sampled unique classes, support samples per class, and a query sample from one of the classes. Given support samples, the goal of few-shot learning is to predict which of unique classes does the query sample belong to. In few-shot learning literature, these setup is generally called the -way, -shot learning.

Here, we formally define the settings of episodic learning similar to that of [46]. First, we define episode as the distribution over all possible labels , where a label set contains batches of randomly chosen unique classes. Then, we define as the support set with data-label pairs , and as the batches of a single data-label pair. The objective of episodic learning is to match query data-label pair with support data-label pair with the same label. Thus, we formulate the objective function of episodic learning as following:

(9)

where is the cross entropy objective function between predictions   and ground truths y.

We evaluate FDEN on few-shot learning to show that the decomposed identity factor is successful in containing the identity information of the observed data. Thus, we validate our results on two different domains of data with varying complexity, Omniglot and Mini-ImageNet, and compare our results with works that exploit episodic learning scheme ([46, 44], Table 1). One property of FDEN is that it only learns to exploit the latent space. In other words, FDEN does not have any information on the input data except for a pre-trained model’s representation of it. Thus, our baseline (denoted as MLP) for this experiment is few-shot learning performance using only the representation z with a MLP classifier with same structure as FDEN’s Alignment Network. We have shown our results with pre-trained network fixed (denoted as FDENf) and end-to-end learning by fine tuning both FDEN and the pre-trained network (denoted as FDENe). Note that we used the same weights for both image-to-image translation experiments in subsection 5.3.1 and few-shot learning experiments. We evaluate our results on 1,000 episodes with unseen samples for all experiments.

The results of FDENf and image-to-image translation show that the identity factors and style factors indeed contain information relevant to their factor of variation. However, end-to-end learning (i.e. FDENe) slightly degrades the quality of image-to-image translation with the benefit of few-shot learning performance significantly improving. Although our results on end-to-end experiments are lower than that of the comparing methods, considering that FDEN is also performing image-to-image translation, we see that our results are plausible.

5.4 Statistical Analysis

(a) Identity factors
(b) Style factors
Figure 5: t-SNE scatter plot of factors from 5-way 1-shot Omniglot model. As shown by the dotted lines in (a), the identity factors clearly clustered compared to style factors in (b). Each plot consists of 5 unique classes with 20 samples per class (Best viewed in color).
Figure 6: Representational Similarity Analysis (RSA) on units of representation z and units of four factors from Pioneer Network trained with CelebA-128 data set. Values close to 0 are dissimilar, while values far from 0 are similar. There is a high correlation between units within a factor and very low correlation between units of other factors, suggesting that factors do indeed show independence between each other. (Best viewed in color).

t-SNE To further analyze our results, we’ve drawn t-SNE scatter plots with factors from 5-way 1-shot Omniglot model (Figure 5). The t-SNE plot for identity factors shows apparent clusters between samples with same class, while the style factors show no visible clusters. This observation suggests that the identity factors are indeed aligned to identity information (in this case, a letter). On the other hand, a style factor consists of all information independent to the identity factor and it does not consider alignment to any single information, hence the entanglement in the t-SNE plot.

Representation Similarity Analysis Representation Similarity Analysis (RSA) [26] is a data analysis framework for comparing the dissimilarity between two random variables. We have drawn a dissimilarity matrix by computing the Pearson’s correlation coefficient () for each unit of all factors and each unit of representation z against all other units (Figure 6). As for the RAS on units of representation z, we see high similarity between each units. However, there is a high correlation between units within a factor and very low correlation between units of other factors, suggesting that factors do indeed show independence between each other.

5.5 Ablation Study

Without Gradient Reversal Layer

Figure 7: Mutual information training curve with and without GRL. The Statistician Network without GRL minimizes the mutual information by first pretraining it with and fine tuning it with .
Figure 8: Result of image-to-image translation without Factorizer. The pre-trained network is a Pioneer Network pre-trained with CelebA-64. FDEN decomposes the representation into four factors, and the interpolation was only performed for one of the factors (rest of the factors are from left image).

First, we start by replacing the GRL, which is the component responsible for minimizing the mutual information (Figure 7). To minimize the mutual information without GRL, we pretrain FDEN with negative for 20,000 iterations and finetune with positive . The mutual information for FDEN without GRL is steady around 0 for most of the training iterations, suggesting that mutual information is not estimated properly throughout the training procedure. In contrast, the mutual information for FDEN with GRL is very high during beginning of the training iteration, then reduced down to 0 after 20,000 iterations. This suggests that FDEN is indeed learning to calculate the mutual information in the first 20,000 iterations, and begins to minimize the mutual information after 20,000 iterations.

Without Factorizer

Factorizer is responsible for factorizing a representation into independent and interpretable factors. Removing the Factorizer from FDEN essentially makes it an autoencoder with multiple streams in the middle. This autoencoder can reconstruct images well, its factors are not independent nor are they interpretable. By interpolating only one factor and fixing the other factors (Figure 8), we can see multiple factor of variations, for example hair, lips, rotation. Compared to FDEN with Factorizer (Figure 4), which can interpolate factors separately.

6 Conclusion

In this work, we propose Factors Decomposer-Entangler Network (FDEN) that learns to decompose a latent representation into independent factors. Our work brings the possibilities of extending state-of-the-art models to solve different tasks and also maintain the performance of its original task.

Limitation Since weights of pre-trained network are fixed during training FDEN, the performance of downstream task is upper-bounded by the representative power of the pre-trained network. This upper-bound is more apparent in image-to-image translation since the translated images are combinations of reconstructed images of the pre-trained network (i.e. second and sixth images from Figure 4), not the data samples (i.e. first and last images from Figure 4). Recent literature have investigated that GANs and autoencoders have a tendency to leave out non-discriminative features during reconstruction [35]. A possible future work for mitigating these limitations is to exploit the representation more closely into the units [2] rather than factors for a better reconstruction performance.

Acknowledgements

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence) and Kakao Corp. (Development of Algorithms for Deep Learning-Based One-/Few-shot Learning).

Supplementary Material

Appendix Appendix A Additional Results

a.1 Image-to-image Translation - Pioneer Network

We train FDEN using the CelebA-128 data set and a pre-trained Pioneer Network [18]. For this experiment, we train each of the classifier with binary attributes of CelebA. To perform image-to-image translation, we first extract the mean of each factor with same ground truths from all of train data set (e.g. mean of all from train data set with ground truth of 0). Then, the results below are reconstruction of input images but with one factor replaced by the mean factor of opposite ground truth.

Figure 9: Results of FDEN with factors on CelebA-128 data set.
Figure 10: Results of FDEN with factors on CelebA-128 data set.

a.2 Image-to-image Translation - ALI

Images on the first and the last column are the input images that we are interested in translating. Images on the second and sixth columns are ALI’s original reconstruction. Images in the middle are results of reconstruction with interpolated identity and style factors of the input images.

Figure 11: Additional results on MS-Celeb-1M data set.
Figure 12: Additional results on Omniglot data set.
Figure 13: Additional results on Oxford Flower data set.
Figure 14: Additional results on Mini-ImageNet data set.
Figure 15: Additional result on MS-Celeb-1M data set (interpolation between different celebrities).
Figure 16: Additional result on MS-Celeb-1M data set (interpolation between same celebrity).
Figure 17: Additional result on Oxford Flower data set (interpolation between different flowers).
Figure 18: Additional result on Oxford Flower data set (interpolation between same flower).

Appendix Appendix B Hyperparameters

b.1 Fden

Operation Feature Maps Batch Norm Dropout Activation
input
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 64 0.2 Leaky ReLu
Fully Connected 1 0.2 Linear
input
Concatenate along the channel axis
Fully Connected 1024 0.2 Leaky ReLu
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 64 0.2 Leaky ReLu
Fully Connected 1 0.2 Linear
input
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 256 0.2 Leaky ReLu
Fully Connected 0.2 Linear
input
Concatenate along the channel axis
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 512 0.2 Leaky ReLu
Fully Connected 0.2 Linear
Optimizer Adam
Batch size 16
Episodes per epoch 10,000
Epochs 1,000
Leaky ReLu slope 0.01
Weight initialization Truncated Normal ()
Loss weights
Omniglot - 256
MS-Celeb-1M, Mini-ImageNet, Oxford, CelebA - 512
Table 2: Model hyperparameters.

b.2 Adversarially Learned Inference

We chose ALI [12] for the invertible network of our framework. We used the exactly the same hyperparameters presented on the Appendix A of [12]. For training Omniglot data set, we used the model designed for unsupervised learning of SVHN. For training Mini-ImageNet, MS-Celeb-1M, Oxford Flower data sets, we used the model designed for unsupervised learning of CelebA. Although [12] designed a model for a variat of ImageNet (Tiny ImageNet), our preliminary results showed that CelebA model could synthesize better images with Mini-ImageNet data set.

For training Mini-ImageNet, MS-Celeb-1M, Oxford Flowers data sets, we’ve included a reconstruction loss between the input image and its reconstructed image. This results in steady convergence and better reconstruction.

b.3 Pioneer Network

We chose Pioneer Network [18] for its state-of-the-art reconstruction performance. We use the pre-trained model for CelebA-128 publicly open at author’s website.

Footnotes

  1. Code available at https://github.com/wltjr1007/FDEN

References

  1. D. Bau, B. Zhou, A. Khosla, A. Oliva and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6541–6549. Cited by: §1, §2.
  2. D. Bau, J. Zhu, J. Wulff, W. Peebles, H. Strobelt, B. Zhou and A. Torralba (2019) Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502–4511. Cited by: §2, §6.
  3. J. Behrmann, D. Duvenaud and J. Jacobsen (2018) Invertible residual networks. Proceedings of the International Conference on Machine Learning. Cited by: §4.1.
  4. M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville and D. Hjelm (2018) Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, pp. 531–540. Cited by: §2, §3, §4.3, §4.4.
  5. Y. Bengio, A. Courville and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1, §2.
  6. D. Berthelot, T. Schumm and L. Metz (2017) BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §1, §1, §4.1.
  7. L. Chen, H. Zhang, J. Xiao, W. Liu and S. Chang (2018) Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1043–1052. Cited by: §1.
  8. T. Q. Chen, X. Li, R. B. Grosse and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Proceedings of the Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §1, §1.
  9. X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §1, §2.
  10. P. Comon (1994) Independent component analysis, a new concept?. Signal processing 36 (3), pp. 287–314. Cited by: §2.
  11. M. D. Donsker and S. S. Varadhan (1983) Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics 36 (2), pp. 183–212. Cited by: §3.
  12. V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky and A. Courville (2017) Adversarially learned inference. In Proceedings of the International Conference on Learning Representations, Cited by: §B.2, §1, §4.1, §5.2.
  13. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §4.4, §4.4.
  14. A. Gonzalez-Garcia, J. van de Weijer and Y. Bengio (2018) Image-to-image translation for cross-domain disentanglement. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1287–1298. Cited by: §1.
  15. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  16. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin and A. C. Courville (2017) Improved training of wasserstein gans. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §1.
  17. Y. Guo, L. Zhang, Y. Hu, X. He and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §5.1.
  18. A. Heljakka, A. Solin and J. Kannala (2018) Pioneer networks: progressively growing generative autoencoder. In Asian Conference on Computer Vision, pp. 22–38. Cited by: §A.1, §B.3, §1, §5.2, §5.2.
  19. I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende and A. Lerchner (2018) Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: §2.
  20. I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §2.
  21. I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell and A. Lerchner (2017) Darla: improving zero-shot transfer in reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 1480–1490. Cited by: §1.
  22. J. Jacobsen, A. Smeulders and E. Oyallon (2018) I-revnet: deep invertible networks. Proceedings of the International Conference on Learning Representations. Cited by: §4.1.
  23. C. Jutten and J. Karhunen (2003) Advances in nonlinear blind source separation. In Proc. of the 4th Int. Symp. on Independent Component Analysis and Blind Signal Separation (ICA2003), pp. 245–256. Cited by: §2.
  24. H. Kim and A. Mnih (2018) Disentangling by factorising. In Proceedings of the International Conference on Machine Learning, pp. 4153–4171. Cited by: §1, §2.
  25. J. B. Kinney and G. S. Atwal (2014) Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences 111 (9), pp. 3354–3359. Cited by: §3.
  26. N. Kriegeskorte, M. Mur and P. A. Bandettini (2008) Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience 2, pp. 4. Cited by: §5.4.
  27. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  28. B. M. Lake, R. Salakhutdinov and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §5.1.
  29. A. H. Liu, Y. Liu, Y. Yeh and Y. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. In Proceedings of the Advances in Neural Information Processing Systems, pp. 2590–2599. Cited by: §1, §2.
  30. Y. Liu, Y. Yeh, T. Fu, S. Wang, W. Chiu and Y. Frank Wang (2018) Detach and adapt: learning cross-domain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8867–8876. Cited by: §1.
  31. Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan and X. Wang (2018) Exploring disentangled feature representation beyond face identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2080–2089. Cited by: §2, §2.
  32. Z. Liu, P. Luo, X. Wang and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §5.1.
  33. F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf and O. Bachem (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pp. 4114–4124. Cited by: §2, §5.3.2.
  34. X. Lu, Y. Tsao, S. Matsuda and C. Hori (2013) Speech enhancement based on deep denoising autoencoder.. In Interspeech, pp. 436–440. Cited by: §4.1.
  35. P. Manisha and S. Gujar (2018) Generative adversarial networks (gans): what it can generate and what it cannot?. arXiv preprint arXiv:1804.00140. Cited by: §6.
  36. M. Nilsback and A. Zisserman (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §5.1.
  37. S. Ozair, C. Lynch, Y. Bengio, A. v. d. Oord, S. Levine and P. Sermanet (2019) Wasserstein dependency measure for representation learning. In Proceedings of the Advances in Neural Information Processing Systems Reproducibility Challenge, Cited by: §2.
  38. D. Pham (2004) Fast algorithms for mutual information based independent component analysis. IEEE Transactions on Signal Processing 52 (10), pp. 2690–2700. Cited by: §2.
  39. A. Radford, L. Metz and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1.
  40. S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, Cited by: §5.1.
  41. K. Ridgeway and M. C. Mozer (2018) Learning deep disentangled embeddings with the f-statistic loss. In Proceedings of the Advances in Neural Information Processing Systems, pp. 185–194. Cited by: §1.
  42. J. Schmidhuber (1992) Learning factorial codes by predictability minimization. Neural Computation 4 (6), pp. 863–879. Cited by: §1.
  43. T. Scott, K. Ridgeway and M. C. Mozer (2018) Adapted deep embeddings: a synthesis of methods for k-shot inductive transfer learning. In Proceedings of the Advances in Neural Information Processing Systems, pp. 76–85. Cited by: §1.
  44. J. Snell, K. Swersky and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §5.1, §5.3.2, Table 1.
  45. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P. Manzagol (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §4.1.
  46. O. Vinyals, C. Blundell, T. Lillicrap and D. Wierstra (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §5.1, §5.3.2, §5.3.2, Table 1.
  47. A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik and S. Savarese (2018) Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3712–3722. Cited by: §1.
  48. J. Zhu, T. Park, P. Isola and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402492
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description