A Plugin Method for Representation Factorization
Abstract
In this work, we focus on decomposing the latent representations in GANs or learned feature representations in deep autoencoders into semantically controllable factors in a semisupervised manner, without modifying the original trained models. Specifically, we propose a Factors DecomposerEntangler Network (FDEN) that learns to decompose a latent representation into mutually independent factors. Given a latent representation, the proposed framework draws a set of interpretable factors, each aligned to independent factors of variations by maximizing their total correlation in an informationtheoretic means. As a plugin method, we have applied our proposed FDEN to the existing networks of Adversarially Learned Inference and Pioneer Network and conducted computer vision tasks of imagetoimage translation in semantic ways, \eg, changing styles while keeping an identify of a subject, and object classification in a fewshot learning scheme. We have also validated the effectiveness of our method with various ablation studies in qualitative, quantitative, and statistical examination.
1 Introduction
With the advances in deep learning and its successes in various applications, it has been of great interest to interpret or understanding the feature representations. In particular, thanks to a generic framework of deep generative adversarial learning, we have a great tool, \eg, Generative Adversarial Network (GAN) [15] and its variants [6, 9, 16], to implicitly estimate the underlying data distribution in connection with a latent space. However, as the latent representation is highly entangled, it is still challenging to get insights or interpret such latent representations in an observation space (\eg, an image).
A representation is generally considered disentangled when it can capture interpretable semantic information or factors of the variations underlying in the problem structure [1]. Thus, the concept of disentangled representation is closely related to that of factorial representation [8, 24, 42], which claims that a unit of a disentangled representation should correspond to an independent factor of an observed data. For example, there are different factors that describe a facial image, such as gender, baldness, smiling, pose, identify, \etc. In this perspective, previous researches have also validated the effectiveness of disentangled representation in various tasks such as fewshow learning [41, 43, 21, 7], domain adaptation [47, 30], and image translation [14, 29, 8].
While learning a disentangled representation is desirable, it does not imply that an (entangled) latent representation is less powerful nor that it does not have any interpretability. In fact, various methods that do not consider disentanglement [39, 27] achieved the stateoftheart performance in their respective domains.
In this work, given a pertrained deep model, empowered with data generation such as GANs [12, 6] or Deep AutoEncoders (DAEs) [18, 8], we focus on decomposing the latent representations in GANs or learned feature representations in DAEs into semantically controllable factors in a semisupervised manner, without modifying the original trained models.
Specifically, we devise a Factors DecomposerEntangler Network (FDEN) that learns to decompose a representation into semantically independent factors in a semisupervised manner. Given a latent or feature representation vector, the proposed network draws a set of interpretable factors, (some of which are derived in a supervised way when such information for an input data is available), which are informationtheoretically maximized in mutual information. In addition, it can restore the independent factors back into its original representation, making FDEN an autoencoderlike architecture. The motivation behind the autoencoderlike architecture is to utilize the latent representation from a fixed pretrained model rather than to develop and train a disentangled representation from scratch. In doing so, we can focus our efforts solely on disentanglement with the benefit of the performance achieved by the pretrained model itself. Note that our method follows a general consensus on a robust representation learning (a) disentangling as many factors as possible, (b) maintaining as much information in the original data as possible [5].
To evaluate our proposed framework, we perform qualitative, quantitative, and statistical examination of the factorized representation.
First, we measure the effectiveness of factorized representation in downstream tasks by performing imagetoimage translation in conjunction with fewshot learning.
Then, we examine the how each component of FDEN works towards creating a factorized representation with exhaustive ablation studies and statistical analysis.
The main contributions of our work are threefold

We propose a novel network, called Factors DecomposerEntangler Network (FDEN), that can be easily plugged in an existing network that is empowered with data generation.

Thanks to the factorization property, our network can be used for imagetoimage translation in semantic ways, \eg, changing styles while keeping an identify of a subject, and for classification tasks in a fewshot learning scheme.

Our work brings the possibilities of extending stateoftheart models to solve different tasks without modifying the weights so that it can maintain the performance of its original task.
2 Related Works
Exploiting the Representation Vector There is a consensus [5, 19, 31] among researcher that a robust approach to representation learning is through disentanglement. To my best knowledge, previous work on disentangled representation has been focused on unsupervised approaches to make each unit of a representation vector interpretable and independent to other units [20, 24]. For example, Kim et al. [24] evaluate their representation on classification performance of predicting which index of a representation correspond to a factor of variation. However, recent observation have pointed out flaws in unsupervised approaches to disentanglement and suggested future works on (semi) supervised approaches to disentanglement [33]. To this extent, Bau et al. [2, 1] takes a more direct and semisupervised approach to exploiting the units of a representation. Specifically, they propose ways to exploit the units of pretrained neural networks to independently turn on or off factor of variations. This is achieved by altering the value of the unit and analyzing the changes in classification performance. In a similar manner, our work approaches disentanglement through semisupervised factorial learning approach. However we take into account the representation as a whole rather than a single unit of representation.
Deep Learning Based Independent Component Analysis Embedding or restoring independent components in a representation has been an ongoing research topic in representation learning since decades ago [10, 23, 20]. There have been approaches to directly minimize the dependency between two random variations by the means of adversarial learning [29, 31] and feature normalization [48]. With the advances of GANs, models exploiting mutual information [4, 9] and its variants [37, 20] have been proposed. These works are indirect approaches to independent component analysis that utilize the dual representation of mutual information to maximize the mutual dependency between the data sample and its representation vector. There have been several approaches on directly minimizing the mutual information, but they cannot be applied to neural networks [38] or ignore the dual upper bound term (i.e. supremum term in Equation 3). In contrast to these works, our work introduces a direct approach to minimizing the dependency between random variables that are applicable to most deep neural networks.
3 Preliminary
Mutual Information In terms of an information theory, mutual information is a measure of dependency between two random variables and can be formulated as the KullbackLeibler (KL) divergence as follows:
(1) 
where denotes a joint probability distribution and is the product of the marginal probability distributions and . As it captures both linear and nonlinear statistical dependency between variables, it is believed for mutual information to measure the true dependence [25]. Thus, we utilized mutual information in formulating our objective function as a means of nonlinearly decomposing a latent representation.
Total Correlation Total correlation, or multiinformation, is a variation of mutual information that can capture the dependency among multiple random variables. For example, the total correlation among a set of random variables can be formulated as the KLdivergence between the joint probability and the product of marginal probability :
(2) 
In Subsection 4.3, we discuss how FDEN utilizes mutual information and total correlation in detail.
DonskerVaradhan representation of KLdivergence Since mutual information and total correlation is intractable for continuous variables, we exploit a dual representation [11] for KLdivergence computation:
(3) 
where is a family of functions parameterized by a neural network. For full derivation of Equation (3), readers are referred to [4].
4 Factors DecomposerEntangler Network
The proposed Factors DecomposerEntangler Network (FDEN) is a novel framework that can be plugged into a pretrained network, empowered with data generation (\eg, GANs) or reconstruction (\eg, DAEs), and factorize its latent or feature representation . Specifically, the goal of FDEN is to decompose an input representation into independent and semantically interpretable factors without losing the original information in the latent or feature representation . To achieve this, we compose an FDEN with three modules (Figure 2): Decomposer , Factorizer , and Entangler . It is noteworthy that since FDEN uses a fixed pretrained network and deals with the latent or feature representation from the network, it allows us solely to factorize the input representation for other new tasks, while keeping the network’s capacity or power for its original tasks.
4.1 Latent or Feature Representation
Our FDEN has an autoencoderlike structure that takes a latent or feature representation from a pretrained network as an input. For a pretrained network, we consider networks that are capable of generating or encodingdecoding observable samples (\eg, an image). In other words, we focus on deep networks that find a latent representation from the input space and also reconstruct or generate a sample given its latent representation. Typical examples of these neural networks include bidirectional GANs [12, 6], autoencoders [45, 34], and invertible networks [3, 22].
4.2 DecomposerEntangler
The DecomposerEntangler network (Figure 2 (a) and (c), respectively) is autoencoderlike architecture that takes a representation z as input and reconstructs its original representation . Specifically, the Decomposer takes a representation z as input and decodes it with a global decoder network . Then, the decoded representation is decomposed into a set of factors, each of which uses a local decoder network, \eg, . The Entangler takes the factors into their corresponding streams . These streams are then concatenated on the channel axis and fed into the global encoder to reconstruct the original representation . Since the goal of DecomposerEntangler network is to reconstruct the original representation, we introduce the reconstruction objective function . Also, since the sample x and its representation z may or may not be bijective, we include a regularizer to the reconstruction objective function as follows:
(4) 
where is a constant weight term for the regularizer. Note that a fixed pretrained network takes input to reconstruct its data (Figure 1).
At this point, a representation z is merely decomposed and reassembled into (for an ablation study on FDEN trained with only objective function, refer to subsection 5.2). Although these factors contain information in , they are not aligned to specific factors of variation. In other words, the factors are not independent nor do they carry any distinguishable information. Thus, we introduce a module called Factorizer to give information into these factors in the next subsection.
4.3 Factorizer
The Factorizer uses an informationtheoretic measure to make the factors independent and has distinguishable information. The general idea is to minimize the total correlation among all factors (via Statisticians Network) while giving them relevant information using a set of classifiers (Alignment Network).
Statisticians Network The first component of the Factorizer, the Statisticians Network , estimates the total correlation among factors in a oneversusall scheme. Our goal is to minimize the total correlation among factors so that they are maximally and mutually independent to each other. We follow [4] (i.e., Equation (3)) to estimate the total correlation among factors:
(5) 
where is the Statisticians Network, is the joint distribution of all factors (i.e., ), and is the product of marginal distribution of all factors. We simplify the marginal distribution by taking from the joint distribution and from the joint distribution shuffled i.i.d. by the batch axis for each factors, i.e. . Although the latent representation is factorized into independent factors, from a semantic point of view, the decomposed factors are not necessarily interpretable. In this regard, we further consider minimal networks that help mapping to human understandable factors in a supervised manner.
Alignment Networks The Alignment Network is designed to link each factor to one of the human labelled factors (or attributes) in a supervised manner. Concretely, there are a set of classifiers that identifies whether an input sample for the latent representation has the target factor information or not. This supervised learning implicitly guides each factor to be aligned with one of the factor labels. The Statisticians Network makes the factors independent to each other, so when one factor has information on a factor of variation, \eg, for gender, the other factors \ie, , will have other independent information. However, as there exist a huge number of factors that possibly make diverse variations in samples, it is not suitable to consider the human labelled attributes only. In this regard, we further consider another independent factor that is dedicated for other potential factors, not specified in human labels. This unspecified factor is trained in an unsupervised way, just being involved in total correlation maximization. To jointly train the Alignment Networks except for , we define the supervised loss function as follows:
(6) 
It should be noted that this Alignment Network is capable of aligning between factors and human labelled attributes, thanks to our Statisticians Network that causes the factors to be independent via total correlation maximization. Further, the reconstruction loss ensures that any information loss is minimal so that the decomposed factors have enough information.
4.4 Learning
Here, we define the overall objective function for FDEN:
(7) 
where are the coefficients to weight different losses, and the negative is due to maximization of Equation (5) for its supremum term. Since we need to minimize our objective and the dependency among factors, we introduce a workaround using a Gradient Reversal Layer [13] in the following paragraph.
Gradient Reversal Layer (GRL) Note that needs to be maximized to successfully estimate the dual representation of KLdivergence, but our goal is to minimize the dependency between factors. Thus, we add a Gradient Reversal Layer (GRL) [13] before the first layer of Statiscians Network. In essence, the GRL multiplies the gradients by a negative constant during backpropagation. With the GRL in place, the Statisticians Network will maximize to estimate the total correlation, but the rest of the network will be guided towards minimization of mutual information (for analysis on the effectiveness of GRL against other approach, refer to subsection 5.5.1).
Adaptive Gradient Clipping Since is unbounded, its gradients can overwhelm the gradients of other objective functions if left uncontrolled. To mitigate this, we apply an adaptive gradient clipping [4]:
(8) 
where is the adapted gradients, , and (positive due to GRL). is the gradients over since only backpropagates through and .
5 Experiment
In this section, we perform various experiments to evaluate the Factors DecomposerEntangler Network (FDEN). Our goal is to show that each module of FDEN is effective to decompose a latent representation into independent factors. Thus, we evaluate our FDEN using a topdown approach, i.e. from model to an individual unit of factor. First, we start off by performing a suite of ablation studies to see the effectiveness of each module of FDEN in factorizing a representation. Second, we evaluate the effectiveness of factors by performing various downstream tasks. Finally, we analyze individual units of factors to see if a representation is indeed reasonably factorized.
5.1 Data sets
We evaluate our proposed FDEN on various domains of data set: Omniglot (character), MSCeleb1M (facial with identity), CelebA (facial with attributes), MiniImageNet (natural), and Oxford Flower (floral) data set.
Omniglot The Omniglot [28] data set consists of 1,623 characters from 50 alphabets, where each character is drawn by 20 different people via Amazon’s Mechanical Turk. Following [46, 44], we partitioned the data set by 1,200 characters for training and remaining 423 for testing. Also following [46, 44], we have augmented the data set by rotating 90, 180, 270 degrees, where each rotation is treated as a new character (i.e., 4,800 characters for training data set and 1,692 characters for testing data set).
MSCeleb1M Lowshot The MSCeleb1M [17] lowshot data set consists of facial images of 21,000 celebrities. This data set is partitioned (by [17]) into 20,000 celebrities for training and 1,000 celebrities for testing. There are average of 58 images per celebrity in the training data set (total of 1,155,175 images), and 5 images per celebrity in the test data set (total of 5,000 images).
Oxford Flower The Oxford Flower [36] data set consists of images of 102 flower species with 40 to 258 per flower species. We have split the data set by randomly selecting 82 flower species for training and 20 flower species for testing.
5.2 Implementation Details
pretrained Networks For the pretrained network, we utilize Adversarially Learned Inference (ALI) [12] and Pioneer Network [18].
ALI is a bidirectional GAN that jointly learns a generation network and an inference network. We chose ALI for its simplicity in implementation and its ability to create powerful latent representation. For MSCeleb1M, MiniImageNet, and Oxford data set, we replicated the model designed by the authors for CelebA data set. For Omniglot data set, we replicated the model designed by the authors for SVHN data set.
Pioneer Network [18] is a progressively growing autoencoder which can achieve high quality reconstructions. We have chosen Pioneer Network also for its stateoftheart reconstruction performance. Apart from various GANs, Pioneer Network created one of the highest quality reconstructions we have found. We use the pretrained model for CelebA128 publicly open at author’s website.
Factors DecomposerEntangler Network FDEN consists of Decomposer, Statisticians Network, Alignment Network, and Entangler which are fully connected layers parameterized by and , respectively. For the sake of simplicity and model complexity, we kept each module to consist of 3 or 4 fully connected layers with dropout, batch normalization, and a leaky ReLu activation.
For details of hyperparameters, readers are referred to the Supplementary.
5.3 Downstream Task
ImagetoImage Translation
The goal of this experiment is to show the effectiveness of FDEN’s ability to decompose and reconstruct a latent representation. Given representation of two samples, and , we perform imagetoimage translation by linearly interpolating their identity factors, and , with style factors of different images, and (Figure 3). For example, . Without modifying the weights of the invertible networks, we reconstruct an translated image with . For imagetoimage translation, we evaluate our results with on Omniglot, MSCeleb1M, and Oxford Flower data set using pretrained ALI (Figure 4) .
Our results show that identity relevant features are clearly aligned with identity factors. For example, first MSCeleb1M images from Figure 4 show clear interpolation between a woman and a man rowwise. Since we only factorize a representation into two factors, style factor carry all nonrelevant information to identity. Thus, during interpolation between factors, we see multiple factors changing together, such as changes in rotation and brightness in face and background. Although it is hard to distinguish which factor of variation changes during interpolating factors of Omniglot and Oxford Flower data sets, we can notice that each step of interpolation results in a partially interpretable change. These observations show that FDEN can indeed decompose a latent representation into independent factors.
Also, comparing the ALI’s reconstructed image (\nth1 row \nth2 column, \nth6 row \nth3 column) and the FDEN’s reconstructed image (\nth1 row \nth3 column, \nth3 row \nth5 column), we can observe that they are very similar. This shows that FDEN can indeed be plugged into a pretrained network without reducing its performance on its original downstream task (additional hiresolution results are available on the Supplementary).
Fewshot Learning
Omniglot  MiniImageNet  
5way  20way  5way  
1shot  5shot  1shot  1shot  5shot  
[46]  98.1%  98.9%  93.8%  43.5%  55.3% 
[44]  98.8%  99.7%  96.0%  49.4%  68.2% 
FDENe  91.1%  99.0%  90.7%  49.4%  61.4% 
MLP  80.3%  89.8%  65.2%  26.3%  37.2% 
FDENf  88.3%  95.4%  82.6%  43.9%  48.6% 
There have been several approaches on evaluating a representation, most notably the disentanglement scores [33]. However, these metrics either measure the performance of each unit of a representation vector in relation to its attribute classification performance or its ability to maximize the mutual information, which essentially is not applicable to our work. Thus, we chose fewshot learning performance as an alternative metric on evaluating how much the factors are independent from each other.
For this experiment, the Alignment Network exploits an episodic learning scheme that is suitable for fewshot learning environment. Each episode consists of randomly sampled unique classes, support samples per class, and a query sample from one of the classes. Given support samples, the goal of fewshot learning is to predict which of unique classes does the query sample belong to. In fewshot learning literature, these setup is generally called the way, shot learning.
Here, we formally define the settings of episodic learning similar to that of [46]. First, we define episode as the distribution over all possible labels , where a label set contains batches of randomly chosen unique classes. Then, we define as the support set with datalabel pairs , and as the batches of a single datalabel pair. The objective of episodic learning is to match query datalabel pair with support datalabel pair with the same label. Thus, we formulate the objective function of episodic learning as following:
(9) 
where is the cross entropy objective function between predictions and ground truths y.
We evaluate FDEN on fewshot learning to show that the decomposed identity factor is successful in containing the identity information of the observed data. Thus, we validate our results on two different domains of data with varying complexity, Omniglot and MiniImageNet, and compare our results with works that exploit episodic learning scheme ([46, 44], Table 1). One property of FDEN is that it only learns to exploit the latent space. In other words, FDEN does not have any information on the input data except for a pretrained model’s representation of it. Thus, our baseline (denoted as MLP) for this experiment is fewshot learning performance using only the representation z with a MLP classifier with same structure as FDEN’s Alignment Network. We have shown our results with pretrained network fixed (denoted as FDENf) and endtoend learning by fine tuning both FDEN and the pretrained network (denoted as FDENe). Note that we used the same weights for both imagetoimage translation experiments in subsection 5.3.1 and fewshot learning experiments. We evaluate our results on 1,000 episodes with unseen samples for all experiments.
The results of FDENf and imagetoimage translation show that the identity factors and style factors indeed contain information relevant to their factor of variation. However, endtoend learning (i.e. FDENe) slightly degrades the quality of imagetoimage translation with the benefit of fewshot learning performance significantly improving. Although our results on endtoend experiments are lower than that of the comparing methods, considering that FDEN is also performing imagetoimage translation, we see that our results are plausible.
5.4 Statistical Analysis
tSNE To further analyze our results, we’ve drawn tSNE scatter plots with factors from 5way 1shot Omniglot model (Figure 5). The tSNE plot for identity factors shows apparent clusters between samples with same class, while the style factors show no visible clusters. This observation suggests that the identity factors are indeed aligned to identity information (in this case, a letter). On the other hand, a style factor consists of all information independent to the identity factor and it does not consider alignment to any single information, hence the entanglement in the tSNE plot.
Representation Similarity Analysis Representation Similarity Analysis (RSA) [26] is a data analysis framework for comparing the dissimilarity between two random variables. We have drawn a dissimilarity matrix by computing the Pearson’s correlation coefficient () for each unit of all factors and each unit of representation z against all other units (Figure 6). As for the RAS on units of representation z, we see high similarity between each units. However, there is a high correlation between units within a factor and very low correlation between units of other factors, suggesting that factors do indeed show independence between each other.
5.5 Ablation Study
Without Gradient Reversal Layer
First, we start by replacing the GRL, which is the component responsible for minimizing the mutual information (Figure 7). To minimize the mutual information without GRL, we pretrain FDEN with negative for 20,000 iterations and finetune with positive . The mutual information for FDEN without GRL is steady around 0 for most of the training iterations, suggesting that mutual information is not estimated properly throughout the training procedure. In contrast, the mutual information for FDEN with GRL is very high during beginning of the training iteration, then reduced down to 0 after 20,000 iterations. This suggests that FDEN is indeed learning to calculate the mutual information in the first 20,000 iterations, and begins to minimize the mutual information after 20,000 iterations.
Without Factorizer
Factorizer is responsible for factorizing a representation into independent and interpretable factors. Removing the Factorizer from FDEN essentially makes it an autoencoder with multiple streams in the middle. This autoencoder can reconstruct images well, its factors are not independent nor are they interpretable. By interpolating only one factor and fixing the other factors (Figure 8), we can see multiple factor of variations, for example hair, lips, rotation. Compared to FDEN with Factorizer (Figure 4), which can interpolate factors separately.
6 Conclusion
In this work, we propose Factors DecomposerEntangler Network (FDEN) that learns to decompose a latent representation into independent factors. Our work brings the possibilities of extending stateoftheart models to solve different tasks and also maintain the performance of its original task.
Limitation Since weights of pretrained network are fixed during training FDEN, the performance of downstream task is upperbounded by the representative power of the pretrained network. This upperbound is more apparent in imagetoimage translation since the translated images are combinations of reconstructed images of the pretrained network (i.e. second and sixth images from Figure 4), not the data samples (i.e. first and last images from Figure 4). Recent literature have investigated that GANs and autoencoders have a tendency to leave out nondiscriminative features during reconstruction [35]. A possible future work for mitigating these limitations is to exploit the representation more closely into the units [2] rather than factors for a better reconstruction performance.
Acknowledgements
This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017001779, A machine learning and statistical inference framework for explainable artificial intelligence) and Kakao Corp. (Development of Algorithms for Deep LearningBased One/Fewshot Learning).
Supplementary Material
Appendix Appendix A Additional Results
a.1 Imagetoimage Translation  Pioneer Network
We train FDEN using the CelebA128 data set and a pretrained Pioneer Network [18]. For this experiment, we train each of the classifier with binary attributes of CelebA. To perform imagetoimage translation, we first extract the mean of each factor with same ground truths from all of train data set (e.g. mean of all from train data set with ground truth of 0). Then, the results below are reconstruction of input images but with one factor replaced by the mean factor of opposite ground truth.
a.2 Imagetoimage Translation  ALI
Images on the first and the last column are the input images that we are interested in translating. Images on the second and sixth columns are ALI’s original reconstruction. Images in the middle are results of reconstruction with interpolated identity and style factors of the input images.
Appendix Appendix B Hyperparameters
b.1 Fden
Operation  Feature Maps  Batch Norm  Dropout  Activation  

input  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  64  0.2  Leaky ReLu  
Fully Connected  1  0.2  Linear  
input  
Concatenate along the channel axis  
Fully Connected  1024  0.2  Leaky ReLu  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  64  0.2  Leaky ReLu  
Fully Connected  1  0.2  Linear  
input  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  256  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
input  
Concatenate along the channel axis  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  512  0.2  Leaky ReLu  
Fully Connected  0.2  Linear  
Optimizer  Adam  
Batch size  16  
Episodes per epoch  10,000  
Epochs  1,000  
Leaky ReLu slope  0.01  
Weight initialization  Truncated Normal ()  
Loss weights  

b.2 Adversarially Learned Inference
We chose ALI [12] for the invertible network of our framework. We used the exactly the same hyperparameters presented on the Appendix A of [12]. For training Omniglot data set, we used the model designed for unsupervised learning of SVHN. For training MiniImageNet, MSCeleb1M, Oxford Flower data sets, we used the model designed for unsupervised learning of CelebA. Although [12] designed a model for a variat of ImageNet (Tiny ImageNet), our preliminary results showed that CelebA model could synthesize better images with MiniImageNet data set.
For training MiniImageNet, MSCeleb1M, Oxford Flowers data sets, we’ve included a reconstruction loss between the input image and its reconstructed image. This results in steady convergence and better reconstruction.
b.3 Pioneer Network
We chose Pioneer Network [18] for its stateoftheart reconstruction performance. We use the pretrained model for CelebA128 publicly open at author’s website.
Footnotes
 Code available at https://github.com/wltjr1007/FDEN
References
 (2017) Network dissection: quantifying interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6541–6549. Cited by: §1, §2.
 (2019) Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502–4511. Cited by: §2, §6.
 (2018) Invertible residual networks. Proceedings of the International Conference on Machine Learning. Cited by: §4.1.
 (2018) Mutual information neural estimation. In Proceedings of the International Conference on Machine Learning, pp. 531–540. Cited by: §2, §3, §4.3, §4.4.
 (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1, §2.
 (2017) BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §1, §1, §4.1.
 (2018) Zeroshot visual recognition using semanticspreserving adversarial embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1043–1052. Cited by: §1.
 (2018) Isolating sources of disentanglement in variational autoencoders. In Proceedings of the Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §1, §1.
 (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §1, §2.
 (1994) Independent component analysis, a new concept?. Signal processing 36 (3), pp. 287–314. Cited by: §2.
 (1983) Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics 36 (2), pp. 183–212. Cited by: §3.
 (2017) Adversarially learned inference. In Proceedings of the International Conference on Learning Representations, Cited by: §B.2, §1, §4.1, §5.2.
 (2016) Domainadversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §4.4, §4.4.
 (2018) Imagetoimage translation for crossdomain disentanglement. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1287–1298. Cited by: §1.
 (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
 (2017) Improved training of wasserstein gans. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §1.
 (2016) Msceleb1m: a dataset and benchmark for largescale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §5.1.
 (2018) Pioneer networks: progressively growing generative autoencoder. In Asian Conference on Computer Vision, pp. 22–38. Cited by: §A.1, §B.3, §1, §5.2, §5.2.
 (2018) Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230. Cited by: §2.
 (2017) Betavae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §2, §2.
 (2017) Darla: improving zeroshot transfer in reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 1480–1490. Cited by: §1.
 (2018) Irevnet: deep invertible networks. Proceedings of the International Conference on Learning Representations. Cited by: §4.1.
 (2003) Advances in nonlinear blind source separation. In Proc. of the 4th Int. Symp. on Independent Component Analysis and Blind Signal Separation (ICA2003), pp. 245–256. Cited by: §2.
 (2018) Disentangling by factorising. In Proceedings of the International Conference on Machine Learning, pp. 4153–4171. Cited by: §1, §2.
 (2014) Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences 111 (9), pp. 3354–3359. Cited by: §3.
 (2008) Representational similarity analysisconnecting the branches of systems neuroscience. Frontiers in systems neuroscience 2, pp. 4. Cited by: §5.4.
 (2012) Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
 (2015) Humanlevel concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §5.1.
 (2018) A unified feature disentangler for multidomain image translation and manipulation. In Proceedings of the Advances in Neural Information Processing Systems, pp. 2590–2599. Cited by: §1, §2.
 (2018) Detach and adapt: learning crossdomain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8867–8876. Cited by: §1.
 (2018) Exploring disentangled feature representation beyond face identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2080–2089. Cited by: §2, §2.
 (201512) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §5.1.
 (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pp. 4114–4124. Cited by: §2, §5.3.2.
 (2013) Speech enhancement based on deep denoising autoencoder.. In Interspeech, pp. 436–440. Cited by: §4.1.
 (2018) Generative adversarial networks (gans): what it can generate and what it cannot?. arXiv preprint arXiv:1804.00140. Cited by: §6.
 (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. Cited by: §5.1.
 (2019) Wasserstein dependency measure for representation learning. In Proceedings of the Advances in Neural Information Processing Systems Reproducibility Challenge, Cited by: §2.
 (2004) Fast algorithms for mutual information based independent component analysis. IEEE Transactions on Signal Processing 52 (10), pp. 2690–2700. Cited by: §2.
 (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1.
 (2016) Optimization as a model for fewshot learning. In Proceedings of the International Conference on Learning Representations, Cited by: §5.1.
 (2018) Learning deep disentangled embeddings with the fstatistic loss. In Proceedings of the Advances in Neural Information Processing Systems, pp. 185–194. Cited by: §1.
 (1992) Learning factorial codes by predictability minimization. Neural Computation 4 (6), pp. 863–879. Cited by: §1.
 (2018) Adapted deep embeddings: a synthesis of methods for kshot inductive transfer learning. In Proceedings of the Advances in Neural Information Processing Systems, pp. 76–85. Cited by: §1.
 (2017) Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §5.1, §5.3.2, Table 1.
 (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §4.1.
 (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §5.1, §5.3.2, §5.3.2, Table 1.
 (2018) Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3712–3722. Cited by: §1.
 (2017) Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.