HighFidelity Synthesis with Disentangled Representation
Abstract
Learning disentangled representation of data without supervision is an important step towards improving the interpretability of generative models. Despite recent advances in disentangled representation learning, existing approaches often suffer from the tradeoff between representation learning and generation performance (\ie improving generation quality sacrifices disentanglement performance). We propose an InformationDistillation Generative Adversarial Network (IDGAN), a simple yet generic framework that easily incorporates the existing stateoftheart models for both disentanglement learning and highfidelity synthesis. Our method learns disentangled representation using VAEbased models, and distills the learned representation with an additional nuisance variable to the separate GANbased generator for highfidelity synthesis. To ensure that both generative models are aligned to render the same generative factors, we further constrain the GAN generator to maximize the mutual information between the learned latent code and the output. Despite the simplicity, we show that the proposed method is highly effective, achieving comparable image generation quality to the stateoftheart methods using the disentangled representation. We also show that the proposed decomposition leads to an efficient and stable model design, and we demonstrate photorealistic highresolution image synthesis results (1024x1024 pixels) for the first time using the disentangled representations.
1 Introduction
Learning a compact and interpretable representation of data without supervision is important to improve our understanding of data and machine learning systems. Recently, it is suggested that a disentangled representation, which represents data using independent factors of variations in data can improve the interpretability and transferability of the representation [5, 1, 50]. Among various usecases of disentangled representation, we are particularly interested in its application to generative models, since it allows users to specify the desired properties in the output by controlling the generative factors encoded in each latent dimension. There are increasing demands on such generative models in various domains, such as image manipulation [20, 30, 27], drug discovery [15], ML fairness [11, 35], \etc.
Most prior works on unsupervised disentangled representation learning formulate the problem as constrained generative modeling task. Based on wellestablished frameworks, such as the Variational Autoencoder (VAE) or the Generative Adversarial Network (GAN), they introduce additional regularization to encourage the axes of the latent manifold to align with independent generative factors in the data. Approaches based on VAE [17, 9, 25, 7] augment its objective function to favor a factorized latent representation by adding implicit [17, 7] or explicit penalties [25, 9]. On the other hand, approaches based on GAN [10] propose to regularize the generator such that it increases the mutual information between the input latent code and its output.
One major challenge in the existing approaches is the tradeoff between learning disentangled representations and generating realistic data. VAEbased approaches are effective in learning useful disentangled representations in various tasks, but their generation quality is generally worse than the stateofthearts, which limits its applicability to the task of realistic synthesis. On the other hand, GANbased approaches can achieve the highquality synthesis with a more expressive decoder and without explicit likelihood estimation [10]. However, they tend to learn comparably more entangled representations than the VAE counterparts [17, 25, 9, 7] and are notoriously difficult to train, even with recent techniques to stabilize the training [25, 53].
To circumvent this tradeoff, we propose a simple and generic framework to combine the benefits of disentangled representation learning and highfidelity synthesis. Unlike the previous approaches that address both problems jointly by a single objective, we formulate two separate, but successive problems; we first learn a disentangled representation using VAE, and distill the learned representation to GAN for highfidelity synthesis. The distillation is performed from VAE to GAN by transferring the inference model, which provides a meaningful latent distribution, rather than a simple Gaussian prior and ensures that both models are aligned to render the same generative factors. Such decomposition also naturally allows a layered approach to learn latent representation by first learning major disentangled factors by VAE, then learning missing (entangled) nuisance factors by GAN. We refer the proposed method as the Information Distillation Generative Adversarial Network (IDGAN).
Despite the simplicity, the proposed IDGAN is extremely effective in addressing the previous challenges, achieving highfidelity synthesis using the learned disentangled representation (\eg 10241024 image). We also show that such decomposition leads to a practically efficient model design, allowing the models to learn the disentangled representation from lowresolution images and transfer it to synthesize highresolution images.
The contributions of this paper are as follows:

We propose IDGAN, a simple yet effective framework that combines the benefits of disentangled representation learning and highfidelity synthesis.

The decomposition of the two objectives enables plugandplaystyle adoption of stateoftheart models for both tasks, and efficient training by learning models for disentanglement and synthesis using low and highresolution images, respectively.

Extensive experimental results show that the proposed method achieves stateoftheart results in both disentangled representation learning and synthesis over a wide range of tasks from synthetic to complex datasets.
2 Related Work
Disentanglement learning.
Unsupervised disentangled representation learning aims to discover a set of generative factors, whose element encodes unique and independent factors of variation in data. To this end, most prior works based on VAE [17, 25, 9] and GAN [10, 21, 33, 32] focused on designing the loss function to encourage the factorization of the latent code. Despite some encouraging results, however, these approaches have been mostly evaluated on simple and lowresolution images [40, 36]. We believe that improving the generation quality of disentanglement learning is important, since it not only increases the practical impact in realworld applications, but also helps us to better assess the disentanglement quality on complex and natural images where the quantitative evaluation is difficult. Although there are increasing recent efforts to improve the generation quality with disentanglement learning [21, 44, 33, 32], they often come with the degraded disentanglement performance [10], rely on a specific inductive bias (\eg 3D transformation [44]), or are limited to lowresolution images [33, 32, 21]. On the contrary, our work aims to investigate a general framework to improve the generation quality without representation learning tradeoff, while being general enough to incorporate various methods and inductive biases. We emphasize that this contribution is complementary to the recent efforts for designing better inductive bias or supervision for disentanglement learning [52, 43, 47, 37, 8]. In fact, our framework is applicable to a wide variety of disentanglement learning methods and can incorporate them in a plugandplay style as long as they have an inference model (\eg nonlinear ICA [24]).
Combined VAE/GAN models.
There have been extensive attempts in literature toward building hybrid models of VAE and GAN [28, 4, 19, 54, 6]. These approaches typically learn to represent and synthesize data by combining VAE and GAN objectives and optimizing them jointly in an endtoend manner. Our method is an instantiation of this model family, but is differentiated from the prior work in that (1) the training of VAE and GAN is decomposed into two separate tasks and (2) the VAE is used to learn a specific conditioning variable (\ie disentangled representation) to the generator while the previous methods assume the availability of an additional conditioning variable [4] or use VAE to learn the entire (entangled) latent distribution [28, 19, 54, 6]. In addition, extending the previous VAEGAN methods to incorporate disentanglement constraints is not straightforward, as the VAE and GAN objectives are tightly entangled in them. In the experiment, we demonstrate that applying existing hybrid models on our task typically suffers from the suboptimal tradeoff between the generation quality and the disentanglement performance, and they perform much worse than our method.
3 Background: Disentanglement Learning
The objective of unsupervised disentanglement learning is to describe each data using a set of statistically independent generative factors . In this section, we briefly review prior works and discuss their advantages and limitations.
The stateoftheart approaches in unsupervised disentanglement learning are largely based on the Variational Autoencoder (VAE). They rewrite their original objective and derive regularizations that encourage the disentanglement of the latent variables. For instance, VAE [17] proposes to optimize the following modified Evidence LowerBound (ELBO) of the marginal loglikelihood:
(1) 
where setting reduces to the original VAE. By forcing the variational posterior to be closer to the factorized prior (), the model learns a more disentangled representation, but with a sacrifice of generation quality, since it also decreases the mutual information between and [9, 25]. To address such tradeoff and improve the generation quality, recent approaches propose to gradually anneal the penalty on the KLdivergence [7], or decompose it to isolate the penalty for total correlation [51] that encourages the statistical independence of latent variables [1, 9, 25].
Approaches based on VAE have shown to be effective in learning disentangled representations over a range of tasks from synthetic [40] to complex datasets [34, 3]. However, their generation performance is generally insufficient to achieve highfidelity synthesis, even with recent techniques isolating the factorization of the latent variable [25, 9]. We argue that this problem is fundamentally attributed to two reasons: First, most VAEbased approaches assume the fullyindependent generative factors [17, 25, 9, 36, 50, 39]. This strict assumption oversimplifies the latent manifold and may cause the loss of useful information (\eg correlated factors) for generating realistic data. Second, they typically utilize a simple generator, such as the factorized Gaussian decoder, and learn a unimodal mapping from the latent to input space. Although this might be useful to learn meaningful representations [7] (\eg capturing a structure in local modes), such decoder makes it difficult to render complex patterns in outputs (\eg textures).
4 HighFidelity Synthesis via Distillation
Our objective is to build a generative model that produces highfidelity output with an interpretable latent code (\ie disentangled representation). To achieve this goal, we build our framework upon VAEbased models due to their effectiveness in learning disentangled representations. However, discussions in the previous section suggest that disentanglement learning in VAE leads to the sacrifice of generation quality due to the strict constraints on fullyfactorized latent variables and the utilization of simple decoders. We aim to improve the VAEbased models by enhancing generation quality while maintaining its disentanglement learning performance.
Our main idea is to decompose the objectives of learning disentangled representation and generating realistic outputs into separate but successive learning problems. Given a disentangled representation learned by VAEs, we train another network with a much higher modeling capacity (\eg GAN generator) to decode the learned representation to a realistic sample in the observation space.
Figure 1 describes the overall framework of the proposed algorithm. Formally, let denote the latent variable composed of the disentangled variable and the nuisance variable capturing independent and correlated factors of variation, respectively. In the proposed framework, we first train VAE (\eg Eq. (1)) to learn disentangled latent representations of data, where each observation can be projected to by the learned encoder after the training. Then in the second stage, we fix the encoder and train a generator for highfidelity synthesis while distilling the learned disentanglement by optimizing the following objective:
(2)  
(3)  
(4) 
where is the aggregated posterior [38, 18, 49] of the encoder network
4.1 Analysis
In this section, we provide indepth analysis of the proposed method and its connections to prior works.
Comparisons to VAEs [17, 9, 25].
Despite the simplicity, the proposed IDGAN effectively addresses the problems in VAEs with generating highfidelity outputs; it augments the latent representation by introducing a nuisance variable , which complements the disentangled variable by modeling richer generative factors. For instance, the VAE objective tends to favor representational factors that characterize as much data as possible [7] (\eg azimuth, scale, lighting, \etc.), which are beneficial in representation learning, but incomprehensive to model the complexity of observations. Given the disentangled factors discovered by VAEs, the IDGAN learns to encode the remaining generative factors (such as highfrequency textures, face identity, \etc) into nuisance variable . (Figure 7). This process shares a similar motivation with a progressive augmentation of latent factors [31], but is used for modeling disentangled and nuisance generative factors. In addition, IDGAN employs a much more expressive generator than a simple factorized Gaussian decoder in VAE, which is trained with adversarial loss to render realistic and convincing outputs. Combining both, our method allows the generator to synthesize various data in a local neighborhood defined by , where the specific characteristics of each example are fully characterized by the additional nuisance variable .
Comparisons to InfoGAN [10].
The proposed method is closely related to InfoGAN, which optimizes the variational lowerbound of mutual information for disentanglement learning. To clarify the difference between the proposed method and InfoGAN, we rewrite the regularization for both methods using the KL divergence as follows:
(5)  
(6)  
(7) 
where summarizes all regularization terms in our method
Eq. (5) shows that InfoGAN optimizes the forward KL divergence between the prior and the approximated posterior . Due to the zeroavoiding characteristics of forward KL [42], it forces all latent code with nonzero prior to be covered by the posterior . Intuitively, it implies that InfoGAN tries to exploit every dimensions in to encode each (unique) factor of variations. It becomes problematic when there is a mismatch between the number of true generative factors and the size of latent variable , which is common in unsupervised disentanglement learning. On the contrary, VAE optimizes the reverse KL divergence (Eq. (6)), which can effectively avoid the problem by encoding only meaningful factors of variation into certain dimensions in while collapsing the remainings to the prior. Since the encoder training in our method is only affected by Eq. (6), it allows us to discover the ambient dimension of latent generative factors robust to the choice of latent dimension .
In addition, Eq. (5) shows that InfoGAN optimizes the encoder using the generated distributions, which can be problematic when there exists a sufficient discrepancy between the true and generated distributions (\eg modecollapse may cause learning partial generative factors.). On the other hand, the encoder training in our method is guided by the true data (Eq. (6)) together with maximum likelihood objective, while the mutual information (Eq. (7)) is enforced only to the generator. This helps our model to discover comprehensive generative factors from data while guiding the generator to align its outputs to the learned representation.
Practical benefits.
The objective decomposition in the proposed method also offers a number of practical advantages. First, it enables plugandplaystyle adoption of the stateoftheart models for disentangled representation learning and highquality generation. As shown in Figure 2, it allows our model to achieve stateoftheart performance on both tasks. Second, such decomposition also leads to an efficient model design, where we learn disentanglement from lowresolution images and distill the learned representation to the task of highresolution synthesis with a much highercapacity generator. We argue that it is practically reasonable in many cases since VAEs tend to learn global structures in disentangled representation, which can be captured from lowresolution images. We demonstrate this in the highresolution image synthesis task, where we use the disentangled representation learned with images for the synthesis of or images.
5 Experiments
In this section, we present various results to show the effectiveness of IDGAN. Refer to the Appendix for more comprehensive results and figures.
5.1 Implementation Details
Compared methods.
We compare our method with stateoftheart methods in disentanglement learning and generation. We choose VAE [17], FactorVAE [25], and InfoGAN [10] as baselines for disentanglement learning. For fair comparison, we choose the best hyperparameter for each model via extensive hyperparameter search. We also report the performance by training each method over five different random seeds and averaging the results.
Network architecture.
For experiments on synthetic datasets, we adopt the architecture from [36] for all VAEbased methods (VAE, VAE, and FactorVAE). For GANbased methods (GAN, InfoGAN, and IDGAN), we employ the same decoder and encoder architectures in VAE as the generator and discriminator, respectively. We set the size of disentangled latent variable to 10 for all methods, and exclude the nuisance variable in GANbased methods for a fair comparison with VAEbased methods. For experiments on complex datasets, we employ the generator and discriminator in the stateoftheart GAN [41, 46]. For VAE architectures, we utilize the same VAE architecture as in the synthetic datasets. We set the size of disentangled and nuisance variables to 20 and 256, respectively.
Evaluation metrics
We employ three popular evaluation metrics in the literature: FactorVAE Metric (FVM) [25], Mutual Information Gap (MIG) [9], and Fréchet Inception Distance (FID) [16]. FVM and MIG evaluate the disentanglement performance by measuring the degree of axisalignment between each dimension of learned representations and groundtruth factors. FID evaluates the generation quality by measuring the distance between the true and the generated distributions.
5.2 Results on Synthetic Dataset.
For quantitative evaluation of disentanglement, we employ the dSprites dataset [40], which contains synthetic images generated by randomly sampling known generative factors, such as shape, orientation, size, and xy position. Since the complexity of dSprites is limited to analyze the disentanglement and generation performance, we adopt three variants of dSprites, which are generated by adding color [25] (ColordSprites) or background noise [36] (Noisy and ScreamdSprites).
Table 4.1 and Figure 3 summarize the quantitative and qualitative comparison results with existing disentanglement learning approaches, respectively. First, we observe that VAEbased approaches (\ie VAE and FactorVAE) achieve the stateoftheart disentanglement performance across all datasets, outperforming the VAE baseline and InfoGAN with a nontrivial margin. The qualitative results in Figure 3 show that the learned generative factors are wellcorrelated with meaningful disentanglement in the observation space. On the other hand, InfoGAN fails to discover meaningful disentanglement in most datasets. We observe that information maximization in InfoGAN often leads to undesirable factorization of generative factors, such as encoding both shape and position into one latent code, but factorizing latent dimensions by different combinations of them (\eg ColordSprites in Figure 3). IDGAN achieves stateoftheart disentanglement through the distillation of the learned latent code from the VAEbased models. Appendix B.3 also shows that IDGAN is much more stable to train and insensitive to hyperparameters than InfoGAN.
In terms of generation quality, VAEbased approaches generally perform much worse than GAN baseline. This performance gap is attributed to the strong constraints on the factorized latent variable and weak decoder in VAE, which limits the generation capacity. This is clearly observed in the results on the NoisydSprites dataset (Figure 3), where the outputs from VAE fail to render the highdimensional patterns in the data (\ie uniform noise). On the other hand, our method achieves competitive generation performance to the stateoftheart GAN using a much more flexible generator for synthesis, which enables the modeling of complex patterns in data. As observed in Figure 3, IDGAN performs generation using the same latent code with VAE, but produces much more realistic outputs by capturing accurate object shapes (in ColordSprites) and background patterns (in ScreamdSprites and NoisydSprites) missed by the VAE decoder. These results suggest that our method can achieve the best tradeoff between disentanglement learning and highfidelity synthesis.
5.3 Ablation Study
This section provides an indepth analysis of our method.
dSprites  
FVM ()  MIG ()  FID ()  
VAE (reference)  0.650.08  0.280.09  37.7524.58 
VAEGAN  0.460.18  0.130.11  33.5424.93 
IDGAN (endtoend)  0.500.14  0.130.09  3.182.38 
IDGAN (twostep)  0.650.08  0.280.09  2.001.74 
Is twostep training necessary?
First, we study the impact of twostage training for representation learning and synthesis. We consider two baselines: (1) VAEGAN [28] as an extension of VAE with adversarial loss, and (2) endtoend training of IDGAN. Contrary to IDGAN that learns to represent () and synthesize () data via separate objectives, these baselines learn a single, entangled objective for both tasks. Table 2 summarizes the results in the dSprites dataset.
The results show that VAEGAN improves the generation quality of VAE with adversarial learning. The generation quality is further improved in the endtoend version of IDGAN by employing a separate generator for synthesis. However, the improved generation quality in both baselines comes with the cost of degraded disentanglement performance. We observe that updating the encoder using adversarial loss hinders the discovery of disentangled factors, as the discriminator tends to exploit highfrequency details to distinguish the real images from the fake images, which motivates the encoder to learn nuisance factors. This suggests that decomposing the representation learning and generation objective is important in the proposed framework (IDGAN twostep), which achieves the best performance in both tasks.
Is distillation necessary?
The above ablation study justifies the importance of twostep training.
Next, we compare different approaches for twostep training that perform conditional generation using the representation learned by VAE.
Specifically, we consider two baselines:
(1) cGAN and (2) IDGAN trained without distillation (IDGAN w/o distill).
We opt to consider cGAN as the baseline since we find that it implicitly optimizes (see Appendix A.2 for the proof).
In the experiments, we train all models in the CelebA 128x128 dataset using the same VAE trained on the resolution, and compare the generation quality (FID) and a degree of alignment between the disentangled code and generator output .
For comparison of the alignment, we measure (Eq. (7)) and GILBO
As shown in the table, all three models achieve comparable generation performances in terms of FID. However, we observe that their alignments to the input latent code vary across the methods. For instance, IDGAN (w/o distill) achieves very low and GILBO, indicating that the generator output is not accurately reflecting the generative factors in . The qualitative results (Figure 4) also show considerable mismatch between the and the generated images. Compared to this, cGAN achieves much higher degree of alignment due to the implicit optimization of , but its association is much loose than our method (\eg changes in gender and hairstyle). By explicitly constraining the generator to optimize , IDGAN achieves the best alignment.
5.4 Results on Complex Dataset
To evaluate our method with more diverse and complex factors of variation, we conduct experiments on natural image datasets, such as CelebA [34], 3D Chairs [3], and Cars [26]. We first evaluate our method on images, and extend it to higher resolution images using the CelebA () and CelebAHQ [23] () datasets.
Comparisons to other methods.
Table 4 summarizes quantitative comparison results (see Appendix A.4 for qualitative comparisons). Since there are no groundtruth factors available in these datasets, we report the performance based on generation quality (FID). As expected, the generation quality of VAEbased methods is much worse in natural images. GANbased methods, on the contrary, can generate more convincing samples exploiting the expressive generator. However, we observe that the baseline GAN taking only nuisance variables ends up learning highlyentangled generative factors. IDGAN achieves disentanglement via disentangled factors learned by VAE, and generation performance on par with the GAN baseline.
To better understand the disentanglement learned by GANbased methods, we present latent traversal results in Figure 5. We generate samples by modifying values of each dimension in the disentangled latent code while fixing the rest. We observe that the InfoGAN fails to encode meaningful factors into , and nuisance variable dominates the generation process, making all generated images almost identical. On the other hand, IDGAN learns meaningful disentanglement with and generates reasonable variations.
3D chair  Cars  CelebA  
VAE  116.46  201.29  160.06 
VAE  107.97  235.32  166.01 
FactorVAE  123.64  208.60  154.48 
GAN  24.17  14.62  3.34 
InfoGAN  60.45  13.67  4.93 
IDGAN+VAE  25.44  14.96  4.08 
Extension to highresolution synthesis.
One practical benefit of the proposed twostep approach is that we can incorporate any VAE and GAN into our framework.
To demonstrate this, we train IDGAN for highresolution images (\eg and ) while distilling the VAE encoder learned with much smaller images
We first adapt IDGAN to the image synthesis task. To understand the impact of distillation, we visualize the outputs from the VAE decoder and the GAN generator using the same latent code as inputs. Figure 6 summarizes the results. We observe that the outputs from both networks are aligned well to render the same generative factors to similar outputs. Contrary to blurry and lowresolution () VAE outputs, however, IDGAN produces much more realistic and convincing outputs by introducing a nuisance variable and employing more expressive decoder trained on higherresolution (). Interestingly, synthesized images by IDGAN further clarify the disentangled factors learned by the VAE encoder. For instance, the first row in Figure 6 shows that the ambiguous disentangled factors from the VAE decoder output is clarified by IDGAN, which is turned out to capture the style of a cap. This suggests that IDGAN can be useful in assessing the quality of the learned representation, which will broadly benefit future studies.
To gain further insights on the learned generative factors by our method, we conduct qualitative analysis on the latent variables ( and ) by generating samples by fixing one variable while varying another (Figure 7). We observe that varying the disentangled variable leads to variations in the holistic structures in the outputs, such as azimuth, skin color, hair style, etc, while varying the nuisance variable leads to changes in more finegrained facial attributes, such as expression, skin texture, identity, \etc. It shows that IDGAN successfully distills meaningful and representative disentangled generative factors learned by the inference network in VAE, while producing diverse and highfidelity outputs using generative factors encoded in the nuisance variable.
Finally, we further conduct experiments on the more challenging task of megapixel image synthesis. In the experiments, we base our IDGAN on the VGAN architecture [46] and adapt it to synthesize CelebAHQ images given factors learned by VAE. Figure 8 presents the results, where we generate images by changing one values in one latent dimension in . We observe that IDGAN produces highquality images with nice disentanglement property, where it changes one factor of variation in the data (\eg azimuth and hairstyle) while preserving the others (\eg identity).
6 Conclusion
We propose Information Distillation Generative Adversarial Network (IDGAN), a simple framework that combines the benefits of the disentanglement representation learning and highfidelity synthesis. We show that we can incorporate the stateoftheart for both tasks by decomposing their objectives while constraining the generator by distilling the encoder. Extensive experiments on synthetic and complex datasets validate that the proposed method can achieve the best tradeoff between realism and disentanglement, outperforming the existing approaches with substantial margin. We also show that such decomposition leads to efficient and effective model design, allowing highfidelity synthesis with disentanglement on highresolution images.
Appendix
Appendix A Derivations
a.1 InfoGAN optimizes ForwardKL Divergence
This section provides the derivation of Eq. (5) in the main paper. Consider as a mapping function , where and denote nuisance and disentangled variables, respectively. Also, assuming is a deterministic function of , the conditional distribution can be approximated to a dirac distribution . Then, the marginal distribution can be described as below:
(A.1) 
Then, the variational lowerbound of mutual information optimized in InfoGAN [10] can be rewritten as follows:
(A.2)  
(A.3)  
(A.4)  
(A.5)  
(A.6)  
(A.7)  
(A.8) 
where the Eq. (A.8) corresponds to Eq. (5) of the main paper. Similiarly, we can rewrite the distillation regularization (Eq. (4) in the main paper) as follows:
(A.9)  
(A.10)  
(A.11) 
where Eq. (A.11) corresponds to Eq. (7) in the main paper. As discussed in the paper, both Eq. (A.8) and (A.11) correspond to the forward KLD; regularization in InfoGAN (Eq. (A.8)) is optimized with respect to both the encoder and the generator , which is problematic due to the zeroavoiding characteristics of forward KLD and the potential mismatch between the true and generated data distributions. On the other hand, our method can effectively avoid this problem by optimizing Eq. (A.11) with only respect to the generator while encoder training is guided by reverse KLD using the true data distribution (Eq. (6) in the main paper).
a.2 cGAN implicitly maximizes
In Section 5.3 of the main paper, we define cGAN as the baseline that also optimizes implicitly in its objective function. This section provides its detailed derivation. Formally, we consider cGAN that minimizes a JensenShannon divergence (JSD) between two joint distributions , where and denote real and fake joint distributions, respectively. Then, from is derived as follows:
(A.12)  
(A.13)  
(A.14)  
(A.15)  
(A.16)  
(A.17) 
Eq. (A.17) implies that the cGAN objective also implicitly maximizes . However, Eq. (A.17) is guaranteed only when the discriminator converges to (near)optimal with respect to real and fake joint distributions, which makes the optimization of highly dependent on the quality of the discriminator. On the other hand, IDGAN maximizes explicitly by directly computing from the learned encoder , which leads to a higher degree of alignment between the input latent code and the generated output (Table 3 and Figure 4 in the main paper).
Appendix B Additional Experiment Results
b.1 Additional Results on Synthetic Dataset
We present additional qualitative results on the synthetic dataset, which corresponds to Section 5.2 in the main paper. Figure A.1 presents the randomly generated images by the proposed IDGAN. We observe that the generated images are sharp and realistic, capturing complex patterns in the background (ScreemdSprites and NoisedSprites datasets). We also observe that it generates convincing foreground patterns, such as color and shape of the objects, while covering diverse and comprehensive patterns in real objects.
Figure A.2 and A.3 present additional qualitative comparison results with VAE and InfoGAN by manipulating the disentangled factors, which correspond to Figure 3 in the main paper. We observe that VAE captures the meaningful disentangled factors, such as location and color of the object, but overlooks several complex but important patterns in the background (ScreemdSprites and NoisedSprites datasets) as well as foreground (\eg detailed shape and orientation). On the other hand, InfoGAN generates more convincing images by employing a more expressive decoder, but learns much more entangled representations (\eg changing location and color of the objects in ColordSprites dataset). By combining the benefits of both approaches, IDGAN successfully learns meaningful disentangled factors and generates realistic patterns.
b.2 Additional Results on a Complex Dataset
Qualitative results of Table 4.
Here we compare the qualitative samples generated by each model in Table 4, \ie. VAE, VAE, FactorVAE, GAN, InfoGAN, and IDGAN. The qualitative results are shown in Figure A.4. Although VAEbased methods learn to represent global structures or salient factors of data in all datasets, generated samples are often blurry and lack textural or local details. On the other hand, GANbased approaches (\ie GAN, InfoGAN and IDGAN) generate sharp and realistic samples thanks to the implicit density estimation and expressive generators. However, as shown in Figure A.5, InfoGAN generally fails to capture meaningful disentangled factors into since it exploits the nuisance variable to encode the most salient factors of variations. On the other hand, IDGAN successfully captures major disentangled factors into while encoding only local details into the nuisance variable .
Additional results on highresolution synthesis.
Here we provide more qualitative results of IDGAN on highresolution image synthesis (CelebA 256256 and CelebAHQ datasets). We first present the results on the CelebAHQ dataset composed of megapixel images (10241024 pixels). Figure A.6 presents the randomly generated samples by IDGAN. We observe that IDGAN produces sharp and plausible samples on high resolution images, showing on par generation performance with the stateoftheart GAN baseline [46] employed as a backbone network of IDGAN. We argue that this is due to the separate decoder and generator scheme adopted in IDGAN, which is hardly achievable in the VAEbased approaches using a factorized Gaussian decoder for explicit maximization of data loglikelihood.
Next, we analyze the learned factors of variations in IDGAN by investigating the disentangled and nuisance variable and , respectively. Similarly to Figure 8 in the main paper, we compare the samples generated by fixing one latent variable while varying another. The results are summarized in Figure A.7. Similar to Figure 7, we observe that the disentangled variable contributes to the most salient factors of variations (\egazimuth, shape, or colour of face and hair, \etc.) while the nuisance variable contributes to the remaining fine details (\egidentity, hair style, expression, background, \etc.). For instance, we observe that fixing the disentangled variable leads to consistent global face structure (\eg black male facing slightly right (first column), blonde female facing slightly left (fourth column)), while fixing nuisance variable leads to consistent details (\eg horizontally floating hair (third row), smiling expression (fourth and fifths rows)). These results suggest that the generator in IDGAN is wellaligned with the VAE decoder to render the same disentangled variable into similar observations, but with more expressive and realistic details by exploiting the nuisance variable .
Finally, to further visualize the learned disentangled factors in , we present the latent traversal results in Figure A.8 as an extension to Figures LABEL:fig:hq_synthesis and 8 in the main paper. We also visualize the results on CelebA images in Figure A.9, where we observe a similar behavior.
b.3 Sensitivity of Generation Performance (FID) on the Hyperparameter
To better understand the sensitivity of our model to its hyperparmeter ( in Eq. (2)), we conduct an ablation study by measuring the generation performance (FID) of our models trained with various . Figure A.10 summarizes the results on the dSprite dataset. First, we observe that the proposed IDGAN performs well over a wide range of hyperparameters () while the performance of InfoGAN is affected much sensitively to the choice of . Interestingly, increasing the in our method also leads to the improved generation quality over a certain range of . We suspect that it is because the information maximization in Eq. (4) using the pretrained encoder also behaves as the perceptual loss [22, 13, 29], regularizing the generator to match the true distribution in more meaningful feature space (\ie disentangled representation).
Appendix C Implementation Details
c.1 Evaluation Metrics
FactorVAE Metric (FVM).
FVM [25] measures the accuracy of a majorityvote classifier, where the encoder network to be evaluated is used for constructing the training data of this classifier. A single training data, or vote, is generated as follows: we first extract encoder outputs from the entire samples of a synthetic dataset; estimate empirical variances of each latent dimension from the extracted outputs; sort out collapsed latent dimensions of variances smaller than 0.05; synthesize 100 samples with a single factor fixed and the other factors varying randomly; extract encoder outputs from synthesized samples; compute variances of each latent dimension divided by the empirical variances computed beforehand; then finally get a single vote which is a pair of the index of the fixed factor and the index of the latent dimension with the smallest normalized variance. We generate 800 votes to train the majorityvote classifier and report its train accuracy as the metric score.
Mutual Information Gap (MIG).
MIG [9] is an informationtheoretic approach to measure the disentanglement of representations.
Specifically, assuming generative factors and dimensional latents , it computes a normalized empirical mutual information to measure the informationtheoretic preferences of towards each , or vice versa.
Then, it aggregates the differences, or gap, between the top two preferences for each and averages them to compute MIG, i.e. , where .
For implementation details, we directly follow the settings
Fréchet Inception Distance (FID).
We employ Fréchet Inception Distance (FID) [16] to evaluate the generation quality of each model considered in our experiments.
FID measures the FrÃ©chet distance [14] between two Gaussians, constructed by generated and real images, respectively, in the feature space of a pretrained deep neural network.
For each model, we compare 50,000 generated images and 50,000 real images to compute FID.
For dSprites and its variants, we use a manually trained ConvNet trained to predict true generative factors of dSprites and its varaints.
For the CelebA, 3D Chairs RGB, and Cars datasets, we use Inception V3 [48] pretrained on the ImageNet [12] dataset.
We use the publicly available code
c.2 Dataset
Name  Description 

dSprites [40]  737,280 binary 64x64 images of 2D sprites with 5 groundtruth factors, including shape (3), scale (6), orientation (40), xposition (32), and yposition (32). 
ColordSprites [7, 36]  The sprite is filled with a random color. We randomly sample intensities of each color channel from 8 discrete values, linearly spaced between [0, 1]. 
NoisydSprites [36]  The background in each dSprites sample is filled with random uniform noise. 
ScreamdSprites [36]  The background of each dSprites sample is replaced with a randomlycropped patch of The Scream painting [] and the sprite is colored with the inverted color of the patch over the pixel regions of the sprite. 
CelebA, CelebAHQ [34, 23]  CelebA dataset contains 202,599 RGB images of celebrity faces, which is composed of 10,177 identities, 5 landmark locations, and 40 annotated attributes of human faces. We use the aligned&cropped version of the dataset with the image size of 6464 and 256256. CelebAHQ is the subset of the inthewild version of the CelebA dataset, which is composed of 30,000 RGB 10241024 highresolution images. 
3D Chairs [3]  86,366 RGB 6464 images of chair CAD models with 1,393 types, 31 azimuths, and 2 elevations. 
Cars [26]  16,185 RGB images of 196 classes of cars. We crop and resize each image into the size of 6464 using the boundingbox annotations provided. 
c.3 Architecture
Encoder  Decoder  Discriminator 

Input: 64 64 # channels  Input:  FC 1000, leaky ReLU 
44 conv 32, ReLU, stride 2  FC 256, ReLU  FC 1000, leaky ReLU 
44 conv 32, ReLU, stride 2  FC 4464, ReLU  FC 1000, leaky ReLU 
44 conv 64, ReLU, stride 2  44 upconv 64, ReLU, stride 2  FC 1000, leaky ReLU 
44 conv 64, ReLU, stride 2  44 upconv 32, ReLU, stride 2  FC 1000, leaky ReLU 
FC 256, FC 210  44 upconv 32, ReLU, stride 2  FC 1000, leaky ReLU 
44 upconv # channels, stride 2  FC 2 
Generator  Discriminator 

Input:  Input: 64 64 # channels 
FC 256, ReLU  44 conv 32, ReLU, stride 2 
FC 4464, ReLU  44 conv 32, ReLU, stride 2 
44 upconv 64, ReLU, stride 2  44 conv 64, ReLU, stride 2 
44 upconv 32, ReLU, stride 2  44 conv 64, ReLU, stride 2 
44 upconv 32, ReLU, stride 2  FC 256, FC 1 
44 upconv # channels, stride 2 
Generator  Discriminator 

Input:  Input: 64 64 3 
FC 44512  33 conv 64, stride 1 
ResBlock 512, NN Upsampling  ResBlock 64, AVG Pooling 
ResBlock 256, NN Upsampling  ResBlock 128, AVG Pooling 
ResBlock 128, NN Upsampling  ResBlock 256, AVG Pooling 
ResBlock 64, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 64, 44 conv 3, stride 1  FC 1 
Generator  Discriminator 

Input:  Input: 128 128 3 
FC 44512  33 conv 64, stride 1 
ResBlock 512, NN Upsampling  ResBlock 64, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 128, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 256, AVG Pooling 
ResBlock 256, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 128, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 128, 44 conv 3, stride 1  FC 1 
Generator  Discriminator 

Input:  Input: 256 256 3 
FC 44512  33 conv 64, stride 1 
ResBlock 512, NN Upsampling  ResBlock 64, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 128, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 256, AVG Pooling 
ResBlock 256, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 128, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 64, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 64, 44 conv 3, stride 1  FC 1 
Generator  Discriminator 

Input:  Input: 1024 1024 3 
FC 44512  ResBlock 16, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 32, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 64, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 128, AVG Pooling 
ResBlock 512, NN Upsampling  ResBlock 256, AVG Pooling 
ResBlock 256, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 128, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 64, NN Upsampling  ResBlock 512, AVG Pooling 
ResBlock 32, NN Upsampling  11 conv 2512, Sampling 512 
ResBlock 16,44 conv 3, stride 1  FC 1 
Footnotes
 In practice, we can easily sample from by .
 In practice, we learn the encoder and generator independently by Eq. (6) and (7), respectively, through twostep training.
 We report both and GILBO without to avoid potential error in measuring (\eg fitting a Gaussian [2]). Note that it does not affect the relative comparison since all models share the same .
 GILBO is formulated similarly as (Eq. (4)), but optimized over another auxiliary encoder network different from the one used in .
 We simply downsample the generator output by bilinear sampling to match the dimension between the generator and encoder.
 https://github.com/googleresearch/disentanglement_lib
 https://github.com/mseitzer/pytorchfid
References
 (2018) Information dropout: learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §3.
 (2018) GILBO: one metric to measure them all. In NeurIPS, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi and R. Garnett (Eds.), pp. 7037–7046. External Links: Link Cited by: §5.3, Table 3, footnote 4.
 (2014) Seeing 3D chairs: exemplar partbased 2D3D alignment using a large dataset of CAD models. In CVPR, Cited by: Table A.1, §3, §5.4.
 CVAEgan: finegrained image generation through asymmetric training. In ICCV, Cited by: §2.
 (2013) Representation Learning: a review and new perspectives. PAMI. Cited by: §1.
 (2017) Neural photo editing with introspective adversarial networks. In ICLR, Cited by: §2.
 (2018) Understanding disentangling in vae. arXiv preprint arXiv:1804.03599. Cited by: Table A.1, §1, §1, §3, §3, §4.1.
 (2019) Weakly supervised disentanglement by pairwise similarities. arXiv preprint arXiv:1906.01044. Cited by: §2.
 (2018) Isolating sources of disentanglement in variational autoencoders. In NeurIPS, Cited by: §C.1, §1, §1, §2, §3, §3, §4.1, §5.1.
 (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, Cited by: §A.1, Figure A.4, §1, §1, §2, §4.1, §4, §5.1.
 (2019) Flexibly fair representation learning by disentanglement. In ICML, pp. 1436–1445. Cited by: §1.
 (2009) ImageNet: A LargeScale Hierarchical Image Database. In CVPR, Cited by: §C.1.
 (2016) Generating images with perceptual similarity metrics based on deep networks. In NeurIPS, Cited by: §B.3.
 (1957) Sur la distance de deux lois de probabilité. COMPTES RENDUS HEBDOMADAIRES DES SEANCES DE L ACADEMIE DES SCIENCES. Cited by: §C.1.
 (2018) Automatic chemical design using a datadriven continuous representation of molecules. ACS Central Science. Cited by: §1.
 (2017) GANs trained by a two timescale update rule converge to a nash equilibrium. In NeurIPS, Cited by: §C.1, §5.1.
 (2017) VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: Figure A.4, §1, §1, §2, §3, §3, §4.1, §5.1.
 (2016) ELBO surgery: yet another way to carve up the variational evidence lower bound. In NeurIPS, Cited by: §4.
 (2018) IntroVAE: introspective variational autoencoders for photographic image synthesis. In NIPS, Cited by: §2.
 (2018) Multimodal unsupervised imagetoimage translation. In ECCV, Cited by: §1.
 (2019) IBGAN: disentangled representation learning with information bottleneck GAN. External Links: Link Cited by: §2.
 (2016) Perceptual losses for realtime style transfer and superresolution. In ECCV, Cited by: §B.3.
 (2018) Progressive growing of GANs for improved quality, stability, and variation. In ICLR, Cited by: Table A.1, §5.4.
 (2019) Variational autoencoders and nonlinear ica: a unifying framework. arXiv preprint arXiv:1907.04809. Cited by: §2.
 (2018) Disentangling by factorising. In ICML, Cited by: Figure A.4, §C.1, §1, §1, §2, §3, §3, §4.1, §5.1, §5.1, §5.2.
 (2013) 3D object representations for finegrained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR13), Cited by: Table A.1, §5.4.
 (2017) Fader networks:manipulating images by sliding attributes. In NeurIPS, Cited by: §1.
 (2016) Autoencoding beyond pixels using a learned similarity metric. In ICML, Cited by: §2, §5.3.
 (2017) Photorealistic single image superresolution using a generative adversarial network. In CVPR, Cited by: §B.3.
 (2018) Diverse imagetoimage translation via disentangled representations. In ECCV, Cited by: §1.
 (2019) Overcoming the disentanglement vs reconstruction tradeoff via jacobian supervision. In ICLR, Cited by: §4.1.
 (2019) InfoGANcr: disentangling generative adversarial networks with contrastive regularizers. arXiv preprint arXiv:1906.06034. Cited by: §2.
 (2019) OOGAN: disentangling GAN with onehot sampling and orthogonal regularization. arXiv preprint arXiv:1905.10836. Cited by: §2.
 (2015) Deep learning face attributes in the wild. In ICCV, Cited by: Table A.1, §3, §5.4.
 (2019) On the fairness of disentangled representations. arXiv preprint arXiv:1905.13662. Cited by: §1.
 (2018) Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. arXiv preprint arXiv:1811.12359. Cited by: §C.1, Table A.1, §2, §3, §5.1, §5.2.
 (2019) Disentangling factors of variation using few labels. arXiv preprint arXiv:1905.01258. Cited by: §2.
 (2016) Adversarial autoencoders. In ICLR, Cited by: §4.
 (2018) Disentangling disentanglement in variational autoencoders. In Bayesian Deep Learning Workshop, NeurIPS, Cited by: §3.
 (2017) dSprites: Disentanglement testing Sprites dataset. Note: https://github.com/deepmind/dspritesdataset/ Cited by: Table A.1, §2, §3, §5.2.
 (2018) Which training methods for gans do actually converge?. In ICML, Cited by: Figure A.4, Table A.4, Table A.5, Table A.6, §5.1.
 (2005) Divergence measures and message passing. Technical report Technical report, Microsoft Research. Cited by: §4.1.
 (2017) Learning disentangled representations with semisupervised deep generative models. In NeurIPS, Cited by: §2.
 (2019) HoloGAN: unsupervised learning of 3d representations from natural images. In ICCV, Cited by: §2.
 (2014) Autoencoding variational bayes. In ICLR, Cited by: Figure A.4.
 (2019) Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow. In ICLR, Cited by: Figure A.6, §B.2, Table A.7, §5.1, §5.4.
 (2019) Learning disentangled representations with referencebased variational autoencoders. arXiv preprint arXiv:1901.08534. Cited by: §2.
 (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §C.1.
 (2018) Wasserstein autoencoders. In ICLR, Cited by: §4.
 (2018) Recent advances in autoencoderbased representation learning. In Bayesian Deep Learning Workshop, NeurIPS, Cited by: §1, §3.
 (1960) Information Theoretical Analysis of Multivariate Correlation. IBM Journal of Research and Development. Cited by: §3.
 (2019) Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017. Cited by: §2.
 (2018) Improving the improved training of wasserstein GANs. In ICLR, Cited by: §1.
 (2017) Toward multimodal imagetoimage translation. In NIPS, Cited by: §2.