Generative Adversarial Image Synthesis with Decision Tree Latent Controller

Generative Adversarial Image Synthesis with Decision Tree Latent Controller

Takuhiro Kaneko        Kaoru Hiramatsu        Kunio Kashino
NTT Communication Science Laboratories, NTT Corporation
{kaneko.takuhiro, hiramatsu.kaoru, kashino.kunio}@lab.ntt.co.jp
Abstract

This paper proposes the decision tree latent controller generative adversarial network (DTLC-GAN), an extension of a GAN that can learn hierarchically interpretable representations without relying on detailed supervision. To impose a hierarchical inclusion structure on latent variables, we incorporate a new architecture called the DTLC into the generator input. The DTLC has a multiple-layer tree structure in which the ON or OFF of the child node codes is controlled by the parent node codes. By using this architecture hierarchically, we can obtain the latent space in which the lower layer codes are selectively used depending on the higher layer ones. To make the latent codes capture salient semantic features of images in a hierarchically disentangled manner in the DTLC, we also propose a hierarchical conditional mutual information regularization and optimize it with a newly defined curriculum learning method that we propose as well. This makes it possible to discover hierarchically interpretable representations in a layer-by-layer manner on the basis of information gain by only using a single DTLC-GAN model. We evaluated the DTLC-GAN on various datasets, i.e., MNIST, CIFAR-10, Tiny ImageNet, 3D Faces, and CelebA, and confirmed that the DTLC-GAN can learn hierarchically interpretable representations with either unsupervised or weakly supervised settings. Furthermore, we applied the DTLC-GAN to image-retrieval tasks and showed its effectiveness in representation learning.

1 Introduction

There have been recent advances in computer vision and graphics, enabling photo-realistic images to be created. However, it still requires considerable skill or effort to create a pixel-level detailed image from scratch. Deep generative models, such as generative adversarial networks (GANs) [12] and variational autoencoders (VAEs) [21, 43], have recently emerged as powerful models to alleviate this difficulty. Although these models make it possible to generate various images with high fidelity quality by changing (e.g., randomly sampling) latent variables in the generator or decoder input, there still remains a painstaking process to create the desired image because the naive formulation does not impose any structure on latent variables; as a result, they may be used by the generator or decoder in a highly entangled manner. This causes difficulty in interpreting the “meaning” of the individual variables and in controlling image generation by operating each one.

When we create an image from scratch, we typically select and narrow a target to paint in a coarse-to-fine manner. For example, when we create an image of a face with glasses, we first roughly consider the type of glasses, e.g., transparent glasses/sunglasses, then define the details, e.g., thin/thick rimmed glasses or small/big sunglasses. To use a deep generative model as a supporter for creating an image, we believe that such hierarchically interpretable representation is the key to obtaining the image one has in mind.

Figure 1: Examples of image generation under control using DTLC-GAN: DTLC-GAN enables image generation to be controlled in coarse-to-fine manner, i.e., “selected & narrowed.” Our goal is to discover such hierarchically interpretable representations without relying on detailed supervision.
Figure 2: Generated image samples on CIFAR-10: All images were generated from same noise but different latent codes. In each row, we varied second-, third-, and fourth-layer codes per nine images, per three images, and per image, respectively. Note that we learn these hierarchically disentangled representations only with supervision of class labels. This model also achieves high inception score 8.80. We give details in Section 6.3.

These facts motivated us to address the problem of how to derive hierarchically interpretable representations in a deep generative model. To solve this problem, we propose the decision tree latent controller GAN (DTLC-GAN), an extension of the GAN that can learn hierarchically interpretable representations without relying on detailed supervision. Figure 1 shows examples of image generation under control using the DTLC-GAN. If semantic features are represented in a hierarchically disentangled manner, we can approach a target image gradually and interactively.

To impose a hierarchical inclusion structure on latent variables, we incorporate a new architecture called the DTLC into the generator input. The DTLC has a multiple-layer tree structure in which the ON or OFF of the child node codes is controlled by the parent node codes. By using this architecture hierarchically, we can obtain the latent space in which the lower layer codes are selectively used depending on the higher layer codes.

On the problem of making the latent codes capture salient semantic features of images in a hierarchically disentanglede manner in the DTLC, the main difficulty is that we need to discover representations disentangled in the following three stages: (1) disentanglement between the control target (e.g., glasses) and unrelated factors (e.g., identity); (2) coarse-to-fine disentanglement between layers, i.e., the higher layer codes capture rough categories, while the lower layer ones capture detailed categories; and (3) inner-layer disentanglement to control semantic features independently, i.e., when one code captures a semantic feature (e.g., thin glasses), another one captures a different semantic feature (e.g., thick glasses).

A possible solution would be to collect detailed annotations, the amount of which is large enough to solve the problems in a fully supervised manner. However, this approach incurs high annotation costs. Even though we have enough human resources, defining the detailed categories remains a nontrivial task. The latter problem is also addressed in the field of research concerned with attribute representations [36, 58] and is still an open issue. This motivated us to tackle a challenging condition in which hierarchically interpretable representations need to be learned without relying on detailed annotations. Under this condition, it is not trivial to solve all the above three disentanglement problems at the same time because they are not independent from each other but are interrelated. To mitigate these problems, we propose a hierarchical conditional mutual information regularization (HCMI), which is an extension of MI [4] and conditional MI (CMI) [18] to hierarchical conditional settings and optimize it with a newly defined curriculum learning [3] method that we also propose. This makes it possible to discover hierarchically interpretable representations in a layer-by-layer manner on the basis of information gain by only using a single DTLC-GAN model. This is noteworthy because we can learn expressive representations without large increase in calculation cost. Figure 2 shows typical examples on CIFAR-10, where we succeeded in learning expressive representations, i.e., categories, are learned in a weakly supervised (i.e., only 10 class labels are supervised) manner. We evaluated our DTLC-GAN on various datasets, i.e., MNIST, CIFAR-10, Tiny ImageNet, 3D Faces, and CelebA, and confirmed that it can learn hierarchically interpretable representations with either unsupervised or weakly supervised settings. Furthermore, we applied our DTLC-GAN to image-retrieval tasks and showed its effectiveness in representation learning.

Contributions:

Our contributions are summarized as follows. (1) We derive a novel functionality in a deep generative model, which enables semantic features of an image to be controlled in a coarse-to-fine manner. (2) To obtain this functionality, we incorporate a new architecture called the DTLC into a GAN, which imposes a hierarchical inclusion structure on latent variables. (3) We propose a regularization called the HCMI and optimize it with a newly defined curriculum learning method that we also propose. This makes it possible to learn hierarchically disentangled representations only using a single DTLC-GAN model without relying on detailed supervision. (4) We evaluated our DTLC-GAN on various datasets and confirmed its effectiveness in image generation and image-retrieval tasks. We provide supplementary materials including demo videos at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/dtlc-gan/.

2 Related Work

Deep Generative Models:

In computer vision and machine learning, generative image modeling is a fundamental problem. Recently, there was a significant breakthrough due to the emergence of deep generative models. These models roughly fall into two approaches: deterministic and stochastic. On the basis of deterministic approaches, Dosovitsky et al. [8] proposed a deconvolution network that generates 3D objects, and Reed et al. [42] and Yang et al. [57] proposed networks that approximate functions for image synthesis. There are three major models based on stochastic approaches. One is a VAE [21, 43], which is formulated as probabilistic graphical models and optimized by maximizing a variational lower bound on the data likelihood. Another is an autoregressive model [49], which breaks the data distribution into a series of conditional distributions and uses neural networks to model them. The other is a GAN [12], which is composed of generator and discriminator networks. The generator is optimized to fool the discriminator, while the discriminator is optimized to distinguish between real and generated data. This min-max optimization makes the training procedure unstable, but several techniques [1, 2, 14, 31, 38, 45, 61] have recently been proposed to stabilize it. All these models have pros and cons. In this paper, we take a stochastic approach, particularly focusing on a GAN, and propose an extension to it because it has flexibility on latent variable design. Extension to other models remains a promising area for future work.

Disentangled Representation Learning:

In the study of stochastic deep generative models, there have been attempts to learn disentangled representations similar to our attempt. Most of the studies addressed the problem in supervised settings and incorporated supervision into the networks. For example, attribute/class labels [33, 35, 48, 50, 55, 60], text descriptions [30, 40, 59], and object location descriptions [39, 41] are used as supervision. To reduce the annotation cost, extensions to semi-supervised settings have also recently been proposed [20, 45, 46]. The advantage of these settings is that disentangled representation can be explicitly learned following the supervision; however, the limitation is that learnable representations are restricted to supervision. To overcome this limitation, weakly supervised [18, 23, 29, 32] and unsupervised [4] models have recently been proposed, which discover meaningful hidden representation without relying on detailed annotations; however, these models are limited to discovering one-layer hidden representations, whereas the DTLC-GAN enables multi-layer hidden representations to be learned. We further discussed the relationship to the previous GANs in Section 4.4.

Hierarchical Representation:

The other related topic is hierarchical representation. Previous studies have decomposed an image in various ways. The LAPGAN [6] and StackGAN [59] deconstruct an image by repeatedly downsampling it, S-GAN [52] decomposes the generative process to structure and style, VGAN [51] decomposes a video into foreground and background, and SGAN [15] learns multi-level representations in feature spaces of intermediate layers. Other studies [11, 13, 16, 24, 56] used recursive structures to draw images in a step-by-step manner. The main difference from these studies is that they derive hierarchical representations in a pixel space or feature space to improve the fidelity of an image, while we derive those in a latent space to improve the interpretability and controllability of latent codes. More recently, Zhao et al. [62] proposed an extension of a VAE called the VLAE to learn multi-layer hierarchical representations in a latent space similar to ours; however, the type of hierarchy is different from ours. They learn representations that are semantically independent among layers, whereas we learn those where lower layer codes are correlated with higher layer codes in a decision tree manner. We argue that such representation is necessary to learn category-specific features and control image generation in a select-and-narrow manner.

3 Background: GAN

A GAN [12] is a framework for training a generative model using a min-max game. The goal is to learn generative distribution that matches the real data distribution . It consists of two networks: a generator that transforms noise into data space , and a discriminator that assigns probability when is a sample from and assigns probability when it is a sample from . The is a prior on . The and play a two-player min-max game with the following binary cross entropy:

(1)

The attempts to find the binary classifier for discriminating between true and generated data by maximizing this loss, whereas the attempts to generate data indistinguishable from the true data by minimizing this loss.

Figure 3: Sampling example using three-layer DTLC: (a) Architecture of three-layer DTLC where . (b) Sampling example in Step 1. Each code is sampled from categorical distribution. Filled and open circles indicate 1 and 0, respectively. (c) Sampling example in Steps 2 and 3. ON or OFF of child node codes is selected by parent node codes. This execution is conducted recursively from highest layer to lowest layer. This imposes hierarchical inclusion constraints on sampling. (d) Sample images generated using this controller. Each image corresponds to each latent code. We tested on subset of MNIST dataset, which includes “4” and “5” digit images. This is relatively easy dataset; however, it is noteworthy that hierarchically disentangled representations, such as “4” or “5” in first layer and “narrow-width 4” or “wide-width 4” in second left layer, are learned in fully unsupervised manner.

4 Dtlc-Gan

4.1 Dtlc

In the naive GAN, latent variables are sampled from an unconditional prior and do not have any constraints on a structure. As a result, they may be used by the in a highly entangled manner, causing difficulty in interpreting the “meaning” of the individual variables and in controlling image generation by operating each one. Motivated by this fact, we incorporate the DTLC into the generator input to impose a hierarchical inclusion structure on latent variables.

Notation:

In the DTLC-GAN, the latent variables are decomposed into multiple levels. We first decompose the latent variables into two parts: , which is a latent code derived from an -layer DTLC and will target hierarchically interpretable semantic features, and , which is a source of incompressible noise that covers factors that are not represented by . To derive , the DTLC has a multiple-layer tree structure and is composed of layer codes . In each layer, is decomposed into node codes . To impose a hierarchical inclusion relationship between the th and th layers, an th parent node code is associated with child node codes , where . By this definition, is calculated as .

We can use both discrete and continuous variables as , but for simplicity, we treat the case in which parent node codes are discrete and the lowest layer codes are either discrete or continuous. In this case, is represented as a dimensional onehot vector and each dimension is associated with one child node code .

Sampling Scheme:

In a training phase, we sample latent codes as follows. We illustrate a sampling example in Figure 3.

  1. We sample from categorical distribution . We sample in a similar manner in the discrete case, while we sample it from uniform distribution in the continuous case. Note that, if we have supervision for , we can directly use it instead of sampling.

  2. To impose a hierarchical inclusion structure, we sample from conditional prior , where . We do this with the following process:

    (2)

    where is the th dimension of . This equation means that a parent node code acts as a child node selector controlling the ON or OFF of a child node code.

  3. By executing Step 2 recursively from the highest layer to the lowest layer, we can sample with layer hierarchical inclusion constraints. We add it to the generator input and use it with to generate an image: .

Figure 4: Example of curriculum learning: (a) We first learn disentangled representations in first layer. To do this, we only use regularization for this layer and fix and set average value to lower layer codes. (b)(c) We then learn disentangled representations in second and third layers in layer-by-layer manner. We add regularization and sampling in turn depending on training phase. (d) Image samples generated in each phase. In phase (a), first-layer codes are learned, while second- and third-layer codes are fixed; therefore, disentangled representations are obtained. In phase (b), first- and second-layer codes are learned, while third-layer codes are fixed; therefore, disentangled representations are obtained. In phase (c), all codes are learned; therefore, disentangled representations are obtained.

4.2 Hcmi

The DTLC imposes a hierarchical inclusion structure on latent variables; however, its constraints are not sufficient to correlate latent variables with semantic features of images. To solve this problem without relying on detailed supervision, we propose a hierarchical conditional mutual information regularization (HCMI), which is an extension of MI [4] and conditional MI (CMI) [18] to hierarchical conditional settings. In particular, we use different types of regularization for the second layer to the th layer, which have parent node codes, and the first layer, which does not have those.

Regularization for Second Layer to th Layer:

In this case, we need to discover semantic features in a hierarchically restricted manner; therefore, we maximize mutual information between th-layer child node code and image conditioned on th-layer parent node code : . For simplicity, we denote and as and , respectively. In practice, exact calculation of this mutual information is difficult because it requires calculation of the intractable posterior . Therefore, following previous studies [4, 18], we instead calculate its lower bound using an auxiliary distribution approximating :

(3)

For simplicity, we fix the distribution of and treat as constant. In practice, is parametrized as a neural network and we particularly denote the network for   as , where . Thus, the final objective function is written as

(4)

The attempts to discover the specific semantic features that correlate with in terms of conditional information gain by maximizing this objective. We calculate for every child node code . We denote the summation in the th layer as . We use this objective with trade-off parameter .

Regularization for First Layer:

The above regularization is useful for the codes that have parent node codes; however, the first-layer codes do not have those; thus, we instead use a different regularization for them. Fortunately, this single-layer case has been addressed in previous studies [4, 35] and we use one of them depending on the supervision setting. In an unsupervised setting, we use the MI [4] written as

(5)

In a weakly supervised setting, we use an auxiliary classifier regularization (AC) [35] written as

(6)

Note that the first term is the same as , and the added second term acts as supervision regularization. We use these objectives with trade-off parameter .

Full Objective:

Our full objective is written as

(7)

This is minimized for the and and maximized for the .

4.3 Curriculum Learning

The HCMI works well when the higher layer codes are already known; however, we assume the condition in which detailed annotations are not provided in advance. As a result, the network may confuse between inner-layer and intra-layer disentanglement at the beginning of training. To mitigate this problem, we developed our curriculum learning method. In particular, we define a curriculum for regularization and sampling. We illustrate an example of the proposed curriculum learning method in Figure 4.

Curriculum for Regularization:

As a curriculum for regularization, we do not use the whole regularization in Equation 4.2 at the same time, instead, we add the regularization from the highest layer to the lowest layer in turn according to the training phase. In an unsupervised setting, we first learn with then add in turn. In a weakly supervised setting, we first learn with then add in turn. We use different curricula between these two settings because in a weakly supervised setting, we already know the first-layer codes; thus, we can start from learning the second-layer codes.

Curriculum for Sampling:

In learning the higher layer codes, instability caused by random sampling of the lower layer codes can degrade the learning performance. Motivated by this fact, we define a curriculum for sampling. In particular, in learning the higher layer codes , we fix and set the average value to the lower layer codes , e.g., set for discrete code and set for continuous code .

4.4 Relationship to Previous GANs

The DTLC-GAN is a general framework, and we can see it as a natural extension of previous GANs. We summarize this relationship in Table 1. In particular, the InfoGAN [4] and CFGAN [18]111Strictly speaking, the CFGAN is formulated as an extension of the CGAN, while the weakly supervised DTLC-GAN is formulated as an extension of the AC-GAN. Therefore, these two models do not have completely the same architecture; however, they share the similar motivation. are highly related to the DTLC-GAN in terms of discovering hidden representations on the basis of information gain; however, they are limited to learning one-layer hidden representation. We developed our DTLC-GAN, HCMI, and curriculum learning method to overcome this limitation.

# of Hidden Unsupervised (Weakly) Supervised
Layers

 

0 GAN [12] CGAN [33]11footnotemark: 1, AC-GAN [35]
1 InfoGAN [4] CFGAN [18]11footnotemark: 1
2
3 DTLC-GAN
4,
Table 1: Relationship to previous GANs

5 Implementation

We designed the network architectures and training scheme on the basis of techniques introduced for the InfoGAN [4]. The and share all convolutional layers, and one fully connected layer is added to the final layer for . This means that the difference in the calculation cost for the GAN and DTLC-GAN is negligibly small. For discrete code , we represent as softmax nonlinearity. For continuous code , we treat as a factored Gaussian.

In most of the experiments we conducted, we used typical DCGAN models [38] and did not use the state-of-the-art GAN training techniques to evaluate whether the DTLC-GAN works well without relying on such techniques. However, our contributions are orthogonal to these techniques; therefore, we can improve image quality easily by incorporating these techniques to our DTLC-GAN. To demonstrate this, we also tested the DTLC-WGAN-GP (our DTLC-GAN with the WGAN-GP ResNet [14]) as discussed in Section 6.3. The details of the experimental setup are given in Section B of the appendix.

6 Experiments

We conducted experiments on various datasets, i.e., MNIST [26], CIFAR-10 [22], Tiny ImageNet [44], 3D Faces [37], and CelebA [27], to evaluate the effectiveness and generality of our DTLC-GAN.222Due to the limited space, we provide only the important results in this main text. Please refer to the appendix for more results. We first used the MNIST and CIFAR-10 datasets, which are widely used in this field, to analyze the DTLC-GAN qualitatively and quantitatively. In particular, we evaluated the DTLC-GAN in an unsupervised setting on the MNIST dataset and in a weakly supervised setting on the CIFAR-10 dataset (Section 6.1 and 6.2, respectively). We tested the DTLC-WGAN-GP on the CIFAR-10 and Tiny ImageNet datasets to demonstrate that our contributions are orthogonal to the state-of-the-art GAN training techniques (Section 6.3). We used the 3D Faces dataset to evaluate the effectiveness of the DTLC-GAN with continuous codes (Section 6.4) and evaluated it on image-retrieval tasks using the CelebA dataset (Section 6.5). Hereafter, we denote the DTLC-GAN with an th layer DTLC as the DTLC-GAN and DTLC-GAN in a weakly supervised setting as the DTLC-GAN.

Figure 5: Representation comparison on MNIST: We compared models in which dimensions of latent codes given to are same 20. In each figure, column contains three samples from same category. In each row, one latent code is varied, while other latent codes and noise are fixed.

6.1 Unsupervised Representation Learning

We first analyzed the DTLC-GAN in unsupervised settings on the MNIST dataset, which consists of photographs of handwritten digits and contains 60,000 training and 10,000 test samples.

Representation Comparison:

To confirm the effectiveness of the hierarchical representation learning, we compared the DTLC-GAN with that in which dimensions of latent codes given to the are the same but are not hierarchical. To represent our DTLC-GAN, we used the DTLC-GAN, where and . In this model, , the dimension of which is , is given to the . For comparison, we used two models in which latent code dimensions are also 20 but not hierarchical. One is the InfoGAN, which has one code , and the other is the InfoGAN, which has two codes . We show the results in Figure 5. In (c), the DTLC-GAN succeeded in learning hierarchically interpretable representations (in the first layer, digits, and in the second layer, details of each digit). In (a), the InfoGAN succeeded in learning disentangled representations; however, they were learned as a flat relationship; thus, it was not trivial to estimate the higher concept (e.g., digits) from them. In (b-1) and (b-2), the InfoGAN failed to learn interpretable representations. We argue that this is because and struggle to represent digit types. To clarify this limitation, we also conducted experiments on simulated data. See Section A.1 of the appendix for details.

Figure 6: Ablation study in unsupervised settings on MNIST: Left figures show changes in mean SSIM scores through learning. We measured those between pairs of images within same category per layer. Right figures show sample images generated with varying latent codes per layer. Gray line indicates parent-child relationship. From top to bottom, (a) DTLC-GAN without curriculum, (b) DTLC-GAN with curriculum for regularization, and (c) DTLC-GAN with full curriculum (curriculum for regularization and sampling: proposed curriculum learning method).
Figure 7: Ablation study in weakly supervised settings on CIFAR-10: View of figure is similar to that in Figure 6

Ablation Study on Curriculum Learning:

To analyze the effectiveness of the proposed curriculum learning method, we conducted an ablation study. To evaluate quantitatively, we measured the inter-category diversity of generated images on the basis of structural similarity (SSIM) [53], which is a well-characterized perceptual similarity metric. This is an ad-hoc measure; however, recent studies [18, 35] showed that an SSIM-based measure is useful for evaluating the diversity of images generated with a GAN. Note that evaluating the quality of deep generative models is not trivial and is still an open issue due to the variety of probabilistic criteria [47]. To evaluate the th layer inter-category diversity, we measured the SSIM scores between pairs of images that are sampled from the same noise and higher layer codes but random th and lower layer codes. We calculated the scores for 50,000 randomly sampled pairs of images and took the average. The smaller value indicates that diversity is larger. We show changes in the mean SSIM scores through learning and sample images generated with varying latent codes per layer in Figure 6. We used the DTLC-GAN, where and . From these results, the DTLC-GAN with the full curriculum succeeded in making higher layer codes obtain higher diversity and lower layer codes obtain lower diversity, while the others failed. We argue that this is because the latter cannot avoid confusion between inner-layer and intra-layer disentanglement. The qualitative results also support this fact. We also show sample images for all categories in Figures 1416 of the appendix.

6.2 Weakly Supervised Representation Learning

We next analyzed the DTLC-GAN in weakly supervised settings (i.e., only class labels are supervised) on the CIFAR-10 dataset, which consists of 10 classes of images and contains 5,000 training and 1,000 test samples per class.

Ablation Study on Curriculum Learning:

We conducted an ablation study to evaluate the effectiveness of the proposed curriculum learning method in weakly supervised settings. We show changes in mean SSIM scores through learning and sample images generated with varying latent codes per layer in Figure 7. In this experiment, we used the DTLC-GAN, where and . We can see the same tendency as in Figure 6. These results indicate that proposed curriculum learning method is indispensable, even in weakly supervised settings. We show samples images for all categories in Figures 1719 of the appendix. We also conducted preference tests to analyze visual interpretability. See Section A.2 of the appendix for details.

Quantitative Evaluation:

An important concern is whether our extension degrades image quality. To address this concern, we evaluated the DTLC-GAN on three metrics: inception score [45], adversarial accuracy [56], and adversarial divergence [56].333The latter two metrics require pairs of generated images and class labels to train a classifier. In our settings, a conditional generator is learned; thus, we directly used it to generate an image with a class label. We used the classifier, architecture of which was similar to the except for the output layer. We list the results in Table 2. We compared a GAN, the AC-GAN, and DTLC-GAN, where and . For fair comparison, we used the same network architecture and training scheme except for the extended parts. The inception scores are not state-of-the-art, but in this comparison, the DTLC-GAN improved upon GAN and was comparable to the AC-GAN. The adversarial accuracy and adversarial divergence scores are state-of-the art, and the DTLC-GAN improved upon the AC-GAN. These results are noteworthy because they indicate that we can obtain expressive representation using the DTLC-GAN without concern for image-quality degradation.

Inception Adversarial Adversarial
Model Score Accuracy Divergence

 

GAN 7.09 0.09 - -
AC-GAN 7.41 0.06 50.99 0.55 2.07 0.02
DTLC-GAN 7.39 0.03 55.10 0.48 1.82 0.03
DTLC-GAN 7.35 0.09 55.20 0.47 1.95 0.05
DTLC-GAN 7.46 0.06 56.19 0.36 1.93 0.05
DTLC-GAN 7.51 0.06 58.87 0.52 1.83 0.04
Real Images 11.24 0.12 85.77 0.22 0
State-of-the-Art 8.59 0.12† 44.22 0.08‡ 5.57 0.06‡
Table 2: Quantitative comparison between GAN, AC-GAN, and DTLC-GAN (†Huang et al. [15], ‡Yang et al. [56])

6.3 Combination with WGAN-GP

Another concern is whether our contributions are orthogonal to the state-of-the-art GAN training techniques. To demonstrate this, we tested the DTLC-WGAN-GP on three cases: CIFAR-10 (unsupervised/weakly supervised) and Tiny ImageNet444Tiny version of the ImageNet dataset containing 200 classes 500 images. To shorten the training time, we resized images to . (unsupervised). The number of categories was same as that with the models used in Table 2. We list the results in Table 3. Interestingly, in all cases, the scores improved as the layers became deeper, and the DTLC-WGAN-GPs achieved state-of-the-art performance. We show generated image samples in Figures 2022 of the appendix.

CIFAR-10 CIFAR-10 Tiny ImageNet
Model (Unsupervised) (Supervised) (Unsupervised)

 

WGAN-GP 7.86 .07† - 8.33 .11
AC/Info-WGAN-GP 7.97 .09 8.42 .10† 8.33 .10
DTLC-WGAN-GP 8.03 .12 8.44 .10 8.34 .08
DTLC-WGAN-GP 8.15 .08 8.56 .07 8.41 .10
DTLC-WGAN-GP 8.22 .11 8.80 .08 8.51 .08
State-of-the-Art 7.86 .07† 8.59 .12‡ -
Table 3: Inception scores for WGAN-GP-based models (†Gulrajani et al. [14], ‡Huang et al. [15])

6.4 Extension to Continuous Codes

To analyze the DTLC-GAN with continuous codes, we evaluated it on the 3D Faces dataset, which consists of faces generated from a 3D face model and contains 240,000 samples. We compared three models, the InfoGAN, which is the InfoGAN with five continuous codes (used in the InfoGAN study [4]), InfoGAN, which is the InfoGAN with one categorical code and one continuous code , and DTLC-GAN, which has one categorical code in the first layer and five continuous codes in the second layer. We show example results in Figure 8. In the InfoGANs (a, b), the individual codes tend to represent independent and exclusive semantic features because they have a flat relationship, while in the DTLC-GAN (c), we can learn category-specific (in this case, pose-specific) semantic features conditioned on the higher layer codes.

Figure 8: Representation comparison of models that have continuous codes: Each sample is generated from same noise but different continuous codes (varied from left to right). In InfoGANs (a, b), each code is independent and exclusive, while in DTLC-GAN (c), lower layer codes learn category-specific (in this case, pose-specific) semantic features conditioned on higher layer codes.

6.5 Application to Image Retrieval

One possible application of the DTLC-GAN is to use hierarchically interpretable representations for image retrieval. To confirm this, we used the CelebA dataset, which consists of photographs of faces and contains 180,000 training and 20,000 test samples. To search for an image hierarchically, we measure the L2 distance between query and database images on the basis of , , which are predicted using auxiliary functions . Figure 9 shows the results of bangs-based, glasses-based, and smiling-based image retrieval. For evaluation, we used the test set in the CelebA dataset. We trained DLTC-GAN, where , , and , particularly where hierarchical representations are learned only for the attribute presence state.555We provide generated image samples in Figures 2426 of the appendix. These results indicate that as the layer becomes deeper, images in which attribute details match more can be retrieved. To evaluate quantitatively, we measured the SSIM score between query and database images for the attribute-specific areas [18] defined in Figure 10. We summarize the scores in Table 4. These results indicate that as the layer becomes deeper, the concordance rate of attribute-specific areas increases.

Figure 9: Example results of hierarchically interpretable image retrieval: To search for image hierarchically, we measure L2 distance between query and database images on basis of , , and
Figure 10: Attribute-specific areas used for evaluation in Table 4
Code Bangs Glasses Smiling

 

0.150 0.189 0.274
0.194 0.256 0.294
0.211 0.265 0.326
Table 4: Attribute-specific SSIM scores for different codes

7 Discussion and Conclusions

We proposed an extension of the GAN called the DTLC-GAN to learn hierarchically interpretable representations. To develop it, we introduced the DTLC to impose a hierarchical inclusion structure on latent variables and proposed the HCMI and curriculum learning method to discover the salient semantic features in a layer-by-layer manner by only using a single DTLC-GAN model without relying on detailed supervision. Experiments showed promising results, indicating that the DTLC-GAN is well suited for learning hierarchically interpretable representations. The DTLC-GAN is a general model, and possible future work includes applying it to other models, such as encoder-decoder models [7, 9, 21, 25, 43], and using it as a latent hierarchical structure discovery tool for high-dimensional data.

References

  • [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
  • [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
  • [5] H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early visual processing by language. In NIPS, 2017.
  • [6] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS, 2015.
  • [7] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.
  • [8] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.
  • [9] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. In ICLR, 2017.
  • [10] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.
  • [11] S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, and G. E. Hinton. Attend, infer, repeat: Fast scene understanding with generative models. In NIPS, 2016.
  • [12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [13] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. DRAW: A recurrent neural network for image generation. In ICML, 2015.
  • [14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In NIPS, 2017.
  • [15] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In CVPR, 2017.
  • [16] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Generating images with recurrent adversarial networks. In ICLR Workshop, 2016.
  • [17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [18] T. Kaneko, K. Hiramatsu, and K. Kashino. Generative attribute controller with conditional filtered generative adversarial networks. In CVPR, 2017.
  • [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [20] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In NIPS, 2014.
  • [21] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009.
  • [23] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
  • [24] H. Kwak and B.-T. Zhang. Generating images part by part with composite generative adversarial networks. arXiv preprint arXiv:1607.05387, 2016.
  • [25] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
  • [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 86(11):2278–2324, 1998.
  • [27] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [28] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop, 2013.
  • [29] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In NIPS, 2016.
  • [30] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016.
  • [31] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
  • [32] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In NIPS, 2016.
  • [33] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [34] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, 2010.
  • [35] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. In ICML, 2017.
  • [36] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.
  • [37] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3D face model for pose and illumination invariant face recognition. In AVSS, 2009.
  • [38] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [39] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016.
  • [40] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
  • [41] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, and N. de Freitas. Generating interpretable images with controllable structure. In ICLR Workshop, 2017.
  • [42] S. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogy-making. In NIPS, 2015.
  • [43] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
  • [44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  • [45] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
  • [46] J. T. Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In ICLR, 2016.
  • [47] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR, 2016.
  • [48] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant face recognition. In CVPR, 2017.
  • [49] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.
  • [50] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelCNN decoders. In NIPS, 2016.
  • [51] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
  • [52] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
  • [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. on IP, 13(4):600–612, 2004.
  • [54] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. In ICML Workshop, 2015.
  • [55] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016.
  • [56] J. Yang, A. Kannan, D. Batra, and D. Parikh. LR-GAN: Layered recursive generative adversarial networks for image generation. In ICLR, 2017.
  • [57] J. Yang, S. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In NIPS, 2015.
  • [58] A. Yu and K. Grauman. Just noticeable differences in visual attributes. In ICCV, 2015.
  • [59] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [60] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR, 2017.
  • [61] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017.
  • [62] S. Zhao, J. Song, and S. Ermon. Learning hierarchical features from generative models. In ICML, 2017.

In this appendix, we provide additional analysis in Section A, give details on the experimental setup in Section B, and provide extended results in Section C. We provide other supplementary materials including demo videos at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/dtlc-gan/.

Appendix A Additional Analysis

a.1 Representation Comparison on Simulated Data

To clarify the limitation of the InfoGANs compared in Figure 5, we conducted experiments on simulated data. In particular, we used simulated data that are hierarchically sampled in the 2D space and have globally ten categories and locally two categories. When sampling data, we first randomly selected a global position from ten candidates that are equally spaced around a circle of radius . We then randomly selected a local position from two candidates that are rotated by radians in clockwise and anticlockwise directions from the global position. Based on this local position, we sampled data from a Gaussian distribution of a standard deviation of .

We compared models that are similar to those compared in Figure 5. As the proposed model, we used the DTLC-GAN, where and . In this model, , the dimension of which is , is given to the . For comparison, we used two models in which latent code dimensions are also 20 but not hierarchical. One is the InfoGAN, which has one code , and the other is the InfoGAN, which has two codes . For DTLC-GAN, we also compared the DTLC-GANs with and without curriculum learning.

We show the results in Figure 11. The results indicate that InfoGANs (b, c) and DTLC-GAN without curriculum learning (d) tend to cause unbalanced or non-hierarchical clustering. In contrast, the DTLC-GAN with curriculum learning (e) succeeds in capturing hierarchical structures, i.e., the first-layer codes captured global ten points, whereas the second-layer codes captured local two points for each global position.

a.2 Visual Interpretability Analysis

To clarity the benefit of learned representations, we conducted two XAB tests. For each test, we compared the fourth-layer models (DTLC-GANs or DTLC-WGAN-GPs) with and without curriculum learning.

  • Test I: Difference Interpretability Analysis
    To confirm whether is more interpretable than , we compared the generated images (X) with the images generated from latent variables in which one dimension of is changed (A) and one dimension of is changed (B). The changed dimension of or was randomly chosen. We asked participants which difference is more interpretable or even.

  • Test II: Semantic Similarity Analysis
    To confirm whether is hierarchically interpretable, we compared the generated images (X) with the images generated from latent variables in which one dimension of is varied (A) and one dimension of is varied (B). For each case, we fixed the higher layer codes. The changed dimension of or was randomly chosen. The lower layer codes were also randomly chosen. We asked participants which is semantically similar or even.

To eliminate bias in individual samples, we showed 25 samples at the same time. To eliminate bias in the order of stimuli, the order (AB or BA) was randomly selected. We show the user interfaces in Figure 12.

We summarize the results in Tables 5. In (a) and (b), we list the results of tests I and II, respectively, using the DTLC-GAN, which were used for the experiments discussed in Figure 7. The results of test I indicate that is more interpretable than regardless of curriculum learning. We argue that this is because does not have any constraints on a structure and may be used by the in a highly entangled manner. The results of test II indicate that representations learned with curriculum learning are hierarchically categorized in a better way in terms of semantics than those without it. The results support the effectiveness of the proposed curriculum learning method.

We also conducted test II (semantic similarity analysis) for all the DTLC-WGAN-GPs discussed in Section 6.3. We summarize the results in Table 5(c)–(e). We observed a similar tendency as those of DTLC-GAN.

a.3 Unsupervised Learning on Complex Dataset

Although, in Section 6.1, we mainly analyzed unsupervised settings on the MNIST dataset, which is relatively simple, we can learn hierarchical representations in an unsupervised manner even in more complex datasets. However, in this case, learning targets depend on the initialization because such datasets can be categorized in various ways. We illustrate this in Figure 13. We also evaluated the DTLC-WGAN-GP in unsupervised settings on the CIFAR-10 and Tiny ImageNet datasets. See Section 6.3 for details.

Model even

 

W/o curriculum 0.0 1.0 1.0 99.0 1.0
W/ curriculum 0.0 1.0 1.0 99.0 1.0
*Number of collected answers is 400
(a) Test I for DTLC-GAN on CIFAR-10
Model even

 

W/o curriculum 22.4 3.9 41.3 4.6 36.2 4.5
W/ curriculum 3.6 1.7 17.8 3.5 78.7 3.8
*Number of collected answers is 450
(b) Test II for DTLC-GAN on CIFAR-10
Model even

 

W/o curriculum 18.0 4.4 31.3 5.3 50.7 5.7
W/ curriculum 4.7 2.4 12.0 3.7 83.3 4.2
*Number of collected answers is 300
(c) Test II for DTLC-WGAN-GP on CIFAR-10
Model even

 

W/o curriculum 21.7 4.7 38.3 5.5 40.0 5.6
W/ curriculum 17.0 4.3 24.0 4.9 59.0 5.6
*Number of collected answers is 300
(d) Test II for DTLC-WGAN-GP on CIFAR-10
Model even

 

W/o curriculum 13.2 4.2 53.6 6.2 33.2 5.9
W/ curriculum 2.4 1.9 17.2 4.7 80.4 5.0
*Number of collected answers is 250
(e) Test II on DTLC-WGAN-GP on Tiny ImageNet
Table 5: Average preference score (%) with confidence intervals. We compared fourth-layer models (DTLC-GANs or DTLC-WGAN-GPs) with and without curriculum learning.
Figure 11: Evaluation on simulated data: (a) We used simulated data, which have globally ten categories and locally two categories. In (b), categories are learned at the same time. In (c)(d), categories are learned at the same time, causing unbalanced and non-hierarchical clustering. In (e), ten global categories are first discovered then two local categories are learned. Upper left: kernel density estimation (KDE) plots. Others: samples from real data or models. Same color indicates same category.
Figure 12: User interfaces for XAB tests: (a) Samples in “Image: A” are generated from latent variables in which one dimension of is changed. Samples in “Image: B” are generated from latent variables in which one dimension of is changed. (b) Samples in “Image: A” are generated from latent variables in which one dimension of is changed. Samples in “Image: B” are generated from latent variables in which one dimension of is changed. (c) Samples in “Image: A” are generated from latent variables in which is varied. Samples in “Image: B” are generated from latent variables in which is varied. (d) Samples in “Image: A” are generated from latent variables in which is varied. Samples in “Image: B” are generated from latent variables in which is varied.
Figure 13: Representation comparison between two models that are learned in fully unsupervised manner with different initialization: In (a), samples are generated from one model, while, in (b), samples are generated from another model. In each row, and are varied per three images and per image, respectively. In this setting, learning targets (in (a), hair style and in (b), pose) depend on initialization because this dataset can be categorized in various ways.

Appendix B Details on Experimental Setup

In this section, we describe the network architectures and training scheme for each dataset. We designed the network architectures and training scheme on the basis of techniques introduced for the InfoGAN [4]. The and share all convolutional layers (Conv.), and one fully connected layer (FC.) is added to the final layer for . This means that the difference in the calculation cost for the GAN and DTLC-GAN is negligibly small. For discrete code , we represented as softmax nonlinearity. For continuous code , we parameterized through a factored Gaussian.

In most of the experiments we conducted, we designed the network architectures and training scheme on the basis of the techniques introduced for the DCGAN [38] and did not use the state-of-the-art GAN training techniques to evaluate whether the DTLC-GAN works well without relying on such techniques. To downscale and upscale, we respectively used convolutions (Conv. ) and backward convolutions (Conv. ), i.e., fractionally strided convolutions, with stride 2. As activation functions, we used rectified linear units (ReLUs) [34] for the , while we used leaky rectified linear units (LReLUs) [28, 54] for the . We applied batch normalization (BNorm) [17] to all the layers except the generator output layer and discriminator input layer. We trained the networks using the Adam optimizer [19] with a minibatch of size . The learning rate was set to for the and to for the . The momentum term was set to .

To demonstrate that our contributions are orthogonal to the state-of-the-art GAN training techniques, we also tested the DTLC-WGAN-GP (our DTLC-GAN with the WGAN-GP ResNet [14]) discussed in Section 6.3. We used similar network architectures and training scheme as the WGAN-GP ResNet, except for the extended parts.

The details for each dataset are given below.

b.1 Mnist

The DTLC-GAN network architectures for the MNIST dataset, which were used for the experiments discussed in Section 6.1, are shown in Table 6. As a pre-process, we normalized the pixel value to the range . In the generator output layers, we used the Sigmoid function. We used the DTLC-GAN, where and , i.e., which has one discrete code in the first layer and discrete codes in the th layer where , , and . We added to the generator input. The trade-off parameters were set to 0.1. We trained the networks for iterations in unsupervised settings. As a curriculum for , we added regularization and sampling after iterations.

b.2 Cifar-10

The DTLC-GAN network architectures for the CIFAR-10 dataset, which were used for the experiments discussed in Section 6.2, are shown in Table 7. As a pre-process, we normalized the pixel value to the range . In the generator output layers, we used the Tanh function. We used the DTLC-GAN, where and , i.e., which has one ten-dimensional discrete code in the first layer and discrete codes in the th layer where , , and . We added to the generator input. We used the supervision (i.e., class labels) for . The trade-off parameters were set to 1. We trained the networks for iterations in weakly supervised settings. As a curriculum for , we added regularization and sampling after iterations.

b.3 Dtlc-Wgan-Gp

The DTLC-WGAN-GP network architectures for the CIFAR-10 and Tiny ImageNet datasets, which were used for the experiments discussed in Section 6.3, are similar to the WGAN-GP ResNet used in a previous paper [14], except for the extended parts. We used the DTLC-WGAN-GP, where and , i.e., which has one ten-dimensional discrete code in the first layer and discrete codes in the th layer where , , and . Following the AC-WGAN-GP ResNet implementation [14], we used conditional batch normalization (CBN) [5, 10] to make the conditioned on the codes . CBN has two parameters, i.e., gain parameter and bias parameter , for each category, where . As curriculum for sampling, in learning the higher layer codes, we used and averaged over those for the related lower layer node codes.

In unsupervised settings, we sampled from categorical distribution . The trade-off parameters were set to 1. We trained the networks for iterations. As a curriculum for , we added regularization and sampling after iterations.

In weakly supervised settings, we used the supervision (i.e., class labels) for . The were set to 1. We trained the networks for iterations. As a curriculum for , we added regularization and sampling after iterations.

b.4 3D Faces

The DTLC-GAN network architectures for the 3D Faces dataset, which were used for the experiments discussed in Section 6.4, are shown in Table 8. As a pre-process, we normalized the pixel value to the range . In the generator output layers, we used the Sigmoid function. We used the DTLC-GAN, where and , i.e., which has one discrete code in the first layer and five continuous codes in the second layer. We added to the generator input. The trade-off parameters and were set to 1. We trained the networks for iterations in unsupervised settings. As a curriculum for , we added regularization and sampling after iterations.

b.5 CelebA

The DTLC-GAN network architectures for the CelebA dataset, which were used for the experiments discussed in Section 6.5, are shown in Table 9. As a pre-process, we normalized the pixel value to the range . In the generator output layers, we used the Tanh function. We used the DTLC-GAN, where and , particularly where hierarchical representations are learned only for the attribute presence state. Therefore, and () is calculated as . This model has one two-dimensional discrete code in the first layer and discrete codes in the th layer where and . We added to the generator input. We used the supervision (i.e., an attribute label) for . The trade-off parameters were set to 1, 0.1, and 0.04 for bangs, glasses, and smiling, respectively. We trained the networks for iterations in weakly supervised settings. As a curriculum for , we added regularization and sampling after iterations.

b.6 Simulated Data

The DTLC-GAN network architectures for the simulated data used for the experiments discussed in Section A.1, are shown in Table 10. As a pre-process, we scaled the discriminator input by factor 4 (roughly scaled to range ). We used the DTLC-GAN, where and , i.e., which has one discrete code in the first layer and ten discrete codes in the second layer. We added to the generator input. The trade-off parameters and were set to 1. We trained the networks using the Adam optimizer with a minibatch of size 512. The learning rate was set to 0.0001 for and . The momentum term was set to 0.5. We trained the networks for iterations in unsupervised settings. As a curriculum for , we added regularization and sampling after iterations.

 

Generator

 

Input
1024 FC., BNorm, ReLU
FC., BNorm, ReLU
64 Conv. , BNorm, ReLU
1 Conv. , Sigmoid

 

 


Discriminator / Auxiliary Function

 

Input 1 gray image
64 Conv. , LReLU
128 Conv. , BNorm, LReLU
1024 FC., BNorm, LReLU
FC. output for
128 FC., BNorm, LReLU-FC. output for

 

Table 6: DTLC-GAN network architectures used for MNIST

 

Generator

 

Input
FC., BNorm, ReLU
256 Conv. , BNorm, ReLU
128 Conv. , BNorm, ReLU
64 Conv. , BNorm, ReLU
3 Conv., Tanh

 

 


Discriminator / Auxiliary Function

 

Input 3 color image
64 Conv., LReLU, Dropout
128 Conv. , BNorm, LReLU, Dropout
128 Conv., BNorm, LReLU, Dropout
256 Conv. , BNorm, LReLU, Dropout
256 Conv., BNorm, LReLU, Dropout
512 Conv. , BNorm, LReLU, Dropout
512 Conv., BNorm, LReLU, Dropout
FC. output for
128 FC., BNorm, LReLU, Dropout-
FC. output for

 

Table 7: DTLC-GAN network architectures used for CIFAR-10

 

Generator

 

Input
1024 FC., BNorm, ReLU
FC., BNorm, ReLU
64 Conv. , BNorm, ReLU
1 Conv. , Sigmoid

 

 


Discriminator / Auxiliary Function

 

Input 1 gray image
64 Conv. , LReLU
128 Conv. , BNorm, LReLU
1024 FC., BNorm, LReLU
FC. output for
128 FC., BNorm, LReLU-FC. output for

 

Table 8: DTLC-GAN network architectures used for 3D Faces

 

Generator

 

Input
FC., BNorm, ReLU
256 Conv. , BNorm, ReLU
128 Conv. , BNorm, ReLU
64 Conv. , BNorm, ReLU
3 Conv. , Tanh

 

 


Discriminator / Auxiliary Function

 

Input 3 color image
64 Conv. , LReLU
128 Conv. , BNorm, LReLU
256 Conv. , BNorm, LReLU
512 Conv. , BNorm, LReLU
FC. output for
128 FC., BNorm, LReLU-FC. output for

 

Table 9: DTLC-GAN network architectures used for CelebA

 

Generator

 

Input
128 FC., ReLU
128 FC., ReLU
2 FC.

 

 


Discriminator / Auxiliary Function

 

Input 2D simulated data
(scaled by factor 4 (roughly scaled to range ))
128 FC. ReLU
128 FC. ReLU
FC. output for
128 FC., ReLU-FC. output for

 

Table 10: DTLC-GAN network architectures used for simulated data

Appendix C Extended Results

c.1 Mnist

We give extended results of Figure 6 in Figures 1416. We used the DTLC-GAN, where and . Figure 14 shows the generated image examples using the DTLC-GAN learned without a curriculum. Figure 15 shows the generated image examples using the DTLC-GAN learned only with the curriculum for regularization. Figure 16 shows the generated image examples using the DTLC-GAN learned with the full curriculum (curriculum for regularization and sampling: proposed curriculum learning method). The former two DTLC-GANs (without the full curriculum) exhibited confusion between inner-layer and intra-layer disentanglement, while the DTLC-GAN with the full curriculum succeeded in avoiding confusion. The inner-category divergence evaluation on the basis of the SSIM in Figure 6 also supports these observations.

c.2 Cifar-10

We give extended results of Figure 7 in Figures 1719. We used the DTLC-GAN, where and . We used class labels as supervision. Figure 17 shows the generated image samples using the DTLC-GAN learned without a curriculum. Figure 18 shows the generated image samples using the DTLC-GAN learned only with the curriculum for regularization. Figure 19 shows the generated image samples using the DTLC-GAN learned with the full curriculum (curriculum for regularization and sampling: proposed curriculum learning method). All models succeeded in learning disentangled representations in class labels since they are given as supervision; however, the former two DTLC-GANs (without the full curriculum) exhibited confusion between inner-layer and intra-layer disentanglement from second- to fourth-layer codes. In contrast, the DTLC-GAN with the full curriculum succeeded in avoiding confusion. The inner-category divergence evaluation on the basis of the SSIM in Figure 7 also supports these observations.

c.3 Dtlc-Wgan-Gp

We show the generated image samples using the models discussed in Section 6.3, in Figure 2022. We used the DTLC-WGAN-GP, where and . In weakly supervised settings, we used class labels as supervision. Figure 20 shows the generated image samples using the DTLC-WGAN-GP on CIFAR-10 (unsupervised). Figure 21 shows the generated image samples using the DTLC-WGAN-GP on CIFAR-10 (weakly supervised). Figure 22 shows the generated image samples using the DTLC-WGAN-GP on Tiny ImageNet (unsupervised).

c.4 3D Faces

We give extended results of Figure 8 in Figure 23. Similarly to Figure 8, we compared three models, the InfoGAN, which is the InfoGAN with five continuous codes (used in the InfoGAN study [4]), InfoGAN, which is the InfoGAN with one categorical code and one continuous code , and DTLC