Guiding InfoGAN with Semi-Supervision
In this paper we propose a new semi-supervised GAN architecture (ss-InfoGAN) for image synthesis that leverages information from few labels (as little as , max. of the dataset) to learn semantically meaningful and controllable data representations where latent variables correspond to label categories. The architecture builds on Information Maximizing Generative Adversarial Networks (InfoGAN) and is shown to learn both continuous and categorical codes and achieves higher quality of synthetic samples compared to fully unsupervised settings. Furthermore, we show that using small amounts of labeled data speeds-up training convergence. The architecture maintains the ability to disentangle latent variables for which no labels are available. Finally, we contribute an information-theoretic reasoning on how introducing semi-supervision increases mutual information between synthetic and real data.
In many machine learning tasks it is assumed that the data originates from a generative process involving complex interaction of multiple independent factors, each accounting for a source of variability in the data. Generative models are then motivated by the intuition that in order to create realistic data a model must have “understood” these underlying factors. For example, images of handwritten characters are defined by many properties such as character type, orientation, width, curvature and so forth.
Recent models that attempt to extract these factors are either completely supervised [23, 20, 18] or entirely unsupervised [5, 3]. Supervised approaches allow for extraction of the desired parameters but require fully labeled datasets and a priori knowledge about which factors underlie the data. However, factors not corresponding to labels will not be discovered. In contrast, unsupervised approaches require neither labels nor a priori knowledge about the the underlying factors but this flexibility comes at a cost: such models provide no means of exerting control on what kind of features are found. For example, Information Maximizing Generative Adversarial Networks (InfoGAN) have recently been shown to learn disentangled data representations. Yet the extracted representations are not always directly interpretable by humans and lack direct measures of control due to the unsupervised training scheme. Many application scenarios however require control over specific features.
Embracing this challenge, we present a new semi-supervised generative architecture that requires only few labels to provide control over which factors are identified. Our approach can exploit already existing labels or use datasets that are augmented with easily collectible labels (but are not fully labeled). The model, based on the related InfoGAN  is dubbed semi-supervised InfoGAN (ss-InfoGAN). In our approach we maximize two mutual information terms: i) The mutual information between a code vector and real labeled samples, guiding the corresponding codes to represent the information contained in the labeling, ii) and the mutual information between the code vector and the synthetic samples. By doing so ss-InfoGAN can find representations that unsupervised methods such as InfoGAN fail to find, for example the category of digits of the SVHN dataset. Notably our approach requires only of labeled data for the hardest dataset we tested and for simpler datasets only labeled samples () were necessary.
We discuss our method in full, provide an information theoretical rationale for the chosen architecture and demonstrate its utility in a number of experiments on the MNIST , SVHN , CelebA  and CIFAR-10  datasets. We show that our method improves results over the state-of-the-art, combining advantages of supervised and unsupervised approaches.
2 Related Work
Many approaches to modeling the data generating process and identifying the underlying factors by learning to synthesize samples from disentangled representations exist. An example of an early approach is supervised bi-linear models , separating style from the content. Zhu et al.  use a multi-view deep perceptron model to untangle the identity and viewpoint of face images. Weakly supervised methods based on supervised clustering, have been proposed such as high-order Boltzman machines  applied on face images.
Variational Autoencoders (VAEs)  and Generative Adversarial Networks (GANs)  have recently seen a lot of interest in generative modeling problems. In both approaches a deep neural network is trained as a generative model by using standard backpropagation, enabling synthesis of novel samples without explicitly learning the underlying data distribution. VAEs maximize a lower bound on the marginal likelihood which is expected to be tight for accurate modeling [8, 2, 26, 24]. In contrast, GANs optimize a minimax game objective via a discriminative adversary. However, they have been shown to be unstable and fragile [23, 17].
Employing semi-supervised learning, Kingma et al.  use VAEs to isolate content from other variations, and achieve competitive recognition performance in addition to high-quality synthetic samples. Deep Convolutional Inverse Graphics Network (DC-IGN) , which uses a VAE architecture and a specially tailored training scheme is capable of learning a disentangled latent space in fully supervised manner. Since the model is evaluated by using images of 3D models, labels for the underlying factors are cheap to attain. However, this type of dense supervision is unfeasible for most non-synthetic datasets.
Adversarial Autoencoders  combine the VAE and GAN frameworks in using an adversarial loss on the latent space. Similarly, Mathieu et al.  introduces an adversarial loss on the reconstructions of VAE, that is, on the pixel space. Both models are shown to learn both discrete and continuous latent representations and to disentangle style and content in images. However, these hybrid architectures have conceptually different designs as opposed to GANs. While the former learns the data distribution via Autoencoder training and employ the adversarial loss as a regularizer, the latter directly relies on an adversarial objective. Despite the robust and stable training, VAEs have tendency to generate blurry images .
Conditional GANs [23, 20, 18] augment the GAN framework by using class labels. Mirza and Osindero  train a class-conditional discriminator while  and  use auxiliary loss terms for the labels. Salimans et al.  use conditional GANs for pre-training, aiming to improve semi-supervised classification accuracy of the discriminator. Similarly, the AC-GAN model  introduces an additional classification task in the discriminator to provide class-conditional training and inference of the generator in order to be able to synthesize higher resolution images than previous architectures. Our work is similar to the above in that it provides class-conditional generation of images. However, due to MI loss terms our architecture can i) be employed in both supervised and semi-supervised settings. ii) can learn interpretable representations in addition to smooth manifolds and iii) can exploit continuous supervision signals if such labels are available.
Comparatively fewer works treat the subject of fully unsupervised generative models to retrieve interpretable latent representations. Desjardins et al.  introduced a higher-order RBM for recognition of facial expressions. However, it can only disentangle discrete latent factors and the computational complexity rises exponentially in the number of features. More recently, Chen et al.  developed an extension to GANs, called Information Maximizing Generative Adversarial Networks (InfoGAN). It enforces the generator to learn disentangled representations through increasing the mutual information between the synthetic samples and a newly introduced latent code. Our work extends InfoGAN such that additional information can be used. Supervision can be a necessity if the model struggles in learning desirable representations or if specific features need to be controlled by the user. Our model provides a framework for semi-supervision in InfoGANs. We find that leveraging few labeled samples brings improvements on the convergence rate, quality of representations and synthetic samples. Moreover, semi-supervision helps the model in capturing otherwise difficult to capture representations.
3.1 Preliminaries: GAN and InfoGAN
In the GAN framework, a generator producing synthetic samples is pitted against a discriminator that attempts to discriminate between real data and samples created by . The goal of the generator is to match the distribution of generated samples with the real distribution . Instead of explicitly estimating , learns to transform noise variables into synthetic samples . The discriminator outputs a single scalar representing the probability of a sample coming from the true data distribution. Both and are differentiable functions parametrized by neural networks. We typically omit the parameters and for brevity. and are simultaneously trained by using the minimax game objective :
GANs map from the noise space to data space without imposing any restrictions. This allows to produce arbitrary mappings and to learn highly dependent factors that are hard to interpret. Therefore, variations of in any dimension often yields entangled effects on the synthetic samples . InfoGANs  are capable of learning disentangled representations. InfoGAN extends the unstructured noise by introducing a latent code . While represents the incompressible noise, describes semantic features of the data. In order to prevent from ignoring the latent codes , InfoGAN regularizes learning via an additional cost term penalizing low mutual information between and :
where is an auxiliary parametric distribution approximating the posterior , corresponds to the lower bound of the mutual information and is the weighting coefficient.
3.2 Semi-Supervised InfoGAN
Although InfoGAN can learn to disentangle representations in an unsupervised manner for simple datasets, it struggles to do so on more complicated datasets such as CelebA or CIFAR-10. In particular, capturing categorical codes is challenging and hence InfoGAN yields poorer performance in class-conditional generation task than competing methods. Moreover, depending on the initialization, the learned latent codes may differ between training sessions further reducing interpretability.
In Semi-Supervised InfoGAN (ss-InfoGAN) we introduce available or easily acquired labels to address these issues. Figure 1 schematically illustrates our architecture. To make use of label information we decompose the latent code into a set of semi-supervised codes, , and unsupervised codes, , where . The semi-supervised codes encode the same information as the labels , whereas are free to encode potential remaining semantic factors.
We seek to increase the mutual information between the latent codes and the labeled real samples , by interpreting labels as the latent codes , (i.e. ). Note that not all samples need to be labeled for the generator to learn the inherent semantic meaning of . We additionally want to increase the mutual information between the semi-supervised latent codes and the synthetic samples so that information can flow back to the generator. This is accomplished via Variational Information Maximization  in deriving lower bounds for both MI terms. For the lower bounds of we utilize the same derivation as InfoGAN:
where and are again auxiliary distributions to approximate posteriors and are parametrized by neural networks. With we attain the MI cost term:
Since we would like to encode the labels via latent codes , we optimize with respect to and only with respect to . The final objective function is then:
Training on labeled real data enables to encode the semantic meaning of via by means of increasing the mutual information . Simultaneously, the generator acquires the information of indirectly by increasing and learns to utilize the semi-supervised representations in synthetic samples. In our experiments we find that a small subset of labeled samples is enough to observe significant effects.
We show that our approach gives control over discovered properties and factors and that our method achieves better image quality. Here we provide an information theoretic underpinning shedding light on the reason for these gains. By increasing both and , the mutual information term is increased as well. We make the following assumptions:
where is a constant and are dependency relations. Assumption (8) follows the intuition that the data is hypothesized to arise from the interaction of independent factors. While latent factors consist of , and , we abstract for the sake of simplicity. Assumption (9) formulates the initial state of our model where the synthetic data distribution and the data distribution are independent. Finally we can assume that labels follow a fixed distribution and hence have a fixed entropy , giving rise to (10).
We decompose and reformulate in the following way:
where is the multivariate mutual information term. While pointwise MI is per definition non-negative, in the multivariate case negative values are possible if two variables are coupled via the third. By using the conditional independence assumption (8), we have
Thus the entropy term in Eq. (11) takes the form
Let symbolize the change in value of a term. According to assumption (10), the following must hold:
Note that and increase during training since we directly optimize these terms, leading to the following cases:
The first case results in the desired behavior. However the latter case cannot occur, as it would result in negative mutual information . Hence, based on our assumptions, increasing both and leads to an increase in .
For both and of ss-InfoGAN we use a similar architecture with DCGAN , which is reported to stabilize training. The networks for the parametric distributions and share all the layers with except the last layers. This is similar to , which models as an extension to . This approach has the disadvantage of negligibly higher computational cost for in comparison to InfoGAN. However, this is offset by a faster convergence rate in return.
In our experiments with low amount of labeled data, we initially favor drawing labeled samples, which improves convergence rate of the supervised latent codes significantly. During training the probability of drawing a labeled sample is annealed until the actual labeled sample ratio in the data is reached. The loss function used to calculate and is the cross-entropy for categorical latent codes and the mean squared error for continuous latent codes. The unsupervised categorical codes are sampled from a uniform categorical distribution whereas the continuous codes are sampled from a uniform distribution. All the experimental details are listed in the supplementary document. In the interest of reproducible research, we provide the source code on GitHub. 111Implementation can be found at https://github.com/spurra/ss-infogan
For comparison we re-implement the original InfoGAN architecture in the Torch framework  with minor modifications. Note that there may be differences in results due to the unstable nature of GANs, possibly amplified by using a different framework and different initial conditions. In our implementation the loss function for continuous latent codes are not treated as a factored Gaussian, but approximated with the mean squared error, which leads to a slight adjustment in the architecture of .
In our study we focus on interpretability of the representations and quality of synthetic images under different amount of labeling. The closest related work to that of ours is InfoGAN, and the aim was to directly improve upon that architecture. The existing semi-supervised generative modeling studies on the other hand, aim to learn discriminative representations for classification. Therefore we make a direct comparison with InfoGAN.
We evaluate our model on the MNIST , SVHN , CelebA  and CIFAR-10  datasets. First, we inspect how well ss-InfoGAN learns the representations as defined by existing labels. Second, we qualitatively evaluate the representations learned by ss-InfoGAN. Finally, we analyze how much labeled data is required for the model to encode semantic meaning of the labels via .
We hypothesize that the quality of the generator in class-conditional sample synthesis can be quantitatively assessed by a separate classifier trained to recognize class labels. The class labels of the synthetic samples (i.e. the class conditional inputs of the generator) are regarded as true targets and compared with the classifier’s predictions. In order to prevent biased results due to the generator overfitting, we train the classifier by using the test set, and validate on the training set for each dataset. Despite the test set consisting fewer samples, the classifier generally performs well on the unseen training set. In our experiments, we use a standard CNN (architecture described in the supplementary file) for the MNIST, CelebA and SVHN datasets and Wide Residual Networks  for CIFAR-10 dataset.
In order to evaluate how well the model separates types of semantic variation, we generate synthetic images by varying only one latent factor by means of linear interpolation while keeping the remaining latent codes fixed.
To evaluate the necessary amount of supervision we perform quantitative analysis of the classifier accuracy and qualitative analysis by examining synthetic samples. To do so, we discard increasingly bigger sets of labels from the data. Note that is trained only by using labeled samples and hence sees less data, whereas the rest of the architecture, namely the generator and the shared layers of the discriminator, uses the entire training samples in unsupervised manner. The minimum amount of labeled samples required to learn the representation of labels varies depending on the dataset. However, for all our experiments it never exceeded .
MNIST is a standard dataset used to evaluate many generative models. It consists of handwritten digits, and is labeled with the digit category. Figure 2 presents the synthetic samples generated with our model and InfoGAN by varying the latent code. Due to lower complexity of the dataset, InfoGAN is capable of learning the digit representation unsupervised. However, using just of the available data has a two-fold benefit. First, semi-supervision provides additional fine-grained control (e.g., digits are already sorted in ascending manner in Figure 1(a), 1(b)). Second, we experimentally verified that the additional information increases convergence speed of the generator, illustrated in Figure 2(a). The 0-1 loss of the classifier decreases faster as more labeled samples are introduced while the fully unsupervised setting (i.e. InfoGAN) is the slowest. The smallest amount of labeled samples for which the effect of supervision is observable is of the dataset, which corresponds to labeled samples out of .
Next, we run ss-InfoGAN on the SVHN dataset which consists of color images, hence includes more noise and natural effects such as illumination. Similar to MNIST, this dataset is labeled with respect to the digit category. In Figure 4, latent codes with various interpretation are presented. In this experiment different amount of supervision result in different unsupervised representations retrieved.
The SVHN dataset is perturbed by various noise factors such as blur and ambiguity in the digit categories. Figure 5 compares real samples with randomly generated synthetic samples by varying digit categories. The InfoGAN configuration ( supervision) fails to encode a categorical latent code for the digit category. Leveraging some labeled information, our model becomes more robust to perturbations in the images. Through the introduction of labeled samples we are capable of exerting control over the latent space, encoding the digit labels in the categorical latent code . The smallest fraction of labeled data needed to achieve a notable effect is (i.e. labels out of samples).
In Figure 2(b) we assess the performance of ss-InfoGAN with respect to . The unsupervised configuration is left out since it is not able to control digit categories. As ss-InfoGAN exploits more label information, the generator converges faster and synthesizes more accurate images in terms of digit recognizability.
The CelebA dataset contains a rich variety of binary labels. We pre-process the data by extracting the faces via a face detector and then resize the extracted faces to . From the set of binary labels provided in the data we select the following attributes: ”presence of smile”, ”mouth open”, ”attractive” and ”gender”.
Figure 6 shows synthetic images generated with ss-InfoGAN by varying certain latent codes. Although we experiment by using various hyper-parameters, InfoGAN is not able to learn an equivalent representation to these attributes. We see that for as low as , acquires the semantic meaning of . This corresponds to labeled samples out of . Figure 7 presents a batch of real samples from the dataset alongside with randomly synthesized samples from generators trained on various labeled percentages, with corresponding again to InfoGAN.
The performance of ss-InfoGAN on the independent classifier is shown in Figure 7(a). For the lowest amount of labeling some instability can be observed. We believe this is due to the differences between the positives and the negatives of each binary label being more subtle than in other datasets. In addition, synthetic data generation exhibits certain variability which can obfuscate important parts of the image. However, using of labeled samples ensures a stable training performance.
Finally we evaluate our model on CIFAR-10 dataset consisting of natural images. The data is labeled with the object category, which we use for the first semi-supervised categorical code. In order to stabilize training we apply instance noise .
On this dataset the unsupervised latent codes are not interpretable. An example is presented in Figure 9 where the synthetic samples are generated by varying one of the unsupervised latent codes. Despite the fact that ss-InfoGAN model is trained by using all label information, the semantic meaning of this unsupervised representation is not clear. The randomness of the natural images prevent models from learning interpretable representations in the absence of guidance.
Figure 10 shows synthetic samples generated by models with different supervision configurations. InfoGAN has difficulties in learning the object category (see Figure 9(b)) and hence in generating class-conditional synthetic images. For this dataset we find that labeling of the training data (corresponding to images out of ) is sufficient for ss-InfoGAN to encode class category (see Figure 9(c)).
In Figure 7(b) classification accuracy of on the synthetic samples is plotted, again displaying the similar behavior of having better performance as more labels are available. It is evident that the additional information provided by the labels is fundamental to control what the image depicts. We argue that attaining such low amounts of labels is feasible even for large and complicated datasets.
5.5 Convergence speed of sample quality
During the course of the experiments, it is observed that the convergence of synthetic sample quality is faster in comparison to InfoGAN. Figure 11 shows synthetic SVHN samples from a fully supervised ss-infoGAN and infoGAN at training epoch and . The training epochs are chosen by inspection so that each model starts producing recognizable images. Therefore we can quantitatively say that ss-InfoGAN converges faster than InfoGAN.
We have introduced ss-InfoGAN a novel semi-supervised generative model. We have shown that including few labels increases the convergence speed of the latent codes and that these represent the same meaning as the labels . This speed-up increases as more data samples are labeled. Although in theory this only improves convergence speed of , we have shown empirically that the sample quality convergence speed has improved as well.
In addition, it was shown that using labeling information is useful in cases where InfoGAN fails to find a specific representation, such as in the case of SVHN, CelebA and CIFAR-10. To successfully guide a latent code to the desired representation, it is sufficient that the dataset contains only a minimal subset of labeled data. The amount of required labels ranges from for the simplest datasets (MNIST) to a maximum of for the most complex datasets (CIFAR-10). We argue that acquiring such low percentages of labels is cost effective and makes the proposed architecture an attractive choice if control over specific latent codes is required and full supervision is not an option.
This work was supported in parts by the ERC grant OPTINT (StG-2016-717054)
-  Barber, D., Agakov, F.V.: The im algorithm: A variational approach to information maximization. In: NIPS. pp. 201–208 (2003)
-  Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. arXiv preprint arXiv:1509.00519 (2015)
-  Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. ArXiv e-prints (Jun 2016)
-  Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)
-  Desjardins, G., Courville, A., Bengio, Y.: Disentangling Factors of Variation via Generative Entangling. ArXiv e-prints (Oct 2012)
-  Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Networks. ArXiv e-prints (Jun 2014)
-  Kingma, D.P., Rezende, D.J., Mohamed, S., Welling, M.: Semi-Supervised Learning with Deep Generative Models. ArXiv e-prints (Jun 2014)
-  Kingma, D.P., Salimans, T., Welling, M.: Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934 (2016)
-  Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
-  Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
-  Kulkarni, T.D., Whitney, W., Kohli, P., Tenenbaum, J.B.: Deep Convolutional Inverse Graphics Network. ArXiv e-prints (Mar 2015)
-  Lamb, A., Dumoulin, V., Courville, A.: Discriminative regularization for generative models. arXiv preprint arXiv:1602.03220 (2016)
-  Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (Nov 1998)
-  Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV) (Dec 2015)
-  Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial Autoencoders. ArXiv e-prints (Nov 2015)
-  Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems. pp. 5041–5049 (2016)
-  Metz, L., Poole, B., Pfau, D., Sohl-Dickstein, J.: Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 (2016)
-  Mirza, M., Osindero, S.: Conditional Generative Adversarial Nets. ArXiv e-prints (Nov 2014)
-  Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 (2011)
-  Odena, A., Olah, C., Shlens, J.: Conditional Image Synthesis With Auxiliary Classifier GANs. ArXiv e-prints (Oct 2016)
-  Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ArXiv e-prints (Nov 2015)
-  Reed, S., Sohn, K., Zhang, Y., Lee, H.: Learning to disentangle factors of variation with manifold interaction. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14). pp. 1431–1439 (2014)
-  Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems. pp. 2226–2234 (2016)
-  Siddharth, N., Paige, B., Van de Meent, J.W., Desmaison, A., Wood, F., Goodman, N.D., Kohli, P., Torr, P.H.S.: Learning Disentangled Representations with Semi-Supervised Deep Generative Models. ArXiv e-prints (Jun 2017)
-  Sønderby, C.K., Caballero, J., Theis, L., Shi, W., Huszár, F.: Amortised MAP inference for image super-resolution. CoRR abs/1610.04490 (2016), http://arxiv.org/abs/1610.04490
-  Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. In: Advances in Neural Information Processing Systems. pp. 3738–3746 (2016)
-  Tenenbaum, J.B., Freeman, W.T.: Separating style and content with bilinear models. Neural Comput. 12(6), 1247–1283 (Jun 2000), http://dx.doi.org/10.1162/089976600300015349
-  Zagoruyko, S., Komodakis, N.: Wide residual networks. CoRR abs/1605.07146 (2016), http://arxiv.org/abs/1605.07146
-  Zhu, Z., Luo, P., Wang, X., Tang, X.: Deep Learning Multi-View Representation for Face Recognition. ArXiv e-prints (Jun 2014)