Learning Compositional Visual Concepts with Mutual Consistency
Abstract
Compositionality of semantic concepts in image synthesis and analysis is appealing as it can help in decomposing known and generatively recomposing unknown data. For instance, we may learn concepts of changing illumination, geometry or albedo of a scene, and try to recombine them to generate physically meaningful, but unseen data for training and testing. In practice however we often do not have samples from the joint concept space available: We may have data on illumination change in one data set and on geometric change in another one without complete overlap. We pose the following question: How can we learn two or more concepts jointly from different data sets with mutual consistency where we do not have samples from the full joint space? We present a novel answer in this paper based on cyclic consistency over multiple concepts, represented individually by generative adversarial networks (GANs). Our method, ConceptGAN, can be understood as a drop in for data augmentation to improve resilience for real world applications. Qualitative and quantitative evaluations demonstrate its efficacy in generating semantically meaningful images, as well as one shot face verification as an example application.
caption@DCTfileextcaption@DCTlistnamecaption@DCTnamecaption@DCTplacementcaption@DCTwithincaption@DCTwithoutcaption@withinnonecaption@withinsection
1 Introduction
In applications such as object detection and face recognition, a large set of training data with accurate annotation is critical for the success of modern deep learningbased methods. However, collecting and annotating such data can be a laborious or even an essentially impossible task. Conventional data augmentation techniques typically involve either manual effort or simple transformations such as translation and rotation of the available data, and may not result in semantically meaningful data samples.
Recently, generative models have been shown to successfully synthesize unseen data samples, such as imagetoimage translation and CycleGAN [40, 10]. Given sufficient training data, these allow us, for instance, to translate from an image of a textured handbag to a corresponding visually convincing image of a shoe with the same texture, or from a color image of a handbag to a consistent line drawing of a handbag. Starting with this limitation of learning one concept at a time, naturally one would like to continue learning more concepts to generate a wider variety of data. However, samples from the joint distribution, in our simple case of line drawings of shoes, may not be available for training. Going beyond two concepts, the joint concept space certainly becomes exponential and unfeasible for gathering data. As shown in Figure 1, it is difficult to directly compose separately trained CycleGAN mappings in a semantically meaningful way to synthesize plausible images in the subdomains with no training data. For example shapevarying mappings trained with color images may fail to translate images in the line drawing domain. As an answer to this challenge, we make compositionality a principled and explicit part of the model while learning individual concepts. We achieve this by regularizing the learning of the individual concepts by enforcing consistency of concept composition. In our earlier example, this implies enforcing cyclic consistency of applying bag to shoe, color to line drawing, and their corresponding inverses, resulting in a cycle of four concept shifts (Figure 2). In general, we enforce consistency over multiple closed paths in the underlying graph. The benefits of this are twofold: (a) we ensure that the concepts are mutually consistent in the sense of not impacting their mutual forward and inverse generation capability, and (b) we can optimize the resulting cost function irrespective of whether data samples from the joint concept space are available. In fact, we focus on the case where no data is available from one joint concept space (e.g., line drawings of shoes) and demonstrate that we can nevertheless generate meaningful samples from it. This paper focuses primarily on the simplest case of our framework with twoconcept cycles.
While not strictly necessary, we assume that the application of concepts is commutative, yielding a set of symmetric cycle consistency constraints. As it is notoriously difficult to gauge the performance of novel image synthesis, we use a surrogate task, face verification, for performance evaluation and demonstrate how a blackbox baseline system can be improved by data augmentation. In summary:

We propose a principled framework for learning pairwise visual concepts from partial data with mutual consistency.

We demonstrate that via joint learning, transfer and composition of concepts, semantically meaningful image synthesis can be achieved over a joint latent space with incomplete data, for instance from a subdomain where no data is available at training time.

We demonstrate a scalable framework for efficient data augmentation where multiple concepts learned in a pairwise fashion can be directly composed in image synthesis.

Using face verification as a surrogate problem, we show how the proposed method can be used as a framework to perform conditional image synthesis, helping improve face verification accuracy.

We provide a scheme for building iterative solutions for an arbitrary number of concepts as a generalization.
2 Related work
Data augmentation techniques have been utilized to improve the training performance especially for deep learningbased methods [32, 28, 17, 35]. Conventional approaches mostly rely on simple transformations such as rotation [31], random cropping [17], random flipping [32, 28, 17] and altering RGB channel intensities [15]. The amount of new information introduced in such operations is limited as no latent manipulation (e.g., varying the illumination) is involved.
Generative adversarial networks (GAN) [4] provide an efficient tool to augment data with virtual samples [33]. In GAN, plausible yet unseen images are generated by matching the synthetic sample distribution to the real data distribution. The adversarial idea has been successfully applied to the transformation across image domains. Isola et al. [7] propose the pix2pix framework, which adapts a conditional GAN [25] to map images from the input to output domain given paired training data. Various strategies have been utilized to tackle the problem with unsupervised data, such as using weightsharing between adversarial networks to learn the joint distribution across domains [20, 19] and using an additional regularization loss term which minimizes a similarity distance between the inputs and the outputs [34, 1, 30].
In particular, Zhu et al. [40] propose CycleGAN, which extends the pix2pix [7] framework by introducing additional cycleconsistency constraints to simultaneously learn a pair of forward and backward mappings between two domains given unpaired training data. Similar unsupervised learning ideas are also proposed in the DiscoGAN [10] and the DualGAN [39]. Following the cycleconsistency formulation, Liang et al. [18] focus on editing highlevel semantic content of objects while preserving background characteristics. In these prior works, however, translation mappings learned in each experiment depend on specific training distributions, and therefore can not be easily transferred or semantically composed without extra training experiments.
Another group of generative modelbased approaches seek to learn the disentangled latent representations [14, 13, 23, 16, 37] where the semantic perturbation can then be expressed via the vector arithmetic [24, 29]. Odena et al. [26] attach an auxiliary classifier to the adversarial network to build a classconditional image synthesis model. Chen et al. [2] adopt an unsupervised approach by maximizing the mutual information between code space input and output observations. Fu et al. [3] perform conditional image synthesis given training data only supervised in one (source) domain via joint feature disentanglement and adaption. Lu et al. [22] and Kim et al. [11] both propose models of conditional image synthesis on top of a cycleconsistency formulation [40]. While these works can provide plausible image synthesis conditional on attribute manipulations, the discussions are still under the assumption that training data are available over the joint latent space and have no accommodation for the challenge of the data scarcity. Unlike prior works, the proposed ConceptGAN captures image space mappings that correspond to commutative shifts in the underlying latent space. In each experiment, we jointly learn, transfer and compose such concepts to synthesize images over all possible latent manipulations including a attribute combination completely unseen at the training stage which corresponds to a missing subdomain.
3 Model formulation
We propose ConceptGAN, a concept learning framework aimed at recovering the joint space information given missing training data in one subdomain. As illustrated in Figure 2, the basic unit of the framework is modeled as a fourvertex cyclic graph, where a pair of latent concepts is jointly learned. Each vertex refers to a subdomain with binary latent labels and and corresponding training samples , where denotes the number of training samples in the subdomain . The variation over each latent concept is learned as a pair of forward and inverse mappings, between subdomains. For example, and define the variation over concept . In particular, no pairwise correspondence is required for data samples between any two subdomains and our goal is to generate realistic synthetic samples over all four subdomains under the assumption that no training samples are available in one of the subdomains. In the following discussion, we assume that the subdomain has no training data (i.e. ). An adversarial discriminator is introduced at each of the three subdomains , and to tell synthetic data and real data apart. We further extend cycleconsistency constraints used in the CycleGAN [40] and introduce a commutative loss to encourage learning transferable and composable concept mappings.
3.1 Adversarial loss
The adversarial loss [4] is applied to each of the three subdomains where real data is available during training, which encourages learning mappings between adjacent subdomains to generate realistic samples. Let denote the underlying distribution of the real data in subdomain . For generator and discriminator , for example, the adversarial loss is expressed as:
(1) 
where the generator and discriminator are learned to optimize a minimax objective such that
(2) 
Similarly we define , , and for , and respectively. The overall adversarial loss is the sum of these four terms.
3.2 Extended cycleconsistency loss
In Zhu et al. [40] a pairwise cycleconsistency loss is proposed to encourage generators to learn bijectional mappings between two distributions. Let denote the sum of all such pairwise (i.e., distance2) cycle consistency losses adopted in the cyclic model, where six terms are included: (1) both forward cycleconsistency and backward cycleconsistency [40] between pairs and and (2) only forward cycleconsistency between pairs and . Such consistency constraints can naturally be extended to potentially any closed walks in the cyclic graph and thus further reduce the space of possible mappings. In particular, the difference between training data samples and image samples reconstructed via walking through all four vertices from either direction is minimized. For example, for any data sample in subdomain , a distance4 cycle consistency constraint is defined in the clockwise direction and in the counterclockwise direction . Such constraints are implemented by the penalty function:
(3) 
Similarly, we define and considering the case where the original image is in subdomain and respectively. Let denotes the sum of these three terms. The overall cycle consistency loss is defined as the sum of these terms: .
3.3 Commutative loss
Adversarial training in Zhu et al. [40] learns mappings that capture sample distributions of training data and therefore are not easily transferable to input data that follows a different distribution without a second training, which may lead to weak compositionality. In order to encourage the model to capture semantic shifts, which correspond to commutative operators such as addition and subtraction in latent space, we enforce a commutative property for concept composition such that starting from one data sample, similar outputs are expected after applying concepts in different orders. For example, for any data sample in subdomain , we introduce a constraint implemented by the penalty function:
(4) 
and are defined in a similar way by considering original image in subdomains and . The overall commutative loss is the sum of the three terms.
3.4 Overall loss function
The overall loss function is expressed as:
(5) 
with weight parameters and . The generators are learned as the solutions of a minimax problem:
(6) 
3.5 Composition of multiple concepts
In each experiment, two concepts are jointly trained via the proposed cyclic model shown in Figure 2, where synthetic images are generated in all four sudomains. In particular, by composing the pair of concepts, plausible images are synthesized in subdomain where we assume no training data is available. Such image synthesis mechanism can be generalized by considering the composition of multiple concepts. For example, we demonstrate in Figure 4, that by directly combining two pairs of concepts learned in separate experiments, plausible images can be generated over three dimensional latent space, including a subdomain where no training data is available in either of the experiments, which suggests that the proposed system can be scaled up with linearly increased complexity via direct composition of concepts learned in pairwise fashion.
3.6 Implementation details
For all discriminators, we use the architecture similar to Kim et al. [10] which contains 5 convolution layers with filters. Compared to the PatchGAN used in Zhu et al. [40], the discriminator network takes 64x64 input images and output a scalar from the sigmoid function for each image. For all the generators, we use the architecture adapted from Zhu et al [40], which contains 2 convolution layers with stride 2, 6 residual blocks and 2 fractionallystrided convolution layers with stride . We use Adam optimizer [12] with an initial learning rate of 0.0002 at the first 150 epochs, followed by a linearly decaying learning rate for the next 150 epochs as the rate goes to zero. For experiments in Section 4.1, we set and we also include an identity loss component [40] with weight 10.
4 Experiments
4.1 Conditional image synthesis
Image synthesis experiments are performed each corresponding to the manipulation over two concepts. In Figure 3, column (I), (II) and (III) demonstrate the clockwise cycleconsistency, the counterclockwise cycleconsistency and the commutative property of the concept composition respectively. Given real testing images shown at the leftmost in each panel, plausible synthetic data are generated with correct semantic variation in each subdomain, including the subdomain where no training data is available.
Concept learning with face images Figures 3 (A) and Figure 4 show the results of applying proposed method on face images. The concept learning models are trained and tested on CelebA dataset [21]. In the experiment concerning the concepts “smile” and “eyeglasses” (Figure 3(A)), 4851,3945 and 4618 images with attribute labels (no smile, no eyeglasses), (no smile, with eyeglasses) and (with smile, no eyeglasses) are used at the training stage for subdomains , and respectively. Figure 4 presents the results of directly composing three concepts learned in two separate experiments described above. Synthetic images are generated in the subdomain with labels (with smile, with eyeglasses, with bangs) where no training data is available in either experiment. It is shown that the proposed method can thus be generalized to manipulation over higher dimensional latent spaces.
Transfer of learned concepts Here, we qualitatively demonstrate the transferability of the concepts learned by ConceptGAN on different datasets not used at all during training. Figure 5 presents the results of this experiment of direct transfer of the learned concept pair to independent test sets. Concepts ”eyeglasses” and ”bangs” are trained with CelebA [21] dataset and tested on datasets LFW [6] and MSCELEB1M [5] respectively.
Concept learning of shape and texture
Figure 3 (B) shows the results of applying the proposed method on images concerning the concepts “handbag vs. shoe” (shape variation) and “photo vs. edge” (texture variation). Without taking advantage of the paired labels, we use the “edges2shoes” and “edges2handbags” dataset from “pix2pix” [7] dataset. 5124, 5124 and 4982 images with attribute labels (color, handbag), (edge, handbag) and (color, shoe) are used at the training stage for subdomains , and respectively. Given no training data, synthetic line drawing (“edge”) images are generated for shoes.
The importance of simultaneously learning and transferring concept mappings is demonstrated in comparison to results of direct composition of separately trained CycleGAN units [40] in Figure 1. In particular, the mappings trained via baseline CycleGAN with images in subdomains and are restricted to training distributions and therefore fail to preserve the correct texture information when directly transferred to input images in subdomain .
Additional results of the experiments, including with other concepts, can be found in the supplementary material.
5 Quantitative evaluations
We provide quantitative performance evaluations of our proposed concept learning framework for two different tasks: attribute classification and face verification.
5.1 Attribute classification
In this section, our goal is to quantitatively demonstrate the importance of simultaneously learning and transferring concept mappings as opposed to learning and composing concepts separately via a single CycleGAN unit. To this end, we perform, and report results of, several classification experiments. Specifically, we employ the following evaluation protocol: (a) We use data in subdomains , and to learn concept mappings and automatically synthesize data in the subdomain using the proposed concept learning model. We then use the generated images in the subdomain and perform a twoclass classification experiment on each of the concepts. (b) We repeat the experiment described above, but now data in is generated as composition of two independently learned CycleGAN units, i.e., we learn one CycleGAN for the mapping and another CycleGAN for the mapping. Given data in , we then compose the two learned mappings to synthesize data in . We use the same network architecture to train the separate CycleGAN unit as described in Section 3.6.
Classifier  Val  CycleGAN  Ours 

C1: “color/shoe” vs. “edge/shoe”  99  0  99 
C2: “edge/handbag” vs. “edge/shoe”  99  99  98 
Both C1 and C2  N/A  0  98 
Classifier  Val  CycleGAN  Ours 

C1: “with” vs. “no” eyeglasses  98  93  98 
C2: “with” vs. “no” bangs  93  61  67 
Both C1 and C2  N/A  56  66 
Attributes  Smiling & Eyeglasses  Bangs & Eyeglasses  Smiling, Bangs, & Eyeglasses  
Ranking Method  RNP  SRID  RNP  SRID  RNP  SRID  
Augmentation  No  Yes  Yes  No  Yes  Yes  No  Yes  Yes 
CaffeFace  8.3  10.7  12.8  6.0  9.6  15.4  11.5  13.3  16.6 
VGGFace  38.6  43.9  49.4  49.8  59.4  61.5  44.4  54.8  58.6 
Key results of this experiment for multiple concept examples include the following. (a) Classifying shoe and edge images: In this experiment, we demonstrate results on the “handbag vs. shoe” and “color vs. edge” concepts. We use images of “color/handbag” (), “color/shoe” (), and “edge/handbag” () for learning the mappings of our proposed concept learning approach as well as individual mappings for CycleGAN. The results of this experiment are shown in Table 1. The results demonstrate that the proposed method successfully composes two concepts in the subdomain as 98 of the synthesized images pass both classification tests, which greatly outperforms the results of direct composition of two separately trained CycleGAN units where no synthesized image survive both tests. (b) Classifying face images with “eyeglasses” and “bangs”: In this experiment, we demonstrate results on the “eyeglasses” and “bangs” concepts. We use images of “no eyeglasses, no bangs” (), “with eyeglasses, no bangs” (), and “no eyeglasses, with bangs” () to learn the mappings of ConceptGAN and baseline CycleGAN. The results of this experiment are shown in Table 2. The proposed method outperforms direct composition of CycleGAN units in terms of the synthesis quality in the subdomain by around improvement in joint classification accuracy.
5.2 Face verification with augmented data
Given a pair of face images, face verification is the problem of determining whether the pair represents the same person. Here, we demonstrate the applicability of ConceptGAN to this problem. Specifically, we begin with the oneshot version where every person in the probe and the gallery has exactly one image each. We then use the learned concept mappings to synthesize new, unseen face images, transforming the oneshot version to a multishot one. We demonstrate that by performing this conversion with our synthesized images, we improve the face verification performance. Here, we note that the focus of these evaluations is not to obtain stateoftheart results but to demonstrate the applicability of ConceptGAN as a plugin module that can be used in conjunction with any existing face verification algorithm to obtain improved performance. We use the CelebA [21] dataset for all experiments, where we generate 10 random splits of 100 people each not used in training ConceptGAN and report the average performance.
Ranking method  SRID  

Augmentation  No  Yes 
LFW  9.5  13.1 
MSCeleb1M  11.7  14.8 
We first perform oneshot experiments where we use two popular pretrained face representation models, VGGFace [27] and CaffeFace [36] to compute feature representations of the images and rank gallery candidates using the Euclidean distance. We next perform multishot experiments by augmenting both probe and gallery sets for each person using ConceptGAN, and rank gallery candidates with two multishot ranking algorithms, SRID [9, 8] and RNP [38]. Results of all the experiments discussed above are summarized in Table 3, where the augmented probe and gallery sets have 4 images each in the cases of two concepts and 8 images each in the case of 3 concepts. As can be noted from these results, converting the oneshot face verification problem to a multishot one by means of ConceptGAN has obvious benefits, with the multishot rank1 face verification results consistently outperforming the corresponding oneshot results. We further qualitatively show the rank improvement in Figure 6, where we see improved retrieval in the cases where face verification was performed with augmented data. Here we also provide qualitative evaluations for the transferability of concepts learned by CycleGAN. Specifically, in Table 4, we show rank1 face verification results with CaffeFace and SRID on two independent test sets (LFW and MSCeleb1M) using concepts learned by ConceptGAN on the CelebA dataset, where we see improved performance with data synthetized using the transferred concepts. These results, complemented by the qualitative evaluations of the previous section, provide evidence for the transferability of the learned concepts to new datasets, demonstrating promise in learning the underlying latent space information.
6 Discussion: Generalizing to multiple concepts
In the previous sections, we discussed a possible way we could scale up to three concepts, and showed qualitative and quantitative results. Here, we provide a scheme to generalize our method to concepts under two assumptions: (a) concepts have distinct states, i.e. they are not continuous, and (b) activating one concept does not inhibit any other. We show that pairwise constraints over two concepts are sufficient for generating samplers from all concept combinations. Figure 7(a) illustrates with concepts as a graph where the edges apply a concept and the nodes are the concept combinations. Each node of the graph may be observed or not as illustrated in figure 7(b) (green indicates an observed node). “Observed” means that we have samples from the underlying distribution of a node. Applying our method then allows to infer node , indicated in brown, with two concepts and involved. Indeed, the subgraph of nodes is exactly our proposed two concept solution. Let’s add data drawn from node , observing the additional concept . The resulting graph shows that we can also infer nodes and by adding constraints corresponding the cycles and . We now take the next step in generalization by considering node . Assuming that we indeed can infer nodes , we consider constraints that treat them as ’observed’, such as over the cycles , , and . This allows us to estimate samples for node . To illustrate the generic nature, figure 7(c) shows a situation with data at nodes . We can firstly infer nodes and secondarily . Generalizing to , we propose to discover new layers of nodes in order of their distance from any observed node. Naturally one cannot escape the combinatorial complexity of generating samplers over all concepts. However, our proposed generalization paves the way for iterative algorithms that yield approximate solutions efficiently based on a graphical representation of concepts and data.
7 Conclusions
We proposed ConceptGAN, a novel concept learning framework where we seek to capture underlying semantic shifts between data domains instead of mappings restricted to training distributions. The key idea is that via joint concept learning, transfer and composition, information over a joint latent space is recovered given incomplete training data. We showed that the proposed method can be applied as a smart data augmentation technique to generate realistic samples over different variations of concept attributes, including samples in a subdomain where the variation is completely unseen at the training stage. We demonstrated the compositionality of the captured concepts as well as the transferability of the data augmentation in application on face verification problems.
References
 [1] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixellevel domain adaption with generative adversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [2] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2172–2180, 2016.
 [3] T.C. Fu, Y.C. Liu, W.C. Chiu, S.D. Wang, and Y.C. F. Wang. Learning crossdomain disentangled deep representation with supervision from a single domain. arXiv preprint arXiv:1705.01314, 2017.
 [4] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems (NIPS), 2014.
 [5] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Msceleb1m: A dataset and benchmark for large scale face recognition. In European Conference on Computer Vision. Springer, 2016.
 [6] G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 0749, University of Massachusetts, Amherst, October 2007.
 [7] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [8] S. Karanam, M. Gou, Z. Wu, A. RatesBorras, O. Camps, and R. J. Radke. A systematic evaluation and benchmark for person reidentification: Features, metrics, and datasets. arXiv preprint arXiv:1605.09653, 2016.
 [9] S. Karanam, Y. Li, and R. J. Radke. Person reidentification with block sparse recovery. Image and Vision Computing, 60:75–90, 2017.
 [10] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover crossdomain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1857–1865, 2017.
 [11] T. Kim, B. Kim, M. Cha, and J. Kim. Unsupervised visual attribute transfer with reconfigurable generative adversarial networks. arXiv:1707.09798, 2017.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.
 [13] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling. Semisupervised learning with deep generative models. Advances in Neural Information Processing Systems (NIPS), pages 3581–3589, 2014.
 [14] D. P. Kingma and M. Welling. Autoencoding variational bayes. International Conference on Learning Representations (ICLR), 2013.
 [15] A. Krizhevsky, I. Sutskever, and G. E. Hintion. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
 [16] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. Advances in Neural Information Processing Systems (NIPS), 2015.
 [17] G. Levi and T. Hassner. Age and gender classification using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2015.
 [18] X. Liang, H. Zhang, and E. P. Xing. Generative semantic manipulation with contrasting gan. The 31st Conference on Neural Information Processing Systems (NIPS), 2017.
 [19] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. The 31st Conference on Neural Information Processing Systems (NIPS), 2017.
 [20] M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. Advances in Neural Information Processing Systems 29, pages 469–477, 2016.
 [21] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 [22] Y. Lu, Y.W. Tai, and C.K. Tang. Conditional cyclegan for attribute guided face image generation. arXiv:1705.09966, 2017.
 [23] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. International Conference on Learning Representations (ICLR), 2016.
 [24] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (NIPS), 2013.
 [25] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [26] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 2642–2651, 2017.
 [27] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proceedings of the British Machine Vision Conference (BMVC), 2015.
 [28] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recongnition. British Machine Vision Conference, 2015.
 [29] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations (ICLR), 2016.
 [30] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [31] P. Y. Simard, D. Steinkraus, and J. C. Platt. Best practices for convolutional neural networks applied to visual document analysis. Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR), 2003.
 [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recongnition. International Conference on Learning Representations (ICLR), 2015.
 [33] L. Sixt, B. Wild, and T. Landgraf. Rendergan: Generating realistic labeled data. International Conference on Learning Representations (ICLR) Workshops, 2017.
 [34] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised crossdomain image generation. International Conference on Learning Representations (ICLR), 2017.
 [35] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
 [36] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
 [37] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. Proceedings of European Conference on Computer Vision (ECCV), 2016.
 [38] M. Yang, P. Zhu, L. Van Gool, and L. Zhang. Face recognition based on regularized nearest points between image sets. In Proceedings of the IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2013.
 [39] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for imagetoimage translation. In Proceedings of International Conference on Computer Vision (ICCV), 2017.
 [40] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.