ZstGAN: An Adversarial Approach for
Unsupervised Zero-Shot Image-to-Image Translation
Image-to-image translation models have shown remarkable ability on transferring images among different domains. Most of existing work follows the setting that the source domain and target domain keep the same at training and inference phases, which cannot be generalized to the scenarios for translating an image from an unseen domain to an another unseen domain. In this work, we propose the Unsupervised Zero-Shot Image-to-image Translation (UZSIT) problem, which aims to learn a model that can transfer translation knowledge from seen domains to unseen domains. Accordingly, we propose a framework called ZstGAN: By introducing an adversarial training scheme, ZstGAN learns to model each domain with domain-specific feature distribution that is semantically consistent on vision and attribute modalities. Then the domain-invariant features are disentangled with an shared encoder for image generation. We carry out extensive experiments on CUB and FLO datasets, and the results demonstrate the effectiveness of proposed method on UZSIT task. Moreover, ZstGAN shows significant accuracy improvements over state-of-the-art zero-shot learning methods on CUB and FLO. Our code is publicly available at https://github.com/linjx-ustc1106/ZstGAN-PyTorch.
Image-to-image translation tasks [14, 36], which aim at learning mappings that can convert an image among different domains while preserving the main representations of the input images, have been widely investigated in recent years. Existing image-to-image translation usually works on the following setting: Given domains of interests, denoted as where , the objective is to learn mappings , where . After obtaining these ’s, we can achieve the translations among these domains. Specially, many models have been proposed for the setting like CycleGAN , DiscoGAN , etc and for the setting like StarGAN .
One limitation of existing models is, the ’s can only achieve mappings among these given domains, without the generalization abilities to other unseen domains. That is, existing image-to-image translation models cannot translate an image from an unseen domain or to another unseen domain. Take the bird translation shown in Figure 1 as an example. Assume a model is trained on domain , and . Therefore, it is natural that can achieve translation among these three domains (see the upper half part of Figure 1), but cannot be applied to unseen domains , and . In practice, new image domains always come and it is impractical to train new translation models from scratch covering the new domains. Therefore we aim to generalize to unseen domains as shown in the bottom half part of Figure 1.
Zero-Shot Learning (ZSL) [18, 26, 22] aims to recognize objects whose instances might not have been seen during training. In order to generalize to unseen classes, a common assumption in zero-shot learning assuming is that some side-information about the classes is available, such as class attributes or textual descriptions, which provides semantic information about the classes.
As far as we can survey, there is no literature works on zero-shot learning for unsupervised image translation. To fulfill such a blank in image-to-image translation, we propose a new problem, unsupervised zero-shot image-to-image translation (briefly, UZSIT). Compared to the standard ZSL, UZSIT is more challenging: (1) The target of image translation is more complex than classification, which not only requires us to generate representative features across seen and unseen domains but also generate reasonable translation images. (2) Unlike ZSL methods trained in a supervised way on seen domains, we do not have any paired data between any two domains. This requires us to learn the mappings in a fully unsupervised manner for both seen domains and unseen domains. Therefore, we devise a framework, called ZstGAN, for UZSIT problem. There are two key steps in ZstGAN.
(1) We model each seen/unseen domain using a domain-specific feature distribution constrained by semantic consistency. Specifically, a visual-to-semantic encoder and an attribute-to-semantic encoder are introduced. They are jointly trained to extract domain-specific features from images and attributes respectively while preserving the same semantic information between these two modalities. The adversarial and classification losses are introduced to the two encoders to regularize training.
(2) We disentangle domain-invariant features from the domain-specific features and combine them to generate translation results, which is achieved by one adversarial learning loss and two reconstruction losses.
We work on two datasets commonly used in ZSL, Caltech-UCSD-Birds 200-2011 (CUB)  and Oxford Flowers (FLO) , to verify the effectiveness of our method on UZSIT task. We also generalize our model to traditional ZSL tasks, and find that our model can achieve significant improvement over state-of-the-art ZSL methods on CUB and FLO datasets.
The remaining part is organized as follows: We present a brief over review of related works in Section 2. We detail the problem formulation of UZSIT and a description of our approach in Section 3. The datasets and experimental results are reported in Section 4. Finally, we summarize our work and present several future directions in the Section 5.
2 Related Works
Generative Adversarial Networks Image generation has been widely investigated in recent years. Most of works focus on modeling the natural image distribution. Generative Adversarial Network (GAN)  was firstly proposed to generate images from random variables by a two-player minimax game: a generator G tries to create fake but plausible images, while a discriminator D is trained to distinguish difference between real and fake images. To address the stability issues in GAN, Wasserstein-GAN (WGAN)  was proposed to optimize an approximation of the Wasserstein distance. To further improve the vanishing and exploding gradient problems of WGAN, Gulrajani et al.  proposed a WGAN-GP that uses gradient penalty instead of the weight clipping to enforce the Lipschitz constrain in WGAN. Mao et al.  also proposed a LSGAN and found that optimizing the least square cost function is the same as optimizing a Pearson divergence. In this paper, we combine with WGAN-GP  to generate domain-specific features and translation images.
Image-to-Image Translation Recently, Isola et al.  proposed a general conditional GAN (Pix2Pix) for a wide range of supervised image-to-image translation tasks, including label-to-street scene, aerial-to-map, day-to-night and so on. Discovering that image translation between two domains should obey the cycle consistent rule, DualGAN , DiscoGAN  and CycleGAN  were proposed to tackle the unpaired image translation problem by training two cross-domain translation models at the same time. However, CycleGANs lack the ability to control the translated results in the target domain and their results usually lack of diversity. In order to control the translated results in the target domain and obtain more diverse outputs with a fixed input, works [21, 13, 19] divided the latent space whin translation into domain-invariant and domain-specific portions. The different domains share the same domain-invariant latent space while each domain has different domain-specific latent spaces. Choi et al.  further proposed to perform image-to-image translations for multiple domains. For the low-resource unpaired image-to-image translation, Benaim et al.  first proposed a one-shot cross-domain translation which transfers one and only one image in a source domain to a target domain. Lin et al.  also proposed a DosGAN that is able to translate images from unseen face identities without any fine-tuning once the model is trained on seen face identities, which is most related to our work. In this work, we focus on a different setting from these two works as zero-shot image translation which learns to transfer images from unseen domains to other unseen domains with the availability of both visual and semantic modalities.
Zero-Shot Learning Zero-Shot Learning (ZSL) was first introduced by , where train and test classes are disjoint for object recognition. Traditional methods for ZSL are based on learning an embedding from the visual space to the semantic space. In the test period, the semantic vector of an unseen sample is extracted and the most likely class is predicted by nearest neighbor method [32, 26, 29]. Recent works on ZSL have widely explored the idea of generative models. Wang et al.  presented a deep generative model for ZSL based on VAE . Due to the rapidly developed GANs, other approaches used GANs to synthesize visual representations for the seen and unseen classes [22, 4]. However, the generate images usually lack sufficient quality to train a classifier for both the seen and unseen classes. Hence authors [34, 8] used GANs to synthesizes CNN features rather than image pixels conditioned on class-level semantic information. On the other hand, considering that ZSL is a domain shift problem, [30, 5] presented the Generalized ZSL (GZSL) that leverages both seen and unseen classes at test time.
3.1 Problem Formulation
We provide a mathematical formulation of UZSIT in this subsection.
Let be the collection of images. Let and be two disjoint image categories, where and , , and . For ease of reference, define . Let denote the set of attributes or textual descriptions of images. Each sample can be represented by a where is a picture, is the corresponding label (e.g. a bird or a cat, etc) and is the attribute (e.g., the color, position, etc). We have two different sets, a training set and a test set .
The objective of UZSIT is to train an image-to-image translation model on without touching . Then evaluating the obtained model on without any further tuning. An assumption that and shares a common semantic space is required. Specifically, while and have different category sets ( and ), they are required to share the same image and attribute spaces ( and ) where semantic information is extracted from.
An implicit assumption in image-to-image translation is that an image contains two kinds of features [21, 13, 19]: domain-invariant features and domains-specific features for any , . With an oracle image merge operator , .
In existing image-to-image translation models, the domains-specific features of different domains are usually extracted without depicting them in a common semantic space. So implicit relationship among different domains is omitted by this kind of features. In this paper, we argue that domains-specific features should be not only discriminative for different domains, but also representative to align different domains in a common semantic space. We will discuss how to learn the domains-specific features in the following subsection.
Depending on where the domains-specific features are extracted from, we devise two kinds of image translation problems at zero-shot testing phase.
(1) Vision-driven image translation: ;
(2) Attribute-driven image translation: .
In and , the first input is used to provide domain-invariant features and the second input is used to specify domain-specific features: one uses an image and the other use attribute.
The architecture of our proposed ZstGAN is shown in Figure 2. We use to denote a Gaussian distribution. There are three encoders in out framework, , and , which work on extracting domain-invariant features, vision-based domain-specific features and attribute-based domain-specific features respectively. A decoder is also needed to convert the hidden representations into natural images, where the first input is domain-invariant features and the second input is domain-specific features. That is, to generate an image, and works as follows:
where and . We denote these two mappings with V-ZstGAN and A-ZstGAN respectively. In our configuration, we do not explicitly train in the training stage and it is naturally obtained by training with the following constraints.
The objective function is designed according to the following criteria:
(1) Domain-specific features with semantic consistency
Given a tuple , the image and the attribute should share the same semantic representation. For such purpose, we utilize an adversarial training scheme which requires outputs of and to follow the same distribution conditioned on domain attributes. We need a domain-specific features discriminator which is used to distinguish outputs of and . In detail, the adversarial training objective for and is:
where is a noise vector sampled from .
Using adversarial training can only ensure the distributions of the vision based domain-specific features and the attribute based domain-specific features to fit with each other. However, such features lack the ability to identify which domain the input images/attributes come from, causing meaninglessness of term “domain-specific”. Thus, we require the and to be correctly classified by a classifier . The classification loss is given as:
where is parameters of the classifier .
(2) Domain-invariant features disentanglement
Given a domain-invariant encoder and a generator illustrated in Figure 2 and another tuple , we have domain-invariant features , domain-specific features . To translate image from domain to domain , we can combine and to obtain . To ensure the translated result lie in the target domain and in the real image domain, we introduce a domain discriminator . of takes a real or fake image as input, and maximizes the mutual information between the target domain-specific features and the input as InfoGAN . Also, of outputs a probability of the input belonging to the real image domain. We illustrate the objective functions as below:
To ensure the disentanglement of domain-invariant features with domain-specific features, we introduce a self-reconstruction loss and a cross-reconstruction loss . We can obtain the self-reconstructed image and the cross-reconstructed image . The is to minimize the L1 norm between and :
The is to minimize the L1 norm between and :
If optimally minimized and optimally minimized , we can find that the difference between and , which are from two domains, only lies in the difference between and . Thus it implies that and are domain-specific features that determine which domain image belongs to. On the other hand, if optimally minimized and optimally minimized , we can find that the difference between and , which are both the reconstruction images of the same , only lies in the difference between and . Thus it further implies that and are domain-invariant features that maintain across different domains.
(3) The overall training objective
The overall objective for above mentioned encoders, discriminators, classifier and generator , , , , , and is given by:
where , and are weights to achieve balance among different loss terms. Note that, unlike  that utilizes a pre-trained CNN feature extractor as a fixed visual-to-semantic encoder, our is updated with the adversarial and classification losses. The pre-trained CNN feature extractor restrict itself to adapt with specific domains and attributes, while our approach enables the to extract domain-specific features that are both discriminative for different domains and representative to align the visual images with attributes in a common semantic space. Experiment in the Section 4.2 also demonstrates that our approach significantly improves performance of .
3.3 Implementation details
For , we use the 50-layer ResNet  to encode image to domain attribute-specific features of dimensions. One fully-connected layer () is connected to for classification output. For , it consists of two fully-connected layers and takes both attributes and noise as inputs. For discriminator , it consists of two fully-connected layers as . For domain discriminator , we use PatchGANs  that consists of six stride convolution layers, and two separated convolution layers for discrimination output and domain-specific feature prediction. For , it has one stride convolution layer, two stride convolution layers and residual blocks . For generator , it first adds domain attribute-specific features to domain-invariant features from encoder with Adaptive Instance Normalization (AdaIN) . Then the combined feature is input to residual blocks, two stride deconvolution layers and one stride convolution layer.
For all experiments, we resize the images to resolution as inputs. The dimension of domain-specific features is set to . The dimension of is set to . We set the weight parameters , and for CUB experiments and for FLO experiments. We train our networks using Adam  with learning rate of . For all experiments, we train models with a learning rate of in the first iterations and linearly decay the learning every iteration.
Datasets We conduct extensive quantitative and qualitative evaluations on Caltech-UCSD-Birds 200-2011 (CUB)  and Oxford Flowers (FLO)  which are commonly used in ZSL tasks. CUB contains bird species with images. We crop all images in CUB with bounding boxes given in . FLO contains images of flowers from different categories. For every image in CUB and FLO datasets, we extract 1024-dim character-based CNN-RNN  ( captions are provided for each image) as the attribute set . We split each dataset into domain-disjoint train and test sets. CUB is split to train domains and unseen domains. Within unseen domains, data is used as test data; FLO is split to train domains and unseen domains. Within unseen domains, data is used as test data.
4.1 Zero-Shot Image Translation Comparison
Since there is no previous work on UZSIT problem, we compare with our model with StarGAN  that can be viewed as an unsupervised many-shot image-to-image translation model which is trained with data of unseen domains. We train StarGAN with data of total domains on CUB dataset, and with data of total domains on FLO dataset.
The translation results of StarGAN and our model on CUB and FLO are shown in Figure 3 and Figure 4. We can find that although our ZstGAN is trained without data of unseen domains and StarGAN is trained with data of unseen domains, our ZstGAN shows even better translation quality with StarGAN in both CUB and FLO. For example, in the forth column of Figure 3(b), our V-ZstGAN and A-ZstGAN accurately transfers the attributes of gray wings, black rectrices and bright yellow beak to the translation results, while StarGAN only shows little yellow and gray color without accurate position in the translation result. In the third column of Figure 4(a), both A-ZstGAN and V-ZstGAN successfully transfer the “long and very thin bright yellow petals” description to the translation results, while StarGAN fails to change the shape of the original flower. Such results are mainly due to the design of StarGAN that simply uses domain codes as domain-specific features, which make it difficult to align different domains with a common semantic space. We can also see that translation results of A-ZstGAN highly correlate with V-ZstGAN’s, which verifies the effectiveness of adversarial learning for vision based and attribute based domain-specific features alignment.
For quantitative evaluation, we translate source images from a random unseen domain to a random unseen domain in each test minibatch, and report the top-1 and top-5 classification accuracy of translated images of StarGAN and our model in Table 1, and Fr¨¦chet Inception Distance (FID)  scores in Table 2. We can observe that the quantitative results are consistent with results in Figure 3 and Figure 4, where our ZstGAN achieves better classification accuracy and FID scores than StarGAN.
4.2 Generalizing to ZSL and GZSL
The domain-specific features outputted from visual-to-semantic encoder and attribute-to-semantic encoder in our ZstGAN can also be used for ZSL and Generalized ZSL (GZSL) problems [30, 5], where in GZSL setting the seen domains can also be leveraged for testing. Specifically, we train two additional softmax classifiers that use generated domain-specific features from and corresponding labels as  for ZSL and GZSL testing respectively. In ZSL setting, only average per-class top-1 accuracy on unseen domains is computed. In GZSL setting, we compute average per-class top-1 accuracy on unseen domains (denoted as U), average per-class top-1 accuracy on seen domains (denoted as S) and their harmonic mean, i.e., .
We compare our ZstGAN with three state-of-the-art ZSL and GZSL methods, e.g., SJE , ESZSL  and f-CLSWGAN . The ZSL results on CUB and FLO are shown in Table 3. The GZSL results on CUB and FLO are shown in Table 4. The experiments clearly demonstrate the advantage of our ZstGAN for GZSL and ZSL since it achieves the best top-1 accuracy results in all the results, with improvements from to more than . While our modification on f-CLSWGAN is not difficult to implement, our intuition is sound from the aspect of image-to-image translation and the improvement is rather significant. We also find that the classification accuracy of our model for ZSL is higher than the results for UZSIT in Table 1, this is because UZSIT is more challenging than ZSL since UZSIT needs to generate images that should properly fuse the domain-specific features with domain-invariant features to look like real target images.
We also show the t-SNE  visualization of domain-specific features extracted by and on unseen domains of FLO in Figure 5. We can observe that: (1) Both Figure 5(a) and Figure 5(b) show clear clusters for different domains, which indicates that the and indeed learn to generalize to unseen domains; (2) Patterns of domain-specific features extracted by and are highly consistent to each other. For example, the samples of the 5th domain (green color) in Figure 5(a) are mixed in some samples of the 17th domains, and the same phenomenon is observed in Figure 5(b). Such result indicates that and indeed learn to mapping the visual images and attributes to the same semantic space.
4.3 Analyzing Different Influence Factors of ZstGAN
4.3.1 Influence of domain-specific features losses
The main difference between existing image translation models and our ZstGAN is that we can extract domain-specific features that can be transferred from seen domains to unseen domains. And this advantage is mainly built on the introducing of adversarial and classification losses for jointly optimization. So we first investigate the influence of following two aspects on ZstGAN.
ZstGAN-CLS. To verify the effectiveness of classification losses for domain-specific features, we remove the loss for optimization and loss for optimization.
ZstGAN-GAN. To verify the effectiveness of adversarial leaning for domain-specific features, we remove the loss for and optimization.
The classification accuracy on FLO is also reported in Figure 6. We can observe that there is a big accuracy drop for both ZstGAN-CLS and ZstGAN-GAN. Specially, we find that A-ZstGAN-GAN’s classification accuracy is much lower than V-ZstGAN-GAN’s, which is because domain-specific features from the images and attributes are not aligned any more without adversarial learning.
4.3.2 Influence of Seen Domains
To investigate how the number of seen domains influences the performance of zero-shot image translation on unseen domains. We train ZstGAN with different seen domains on FLO and show the classification accuracy results on unseen domains in Table 5. As we can see, with the decrease of , the classification accuracy of translation results also decreases. Such results are not surprising since the image translation on unseen domains is based on the semantic representation of seen domains. If semantic representation learned from seen domains is not adequate to represent semantic information of unseen domains, translation model may fail to translate image to the target domain.
To verify that the generality of our model is not only limited on the unseen domains given by specific datasets, we interpolate domain-specific features generated by images or texts from unseen domains for image translation. Specifically, given two conditional images, we linearly interpolate between their domain-specific features and combine them with domain-invariant features of input. Similar operation is used for conditional texts interpolation. The results of domain-specific features interpolations are shown in Figure 7. We observe that our model can produce continuous translations through variation of domain-specific features from both images and texts. This indicates that (1) our model indeed learn to generalize to unseen domains which are not only discrete ones given by specific datasets but also can be a continuous space covering the whole semantic representations; (2) our model learns to disentangle the domain-specific features and domain-invariant features since the domain-invariant features, such as leaves in the background, almost keep unchanged for different domain-specific features.
In this paper, we propose an Unsupervised Zero-Shot Image-to-image Translation (UZSIT) problem, which aims to generalize image translation models from seen domains to unseen domains. Accordingly, we proposed a ZstGAN to this end. The ZstGAN models each seen/unseen domain using a domain-specific feature distribution conditioned on domain attributes, disentangles domain-invariant features from domain-specific features and combines them for image generation. Experiments show that our ZstGAN can effectively tackle zero-shot image translation on CUB and FLO datasets. In addition, we show that ZstGAN can achieve much better performance than state-of-the-art ZSL and GZSL methods on CUB and FLO datasets.
For future work, there are many interesting directions. First, it is interesting to design better models with better understanding of UZSIT. Second, achieving zero-shot translation without attributes is also valuable. Third, we may generalize the UZSIT to other relevant fields, such as domain adaptation and neural machine translation.
-  Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
-  S. Benaim and L. Wolf. One-shot unsupervised cross domain translation. arXiv preprint arXiv:1806.06029, 2018.
-  M. Bucher, S. Herbin, and F. Jurie. Generating visual representations for zero-shot classification. In International Conference on Computer Vision (ICCV) Workshops: TASK-CV: Transferring and Adapting Source Knowledge in Computer Vision, 2017.
-  W. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pages 52–68, 2016.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
-  Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  R. Felix, V. B. Kumar, I. Reid, and G. Carneiro. Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 21–37, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017.
-  X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-to-image translation. In ECCV, 2018.
-  P. Isola, J. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, July 2017.
-  T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1857–1865, 2017.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 951–958. IEEE, 2009.
-  H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, 2018.
-  J. Lin, Y. Xia, S. Liu, T. Qin, Z. Chen, and J. Luo. Exploring explicit domain supervision for latent space disentanglement in unpaired image-to-image translation. arXiv preprint arXiv:1902.03782, 2019.
-  J. Lin, Y. Xia, T. Qin, Z. Chen, and T.-Y. Liu. Conditional image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(July 2018), pages 5524–5532, 2018.
-  Y. Long, L. Liu, F. Shen, L. Shao, and X. Li. Zero-shot learning using synthesised unseen visual data with diffusion regularisation. IEEE transactions on pattern analysis and machine intelligence, 40(10):2498–2512, 2018.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2794–2802, 2017.
-  M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, pages 722–729. IEEE, 2008.
-  M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
-  S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–58, 2016.
-  B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pages 2152–2161, 2015.
-  R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pages 935–943, 2013.
-  R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 935–943. Curran Associates, Inc., 2013.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
-  D. Wang, Y. Li, Y. Lin, and Y. Zhuang. Relational knowledge transfer for zero-shot learning. In AAAI, volume 2, page 7, 2016.
-  W. Wang, Y. Pu, V. K. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, and L. Carin. Zero-shot learning via class-conditioned deep generative models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
-  Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.