Unsupervised Image-to-Image Translation Using Domain-Specific Variational Information Bound
Unsupervised image-to-image translation is a class of computer vision problems which aims at modeling conditional distribution of images in the target domain, given a set of unpaired images in the source and target domains. An image in the source domain might have multiple representations in the target domain. Therefore, ambiguity in modeling of the conditional distribution arises, specially when the images in the source and target domains come from different modalities. Current approaches mostly rely on simplifying assumptions to map both domains into a shared-latent space. Consequently, they are only able to model the domain-invariant information between the two modalities. These approaches usually fail to model domain-specific information which has no representation in the target domain. In this work, we propose an unsupervised image-to-image translation framework which maximizes a domain-specific variational information bound and learns the target domain-invariant representation of the two domain. The proposed framework makes it possible to map a single source image into multiple images in the target domain, utilizing several target domain-specific codes sampled randomly from the prior distribution, or extracted from reference images.
Unsupervised Image-to-Image Translation Using Domain-Specific Variational Information Bound
Hadi Kazemi email@example.com Sobhan Soleymani firstname.lastname@example.org Fariborz Taherkhani email@example.com Seyed Mehdi Iranmanesh firstname.lastname@example.org Nasser M. Nasrabadi email@example.com West Virginia University Morgantown, WV 26505
noticebox[b]32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.\end@float
Image-to-image translation is the major goal for many computer vision problems, such as sketch to photo-realistic image translation , style transfer , inpainting missing image regions , colorization of grayscale images [11, 32], and super-resolution . If corresponding image pairs are available in both source and target domains, these problems can be studied in a supervised setting. For years, researchers  have made great efforts to solve this problem employing classical methods, such as superpixel-based segmentation . More recentely, frameworks such as conditional Generative Adversarial Networks (cGAN) , Style and Structure Generative Adversarial Network (S-GAN) , and VAE-GAN  are proposed to address the problem of supervised image-to-image translation. However, in many real-world applications, collecting paired training data is laborious and expensive . Therefore, in many applications, there are only a few paired images available or no paired images at all. In this case, only independent sets of images in each domain, with no correspondence in the other domain, should be deployed to learn the cross-domain image translation task. Despite the difficulty of the unsupervised image-to-image translation, since there is no paired samples guiding how an image should be translated into a corresponding image in the other domain, it is still more desirable compared to the supervised setting due to the lack of paired images and the convenience of collecting two independent image sets. As a result, in this paper, we focus on the design of a framework for unsupervised image-to-image translation.
The key challenge in cross-domain image translation is learning the conditional distribution of images in the target domain. In the unsupervised setting, this conditional distribution should be learned using two independent image sets. Previous works in the literature mostly consider a shared-latent space, in which they assume that images from two domains can be mapped into a low-dimensional shared-latent space [37, 20]. However, this assumption does not hold when the two domains represent different modalities, since some information in one modality might have no representation in the other modality. For example, in the case of sketch to photo-realistic image translation, color and texture information have no interpretable meaning in the sketch domain. In other words, each sketch can be mapped into several photo-realistic images. Accordingly, learning a single domain-invariant latent space with aforementioned assumption [37, 20, 24] prevents the model from capturing domain-specific information. Therefore, a sketch can only be mapped into one of its corresponding photo-realistic images. In addition, since the current unsupervised techniques are implemented mainly based on the ”cycle consistency” [20, 37], the translated image in the target domain may encode domain-specific information of the source domain (Figure 1). The encoded information can then be utilized to recover the source image again. This encoding can effectively degrade the performance and stability of the training process.
To address this problem, we remove the shared-latent space assumption, and learn a domain-specific space jointly with a domain-invariant space. Our proposed framework is based on Generative Adversarial Networks and Variational Autoencoders (VAEs), and models the conditional distribution of the target domain using VAE-GAN. Broadly speaking, two encoders map a source image into a pair of domain-invariant and source domain-specific codes. The domain-invariant code in combination with a target domain-specific code, sampled from a desired distribution, is fed to a generator which translates them into the corresponding target domain image. To reconstruct the source image at the end of the cycle, the extracted source domain-specific code is passed through a domain-specific path to the backward path from translated target domain image.
In order to learn two distinct codes for the shared and domain-specific information, we train the network to extract two distinct domain-specific and domain-invariant codes. The former is learned by maximizing its mutual information with the source domain while simultaneously we minimize the mutual information between this code and the translated image in the target domain. The mutual information maximization may also result in the domain-specific code to represent an interpretable representation of the domain-specific information . These loss terms are crucial in the unsupervised framework, since domain-invariant information may also go through the domain-specific path to satisfy the cycle consistency in the backward path.
In this paper we extend CycleGAN  to learn a domain-specific code for each modality, through domain-specific variational information bound maximization, in addition to a domain-invariant code. Then, based on the proposed domain-specific learning scheme, we introduce a framework for one-to-many cross-domain image-to-image translation in an unsupervised setting.
2 Related Works
In the computer vision literature, image generation problem is tackled using autoregressive models [21, 29], restricted Boltzmann machines , and autoencoders . Recently, generative techniques are proposed for image translation tasks. Models such as GANs [7, 34] and VAEs [23, 15] achieve impressive results in image generation. They are also utilized in conditional setting [12, 38] to address the image-to-image translation problem. However, in the prior research, relatively less attention is given to the unsupervised setting [20, 37, 4].
Many state-of-the-art unsupervised image-to-image translation frameworks are developed based on the cycle-consistency constraint . Liu et al.  showed that learning a shared-latent space between the images in source and target domains implies the cycle-consistency. The cycle-consistency constraint assumes that the source image can be reconstructed from the generated image, in the target domain, without any extra domain-specific information [20, 37]. From our experience, this assumption severely constrains the network and degrades the performance and stability of the training process, in the case of learning the translation between different modalities. In addition, this assumption limits the diversity of generated images by the framework, i.e., the network associates a single target image with each source image. To tackle this problem, some prior research attempt to map a single image into multiple images in the target domain in a supervised setting [5, 3]. This problem is also addressed in  in an unsupervised setting. However, they have not considered any mechanisms to force their auxiliary latent variables to represent only the domain-specific information.
In this work, in contrast, we aim to learn distinct domain-specific and domain-invariant latent spaces in an unsupervised setting. The learned domain-specific code is supposed to represent the properties of the source image which have no representation in the target domain. To this end, we train our network by maximization of a domain-specific variational information to learn a domain-specific space.
3 Framework and Formulation
Our framework, as illustrated in Figure 2, is developed based on GAN  and VAE-GAN , and includes two generative adversarial networks; and . The encoder-generators, and , also constitute two VAEs. Inspired by CycleGAN model , we trained our network in two cycles; and , where and represent the source and target domains, respectively.111For simplicity, in the remainder of the paper, for each cycle, we use terms input domain and output domain. Each cycle consists of forward and backward paths. In each forward path, we translate an image from the input domain into its corresponding image in the output domain. In the backward path, we remap the generated image into the input domain and reconstruct the input image. In our formulation, rather than learning a single shared-latent space between the two domains, we propose to decompose the latent code, , into two parts: , which is the domain-invariant code between the two domains, and , which is the domain-specific code.
During the forward path in cycle, we simultaneously train two encoders, and , to map data samples from the input domain, , into a latent representation, . The input domain-invariant encoder, , maps the input image, , into the input domain-invariant code, . The input domain-specific encoder, , maps into the input domain-specific code, . Then, the domain-invariant code, , and a randomly sampled output domain-specific code, , are fed to the output generator (decoder), , to generate the corresponding representation of the input image, , in the output domain . Since in cycle the output domain-specific information is not available during the training phase, a prior, , is imposed over the domain-specific distribution which is selected as a unit normal distribution . Here, index in the codes’ subscripts refers to the first cycle . We use the same notation for all the latent codes in the reminder of the paper.
The output discriminator, , is employed to enforce the translated images, , resemble images in the output domain . The translated images should not be distinguishable from the real samples in . Therefore, we apply the adversarial loss  which is given by:
Note that the domain-specific encoder outputs mean and variance vectors , which represents the distribution of the domain-specific code given by . Similar to the previous works on VAE , we assume that the domain-specific components of are conditionally independent and Gaussian with unit variance. We utilize reparametrization trick  to train the VAE-GAN using back-propagation. We define the variational loss for the domain-specific VAE as follows:
where the KullbackâLeibler () divergence term is a measure of how the distribution of domain-specific code, , diverges from the prior distribution. The conditional distribution is modeled as Laplacian distribution, and therefore, minimizing the negative log-likelihood term is equivalent to the absolute distance between the input and its reconstruction.
In the backward path, the output domain-invariant encoder, , and the output domain-specific encoder, , map the generated image into the reconstructed domain-invariant code, , and the reconstructed domain-specific code, , respectively. The domain-specific encoder, , outputs mean and variance vectors which represents the distribution of the domain-specific code, , given by . Finally, the reconstructed input, , is generated by the output generator, , with and as its inputs. Here, is sampled from its distribution, , where is the output of in the forward path. We enforce a reconstruction criteria to force , and to be the reconstruction of , , and , respectively. To this end, the reconstruction loss is defined as follows:
where , , and are the hyper-parameters to control the weight of each term in the loss function.
4 Domain-specific Variational Information bound
In the proposed model, we decompose the latent space, , into the domain-invariant and domain-specific codes. As it is mentioned in the previous section, the domain-invariant code should only capture the information shared between the two modalities, while the domain-specific code represents the information which has no interpretation in the output domain. Otherwise, all the information can go through the domain-specific path and satisfy the cycle-consistency property of the network ( and ). In this trivial solution, the generator, , can translate an input domain image into the output domain image that does not correspond to the input image, while satisfying the discriminator in terms of resembling the images in . Figure 7 (second row) presents images generated by this trivial solution.
Here, we propose an unsupervised method to learn the domain-specific information of the source data distribution which has minimum information about the target domain. To learn the source domain-specific code, , we propose to minimize the mutual information between and the target domain distribution, while simultaneously, we maximize the mutual information between and the source domain distribution. Similarly, the target domain-specific code is learned for target domain . In other words, to learn the source and target domain specific codes and , we should minimize the following loss function:
where represents the model parameters. To translate to an implementable loss function, we define the following two loss functions:
where and are implemented in cycles and , respectively.
Instead of minimizing , or similarly , we minimize their variational upper bounds, which we refer to as domain-specific variational information bounds. Zhao et al.  illustrated that using KL-divergance in VAEs results in information preference problem, in which the mutual information between the latent code and the input becomes vanishingly small, while training the network using only reconstruction loss, with no KL divergence term, maximizes the mutual information. However, some other types of divergences, such as MMD and Stein Variational Gradient, do not suffer from this problem. Consequently, in this paper, for , to maximize we can replace the first term in (2) with Maximum-Mean Discrepancy (MMD) , which always prefers to maximize mutual information between and . The MMD is a framework which utilizes all of the moments to quantify the distance between two distributions. It could be implemented using the kernel trick as follows:
where is any universal positive definite kernel, such as Gaussian . Consequently, we rewrite the VAE objective in Equation (2) as follows:
Since is tractable but difficult to compute, we define variational approximations to it as . Similar to , is defined as a fixed -dimensional spherical Gaussian, , where is the dimension of . This upper-bound in combination with the MMD forms a domain-specific variational information bound. Note that MMD does not optimize an upper-bound to the negative log likelihood directly, but it guarantees the mutual information to be maximized and we can expect a high log likelihood performance . To translate this upper-bound, , to an implementable loss function in the model, we use the following empirical data distribution approximation:
Therefore, the upper bound can be approximated as:
Since and , the implementable upper-bound, , can be approximated as follows:
As illustrated in Figure 1(b), we train the cycle starting from an image . All the components in this cycle share weights with the corresponding components in cycle. Similar losses, , , , and , can be defined for this cycle. The overall loss for the network is defined as:
We adopt the architecture for our common latent encoder, generator, and discriminator networks from Zhu and Park et al. . The domain-invariant encoders includes two stride-2 convolutions, and three residual blocks . The generators consist of three residual blocks and two transposed convolutions with stride-2. The domain-specific encoders share the first two convolution layers with their corresponding domain-invariant encoders, followed by five stride-2 convolutions. Since the spatial size of the domain-specific codes do not match with their corresponding domain-invariant codes, we tile them to the same size as the domain-invariant codes, and then, concatenate them to create the generators’ inputs. For the discriminator networks we use PatchGAN networks [19, 12], which classifies whether overlapping image patches are real or fake. We use Adam optimizer  for online optimization with the learning rate of 0.0002. For reconstruction loss in (3), we set and . The values of and in (12) are set to 1, and the . Finally, regarding the kernel parameter in (6), as discussed in , MMD is fairly robust to this parameter selection, and using is a practical value in most scenarios, where is the dimension of .
Our experiments aim to show that an interpretable representation can be learned by the domain-specific variational information bound maximization. Visual results on translation task show how domain-specific code can alter the style of generated images in a new domain. We compare our method against baselines both qualitatively and quantitatively.
6.1 Qualitative Evaluation
We use two datasets for qualitative comparison, edges handbags  and edges shoes . Figures 2(a) and 2(b) represent the comparison between the proposed framework and baseline image-to-image translation algorithms: CycleGAN , UNIT , and BicycleGAN . Our framework, similar to the BicycleGAN, can be utilized to generate multiple realistic images for a single input, while does not require any supervision. In contrast, CycleGAN and UNIT learn one-to-one mappings as they learn only one domain-invariant latent code between the two modalities. From our experience, training CycleGAN and UNIT on edges photos datasets is very unstable and sensitive to the parameters. Figure 1 illustrates how CycleGAN encodes information about textures and colors in the generated image in the edge domain. This information encoding enables the discriminator to easily distinguish the fake generated samples from the real ones which results in unstability in the training of the generators.
Three other datasets, namely architectural labels photos from the CMP Facade database , and CUHK Face Sketch Dataset (CUFS)  are employed for more qualitative evaluation. The image-to-image translation results for the proposed framework are presented in Figure 3(d), and 3(c) for these datasets, respectively. Our method successfully captures domain-specific properties of the target domain. Therefore, we are able to generate diverse images from a single input sample. More results for edges shoes and edges handbags datasets are presented in Figures 3(a) and 3(b), respectively. These figures present one-to-many image translation when different domain-specific codes are deployed. The results for the backward path for edges handbags and edges shoes are also presented in Figure 3(e). Since there is no extra information in the edge domain, the generated edges are quite similar to each other despite the value of edge domain-specific code.
Using the learned domain-specific code, we can transfer domain-specific properties from a reference image in the output domain to the generated image. To this end, instead of sampling from the distribution of output domain-specific code, we can use a domain-specific code extracted from a reference image in the output domain. To this end, the reference image is fed to the output domain-specific encoder to extract its domain-specific code. The extracted code can be used for image translation guided by the reference image. Figures 6 show the results using domain-specific codes extracted from multiple reference images to translate edges into realistic photos. Finally, Figure 5 illustrates some failure cases, where some domain-specific codes do not result in well-defined styles.
6.2 Quantitative Evaluation
Table 1 presents the quantitative comparison between the proposed framework and three state-of-the-art models. Similar to BicycleGAN , we perform a quantitative analysis of the diversity using Learned Perceptual Image Patch Similarity (LPIPS) metric . The LPIPS distance is calculated as the average distance between 2000 pairs of randomly generated output images, in deep feature space of a pre-trained AlexNet . Diversity scores for different techniques using the LPIPS metric are summarized in Table 1. Note that the diversity score is not defined for one-to-one frameworks, e.g., CycleGAN and UNIT. Previous findings showed that these models are not able to generate large output variation, even by noise injection [12, 38]. The diversity scores of our proposed framework are close to the BicycleGAN, while we do not have any supervision during the training phase.
Generating unnatural images usually results in a high diversity score. Therefore, to investigate whether the variation of generated images is meaningful, we need to evaluate the visual realism of the generated samples as well. As proposed in [32, 37], the âfooling” rate of human subjects, is considered as visual realism score of each framework. We sequentially presented a real and generated image to a human for 1 second each, in a random order, asked them to identify the fake, and measured the fooling rate. We also used the Frechet Inception Distance (FID) to evaluate the quality of generated images . It directly measures the distance between the synthetic data distribution and the real data distribution. To calculate FID, images are encoded with visual features from a pre-trained inception model. Note that a lower FID value interprets as a lower distance between synthetic and real data distributions. Table 1 shows how the FID results confirm the results from fooling rate. We calculate the FID over 10k randomly generated samples.
6.3 Discussion and Ablation Study
Our framework learns a disentangled representation of content and style, which provides users more control on the image translation outputs. This framework is not only suitable for image-to-image translation, but also one can use it to transfer style between the images of a single domain. Comparing with other unsupervised one-to-one image-to-image translation frameworks, i.e., CycleGAN and UNIT, our method handles translation between significantly different domains. In contrast, CycleGAN encodes the domain-specific codes to satisfy the cycle-consistency (see Figure 1). UNIT also completely fails as it cannot find a shared representation in these cases.
Neglecting the minimization of the mutual information between target domain-specific information and the source domain may result in capturing attributes with high variation in the target despite their common nature in both domains. For example, as illustrated in Figure 7, the domain-specific code can result in altering the attributes, such as gender or face structure, while these attributes are domain-invariant properties of the two modalities. In addition, removing the domain-specific code cycle-consistency criteria (e.g. ) results in a partial mode collapse in the model, with many outputs being almost identical, which reduces the LPIPS (see Table 2). Without the domain-invariant code cycle-consistency criteria (e.g. ), the image quality is unsatisfactory. A possible reason for quality degradation is that can include the domain-specific information as there is no constraint on it to represent shared information exclusively. That results in the same issue as explained in Figure 1. Very small values for result in the second term in in (5) to be neglected. Therefore, the domain-specific code, , will be irrelevant in the loss minimization and the learned domain specific code could be meaningless. In contrast, with very large values of , carries the domain specific information of the as well.
In this paper, we introduced a framework for one-to-many cross-domain image-to-image translation in an unsupervised setting. In contrast to the previous works, our approach learns a distinct domain-specific code for each of the two modalities, maximizing a domain-specific variational information bound. In addition, it learns a domain-invariant code. During the training phase, a unit normal distribution is imposed over the domain-specific latent distribution, which let us control the domain-specific properties of the generated image in the output domain. To generate diverse target domain images, we extract domain-specific codes from reference images, or sample them from a prior distribution. These domain-specific codes, combined with the learned domain-invariant code, result in target domain images with different target domain-specific properties.
-  A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
-  A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented CycleGAN: Learning Many-to-Many mappings from unpaired data. arXiv preprint arXiv:1802.10151, 2018.
-  A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Example-based image synthesis. arXiv preprint arXiv:1708.05349, 2017.
-  K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1(2):7, 2017.
-  Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1520–1529, 2017.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, pages 2672–2680, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (TOG), 35(4):110, 2016.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. European Conference on Computer Vision, pages 694–711, 2016.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
-  A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint, 2016.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with Markovian generative adversarial networks. European Conference on Computer Vision, pages 702–716, 2016.
-  M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. Advances in Neural Information Processing Systems, pages 700–708, 2017.
-  A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
-  C. Peng, X. Gao, N. Wang, and J. Li. Superpixel-based face sketch–photo synthesis. IEEE Transactions on Circuits and Systems for Video Technology, 27(2):288–299, 2017.
-  D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in deep latent gaussian models. International Conference on Machine Learning, 2, 2014.
-  A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Moressi, F. Cole, and K. Murphy. Xgan: Unsupervised image-to-image translation for many-to-many mappings. arXiv preprint arXiv:1711.05139, 2017.
-  P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2, 2017.
-  P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Colorado University at Boulder Department of Computer Science, 1986.
-  X. Tang and X. Wang. Face sketch synthesis and recognition. Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 687–694, 2003.
-  R. Tyleček and R. Šára. Spatial pattern templates for recognition of objects with regular structure. German Conference on Pattern Recognition, pages 364–374, 2013.
-  A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
-  X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. European Conference on Computer Vision, pages 318–335, 2016.
-  A. Yu and K. Grauman. Fine-grained visual comparisons with local learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 192–199, 2014.
-  R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. European Conference on Computer Vision, pages 649–666, 2016.
-  R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924, 2018.
-  J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
-  S. Zhao, J. Song, and S. Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. European Conference on Computer Vision, pages 597–613, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
-  J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. Advances in Neural Information Processing Systems, pages 465–476, 2017.
-  F. Zohrizadeh, M. Kheirandishfard, and F. Kamangar. Image segmentation using sparse subset selection. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1470–1479, March 2018.