Unsupervised ImagetoImage Translation Using DomainSpecific Variational Information Bound
Abstract
Unsupervised imagetoimage translation is a class of computer vision problems which aims at modeling conditional distribution of images in the target domain, given a set of unpaired images in the source and target domains. An image in the source domain might have multiple representations in the target domain. Therefore, ambiguity in modeling of the conditional distribution arises, specially when the images in the source and target domains come from different modalities. Current approaches mostly rely on simplifying assumptions to map both domains into a sharedlatent space. Consequently, they are only able to model the domaininvariant information between the two modalities. These approaches usually fail to model domainspecific information which has no representation in the target domain. In this work, we propose an unsupervised imagetoimage translation framework which maximizes a domainspecific variational information bound and learns the target domaininvariant representation of the two domain. The proposed framework makes it possible to map a single source image into multiple images in the target domain, utilizing several target domainspecific codes sampled randomly from the prior distribution, or extracted from reference images.
Unsupervised ImagetoImage Translation Using DomainSpecific Variational Information Bound
Hadi Kazemi hakazemi@mix.wvu.edu Sobhan Soleymani ssoleyma@mix.wvu.edu Fariborz Taherkhani fariborztaherkhani@gmail.com Seyed Mehdi Iranmanesh seiranmanesh@mix.wvu.edu Nasser M. Nasrabadi nasser.nasrabadi@mail.wvu.edu West Virginia University Morgantown, WV 26505
noticebox[b]32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.\end@float
1 Introduction
Imagetoimage translation is the major goal for many computer vision problems, such as sketch to photorealistic image translation [25], style transfer [13], inpainting missing image regions [12], colorization of grayscale images [11, 32], and superresolution [18]. If corresponding image pairs are available in both source and target domains, these problems can be studied in a supervised setting. For years, researchers [22] have made great efforts to solve this problem employing classical methods, such as superpixelbased segmentation [39]. More recentely, frameworks such as conditional Generative Adversarial Networks (cGAN) [12], Style and Structure Generative Adversarial Network (SGAN) [30], and VAEGAN [17] are proposed to address the problem of supervised imagetoimage translation. However, in many realworld applications, collecting paired training data is laborious and expensive [37]. Therefore, in many applications, there are only a few paired images available or no paired images at all. In this case, only independent sets of images in each domain, with no correspondence in the other domain, should be deployed to learn the crossdomain image translation task. Despite the difficulty of the unsupervised imagetoimage translation, since there is no paired samples guiding how an image should be translated into a corresponding image in the other domain, it is still more desirable compared to the supervised setting due to the lack of paired images and the convenience of collecting two independent image sets. As a result, in this paper, we focus on the design of a framework for unsupervised imagetoimage translation.
The key challenge in crossdomain image translation is learning the conditional distribution of images in the target domain. In the unsupervised setting, this conditional distribution should be learned using two independent image sets. Previous works in the literature mostly consider a sharedlatent space, in which they assume that images from two domains can be mapped into a lowdimensional sharedlatent space [37, 20]. However, this assumption does not hold when the two domains represent different modalities, since some information in one modality might have no representation in the other modality. For example, in the case of sketch to photorealistic image translation, color and texture information have no interpretable meaning in the sketch domain. In other words, each sketch can be mapped into several photorealistic images. Accordingly, learning a single domaininvariant latent space with aforementioned assumption [37, 20, 24] prevents the model from capturing domainspecific information. Therefore, a sketch can only be mapped into one of its corresponding photorealistic images. In addition, since the current unsupervised techniques are implemented mainly based on the ”cycle consistency” [20, 37], the translated image in the target domain may encode domainspecific information of the source domain (Figure 1). The encoded information can then be utilized to recover the source image again. This encoding can effectively degrade the performance and stability of the training process.
To address this problem, we remove the sharedlatent space assumption, and learn a domainspecific space jointly with a domaininvariant space. Our proposed framework is based on Generative Adversarial Networks and Variational Autoencoders (VAEs), and models the conditional distribution of the target domain using VAEGAN. Broadly speaking, two encoders map a source image into a pair of domaininvariant and source domainspecific codes. The domaininvariant code in combination with a target domainspecific code, sampled from a desired distribution, is fed to a generator which translates them into the corresponding target domain image. To reconstruct the source image at the end of the cycle, the extracted source domainspecific code is passed through a domainspecific path to the backward path from translated target domain image.
In order to learn two distinct codes for the shared and domainspecific information, we train the network to extract two distinct domainspecific and domaininvariant codes. The former is learned by maximizing its mutual information with the source domain while simultaneously we minimize the mutual information between this code and the translated image in the target domain. The mutual information maximization may also result in the domainspecific code to represent an interpretable representation of the domainspecific information [6]. These loss terms are crucial in the unsupervised framework, since domaininvariant information may also go through the domainspecific path to satisfy the cycle consistency in the backward path.
In this paper we extend CycleGAN [37] to learn a domainspecific code for each modality, through domainspecific variational information bound maximization, in addition to a domaininvariant code. Then, based on the proposed domainspecific learning scheme, we introduce a framework for onetomany crossdomain imagetoimage translation in an unsupervised setting.
2 Related Works
In the computer vision literature, image generation problem is tackled using autoregressive models [21, 29], restricted Boltzmann machines [26], and autoencoders [10]. Recently, generative techniques are proposed for image translation tasks. Models such as GANs [7, 34] and VAEs [23, 15] achieve impressive results in image generation. They are also utilized in conditional setting [12, 38] to address the imagetoimage translation problem. However, in the prior research, relatively less attention is given to the unsupervised setting [20, 37, 4].
Many stateoftheart unsupervised imagetoimage translation frameworks are developed based on the cycleconsistency constraint [37]. Liu et al. [20] showed that learning a sharedlatent space between the images in source and target domains implies the cycleconsistency. The cycleconsistency constraint assumes that the source image can be reconstructed from the generated image, in the target domain, without any extra domainspecific information [20, 37]. From our experience, this assumption severely constrains the network and degrades the performance and stability of the training process, in the case of learning the translation between different modalities. In addition, this assumption limits the diversity of generated images by the framework, i.e., the network associates a single target image with each source image. To tackle this problem, some prior research attempt to map a single image into multiple images in the target domain in a supervised setting [5, 3]. This problem is also addressed in [2] in an unsupervised setting. However, they have not considered any mechanisms to force their auxiliary latent variables to represent only the domainspecific information.
In this work, in contrast, we aim to learn distinct domainspecific and domaininvariant latent spaces in an unsupervised setting. The learned domainspecific code is supposed to represent the properties of the source image which have no representation in the target domain. To this end, we train our network by maximization of a domainspecific variational information to learn a domainspecific space.
3 Framework and Formulation
Our framework, as illustrated in Figure 2, is developed based on GAN [30] and VAEGAN [17], and includes two generative adversarial networks; and . The encodergenerators, and , also constitute two VAEs. Inspired by CycleGAN model [37], we trained our network in two cycles; and , where and represent the source and target domains, respectively.^{1}^{1}1For simplicity, in the remainder of the paper, for each cycle, we use terms input domain and output domain. Each cycle consists of forward and backward paths. In each forward path, we translate an image from the input domain into its corresponding image in the output domain. In the backward path, we remap the generated image into the input domain and reconstruct the input image. In our formulation, rather than learning a single sharedlatent space between the two domains, we propose to decompose the latent code, , into two parts: , which is the domaininvariant code between the two domains, and , which is the domainspecific code.
During the forward path in cycle, we simultaneously train two encoders, and , to map data samples from the input domain, , into a latent representation, . The input domaininvariant encoder, , maps the input image, , into the input domaininvariant code, . The input domainspecific encoder, , maps into the input domainspecific code, . Then, the domaininvariant code, , and a randomly sampled output domainspecific code, , are fed to the output generator (decoder), , to generate the corresponding representation of the input image, , in the output domain . Since in cycle the output domainspecific information is not available during the training phase, a prior, , is imposed over the domainspecific distribution which is selected as a unit normal distribution . Here, index in the codes’ subscripts refers to the first cycle . We use the same notation for all the latent codes in the reminder of the paper.
The output discriminator, , is employed to enforce the translated images, , resemble images in the output domain . The translated images should not be distinguishable from the real samples in . Therefore, we apply the adversarial loss [30] which is given by:
(1) 
Note that the domainspecific encoder outputs mean and variance vectors , which represents the distribution of the domainspecific code given by . Similar to the previous works on VAE [15], we assume that the domainspecific components of are conditionally independent and Gaussian with unit variance. We utilize reparametrization trick [15] to train the VAEGAN using backpropagation. We define the variational loss for the domainspecific VAE as follows:
(2) 
where the KullbackâLeibler () divergence term is a measure of how the distribution of domainspecific code, , diverges from the prior distribution. The conditional distribution is modeled as Laplacian distribution, and therefore, minimizing the negative loglikelihood term is equivalent to the absolute distance between the input and its reconstruction.
In the backward path, the output domaininvariant encoder, , and the output domainspecific encoder, , map the generated image into the reconstructed domaininvariant code, , and the reconstructed domainspecific code, , respectively. The domainspecific encoder, , outputs mean and variance vectors which represents the distribution of the domainspecific code, , given by . Finally, the reconstructed input, , is generated by the output generator, , with and as its inputs. Here, is sampled from its distribution, , where is the output of in the forward path. We enforce a reconstruction criteria to force , and to be the reconstruction of , , and , respectively. To this end, the reconstruction loss is defined as follows:
(3) 
where , , and are the hyperparameters to control the weight of each term in the loss function.
4 Domainspecific Variational Information bound
In the proposed model, we decompose the latent space, , into the domaininvariant and domainspecific codes. As it is mentioned in the previous section, the domaininvariant code should only capture the information shared between the two modalities, while the domainspecific code represents the information which has no interpretation in the output domain. Otherwise, all the information can go through the domainspecific path and satisfy the cycleconsistency property of the network ( and ). In this trivial solution, the generator, , can translate an input domain image into the output domain image that does not correspond to the input image, while satisfying the discriminator in terms of resembling the images in . Figure 7 (second row) presents images generated by this trivial solution.
Here, we propose an unsupervised method to learn the domainspecific information of the source data distribution which has minimum information about the target domain. To learn the source domainspecific code, , we propose to minimize the mutual information between and the target domain distribution, while simultaneously, we maximize the mutual information between and the source domain distribution. Similarly, the target domainspecific code is learned for target domain . In other words, to learn the source and target domain specific codes and , we should minimize the following loss function:
(4) 
where represents the model parameters. To translate to an implementable loss function, we define the following two loss functions:
(5) 
where and are implemented in cycles and , respectively.
Instead of minimizing , or similarly , we minimize their variational upper bounds, which we refer to as domainspecific variational information bounds. Zhao et al. [35] illustrated that using KLdivergance in VAEs results in information preference problem, in which the mutual information between the latent code and the input becomes vanishingly small, while training the network using only reconstruction loss, with no KL divergence term, maximizes the mutual information. However, some other types of divergences, such as MMD and Stein Variational Gradient, do not suffer from this problem. Consequently, in this paper, for , to maximize we can replace the first term in (2) with MaximumMean Discrepancy (MMD) [35], which always prefers to maximize mutual information between and . The MMD is a framework which utilizes all of the moments to quantify the distance between two distributions. It could be implemented using the kernel trick as follows:
(6) 
where is any universal positive definite kernel, such as Gaussian . Consequently, we rewrite the VAE objective in Equation (2) as follows:
(7) 
Following the method described in [1], to minimize the first term of in (5), we define an upperbound for the first term as:
(8) 
Since is tractable but difficult to compute, we define variational approximations to it as . Similar to [1], is defined as a fixed dimensional spherical Gaussian, , where is the dimension of . This upperbound in combination with the MMD forms a domainspecific variational information bound. Note that MMD does not optimize an upperbound to the negative log likelihood directly, but it guarantees the mutual information to be maximized and we can expect a high log likelihood performance [35]. To translate this upperbound, , to an implementable loss function in the model, we use the following empirical data distribution approximation:
(9) 
Therefore, the upper bound can be approximated as:
(10) 
Since and , the implementable upperbound, , can be approximated as follows:
(11) 
As illustrated in Figure 1(b), we train the cycle starting from an image . All the components in this cycle share weights with the corresponding components in cycle. Similar losses, , , , and , can be defined for this cycle. The overall loss for the network is defined as:
(12) 
5 Implementation
We adopt the architecture for our common latent encoder, generator, and discriminator networks from Zhu and Park et al. [37]. The domaininvariant encoders includes two stride2 convolutions, and three residual blocks [8]. The generators consist of three residual blocks and two transposed convolutions with stride2. The domainspecific encoders share the first two convolution layers with their corresponding domaininvariant encoders, followed by five stride2 convolutions. Since the spatial size of the domainspecific codes do not match with their corresponding domaininvariant codes, we tile them to the same size as the domaininvariant codes, and then, concatenate them to create the generators’ inputs. For the discriminator networks we use PatchGAN networks [19, 12], which classifies whether overlapping image patches are real or fake. We use Adam optimizer [14] for online optimization with the learning rate of 0.0002. For reconstruction loss in (3), we set and . The values of and in (12) are set to 1, and the . Finally, regarding the kernel parameter in (6), as discussed in [35], MMD is fairly robust to this parameter selection, and using is a practical value in most scenarios, where is the dimension of .
6 Experiments
Our experiments aim to show that an interpretable representation can be learned by the domainspecific variational information bound maximization. Visual results on translation task show how domainspecific code can alter the style of generated images in a new domain. We compare our method against baselines both qualitatively and quantitatively.
6.1 Qualitative Evaluation
We use two datasets for qualitative comparison, edges handbags [36] and edges shoes [31]. Figures 2(a) and 2(b) represent the comparison between the proposed framework and baseline imagetoimage translation algorithms: CycleGAN [37], UNIT [20], and BicycleGAN [38]. Our framework, similar to the BicycleGAN, can be utilized to generate multiple realistic images for a single input, while does not require any supervision. In contrast, CycleGAN and UNIT learn onetoone mappings as they learn only one domaininvariant latent code between the two modalities. From our experience, training CycleGAN and UNIT on edges photos datasets is very unstable and sensitive to the parameters. Figure 1 illustrates how CycleGAN encodes information about textures and colors in the generated image in the edge domain. This information encoding enables the discriminator to easily distinguish the fake generated samples from the real ones which results in unstability in the training of the generators.
Three other datasets, namely architectural labels photos from the CMP Facade database [28], and CUHK Face Sketch Dataset (CUFS) [27] are employed for more qualitative evaluation. The imagetoimage translation results for the proposed framework are presented in Figure 3(d), and 3(c) for these datasets, respectively. Our method successfully captures domainspecific properties of the target domain. Therefore, we are able to generate diverse images from a single input sample. More results for edges shoes and edges handbags datasets are presented in Figures 3(a) and 3(b), respectively. These figures present onetomany image translation when different domainspecific codes are deployed. The results for the backward path for edges handbags and edges shoes are also presented in Figure 3(e). Since there is no extra information in the edge domain, the generated edges are quite similar to each other despite the value of edge domainspecific code.
Using the learned domainspecific code, we can transfer domainspecific properties from a reference image in the output domain to the generated image. To this end, instead of sampling from the distribution of output domainspecific code, we can use a domainspecific code extracted from a reference image in the output domain. To this end, the reference image is fed to the output domainspecific encoder to extract its domainspecific code. The extracted code can be used for image translation guided by the reference image. Figures 6 show the results using domainspecific codes extracted from multiple reference images to translate edges into realistic photos. Finally, Figure 5 illustrates some failure cases, where some domainspecific codes do not result in welldefined styles.
6.2 Quantitative Evaluation
EdgesShoes  EdgesHandbags  

Method 







Real Images  0.290      0.369      
UNIT    22.0  90.32    19.2  84.36  
CycleGAN    24.3  86.54    25.9  81.22  
BicycleGAN  0.113  38.0  43.18  0.134  34.9  37.79  
Ours  0.121  36.0  48.36  0.129  33.2  40.84 
Table 1 presents the quantitative comparison between the proposed framework and three stateoftheart models. Similar to BicycleGAN [38], we perform a quantitative analysis of the diversity using Learned Perceptual Image Patch Similarity (LPIPS) metric [33]. The LPIPS distance is calculated as the average distance between 2000 pairs of randomly generated output images, in deep feature space of a pretrained AlexNet [16]. Diversity scores for different techniques using the LPIPS metric are summarized in Table 1. Note that the diversity score is not defined for onetoone frameworks, e.g., CycleGAN and UNIT. Previous findings showed that these models are not able to generate large output variation, even by noise injection [12, 38]. The diversity scores of our proposed framework are close to the BicycleGAN, while we do not have any supervision during the training phase.
Generating unnatural images usually results in a high diversity score. Therefore, to investigate whether the variation of generated images is meaningful, we need to evaluate the visual realism of the generated samples as well. As proposed in [32, 37], the âfooling” rate of human subjects, is considered as visual realism score of each framework. We sequentially presented a real and generated image to a human for 1 second each, in a random order, asked them to identify the fake, and measured the fooling rate. We also used the Frechet Inception Distance (FID) to evaluate the quality of generated images [9]. It directly measures the distance between the synthetic data distribution and the real data distribution. To calculate FID, images are encoded with visual features from a pretrained inception model. Note that a lower FID value interprets as a lower distance between synthetic and real data distributions. Table 1 shows how the FID results confirm the results from fooling rate. We calculate the FID over 10k randomly generated samples.
6.3 Discussion and Ablation Study
Our framework learns a disentangled representation of content and style, which provides users more control on the image translation outputs. This framework is not only suitable for imagetoimage translation, but also one can use it to transfer style between the images of a single domain. Comparing with other unsupervised onetoone imagetoimage translation frameworks, i.e., CycleGAN and UNIT, our method handles translation between significantly different domains. In contrast, CycleGAN encodes the domainspecific codes to satisfy the cycleconsistency (see Figure 1). UNIT also completely fails as it cannot find a shared representation in these cases.
Neglecting the minimization of the mutual information between target domainspecific information and the source domain may result in capturing attributes with high variation in the target despite their common nature in both domains. For example, as illustrated in Figure 7, the domainspecific code can result in altering the attributes, such as gender or face structure, while these attributes are domaininvariant properties of the two modalities. In addition, removing the domainspecific code cycleconsistency criteria (e.g. ) results in a partial mode collapse in the model, with many outputs being almost identical, which reduces the LPIPS (see Table 2). Without the domaininvariant code cycleconsistency criteria (e.g. ), the image quality is unsatisfactory. A possible reason for quality degradation is that can include the domainspecific information as there is no constraint on it to represent shared information exclusively. That results in the same issue as explained in Figure 1. Very small values for result in the second term in in (5) to be neglected. Therefore, the domainspecific code, , will be irrelevant in the loss minimization and the learned domain specific code could be meaningless. In contrast, with very large values of , carries the domain specific information of the as well.
shoes  handbags  

w/  w/o  w/  w/o  
LPIPS  0.121  0.095  0.129  0.113 
7 Conclusion
In this paper, we introduced a framework for onetomany crossdomain imagetoimage translation in an unsupervised setting. In contrast to the previous works, our approach learns a distinct domainspecific code for each of the two modalities, maximizing a domainspecific variational information bound. In addition, it learns a domaininvariant code. During the training phase, a unit normal distribution is imposed over the domainspecific latent distribution, which let us control the domainspecific properties of the generated image in the output domain. To generate diverse target domain images, we extract domainspecific codes from reference images, or sample them from a prior distribution. These domainspecific codes, combined with the learned domaininvariant code, result in target domain images with different target domainspecific properties.
References
 [1] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
 [2] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented CycleGAN: Learning ManytoMany mappings from unpaired data. arXiv preprint arXiv:1802.10151, 2018.
 [3] A. Bansal, Y. Sheikh, and D. Ramanan. Pixelnn: Examplebased image synthesis. arXiv preprint arXiv:1708.05349, 2017.
 [4] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixellevel domain adaptation with generative adversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1(2):7, 2017.
 [5] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. Computer Vision (ICCV), 2017 IEEE International Conference on, pages 1520–1529, 2017.
 [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
 [7] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, pages 2672–2680, 2014.
 [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
 [10] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 [11] S. Iizuka, E. SimoSerra, and H. Ishikawa. Let there be color!: joint endtoend learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (TOG), 35(4):110, 2016.
 [12] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [13] J. Johnson, A. Alahi, and L. FeiFei. Perceptual losses for realtime style transfer and superresolution. European Conference on Computer Vision, pages 694–711, 2016.
 [14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [15] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [16] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
 [17] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 [18] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint, 2016.
 [19] C. Li and M. Wand. Precomputed realtime texture synthesis with Markovian generative adversarial networks. European Conference on Computer Vision, pages 702–716, 2016.
 [20] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. Advances in Neural Information Processing Systems, pages 700–708, 2017.
 [21] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 [22] C. Peng, X. Gao, N. Wang, and J. Li. Superpixelbased face sketch–photo synthesis. IEEE Transactions on Circuits and Systems for Video Technology, 27(2):288–299, 2017.
 [23] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in deep latent gaussian models. International Conference on Machine Learning, 2, 2014.
 [24] A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Moressi, F. Cole, and K. Murphy. Xgan: Unsupervised imagetoimage translation for manytomany mappings. arXiv preprint arXiv:1711.05139, 2017.
 [25] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays. Scribbler: Controlling deep image synthesis with sketch and color. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2, 2017.
 [26] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Colorado University at Boulder Department of Computer Science, 1986.
 [27] X. Tang and X. Wang. Face sketch synthesis and recognition. Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 687–694, 2003.
 [28] R. Tyleček and R. Šára. Spatial pattern templates for recognition of objects with regular structure. German Conference on Pattern Recognition, pages 364–374, 2013.
 [29] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
 [30] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. European Conference on Computer Vision, pages 318–335, 2016.
 [31] A. Yu and K. Grauman. Finegrained visual comparisons with local learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 192–199, 2014.
 [32] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. European Conference on Computer Vision, pages 649–666, 2016.
 [33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924, 2018.
 [34] J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
 [35] S. Zhao, J. Song, and S. Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
 [36] J.Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. European Conference on Computer Vision, pages 597–613, 2016.
 [37] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
 [38] J.Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal imagetoimage translation. Advances in Neural Information Processing Systems, pages 465–476, 2017.
 [39] F. Zohrizadeh, M. Kheirandishfard, and F. Kamangar. Image segmentation using sparse subset selection. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1470–1479, March 2018.