GANHopper: Multi-Hop GAN for Unsupervised Image-to-Image Translation
We introduce GANHopper, an unsupervised image-to-image translation network that transforms images gradually between two domains, through multiple hops. Instead of executing translation directly, we steer the translation by requiring the network to produce in-between images which resemble weighted hybrids between images from the two input domains. Our network is trained on unpaired images from the two domains only, without any in-between images. All hops are produced using a single generator along each direction. In addition to the standard cycle-consistency and adversarial losses, we introduce a new hybrid discriminator, which is trained to classify the intermediate images produced by the generator as weighted hybrids, with weights based on a predetermined hop count. We also introduce a smoothness term to constrain the magnitude of each hop, further regularizing the translation/ Compared to previous methods, GANHopper excels at image translations involving domain-specific image features and geometric variations while also preserving non-domain-specific features such as backgrounds and general color schemes.
Unsupervised image-to-image translation has been one of the most intensively studied problems in computer vision, since the introduction of domain transfer network (DTN) , CycleGAN , DualGAN , and UNIT  in 2017. While these networks and many follow-ups were designed to perform general-purpose translations, it is challenging for the translator to learn transformations beyond local and stylistic adjustments, such as geometry and shape variations. For example, typical dog-cat translations learned by CycleGAN do not transform the animals in terms of geometric facial features; only pixel-scale color or texture alterations take place.
When the source and target domains exhibit sufficiently large discrepancies, any proper translation function is expected to be complex and difficult to learn. Without any paired images to supervise the learning process, the search space for the translation functions can be immense. With large image changes, there are even more degrees of freedom to account for. In such cases, a more constrained and steerable search would be desirable.
In this paper, we introduce an unsupervised image-to-image translator that is constrained to transform images gradually between two domains, e.g., cats and dogs. Instead of performing the transformation directly, our translator executes the task in steps, called hops. Our multi-hop network is built on CycleGAN . However, we steer the translation paths by forcing the network to produce in-between images which resemble weighted hybrids between images from the two input domains. For example, a four-hop network for dog-to-cat translation produces three in-between images: the first is 25% cat-like and 75% dog-like, the second is 50/50, and the third is 75% cat-like and 25% dog-like. The fourth and final hop is a 100% translated cat.
Our network, GANHopper, is unsupervised and trained on unpaired images from two input domains, without any in-between hybrid images in its training set. Equally important, all hops are produced using a single generator along each direction, so the network has no more capacity than CycleGAN. To make training possible, we introduce a new hybrid discriminator, which is trained exclusively on real images (e.g., dogs or cats) to evaluate the in-between images by classifying them as weighted hybrids, depending on the prescribed hop count. In addition to the original cycle-consistency and adversarial losses from CycleGAN, we introduce two new losses: a hybrid loss to assess the degree to which an image belongs to one of the input domains, and a smoothness loss which further regulates the image transitions to ensure that a generated image in the hop sequence does not deviate much from the preceding image.
GANHopper does not merely transform an input cat into a dog — many dogs can fool the discriminator. Rather, it aims to generate the dog which looks most similar to the given cat; see Figure 1. Compared to previous unsupervised image-to-image translation networks, our network excels at image translations involving domain-specific image features and geometric variations (i.e., “what makes a dog a dog?”) while preserving non-domain-specific image features such as background and general color schemes, e.g., the fur color of the input cat in Figure 1.
The ability to produce large changes (in particular, geometry transformations) via unsupervised domain translation has been a hotly-pursued problem. There appears to be a common belief that the original CycleGAN/DualGAN architecture cannot learn geometry variations and must be modified at the feature representation or training-approach level. As a result, many approaches resort to latent space translations, e.g., with style-content  or scale  separation and feature disentanglement . Our work challenges this assumption, as GANHopper follows fundamentally the same architecture as CycleGAN, working directly in image space; it merely enforces a gradual, multi-hop translation to steer and regulate the image transitions.
2 Related Work
The foundation of modern image-to-image translation is the UNet architecture, first developed for semantic image segmentation . This architecture was later extended with conditional adversarial training to a variety of image-to-image translation tasks . Further improvements led to the generation of higher-resolution outputs  and multiple possible outputs for the same image in “one-to-many” translation tasks, e.g. grayscale image colorization .
The above methods require paired input and output images as training data. A more recent class of image-to-image translation networks is capable of learning from only unpaired data in the form of two sets and of input and output images, respectively [25, 22, 10]. These methods jointly train a network to map from to and a network to map from to , enforcing at training time that and . Such cycle consistency is thought to regularize the learned mappings to be semantically meaningful, rather than arbitrary translations.
While the above approaches succeed at domain translations involving low-level appearance shift (e.g. summer to winter, day to night), they often fail when the translation requires a significant shape deformation (e.g. cat to dog). Cycle-consistent translators have been shown to perform larger shape changes when trained with a discriminator and perceptual loss function that consider more global image context . An alternative approach is to interpose a shared latent code from which images in both domains are generated (i.e. and ) . This method can also be extended to enable translation into multiple output images . Another tactic is to explicitly and separately model geometry vs. appearance in the translation process. A domain-specific method for translating human faces to caricature sketches accomplishes this by detecting facial landmarks, deforming them, then using them to warp the input face . More recent work has proposed a related technique that is not specific to faces . Finally, it is also possible to perform domain translation via the feature hierarchy of a pre-trained image classification network . This method can also produce large shape changes.
In contrast to the above, we show that direct image-to-image translation can produce large shape changes, while also preserving appearance details, if translation is performed in a sequence of smooth hops. This process can be viewed as producing an interpolation sequence between two domains. Many GANs can produce interpolations between images via linear interpolation in their latent space. These interpolations can even be along interpretable directions which are either specified in the dataset  or automatically inferred . However, GAN latent space interpolation does not perform cross-domain interpolation. We are aware of one other work which performs cross-domain interpolation  by identifying corresponding points on images from two domains and using these points as input to standard image morphing approaches . However, this approach requires images in both the source and target domain to interpolate between, whereas our method takes just a source image and produces the interpolation to the best-matching image in the target domain.
Let and denote our source and target image domains, respectively. Our goal is to learn a transformation that, given an image , outputs another image such that is perceived to be the counterpart of the image in the dataset . The same must be achieved with the analog transformation from to . This task is identical to that performed by CycleGAN . However, we do not translate the input image in one pass through the network. Rather, we facilitate the translation process via a sequence of intermediate images. We introduce the concept of a hop, which we define as the process of warping one image toward the target domain by a limited amount using a generator network. Repeated hops produce hybrid images as byproducts of the translation process.
Since we do not translate images in a single pass through a network, our training process must be modified from the traditional cycle-consistent learning framework. In particular, the generation of hybrid images during the translation is a challenge, because the training data does not include such images. Therefore, the hybrid-ness of these generated images must be estimated on the fly during training. To this end, we introduce a new discriminator, which we call the hybrid discriminator, whose objective is to evaluate how similar an image is to both input domains, generating a membership score. We also add a new smoothness term to the loss, whose purpose is to encourage a gradual warping of the images through the hops so that the generator does not overshoot the translation. The following subsections present our multi-hop framework and expand on these two key new elements.
3.1 Multi-hop framework
Our model consists of the original two generators from CycleGAN, denoted by and , and three discriminators, two of which are CycleGAN’s original adversarial discriminators and . The third discriminator is the new hybrid discriminator . Figure 2 depicts how these different generators and discriminators work together during training time to translate images via multiple hops.
A hop is defined as using either or to warp an image towards the domain or , respectively. A full translation is achieved by performing hops using the same generator, where is a user defined value. For instance, if , , where and . Similarly, , where and . Given an image , the translation hops are defined via the following recurrence relations:
We adopt the architecture and layer nomenclature originally proposed by Johnson et al.  and used in CycleGAN. Let c7s1- denote a 77 Convolution-InstanceNorm-ReLU layer with filters and stride . d denotes a 33 Convolution-InstanceNorm-ReLU layer with filters and stride 2. Reflection padding was used to reduce artifacts. R denotes a residual block with two 33 convolutional layers, each with filters. u denotes a 33 TransposeConvolution-InstanceNorm-ReLU layer with filters and stride . The network takes 128128 images as input and consists of the following layers: c7s1-64, d128, d256, R256 (12), u128, u64, c7s1-3.
For the discriminator networks , and , we use the same 7070 PatchGAN  used in CycleGAN. Let C denote a 44 Convolution-InstanceNorm-LeakyReLU layer with filters and stride . Unlike CycleGAN, we do not apply another convolution to produce a 1-dimensional output. Instead, given the 128128 input images, we produce a 1616 feature matrix. Each of its elements is associated with one of the 7070 patches from the input image. The discriminator consists of the following layers: C64, C128, C256, C512.
The full loss function combines the reconstruction loss, adversarial loss, domain loss and smoothness loss, denoted respectively as , , and :
We empirically define the values for the weights in the objective function as: , , , .
Rather than enforcing cycle consistency between the input and output images, as in CycleGAN, we enforce it locally along every hop of our multi-hop translation. That is, should undo a single of hop of and vice versa. We enforce this property via a loss proportional to the difference between and for all hops (and symmetrically between and :
The generator tries to generate images that look similar to images from domain , while aims to distinguish between these generated images and real images . Note that “generated images” includes both final output images and in-between images. The discriminators use a least squares formulation :
The hybrid term assesses the degree to which an image belongs to one of the two domains. For instance, if GANHopper is trained with hops, we desire that the first hop be judged as belonging 25% to domain and 75% to domain . Thus, we define the target hybridness score of hop to be ; conversely, it is defined as for the reverse hops . To encourage each hop to achieve its target hybridness, we penalize the distance between the target hybridness and the output of the hybrid discriminator on that hop. Since is also trained to output 0 for ground-truth images in and 1 for ground-truth images in (i.e. it is a binary domain classifier), an image for which produces an output of 0.25 can be interpreted as an image which the classifier is 25% confident belongs to domain —precisely the behavior we desire:
The smoothness term penalizes the image-space difference between hop and hop . This term encourages the hops to be individually as small as possible while still leading to a full translation when combined, which has a regularizing effect on the training:
We train each network in GANHopper one hop at a time, i.e. for each image to be translated, we perform a single hop, update the weights of the generator and discriminator networks, perform the next hop, etc. Training the network this way, rather than performing all hops and then doing a single weight update, has the advantage of requiring significantly less memory. The generator_update and discriminator_update procedures use a single term of the sums which define the loss (i.e. the term for hop ) to compute parameter gradients.
4 Results and Evaluation
Our network takes 128128 images as input and outputs images of the same resolution. Experiments were performed on an NVIDIA GTX 1080 Ti (using batch size 6) and an NVIDIA Titan X (batch size 24). We trained GANHopper using Adam with a learning rate of . With the exception of the cat/human faces experiment, we trained all experiments for 100 epochs (cat/human mode collapsed after 25 epochs, so we report the results from epoch 22).
In our experiments, we used combinations of seven datasets, translating between pairs. Some translation pairs demand both geometric and texture changes:
We compare GANHopper with three prior approaches: CycleGAN , DiscoGAN  and GANimorph . All three are “unsupervised direct image-to-image translation” methods, in that they transform the input image from one domain into the output image from another domain without mediation by any shared latent variables and without any prior pairing of samples between the two domains. We trained these baselines on the aforementioned datasets with their public implementation and with default settings.
Quantitative evaluation of translation accuracy
We quantitatively evaluate dog/cat translation using two metrics (Figure 3). First, we compute the percentage of output pixels that are classified as belonging to the target domain by a pre-trained semantic segmentation network (DeepLabV3 , trained on PASCAL VOC 2012). Second, we measure how well the output preserves salient features from the input using a perceptual similarity metric . CycleGAN produces outputs that best resemble the input but fails to perform domain translation. Our approach outperforms both GANimorph and DiscoGAN on both metrics: it is slightly better at domain translation and considerably better at preserving input features. This result indicates that one need not sacrifice domain translation ability to preserve salient features of the input. Figure 4 shows how the percentage of pixels translated varies as a function of the number of hops performed. While not strictly linearly increasing, it is a smooth monotonic function, suggesting that our hybrid loss term successfully encourages in-between images which can be interpreted as domain hybrids. As shown in the supplementary material, our method also outperforms quantitatively the other methods on the human-to-cat dataset when both metrics described are considered.
Figure 5 compares our method to the baselines on cat to dog and dog to cat translation. Our multi-hop procedure translates the input via a sequence of hybrid images (Figure 5(a)), allowing it to preserve key visual characteristics of the input if changing them is not necessary to achieve domain translation. For instance, fur colors and background textures are preserved in most cases (e.g. white cats map to white dogs) as is head orientation, while domain-specific features such as eyes, noses, and ears are appropriately deformed. The multi-hop procedure also allows control over how much translation to perform. The user can control the degree of “dogness” or “catness” introduced by the translation, including performing more hops than the network was trained on in order to exaggerate the characteristics of the target domain. Figure 5(b) shows the result of performing 8 hops using a network trained to perform only four. In the fifth row, the additional hops help to clarify the shape of the output dog’s tongue.
By contrast, the baselines produce less desirable results. CycleGAN preserves the input features too much, leading to incomplete translations (Figure 5(c)). Note that CycleGAN’s outputs often look similar to the first hop of our network; this makes sense, since each hop uses a CycleGAN-like generator network. Our network uses multiple hops of that same architecture to overcome CycleGAN’s original limitations. DiscoGAN (Figure 5(d)) can properly translate high-level properties such as head pose and eye placement but fails to preserve lower-level appearance details such as fur patterns and color. Its results are also often geometrically malformed (lines 2, 4, 5, 7, and 8). GANimporph (Figure 5(e)) produces images that are convincingly part of the target domain but preserve little of the input image’s features (typically only head pose). Note that all baselines produce outputs with noticeably decreased saturation and contrast, whereas our method preserves these properties.
Figure 6 shows a similar comparison on human to cat translation. Again, our method preserves input features well: facial structures stay roughly the same, and cats with light fur tend to generate blonde-haired people. Our method also preserves background details better than the baselines.
Impact of training hop count
We also examine the impact of the number of hops used during training. A network using too few hops must more quickly change the domain of the image; this causes the generator to “force” the translation and produce undesirable outputs. In the summer to winter translation of Figure 7 Top, the hiker’s jacket quickly loses its blue color in the first row () compared with the second row (). In the winter to summer translation of Figure 7 Bottom, the lake incorrectly becomes green when using a two-hop network butt is preserved with four hops (while vegetation is still converted to green). The results suggest that increasing the number of hops has the added benefit of increasing image diversity and also allowing for smoother transition from one domain to another.
Impact of the smoothness term
Figure 8 demonstrates the impact of the smoothness weight on training dog to cat translation with 4 hops. preserves the original fur patterns in the cat-to-dog translation and the sharpness of the image in the dog-to-cat translation. With , the network collapses to producing cats with gray and white fur and noticeably blurry dogs. Higher values also help preserve the input background textures.
As our method uses CycleGAN as a sub-component, it inherits some of the problems faced by that method, as well as other direct unpaired image translators. Figure 9 shows one prominent failure mode, in which the network “cheats” by erasing part of the object to be translated and replacing it with background (e.g. zebra legs). The smoothness term in our loss function penalizes differences between hops, so increasing its weight can help with this problem, but this issue remains unsolved in general.
5 Conclusion and Future Work
Unsupervised image-to-image translation is an ill-posed problem, and different methods have chosen different regularizing assumptions to define their solutions to it [21, 14, 6]. In this paper, we follow the cycle-consistency assumption of CycleGAN  and DualGAN , while introducing the multi-hop paradigm to exert fine-grained control over the translation using a new hybrid discriminator. Compared to other approaches, our GANHopper network better preserves features of the input image while still applying the necessary transformations to create an output that clearly belongs to the target domain.
The meta idea of “transforming images in small steps” raises new questions worth exploring. For example, how many steps are ideal? The results in this paper used 2-4 hops, as more hops did not noticeably improve performance but did increase training time. However, some images in a domain are clearly harder than others to translate into a different domain (e.g. translating dogs with long vs. short snouts into cats). Can we automatically learn the ideal number of hops for each input image? Taken to an extreme, can we use a very large number of tiny hops to produce a smooth interpolation sequence from source to target domain? We also want to identify domains where GANHopper systematically fails and explore the design space of multi-hop translation architectures in response. For instance, while GANHopper uses the same network for all hops, it may be better to use different networks per hop (i.e. the optimal function for translating a 25% dog to a 50% dog may not be the same as the function for translating a 75% dog to a 100% dog). Another interesting direction is to combine GANHopper with ideas from MUNIT  or BiCycleGAN , so that the user can control the output of the translation via a “style” code while still preserving important input features (e.g. translating a white cat into different white-furred dog breeds). Finally, we would like to further investigate the idea that initially spurred the development of GANHopper: generating meaningful extrapolation sequences beyond the boundaries of a given image domain, to produce creative and novel outputs.
- (2018-07) Neural best-buddies: sparse cross-domain correspondence. ACM Trans. Graph. 37 (4). Cited by: §2.
- (2018) CariGANs: unpaired photo-to-caricature translation. Cited by: §2.
- (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. Cited by: Figure 3, Figure 4, §4.
- (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Cited by: §2.
- (2018) Improving shape deformation in unsupervised image-to-image translation. CoRR abs/1808.04325. External Links: Cited by: §2, §4.
- (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §1, §2, §5, §5.
- (2016) Image-to-image translation with conditional adversarial networks. CoRR abs/1611.07004. External Links: Cited by: §2, §3.1.
- (2016) Perceptual losses for real-time style transfer and super-resolution. CoRR abs/1603.08155. External Links: Cited by: §3.1.
- (2019) Cross-domain cascaded deep feature translation. CoRR abs/1906.01526. Cited by: §2.
- (2017) Learning to discover cross-domain relations with generative adversarial networks. In ICML, Cited by: Figure 1, §2, §4.
- (2017) Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, Cited by: §2.
- (2014-09) Automating image morphing using structural similarity on a halfway domain. ACM Trans. Graph. 33 (5). Cited by: §2.
- (2012) Dog breed classification using part localization. In Proceedings of the 12th European Conference on Computer Vision - Volume Part I, ECCV’12, Berlin, Heidelberg, pp. 172–185. External Links: Cited by: 1st item.
- (2017) Unsupervised image-to-image translation networks. CoRR abs/1703.00848. External Links: Cited by: §1, §2, §5.
- (2014) Deep learning face attributes in the wild. CoRR abs/1411.7766. External Links: Cited by: 3rd item.
- (2017) Least squares generative adversarial networks. In ICCV, Cited by: §3.2.
- (2015) Large-scale deep learning on the YFCC100M dataset. CoRR abs/1502.03409. External Links: Cited by: 2nd item.
- (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §2.
- (2017) Unsupervised cross-domain image generation. In Proc. of ICLR, Cited by: §1.
- (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- (2019) TransGaGa: geometry-aware unsupervised image-to-image translation. In Proc. of CVPR, Cited by: §1, §2, §5.
- (2017) DualGAN: unsupervised dual learning for image-to-image translation. In Proc. of ICCV, Cited by: §1, §2, §5.
- (2019) LOGAN: unpaired shape transform in latent overcomplete space. ACM Trans. on Graphics 38 (6), pp. . Cited by: §1.
- (2018-06) The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 586–595. External Links: Cited by: Figure 3, §4.
- (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision (ICCV), to appear, Cited by: Figure 1, §1, §1, §2, §3, 4th item, §4, §5.
- (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, Cited by: §2, §5.