Learning a Self-inverse Network for Unpaired Bidirectional Image-to-image Translation
Recently image-to-image translation has attracted significant interests in the literature, starting from the successful use of the generative adversarial network (GAN), to the introduction of cyclic constraint, to extensions to multiple domains. However, in existing approaches, there is no guarantee that the mapping between two image domains is unique or one-to-one. Here we propose a self-inverse network learning approach for unpaired image-to-image translation. Building on top of CycleGAN, we learn a self-inverse function by simply augmenting the training samples by switching inputs and outputs during training. The outcome of such learning is a proven one-to-one mapping function. Our extensive experiments on a variety of detests, including cross-modal medical image synthesis, object transfiguration, and semantic labeling, consistently demonstrate clear improvement over the CycleGAN method both qualitatively and quantitatively. Especially our proposed method reaches the state-of-the-art result on the label to photo direction of the cityscapes benchmark dataset.
Image-to-image translation (or cross-domain image synthesis) is nothing but a mapping function from an input image to output image or vice versa. Recently image-to-image translation has attracted significant interest from researchers and extensive research works have been proposed, which are easily grouped into two categories: supervised  vs unsupervised (or unpaired) .
Iosla et al.  present the seminar work of image-to-image translation that offers a general-purpose solution, Goodfellow et al. propose to use the generative adversarial network (GAN)  for the first time in the literature. While paired data are assumed in , later Zhu et al.  propose the CycleGAN approach for addressing the unpaired setting using the so-called cyclic constraints. There are many recent advances that use guidance information [30, 28], impose different constraints [9, 22, 33], or deal with multiple domains[36, 6, 11, 18], etc. In this paper, we study unpaired image-to-image translation.
Specifically, we study the image-to-image translation problem from a perspective of learning a one-to-one mapping between two image domains. It is desirable for many applications. For example, in medical image synthesis, a patient has a unique image for each imaging modality or for each sequence/configuration within a single modality; therefore, having a one-to-one mapping is crucial. Furthermore, we study how to ensure one-to-one mapping under an unpaired setting.
What contrasts with a one-to-one mapping function are one-to-many, many-to-one, and many-to-many  111It is worth noting that recently there are quite some works focusing on addressing image-to-image translation among many domains, also the so-called one-to-many . mapping functions. In , the well-studied scenarios of labels-to-scenes, edge-to-photo are more likely one-to-many mapping as it is possible that multiple photos (scenes) have the same edge (label) information. The colorization example is also one-to-many. From an information theory perspective, the entropy of the edge map (label) is low while that of the photo is high. When an image translation goes from an information-gaining direction, that is, from low- to high-entropy, its mapping leans towards one-to-many. Similarly, if it goes from an information-losing direction, then its mapping leans toward many-to-one. If the information level of both domains is close (or information-similar), then the mapping is close to one-to-one. In , the examples of Monet-to-photo, summer-to-winter are closer to one-to-one mapping as the underlying contents of both images before and after translation are regarded the same but the styles are different, which does not change the image entropy significantly.
Our main contribution lies in proposing a self-inverse GAN network. When a function is self-inverse, meaning
it guarantees a one-to-one mapping. We use the CycleGAN  as the baseline framework for image-to-image translation. To impose the self-inverse property, we implement a simple idea of augmenting the training samples by switching inputs and outputs during training. However, as we will demonstrate empirically, this seemingly simple idea makes a genuinely big difference!
The distinct feature of our self-inverse network is that it learns one network to perform both forward (: from A to B) and backward (: from B to A) translation tasks. It contrasts with the state-of-the-art approaches which typically learn two separate networks, one for forwarding translation and the other for backward translation. As a result, it enjoys several benefits. First, it halves the necessary parameters, assuming that the self-inverse network and the two separate networks share the same network architecture. Second, it automatically doubles the sample size, a great feature for any data-driven models, thus becoming less likely to over-fit the model.
One key question arises: Is it feasible to learn such a self-inverse network for image-to-image translation? We can not theoretically prove this existence; however, we experimentally demonstrate so. Intuitively, such an existence is related to the redundancy in the expressive power of the deep neural network. Even given a fixed network architecture, the function space for a network that translates an image from to is large enough, that is, there are many neural networks with different parameters capable of doing the same translation job. The same holds for the inversion network. Therefore, the overlap between these two spaces, in which the self-inverse network resides, does exist.
2 Literature review
As mentioned earlier, the approaches for image-to-image translation can be divided into two categories: supervised  and unsupervised [34, 21]. The former uses paired images in the training; the latter handles the unpaired one. The generative adversarial network (GAN) is widely used in both types of approaches.
In addition to using the GAN that essentially enforces similarity in in image distribution, other guidance information is used such as landmark points , contours , sketches , anatomical information  etc. In addition to cyclic constraint , other constraints like ternary discriminative function , optimal transport function , smoothness over the sample graph  are used too.
Also, extensions are proposed to deal with video inputs [31, 3], to synthesize images in high resolution , to seek for diversity , to handle more than two image domains [36, 6, 11, 18]. Furthermore, there are methods that leverage attention mechanism [24, 5, 26] and mask guidance . Finally, disentangling is a new emerging direction [11, 18].
In terms of works about inverse problem with neural networks,  makes the CNN architecture invertible by providing an explicit inverse. Ardizzone et.al  prove the invertibility theoretically and experimentally in inverse problem using invertible neural networks. More specifically, Kingma  shows the benefit of a invertible convolution for the generative flow. Different from previous works, our self-inverse network realize the invertibility of a neural network by switching inputs and outputs..
For image to image translation, many works has been done to diversify the output [1, 21, 18, 11, 36, 19], while not too many work has been done to make the output unique. Our work goes to the latter direction.
Although there are so many research works on image-to-image translation, the perspective of learning a one-to-one mapping network has not been fully investigated, with the exception of . In , Lu et al. show that CycleGAN can not theoretically guarantee the one-to-one mapping property and propose to use an optimal transport mechanism to mitigate this issue. However, like GAN, the optimal transport method also measures the similarity in image distribution; hence the one-to-one issue is not fully resolved. By contrast, our self-inverse learning comes with a guarantee that the learned network realizes a one-to-one mapping.
3 Self-inverse learning for unpaired image-to-image translation
In the section, we first show the property that the self-inverse function guarantees one-to-one (one2one) mapping. Then we discuss how to train a self-inverse CycleGAN network for image-to-image translation
3.1 One-to-one property
In image-to-image translation, we define a forward function as that maps an image on domain to another image on domain and, similarly, an inverse function as . When there is no confusion, we will skip the subscript (e.g., ).
Property: If a function is self-inverse, that is , then the function defines a one-to-one mapping, that is, if and only if .
 If , then .
 If , then as long as the inverse function exists, which is the case for a self-inverse function as .
3.2 One-to-one Benefits
There are several advantages in learning a self-inverse network to have the one-to-one mapping property.
(1) From the perspective of the application, only one self-inverse function can model both tasks and and it is a novel way for multi-task learning. As shown in Figure 1, the self-inverse network generates an output given input, and vice versa, with only one CNN and without knowing the mapping direction. It is capable of doing both tasks within the same network, simultaneously. In comparison to separately assigning two CNNs for tasks and , the self-inverse network halves the necessary parameters, assuming that the self-inverse network and the two CNNs share the same network architecture as shown in Figure 1.
(2) It automatically doubles the sample size, a great feature for any data-driven models, thus becoming less likely to over-fit the model. The self-inverse function has the co-domain . If the sample size of either domain or is , then the sample size for domain is . As a result, the sample size for both tasks and are doubled, becoming a novel method for data augmentation to mitigate the over-fitting problem.
(3) In the unpaired image-to-image translation setting, the goal is to minimize the distribution gap between the two domains. The state-of-art methods can realize this but can not guarantee an ordered mapping or bijection between the two domains. This results in variations for the generated images.
(4)The one-to-one mapping is a strict constraint. Therefore, forcing a CNN model as a self-inverse function can shrink the target function space.
3.3 One-to-one CycleGAN
We are inspired by the basic formulation of CycleGAN . In CycleGAN, there are two generators and , two discriminators and , and one joint object function. In our one2one CycleGAN, we have one shared generator and still two discriminators and . Instead of having a joint objective for the dual-mappings, our proposed method has two separate objective functions, one for each of two mapping directions.
3.3.1 Separate loss functions
Compared to CycleGAN that uses a joint loss for both image transfer directions, our method have two separate losses, one for each image transfer direction. For the mapping function and its discriminator , the adversarial loss is
The cycle consistency loss is
For the mapping function and its discriminator , the adversarial loss is:
The cycle consistency loss is:
So, the final objective for the mapping function is
and the minimax optimization solves
Similarly, the final objective for the mapping function is
and the minimax optimization solves
3.4 Self-inverse implementation
We apply the proposed method based on the framework of CycleGAN . To have a fair comparison with CycleGAN, we adopt the architecture of (Johnson et al., 2016) as the generator and the PatchGAN  as the discriminator. The log likelihood objective in the original GAN is replaced with a least-squared loss  for more stable training. We resize the input images to . The loss weights are set as . Following CycleGAN, we adopt the Adam optimizer  with a learning rate of 0.0002. Similarly, we use a pool size of 50. The learning rate is fixed for the first 100 epochs and linearly decayed to zero over the next 100 epochs on Yosemite and apple2orange datasets. The learning rate is fixed for the first 4 epochs and linearly decayed to zero over the next 3 epochs on the BRATS dataset. The learning rate is fixed for the first 90 epochs and linearly decayed to zero over the next 30 epochs on the Cityscapes dataset.
3.5 Training details and optimization
In our experiments, we use a batch size of 1. At each iteration, we randomly sample a batch of pair , where samples and . At any iteration , we perform the following three steps:
Firstly, we feed as the input and as the target, then forward and back-propagate ;
Secondly, we feed as the input and as the target, then forward and back-propagate ;
Finally, we back-propagate and individually.
In order to test the effect of the proposed method, we evaluate it on an array of applications: cross-modal medical image synthesis, object transfiguration, and style transfer. Also we compare against several unpaired image-to-image translation methods: CycleGAN , DiscoGAN , DistanceGAN , and UNIT . We conduct a user study when the ground truth images are unknown and perform quantitative evaluation when the ground truth images are present.
4.1 Datasets and results
we test our method on the horse zebra task used in CycleGAN paper  with 2401 training images (939 horses and 1177 zebras) and 260 test images (120 horses and 140 zebras). This task has no ground truth for generated images and hence no quantitative evaluation is feasible. So we provide the qualitative results obtained in a user study. In the user study, we ask a user to rate his/her preferred image out of three randomly positioned images, one obtains from CycleGAN, one from DistanceGAN, and the other from one2one CycleGAN. Figure 2 shows examples of input and synthesized images and Table 1 summarize the use study results.
Figure 2 tells that one2one CycleGAN likely generates better quality images in an unsupervised fashion, especially in terms of the quality of zebra synthesis from the horse (refer to the first four rows). Our method generated more real and complete zebra content. From Table 1, it is clear that our one2one CycleGAN is the most favorable with a 75% (77%) preference percentage for the horse2zebra (zebra2horse) mapping direction. and DistanceGAN is the least favorable.
we test our method on the apple orange task  with 2014 training images (995 apples and 1019 orange) and 514 test images (248 apples and 266 oranges). This task has no ground truth for generated images and hence no quantitative evaluation is feasible. Figure 4 shows examples of input and synthesized images. There are failure cases in rows 1,2,4 from CycleGAN while our model generates normal images.
Cross-modal medical image synthesis. This task evaluates cross-modal medical image synthesis. The models are trained on the BRATS dataset  which contains paired MRI data to allow quantitative evaluation. It contains ample multi-institutional routine clinically-acquired pre-operative multimodal MRI scans of glioblastoma (GBM/HGG) and lower grade glioma (LGG) images. There are 285 3D volumes for training and 66 3D volume for the test. The and images are selected for our bi-directional image synthesis. All the 3D volumes are preprocessed to one channel image of size 256 x 256 x 1. We use the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) to evaluate the quality of generated images.
|T1 T2||One2one CycleGAN||22.03||0.86|
|T2 T1||One2one CycleGAN||18.31||0.82|
As shown in Table 2, on the image synthesis direction, our one2one model outperforms the CycleGAN model on PSNR by 6.0%. The qualitative result is shown in columns 3 and 4 in Figure 6. On the image synthesis direction, our one2one model outperforms the CycleGAN model on PSNR by 5.0%. The qualitative result is shown in columns 7 and 8 in Figure 6.
|Label Photo||Photo Label|
|Method||Pixel Acc.||Class Acc.||Class IoU||Pixel Acc.||Class Acc.||Class IoU|
|One2one CycleGAN (ours)||58.2||18.9||14.3||52.7||18.1||13.0|
Semantic labeling. We also test our method on the labels photos task using the Cityscapes dataset  under the unpaired setting as in the original CycleGAN paper. For quantitative evaluation, in line with previous work, for labels photos we adopt the “FCN score” , which evaluates how interpretable the generated photos are according to a semantic segmentation algorithm. For photos labels, we use the standard segmentation metrics, including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU). The quantitative result is shown in Table 3. Our model reaches the state-of-the-art on the label photo direction image synthesis under this unpaired setting. The pixel accuracy outperforms the second best result by 10.4 %; The clas accuracy outperforms the second best result by 24.3 %; The class IoU outperforms the second best result by 30.0 %. On the photo label direction, our model reaches comparable results.
The qualitative result is shown in Figure 5. Compared with CycleGAN which is the second best result in the label photo direction, our model has clearly better visual results. On the photo label direction, our model also have a comparable or better result.
Style Transfer. We also test our method on the summer winter style transfer task using the Yosemite dataset under the unpaired setting as in the original CycleGAN paper. As shown in Figure 3 for the qualitative result, our method has better visual result in both directions of style transfer. We also do a similar user study by providing the generated image from the test set by our model and the CyecleGAN to users. The result is in Table 4. The user study results show that our model has a higher preference than CycleGAN.
We have presented an approach for enforcing the learning of a one-to-one mapping function for unpaired image-to-image translation. The idea is to take advantage of representative redundancy in deep networks and realize self-inverse learning. The implementation is as simple as augmenting the training samples by switching inputs and outputs. However, this seemingly simple idea brings a genuinely big difference, which has been confirmed by our extensive experiments on multiple applications including cross-modal medical image synthesis, object transfiguration, style transfer, etc. The proposed one-to-one CycleGAN consistently outperforms the baseline CycleGAN model and other state-of-the-art unsupervised approaches in terms of various qualitative and quantitative metrics. In the future, we plan to investigate the effect of applying the self-inverse learning to natural language translation and study the theoretic perspective of the self-inverse network.
-  (2018) Augmented cyclegan: learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151. Cited by: §1, §2.
-  (2018) Analyzing inverse problems with invertible neural networks. arXiv preprint arXiv:1808.04730. Cited by: §2.
-  (2018) Recycle-gan: unsupervised video retargeting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–135. Cited by: §2.
-  (2017) One-sided unsupervised domain mapping. In Advances in neural information processing systems, pp. 752–762. Cited by: §4.
-  (2018) Attention-gan for object transfiguration in wild images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 164–180. Cited by: §2.
-  (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: §1, §2.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §4.1.
-  (2017) Smart, sparse contours to represent and edit images. arXiv preprint arXiv:1712.08232. Cited by: §2.
-  (2017) Triangle generative adversarial networks. In Advances in Neural Information Processing Systems, pp. 5247–5256. Cited by: §1, §2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
-  (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §1, §2, §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: Figure 1, §1, §1, §1, §2, §3.4, §4.1.
-  (2018) I-revnet: deep invertible networks. arXiv preprint arXiv:1802.07088. Cited by: §2.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §3.4.
-  (2017) Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1857–1865. Cited by: §4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
-  (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.
-  (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §1, §2, §2.
-  (2019) DRIT++: diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1905.01270. Cited by: §2.
-  (2018) Generative semantic manipulation with mask-contrasting gan. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 558–573. Cited by: §2.
-  (2017) Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pp. 700–708. Cited by: §2, §2, §4.
-  (2018) Guiding the one-to-one mapping in cyclegan via optimal transport. arXiv preprint arXiv:1811.06284. Cited by: §1, §2, §2.
-  (2018) Image generation from sketch constraint using contextual gan. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 205–220. Cited by: §2.
-  (2018) DA-gan: instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5657–5666. Cited by: §2.
-  (2019) Mode seeking generative adversarial networks for diverse image synthesis. arXiv preprint arXiv:1903.05628. Cited by: §2.
-  (2018) Unsupervised attention-guided image-to-image translation. In Advances in Neural Information Processing Systems, pp. 3693–3703. Cited by: §2.
-  (2015) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §4.1.
-  (2018) Ganimation: anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 818–833. Cited by: §1, §2.
-  (2019) Towards instance-level image-to-image translation. arXiv preprint arXiv:1905.01744. Cited by: §2.
-  (2018) Geometry guided adversarial facial expression synthesis. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 627–635. Cited by: §1, §2.
-  (2018) Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807. Cited by: §2.
-  (2019) Harmonic unpaired image-to-image translation. CoRR abs/1902.09727. External Links: Cited by: §1, §2.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §1, §1, §2, §2, §3.3, §3.4, §4.1, §4.1, §4.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: Figure 1.
-  (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pp. 465–476. Cited by: §1, §2, §2.