Towards Learning a Self-inverse Network for Bidirectional Image-to-image Translation
The one-to-one mapping is necessary for many bidirectional image-to-image translation applications, such as MRI image synthesis as MRI images are unique to the patient. State-of-the-art approaches for image synthesis from domain X to domain Y learn a convolutional neural network that meticulously maps between the domains. A different network is typically implemented to map along the opposite direction, from Y to X. In this paper, we explore the possibility of only wielding one network for bi-directional image synthesis. In other words, such an autonomous learning network implements a self-inverse function. A self-inverse network shares several distinct advantages: only one network instead of two, better generalization and more restricted parameter space. Most importantly, a self-inverse function guarantees a one-to-one mapping, a property that cannot be guaranteed by earlier approaches that are not self-inverse. The experiments on three datasets show that, compared with the baseline approaches that use two separate models for the image synthesis along two directions, our self-inverse network achieves better synthesis results in terms of standard metrics. Finally, our sensitivity analysis confirms the feasibility of learning a self-inverse function for the bidirectional image translation.
Recently there is a growing need for bidirectional image-to-image translation, include image style transfer, translation between image and semantic labels, gray-scale to color, edge-map to photograph, super resolution and many other types of image manipulations. Here we highlight one application related to Magnetic Resonance Imaging (MRI). MRI is one of the widely used medical image modalities due to its non-invasiveness and its ability of clearly capturing soft tissue structures using multiple acquisition sequences. However, its disadvantage lies in its long acquisition time and expensive cost. Therefore, there is a lack of large scale MRI image database needed for learning-based image analysis. MRI image synthesis or image-to-image translation [41, 18] is able to fill such a gap by generating more images for training purpose. Also, a generated MRI image can be helpful to cross-sequence image registration, in which an image is first synthesized for the target sequence and then used for registration .
In language translation, if we treat the translation from one language A to another language B as a forward process , then the translation from the language B to A is its inverse problem . Similarly, in computer vision, there is a concept of image-to-image translation [19, 43, 44, 6] that converts an image to another one. In medical imaging, there are image reconstruction problems. Traditionally, each of these problems uses two different functions, one for the forward task and the other one for its inverse . In this paper, our goal is to demonstrate that, for MRI image synthesis and other tasks, we are able to learn the above two tasks simultaneously using only one function (see Figure 1), that is, .
The community has explored the power of CNN in various tasks in computer vision, as well as within several other fields. But so far, to the best of our knowledge, no one has explored the learning capability of a self-inverse function using CNN, and its potential use in applications. Our aim in this paper is to bridge this gap. We refer to the mapping from a domain to a domain as task and the mapping from the domain to as task . Additionally, the proposed CNN that learns a self-inverse function is referred to as the self-inverse or one-to-one network. The one-to-one mapping property is necessary for application like MRI image synthesis as MRI images are unique to the patient.
2 Benefits of learning a self-inverse network
There are several advantages in learning a self-inverse network equipped with the one-to-one mapping property.
(1) From the perspective of the application, only one self-inverse function can model both tasks and and it is a novel way for multi-task learning. As shown in Figure 1, the self-inverse network generates an output given an input, and vice versa, with only one CNN and without knowing the mapping direction. It is capable of doing both tasks within the same network, simultaneously. In comparison to separately assigning two CNNs for tasks and , the self-inverse network halves the necessary parameters, assuming that the self-inverse network and the two CNNs share the same network architecture as shown in Figure 2.
(2) It automatically doubles the sample size, a great feature for any data-driven models, thus becoming less likely to over-fit the model. The self-inverse function has the co-domain . If the sample size of either domain or is , then the sample size for domain is . As a result, the sample size for both tasks and are doubled, becoming a novel method for data augmentation to mitigate the over-fitting problem.
(3) It implicitly shrinks the target function space. As shown in Figure 3, the blue area is the whole function space, which is unlimited. Given a CNN with its architecture fixed, its function space (Figure 3, white area) is enormous, with millions of parameters. When the CNN is trained for the task , the target function space is the purple area. When the CNN is trained for the task , the target function space is the green area. When it is trained to learn a self-inverse function for both tasks and , the target function space is the overlapping area, which is a subset of the function space of and . For a fixed neural network architecture, its function space is large enough to have the overlapping area in Figure 3. For a fixed data set, the trained model is a function within the blue area or the purple area for each direction, since the overlap area is always the subset of the blue or purple areas. If the network is trained as a self-inverse network, the trained model is a function within the overlapping area, which is always smaller than that of the network trained separately in each direction. A smaller function space means a smaller bias between the true function and the trained model, so the self-inverse network likely generalizes better. Another interpretation of this shrinking behavior is to regard the inverse as a regularization condition when learning the function , and vice versa.
|Direction||Method||p. acc.||c. acc.||IOU|
3 Related Works
Inverse problem with neural networks The loss of information is a big problem that affects the performance of CNNs in various tasks. Several works such as [11, 29] show that essential information concerning the input image is lost as the network traverses to deeper layers in well-known ImageNet-based CNN classifiers. To recover and understand the loss of information, the above works use learned or hand-crafted methods prior to inverting the representation. An example of ‘compensating’ the lost information for performance improvement involves the segmentation task approach , which proposes the use of prior anatomical information from the latent space within a pre-trained decoder.
Building an invertible architecture is difficult due to the local inversion being ill-conditioned, hence not much progress has been made in solving it. Multiple works only allow invertible representation learning under certain conditions. Parseval network  increases the robustness of learned representation with respect to adversarial attacks. In this work, the linear operator is bijective under the condition that the spectrum of convolutional operator is constrained to norm 1 during learning.  introduces a signal recovery method conditioned on pooling representation to design invertible neural network layers.  makes the CNN architecture invertible by providing an explicit inverse. In this work, the reconstruction of the linear interpolations between natural image representation is achieved. This gives empirical evidence to the notion that learning invertible representation that do not discard any information concerning their input on large-scale supervised problems is possible. But it can not provide bi-directional mapping and is not self-invertible. Ardizzone et.al prove theoretically and verify experimentally for artificial data and real data in inverse problme using invertible neural networks. More specifically, Kingma  uses the invertible 1x1 convolution for the generative flow. Different from the previous works, our self-inverse network realize the inevitability between two domains by switching the inputs and outputs and then learning a self-inverse function.
Image-to-image translation The concept of image-to-image translation is broad, including image style transfer, translation between image and semantic labels, gray-scale to color, edge-map to photograph, super resolution  and many other types of image manipulations. It dates back to image analogies by , which employs a non-parametric texture model  from a single input-output training image pair. More recent approaches use a data set of input-output examples to learn a parametric translation function using CNN . Our approach builds on the “pix2pix” framework of , which uses a conditional generative adversarial network  to learn a mapping from input to output images. CycleGAN  contributes to the unpaired image-to-image translation with a cycle consistency loss. In this framework, CycleGAN addresses exactly the same issue of learning a bijective mapping, albeit without the self-inverse property. CycleGAN can be seen as BiGAN  where the latent variable is like an image in the co-domain and the loss is augmented with an L1 loss. Similar ideas have been applied to various tasks such as generating photographs from sketches  or from attribute and semantic layouts. Recently,  uses multi-scale loss and Conditional GAN to realize high resolution image synthesis and semantic manipulation. One direction towards diverisifying image translation is to allow many to many mapping, like augmented CycleGAN[1, 27, 24, 17, 44, 25]. The other direction towards accurate image translation is to restrict output image variance, like instance level image translation . Our method falls into the latter case and learns both tasks and with one generator network in a bidirectional way instead of using two generator networks (see Figure 2).Unlike , we encourage the invertbility of our model as a self-inverse function to realize bijection.
Neural style transfer Neural style transfer can be treated as a special category of image-to-image translation as well.  proposes to use image representation derived from CNN, optimized for object recognition, to make high level image information explicit.  introduces a cascade refinement networks for photographic image synthesis.  highlights the power and flexibility of generative feed-forward models trained with complex and expressive loss functions for style transfer.  contributes the perceptual losses, which works very well.
Our goal is to learn a self-inverse mapping function or bidirectional mapping function for pairs . This means . It also can be illustrated in this way: the function and its inverse function satisfies , where samples , , and the symbol ‘’ means bijection: the symbol ‘’ means one directional mapping and the symbol ‘’ means the two functions on both sides are exactly the same function.
Mathematically, it boils down to solving the following minimization problem:
where denotes the neural network parameters, and the loss function for tasks and , respectively, and is the regularizer. In this paper, we use norm as the loss and GAN discriminator as the regularizer. The model pipeline is illustrated in Figure 2(c). It consists of two networks. The generator network and the discriminator network or . Here and are the same network, while the and are two different networks for the baseline pix2pix model (see Figure 2(a)). The generator is trained to translate the image as real as possible to fool the discriminator network or , which is trained as well as possible to detect the ‘fake’ examples generated by .
Detailed network architecture. We adopt the architecture from  for our self-inverse network implementation. Let denote a Convolution-BatchNorm-LeakyReLU layer with filters in the encoder and Convolution-BatchNorm-ReLU layer with filters in the decoder. All convolutions are spatial filters applied with a stride 2. Convolutions in the encoder are down-sampled by a factor of 2. Convolutions in the decoder are up-sampled by a factor of 2.
The encoder-decoder architecture consists of an encoder, , and an decoder, . After the last layer in the decoder, a convolution is applied to map according to the number of output channels, which is 1, followed by a Tanh function. Following the convention, The is not applied with batch-normalization. All LeakyReLUs in the encoder are with a slope of 0.2. For the U-Net skip connection, the skip connection is to concatenate feature maps from layer to layer . where is the layer index, is the total number of layers. Compared to the decoder above without skip connection, the number of feature maps doubles due to the use of an U-Net decoder, .It is . Following the layer is a convolution layer to map the feature map channel number to 1. Then a sigmoid function is followed to generate the output. Similar to the generator, the first convolution layer is without batch normalization. All LeakyReLU are with a slope of 0.2.
Loss function. The objective of a conditional GAN  can be expressed as
We use L1 distance rather than L2 as L1 encourages less blurring:
Our final objective
With , the net could learn a mapping from to in term of any distribution instead of just a delta function.
Bi-directional Training To train a CNN as a self-inverse network, we randomly sample a certain-sized batch of pairs and alternatively and iteratively. This is shown in (see Figure.4). The baseline is without alternative training, which means that training two separated generator networks for the tasks and , respectively (see Figure.4). For a fair comparison with the baseline, with the same data set, we use the same batch size and the same number of epochs. In other words, except for the alternative part, everything is the same as the baseline. We resize the input images to , add a random jitter, and then randomly crop it back to size . All networks are trained from scratch. The weights are initialized from a Gaussian distribution with mean 0 and standard deviation of 0.02.
5 Experimental results
Below, ‘pix2pix’ refers to the result obtained by the model we retrained from scratch following exactly the same training details as that in the pix2pix paper . ‘one2one’ refers to our results by training the same networks as a self-inverse function. In all the tables, all of the results are averaged across the whole validation partition which follow the same dataset split in  .
We conduct the experiments using three paired image data sets:
Semantic label photo, trained on the Cityscapes dataset ;
Map aerial photo, trained on data scraped from Google Maps ;
MRI image synthesis on BRATS.
We use the following evaluation metrics
Cityscapes data set. For fair comparison with the baseline, which is pix2pix , we follow the same evaluation metric as that in pix2pix paper. We use the released public evaluation code from the pix2pix GitHub repository. For the photolabels direction, we use IOU as the evaluation metric. For the labelsphoto direction, we use the ”FCN score” [34, 28, 39, 42, 32].
Map data scraped from Google Maps and Brats. To quantify the image quality distance between the generated image and the ground truth objectively and to have a metric to do the model sensitivity analysis, we use the SSIM, PSNR, and L1 distance as the evaluation metric for both directions.
5.1 Semantic label photo
Our model is one2one and the baseline is pix2pix. Table 1 and Figure 5 show the model performance comparison between one2one model and pix2pix model on bidirectional label and photo image translation. The evaluation metrics are pixel actuary(p.acc.), class accuracy(c.acc.) and class IOU(IOU). In the direction of photo labels, our one2one model performances higher than pix2pix model by 3.75% in pixel actuary. In the direction of labels photo, the evaluation metric is “FCN score”. Our one2one model increase the class IOU by 5.3% compared with the pix2pix model. Note that the FCN score for ground truth is 0.21. The FCN score of The one2one model is 0.20 which is very close to the score of the ground truth.
5.2 Map aerial photo
Table 2 and Figure 6 show the model performance comparison between one2one model and pix2pix model on bidirectional aerial and map image translation. In the direction of aerial photo image translation is many-to-one. As shown in Table 2 and Figure 6 upper part, pix2pix produces better result than one2one by 3%, 10.5%, 9,6% in PSNR,SSIM and L1 individually. In the direction of map aerial photo, as shown in Table 2 and bottom part of Figure 6, the one2one model outperform the pix2pix model by 3% in SSIM and 2% in PSNR.
5.3 MRI image synthesis on BRATS
We conduct the experiments based on the BraTS 2018 dataset , which contain ample multi-institutional routine clinically-acquired pre-operative multimodal MRI scans of glioblastoma (GBM/HGG) and lower grade glioma (LGG) images. There are 285 3D volumes for training and 66 3D volume for test. The and images are selected for our bi-directional image synthesis. All the 3D volumes are preprocessed to one channel image of size 256 x 256 x 1. In all tables, all results are averaged across all splits as in . As shown in Table 5(a), on the image synthesis direction, our one2one model outperforms the pix2pix model on PSNR by 13.6%. The qualitative result is shown in columns 3 and 4 in Figure 9. On the image synthesis direction, our one2one model outperforms the pix2pix model on PSNR by 11.6%. The qualitative result is shown in columns 5 and 6 in Figure 9.
6 Model sensitivity analysis
To measure the model sensitivity, we add a perturbation to the input image , then measure the change of the output, . In our experiment on BraTs dataset shown in Figure.9, on the direction, the input image with perturbation is the generated images from with the pix2pix model (see colunum 5 in Figure.9), on the direction, the input image with perturbation is the generated images from with the pix2pix model (see column 3 in Figure.9).
In order to compare the performance of pix2pix and one2one on both tasks and , we need to train 3 models in total: pix2pix for task (pix2pixA), pix2pix for task (pix2pixB) and a one2one model for both tasks and (one2one). To compare the model sensitivity between pix2pixA and one2one for task , we follow four steps.
For an image pair , we pass to pix2pixB as input to generate , which adds a perturbation to .
We input to the pix2pixA and one2one models, obtaining the corresponding outputs and , respectively.
We input to the pix2pixA and one2one models obtaining the corresponding outputs and , respectively.
For both models, we use a predefined evaluation metric (for example PSNR and SSIM) to evaluate and and get the scores and , respectively. So, the change of the output is measured by .
The model with a larger change of the output due to perturbation is more sensitive, and vice versa. Similarly, we can compare the model sensitivity between pix2pixB and one2one for task by swapping the and in the above steps.
As shown in Table 5(b) on the image synthesis direction, our one2one model is more sensitive than pix2pix model, improving PSNR by 38.7%! The qualitative result is shown in column 7 and 8 in Figure 9. On the image synthesis direction, our one2one model is more sensitive than pix2pix model, improving PSNR by 9.3%. The qualitative results are shown in columns 9 and 10 in Figure 9.
For the cityscapes dataset, we use the mean class IOU to measure the change of output for photo labels direction and “FCN score” to measure the change of output for labels photo direction. In Table 3 and figure 7, D(CLASS IOU) is the absolute value difference of IOU score for the photolabels direction and FCN score for labelphoto direction between one2one and pix2pix.
For the Google Maps data set, we use the structural similarity index (SSIM), peak signal to noise ratio (PSNR) and L1 distance to measure the change of output from both directions. In Table 4 and Figure 8, the dL1, dPSNR and dSSIM is the absolute value of the difference between one2one and pix2pix.
For the cityscapes dataset, according to Table 3, one2one model is more sensitive than pix2pix by 6% in the label photo direction and 5% in the photo label direction and Figure 8 illustrates qualitative sensitivity analysis.
For the maps dataset, according Table 4, one2one model is more sensitive than pix2pix by 2% in PSNR and 14% in L1 for the aerial map direction. The one2one model is more sensitive than pix2pix by 3% in L1, 2% in PSNR and 4.3% in SSIM in the map aerial direction. Figure 8 illustrates qualitative sensitivity analysis.
In summary, the one2one model is more sensitive than pix2pix models on all the three datasets.
We have presented an approach for learning one U-Net for both forward and inverse image-to-image translation. The experiment results and model sensitivity analysis results are consistent to verify the one-to-one mapping property of the self-inverse network. In future, we will further explore the theoretical aspect of the self-inverse network learning.
-  (2018) Augmented cyclegan: learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151. Cited by: §3.
-  (2018) Analyzing inverse problems with invertible neural networks. arXiv preprint arXiv:1808.04730. Cited by: §3.
-  (2013) Signal recovery from pooling representations. arXiv preprint arXiv:1311.4025. Cited by: §3.
-  (2015) Using image synthesis for multi-channel registration of different image modalities. In Medical Imaging 2015: Image Processing, Vol. 9413, pp. 94131Q. Cited by: §1.
-  (2017) Photographic image synthesis with cascaded refinement networks. In IEEE International Conference on Computer Vision (ICCV), Vol. 1, pp. 3. Cited by: §3.
-  (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: §1.
-  (2017) Parseval networks: improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847. Cited by: §3.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: Figure 1, 1st item, 1st item.
-  (2018) Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9290–9299. Cited by: §3.
-  (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §3.
-  (2016) Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4829–4837. Cited by: §3.
-  (1999) Texture synthesis by non-parametric sampling. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, Vol. 2, pp. 1033–1038. Cited by: §3.
-  (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414–2423. Cited by: §3.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.
-  (2001) Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 327–340. Cited by: §3.
-  (2010) Image quality metrics: psnr vs. ssim. In Pattern recognition (icpr), 2010 20th international conference on, pp. 2366–2369. Cited by: 2nd item.
-  (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §3.
-  (2017) Simultaneous super-resolution and cross-modality synthesis of 3d medical images using weakly-supervised joint convolutional sparse coding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6070–6079. Cited by: §1.
-  (2017) Image-to-image translation with conditional adversarial networks. CVPR. Cited by: Figure 2, §1, §3, §4, 2nd item, 1st item, 2nd item, §5.
-  (2018) I-revnet: deep invertible networks. arXiv preprint arXiv:1802.07088. Cited by: §3.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §3.
-  (2016) Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215. Cited by: §3.
-  (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §3.
-  (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §3.
-  (2019) DRIT++: diverse image-to-image translation via disentangled representations. arXiv preprint arXiv:1905.01270. Cited by: §3.
-  (2016) Robust single image super-resolution via deep networks with sparse prior. IEEE Transactions on Image Processing 25 (7), pp. 3194–3207. Cited by: §3.
-  (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §3.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §3, 1st item.
-  (2016) Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision 120 (3), pp. 233–255. Cited by: §3.
-  (2014) The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: 2nd item, §5.3, Table 5.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §4.
-  (2016) Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2405–2413. Cited by: 1st item.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: Figure 4.
-  (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: 1st item.
-  (2016) Scribbler: controlling deep image synthesis with sketch and color. arXiv preprint arXiv:1612.00835. Cited by: §3.
-  (2019) Towards instance-level image-to-image translation. arXiv preprint arXiv:1905.01744. Cited by: §3.
-  (2016) Texture networks: feed-forward synthesis of textures and stylized images.. In ICML, pp. 1349–1357. Cited by: §3.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.
-  (2016) Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pp. 318–335. Cited by: 1st item.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: 2nd item.
-  (2018) Ultra-fast t2-weighted mr reconstruction using complementary t1-weighted information. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 215–223. Cited by: §1.
-  (2016) Colorful image colorization. In European Conference on Computer Vision, pp. 649–666. Cited by: 1st item.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593. Cited by: Figure 2, §1, §3.
-  (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pp. 465–476. Cited by: §1, §3.