Optimal Transport, CycleGAN, and Penalized LS for Unsupervised Learning in Inverse Problems

Optimal Transport, CycleGAN, and Penalized LS for Unsupervised Learning in Inverse Problems

Byeongsu Sim
&Gyutaek Oh
&Sungjun Lim
&Jong Chul Ye
Department of Mathematical Sciences, KAIST, Daejeon, Republic of Korea
Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea

Jong Chul Ye is the corresponding author of this paper. Email: jong.ye@kaist.ac.kr

The penalized least squares (PLS) is a classic approach to inverse problems, where a regularization term is added to stabilize the solution. Optimal transport (OT) is another mathematical framework for computer vision tasks by providing means to transport one measure to another at minimal cost. Cycle-consistent generative adversarial network (cycleGAN) is a recent extension of GAN to learn target distributions with less mode collapsing behavior. Although similar in that no supervised training is required, the algorithms look different, so the mathematical relationship between these approaches is not clear. In this article, we provide an important advance to unveil the missing link. Specifically, we reveal that a cycleGAN architecture can be derived as a dual formulation of the optimal transport problem, if the PLS with a deep learning penalty is used as a transport cost between the two probability measures from measurements and unknown images. This suggests that cycleGAN can be considered as stochastic generalization of classical PLS approaches. Our derivation is so general that various types of cycleGAN architecture can be easily derived by merely changing the transport cost. As proofs of concept, this paper provides novel cycleGAN architecture for unsupervised learning in accelerated MRI and deconvolution microscopy problems, which confirm the efficacy and the flexibility of the theory.


.tiffpng.pngconvert #1 \OutputFile \AppendGraphicsExtensions.tiff

1 Introduction

Inverse problems are ubiquitous in computer vision, biomedical imaging, scientific discovery, etc. In inverse problems, a noisy measurement from an unobserved image is modeled by


where is the measurement noise, and is the measurement operator. In inverse problems originated from physics, the measurement operator is usually represented by an integral equation:


for , where is an integral kernel. Then, the inverse problem is formulated as an estimation problem of the unknown from the measurement . It is well-known that an inverse problem is ill-posed. A classical strategy to mitigate the ill-posedness is the penalized least squares (PLS) approach:


for , where is a regularization (or penalty) function such as , total variation (TV), etc (Chaudhuri et al., 2014; Sarder & Nehorai, 2006; McNally et al., 1999). In some inverse problems, the measurement operator is not well defined so that both the unknown operator and the image should be estimated.

Recently, deep learning approaches with supervised training have become the main approaches to inverse problems because of their excellent and ultra-fast reconstruction performance. For example, in the low-dose x-ray computed tomography (CT) denoising problems, a convolutional neural network (CNN) is trained to learn the relationship between noisy image and the matched noiseless (or high dose) label images (Kang et al., 2017). In the context of (3), the supervised neural network can be understood as directly learning the operation . Unfortunately, in many applications, matched label data are not available for supervised training. Therefore, unsupervised training without matched reference data has become an important research topic.

Recently, generative adversarial network (GAN) has attracted significant attention in machine learning community by providing a way to generate target data distribution from random distribution (Goodfellow et al., 2014). In particular, the authors in (Arjovsky et al., 2017) proposed so-called Wasserstein GAN (W-GAN), which is closely related to the mathematical theory of optimal transport (OT) (Villani, 2008; Peyré et al., 2019). In optimal transport, for given two probability measures supported on the and spaces, one pays a cost for transporting one measure to another. Then, the minimization of the average transportation cost provides an unsupervised way of learning the transport map between the two measures. Unfortunately, these GAN approaches often generate artificial features due to the mode collapsing, so cycleGAN (Zhu et al., 2017) that imposes the one-to-one correspondency has been extensively investigated (Kang et al., 2019; Lu et al., 2017).

Although classical PLS, optimal transport, and cycleGAN share the commonality of unsupervised learning which does not require matched training data, there is no mathematical theory to systematically link these seemingly different approaches. Therefore, one of the main contributions of this paper is to unveil the missing link between these methods. In particular, we reveal that cycleGAN architecture can be derived as a dual formulation of the optimal transport problem when the cost function of PLS with a deep learning penalty is used as a transport cost between the two measures. Thus, cycleGAN can be considered as stochastic generalization of PLS. Moreover, this framework is so general that various cycleGAN architecture can be easily obtained by changing the PLS cost.

As a proof of concept, we provide novel cycleGAN architecture for unsupervised learning in two physical inverse problems: accelerated MRI, and deconvolution microscopy. In these applications, we show that only one single CNN generator is required, which significantly reduces the training complexity. Experimental results confirmed that the proposed unsupervised learning approaches can successfully provide accurate inversion results without any matched reference.

2 Related Works

2.1 Optimal Transport

Optimal transport compares two measures in a Lagrangian framework (Villani, 2008; Peyré et al., 2019). Formally, we say that transports to , if


which condition is often simply represented by where is often called the push-forward operator. Monge’s original optimal transport problem (Villani, 2008; Peyré et al., 2019) is then to find a transport map that transport to at the minimum total transportation cost:

subject to

However, this is usually computational expensive due to the nature of combinatorial assignment. Kantorovich relaxed the assumption to consider a probabilistic transport, which allows mass splitting from a source toward several targets (Villani, 2008; Peyré et al., 2019). Specifically, Kantorovich introduced a joint measure and the associated cost such that becomes the amount of mass transferred from to and is the associated cost. Then, the Kantorovich’s relaxation is formulated as

subject to

for all measurable set and . Here, the last two constraints come from the observation that the total amount of mass removed from any measurable set has to equal to the marginals (Villani, 2008; Peyré et al., 2019). Another important advantage of Kantorovich formulation is the dual formulation as stated in the following theorem:

Theorem 1 (Kantorovich duality theorem).

(Villani, 2008, Theome 5.10, p.57-p.59) Let and be two Polish probability spaces (separable complete metric space) and let be a continuous cost function, such that for some and , where denotes the set of 1-Lipschitz functions with the measure . Then, there is duality:

and the above maximum is taken over so-called Kantorovich potential whose c-transform is properly defined.

2.2 Penalized Least Square with Deep Learning Prior

Recently, model-based image reconstruction frameworks using a convolution neural network (CNN) based prior have been extensively studied (Zhang et al., 2017; Aggarwal et al., 2018) thanks to their similarities to the classical regularization theory. The main idea is to utilize a pre-trained neural network to stabilize the inverse solution. Specifically, the problem can be formulated as


for , where is a pre-trained CNN with the network parameter and the input . The minimization problem of (7) is to find a balance between data fidelity term and the CNN output.

3 Main Contributions

One of the basic assumptions in the classical PLS formulation is that the measurement is fixed and one is interested in finding unknown assuming some prior distribution. In this section, we relax the assumption and consider unsupervised learning situations where both of them are random samples from joint probability measure . Interesting, this leads to cycleGAN architecture. Specifically, by considering and in (7) as random variables, we have the following reformulation of the cost:


where the first two terms before the semi-colon are the random vectors. Unlike (7), the network in (8) is not pretrained and needs to be learned. More specifically, are the network and measurement system parameters that should be estimated. These parameters can be found by minimizing the average transport cost for all combinations of and with respect to the joint measure :


where the minimum is taken over all joint distribution whose marginal distribution with respect to and is and , respectively. Below we show that the average transportation cost in (9) has an interesting decomposition.

Lemma 1.

If the mapping is single-valued, then the average transportation cost in (9) can be decomposed as






From the definition of and , it is straightforward to show that

To show reversed inequality, we need the classical results on optimal transport for the restricted measure (Villani, 2008). Specifically, Theorem 2 in Appendix informs that for some optimal transportation plan , the restrictions and acquire optimality. Therefore,

This concludes the proof. ∎

The following proposition tells why we name the first term as . Indeed, the term in (11) is the weighted version of cycle consistency term in cycleGAN (Zhu et al., 2017), which imposes the constraint and . However, our formulation is more general, since there exists weighting factors that can balance the importance of the cycle-consistency terms.

Proposition 1.

where and are measurable functions such that -a.e., and -a.e.


By virtue of the definitions of the cost function and the sets and , we have


Furthermore, we use disintegration theorem (Simmons, 2012) to split the joint measure as follows:

where -a.e. and -a.e. Then, we have

where denotes the indicator function for the set and . In a similar fashion, with , we have

This concludes the proof. ∎

Here, care should be taken regarding the measurable functions and . For the case of supervised learning, there exists perfectly matched reference as shown in Fig. 1(a) so that -a.e. and -a.e. In this case, the . On the other hand, if the joint distribution is absolutely continuous as shown in Fig. 1(b), then -a.e. and -a.e. In this case, . Most interesting case arises when the joint measure has singularities on the set of (see Fig. 1(c)). In fact, the unsupervised learning cares about such situation where once the mappings are found, significant portion of the data in and can be paired with the mappings, even though there are still remaining unpaired data. This is when the cycle consistency terms plays the important role in unsupervised learning.

Figure 1: (a) Supervised learning with matched references, (b) optimal transport with absolute continuous measures, and (c) optimal transport having measures with singularities. CycleGAN corresponds to the scenario (c).

The remaining terms in (10), which corresponds to the GAN term, can be obtained as follows:

Proposition 2.

For , the cost in (12) can be equivalently represented as


where and are 1-Lipschitz Kantorovich potential functions.


This is a just simple corollary of the original Kantorovich’s duality formulation proof and the classical results of optimal transport on the restricted measure. We start with the first term of in (12). Specifically, we have

where and are marginals of the restriction of optimal transportation plan on the restricted set . Now, using Proposition 3 in Appendix, we have

for the non-restricted marginals and , where for the last equality we use for when is 1-Lipschitz (Villani, 2008). Using the same technique, the second term in (12) can be equivalently represented as

By collecting terms, we conclude the proof. ∎

Now, the final step is implementing the Kantorovich potential using CNNs with parameters and , i.e. and . By collecting all terms in the transportation cost together, the dual formulation results in a cycleGAN cost function:




where denotes the cycle consistent loss in (13) and is the GAN loss given by:


Here, the Kantorovich 1-Lipschitz potential and corresponds to the Wasserstein GAN discriminators. Specifically, tries to find the difference between the true image and the generated image , whereas attempts to find the fake measurement data that are generated by the synthetic measurement procedure . Furthermore, if and are constant almost everywhere, then they become simple scaling factors for the cycle-consistency terms:


Note that our cost formulation using (17) with (18) and (19) is more general compared to the standard cycleGAN, since a general form of measurement data generator can be used. In the following section, we will show how to choose depending on specific applications.

4 Applications to Inverse Problems

4.1 Accelerated MRI

In accelerated MRI, the goal is to recover high quality MR images from sparsely sampled -space data to reduce the acquisition time. This problem has been extensively studied using compressed sensing (Lustig et al., 2007), but recently deep learning approaches have been the main research interest due to its excellent performance at significantly reduced run time complexity (Hammernik et al., 2018; Han & Ye, 2019). A standard method for neural network training for accelerated MRI is based on the supervised learning, where the MR images from fully sampled -space data is used as reference and subsampled -space data is used as the input for the neural network. Unfortunately, in accelerated magnetic resonance imaging (MRI), high-resolution fully sampled -space data is very difficult to acquire due to the long scan time. Therefore, the need for unsupervised learning without matched reference data is increasing.

In accelerated MRI, the forward measurement model can be described as


where is the 2-D Fourier transform and is the projection to that denotes -space sampling indices. To implement every step of the algorithm as the image domain processing, (20) can be converted to the image domain forward model by applying the inverse Fourier transform:


This results in the following cost function for the PLS formulation:


Usually, the sampling mask is known so that the forward mapping for the inverse problem is deterministic.

Figure 2: Proposed cycleGAN architecture for accelerated MRI with 1-D downsampling patterns.
Figure 3: Unsupervised learning results for accelerated MRI using proposed cycleGAN. The values in the corner are PSNR/SSIM values.

The schematic diagram of the associated cycleGAN architecture is illustrated in Fig. 2, whose generator and discriminator architecture are shown in Fig. 6 in Appendix. Note that we just need as single generator, as the mapping from the clean to aliased images is deterministic for a given sampling pattern. As for the loss function, we use (17) with the for the cycle consistency term in (19). For GAN implementation, we use the Wasserstein GAN in (18) with Lipschitz penalty loss (Gulrajani et al., 2017) to enforce that the Kantorovich potential become 1-Lipschitz. The detailed description of the training procedure is provided in Appendix. The reconstruction results in Fig. 3 clearly showed that the proposed unsupervised learning method using dedicated cycleGAN successfully recovered fine details without matched references.

4.2 Deconvolution Microscopy

Deconvolution microscopy is extensively used to improve the resolution of the widefield fluorescent microscopy. Recently, CNN approaches have been extensively studied as fast and high performance alternatives. Unfortunately, the CNN approaches usually require matched high resolution images for supervised training. Mathematically, a blurred measurement can be described as


where is the point spread function (PSF). If the PSF is not known, which is often referred to as the blind deconvolution problem, both the unknown PSF and the image should be estimated. This results in the following cost function for the PLS formulation:


The corresponding cycleGAN architecture is illustrated in Fig. 4. In contrast to the conventional cycleGAN approaches that require two generators, the proposed cycleGAN approach needs only a single generator and the blur image generation can be done using a linear convolution layer corresponding to PSF, which significantly improves the robustness of network training. The detailed description of the network architecture and training are given in Appendix.

Figure 4: Proposed cycleGAN architecture with a blur kernel for deconvolution microscopy. Here, denotes the CNN-based low-resolution to high-resolution generator. The blur generator is composed of a linear blur kernel .

Fig. 5 show lateral views of deconvolution results of microtubule samples by various methods. Here, input images are degraded by blur and noises. The supervised learning and the standard cycleGAN with two generators showed better contrast and removed blur; however, the structural continuity was not preserved. On the other hand, in our cycleGAN approach, blurs and noise were successfully removed and preserved the continuity of the structure.

Figure 5: Comparison of reconstruction results by various methods: (a) Blurred image measurements, (b) supervised learning, (c) the conventional cycleGAN, and (d) the proposed cycleGAN. The ROIs (marked yellow) show the area for the enlarged parts.

5 Conclusions

In this paper, we presented a general design principle for cycleGAN architecture for various inverse problems. Specifically, the proposed architecture was derived as a dual formulation of an optimal transport problem that uses the penalized least squares cost as the transport cost between two measures supported on the measurement data and the images. As proofs of concept, we designed novel cycleGAN architecture for accelerated MRI and deconvolution microscopy examples, providing accurate reconstruction results without any matched reference data. Given the generality of our design principle, we believe that our method can be an important platform for unsupervised learning for inverse problems.


Appendix A Appendix

a.1 Additional Theorems and Lemmas

Theorem 2 (Optimality is inherited by restriction).

(Villani, 2008, Theorem 4.6, p.46) Under the same assumption of Kantorovich duality theorem, let be an optimal transport plan, and be a nonnegative measure on , such that and . Let , be the maginals of . Then optimality is inherited to its restriction, i.e., is an optimal transference plan between and .

Theorem 3 (Restriction for the Kantorovich duality).

(Villani, 2008, Theorem 5.19, p.75-p.76) Under the same assumption of Kantorovich duality theorem, let be an optimal transport plan and be an optimal potential that attain the maximum, and be a measure on , such that and . Let , and be the marginals of . Then there exists such that on projsupp, and solves the dual Kantorovich problem between and .

We are now ready to provide our duality result for the restricted measure.

Proposition 3.


Then, we can replace with in the sense that both of them share the same optimal Kantorovich potential. More specifically, if is the optimal Kantorovich potential, which satisfies


where and are marginals of the restriction of optimal transference plan.


Let . Then, is nonnegative measure on . When , , and , so the statement trivially holds. Suppose . Then by lemma, there exists such that on projsupp, and solves the dual Kantorovich problem between and . ∎

a.2 Experimental Details

a.2.1 Accelerated MRI

We use dataset for fastMRI challenge (Zbontar et al., 2018) for our experiments. This dataset is composed of MR images of knees. We extracted 3500 MR images from fastMRI single coil validation set. Then, 3000 slices are used for training/evaultion, and 500 slices are used for test. These MR images are fully sampled images, so we make undersampled images by a randomly subsampling k-space lines. The acceleration factor is four, and autocalibration signal (ACS) region contains 4% of k-space lines. Each slice is normalized by standard deviation of the magnitude of each slice. To handle complex values of data, we concatenate real and imaginary values along the channel dimension. Each slice has different size, so we use only single batch. The images are center-cropped to the size of , and then the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values are calculated.

We use U-Net generator to reconstruct fully sampled MR images from undersampled MR images as shown in Fig. 6(a). Our generator consists of convolution, Instance normalization, and leaky ReLU operation. Also, there are skip-connection and pooling layers. At the last convolution layer, we do not use any operation. Our discriminator is same as the discriminator of original CycleGAN. We use PatchGAN (Isola et al., 2017), so the discriminator classifies inputs at patch scales. The discriminator also consists of convolution layer, instance normalization, and leaky ReLU operation as shown in Fig. 6(b). We use Adam optimizer to train our network, with momentum parameters , and learning rate of 0.0001. The discriminator is updated 5 times for every generator updates. We use batch size of 1, and trained our network during 100 epochs. Our code was implemented by TensorFlow.



Figure 6: Proposed network architecture for (a) generator and (b) discriminator.
Figure 7: A modified 3D U-net architecture for our high-resolution image generator.
Figure 8: Multi-PatchGANs discriminator architecture.

a.2.2 Deconvolution microscopy

The network architecture of the high resolution image generator from the low-resolution image is a modified 3D-Unet (Çiçek et al., 2016) as shown in Fig 7. Our U-net structure consists of contracting and expanding paths. The contracting path consists of the repetition of the following blocks: 3D conv- Instance Normalization (Ulyanov et al., 2016)- ReLU. Here, the generator has symmetric configuration so that both encoder and decoder have the same number of layers, i.e. . Throughout the network, the convolutional kernel dimension is . There exists a pooling layer and skipped connection for every other convolution operations. To enhance the image contrast, we add additional sigmoid layer at the end of U-Net. On the other hand, the low-resolution image generator from high resolution input is based on a single 3D convolution layer that models a 3D blurring kernel. The size of the 3D PSF modeling layer is chosen depending on the problem set by considering their approximate PSF sizes. In this paper, the size of the 3D PSF layer is set to .

As for the discriminators, we follow the original CycleGAN that uses multi-PatchGANs (mPGANs) (Isola et al., 2017), where each discriminator has input patches with different sizes used. As shown in Fig 8, it consist of three independent discriminators. Each discriminator takes patches at different sizes: original, and half, and quarter size patches.

A total 18 epi-fluorescent (EPF) microscopy images of tubulin with a size of were used for training, and one for validation. As for unmatched sharp image volume, we used deblurred image generated by utilizing a commercial software AutoQuant X3 (Media Cybernetics, Rockville). The EPF volume depth was increased to 64 using the reflection boundary condition. Due to GPU memory limitations, the EPF volume was split into size patches. For data augmentation, rotation, flip, translation, and scale were imposed on the input patches. We normalized the patches and set them to [0,1]. Adam optimizer (Kingma & Ba, 2014) was also used for training. The learning rate was initially set to 0.0001, which is decreased linearly after 40 epoch, and the total number of epoch was 200 epoch. For the optimizer, we used only a single batch.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description