Optimal Transport, CycleGAN, and Penalized LS for Unsupervised Learning in Inverse Problems
The penalized least squares (PLS) is a classic approach to inverse problems, where a regularization term is added to stabilize the solution. Optimal transport (OT) is another mathematical framework for computer vision tasks by providing means to transport one measure to another at minimal cost. Cycle-consistent generative adversarial network (cycleGAN) is a recent extension of GAN to learn target distributions with less mode collapsing behavior. Although similar in that no supervised training is required, the algorithms look different, so the mathematical relationship between these approaches is not clear. In this article, we provide an important advance to unveil the missing link. Specifically, we reveal that a cycleGAN architecture can be derived as a dual formulation of the optimal transport problem, if the PLS with a deep learning penalty is used as a transport cost between the two probability measures from measurements and unknown images. This suggests that cycleGAN can be considered as stochastic generalization of classical PLS approaches. Our derivation is so general that various types of cycleGAN architecture can be easily derived by merely changing the transport cost. As proofs of concept, this paper provides novel cycleGAN architecture for unsupervised learning in accelerated MRI and deconvolution microscopy problems, which confirm the efficacy and the flexibility of the theory.
.tiffpng.pngconvert #1 \OutputFile \AppendGraphicsExtensions.tiff
Inverse problems are ubiquitous in computer vision, biomedical imaging, scientific discovery, etc. In inverse problems, a noisy measurement from an unobserved image is modeled by
where is the measurement noise, and is the measurement operator. In inverse problems originated from physics, the measurement operator is usually represented by an integral equation:
for , where is an integral kernel. Then, the inverse problem is formulated as an estimation problem of the unknown from the measurement . It is well-known that an inverse problem is ill-posed. A classical strategy to mitigate the ill-posedness is the penalized least squares (PLS) approach:
for , where is a regularization (or penalty) function such as , total variation (TV), etc (Chaudhuri et al., 2014; Sarder & Nehorai, 2006; McNally et al., 1999). In some inverse problems, the measurement operator is not well defined so that both the unknown operator and the image should be estimated.
Recently, deep learning approaches with supervised training have become the main approaches to inverse problems because of their excellent and ultra-fast reconstruction performance. For example, in the low-dose x-ray computed tomography (CT) denoising problems, a convolutional neural network (CNN) is trained to learn the relationship between noisy image and the matched noiseless (or high dose) label images (Kang et al., 2017). In the context of (3), the supervised neural network can be understood as directly learning the operation . Unfortunately, in many applications, matched label data are not available for supervised training. Therefore, unsupervised training without matched reference data has become an important research topic.
Recently, generative adversarial network (GAN) has attracted significant attention in machine learning community by providing a way to generate target data distribution from random distribution (Goodfellow et al., 2014). In particular, the authors in (Arjovsky et al., 2017) proposed so-called Wasserstein GAN (W-GAN), which is closely related to the mathematical theory of optimal transport (OT) (Villani, 2008; Peyré et al., 2019). In optimal transport, for given two probability measures supported on the and spaces, one pays a cost for transporting one measure to another. Then, the minimization of the average transportation cost provides an unsupervised way of learning the transport map between the two measures. Unfortunately, these GAN approaches often generate artificial features due to the mode collapsing, so cycleGAN (Zhu et al., 2017) that imposes the one-to-one correspondency has been extensively investigated (Kang et al., 2019; Lu et al., 2017).
Although classical PLS, optimal transport, and cycleGAN share the commonality of unsupervised learning which does not require matched training data, there is no mathematical theory to systematically link these seemingly different approaches. Therefore, one of the main contributions of this paper is to unveil the missing link between these methods. In particular, we reveal that cycleGAN architecture can be derived as a dual formulation of the optimal transport problem when the cost function of PLS with a deep learning penalty is used as a transport cost between the two measures. Thus, cycleGAN can be considered as stochastic generalization of PLS. Moreover, this framework is so general that various cycleGAN architecture can be easily obtained by changing the PLS cost.
As a proof of concept, we provide novel cycleGAN architecture for unsupervised learning in two physical inverse problems: accelerated MRI, and deconvolution microscopy. In these applications, we show that only one single CNN generator is required, which significantly reduces the training complexity. Experimental results confirmed that the proposed unsupervised learning approaches can successfully provide accurate inversion results without any matched reference.
2 Related Works
2.1 Optimal Transport
which condition is often simply represented by where is often called the push-forward operator. Monge’s original optimal transport problem (Villani, 2008; Peyré et al., 2019) is then to find a transport map that transport to at the minimum total transportation cost:
However, this is usually computational expensive due to the nature of combinatorial assignment. Kantorovich relaxed the assumption to consider a probabilistic transport, which allows mass splitting from a source toward several targets (Villani, 2008; Peyré et al., 2019). Specifically, Kantorovich introduced a joint measure and the associated cost such that becomes the amount of mass transferred from to and is the associated cost. Then, the Kantorovich’s relaxation is formulated as
for all measurable set and . Here, the last two constraints come from the observation that the total amount of mass removed from any measurable set has to equal to the marginals (Villani, 2008; Peyré et al., 2019). Another important advantage of Kantorovich formulation is the dual formulation as stated in the following theorem:
Theorem 1 (Kantorovich duality theorem).
(Villani, 2008, Theome 5.10, p.57-p.59) Let and be two Polish probability spaces (separable complete metric space) and let be a continuous cost function, such that for some and , where denotes the set of 1-Lipschitz functions with the measure . Then, there is duality:
and the above maximum is taken over so-called Kantorovich potential whose c-transform is properly defined.
2.2 Penalized Least Square with Deep Learning Prior
Recently, model-based image reconstruction frameworks using a convolution neural network (CNN) based prior have been extensively studied (Zhang et al., 2017; Aggarwal et al., 2018) thanks to their similarities to the classical regularization theory. The main idea is to utilize a pre-trained neural network to stabilize the inverse solution. Specifically, the problem can be formulated as
for , where is a pre-trained CNN with the network parameter and the input . The minimization problem of (7) is to find a balance between data fidelity term and the CNN output.
3 Main Contributions
One of the basic assumptions in the classical PLS formulation is that the measurement is fixed and one is interested in finding unknown assuming some prior distribution. In this section, we relax the assumption and consider unsupervised learning situations where both of them are random samples from joint probability measure . Interesting, this leads to cycleGAN architecture. Specifically, by considering and in (7) as random variables, we have the following reformulation of the cost:
where the first two terms before the semi-colon are the random vectors. Unlike (7), the network in (8) is not pretrained and needs to be learned. More specifically, are the network and measurement system parameters that should be estimated. These parameters can be found by minimizing the average transport cost for all combinations of and with respect to the joint measure :
where the minimum is taken over all joint distribution whose marginal distribution with respect to and is and , respectively. Below we show that the average transportation cost in (9) has an interesting decomposition.
If the mapping is single-valued, then the average transportation cost in (9) can be decomposed as
From the definition of and , it is straightforward to show that
To show reversed inequality, we need the classical results on optimal transport for the restricted measure (Villani, 2008). Specifically, Theorem 2 in Appendix informs that for some optimal transportation plan , the restrictions and acquire optimality. Therefore,
This concludes the proof. ∎
The following proposition tells why we name the first term as . Indeed, the term in (11) is the weighted version of cycle consistency term in cycleGAN (Zhu et al., 2017), which imposes the constraint and . However, our formulation is more general, since there exists weighting factors that can balance the importance of the cycle-consistency terms.
where and are measurable functions such that -a.e., and -a.e.
By virtue of the definitions of the cost function and the sets and , we have
Furthermore, we use disintegration theorem (Simmons, 2012) to split the joint measure as follows:
where -a.e. and -a.e. Then, we have
where denotes the indicator function for the set and . In a similar fashion, with , we have
This concludes the proof. ∎
Here, care should be taken regarding the measurable functions and . For the case of supervised learning, there exists perfectly matched reference as shown in Fig. 1(a) so that -a.e. and -a.e. In this case, the . On the other hand, if the joint distribution is absolutely continuous as shown in Fig. 1(b), then -a.e. and -a.e. In this case, . Most interesting case arises when the joint measure has singularities on the set of (see Fig. 1(c)). In fact, the unsupervised learning cares about such situation where once the mappings are found, significant portion of the data in and can be paired with the mappings, even though there are still remaining unpaired data. This is when the cycle consistency terms plays the important role in unsupervised learning.
The remaining terms in (10), which corresponds to the GAN term, can be obtained as follows:
For , the cost in (12) can be equivalently represented as
where and are 1-Lipschitz Kantorovich potential functions.
This is a just simple corollary of the original Kantorovich’s duality formulation proof and the classical results of optimal transport on the restricted measure. We start with the first term of in (12). Specifically, we have
where and are marginals of the restriction of optimal transportation plan on the restricted set . Now, using Proposition 3 in Appendix, we have
By collecting terms, we conclude the proof. ∎
Now, the final step is implementing the Kantorovich potential using CNNs with parameters and , i.e. and . By collecting all terms in the transportation cost together, the dual formulation results in a cycleGAN cost function:
where denotes the cycle consistent loss in (13) and is the GAN loss given by:
Here, the Kantorovich 1-Lipschitz potential and corresponds to the Wasserstein GAN discriminators. Specifically, tries to find the difference between the true image and the generated image , whereas attempts to find the fake measurement data that are generated by the synthetic measurement procedure . Furthermore, if and are constant almost everywhere, then they become simple scaling factors for the cycle-consistency terms:
4 Applications to Inverse Problems
4.1 Accelerated MRI
In accelerated MRI, the goal is to recover high quality MR images from sparsely sampled -space data to reduce the acquisition time. This problem has been extensively studied using compressed sensing (Lustig et al., 2007), but recently deep learning approaches have been the main research interest due to its excellent performance at significantly reduced run time complexity (Hammernik et al., 2018; Han & Ye, 2019). A standard method for neural network training for accelerated MRI is based on the supervised learning, where the MR images from fully sampled -space data is used as reference and subsampled -space data is used as the input for the neural network. Unfortunately, in accelerated magnetic resonance imaging (MRI), high-resolution fully sampled -space data is very difficult to acquire due to the long scan time. Therefore, the need for unsupervised learning without matched reference data is increasing.
In accelerated MRI, the forward measurement model can be described as
where is the 2-D Fourier transform and is the projection to that denotes -space sampling indices. To implement every step of the algorithm as the image domain processing, (20) can be converted to the image domain forward model by applying the inverse Fourier transform:
This results in the following cost function for the PLS formulation:
Usually, the sampling mask is known so that the forward mapping for the inverse problem is deterministic.
The schematic diagram of the associated cycleGAN architecture is illustrated in Fig. 2, whose generator and discriminator architecture are shown in Fig. 6 in Appendix. Note that we just need as single generator, as the mapping from the clean to aliased images is deterministic for a given sampling pattern. As for the loss function, we use (17) with the for the cycle consistency term in (19). For GAN implementation, we use the Wasserstein GAN in (18) with Lipschitz penalty loss (Gulrajani et al., 2017) to enforce that the Kantorovich potential become 1-Lipschitz. The detailed description of the training procedure is provided in Appendix. The reconstruction results in Fig. 3 clearly showed that the proposed unsupervised learning method using dedicated cycleGAN successfully recovered fine details without matched references.
4.2 Deconvolution Microscopy
Deconvolution microscopy is extensively used to improve the resolution of the widefield fluorescent microscopy. Recently, CNN approaches have been extensively studied as fast and high performance alternatives. Unfortunately, the CNN approaches usually require matched high resolution images for supervised training. Mathematically, a blurred measurement can be described as
where is the point spread function (PSF). If the PSF is not known, which is often referred to as the blind deconvolution problem, both the unknown PSF and the image should be estimated. This results in the following cost function for the PLS formulation:
The corresponding cycleGAN architecture is illustrated in Fig. 4. In contrast to the conventional cycleGAN approaches that require two generators, the proposed cycleGAN approach needs only a single generator and the blur image generation can be done using a linear convolution layer corresponding to PSF, which significantly improves the robustness of network training. The detailed description of the network architecture and training are given in Appendix.
Fig. 5 show lateral views of deconvolution results of microtubule samples by various methods. Here, input images are degraded by blur and noises. The supervised learning and the standard cycleGAN with two generators showed better contrast and removed blur; however, the structural continuity was not preserved. On the other hand, in our cycleGAN approach, blurs and noise were successfully removed and preserved the continuity of the structure.
In this paper, we presented a general design principle for cycleGAN architecture for various inverse problems. Specifically, the proposed architecture was derived as a dual formulation of an optimal transport problem that uses the penalized least squares cost as the transport cost between two measures supported on the measurement data and the images. As proofs of concept, we designed novel cycleGAN architecture for accelerated MRI and deconvolution microscopy examples, providing accurate reconstruction results without any matched reference data. Given the generality of our design principle, we believe that our method can be an important platform for unsupervised learning for inverse problems.
- Aggarwal et al. (2018) Hemant K Aggarwal, Merry P Mani, and Mathews Jacob. MoDL: Model-based deep learning architecture for inverse problems. IEEE transactions on medical imaging, 38(2):394–405, 2018.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
- Chaudhuri et al. (2014) Subhasis Chaudhuri, Rajbabu Velmurugan, and Renu Rameshan. Blind deconvolution methods: A review. In Blind Image Deconvolution, pp. 37–60. Springer, 2014.
- Çiçek et al. (2016) Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Springer, 2016.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Advances in neural information processing systems, pp. 5767–5777, 2017.
- Hammernik et al. (2018) Kerstin Hammernik, Teresa Klatzer, Erich Kobler, Michael P Recht, Daniel K Sodickson, Thomas Pock, and Florian Knoll. Learning a variational network for reconstruction of accelerated MRI data. Magnetic resonance in medicine, 79(6):3055–3071, 2018.
- Han & Ye (2019) Yoseob Han and Jong Chul Ye. k-Space Deep Learning for Accelerated MRI. IEEE Transactions on Medical Imaging (in press), also available as arXiv preprint arXiv:1805.03779, 2019.
- Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017.
- Kang et al. (2017) Eunhee Kang, Junhong Min, and Jong Chul Ye. A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction. Medical Physics, 44(10), 2017.
- Kang et al. (2019) Eunhee Kang, Hyun Jung Koo, Dong Hyun Yang, Joon Bum Seo, and Jong Chul Ye. Cycle-consistent adversarial denoising network for multiphase coronary CT angiography. Medical Physics, 46(2):550–562, 2019.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lu et al. (2017) Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang. Conditional cyclegan for attribute guided face image generation. arXiv preprint arXiv:1705.09966, 2017.
- Lustig et al. (2007) Michael Lustig, David Donoho, and John M Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med., 58(6):1182–1195, 2007.
- McNally et al. (1999) James G McNally, Tatiana Karpova, John Cooper, and José Angel Conchello. Three-dimensional imaging by deconvolution microscopy. Methods, 19(3):373–385, 1999.
- Peyré et al. (2019) Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
- Sarder & Nehorai (2006) Pinaki Sarder and Arye Nehorai. Deconvolution methods for 3-D fluorescence microscopy images. IEEE Signal Processing Magazine, 23(3):32–45, 2006.
- Simmons (2012) David Simmons. Conditional measures and conditional expectation; rohlin’s disintegration theorem. Discrete & Continuous Dynamical Systems-A, 32(7):2565–2582, 2012.
- Ulyanov et al. (2016) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
- Zbontar et al. (2018) Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J Muckley, Mary Bruno, Aaron Defazio, Marc Parente, Krzysztof J Geras, Joe Katsnelson, Hersh Chandarana, et al. fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839, 2018.
- Zhang et al. (2017) Jiawei Zhang, Jinshan Pan, Wei-Sheng Lai, Rynson WH Lau, and Ming-Hsuan Yang. Learning fully convolutional networks for iterative non-blind deconvolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3817–3825, 2017.
- Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.
Appendix A Appendix
a.1 Additional Theorems and Lemmas
Theorem 2 (Optimality is inherited by restriction).
(Villani, 2008, Theorem 4.6, p.46) Under the same assumption of Kantorovich duality theorem, let be an optimal transport plan, and be a nonnegative measure on , such that and . Let , be the maginals of . Then optimality is inherited to its restriction, i.e., is an optimal transference plan between and .
Theorem 3 (Restriction for the Kantorovich duality).
(Villani, 2008, Theorem 5.19, p.75-p.76) Under the same assumption of Kantorovich duality theorem, let be an optimal transport plan and be an optimal potential that attain the maximum, and be a measure on , such that and . Let , and be the marginals of . Then there exists such that on projsupp, and solves the dual Kantorovich problem between and .
We are now ready to provide our duality result for the restricted measure.
Then, we can replace with in the sense that both of them share the same optimal Kantorovich potential. More specifically, if is the optimal Kantorovich potential, which satisfies
where and are marginals of the restriction of optimal transference plan.
Let . Then, is nonnegative measure on . When , , and , so the statement trivially holds. Suppose . Then by lemma, there exists such that on projsupp, and solves the dual Kantorovich problem between and . ∎
a.2 Experimental Details
a.2.1 Accelerated MRI
We use dataset for fastMRI challenge (Zbontar et al., 2018) for our experiments. This dataset is composed of MR images of knees. We extracted 3500 MR images from fastMRI single coil validation set. Then, 3000 slices are used for training/evaultion, and 500 slices are used for test. These MR images are fully sampled images, so we make undersampled images by a randomly subsampling k-space lines. The acceleration factor is four, and autocalibration signal (ACS) region contains 4% of k-space lines. Each slice is normalized by standard deviation of the magnitude of each slice. To handle complex values of data, we concatenate real and imaginary values along the channel dimension. Each slice has different size, so we use only single batch. The images are center-cropped to the size of , and then the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values are calculated.
We use U-Net generator to reconstruct fully sampled MR images from undersampled MR images as shown in Fig. 6(a). Our generator consists of convolution, Instance normalization, and leaky ReLU operation. Also, there are skip-connection and pooling layers. At the last convolution layer, we do not use any operation. Our discriminator is same as the discriminator of original CycleGAN. We use PatchGAN (Isola et al., 2017), so the discriminator classifies inputs at patch scales. The discriminator also consists of convolution layer, instance normalization, and leaky ReLU operation as shown in Fig. 6(b). We use Adam optimizer to train our network, with momentum parameters , and learning rate of 0.0001. The discriminator is updated 5 times for every generator updates. We use batch size of 1, and trained our network during 100 epochs. Our code was implemented by TensorFlow.
a.2.2 Deconvolution microscopy
The network architecture of the high resolution image generator from the low-resolution image is a modified 3D-Unet (Çiçek et al., 2016) as shown in Fig 7. Our U-net structure consists of contracting and expanding paths. The contracting path consists of the repetition of the following blocks: 3D conv- Instance Normalization (Ulyanov et al., 2016)- ReLU. Here, the generator has symmetric configuration so that both encoder and decoder have the same number of layers, i.e. . Throughout the network, the convolutional kernel dimension is . There exists a pooling layer and skipped connection for every other convolution operations. To enhance the image contrast, we add additional sigmoid layer at the end of U-Net. On the other hand, the low-resolution image generator from high resolution input is based on a single 3D convolution layer that models a 3D blurring kernel. The size of the 3D PSF modeling layer is chosen depending on the problem set by considering their approximate PSF sizes. In this paper, the size of the 3D PSF layer is set to .
As for the discriminators, we follow the original CycleGAN that uses multi-PatchGANs (mPGANs) (Isola et al., 2017), where each discriminator has input patches with different sizes used. As shown in Fig 8, it consist of three independent discriminators. Each discriminator takes patches at different sizes: original, and half, and quarter size patches.
A total 18 epi-fluorescent (EPF) microscopy images of tubulin with a size of were used for training, and one for validation. As for unmatched sharp image volume, we used deblurred image generated by utilizing a commercial software AutoQuant X3 (Media Cybernetics, Rockville). The EPF volume depth was increased to 64 using the reflection boundary condition. Due to GPU memory limitations, the EPF volume was split into size patches. For data augmentation, rotation, flip, translation, and scale were imposed on the input patches. We normalized the patches and set them to [0,1]. Adam optimizer (Kingma & Ba, 2014) was also used for training. The learning rate was initially set to 0.0001, which is decreased linearly after 40 epoch, and the total number of epoch was 200 epoch. For the optimizer, we used only a single batch.