Learningbased Image Reconstruction via
Parallel Proximal Algorithm
Abstract
In the past decade, sparsitydriven regularization has led to advancement of image reconstruction algorithms. Traditionally, such regularizers rely on analytical models of sparsity (e.g. total variation (TV)). However, more recent methods are increasingly centered around datadriven arguments inspired by deep learning. In this letter, we propose to generalize TV regularization by replacing the penalty with an alternative prior that is trainable. Specifically, our method learns the prior via extending the recently proposed fast parallel proximal algorithm (FPPA) to incorporate dataadaptive proximal operators. The proposed framework does not require additional inner iterations for evaluating the proximal mappings of the corresponding learned prior. Moreover, our formalism ensures that the training and reconstruction processes share the same algorithmic structure, making the endtoend implementation intuitive. As an example, we demonstrate our algorithm on the problem of deconvolution in a fluorescence microscope.
1 Introduction
The problem of reconstructing an image from its noisy linear observations is fundamental in signal processing. Formulating the reconstruction as a linear inverse problem
(1) 
the unknown image is computed from measurements . Here, the matrix models the response of the acquisition device, while represents the measurement noise. In practice, the reconstruction often relies on the regularized leastsquares approach:
(2) 
where is a regularization functional that promotes some desired properties in the solution and controls the strength of the regularization.
In most reconstruction schemes, an analytical prior model is used. One of the most popular regularizers for images is total variation (TV) [1], defined as , where is the discrete gradient operator. The TV functional is a sparsitypromoting prior (via the norm) on the image gradient. Used in compressed sensing [2, 3], TV regularization has been central to inverse problems and successfully applied to a wide range of imaging applications [4, 5, 6, 7].
Two commonly used methods for performing TV regularized image reconstructions are the (fast) iterative shrinkage/thresholding algorithm ((F)ISTA) [8] and alternating direction method of multipliers (ADMM) [9]. These algorithms reduce the complex optimization problem to a sequence of simpler operations applied to the iterates. Both methods require evaluating the proximal mapping of the TV regularizer at each iteration [10]. This amounts to solving a denoising problem that does not depend on and imposes piecewisesmoothess on the reconstruction [11].
From a fundamental standpoint, the modular structure of FISTA and ADMM algorithms separates the prior model (specified by the proximal) from the underlying physical model . To develop more effective regularizers than TV, researchers have thus modified the proximal operators based on practical grounds (notably, the subsequent meansquarederror (MSE) performance) rather than analyticity. One class of algorithms called “plugandplay” (PnP) [12, 13, 14, 15, 16] replaces the proximal step with powerful denoising techniques such as BM3D [17]. More recently, motivated by the success of neural networks [18] in image analysis applications [19], learningbased methods have also been proposed for designing regularization strategies. One popular approach is to unfold a specific iterative reconstruction algorithm that is derived for a TVlike regularization and consider a parametrized proximal step instead of a fixed one. Through the learning of parametrization coefficients in a datadriven fashion, such algorithms have adapted the regularizer to the underlying properties (deterministic and/or stochastic) of the data [20, 21, 22, 23, 24].
The efficiency of designing trainable regularizers is primarily determined by the algorithm that is chosen to be unfolded. The major challenge is that many proximal operators, such as that of TV, do not admit closed form solutions and require additional iterative solvers for computation [8, 25]. This complication might limit the learning process to differentiable models [22]. Alternatively, ISTAbased schemes can be used without such confinements for learning proximals that are simpler [23]. Using variablesplitting [9], ADMMbased learning has addressed these proximalrelated problems. However, the final reconstruction algorithm obtained by this formulation is efficient only for a restricted class of forward models due to the inherent properties of ADMM [21, 26]. Moreover, since variablesplitting introduces auxiliary variables, such methods also require more memory, which becomes a bottleneck for largescale imaging problems [27].
In this letter, we propose a new learningbased image reconstruction method called the trainable parallel proximal algorithm (TPPA). Our algorithm extends the recently proposed fast parallel proximal algorithm (FPPA) [28] to its dataadaptive variant. At its core, FPPA uses a simple waveletdomain softthresholding to compute the proximal of TV, eliminating the need for an additional iterative solver. Building upon this aspect, our framework: 1) efficiently learns a TVtype regularization by replacing the softthresholding function by a parametric representation that is then learned for a given dataclass, 2) is general and does not put any restrictions on the forward model . We also show that the training and reconstruction processes share the same algorithmic structure, making TPPA’s endtoend implementation very convenient. We apply the proposed method to the problem of deconvolution in fluorescence microscopy. Our results show that the learned regularization improves the deconvolution accuracy compared to TV and PnP models.
2 Mathematical Background
Our formalism starts with discussing the fundamentals of the FPPA method. This is then followed by the derivation of our method, which is the datadriven variant of FPPA.
2.1 FPPA for TV regularization
First, we provide some background on TV regularization via FPPA. The method uses wavelets to define (and generalize) the TV regularizer. To see this, we first define a transform that consists of the gradient operator , as well as an averaging operator . The averaging operator computes pairwise averages of pixels along each dimension. We rescale both operators by for notational convenience. Note that combining these operators makes an invertible transform and it holds that , which is not the case for alone. However, note that due to being redundant [29].
can be rewritten as a union of four orthogonal transforms , allowing to be interpreted as the union of scaled and shifted Haar wavelets and scaling functions [30]. This viewpoint provides us with the central idea of FPPA, which recasts the TV regularizer by using the four orthogonal Haar transforms:
(3) 
is the set of all the detail (i.e. difference) coefficients of the transform . This relationship is then used to design the following updates at iteration :
(4a)  
(4b)  
(4c) 
where the scalar softthresholding function
(5) 
is applied elementwise on the detail coefficients. As in the FISTA implementation of TV (TVFISTA) [8], the parameters are set as [31]
(6) 
and . Note that FPPA exploits the wellknown connection between the Haar wavelettransform and TV, and it is closely related to a technique called cycle spinning [32, 33, 34, 35].
The convergence rate of FPPA is given by [28]
(7) 
where are the iterates from (4), is the true TV cost functional, and is a minimizer of . This means that for a constant stepsize , convergence can be established in the neighborhood of the optimum, which can be made arbitrarily close by letting . Additionally, the global convergence rate of FPPA matches that of TVFISTA [8].
FPPA that works with a fixed regularizer such as TV. The idea and convergence of FPPA can be generalized to regularizers beyond TV by using other wavelet transform and considering multiple resolutions.
3 Proposed Approach: Trainable Parallel Proximal Algorithm (TPPA)
We now present our method, which adapts the regularization to the data rather than being designed for a fixed one. Given and , we see that the shrinkage function solely determines the reconstruction. We have noted that the TV reconstruction is strictly linked to the softthresholding within the scheme outlined in (4). However, the efficiency of a shrinkage function varies with the type of object being imaged [36]. This necessitates revisiting FPPA to obtain a dataspecific reconstruction algorithm.
Our model keeps as the pairwise averages and differences and considers an iterationdependent sequence of shrinkage functions for each wavelet channel . We adopt the following parametrization:
(8) 
where are the expansion coefficients and is a basis function positioned on the grid . We additionally reparametrize each stepsize with a scalar and a onetoone function
(9) 
This representation facilitates automatic tuning of the stepsizes while ensuring their nonnegativity. We note that the overall parametrization can be restricted to its iterationindependent counterpart (i.e. same set of parameters for each iteration). Moreover, by appropriately constraining the parameters to lie in a wellcharacterized subspace, the convergence rate given in (7) can be preserved. However, such constraints are potentially restrictive on the reconstruction performance [37].
At iteration , the TPPA updates are
(10a)  
(10b)  
(10c) 
where the scaling factors are absorbed into the coefficients. In contrast to (4), TPPA uses a sequence of adjustable shrinkage functions for each in addition to selftuning the stepsize. More importantly, compared to similar approaches based on ADMM [26, 37], TPPA does not rely on being a structured matrix (such as blockcirculant) for computational efficiency.
3.1 Training of Model Parameters
We now consider determining our model parameters (i.e. shrinkage functions and stepsizes) via an offline training. Through a collection of training pairs , our goal is to learn
where , with denoting the vector of coefficients. The total number of trainable parameters is We define the cost for parameter learning to be the mean squared error (MSE) over the training data
(11) 
where is the output of (10) for a given measurement vector and set of parameters after a fixed number of iterations . The learned parameters are thus obtained via
(12) 
The implication of (12) is immediate: Given a fixed computation cost (that is expectedly cheaper than that for TV), the shrinkages are optimized to maximize the reconstruction accuracy over the training dataset.
Note that the optimization problem in (12) is smooth and hence firstorder optimization methods are convenient. We use the gradient descent algorithm with Nesterov’s acceleration scheme [31] (see Algorithm 1^{1}^{1}1Note that iterates of the parameters are represented by to distinguish from . ).
We now explain how the gradient of the cost function in (12) is derived. We denote the gradient by and rely on backpropagation [38] for obtaining its analytical expression. Here, we point out the main aspects of our derivation since such calculations are lengthy. First, we define the residual term and use the chain rule to get:
(13) 
Using matrix calculus, the derivatives as thus
where . Similarly, we compute
As for the derivatives of the training parameters, we use the chain rule once again to attain the following:
where is the matrix representation of the basis functions in the sense that . Considering these partial derivatives along with (13), we obtain the scheme described in Algorithm 2 to compute the backpropagation. Finally, we note that the reconstruction in (10) and Algorithms 1 and 2 essentially share the same structure (i.e. gradient descent with acceleration). This makes the proposed model convenient, since the computational implementation can be reused.
4 Numerical Results
We now present in silico experiments corroborating TPPA, with deconvolution of fluorescence microscopy images where the point spread function (PSF) of the microscope is approximated by a Gaussian kernel of variance 2 pixels. The imaging process is assumed not to be photonlimited; noise is modeled as additive white Gaussian noise (AWGN) of 30 dB SNR.
Fluorescence microscopy images of human bone osteosarcoma epithelial cells (U2OS Line)^{2}^{2}2The dataset consists of multicolor images where each color channel depicts different flurophores targeted to different organelles or cellular components. In our simulations, we use the channel corresponding to the blue color which targets the cell nucleus., from [39] are used as our groundtruth data. All images’ intensity are scaled between 0 and 1. To generate the training pairs, we use 100 images and apply the forward model to a single patch (per image) of size 6464 extracted around the center of the fieldofview. Once the images chosen for training are excluded, we select a different 20 images of size 256256 as a validation set.
Deconvolution Algorithm  
PSF Kernel Size  TV  PnP (using BM3D)  Proposed Method 
55  21.99  24.69  24.96 
99  20.77  22.33  22.57 

Learning is carried out by using Algorithm 1 with 200 iterations (that is ) with . We set the number of layers for TPPA as . The shrinkage functions are parametrized by equallyspaced cubic Bsplines over the dynamic range of . All shrinkages are initialized with the identity operator. Finally, we note that and .
As a baseline comparison, we consider TV regularization implemented using FPPA described in (4). The algorithm is run until either 100 iterations is reached or is satisfied. We also compare against the PnP model where the proximal of TV is replaced by BM3D [17]. The latter is implemented using 10 FISTA iterations (same as the number of layers in TPPA) and all methods use a zero initialization. For each validation image, we optimize the regularization parameters for both algorithms (by using an oracle) for the bestpossible SNR performance. Average SNRs of the reconstruction are reported in Table 1 for different sizes of the blur kernel.
The results show that the accuracy of our model is better than the other algorithms considered. In particular, the SNR performance provided by TPPA is significantly better than that of TV. Furthermore, visual inspection of the reconstructions reveals that the TV deconvolution creates the characteristic blocky artifacts at textured regions (see Figure 1). Since the successive sequence of shrinkage functions are adapted to the underlying features of the training data, one notices that these artifacts are reduced for TPPA. Our method also renders the boundary of the nucleus more faithfully and provides a homogeneous background. These observations confirm the efficiency of our dataspecific approach (in terms of deconvolution quality) and highlight its potential importance in practical scenarios.
5 Conclusion
We developed a learningbased algorithm for linear inverse problems that is in the spirit of TV regularization. Our approach, TPPA, has enabled us to move away from the softthresholding operator (with a fixed threshold value at all iterations) to a collection of parametrized shrinkage functions that are optimized (in MSE sense) for a training set. Compared to TV regularization and PnP technique, our deconvolution simulations demonstrate that advantages of TPPA in terms of accuracy.
References
 [1] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D, vol. 60, no. 1–4, pp. 259–268, November 1992.
 [2] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
 [3] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE Trans. Inf. Theory, vol. 51, no. 12, pp. 4203–4215, December 2005.
 [4] M. Persson, D. Bone, and H. Elmqvist, “Total variation norm for threedimensional iterative reconstruction in limited view angle tomography,” Phys. Med. Biol., vol. 46, no. 3, pp. 853–866, 2001.
 [5] M. M. Bronstein, A. M. Bronstein, M. Zibulevsky, and H. Azhari, “Reconstruction in diffraction ultrasound tomography using nonuniform FFT,” IEEE Trans. Med. Imag., vol. 21, no. 11, pp. 1395–1401, November 2002.
 [6] M. Lustig, D. L. Donoho, and J. M. Pauly, “Sparse MRI: The application of compressed sensing for rapid MR imaging,” Magn. Reson. Med., vol. 58, no. 6, pp. 1182–1195, December 2007.
 [7] U. S. Kamilov, I. N. Papadopoulos, M. H. Shoreh, A. Goy, C. Vonesch, M. Unser, and D. Psaltis, “Optical tomographic image reconstruction based on beam propagation and sparse regularization,” IEEE Trans. Comp. Imag., vol. 2, no. 1, pp. 59–70,, March 2016.
 [8] A. Beck and M. Teboulle, “Fast gradientbased algorithm for constrained total variation image denoising and deblurring problems,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 2419–2434, November 2009.
 [9] M. V. Afonso, J. M.BioucasDias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. Image Process., vol. 19, no. 9, pp. 2345–2356, September 2010.
 [10] J. J. Moreau, “Proximité et dualité dans un espace hilbertien,” Bull. Soc. Math. France, vol. 93, pp. 273–299, 1965.
 [11] J. Cheng and B. Hofmann, Handbook of Mathematical Methods in Imaging. Springer, 2011, ch. Chapter 3: Regularization Methods for IllPosed Problems, pp. 87–109.
 [12] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plugandplay priors for model based reconstruction,” in Proc. IEEE Global Conf. Signal Process. and INf. Process. (GlobalSIP), Austin, TX, USA, December 35, 2013, pp. 945–948.
 [13] S. Sreehari, S. V. Venkatakrishnan, B. Wohlberg, G. T. Buzzard, L. F. Drummy, J. P. Simmons, and C. A. Bouman, “Plugandplay priors for bright field electron tomography and sparse interpolation,” IEEE Trans. Comp. Imag., vol. 2, no. 4, pp. 408–423, December 2016.
 [14] S. H. Chan, X. Wang, and O. A. Elgendy, “Plugandplay ADMM for image restoration: Fixedpoint convergence and applications,” IEEE Trans. Comp. Imag., vol. 3, no. 1, pp. 84–98, March 2017.
 [15] U. S. Kamilov, H. Mansour, and B. Wohlberg, “A plugandplay priors approach for solving nonlinear imaging inverse problems,” IEEE Signal. Proc. Let., vol. 24, no. 12, pp. 1872–1876, December 2017.
 [16] Y. Romano, M. Elad, and P. Milanfar, “The little engine that could: Regularization by denoising (RED),” SIAM Journal on Imaging Sciences, vol. 10, no. 4, pp. 1804–1844, 2017.
 [17] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3D transformdomain collaborative filtering,” IEEE Trans. Image Process., vol. 16, no. 16, pp. 2080–2095, August 2007.
 [18] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 28, 2015.
 [19] M. T. McCann, K. H. Jin, and M. Unser, “Convolutional neural networks for inverse problems in imaging: A review,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 85–95, 2017.
 [20] A. Barbu, “Training an active random field for realtime image denoising,” IEEE Trans. Image Process., vol. 18, no. 11, pp. 2451–2462, November 2009.
 [21] U. Schmidt and S. Roth, “Shrinkage fields for effective image restoration,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, June 2328, 2014, pp. 2774–2781.
 [22] Y. Chen, W. Yu, and T. Pock, “On learning optimized reaction diffuction processes for effective image restoration,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, June 810, 2015, pp. 5261–5269.
 [23] U. S. Kamilov and H. Mansour, “Learning optimal nonlinearities for iterative thresholding algorithms,” IEEE Signal Process. Lett., vol. 23, no. 5, pp. 747–751, May 2016.
 [24] D. Mahapatra, S. Mukherjee, and C. S. Seelamantula, “Deep sparse coding using optimized linear expansion of thresholds,” 2017, arXiv:1705.07290 [cs.LG].
 [25] Z. Qin, D. Goldfarb, and S. Ma, “An alternating direction method for total variation denoising,” Optim. Method Softw., vol. 30, no. 3, pp. 594–615, 2015.
 [26] Y. Yang, J. Sun, H. Li, and Z. Xu, “Deep ADMMNet for compressive sensing MRI,” in Adv. Neural Inf. Process. Syst. 29 (NIPS 2016), 2016, pp. 1–9.
 [27] N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller, “DiffuserCam: Lensless singleexposure 3D imaging,” Optica, vol. 5, no. 1, pp. 1–9, 2018.
 [28] U. S. Kamilov, “A parallel proximal algorithm for anisotropic total variation minimization,” IEEE Trans. Image Process., vol. 26, no. 2, pp. 539–548, February 2017.
 [29] M. Elad, P. Milanfar, and R. Rubinstein, “Analysis versus synthesis in signal priors,” Inverse Problems, vol. 23, no. 3, pp. 947–968, 2007.
 [30] S. Mallat, A Wavelet Tool of Signal Processing: The Sparse Way, 3rd ed. San Diego: Academic Press, 2009.
 [31] Y. E. Nesterov, “A method for solving the convex programming problem with convergence rate ,” Dokl. Akad. Nauk SSSR, vol. 269, pp. 543–547, 1983, (in Russian).
 [32] R. R. Coifman and D. L. Donoho, Springer Lecture Notes in Statistics. SpringerVerlag, 1995, ch. Translationinvariant denoising, pp. 125–150.
 [33] A. K. Fletcher, K. Ramchandran, and V. K. Goyal, “Wavelet denoising by recursive cycle spinning,” in Proc. IEEE Int. Conf. Image Proc. (ICIP’02), Rochester, NY, USA, September 22–25, 2002, pp. II.873–II.876.
 [34] G. Steidl, J. Weickert, T. Brox, P. Mrazek, and M. Welk, “On the equivalence of soft wavelet shrinkage, total variation diffusion, total variation regularization, and SIDEs,” SIAM J. Numer. Anal., vol. 42, no. 2, pp. 686–713, 2004.
 [35] U. S. Kamilov, E. Bostan, and M. Unser, “Variational justification of cycle spinning for waveletbased solutions of inverse problems,” IEEE Signal Process. Lett., vol. 21, no. 11, pp. 1326–1330, November 2014.
 [36] E. Bostan, U. S. Kamilov, M. Nilchian, and M. Unser, “Sparse stochastic processes and discretization of linear inverse problems,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2699–2710, July 2013.
 [37] H. Nguyen, E. Bostan, and M. Unser, “Learning convex regularizers for optimal Bayesian denoising,” IEEE Transactions on Signal Processing, to appear.
 [38] H. B. Demuth, M. H. Beale, J. A. Bittker, O. D. Jess, and M. T. Hagan, Neural Network Design. USA: Martin Hagan, 2014.
 [39] M.A. B. et al., “A dataset of images and morphological profiles of 30,000 smallmolecule treatments using the Cell Painting assay,” GigaScience, p. giw014, 2017.