Stochastic FirstOrder Minimization Techniques Using Jensen Surrogates for XRay Transmission Tomography
Abstract
Image reconstruction in Xray transmission tomography has been an important research field for decades. In light of data volume increasing faster than processor speeds, one needs accelerated iterative algorithms to solve the optimization problem in the Xray CT application. Incremental methods, in which a subset of data is being used at each iteration to accelerate the computations, have been getting more popular lately in the machine learning and mathematical optimization fields. The most popular member of this family of algorithms in the Xray CT field is orderedsubsets. Even though it performs well in earlier iterations, the lack of convergence in later iterations is a known phenomenon. In this paper, we propose two incremental methods that use Jensen surrogates for the Xray CT application, one stochastic and one orderedsubsets type. Using measured data, we show that the stochastic variant we propose outperforms other algorithms, including the gradient descent counterparts.
Stochastic FirstOrder Minimization Techniques Using Jensen Surrogates for XRay Transmission Tomography
Soysal Degirmencia , Joseph A. O’Sullivana , David G. Politteb
a Department of Electrical and Systems Engineering, Washington University,
1 Brookings Drive, Saint Louis, MO, USA, 63130
b Mallinckrodt Institute of Radiology, Washington University School of Medicine,
510 South Kingshighway Blvd, Saint Louis, MO, USA, 63110
Soysal Degirmenci: Email: s.degirmenci@wustl.edu
Joseph A. O’Sullivan: Email: jao@wustl.edu
David G. Politte: Email: politted@wustl.edu
Keywords: Xray imaging, stochastic gradient descent, ordered subsets, Jensen surrogates, incremental descent, stochastic average
1 Introduction
With the scale of computational inverse problems getting larger in many applications, including XRay Computed Tomography, acceleration techniques that provide fast rates of convergence have become crucial. One of the acceleration techniques widely used in many applications is incremental methods, which is a range decomposition technique [1]. In the general case, one forms a surrogate function around the current estimate using a subset of data and computes the next estimate by minimizing it. The idea behind the acceleration is that the subset of data is a good approximation of the full data space and computational time is saved by using only a portion of it. The most popular variants of incremental methods, ordered subsets, and stochastic surrogate descent, perform well in earlier iterations but are known to lack convergence for the later iterations. Lately, the variants of incremental methods in optimization field have proliferated. The algorithms proposed were shown to perform well with desirable convergence properties. The idea behind these methods is to keep track of the surrogate parameters for each subset and to use a combination of them while performing updates, rather than using only one of them.
Jensen surrogates is a possible choice of surrogate functions, and it has nice properties that result in algorithms with attractive properties and good performance [2, 3]. For Xray CT, the family of algorithms that uses Jensen surrogates has parameters that are easy to compute. These can also be extended to acceleration methods, including rangebased techniques.
In this paper, we:

explore stochastic incremental methods and propose two novel accelerated algorithms using Jensen surrogates for Xray CT, Stochastic Average Jensen Surrogates Optimization (SAJS) and Ordered Subsets Average Jensen Surrogates Optimization (OSAJS), and

show that the SAJS algorithm outperforms the other competing algorithms for a large number of subsets by using two sets of real data collected from a baggage scanner.
2 Convex Optimization Using Jensen Surrogates for XRay CT
The objective function we minimize for monoenergetic XRay transmission tomography is as follows.
(1)  
(2) 
where is the datafitting term and is the regularization term. Here, we specifically look at the Poisson loglikelihood datafitting term, where
(3) 
is the attenuated data vector, is the incident photon count vector, and is the system matrix that defines the relationship between raypaths and voxels. In a simple raytracing model, represents the length of intersection between the voxel indexed by and the raypath indexed by . The nonnegativity constraint on the image is due to the physical nature of linear attenuation coefficients. The regularization term we use is
(4) 
where , , and is a matrix that has s along the diagonal and only one offdiagonal for each row. We further assume that for two arbitrary voxels, if there exists a row in where the first voxel has a value equal to and the second , there should exist another row where the reverse is true as well. The defined is an edgepreserving function that is of Huber type, convex, even, and differentiable, where defines a neighborhood around a center voxel we would like to use to regularize the image. Similarly, determines the weights of the corresponding neighborhood. This function behaves quadratically for small and linearly for large values. Since both and are convex and differentiable, we can use the same Jensen surrogate formulation for each of them to create an iterative minimization algorithm.
For the datafitting term, , we denote its Jensen surrogate around parameterized by as . We follow the same procedure as in,[3] which gives the resulting surrogate^{*}^{*}*We ignore the constant term after step 1.
(5)  
We choose , where , which results in a surrogate function as follows.
(6)  
Here, it is important to note that when , the surrogate does not have a closedform minimization and thus must be solved using some convex minimization method. Lange, Fessler et al. [4] use this auxiliary variable choice and solves it via Newton’s method.
For the regularization term we denote its Jensen surrogate around parameterized by as . We choose (since and there is exactly one element equal to and one element equal to in each row of , the denominator is equal to 2). Then the surrogate becomes^{†}^{†}†We ignore the constant term after step 1.
(7)  
where in the third step we changed the notation, so that is a set of indices that defines the neighborhood of index (those sets of indices are the only ones that are relevant to .), and in the fifth step we used the fact that is an even function.
When two surrogate functions are combined, we have our algorithm, the Jensen Surrogates Optimization for XRay CT, presented in Algorithm 1.
As in most iterative algorithms, this algorithm consists of three steps in each iteration. First, the current estimate in the image domain is forward projected to the data domain with Beer’s Law of attenuation applied. Then, this estimate is back projected onto the image domain where this information as well as the back projection of attenuated data and the previous image estimate are used to compute the next estimate. In the case of no regularization (), the surrogate can be minimized in one step; there is a closedform update. For the regularized case, however, there is not a closedform update that minimizes the combined surrogate functions. Instead, we have independent onedimensional convex problems we can minimize in parallel. Any convex optimization method can be used to minimize these functions. In order to achieve fast convergence we first attempted to use Newton’s method. It is important to note that since each minimization is a onedimensional problem, inversion of the Hessian is not an issue and we take advantage of that fact. However, due to characteristics of the functions being minimized, Newton’s method diverged for some cases. Then, we attempted to use a Trust Region Newton’s method,[5] which is a modification of Newton’s method. In the trust region method, in each iteration, there is a trust region defined such that it bounds the next iterate computed, which makes the algorithm more stable. A metric that measures how well the quadratic approximation approximates the original function is computed at each iteration. Depending on the value of this metric, we either “trust” the quadratic approximation more and expand the trust region, or trust less and shrink it. Unfortunately, the trust region method requires two computations of the function, one first derivative, one second derivative, and many comparisons per iteration. In order to get a faster method that requires fewer computations per iteration and produces comparable performance, we developed a modified trust region method. This method takes advantage of the structure of the problem to construct a fixed trust region and only requires one first derivative, one second derivative, and two comparisons per iteration. More information about these methods can be found in Degirmenci.[6]
As discussed above, when , the image update at iteration becomes
(8) 
where is a nonnegativity operator. Unfortunately, these types of “full” methods have a sublinear rate of convergence [6], which is very slow. Applications of Xray imaging, such as baggage scanning (considered below) and medical imaging, both require fast reconstructions with “good” resultant image volumes. In the next section, we will look at several acceleration methods based on range decompositions, which are variants of Algorithm 1.
3 Acceleration Methods
When the amount of data is very large, a reasonable approach to reach the optimal solution faster is to use a portion or subset of the data at each iteration to perform updates. These methods are called “incremental methods” in general. In this section, we first present the popular incremental approach in Xray CT called ordered subsets using Jensen surrogates and propose our new algorithms Stochastic Average and Ordered Subsets Average Jensen Surrogates Optimization for XRay CT. For convenience, we assume that the data is split into subsets where the set of sourcedetector indices of subset is represented as .
3.1 Ordered Subsets Jensen Surrogates Optimization for XRay CT
Ordered subsets is a popular technique used in Xray CT in which the estimates are updated at each iteration by using a subset of data in an ordered way. In other words, the choice of indices to be used is deterministic and follows a cyclic order. Since only a subset of data is used, in order to keep the balance between the datafitting and the regularization term, the regularization parameter is scaled down by , the number of subsets. The corresponding algorithm of ordered subsets when used with Jensen surrogates is presented in Algorithm 2.
This algorithm is known to work well in earlier iterations when the estimate is “far away” from the optimum. However, for later iterations, it lacks convergence and stops decreasing after some number of iterations. For the gradient descent case, Bertsekas et al. [1] showed that regardless of the number of iterations, this type of method never gets closer to the optimal function value than a certain positive constant.
3.2 Stochastic Average Jensen Surrogates Optimization for XRay CT
As seen in the previous section, when the ordered subsets technique is used, we use only the most recent version of the back projection information in order to minimize a subset of the problem. In this section, we propose to store the back projection of the estimates for each subset, whether they are the most recent or not, and to use a combination of this back projection information to update the estimates at each iteration. The subset is chosen stochastically, so that each subset has a probability of being chosen equal to . This algorithm can be seen as an equivalent of the Stochastic Average Gradient algorithm [7] that uses Jensen surrogates instead of Lipschitz quadratic surrogates for the XRay CT application. Also, it is important to note that we only use the previous estimate information for the datafitting term; a similar approach can be used for the regularization term as well, but this would require the storage of images. Not only does this make the onedimensional optimization to update the estimate slower, but also increases the storage requirement. Experimentally, we have not found any advantages and will not include it here.
Algorithm 3 presents the Stochastic Average Jensen Surrogates Optimization for XRay CT (SAJS). For a very large number of subsets, the requirement of storing back projected estimates in the image domain can be problematic if the memory capacity of the computing architecture being used is low. As seen in the image update, at each iteration we use the most recent sum of the back projected estimates. In a naive implementation, this computation would add to the computational complexity. A better approach is to store the most recent sum as another variable, and when a subset is chosen, subtract the old back projected estimate of the corresponding estimate, and add the newly computed one to the sum. This version would require in additional computations and an additional of storage, which is more reasonable than the naive implementation.
3.3 Ordered Subsets Average Jensen Surrogates Optimization for XRay CT
In addition to the stochastic choice of subsets presented in the previous section, one can also choose them in a cyclic manner as in the ordered subsets method. The resultant algorithm, called Ordered Subsets Average Jensen Surrogates Optimization for XRay CT (OSAJS), is shown in Algorithm 4.
4 Results
In this section, we compare the proposed algorithms and the gradient descent variants using real data. Before presenting the results, we briefly explain the competing algorithms implemented.
The gradient descent method is a popular technique for iterative convex minimization. In an iteration, one forms a surrogate function around the current estimate that is equal to
(9) 
When the function of interest is twice differentiable and Lipschitz continuous, this method has a guarantee of convergence when the quadratic scaling factor is chosen to be equal to , where is the Lipschitz gradient constant of .^{‡}^{‡}‡Using an arbitrary initial and decreasing it when the objective function starts increasing is a popular technique when this constant is not computable. This method, backtracking, can also be used with Jensen surrogates. For simplicity, this investigation is left as future work. For our case, the gradient is equal to
(10)  
(11) 
where . The Lipschitz gradient constant is computed using the following inequality on the Hessian,
(12)  
(13)  
(14) 
which in return can be used to find as
(15) 
where is the operator that returns the maximum eigenvalue of the given square matrix. When the system matrix is scaled down in the image domain, we experimentally found that follows a linear relationship with the fully sampled version. Thus, we computed it by finding for a scaleddown system matrix and then multiplying by the appropriate scale factor. For the regularization matrix part, we computed for a subset of the image volume, zslices, to make it computationally tractable.
Algorithm 5 presents the Gradient Descent Optimization for XRay CT (FullGD). For convenience, we will compare the ordered subset (OSGD) and the stochastic average (SAGD) variants of the gradient descent method, which are straightforward extensions of the base case algorithm.
The acceleration methods have been investigated using two realdata scans acquired on a SureScan^{TM} x1000 Explosive Detection System. For all the methods presented, the regularization parameters and were used. All reconstructed image volumes consist of rows, columns, and a varying number of slices along the zaxis. In order to compare the rates of convergence, we look at the normalized function errors , where the objective function value at the optimum was computationally found by running the algorithms for many more iterations. We examine the performance of our algorithms with a relatively low number of subsets, , and a large number of subsets, .
We compare the proposed algorithms SAJS and OSAJS with
Figures 13 show three slices each of axial, sagittal, and coronal views of the reconstructed volume for Bag #1. Figure 4 presents the normalized function errors for different algorithms vs. the number of effective data passes for Bag #1 for two subset settings while Figure 5 presents the errors for Bag #2. As seen in the figures, for subsets, the FastJS algorithm slightly outperforms the SAJS and OSAJS algorithms proposed. For subsets, we see that SAJS algorithm we proposed outperforms the other competing algorithms.
It is also important to note that the proposed Jensen surrogate algorithms outperform their gradient descent equivalents for all cases. Another observation is that the cyclic choice we proposed for the average scheme we proposed works competitively for the subsets case, but performs worse than its stochastic counterpart for subsets.
5 Conclusion
In this paper, we investigated incremental methods and proposed one stochastic and one deterministic algorithm that use Jensen surrogates and compared these methods with other competing algorithms using data collected from two bags by a baggage scanner. The Stochastic Average variant (SAJS) we proposed performs competitively for the small number of subsets and outperforms the other competing algorithms for the large number of subsets in terms of number of effective data passes. This is an indicator that it should perform better as the number of subsets increases. However, implementation of the larger number of subsets requires additional storage and more image updates per effective data pass, which both add computational costs. Thus, we plan to conduct a thorough time analysis of the proposed algorithms as well as to explore the “hybrid” variants.
References
 [1] D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,” Optimization for Machine Learning 2010, pp. 1–38, 2011.
 [2] J. A. O’Sullivan and J. Benac, “Alternating minimization algorithms for transmission tomography,” Medical Imaging, IEEE Transactions on 26(3), pp. 283–297, 2007.
 [3] S. Degirmenci, D. G. Politte, C. Bosch, N. Tricha, and J. A. O’Sullivan, “Acceleration of iterative image reconstruction for xray imaging for security applications,” in IS&T/SPIE Electronic Imaging, pp. 94010C–94010C, International Society for Optics and Photonics, 2015.
 [4] K. Lange, J. Fessler, et al., “Globally convergent algorithms for maximum a posteriori transmission tomography,” Image Processing, IEEE Transactions on 4(10), pp. 1430–1438, 1995.
 [5] J. Nocedal and S. Wright, Numerical optimization, Springer Science & Business Media, 2006.
 [6] S. Degirmenci, A General Framework of LargeScale Convex Optimization Using Jensen Surrogates and Acceleration Techniques. PhD thesis, Washington University in St. Louis, 2016.
 [7] N. L. Roux, M. Schmidt, and F. R. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Advances in Neural Information Processing Systems, pp. 2663–2671, 2012.