Conjugate gradient based acceleration for inverse problems

# Conjugate gradient based acceleration for inverse problems

Sergey Voronin Christophe Zaroli Institut de Physique du Globe de Strasbourg, UMR 7516, Université de Strasbourg, EOST/CNRS, France Naresh P. Cuntoor Intelligent Automation Inc, Rockville, MD, USA
September 12, 2019
September 12, 2019
###### Abstract

The conjugate gradient method is a widely used algorithm for the numerical solution of a system of linear equations. It is particularly attractive because it allows one to take advantage of sparse matrices and produces (in case of infinite precision arithmetic) the exact solution after a finite number of iterations. It is thus well suited for many types of inverse problems. On the other hand, the method requires the computation of the gradient. Here difficulty can arise, since the functional of interest to the given inverse problem may not be differentiable. In this paper, we review two approaches to deal with this situation: iteratively reweighted least squares and convolution smoothing. We apply the methods to a more generalized, two parameter penalty functional. We show advantages of the proposed algorithms using examples from a geotomographical application and for synthetically constructed multi-scale reconstruction and regularization parameter estimation.

## 1 Introduction

Consider the linear system , where and . Often, in linear systems arising from physical inverse problems, we have more unknowns than data: [19] and the right hand side of the system corresponding to the observations or measurements is noisy. In such a setting, it is common to use regularization by introducing a constraint on the solution, both to account for the possible ill-conditioning of and noise in and for the lack of data with respect to the number of unknown variables in the linear system. A commonly used constraint is imposed via a penalty on the norm of the solution, either forcing the sum of certain powers of the absolute values of coefficients to be bounded or for many of the coefficients to be zero, i.e. sparsity: to require the solution to have few nonzero elements compared to the dimension of . To account for different kinds of scenarios, we consider here the generalized functional:

 Fl,p(x)=∥Ax−b∥ll+λ∥x∥pp=m∑i=1∣∣ ∣∣n∑j=1Aijxj−bi∣∣ ∣∣l+λn∑k=1|xk|p, (1.1)

for . For example, when , the familiar Tikhonov regularization is recovered. When and , we obtain the least squares problem with the convex regularizer, governed by the regularization parameter [5], commonly used for sparse signal recovery. In addition, (1.1) allows us to impose a non-standard penalty on the residual vector, . This is useful e.g. in cases, where we want to impose a higher penalty on any outliers which may be present in . (Since the case is commonly utilized, we denote ). For any , the map (for any ) is called the -norm on . For , the norm is called an norm and is convex. As , the right term of this functional approximates the count of nonzeros or the so-called “norm”:

 ∥x∥0=limp→0∥x∥p=limp→0(n∑k=1|xk|p)1/p.

For different values this measure is plotted in Figure 1.

The non-smoothness of the family of functionals complicates their minimization from an algorithmic point of view. The non-smooth part of (1.1) is due to the absolute value function or of , or both, depending on the values of and . Because the gradient of cannot be obtained when or are less than , different minimization techniques such as sub-gradient methods are frequently used [18]. For the convex case in (1.1), various thresholding based methods have become popular. A particularly successful example is the soft thresholding based method FISTA [2]. This algorithm is an accelerated version of a soft thresholded Landweber iteration [11]:

 xn+1=Sλ2(xn+ATb−ATAxn). (1.2)

The soft thresholding function [5] is defined by

 (Sλ(x))k=sgn(xk)max{0,|xk|−λ}, ∀k=1,…,n, ∀x∈Rn.

The scheme (1.2) is known to converge from some initial guess, but slowly, to the minimizer [5]. The thresholding in (1.2) is performed on , which is a very simple gradient based scheme with a constant line search [8]. The thresholding based schemes typically require many iterations to converge and this is costly (due to the many matrix-vector mults required) when is large. Moreover, the regularization parameter is often not known in advance. Instead, it is frequently estimated using a variant of the L-curve technique [10] together with a continuation scheme, where is iteratively decreased, reusing the previous solution as the initial guess at the next lower . This requires many iterations.

In this article, we discuss two approaches to obtaining approximate solutions to (1.1) using an accelerated conjugate gradient approach, with a specific focus on the case of large matrix , where the use of thresholding based techniques is expensive, due to the many iterations required. In contrast, our methods accomplish similar work in fewer iterations, because each iteration is more powerful than that of a thresholding scheme. We consider specifically the case since for , (1.1) is not convex. However, the minimization of non-smooth non-convex functions has been shown to produce good results in some compressive sensing applications [4] and our methods can be applied also to the non-convex case, as long as care is taken to avoid local minima.

The first approach is based on the conjugate gradient acceleration of the reweighted least squares idea developed in [20] with further developments and analysis given in [9]. The approach is based on a two norm approximation of the absolute value function:

 |xk|=x2k|xk|=x2k√x2k≈x2k√x2k+ϵ2

where in the rightmost term, a small is used, to insure the denominator is finite, regardless of the value of . Thus, at the -th iteration, a reweighted -approximation to the -norm of is of the form:

 ∥x∥1≈N∑k=1x2k√(xnk)2+ϵ2n=N∑k=1~wnkx2k

where the right hand side is a reweighted two-norm with weights:

 ~wnk=1√(xnk)2+ϵ2n.

It follows that is a close approximation to , which proceeds to as . Given the smooth approximation resulting from the reweighted two norm, the gradient acceleration idea is then built on top of the least squares approximations. A slight generalization of the weights makes the approach applicable to (1.1) with . With the aid of results from [16] and the assumption that the residuals are nonzero, we are able to extend the algorithm to the general case in (1.1).

The second approach is based on smooth approximations to the non-smooth absolute value function , computed via convolution with a Gaussian function, described in detail in [22]. Starting from the case , we replace the non-smooth objective function by a smooth functional , which is close to in value (as the parameter ). Since the approximating functional is smooth, we can compute its gradient vector and Hessian matrix . We are then able to use gradient based algorithms such as conjugate gradients to approximately minimize by working with the approximate functional and gradient pair. We also apply the approach to the general case (where we may have ) in (1.1), with the assumption that the residuals are nonzero.

In this paper, we describe the use of both acceleration approaches and the generalized functional, and give some practical examples from Geophysics and of wavelet based model reconstructions.

## 2 Iteratively Reweighted Least Squares

The iteratively reweighted least squares (IRLS) method, was originally presented in [6]. The algorithm presented here is from the work in [20]. Several new developments have recently emerged. In particular, [9] provides the derivations for the practical implementation of the IRLS CG scheme (without running the CG algorithm to convergence at each iteration) while [3] provides some stability and convergence arguments in the presence of noise. In this article, we survey the method and present the extension of the algorithm to (1.1), without going into the mathematical details for the convergence arguments, which are provided in the above references and can be extended to (1.1) with our assumptions. The basic idea of IRLS consists of a series of smooth approximations to the absolute value function:

 ∥x∥1≈N∑k=1x2k√(xnk)2+ϵ2n=N∑k=1~wnkx2k

where the right hand side is a reweighted two-norm with weights:

 ~wnk=1√(xnk)2+ϵ2n. (2.1)

In [21], the non-CG version of the IRLS algorithm is considered with the generalized weights:

 wnk=1[(xnk)2+ϵ2n]2−p2. (2.2)

This results in the iterative scheme:

 xn+1k=11+12λpwnk(xnk+(ATb)k−(ATAxn)k)fork=1,…,N, (2.3)

which converges to the minimizer of (1.1) for and . The iteratively reweighted least squares (IRLS) algorithm given by scheme (2.3) with weights (2.2) follows from the construction of a surrogate functional (2.1), as per Lemma 2.1 below.

###### Lemma 2.1

Define the surrogate functional:

 G(x,a,w,ϵ) = ∥Ax−b∥22−∥A(x−a)∥22+∥x−a∥22 (2.4) + ∑12λ(pwk((xk)2+ϵ2)+(2−p)(wk)pp−2)

where with . Then the minimization procedure defines the iteration dependent weights:

 wnk=1[(xnk)2+ϵ2n]2−p2. (2.6)

In addition, the minimization procedure , produces the iterative scheme:

 xn+1k=11+12λpwnk((xn)k−(ATAxn)k+(ATb)k). (2.7)

which converges to the minimizer of (1.1) for and .

The proof of the lemma is given in [21]. The construction of the non-increasing sequence is important for convergence analysis. On the other hand, different choices can be used for implementations. The auxiliary function is also used for the convergence proof and for the derivation of the algorithm. In particular, it is chosen to yield and to conclude the boundedness of the sequence of iterates . In the same paper, the FISTA style acceleration of the scheme is shown. In practice, however, a large number of iterations may be required for convergence and so the CG acceleration of the above scheme is of particular interest.

Acceleration via CG is accomplished by modifying the auxiliary functional (2.4). If we instead set,

 G(x,w,ϵ)=∥Ax−b∥22+λN∑k=1[pwk(x2k+ϵ2)+(2−p)wpp−2k],

then the two minimization problems:

 wn+1=argminwG(xn+1,w,ϵn+1);xn+1=argminxG(x,wn,ϵn)

give the same iteration dependent weights in (2.6) and the iterative scheme:

 xn+1=argminx{∥Ax−b∥22+12λpN∑k=1wnkx2k}. (2.8)

The details of convergence are given in [20]. The choice of the non-increasing sequence is again crucial for convergence analysis, although simpler choices (e.g. can be programmed in practice). The choice

 ϵn=min(ϵn−1,|G(xn−2,wn−2,ϵn−2)−G(xn−1,wn−1,ϵn−1)|γ2+αn), (2.9)

with and is taken for showing convergence to the minimizer in [20].

We now look more closely at the optimization problem in (2.8) above. In particular, notice that we can write:

 N∑k=112λpwnkx2k=N∑k=1(Dnkk)2x2k=∥Dnx∥22

where is an iteration dependent diagonal matrix with elements . This observation allows us to write the solution to the optimization problem at each iteration in terms of a linear system:

 xn+1=argminx{∥Ax−b∥22+∥Dnx∥22}⟹(ATA+(Dn)T(Dn))xn+1=ATb. (2.10)

In turn, the system in (2.10) can be solved using CG iterations. In practice, we need only an approximate solution to the above linear system at each iteration so the amount of CG iterations can be less than at each iteration . In [9], it is shown that this procedure (involving inexact CG solutions) gives convergence to the minimizer. Notice also that it is not necessary to build up the matrix. Instead, the matrix is applied to vectors at each iteration, which can be done using an array computation. In [20], the application of the Woodbury matrix identity is explored, which can aid in cases where IRLS is applied to short and wide or tall and thin matrices.

### 2.2 Application to generalized residual penalty

Next, we consider the application of the IRLS scheme to (1.1). In [16] an IRLS algorithm for the minimization of was developed. The generalized IRLS equations resulting from this problem are , where is a diagonal matrix with elements , where . This system was derived by setting the gradient to zero where it was assumed that for all . In this case [16]:

 ∂∂xkm∑i=1|ri(x)|l=m∑i=1lsgn(ri)|ri|l−1Aik=m∑i=1ril|ri|l−2Aik=[ATR(Ax−b)]k (2.11)

where we make use of . When , the familiar normal equations are recovered. Notice that for all is a strong but plausible assumption to make (for example, to enforce this, we can add a very small amount of additive noise to the right hand side; the effect of noise on stability and convergence of IRLS has been analyzed in [3]). However, the same cannot be said of e.g. for sparse solutions, where many components can equal zero. If, with the assumption for all , we use the surrogate functional,

 ~G(x,w,ϵ)=m∑i=1∣∣ ∣∣n∑j=1Aijxj−bi∣∣ ∣∣l+λN∑k=1[pwk(x2k+ϵ2)+(2−p)wpp−2k]

it then follows that the resulting algorithm for the minimization of (1.1) can be written as:

 (ATRnA+(Dn)T(Dn))xn+1=ATRnb (2.12)

with the same diagonal matrix as before and with the diagonal matrix with diagonal elements (for where , we can set the entry to with the choice of user controllable, tuned for a given application). As before, at each iteration , the system in (2.12) can be solved (approximately) using a few iterations of the CG algorithm (or some variant of the method, such as LSQR [13]). Below, we present the basic version of the IRLS CG algorithm. In practice, the parameter sequence for can be set to e.g. or the update in (2.9) can be used. Fixing for is common. Varying it may be useful in the case that or less than are chosen, resulting in a non-convex problem. For large problems it is important not to form and matrices explicitly. Instead vectors and can be used to hold their diagonal elements (i.e. ).

## 3 Approximate mollifier approach via convolution

In mathematical analysis, a smooth function is said to be a (non-negative) mollifier if it has finite support, is non-negative , and has area [7]. For any mollifier and any , define the parametric function by: , for all . Then is a family of mollifiers, whose support decreases as , but the volume under the graph always remains equal to one. We then have the following important lemma for the approximation of functions, whose proof is given in [7].

###### Lemma 3.1

For any continuous function with compact support and , and any mollifier , the convolution , which is the function defined by:

 (ψσ∗g)(t):=∫Rψσ(t−s)g(s)ds=∫Rψσ(s)g(t−s)ds, ∀t∈R,

converges uniformly to on , as .

### 3.1 Smooth approximation to the absolute value function

Motivated by the above results, we will use convolution with approximate mollifiers to approximate the absolute value function (which is not in ) with a smooth function. We start with the Gaussian function (for all ), and introduce the -dependent family:

 Kσ(t):=1σK(tσ)=1√2πσ2exp(−t22σ2),∀t∈R. (3.1)

This function is an approximate mollifier since strictly speaking, it does not have finite support. However, this function is coercive, that is, for any , as . In addition, we have that for all :

 ∫∞−∞Kσ(t)dt = 1√2πσ2∫Rexp(−t22σ2)dt=2√2πσ2∫∞0exp(−t22σ2)dt = 2√2πσ2∫∞0exp(−u2)√2σdu=2√2πσ2√2σ√π2=1.

Figure 2 below presents a plot of the function in relation to the particular choice . We see that and is very close to zero for . In this sense, the function is an approximate mollifier.

Let us now compute the limit . For , it is immediate that . For , we use l’Hôpital’s rule:

 limσ→0Kσ(t)=limσ→01√2πσ2exp(−t22σ2)=limγ→∞γ√2πexp(γ2t22)=1√2πlimγ→∞1γt2exp(γ2t22)=0,

with . We see that behaves like a Dirac delta function with unit integral over and the same pointwise limit. Thus, for small , we expect that the absolute value function can be approximated by its convolution with , i.e.,

 |t|≈ϕσ(t),∀t∈R, (3.2)

where the function is defined as the convolution of with the absolute value function:

 ϕσ(t):=(Kσ∗|⋅|)(t)=1√2πσ2∫∞−∞|t−s|exp(−s22σ2)ds,∀t∈R. (3.3)

We show in Proposition 3.3 below, that the approximation in (3.2) converges in the norm (as ). The advantage of using this approximation is that , unlike the absolute value function, is a smooth function.

Before we state the convergence result in Proposition 3.3, we express the convolution integral and its derivative in terms of the well-known error function [1].

###### Lemma 3.2

For any , define as in (3.3) Then we have that for all :

 ϕσ(t)= terf(t√2σ)+√2πσexp(−t22σ2), (3.4) ddtϕσ(t)= erf(t√2σ), (3.5)

where the error function is defined as:

 erf(t)=2√π∫t0exp(−u2)du∀t∈R.

The above lemma is proved in [22]. Next, using the fact that the error function satisfies the bounds:

 (3.6)

the following convergence result can be established:

###### Proposition 3.3

Let for all , and let the function be defined as in (3.3), for all . Then:

 limσ→0∥ϕσ−g∥L1=0.

The proof is again given in [22]. While (since as ), the approximation in the norm still holds. It is likely that the convolution approximation converges to in the norm for a variety of non-smooth coercive functions , not just for .

### 3.2 Approximation at zero

Note from (3.2) that while the approximation is indeed smooth, it is positive on and in particular , although does go to zero as . To address this, we can use different approximations based on which are zero at zero. Below, we describe several different alternatives which are possible. The first is formed by subtracting the value at :

 ~ϕσ(t)=ϕσ(t)−ϕσ(0)=terf(t√2σ)+√2πσexp(−t22σ2)−√2πσ. (3.7)

An alternative is to use where the subtracted term decreases in magnitude as becomes larger and only has much effect for close to zero. We could also simply drop the second term of to get:

 ^ϕσ(t)=ϕσ(t)−√2πσexp(−t22σ2)=terf(t√2σ) (3.8)

which is zero when . The behavior is plotted in Figure 2.

### 3.3 Gradient Computations and Algorithms

We now discuss algorithms for the approximate minimization of (1.1) using the ideas we have developed. In particular, we will focus on the case resulting in the one parameter functional:

 ~Fp(x)=∥Ax−b∥22+λ(n∑k=1|xk|p). (3.9)

We obtain the smooth approximation functional to :

 (3.10)

which is of similar form considered in detail in [22]. To compute the gradient of the smooth functional it is enough to consider the function . Taking the derivative with respect to yields:

 ∂∂xjϕσ(xj)p=pϕσ(xj)p−1ϕ′σ(xj)=pϕσ(xj)p−1erf(xj√2σ).

Taking the second derivative yields:

 ∂2∂xi∂xjGp(x)=0,

when and when we have:

 ∂2∂x2jGp(x) = ∂∂xj{[pϕσ(xj)p−1][erf(xj√2σ)]} = p(p−1)ϕσ(xj)p−2erf2(xj√2σ)+[pϕσ(xj)p−1]2√2√πσexp(−x2j2σ2).

The following results follow.

###### Lemma 3.4

Let be as defined in (3.10) where and . Then the gradient is given by:

 ∇Hp,σ(x)=2AT(Ax−b)+λp(→v(x)), (3.11)

and the Hessian is given by:

 ∇2Hp,σ(x)=2ATA+λpDiag(→w(x)), (3.12)

where the functions and are defined for all :

 →v(x):={v(xj)}nj=1= {ϕσ(xj)p−1erf(xj√2σ)}nj=1 →w(x):={w(xj)}nj=1= {(p−1)ϕσ(xj)p−2erf2(xj√2σ)+[ϕσ(xj)p−1]√2πσ2exp(−x2j2σ2)}nj=1.

Given and , we can apply a number of gradient based methods for the minimization of (and hence for the approximate minimization of ), which take the following general form:

Note that in the case of , the functional is not convex, so such an algorithm may not converge to the global minimum in that case. The generic algorithm above depends on the choice of search direction , which is based on the gradient, and the line search, which can be performed in several different ways.

### 3.4 Line Search Techniques

Gradient based algorithms differ based on the choice of search direction vector and line search techniques for parameter . In this section we describe some suitable line search techniques. Given the current iterate and search direction , we would like to choose so that:

 Hp,σ(xn+1)=Hp,σ(xn+μsn)≤Hp,σ(xn),

where is a scalar which measures how long along the search direction we advance from the previous iterate. Ideally, we would like a strict inequality and the functional value to decrease. Exact line search would solve the single variable minimization problem:

 ¯μ=argminμHp,σ(xn+μsn).

The first order necessary optimality condition (i.e., ) can be used to find a candidate value for , but it is not easy to solve the gradient equation. Instead, using the second order Taylor approximation of at any given , we have that

 n′(t)=n′(0)+tn′′(0)+o(t)≈n′(0)+tn′′(0) (3.13)

using basic matrix calculus:

 n′(t) = (∇Hp,σ(x+ts))Ts⟹n′(0)=∇Hp,σ(x)Ts n′′(t) = [(∇2Hp,σ(x+ts))Ts]Ts=sT∇2Hp,σ(x+ts)s⟹n′′(0)=sT∇2Hp,σ(x)s,

we get that if and only if

 μ=−∇Hp,σ(x)TssT∇2Hp,σ(x)s. (3.14)

An alternative approach is to use a backtracking line search to get a step size that satisfies one or two of the Wolfe conditions [12]. This update scheme can be slow since several evaluations of may be necessary, which are relatively expensive when the dimension is large. The Wolfe condition scheme also necessitates the choice of further parameters.

### 3.5 Nonlinear Conjugate Gradient Algorithm

We now present the conjugate gradient scheme in Algorithm 3, which can be used for sparsity constrained regularization. In the basic (steepest) descent algorithm, we simply take the negative of the gradient as the search direction. For nonlinear conjugate gradient methods, several different search direction updates are possible. We find that the Polak-Ribière scheme often offers good performance [14, 15, 17]. In this scheme, we set the initial search direction to the negative gradient, as in steepest descent, but then do a more complicated update involving the gradient at the current and previous steps:

 βn+1 = max{∇Hp,σn(xn+1)T(∇Hp,σn(xn+1)−∇Hp,σn(xn))∇Hp,σn(xn)T∇Hp,σn(xn),0}, sn+1 = −∇Hp,σn(xn+1)+βn+1sn.

One extra step we introduce in Algorithm 3 is a thresholding which sets small components to zero. That is, at the end of each iteration, we retain only a portion of the largest coefficients. This is necessary, as otherwise the solution we recover will contain many small noisy components and will not be sparse. In our numerical experiments, we found that soft thresholding works well when and that hard thresholding works better when . The component-wise soft and hard thresholding functions with parameter are given by:

 (3.15)

For , an alternative to thresholding at each iteration at is to use the optimality condition of the functional [5]. After each iteration (or after a block of iterations), we can evaluate the vector

 vn=AT(b−Axn). (3.16)

We then set the components (indexed by ) of the current solution vector to zero for indices for which .

Note that after each iteration, we also vary the parameter in the approximating function to the absolute value , starting with relatively far from zero at the first iteration and decreasing towards as we approach the iteration limit. The decrease can be controlled by a parameter so that . The choice worked well in our experiments. Comments on the computational cost relative to the FISTA algorithm are discussed in [22], where it is shown that the most expensive matrix-vector multiplication operations are present in both algorithms. On the other hand, the overhead with using CG iterations is significant and each iteration does take more time than that of a thresholding method. The algorithm is designed to be run for a small number of iterations.

In Algorithm 3, we present a nonlinear Polak-Ribière conjugate gradient scheme to approximately minimize [14, 15, 17]. The function in the algorithms which enforces sparsity refers to either one of the two thresholding functions defined in (3.15) or to the strategy using the vector in (3.16). The update rule for can be varied. In particular, it can again be tied to the distance between two successive iterates (e.g. ), although care must be taken not to make too small, which has the effect of introducing a nearly sharp corner. Another possibility, given access to both gradient and Hessian, is to use a higher order root finding method, such as Newton’s method [12] as discussed in [22].

### 3.6 Application to generalized residual penalty and wavelet representations

First, we comment on the application of the CONV CG method to (1.1). In this case, the change of variables gives the constrained minimization problem:

 miny,x{∥y∥ll+λ∥x∥pp}s.t.y=Ax−b.

This can be accomplished via e.g. an alternative variable minimization scheme combined with the Lagrange multiplier method, in which case we get the minimization problem:

 miny,x,s{∥y∥ll+λ∥x∥pp+sT(Ax−b−y)}

where is a vector of Lagrange multipliers. We can use the alternate minimization method to minimize with respect to each variable in a loop. Let us now assume that and that all . In this case, minimizing with respect to by setting the gradient to zero:

 l{yl−1