# Compressive Conjugate Directions: Linear Theory^{†}^{†}thanks: This work was supported by Stanford Exploration Project.

###### Abstract

We present a powerful and easy-to-implement iterative algorithm for solving large-scale optimization problems that involve /total-variation (TV) regularization. The method is based on combining the Alternating Directions Method of Multipliers (ADMM) with a Conjugate Directions technique in a way that allows reusing conjugate search directions constructed by the algorithm across multiple iterations of the ADMM. The new method achieves fast convergence by trading off multiple applications of the modeling operator for the increased memory requirement of storing previous conjugate directions. We illustrate the new method with a series of imaging and inversion applications.

Key words. -regularization, total-variation regularization, regularized inversion, ADMM, method of multipliers, Conjugate Gradients, compressive conjugate directions

AMS subject classifications. 65K05, 90C06

## 1 Introduction

We address a class of regularized least-squares fitting problems of the form

\hb@xt@.01(1.1) | ||||

where is a known vector (data), a vector of unknowns^{1}^{1}1sometimes referred to as “model”, and are linear operators. If is the identity map, then problem (LABEL:eq:opt0) is a least-squares fitting with regularization,

\hb@xt@.01(1.2) |

If the unknown vector is the discretization of a function, and is the first-order finite difference operator

then problem (LABEL:eq:opt0) turns into a least-squares fitting with a total-variation (TV) regularization

\hb@xt@.01(1.3) |

On the one hand, in (LABEL:eq:opt1) we seek a model vector such that forward-modeled data match observed data in the least squares sense, while imposing sparsity-promoting regularization. In (LABEL:eq:opt), on the other hand, we impose blockiness-promoting total-variation (TV) regularization. Note that rather than using a regularization parameter as a coefficient of the regularization term, we use a data-fitting weight . TV regularization (also known as the Rudin-Osher-Fatemi, or ROF, model [36]) acts as a form of “model styling” that helps to preserve sharp contrasts and boundaries in the model even when spectral content of input data has a limited resolution.

-TV regularized least-squares fitting, a key tool in imaging and de-noising applications (see, e.g. [36, 10, 42, 26]), is beginning to play an increasingly important role in applications where the modeling operator in (LABEL:eq:opt0) is computationally challenging to apply. In particular, in seismic imaging problems of exploration geophysics such as full-waveform inversion [39, 16] modeling of seismic wave propagation in a three-dimensional medium from multiple seismic sources is by far the greatest contributor to the computational cost of inversion, and reduction of the number of applications of the operator is key to success in practical applications.

-regularized least-squares problems can be reduced to inequality-constrained quadratic programs and solved using interior-point methods based on, e.g., Newton [7] or nonlinear Conjugate Gradients [26] methods. Alternatively, the resulting bound-constrained quadratic programs can be solved using gradient projection [17] or projected Conjugate Gradients [33]. A conceptually different class of techniques for solving -regularized least-squares problems is based on homotopy methods [23, 15, 31].

Another class of methods for solving (LABEL:eq:opt0) that merits a special mention applies splitting schemes for the sum of two operators. For example the iterative shrinking-thresholding algorithm (ISTA) is based on applying forward-backward splitting [8, 32] to solving the -regularized problem (LABEL:eq:opt1) by gradient descent [4, 11, 12]:

\hb@xt@.01(1.4) | ||||

where is a sufficiently small step parameter, and the soft thresholding or shrinkage operator is the Moreau resolvent (see, e.g., [1]) of ,

\hb@xt@.01(1.5) | ||||

and denotes the subgradient [34, 1], and the absolute value of a vector is computed component-wise. The typically slow convergence of the first-order method (LABEL:eq:ISTA) can be accelerated by an over-relaxation step [29], resulting in the Fast ISTA algorithm (FISTA) [3]:

\hb@xt@.01(1.6) | ||||

where and is sufficiently small.

It is important to note that algorithm (LABEL:eq:FISTA) is applied to the -regularized problem (LABEL:eq:opt1), not the TV-regularized problem (LABEL:eq:opt). An accelerated algorithm for solving a TV-regularized denoising problem^{2}^{2}2with in (LABEL:eq:opt) was proposed in [2] and applied the Nesterov relaxation [29] to solving the dual of the TV-regularized denoising problem [9]. However, using a similar approach to solving (LABEL:eq:opt) with a non-trivial operator results in accelerated schemes that still require inversion of [2, 21] and thus lack the primary appeal of the accelerated gradient descent methods—i.e., a single application of and its transpose per iteration^{3}^{3}3In [2] inversion of is replaced by a single gradient descent, however, over-relaxation is applied to the dual variable..

The advantage of (LABEL:eq:FISTA) compared with simple gradient descent is that Nesterov’s over-relaxation step requires storing two previous solution vectors and provides improved search direction for minimization. Note, however, that the step length is inversely proportional to the Lipschitz constant of [3] and may be small in practice.

A very general approach to solving problems (LABEL:eq:opt0) involving either or TV regularization is provided by primal-dual methods. For example, in TV-regularized least-squares problem (LABEL:eq:opt), by substituting

\hb@xt@.01(1.7) |

and adding (LABEL:eq:zBu) as a constraint, we obtain an equivalent equality-constrained optimization problem

\hb@xt@.01(1.8) | ||||

The optimal solution of (LABEL:eq:opt2) corresponds to the saddle-point of its Lagrangian

\hb@xt@.01(1.9) |

that can be found by the Uzawa method [41]. The Uzawa method finds the saddle point by alternating a minimization with respect to the primal variables and ascent over the dual variable for the objective function equal to the standard Lagrangian (LABEL:eq:L0), ,

\hb@xt@.01(1.10) | ||||

for some positive step size . Approach (LABEL:eq:dual), when applied to the Augmented Lagrangian [35], ,

\hb@xt@.01(1.11) |

results in the method of multipliers [25]. For problems (LABEL:eq:opt0) all these methods still require joint minimization with respect to and of some objective function that includes both and a smooth function of . Splitting the joint minimization into separate steps of minimization with respect , followed by minimization with respect to , results in the Alternating-Directions Method of Multipliers (ADMM) [20, 18, 19, 14, 6]. To establish a connection to the splitting techniques applied to the sum of two operators, we note that the ADMM is equivalent to applying the Douglas-Rachford splitting [13] to the problem

\hb@xt@.01(1.12) |

where is the subgradient, and problem (LABEL:eq:0subgrad) is equivalent to (LABEL:eq:opt0). The ADMM is a particular case of a primal-dual iterative solution framework with splitting [43], where the minimization in (LABEL:eq:dual) is split into two steps,

\hb@xt@.01(1.13) | ||||

For the ADMM, we substitute in (LABEL:eq:framework) but other choices of a modified Lagrange function are possible that may produce convergent primal-dual algorithms [43]. Making the substitution from (LABEL:eq:LAug) into (LABEL:eq:framework), and introducing a scaled vector of multipliers,

\hb@xt@.01(1.14) |

we obtain

\hb@xt@.01(1.15) | ||||

where we used the fact that adding a constant term to the objective function does not alter the solution. In the iterative process (LABEL:eq:opt4), we apply splitting, minimizing

\hb@xt@.01(1.16) |

alternately with respect to and . Further we note that the minimization of (LABEL:eq:obj) with respect to (in a splitting step with fixed) is given trivially by the shrinkage operator (LABEL:eq:shrink),

\hb@xt@.01(1.17) |

Combining (LABEL:eq:opt4) and (LABEL:eq:shrink1) we obtain Algorithm LABEL:alg:aug.

Minimization on the first line of (LABEL:eq:opt4) at each step of the ADMM requires inversion of the operator . In the first-order gradient-descent methods like (LABEL:eq:FISTA) a similar requirement is obviated by replacing the minimization with respect to variable by gradient descent. However, for ill-conditioned problems the gradient may be a poor approximation to the optimal search direction. One interpretation of Nesterov’s over-relaxation step in (LABEL:eq:FISTA) is that it provides a better search direction by perturbing the current solution update with a fraction of the previous update on the last line of (LABEL:eq:FISTA). The intermediate least-squares problem in (LABEL:eq:opt4) can be solved approximately using, for example, a few iterations of conjugate gradients. However, repeating multiple iterations of Conjugate Gradients at each step of the ADMM may be unnecessary. Indeed, as we demonstrate in the following sections, conjugate directions constructed at earlier steps of the ADMM can be reused because the matrix of the system of normal equations associated with the minimization on the first line of (LABEL:eq:opt4) does not change between ADMM steps^{4}^{4}4Only the right-hand sides of the system are updated as a result of thresholding.. Therefore, we can trade the computational cost of applying the operator and its transpose against the cost of storing a few solution and data-size vectors. As this approach is applied to the most general problem (LABEL:eq:opt0) with a non-trivial operator , in addition to the potential speed-up, this method has the advantage of working equally well for and -regularized problems.

We stress that our new approach does not improve the theoretical convergence properties of the classic ADMM method under the assumption of exact minimization in step 4 of Algorithm LABEL:alg:aug. The asymptotic convergence rate is still as with exact minimization [24]. The new approach provides a numerically feasible way of implementing the ADMM for problems where a computationally expensive operator precludes accurate minimization in step 4. However, the rate of convergence in the general method of multipliers (LABEL:eq:dual) is sensitive to the choice of parameter , and an improved convergence rate for some values of can be accompanied with more ill-conditioned minimization problems at each step of (LABEL:eq:opt4) [19]. By employing increasingly more accurate conjugate-directions solution of the minimization problem at each iteration of (LABEL:eq:opt4) the new method offsets the deteriorating condition of the intermediate least-squares problems, and achieves a faster practical convergence at early iterations.

Practical utility of the ADMM in applications that involve sparsity-promoting (LABEL:eq:opt1) or edge-preserving (LABEL:eq:opt) inversion is often determined by how quickly we can resolve sparse or blocky model components. These features can often be qualitatively resolved within relatively few initial iterations of the ADMM (see discussion in the Appendix of [22]). In our Section 4, fast recovery of such local features will be one of the key indicators for judging the efficiency of the proposed method.

In the next section we describe two new algorithms, Steered and Compressive Conjugate Gradients based on the principle of reusing conjugate directions for multiple right-hand sides. In Section 3 we prove convergence and demonstrate that the new algorithm coincides with the exact ADMM in a finite number of iterations. Section 4 contains a practical implementation of the Compressive Conjugate Gradients method. We test the method on a series of problems from imaging and mechanics, and compare its performance against FISTA and ADMM with gradient descent and restarted conjugate gradients.

## 2 Steered and Compressive Conjugate Directions

Step 4 of Algorithm LABEL:alg:aug is itself a least-squares optimization problem of the form

\hb@xt@.01(2.1) |

where

\hb@xt@.01(2.2) |

and

\hb@xt@.01(2.3) |

Solving optimization problem (LABEL:eq:Fuvk) is mathematically equivalent to solving the following system of normal equations [40],

\hb@xt@.01(2.4) |

as operator (LABEL:eq:Fop) has maximum rank. Solving (LABEL:eq:norm) has the disadvantage of squaring the condition number of operator (LABEL:eq:Fop) [40]. When the operator is available in a matrix form, and a factorization of operator is numerically feasible, solving the normal equations (LABEL:eq:norm) should be avoided and a technique based on a matrix factorization should be applied directly to solving (LABEL:eq:Fuvk) [5, 37]. However, when matrix is not known explicitly or its size exceeds practical limitations of direct methods, as is the case in applications of greatest interest for us, an iterative algorithm, such as the Conjugate Gradients for Normal Equations (CGNE) [5, 37], can be used to solve (LABEL:eq:norm). Solving (LABEL:eq:Fuvk) exactly may be unnecessary and we can expect that for large-scale problems only a few steps of an iterative method need be carried out. However, every iteration typically requires the application of operator and its adjoint, and in large-scale optimization problems we are interested in minimizing the number of applications of these operations. For large-scale optimization problems we need an alternative to re-starting an iterative solver for each intermediate problem (LABEL:eq:Fuvk). We propose to minimize restarting iterations^{5}^{5}5avoiding restarting altogether in the theoretical limit of infinite computer storage by devising a conjugate-directions technique for solving (LABEL:eq:Fuvk) with a non-stationary right-hand side. At each iteration of the proposed algorithm we find a search direction that is conjugate to previous directions with respect to the operator . In the existing conjugate direction techniques, iteratively constructed conjugate directions span the Krylov subspaces [40],

\hb@xt@.01(2.5) |

However, in our approach we construct a sequence of vectors (search directions) that are conjugate with respect to operator at the th step but may not span the Krylov subspace . This complicates convergence analysis of our technique, but allows “steering” search directions by iteration-dependent right-hand sides. Since the right-hand side in (LABEL:eq:Fuvk) is the result of the shrinkage (LABEL:eq:shrink1) at previous iterations that steer or compress the solution, we call our approach “steered” or “compressive” conjugate directions.

For the least-squares problem (LABEL:eq:Fuvk), we construct two sets of vectors for

\hb@xt@.01(2.6) | ||||

such that

\hb@xt@.01(2.7) |

Equations (LABEL:eq:pq) and (LABEL:eq:conjdir) mean that the vectors form conjugate directions [40, 37]. At each iteration we find an approximation to the solution of (LABEL:eq:Fuvk) as a linear combination of vectors , for which the residual

\hb@xt@.01(2.8) |

is orthogonal to vectors ,

\hb@xt@.01(2.9) |

Vector is constructed as a linear combination of all previous vectors and so that the conjugacy condition in (LABEL:eq:pq) is satisfied. The resulting algorithm for arbitrary depending on is given by Algorithm LABEL:alg:scd.

Note that the above algorithm is not specific to a particular sequence of right-hand-side vectors and its applicability goes beyond solving the constrained optimization problems (LABEL:eq:opt2). The algorithm requires storing vectors (LABEL:eq:pq), as well as one vector each for the current solution iterate , variable right-hand side , intermediate vectors and . The requirement of storing a growing number of vectors makes the algorithm resemble the GMRES method [37] for solving linear systems with non-self-adjoint operators. However, in our case, this is a consequence of having a variable right-hand side, requiring re-computation of solution iterates as linear combinations of all of the previous search directions (LABEL:eq:pq). This requirement can be relaxed in applications where vector is updated, for example, by the modified Lagrangian technique for solving a constrained optimization problem, and converges to a limit. In Section 4 we describe practical applications of the algorithm achieving fast convergence while storing only a subset of vectors (LABEL:eq:pq). The algorithm requires one application of and its transpose at each iteration and dot-products of large vectors.

Combining Algorithms LABEL:alg:aug and LABEL:alg:scd we obtain the Compressive Conjugate Directions Algorithm LABEL:alg:ccd.

## 3 Convergence Analysis

Convergence properties of the ADMM were studied in many publications and are well known. However, here we provide a self-contained proof of convergence for Algorithm LABEL:alg:aug that mostly follows the presentation of [6]. Later, we use this result to study the convergence of Algorithm LABEL:alg:ccd.

###### Theorem 3.1

Assume that , operators , are maximum rank, and

\hb@xt@.01(3.1) | ||||

is the unique solution of problem (LABEL:eq:opt2). Assume that a vector is defined as

\hb@xt@.01(3.2) |

where is the vector of Lagrange multipliers for the equality constraint in (LABEL:eq:opt2). Algorithm LABEL:alg:aug then converges to this solution if , that is,

\hb@xt@.01(3.3) |

Proof. Problem (LABEL:eq:opt2) has a convex objective function and equality constraints, hence (LABEL:eq:solution0,LABEL:eq:multiplier0) is a saddle point of its Lagrangian (LABEL:eq:L0) [7]. Substituting from Algorithm LABEL:alg:aug, we have

\hb@xt@.01(3.4) | ||||

where is the optimal value of the objective function and is its approximation at iteration of the algorithm. Inequality (LABEL:eq:lowerest) provides a lower bound for the objective function estimate . Step 4 of the algorithm is equivalent to

\hb@xt@.01(3.5) |

Substituting the expression for from steps 6 into (LABEL:eq:ukp1), we obtain

\hb@xt@.01(3.6) |

Equality (LABEL:eq:ukp2) is equivalent to

\hb@xt@.01(3.7) |

Substituting and into the right-hand side of (LABEL:eq:ukp3), we obtain

\hb@xt@.01(3.8) | ||||

Step 5 is equivalent to

\hb@xt@.01(3.9) | |||

where we used the expression for from step 6. Substituting and into the right-hand side of the second line of (LABEL:eq:ukp5), we obtain

\hb@xt@.01(3.10) |

Adding (LABEL:eq:ukp4) and (LABEL:eq:ukp6), we get

\hb@xt@.01(3.11) | ||||

an upper bound for . Adding (LABEL:eq:lowerest) and (LABEL:eq:upperest), we get

\hb@xt@.01(3.12) | ||||

or after rearranging,

\hb@xt@.01(3.13) | ||||

We will now use (LABEL:eq:upper1) to derive an upper estimate for

Using step 6 of Algorithm LABEL:alg:aug for the first term in (LABEL:eq:upper1) and introducing , we get

\hb@xt@.01(3.14) | ||||

Substituting (LABEL:eq:first) into (LABEL:eq:upper1), we obtain

\hb@xt@.01(3.15) | ||||

yielding

\hb@xt@.01(3.16) | ||||

Expanding the left-hand side of (LABEL:eq:upper3), we obtain

\hb@xt@.01(3.17) | ||||

Let us prove that the middle term in the left-hand side of (LABEL:eq:upper4) is non-negatve,

where we used step 6 of Algorithm LABEL:alg:aug. Indeed, since minimizes (LABEL:eq:obj) with , using the convexity of norm, we have for ,

\hb@xt@.01(3.18) | ||||

Similarly, since minimizes (LABEL:eq:obj) for and , for we have

\hb@xt@.01(3.19) | ||||

In both (LABEL:eq:upper5) and (LABEL:eq:upper6) we used step 6 of Algorithm LABEL:alg:aug and the fact that for any convex function