The Regularization Theory of the Krylov Iterative Solvers LSQR, CGLS, LSMR and CGME For Linear Discrete Ill-Posed ProblemsThis work was supported in part by the National Science Foundation of China (No. 11371219)

The Regularization Theory of the Krylov Iterative Solvers LSQR, CGLS, LSMR and CGME For Linear Discrete Ill-Posed Problems

Abstract

For the large-scale linear discrete ill-posed problem or with contaminated by a white noise, the Lanczos bidiagonalization based Krylov solver LSQR and its mathematically equivalent CGLS are most commonly used. They have intrinsic regularizing effects, where the number of iterations plays the role of regularization parameter. However, there has been no answer to the long-standing fundamental concern: for which kinds of problems LSQR and CGLS can find best possible regularized solutions? The concern was actually expressed foresightedly by Björck and Eldén in 1979. Here a best possible regularized solution means that it is at least as accurate as the best regularized solution obtained by the truncated singular value decomposition (TSVD) method, which and the best possible solution by standard-form Tikhonov regularization are both of the same order of the worst-case error and cannot be improved under the assumption that the solution to an underlying linear compact operator equation is continuous or its derivative squares integrable. In this paper we make a detailed analysis on the regularization of LSQR for severely, moderately and mildly ill-posed problems. We first consider the case that the singular values of are simple. We establish accurate theorems for the 2-norm distance between the underlying -dimensional Krylov subspace and the -dimensional dominant right singular subspace of . Based on them and some follow-up results, for the first two kinds of problems, we prove that LSQR finds a best possible regularized solution at semi-convergence occurring at iteration and the following results hold for : (i) the -step Lanczos bidiagonalization always generates a near best rank approximation to ; (ii) the Ritz values always approximate the first large singular values in natural order; (iii) the -step LSQR always captures the dominant SVD components of . However, for the third kind of problem, we prove that LSQR cannot find a best possible regularized solution generally. We derive accurate estimates for the diagonals and subdiagonals of the bidiagonal matrices generated by Lanczos bidiagonalization, which can be used to decide if LSQR finds a best possible regularized solution at semi-convergence. We also analyze the regularization of the other two Krylov solvers LSMR and CGME that are MINRES and the CG method applied to and with , respectively, proving that the regularizing effects of LSMR are similar to LSQR for each kind of problem and both are superior to CGME. We extend all the results to the case that has multiple singular values. Numerical experiments confirm our theory on LSQR.

D
\slugger

sirevxxxxxxxx–x

iscrete ill-posed, full or partial regularization, best or near best rank approximation, TSVD solution, semi-convergence, Lanczos bidiagonalization, LSQR, CGLS, LSMR, CGME

{AMS}

65F22, 65F10, 65F20, 65J20, 65R30, 65R32, 15A18

1 Introduction and Preliminaries

Consider the linear discrete ill-posed problem

(1)

where the norm is the 2-norm of a vector or matrix, and is extremely ill conditioned with its singular values decaying to zero without a noticeable gap. (1) mainly arises from the discretization of the first kind Fredholm integral equation

(2)

where the kernel and are known functions, while is the unknown function to be sought. If is non-degenerate and satisfies the Picard condition, there exists the unique squares integrable solution ; see [27, 53, 56, 81, 89]. Here for brevity we assume that and belong to the same set with . Applications include image deblurring, signal processing, geophysics, computerized tomography, heat propagation, biomedical and optical imaging, groundwater modeling, and many others; see, e.g., [1, 26, 27, 56, 66, 75, 76, 81, 89, 90, 119]. The theory and numerical treatments of integral equations can be found in [81, 82]. The right-hand side is noisy and assumed to be contaminated by a white noise , caused by measurement, modeling or discretization errors, where is noise-free and . Because of the presence of noise and the extreme ill-conditioning of , the naive solution of (1) bears no relation to the true solution , where denotes the Moore-Penrose inverse of a matrix. Therefore, one has to use regularization to extract a best possible approximation to .

The most common regularization, in its simplest form, is the direct standard-form Tikhonov regularization

(3)

with the regularization parameter [101, 111, 112]. The solutions to (1) and (3) can be fully analyzed by the singular value decomposition (SVD) of . Let

(4)

be the SVD of , where and are orthogonal, with the singular values assumed to be simple throughout the paper except Section 7, and the superscript denotes the transpose of a matrix or vector. Then

(5)

with .

Throughout the paper, we always assume that satisfies the discrete Picard condition with some constant for arbitrarily large [1, 33, 50, 51, 53, 56, 76]. It is an analog of the Picard condition in the finite dimensional case; see, e.g., [50], [53, p.9], [56, p.12] and [76, p.63]. This condition means that, on average, the Fourier coefficients decay faster than and enables regularization to compute useful approximations to , which results in the following popular model that is used throughout Hansen’s books [53, 56] and the current paper:

(6)

where is a model parameter that controls the decay rates of . Hansen [56, p.68] points out, “while this is a crude model, it reflects the overall behavior often found in real problems.” One precise definition of the discrete Picard condition is with certain constants . We remark that once the and do not differ greatly, such discrete Picard condition does not affect our claims, rather it complicates derivations and forms of the results.

The white noise has a number of attractive properties which play a critical role in the regularization analysis: Its covariance matrix is , the expected values and , and and ; see, e.g., [53, p.70-1] and [56, p.41-2]. The noise thus affects more or less equally. With (6), relation (5) shows that for large singular values is dominant relative to . Once from some onwards, the small singular values magnify , and the noise dominates and must be suppressed. The transition point is such that

(7)

see [56, p.42, 98] and a similar description [53, p.70-1]. The are then divided into the large ones and the small ones. The truncated SVD (TSVD) method [53, 56] computes the TSVD regularized solutions

(8)

It is known from [53, p.70-1] and [56, p.86-8,96] that is the best TSVD regularized solution to (1) and balances the regularization and perturbation errors optimally. The parameter is a regularization parameter that determines how many large SVD components of are used to compute a regularized solution to (1).

Let , and , and define . Then is the best rank approximation to with (cf. [10, p.12]), and is the minimum-norm least squares solution to

that perturbs to in (1). This interpretation will be often exploited later.

The solution of the Tikhonov regularization has a filtered SVD expansion

(9)

where the are called filters. The TSVD method is a special parameter filtered method, where, in , we take and . The error can be written as the sum of the regularization and perturbation errors, and an optimal aims to balance these two errors and make the sum of their norms minimized [53, 56, 81, 119]. The best possible regularized solution retains the dominant SVD components and dampens the other small SVD components as much as possible [53, 56]. Apparently, the ability to acquire only the largest SVD components of is fundamental in solving (1).

A number of parameter-choice methods have been developed for finding or , such as the discrepancy principle [88], the L-curve criterion, whose use goes back to Miller [87] and Lawson and Hanson [84] and is termed much later and studied in detail in [52, 58], and the generalized cross validation (GCV) [39, 120]; see, e.g., [5, 53, 56, 76, 78, 80, 92, 102, 119] for numerous comparisons. All parameter-choice methods aim to make not small for and for . Each of these methods has its own merits and disadvantages, and no one is absolutely reliable for all ill-posed problems. For example, some of the mentioned parameter-choice methods may fail to find accurate approximations to ; see [46, 118] for an analysis on the L-curve method and [53] for some other parameter-choice methods. A further investigation on paramater-choice methods is not our concern in this paper.

The TSVD method is important in its own right. It and the standard-form Tikhonov regularization produce very similar solutions with essentially the minimum 2-norm error, i.e., the worst-case error [81, p.13]; see [117], [51], [53, p.109-11] and [56, Sections 4.2 and 4.4]. Indeed, for a linear compact equation including (2) with the noisy and true solution , under the source condition that its solution or , the range of the adjoint of or that of , which amounts to assuming that or its derivative is squares integrable, the errors of the best regularized solutions by the TSVD method and the Tikhonov regularization are order optimal, i.e., the same order as the worst-case error [81, p.13,18,20,32-40], [90, p.90] and [119, p.7-12]. These conclusions carries over to (1) [119, p.8]. Therefore, either of and is a best possible solution to (1) under the above assumptions and can be taken as standard reference when assessing the regularizing effects of an iterative solver. For the sake of clarity, we will take .

For (1) large, the TSVD method and the Tikhonov regularization method are generally too demanding, and only iterative regularization methods are computationally viable. A major class of methods has been Krylov iterative solvers that project (1) onto a sequence of low dimensional Krylov subspaces and computes iterates to approximate ; see, e.g., [1, 27, 38, 45, 53, 56, 81]. Of Krylov iterative solvers, the CGLS (or CGNR) method, which implicitly applies the Conjugate Gradient (CG) method [40, 60] to the normal equations of (1), and its mathematically equivalent LSQR algorithm [98] have been most commonly used. The Krylov solvers CGME (or CGNE) [10, 11, 22, 45, 47, 61] and LSMR [11, 30] are also choices, which amount to the CG method applied to with and MINRES [97] applied to , respectively. These Krylov solvers have been intensively studied and known to have regularizing effects [1, 24, 38, 45, 47, 53, 56, 61] and exhibit semi-convergence [90, p.89]; see also [10, p.314], [11, p.733], [53, p.135] and [56, p.110]: The iterates converge to and their norms increase steadily, and the residual norms decrease in an initial stage; then afterwards the noise starts to deteriorate the iterates so that they start to diverge from and instead converge to , while their norms increase considerably and the residual norms stabilize. If we stop at the right time, then, in principle, we have a regularization method, where the iteration number plays the role of parameter regularization. Semi-convergence is due to the fact that the projected problem starts to inherit the ill-conditioning of (1) from some iteration onwards, and the appearance of a small singular value of the projected problem amplifies the noise considerably.

The regularizing effects of CG type methods were noticed by Lanczos [83] and were rediscovered in [74, 106, 110]. Based on these works and motivated by a heuristic explanation on good numerical results with very few iterations using CGLS in [74], and realizing that such an excellent performance can only be expected if convergence to the regular part of the solution, i.e., , takes place before the effects of ill-posedness show up, on page 13 of [12], Björck and Eldén in 1979 foresightedly expressed a fundamental concern on CGLS (and LSQR): More research is needed to tell for which problems this approach will work, and what stopping criterion to choose. See also [53, p.145]. As remarked by Hanke and Hansen [48], the paper [12] was the only extensive survey on algorithmic details until that time, and a strict proof of the regularizing properties of conjugate gradients is extremely difficult. An enormous effort has long been made to the study of regularizing effects of LSQR and CGLS (cf. [24, 27, 28, 37, 45, 47, 53, 56, 61, 81, 91, 94, 99, 105, 114]) in the Hilbert or finite dimensional space setting, but a rigorous regularization theory of LSQR and CGLS for (1) is still lacking, and there has been no definitive answer to the above long-standing fundamental question, and the same is for LSMR and CGME.

For symmetric, MINRES and MR-II applied to directly are alternatives and have been shown to have regularizing effects [17, 45, 49, 56, 67, 79], but MR-II seems preferable since the noisy is excluded in the underlying subspace [65, 67]. For nonsymmetric or multiplication with difficult to compute, GMRES and RRGMRES are candidate methods [3, 18, 19, 93], and the latter may be better [67]. The hybrid approaches based on the Arnoldi process have been first proposed in [20] and studied in [17, 21, 85, 95]. Gazzola and her coauthors [31][35] have described a general framework of the hybrid methods and presented various Krylov-Tikhonov methods with different parameter-choice strategies. Unfortunately, unlike LSQR and CGLS, these methods are highly problem dependent and may not have regularizing effects for general nonsymmetric ill-posed problems; see, e.g., [67] and [56, p.126]. The fundamental cause is that the underlying Krylov subspaces may not favor the the dominant left and singular subspaces of , which are desired in solving (1).

The behavior of ill-posed problems critically depends on the decay rate of . The following characterization of the degree of ill-posedness of (1) was introduced in [62] and has been widely used [1, 27, 53, 56, 89]: If , then (1) is mildly or moderately ill-posed for or . If with , , then (1) is severely ill-posed. Here for mildly ill-posed problems we add the requirement , which does not appear in [62] but must be met for in (1) [48, 53]. In the one-dimensional case, i.e., , (1) is severely ill-posed with sufficiently smooth, and it is moderately ill-posed with , where is the highest order of continuous derivatives of ; see, e.g., [53, p.8] and [56, p.10-11]. Clearly, the singular values for a severely ill-posed problem decay at the same rate , while those of a moderately or mildly ill-posed problem decay at the decreasing rate that approaches one more quickly with for the mildly ill-posed problem than for the moderately ill-posed problem.

If a regularized solution to (1) is at least as accurate as , then it is called a best possible regularized solution. Given (1), if the regularized solution of an iterative regularization solver at semi-convergence is a best possible one, then, by the words of Björck and Eldén, the solver works for the problem and is said to have the full regularization. Otherwise, the solver is said to have the partial regularization.

Because it has been unknown whether or not LSQR, CGLS, LSMR and CGME have the full regularization for a given (1), one commonly combines them with some explicit regularization, so that the resulting hybrid variants (hopefully) find best possible regularized solutions [1, 53, 56]. A hybrid CGLS is to run CGLS for several trial regularization parameters and picks up the best one among the candidates [1]. Its disadvantages are that regularized solutions cannot be updated with different and there is no guarantee that the selected regularized solution is a best possible one. The hybrid LSQR variants have been advocated by Björck and Eldén [12] and O’Leary and Simmons [96], and improved and developed by Björck [9] and Björck, Grimme and Van Dooren [13]. A hybrid LSQR first projects (1) onto Krylov subspaces and then regularizes the projected problems explicitly. It aims to remove the effects of small Ritz values and expands a Krylov subspace until it captures the dominant SVD components of [9, 13, 48, 96]. The hybrid LSQR and CGME have been intensively studied in, e.g., [6, 7, 8, 23, 47, 48, 85, 93, 95, 103] and [1, 56, 59]. Within the framework of such hybrid solvers, it is hard to find a near-optimal regularization parameter [13, 103]. More seriously, as we will elaborate mathematically and numerically in the concluding section of this paper, it may make no sense to speak of the regularization of the projected problems and their optimal regularization parameters since they may actually fail to satisfy the discrete Picard conditions. In contrast, if an iterative solver has the full regularization, we stop it after semi-convergence. Obviously, we cannot emphasize too much the importance of completely understanding the regularization of LSQR, CGLS, LSMR and CGME. By the definition of the full or partial regularization, we now modify the concern of Björck and Eldén as: Do LSQR, CGLS, LSMR and CGME have the full or partial regularization for severely, moderately and mildly ill-posed problems? How to identify their full or partial regularization in practice?

In this paper, assuming exact arithmetic, we first focus on LSQR and make a rigorous analysis on its regularization for severely, moderately and mildly ill-posed problems. Due to the mathematical equivalence of CGLS and LSQR, the assertions on the full or partial regularization of LSQR apply to CGLS as well. We then analyze the regularizing effects of LSMR and CGME and draw definitive conclusions. We prove that LSQR has the full regularization for severely and moderately ill-posed problems once and suitably, and it generally has only the partial regularization for mildly ill-posed problems. In Section 2, we describe the Lanczos bidiagonalization process and LSQR, and make an introductory analysis. In Section 3, we establish accurate theorems for the 2-norm distance between the underlying -dimensional Krylov subspace and the -dimensional dominant right singular subspace of . We then derive some follow-up results that play a central role in analyzing the regularization of LSQR. In Section 4, we prove that a -step Lanczos bidiagonalization always generates a near best rank approximation to , and the Ritz values always approximate the first large singular values in natural order, and no small Ritz value appears for . This will show that LSQR has the full regularization. For mildly ill-posed problems, we prove that, for some , the Ritz values generally do not approximate the first large singular values in natural order and LSQR generally has only the partial regularization. In Section 5, we derive bounds for the entries of bidiagonal matrices generated by Lanczos bidiagonalization, proving how fast they decay and showing how to use them to reliably identify if LSQR has the full regularization when the degree of ill-posedness of (1) is unknown in advance. Exploiting some of the results on LSQR, we analyze the regularization of LSMR and CGME and prove that LSMR has similar regularizing effects to LSQR for each kind of problem and both of them are superior to CGME. In Section 6, we present some perturbation results and prove that LSQR resembles the TSVD method for severely and moderately ill-posed problems. In Section 7, with a number of nontrivial changes and reformulations, we extend all the results to the case that has multiple singular values. In Section 8, we report numerical experiments to confirm our theory on LSQR. Finally, we summarize the paper with further remarks in Section 9.

Throughout the paper, denote by the -dimensional Krylov subspace generated by the matrix and the vector , and by and the bold letter the identity matrix and the zero matrix with orders clear from the context, respectively. For , we define the nonnegative matrix , and for , means componentwise.

2 The LSQR algorithm

LSQR is based on the Lanczos bidiagonalization process, which computes two orthonormal bases and of and for , respectively. We describe the process as Algorithm 1.

Algorithm 1:  -step Lanczos bidiagonalization process {remunerate} Take , and define . For {romannum} Algorithm 1 can be written in the matrix form (10) (11) where denotes the -th canonical basis vector of , , and (12) It is known from (10) that (13) We remind that the singular values of , called the Ritz values of with respect to the left and right subspaces and , are all simple. This basic fact will often be used later. At iteration , LSQR solves the problem and computes the iterates with (14) where is the first canonical basis vector of , and the residual norm decreases monotonically with respect to . We have and , both of which can be cheaply computed. Note that . We have (15) that is, the iterate by LSQR is the minimum-norm least squares solution to the perturbed problem that replaces in (1) by its rank approximation . Recall that the best rank approximation to satisfies . We can relate LSQR and the TSVD method from two perspectives. One of them is to interpret LSQR as solving a nearby problem that perturbs to , provided that is a near best rank approximation to with an approximate accuracy . The other is to interpret and as the solutions to the two perturbed problems of (1) that replace by the rank approximations with the same quality to , respectively. Both perspectives lead to the consequence: the LSQR iterate is as accurate as and is thus a best possible regularized solution to (1), provided that is a near best rank approximation to with the approximate accuracy and the singular values of approximate the first large ones of in natural order for . Otherwise, as will be clear later, cannot be as accurate as if either is not a near best rank approximation to or has at least one singular value smaller than . We will give a precise definition of a near best rank approximation later. As stated in the introduction, the semi-convergence of LSQR must occur at some iteration . Under the discrete Picard condition (6), if semi-convergence occurs at iteration , we are sure that LSQR has the full regularization because has captured the dominant SVD components of and effectively suppressed the other SVD components; if semi-convergence occurs at some iteration , then LSQR has only the partial regularization since it has not yet captured the needed dominant SVD components of .

3 theorems for the distances between and as well as the others related

Van der Sluis and Van der Vorst [113] prove the following result, which has been used in Hansen [53] and the references therein to illustrate and analyze the regularizing effects of LSQR and CGLS. We will also investigate it further in our paper.

Proposition 3.1

LSQR with the starting vector and CGLS applied to with the starting vector generate the same iterates

(16)

where

(17)

and the are the singular values of labeled as .

(16) shows that has a filtered SVD expansion of form (9). If all the Ritz values approximate the first singular values of in natural order, the filters and the other monotonically decay to zero for . If this is the case until , the -step LSQR has the full regularization and computes a best possible regularized solution . However, if a small Ritz value appears before some , i.e., and with the smallest integer , then tends to zero monotonically for ; on the other hand, we have

since the first factor is non-positive and the second factor is positive. Then we get , so that is deteriorated and LSQR has only the partial regularization. Hansen [53, p.146-157] summarizes the known results on , where a bound for is given in [53, p.155] but there is no accurate estimate for the bound. As we will see in Section 6, the results to be established in this paper can be used for this purpose, and, more importantly, we will show that the bound in [53, p.155] can be sharpened substantially.

The standard -step Lanczos bidiagonalization method computes the Ritz values , which are used to approximate some of the singular values of , and is mathematically equivalent to the symmetric Lanczos method for the eigenvalue problem of starting with ; see [10, 11] or [2, 71, 72] for several variations that are based on standard, harmonic, refined projection [4, 108, 115] or a combination of them. A general convergence theory of harmonic and refined harmonic projection methods was lacking in the books [4, 108, 115] and has later been established in [70]. As is known from [10, 86, 100], for a general singular value distribution and a general vector , some of the Ritz values become good approximations to the largest and smallest singular values of as increases. If large singular values are well separated but small singular values are clustered, large Ritz values converge fast but small Ritz values converge very slowly.

For (1), we see from (4) and (6) that contains more information on dominant right singular vectors than on the ones corresponding to small singular values. Therefore, hopefully contains richer information on the first right singular vectors than on the other ones, at least for small. Furthermore, note that has many small singular values clustered at zero. Due to these two basic facts, all the Ritz values are expected to approximate the large singular values of in natural order until some iteration , at which a small Ritz value shows up and the regularized solutions then start to be contaminated by the noise dramatically after that iteration. These qualitative arguments are frequently used to analyze and elaborate the regularizing effects of LSQR and CGLS; see, e.g., [1, 53, 55, 56, 59] and the references therein. Clearly, these arguments are not precise and cannot help us draw any definitive conclusion on the full or partial regularization of LSQR. For a severely ill-posed example from seismic tomography, it is reported in [114] that the desired convergence of the Ritz values actually holds as long as the discrete Picard condition is satisfied and there is a good separation among the large singular values of . Unfortunately, there has been no mathematical justification on these observations.

A complete understanding of the regularization of LSQR includes accurate solutions of the following basic problems: How well or accurately does approximate or capture the -dimensional dominant right singular subspace of ? How accurate is the rank approximation to ? Can it be a near best rank approximation to ? How does the noise level affects the approximation accuracy of and for and , respectively? What sufficient conditions on and are needed to guarantee that is a near best rank approximation to ? When do the Ritz values approximate in natural order? When does at least a small Ritz value appear, i.e., before some ? We will make a rigorous and detailed analysis on these problems and some others related closely, present our results, and draw definitive assertions on the regularization of LSQR for three kinds of ill-posed problems.

In terms of the canonical angles between two subspaces and of the same dimension [109, p.43], we first present the following theorem, showing how the -dimensional Krylov subspace captures or approximates the -dimensional dominant right singular subspace of for severely ill-posed problems.

{theorem}

Let the SVD of be as (4). Assume that (1) is severely ill-posed with and , , and the discrete Picard condition (6) is satisfied. Let be the -dimensional dominant right singular subspace of spanned by the columns of and . Then for we have

(18)

with to be defined by (26) and

(19)
(20)

where

(21)

In particular, we have

(22)
(23)
(24)

Proof. Let whose columns are the first left singular vectors of defined by (4). Then the Krylov subspace with

Partition the diagonal matrix and the matrix as follows:

where . Since is a Vandermonde matrix with supposed to be distinct for , it is nonsingular. Therefore, from we have

(25)

where

(26)

Write , and define

(27)

Then , and the columns of form an orthonormal basis of . So we get an orthogonal direct sum decomposition of :

(28)

By definition and (28), for the matrix 2-norm we obtain

(29)

which is (18).

Next we estimate . For , it is easily justified that the -th column of consists of the coefficients of the -th Lagrange polynomial

that interpolates the elements of the -th canonical basis vector at the abscissas . Consequently, the -th column of is

(30)

from which we obtain

(31)

Since is monotonically decreasing for , it is bounded by . With this property and the definition of by (21), we get