NESTA: A Fast and Accurate First-order Methodfor Sparse Recovery

# NESTA: A Fast and Accurate First-order Method for Sparse Recovery

Stephen Becker, Jérôme Bobin and Emmanuel J. Candès Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 (srbecker, bobin, emmanuel@acm.caltech.edu). This work has been partially supported by ONR grants N00014-09-1-0469 and N00014-08-1-0749, by a DARPA grant FA8650-08-C-7853, and by the 2006 Waterman Award from NSF. Submitted April 16, 2009.
July 1, 2019
###### Abstract

Accurate signal recovery or image reconstruction from indirect and possibly undersampled data is a topic of considerable interest; for example, the literature in the recent field of compressed sensing is already quite immense. Inspired by recent breakthroughs in the development of novel first-order methods in convex optimization, most notably Nesterov’s smoothing technique, this paper introduces a fast and accurate algorithm for solving common recovery problems in signal processing. In the spirit of Nesterov’s work, one of the key ideas of this algorithm is a subtle averaging of sequences of iterates, which has been shown to improve the convergence properties of standard gradient-descent algorithms. This paper demonstrates that this approach is ideally suited for solving large-scale compressed sensing reconstruction problems as 1) it is computationally efficient, 2) it is accurate and returns solutions with several correct digits, 3) it is flexible and amenable to many kinds of reconstruction problems, and 4) it is robust in the sense that its excellent performance across a wide range of problems does not depend on the fine tuning of several parameters. Comprehensive numerical experiments on realistic signals exhibiting a large dynamic range show that this algorithm compares favorably with recently proposed state-of-the-art methods. We also apply the algorithm to solve other problems for which there are fewer alternatives, such as total-variation minimization, and convex programs seeking to minimize the norm of under constraints, in which is not diagonal.

Key words. Nesterov’s method, smooth approximations of nonsmooth functions, minimization, duality in convex optimization, continuation methods, compressed sensing, total-variation minimization.

## 1 Introduction

Compressed sensing (CS) [13, 14, 25] is a novel sampling theory, which is based on the revelation that one can exploit sparsity or compressibility when acquiring signals of general interest. In a nutshell, compressed sensing designs nonadaptive sampling techniques that condense the information in a compressible signal into a small amount of data. There are some indications that because of the significant reduction in the number of measurements needed to recover a signal accurately, engineers are changing the way they think about signal acquisition in areas ranging from analog-to-digital conversion [23], digital optics, magnetic resonance imaging [38], seismics [37] and astronomy [8].

In this field, a signal is acquired by collecting data of the form

 b=Ax0+z,

where is the signal of interest (or its coefficient sequence in a representation where it is assumed to be fairly sparse), is a known “sampling” matrix, and is a noise term. In compressed sensing and elsewhere, a standard approach attempts to reconstruct by solving

 minimizef(x)subject to∥b−Ax∥ℓ2≤ϵ, \hb@xt@.01(1.1)

where is an estimated upper bound on the noise power. The choice of the regularizing function depends on prior assumptions about the signal of interest: if is (approximately) sparse, an appropriate convex function is the norm (as advocated by the CS theory); if is a piecewise constant object, the total-variation norm provides accurate recovery results, and so on.

Solving large-scale problems such as (LABEL:eq:cs) (think of as having millions of entries as in mega-pixel images) is challenging. Although one cannot review the vast literature on this subject, the majority of the algorithms that have been proposed are unable to solve these problems accurately with low computational complexity. On the one hand, standard second-order methods such as interior-point methods [10, 36, 48] are accurate but problematic for they need to solve large systems of linear equations to compute the Newton steps. On the other hand, inspired by iterative thresholding ideas [24, 30, 20], we have now available a great number of first-order methods, see [31, 9, 34, 35] and the many earlier references therein, which may be faster but not necessarily accurate. Indeed, these methods are shown to converge slowly, and typically need a very large number of iterations when high accuracy is required.

We would like to pause on the demand for high accuracy since this is the main motivation of the present paper. While in some applications, one may be content with one or two digits of accuracy, there are situations in which this is simply unacceptable. Imagine that the matrix models a device giving information about the signal , such as an analog-to-digital converter, for example. Here, the ability to detect and recover low-power signals that are barely above the noise floor, and possibly further obscured by large interferers, is critical to many applications. In mathematical terms, one could have a superposition of high power signals corresponding to components of with magnitude of order 1, and low power signals with amplitudes as far as 100 dB down, corresponding to components with magnitude about . In this regime of high-dynamic range, very high accuracy is required. In the example above, one would need at least five digits of precision as otherwise, the low power signals would go undetected.

Another motivation is solving (LABEL:eq:cs) accurately when the signal is not exactly sparse, but rather approximately sparse, as in the case of real-world compressible signals. Since exactly sparse signals are rarely found in applications—while compressible signals are ubiquitous—it is important to have an accurate first-order method to handle realistic signals.

### 1.1 Contributions

A few years ago, Nesterov [43] published a seminal paper which couples smoothing techniques (see [4] and the references therein) with an improved gradient method to derive first-order methods which achieve a convergence rate he had proved to be optimal [41] two decades earlier. As a consequence of this breakthrough, a few recent works have followed up with improved techniques for some very special problems in signal or image processing, see [3, 21, 52, 1] for example, or for minimizing composite functions such as -regularized least-squares problems [44]. In truth, these novel algorithms demonstrate great promise; they are fast, accurate and robust in the sense that their performance does not depend on the fine tuning of various controlling parameters.

This paper also builds upon Nesterov’s work by extending some of his works discussed just above, and proposes an algorithm—or, better said, a class of algorithms—for solving recovery problems from incomplete measurements. We refer to this algorithm as NESTA—a shorthand for Nesterov’s algorithm—to acknowledge the fact that it is based on his method. The main purpose and the contribution of this paper consist in showing that NESTA obeys the following desirable properties.

• Speed: NESTA is an iterative algorithm where each iteration is decomposed into three steps, each involving only a few matrix-vector operations when is an orthogonal projector and, more generally, when the eigenvalues of are well clustered. This, together with the accelerated convergence rate of Nesterov’s algorithm [43, 3], makes NESTA a method of choice for solving large-scale problems. Furthermore, NESTA’s convergence is mainly driven by a single smoothing parameter introduced in Section LABEL:sec:nesterov. One can use continuation techniques [34, 35] to dynamically update this parameter to substantially accelerate this algorithm.

• Accuracy: NESTA depends on a few parameters that can be set in a very natural fashion. In fact, there is a trivial relationship between the value of these parameters and the desired accuracy. Furthermore, our numerical experiments demonstrate that NESTA can find the first 4 or 5 significant digits of the optimal solution to (LABEL:eq:cs), where is the norm or the total-variation norm of , in a few hundred iterations. This makes NESTA amenable to solve recovery problems involving signals of very large sizes that also exhibit a great dynamic range.

• Flexibility: NESTA can be adapted to solve many problems beyond minimization with the same efficiency, such as total-variation (TV) minimization problems. In this paper, we will also discuss applications in which in (LABEL:eq:cs) is given by , where one may think of as a short-time Fourier transform also known as the Gabor transform, a curvelet transform, an undecimated wavelet transform and so on, or a combination of these, or a general arbitrary dictionary of waveforms (note that this class of recovery problems also include weighted methods [16]). This is particularly interesting because recent work [29] suggests the potential advantage of this analysis-based approach over the classical basis pursuit in solving important inverse problems [29].

A consequence of these properties is that NESTA, and more generally Nesterov’s method, may be of interest to researchers working in the broad area of signal recovery from indirect and/or undersampled data.

Another contribution of this paper is that it also features a fairly wide range of numerical experiments comparing various methods against problems involving realistic and challenging data. By challenging, we mean problems of very large scale where the unknown solution exhibits a large dynamic range; that is, problems for which classical second-order methods are too slow, and for which standard first-order methods do not provide sufficient accuracy. More specifically, Section LABEL:sec:numeric presents a comprehensive series of numerical experiments which illustrate the behavior of several state-of-the-art methods including interior point methods [36], projected gradient techniques [34, 51, 31], fixed point continuation and iterative thresholding algorithms [34, 56, 3]. It is important to consider that most of these methods have been perfected after several years of research [36, 31], and did not exist two years ago. For example, the Fixed Point Continuation method with Active Set [35], which represents a notable improvement over existing ideas, was released while we were working on this paper.

### 1.2 Organization of the paper and notations

As emphasized earlier, NESTA is based on Nesterov’s ideas and Section LABEL:sec:nesterov gives a brief but essential description of Nesterov’s algorithmic framework. The proposed algorithm is introduced in Section LABEL:sec:nesta. Inspired by continuation-like schemes, an accelerated version of NESTA is described in Section LABEL:sec:conti. We report on extensive and comparative numerical experiments in Section LABEL:sec:numeric. Section LABEL:sec:flexible covers extensions of NESTA to minimize the norm of under data constraints (Section LABEL:sec:analysis), and includes realistic simulations in the field of radar pulse detection and estimation. Section LABEL:sec:tvmin extends NESTA to solve total-variation problems and presents numerical experiments which also demonstrate its remarkable efficiency there as well. Finally, we conclude with Section LABEL:sec:discussion discussing further extensions, which would address an even wider range of linear inverse problems.

Notations. Before we begin, it is best to provide a brief summary of the notations used throughout the paper. As usual, vectors are written in small letters and matrices in capital letters. The th entry of a vector is denoted and the th entry of the matrix is .

It is convenient to introduce some common optimization problems that will be discussed throughout. Solving sparse reconstruction problems can be approached via several different equivalent formulations. In this paper, we particularly emphasize the quadratically constrained -minimization problem

 (BPϵ)minimize∥x∥ℓ1subject to∥b−Ax∥ℓ2≤ϵ, \hb@xt@.01(1.2)

where quantifies the uncertainty about the measurements as in the situation where the measurements are noisy. This formulation is often preferred because a reasonable estimate of may be known. A second frequently discussed approach considers solving this problem in Lagrangian form, i.e.

 (QPλ)minimizeλ∥x∥ℓ1+12∥b−Ax∥2ℓ2, \hb@xt@.01(1.3)

and is also known as the basis pursuit denoising problem (BPDN) [18]. This problem is popular in signal and image processing because of its loose interpretation as a maximum a posteriori estimate in a Bayesian setting. In statistics, the same problem is more well-known as the lasso [49]

 (LSτ)minimize∥b−Ax∥ℓ2subject to∥x∥ℓ1≤τ. \hb@xt@.01(1.4)

Standard optimization theory [47] asserts that these three problems are of course equivalent provided that obey some special relationships. With the exception of the case where the matrix is orthogonal, this functional dependence is hard to compute [51]. Because it is usually more natural to determine an appropriate rather than an appropriate or , the fact that NESTA solves () is a significant advantage. Further, note that theoretical equivalence of course does not mean that all three problems are just as easy (or just as hard) to solve. For instance, the constrained problem () is harder to solve than (), as discussed in Section LABEL:sec:constrained. Therefore, the fact that NESTA turns out to be competitive with algorithms that only solve () is quite remarkable.

## 2 Nesterov’s method

### 2.1 Minimizing smooth convex functions

In [42, 41], Nesterov introduces a subtle algorithm to minimize any smooth convex function on the convex set ,

 minx∈Qpf(x). \hb@xt@.01(2.1)

We will refer to as the primal feasible set. The function is assumed to be differentiable and its gradient is Lipschitz and obeys

 ||∇f(x)−∇f(y)||ℓ2≤L∥x−y∥ℓ2; \hb@xt@.01(2.2)

in short, is an upper bound on the Lipschitz constant. With these assumptions, Nesterov’s algorithm minimizes over by iteratively estimating three sequences , and while smoothing the feasible set . The algorithm depends on two scalar sequences and discussed below, and takes the following form:

 Initialize x0. For k≥0, 1. Compute ∇f(xk). 2. Compute yk: yk=argminx∈QpL2∥x−xk∥2ℓ2+⟨∇f(xk),x−xk⟩. 3. Compute zk: zk=argminx∈QpLσppp(x)+∑ki=0αi⟨∇f(xi),x−xi⟩. 4. Update xk: xk=τkzk+(1−τk)yk. Stop when a given criterion is valid.

At step , is the current guess of the optimal solution. If we only performed the second step of the algorithm with instead of , we would obtain a standard first-order technique with convergence rate .

The novelty is that the sequence “keeps in mind” the previous iterations since Step 3 involves a weighted sum of already computed gradients. Another aspect of this step is that—borrowing ideas from smoothing techniques in optimization [4]—it makes use of a prox-function for the primal feasible set . This function is strongly convex with parameter ; assuming that vanishes at the prox-center , this gives

 pp(x)≥σp2∥x−xcp∥2ℓ2.

The prox-function is usually chosen so that , thus discouraging from moving too far away from the center .

The point , at which the gradient of is evaluated, is a weighted average between and . In truth, this is motivated by a theoretical analysis [43, 50], which shows that if and , then the algorithm converges to

 x⋆=argminx∈Qpf(x)

with the convergence rate

 f(yk)−f(x⋆)≤4Lpp(x⋆)(k+1)2σp. \hb@xt@.01(2.3)

This decay is far better than what is achievable via standard gradient-based optimization techniques since we have an approximation scaling like instead of .

### 2.2 Minimizing nonsmooth convex functions

In an innovative paper [43], Nesterov recently extended this framework to deal with nonsmooth convex functions. Assume that can be written as

 f(x)=maxu∈Qd⟨u,Wx⟩, \hb@xt@.01(2.4)

where , and . We will refer to as the dual feasible set, and suppose it is convex. This assumption holds for all of the problems of interest in this paper—we will see in Section LABEL:sec:nesta that this holds for , , the total-variation norm and, in general, for any induced norm—yet it provides enough information beyond the black-box model to allow cleverly-designed methods with a convergence rate scaling like rather than , in the number of steps .

With this formulation, the minimization (LABEL:eq:nestmin) can be recast as the following saddle point problem:

 minx∈Qpmaxu∈Qd⟨u,Wx⟩. \hb@xt@.01(2.5)

The point is that (LABEL:eq:saddlezero) is convex but generally nonsmooth. In [43], Nesterov proposed substituting by the smooth approximation

 fμ(x)=maxu∈Qd⟨u,Wx⟩−μpd(u), \hb@xt@.01(2.6)

where is a prox-function for ; that is, is continuous and strongly convex on , with convexity parameter (we shall assume that vanishes at some point in ). Nesterov proved that is continuously differentiable, and that its gradient obeys

 ∇fμ(x)=W∗uμ(x), \hb@xt@.01(2.7)

where is the optimal solution of (LABEL:eq:smooth-approx). Furthermore, is shown to be Lipschitz with constant

 Lμ=1μσd∥W∥2 \hb@xt@.01(2.8)

( is the operator norm of ). Nesterov’s algorithm can then be applied to as proposed in [43]. For a fixed , the algorithm converges in iterations. If we describe convergence in terms of the number of iterations needed to reach an solution (that is, the number of steps is taken to produce an obeying ), then because is approximately proportional to the accuracy of the approximation, and because is proportional to , the rate of convergence is , a significant improvement over the sub-gradient method which has rate .

## 3 Extension to Compressed Sensing

We now extend Nesterov’s algorithm to solve compressed sensing recovery problems, and refer to this extension as NESTA. For now, we shall be concerned with solving the quadratically constrained minimization problem (LABEL:eq:bp).

### 3.1 Nesta

We wish to solve (LABEL:eq:bp), i.e. minimize subject to , where is singular ().

In this section, we assume that is an orthogonal projector, i.e. the rows of are orthonormal. This is often the case in compressed sensing applications where it is common to take as a submatrix of a unitary transformation which admits a fast algorithm for matrix-vector products; special instances include the discrete Fourier transform, the discrete cosine transform, the Hadamard transform, the noiselet transform, and so on. Basically, collecting incomplete structured orthogonal measurements is the prime method for efficient data acquisition in compressed sensing.

Recall that the norm is of the form

 ∥x∥ℓ1=maxu∈Qd⟨u,x⟩,

where the dual feasible set is the ball

 Qd={u:∥u∥∞≤1}.

Therefore, a natural smooth approximation to the norm is

 fμ(x)=maxu∈Qd⟨u,x⟩−μpd(u),

where is our dual prox-function. For , we would like a strongly convex function, which is known analytically and takes its minimum value (equal to zero) at some . It is also usual to have separable. Taking these criteria into account, a convenient choice is whose strong convexity parameter is equal to . With this prox-function, is the well-known Huber function and is Lipschitz with constant .111In the case of total-variation minimization in which , is not a known function. In particular, is given by

 ∇fμ(x)[i]={μ−1x[i],if |x[i]|<μ,sgn(x[i]),otherwise. \hb@xt@.01(3.1)

Following Nesterov, we need to solve the smooth constrained problem

 minx∈Qpfμ(x), \hb@xt@.01(3.2)

where . Once the gradient of at is computed, Step and Step of NESTA consist in updating two auxiliary iterates, namely, and .

### 3.2 Updating yk

To compute , we need to solve

 yk=argminx∈QpLμ2∥xk−x∥2ℓ2+⟨∇fμ(xk),x−xk⟩, \hb@xt@.01(3.3)

where is given. The Lagrangian for this problem is of course

 L(x,λ)=Lμ2∥xk−x∥2ℓ2+λ2(∥b−Ax∥2ℓ2−ϵ2)+⟨∇fμ(xk),x−xk⟩, \hb@xt@.01(3.4)

and at the primal-dual solution , the Karush-Kuhn-Tucker (KKT) conditions [47] read

 ∥b−Ayk∥2ℓ2 ≤ϵ, λϵ ≥0, λϵ(∥b−Ayk∥2ℓ2−ϵ2) =0, Lμ(yk−xk)+λϵA∗(Ayk−b)+∇fμ(xk) =0.

From the stationarity condition, is the solution to the linear system

 (I+λLμA∗A)yk=λLμA∗b+xk−1Lμ∇fμ(xk). \hb@xt@.01(3.5)

As discussed earlier, our assumption is that is an orthogonal projector so that

 yk=(I−λλ+LμA∗A)(λLμA∗b+xk−1Lμ∇fμ(xk)). \hb@xt@.01(3.6)

In this case, computing is cheap since no matrix inversion is required—only a few matrix-vector products are necessary. Moreover, from the KKT conditions, the value of the optimal Lagrange multiplier is obtained explicitly, and equals

 λϵ=max(0,ϵ−1∥b−Aq∥ℓ2−Lμ),q=xk−L−1μ∇fμ(xk). \hb@xt@.01(3.7)

Observe that this can be computed beforehand since it only depends on and .

### 3.3 Updating zk

To compute , we need to solve

 zk=argminx∈QpLμσppp(x)+⟨∑i≤kαi∇fμ(xi),x−xk⟩, \hb@xt@.01(3.8)

where is the primal prox-function. The point differs from since it is computed from a weighted cumulative gradient , making it less prone to zig-zagging, which typically occurs when we have highly elliptical level sets. This step keeps a memory from the previous steps and forces to stay near the prox-center.

A good primal prox-function is a smooth and strongly convex function that is likely to have some positive effect near the solution. In the setting of (LABEL:eq:cs), a suitable smoothing prox-function may be

 pp(x)=12∥x−x0∥2ℓ2 \hb@xt@.01(3.9)

for some , e.g. an initial guess of the solution. Other choices of primal feasible set may lead to other choices of prox-functions. For instance, when is the standard simplex, choosing an entropy distance for is smarter and more efficient, see [43]. In this paper, the primal feasible set is quadratic, which makes the Euclidean distance a reasonable choice. What is more important, however, is that this choice allows very efficient computations of and while other choices may considerably slow down each Nesterov iteration. Finally, notice that the bound on the error at iteration in (LABEL:eq:conv) is proportional to ; choosing wisely (a good first guess) can make small. When nothing is known about the solution, a natural choice may be ; this idea will be developed in Section LABEL:sec:conti.

With (LABEL:eq:primalpx), the strong convexity parameter of is equal to , and to compute we need to solve

 zk=argminxLμ2∥x−x0∥2ℓ2+λ2∥b−Ax∥2ℓ2+⟨∑i≤kαi∇fμ(xi),x−xk⟩ \hb@xt@.01(3.10)

for some value of . Just as before, the solution is given by

 zk=(I−λλ+LμA∗A)(λLμA∗b+x0−1Lμ∑i≤kαi∇fμ(xi)), \hb@xt@.01(3.11)

with a value of the Lagrange multiplier equal to

 λϵ=max(0,ϵ−1∥b−Aq∥ℓ2−Lμ),q=x0−L−1μ∑i≤k∇αifμ(xi). \hb@xt@.01(3.12)

In practice, the instances have not to be stored; one just has to store the cumulative gradient .

### 3.4 Computational complexity

The computational complexity of each of NESTA’s step is clear. In large-scale problems, most of the work is in the application of and . Put for the complexity of applying or . The first step, namely, computing , only requires vector operations whose complexity is . Step and require the application of or three times each (we only need to compute once). Hence, the total complexity of a single NESTA iteration is where is dominant.

The calculation above are in some sense overly pessimistic. In compressed sensing applications, it is common to choose as a submatrix of a unitary transformation , which admits a fast algorithm for matrix-vector products. In the sequel, it might be useful to think of as a subsampled DFT. In this case, letting be the matrix extracting the observed measurements, we have . The trick then is to compute in the -domain directly. Making the change of variables , our problem is

 minimize^fμ(x)subject to∥b−Rx∥ℓ2≤ϵ,

where . The gradient of is then

 ∇^fμ(x)=U∇fμ(U∗x).

With this change of variables, Steps and do not require applying or since

 yk=(I−λλ+LμR∗R)(λLμR∗b+xk−1Lμ∇fμ(xk)),

where is the diagonal matrix with diagonal entries depending on whether a coordinate is sampled or not. As before, with . The complexity of Step is now and the same applies to Step .

Put for the complexity of applying and . The complexity of Step is now , so that this simple change of variables reduces the cost of each NESTA iteration to . For example, in the case of a subsampled DFT (or something similar), the cost of each iteration is essentially that of two FFTs. Hence, each iteration is extremely fast.

### 3.5 Parameter selection

NESTA involves the selection of a single smoothing parameter and of a suitable stopping criterion. For the latter, our experience indicates that a robust and fairly natural stopping criterion is to terminate the algorithm when the relative variation of is small. Define as

 Δfμ:=|fμ(xk)−¯fμ(xk)|¯fμ(xk),¯fμ(xk):=1min{10,k}min{10,k}∑l=1fμ(xk−l). \hb@xt@.01(3.13)

Then convergence is claimed when

 Δfμ<δ

for some . In our experiments, depending upon the desired accuracy.

The choice of is based on a trade-off between the accuracy of the smoothed approximation (basically, ) and the speed of convergence (the convergence rate is proportional to ). With noiseless data, is directly linked to the desired accuracy. To illustrate this, we have observed in [7] that when the true signal is exactly sparse and is actually the minimum solution under the equality constraints , the error on the nonzero entries is on the order of . The link between and accuracy will be further discussed in Section LABEL:sec:mu.

### 3.6 Accelerating NESTA with continuation

Inspired by homotopy techniques which find the solution to the lasso problem (LABEL:eq:lasso) for values of ranging in an interval , [34] introduces a fixed point continuation technique which solves -penalized least-square problems (LABEL:eq:pls)

 (QPλ)minimizeλ∥x∥ℓ1+12∥b−Ax∥2ℓ2,

for values of obeying . The continuation solution approximately follows the path of solutions to the problem and, hence, the solutions to (LABEL:eq:cs) and (LABEL:eq:lasso) may be found by solving a sequence a penalized least-squares problems.

The point of this is that it has been noticed (see [34, 45, 27]) that solving (LABEL:eq:pls) (resp. the lasso (LABEL:eq:lasso)) is faster when is large (resp.  is low). This observation greatly motivates the use of continuation for solving (LABEL:eq:pls) for a fixed . The idea is simple: propose a sequence of problems with decreasing values of the parameter , , and use the intermediate solution as a warm start for the next problem. This technique has been used with some success in [31, 51]. Continuation has been shown to be a very successful tool to increase the speed of convergence, in particular when dealing with large-scale problems and high dynamic range signals.

Likewise, our proposed algorithm can greatly benefit from a continuation approach. Recall that to compute , we need to solve

 yk =argminx∈QpLμ2∥x−xk∥2ℓ2+⟨c,x⟩ =argminx∈Qp∥x−(xk−L−1μc)∥2ℓ2

for some vector . Thus with the projector onto , . Now two observations are in order.

• Computing is similar to a projected gradient step as the Lipschitz constant plays the role of the step size. Since is proportional to , the larger , the larger the step-size, and the faster the convergence. This also applies to the sequence .

• For a fixed value of , the convergence rate of the algorithm obeys

 fμ(yk)−fμ(x⋆μ)≤2Lμ∥x⋆μ−x0∥2ℓ2k2,

where is the optimal solution to over . On the one hand, the convergence rate is proportional to , so a large value of is beneficial. On the other hand, choosing a good guess close to provides a low value of , also improving the rate of convergence. Warm-starting with from a previous solve not only changes the starting point of the algorithm, but it beneficially changes as well.

These two observations motivate the following continuation-like algorithm:

 Initialize μ0, x0 and the number of continuation steps T. For t≥1, 1. Apply Nesterov’s algorithm with μ=μ(t) and x0=xμ(t−1). 2. Decrease the value of μ: μ(t+1)=γμ(t) with γ<1. Stop when the desired value of μf is reached.

This algorithm iteratively finds the solutions to a succession of problems with decreasing smoothing parameters producing a sequence of—hopefully— finer estimates of ; these intermediate solutions are cheap to compute and provide a string of convenient first guess for the next problem. In practice, they are solved with less accuracy, making them even cheaper to compute.

The value of is based on a desired accuracy as explained in Section LABEL:sec:paramset. As for an initial value , (LABEL:eq:nabfmu) makes clear that the smoothing parameter plays a role similar to a threshold. A first choice may then be .

We illustrate the good behavior of the continuation-inspired algorithm by applying NESTA with continuation to solve a sparse reconstruction problem from partial frequency data. In this series of experiments, we assess the performance of NESTA while the dynamic range of the signals to be recovered increases.

The signals are -sparse signals—that is, have exactly nonzero components—of size and . Put for the indices of the nonzero entries of ; the amplitude of each nonzero entry is distributed uniformly on a logarithmic scale with a fixed dynamic range. Specifically, each nonzero entry is generated as follows:

 x[i]=η1[i]10αη2[i], \hb@xt@.01(3.14)

where with probability (a random sign) and is uniformly distributed in . The parameter quantifies the dynamic range. Unless specified otherwise, a dynamic range of dB means that (since for large signals is approximately the logarithm base 10 of the ratio between the largest and the lowest magnitudes). For instance, 80 dB signals are generated according to (LABEL:eq:entriesmod) with .

The measurements consist of random discrete cosine measurements so that is diagonalized by the DCT. Finally, is obtained by adding a white Gaussian noise term with standard deviation . The initial value of the smoothing parameter is and the terminal value is . The algorithm terminates when the relative variation of is lower than . NESTA with continuation is applied to 10 random trials for varying number of continuation steps and various values of the dynamic range. Figure LABEL:fig:continuation1 graphs the value of while applying NESTA with and without continuation as a function of the iteration count. The number of continuation steps is set to .

One can observe that computing the solution to (solid line) takes a while when computed with the final value ; notice that NESTA seems to be slow at the beginning (number of iterations lower than 15). In the meantime NESTA with continuation rapidly estimates a sequence of coarse intermediate solutions that converges to the solution to In this case, continuation clearly enhances the global speed of convergence with a factor . Figure LABEL:fig:continuation2 provides deeper insights into the behavior of continuation with NESTA and shows the number of iterations required to reach convergence for varying values of the continuation steps for different values of the dynamic range.

When the ratio is low or when the required accuracy is low, continuation is not as beneficial: intermediate continuation steps require a number of iterations which may not speed up overall convergence. The stepsize which is about works well in this regime. When the dynamic range increases and we require more accuracy, however, the ratio is large, since , and continuation provides considerable improvements. In this case, the step size is too conservative and it takes a while to find the large entries of . Empirically, when the dynamic range is dB, continuation improves the speed of convergence by a factor of . As this factor is likely to increase exponentially with the dynamic range (when expressed in dB), NESTA with continuation seems to be a better candidate for solving sparse reconstruction problems with high accuracy.

Interestingly, the behavior of NESTA with continuation seems to be quite stable: increasing the number of continuation steps does not increase dramatically the number of iterations. In practice, although the ideal is certainly signal dependent, we have observed that choosing leads to reasonable results.

### 3.7 Some theoretical considerations

The convergence of NESTA with and without continuation is straightforward. The following theorem states that each continuation step with converges to . Global convergence is proved by applying this theorem to .

###### Theorem 3.1

At each continuation step , , and

 fμ(t)(yk)−fμ(t)(x⋆μ(t))≤2Lμ(t)∥x⋆μ(t)−xμ(t−1)∥2ℓ2k2.

Proof. Immediate by using [43, Theorem 2].

As mentioned earlier, continuation may be valuable for improving the speed of convergence. Let each continuation step stop after iterations with

 N(t)=√2Lμ(t)γtδ0∥x⋆μ(t)−x⋆μ(t−1)∥ℓ2

so that we have

 fμ(t)(yk)−fμ(t)(x⋆μ(t))≤γtδ0,

where the accuracy becomes tighter as increases. Then summing up the contribution of all the continuation steps gives

 Nc=√2μ0δ0T∑t=1γ−t∥x⋆μ(t)−x⋆μ(t−1)∥ℓ2.

When NESTA is applied without continuation, the number of iterations required to reach convergence is

 N=√2μ0δ0γ−T∥x⋆μf−x0∥ℓ2.

Now the ratio is given by

 NcN=T∑t=1γT−t∥x⋆μ(t)−x⋆μ(t−1)∥ℓ2∥x⋆μf−x0∥ℓ2. \hb@xt@.01(3.15)

Continuation is definitely worthwhile when the right-hand side is smaller than . Interestingly, this quantity is directly linked to the path followed by the sequence . More precisely, it is related to the smoothness of this path; for instance, if all the intermediate points belong to the segment in an ordered fashion, then . Hence, and continuation improves the convergence rate.

Figure LABEL:fig:solpaths illustrates two typical solution paths with continuation. When the sequence of solutions obeys (this is the case when and ), the solution path is likely to be “smooth;” that is, the solutions obey as on the left of Figure LABEL:fig:solpaths. The “nonsmooth” case on the right of Figure LABEL:fig:solpaths arises when the sequence of smoothing parameters does not provide estimates of that are all better than . Here, computing some of the intermediate points is wasteful and continuation fails to be faster.

## 4 Accurate Optimization

A significant fraction of the numerical part of this paper focuses on comparing different sparse recovery algorithms in terms of speed and accuracy. In this section, we first demonstrate that NESTA can easily recover the exact solution to with a precision of 5 to 6 digits. Speaking of precision, we shall essentially use two criteria to evaluate accuracy.

• The first is the (relative) error on the objective functional

 ∥x∥ℓ1−∥x⋆∥ℓ1∥x⋆∥ℓ1, \hb@xt@.01(4.1)

where is the optimal solution to .

• The second is the accuracy of the optimal solution itself and is measured via

 ∥x−x⋆∥ℓ∞, \hb@xt@.01(4.2)

which gives a precise value of the accuracy per entry.

### 4.1 Is NESTA accurate?

For general problem instances, the exact solution to (or equivalently ) cannot be computed analytically. Under some conditions, however, a simple formula is available when the optimal solution has exactly the same support and the same sign as the unknown (sparse) (recall the model ). Denote by the support of , . Then if is sufficiently sparse and if the nonzero entries of are sufficiently large, the solution to is given by

 x⋆[I] =(A[I]∗A[I])−1(A[I]∗b−λsgn(x0[I])), \hb@xt@.01(4.3) x⋆[Ic] =0, \hb@xt@.01(4.4)

see [12] for example. In this expression, is the vector with indices in and is the submatrix with columns indices in .

To evaluate NESTA’s accuracy, we set , , and (this is the number of nonzero coordinates of ). The absolute values of the nonzero entries of are distributed between and so that we have about dB of dynamic range. The measurements are discrete cosine coefficients selected uniformly at random. We add Gaussian white noise with standard deviation . We then compute the solution (LABEL:eq:optimsol), and make sure it obeys the KKT optimality conditions for so that the optimal solution is known.

We run NESTA with continuation with the value of . We use , and the number of continuation steps is set to . Table LABEL:Fistaccurate reports on numerical results. First, the value of the objective functional is accurate up to digits. Second, the computed solution is very accurate since we observe an error of . Now recall that the nonzero components of vary from about to so that we have high accuracy over a huge dynamic range. This can also be gleaned from Figure LABEL:fig:optvsfista which plots NESTA’s solution versus the optimal solution, and confirms the excellent precision of our algorithm.

### 4.2 Setting up a reference algorithm for accuracy tests

In general situations, a formula for the optimal solution is of course unavailable, and evaluating the accuracy of solutions requires defining a method of reference. In this paper, we will use FISTA [3] as such a reference since it is an efficient algorithm that also turns out to be extremely easy to use; in particular, no parameter has to be tweaked, except for the standard stopping criterion (maximum number of iterations and tolerance on the relative variation of the objective function).

We run FISTA with iterations on the same problem as above, and report its accuracy in Table LABEL:Fistaccurate. The -norm is exact up to digits. Furthermore, Figure LABEL:fig:optvsfista shows the entries of FISTA’s solution versus those of the optimal solution, and one observes a very good fit (near perfect when the magnitude of a component of is higher than ). The error between FISTA’s solution and the optimal solution is equal to ; that is, the entries are exact up to . Because this occurs over an enormous dynamic range, we conclude that FISTA also gives very accurate solutions provided that sufficiently many iterations are taken. We have observed that running FISTA with a high number of iterations—typically greater than —provides accurate solutions to , and this is why we will use it as our method of reference in the forthcoming comparisons from this section and the next.

### 4.3 The smoothing parameter μ and NESTA’s accuracy

By definition, fixes the accuracy of the approximation to the norm and, therefore, NESTA’s accuracy directly depends on this parameter. We now propose to assess the accuracy of NESTA for different values of . The problem sizes are as before, namely, and , except that now the unknown is far less sparse with . The standard deviation of the additive Gaussian white noise is also higher, and we set .

Because of the larger value of and , it is no longer possible to have an analytic solution from (LABEL:eq:optimsol). Instead, we use FISTA to compute a reference solution , using iterations and with , which gives . To be sure that FISTA’s solution is very close to the optimal solution, we check that the KKT stationarity condition is nearly verified. If is the support of the optimal solution , this condition reads

 A[I⋆]∗(b−Ax⋆) =λsgn(x⋆[I⋆]), ∥A[Ic⋆]∗(b−Ax⋆)∥ℓ∞ ≤λ.

Now define to be the support of . Then, here, obeys

 ∥A[I]∗(b−AxF)−λsgn(xF[I])∥ℓ∞ =2.6610−10λ, ∥A[Ic]∗(b−AxF)∥ℓ∞ ≤0.99λ.

This shows that is extremely close to the optimal solution.

NESTA is run with continuation steps for three different values of (the tolerance is set to , and respectively). Figure LABEL:fig:nestavsfista plots the solutions given by NESTA versus the “optimal solution” . Clearly, when decreases, the accuracy of NESTA increases just as expected. More precisely, notice in Table LABEL:Nestaccurate that for this particular experiment, decreasing by a factor of gives about additional digit of accuracy on the optimal value.

According to this table, seems a reasonable choice to guarantee an accurate solution since one has between and digits of accuracy on the optimal value, and since the error is lower than . Observe that this value separates the nonzero entries from the noise floor (when ). In the extensive numerical experiments of Section LABEL:sec:numeric, we shall set and as default values.

## 5 Numerical comparisons

This section presents numerical experiments comparing several state-of-the-art optimization techniques designed to solve (LABEL:eq:bp) or (LABEL:eq:pls). To be as fair as possible, we propose comparisons with methods for which software is publicly available online. To the best of our knowledge, such extensive comparisons are currently unavailable. Moreover, whereas publications sometimes test algorithms on relatively easy and academic problems, we will subject optimization methods to hard but realistic reconstruction problems.

In our view, a challenging problem involves some or all of the characteristics below.

• High dynamic range. As mentioned earlier, most optimization techniques are able to find (more or less rapidly) the most significant entries (those with a large amplitude) of the signal . Recovering the entries of that have low magnitudes accurately is more challenging.

• Approximate sparsity. Realistic signals are seldom exactly sparse and, therefore, coping with approximately sparse signals is of paramount importance. In signal or image processing for example, wavelet coefficients of natural images contain lots of low level entries that are worth retrieving.

• Large scale. Some standard optimization techniques, such as interior point methods, are known to provide accurate solutions. However, these techniques are not applicable to large-scale problems due to the large cost of solving linear systems. Further, many existing software packages fail to take advantage of fast-algorithms for applying . We will focus on large-scale problems in which the number of unknowns is over a quarter of a million, i.e. .

### 5.1 State-of-the-art methods

Most of the algorithms discussed in this section are considered to be state-of-art in the sense that they are the most competitive among sparse reconstruction algorithms. To repeat ourselves, many of these methods have been improved after several years of research [36, 31], and many did not exist two years ago [34, 51]. For instance, [35] was submitted for publication less than three months before we put the final touches on this paper. Finally, our focus is on rapid algorithms so that we are interested in methods which can take advantage of fast algorithms for applying to a vector. This is why we have not tested other good methods such as [32], for example.

#### 5.1.1 Nesta

Below, we applied NESTA with the following default parameters

 x0=A∗b,μ=0.02,δ=10−7

(recall that is the initial guess). The maximal number of iterations is set to ; if convergence is not reached after iterations, we record that the algorithm did not convergence (DNC). Because NESTA requires 2 calls to either or per iteration, this is equivalent to declaring DNC after iterations where refers to the total number of calls to or ; hence, for the other methods, we declare DNC when . When continuation is used, extra parameters are set up as follows:

 T=4,μ0=∥x0∥ℓ∞,γ=(μ/μ0)1/T,

and for ,

 μt=γtμ0,δt=0.1⋅(δ/0.1)t/T.

Numerical results are reported and discussed in Section LABEL:sec:numreshd.

#### 5.1.2 Gradient Projections for Sparse Reconstruction (GPSR) [31]

GPSR has been introduced in [31] to solve the standard minimization problem in Lagrangian form (). GPSR is based on the well-known projected gradient step technique,

 v(k+1)=PQ(v(k−1)−αk∇F(vk)),

for some projector onto a convex set ; this set contains the variable of interest . In this equation, is the function to be minimized. In GPSR, the problem is recast such that the variable has positive entries and (a standard change of variables in linear programming methods). The function is then

 F(v)=λ1–∗v+12∥b−[A,−A]v∥2ℓ2,

where is the vector of ones, and belongs to the nonnegative orthant, for all . The projection onto is then trivial. Different techniques for choosing the step-size (backtracking, Barzilai-Borwein [2], and so on) are discussed in [31]. The code is available at http://www.lx.it.pt/~mtf/GPSR/. In the forthcoming experiments, the parameters are set to their default values.

GPSR also implements continuation, and we test this version as well. All parameters were set to defaults except, per the recommendation of one of the GPSR authors to increase performance, the number of continuation steps was set to 40, the ToleranceA variable was set to , and the MiniterA variable was set to . In addition, the code itself was tweaked a bit; in particular, the stopping criteria for continuation steps (other than the final step) was changed. Future releases of GPSR will probably contain a similarly updated continuation stopping criteria.

#### 5.1.3 Sparse reconstruction by separable approximation (SpaRSA) [54]

SpaRSA is an algorithm to minimize composite functions composed of a smooth term and a separable non-smooth term , e.g. (). At every step, a subproblem of the form

 minimize∥x−