A Fast Active Set Block Coordinate Descent Algorithm for \ell_{1}-regularized least squares

# A Fast Active Set Block Coordinate Descent Algorithm for ℓ1-regularized least squares

M. De Santis, S. Lucidi, F. Rinaldi    Marianna De Santis Fakultät für Mathematik, Technische Universität Dortmund, Vogelpothsweg 87, 44227 Dortmund, Germany marianna.de.santis@tu-dortmund.de    Stefano Lucidi Dipartimento di Ingegneria Informatica Automatica e Gestionale, Sapienza Università di Roma, Via Ariosto, 25, 00185 Roma, Italy stefano.lucidi@dis.uniroma1.it    Francesco Rinaldi Dipartimento di Matematica, Università di Padova, Via Trieste, 63, 35121 Padova, Italy rinaldi@math.unipd.it
###### Abstract

The problem of finding sparse solutions to underdetermined systems of linear equations arises in several applications (e.g. signal and image processing, compressive sensing, statistical inference). A standard tool for dealing with sparse recovery is the -regularized least-squares approach that has been recently attracting the attention of many researchers.

In this paper, we describe an active set estimate (i.e. an estimate of the indices of the zero variables in the optimal solution) for the considered problem that tries to quickly identify as many active variables as possible at a given point, while guaranteeing that some approximate optimality conditions are satisfied. A relevant feature of the estimate is that it gives a significant reduction of the objective function when setting to zero all those variables estimated active. This enables to easily embed it into a given globally converging algorithmic framework.

In particular, we include our estimate into a block coordinate descent algorithm for -regularized least squares, analyze the convergence properties of this new active set method, and prove that its basic version converges with linear rate.

Finally, we report some numerical results showing the effectiveness of the approach.

Key words. -regularized least squares, active set, sparse optimization

AMS subject classifications. 65K05, 90C25, 90C06

## 1 Introduction

The problem of finding sparse solutions to large underdetermined linear systems of equations has received a lot of attention in the last decades. This is due to the fact that several real-world applications can be formulated as linear inverse problems. A standard approach is the so called - unconstrained optimization problem:

 minx∈Rn12∥Ax−b∥2+τ∥x∥1, \hb@xt@.01(1.1)

where , , and . We denote by the standard norm and by the norm defined as .

Several classes of algorithms have been proposed for the solution of Problem (LABEL:l2l1). Among the others, we would like to remind Iterative Shrinkage/Thresholding (IST) methods (see e.g. [3, 4, 9, 11, 34]), Augmented Lagrangian Approaches (see e.g. [2]), Second Order Methods (see e.g. [5, 18]), Sequential Deterministic (see e.g. [32, 33, 39]) and Stochastic (see e.g. [16, 28] and references therein) Block Coordinate Approaches, Parallel Deterministic (see e.g. [15] and references therein) and Stochastic (see e.g. [10, 29] and references therein) Block Coordinate Approaches, and Active-set strategies (see e.g. [20, 35, 36]).

The main feature of this class of problems is the fact that the optimal solution is usually very sparse (i.e. it has many zero components). Then, quickly building and/or correctly identifying the active set (i.e. the subset of zero components in an optimal solution) for Problem (LABEL:l2l1) is becoming a crucial task in the context of Big Data Optimization, since it can guarantee relevant savings in terms of CPU time. As a very straightforward example, we can consider a huge scale problem having a solution with just a few nonzero components. In this case, both the fast construction and the correct identification of the active set can considerably reduce the complexity of the problem, thus also giving us the chance to use more sophisticated optimization methods than the ones usually adopted.Various attempts have been made in order to use active set technique in the context of -regularized problems.

In [35, 36], Wen et al. proposed a two-stage algorithm, FPC-AS, where an estimate of the active variables set is driven by using a first-order iterative shrinkage method.

In [37], a block-coordinate relaxation approach with proximal linearized subproblems yields convergence to critical points, while identification of the optimal manifold (under a nondegeneracy condition) allows acceleration techniques to be applied on a reduced space.

In [23], the authors solve an -regularized log determinant program related to the problem of sparse inverse covariance matrix estimation combining a second-order approach with a technique to correctly identifying the active set.

An efficient version of the two-block nonlinear constrained Gauss-Seidel algorithm that at each iteration fixes some variables to zero according to a simple active set rule has been proposed in [27] for solving -regularized least squares.

In a recent paper [5], Nocedal et al. described an interesting family of second order methods for -regularized convex problems. Those methods combine a semi-smooth Newton approach with a mechanism to identify the active manifold in the given problem.

In the case one wants to solve very large problems, Block Coordinate Descent Algorithms (both Sequential and Parallel) represent a very good alternative and, sometimes, the best possible answer [33]. An interesting Coordinate Descent algorithm combining a Newton steps with a line search technique was described by Yuan et al. in [38]. In this context, the authors also proposed a shrinking technique (i.e. a heuristic strategy that tries to fix to zero a subset of variables according to a certain rule), which can be seen as a way to identify the active variables. In [33], some ideas on how to speed up their Block Coordinate Descent Algorithm by including an active set identification strategy are described, but no theoretical analysis is given for the resulting approach.

What we want to highlight here is that all the approaches listed above, but the one described in [5], estimate the final active set by using the current active set and perform subspace minimization on the remaining variables. In [5], the authors define an estimate that performs multiple changes in the active manifold by also including variables that are nonzero at a given point and satisfy some specific condition. Since this active set mechanism, due to the aggressive changes in the index set, can cause cycling, including the estimate into a globally converging algorithmic framework is not always straightforward.

In this work, we adapt the active set estimate proposed in [14] for constrained optimization problems to the -regularized least squares case. Our estimate, similarly to the one proposed in [5], does not only focus on the zero variables of a given point. Instead it tries to quickly identify as many active variables as possible (including the nonzero variables of the point), while guaranteeing that some approximate optimality conditions are satisfied.

The main feature of the proposed active set strategy is that a significant reduction of the objective function is obtained when setting to zero all those variables estimated active. This global property, which is strongly related to the fact that the components estimated active satisfy an approximate optimality condition, makes easy to use the estimate into a given globally converging algorithmic framework.

Furthermore, inspired by the papers [33, 38, 39], we describe a new Block Coordinate Descent Algorithm that embeds the considered active set estimate. At each iteration, the method first sets to zero the active variables, then uses a decomposition strategy for updating a bunch of the non-active ones. On the one hand, decomposing the non-active variables enables to handle huge scale problems that other active set approaches cannot solve in reasonable time. On the other hand, since the subproblems analyzed at every iteration explicitly take into account the -norm, the proposed algorithmic framework does not require a sign identification strategy (for the non-active variables), which is tipically needed when using other active set methods from the literature.

The paper is organized as follows. In Section LABEL:estimate, we introduce our active set strategy. In Section LABEL:algorithm, we describe the active set coordinate descent algorithm, and prove its convergence. We further analyze the convergence rate of the algorithm. In Section LABEL:numres, we report some numerical results showing the effectiveness of the approach. Finally, we draw some conclusions in Section LABEL:conclusions.

## 2 Notation and Preliminary Results

Throughout the paper we denote by , , and the original function in Problem (LABEL:l2l1), the quadratic term of the objective function in Problem (LABEL:l2l1), the gradient vector and the Hessian matrix of respectively. Explicitly

 q(x)=12∥Ax−b∥2,g(x)=A⊤(Ax−b),H=A⊤A.

Given a matrix , we further denote by and the maximum and the minimum eigenvalue of the matrix , respectively. Furthermore, with we indicate the set of indices , and with we indicate the submatrix of whose rows and columns indices are in . We also report the optimality conditions for Problem (LABEL:l2l1):

###### Proposition 2.1

is an optimal solution of Problem (LABEL:l2l1) if and only if

 ⎧⎪ ⎪⎨⎪ ⎪⎩x⋆i>0,gi(x⋆)+τ=0x⋆i<0,gi(x⋆)−τ=0x⋆i=0,−τ≤gi(x⋆)≤τ. \hb@xt@.01(2.1)

Furthermore, we define a continuous function that measures the violation of the optimality conditions in (and is connected to the Gauss-Southwell-r rule proposed in [33]), that is

 Φi(x)=−mid{gi(x)−τHii,xi,gi(x)+τHii}, \hb@xt@.01(2.2)

where mid indicates the median of .

Finally, we recall the concept of strict complementarity.

###### Definition 2.2

Strict complementarity holds if, for any , we have

 −τ

## 3 Active set estimate

All the algorithms that adopt active set strategies need to estimate a particular subset of components of the optimal solution . In nonlinear constrained minimization problems, for example, using an active set strategy usually means correctly identifying the set of active constraints at the solution. In our context, we deal with Problem (LABEL:l2l1) and the active set is considered as the subset of zero-components of .

###### Definition 3.1

Let be an optimal solution for Problem (LABEL:l2l1). We define the active set as follows:

 \@fontswitch¯A(x⋆)={i∈I:x⋆i=0}. \hb@xt@.01(3.1)

We further define as non-active set the complementary set of :

 \@fontswitch¯N(x⋆)=I∖\@fontswitch¯A(x⋆)={i∈{1,…,n}:x⋆i≠0}. \hb@xt@.01(3.2)

In order to get an estimate of the active set we rewrite Problem (LABEL:l2l1) as a box constrained programming problem and we use similar ideas to those proposed in [12].

Problem (LABEL:l2l1) can be equivalently rewritten as follows:

 min12∥A(u−v)−b∥2+τ∑ni=1(ui+vi)u≥0v≥0, \hb@xt@.01(3.3)

where . Indeed, we can transform a solution of Problem (LABEL:l2l1) into a solution of (LABEL:prob1) by using the following transformation:

 u⋆=max(0,x⋆),v⋆=max(0,−x⋆).

Equivalently, we can transform a solution of (LABEL:prob1) into a solution of Problem (LABEL:l2l1) by using the following transformation:

 x⋆=u⋆−v⋆.

The Lagrangian function associated to (LABEL:prob1) is

 \@fontswitchL(u,v,λ,μ)=12∥A(u−v)−b∥2+τn∑i=1(ui+vi)−λ⊤u−μ⊤v,

with vectors of Lagrangian multipliers. Let be an optimal solution of Problem (LABEL:prob1). Then, from necessary optimality conditions, we have

 \hb@xt@.01(3.4)

From (LABEL:lambdamu), we can introduce the following two multiplier functions

 λi(u,v)=gi(u−v)+τ;μi(u,v)=τ−gi(u−v). \hb@xt@.01(3.5)

By means of the multiplier functions, we can recall the non-active set estimate and active set estimate proposed in the field of constrained smooth optimization (see [14] and references therein):

 \@fontswitchN(u,v)={i:ui>ϵλi(u,v)}∪{i:vi>ϵμi(u,v)}, \hb@xt@.01(3.6)
 \@fontswitchA(u,v)=I∖\@fontswitchN(u,v), \hb@xt@.01(3.7)

where is a positive scalar.

We draw inspiration from (LABEL:nas1) and (LABEL:as1) to propose the new estimates of active and non-active set for Problem (LABEL:l2l1). Indeed, by using the relations

 u=max(0,x)andv=max(0,−x),

we can give the following definitions.

###### Definition 3.2

Let . We define the following sets as estimate of the non-active and active variables sets:

 \@fontswitchN(x)={i:max(0,xi)>ϵ(τ+gi(x))}∪{i:max(0,−xi)>ϵ(τ−gi(x))}, \hb@xt@.01(3.8)
 \@fontswitchA(x)=I∖\@fontswitchN(x). \hb@xt@.01(3.9)

In the next Subsections, we first discuss local and global properties of our estimate, then we compare it with other active set estimates.

### 3.1 Local properties of the active set estimate

Now, we describe some local properties (in the sense that those properties only hold into a neighborhood of a given point) of our active set estimate. In particular, the following theoretical result states that when the point is sufficiently close to an optimal solution the related active set estimate is a subset of the active set calculated in the optimal point (and it includes the optimal active variables that satisfy strict complementarity). Furthermore, when strict complementarity holds the active set estimate is actually equal to the optimal active set.

###### Theorem 3.3

Let be an optimal solution of Problem (LABEL:l2l1). Then, there exists a neighborhood of such that, for each in this neighborhood, we have

 \@fontswitch¯A+(x⋆)⊆\@fontswitchA(x)⊆\@fontswitch¯A(x⋆), \hb@xt@.01(3.10)

with .
Furthermore, if strict complementarity (LABEL:stc) holds in , then there exists a neighborhood of such that, for each in this neighborhood, we have

 \@fontswitchA(x)=\@fontswitch¯A(x⋆). \hb@xt@.01(3.11)

Proof. The proof follows from Theorem 2.1 in [14].

### 3.2 A global property of the active set estimate

Here, we analyze a global property of the active set estimate. In particular, we show that, for a suitably chosen value of the parameter appearing in Definition LABEL:def:est, by starting from a point and fixing to zero all variables whose indices belong to the active set estimate , it is possible to obtain a significant decrease of the objective function. This property, which strongly depends on the specific structure of the problem under analysis, represents a new interesting theoretical result, since it enables to easily embed the active set estimate into any globally converging algorithmic framework (in the next section, we will show how to include it into a specific Block Coordinate Descent method). Furthermore, the global property cannot be deduced from the theoretical results already reported in [14].

###### Assumption 1

Parameter appearing in Definition LABEL:def:est satisfies the following condition:

 0<ϵ<1λmax(A⊤A). \hb@xt@.01(3.12)
###### Proposition 3.4

Let Assumption LABEL:ass1 hold. Given a point and the related sets and , let be the point defined as

 y\@fontswitchA(z)=0,y\@fontswitchN(z)=z\@fontswitchN(z).

Then,

 f(y)−f(z)≤−12ϵ∥y−z∥2.

Proof. see Appendix LABEL:appmainresact.

### 3.3 Comparison with other active set strategies

Our active set estimate is somehow related to those proposed respectively by Byrd et al. in [5] and by Yuan et al. in [38]. It is also connected in some way to the IST Algorithm (ISTA), see e.g.  [3, 11]. Indeed, an ISTA step can be seen as a simple way to set to zero the variables in the context of -regularized least-squares problems.

Here, we would like to point out the similarities and the differences between those strategies and the one we propose in the present paper.

First of all, we notice that, at a generic iteration of a given algorithm, if is the related iterate and is an index estimated active by our estimate, that is,

 i∈\@fontswitchA(xk)={i:max(0,xki)≤ϵ(τ+gi(xk))}∩{i:max(0,−xki)≤ϵ(τ−gi(xk))},

this is equivalent to write

 xki∈[ϵ(gi(xk)−τ),ϵ(gi(xk)+τ)]and−τ≤gi(xk)≤τ, \hb@xt@.01(3.13)

which means that is sufficiently small and satisfies the optimality condition associated with a zero component (see (LABEL:optcondorpr)). As we will see, the estimate, due to the way it is defined, tends to be more conservative than other active set strategies (i.e. it might set to zero slightly smaller sets of variables). On the other hand, the global property analyzed in the previous section (i.e. decrease of the objective function when setting to zero the active variables) seems to indicate that the estimate truly contains indices related to variables that will be active in the optimal solution. As we will see later on, this important property does not hold when considering the other active set strategies analyzed here.

In the block active set algorithm for quadratic -regularized problems proposed in [5], the active set estimate, at a generic iteration , can be rewritten in the following way:

 \@fontswitchAkByrd={i:xki=0;gi(xk)∈(−τ,τ)}∪{i:xki<0;gi(xk)=−τ}∪{i:xki>0;gi(xk)=τ}.

Let and be an index estimated active by our estimate, from (LABEL:dualcondref), we get .

Then, in the case , implies . In fact, let . If we have so that . It is easy to see that the other way around is not true.

Other differences between the two estimates come out when considering indices such that . Let and, in particular, . If , then we get

 max(0,−xki)=−xki>ϵ2τ=ϵ(τ−gi(xk)),

so that . Using the same reasoning we can see that, in the case and, in particular, , it can happen

 max(0,xki)=xki>ϵ2τ=ϵ(τ+gi(xk)),

so that .

In [38], the active set estimate is defined as follows

 \@fontswitchAkYuan={i:xki=0;gi(xk)∈(−τ+Mk−1,τ−Mk−1)}, \hb@xt@.01(3.14)

where is a positive scalar that measures the violation of the optimality conditions. It is easy to see that our active set contains the one proposed in [38]. Furthermore, we have that variables contained in our estimate are not necessarily contained in the estimate (LABEL:stima:yuan). In particular, a big difference between our estimate and the one proposed in [38] is that we can also include variables that are non-zero at the current iterate.

As a final comparison, we would like to point out the differences between the ISTA strategy and our estimate. Consider the generic iteration of ISTA with the same used in our active set strategy:

 xk+1=argminx{q(xk)+g(xk)⊤(x−xk)+ϵ∥x−xk∥2+τ∥x∥1}. \hb@xt@.01(3.15)

From the optimality conditions of the inner problem in (LABEL:one), we have that the zero variables at belong to the following set:

 \@fontswitchAkISTA={i:ϵ(−τ+gi(xk))≤xki≤ϵ(τ+gi(xk))}. \hb@xt@.01(3.16)

We can easily see that . The opposite is not always true, apart from the variables . As a matter of fact, let us consider and . Then, we have that

 xki≤ϵ(τ+gi(xk))⇒i∈{i: max(0,xki)≤ϵ(τ+gi(xk))}xki≥ϵ(−τ+gi(xk))⇒−xki≤ϵ(τ−gi(xk))

In order to have it should be

 ϵ(τ−gi)≥max{0,−xki}=0

that is a tighter requirement with respect to the one within . A similar reasoning applies also to variables with . We would also like to notice that the ISTA step might generate unnecessary projections of variables to zero, thus being not always effective as a tool for identifying the active set.

In this final remark, we show that, when using the active set strategies analyzed above, a sufficient decrease of the objective function cannot be guaranteed by setting to zero the variables in the active set (i.e. Proposition LABEL:prop1 does not hold). This fact makes hard, in some cases, to include those active set strategies into a globally convergent algorithmic framework.

###### Remark 1

Proposition LABEL:prop1 does not hold for the active set strategies described above. This can be easily seen in the following case.

Let us assume that, at some iteration , it exists only one index , with , and . Let and be the point defined as for all , and . Then,

 f(y)=f(xk)+(g^ı(xk)−τ)(y^ı−xk^ı)+12(y^ı−xk^ı)2H^ı^ı.

Since and , we have , so that by setting to zero the active variable we get an increase of the objective function value.

The same reasoning applies also to the ISTA step, assuming that at some iteration , there exists only one index such that

 ϵ(−τ+g^ı(xk))

and .

Finally, it is easy to notice that, at each iteration , the active set estimate defined in [38] only keeps fixed to zero, at iteration , some of the variables that are already zero in , thus not changing the objective function value.

## 4 A Fast Active Set Block Coordinate Descent Algorithm

In this section, we describe our Fast Active SeT Block Coordinate Descent Algorithm (FAST-BCDA) and analyze its theoretical properties. The main idea behind the algorithm is that of exploiting as much as possible the good properties of our active set estimate, more specifically:

• the ability to identify, for sufficiently large, the “strong” active variables (namely, those variables satisfying the strict complementarity, see Theorem LABEL:activeest);

• the ability to obtain, at each iteration, a sufficient decrease of the objective function, by fixing to zero those variables belonging to the active set estimate (see Proposition LABEL:prop1 of the previous section).

As we have seen in the previous section, the estimate, due to the way it is defined, tends to be more conservative than other active set strategies (i.e. it might set to zero a slightly smaller set of variables at each iteration). Anyway, since for each block we exactly solve an -regularized subproblem, we can eventually force to zero some other variables in the non-active set. Another important consequence of including the -norm in the subproblems is that we do not need any sign identification strategy for the non-active variables.

At each iteration , the algorithm defines two sets , and executes two steps:

• it sets to zero all of the active variables;

• it minimizes only over a subset of the non-active variables, i.e. those which violate the optimality conditions the most.

More specifically, we consider the measure related to the violation of the optimality conditions reported in (LABEL:tsengest). We then sort in decreasing order the indices of non-active variables (i.e. the set of indices ) with respect to this measure and define the subset containing the first sorted indices.

The set is then partitioned into subsets of cardinality , such that . Then the algorithm performs subiterations. At the -th subiteration the algorithm considers the set and solves to optimality the subproblem we get from (LABEL:l2l1), by fixing all the variables but the ones whose indices belong to . Below we report the scheme of the proposed algorithm (see Algorithm LABEL:fig:FAST-CDA).

The convergence of FAST-BCDA is based on two important results. The first one is Proposition LABEL:prop1, which guarantees a sufficient decrease of the objective function by setting to zero the variables in the active set. The second one is reported in the proposition below. It shows that, despite the presence of the nonsmooth term, by exactly minimizing Problem (LABEL:l2l1) with respect to a subset of the variables (keeping all the other variables fixed), it is possible to get a sufficient decrease of the objective function in case .

###### Proposition 4.1

Given a point and a set , let be the solution of Problem (LABEL:l2l1), where all variables but the ones whose indices belong to are fixed to . Let be defined as

 yJ=w∗,yI∖J=zI∖J.

Then we have

 f(y)−f(z)≤−12λmin(HJJ)∥y−z∥2. \hb@xt@.01(4.1)

Proof. See Appendix LABEL:appcon.

Now, we introduce an assumption that will enable us to prove global convergence of our algorithm.

###### Assumption 2

The matrix satisfies the following condition

 minJλmin((A⊤A)JJ)≥σ>0, \hb@xt@.01(4.2)

where is any subset of such that , with cardinality of the blocks used in FAST-BCDA.

###### Remark 2

We notice that even though there are some similarities between Condition (LABEL:assunzione) and the well-known Restricted Isometry Property (RIP) condition with fixed order (see e.g. [6] for further details), Condition (LABEL:assunzione) is weaker than the RIP condition.

Finally, we are ready to state the main result concerning the global convergence of FAST-BCDA.

###### Theorem 4.2

Let Assumption LABEL:ass1 and Assumption LABEL:ass2 hold. Let be the sequence produced by Algorithm FAST-BCDA.

Then, either an integer exists such that is an optimal solution for Problem (LABEL:l2l1), or the sequence is infinite and every limit point of the sequence is an optimal point for Problem (LABEL:l2l1).

Proof. see Appendix LABEL:appcon.

Now, we discuss Assumptions LABEL:ass1 and LABEL:ass2 that are needed to guarantee convergence of FAST-BCDA.

### 4.1 Comments on the assumptions

Assumption LABEL:ass1 requires the evaluation of , which is not always easily computable for large scale problems. Hence, we describe an updating rule for the parameter , that enables to avoid any “a priori” assumption on .

In practice, at each iteration we need to find the smallest such that the value and the corresponding sets , give a point

 y0,k\@fontswitchAk=0andy0,k\@fontswitchNk=xk\@fontswitchNk

satisfying

 f(y0,k)≤f(xk)−γ∥y0,k−xk∥2, \hb@xt@.01(4.3)

with . Then, we can introduce a variation of FAST-BCDA, namely FAST-BCDA-, that includes the updating rule for the parameter in its scheme, and prove its convergence.

###### Theorem 4.3

Let Assumption LABEL:ass2 hold. Let be the sequence produced by Algorithm FAST-BCDA-.

Then, either an integer exists such that is an optimal solution for Problem (LABEL:l2l1), or the sequence is infinite and every limit point of the sequence is an optimal point for Problem (LABEL:l2l1).

Proof. The proof follows by repeating the same arguments of the proof of Theorem LABEL:teorema1 by replacing the relation (LABEL:app_prop1) with (LABEL:eps-alg).

Assumption LABEL:ass2, which we need to satisfy in order to guarantee convergence of both FAST-BCDA and FAST-BCDA-, is often met in practice if we consider blocks of 1 or 2 variables (i.e. equal to 1 or 2). Indeed, when solving blocks of 1 variable, we need to guarantee that any column of matrix is such that

 ∥Aj∥2≥σ>0.

This is often the case when dealing with overcomplete dictionaries for signal/image reconstruction (as the columns of matrix are usually normalized, see e.g. [1]). When using 2-dimensional blocks, we want no parallel columns in the matrix . This is a quite common requirement in the context of overcomplete dictionaries (as it corresponds to ask that mutual coherence is lower than 1, see e.g. [1]). Furthermore, the solution of 1-dimensional block subproblems can be determined in closed form by means of the well-known scalar soft-threshold function (see e.g. [3, 34]). Similarly, we can express in closed form the solution of 2-dimensional block subproblems.

Summarizing, thanks to the possibility to use an updating rule for , and due to the fact that we only use blocks of dimensions 1 or 2 in our algorithm, we have that Assumptions LABEL:ass1 and LABEL:ass2 are quite reasonable in practice.

### 4.2 Convergence rate analysis

Here, we report a result related to the convergence rate of FAST-BCDA with 1-dimensional blocks (namely FAST-1CDA). In particular, we show that it converges at a linear rate. In order to prove the result, we make an assumption that is common when analyzing the convergence rate of both algorithms for -regularized problems (see e.g. [21]) and algorithms for general problems (see e.g. [25]):

###### Assumption 3

Let be the sequence generated by FAST-1CDA. We have that

 limk→∞xk=x⋆, \hb@xt@.01(4.4)

where is an optimal point of problem (LABEL:l2l1).

Now, we state the theoretical result related to the linear convergence.

###### Theorem 4.4

Let Assumptions LABEL:ass1, LABEL:ass2 and LABEL:ass3 hold. Let be the sequence generated by FAST-1CDA.

Then converges at least Q-linearly to , where . Furthermore, converges at least R-linearly to .

Proof. See Appendix LABEL:appconrate.

## 5 Numerical Results

In this section, we report the numerical experiments related to FAST-BCDA. We implemented our method in MATLAB, and considered four different versions of it in the experiments:

• FAST-1CDA and FAST-2CDA, basic versions of FAST-BCDA where blocks of dimension and are respectively considered;

• FAST-1CDA-E and FAST-2CDA-E, “enhanced” versions of FAST-BCDA where again blocks of dimension and are respectively considered (see subsection LABEL:sec:acc for further details).

We first analyzed the performance of these four versions of our algorithm. Then, we compared the best one with other algorithms for -regularized least squares problems. Namely, we compared FAST-2CDA-E with ISTA [3, 11], FISTA [3], PSSgb [30], SpaRSA [34] and FPCAS [35].
All the tests were performed on an Intel Xeon(R) CPU E5-1650 v2 3.50 GHz using MATLAB R2011b.

We considered two different testing problems of the form (LABEL:l2l1), commonly used for software benchmarking (see e.g. [35, 18]). In particular, we generated artificial signals of dimension , with a number of observations and we set the number of nonzeros , with . The two test problems (P1 and P2) differ in the way matrix is generated:

• Considering as the Gaussian matrix whose elements are generated independently and identically distributed from the normal distribution , the matrix was generated by scaling the columns of .

• Considering as the matrix generated by using the MATLAB command

 A={sprand}(m,n,density),

with , the matrix was generated by scaling the columns of .

We would like to notice that the Hessian matrices related to instances of problem P1 have most of the mass on the diagonal. Then, those instances are in general easier to solve than the ones of problem P2.

Once the matrix was generated, the true signal was built as a vector with randomly placed spikes, with zero in the other components. Finally, for all problems, the vector of observations was chosen as , where is a Gaussian white noise vector, with variance . We set as in [2, 34]. We produced ten different random instances for each problem, for a total of instances. The comparison of the overall computational effort is carried out by using the performance profiles proposed by Dolan and Moré in [13], plotting graphs in a logarithmic scale.

For the value of (number of non-active variables to be used in ) we set for FAST-1CDA and for FAST-2CDA (these values are the ones that guarantee the best performances among the ones we tried). For what concerns the choice of the parameter used in the active set estimate, the easiest choice is that of setting to a fixed value. We tested several values and obtained the best results with and for FAST-1CDA and FAST-2CDA respectively. We further tested an implementation of both FAST-1CDA- and FAST-2CDA-. Since there were no significant improvements in the performance, we decided to keep the value fixed.

We would also like to spend a few words about the criterion for choosing the variables in . In some cases, we found more efficient using the following measure:

 |gi(xk)+τ|if xki>0;|gi(xk)−τ|if xki<0;max{0,−(gi(xk)+τ),gi(xk)−τ}if xki=0, \hb@xt@.01(5.1)

in place of the one reported in (LABEL:tsengest), which we considered for proving the theoretical results. The main feature of this new measure is that it only takes into account first order information (while (LABEL:tsengest) considers proximity of the component value to zero too). Anyway, replacing (LABEL:tsengest) with the new measure is not a big deal, since convergence can still be proved using (LABEL:alternative). Furthermore, linear rate can be easily obtained assuming that strict complementarity holds. Intuitively, considering only first order information in the choice of the variables should make more sense in our context, since proximity to zero is already taken into account when using the estimate to select the active variables.

### 5.1 Enhanced version of Fast-Bcda

By running our codes, we noticed that the cardinality of the set related to the non-active variables decreases quickly as the iterations go by. In general, very few iterations are needed to obtain the real non-active set. By this evidence, and keeping in mind the theoretical result reported in Section LABEL:estimate, we decided to develop an “enhanced” version of our algorithms, taking inspiration by the second stage of FPC-AS algorithm [35]. Once a “good” estimate of was obtained, we solved the following smooth optimization subproblem

 min12∥Ax−b∥2+τsign(x\@fontswitchNk)⊤x\@fontswitchNks.t.xi=0i∈\@fontswitchAk.

In practice, we considered an estimate “good” if both there are no changes in the cardinality of the set with respect to the last two iterations, and is lower or equal than a certain threshold (we fixed in our experiments).

### 5.2 Preliminary experiments

In order to pick the best version among the four we developed, we preliminary compared the performance of FAST-1CDA (FAST1), FAST-2CDA (FAST2), FAST-1CDA-E (FAST1-E) and FAST-2CDA-E (FAST2-E). In Figure LABEL:fig:PPcompprel, we report the performance profiles with respect to the CPU time.

As we can see, even if the four version of FAST-BCDA have similar behaviour, FAST-2CDA-E is the one that gives the overall best performance. We then choose FAST-2CDA-E as the algorithm to be compared with the other state-of-the art algorithms for -regularized problems.

### 5.3 Comparison with other algorithms

In this section, we report the numerical experience related to the comparison of FAST-2CDA-E with ISTA [3, 11], FISTA [3], PSSgb [30], SpaRSA [34] and FPCAS [35].

In our tests, we first ran FAST-2CDA to obtain a target objective function value, then ran the other algorithms until each of them reached the given target (see e.g. [34]). Any run exceeding the limit of iterations is considered failure. Default values were used for all parameters in SpaRSA [34] and FPCAS [35]. For PSSgb [30] we considered the two-metric projection method and we set the parameter options.quadraticInit to , since this setting can achieve better performance for problems where backtracking steps are required on each iteration (see http://www.cs.ubc.ca/~schmidtm/Software/thesis.html). In all codes, we considered the null vector as starting point and all matrices were stored explicitly. In Figure LABEL:fig:PPmatavec, we report the plot of the performance profiles related to the CPU time for all instances. From these profiles it is clear that FAST-2CDA-E outperforms all the other algorithms and that SpaRSA and PSSgb are the two best competitors. We then further compare, in Figure LABEL:fig:boxplots-all, FAST-2CDA-E, SpaRSA and PSSgb reporting the box plots related to the distribution of the CPU time. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually.

In particular, Figure LABEL:fig:boxplots-all shows the plots related to the distribution of the CPU time for all instances, for P1 instances and for P2 instances, respectively. For what concerns P1 instances, SpaRSA and PSSgb show a similar behavior, while observing the plot related to P2 instances SpaRSA shows a better performance. For both classes, FAST-2CDA-E shows the lowest median. As a further comparison among FAST-2CDA-E, SpaRSA and PSSgb, we report in Figure LABEL:fig:errP1 and in Figure LABEL:fig:errP2, the plots of the relative error vs. the CPU time for the P1 and the P2 instances respectively. In each plot, the curves are averaged over the ten runs for fixed and . Observing these plots, we notice that FAST-2CDA-E is able to reach better solutions with lower CPU time.

### 5.4 Real Examples

In this subsection, we test the efficiency of our algorithm on realistic image reconstruction problems. We considered six images: a SheppâLogan phantom available through the MATLAB Image Processing Toolbox and five widely used images downloaded from http://dsp.rice.edu/cscamera (the letter R, the mandrill, the dice, the ball, the mug). Each image has pixels. We followed the procedure described in [35] to generate the instances (i.e. matrix and vector ). What we want to highlight here is that the optimal solutions are unknown. Hence the reconstructed images can only be compared by visual inspection. Also in this case, we first ran FAST-2CDA to obtain a target objective function value, then ran the other algorithms until each of them reached the given target. The CPU-time needed for reconstructing the images is reported in Table LABEL:tabtempi. In Figure LABEL:fig:realim2, we report the images of the dice and of the mandrill reconstructed by FAST-2CDA-E, PSSgb and SpaRSA. It is interesting to notice that the quality of the reconstructed images can depend on the algorithm used. In Table LABEL:tabtempi, we can easily see that FAST-BCDA was faster in all problems.

## 6 Conclusions

In this paper, we devised an active set-block coordinate descent method (FAST-BCDA) for solving -regularized least squares problems. The way the active set estimate is calculated guarantees a sufficient decrease in the objective function at every iteration when setting to zero the variables estimated active. Furthermore, since the subproblems related to the blocks explicitly take into account the -norm, the proposed algorithmic framework does not require a sign identification strategy for the non-active variables.

Global convergence of the method is established. A linear convergence result is also proved. Numerical results are presented to verify the practical efficiency of the method, and they indicate that FAST-BCDA compares favorably with other state-of-the-art techniques.

We further would like to remark that the proposed active set strategy is independent from the specific algorithm we have designed and can be easily included into other algorithms for -regularized least squares, both sequential and parallel, to improve their performance. We finally highlight that the algorithmic scheme we described can be easily modified in order to work in a parallel fashion. Future work will be devoted to adapt the presented approach to handle convex -regularized problems.

Acknowledgments The authors would like to thank the Associate Editor and the anonymous Reviewers for their thorough and useful comments that significantly helped to improve the paper.

## A Main theoretical result related to the active set estimate

Here, we prove the main theoretical result related to the active set estimate.

Proof of Proposition LABEL:prop1. We first define the sets and =. By taking into account the definitions of the sets and and the points and , we have:

 f(y)=q(y)+τn∑i=1sign(yi)yi=q(y)+τ∑i∈\@fontswitchNsign(yi)yi+τ∑i∈\@fontswitchAsign(zi)yi. \hb@xt@.01(A.1)

from which

 f(y)=f(z)+(g\@fontswitchA(z)+τS\@fontswitchAe)⊤(y−z)\@fontswitchA+12(y−z)T\@fontswitchAH\@fontswitchA\@fontswitchA(y−z)\@fontswitchA,

where is the unit vector, and is the diagonal matrix defined as

 S\@fontswitchA=Diag(sign(z\@fontswitchA)),

with the function intended componentwise.

Since we have that the following inequality holds

 f(y)≤f(z)+(g\@fontswitchA(z)+τS\@fontswitchAe)⊤(y−z)\@fontswitchA+λmax(A⊤A)2∥(y−z)\@fontswitchA∥2.

Recalling (LABEL:eps-prop2) we obtain:

 f(y)≤f(z)+(g\@fontswitchA(z)+τS\@fontswitchAe)⊤(y−z)\@fontswitchA+12ϵ∥(y−z)\@fontswitchA∥2. \hb@xt@.01(A.2)

Then, we can write

 f(y)≤f(z)+(g\@fontswitchA(z)+τS\@fontswitchAe+1ϵ(y−z)\@fontswitchA)⊤(y−z)\@fontswitchA−12ϵ∥(y−z)\@fontswitchA∥2.

In order to prove the proposition, we need to show that

 (g\@fontswitchA(z)+τS\@fontswitchAe+1ϵ(