A Flexible Coordinate Descent Method

# A Flexible Coordinate Descent Method

Kimon Fountoulakis and Rachael Tappenden    Kimon Fountoulakis K. Fountoulakis is with the International Computer Science Institute, Department of Statistics, University of California Berkeley, 1947 Center St, Ste. 600, Berkeley, CA 94704, USA. e-mail: kfount@berkeley.edu.    Rachael Tappenden R. Tappenden (corresponding author) is with the School of Mathematics and Statistics, University of Canterbury, Private Bag 8400, Christchurch 8041, NZ. e-mail: rachael.tappenden@canterbury.ac.nz.
###### Abstract

We present a novel randomized block coordinate descent method for the minimization of a convex composite objective function. The method uses (approximate) partial second-order (curvature) information, so that the algorithm performance is more robust when applied to highly nonseparable or ill conditioned problems. We call the method Flexible Coordinate Descent (FCD). At each iteration of FCD, a block of coordinates is sampled randomly, a quadratic model is formed about that block and the model is minimized approximately/inexactly to determine the search direction. An inexpensive line search is then employed to ensure a monotonic decrease in the objective function and acceptance of large step sizes. We present several high probability iteration complexity results to show that convergence of FCD is guaranteed theoretically. Finally, we present numerical results on large-scale problems to demonstrate the practical performance of the method.

##### Keywords.

large scale optimization; second-order methods; curvature information; block coordinate descent; nonsmooth problems; iteration complexity; randomized.

##### AMS Classification.

49M15; 49M37; 65K05; 90C06; 90C25; 90C53.

## 1 Introduction

In this work we are interested in solving the following convex composite optimization problem

 minx∈RNF(x):=f(x)+Ψ(x), (1)

where is a smooth convex function and is a (possibly) nonsmooth, (coordinate) separable, real valued convex function. Problems of the form (1) arise in many important scientific fields, and applications include machine learning yuanho (), regression IEEEhowto:Tibshirani () and compressed sensing IEEEhowto:CandesRombergTao (); Candes06 (); Donoho06 (). Often the term is a data fidelity term, and the term represents some kind of regularization.

Frequently, problems of the form (1) are large-scale, i.e., the size of is of the order of a million or a billion. Large-scale problems impose restrictions on the types of methods that can be employed for the solution of (1). In particular, the methods should have low per iteration computational cost, otherwise completing even a single iteration of the method might require unreasonable time. The methods must also rely only on simple operations such as matrix vector products, and ideally, they should offer fast progress towards optimality.

First order methods, and in particular randomized coordinate descent methods, have found great success in this area because they take advantage of the underlying problem structure (separability and block structure), and satisfy the requirements of low computational cost and low storage requirements. For example, in Richtarik14 () the authors show that their randomized coordinate descent method was able to solve sparse problems with millions of variables in a reasonable amount of time.

Unfortunately, randomized coordinate descent methods have two significant drawbacks. Firstly, due to their coordinate nature, coordinate descent methods are usually efficient only on problems with a high degree of partial separability,111See (Richtarik15, , Equation (2)) or (Tappenden15, , Definition 13) for a precise definition of ‘partial separability’. and performance suffers when there is a high dependency between variables. Secondly, as first-order methods, coordinate descent methods do not usually capture essential curvature information and have been shown to struggle on complicated sparse problems Fountoulakis15 (); l1regSCfg ().

The purpose of this work is to overcome these drawbacks by equipping a randomized block coordinate descent method with approximate partial second-order information. In particular, at every iteration of FCD a search direction is obtained by solving a local quadratic model approximately. The quadratic model is defined by a matrix representing approximate second order information, so that essential curvature information is incorporated. The model need only be minimized approximately to give an inexact search direction, so that the process is efficient in practice. (The termination condition for the quadratic subproblem is inspired by sqa (), and we discuss this in more detail later.) A line search is then employed along this inexact direction, in order to guarantee a monotonic decrease in the objective function and large step-sizes.

FCD is computationally efficient. At each iteration of FCD, a block of coordinates is randomly selected, which is inexpensive. The method allows the use of inexact search directions, i.e., the quadratic model is minimized approximately. Intuitively, it is expected that this will lead to a reduction in algorithm running time, compared with minimizing the model exactly at every iteration. Also, the line search step depends on a subset of the coordinates only (corresponding to the randomly selected block of coordinates), so it is much cheaper compared with working in the full dimensional space. We note that the per iteration computational cost of the FCD method may be slightly higher than other randomized coordinate descent methods. This is due to the incorporation of the matrix representing partial second order information — the matrix gives rise to a quadratic model that is (in general) nonseparable, so the associated subproblem may not have a closed form solution. However, in the numerical results presented later in this work, we show that the method is more robust (and efficient) in practice, and often FCD will require fewer iterations to reach optimality, compared with other state-of-the-art methods. i.e., we show that FCD is able to solve difficult problems, on which other coordinate descent methods may struggle.

The FCD method is supported by theoretical convergence guarantees in the form of high probability iteration complexity results. These iteration complexity results provide an explicit expression for the number of iterations , which guarantees that, for any given error tolerance , and confidence level , the probability that FCD achieves an -optimal solution, exceeds , i.e., .

### 1.1 Literature review

Coordinate descent methods are some of the oldest and simplest iterative methods, and they are often better known in the literature under various names such as Jacobi methods, or Gauss-Seidel methods, among others. It has been observed that these methods may suffer from poor practical performance, particularly on ill-conditioned problems, and until recently, often higher order methods have been favoured by the optimization community. However, as we enter the era of big data, coordinate descent methods are coming back into favour, because of their ability to provide approximate solutions to real world instances of very large/huge scale problems in a reasonable amount of time.

Currently, state-of-the-art randomized coordinate descent methods include that of Richtàrik and Takàč Richtarik14 (), where the method can be applied to unconstrained convex composite optimization problems of the form (1). The algorithm is supported by theoretical convergence guarantees in the form of high probability iteration complexity results, and Richtarik14 () also reports very impressive practical performance on highly separable large scale problems. Their work has also been extended to the parallel case Richtarik15 (), to include acceleration techniques Fercoq13 (), and to include the use of inexact updates Tappenden13 ().

Other important works on coordinate descent methods include that of Tseng and Yun in tsengyun (); Tseng09 (); Tseng10 (), Nesterov Nesterov12 () for huge-scale problems, work in Lu13 () and Tappenden15 () that improves the complexity analysis of Richtarik14 () and Richtarik15 () respectively, coordinate descent methods for group lasso problems Qin13 (); Simon12 () or for problems with general regularizers ShalevSchwartz13 (); Wright12 (), a coordinate descent method that uses a sequential active set strategy desantis14 (), and coordinate descent for constrained optimization problems Necoara14 ().

Unfortunately, on ill-conditioned problems, or problems that are highly nonseparable, first order methods can display very poor practical performance, and this has prompted the study of methods that employ second order information. To this end, recently there has been a flurry of research on Newton-type methods for problems of the general form (1), or in the special case where . For example, Karimi and Vavasis Karimi14 () have developed a proximal quasi-Newton method for -regularized least squares problems, Lee, Sun and Saunders Lee12 (); Lee13 () have proposed a family of Newton-type methods for solving problems of the form (1) and Scheinberg and Tang Scheinberg13 () present iteration complexity results for a proximal Newton-type method. Moreover, the authors in sqa () extended standard inexact Newton-type methods to the case of minimization of a composite objective involving a smooth convex term plus an -regularizer term.

Facchinei and coauthors in facchinei14a (); facchinei14b (); facchinei14 () have also made a significant contribution to the literature on randomized coordinate descent methods. Two new algorithms are introduced in those works, namely FLEXA facchinei14b (); facchinei14 () and HyFLEXA facchinei14a (). Both methods can be applied to problems of the form (1) where is allowed to be nonconvex, and for HyFLEXA, is allowed to be nonseparable. We stress that nonconvexity of , and nonseparability of are outside the scope of this work. Nevertheless, these methods are pertinent to the current work, so we discuss them now.

For FLEXA and HyFLEXA, (as is standard in the current literature), the block structure of the problem is fixed throughout the algorithm: the data vector is partitioned into blocks, . At each iteration, one block, all blocks, or some number in between, are selected in such a way that some specified error bound is satisfied. Next, a local convex model is formed and that model is approximately minimized to find a search direction. Finally, a convex combination of the current point, and the search direction is taken, to give a new point, and the process is repeated. The model they propose is block separable, so the subproblems for each block are independent. (See (3) in facchinei14 () for the local/block quadratic model, or F1–F3 in facchinei14a () for a description of the strongly convex local/block model). This has the advantage that the subproblem for each block can be solved/minimized in parallel, but has the disadvantage that no interaction between different blocks is captured by either algorithm. We also note that, while these algorithms are equipped with global convergence guarantees, iteration complexity results have not been established for either method, which is a drawback.

We make the following two comments about FLEXA and HyFLEXA. The first is that, although, as presented in facchinei14 () and facchinei14a (), coordinate selection must respect the block structure of the problem, if the regularizer is assumed to be coordinate separable (which is the assumption we make in this work) then the convergence analysis in facchinei14 () and facchinei14a () can be adapted to hold in the ‘non-fixed block’ setting. The second comment is that, while the works in facchinei14 () and facchinei14a () only present global convergence guarantees, a variation of the method, presented in Razaviyayn14 (), is equipped with iteration complexity results. The algorithm in this paper incorporates partial (possibly inexact) second order information, allows inexact updates with a verifiable stopping condition, and is supported by iteration complexity guarantees.

Another important method that has recently been proposed is a randomized coordinate decent method by Qu et al. Qu15 (). One of the most notable features of their algorithm is that the block structure is not fixed throughout the algorithm. That is, at each iteration of their algorithm, a random subset of coordinates is selected from the set of coordinates . Then, a quadratic model is minimized exactly to obtain an update, and a new point is generated by applying the update to the current point. Unfortunately, this algorithm is only applicable to strongly convex smooth functions, or strongly convex empirical risk minimization problems, but is not suitable for general convex problems of the form (1). Moreover, the matrices used to define their quadratic model must obey several (strong) assumptions, and the quadratic model must be minimized exactly. These restrictions may make the algorithm inconvenient from a practical perspective.

The purpose of this work is to build upon the positive features of some existing coordinate descent methods and create a general flexible framework that can be used to solve any problem of the form (1). We will adopt some of the ideas in sqa (); facchinei14 (); Qu15 () (including variable block structure; the incorporation of partial approximate second order information; the practicality of approximate solves and inexact updates; cheap line search) and create a new flexible randomized coordinate descent framework that is not only computationally practical, but is also supported by strong theoretical guarantees.

### 1.2 Major Contributions

In this section we list several of the major contributions of this paper, which correspond to the central ideas of our FCD algorithm.

1. Blocks can vary throughout iterations. Existing randomized coordinate descent methods initially partition the data into blocks () and at each iteration, one/all/several of the blocks are selected according to some probability distribution, and those blocks are then updated. For those methods, the block partition is fixed throughout the entire algorithm. So, for example, coordinates in block will always be updated all together as a group, independently of any other block . One of the main contributions of this work is that we allow the blocks to vary throughout the algorithm. i.e., the method is not restricted to a fixed block structure. No partition of need be initialized. Rather, when FCD is initialized, a parameter is chosen and then, at each iteration of FCD, a subset of size is randomly selected/generated, and the coordinates of corresponding to the indices in are updated. To the best of our knowledge, the only other paper that allows for this random mixing of coordinates/varying blocks strategy, is the work by Qu et al. Qu15 ().222An earlier version of this work, which, to the best of our knowledge, was the first to propose varying random block selection, is cited by Qu15 (). This variable block structure is crucial with regards to our next major contribution.

###### Remark 1

We note that the algorithms of Tseng and Yun tsengyun (); Tseng09 (); Tseng10 () also allow a certain amount of coordinate mixing. However, their algorithms enforce deterministic rules for coordinate subset selection (either a ‘generalized Gauss-Seidel’, or Gauss-Southwell rule), which is different from our randomized selection rule. This difference is important because their deterministic strategy can be expensive/inconvenient to implement in practice, whereas our random scheme is simple and cheap to implement.

2. Model: Incorporation of partial second order information. FCD uses a quadratic model to determine the search direction. The model is defined by any symmetric positive definite matrix , which depends on both the subset of coordinates , and also on the current point . i.e., is not fixed; rather, it can change/vary throughout the algorithm. We stress that this is a key advantage over most existing methods, which enforce the symmetric positive definite matrix defining their quadratic model to be fixed with respect to the fixed block structure, and/or iteration counter. To the best of our knowledge, the only works that allow the matrix to vary with respect to the subset is this one (FCD), and the work by Qu et al. Qu15 (). A crucial observation is that, if approximates the Hessian, our approach allows every element of the Hessian to be accessed. On the other hand, other existing methods can only access blocks along the diagonal of (some perturbation of) the Hessian, (including facchinei14a (); facchinei14b (); facchinei14 (); Richtarik15 (); Richtarik14 (); Tappenden15 ().)

Furthermore, the only restriction we make on the choice of the matrix is that it be symmetric and positive definite. This is a much weaker assumption than made by Qu et al. Qu15 (), who insist that the matrix defining their quadratic model be a principle submatrix of some fixed/predefined matrix say, and the large matrix must be symmetric and sufficiently positive definite. (See Section 2.1 in Qu15 ().)

3. Inexact search directions. To ensure that FCD is computationally practical, it is imperative that the iterates are inexpensive, and this is achieved through the use of inexact updates. That is, the model can be approximately minimized, giving inexact search directions. This strategy makes intuitive sense because, if the current point is far from the true solution, it may be wasteful/unnecessary to exactly minimize the model. Any algorithm can be used to approximately minimize the model, and if approximates the Hessian, then the search direction obtained by minimizing the model is an approximate Newton-type direction. Importantly, the stopping conditions for the inner solve are easy to verify; they depend upon quantities that are easy/inexpensive to obtain, or may be available as a byproduct of the inner search direction solver.

###### Remark 2

The precise form of our inexactness termination criterion is important because, not only is it computationally practical (i.e., easy to verify), it also allowed us to derive iteration complexity results for FCD (see point 5). We considered several other termination criterion formulations, but we were unable to obtain corresponding iteration complexity results (although global convergence results were possible). Currently, the majority of randomized CD methods require exact solves for their inner subproblems and we believe that a major reason for this is because iteration complexity results are significantly easier to obtain in the exact case. Coming up with practical inexact termination criteria for randomized CD methods that also allow the derivation of iteration complexity results seems to be a major hurdle in this area, although some progress is being made, e.g., Cassioli13 (); Devolder13 (); Devolder11 (); Tappenden13 ().

4. Line search. FCD includes a line search to ensure a monotonic decrease of the objective function as iterates progress. (The line search is needed because we only make weak assumptions on the matrix .) The line search is inexpensive to perform because, at each iteration, it depends on a subset of coordinates only. One of the major advantages of incorporating second-order information combined with a line search is to allow, in practice, the selection of large step sizes (close to one). This is because unit step sizes can substantially improve the practical efficiency of a method. In fact, for all experiments that we performed, unit step sizes were accepted by the line search for the majority of the iterations.

5. Convergence theory. We provide a complete convergence theory for FCD. In particular, we provide high probability iteration complexity guarantees for the algorithm in both the convex and strongly convex cases, and in the cases of both inexact or exact search directions. Our results show that FCD converges at a sublinear rate for convex functions, and at a linear rate for strongly convex functions.

### 1.3 Paper Outline

The paper is organized as follows. In Section 2 we introduce the notation and definitions that are used throughout this paper, including the definition of the quadratic model, and stationarity conditions. A thorough description of the FCD algorithm is presented in Section 3, including how the blocks are selected/sampled at each iteration (Section 3.1), a description of the search direction, several suggestions for selecting the matrices , and analysis of the subproblem/model termination conditions (Section 3.2), a definition of the line search (Section 3.3) and we also present several concrete problem examples (Section 3.4). In Section 4 we show that the line search in FCD will always be satisfied for some positive step , and that the objective function value is guaranteed to decrease at every iteration. Section 5 introduces several technical results, and in Section 6 we present our main iteration complexity results. In Section 7 we give a simplified analysis of FCD in the smooth case. Finally, several numerical experiments are presented in Section 9, which show that the algorithm performs very well in practice, even on ill-conditioned problems.

## 2 Preliminaries

Here we introduce the notation and definitions that are used in this paper, and we also present some important technical results. Throughout the paper we denote the standard Euclidean norm by and the ellipsoidal norm by , where is a symmetric positive definite matrix. Furthermore, denotes a vector in with (strictly) positive components.

For a function the elements that satisfy

 Φ(y)≥Φ(x)+⟨s,y−x⟩,∀y∈RN,

are called the subgradients of at point . In words, all elements defining a linear function that supports at a point are subgradients. The set of all at a point is called the subdifferential of and it is denoted by .

##### Convexity.

A function is strongly convex with convexity parameter if

 Φ(y)≥Φ(x)+⟨s,y−x⟩+μΦ2∥y−x∥2,∀x,y∈RN, (2)

where . If . then function is said to be convex.

##### Convex conjugate and proximal mapping.

The proximal mapping of a convex function at the point is defined as follows:

 proxΨ(x):=argminy∈RNΨ(y)+12∥y−x∥2. (3)

Furthermore, for a convex function , it’s convex conjugate is defined as From Chapter of Rockafellar06 (), we see that , and it’s convex conjugate , are nonexpansive:

 ∥proxΨ(y)−proxΨ(x)∥≤∥y−x∥,and∥proxΨ∗(y)−proxΨ∗(x)∥≤∥y−x∥. (4)
##### Coordinate decomposition of RN.

Let be a column permutation of the identity matrix and further let be a decomposition of into submatrices (column vectors), where . It is clear that any vector can be written uniquely as where denotes the th coordinate of .

Define . Then we let and be the collection of columns from matrix that have column indices in the set . We denote the vector

 xS:=∑i∈SUTix=UTSx. (5)
##### Coordinate decomposition of Ψ.

The function is assumed to be coordinate separable. That is, we assume that can be decomposed as:

 Ψ(x)=N∑i=1Ψi(xi), (6)

where the functions are convex. We denote by , where , the function

 ΨS(xS)=∑i∈SΨi(xi), (7)

where is the collection of components from that have indices in set . The following relationship will be used repeatedly in this work. For , index set , and , it holds that:

 Ψ(x+UStS)−Ψ(x+US^tS)ΨS(xS+tS)−ΨS(xS+^tS). (8)

Clearly, when , we have the special case .

##### Subset Lipschitz continuity of f.

Throughout the paper we assume that the gradient of is Lipschitz, uniformly in , for all subsets . This means that, for all , and we have

 ∥∇Sf(x+UStS)−∇Sf(x)∥≤LS∥tS∥, (9)

where . An important consequence of (9) is the following standard inequality (Nesterov04, , p.57):

 f(x+UStS)≤f(x)+⟨∇Sf(x),tS⟩+LS2∥tS∥2. (10)

Let denote the set of optimal solutions of (1), and let be any member of that set. We define

 R(x)=maxymaxx∗∈X∗{∥y−x∗∥:F(y)≤F(x)}, (11)

which is a measure of the size of the level set of given by . In this work we make the standard assumption that is bounded at the initial iterate . We also define a scaled version of (11)

 Rw(x)=maxymaxx∗∈X∗{∥y−x∗∥w:F(y)≤F(x)}, (12)

where and .

For fixed , we define a piecewise quadratic approximation of around the point as follows: where

 Q(x;t):=⟨∇f(x),t⟩+12∥t∥2H(x)+Ψ(x+t) (13)

and is any symmetric positive definite matrix. We also define

 QS(x,tS):=⟨∇Sf(x),tS⟩+12∥tS∥2HS(x)+ΨS(xS+tS), (14)

and is a square submatrix (principal minor) of that corresponds to the selection of columns and rows from with column and row indices in set . Notice that is the quadratic model for the collection of coordinates in .

Similarly to (8), we have the following important relationship. For , index set , and , it holds that:

 Q(x;UStS)−Q(x;US^tS) ⟨∇f(x),UStS⟩+12∥UStS∥2H(x)+Ψ(x+UStS) (15) −⟨∇f(x),US^tS⟩−12∥US^tS∥2H(x)−Ψ(x+US^tS) ⟨∇Sf(x),tS⟩−⟨∇Sf(x),^tS⟩+12∥tS∥2HS(x)−12∥^tS∥2HS(x) +ΨS(xS+tS)−ΨS(xS+^tS) QS(x,tS)−QS(x,^tS)

### 2.2 Stationarity conditions.

The following theorem gives the equivalence of some stationarity conditions of problem (1).

###### Theorem 2.1

The following are equivalent first order optimality conditions for problem (1), Section in PB13 ().

1. and ,

2. ,

3. ,

4. .

Let us define the continuous function

 g(x;t):=∇f(x)+H(x)t+proxΨ∗(x+t−∇f(x)−H(x)t). (16)

By Theorem 2.1, the points that satisfy are stationary points for problem (1). Hence, is a continuous measure of the distance from the set of stationary points of problem (1). Furthermore, we define

 gS(x;tS):=∇Sf(x)+HS(x)tS+proxΨ∗S(xS+tS−∇Sf(x)−HS(x)tS). (17)

## 3 The Algorithm

In this section we present the Flexible Coordinate Descent (FCD) algorithm for solving problems of the form (1). There are three key steps in the algorithm: (Step ) the coordinates are sampled randomly; (Step ) the model (14) is solved approximately until rigorously defined stopping conditions are satisfied; (Step ) a line search is performed to find a step size that ensures a sufficient reduction in the objective value. Once these key steps have been performed, the current point is updated to give a new point , and the process is repeated.

In the algorithm description, and later in the paper, we will use the following definition for the set :

 Ω:=∂QS(xk;tSk). (18)

We are now ready to present pseudocode for the FCD algorithm, while a thorough description of each of the key steps in the algorithm will follow in the rest of this section.

### 3.1 Selection of coordinates (Step 3)

One of the central ideas of this algorithm is that the selection of coordinates to be updated at each iteration is chosen randomly. This allows the coordinates to be selected very quickly and cheaply. For all the results in this work, we use a ‘-nice sampling’ Richtarik15 (). This sampling is described briefly below, and a full description can be found in Richtarik15 ().

Let denote the set of all subsets of , which includes the empty set and itself. At every iteration of FCD, a subset is sampled from . The probability of a set being selected is denoted by , and we are interested in probability distributions with and . i.e., all non-empty sets have positive probability of being selected.

In what follows we describe a (specific instance of a) uniform probability distribution (called a -nice sampling). In particular, the uniform probability distribution is constructed in such a way that subsets in with the same cardinality have equal probability of being selected, i.e., with .

###### Definition 1 (τ-nice sampling)

Given an integer , a set with is selected uniformly with probability one, and the sampling is uniquely characterised by its density function:

 P(S)=1(Nτ) ∀S∈2[N] with |S|=τ,P(S)=0 ∀S∈2[N] with |S|≠τ. (24)
###### Assumption 3.1

In Step 3 of FCD (Algorithm 1), we assume that the sampling procedure used to generate the subset of coordinates in that given in Definition 1.

Below is a technical result (see Richtarik15 ()) that uses the probability distribution (24) and is used in the worst-case iteration complexity results of this paper. Let with be some constants. Then

 E[∑i∈Sθi]=τNN∑i=1θi. (25)

### 3.2 The search direction and Hessian approximation (Step 4)

The search direction (Step 4 of FCD) is determined as follows. Given a set of coordinates , FCD forms a model for (14), and minimizes the model approximately until the stopping conditions (20) and (21) are satisfied, giving an ‘inexact’ search direction. We also describe the importance of the choice of the matrix , which defines the model and is an approximate second order information term. Henceforth, we will use the shorthand .

#### 3.2.1 The search direction

The subproblem (19), (where is defined in (14)) is approximately solved, and the search direction is accepted when the stopping conditions (20) and (21) are satisfied, for some . Notice that

 Q(x;UStS)−Q(x;0) QS(x;tS)−QS(x;0) ⟨∇Sf(x),tS⟩+12∥tS∥2HS+ΨS(xS+tS)−ΨS(xS). (26)

Hence, from (3.2.1), the stopping conditions (20) and (21) depend on subset only, and are therefore inexpensive to verify, meaning that they are implementable.

###### Remark 3
• At some iteration , it is possible that . In this case, it is easy to verify that the optimal solution of subproblem (19) is . Therefore, before calculating we check first if condition is satisfied.

• Notice that, unless at optimality (i.e., ), there will always exist a subset such that , which implies that . Hence, FCD will not stagnate.

#### 3.2.2 The Hessian approximation

One of the most important features of this method is that the quadratic model (14) incorporates second order information in the form of a symmetric positive definite matrix . This is key because, depending upon the choice of , it makes the method robust. Moreover, at each iteration, the user has complete freedom over the choice of .

We now provide a few suggestions for the choice of . (This list is not intended to be exhaustive.) Notice that in each case there is a trade off between a matrix that is inexpensive to work with, and one that is a more accurate representation of the true block Hessian.

1. Identity. Here, , so that no second order information is employed by the method.

2. Multiple of the identity. One could set , for some . In particular, a popular choice is , where is the Lipschitz constant defined in (9).

3. Diagonal matrix. When is a diagonal matrix, and it’s inverse are inexpensive to work with. In particular, captures partial curvature information, and is an effective choice in practice, particularly when is a good approximation to . Moreover, if is quadratic, then is constant for all , so can be computed and stored at the start of the algorithm and elements can be accessed throughout the algorithm as necessary.

4. Principal minor of the Hessian. When , second order information is incorporated into FCD, which can be very beneficial for ill-conditioned problems. However, this choice of is usually more computationally expensive to work with. In practice, may be used in a matrix-free way and not explicitly stored. For example, there may be an analytic formula for performing matrix-vector products with , or techniques from automatic differentiation could be employed, see (IEEEhowto:wrightbook2, , Section ).

5. Quasi-Newton approximation. One may choose to be an approximation to based on a limited-memory BFGS update scheme, see (IEEEhowto:wrightbook2, , Section ). This approach might be suitable in cases where the problem is ill-conditioned, but performing matrix-vector products with is expensive.

###### Remark 4

We make the following remarks regarding the choice of .

1. If any of the matrices mentioned above are not positive definite, then they can be adjusted to make them so. For example, if is diagonal, any zero that appears on the diagonal can be replaced with a positive number. Moreover, if is not positive definite, a multiple of the identity can be added to it.

2. If is chosen to be a diagonal matrix, then the quadratic model (14) is separable, so the subproblems for each coordinate in are independent. Moreover, in some cases this gives rise to a closed form solution for the search direction . (For example, if , then soft thresholding may be used.)

An advantage of the FCD algorithm (if Option 4 is used for ) is that all elements of the Hessian can be accessed. This is because the set of coordinates can change at every iteration, and so too can matrix . This makes FCD extremely flexible and is particularly advantageous when there are large off diagonal elements in the Hessian. The importance of incorporation of information from off-diagonal elements is demonstrated in numerical experiments in Section 9.

#### 3.2.3 Termination criteria for model minimization

In Step 4 of FCD, the termination criteria (20) and (21) are used to determine whether an acceptable search direction has been found. For composite functions of the form (1), both conditions are required to ensure that is a descent direction. Moreover, (20) is important for intuitively obvious reasons: the model should be decreased at each iteration. We will now attempt to explain/justify the use of the second termination condition (21).

The parameter plays a key role; it determines how ‘inexactly’ the quadratic model (19) may be solved (or equivalently, how ‘inexact’ the search direction is allowed to be). If one sets , then the model (19) is minimized exactly, leading to exact search directions. On the other hand, if then the model is approximately minimized, leading to inexact search directions.

To obtain iteration complexity guarantees for FCD, the termination conditions must be chosen carefully. In particular, it becomes obvious in Lemma 3, that (20) and (21) are the appropriate conditions. We now proceed to show that the conditions are mathematically sound. By rearranging (21) we obtain

 infu∈Ω∥u−gS(xk;tSk)∥2+∥gS(xk;tSk)∥2≤(ηSk∥gS(xk;0)∥)2. (27)

The left hand side in (27) is a continuous function because the ‘’ operator is an orthonormal projection onto a closed subspace , see (18), and is continuous as a function of . Furthermore, there exists a point such that (27) is satisfied, and this is the minimizer of (19). In particular, if is the minimizer of (19), then from Theorem 2.1 we have that . Hence, the left hand side in (27) is zero and the condition is satisfied for any and . Since the right hand side in (27) is a non-negative constant, the left hand side is continuous and the minimizer of (19) satisfies (27), then there exists a closed ball centered at the minimizer such that within this ball (27) is satisfied. Therefore, any convergent method which can be used to minimize (19) will eventually satisfy the termination condition (27) without solving (19) exactly, unless the right hand side in (27) is zero.

We stress that this projection operation is inexpensive for the applications that we consider, e.g., when is the -norm or the -norm squared, or elastic net regularization, which is a combinations of these two. In particular, the subdifferential has an ‘analytic’ form, which also allows for a closed form solution of the left hand side of the termination criterion. Below we provide an example for the case where . Using the definition of we have that the left hand side problem in (27) is equivalent to

 infu−∇Sf(xk)−HS(xk)tSk∈∂Ψ∗S(xSk+tSk)∥u−gS(xk;tSk)∥2.

By making a change of coordinates from to as , we rewrite the previous problem as

 infv∈∂Ψ∗S(xSk+tSk)∥v+∇Sf(xk)+HS(xk)tSk−gS(xk;tSk)∥2.

Using the definition of from (17) we have that the previous problem is equivalent to

 infv∈∂Ψ∗S(xSk+tSk)∥v−proxΨ∗S(xSk+tSk−∇Sf(xk)−HS(xk)tSk)∥2. (28)

The subdifferential has the following coordinate-wise form

 ∂Ψ∗i(xik+tik)=⎧⎪⎨⎪⎩τ%ifxik+tik>0,ifxik+tik=0,−τifxik+tik<0.∀i∈S

Using the above analytic form of the subdifferential it is easy to see that problem (28) has a closed form solution that is inexpensive to compute.

###### Remark 5

We stress that if the regularizer is the -norm, or the squared -norm, which are two very popular regularizers, our inexactness criterion (27) is inexpensive to implement in practice. However, for other general regularizers, this may not be the case, and it may not be possible to compute (27) analytically. In such a case, it may be necessary to compute the minimizer of the quadratic model exactly, or to use some other iterative method, both of which come with an associated computational cost.

### 3.3 The line search (Step 5)

The stopping conditions (20) and (21) ensure that is a descent direction, but if the full step is taken, a reduction in the function value (1) is not guaranteed. To this end, we include a line search step in our algorithm in order to guarantee a monotonic decrease in the function . Essentially, the line search guarantees the sufficient decrease of at every iteration, where sufficient decrease is measured by the loss function (23).

In particular, for fixed , we require that for some , (22) is satisfied. (In Lemma 1 we prove the existence of such an .) Notice that

 ℓ(x;UStS)−ℓ(x;0) ⟨∇Sf(x),tS⟩+Ψ(x+UStS)−Ψ(x) (29) ⟨∇Sf(x),tS⟩+ΨS(xS+tS)−ΨS(xS),

which shows that the calculation of the right hand side of (22) only depends upon block , so it is inexpensive. Moreover, the line search condition (Step 5) involves the difference between function values . Fortunately, while function values can be expensive to compute, computing the difference between function values at successive iterates need not be. (See Section 3.4, and the numerical results in Section 9 and Table 6.)

### 3.4 Concrete examples

Now we give two examples to show that the difference between function values at successive iterates can be computed efficiently, which demonstrates that the line-search is implementable. The first example considers a convex composite function, where the smooth term is a quadratic loss term and the second example considers a logistic regression problem.

##### Quadratic loss plus regularization example.

Suppose that and is not equivalent to zero, where , and . Then

 F(xk)−F(xk+αUStSk) f(xk)+ΨS(xS)−f(xk+αUStSk)−ΨS(xSk+αtSk) = ΨS(xS)−α⟨∇Sf(x),tSk⟩−α22∥AitSk∥22−ΨS(xSk+αtSk).

The calculation , as a function of , only depends on subset . Hence, it is inexpensive. Moreover, often the quantities in (3.4) are already needed in the computation of the search direction , so regarding the line search step, they essentially come “for free”.

##### Logistic regression example.

Suppose that and is not equivalent to zero, where is the th row of a matrix and is the th component of vector . As before, we need to evaluate (3.4). Let us split calculation of in parts. The first part is inexpensive, since it depends only on subset . The second part is more expensive because is depends upon the logarithm. In this case, one can calculate once at the beginning of the algorithm and then update less expensively. In particular, let us assume that the following terms:

 e−bjaTjx0∀jandf(x0)=m∑j=1log(1+e−bjaTjx0), (31)

are calculated once and stored in memory. Then, at iteration , the calculation of is required for different values of by the backtracking line search algorithm. The most demanding task in calculating is the calculation of the products once, which is inexpensive since this operation depends only on subset . Having and (31) calculation of for different values of is inexpensive. At the end of the process, and are given for free, and the process can be repeated for the calculation of etc.

###### Remark 6

The examples above show that, for quadratic loss functions, and logistic regression problems, computing function values is relatively inexpensive. Thus, for problems with this, or similiar structure, FCD is competitive, even though it requires function evaluations for the line search. (The competitiveness of FCD is confirmed in our numerical experiments in Section 9.) However, we stress that, for other applications with more general objective functions, it may not be possible to perform function evaluations efficiently, in which case FCD may not be a suitable algorithm, and a different algorithm could be used.

## 4 Bounded step-size and monotonic decrease of F

Throughout this section we denote . The following assumptions are made about and . Assumption 4.1 explains that the Hessian approximation must be positive definite for all subsets of coordinates at all iterations . Assumption 4.2 explains that the gradient of must be Lipschitz for all subsets .

###### Assumption 4.1

There exist , such that the sequence of symmetric satisfies:

 0<λS≤λmin(HSk)andλmax(HSk)≤ΛS,for all S⊆[N]. (32)
###### Assumption 4.2

The function is smooth and satisfies (9) for all .

The next assumption regards the relationship between the parameters introduced in Assumptions 4.1 and 4.2.

###### Assumption 4.3

The relation holds for all .

The following lemma shows that if is nonzero, then is decreased. The proof closely follows that of (sqa, , Theorem ).

###### Lemma 1

Let Assumptions 3.1, 4.1, 4.2 and 4.3 hold, and let . Given , let and be generated by FCD (Algorithm 1) with . Then Step 6 of FCD will accept a step-size that satisfies

 α≥αS,whereαS:=(1−θ)λS2LS>0. (33)

Furthermore,

 F(xk)−F(xk+αUStSk)>θ(1−θ)λ2S4LS∥tSk∥2. (34)
###### Proof

From (20), we have Rearranging gives

 ℓ(xk;0)−ℓ(xk;UStSk)>12∥tSk∥2HSk12λS∥tSk∥2. (35)

Denote . By Assumption 4.2, for all , we have

 F(xk+1)≤f(xk)+α⟨∇Sf(xk),tSk⟩+LS2α2∥tSk∥2+Ψ(xk+αUStSk).

Adding to both sides of the above and rearranging gives

 F(xk)−F(xk+1) ≥ −α⟨∇Sf(xk),tSk⟩−LS2α2∥tSk∥2−Ψ(xk+αUStSk)+Ψ(xk) (36) ℓ(xk;0)−ℓ(xk;αUStSk)−LS2α2∥tSk∥2.

By convexity of we have that

 ℓ(xk;0)−ℓ(xk;αUStSk)≥α(ℓ(xk;0)−ℓ(xk;UStSk)). (37)

Then

 F(xk)−F(xk+1) − θ(ℓ(xk;0)−ℓ(xk;αUStSk)) (38)