# Second-order Conditional Gradients

## Abstract

Constrained second-order convex optimization algorithms are the method of choice when a high accuracy solution to a problem is needed, due to the quadratic convergence rates these methods enjoy when close to the optimum. These algorithms require the solution of a constrained quadratic subproblem at every iteration. In the case where the feasible region can only be accessed efficiently through a linear optimization oracle, and computing first-order information about the function, although possible, is costly, the coupling of constrained second-order and conditional gradient algorithms leads to competitive algorithms with solid theoretical guarantees and good numerical performance.

## 1 Introduction

We focus on the optimization problem defined as

(1.1) |

where is strongly convex, has Lipschitz continuous gradients and is twice differentiable.

### 1.1 Newton and Quasi-Newton Methods

An immensely powerful approach when is to construct a second-order approximation to at the current iterate using first and second order information, denoted by , and move in the direction that minimizes this approximation, giving rise to a family of methods known as Newton methods (Kantorovich, 1948). A damped variant of the former applied to the minimization of a self-concordant function, converges globally, and shows quadratic local convergence when the iterates are close enough to the optimum (Nesterov and Nemirovskii, 1994). The global convergence of this method also extends to strongly convex and smooth function (Nesterov and Nemirovskii, 1994; Nesterov, 2013). Using a cubic regularized version of Newton’s method, the global convergence of the method can also be extended to a broader class of functions than that of self-concordant or strongly convex and smooth functions (Nesterov and Polyak, 2006).

When is a convex set, one can use a constrained analog of these methods (Levitin and Polyak, 1966), where a quadratic approximation to the function is minimized over at each iteration. There are two shortcomings to these methods. First, computing second-order information about , especially in high dimensions, is computationally expensive, this has led to the development of Quasi-Newton methods, that build an approximation to the hessian using gradient information acquired from past iterates (see Nocedal and Wright (2006) for an overview). Secondly, in many cases solving the approximate quadratic subproblem to optimality is not possible, and so inexact variants are used. This has resulted in numerous inexact Quasi-Newton methods, which inherit many of the favorable properties of Newton methods (Scheinberg and Tang, 2016; Lee et al., 2014). The global convergence analysis of these methods can be recast in terms of a notion related to the multiplicative stability of the Hessian, allowing for elegant proofs of global convergence for several variants, both in the unconstrained, and constrained case (Karimireddy et al., 2018b).

### 1.2 The Conditional Gradient Algorithms

The conditional gradient algorithm (Levitin and Polyak, 1966) (also known as the Frank-Wolfe algorithm (Frank and Wolfe, 1956)), is to be found on the other side of the computational cost spectrum. Instead of requiring the resolution of a constrained quadratic problem at each iteration, it requires the resolution of a constrained linear problem, and only uses first-order information about . As the conditional gradient algorithm maintains its iterates as convex combinations of extremal points of obtained from the linear optimization problem it is projection-free. The conditional gradient algorithm is a robust convex optimization algorithm that produces sparse solutions and, e.g., its Away-Step and Pairwise-Step variants enjoy a global linear convergence for strongly convex and smooth functions over polytopes (Lacoste-Julien and Jaggi, 2015); improved up to linear rates of convergence also hold for sharp functions (Kerdreux et al., 2019). It has been the method of choice in many machine learning applications, where projecting onto is computationally prohibitive, such as, e.g., in video co-localization (Joulin et al., 2014), greedy particle optimization in Bayesian inference (Futami et al., 2019) or in structural SVMs (Lacoste-Julien et al., 2013).

For problems where first-order information about is hard to compute, using second-order methods seems counter-intuitive, yet it allows the construction of a quadratic approximation to the function whose gradients are much cheaper to compute (another alternative is to replace the first-order oracle by a stochastic first order oracle to combat the computational cost of the first-order oracle; see e.g., (Robbins and Monro, 1951)). The observation that the progress made by moving towards the minimizer of is often much greater than the progress made by an iteration of any first-order method allows these algorithms to be competitive in practice (Schmidt et al., 2009) although working with much more expensive second-order approximations. We consider the case where both the first-order oracle for and projection oracle onto are computationally expensive, but linear programming oracles over are relatively cheap. This is a setup that is often found in many applications. In this setting, we show how conditional gradient algorithms can be coupled with Quasi-Newton methods to obtain competitive algorithms both in progress per iteration and in progress per unit time; while the former is expected the second is not.

### 1.3 Quadratic Subproblem Stopping Criterion

The algorithm presented in this work uses a conditional gradient algorithm to approximately minimize the quadratic approximation over . The basic idea is similar to the Conditional Gradient Sliding algorithm (Lan and Zhou, 2016) where quadratic subproblems appearing in accelerated methods are solved via conditional gradients, yet our analysis is quite different.

The stopping criterion for our subproblems relies on knowledge of the primal gap (or any lower bound on the primal gap bounded away from zero). Knowledge of the value of has been used to obtain optimal step sizes, known as Polyak steps, for first-order methods (Polyak, 1969; Nedic, 2002; Hazan and Kakade, 2019) as well as to obtain local estimates of the strong convexity parameter of a function for accelerated algorithms (Barré and d’Aspremont, 2019). This setup is well-known and important in the context of machine learning applications where the loss function is usually lower bounded by , i.e., . A similar assumption is made in Asi and Duchi (2019), to build optimization algorithms that have a robust performance with respect to different step sizes and algorithm parameters. Another important example is the approximate Carathéodory problem (Mirrokni et al., 2017; Combettes and Pokutta, 2019) where we want to find the decomposition of a point in a polytope as a convex combination of the vertices of and where the objective .

We stress that while we opted for the primal gap here as stopping criterion, as done in the aforementioned recent works, other choices together with the necessary changes to the argument are possible.

### 1.4 Contributions

The contribution can be summarized as follows:

**Projection-free Quasi-Newton Method.** We provide an inexact projection-free Quasi-Newton method, denoted as Second-order Conditional Gradients (SOCG) with a stopping criterion that relies on knowledge of the primal gap and achieves global linear convergence and quadratic local convergence with properly tuned parameters when close to the optimum. Note that (Ochs and Malitsky, 2019)[Example 11] also proposed a conditional gradient-based Newton method via their model function-based conditional gradient algorithm, however their approach is markedly different from ours: the steps performed in their algorithm are projected onto the feasible region using the Euclidean norm while the projections in our algorithm use a norm defined by a positive semi-definite matrix that approximates the Hessian.

We also discuss how the algorithm can be used when the primal gap is not known to obtain an algorithm that retains the attractive numerical performance of the original algorithm; see Section 5.2.

**Computational Experiments.** We compare our algorithm with other state-of-the-art projection free optimization algorithms and provide numerical evidence that the proposed algorithm is able to significantly outperform other algorithms in terms of per-iteration and per-unit-time progress.

## 2 Preliminaries

We denote the unique minimizer of Problem 1.1 by , and we define the scaled projection of onto , denoted by , as:

(2.1) | ||||

(2.2) |

where is the matrix norm defined by such that and is the Euclidean norm. As is strongly-convex with respect to , Problem 2.2 has a unique minimizer. Let denote the quadratic approximation of the function around the point , i.e.:

Where is an approximation of the true Hessian of at , i.e., that satisfies the following assumption:

###### Assumption 1.

There exists an such that for and :

(2.3) |

Moreover, as we have that . This condition is similar in spirit to Condition C in (Karimireddy et al., 2018b).

###### Assumption 2.

For all , there exists a such that .

## 3 Second-order Conditional Gradients Algorithm

The exact SOCG algorithm (Algorithm 1) defines iterates as where is a step size and:

(3.1) |

where the minimization problem is solved with a Conditional Gradients algorithm.

###### Remark 3.1.

The minimization problem shown in Equation 3.1 is equivalent to minimizing the quadratic approximation of over , that is:

(3.2) |

Computing the scaled projection shown in Equation 3.1 and Equation 3.2 is often too expensive, and so the problems are solved until for some ; we say that the scaled projection problem is solved to -optimality. In the case where for , the projected Newton algorithm is equivalent to the projected gradient descent algorithm, when the projections are computed with a Conditional Gradients algorithm. In fact, Algorithm 1 will lead to a valid descent direction for any positive definite matrix when minimizing a convex function with , as shown in Lemma A.2 in Appendix A.

The algorithm defined via Equation 3.1 is of interest when is an approximation of the Hessian in applications where computing first-order information of is computationally expensive, or when we want to find a solution to Problem 1.1 with high accuracy. In the former case we aim to form an approximation to the function , whose gradients are easy to compute, and which potentially provides significantly more progress than first-order methods per iteration. We denote by a procedure that updates the current approximation of the Hessian using information from past gradients and iterates.

(3.6) |

(3.7) |

### 3.1 Exact projections: for

We first assume that the scaled projection problem in Line 3 of Algorithm 1 is solved to optimality, i.e. for , and thus .

#### Global Convergence

The global convergence of Projected Newton Methods (a class of algorithms to which SOCG belongs to) is well-known for -smooth and -strongly convex functions when , we will simply restate the most important results. Proofs of these can be found in Appendix B and in the references cited.

###### Theorem 3.2.

###### Corollary 3.3.

If we pick using the exact bounded line search strategy shown in Equation 3.7, the SOCG algorithm (Algorithm 1) will make as least as much progress as it will make by choosing , therefore if we pick perform a bounded exact line search the primal gap will contract as:

Which follows by particularizing the results from Theorem 3.2 to .

#### Local convergence

When the iterates are close enough to a quadratic convergence rate in the distance to can be obtained.

###### Theorem 3.4.

### 3.2 Inexact projections: for

Solving the Problem in Equation 3.7 to optimality is computationally prohibitive, and because of this an -optimal solution is often used. Usually two approaches are used, either the problems are solved up to a certain multiplicative error or they are solved up to an additive error. We focus on the latter case, where we require knowledge of the primal gap of the Problem in Equation 3.7.

#### Global convergence: Inexact SOCG

###### Theorem 3.6.

Assume is -smooth and -strongly convex function, is convex, the step size sequence satisfies and the subproblems are solved to an accuracy such that:

(3.8) |

where is a, to be chosen, forcing parameter. Then the following bound on the primal gap of the iterates of the SOCG algorithm (Algorithm 1) holds:

#### Local convergence: Inexact SOCG

###### Theorem 3.8.

###### Corollary 3.9.

If the assumptions of Theorem 3.8 are satisfied, then there are three options: If we choose a parameter then the SOCG algorithm (Algorithm 1) will generate iterates that converge linearly to . If we choose as the distance from the iterates to will converge superlinearly. Lastly, if , where is a constant, the distance of the iterates to will converge quadratically.

## 4 Limited-memory BFGS

Several methods exist to obtain an approximation to the hessian from first-order information. Notable among them are the BFGS (Broyden, Fletcher, Goldfarb, and Shanno) and DFP (Davidon, Fletcher, and Powell) algorithms, which update with a rank-2 matrix at each iteration. However these methods explicitly require the storage of a dense matrix, so the memory requirements scale as , the updating requires operations and computing the gradient of using a dense requires multiplications. To avoid these requirements a Limited-memory BFGS (L-BFGS) algorithm that uses only the information from the past iterates can be used. This lowers the memory requirements to and requires only operations to update at each iteration. Moreover computing the gradient of with an L-BFGS reconstruction of the hessian requires multiplications (Nocedal and Wright, 2006).

## 5 Computations

The purpose of this computational section is to demonstrate the numerical advantage of using second-order approximations to the original function when minimizing functions with expensive first-order oracles, in the context of projection-free optimization.

### 5.1 Sparse Coding over the Birkhoff Polytope

Given a set of input data points with , sparse dictionary learning attempts to find a dictionary and a sparse representation with that minimizes:

(5.1) |

Where is the set of matrices with columns with norm less than one. This problem is of interest as many signal processing see performance boosts when given a learned dictionary that is able to give a sparse representation (see Mairal et al. (2010) for an overview), as opposed to a predefined dictionary obtained from Fourier or wavelet transforms. The elements in this learned dictionary are not required to be orthogonal, and they can form an undercomplete or an overcomplete dictionary.

The problem in Equation 5.1 is convex with respect to when is fixed, and vice-versa, and so it can be solved by alternating between minimizing with respect to with fixed , and minimizing with respect to with fixed (Lee et al., 2007) (see (Mairal et al., 2010) for an online learning approach). The latter problem is typically solved with a stochastic projected gradient descent (Aharon et al., 2006). We focus on a variation of the minimization with respect to with fixed , more concretely, we additionally require the rows of to have norm bounded below , the elements of be non-negative, and . A natural way to impose this is to solve the problem over the Birkhoff polytope. Given a set of vectors and , such that for all , we aim to solve the problem where is the Birkhoff polytope and has the form:

The gradient of amounts to computing:

The hessian of is given by the block diagonal matrix with where has the form . As the eigenvalues of a block-diagonal matrix are the eigenvalues of the blocks that form the diagonal, and is positive definite, the function is -strongly convex and -smooth. The complexity of the gradient computation scales as . Solving a linear program over the Birkhoff polytope using the Hungarian method has a complexity of . Thus it is more expensive to compute a gradient than it is to solve an LP over if is large. We generate synthetic data by creating a matrix with and entries sampled from a standard normal distribution, and vectors , with entries sampled from a standard normal distribution, in order to form . The set of vectors is generated by computing for all .

In order to use the subproblem accuracy described in Theorems 3.6 and 3.8 (with forcing parameter ), the value of is first computed to high accuracy, and used in the experiments. Figure 1 shows the performance of Algorithm 1 when the DICG algorithm (Garber and Meshi, 2016) is used, and the matrix exact hessian is used, i.e., (denoted by SOCG DICG), and also when an L-BFGS and BFGS method is used to generate (denoted by SOCG DICG (L-BFGS) and SOCG DICG (BFGS) respectively). These algorithms are compared to the vanilla Conditional Gradients algorithm (denoted by FW), the Away-Step and Pairwise-Step Conditional Gradients algorithms (AFW and PFW), the Lazy Away-Step Conditional Gradients algorithm (Braun et al., 2017) (AFW (L)), the aforementioned DICG algorithm, the Stochastic Variance-Reduced Conditional Gradients algorithm (Hazan and Luo, 2016) and the Conditional Gradient Sliding algorithm (Lan and Zhou, 2016). All algorithms use exact line search bounded between and .

The results in Figure 1, show that the SOCG algorithm (Algorithm 1) significantly outperforms other projection-free algorithms in this setting both in per-iteration and per-unit time progress. This stems from the fact that the calls to the first-order oracle to are much cheaper than the first-order oracle calls to in this example. With respect to SVRFW, we highlight that the cost of computing a stochastic gradient is the same as computing the non-stochastic gradient of , and we expect more progress with the latter.

### 5.2 Inverse covariance estimation over spectrahedron

In many applications the relationships between variables can be modeled with the use of undirected graphical models, such is the case for example in gene expression problems, where the goal is to find out which groups of genes are responsible for producing a certain outcome, given a gene dataset. When the underlying distribution of these variables is Gaussian, the problem of determining the relationship between variables boils down to finding patterns of zeros in the inverse covariance matrix of the distribution. A common approach to solving this problem relies on finding a -regularized maximum likelihood estimator of , so as to encourage sparsity, over the positive definite cone (Banerjee et al., 2008; Friedman et al., 2008), this is often called the Graphical Lasso.

Several optimization algorithms have been used to tackle this problem, such as interior point methods (Yuan and Lin, 2007), block coordinate descent or accelerated first-order algorithms (Banerjee et al., 2008), coordinate descent algorithms (Friedman et al., 2008) and even projected limited-memory quasi-Newton algorithms (Schmidt et al., 2009). We solve the same problem over the space of bounded trace positive semidefinite matrices, that is:

Where is the empirical covariance matrix of a set of datapoints drawn from a Gaussian distribution in , is a regularization parameter and . Note that the problem is convex and non-smooth. Evaluating has complexity if we compute the determinant with a LU decomposition, and evaluating the gradient is equal to , where represents Hadamard division, has complexity , dominated by the matrix inversion. Solving the linear program , where denotes Hadamard product, amounts to computing the largest eigenvector of , with a complexity of . The matrix is generated by computing a random orthonormal basis in and computing , where is uniformly distributed between and for .

When the primal gap of the problem is not known, one can substitute the subproblem accuracy rule in Theorem 3.6 and 3.8 by one that will still ensure that the local superlinear convergence is achieved. If we pick with (and we check afterwards that the directions obtained are descent directions, otherwise increase the accuracy of the subproblem solution) we can ensure this convergence by Lemma C.6 as . The performance of this strategy can be seen in Figure 2 for , and , when the Lazy Away-Step Conditional Gradients algorithm is used in the SOCG algorithm (Algorithm 1) with the L-BFGS strategy, denoted by SOCG AFW (L) (L-BFGS). All algorithms use a golden-section bounded line search between 0 and 1 in this comparison.

### 5.3 Structured Logistic Regression over unit ball

Given a binary classification task with labels and samples with and for all , we wish to solve:

where is the unit ball centered at the origin. Although projecting into any ball has complexity , and so projections are cheap, this feasible region is often used to compare the performance of projection-free algorithms between each other (see Lacoste-Julien and Jaggi (2015); Rao et al. (2015); Braun et al. (2019)). Solving a linear program program over the ball also has complexity . This experiment was also considered in Ghanbari and Scheinberg (2018) and Scheinberg and Tang (2016) to compare the performance of several Proximal Quasi-Newton methods in the context of minimization with a projection oracle.

The labels and samples used are taken from the training set of the gissette (Guyon et al., 2007) dataset, where and . Figure 3 shows the performance of Algorithm 1 with the Lazy Away-Step Conditional Gradient algorithm (Braun et al., 2019) and the L-BFGS strategy, denoted by SOCG AFW (L) (L-BFGS). All algorithms use a golden-section bounded line search between and in this comparison, and we use and for the SOCG algorithm.

The results in Figure 2 and 3 show that this simple modification to the subproblem accuracy strategy results in good performance both in per-iteration and per-unit-time progress.

## Acknowledgments

Research reported in this paper was partially supported by NSF CAREER Award CMMI-1452463. We would like to thank Gábor Braun for the helpful discussions.

## Appendix A Auxiliary Results

###### Lemma A.1.

Given a positive definite matrix , an satisfies:

(A.1) |

if and only if .

###### Proof.

() Using the first order optimality conditions for the scaled projection, :

(A.2) |

Which hold true if and only if , as Equation A.1 represents the first-order optimality conditions for Problem 1.1, of which is the unique optimal solution.

() Assume that . As is positive definite the objective function in the optimization problem in Equation A.1 is strongly convex, therefore the solution to the scaled projection problem is unique, and must verify the first order optimality conditions:

(A.3) |

Particularizing Equation A.3 for we see that the optimality conditions for the scaled projection problems are verified as is the solution to Problem 1.1. As the solution is unique we have that . ∎

###### Lemma A.2.

Assume that is positive definite and . Then the directions given by , where are descent directions at point , i.e., they satisfy .

## Appendix B Global Convergence

### b.1 Exact projections: for

We state some auxiliary results that will be used in the convergence proof of Algorithm 1 when for .

###### Lemma B.1.

###### Theorem B.2.

(Karimireddy et al., 2018b) Given an -smooth and -strongly convex function and a convex set , if Assumption 1 is satisfied, then the step size sequence guarantees a primal gap contraction at every iteration such that:

###### Proof.

Consider the iterates , the iterate can be written as:

(B.2) |

Using the -smoothness and the -strong convexity of the function we can write:

(B.3) | ||||

(B.4) | ||||

(B.5) | ||||

(B.6) | ||||

(B.7) | ||||

(B.8) | ||||

(B.9) | ||||

(B.10) | ||||

(B.11) |

Where the second and fifth inequality come from Assumption 1, the third inequality follows from the fact that , the fourth inequality from Lemma B.1, the sixth from the fact that we particularize for , the seventh from the fact that , and the last inequality from the -strong convexity and -smoothness of the function . Reordering the previous expression leads to:

∎

### b.2 Inexact projections: for

###### Theorem B.3.

Given an -smooth and -strongly convex function and a convex set , if Assumption 1 is satisfied, the step size sequence satisfies and the subproblems are solved to an additive accuracy such that:

where is a forcing term, then the following bound on the primal gap of the iterates of Algorithm 1 holds:

###### Proof.

Using the -smoothness and -strong convexity of and noting that we have that:

(B.12) | ||||

(B.13) | ||||

(B.14) | ||||

(B.15) | ||||

(B.16) | ||||

(B.17) | ||||

(B.18) |

Where the second inequality comes from the definition of , the third from Assumption 1, the fourth from the fact that and the fifth from Lemma B.1. If we denote by the iterate generated if the problem had been solved to optimality, where (see Equation B.2), we can see that . Plugging this into Equation B.18 we have:

(B.19) | ||||

(B.20) |

Where the last equality follows from the definition of . We can bound the second term in the previous expression using the arguments in the proof from Theorem 3.2 (note that this is the same term as the one shown in Equation B.6), which leads to: