Fast minimization of structured convex quartics

# Fast minimization of structured convex quartics

Brian Bullins
Princeton University
###### Abstract

We propose faster methods for unconstrained optimization of structured convex quartics, which are convex functions of the form

 f(x)=c⊤x+x⊤Gx+T[x,x,x]+124∥Ax∥44

for , , , and such that . In particular, we show how to achieve an -optimal minimizer for such functions with only calls to a gradient oracle and linear system solver, where is a problem-dependent parameter. Our work extends recent ideas on efficient tensor methods and higher-order acceleration techniques to develop a descent method for optimizing the relevant quartic functions. As a natural consequence of our method, we achieve an overall cost of calls to a gradient oracle and (sparse) linear system solver for the problem of -regression when , providing additional insight into what may be achieved for general -regression. Our results show the benefit of combining efficient higher-order methods with recent acceleration techniques for improving convergence rates in fundamental convex optimization problems.

## 1 Introduction

In this paper, we are interested in the unconstrained optimization problem

 minx∈Rdf(x), (1)

where is a convex function of the form

 f(x)=c⊤x+x⊤Gx+T[x,x,x]+124∥Ax∥44 (2)

for some , , , and such that and are the rows of . We will refer to functions of this form as structured convex quartics, as we are given an explicit decomposition of the fourth-order term, i.e.,

 ∇4f(x)=n∑i=1a⊗4i,x∈Rd.

While fast minimization of convex quadratic functions has been an area of significant research efforts (Cohen et al., 2015; Clarkson and Woodruff, 2017; Agarwal et al., 2017b, c), the structured convex quartic case has been less explored.

In this work, we present a method, called , whose total cost to find an -optimal minimizer is established in the following theorem.

###### Theorem 1.1.

Let be a convex function of the form (2). Then, under appropriate initialization, finds a point such that

 f(xN)−f(x∗)≤ε

with total computational cost , where is the time to calculate the gradient of , LSS is the time to solve a (sparse) linear system, and is a problem-dependent parameter.

In the case where , being the matrix multiplication constant, and for when the linear system is sufficiently sparse, our method improves upon (up to logarithmic factors) the previous best rate of (where is the radius of the box containing the relevant convex set), which can be achieved by using a fast cutting plane method (Lee et al., 2015).

We believe that, in addition to improving the complexity for a certain class of convex optimization problems, our approach illustrates the possibility of using an efficient local search-type method for some more difficult convex optimization tasks, such as -regression. This is in contrast to homotopy-based approaches (such as interior-point or path-following methods) (Nesterov and Nemirovskii, 1994; Bubeck et al., 2018a), cutting plane methods (Lee et al., 2015), and the ellipsoid method (Khachiyan, 1980).

### 1.1 Related work

In the general case, it has been shown to be NP-hard to find the global minimizer of a quartic polynomial (Murty and Kabadi, 1987; Parrilo and Sturmfels, 2003), or even to decide if the quartic polynomial is convex (Ahmadi et al., 2013). However, in this paper we are able to bypass these hardness results by guaranteeing the convexity of .

In terms of optimization for higher-order smooth convex functions, for functions whose Hessian is -Lipschitz, Monteiro and Svaiter (2013) achieve an error of after calls to a second-order Taylor expansion minimization oracle. Lower bounds have been established for the oracle complexity of higher-order smooth functions, (Arjevani et al., 2018; Agarwal and Hazan, 2018) which match the rate of Monteiro and Svaiter (2013) for , and recent progress has been made toward tightening these bounds.

Some recent work from Gasnikov et al. (2018), only available in Russian, establishes near-optimal rates for higher-order smooth optimization, though to the best of our understanding, it appears that the paper does not provide an explicit guarantee for the line search procedure. More recently, two independent works (Jiang et al., 2018; Bubeck et al., 2018b), published on the arxiv over the past few days, establish near-optimal rates for optimization of functions with higher-order smoothness, under an oracle model, along with an analysis of the binary search procedure. In this paper, while we consider only the case for , we go beyond the oracle model to establish an end-to-end complexity based on efficient approximations of tensor methods (Nesterov, 2018a). Furthermore, while our paper also relies on a careful handling of the binary search procedure, our approach requires the more general setting of higher-order smoothness with respect to matrix-induced norms, which does not appear to follow immediately from Jiang et al. (2018); Bubeck et al. (2018b).

## 2 Setup

Let be a symmetric positive-definite matrix, i.e., . We let for a matrix , and we denote the minimizer as . For any vector , we define its matrix-induced norm (w.r.t. ) as . Throughout the paper, we will let . We say a differentiable function is -uniformly convex (of degree ) with respect to if, for all ,

 f(y)≥f(x)+⟨∇f(x),y−x⟩+μpp∥y−x∥pB.

Note that for and , this definition captures the standard notion of strong convexity. As we shall see, since our aim is to minimize structured quartic functions, we will be concerned with this definition for and .

A related notion is that of (higher-order) smoothness. Namely, we say a -times differentiable function is smooth (of degree ) w.r.t. if the -th differential is Lipschitz continuous, i.e., for all ,

 ∥∇pf(y)−∇pf(x)∥∗B≤Lp∥y−x∥B,

where we define

 ∥∇pf(y)−∇pf(x)∥∗B\lx@stackreldef=maxh:∥h∥B≤1∣∣∇pf(y)[h]p−∇pf(x)[h]p∣∣ .

Again, since we our concerned with quartic functions, we will later show how is smooth with respect to the appropriate norm.

For that are smooth w.r.t. , we also have that, for all ,

 ∥∇f(y)−∇Φx,B(y)∥B−1≤L36∥y−x∥3B, (3)
 ∥∇2f(y)−∇2Φx,B(y)∥∗B≤L32∥y−x∥2B. (4)

It will eventually become necessary to handle the set of all points that might be reached by our method, starting from an initial point . To that end, we consider the following objects, beginning with the set

 K\lx@stackreldef={x:∥x−x0∥2B≤4∥x0−x∗∥2B}. (5)

Given this set, we now consider the maximum function value attained over , i.e., Finally, we let

 P\lx@stackreldef=maxx,y∈L∥x−y∥2B, (6)

where . We may also define . We note that, since is smooth, is a problem-dependent parameter, i.e., it depends on , , , and . As we will later show, the dependence on in the final convergence rate will only appear as part of logarithmic factors.

### 2.1 Properties of convex quartic functions

Throughout the paper, following the conventions of Nesterov (2018a), we will let

 Φx,p(y)\lx@stackreldef=f(x)+p∑i=11i!∇pf(x)[y−x]i, p≥1 (7)

denote the -th order Taylor approximation of , centered at . Furthermore, for that is smooth, we define a model function

 Ωx,p,B(y)\lx@stackreldef=Φx,p(y)+2pLp(p+1)!∥y−x∥p+1B. (8)

As we are only concerned with functions that are smooth, we will drop the subscript to define and

 Ωx,B(y)\lx@stackreldef=Ωx,3,B(y)=Φx(y)+L34∥y−x∥4B. (9)

Note that is smooth (of degree 3) w.r.t . The following theorem illustrates some useful properties of the model .

###### Theorem 2.1 (Nesterov (2018a), Theorem 1, for M=2l3).

Suppose is convex, 3-times differentiable, and smooth (of degree 3). Then, for any , we have

 0⪯∇2f(y)⪯∇2Φx(y)+L32∥y−x∥2BB.

Moreover, for all ,

 f(y)≤Ωx,B(y). (10)

With this representation of the model function in hand, we let

 TB(x)\lx@stackreldef=argminy∈RdΩx,B(y) (11)

denote a minimizer of the fourth-order model, centered at . The following lemma concerning , which will later prove useful, establishes a relaxed version of eq. (2.13) from Nesterov (2018a).

###### Lemma 2.2.

Let , and let be as in (11). Then, for all ,

 ⟨∇f(x),y−x⟩≥12L3^r2B(x,y)∥∇f(x)∥2B−1+3L38^r4B(x,y)−2Z(x,y)W(x,y)∥x−TB(y)∥B2L3^r2B(x,y), (12)

where

 Z(x,y)\lx@stackreldef=∥∇f(x)+L3^r2B(x,y)B(x−y)∥B−1,
 W(x,y)\lx@stackreldef=(∥B−1/2∥2∥H(x,y)∥∥B−1∥+L3∥x−TB(y)∥2B),

and

 H(x,y)\lx@stackreldef=∇2Ωy,B(TB(y))+12∇3Ωy,B(TB(y))[x−TB(y)].

In order to get a handle on the regularity properties of , we establish its smoothness and uniform convexity parameters w.r.t. .

###### Lemma 2.3 (L3 smoothness).

Suppose is of the form (2). Then, for all ,

 ∥∇3f(y)−∇3f(x)∥∗A⊤A≤∥y−x∥A⊤A. (13)
###### Lemma 2.4 (μ4 uniform convexity).

Suppose is of the form (2). Then, for all ,

 f(y)≥f(x)+⟨∇f(x),y−x⟩+n72∥y−x∥4A⊤A. (14)

We may also observe that is uniformly convex w.r.t. .

###### Lemma 2.5.

For all ,

 Ωx,B(z)≥Ωx,B(y)+⟨∇Ωx,B(y),z−y⟩+L312∥z−y∥4B. (15)

## 3 Minimizing structured convex quartics

In order to show an overall convergence rate for minimizing structured convex quartics, we shall see that the desired algorithm would be to find an exact minimizer of , for some at each iteration of the main algorithm. Thus, one of our main challenges will be to show that an approximate minimizer of is sufficiently accurate for the rest of the algorithm. To that end, we begin by considering the auxiliary minimization problem, for which our method converges at a linear rate to an -optimal minimizer. With this approximate minimizer in hand, we find that, when taking small enough, it provides a sufficiently accurate solution to be used as part of a binary search procedure, called . This approach is needed for finding an appropriate value which meets a certain approximation criterion.

Finally, once we have found an valid choice of and its corresponding , we show how they can be used as part of our main method, called , to lead to a final solution such that in iterations of . Furthermore, each of these iterations requires some polylogarithmic factors incurred by and .

### 3.1 Approximate auxiliary minimization

To begin, we consider the auxiliary minimization problem , where

 Γx,B(h)\lx@stackreldef=⟨∇f(x),h⟩+12h⊤∇2f(x)h+16∇3f(x)[h]3+L34∥h∥4B.

Note that is equivalent to , up to a change of variables. Our aim is to establish a minimization procedure which returns an -optimal solution in iterations, where is a problem-dependent parameter. Furthermore, each iteration is dominated by calls to a (sparse) linear system solver. This subroutine, which we call , is described in Section 5 of Nesterov (2018a) and is necessary for returning an approximate minimizer of . The approach involves showing that the auxiliary function is relatively smooth and convex (Lu et al., 2018), and further that each iteration of the method for minimizing such a function reduces to a minimization problem of the form

 −minλ>0w(λ), (16)

where

 w(λ)\lx@stackreldef=λ22+12⟨(√2λB+∇2f(x))−1ct,ct⟩

and

 ct\lx@stackreldef=∇Γx,B(ht)=∇f(x)+∇2f(x)ht+∇3f(x)[ht]2+L3∥ht∥2BBht.

As noted by Nesterov (2018a), this minimization problem is both one-dimensional and strongly convex, and so we may achieve global linear convergence. Taken together with the relative smoothness and convexity of , we have the following theorem.

###### Theorem 3.1 (Nesterov (2018a), eq.(5.9) (τ=√2). See also: Lu et al. (2018), Theorem 3.1).

For all , , generated by (Algorithm 1), we have that

 Γyk,B(ht)−Γyk,B(h∗)≤α(√2+12)t−1,

where and .

###### Corollary 3.2.

Let be the output from , for and , where . Then

 Ωyk,B(xk+1)−Ωyk,B(TB(yk))≤~εaam,

where each iteration requires time proportional to evaluating in order to compute , as well as calls to a (sparse) linear system solver.

###### Proof.

We first note that , and so . As observed by Nesterov (2018a) (see also: Appendix A in Agarwal et al. (2017a)), can be calculated in time proportional to the cost of evaluating , which takes time for of the form (2). In addition, Nesterov (2018a) notes that (16) can be found by any reasonable linearly convergent procedure, and so given access to the gradient of , this problem can be optimized (to sufficiently small error) in calls to a gradient oracle. Since

 ddλw(λ)=λ−√22c⊤t(√2λB+∇2f(x))−1B(√2λB+∇2f(x))−1ct,

calculating the gradient requires time.

Finally, since , by our choice of , it follows from Theorem 3.1 that

 Ωyk,B(xk+1)−Ωyk,B(TB(yk))≤~εaam.

As we shall see, it will become necessary to handle the approximation error from , and so we provide the following lemmas to that end.

###### Lemma 3.3.

Let , let be as output by , and let be as in (11). Then,

###### Proof.

By Lemma 2.5, we know that

 Ωyk,B(xk+1)−Ωyk,B(TB(yk)) ≥⟨∇Ωyk,B(TB(yk)),xk+1−TB(yk)⟩+L312∥xk+1−TB(yk)∥4B =L312∥xk+1−TB(yk)∥4B,

and so it follows from Corollary 3.2 that

###### Lemma 3.4.

Let . Then,

 ⟨∇f(xk+1),yk−xk+1⟩≥12L3^r2B(xk+1,yk)∥∇f(xk+1)∥2B−1+3L38^r4B(xk+1,yk)−3Z(xk+1,yk)W(xk+1,yk)~ε1/4aamL5/43^r2B(xk+1,yk) .
###### Proof.

The result follows from Lemmas 2.2 and 3.3. ∎

###### Lemma 3.5.

Let be the output from for . In addition, let . Then,

 ∣∣^r(xk+1,yk)2−r(yk)2∣∣≤6(~εaamL3)1/4P1/2+(12~εaamL3)1/2.
###### Proof.

Let . We have that

 ∣∣^r(xk+1,yk)2−r(yk)2∣∣ =∣∣∥TB(yk)−yk∥2B−∥xk+1−yk∥2B∣∣ =∣∣∥TB(yk)−yk∥2B−∥TB(yk)+β−yk∥2B∣∣ =∣∣∥TB(yk)−yk∥2B+2⟨β,B(TB(yk)−yk)⟩+∥β∥2B−∥TB(yk)−yk∥2B∣∣ ≤2∥β∥B∥TB(yk)−yk∥B+∥β∥2B.

Now, by Lemma 3.3, we know that , and so it follows from the definition of that

 ∣∣^r(xk+1,yk)2−r(yk)2∣∣≤6(~εaamL3)1/4P1/2+(12~εaamL3)1/2.

### 3.2 Search procedure for finding ρk

In this section, we establish the correctness of , our subroutine for finding an appropriate choice of , given as inputs. One of the key algorithmic components for achieving fast higher-order acceleration, as observed by Monteiro and Svaiter (2013) and Nesterov (2018b), is to determine such that , where we define

 ζk(ρ)\lx@stackreldef=∥TB(yk(ρ))−yk(ρ)∥2B, (17)
 yk(ρ)\lx@stackreldef=(1−τk(ρ))xk+τk(ρ)vk, (18)

and

 τk(ρ)\lx@stackreldef=21+√1+4L3Akρ . (19)

We will also need to define an approximate version

 ^ζk(ρ)\lx@stackreldef=∥xk+1(ρ)−yk(ρ)∥2B, (20)

where we let . We may observe that is continuous in , and furthermore that there exists some such that , since if , then , and if , then . Thus, we may reduce it to a binary search problem, under an appropriate initialization. For now, we assume that at each iteration , is given initial bounds and such that , thus ensuring it is a valid binary search procedure. We will later show how can provide with such guarantees.

An important part of managing this process is to limit how quickly can grow, as we will need to ensure a closeness in function value once our candidate bounds and are sufficiently close. The following theorem gives us precisely what we need, namely a differential inequality w.r.t. .

###### Theorem 3.6.

Let be as defined in (17), for some . Then we have that, for all ,

 ∣∣ζ′k(ρ)∣∣≤Rζk(ρ)1/2,

where is as defined in (33).

###### Theorem 3.7.

Given , as inputs, and chosen sufficiently small, the algorithm outputs and such that

 (1−~εrs)^ζk(ρk)≤ρk≤(1+~εrs)^ζk(ρk) (21)

where is as defined in (20).