Gradient Method With Inexact Oracle for Composite Non-Convex Optimization

# Gradient Method With Inexact Oracle for Composite Non-Convex Optimization

## Abstract

In this paper, we develop new first-order method for composite non-convex minimization problems with simple constraints and inexact oracle. The objective function is given as a sum of ”‘hard”’, possibly non-convex part, and ”‘simple”’ convex part. Informally speaking, oracle inexactness means that, for the ”‘hard”’ part, at any point we can approximately calculate the value of the function and construct a quadratic function, which approximately bounds this function from above. We give several examples of such inexactness: smooth non-convex functions with inexact Hölder-continuous gradient, functions given by auxiliary uniformly concave maximization problem, which can be solved only approximately. For the introduced class of problems, we propose a gradient-type method, which allows to use different proximal setup to adapt to geometry of the feasible set, adaptively chooses controlled oracle error, allows for inexact proximal mapping. We provide convergence rate for our method in terms of the norm of generalized gradient mapping and show that, in the case of inexact Hölder-continuous gradient, our method is universal with respect to Hölder parameters of the problem. Finally, in a particular case, we show that small value of the norm of generalized gradient mapping at a point means that a necessary condition of local minimum approximately holds at that point.

Keywords: nonconvex optimization, composite optimization, inexact oracle, Hölder-continuous gradient, complexity, gradient descent methods, first-order methods, parameter free methods, universal gradient methods.

AMS Classification: 90C30, 90C06, 90C26.

## Introduction

In this paper, we introduce new first-order method for non-convex composite optimization problems with inexact oracle. Namely, our problem of interest is as follows

 minx∈X⊆E{ψ(x):=f(x)+h(x)}, (1)

where is a closed convex set, is a simple convex function, e.g. . We assume that is a general function endowed with an inexact first-order oracle, which is defined below (see Definition 1). Informally speaking, at any point we can approximately calculate the value of the function and construct a quadratic function, which approximately bounds our from above. An example of problem with this kind of inexactness is given in Bogolubsky et al. (2016), where the authors study a learning problem for parametric PageRank model.

First-order methods are widely developed since the earliest years of optimization theory, see, e.g., Polyak (1963). Recent renaissance in their development started more than ten years ago and was mostly motivated by fast growing problem sizes in applications such as Machine Learning, Data Analysis, Telecommunications. For many years, researchers mostly considered convex optimization problems since they have good structure and allow to estimate rate of convergence for proposed algorithms. Recently, non-convex problems started to attract fast growing attention, as they appear often in Machine Learning, especially in Deep Learning. Thus, high standards of research on algorithms for convex optimization started to influence non-convex optimization. Namely, it have become very important for newly developed methods to obtain a rate of convergence with respect to some criterion. Usually, this criterion is the norm of gradient mapping, which is a generalization of gradient for constrained problems, see, e.g. Nesterov (2004).

Already in Polyak (1987), the author analyzed how different types of inexactness in gradient values influence gradient method for unconstrained smooth convex problems. At the moment, theory for convex optimization algorithms with inexact oracle is well-developed in a series of papers d’Aspremont (2008); Devolder et al. (2014); Dvurechensky and Gasnikov (2016). In d’Aspremont (2008), it was proposed to calculate inexactly the gradient of the objective function and extend Fast Gradient Method of Nesterov (2005) to be able to use inexact oracle information. In Devolder et al. (2014), a general concept of inexact oracle is introduced for convex problems, Primal, Dual and Fast gradient methods are analyzed. In Dvurechensky and Gasnikov (2016), the authors develop Stochastic Intermediate Gradient Method for problems with stochastic inexact oracle, which provides good flexibility for solving convex and strongly convex problems with both deterministic and stochastic inexactness.

The theory for non-convex smooth, non-smooth and stochastic problems is well developed in Ghadimi and Lan (2016); Ghadimi et al. (2016). In Ghadimi and Lan (2016), problems of the form (1), where and is a smooth non-convex function are considered in the case when the gradient of is exactly available, as well as when it is available through stochastic approximation. Later, in Ghadimi et al. (2016) the authors generalized these methods for constrained problems of the form (1) in both deterministic and stochastic settings.

Nevertheless, it seems to us that gradient methods for non-convex optimization problems with deterministic inexact oracle lack sufficient development. The goal of this paper is to fill this gap.

It turns out that smooth minimization with inexact oracle is closely connected with minimization of functions with Hölder-continuous gradient. We say that a function has Hölder-continuous gradient on iff there exist and s.t.

 ∥∇f(x)−∇f(y)∥E,∗≤Lν∥x−y∥νE,x,y∈X.

In Devolder et al. (2014) it was shown that a convex problem with Hölder-continuous subgradient can be considered as a smooth problem with deterministic inexact oracle. Later, universal gradient methods for convex problems with Hölder-continuous subgradient were proposed in Nesterov (2015). These algorithms do not require to know Hölder parameter and Hölder constant . Thus, they are universal with respect to these parameters. Ghadimi et al. (2015) proposed methods for non-convex problems of the form (1), where has Hölder-continuous gradient. These methods rely on Euclidean norm and are good when the euclidean projection onto the set is simple.

Our contribution in this paper is as follows.

1. We generalize for non-convex case the definition of inexact oracle in Devolder et al. (2014) and provide several examples, where such inexactness can arise. We consider two types of errors – controlled errors, which can be made as small as desired, and uncontrolled errors, which can only be estimated.

2. We introduce new gradient method for problem (1) and prove a theorem (see Theorem 1) on its rate of convergence in terms of the norm of generalized gradient mapping. Our method is adaptive to the controlled oracle error, is capable to work with inexact proximal mapping, has flexibility of choice of proximal setup, based on the geometry of set .

3. We show that, in the case of problems with inexact Hölder-continuous gradient, our method is universal, that is, it does not require to know in advance a Hölder parameter and Hölder constant for the function , but provides best known convergence rate uniformly in Hölder parameter .

Thus, we provide a universal algorithm for non-convex Hölder-smooth composite optimization problems with deterministic inexact oracle.

The rest of the paper is organized as follows. In Section 1, we define deterministic inexact oracle for non-convex problems and provide several examples. In Section 2, we describe our algorithm, prove the convergence theorem. Also we provide two corollaries for particular cases of smooth functions and Hölder-smooth functions. Note that the latter case includes the former one. Finally, we provide some explanations about how convergence of the norm of generalized gradient mapping to zero leads to a good approximation for a point, where a necessary optimality condition for Problem (1) holds. Note that we use different reasoning from what can be found in literature.

Notation Let be a finite-dimensional real vector space and be its dual. We denote the value of linear function at by . Let be some norm on , be its dual.

## 1 Inexact Oracle

In this section, we define the inexact oracle and describe several examples where it naturally arises.

###### Definition 1.

We say that a function is equipped with an inexact first-order oracle on a set if there exists and at any point for any number there exists a constant and one can calculate and satisfying

 |f(x)−~f(x,δc,δu)|≤δc+δu, (2) Extra open brace or missing close brace (3)

In this definition, represents the error of the oracle, which we can control and make as small as we would like to. On the opposite, represents the error, which we can not control. The idea behind the definition is that at any point we can approximately calculate the value of the function and construct an upper quadratic bound.

Let us now consider several examples.

### 1.1 Smooth Function with Inexact Oracle Values

Let us assume that

1. Function is -smooth on , i.e. it is differentiable and, for all , .

2. Set is bounded with .

3. There exist and at any point , for any , we can calculate approximations and s.t. , .

Then, using -smoothness of , we obtain, for any ,

 f(y) ≤f(x)+⟨∇f(x),y−x⟩+L2∥x−y∥2E (4) ≤¯f(x)+¯δ1c+¯δ1u+⟨∇¯g(x),y−x⟩+⟨∇f(x)−¯g(x),y−x⟩+L2∥x−y∥2E (5) ≤¯f(x)+⟨∇¯g(x),y−x⟩+L2∥x−y∥2E+¯δ1c+¯δ1u+(¯δ2c+¯δ2u)D. (6)

Thus, is an inexact first-order oracle with , , and .

### 1.2 Smooth Function with Hölder-Continuous Gradient

Assume that is differentiable and its gradient is Hölder-continuous, i.e. for some and ,

 ∥∇f(x)−∇f(y)∥∗≤Lν∥x−y∥νE,∀x,y∈X. (7)

Then

 f(y)≤f(x)+⟨∇f(x),y−x⟩+Lν1+ν∥x−y∥1+νE,∀x,y∈X. (8)

It can be shown, see Nesterov (2015), Lemma 2, that, for all and any ,

 f(y)−(f(x)−⟨∇f(x),y−x⟩)≤L(δ)2∥x−y∥2E+δ,∀y∈X, (9)

where

 L(δ)=(1−ν1+ν⋅2δ)1−ν1+νL21+νν. (10)

Thus, is an inexact first-order oracle with , , and given by (10).

Note that, if can only be calculated inexactly as in Subsection 1.1, their approximations will again be an inexact first-order oracle.

### 1.3 Function Given by Maximization Subproblem

Assume that function is defined by an auxiliary optimization problem

 f(x)=maxu∈U⊆H{Ψ(x,u):=−G(u)+⟨Au,x⟩}, (11)

where is a linear operator, is a continuously differentiable uniformly convex function of degree with parameter . The last means that

 ⟨∇G(u1)−∇G(u2),u1−u2⟩≥σρ∥u1−u2∥ρH,∀u1,u2∈U, (12)

where is some norm on . Note that is differentiable and , where is the optimal solution in (11) for fixed .

Extending the proof in Nesterov (2015), we can prove the following.

###### Lemma 1.

If is uniformly convex on , then the gradient of is Hölder-continuous with

 ν=1ρ−1,Lν=∥A∥ρρ−1H→E∗σ1ρ−1ρ, (13)

where .

Proof From the optimality conditions in (11), we obtain

 ⟨ATx1−∇G(u(x1)),u(x2)−u(x1)⟩≤0, (14) ⟨ATx2−∇G(u(x2)),u(x1)−u(x2)⟩≤0. (15)

Adding these inequalities, we obtain, by definition of uniformly convex function,

 ⟨AT(x1−x2),u(x1)−u(x2)⟩≥⟨∇G(u(x1))−∇G(u(x2)),u(x1)−u(x2)⟩\lx@stackrel(???)eq:ucdef≥σρ∥u(x1)−u(x2)∥ρH. (16)

on the other hand,

 ∥A(u(x1)−u(x2))∥2E,∗ ≤∥A∥2H→E∗∥u(x1)−u(x2)∥2H (17) ≤∥A∥2H→E∗(1σρ⟨AT(x1−x2),u(x1)−u(x2)⟩)2/ρ (18) ≤∥A∥2H→E∗σ2/ρρ∥A(u(x1)−u(x2))∥2/ρE,∗∥x1−x2∥2/ρE. (19)

Thus,

 ∥A(u(x1)−u(x2))∥2−2/ρE,∗≤∥A∥2H→E∗σ2/ρρ∥x1−x2∥2/ρE, (20)

which proves the Lemma. ∎

Let us now consider a situation, when the maximization problem in (11) can be solved only inexactly by some auxiliary numerical method. It is natural to assume that, for any and any , we can calculate a point s.t.

 0≤f(x)−Ψ(x,ux)=Ψ(x,u∗(x))−Ψ(x,ux)≤δ. (21)

Since is a concave function, for any and , we have

 ln(1ρtρ+ρ−1ρτρρ−1)≥1ρln(tρ)+ρ−1ρln(τρρ−1)=ln(tτ). (22)

Using this inequality with

 t=σ1/ρρ∥u∗(x)−ux∥H,τ=∥A∥H→E∗σ1/ρρ∥y−x∥E, (23)

we obtain, for any ,

 ⟨A(u∗(x)−ux),y−x⟩ ≤∥A∥H→E∗∥u∗(x)−ux∥H∥y−x∥E (24) ≤σρρ∥u∗(x)−ux∥ρH+∥A∥ρρ−1H→E∗ρρ−1σ1ρ−1ρ∥y−x∥ρρ−1E (25) =σρρ∥u∗(x)−ux∥ρH+Lν1+ν∥y−x∥1+νE, (26)

where and are defined in (13). At the same time, since (11) is uniformly concave in second argument, we have

 σρρ∥u∗(x)−ux∥ρH≤Ψ(x,u∗(x))−Ψ(x,ux)\lx@stackrel(???)≤δ. (27)

Combining this inequality with the previous one, we obtain

 Extra open brace or missing close brace (28)

Since has Hölder-continuous gradient with parameters (13), using (8), we obtain

 f(y) ≤f(x)+⟨∇f(x),y−x⟩+Lν1+ν∥x−y∥1+νE (29) \lx@stackrel(???)≤Ψ(x,ux)+δ+⟨Aux,y−x⟩+⟨A(u∗(x)−ux),y−x⟩+2Lν1+ν∥x−y∥1+νE (30) \lx@stackrel(???)≤Ψ(x,ux)+⟨Aux,y−x⟩+2Lν1+ν∥x−y∥1+νE+2δ (31) \lx@stackrel≤Ψ(x,ux)+⟨Aux,y−x⟩+2L(δ)2∥x−y∥2E+4δ. (32)

Thus, we have obtained that is an inexact first-order oracle with , , and given by (10) with .

To construct our algorithm for problem (1), we introduce, as it usually done, proximal setup Ben-Tal and Nemirovski (2015). We choose a prox-function which is continuous, convex on and

1. admits a continuous in selection of subgradients , where is the set of all , where exists;

2. is -strongly convex on with respect to , i.e., for any .

We define also the corresponding Bregman divergence , . Standard proximal setups, i.e. Euclidean, entropy, , simplex , nuclear norm, spectahedron can be found in Ben-Tal and Nemirovski (2015). We will use Bregman divergence in so called composite prox-mapping

 minx∈X{⟨g,x⟩+1γV[¯x](x)+h(x)}, (33)

where , , are given. We allow this problem to be solved inexactly in the following sense.

###### Definition 1.

Assume that we are given , , , . We call a point an inexact composite prox-mapping iff for any we can calculate and there exists s.t. it holds that

 ⟨g+1γ[d′(~x)−d′(¯x)]+p,u−~x⟩≥−δpc−δpu,∀u∈X. (34)

We write

 ~x=argminx∈Xδpc+δpu{⟨g,x⟩+1γV[¯x](x)+h(x)} (35)

and define

 gX(¯x,g,γ,δpc,δpu):=1γ(¯x−~x). (36)

This is a generalization of inexact composite prox-mapping in Ben-Tal and Nemirovski (2015). Note that if is an exact solution of (33), inequality (34) holds with due to first-order optimality condition. Similarly to Definition 1, represents an error, which can be controlled and made as small as it is desired, represents an error which can not be controlled.

Our main scheme is Algorithm 1.

We will need the following simple extension of Lemma 1 in Ghadimi et al. (2016) to perform the theoretical analysis of our algorithm.

###### Lemma 1.

Let be an inexact composite prox-mapping and be defined in (36). Then, for any , and , it holds

 γ⟨g,gX(¯x,g,γ,δpc,δpu)⟩≥γ∥gX(¯x,g,γ,δpc,δpu)∥2E+(h(~x(¯x,g,γ,δpc,δpu))−h(x))−δpc−δpu. (39)

Proof Taking in (34) and rearranging terms, we obtain, by convexity of and strong convexity of ,

 ⟨g,¯x−~x⟩ ≥1γ⟨d′(~x)−d′(¯x),~x−¯x⟩+⟨p,~x−¯x⟩−δpc−δpu ≥1γ∥~x−¯x∥2E+(h(~x)−h(¯x))−δpc−δpu. (40)

Applying the definition (36), we finish the proof. ∎

Now we state the main

###### Theorem 1.

Assume that is equipped with an inexact first-order oracle in the sense of Definition 1 and for any constants there exists an integer s.t. . Assume also that there exists a number such that for all . Then, after iterations of Algorithm 1, it holds that

 ∥MK(xK−xK+1))∥2E≤(N−1∑k=012Mk)−1(ψ(x0)−ψ∗+N(4δu+δpu))+ε2. (41)

Moreover, the total number of checks of Inequality (38) is not more than

 2N−1+log2MN−1L0. (42)

Proof First of all let us show that the procedure of search of point satisfying (37), (38) is finite. Let be the current number of performed checks of inequality (38) on the step . Then . At the same time, by Definition 1 . Hence, by the Theorem assumptions, there exists s.t. . At the same time, we have

 ~f(wk,δc,k,δu)−ε20Mk−δu \lx@stackrel(???)≤f(wk) (43) \lx@stackrel(???)≤~f(xk,δc,k,δu)+⟨~g(xk,δc,k,δu),wk−xk⟩ (44) +L(δc,k)2∥wk−xk∥2E+ε20Mk+δu,

which leads to (38) when .

Let us now obtain the rate of convergence. We denote, for simplicity, , , Note that

 ~gX,k\lx@stackrel=Mk(xk−xk+1). (45)

Using definition of , we obtain, for any ,

 f(xk+1)−ε20Mk−δu =f(wk)−ε20Mk−δu (46) \lx@stackrel(???)≤~f(wk,δc,k,δu) (47) \lx@stackrel(???)≤~fk+⟨~gk,xk+1−xk⟩+Mk2∥xk+1−xk∥2E+ε10Mk+2δu (48) \lx@stackrel(???)=~fk−1Mk⟨~gk,~gX,k⟩+12Mk∥∥~gX,k∥∥2E+ε10Mk+2δu (49) \lx@stackrel≤f(xk)+ε20Mk+δu−[1Mk∥∥~gX,k∥∥2E+h(xk+1)−h(xk)−ε20Mk−δpu] (50) +12Mk∥∥~gX,k∥∥2E+ε10Mk+2δu.

 ψ(xk+1)≤ψ(xk)−12Mk∥∥~gX,k∥∥2E+ε4Mk+4δu+δpu,k=0,…,N−1.

Summing up these inequalities, we get

 ∥∥~gX,K∥∥2EN−1∑k=012Mk≤N−1∑k=012Mk∥∥~gX,k∥∥2E≤ψ(x0)−ψ(xN)+ε4N−1∑k=01Mk+N(4δu+δpu).

Finally, since, for all and , we obtain

 ∥MK(xK−xK+1))∥2E≤(N−1∑k=012Mk)−1(ψ(x0)−ψ∗+N(4δu+δpu))+ε2, (51)

which is (41). The estimate for the number of checks of Inequality (38) is proved in the same way as in Nesterov and Polyak (2006), but we provide the proof for the reader’s convenience. Let be the total number of checks of Inequality (38) on the step . Then and, for , . Thus, , . Then, the total number of checks of Inequality (38) is

 N−1∑k=0ik=1+log2M0L0+N−1∑k=1(2+log2MkMk−1)=2N−1+log2MN−1L0. (52)

Let us consider two corollaries of the theorem above. First is a simple case, when in Definition 1 . Second is the case, when is given by (10).

###### Corollary 1.

Assume that there exists a constant s.t. for the dependence in Definition 1 it holds that for all . Assume also that there exists a number such that for all . Then, after iterations of Algorithm 1, it holds that

 ∥MK(xK−xK+1))∥2E≤4L(ψ(x0)−ψ∗)N+4L(4δu+δpu)+ε2. (53)

Moreover, the total number of checks of Inequality (38) is not more than

 2N+log2LL0.

Proof By our assumptions, for all iterations , there exists s.t. . Hence, we can apply Theorem 1. Let be the total number of checks of Inequality (38) on a step . Then, for all , the inequality should hold. Otherwise the termination of the inner cycle would happen earlier. Using this inequalities, we obtain

 (N−1∑k=012Mk)−1≤(N−1∑k=014L)−1=4LN.

Thus (53) follows from Theorem 1. The same argument proves the second statement of the corollary. ∎

###### Corollary 2.

Assume that the dependence in Definition 1 is given by (10) for some , i.e.

 L(δc)=(1−ν1+ν⋅2δc)1−ν1+νL21+νν,δc>0. (54)

Assume also that there exists a number such that for all . Then, after iterations of Algorithm 1, it holds that

 ∥MK(xK−xK+1))∥2E≤21+3ν2ν(1−ν1+ν⋅40ε)1−ν2νL1νν(ψ(x0)−ψ∗N+(4δu+δpu))+ε2. (55)

Moreover, the total number of checks of Inequality (38) is not more than

 2N−1+1+ν2ν+1−ν2νlog2(40⋅1−ν1+ν)+1−ν2νlog21ε