Gradient Method With Inexact Oracle for Composite NonConvex Optimization
Abstract
In this paper, we develop new firstorder method for composite nonconvex minimization problems with simple constraints and inexact oracle. The objective function is given as a sum of ”‘hard”’, possibly nonconvex part, and ”‘simple”’ convex part. Informally speaking, oracle inexactness means that, for the ”‘hard”’ part, at any point we can approximately calculate the value of the function and construct a quadratic function, which approximately bounds this function from above. We give several examples of such inexactness: smooth nonconvex functions with inexact Höldercontinuous gradient, functions given by auxiliary uniformly concave maximization problem, which can be solved only approximately. For the introduced class of problems, we propose a gradienttype method, which allows to use different proximal setup to adapt to geometry of the feasible set, adaptively chooses controlled oracle error, allows for inexact proximal mapping. We provide convergence rate for our method in terms of the norm of generalized gradient mapping and show that, in the case of inexact Höldercontinuous gradient, our method is universal with respect to Hölder parameters of the problem. Finally, in a particular case, we show that small value of the norm of generalized gradient mapping at a point means that a necessary condition of local minimum approximately holds at that point.
Keywords: nonconvex optimization, composite optimization, inexact oracle, Höldercontinuous gradient, complexity, gradient descent methods, firstorder methods, parameter free methods, universal gradient methods.
AMS Classification: 90C30, 90C06, 90C26.
Introduction
In this paper, we introduce new firstorder method for nonconvex composite optimization problems with inexact oracle. Namely, our problem of interest is as follows
(1) 
where is a closed convex set, is a simple convex function, e.g. . We assume that is a general function endowed with an inexact firstorder oracle, which is defined below (see Definition 1). Informally speaking, at any point we can approximately calculate the value of the function and construct a quadratic function, which approximately bounds our from above. An example of problem with this kind of inexactness is given in Bogolubsky et al. (2016), where the authors study a learning problem for parametric PageRank model.
Firstorder methods are widely developed since the earliest years of optimization theory, see, e.g., Polyak (1963). Recent renaissance in their development started more than ten years ago and was mostly motivated by fast growing problem sizes in applications such as Machine Learning, Data Analysis, Telecommunications. For many years, researchers mostly considered convex optimization problems since they have good structure and allow to estimate rate of convergence for proposed algorithms. Recently, nonconvex problems started to attract fast growing attention, as they appear often in Machine Learning, especially in Deep Learning. Thus, high standards of research on algorithms for convex optimization started to influence nonconvex optimization. Namely, it have become very important for newly developed methods to obtain a rate of convergence with respect to some criterion. Usually, this criterion is the norm of gradient mapping, which is a generalization of gradient for constrained problems, see, e.g. Nesterov (2004).
Already in Polyak (1987), the author analyzed how different types of inexactness in gradient values influence gradient method for unconstrained smooth convex problems. At the moment, theory for convex optimization algorithms with inexact oracle is welldeveloped in a series of papers d’Aspremont (2008); Devolder et al. (2014); Dvurechensky and Gasnikov (2016). In d’Aspremont (2008), it was proposed to calculate inexactly the gradient of the objective function and extend Fast Gradient Method of Nesterov (2005) to be able to use inexact oracle information. In Devolder et al. (2014), a general concept of inexact oracle is introduced for convex problems, Primal, Dual and Fast gradient methods are analyzed. In Dvurechensky and Gasnikov (2016), the authors develop Stochastic Intermediate Gradient Method for problems with stochastic inexact oracle, which provides good flexibility for solving convex and strongly convex problems with both deterministic and stochastic inexactness.
The theory for nonconvex smooth, nonsmooth and stochastic problems is well developed in Ghadimi and Lan (2016); Ghadimi et al. (2016). In Ghadimi and Lan (2016), problems of the form (1), where and is a smooth nonconvex function are considered in the case when the gradient of is exactly available, as well as when it is available through stochastic approximation. Later, in Ghadimi et al. (2016) the authors generalized these methods for constrained problems of the form (1) in both deterministic and stochastic settings.
Nevertheless, it seems to us that gradient methods for nonconvex optimization problems with deterministic inexact oracle lack sufficient development. The goal of this paper is to fill this gap.
It turns out that smooth minimization with inexact oracle is closely connected with minimization of functions with Höldercontinuous gradient. We say that a function has Höldercontinuous gradient on iff there exist and s.t.
In Devolder et al. (2014) it was shown that a convex problem with Höldercontinuous subgradient can be considered as a smooth problem with deterministic inexact oracle. Later, universal gradient methods for convex problems with Höldercontinuous subgradient were proposed in Nesterov (2015). These algorithms do not require to know Hölder parameter and Hölder constant . Thus, they are universal with respect to these parameters. Ghadimi et al. (2015) proposed methods for nonconvex problems of the form (1), where has Höldercontinuous gradient. These methods rely on Euclidean norm and are good when the euclidean projection onto the set is simple.
Our contribution in this paper is as follows.

We generalize for nonconvex case the definition of inexact oracle in Devolder et al. (2014) and provide several examples, where such inexactness can arise. We consider two types of errors – controlled errors, which can be made as small as desired, and uncontrolled errors, which can only be estimated.

We introduce new gradient method for problem (1) and prove a theorem (see Theorem 1) on its rate of convergence in terms of the norm of generalized gradient mapping. Our method is adaptive to the controlled oracle error, is capable to work with inexact proximal mapping, has flexibility of choice of proximal setup, based on the geometry of set .

We show that, in the case of problems with inexact Höldercontinuous gradient, our method is universal, that is, it does not require to know in advance a Hölder parameter and Hölder constant for the function , but provides best known convergence rate uniformly in Hölder parameter .
Thus, we provide a universal algorithm for nonconvex Höldersmooth composite optimization problems with deterministic inexact oracle.
The rest of the paper is organized as follows. In Section 1, we define deterministic inexact oracle for nonconvex problems and provide several examples. In Section 2, we describe our algorithm, prove the convergence theorem. Also we provide two corollaries for particular cases of smooth functions and Höldersmooth functions. Note that the latter case includes the former one. Finally, we provide some explanations about how convergence of the norm of generalized gradient mapping to zero leads to a good approximation for a point, where a necessary optimality condition for Problem (1) holds. Note that we use different reasoning from what can be found in literature.
Notation Let be a finitedimensional real vector space and be its dual. We denote the value of linear function at by . Let be some norm on , be its dual.
1 Inexact Oracle
In this section, we define the inexact oracle and describe several examples where it naturally arises.
Definition 1.
We say that a function is equipped with an inexact firstorder oracle on a set if there exists and at any point for any number there exists a constant and one can calculate and satisfying
(2)  
(3) 
In this definition, represents the error of the oracle, which we can control and make as small as we would like to. On the opposite, represents the error, which we can not control. The idea behind the definition is that at any point we can approximately calculate the value of the function and construct an upper quadratic bound.
Let us now consider several examples.
1.1 Smooth Function with Inexact Oracle Values
Let us assume that

Function is smooth on , i.e. it is differentiable and, for all , .

Set is bounded with .

There exist and at any point , for any , we can calculate approximations and s.t. , .
Then, using smoothness of , we obtain, for any ,
(4)  
(5)  
(6) 
Thus, is an inexact firstorder oracle with , , and .
1.2 Smooth Function with HölderContinuous Gradient
Assume that is differentiable and its gradient is Höldercontinuous, i.e. for some and ,
(7) 
Then
(8) 
It can be shown, see Nesterov (2015), Lemma 2, that, for all and any ,
(9) 
where
(10) 
Thus, is an inexact firstorder oracle with , , and given by (10).
Note that, if can only be calculated inexactly as in Subsection 1.1, their approximations will again be an inexact firstorder oracle.
1.3 Function Given by Maximization Subproblem
Assume that function is defined by an auxiliary optimization problem
(11) 
where is a linear operator, is a continuously differentiable uniformly convex function of degree with parameter . The last means that
(12) 
where is some norm on . Note that is differentiable and , where is the optimal solution in (11) for fixed .
Extending the proof in Nesterov (2015), we can prove the following.
Lemma 1.
If is uniformly convex on , then the gradient of is Höldercontinuous with
(13) 
where .
Proof From the optimality conditions in (11), we obtain
(14)  
(15) 
Adding these inequalities, we obtain, by definition of uniformly convex function,
(16) 
on the other hand,
(17)  
(18)  
(19) 
Thus,
(20) 
which proves the Lemma. ∎
Let us now consider a situation, when the maximization problem in (11) can be solved only inexactly by some auxiliary numerical method. It is natural to assume that, for any and any , we can calculate a point s.t.
(21) 
Since is a concave function, for any and , we have
(22) 
Using this inequality with
(23) 
we obtain, for any ,
(24)  
(25)  
(26) 
where and are defined in (13). At the same time, since (11) is uniformly concave in second argument, we have
(27) 
Combining this inequality with the previous one, we obtain
(28) 
Since has Höldercontinuous gradient with parameters (13), using (8), we obtain
(29)  
(30)  
(31)  
(32) 
Thus, we have obtained that is an inexact firstorder oracle with , , and given by (10) with .
2 Adaptive Gradient Method for Problems with Inexact Oracle
To construct our algorithm for problem (1), we introduce, as it usually done, proximal setup BenTal and Nemirovski (2015). We choose a proxfunction which is continuous, convex on and

admits a continuous in selection of subgradients , where is the set of all , where exists;

is strongly convex on with respect to , i.e., for any .
We define also the corresponding Bregman divergence , . Standard proximal setups, i.e. Euclidean, entropy, , simplex , nuclear norm, spectahedron can be found in BenTal and Nemirovski (2015). We will use Bregman divergence in so called composite proxmapping
(33) 
where , , are given. We allow this problem to be solved inexactly in the following sense.
Definition 1.
Assume that we are given , , , . We call a point an inexact composite proxmapping iff for any we can calculate and there exists s.t. it holds that
(34) 
We write
(35) 
and define
(36) 
This is a generalization of inexact composite proxmapping in BenTal and Nemirovski (2015). Note that if is an exact solution of (33), inequality (34) holds with due to firstorder optimality condition. Similarly to Definition 1, represents an error, which can be controlled and made as small as it is desired, represents an error which can not be controlled.
Our main scheme is Algorithm 1.
(37) 
(38) 
We will need the following simple extension of Lemma 1 in Ghadimi et al. (2016) to perform the theoretical analysis of our algorithm.
Lemma 1.
Let be an inexact composite proxmapping and be defined in (36). Then, for any , and , it holds
(39) 
Proof Taking in (34) and rearranging terms, we obtain, by convexity of and strong convexity of ,
(40) 
Applying the definition (36), we finish the proof. ∎
Now we state the main
Theorem 1.
Assume that is equipped with an inexact firstorder oracle in the sense of Definition 1 and for any constants there exists an integer s.t. . Assume also that there exists a number such that for all . Then, after iterations of Algorithm 1, it holds that
(41) 
Moreover, the total number of checks of Inequality (38) is not more than
(42) 
Proof First of all let us show that the procedure of search of point satisfying (37), (38) is finite. Let be the current number of performed checks of inequality (38) on the step . Then . At the same time, by Definition 1 . Hence, by the Theorem assumptions, there exists s.t. . At the same time, we have
(43)  
(44)  
which leads to (38) when .
Let us now obtain the rate of convergence. We denote, for simplicity, , , Note that
(45) 
Using definition of , we obtain, for any ,
(46)  
(47)  
(48)  
(49)  
(50)  
This leads to
Summing up these inequalities, we get
Finally, since, for all and , we obtain
(51) 
which is (41). The estimate for the number of checks of Inequality (38) is proved in the same way as in Nesterov and Polyak (2006), but we provide the proof for the reader’s convenience. Let be the total number of checks of Inequality (38) on the step . Then and, for , . Thus, , . Then, the total number of checks of Inequality (38) is
(52) 
∎
Let us consider two corollaries of the theorem above. First is a simple case, when in Definition 1 . Second is the case, when is given by (10).
Corollary 1.
Proof By our assumptions, for all iterations , there exists s.t. . Hence, we can apply Theorem 1. Let be the total number of checks of Inequality (38) on a step . Then, for all , the inequality should hold. Otherwise the termination of the inner cycle would happen earlier. Using this inequalities, we obtain
Thus (53) follows from Theorem 1. The same argument proves the second statement of the corollary. ∎