Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information
We consider variants of trust-region and cubic regularization methods for non-convex optimization, in which the Hessian matrix is approximated. Under mild conditions on the inexact Hessian, and using approximate solution of the corresponding sub-problems, we provide iteration complexity to achieve -approximate second-order optimality which have shown to be tight. Our Hessian approximation conditions constitute a major relaxation over the existing ones in the literature. Consequently, we are able to show that such mild conditions allow for the construction of the approximate Hessian through various random sampling methods. In this light, we consider the canonical problem of finite-sum minimization, provide appropriate uniform and non-uniform sub-sampling strategies to construct such Hessian approximations, and obtain optimal iteration complexity for the corresponding sub-sampled trust-region and cubic regularization methods.
Consider the generic unconstrained optimization problem
where is smooth and non-convex. Over the last few decades, many optimization algorithms have been developed to solve (P0), e.g., [bertsekas1999nonlinear, nesterov2004introductory, boyd2004convex, sun2006optimization, nocedal2006numerical]. However, the large-scale nature of modern “big-data” problems has made the mere evaluation of the gradient or the Hessian of computationally prohibitive. As a result, faced with such computational challenges, many of the classical optimization algorithms might prove to be inefficient, if applicable at all. In this light, many of the recent research efforts have been centered around designing variants of classical algorithms which, by employing suitable approximations of the gradient and/or Hessian, improve upon the cost-per-iteration of their classical counterparts, while maintaining the original iteration complexity. Indeed, designing variants of classical methods which can strike a balance between per-iteration costs and iteration complexity has been at center-stage in fueling the research in optimization for machine learning and data analysis applications.
However, these efforts have mainly been, rather unequally, divided between first-order methods
The rest of this paper is organized as follows. In order to set the scene, in Section 1.1, we start off by describing the motivations behind our work. A very high-level bird’s-eye view of our approach in this paper is then given in Section 1.2. In Section 1.3, we briefly mention the works closely related to the present paper and in its light, outline, in details, our main contributions in Section 1.4. Notation and assumptions used throughout this paper are introduced in Section 2.1. In Section 2.2, we specify and briefly review the classical Newton-type methods considered in this paper, namely trust region and cubic regularization algorithms. Theoretical results regarding the convergence properties of the proposed variants of these classical methods for solving generic non-convex problem (P0) are presented in Section 3. In Section 4, we consider finite-sum minimization problem, which is a special formulation of problem (P0), introduce various randomized sub-sampling strategies as a way to construct inexact Hessian information, and give the corresponding algorithmic convergence guarantees. Conclusions and further thoughts are gathered in Section 5. The details of all the proofs are deferred to Appendix A.
The main focus of these efforts in the optimization community in general, and within the machine learning community in particular, has been on developing efficient first-order methods in which the gradient of is approximated. This is despite the fact that most first-order methods suffer from two main draw-backs. The most established disadvantage of such methods is that their performance can be seriously hindered by ill-conditioning [roosta2016sub1, roosta2016sub2, xu2016sub]. A more subtle, yet potentially more sever draw-back, is that the success of most first-order methods is tightly intertwined with fine-tunning (often many) hyper-parameters, most importantly, the step-size [berahas2017investigation]. In fact, it is highly unlikely that many of these methods exhibit acceptable performance on first try, and it often takes many trials and errors before one can see reasonable results. Despite these shortcomings, it is arguably the simplicity and the ease-of-implementation that contribute greatly in the widespread usage of these methods, specially within the machine learning community.
Contrary to first-order methods, second-order optimization algorithms have been shown to be highly resilient to problem ill-conditioning [roosta2016sub1, roosta2016sub2, xu2016sub], involve much less parameter tuning and are less sensitive to the choice of hyper-parameters [berahas2017investigation]. The added advantage of using the curvature information has classically allowed the second order methods to be optimization powerhouse for solving very non-linear and ill-conditioned problems, in particular within the scientific computing community. However, the hidden blessings offered by second-order methods come with an obvious curse: the computational costs involved in working with matrices arising in such methods, e.g., Hessians or Jacobians, make the application of many classical second-order algorithms rather prohibitive for modern large-scale problems. Arguably, this is a major contributing factor for why the machine learning community has, overwhelmingly, nurtured a distaste for such methods. In this light, there has recently been a surge of interest in designing efficient (stochastic) second-order variants, which are better suited for the large-scale nature of the modern problems, e.g., [roosta2016sub1, roosta2016sub1, roosta2016sub2, xu2016sub, martens2010deep, byrd2011use, byrd2012sample, bollapragada2016exact] (see Section 1.3 for further references).
However, within the context of, both, first and second-order methods, the bulk of the developed theory has been mostly limited to convex settings. Indeed, the status of corresponding research for non-convex problems lags significantly behind. In addition, compared to first-order algorithms, this discrepancy is ever more so pronounced for the class of second-order methods. For example, for a variety of non-convex applications, many authors have recently considered studying non-trivial theoretical guarantees of various first-order methods, that are mostly ad-hoc and inspired by their convex counterparts, e.g., [tu2015low, zheng2015convergent, bhojanapalli2016dropping, arora2015simple, netrapalli2013phase, boumal2016nonconvex, candes2015phase]. In comparison, the body of similar literature for second-order methods is completely undersized. This is rather understandable: in presence of non-convexity, the large-scale computational challenges are exacerbated, multiple folds over, by the difficulty of avoiding (possibly degenerate) saddle-points as well as finding (at least) a local minimum.
In spite of these challenges, the need for effective and efficient non-convex optimization methods has now been growing increasingly as a result of the emergence in popularity of neural networks [lecun2015deep, goodfellow2016deep]. Indeed, the recent success of deep neural nets in training various machine learning applications [krizhevsky2012imagenet, bahdanau2014neural, sutskever2014sequence, cho2014learning, vinyals2015show] has given rise to an incredible explosion of research outputs in various aspects of deep learning from their theoretical understanding [choromanska2015loss, cohen2016expressive, haeffele2015global] to their wide-range of applications [deng2014deep, schmidhuber2015deep, goodfellow2016deep]. The tremors of this explosion have recently reached the optimization community as well. As a result, there has been numerous recent results in understanding, design and extension of various general purpose first-order optimization algorithms which are suitable for non-convex problems such as neural networks. [jin2017escape, ge2015escaping, allen2017natasha, hazan2015beyond, levy2016power, allen2016variance, reddi2016stochastic, chaudhari2016entropy, raginsky2017non, ghadimi2013stochastic].
Although some of the above cited work can guarantee convergence to a second-order criticality, e.g., [jin2017escape, ge2015escaping, levy2016power], the majority of first-order methods lack such performance guarantees. Indeed, their convergence can be, at best, ensured to first-order critical points, which include saddle-points. However, it has been argued that converging to saddle points can be undesirable for obtaining good generalization errors with many non-convex machine learning models, such as deep neural networks [dauphin2014identifying, choromanska2015loss, saxe2013exact, lecun2012efficient]. In fact, it has also been shown that in certain settings, existence of “bad” local minima, i.e., sub-optimal local minima with high training error, can significantly hurt the performance of the trained model at test time [swirszcz2016local, fukumizu2000local]. In addition, important cases have been demonstrated where, stochastic gradient descent (SGD), which is, nowadays, arguably the optimization method of choice in machine learning, indeed stagnates at high training error [he2016deep]. Such high levels of training error can, in turn, appear as bottlenecks in further improving predictive performance in many machine learning applications. As a result, scalable algorithms which avoid saddle points and guarantee convergence to a (good) local minimum are highly desired.
It is well-known that employing the curvature information in the form of Hessian, in addition to the advantages mentioned above, can help with obtaining a method with desirable convergence to second-order criticality. However, the main apparent obstacle in doing so is, again, the serious computational challenges involved in working with the resulting large-scale Hessian matrices. Indeed, in most applications, the cost of evaluating the exact gradient is usually amortized by that of operations with the Hessian, e.g., matrix-vector products. Here is where appropriate Hessian approximations can prove crucial in obtaining scalable algorithms, which maintain the effectiveness offered by incorporating the exact curvature information, and yet exhibit a great deal of efficiency.
1.2 Bird’s-eye View of Our Approach
To accomplish this, here, we focus on trust-region (TR) [conn2000trust] and cubic regularization (CR) [cubic1981Griewank], two algorithms which are widely considered as among the most elegant and theoretically sound general-purpose Newton-type methods designed for non-convex problems (See Section 2.2 for a brief description of these methods). In particular, for solving (P0), we study the theoretical convergence properties of variants of these two algorithms in which the Hessian is suitably approximated
Such relaxed conditions are much weaker than the existing ones in the literature. Far from being of only technical interest, this relaxation allows for efficient constructions of the inexact Hessian via various simple approximation methods, of which techniques from Randomized Linear Algebra (RLA) are shown to be particularly effective. In doing so, we consider a very important class of optimization problems, i.e., large-scale finite-sum minimization, of the form
and its special case
where , each is a smooth but possibly non-convex function, and are given. Problems of the form (P1) and (P2) arise very often in machine learning, e.g., [shalev2014understanding] as well as scientific computing, e.g., [rodoas1, rodoas2]. In big-data regime where , the mere operations with the Hessian of , e.g., matrix-vector products, typically constitute the main bottleneck of computations. Here, we show that our relaxed Hessian approximation condition allows one to draw upon the sub-sampling ideas of [roosta2016sub1, roosta2016sub2, xu2016sub, bollapragada2016exact], to design variants of TR and CR in which the Hessian is (non-)uniformly sub-sampled. We then study the theoretical convergence properties of our proposed algorithms for general non-convex problems of the form (P1) and (P2).
This paper is mainly motivated by developing novel theory which supports simple constructions of approximate Hessian in practice. Extensive numerical examples demonstrating various practical aspects of our proposed algorithms for solving (P1) and (P2) are, instead, given in [xuNonconvexEmpirical2017]. Specifically, in [xuNonconvexEmpirical2017], we consider two classes of non-convex optimization problems that arise often in practice, i.e. non-linear least squares as well as deep learning and present extensive numerical experiments illustrating the empirical performance of the sub-sampled methods considered in this paper on both, real and synthetic data.
1.3 Related Work
Optimization methods that, as a way to increase efficiency, employ inexact gradient and/or Hessian information have long been an important subject of research. However, as alluded to in Section 1.1, in large-scale settings, the bulk of the research efforts have been centered around the design of (stochastic) first-order methods which, by approximating the gradient, achieve higher efficiency than their classical counterparts. Among many others, such methods range from a simple SGD [robbins1951stochastic, le2004large, li2014efficient, bertsekas1996neuro, bottou2010large, cotter2011better], to the most recent improvements such as those which incorporate the previous gradient directions in the current update, e.g., [roux2012stochastic, schmidt2013minimizing, defazio2014saga], variance reduction techniques, e.g., [johnson2013accelerating, xiao2014proximal, nitanda2014stochastic, bietti2016stochastic], decomposition methods, which directly optimize the dual problem, e.g., [shalev2013stochastic, shalev2012proximal], as well as their accelerations [lin2015universal, cotter2011better, shalev2014accelerated, nitanda2016accelerated].
The body of literature corresponding to efficient and large-scale second-order algorithms designed to solve modern big-data problems is not as dense. Since, in such methods, the operations with the Hessian constitute major computational bottlenecks, a classical line of research has been to try to deterministically construct an approximation of the Hessian in a way that the update is computationally feasible, and yet, still provides sufficient second order information. One such class of methods are quasi-Newton algorithms, which are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In such methods, the approximation to the Hessian is updated iteratively using only first order information from the gradients and the iterates through low-rank updates. Among these methods, the celebrated Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [nocedal2006numerical] and its limited memory version (L-BFGS) [nocedal1980updating, liu1989limited], are the most popular and widely used members of this class.
Another, rather more recent, line of research in this direction, is to construct the inexact Hessian information using the application of randomized methods. In large-scale scientific computing applications, stochastic variants of second-order algorithms using sub-sampling/sketching have long been successfully applied, e.g., see [aravkin2012robust, van2013lost, haber2014simultaneous, doas12, rodoas1, rodoas2, roszas, bui2014solving, haber2000optimization, burger2005survey, petra2012inexact, van2015penalty] and the references therein. This is specially so since many scientific computing problems exhibit high degree of ill-conditioning. In such situations, the mere gradient information is not sufficient for effectively solving the underlying optimization problem, and incorporating (approximate) Hessian can drastically help improve the performance.
For machine learning, however, the application of curvature information is much less mature and a few recent studies include [yu2010quasi, lin2008trust, byrd2011use, byrd2012sample, friedlander2012hybrid, martens2010deep, roux2010fast, ghanbari2017black]. However, there has lately been a great surge of interest in the design and application of second-order methods, most of which are motivated by machine learning applications. Specifically, for convex optimization, the stochastic approximation of the full Hessian matrix has been recently considered in [byrd2011use, byrd2012sample, wang2015subsampled, pilanci2015newton, erdogdu2015convergence, roosta2016sub1, roosta2016sub2, xu2016sub, Agarwal2016SecondOS, mutny2016stochastic, ye2016revisiting, bollapragada2016exact, mutny2017parallel, berahas2017investigation, eisen2017large]. In addition to inexact Hessian, a few of these methods study the fully stochastic case in which the gradient is also approximated, e.g., [roosta2016sub1, roosta2016sub2, bollapragada2016exact]. Many of these works have considered a very important class of optimization problems, widely known as finite-sum minimization problem, which is typically represented in the form of (P1). Lower bounds on the oracle complexity, i.e., the minimum calls to second-order oracle to achieve a sub-optimal solution, for second-order methods for generic convex problems of the form (P0) as well as finite-sum problems (P1) was recently obtained in [shamir2017oracle] and [arjevani2016oracle], respectively.
For non-convex machine learning problems, the literature on methods that employ Hessian approximation is significantly less developed than that of convex problems. This is despite the fact that the study of the application of Newton-type methods for general non-convex problems dates decades back, e.g., [levenberg1944method, marquardt1963algorithm]. Although, there are many heuristic and empirical studies of the application of curvature information for, mostly, deep-learning applications, e.g., see the pioneering work of [martens2010deep] along with the several follow-ups [wiesler2013investigations, ngiam2011optimization, vinyals2012krylov, he2016large, kiros2013training, chapelle2011improved], the theoretical understanding of these stochastic methods remains largely under-studied.
Within the context of trust-region methods, there have been several results which study the role of derivative-free and probabilistic models in general, and Hessian approximation in particular, e.g., see [conn2009global, chen2015stochastic, blanchet2016convergence, bandeira2014convergence, larson2016stochastic, shashaani2016astro, gratton2017complexity] and references therein. Compared to trust-region methods, however, cubic regularization with Hessian approximation is significantly less studied. The pioneering and seminal works of [cartis2011adaptiveI, cartis2011adaptiveII] are the first to allow for an approximation in place of the exact Hessian. There, an adaptive variant of cubic regularization, called ARC, is presented, and under deterministic Hessian approximation conditions, global convergence as well as iteration complexity results to obtain approximate second order critical point is provided. In [cartis2012complexity], this analysis is then extended to trust region methods and under the same deterministic Hessian approximation conditions, the corresponding iteration complexity results are given.
More recently, [kohler2017subsampledcubic] considers the finite-sum problem (P1) and by a direct application of the theoretical results of [cartis2011adaptiveI, cartis2011adaptiveII], presents a sub-sampled variant of ARC, in which the exact Hessian and the gradient are replaced by approximations using corresponding sub-samples. However, unfortunately, their analysis suffers from a rather vicious circle: the sample sizes used for approximating the Hessian and the gradient at each iteration are computed using these very same sub-sampled approximations; see [kohler2017subsampledcubic, Theorems 7 and 9] and, in particular, note the circular relationship between Steps 3 and 4 of [kohler2017subsampledcubic, Algorithm 1]. In other words, the approximate Hessian and gradient are formed based on an a priori unknown step which can only be determined after such approximations are formed. In addition, the sample sizes obtained in [kohler2017subsampledcubic] depend on the inverse of the step length at every iteration. Since the step-length eventually becomes increasingly small (see [cartis2011adaptiveI, Lemma 4.2, Lemma 5.1]), this rather unfortunate dependence, in turn, might cause the sample sizes to grow unboundedly large. Here, in addition to other contributions, we remedy these issues through a finer-grained analysis and give sub-sampling complexities, which do not depend on the step-size, can be set a priori and, in turn, be used throughout all iterations (see Section 1.4 for a summary of our main contributions).
Our contributions can be summarized as follows. We consider the generic non-convex optimization problem (P0) and establish worst-case optimal iteration complexities for variants of trust-region and adaptive cubic regularization methods in which the Hessian is only mildly approximated (Theorems 3.1, 3.2, and 3.2). Specifically, we show that under the following weak condition on the inexact Hessian, , at iteration
for some , Algorithms 1 and 2 can achieve the same worst-case iteration complexity (up to some constant) as that of the exact variants to obtain approximate second order critical solution (cf. Definition 1).
To the best of our knowledge, to establish optimal iteration complexity for obtaining approximate second order critical solution, all previous work that considered inexact Hessian information made use of stronger conditions than (6a). More specifically, for trust-region, many authors have considered the following condition \cref@addtoresetequationparentequation
|for some , where is the current trust-region radius, e.g., [bandeira2014convergence, gratton2017complexity] (although both [bandeira2014convergence, gratton2017complexity] study a more general framework under which the entire sub-problem model is probabilistically constructed and approximation extends beyond just the Hessian). For cubic regularization, the condition imposed on the inexact Hessian is often considered as|
for some , e.g., [cartis2011adaptiveI, cartis2011adaptiveII, cartis2012complexity] and other follow-up works. In fact, [cartis2012complexity] has also established optimal iteration complexity for trust-region algorithm under (1b). Both of (1a) and (1b), are stronger than the celebrated Dennis-Moré [dennis1974characterization] condition, i.e.,
In this paper, we relax Conditions (1a) and (1b) used in previous works and replace them with (6a). Under the more relaxed requirement (6a), not only the use of quasi-Newton methods to approximate the Hessian is theoretically justified, but also (and more relevant to our paper) many randomized matrix approximation techniques can be readily applied, e.g., [woodruff2014sketching, mahoney2011randomized, tropp2015introduction, tropp_concentration]. In contrast to the stronger requirements implied by (1a) or (1b), the less strict condition (6a) has the added advantage that the approximate Hessian matrix, , can be easily constructed in practice (Section 4).
In this light, for the finite-sum optimization framework, i.e., problems (P1) and (P2), we present sub-sampling schemes to probabilistically ensure (6a). To accomplish this, we draw upon approximate matrix multiplication results from RLA (Lemma 4.1 and Lemma 4.1). More specifically, we first consider the most general case of (P1) and give conditions under which uniform sampling (with or without replacement) guarantees (6a) (Lemma 4.1). We then consider (P2), a more specific, yet prevalent case in machine learning, propose non-uniform sampling strategy and show that it can indeed lead to a smaller sample-size as compared to the uniform sampling (Lemma 4.1). These sub-sampling complexities do not depend on the step-size, can be set a priori and, in turn, be used throughout all iterations. As a result, our sub-sampling strategies imply straightforward ways to probabilistically construct the approximate Hessian, , which can, in turn, be used in our proposed sub-sampled variants of trust-region and adaptive cubic regularization. Using the proposed sub-sampling schemes, we then give optimal iteration complexities for Algorithms 1 and 2 for optimization of non-convex finite-sum problems where the Hessian is approximated by means of sub-sampling (Theorems 4.2, 4.2 and 4.2).
We now present preliminaries which help with the clarity of the exposition as well as self-containment of the paper. More specifically, in Section 2.1, we first introduce the notation, definitions and main assumptions used throughout the paper. In Section 2.2, we then give a brief review of the classical trust region and cubic regularization algorithms and include several relevant references for the interested reader.
2.1 Notation and Assumptions
Throughout the paper, vectors are denoted by bold lowercase letters, e.g., , and matrices or random variables are denoted by regular upper case letters, e.g., . denotes the transpose of a real vector . For two vectors, , their inner-product is denoted as . For a vector , and a matrix , and denote the vector norm and the matrix spectral norm, respectively, while is the matrix Frobenius norm. and are the gradient and the Hessian of at , respectively, and denotes the identity matrix. For two symmetric matrices and , indicates that is symmetric positive semi-definite. The subscript, e.g., , denotes iteration counter and is the natural logarithm of . The inexact Hessian is denoted by , but for notational simplicity, we may use to, instead, denote the approximate Hessian evaluated at the iterate in iteration , i.e., . Throughout the paper, denotes a collection of indices from , with potentially repeated items and its cardinality is denoted by .
A very useful property of convex functions is that “local optimality” and “global optimality” are in fact, the same. Unfortunately, in non-convex settings, this is no longer the case. For example, even optimization of a degree four polynomial can be NP-hard [hillar2013most]. In fact, just checking whether a point is not a local minimum is NP-complete [murty1987some]. Thus, in non-convex settings, we are often left with designing algorithms that can guarantee convergence to approximate local optimality. In this light, thoughout this paper, we make constant use of the following definition of -Optimality:
Definition 1 (-Optimality).
Given , is an -optimal solution to the problem (P0), if
We note that -Optimality (even with ) does not necessarily imply closeness to any local minimum, neither in iterate nor in the objective value. However, if the saddle points of the function satisfy the strict-saddle property [ge2015escaping, lee2016gradient], then an -optimality guarantees vicinity to a local minimum for sufficiently small and .
For our analysis throughout the paper, we make the following standard assumption regarding the regularity of the exact Hessian of the objective function .
Assumption 1 (Hessian Regularity).
is twice differentiable and has bounded and Lipschitz continuous Hessian on the piece-wise linear path generated by the iterates, i.e. for some and all iterations \cref@addtoresetequationparentequation
where and are, respectively, the iterate and the update step at iteration .
Arguably, the most straightforward approach for globalization of many Newton-type algorithms is the application of line-search. However, it is known that near saddle points where the gradient magnitude can be small, traditional line search methods can be very ineffective and in fact produce iterates that can get stuck at a saddle point [nocedal2006numerical]. Trust region and cubic regularization methods are two elegant globalization alternatives that, specially recently, have attracted much attention. In this section, we give a brief review the classical TR and CR algorithms for solving (P0).
There are many similarities between TR and CR algorithms in terms of their theoretical and algorithmic properties as well as methods for solving their respective sub-problems. The main advantage of these methods is that they are reliably able to take advantage of the direction of negative curvature and escape saddle points. More specifically, if the Hessian at a saddle point contains a negative eigenvalue, these methods can leverage the corresponding direction of negative curvature to obtain decrease in the objective function values. Many problems of interest exhibit saddle points which include negative curvature direction [ge2015escaping, sun2015nonconvex]. As a result, TR and CR methods, if made scalable, can be very effective for solving many large-scale non-convex problems of interest for machine learning and scientific computing. In order to set the scene, in this section we briefly review these algorithms as as they pertain to the present paper.
TR methods [sorensen1982newton, conn2000trust] encompass a general class of iterative methods which specifically define a region around the current iterate within which they trust the model to be a reasonable approximation of the true objective function. They then find the step as a (approximate) minimizer of the model in this region. In effect, they choose the direction and length of the step simultaneously. If a step is not acceptable, they reduce the size of the region and find a new minimizer. The most widely used approximating model, which we consider here, is done via a quadratic function obtained from the second-order Taylor expansion of the true objective at the current iterate. More specifically, using the current iterate , the quadratic variant of TR algorithm finds the next iterate as where is a solution of the constrained sub-problem \cref@addtoresetequationparentequation
|Here, is the region in which we “trust” our quadratic model to be an acceptable approximation of the true objective for the current iteration.|
|The major bottleneck of computations in TR algorithm is the minimization of the constrained quadratic sub-problem (5a). The close connections between the trust region sub-problem (5a) and eigenvalue problems has prompted many authors to design effective methods for solving (5a), e.g., truncated conjugate-gradient methods [steihaug1983conjugate, toint1981towards], the dual-based algorithms [more1983computing, rendl1997semidefinite, sorensen1997minimization], the generalized Lanczos trust-region based methods [gould1999solving, lenders2016trlib], and more modern advancements [erway2009subspace, erway2009iterative, gould2010solving, hazan2015linear, ho2016second].|
|In terms of iteration complexity, there is also a vast body of results. For example, for a smooth non-convex objective and in order to obtain approximate first-order criticality, i.e., for some , the complexity of an (inexact) trust-region method, which ensures at least a Cauchy (steepest-descent-like) decrease at each iteration, is shown to be of the same order as that of steepest descent, i.e., ; e.g., [gratton2008recursiveI, gratton2008recursiveII, blanchet2016convergence, gratton2017complexity, cartis2012complexity]. Recent non-trivial modifications of the classical TR methods have also been proposed which improve upon the complexity to [curtis2014trust]. These bounds can be shown to be tight [cartis2010complexity] in the worst case. Under a more general algorithmic framework and in terms of objective function sub-optimality, i.e., , better complexity bounds, in the convex and strongly-convex settings, have been obtained which are of the orders of and , respectively [grapiglia2016worst].|
|For non-convex problems, however, it is more desired to obtain complexity bounds for achieving approximate second-order criticality, i.e., Definition 1. For this, bounds in the orders of and have been obtained in [cartis2012complexity] and [grapiglia2016worst], respectively. Similar bounds were also given in [gratton2017complexity] under probabilistic model. Most these complexities are obtained for trust region-type methods which, in addition to Cauchy decrease, guarantee a descent, at least, as good as that obtained from following the negative curvature direction (if present). Bounds of this order have shown to be optimal in certain cases [cartis2012complexity].|
where is the cubic regularization parameter chosen for the current iteration. Naively speaking, the role of the parameter in the course of the algorithm is very similar to the trust-region radius, . In fact, one can view as the reciprocal of ; see the comments following the proof of [cartis2011adaptiveI, Theorem 3.1] as well as the updating rules for the trust-region radius in [conn2000trust]. A direct comparison of [cartis2011adaptiveII, Lemmas 3.2 and 3.3] with [conn2000trust, Theorems 6.4.2, 6.4.3] also reveals the striking resemblance of the role of the trust region radius, , with cubic regularization parameter, .
To the best of our knowledge, the use of such regularization, was first introduced in the pioneering work of [cubic1981Griewank] as a means for introducing affine-invariance to Newton-type methods which are globally convergent. However, such regularization was subsequently further studied in the seminal works of [nesterov2006cubic, cartis2011adaptiveI, cartis2011adaptiveII], which provide an in-depth analysis of such regularization methods in a variety of setting.
As in the case of TR, the major bottleneck of CR involved solving the sub-problem (5b). Many authors have proposed various techniques for efficiently solving (5b). These methods range from employing generalized Lanczos-type iterations [cartis2011adaptiveI] in which (5b) is solved in successively embedded, but lower dimensional, Krylov subspaces, to some more recent alternative techniques, e.g., gradient based methods [carmon2016gradient, bianconcini2015use] and a method based on fast approximate matrix inversion [agarwal2016finding].
From the worst-case complexity point of view, CR has a better dependence on compared to TR. More specifically, [nesterov2006cubic] showed that, under global Lipschitz continuity assumption on the Hessian, if the sub-problem (5b) is solved exactly, then the resulting CR algorithm achieves the approximate first-order criticality with complexity of which is better than that of the TR (cf. Section 2.2.1). These results were extended by [cartis2011adaptiveI, cartis2011adaptiveII], to an algorithm, dubbed Adaptive Regularization with Cubics (ARC). In particular, the authors showed that the worst case complexity of can be achieved without requiring the knowledge of the Hessian’s Lipschitz constant, access to the exact Hessian, or multi-dimensional global optimization of the sub-problem (5b). These results were further refined in [cartis2012complexity] where it was shown that, not only, multi-dimensional global minimization of (5b) is unnecessary, but also the same complexity can be achieved with mere one or two dimensional search. This bound has been shown to be tight [cartis2011optimal]. As for the approximate second-order criticality, [cartis2012complexity] showed that at least is required. With further assumptions on the inexactness of sub-problem solution, they also show that one can achieve , which is shown to be tight [cartis2010complexity]. Better dependence on can be obtained if one assumes additional structure, such as convexity, e.g., see [nesterov2006cubic, cartis2012evaluation] as well as the acceleration scheme of [nesterov2008accelerating].
3 Algorithms and Convergence Analysis
We are now ready to present our main algorithms for solving the generic non-convex optimization (P0) along with their corresponding iteration complexity results to obtain a -optimal solution as in (2). More precisely, in Section 3.1 and 3.2, respectively, we present modifications of the TR and ARC methods which incorporate inexact Hessian information, according to mild Hessian approximation requirement (6a). As it can be seen, the bulk of the algorithms are almost identical to the case where the exact Hessian is used and the difference merely boils down to the use of as opposed to . As indicated in Section 2.1, for notational simplicity, in what follows, we use to denote .
Before we dive into details of the algorithms and analysis, we present the key condition we require on the approximated Hessian , which is stated as follows:
Condition 1 (Inexact Hessian Regularity).
For some , , the approximating Hessian, , satisfies \cref@addtoresetequationparentequation
where and are, respectively, the iterate and the update step at iteration .
Remark: We emphasize that (6a) constitutes a significant relaxation over Conditions (1a) and (1b) which are typically assumed in many previous works for analyzing similar algorithms. For example, for an approximate Hessian, , there is no straightforward way to verify (1b) since the search direction depends on the approximate Hessian itself. In contrast, Condition (6a) can be easily controlled by the spectral error between the approximate and exact Hessians, i.e., , (as an upper bound for ). Consequently, the inexact Hessian, , can be easily constructed using several quasi-Newton methods and, more relevant to our paper, many randomized matrix approximation techniques; see Section 4 for specific practical examples.
3.1 Trust Region with Inexact Hessian
Algorithm 1 depicts a trust-region algorithm where at each iteration , instead of the true Hessian , only an inexact approximation, , is used.
In Algorithm 1, we require that the sub-problem (7) is solved only approximately. Indeed, in large-scale problems, where the exact solution of the sub-problem is the main bottleneck of the computations, this is a very crucial relaxation. Such approximate solution of the sub-problem (7) has been adopted in many previous work. Here, we follow the inexactness conditions discussed in [conn2000trust], which are widely known as Cauchy and Eigenpoint conditions. Recall that the Cauchy and Eigen directions correspond, respectively, to one dimensional minimization of the sub-problem (7) along the directions given by the gradient and negative curvature.
Condition 2 (Sufficient Descent Relative to Cauchy and Eigen Directions [conn2000trust]).
Assume that we solve the sub-problem (7) approximately to find such that \cref@addtoresetequationparentequation
Here, is defined in (7), (Cauchy point) is along negative gradient direction and is along approximate negative curvature direction such that , for some (see Appendix B for a way to efficiently compute ).
One way to ensure that an approximate solution to the sub-problem (7) satisfies (8), is by replacing (7) with the following reduced-dimension problem, in which the search space is a two-dimensional sub-space containing vectors , and , i.e.,
Using the inexactness Condition 2, Theorem 3.1 gives the iteration complexity of Algorithm 1, which as discussed in Section 2.2.1, has been shown to be optimal. The proof of Theorem 3.1 can be found in Appendix A.2.
where is a hyper-parameter of Algorithm 1, is defined as in (8b). For Problem (P0) and under Assumption 1, if the approximate solution to the sub-problem (7) satisfies Condition 2, then Algorithm 1 terminates after at most
iterations. As it can be seem, the worst-case total number of iterations required by Algorithm 1 before termination, matches the optimal iteration complexity obtained in [cartis2012complexity]. Furthermore, from (6a), it follows that upon termination of Algorithm 1, in addition to , we have , i.e., the obtained solution satisfies -Optimality as in (2).
3.2 Adaptive Cubic Regularization with Inexact Hessian
Similar to Section 3.1, in this section, we present the algorithm and its corresponding convergence results for the case of adaptive cubic regularization with inexact Hessian. In particular, Algorithm 2 depicts a variant of ARC algorithm where at each iteration , the inexact approximation, , is constructed according to (6).
Similar to Algorithm 1, here we also require that the sub-problem (10) in Algorithm 2 is solved only approximately. Although similar inexact solutions to the sub-problem (10) by using Cauchy and Eigenpoint has been considered in several previous work, e.g., [cartis2012complexity], here we provide refined conditions which prove to be instrumental in obtaining iteration complexities with the relaxed Hessian approximation (6a), as opposed to the stronger condition (1b). For the proof that the Cauchy and Eigenpoint, indeed satisfy the inequalities (11), see Appendix A.3. In particular, Lemmas 5 and 6 in Appendix A.3, describe the model reduction obtained by Cauchy and eigen points more accurately than is usually found in similar literature.
Condition 3 (Sufficient Descent Relative to Cauchy and Eigen Directions).
Assume that we solve the sub-problem (10) approximately to find such that \cref@addtoresetequationparentequation
Here is defined in (10), (Cauchy point) is along negative gradient direction and is along approximate negative curvature direction such that for some (see Appendix B for a way to efficiently compute ).
Similar to Section 3.1, a natural way to ensure that the approximate solution to the sub-problem (10) satisfies (11), is by replacing the unconstrained high-dimensional sub-problem (10) with the following constrained but lower-dimensional problem, in which the search space is reduced to a two-dimensional sub-space containing vectors , and , i.e.,
Note that, if is an orthogonal basis for the sub-space “”, by a linear transformation, we can turn the above sub-problem into an unconstrained problem as
Theorem 3.2 gives the iteration complexity of Algorithm 2 for the case where the approximate solution to the sub-problem (10) is only required to satisfy the inexactness Condition 3. The proof of Theorem 3.2 can be found in Appendix A.3.
where are, respectively, defined as in (11b), (3a), (6b), and are the hyper-parameters of Algorithm 2. For Problem (P0) and under Assumption 1, if the approximate solution to the sub-problem (10) satisfies Condition 3, then Algorithm 2 terminates after at most
Condition 3 seems to be the bare minimum required to guarantee convergence to an approximate second-order criticality. Intuitively, however, if an approximate solution to the sub-problem (10) satisfies more than (11), i.e., if we solve (10) more exactly than just requiring (11), one could expect to be able to improve upon the iteration complexity of Theorem 3.2. Indeed, suppose we solve the reduced sub-problem on progressively embedded sub-spaces with increasingly higher dimensions, all of which including “”, and stop when the corresponding solution satisfies the following conditions.
Condition 4 (Sufficient Descent for Optimal Complexity).
If Condition 4 holds, then as initially pioneered in [cartis2011adaptiveI, cartis2011adaptiveII, cartis2012complexity], we can obtain the optimal iteration complexity for Algorithm 2, as shown in Theorem 3.2. The proof of Theorem 3.2 is given in Appendix A.3.2.
[Optimal Complexity of Algorithm 2]theoremtheoremSARCOptimal Consider any . Suppose the inexact Hessian, , satisfies Conditions (6) with the approximation tolerance, , in (6a) as where is as in (12), and . For Problem (P0) and under Assumption 1, if the approximate solution to the sub-problem (10) satisfies Conditions 3 and 4, then Algorithm 2 terminates after at most
4 Finite-Sum Minimization
In this section, we give concrete and highly practical examples to demonstrate ways to construct the approximate Hessian, which satisfies (6). By considering finite-sum minimization, a ubiquitous problem arising frequently in machine learning, we showcase the practical benefits of the proposed relaxed requirement (6a) for approximating Hessian, compared to the stronger alternatives (1a) and (1b). Specifically, in Section 4.1, we describe randomized techniques to appropriately construct the approximate Hessian, which is probabilistically guaranteed to satisfy (6). Convergence analysis of variants of Algorithms 1 and 2 which incorporate such particular randomized Hessian approximations are then given in Section 4.2.
4.1 Randomized Sub-Sampling
Indeed, a major advantage of (6a) over (1a) and (1b) is that there are many approximation techniques that can produce an inexact Hessian satisfying (6a). Of particular interest in our present paper is the application of randomized matrix approximation techniques, which have recently shown great success in the area of Randomized Linear Algebra at solving various numerical linear algebra tasks [woodruff2014sketching, mahoney2011randomized, drineas2016randnla]. For this, we consider the highly prevalent finite-sum minimization problem and employ random sampling as a way to construct approximations to the exact Hessian, which are, probabilistically, ensured to satisfy (6a). More specifically, in this section, we consider the optimization problem
where each is a smooth but possibly non-convex function. Many machine learning and scientific computing applications involve finite-sum optimization problems of the form (P1) where each is a loss (or misfit) function corresponding to observation (or measurement), e.g., [roosta2015randomized, roszas, rodoas1, rodoas2, roas1, haber2012effective, doas12, tibshirani1996regression, friedman2001elements, kulis2012metric, bottou2016optimization, sra2012optimization]. In particular, in machine learning applications, in (P1) corresponds to the empirical risk [shalev2014understanding] and the goal of solving (P0) is to obtain a solution with small generalization error, i.e., high predictive accuracy on “unseen” data.
Here, we consider (P1) in large-scale regime where . In such settings, the mere evaluations of the Hessian and the gradient increase linearly in . Indeed, for big-data problems, the operations with the Hessian, e.g., matrix-vector products involved in the (approximate) solution of the sub-problems (7) and (10), typically constitute the main bottleneck of computations, and in particular when , are computationally prohibitive. For the special case of problems of the form (P1) in which each is convex, randomized sub-sampling has shown to be very effective in reducing such costs, e.g., [roosta2016sub1, roosta2016sub2, xu2016sub, bollapragada2016exact]. In this section, we show that such randomized approximation techniques can indeed be effectively employed for the non-convex settings considered in this paper.
In this light, suppose we have a probability distribution, , over the indices of the set , such that for each index , we have and . Consider picking a sample of indices from , at each iteration, randomly according to the distribution . Let and denote the sample collection and its cardinality, respectively and define
to be the sub-sampled Hessian. In big-data regime when , if , such sub-sampling can offer significant computational savings.
Now, suppose \cref@addtoresetequationparentequation
In this case, we can naturally consider uniform distribution over , i.e., . Lemma 4.1 gives the sample size required for the inexact Hessian, , to probabilistically satisfy (6), for when the indices are picked uniformly at random with or without replacement. The proof of Lemma 4.1 is in Appendix A.4.
Indeed, if (17) holds, then (6a) follows with the same probability. In addition, if is constructed according to Lemma 4.1, it is easy to see that (6b) is satisfied with (in fact this is a deterministic statement). These two, together, imply that satisfies (6), with probability .
A Special Case: In certain settings, one might be able to construct a more “informative” distribution, , over the indices in the set , as opposed to oblivious uniform sampling. In particular, it might be advantageous to bias the probability distribution towards picking indices corresponding to those ’s which are more relevant, in certain sense, in forming the Hessian. If this is possible, then we can only expect to require smaller sample size as compared with oblivious uniform sampling. One such setting where this is possible is the finite-sum optimization of the form,
for some given data vectors . Finite-sum problems of the form (P2), which is indeed a special case of (P1), arise often in many machine learning problems [shalev2014understanding], e.g., logistic regression with least squares loss as in [xuNonconvexEmpirical2017, Example 4.2].
It is easy to see that, the Hessian of in this case can be written as
Now, consider the sampling distribution as
Note that the absolute values are needed since for non-convex , we might have (for the convex case where all , one can obtain stronger guarantees than Lemmas 4.1 and 4.1; see [xu2016sub]). Using non-uniform sampling distribution (18), Lemma 4.1 gives sampling complexity for the approximate Hessian of (P2) to, probabilistically, satisfy (6). The proof of Lemma 4.1 is in Appendix A.4.
As it can be seen from (15b) and (15c), since , the sampling complexity given by Lemma 4.1 always provides a smaller sample-size compared with that prescribed by Lemma 4.1. Indeed, the advantage of non-uniform sampling is more pronounced in cases where the distribution of ’s are highly skewed, i.e., a few large ones and many small ones, in which case we can have ; see numerical experiments in [xuNonconvexEmpirical2017].
4.2 Probabilistic Convergence Analysis
Now, we are in the position to give iteration complexity for Algorithms 1 and 2 where the inexact Hessian matrix is constructed according to Lemmas 4.1 or 4.1. Since the approximation is a probabilistic construction, in order to guarantee success, we need to ensure that we require a small failure probability across all iterations. In particular, in order to get an overall and accumulative success probability of for the entire iterations, the per-iteration failure probability is set as . This failure probability appears only in the “log factor” for sample size in all of our results, and so it is not the dominating cost. Hence, requiring that all iterations are successful for a large , only necessitates a small (logarithmic) increase in the sample size. For example, for , as in Theorem 3.2, we can set the per-iteration failure probability to , and ensure that when Algorithm 2 terminates, all Hessian approximations have been, accumulatively, successful with probability of . As a result, we can use the sub-sampling Lemmas 4.1 and 4.1 and obtain an approximating matrix which, with high probability, guarantees (6).
Using these results, we can have the following probabilistic, but optimal, guarantee on the worst-case iteration complexity of Algorithm 1 for solving finite-sum problem (P1) (or (P2)) and in the case where the inexact Hessian is formed by sub-sampling. Their proofs follow very similar line of reasoning as that used for obtaining the results of Section 3, and hence are omitted.
[Optimal Complexity of Algorithm 1 For Finite-Sum Problem]theoremtheoremSTR_prob Consider any . Let be as in (9), and set for some . Furthermore, for such , let the sample-size be as in (16) (or (19)). For Problem (P1) (or (P2)) and under Assumption 1, if the approximate solution to the sub-problem (7) satisfies Condition 2, then Algorithm 1 using the sub-sampled matrix as in (14) terminates in at most
iterations, upon which, with probability , we have that
Similarly, in the setting of optimization problems (P1) and (P2), with appropriate sub-sampling of the Hessian as in Lemmas 4.1 and 4.1, we can also obtain probabilistic worst-case iteration complexities for Algorithm 2 as in the deterministic case.
[Complexity of Algorithm 2 For Finite-Sum Problem]theoremtheoremSARC_prob Consider any . Let be as in (12), and set for some . Furthermore, for such , let the sample-size be as in (16) (or (19)). For Problem (P1) (or (P2)) and under Assumption 1, if the approximate solution to the sub-problem (10) satisfies Condition 3, then Algorithm 2 using the sub-sampled matrix as in (14) terminates in at most
iterations, upon which, with probability , we have that
[Optimal Complexity of Algorithm 2 For Finite-Sum Problem]theoremtheoremSARCOptimal_prob Consider any . Let be as in Theorem 3.2, and set for some . Furthermore, for such , let the sample-size be as in (16) (or (19)). For Problem (P1) (or (P2)) and under Assumption 1, if the approximate solution to the sub-problem (10) satisfies Conditions 3 and 4, then Algorithm 2 using the sub-sampled matrix as in (14) terminates in at most
iterations, upon which, with probability , we have that
As it can be seen, the main difference between Theorems 4.2 and 4.2 is in the solution to the sub-problem (10). More specifically, if in addition to (11), the approximate solution to the sub-problem (10), at each iteration, satisfies (13), then Theorem 4.2 gives optimal worst-case iteration complexity.
In this paper, we considered non-convex optimization settings and developed efficient variants of the trust region and adaptive cubic regularization methods in which the curvature information is approximated. For all of our proposed variants, under Hessian approximation requirements which are more relaxed than the existing ones in the literature and using only inexact solutions of the corresponding sub-problems, we obtained optimal iteration complexity to achieve approximate second order criticality.
The proposed relaxed requirement in approximating the curvature allows for straightforward construction of the inexact Hessian using various techniques such as quasi–Newton as well as many randomized matrix approximation methods. As a concrete and highly practical example, we considered the large-scale finite-sum optimization problem and proposed uniform and non-uniform sub-sampling strategies as ways to efficiently construct the desired approximate Hessian. Using such sampling schemes, we then, probabilistically, established optimal iteration complexity for randomized variants of trust region and adaptive cubic regularization methods in which the Hessian is sub-sampled.
In this paper, we focused on approximating the Hessian under the exact gradient information. Arguably, the bottleneck of the computations in such second-order methods involves the computations with the Hessian, e.g., matrix-vector products in the (approximate) solution of the sub-problem. In fact, the cost of the exact gradient computation is typically amortized by that of the operations with the Hessian. In spite of this, approximating the gradient can indeed improve upon the efficiency of the methods proposed in this paper. However, approximating the gradient introduces a new level of difficulties as well as research opportunities, which we intend to pursue in the future.
Finally, we mention that our focus here has been solely on developing the theoretical foundations of such randomized algorithms. Extensive empirical evaluations of the algorithms analyzed in this paper on various machine learning applications are given in the companion paper [xuNonconvexEmpirical2017].
We would like to acknowledge ARO, DARPA, and NSF for providing partial support of this work. \printbibliography
Appendix A Proofs of the Main Theorems
a.1 Proof Outlines
Although the relaxed Hessian approximation (6a) introduces many challenges, our proof techniques mainly follow similar line of reasoning as that in [cartis2011adaptiveI, cartis2011adaptiveII, cartis2012complexity]. Specifically, to prove worst-case iteration complexity for each method, we follow the following steps. Recall that, as indicated in Section 2.1, for notational simplicity, we use to denote the inexact Hessian .
a.2 Proof of Theorem 3.1
First, apply mean-value theorem on at ,
for some in the segment of . And then