Sample Efficient Stochastic VarianceReduced Cubic Regularization Method
Abstract
We propose a sample efficient stochastic variancereduced cubic regularization (LiteSVRC) algorithm for finding the local minimum efficiently in nonconvex optimization. The proposed algorithm achieves a lower sample complexity of Hessian matrix computation than existing cubic regularization based methods. At the heart of our analysis is the choice of a constant batch size of Hessian matrix computation at each iteration and the stochastic variance reduction techniques. In detail, for a nonconvex function with component functions, LiteSVRC converges to the local minimum within ^{2}^{2}2Here hides polylogarithmic factors Hessian sample complexity, which is faster than all existing cubic regularization based methods. Numerical experiments with different nonconvex optimization problems conducted on real datasets validate our theoretical results.
1 Introduction
We study the following unconstrained finitesum nonconvex optimization problem:
(1.1) 
where each is a general nonconvex function. Such nonconvex optimization problems are ubiquitous in machine learning, including training deep neural network (LeCun et al., 2015), robust linear regression (Yu and Yao, 2017) and nonconvex regularized logistic regression (Reddi et al., 2016b). In principle, finding the global minimum of (1.1) is generally a NPhard problem (Hillar and Lim, 2013) due to the lack of convexity.
Instead of finding the global minimum, various algorithms have been developed in the literature (Nesterov and Polyak, 2006; Cartis et al., 2011a; Carmon and Duchi, 2016; Agarwal et al., 2017; Xu et al., 2018; AllenZhu and Li, 2018) to find an approximate local minimum of (1.1). In particular, a point is said to be an approximate local minimum of if
(1.2) 
where are predefined precision parameters. It has been shown that such approximate local minima can be as good as global minima in some problems. For instance, Ge et al. (2016) proved that any local minimum is actually a global minimum in matrix completion problems. Therefore, to develop an algorithm to find an approximate local minimum is of great interest both in theory and practice.
A very important and popular method to find the approximate local minimum is cubicregularized (CR) Newton method, which was originally introduced by Nesterov and Polyak (2006). Generally speaking, in the th iteration, CR solves a subproblem which minimizes a cubicregularized secondorder Taylor expansion at current iterate . The update rule can be written as follows:
(1.3)  
(1.4) 
where is a penalty parameter used in CR. Nesterov and Polyak (2006) proved that to find an approximate local minimum of a nonconvex function , CR requires at most iterations. However, a main drawback for CR is that it needs to sample individual Hessian matrix to get the exact Hessian used in (1.3), which leads to a total Hessian sample complexity, i.e., number of queries to the stochastic Hessian for some and . Such computational cost will be extremely expensive when is large as is in many large scale machine learning problems.
To overcome the computational burden of CR based methods, some recent studies have proposed to use subsampled Hessian instead of the full Hessian (Kohler and Lucchi, 2017; Xu et al., 2017a) to reduce the Hessian complexity. In detail, Kohler and Lucchi (2017) proposed a subsampled cubicregularized Newton method (SCR), which uses a subsampled Hessian instead of full Hessian to reduce the per iteration sample complexity of Hessian evaluations. Xu et al. (2017a) proposed a refined convergence analysis of SCR, as well as a subsampled Trust Region algorithm (Conn et al., 2000). Nevertheless, SCR bears a much slower convergence rate than the original CR method, and the total Hessian sample complexity for SCR to achieve an approximate local minimum is . This suggests that the computational cost of SCR could be even worse than CR when .
In order to retain the fast convergence rate of CR and enjoy the computational efficiency of SCR, Zhou et al. (2018) proposed a stochastic variancereduced cubicregularized Newton methods (SVRC) to further improve the convergence rate of stochastic CR method. At the core of SVRC is an innovative semistochastic gradient, as well as a semistochastic Hessian (Gower et al., 2017; Wai et al., 2017). They proved that SVRC achieves an approximate local minimum with secondorder oracle complexity, which is defined to be the number of queries to the secondorder oracle, i.e., a triplet . However, the secondorder oracle complexity is dominated by the maximum number of queries to one of the elements in the triplet triplet , and therefore is not always accurate in reflecting the true computational complexity. For instance, Algorithm A with higher secondorder oracle complexity may be due to its need to query more stochastic gradients (’s) than Algorithm B, but it may need to query much fewer stochastic Hessians (’s) than Algorithm B. Given that the computational complexity of the stochastic Hessian matrix is while that of the stochastic gradient is only , Algorithm B can be more efficient than Algorithm A. In other words, an algorithm with higher secondorder oracle complexity is not necessarily slower than the other algorithm with lower secondorder oracle complexity. So it is more reasonable to use the Hessian sample complexity to evaluate the efficiency of cubic regularization methods when the dimension is not small. Recently, Wang et al. (2018) proposed another variance reduced stochastic cubic regularization algorithm, which achieves Hessian sample complexity^{3}^{3}3They actually missed additional Hessian sample complexity since their algorithms need to calculate the minimum eigenvalue of Hessian as a stopping criteria in each iteration. to converge to an approximate local minimum.
In this paper, in order to reduce the Hessian sample complexity, we develop a sample efficient stochastic variancereduced cubicregularized Newton method called LiteSVRC, which significantly reduces the sample complexity of Hessian matrix evaluations in stochastic CR methods. In detail, under milder conditions, we prove that LiteSVRC achieves a lower Hessian sample complexity than existing cubic regularization based methods. Numerical experiments with different types of nonconvex optimization problems on various real datasets are conducted to validate our theoretical results.
We summarize our major contributions as follows:

The proposed LiteSVRC algorithm only requires a constant batch size of Hessian evaluations at each iteration. In contrast, the batch size of Hessian evaluations at each iteration in Wang et al. (2018) is implicitly chosen based on the update of the next iterate.

Last but not the least, our results do not require the Lipschitz continuous condition of , which directly improves the results in Wang et al. (2018) that rely on this additional assumption.
1.1 Additional Related Work
Cubic Regularization and Trustregion Newton Method Traditional Newton method in convex setting has been widely studied in past decades (Bennett, 1916; Bertsekas, 1999). In the nonconvex setting, based upon cubicregularized Newton method (Nesterov and Polyak, 2006), Cartis et al. (2011a) proposed a practical framework of cubic regularization which uses an adaptive cubic penalty parameter and approximate cubic subproblem solver. Carmon and Duchi (2016); Agarwal et al. (2017) presented two fast cubicregularized methods where they used only gradient and Hessianvector product to solve the cubic subproblem. Tripuraneni et al. (2017) developed a stochastic cubic regularization algorithm based on Kohler and Lucchi (2017) where only gradient and Hessian vector product are used. The other line of related research is trustregion Newton methods (Conn et al., 2000; Carrizo et al., 2016; Curtis et al., 2017a, b), which have comparable performance guarantees as cubic regularization methods.
Finding Approximate Local Minima There is another line of work which focuses on finding approximate local minima using the negative curvature. Ge et al. (2015); Jin et al. (2017a) showed that (stochastic) gradient descent with an injected uniform noise over a small ball is able to converge to approximate local minima. Carmon et al. (2016); Royer and Wright (2017); AllenZhu (2017) showed that one can find approximate local minima faster than firstorder methods by using Hessian vector product to extract information of negative curvature. Xu et al. (2018); AllenZhu and Li (2018); Jin et al. (2017b) further proved that gradient methods with bounded perturbation noise are also able to find approximate local minima faster than the firstorder methods.
Variance Reduction Variancereduced techniques play an important role in our proposed algorithm. Roux et al. (2012); Johnson and Zhang (2013) proved that stochastic gradient descent(SGD) with variance reduction is able to converge to global minimum much faster than SGD in convex setting. In the nonconvex setting, Reddi et al. (2016a); AllenZhu and Hazan (2016) show that stochastic variancereduced gradient descent (SVRG) is able to converge to firstorder stationary point with the same convergence rate as gradient descent, yet with an improvement in gradient complexity.
The remainder of this paper is organized as follows: we present our proposed algorithm in Section 2. In Section 3, we present our theoretical analysis of the proposed algorithm and compare it with the stateoftheart Cubic Regularization methods. We conduct thorough numerical experiments on different nonconvex optimization problems and on different real world datasets to validate our theory in Section 4. We conclude our work in Section 5.
Notation: We use if , where is a constant independent of any parameters in our algorithm. We use to hide polynomial logarithm terms. We use to denote the 2norm of vector . For symmetric matrix , we use and to denote the spectral norm and Schatten  norm of . We denote the smallest eigenvalue of to be .
2 The Proposed Algorithm
In this section, we present our proposed algorithm LiteSVRC. As is displayed in Algorithm 1, our algorithm has epochs with each epoch length . At the beginning of the th epoch, we calculate the gradient and Hessian of as ‘reference’ of our algorithm, denoted by and respectively. Unlike CR which needs to calculate the full gradient and Hessian at each iteration, we only need to calculate them every iterations.
At the th iteration of the th epoch, we need to solve the CR subproblem defined in (1.3). Since the computational cost of and is expensive, we use the following semistochastic gradient and Hessian instead
(2.1)  
(2.2) 
where is the reference point at which and are computed, and are sampling index sets (with replacement), and are sizes of and . Note that similar semistochastic gradient and Hessian have been proposed in Johnson and Zhang (2013); Xiao and Zhang (2014) and Gower et al. (2017); Wai et al. (2017); Zhou et al. (2018); Wang et al. (2018) respectively. We choose the minibatch sizes of stochastic gradient and stochastic Hessian for Algorithm 1 as follows:
(2.3) 
where are two constants only depending on and .
Compared with the SVRC algorithm proposed in Zhou et al. (2018), our algorithm uses a lite version of semistochastic gradient (Johnson and Zhang, 2013; Xiao and Zhang, 2014), instead of the sophisticated one with Hessian information proposed in Zhou et al. (2018). Note that the additional Hessian information in the semistochastic gradient in Zhou et al. (2018) actually increases the Hessian sample complexity. Therefore, with the goal of reducing the Hessian sample complexity, the standard semistochastic gradient (Johnson and Zhang, 2013; Xiao and Zhang, 2014) used in this paper is more favored.
On the other hand, there are two major differences between our algorithm and the SVRC algorithms proposed in Wang et al. (2018): (1) our algorithm uses a constant Hessian minibatch size instead of an adaptive one in each iteration, and thus the parameter tuning of our algorithm is much easier. In sharp contrast, the minibatch size of the stochastic Hessian in the algorithm proposed by Wang et al. (2018) is dependent on the next iterate, which makes the update an implicit one and it is hard to tune the parameters in practice; and (2) our algorithm does not need to compute the minimum eigenvalue of the Hessian in each iteration, and thus really reduces the Hessian sample complexity as well as runtime complexity in practice. In contrast, the algorithm in Wang et al. (2018) needs to calculate the minimum eigenvalue of the Hessian as a stopping criteria in each iteration, which actually incurs additional Hessian sample complexity.
3 Main Theory
In this section, we present our theoretical results on the Hessian sample complexity of LiteSVRC.
We start with the following assumptions that are needed throughout our analysis:
Assumption 3.1 (Gradient Lipschitz).
There exists a constant , such that for all and
Assumption 3.2 (Hessian Lipschitz).
There exists a constant , such that for all and
These two assumptions are mild and widely used in the line of research for finding approximate global minima (Carmon and Duchi, 2016; Carmon et al., 2016; Agarwal et al., 2017; Wang et al., 2018). Next we present two key definitions, which play important roles in our analysis:
Definition 3.3.
We define the optimal gap as
(3.1) 
Definition 3.4.
Let be the iterate defined in Algorithm 1, where and . We define as follows:
(3.2) 
Definition 3.4 appears in Nesterov and Polyak (2006) with a slightly different form, which is used to describe how much a point is similar to a true local minimum. Recall the definition of approximate local minima in (1.2), it is easy to show the following fact: if holds for any , then is an approximate local minimum if and only if . We note that similar argument is also made in Zhou et al. (2018).
From now on, we will focus on bounding , which is equivalent to finding the approximate local minimum. The following theorem spells out the upper bound of .
Theorem 3.5.
Under Assumptions 3.1 and 3.2, suppose that and . Let be arbitrary chosen parameters, and and are positive parameters satisfying following induction equations for all and :
(3.3)  
(3.4)  
where are absolute constants. Then the output of Algorithm 1 satisfies the following inequality
(3.5) 
where is defined as follows
and is an absolute constant.
Remark 3.6.
Theorem 3.5 suggests that with a fixed number of inner loops , if we run Algorithm 1 for sufficiently large epochs, then we have a point sequence where . That being said, will converge to a local minimum, which is consistent with the convergence analysis in existing related work (Nesterov and Polyak, 2006; Kohler and Lucchi, 2017; Wang et al., 2018).
Now we give a specific choice of parameters mentioned in Theorem 3.5 to derive the total Hessian sample complexity of Algorithm 1.
Corollary 3.7.
Now we provide a comprehensive comparison between our algorithm and other related algorithms in Table 4. The algorithm proposed in Wang et al. (2018) has two versions: sample with replacement and sample without replace. For the completeness, we present both versions in Wang et al. (2018). From Table 4 we can see that LiteSVRC strictly outperforms CR by a factor of . LiteSVRC also outperforms SCR when , which suggests that the variance reduction scheme makes LiteSVRC perform better in the high accuracy regime. More importantly, our proposed LiteSVRC does not rely on the assumption that the function is Lipschitz continuous, which is required by the algorithm proposed in Wang et al. (2018). So in terms of Hessian sample complexity, our algorithm directly improves that of Wang et al. (2018) by a factor of .
4 Experiments
In this section, we conduct experiments on real world datasets to support our theoretical analysis of the proposed LiteSVRC algorithm. Following Zhou et al. (2018), we investigate two nonconvex problems on three different datasets, a9a (sample size: , dimension: ), ijcnn1 (sample size: , dimension: ) and covtype (sample size: , dimension: ), which are all common datasets used in machine learning.
algorithm  periteration  total  function  gradient  Hessian 
Lipschitz  Lipschitz  Lipschitz  
CR  No  No  Yes  
Nesterov and Polyak (2006)  
SCR  Yes  Yes  
Kohler and Lucchi (2017)  
Xu et al. (2017a)  
SVRC  No  No  Yes  
Zhou et al. (2018)^{4}^{4}4  
SVRC  Yes  Yes  Yes  
Wang et al. (2018)^{5}^{5}5  
SVRC  Yes  Yes  Yes  
Wang et al. (2018)^{5}^{5}5  
LiteSVRC  No  Yes  Yes  
(This paper) 
4.1 Baseline Algorithms
To evaluate our proposed algorithm, we compare the proposed LiteSVRC with the following baseline algorithms: (1) trustregion Newton methods (denoted by TR) Conn et al. (2000); (2) Adaptive Cubic regularization (Cartis et al., 2011a, b); (3) Subsampled Cubic regularization (Kohler and Lucchi, 2017); (4) Gradient Cubic regularization (Carmon and Duchi, 2016); (5) Stochastic Cubic regularization (Tripuraneni et al., 2017); (6) SVRC proposed in Zhou et al. (2018); (7) SVRCwithout proposed in Wang et al. (2018). Note that there are two versions of SVRC algorithm proposed in Wang et al. (2018), and the one based on sampling without replacement performs better in both theory and experiments, we therefore only compare with this one, which is denoted by SVRCwithout.
4.2 Implementation Details
For Subsampled Cubic and SVRCwithout, the sample size is dependent on (Kohler and Lucchi, 2017) and is dependent on (Wang et al., 2018), which make these two algorithms implicit algorithms. To address this issue, we follow the suggestion in Kohler and Lucchi (2017); Wang et al. (2018) and use and instead of and . Furthermore, we choose the penalty parameter for SVRC, SVRCwithout and LiteSVRC as constants which are suggested by the original papers of these algorithms. Finally, to solve the CR subproblem in each iteration, we choose to solve the subproblem approximately in the Krylov subspace spanned by Hessian related vectors, as used by Kohler and Lucchi (2017).
In the experiment, we choose two nonconvex regression problem as our objectives. Both of them consist of a loss function (can be nonconvex) and the following nonconvex regularizer
(4.1) 
where are the control parameters and is the th coordinate of . This regularizer has been widely used in nonconvex regression problem, which can be regarded as a special example of robust nonlinear regression (Reddi et al., 2016b; Kohler and Lucchi, 2017; Zhou et al., 2018; Wang et al., 2018).
4.3 Logistic Regression with Nonconvex Regularizer
The first problem is a binary logistic regression problem with a nonconvex regularizer . Given training data and label , , our goal is to solve the following optimization problem:
(4.2) 
where is the sigmoid function and and are the parameters that are used to define the nonconvex regularizers in (4.1) and are set differently for each dataset. In detail, we set for all three datasets, and set for a9a, ijcnn1 and covtype datasets respectively.
The experiment results on the binary logistic regression problem are displayed in Figure 1. The first row of the figure shows the plots of function value gap v.s. Hessian sample complexity of all the compared algorithms, and the second row presents the plots of function value gap v.s. CPU runtime (in second) of all the algorithms. It can be seen from Figure 1 that LiteSVRC performs the best among all algorithms regarding both sample complexity of Hessian and runtime on all three datasets, which is consistent with our theoretical analysis. We remark that SVRC performs the second best in most settings in terms of both Hessian sample complexity and runtime. It should also be noted that although SVRCwithout is also a variancereduced method similar to LiteSVRC and SVRC, it indeed performs much worse than other methods, because as we pointed out in the introduction, it needs to compute the minimum eigenvalue of the Hessian in each iteration, which actually makes the Hessian sample complexity even worse than Subsampled Cubic, let alone the runtime complexity.
4.4 Nonlinear Least Square with Nonconvex Regularizer
In this subsection, we consider another problem, namely, the nonlinear least square problem with a nonconvex regularizer defined in (4.1). The nonlinear least square problem is also studied in Xu et al. (2017b); Zhou et al. (2018). Given training data and , , our goal is to minimize the following problem
(4.3) 
Here is again the sigmoid function. The parameters and in the nonconvex regularizer for different datasets are set as follows: we set for all three datasets, and set for a9a, ijcnn1 and covtype datasets respectively. The experiment results are summarized in Figure 2, where the first row shows the plots of function value gap v.s. Hessian sample complexity and the second row presents the plots of function value v.s. CPU runtime (in second). It can be seen that LiteSVRC again achieves the best performance among all other algorithms regarding to both sample complexity of Hessian and runtime when the required precision is high, which supports our theoretical analysis again. SVRC performs the second best.
5 Conclusions
In this paper, we propose a new algorithm called LiteSVRC, which achieves lower sample complexity of Hessian compared with existing variance reduction based cubic regularization algorithms (Zhou et al., 2018; Wang et al., 2018). Extensive experiments on various nonconvex optimization problems and datasets validate our theory.
Appendix A Proof of the Main Theory
a.1 Proof of Theorem 3.5
Since our algorithm consists of inner loops and outer loops, we mainly focus on the analysis of one single step which is the th iteration in the th epoch, where .
Similar to other CR related work (Nesterov and Polyak, 2006; Cartis et al., 2011a; Kohler and Lucchi, 2017), our ultimate goal is to prove the following statement of one single loop:
(A.1) 
If (A.1) holds, then we just take summation over and in the above inequality, which yields the final result of Theorem 3.5. Unfortunately, (A.1) does not hold in general because of the existence of randomness in our algorithm. Nevertheless, by borrowing the idea from the analysis of SVRG in nonconvex setting (Reddi et al., 2016a), we propose to replace the function in (A.1) with the following Lyapunov function:
(A.2) 
where are parameters defined in Theorem 3.5. With the Lyapunov function in (A.2), we are able to prove the following key lemma that resembles (A.1) and holds in expectation:
Lemma A.1.
With Lemma A.1, we are ready to deliver the proof of our main theory.
Proof of Theorem 3.5.
a.2 Proof of Corollary 3.7
In this section, we provide the proof of our corollary for the sample complexity of LiteSVRC. To prove Corollary 3.7, we need following lemma:
Lemma A.2.
Proof of Corollary 3.7.
Since we already have by the parameter choice in Lemma A.2, we only need to make sure that . Take and , it is sufficient to let , where is a constant. Thus, as we need to sample Hessian at the beginning of each inner loop, and in each inner loop, we need to sample Hessian, then the total sample complexity of Hessian for Algorithm 1 is .
∎
Appendix B Proof of the Key Lemmas
b.1 Proof of Lemma a.1
For simplification, we denote and . In this section, we prove the key lemma about the Lyapunov function (A.2) used in the proof of our main theory. We define for the simplification of notation:
(B.1) 
where . Before we state the proof, we present some technical lemmas that are useful in our analysis.
Firstly, we give a sharp bound of . A very crucial observation is that we can bound the norm of gradient and the smallest eigenvalue of Hessian with , and defined in (B.1). Formally, we have the following lemma:
Lemma B.1.
Lemma B.1 suggests that to bound our target , we only need to focus on and .
Secondly, we bound . We first notice that can be bounded with and . Such bound can be derived straightly from Hessian Lipschitz condition:
Lemma B.2.
We also give following result to show how to bound with :
Lemma B.3.
Based on Lemmas B.1, B.2 and B.3, we have established the connection between and with only and .
Finally, we bound and with consequent vector and matrix concentration inequalities. Previous analysis of variancereduced method in nonconvex setting for firstorder method which only focus on the upper bound of variance of gives an upper bound only associated with , which guarantees the variance reduction (Reddi et al., 2016a; AllenZhu and Hazan, 2016). In our proof, we also need to bound the variance for stochastic Hessian . Thus we have following two lemmas: