Sample Efficient Stochastic Variance-Reduced Cubic Regularization Method

Sample Efficient Stochastic Variance-Reduced Cubic Regularization Method

Dongruo Zhou    and    Pan Xu    and    Quanquan Gu Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: drzhou@cs.ucla.eduDepartment of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: panxu@cs.ucla.eduDepartment of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: qgu@cs.ucla.edu
May 18, 2018111The first version of this paper was submitted to UAI 2018 on March 9, 2018. This is the second version with improved presentation and additional baselines in the experiments, and was submitted to NeurIPS 2018 on May 18, 2018.
Abstract

We propose a sample efficient stochastic variance-reduced cubic regularization (Lite-SVRC) algorithm for finding the local minimum efficiently in nonconvex optimization. The proposed algorithm achieves a lower sample complexity of Hessian matrix computation than existing cubic regularization based methods. At the heart of our analysis is the choice of a constant batch size of Hessian matrix computation at each iteration and the stochastic variance reduction techniques. In detail, for a nonconvex function with component functions, Lite-SVRC converges to the local minimum within 222Here hides poly-logarithmic factors Hessian sample complexity, which is faster than all existing cubic regularization based methods. Numerical experiments with different nonconvex optimization problems conducted on real datasets validate our theoretical results.

1 Introduction

We study the following unconstrained finite-sum nonconvex optimization problem:

(1.1)

where each is a general nonconvex function. Such nonconvex optimization problems are ubiquitous in machine learning, including training deep neural network (LeCun et al., 2015), robust linear regression (Yu and Yao, 2017) and nonconvex regularized logistic regression (Reddi et al., 2016b). In principle, finding the global minimum of (1.1) is generally a NP-hard problem (Hillar and Lim, 2013) due to the lack of convexity.

Instead of finding the global minimum, various algorithms have been developed in the literature (Nesterov and Polyak, 2006; Cartis et al., 2011a; Carmon and Duchi, 2016; Agarwal et al., 2017; Xu et al., 2018; Allen-Zhu and Li, 2018) to find an approximate local minimum of (1.1). In particular, a point is said to be an -approximate local minimum of if

(1.2)

where are predefined precision parameters. It has been shown that such approximate local minima can be as good as global minima in some problems. For instance, Ge et al. (2016) proved that any local minimum is actually a global minimum in matrix completion problems. Therefore, to develop an algorithm to find an approximate local minimum is of great interest both in theory and practice.

A very important and popular method to find the approximate local minimum is cubic-regularized (CR) Newton method, which was originally introduced by Nesterov and Polyak (2006). Generally speaking, in the -th iteration, CR solves a sub-problem which minimizes a cubic-regularized second-order Taylor expansion at current iterate . The update rule can be written as follows:

(1.3)
(1.4)

where is a penalty parameter used in CR. Nesterov and Polyak (2006) proved that to find an -approximate local minimum of a nonconvex function , CR requires at most iterations. However, a main drawback for CR is that it needs to sample individual Hessian matrix to get the exact Hessian used in (1.3), which leads to a total Hessian sample complexity, i.e., number of queries to the stochastic Hessian for some and . Such computational cost will be extremely expensive when is large as is in many large scale machine learning problems.

To overcome the computational burden of CR based methods, some recent studies have proposed to use sub-sampled Hessian instead of the full Hessian (Kohler and Lucchi, 2017; Xu et al., 2017a) to reduce the Hessian complexity. In detail, Kohler and Lucchi (2017) proposed a sub-sampled cubic-regularized Newton method (SCR), which uses a subsampled Hessian instead of full Hessian to reduce the per iteration sample complexity of Hessian evaluations. Xu et al. (2017a) proposed a refined convergence analysis of SCR, as well as a subsampled Trust Region algorithm (Conn et al., 2000). Nevertheless, SCR bears a much slower convergence rate than the original CR method, and the total Hessian sample complexity for SCR to achieve an -approximate local minimum is . This suggests that the computational cost of SCR could be even worse than CR when .

In order to retain the fast convergence rate of CR and enjoy the computational efficiency of SCR, Zhou et al. (2018) proposed a stochastic variance-reduced cubic-regularized Newton methods (SVRC) to further improve the convergence rate of stochastic CR method. At the core of SVRC is an innovative semi-stochastic gradient, as well as a semi-stochastic Hessian (Gower et al., 2017; Wai et al., 2017). They proved that SVRC achieves an -approximate local minimum with second-order oracle complexity, which is defined to be the number of queries to the second-order oracle, i.e., a triplet . However, the second-order oracle complexity is dominated by the maximum number of queries to one of the elements in the triplet triplet , and therefore is not always accurate in reflecting the true computational complexity. For instance, Algorithm A with higher second-order oracle complexity may be due to its need to query more stochastic gradients (’s) than Algorithm B, but it may need to query much fewer stochastic Hessians (’s) than Algorithm B. Given that the computational complexity of the stochastic Hessian matrix is while that of the stochastic gradient is only , Algorithm B can be more efficient than Algorithm A. In other words, an algorithm with higher second-order oracle complexity is not necessarily slower than the other algorithm with lower second-order oracle complexity. So it is more reasonable to use the Hessian sample complexity to evaluate the efficiency of cubic regularization methods when the dimension is not small. Recently, Wang et al. (2018) proposed another variance reduced stochastic cubic regularization algorithm, which achieves Hessian sample complexity333They actually missed additional Hessian sample complexity since their algorithms need to calculate the minimum eigenvalue of Hessian as a stopping criteria in each iteration. to converge to an -approximate local minimum.

In this paper, in order to reduce the Hessian sample complexity, we develop a sample efficient stochastic variance-reduced cubic-regularized Newton method called Lite-SVRC, which significantly reduces the sample complexity of Hessian matrix evaluations in stochastic CR methods. In detail, under milder conditions, we prove that Lite-SVRC achieves a lower Hessian sample complexity than existing cubic regularization based methods. Numerical experiments with different types of nonconvex optimization problems on various real datasets are conducted to validate our theoretical results.

We summarize our major contributions as follows:

  • The proposed Lite-SVRC algorithm only requires a constant batch size of Hessian evaluations at each iteration. In contrast, the batch size of Hessian evaluations at each iteration in Wang et al. (2018) is implicitly chosen based on the update of the next iterate.

  • We prove that Lite-SVRC converges to an -approximate local minimum of a nonconvex function within Hessian sample complexity, which outperforms all the state-of-the-art cubic regularization algorithms including Zhou et al. (2018); Wang et al. (2018).

  • Last but not the least, our results do not require the Lipschitz continuous condition of , which directly improves the results in Wang et al. (2018) that rely on this additional assumption.

1.1 Additional Related Work

Cubic Regularization and Trust-region Newton Method Traditional Newton method in convex setting has been widely studied in past decades (Bennett, 1916; Bertsekas, 1999). In the nonconvex setting, based upon cubic-regularized Newton method (Nesterov and Polyak, 2006), Cartis et al. (2011a) proposed a practical framework of cubic regularization which uses an adaptive cubic penalty parameter and approximate cubic sub-problem solver. Carmon and Duchi (2016); Agarwal et al. (2017) presented two fast cubic-regularized methods where they used only gradient and Hessian-vector product to solve the cubic sub-problem. Tripuraneni et al. (2017) developed a stochastic cubic regularization algorithm based on Kohler and Lucchi (2017) where only gradient and Hessian vector product are used. The other line of related research is trust-region Newton methods (Conn et al., 2000; Carrizo et al., 2016; Curtis et al., 2017a, b), which have comparable performance guarantees as cubic regularization methods.

Finding Approximate Local Minima There is another line of work which focuses on finding approximate local minima using the negative curvature. Ge et al. (2015); Jin et al. (2017a) showed that (stochastic) gradient descent with an injected uniform noise over a small ball is able to converge to approximate local minima. Carmon et al. (2016); Royer and Wright (2017); Allen-Zhu (2017) showed that one can find approximate local minima faster than first-order methods by using Hessian vector product to extract information of negative curvature. Xu et al. (2018); Allen-Zhu and Li (2018); Jin et al. (2017b) further proved that gradient methods with bounded perturbation noise are also able to find approximate local minima faster than the first-order methods.

Variance Reduction Variance-reduced techniques play an important role in our proposed algorithm. Roux et al. (2012); Johnson and Zhang (2013) proved that stochastic gradient descent(SGD) with variance reduction is able to converge to global minimum much faster than SGD in convex setting. In the nonconvex setting, Reddi et al. (2016a); Allen-Zhu and Hazan (2016) show that stochastic variance-reduced gradient descent (SVRG) is able to converge to first-order stationary point with the same convergence rate as gradient descent, yet with an improvement in gradient complexity.

The remainder of this paper is organized as follows: we present our proposed algorithm in Section 2. In Section 3, we present our theoretical analysis of the proposed algorithm and compare it with the state-of-the-art Cubic Regularization methods. We conduct thorough numerical experiments on different nonconvex optimization problems and on different real world datasets to validate our theory in Section 4. We conclude our work in Section 5.

Notation: We use if , where is a constant independent of any parameters in our algorithm. We use to hide polynomial logarithm terms. We use to denote the 2-norm of vector . For symmetric matrix , we use and to denote the spectral norm and Schatten - norm of . We denote the smallest eigenvalue of to be .

2 The Proposed Algorithm

1:  Input: batch size parameters , penalty parameter , , initial point .
2:  for  do
3:     
4:     
5:     
6:     
7:     for  do
8:        
9:        
10:        Sample index set uniformly with replacement,
11:        
12:        
13:        
14:        
15:     end for
16:     
17:  end for
18:  Output: Uniformly randomly choose one as , for and .
Algorithm 1 Sample efficient stochastic variance-reduced cubic regularization method (Lite-SVRC)

In this section, we present our proposed algorithm Lite-SVRC. As is displayed in Algorithm 1, our algorithm has epochs with each epoch length . At the beginning of the -th epoch, we calculate the gradient and Hessian of as ‘reference’ of our algorithm, denoted by and respectively. Unlike CR which needs to calculate the full gradient and Hessian at each iteration, we only need to calculate them every iterations.

At the -th iteration of the -th epoch, we need to solve the CR sub-problem defined in (1.3). Since the computational cost of and is expensive, we use the following semi-stochastic gradient and Hessian instead

(2.1)
(2.2)

where is the reference point at which and are computed, and are sampling index sets (with replacement), and are sizes of and . Note that similar semi-stochastic gradient and Hessian have been proposed in Johnson and Zhang (2013); Xiao and Zhang (2014) and Gower et al. (2017); Wai et al. (2017); Zhou et al. (2018); Wang et al. (2018) respectively. We choose the minibatch sizes of stochastic gradient and stochastic Hessian for Algorithm 1 as follows:

(2.3)

where are two constants only depending on and .

Compared with the SVRC algorithm proposed in Zhou et al. (2018), our algorithm uses a lite version of semi-stochastic gradient (Johnson and Zhang, 2013; Xiao and Zhang, 2014), instead of the sophisticated one with Hessian information proposed in Zhou et al. (2018). Note that the additional Hessian information in the semi-stochastic gradient in Zhou et al. (2018) actually increases the Hessian sample complexity. Therefore, with the goal of reducing the Hessian sample complexity, the standard semi-stochastic gradient (Johnson and Zhang, 2013; Xiao and Zhang, 2014) used in this paper is more favored.

On the other hand, there are two major differences between our algorithm and the SVRC algorithms proposed in Wang et al. (2018): (1) our algorithm uses a constant Hessian minibatch size instead of an adaptive one in each iteration, and thus the parameter tuning of our algorithm is much easier. In sharp contrast, the minibatch size of the stochastic Hessian in the algorithm proposed by Wang et al. (2018) is dependent on the next iterate, which makes the update an implicit one and it is hard to tune the parameters in practice; and (2) our algorithm does not need to compute the minimum eigenvalue of the Hessian in each iteration, and thus really reduces the Hessian sample complexity as well as runtime complexity in practice. In contrast, the algorithm in Wang et al. (2018) needs to calculate the minimum eigenvalue of the Hessian as a stopping criteria in each iteration, which actually incurs additional Hessian sample complexity.

3 Main Theory

In this section, we present our theoretical results on the Hessian sample complexity of Lite-SVRC.

We start with the following assumptions that are needed throughout our analysis:

Assumption 3.1 (Gradient Lipschitz).

There exists a constant , such that for all and

Assumption 3.2 (Hessian Lipschitz).

There exists a constant , such that for all and

These two assumptions are mild and widely used in the line of research for finding approximate global minima (Carmon and Duchi, 2016; Carmon et al., 2016; Agarwal et al., 2017; Wang et al., 2018). Next we present two key definitions, which play important roles in our analysis:

Definition 3.3.

We define the optimal gap as

(3.1)
Definition 3.4.

Let be the iterate defined in Algorithm 1, where and . We define as follows:

(3.2)

Definition 3.4 appears in Nesterov and Polyak (2006) with a slightly different form, which is used to describe how much a point is similar to a true local minimum. Recall the definition of approximate local minima in (1.2), it is easy to show the following fact: if holds for any , then is an -approximate local minimum if and only if . We note that similar argument is also made in Zhou et al. (2018).

From now on, we will focus on bounding , which is equivalent to finding the approximate local minimum. The following theorem spells out the upper bound of .

Theorem 3.5.

Under Assumptions 3.1 and 3.2, suppose that and . Let be arbitrary chosen parameters, and and are positive parameters satisfying following induction equations for all and :

(3.3)
(3.4)

where are absolute constants. Then the output of Algorithm 1 satisfies the following inequality

(3.5)

where is defined as follows

and is an absolute constant.

Remark 3.6.

Theorem 3.5 suggests that with a fixed number of inner loops , if we run Algorithm 1 for sufficiently large epochs, then we have a point sequence where . That being said, will converge to a local minimum, which is consistent with the convergence analysis in existing related work (Nesterov and Polyak, 2006; Kohler and Lucchi, 2017; Wang et al., 2018).

Now we give a specific choice of parameters mentioned in Theorem 3.5 to derive the total Hessian sample complexity of Algorithm 1.

Corollary 3.7.

Under the same conditions as in Theorem 3.5, let batch size parameters satisfy and . Set the inner loop parameter and cubic penalty parameter , where is an absolute constant. Then the output from Algorithm 1 is a -approximate local minimum after

(3.6)

stochastic Hessian evaluations.

Now we provide a comprehensive comparison between our algorithm and other related algorithms in Table 4. The algorithm proposed in Wang et al. (2018) has two versions: sample with replacement and sample without replace. For the completeness, we present both versions in Wang et al. (2018). From Table 4 we can see that Lite-SVRC strictly outperforms CR by a factor of . Lite-SVRC also outperforms SCR when , which suggests that the variance reduction scheme makes Lite-SVRC perform better in the high accuracy regime. More importantly, our proposed Lite-SVRC does not rely on the assumption that the function is Lipschitz continuous, which is required by the algorithm proposed in Wang et al. (2018). So in terms of Hessian sample complexity, our algorithm directly improves that of Wang et al. (2018) by a factor of .

4 Experiments

In this section, we conduct experiments on real world datasets to support our theoretical analysis of the proposed Lite-SVRC algorithm. Following Zhou et al. (2018), we investigate two nonconvex problems on three different datasets, a9a (sample size: , dimension: ), ijcnn1 (sample size: , dimension: ) and covtype (sample size: , dimension: ), which are all common datasets used in machine learning.

algorithm per-iteration total function gradient Hessian
Lipschitz Lipschitz Lipschitz
CR No No Yes
Nesterov and Polyak (2006)
SCR Yes Yes
Kohler and Lucchi (2017)
Xu et al. (2017a)
SVRC No No Yes
Zhou et al. (2018)444
SVRC Yes Yes Yes
Wang et al. (2018)555
SVRC Yes Yes Yes
Wang et al. (2018)555
Lite-SVRC No Yes Yes
(This paper)
33footnotetext: Although the refined SCR in Xu et al. (2017b) does not need function Lipschitz, the original SCR in Kohler and Lucchi (2017) needs it.44footnotetext: We adapt this result directly from the analysis of total second-order oracle calls from Zhou et al. (2018).55footnotetext: In Wang et al. (2018), both algorithms need to calculate at each iteration to decide whether the algorithm should continue, which adds additional Hessian sample complexity. We choose not to include this into the results in the table.

4.1 Baseline Algorithms

To evaluate our proposed algorithm, we compare the proposed Lite-SVRC with the following baseline algorithms: (1) trust-region Newton methods (denoted by TR) Conn et al. (2000); (2) Adaptive Cubic regularization (Cartis et al., 2011a, b); (3) Subsampled Cubic regularization (Kohler and Lucchi, 2017); (4) Gradient Cubic regularization (Carmon and Duchi, 2016); (5) Stochastic Cubic regularization (Tripuraneni et al., 2017); (6) SVRC proposed in Zhou et al. (2018); (7) SVRC-without proposed in Wang et al. (2018). Note that there are two versions of SVRC algorithm proposed in Wang et al. (2018), and the one based on sampling without replacement performs better in both theory and experiments, we therefore only compare with this one, which is denoted by SVRC-without.



4.2 Implementation Details

For Subsampled Cubic and SVRC-without, the sample size is dependent on (Kohler and Lucchi, 2017) and is dependent on (Wang et al., 2018), which make these two algorithms implicit algorithms. To address this issue, we follow the suggestion in Kohler and Lucchi (2017); Wang et al. (2018) and use and instead of and . Furthermore, we choose the penalty parameter for SVRC, SVRC-without and Lite-SVRC as constants which are suggested by the original papers of these algorithms. Finally, to solve the CR sub-problem in each iteration, we choose to solve the sub-problem approximately in the Krylov sub-space spanned by Hessian related vectors, as used by Kohler and Lucchi (2017).


In the experiment, we choose two nonconvex regression problem as our objectives. Both of them consist of a loss function (can be nonconvex) and the following nonconvex regularizer

(4.1)

where are the control parameters and is the -th coordinate of . This regularizer has been widely used in nonconvex regression problem, which can be regarded as a special example of robust nonlinear regression (Reddi et al., 2016b; Kohler and Lucchi, 2017; Zhou et al., 2018; Wang et al., 2018).

(a) a9a
(b) ijcnn1
(c) covtype
(d) a9a
(e) ijcnn1
(f) covtype
Figure 1: Function value gap of different algorithms for nonconvex regularized logistic regression problems on different datasets. (a)-(c) are plotted w.r.t. Hessian sample complexity. (d)-(e) are plotted w.r.t. CPU runtime.

4.3 Logistic Regression with Nonconvex Regularizer

The first problem is a binary logistic regression problem with a nonconvex regularizer . Given training data and label , , our goal is to solve the following optimization problem:

(4.2)

where is the sigmoid function and and are the parameters that are used to define the nonconvex regularizers in (4.1) and are set differently for each dataset. In detail, we set for all three datasets, and set for a9a, ijcnn1 and covtype datasets respectively.

The experiment results on the binary logistic regression problem are displayed in Figure 1. The first row of the figure shows the plots of function value gap v.s. Hessian sample complexity of all the compared algorithms, and the second row presents the plots of function value gap v.s. CPU runtime (in second) of all the algorithms. It can be seen from Figure 1 that Lite-SVRC performs the best among all algorithms regarding both sample complexity of Hessian and runtime on all three datasets, which is consistent with our theoretical analysis. We remark that SVRC performs the second best in most settings in terms of both Hessian sample complexity and runtime. It should also be noted that although SVRC-without is also a variance-reduced method similar to Lite-SVRC and SVRC, it indeed performs much worse than other methods, because as we pointed out in the introduction, it needs to compute the minimum eigenvalue of the Hessian in each iteration, which actually makes the Hessian sample complexity even worse than Subsampled Cubic, let alone the runtime complexity.



(a) a9a
(b) ijcnn1
(c) covtype
(d) a9a
(e) ijcnn1
(f) covtype
Figure 2: Function value gap of different algorithms for nonlinear least square problems on different datasets. (a)-(c) are plotted w.r.t. Hessian sample complexity. (d)-(e) are plotted w.r.t. CPU runtime.

4.4 Nonlinear Least Square with Nonconvex Regularizer

In this subsection, we consider another problem, namely, the nonlinear least square problem with a nonconvex regularizer defined in (4.1). The nonlinear least square problem is also studied in Xu et al. (2017b); Zhou et al. (2018). Given training data and , , our goal is to minimize the following problem

(4.3)

Here is again the sigmoid function. The parameters and in the nonconvex regularizer for different datasets are set as follows: we set for all three datasets, and set for a9a, ijcnn1 and covtype datasets respectively. The experiment results are summarized in Figure 2, where the first row shows the plots of function value gap v.s. Hessian sample complexity and the second row presents the plots of function value v.s. CPU runtime (in second). It can be seen that Lite-SVRC again achieves the best performance among all other algorithms regarding to both sample complexity of Hessian and runtime when the required precision is high, which supports our theoretical analysis again. SVRC performs the second best.


5 Conclusions

In this paper, we propose a new algorithm called Lite-SVRC, which achieves lower sample complexity of Hessian compared with existing variance reduction based cubic regularization algorithms (Zhou et al., 2018; Wang et al., 2018). Extensive experiments on various nonconvex optimization problems and datasets validate our theory.



Appendix A Proof of the Main Theory

In this section, we provide the proofs of Theorem 3.5 and Corollary 3.7.

a.1 Proof of Theorem 3.5

Since our algorithm consists of inner loops and outer loops, we mainly focus on the analysis of one single step which is the -th iteration in the -th epoch, where .

Similar to other CR related work (Nesterov and Polyak, 2006; Cartis et al., 2011a; Kohler and Lucchi, 2017), our ultimate goal is to prove the following statement of one single loop:

(A.1)

If (A.1) holds, then we just take summation over and in the above inequality, which yields the final result of Theorem 3.5. Unfortunately, (A.1) does not hold in general because of the existence of randomness in our algorithm. Nevertheless, by borrowing the idea from the analysis of SVRG in nonconvex setting (Reddi et al., 2016a), we propose to replace the function in (A.1) with the following Lyapunov function:

(A.2)

where are parameters defined in Theorem 3.5. With the Lyapunov function in (A.2), we are able to prove the following key lemma that resembles (A.1) and holds in expectation:

Lemma A.1.

Under the same assumption as in Theorem 3.5, let be variables defined in Algorithm 1. are parameters defined in Theorem 3.5, and is a constant. Then we have following result:

(A.3)

where takes over all randomness.

With Lemma A.1, we are ready to deliver the proof of our main theory.

Proof of Theorem 3.5.

Applying Lemma A.1, we sum up (A.3) from to , while yields

(A.4)

Substituting and into (A.4), we get

Then we take summation from to , we have

Because for all , we have

(A.5)

Finally, because we choose randomly over and , thus we have our result from (A.5):

This competes the proof. ∎

a.2 Proof of Corollary 3.7

In this section, we provide the proof of our corollary for the sample complexity of Lite-SVRC. To prove Corollary 3.7, we need following lemma:

Lemma A.2.

With the parameter choice in Corollary 3.7, we further choose the parameters in Theorem 3.5 as

From now, we can define and as variables only dependent on and . Then we have that and are positive, and

(A.6)

where are two positive constants.

Proof of Corollary 3.7.

Since we already have by the parameter choice in Lemma A.2, we only need to make sure that . Take and , it is sufficient to let , where is a constant. Thus, as we need to sample Hessian at the beginning of each inner loop, and in each inner loop, we need to sample Hessian, then the total sample complexity of Hessian for Algorithm 1 is .

Appendix B Proof of the Key Lemmas

b.1 Proof of Lemma a.1

For simplification, we denote and . In this section, we prove the key lemma about the Lyapunov function (A.2) used in the proof of our main theory. We define for the simplification of notation:

(B.1)

where . Before we state the proof, we present some technical lemmas that are useful in our analysis.

Firstly, we give a sharp bound of . A very crucial observation is that we can bound the norm of gradient and the smallest eigenvalue of Hessian with , and defined in (B.1). Formally, we have the following lemma:

Lemma B.1.

Under the same assumption as in Theorem 3.5, let be variables defined by Algorithm 1. Then we have

(B.2)

where .

Lemma B.1 suggests that to bound our target , we only need to focus on and .

Secondly, we bound . We first notice that can be bounded with and . Such bound can be derived straightly from Hessian Lipschitz condition:

Lemma B.2.

Under the same assumption as in Theorem 3.5, let be variables defined by Algorithm 1. Then we have the following result:

(B.3)

where .

We also give following result to show how to bound with :

Lemma B.3.

Under the same assumption as in Theorem 3.5, let be variables defined by Algorithm 1, are parameters defined in Theorem 3.5, then we have the following result:

(B.4)

Based on Lemmas B.1, B.2 and B.3, we have established the connection between and with only and .

Finally, we bound and with consequent vector and matrix concentration inequalities. Previous analysis of variance-reduced method in nonconvex setting for first-order method which only focus on the upper bound of variance of gives an upper bound only associated with , which guarantees the variance reduction (Reddi et al., 2016a; Allen-Zhu and Hazan, 2016). In our proof, we also need to bound the variance for stochastic Hessian . Thus we have following two lemmas:

Lemma B.4.

Under the same assumption as in Theorem 3.5, let and be the iterates defined in Algorithm 1, is the parameter of batch size defined in Theorem 3.5. Then we have

where only takes expectation over .

Lemma B.5.

Under the same assumption as in Theorem 3.5, let and be iterates defined in Algorithm 1, is the batch size defined in Theorem 3.5, . Then we have

where only takes expectation over , .

Lemmas B.4 and B.5 suggest that with carefully selection of batch size, both and can be bounded by .











Now we are ready to prove Lemma A.1.

Proof of Lemma a.1.

First we combine (B.3) and (B.4) to get a bound of . Let (B.3) (B.4), then we have