On the Convergence of SARAH and Beyond
On the Convergence of SARAH and Beyond
The main theme of this work is a unifying algorithm, abbreviated as L2S, that can deal with (strongly) convex and nonconvex empirical risk minimization (ERM) problems. It broadens a recently developed variance reduction method known as SARAH. L2S enjoys a linear convergence rate for strongly convex problems, which also implies the last iteration of SARAH’s inner loop converges linearly. For convex problems, different from SARAH, L2S can afford step and mini-batch sizes not dependent on the data size , and the complexity needed to guarantee is . For nonconvex problems on the other hand, the complexity is . Parallel to L2S there are a few side results. Leveraging an aggressive step size, D2S is proposed, which provides a more efficient alternative to L2S and SARAH-like algorithms. Specifically, D2S requires a reduced IFO complexity of for strongly convex problems. Moreover, to avoid the tedious selection of the optimal step size, an automatic tuning scheme is developed, which obtains comparable empirical performance with SARAH using judiciously tuned step size.
Consider the frequently encountered empirical risk minimization (ERM) problem
where is the parameter to be learned; the set collects data indices; and, is the loss function corresponding to datum . Let denote the optimal solution of (1) and assume . The standard method to solve (1) is gradient descent (GD), e.g. , which per iteration relies on the update , where is the step size (a.k.a learning rate). For a strongly convex , GD convergences linearly to , meaning after iterations it holds that with ; while for convex it holds that , and for nonconvex one has . However, finding per iteration can be computationally prohibitive when is huge. To cope with this, the stochastic gradient descent (SGD) reduces the computational burden by drawing uniformly at random an index per iteration, and updating via [22, 4]. Albeit computationally light, SGD comes with slower convergence rate than GD[4, 7], which is mainly due to the variance of the gradient estimate given by .
|# IFO (SC)||conv. rate (C)||# IFO (C)||# IFO (NC)|
It turns out that this variance can be reduced by capitalizing on the finite sum structure of ERM. The idea is to judiciously (often periodically) evaluate a snapshot gradient , and use it as an anchor of the stochastic draws in subsequent iterates. As a result, the computational burden of GD is alleviated by stochastic gradients, while the gradient estimator variance can be also reduced using snapshot gradients. Members of the variance reduction family include those abbreviated as SDCA , SVRG [9, 20, 2], SAG , SAGA [5, 21], MISO , S2GD , SCSG  and SARAH [17, 18]. Most of these rely on the update , where is a constant step size and is a carefully designed gradient estimator that takes advantage of the snapshot gradient. Variance reduction methods are faster than SGD for convex and nonconvex problems, and remarkably they converge linearly when is strongly convex. Beyond convergence rate, to fairly compare the complexities of GD and SGD with that of variance reduction algorithms which combine snapshot gradients with the stochastic ones, we will rely on the notion of the so-termed incremental first-order oracle (IFO) .
An IFO takes and as input, and returns the gradient .
For a prescribed , a desirable algorithm obtains an -accurate solution satisfying with minimal IFO complexity. Since an -dependent step size can slow down iteration updates for convex problems, only -independent step size will be considered.111Such a focus excludes works with -dependent mini-batch sizes, e.g.,  that also belong to the class of -dependent step sizes due to the tradeoff between step size and mini-batch size . The IFO complexities of variance reduction algorithms are summarized in Table 1.
Among variance reduction algorithms, the distinct feature of SARAH [17, 18] and its variants [6, 28, 26, 19] is that they rely on a biased gradient estimator formed by recursively using stochastic gradients. SARAH performs comparably to SVRG for strongly convex ERM, but outperforms SVRG for nonconvex losses, while unlike SAGA, it does not require to store a gradient table. With SARAH’s analytical and practical merits granted, there are unexplored issues. Indeed, there is no one-for-all algorithmic framework for SARAH type algorithms. Specifically, analysis of SARAH with -independent step size/mini-batch size for convex problems is missing since analysis in  requires the non-divergence presumption, while SPIDER  focuses on nonconvex problems but the convergence properties on strongly convex ones remain unexplored. Besides, it is still unclear whether the -dependence of SARAH’s IFO complexity can be improved similar to SVRG in strongly convex problems . These issues motivate our work whose contributions are summarized next.
Unifying algorithm and novel analysis: We develop a loopless SARAH-type algorithm that we term L2S. It offers a unified algorithmic framework with provable convergence properties. In addition, one of our contributions is introducing a new method to analyze the problem. Specifically, i) for convex problems, it is established that with an -independent step size/mini-batch size, L2S has convergence rate , and requires IFO calls to find an -accurate solution; ii) for nonconvex problems the convergence rate of L2S is , and the IFO complexity to find a stationary point is ; and iii) for strongly convex problems, L2S converges linearly; and,
Improved condition number enhances SARAH’s practical merits: For strongly convex problems, by differentiating the smoothness of each loss function , we develop a novel algorithm (abbreviated as D2S) that reduces the number of IFO calls for finding an -accurate solution to . An automatic step size tuning scheme is also proposed, with empirical performance almost matching that of SARAH with optmally tuned step size.
Notation. Bold lowercase letters denote column vectors; represents expectation (probability); stands for the -norm of a vector ; and denotes the inner product.
This section reviews SARAH  and places emphasis on the quality of gradient estimates which plays the central role in establishing SARAH’s convergence. Before diving into SARAH, we first state the assumptions posed on and that are involved in (strongly) convex and nonconvex problems.
Each has -Lipchitz gradient, and has -Lipchitz gradient; that is, , and .
Assumption 1 requires each loss functions to be sufficiently smooth, which is standard in variance reduction algorithms. For notational convenience, let and . Clearly, it holds that .
Each is convex.
is -strongly convex, meaning there exists , so that .
Note that Assumption 2 implies that is also convex. Under Assumption 1, the condition number of a strongly convex function is ; the average condition number is ; and the maximum condition number is . It is not hard to see that .
2.1 SARAH for (Strongly) Convex Problems
The detailed steps of SARAH are listed under Alg. 1. In a particular outer loop (lines 3 - 11) indexed by , a snapshot gradient is computed first to serve as an anchor of gradient estimates in the ensuing inner loop (lines 6 - 10). Then is updated times based on as
Distinct from most variance reduction algorithms, SARAH’s gradient estimator is biased, since , where denotes the -algebra generated by . Albeit biased, is carefully designed to ensure the estimation error relative to is bounded above, and stays proportional to .
This estimation error bound of Lemma 1 is critical for analyzing SARAH, and instrumental to establishing its linear convergence for strongly convex problems. It is worth stressing that the step size of SARAH should be chosen by to ensure convergence, which can be larger than that of SVRG, whose step size should be less than . Despite the improvement, the step size could still be small when is large, which can slow down convergence. This prompts one to investigate means of selecting an even larger step size while maintaining the linear convergence rate. A larger step size would further challenge its manual tuning (via grid search), and thus motivates an automatic step size tuning scheme.
Establishing the convergence rate of SARAH with an -independent step size remains open for convex problems. Regarding IFO complexity, the only analysis implicitly assumes SARAH to be non-divergent, as confirmed by the following claim used to derive the IFO complexity.
Claim: [17, Theorem 3] If , , , and , it holds that .
The missing piece of this claim is that for a finite or , must be bounded; or equivalently, the algorithm must be assumed non-divergent. Such an assumption turns out to be challenging to eliminate using the analysis in . The present paper addresses the aforementioned issues analytically, and designs algorithms to boost the practical merits of SARAH.
2.2 SARAH for Nonconvex Problems
SARAH also works for nonconvex problems if one changes Line 11 of Alg. 1 into . The key for convergence again lies in the estimation error of .
Lemma 2 states that the estimation error of is i) proportional to ; and, ii) larger when is larger in the outer loop . Leveraging the estimation error bound, it was established that SARAH can find an -accurate solution with IFO calls . Though obtaining a theoretically attractive IFO complexity, similar to other variance reduced methods, SARAH is not as successful as expected for training neural networks. Part of the reason is the reduced variance in the gradient estimates tends to have negative impact on generalization performances. For instance, some empirical results show that SGD with large batch size (leading to gradient estimate with small variance) tends to converge to a sharp minimum , which is widely accepted to have worse generalization properties compared with those flat minimums; see Fig. 1 for an illustration. In addition, Fig. 1 also shows that with larger variance in gradient estimate, it is easier to escape from a sharp minimum.
These empirical evidences suggest that the variance of gradient estimates is necessary for training neural networks. It turns out that the proposed algorithm can introduce extra variance (estimation error, if rigorously speaking) compared with SARAH through a randomized scheduling of the snapshot gradient computation, while the fast convergence rate like SARAH is maintained.
3 Loopless SARAH
This section presents the LoopLess SARAH (L2S) algorithmic framework, which is capable of dealing with (strongly) convex and nonconcex ERM problems.
3.1 Loopless SARAH for Convex Problems
The subject here is problems with smooth and convex losses such as those obeying Assumptions 1 and 2. We find that SARAH is challenged analytically because in Line 11 of Alg. 1, which necessitates SARAH’s ‘non-divergent’ assumption. A few works have identified this issue [18, 26, 19], but require an -related mini-batch size or step size222These algorithms are designed for nonconvex problems, however, even assuming convexity we are unable to show the convergence with a stepsize independent with .. The proposed L2S bypasses this -dependence by removing the inner loop of SARAH and computing snapshot gradients following a random schedule. Furthermore, it is established that L2S has convergence rate , and requires IFO calls to find an -accurate solution.
L2S is summarized in Alg. 2, and a detailed comparison of L2S with existing algorithms can be found in Appendix A. Besides the single loop structure, the most distinct feature of L2S is that is a probabilistically computed snapshot gradient given by (4), where is again uniformly sampled. The gradient estimator is still biased, since . In L2S, the snapshot gradient is computed every iterates in expectation, while SARAH computes the snapshot gradient once every updates. Clearly, the limitation of SARAH is no longer present in L2S, but the emergent challenge is that one has to ensure a small estimation error to guarantee convergence. The difficulty arises from the randomness of the iteration when a snapshot gradient is computed.
An equivalent manner to describe (4) is through a Bernoulli random variable whose pmf is
If , a snapshot gradient is computed; otherwise, the estimated gradient is used for the update. Note that are i.i.d. for all . Let denote the event that at iteration the last evaluated snapshot gradient was at . In other words, is equivalent to . Note that can take values from (no snapshot gradient computed) to (corresponding to ). By definition are mutually disjoint for a given , and one can show that the probability of sums up to [see Lemma 9 in Appendix]. Exploiting these properties of , the estimation error of can be bounded.
Comparing (5) with Lemma 1 reveals that conditioning on , in L2S is similar to the starting point of an outer loop in SARAH (i.e., ), while the following iterates mimic the behavior of SARAH’s inner loop. Taking expectation w.r.t. in (5), Lemma 3 further asserts that the estimation error depends on the exponentially moving average of norm square of past gradients.
The constant depends on the choice of , e.g., for . Based on Theorem 1, the convergence rate as well as the IFO complexity with different choices of and are specified in the following corollaries.
Choose a constant . If , then L2S has convergence rate and requires IFO calls to find with . If , the convergence rate of L2S is and IFO calls are needed to ensure .
In Corollary 1, the choice of does not depend on . Thus, relative to SARAH, L2S eliminates the non-divergence assumption and establishes the convergence rate. The IFO complexity of L2S is the same as that of SAGA, but outperforms SAGA on convergence rate when choosing .
On the other hand, an -dependent step size is also supported by L2S. Though slightly violating our goal of an -independent step size, we summarize this result next for completeness.
If we select , and , then L2S has convergence rate , and can find a solution satisfying after IFO calls.
With an -dependent step size, in terms of IFO complexity L2S matches SVRG with -dependent step size .
3.2 Loopless SARAH for Nonconvex Problems
The scope of L2S can also be broadened to nonconvex problems under Assumption 1, that is, L2S with a proper step size is guaranteed to use IFO calls to find an -accurate solution. Compared with SARAH, the merit of L2S is that the extra estimation error introduced by the randomized scheduling of snapshot gradient computation can be helpful for exploring the landscape of the loss function. Such exploration may lead to a local (flat) minimum that generalizes better. The extra estimation error introduced by L2S can be seen from the following Lemma.
If Assumption 1 holds, L2S guarantees that for a given
In addition, the following inequality is true
Conditioning on , iterates are comparable to an outer loop of SARAH. Similar to Lemma 2, the estimation error of in (6) tends to be large when is large. For L2S, it is possible to have while this is impossible for SARAH since its inner loop length is fixed to be . Thus, when it so happens , the estimation error of in L2S can be larger than that of SARAH. Futhurmore, taking expectation w.r.t. the randomness of , the estimation error of depends on the exponentially moving average of all past gradient estimates , which is different from Lemma 3 where the estimation error involves the past gradients . It turns out that such a past-estimate-based estimation error is difficult to control with only the exponentially deceasing sequence – what also prompts a cautiously designed (-dependent) .
With Assumption 1 holding, and choosing , the final L2S output satisfies
An intuitive explanation of the -dependent is that with a small , L2S evaluates a snapshot gradient more frequently [cf. (4)], which translates to a relatively small estimation error bound in Lemma 4. Given an accurate gradient estimate, it is thus reasonable to adopt a larger step size.
Selecting and , L2S converges with rate , and requires IFO calls to find a solution satisfying .
3.3 Loopless SARAH for Strongly Convex Problems
In addition to convex and nonconvex problems, a modified version of L2S that we term L2S for Strongly Convex problems, converges linearly under Assumptions 1 – 3. As we have seen previously, L2S is closely related to SARAH, especially when conditioned on a given . Hence, we will first state a useful property of SARAH that will guide the design and analysis of L2S-SC.
As opposed to the random draw of (Line 11 of Alg. 1), Lemma 5 asserts that by properly selecting and , setting preserves the linear convergence of SARAH. On the other hand, choosing in Alg. 2 is observed to yield empirically the best performance  (we have not been able so far to establish its convergence properties). However, the value of is necessary for analysis [see (19) in Appendix].
L2S-SC is summarized in Alg. 3, where obtained in Lines 5 - 11 is a rewrite of (4) using introduced in (3) for the ease of presentation and analysis. L2S-SC differs from L2S in that when , steps back slightly as in Line 7. This "step back" is to allow for a rigorous analysis, and can be viewed as the counterpart of choosing instead of as in Lemma 5. Omitting Line 7 in practice does not deteriorate performance. In addition, the required to initialize L2S is comparable to the number of outer loops of SARAH, as one can also validate through the dependence in the linear convergence rate.
Extensions: For strongly convex problems, to boost the practical merits of L2S and SARAH, the Data Dependent SARAH (D2S) is developed in Appendix F. Leveraging the importance sampling scheme to enlarge the step size, D2S has an IFO complexity. The enlarged step size of D2S will further challenge tuning step size manually. To cope with this, the Barazilai-Borwein step size aided SARAH (B2S) is designed in Appendix G with an established linear convergence rate when is small. Supported by empirical tests, the performance of B2S turns out to be comparable to the best tuned SARAH, regardless of .
4 Numerical Tests
We apply the proposed algorithms to logistic regression to showcase the performances in strongly convex and convex cases. Specifically, consider the loss function
where is the (feature, label) pair of datum . Datasets a3a, w1a, ijcnn1, covtype.binary, rcv1.binary, and real-sim333All datasets are from LIBSVM, which is online available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. are used in numerical tests presented. Details regarding the datasets and implementation are deferred to Appendix H due to space limitations.
|(a) a3a||(b) w1a||(c) rcv1||(d) real-sim|
Test of L2S-SC on strongly convex problems. The performance of L2S-SC is shown in the first row of Fig. 2, and comparisons are drawn with SVRG, SARAH and SGD+ benchmarks. It can be seen that on a3a and rcv1 L2S-SC outperforms SARAH, while on other datasets, L2S-SC shows comparable performance with the best tuned SARAH. The results validates the theoretical results of L2S-SC.
Test of L2S on convex problems. The performances of L2S for convex problems () is listed in the second row of Fig. 2. SVRG, SARAH and SGD+ are adopted as benchmarks. It can be seen that on dataset a3a, rcv1, and real-sim L2S performs almost the same as the best tuned SARAH, while outperforms SARAH on w1a.
|(a) training loss||(b) test accuracy|
Test of L2S on neural networks. We perform classification on MNIST dataset444Online available at http://yann.lecun.com/exdb/mnist/ using a feedforward neural network. The network is trained for epochs and the training loss and test accuracy is plotted in Fig. 3. The gray shadowed area indicates the smallest training loss (highest test accuracy) of SGD, while the green shadowed area represents the best performances for SARAH. There are a few common observations in both Fig. 3 (a) and (b): i) SGD converges much faster in the initial phase compared with variance reduced algorithms; ii) the fluctuate of L2S is larger than that of SARAH, implying the randomized full gradient computation indeed introduces extra but controlled estimation error; and, iii) when x-axis is around , L2S begins to outperform SARAH while in previous epochs their performances are comparable. Note that before L2S outperforms SARAH, there is a deep drop on its accuracy. This can be explained as that L2S explores for a local minimum with generalization merits thanks to the randomized snapshot gradient computation.
A unifying framework, L2S, is introduced to efficiently solve (strongly) convex and nonconvex ERM problems. It was established that for strongly convex problems, L2S converges linearly; for convex problems, enabling an -independent step size/mini-batch size, L2S finds with IFO complexity ; and for nonconvex problems, the IFO complexity is . In addition, side results include the D2S algorithm for enhancing the practical merits of SARAH type algorithms. D2S allowed for an enlarged step size compared with SARAH, that further reduced IFO complexity. Finally, the automatic tuning of the step size tuning scheme was accomplished with a third algorithm (B2S). Merits of proposed algorithms (L2S, D2S, and B2S) were corroborated by numerical tests.
-  (2015) A lower bound for the optimization of finite sums. In Proc. Intl. Conf. on Machine Learning, Lille, France, pp. 78–86. Cited by: §1.
-  (2016) Variance reduction for faster non-convex optimization. In Proc. Intl. Conf. on Machine Learning, New York City, NY, pp. 699–707. Cited by: §1.
-  (1988) Two-point step size gradient methods. IMA Journal of Numerical Analysis 8 (1), pp. 141–148. Cited by: Appendix G.
-  (2016) Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838. Cited by: §1, footnote 1.
-  (2014) SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, pp. 1646–1654. Cited by: §1.
-  (2018) Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, pp. 687–697. Cited by: Appendix A, §1, §3.2, Lemma 2.
-  (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §1.
-  (2006) Probability and random processes for electrical and computer engineers. Cambridge University Press. Cited by: §C.1.
-  (2013) Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, Nevada, pp. 315–323. Cited by: §H.1, §1.
-  (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: Figure 1, §2.2.
-  (2013) Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666. Cited by: §1.
-  (2019) Don’t jump through hoops and remove those loops: svrg and katyusha are better without the outer loop. arXiv preprint arXiv:1901.08689. Cited by: Appendix A.
-  (2017) Less than a single pass: stochastically controlled stochastic gradient. In Proc. Intl. Conf. on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, pp. 148–156. Cited by: Appendix A, §1.
-  (2017) Non-convex finite-sum optimization via scsg methods. In Proc. Advances in Neural Info. Process. Syst., pp. 2348–2358. Cited by: Appendix A.
-  (2013) Optimization with first-order surrogate functions. In Proc. Intl. Conf. on Machine Learning, Atlanta, pp. 783–791. Cited by: §1.
-  (2004) Introductory lectures on convex optimization: a basic course. Vol. 87, Springer Science & Business Media. Cited by: Appendix B, §1, Lemma 6, Lemma 7.
-  (2017) SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proc. Intl. Conf. Machine Learning, Sydney, Australia. Cited by: Appendix A, §E.1, §F.2, §F.3, Appendix F, Appendix F, §G.1, §1, §1, §2.1, §2.1, §2, §3.3, Lemma 1, Lemma 10, Lemma 11.
-  (2019) Optimal finite-sum smooth non-convex optimization with SARAH. arXiv preprint arXiv:1901.07648. Cited by: §H.1, §1, §1, §2.2, §3.1.
-  (2019) ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint arXiv:1902.05679. Cited by: §1, §3.1.
-  (2016) Stochastic variance reduction for nonconvex optimization. In Proc. Intl. Conf. on Machine Learning, New York City, NY, pp. 314–323. Cited by: §H.1, §1, §3.1.
-  (2016) Fast incremental method for nonconvex optimization. arXiv preprint arXiv:1603.06159. Cited by: §1.
-  (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1.
-  (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, Nevada, pp. 2663–2671. Cited by: §1.
-  (2013) Stochastic dual coordinate ascent methods for regularized loss minimization. Vol. 14, pp. 567–599. Cited by: §1.
-  (2016) Barzilai-Borwein step size for stochastic gradient descent. In Proc. Advances in Neural Info. Process. Syst., pp. 685–693. Cited by: Appendix G.
-  (2018) SpiderBoost: a class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690. Cited by: Appendix A, §1, §3.1, §3.2, footnote 1.
-  (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24 (4), pp. 2057–2075. Cited by: §H.1, §1.
-  (2018) R-spider: a fast riemannian stochastic optimization algorithm with curvature independent rate. arXiv preprint arXiv:1811.04194. Cited by: §1.
Appendix A A Comparison of L2S and Existing Algorithms
Differences with SARAH  and SpiderBoost : The main difference is that the L2S gradient estimator in (4) schedules the full gradient computation in a random manner. Such difference further leads to different analysis.
Differences with SPIDER : L2S gradient estimator in (4) is different with that of SPIDER. In addition, suppose denotes the gradient estimate of SPIDER, the (inexact) update of SPIDER is . Furthermore, L2S is provably applicable for strongly convex problems, while the convergence properties of SPIDER in this case are unknown yet.
Differences with SCSG : Indeed, the equivalent inner loop length of L2S, defined as the number of iterations between two consecutive computation of snapshot gradients, is a random variable, which shares a similar idea with  (SCSG with ). However there are a few key differences in addition to the fact that SCSG is designed based on SVRG.
The main difference lies in the analysis techniques. Particularly, the event is leveraged in different ways. In SCSG, their “forward” analysis is analogous to fixing and exploring the randomness of future iterations, while our analysis takes the “backward” route, that is, fixing and considering the randomness of in the previous iterations. As a result, our “backward” analysis leads to a moving average structure [cf. Lemma 3 and 4], an insight not offered by SCSG. In addition, our analysis is much easier than that of SCSG.
Another difference is that in L2S, the length of an inner loop, , where and are two consecutive iterations to compute the snapshot gradient is not a geometrical random variable, and hence different with SCSG. As one can see the largest value that , can take is , while the largest value of a geometric random variable is ;
The total number of updates is a fixed number in L2S, while it is a random variable in SCSG.
The final outputs are different, that is, in (nonconvex) L2S we randomly choose from all past iterates; while in SCSG it is randomly chosen from the outputs of inner loop;
The L-SVRG , which is closely linked with SCSG, is parallel to our work. And in L-SVRG only strongly convex problems are considered but convex and nonconvex problems are also dealt with in this work. Besides, our analysis is significantly different from theirs.
Appendix B Useful Lemmas and Facts
[16, Theorem 2.1.5]. If is convex and has -Lipschitz gradient, then the following inequalities are true
Note that inequality (8a) does not require convexity.
. If is -strongly convex and has -Lipschitz gradient, with , the following inequalities are true
Appendix C Technical Proofs in Section 3.1
c.1 Proof of Lemma 3
The proof builds on following lemmas.
The following equality is true for
where the last equation is due to . We can expand using the same argument. Note that we have , which suggests
Then taking expectation w.r.t. and expanding in (C.1), the lemma is proved. ∎
For a given , events and are disjoint when ; and .
If , by definition and are disjoint, since the most recent calculated snapshot gradient can only appear at either or . Then, since in each iteration, whether to compute a snapshot gradient or a gradient estimator is independent, we thus have
Hence we have
which completes the proof. ∎
The implication of this lemma is that law of total probability  holds, that is, for a random variable that happens in iteration , the following equation holds