Variance Reduced Stochastic Gradient Descent with Neighbors

Variance Reduced Stochastic Gradient Descent
with Neighbors

Thomas Hofmann
Department of Computer Science
ETH Zurich, Switzerland &Aurelien Lucchi
Department of Computer Science
ETH Zurich, Switzerland &Simon Lacoste-Julien
INRIA - Sierra Project-Team
École Normale Supérieure, Paris, France &Brian McWilliams
Department of Computer Science
ETH Zurich, Switzerland
Abstract

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, yet its slow convergence can be a computational bottleneck. Variance reduction techniques such as SAG, SVRG and SAGA have been proposed to overcome this weakness, achieving linear convergence. However, these methods are either based on computations of full gradients at pivot points, or on keeping per data point corrections in memory. Therefore speed-ups relative to SGD may need a minimal number of epochs in order to materialize. This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. As a side-product we provide a unified convergence analysis for a family of variance reduction algorithms, which we call memorization algorithms. We provide experimental results supporting our theory.

 

Variance Reduced Stochastic Gradient Descent
with Neighbors


  Thomas Hofmann Department of Computer Science ETH Zurich, Switzerland Aurelien Lucchi Department of Computer Science ETH Zurich, Switzerland Simon Lacoste-Julien INRIA - Sierra Project-Team École Normale Supérieure, Paris, France Brian McWilliams Department of Computer Science ETH Zurich, Switzerland

1 Introduction

We consider a general problem that is pervasive in machine learning, namely optimization of an empirical or regularized convex risk function. Given a convex loss and a -strongly convex regularizer , one aims at finding a parameter vector which minimizes the (empirical) expectation:

(1)

We assume throughout that each has -Lipschitz-continuous gradients. Steepest descent can find the minimizer , but requires repeated computations of full gradients , which becomes prohibitive for massive data sets. Stochastic gradient descent (SGD) is a popular alternative, in particular in the context of large-scale learning [2, 10]. SGD updates only involve for an index chosen uniformly at random, providing an unbiased gradient estimate, since .

It is a surprising recent finding [11, 5, 9, 6] that the finite sum structure of allows for significantly faster convergence in expectation. Instead of the standard rate of SGD for strongly-convex functions, it is possible to obtain linear convergence with geometric rates. While SGD requires asymptotically vanishing learning rates, often chosen to be [7], these more recent methods introduce corrections that ensure convergence for constant learning rates.

Based on the work mentioned above, the contributions of our paper are as follows: First, we define a family of variance reducing SGD algorithms, called memorization algorithms, which includes SAGA and SVRG as special cases, and develop a unifying analysis technique for it. Second, we show geometric rates for all step sizes , including a universal (-independent) step size choice, providing the first -adaptive convergence proof for SVRG. Third, based on the above analysis, we present new insights into the trade-offs between freshness and biasedness of the corrections computed from previous stochastic gradients. Fourth, we propose a new class of algorithms that resolves this trade-off by computing corrections based on stochastic gradients at neighboring points. We experimentally show its benefits in the regime of learning with a small number of epochs.

2 Memorization Algorithms

2.1 Algorithms

Variance Reduced SGD

Given an optimization problem as in (1), we investigate a class of stochastic gradient descent algorithms that generates an iterate sequence () with updates taking the form:

(2)

where . Here is the current and the new parameter vector, is the step size, and is an index selected uniformly at random. are variance correction terms such that , which guarantees unbiasedness . The aim is to define updates of asymptotically vanishing variance, i.e.  as , which requires . This implies that corrections need to be designed in a way to exactly cancel out the stochasticity of at the optimum. How the memory is updated distinguishes the different algorithms that we consider.

Saga

The SAGA algorithm [4] maintains variance corrections by memorizing stochastic gradients. The update rule is for the selected , and , for . Note that these corrections will be used the next time the same index gets sampled. Setting guarantees unbiasedness. Obviously, can be updated incrementally. SAGA reuses the stochastic gradient computed at step to update as well as .

-Saga

We also consider -SAGA, a method that updates randomly chosen variables at each iteration. This is a convenient reference point to investigate the advantages of “fresher” corrections. Note that in SAGA the corrections will be on average iterations “old”. In -SAGA this can be controlled to be at the expense of additional gradient computations.

Svrg

We reformulate a variant of SVRG [5] in our framework using a randomization argument similar to (but simpler than) the one suggested in [6]. Fix and draw in each iteration . If , a complete update, () is performed, otherwise they are left unchanged. While -SAGA updates exactly variables in each iteration, SVRG occasionally updates all variables by triggering an additional sweep through the data. There is an option to not maintain variables explicitly and to save on space by storing only and .

Uniform Memorization Algorithms

Motivated by SAGA and SVRG, we define a class of algorithms, which we call uniform memorization algorithms.

Definition 1.

A uniform -memorization algorithm evolves iterates according to Eq. (2) and selects in each iteration a random index set of memory locations to update according to

(3)

such that any has the same probability of of being updated, i.e. , .

Note that -SAGA and the above SVRG are special cases. For -SAGA: if otherwise. For SVRG: , , , otherwise.

-Saga

Because we need it in Section 3, we will also define an algorithm, which we call -SAGA, which makes use of a neighborhood system and which selects neighborhoods uniformly, i.e. . Note that Definition 1 requires .

Finally, note that for generalized linear models where depends on only through , we get , i.e. the update direction is determined by , whereas the effective step length depends on the derivative of a scalar function . As used in [9], this leads to significant memory savings as one only needs to store the scalars as is always given when performing an update.

2.2 Analysis

Recurrence of Iterates

The evolution equation (2) in expectation implies the recurrence (by crucially using the unbiasedness condition ):

(4)

Here and in the rest of this paper, expectations are always taken only with respect to (conditioned on the past). We utilize a number of bounds (see [4]), which exploit strong convexity of (wherever appears) as well as Lipschitz continuity of the -gradients (wherever appears):

(5)
(6)
(7)
(8)
(9)

Eq. (6) can be generalized [4] using with . However for the sake of simplicity, we sacrifice tightness and choose . Applying all of the above yields:

Lemma 1.

For the iterate sequence of any algorithm that evolves solutions according to Eq. (2), the following holds for a single update step, in expectation over the choice of :

All proofs are deferred to the Appendix.

Ideal and Approximate Variance Correction

Note that in the ideal case of , we would immediately get a condition for a contraction by choosing , yielding a rate of with , which is half the inverse of the condition number .

How can we further bound in the case of “non-ideal” variance-reducing SGD? A key insight is that for memorization algorithms, we can apply the smoothness bound in Eq. (7)

(10)

Note that if we only had approximations in the sense that (see Section 3), then we can use to get the somewhat worse bound:

(11)

Lyapunov Function

Ideally, we would like to show that for a suitable choice of , each iteration results in a contraction , where . However, the main challenge arises from the fact that the quantities represent stochastic gradients from previous iterations. This requires a somewhat more complex proof technique. Adapting the Lyapunov function method from [4], we define upper bounds such that as . We start with and (conceptually) initialize , and then update in sync with ,

(12)

so that we always maintain valid bounds and with . The are quantities showing up in the analysis, but need not be computed. We now define a -parameterized family of Lyapunov functions111This is a simplified version of the one appearing in [4], as we assume (unconstrained regime).

(13)

In expectation under a random update, the Lyapunov function changes as . We can readily apply Lemma 1 to bound the first part. The second part is due to (12), which mirrors the update of the variables. By crucially using the property that any has the same probability of being updated in (3), we get the following result:

Lemma 2.

For a uniform -memorization algorithm, it holds that

(14)

Note that in expectation the shrinkage does not depend on the location of previous iterates and the new increment is proportional to the sub-optimality of the current iterate . Technically, this is how the possibly complicated dependency on previous iterates is dealt with in an effective manner.

Convergence Analysis

We first state our main Lemma about Lyapunov function contractions:

Lemma 3.

Fix and arbitrarily. For any uniform -memorization algorithm with sufficiently small step size such that

(15)

we have that

(16)

Note that (in the limit).

By maximizing the bounds in Lemma 3 over the choices of and , we obtain our main result that provides guaranteed geometric rates for all step sizes up to .

Theorem 1.

Consider a uniform -memorization algorithm. For any step size with , the algorithm converges at a geometric rate of at least with

(17)

where

(18)

We would like to provide more insights into this result.

Corollary 1.

In Theorem 1, is maximized for . We can write as

(19)

In the big data regime , whereas in the ill-conditioned case .

The guaranteed rate is bounded by in the regime where the condition number dominates (large ) and by in the opposite regime of large data (small ). Note that if , we have with . So for , it pays off to increase freshness as it affects the rate proportionally. In the ill-conditioned regime (), the influence of vanishes.

Note that for , the rate decreases monotonically, yet the decrease is only minor. With the exception of a small neighborhood around , the entire range of results in very similar rates. Underestimating however leads to a (significant) slow-down by a factor .

As the optimal choice of depends on , i.e. , we would prefer step sizes that are -independent, thus giving rates that adapt to the local curvature (see [9]). It turns out that by choosing a step size that maximizes , we obtain a -agnostic step size with rate off by at most :

Corollary 2.

Choosing , leads to for all .

To gain more insights into the trade-offs for these fixed large universal step sizes, the following corollary details the range of rates obtained:

Corollary 3.

Choosing with yields . In particular, we have for the choice that (roughly matching the rate given in [4] for ).

3 Sharing Gradient Memory

3.1 -Approximation Analysis

As we have seen, fresher gradient memory, i.e. a larger choice for , affects the guaranteed convergence rate as . However, as long as one step of a -memorization algorithm is as expensive as steps of a -memorization algorithm, this insight does not lead to practical improvements per se. Yet, it raises the question, whether we can accelerate these methods, in particular -SAGA, by approximating gradients stored in the variables. Note that we are always using the correct stochastic gradients in the current update and by assuring , we will not introduce any bias in the update direction. Rather, we lose the guarantee of asymptotically vanishing variance at . However, as we will show, it is possible to retain geometric rates up to a -ball around .

We will focus on SAGA-style updates for concreteness and investigate an algorithm that mirrors -SAGA with the only difference that it maintains approximations to the true variables. We aim to guarantee and will use Eq. (11) to modify the right-hand-side of Lemma 1. We see that approximation errors are multiplied with , which implies that we should aim for small learning rates, ideally without compromising the -SAGA rate. From Theorem 1 and Corollary 1 we can see that we can choose for sufficiently large, which indicates that there is hope to dampen the effects of the approximations. We now make this argument more precise.

Theorem 2.

Consider a uniform -memorization algorithm with -updates that are on average -accurate (i.e. ). For any step size , where is given by Corollary 5 in the appendix (note that and as ), we get

(20)

where denote the (unconditional) expectation over histories (in contrast to which is conditional), and .

Corollary 4.

With we have

(21)

In the relevant case of , we thus converge towards some -ball around at a similar rate as for the exact method. For , we have to reduce the step size significantly to compensate the extra variance and to still converge to an -ball, resulting in the slower rate , instead of .

We also note that the geometric convergence of SGD with a constant step size to a neighborhood of the solution (also proven in [8]) can arise as a special case in our analysis. By setting in Lemma 1, we can take for SGD. An approximate -memorization algorithm can thus be interpreted as making an algorithmic parameter, rather than a fixed value as in SGD.

3.2 Algorithms

Sharing Gradient Memory

We now discuss our proposal of using neighborhoods for sharing gradient information between close-by data points. Thereby we avoid an increase in gradient computations relative to - or -SAGA at the expense of suffering an approximation bias. This leads to a new tradeoff between freshness and approximation quality, which can be resolved in non-trivial ways, depending on the desired final optimization accuracy.

We distinguish two types of quantities. First, the gradient memory as defined by the reference algorithm -SAGA. Second, the shared gradient memory state , which is used in a modified update rule in Eq. (2), i.e. . Assume that we select an index for the weight update, then we generalize Eq. (3) as follows

(22)

In the important case of generalized linear models, where one has , we can modify the relevant case in Eq. (22) by . This has the advantages of using the correct direction, while reducing storage requirements.

Approximation Bounds

For our analysis, we need to control the error . This obviously requires problem-specific investigations.

Let us first look at the case of ridge regression. and thus with . Considering being updated, we have

(23)

where . Note that this can be pre-computed with the exception of the norm that we only know at the time of an update.

Similarly, for regularized logistic regression with , we have . With the requirement on neighbors that we get

(24)

Again, we can pre-compute and . In addition to we can also store .

-Saga

We can use these bounds in two ways. First, assuming that the iterates stay within a norm-ball (e.g. -ball), we can derive upper bounds

(25)

Obviously, the more compact the neighborhoods are, the smaller . This is most useful for the analysis. Second, we can specify a target accuracy and then prune neighborhoods dynamically. This approach is more practically relevant as it allows us to directly control . However, a dynamically varying neighborhood violates Definition 1. We fix this in a sound manner by modifying the memory updates as follows:

(26)

This allows us to interpolate between sharing more aggressively (saving computation) and performing more computations in an exact manner. In the limit of , we recover -SAGA, as we recover the first variant mentioned.

Computing Neighborhoods

Note that the pairwise Euclidean distances show up in the bounds in Eq. (23) and (24). In the classification case we also require , whereas in the ridge regression case, we also want to be small. Thus modulo filtering, this suggests the use of Euclidean distances as the metric for defining neighborhoods. Standard approximation techniques for finding near(est) neighbors can be used. This comes with a computational overhead, yet the additional costs will amortize over multiple runs or multiple data analysis tasks.

(a) Cov (b) Ijcnn1 (c) Year
, gradient evaluation
, gradient evaluation
, datapoint evaluation
, datapoint evaluation
Figure 1: Comparison of -SAGA, -SAGA, SAGA and SGD (with decreasing and constant step size) on three datasets. The top two rows show the suboptimality as a function of the number of gradient evaluations for two different values of . The bottom two rows show the suboptimality as a function of the number of datapoint evaluations (i.e. number of stochastic updates) for two different values of .

4 Experimental Results

Algorithms

We present experimental results on the performance of the different variants of memorization algorithms for variance reduced SGD as discussed in this paper. SAGA has been uniformly superior to SVRG in our experiments, so we compare SAGA and -SAGA (from Eq. (26)), alongside with SGD as a straw man and -SAGA as a point of reference for speed-ups. We have chosen for -SAGA and -SAGA. The same setting was used across all data sets and experiments.

Data Sets

As special cases for the choice of the loss function and regularizer in Eq. (1), we consider two commonly occurring problems in machine learning, namely least-square regression and -regularized logistic regression. We apply least-square regression on the million song year regression from the UCI repository. This dataset contains data points, each described by input features. We apply logistic regression on the cov and ijcnn1 datasets obtained from the libsvm website 222http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets. The cov dataset contains data points, each described by input features. The ijcnn1 dataset contains data points, each described by input features. We added an -regularizer to ensure the objective is strongly convex.

Experimental Protocol

We have run the algorithms in question in an i.i.d. sampling setting and averaged the results over 5 runs. Figure 1 shows the evolution of the suboptimality of the objective as a function of two different metrics: (1) in terms of the number of update steps performed (“datapoint evaluation”), and (2) in terms of the number of gradient computations (“gradient evaluation”). Note that SGD and SAGA compute one stochastic gradient per update step unlike -SAGA, which is included here not as a practically relevant algorithm, but as an indication of potential improvements that could be achieved by fresher corrections. A step size was used everywhere, except for “plain SGD”. Note that as in all cases, this is close to the optimal value suggested by our analysis; moreover, using a step size of for SAGA as suggested in previous work [9] did not appear to give better results. For plain SGD, we used a schedule of the form with constants optimized coarsely via cross-validation. The -axis is expressed in units of (suggestively called ”epochs”).

SAGA vs. SGD cst

As we can see, if we run SGD with the same constant step size as SAGA, it takes several epochs until SAGA really shows a significant gain. The constant step-size variant of SGD is faster in the early stages until it converges to a neighborhood of the optimum, where individual runs start showing a very noisy behavior.

SAGA vs. -Saga

-SAGA outperforms plain SAGA quite consistently when counting stochastic update steps. This establishes optimistic reference curves of what we can expect to achieve with -SAGA. The actual speed-up is somewhat data set dependent.

-SAGA vs. SAGA and -Saga

-SAGA with sufficiently small can realize much of the possible freshness gains of -SAGA and performs very similar for a few (2-10) epochs, where it traces nicely between the SAGA and -SAGA curves. We see solid speed-ups on all three datasets for both and .

Asymptotics

It should be clearly stated that running -SAGA at a fixed for longer will not result in good asymptotics on the empirical risk. This is because, as theory predicts, -SAGA can not drive the suboptimality to zero, but rather levels-off at a point determined by . In our experiments, the cross-over point with SAGA was typically after epochs. Note that the gains in the first epochs can be significant, though. In practice, one will either define a desired accuracy level and choose accordingly or one will switch to SAGA for accurate convergence.

5 Conclusion

We have generalized variance reduced SGD methods under the name of memorization algorithms and presented a corresponding analysis, which commonly applies to all such methods. We have investigated in detail the range of safe step sizes with their corresponding geometric rates as guaranteed by our theory. This has delivered a number of new insights, for instance about the trade-offs between small () and large ( step sizes in different regimes as well as about the role of the freshness of stochastic gradients evaluated at past iterates.

We have also investigated and quantified the effect of additional errors in the variance correction terms on the convergence behavior. Dependent on how scales with , we have shown that such errors can be tolerated, yet, for small , may have a negative effect on the convergence rate as much smaller step sizes are needed to still guarantee convergence to a small region. We believe this result to be relevant for a number of approximation techniques in the context of variance reduced SGD.

Motivated by these insights and results of our analysis, we have proposed -SAGA, a modification of SAGA that exploits similarities between training data points by defining a neighborhood system. Approximate versions of per-data point gradients are then computed by sharing information among neighbors. This opens-up the possibility of variance-reduction in a streaming data setting, where each data point is only seen once. We believe this to be a promising direction for future work.

Empirically, we have been able to achieve consistent speed-ups for the initial phase of regularized risk minimization. This shows that approximate computations of variance correction terms constitutes a promising approach of trading-off computation with solution accuracy.

Acknowledgments

We would like to thank Yannic Kilcher, Martin Jaggi, Rémi Leblond and the anonymous reviewers for helpful suggestions and corrections.

References

  • [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008.
  • [2] L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177–186. Springer, 2010.
  • [3] S. Dasgupta and K. Sinha. Randomized partition trees for nearest neighbor search. Algorithmica, 72(1):237–263, 2015.
  • [4] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
  • [5] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
  • [6] J. Konečnỳ and P. Richtárik. Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666, 2013.
  • [7] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • [8] M. Schmidt. Convergence rate of stochastic gradient with constant step size. UBC Technical Report, 2014.
  • [9] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, 2013.
  • [10] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical programming, 127(1):3–30, 2011.
  • [11] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 14(1):567–599, 2013.

Appendix A Appendix

Lemma 1.

For the iterate sequence of any algorithm that evolves solutions according to Eq. (2), the following holds for a single update step, in expectation over the choice of , with
, then:

Proof.

Starting from Eq. (4) we have

Lemma 2.

For a uniform -memorization algorithm, it holds that

Proof.

From the uniformity property in Definition 1, it follows that

Exploiting the fact that completes the proof. ∎

Lemma 3.

Fix and arbitrarily. For any uniform -memorization algorithm with sufficiently small step size such that

we have that

(27)

Note that (in the limit).

Proof.

From Lemma 1, we can see that we will have based on the part of . Hence, we can write the rate as , where .

Let us now apply both, Lemma 1 and Lemma 2, to quantify the progress guaranteed to be made in one iteration of the algorithm in expectation, combining the changes to the iterate as well as those to the memory into . Set , then

(28)

As we argued after Eq. (12), the definition of combined with property (10) ensure the crucial bound . Including it and gathering terms in the same “units”, we get:

(29)

We can further simplify the term in the second rectangular brackets with the definition of (in hindsight motivating its definition):

(30)

We require this term to be non-negative, so that we can safely drop it. This leads an upper bound requirement on the step size:

(31)

The term in the first rectangular brackets in Eq. (29) needs to be in order to recover . Inserting the definition of , and dividing by yields

(32)

We can summarize the derivation in the claimed combined inequality. ∎

Theorem 1.

Consider a uniform -memorization algorithm. For any step size , with the algorithm converges at a geometric rate of at least with

where

Proof.

Consider a fixed . There are potentially (infinitely) many choices of that fulfill the condition in Eq. (15). Among those, the largest rate is obtained by maximizing as . Note that for any that does not achieve Eq. (15) with equality for both terms, one can find a larger with the same choice of by either increasing (slack in the first inequality) or decreasing (slack in the second inequality) . We thus focus on step sizes that are maximal for some choice of . Equality with the second bound directly gives us

(33)

We plug this into the first bound and again equal , which yields an optimality condition for

(34)

and thus

(35)

It remains to check what the admissible range of is that achieves the bound in Eq. (15) as we required. The latter is determined by the constraints . From Eq. (34) we can read off for ,

(36)

At the other extreme of we can solve the resulting quadratic equation in

(37)

to get as claimed in Eq. (18) (excluding the second root which yields ). Moreover, for we choose to maximize the rate and have . ∎

Corollary 1.

In Theorem 1, is maximized for . We can write as

In the big data regime , whereas in the ill-conditioned case .

Proof.

Plugging in the definitions of and and performing some symbolic simplifications yields the result. ∎

Corollary 2.

Choosing , leads to .

Proof.

Write , then if :