Closing the convergence gap of SGD without replacement

Closing the convergence gap of SGD without replacement

Abstract

Stochastic gradient descent without replacement sampling is widely used in practice for model training. However, the vast majority of SGD analyses assumes data sampled with replacement, and when the function minimized is strongly convex, an rate can be established when SGD is run for iterations. A recent line of breakthrough work on SGD without replacement (SGDo) established an convergence rate when the function minimized is strongly convex and is a sum of smooth functions, and an rate for sums of quadratics. On the other hand, the tightest known lower bound postulates an rate, leaving open the possibility of better SGDo convergence rates in the general case. In this paper, we close this gap and show that SGD without replacement achieves a rate of when the sum of the functions is a quadratic, and offer a new lower bound of for strongly convex functions that are sums of smooth functions.

1 Introduction

Stochastic gradient descent (SGD) is a widely used first order optimization technique used to approximately minimize a sum of functions

In its most general form, SGD produces a series of iterates

where is the -th iterate, is a stochastic gradient defined below, is a random variable that determines the choice of a single or a subset of sampled functions , and represents the stepsize. With- and without-replacement sampling of the individual component functions are regarded as some of the most popular variants of SGD. During SGD with replacement sampling, the stochastic gradient is equal to and is a uniform number in , i.e., a with-replacement sample from the set of gradients . In the case of without-replacement sapling, the stochastic gradient is equal to and is the -th ordered element in a random permutation of the numbers in , i.e., a without-replacement sample.

In practice, SGD without replacement is much more widely used compared to its with-replacement counterpart, as it can empirically converge significantly faster [1, 12, 11]. However, in the land of theoretical guarantees, with-replacement SGD has been the focal point of convergence analyses. The reason for this is that analyzing stochastic gradients born with replacement is significantly more tractable for a simple reason: in expectation, the stochastic gradient is equal to the “true” gradient of , i.e., . This makes SGD amenable to analyses very similar to that of vanilla gradient descent (GD), which has been extensively studied under a large variety of function classes and geometric assumptions, e.g., see [2].

Unfortunately, the same cannot be said for SGD without replacement, which has long resisted non-vacuous convergence guaranteess. For example, although we have long knew that SGD with replacement can achieve a rate for strongly convex functions , the best bounds we had for SGD without replacement did not even match that rate, in contrast to empirical evidence. However, a recent series of breakthrough results have established convergence guarantees for SGD without replacement establishing, similar or better rates than SGD with replacement.

is strongly convex and a sum of quadratics
Lower bound for quadratics [13]
Upper bound in [6]
Our upper bound
is strongly convex and a sum of smooth functions
Lower bound for quadratics [13]
Upper bound in [7]
Our lower bound
Table 1: Comparison of our lower and upper bounds to current state-of-the-art results. With magenta we denote the state-of-the-art bounds we are comparing against, and with cyan the new bounds. Our matching bounds establish information theoretically optimal rates for SGD.NOTE: hides the logarithmic factors.

In [4], the authors establish for the first time that for sums of quadratics or smooth functions, there exist parameter regimes under which SGDo achieves an rate compared to the rate of with-replacement sampling SGD. In this case, if is considered a constant, then SGDo becomes times faster than SGD with replacement. More recently, [6] showed that for functions that are sums of quadratics, or smooth functions under a Hessian smoothness assumption, one could obtain an even faster rate of . [7] show that for Lipschitz convex functions SGDo is at least as fast as SGD with replacement, and for functions that are strongly convex and sum of smooth components one can achieve a rate of . This latter result was the first convergence rate that provably establishes the superiority of SGD without replacement even for the regime that is not a constant, as long as grows faster than .

This new wave of upper bounds has also been followed by new lower bounds. [13] establishes that there exist sums of quadratics on which SGDo cannot converge faster than . This lower bound gave rise to a gap between achievable rates and information theoretic impossibility. On one hand, there exists this aformentioned lower bound for sums of quadratics. On the other hand the best upper bound for sums of quadratics is and for the more general class of strongly convex functions that are sums of smooth functions the best rate is . This leaves open the question of whether the upper or lower bounds are loose. This is precisely the gap we close in this work.

Our Contributions: In this work, we establish tight bounds for SGDo. We close the gap between lower and upper bounds on two of the function classes that prior works have focused on: strongly convex functions that are i) sums of quadratics and ii) sums of smooth functions. Specifically, for i), we offer tighter convergence rates, i.e., an upper bound that matches the lower bound of [13]; as a matter of fact our convergence rates apply to general quadratic functions that are strongly convex, which is a little more general of a function class. For ii), we provide a new lower bound that matches the upper bound by [7]. A detailed comparison of current and proposed bounds can be found in Table 1.

A few words on the techniques used are in place. For our convergence rate on quadratic functions, we heavily rely on and combine the approaches of [7] and [6]. The convergence rate analyses proposed by [6] can be tightened by a more careful analysis that employs iterate coupling similar to [7], combined with new bounds on the deviation of the stochastic, without-replacement gradient from the true gradient of .

For our lower bound, we use a similar construction to the one in [13], with the difference that each of the individual function components is not a quadratic function, but rather a piece-wise quadratic. This particular function has the property we need: it is smooth, but not quadratic. By appropriately scaling the sharpness of the individual quadratics we construct a function that behaves in a way that SGD without replacement cannot converge faster than a rate of , no matter what stepsize one chooses.

We note that although our methods have an optimal dependence on and , we believe that the dependence on function parameters, e.g., strong convexity, Lipschitz, and smoothness, can potentially be improved.

2 Related Work

The recent flurry of work on without replacement sampling in stochastic optimization extends to several variants of stochastic algorithms beyond SGD. In [8, 16], the authors provide convergence rates for random cyclic coordinate descent, establishing for the first time that it can provably converge faster than stochastic coordinate descent with replacement sampling. This work is complemented by a lower bound on the gap between the random and non-random permutation variant of coordinate descent [15]. Several other works have focused on the random permutation variant of coordinate descent, e.g., see [5, 14] [14]. In [3], novel bounds are given for incremental Newton based methods. In [9], the authors present convergence bounds for with replacement sampling and distributed SGD. Finally, [17] present asymptotic bounds for SGDo for strongly convex functions, and show that with a constant stepsize it approaches the global optimizer to within smaller error radius compared to SGD with replacement.

3 Preliminaries and Notation

We focus on using SGDo to approximately find , the global minimizer of the following unconstrained minimization problem

In our convergence bounds, we denote by the total number of iterations of SGDo, and by the total number of epochs, i.e., passes over the data. Hence, . In our derivations, we denote by the -th iterate of the -th epoch. Consequentially, we have that .

Our results in the following sections rely on the following assumptions.

Assumption 1.

(Convexity of Components) is convex for all .

Assumption 2.

(Strong Convexity) is strongly convex with strong convexity parameter , that is

Assumption 3.

(Bounded Domain)

Assumption 4.

(Bounded Gradients)

Assumption 5.

(Lipschitz Gradients)

4 Optimal SGDo rates for quadratics

In this section, we will focus on strongly convex functions that are quadratic. We will provide a tight convergence rate that improves upon the rate by [6] and matches the lower bound of [13] up to logarithmic factors.

For strongly convex functions that are a sum of smooth functions, [7] offer a rate of , whereas for strongly convex quadratics [6] give a convergence rate of . A closer comparison of these two rates reveals that neither of them can be tight due to the following observation. Assume that . Then, that implies that

At the same time if we asusme that the number of data points is significantly larger than the number of epochs that we run SGDo for, i.e., we have that

In comparison, the known lower bound for quadratics given by [13] is . This makes one wonder what is the true convergence rate of SGDo in this case. We settle the optimal rates for quadratics here by providing an upper bound which matches the known lower bound (up to logarithmic factors).

For the special case of one dimensional quadratics, [13] proved an upper bound matching the one we prove in this paper. Further, the paper conjectures that the proof can be extended to the generic multidimensional case. However, the authors claim that the main technical barrier for this extension is that it requires a special case of a matrix-valued arithmetic-geometric mean inequality which has only been conjectured to be true but not yet proven. The authors further conjecture that their proof can be extended to general smooth and strongly convex functions, which turns out to not be true, as we show in Theorem 2. On the other hand, we believe that our proof can be extended to the more general family of strongly convex functions, where the Hessian is Lipschitz, similar to the the way [6] extend their proof to that case.

In addition to assumptions 1-5 above, here we also assume the following:

Assumption 6.

is a quadratic function

where is a positive semi-definite matrix.

Note that this assumption is a little more general than the assumption that is a sum of quadratics. Also, note that this assumption, in combination with the assumptions on strong convexity and Lipschitz gradients implies bounds on the minimum and maximum eigenvalues of the Hessian of , that is,

where is the identity matrix and means that for all .

Theorem 1.

Under Assumptions 1-6, let the step size of SGDo be

and the total number of epochs we run it for be

Then, after iterations SGDo achieves the following rate

where hides logarithmic factors.

We note at this point that in our bounds we hide factors relating to problem parameters, like , , , etc, as we believe that a possibly better dependence is possible.

The proof for Theorem  1 uses ideas from [6] and [7]. In particular, one of the central ideas in these two papers is that they aim to quantify the amount of progress made by SGDo over a single epoch. Both analyses decompose the progress of the iterates in an epoch as steps of full gradient descent plus some noise term.

Similar to [6], we use the fact that the Hessian of is constant, which helps us better estimate the value of gradients around the minimizer. In contrast to that work, we do not require all individual components to be quadratic, but rather the entire to be a quadratic function.

An imporatant result proved by [7] is that during an epoch, the iterates do not steer off too far away from the starting point of the epoch. This allows one to obtain a reasonably good bound on the noise term, when one tries to approximate the stochastic gradient with the true gradient of . In our analysis, we prove a slightly different version of the same result using an iterate coupling argument similar to the one in [7].

The analysis of [7] relies on computing the Wasserstein distance between the unconditional distribution of iterates and the distribution of iterates given a function sampled during an iteration. In our analysis, we use the same coupling, but we bypass the Wasserstein framework that [7] suggests and directly obtain a bound on how far the coupled iterates move away from each other during the course of an epoch. This results, in our view, to a somewhat simpler and shorter proof.

4.1 Sketch of proof for Theorem 1

Now we give an overview of the proof. As mentioned before, similar to the previous works, the key idea is to perform a tight analysis of the progress made during an epoch. This is captured by the following Lemma.

{restatable}

lemmalemupperBoundEpoch Let the step size be

and the total number of epochs we run SGDo be

Then there exist unviversal constants and such that for any epoch we have

(1)

At this point, we would like to remark that the bound on epochs may be a bit surprising as and are dependent. However, note that since , we can show that the bound on above is satisfied if we set the number of epochs to be greater than for some constant . Furthermore, we note that the dependence of on (i.e., the condition number of ) is most probably not optimal. In particular both [7] and [6] have a better dependence on the condition number.

Given the result in Lemma 4.1, proving Theorem 1 is a simple exercise. To do so, we can simply unroll the recursion (1) across consecutive epochs:

We can now use the fact that and consequently

This leads to the following bound

We further know that for -smooth and -strongly convex functions, the strong convexity parameter cannot be larger than the smoothness parameter, that is . This simple inequality further implies that

By setting the stepsize to be and noting that , and we have the following bound

This establishes Theorem 1.

4.2 With- and without-replacement stochastic gradients are close

One of the key lemmas in [7] establishes that once SGDo iterates get closer to the global minimizer , then any iterate at any time during an epoch stays close to the iterate at the beginning of that epoch. To be more precise, the lemma we refer to is the following.

Lemma 1.

(adapted from Lemma 5 in [7])

We would like to note that Lemma 1 is slightly different from the one in [7], which instead uses rather than , but their proof can be adapted to obtain this version above.

Now, consider the case when the iterates are very close to the optimum and hence . Then, Lemma 1 implies that does not grow quadratically in which would generically happen for gradient steps, but it rather grows linearly in . This is an important and useful fact for SGDo: it shows that all iterates within an epoch remain close to .

Hence, since the iterates of SGDo don’t move too much during an epoch, then the gradients computed throughout the epoch at points should be well approximated by gradients computed on the iterate. Roughly, this translates to the following observation: the gradient steps taken through a single epoch are almost equal to steps of full gradient descent computed at . This is in essence what allows SGDo to achieve better convergence than SGD - an epoch can more often than not be approximated by steps of gradient descent, rather than by a single step.

Now, let represent the random permutation of the functions during the -th epoch. Thus, is the index of the function chosen at the -th iteration of the -th epoch. Proving Lemma 1 requires proving that the function value of , in expectation, is almost equal to . In particular, we prove the following lemma in our supplemental material.

Lemma 2.

For any epoch and -th ordered iterate during that epoch, we have that

(2)

This lemma establishes that SGDo behaves almost like SGD with replacemnent, for which the following is true

To prove this lemma, [7] consider the conditional distribution of iterates, given the current function index, that is , and the unconditional distribution of the iterates . Then, they prove that the absolute difference can be upper bounded by the Wasserstein distance between these two distributions. To further upper bound the Wasserstein distance, they propose a coupling between the two distributions. To prove our slightly different version of Lemma 1, we proved (2) without using this Wasserstein framework. Instead, we use the same coupling argument to directly get a bound on (2). Below we explain the coupling and provide a short intuition.

Consider the condtional distribution of . If we take the distribution of , we can generate the support of by taking all permutations and by swapping and among them. This is essentially a coupling between these two distributions, proposed in [7]. Now, if we use this coupling to convert a permutation in to a permutation , the corresponding and would be within a distance of . This distance bound is Lemma 2 from [7], and for completeness, the proof of this is provided inside the detailed proof of Lemma 2 in the appendix.

We can now use such distance bound, and let denote a (random) vector whose norm is less than . Then,

Similarly, for any :

Therefore,

Similarly, we can prove that

Combining these two results we obtain (2).

The full proof of Theorem 1 requires some more nuanced bounding derivations, and the complete details can be found in the Appendix.

5 Lower bound for general case

In the previous section, we establish that for quadratic functions the lower-bound by [13] is essentially tight. This still leaves open the possibility that a tighter lower bound may exist for strongly convex functions that are not quadratic. After all, the best convergence rate known for strongly convex functions that are sums of smooth functions is of the order of .

Indeed, in this section, we show that the convergence rate of establihed by [7] is tight. In particular, we have the following for different step sizes. For steplength , [13] shows that . For any constant , a steplength of , the theorem of [13] can be adapted directly to get . Finally, for a certain constant (see the proof of Theorem 2 in the Appendix for the value of ), we show the following theorem

Theorem 2.

For , there exists a strongly convex function that is sum of smooth convex functions such that

(3)

Thus overall, we get that for any fixed step size .

Next, we try to explain the function construction and proof technique behind Theorem 2. The construction of the lower bound is similar to that in [13]. The difference is that the prior work considers quadratic functions, while we consider a slightly modified “piece-wise” quadratic function.

We specifically construct the following function :

Now, of the component functions , are of the 1st kind

and functions are of the 2nd kind

For our construction, we let set to be a big enough positive constant. See for example, Fig. 1.

Figure 1: Lower bound construction. Note that represents functions of the first kind, and represents functions of the second kind, and represents the overall function.

Next we ought to verify that this function abides to Assumptions 1-5.

Note that Assumption 1 is satisfied, as it can be seen that functions ’s are all continuous and convex. Furthermore, the strong convexity of can also be established in a straightforward way. Assumption 3 and 4 can be enforced artificially. Finally, let us focus on Assumption 5. To prove that this function has smooth gradients, we need to show

If , that is and lie on the same side of the origin, then this is simple to see because they both lie on the same quadratic. Otherwise, assume and . Also, assume that is function of the first kind, that is the linear term in is Then,

Overall, the difficulty in the analysis comes from the fact that unlike the functions considered in [13], our functions are piecewise quadratics.

Let us initialize at (the minimizer). We will show that in expectation, at the end of epochs, the iterate would be at a certain distance (in expectation). Note that the progress made over an epoch is just the sum of gradients (multiplied by ) over the epoch:

where represents the index of the -th function chosen in the -th epoch. Next, note that the gradients from the linear components are equal to , that is they are constant. Thus, they will cancel out over an epoch.

However the gradients from the quadratic components do not cancel out, and in fact that part of the gradient will not even be unbiased, in the sense that if , the gradient at from the quadratic component will be less in magnitude than the gradient from the quadratic component at .

The idea is to now ensure that if an epoch starts off near the minimizer, then the iterates spend a certain amount of time in region, so that they ‘accumulate’ a lot of gradients of the form , which makes the sum of the gradients at the end of the epoch biased away from the minimizer.

To ensure that the iterates spend some time in the region, we analyze the contribution of the linear components during the epoch. This is because when the iterates are already near the minimizer , the gradient contribution of the quadratic terms would be small, and the dominating component would come from the linear terms. What this means is that during the epoch, it is the linear terms which contribute the most towards the “iterate movement”.

Then, to obtain a lower bound matching the upper bound of [7], observe that it is indeed this contribution of the linear terms that we require to get a tight bound on. This is because, the upper bound from [7] was in fact directly dependent on the “noisy” movement during such an epoch (Lemma 5 from [7]). We give below the informal version of the main lemma for the proof:

Lemma 3.

Let be a random permutation of . Then for we have that

For the purpose of intuition, ignore the contribution of gradients from the quadratic terms. Then, the lemma above says that during an epoch, the gradients from the linear terms would move the iterates approximately away from the minimizer (after we multiply by the stepsize ).

This implies that in the middle of an epoch, with (almost) probability the iterates would be near and with (almost) probability the iterates would be near . Hence, over the epoch, the accumulated quadratic gradients multiplied by the stepsize would look like

If this happens for epochs, we get that the accumulated steps would be for . Since , we know that . Since is the minimizer of our function in this setting, we have constructed a case where SGDo achieves error

This completes the sketch of the proof and the complete proof of Theorem 2 is given in the Appendix.

5.1 Discussion on possible improvements

Theorem 2 hints that for faster convergence rates in the epoch based random shuffling SGD, we would not just require smooth and strongly convex functions, but also potentially require that the Hessians of such functions to be Lipschitz.

We conjecture that Hessian Lipschitzness is sufficient to get the convergence rate of Theorem 1. We think that this is interesting, because the optimal rates for both SGD with replacement and vanilla gradient Ddescent only require strong convexity (and gradient smoothness for SGD). However, here we prove that an optimal rate for SGDo requires the function to be quadratic as well (or at the very least Hessian Lipschitz), and SGDo seems to converge slower if the Hessian is not Lipschitz.

6 Conclusions and Future Work

SGD without replacement has long puzzled researchers. From a practical point of view, it always seems to outperform SGD with replacement, and is the algorithm of choice for training modern machine learning models. From a theoretical point of view, SGDo has resisted tight convergence analysis that establish its performance benefits. A recent wave of work established that indeed SGDo can be faster than SGD with replacement sampling, however a gap still remained between the achievable rates and the best known lower bounds.

In this paper we settle the optimal performance of SGD without replacement for functions that are quadratics, and strongly convex functions that are sums of smooth functions. Our results indicate that a possible improvement in convergence rates may require a fundamnetally different stepsize rule and significantly different function assumptions.

As future directions, we believe that it would be interesting to establish rates for variants of SGDo that do not re-permute the functions at every epoch. This is something that is common in practice, where a random permutation is only performed once every few epochs without a significant drop in performance. Current theoretical bounds are insufficient to explain this phenomenon, and a new theoretical breakthrough may be required to tackle it.

We however believe that one of the strongest new theoretical insights introduced by [7] and used in our analyses can be of significance in a potential attempt to analyze other variants of SGDo as the one above. This insight is that of iterate coupling. That is the property that SGDo iterates are only mildly perturbed after swapping only two elements of a permutation. Such a property is reminiscent to that of algorithmic stability, and a deeper connection between that and iterate coupling is left as a meanigful intellectual endeavor for future work.

Appendix A Proof of Theorem 1

Remark: The proofs of all the theorems, lemmas, claims and corollaries are provided in this Appendix (except the ones cited from other papers).

Proof.

The proof for upper bound uses the framework of [6], combined with some crucial ideas from [7]. In the block diagram below we connect the pieces needed to establish the proof. All lemmas and proofs follow.

Figure 2: A dependency graph for the proof of Theorem 1, giving short descriptions of the components required.

First, we use the following lemma which quantifies the expected progress made per epoch \lemupperBoundEpoch* Simply unrolling this equation for epochs, we get:

(Using the formula for sum of Geometric Progression)
(Since )
(Substituting )

where in the last inequality we used the fact that and hence . Therefore, we get that

a.1 Proof of Lemma 4.1

Proof.

Throughout this proof we will be working inside an epoch, so we would be skipping the super script in which denotes the -th epoch. Thus in this proof, refers to the iterate at beginning of that epoch. Let denote the permutation of used in this epoch. Therefore at the -th iteration of this epoch, we compute the gradient on . Next we define the error term (this is the same error term as the one defined in [6])

(4)

Then,

(5)
Theorem 3.

(Theorem 2.1.11 from  [10]) For strongly convex function ,

Next, we need to control and . To achieve that we introduce the following two lemmas.

Lemma 4.
(6)
Lemma 5.
(7)

Substituting these two inequalities in (5), we get that

(8)

Since , and , we can see that

(9)
(10)
(11)
(12)

Finally, using (9),(10), (11) and (12) in (8), we get

This completes the proof. The only thing left is to prove the following inequalities and :