Closing the convergence gap of SGD without replacement
Stochastic gradient descent without replacement sampling is widely used in practice for model training. However, the vast majority of SGD analyses assumes data sampled with replacement, and when the function minimized is strongly convex, an rate can be established when SGD is run for iterations. A recent line of breakthrough work on SGD without replacement (SGDo) established an convergence rate when the function minimized is strongly convex and is a sum of smooth functions, and an rate for sums of quadratics. On the other hand, the tightest known lower bound postulates an rate, leaving open the possibility of better SGDo convergence rates in the general case. In this paper, we close this gap and show that SGD without replacement achieves a rate of when the sum of the functions is a quadratic, and offer a new lower bound of for strongly convex functions that are sums of smooth functions.
Stochastic gradient descent (SGD) is a widely used first order optimization technique used to approximately minimize a sum of functions
In its most general form, SGD produces a series of iterates
where is the -th iterate, is a stochastic gradient defined below, is a random variable that determines the choice of a single or a subset of sampled functions , and represents the stepsize. With- and without-replacement sampling of the individual component functions are regarded as some of the most popular variants of SGD. During SGD with replacement sampling, the stochastic gradient is equal to and is a uniform number in , i.e., a with-replacement sample from the set of gradients . In the case of without-replacement sapling, the stochastic gradient is equal to and is the -th ordered element in a random permutation of the numbers in , i.e., a without-replacement sample.
In practice, SGD without replacement is much more widely used compared to its with-replacement counterpart, as it can empirically converge significantly faster [1, 12, 11]. However, in the land of theoretical guarantees, with-replacement SGD has been the focal point of convergence analyses. The reason for this is that analyzing stochastic gradients born with replacement is significantly more tractable for a simple reason: in expectation, the stochastic gradient is equal to the “true” gradient of , i.e., . This makes SGD amenable to analyses very similar to that of vanilla gradient descent (GD), which has been extensively studied under a large variety of function classes and geometric assumptions, e.g., see .
Unfortunately, the same cannot be said for SGD without replacement, which has long resisted non-vacuous convergence guaranteess. For example, although we have long knew that SGD with replacement can achieve a rate for strongly convex functions , the best bounds we had for SGD without replacement did not even match that rate, in contrast to empirical evidence. However, a recent series of breakthrough results have established convergence guarantees for SGD without replacement establishing, similar or better rates than SGD with replacement.
|is strongly convex and a sum of quadratics|
|Lower bound for quadratics |
|Upper bound in |
|Our upper bound|
|is strongly convex and a sum of smooth functions|
|Lower bound for quadratics |
|Upper bound in |
|Our lower bound|
In , the authors establish for the first time that for sums of quadratics or smooth functions, there exist parameter regimes under which SGDo achieves an rate compared to the rate of with-replacement sampling SGD. In this case, if is considered a constant, then SGDo becomes times faster than SGD with replacement. More recently,  showed that for functions that are sums of quadratics, or smooth functions under a Hessian smoothness assumption, one could obtain an even faster rate of .  show that for Lipschitz convex functions SGDo is at least as fast as SGD with replacement, and for functions that are strongly convex and sum of smooth components one can achieve a rate of . This latter result was the first convergence rate that provably establishes the superiority of SGD without replacement even for the regime that is not a constant, as long as grows faster than .
This new wave of upper bounds has also been followed by new lower bounds.  establishes that there exist sums of quadratics on which SGDo cannot converge faster than . This lower bound gave rise to a gap between achievable rates and information theoretic impossibility. On one hand, there exists this aformentioned lower bound for sums of quadratics. On the other hand the best upper bound for sums of quadratics is and for the more general class of strongly convex functions that are sums of smooth functions the best rate is . This leaves open the question of whether the upper or lower bounds are loose. This is precisely the gap we close in this work.
Our Contributions: In this work, we establish tight bounds for SGDo. We close the gap between lower and upper bounds on two of the function classes that prior works have focused on: strongly convex functions that are i) sums of quadratics and ii) sums of smooth functions. Specifically, for i), we offer tighter convergence rates, i.e., an upper bound that matches the lower bound of ; as a matter of fact our convergence rates apply to general quadratic functions that are strongly convex, which is a little more general of a function class. For ii), we provide a new lower bound that matches the upper bound by . A detailed comparison of current and proposed bounds can be found in Table 1.
A few words on the techniques used are in place. For our convergence rate on quadratic functions, we heavily rely on and combine the approaches of  and . The convergence rate analyses proposed by  can be tightened by a more careful analysis that employs iterate coupling similar to , combined with new bounds on the deviation of the stochastic, without-replacement gradient from the true gradient of .
For our lower bound, we use a similar construction to the one in , with the difference that each of the individual function components is not a quadratic function, but rather a piece-wise quadratic. This particular function has the property we need: it is smooth, but not quadratic. By appropriately scaling the sharpness of the individual quadratics we construct a function that behaves in a way that SGD without replacement cannot converge faster than a rate of , no matter what stepsize one chooses.
We note that although our methods have an optimal dependence on and , we believe that the dependence on function parameters, e.g., strong convexity, Lipschitz, and smoothness, can potentially be improved.
2 Related Work
The recent flurry of work on without replacement sampling in stochastic optimization extends to several variants of stochastic algorithms beyond SGD. In [8, 16], the authors provide convergence rates for random cyclic coordinate descent, establishing for the first time that it can provably converge faster than stochastic coordinate descent with replacement sampling. This work is complemented by a lower bound on the gap between the random and non-random permutation variant of coordinate descent . Several other works have focused on the random permutation variant of coordinate descent, e.g., see [5, 14] . In , novel bounds are given for incremental Newton based methods. In , the authors present convergence bounds for with replacement sampling and distributed SGD. Finally,  present asymptotic bounds for SGDo for strongly convex functions, and show that with a constant stepsize it approaches the global optimizer to within smaller error radius compared to SGD with replacement.
3 Preliminaries and Notation
We focus on using SGDo to approximately find , the global minimizer of the following unconstrained minimization problem
In our convergence bounds, we denote by the total number of iterations of SGDo, and by the total number of epochs, i.e., passes over the data. Hence, . In our derivations, we denote by the -th iterate of the -th epoch. Consequentially, we have that .
Our results in the following sections rely on the following assumptions.
(Convexity of Components) is convex for all .
(Strong Convexity) is strongly convex with strong convexity parameter , that is
4 Optimal SGDo rates for quadratics
In this section, we will focus on strongly convex functions that are quadratic. We will provide a tight convergence rate that improves upon the rate by  and matches the lower bound of  up to logarithmic factors.
For strongly convex functions that are a sum of smooth functions,  offer a rate of , whereas for strongly convex quadratics  give a convergence rate of . A closer comparison of these two rates reveals that neither of them can be tight due to the following observation. Assume that . Then, that implies that
At the same time if we asusme that the number of data points is significantly larger than the number of epochs that we run SGDo for, i.e., we have that
In comparison, the known lower bound for quadratics given by  is . This makes one wonder what is the true convergence rate of SGDo in this case. We settle the optimal rates for quadratics here by providing an upper bound which matches the known lower bound (up to logarithmic factors).
For the special case of one dimensional quadratics,  proved an upper bound matching the one we prove in this paper. Further, the paper conjectures that the proof can be extended to the generic multidimensional case. However, the authors claim that the main technical barrier for this extension is that it requires a special case of a matrix-valued arithmetic-geometric mean inequality which has only been conjectured to be true but not yet proven. The authors further conjecture that their proof can be extended to general smooth and strongly convex functions, which turns out to not be true, as we show in Theorem 2. On the other hand, we believe that our proof can be extended to the more general family of strongly convex functions, where the Hessian is Lipschitz, similar to the the way  extend their proof to that case.
In addition to assumptions 1-5 above, here we also assume the following:
is a quadratic function
where is a positive semi-definite matrix.
Note that this assumption is a little more general than the assumption that is a sum of quadratics. Also, note that this assumption, in combination with the assumptions on strong convexity and Lipschitz gradients implies bounds on the minimum and maximum eigenvalues of the Hessian of , that is,
where is the identity matrix and means that for all .
We note at this point that in our bounds we hide factors relating to problem parameters, like , , , etc, as we believe that a possibly better dependence is possible.
The proof for Theorem 1 uses ideas from  and . In particular, one of the central ideas in these two papers is that they aim to quantify the amount of progress made by SGDo over a single epoch. Both analyses decompose the progress of the iterates in an epoch as steps of full gradient descent plus some noise term.
Similar to , we use the fact that the Hessian of is constant, which helps us better estimate the value of gradients around the minimizer. In contrast to that work, we do not require all individual components to be quadratic, but rather the entire to be a quadratic function.
An imporatant result proved by  is that during an epoch, the iterates do not steer off too far away from the starting point of the epoch. This allows one to obtain a reasonably good bound on the noise term, when one tries to approximate the stochastic gradient with the true gradient of . In our analysis, we prove a slightly different version of the same result using an iterate coupling argument similar to the one in .
The analysis of  relies on computing the Wasserstein distance between the unconditional distribution of iterates and the distribution of iterates given a function sampled during an iteration. In our analysis, we use the same coupling, but we bypass the Wasserstein framework that  suggests and directly obtain a bound on how far the coupled iterates move away from each other during the course of an epoch. This results, in our view, to a somewhat simpler and shorter proof.
4.1 Sketch of proof for Theorem 1
Now we give an overview of the proof. As mentioned before, similar to the previous works, the key idea is to perform a tight analysis of the progress made during an epoch. This is captured by the following Lemma.
lemmalemupperBoundEpoch Let the step size be
and the total number of epochs we run SGDo be
Then there exist unviversal constants and such that for any epoch we have
At this point, we would like to remark that the bound on epochs may be a bit surprising as and are dependent. However, note that since , we can show that the bound on above is satisfied if we set the number of epochs to be greater than for some constant . Furthermore, we note that the dependence of on (i.e., the condition number of ) is most probably not optimal. In particular both  and  have a better dependence on the condition number.
We can now use the fact that and consequently
This leads to the following bound
We further know that for -smooth and -strongly convex functions, the strong convexity parameter cannot be larger than the smoothness parameter, that is . This simple inequality further implies that
By setting the stepsize to be and noting that , and we have the following bound
This establishes Theorem 1.
4.2 With- and without-replacement stochastic gradients are close
One of the key lemmas in  establishes that once SGDo iterates get closer to the global minimizer , then any iterate at any time during an epoch stays close to the iterate at the beginning of that epoch. To be more precise, the lemma we refer to is the following.
(adapted from Lemma 5 in )
Now, consider the case when the iterates are very close to the optimum and hence . Then, Lemma 1 implies that does not grow quadratically in which would generically happen for gradient steps, but it rather grows linearly in . This is an important and useful fact for SGDo: it shows that all iterates within an epoch remain close to .
Hence, since the iterates of SGDo don’t move too much during an epoch, then the gradients computed throughout the epoch at points should be well approximated by gradients computed on the iterate. Roughly, this translates to the following observation: the gradient steps taken through a single epoch are almost equal to steps of full gradient descent computed at . This is in essence what allows SGDo to achieve better convergence than SGD - an epoch can more often than not be approximated by steps of gradient descent, rather than by a single step.
Now, let represent the random permutation of the functions during the -th epoch. Thus, is the index of the function chosen at the -th iteration of the -th epoch. Proving Lemma 1 requires proving that the function value of , in expectation, is almost equal to . In particular, we prove the following lemma in our supplemental material.
For any epoch and -th ordered iterate during that epoch, we have that
This lemma establishes that SGDo behaves almost like SGD with replacemnent, for which the following is true
To prove this lemma,  consider the conditional distribution of iterates, given the current function index, that is , and the unconditional distribution of the iterates . Then, they prove that the absolute difference can be upper bounded by the Wasserstein distance between these two distributions. To further upper bound the Wasserstein distance, they propose a coupling between the two distributions. To prove our slightly different version of Lemma 1, we proved (2) without using this Wasserstein framework. Instead, we use the same coupling argument to directly get a bound on (2). Below we explain the coupling and provide a short intuition.
Consider the condtional distribution of . If we take the distribution of , we can generate the support of by taking all permutations and by swapping and among them. This is essentially a coupling between these two distributions, proposed in . Now, if we use this coupling to convert a permutation in to a permutation , the corresponding and would be within a distance of . This distance bound is Lemma 2 from , and for completeness, the proof of this is provided inside the detailed proof of Lemma 2 in the appendix.
We can now use such distance bound, and let denote a (random) vector whose norm is less than . Then,
Similarly, for any :
Similarly, we can prove that
Combining these two results we obtain (2).
The full proof of Theorem 1 requires some more nuanced bounding derivations, and the complete details can be found in the Appendix.
5 Lower bound for general case
In the previous section, we establish that for quadratic functions the lower-bound by  is essentially tight. This still leaves open the possibility that a tighter lower bound may exist for strongly convex functions that are not quadratic. After all, the best convergence rate known for strongly convex functions that are sums of smooth functions is of the order of .
Indeed, in this section, we show that the convergence rate of establihed by  is tight. In particular, we have the following for different step sizes. For steplength ,  shows that . For any constant , a steplength of , the theorem of  can be adapted directly to get . Finally, for a certain constant (see the proof of Theorem 2 in the Appendix for the value of ), we show the following theorem
For , there exists a strongly convex function that is sum of smooth convex functions such that
Thus overall, we get that for any fixed step size .
Next, we try to explain the function construction and proof technique behind Theorem 2. The construction of the lower bound is similar to that in . The difference is that the prior work considers quadratic functions, while we consider a slightly modified “piece-wise” quadratic function.
We specifically construct the following function :
Now, of the component functions , are of the 1st kind
and functions are of the 2nd kind
For our construction, we let set to be a big enough positive constant. See for example, Fig. 1.
Note that Assumption 1 is satisfied, as it can be seen that functions ’s are all continuous and convex. Furthermore, the strong convexity of can also be established in a straightforward way. Assumption 3 and 4 can be enforced artificially. Finally, let us focus on Assumption 5. To prove that this function has smooth gradients, we need to show
If , that is and lie on the same side of the origin, then this is simple to see because they both lie on the same quadratic. Otherwise, assume and . Also, assume that is function of the first kind, that is the linear term in is Then,
Overall, the difficulty in the analysis comes from the fact that unlike the functions considered in , our functions are piecewise quadratics.
Let us initialize at (the minimizer). We will show that in expectation, at the end of epochs, the iterate would be at a certain distance (in expectation). Note that the progress made over an epoch is just the sum of gradients (multiplied by ) over the epoch:
where represents the index of the -th function chosen in the -th epoch. Next, note that the gradients from the linear components are equal to , that is they are constant. Thus, they will cancel out over an epoch.
However the gradients from the quadratic components do not cancel out, and in fact that part of the gradient will not even be unbiased, in the sense that if , the gradient at from the quadratic component will be less in magnitude than the gradient from the quadratic component at .
The idea is to now ensure that if an epoch starts off near the minimizer, then the iterates spend a certain amount of time in region, so that they ‘accumulate’ a lot of gradients of the form , which makes the sum of the gradients at the end of the epoch biased away from the minimizer.
To ensure that the iterates spend some time in the region, we analyze the contribution of the linear components during the epoch. This is because when the iterates are already near the minimizer , the gradient contribution of the quadratic terms would be small, and the dominating component would come from the linear terms. What this means is that during the epoch, it is the linear terms which contribute the most towards the “iterate movement”.
Then, to obtain a lower bound matching the upper bound of , observe that it is indeed this contribution of the linear terms that we require to get a tight bound on. This is because, the upper bound from  was in fact directly dependent on the “noisy” movement during such an epoch (Lemma 5 from ). We give below the informal version of the main lemma for the proof:
Let be a random permutation of . Then for we have that
For the purpose of intuition, ignore the contribution of gradients from the quadratic terms. Then, the lemma above says that during an epoch, the gradients from the linear terms would move the iterates approximately away from the minimizer (after we multiply by the stepsize ).
This implies that in the middle of an epoch, with (almost) probability the iterates would be near and with (almost) probability the iterates would be near . Hence, over the epoch, the accumulated quadratic gradients multiplied by the stepsize would look like
If this happens for epochs, we get that the accumulated steps would be for . Since , we know that . Since is the minimizer of our function in this setting, we have constructed a case where SGDo achieves error
This completes the sketch of the proof and the complete proof of Theorem 2 is given in the Appendix.
5.1 Discussion on possible improvements
Theorem 2 hints that for faster convergence rates in the epoch based random shuffling SGD, we would not just require smooth and strongly convex functions, but also potentially require that the Hessians of such functions to be Lipschitz.
We conjecture that Hessian Lipschitzness is sufficient to get the convergence rate of Theorem 1. We think that this is interesting, because the optimal rates for both SGD with replacement and vanilla gradient Ddescent only require strong convexity (and gradient smoothness for SGD). However, here we prove that an optimal rate for SGDo requires the function to be quadratic as well (or at the very least Hessian Lipschitz), and SGDo seems to converge slower if the Hessian is not Lipschitz.
6 Conclusions and Future Work
SGD without replacement has long puzzled researchers. From a practical point of view, it always seems to outperform SGD with replacement, and is the algorithm of choice for training modern machine learning models. From a theoretical point of view, SGDo has resisted tight convergence analysis that establish its performance benefits. A recent wave of work established that indeed SGDo can be faster than SGD with replacement sampling, however a gap still remained between the achievable rates and the best known lower bounds.
In this paper we settle the optimal performance of SGD without replacement for functions that are quadratics, and strongly convex functions that are sums of smooth functions. Our results indicate that a possible improvement in convergence rates may require a fundamnetally different stepsize rule and significantly different function assumptions.
As future directions, we believe that it would be interesting to establish rates for variants of SGDo that do not re-permute the functions at every epoch. This is something that is common in practice, where a random permutation is only performed once every few epochs without a significant drop in performance. Current theoretical bounds are insufficient to explain this phenomenon, and a new theoretical breakthrough may be required to tackle it.
We however believe that one of the strongest new theoretical insights introduced by  and used in our analyses can be of significance in a potential attempt to analyze other variants of SGDo as the one above. This insight is that of iterate coupling. That is the property that SGDo iterates are only mildly perturbed after swapping only two elements of a permutation. Such a property is reminiscent to that of algorithmic stability, and a deeper connection between that and iterate coupling is left as a meanigful intellectual endeavor for future work.
Appendix A Proof of Theorem 1
Remark: The proofs of all the theorems, lemmas, claims and corollaries are provided in this Appendix (except the ones cited from other papers).
The proof for upper bound uses the framework of , combined with some crucial ideas from . In the block diagram below we connect the pieces needed to establish the proof. All lemmas and proofs follow.
First, we use the following lemma which quantifies the expected progress made per epoch \lemupperBoundEpoch* Simply unrolling this equation for epochs, we get:
|(Using the formula for sum of Geometric Progression)|
where in the last inequality we used the fact that and hence . Therefore, we get that
a.1 Proof of Lemma 4.1
Throughout this proof we will be working inside an epoch, so we would be skipping the super script in which denotes the -th epoch. Thus in this proof, refers to the iterate at beginning of that epoch. Let denote the permutation of used in this epoch. Therefore at the -th iteration of this epoch, we compute the gradient on . Next we define the error term (this is the same error term as the one defined in )
(Theorem 2.1.11 from ) For strongly convex function ,
Next, we need to control and . To achieve that we introduce the following two lemmas.