Stochastic Learning under Random Reshuffling

# Stochastic Learning under Random Reshuffling

Bicheng Ying12, Kun Yuan12, Stefan Vlaski12, and Ali H. Sayed2

This work was supported in part by NSF grants CCF-1524250 and ECCS-1407712. Emails: {ybc, kunyuan, svlaski}@ucla.edu and ali.sayed@epfl.ch. A short conference version of this work was presented in [1]. 1Department of Electrical Engineering, University of California, Los Angeles 2School of Engineering, École Polytechnique Fédérale de Lausanne, Switzerland
###### Abstract

In empirical risk optimization, it has been observed that stochastic gradient implementations that rely on random reshuffling of the data achieve better performance than implementations that rely on sampling the data uniformly. Recent works have pursued justifications for this behavior by examining the convergence rate of the learning process under diminishing step-sizes. This work focuses on the constant step-size case. In this case, convergence is guaranteed to a small neighborhood of the optimizer albeit at a linear rate. The analysis establishes analytically that random reshuffling outperforms uniform sampling by showing explicitly that iterates approach a smaller neighborhood of size around the minimizer rather than . Furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail the differences between sampling with and without replacement. We also explain the periodic behavior that is observed in random reshuffling implementations.

{keywords}

Random reshuffling, stochastic gradient descent, mean-square performance, convergence analysis, mean-square-error expression.

## I Motivation

We consider minimizing an empirical risk function , which is defined as the sample average of loss values over a possibly large but finite training set:

 w⋆\lx@stackrelΔ=argminw∈RMJ(w)\lx@stackrelΔ=1NN∑n=1Q(w;xn), (1)

where the denotes the training data samples and the loss function is assumed convex and differentiable. We assume the empirical risk is strongly-convex which ensures that the minimizer, , is unique. Problems of the form (1) are common in many areas of machine learning including linear regression, logistic regression and their regularized versions.

When the size of the dataset is large, it is impractical to solve (1) directly via traditional gradient descent by evaluating the full gradient at every iteration. One simple, yet powerful remedy is to employ the stochastic gradient method (SGD) [2, 3, 4, 5, 6, 7, 8, 9]. Rather than compute the full gradient over the entire data set, these algorithms pick one index at random at every iteration, and employ to approximate . Specifically, at iteration , the update for estimating the minimizer is of the form[10]:

 wi=wi−1−μ∇wQ(wi−1;xni), (2)

where is the step-size parameter. Note that we are using boldface notation to refer to random variables. Traditionally, the index is uniformly distributed over the discrete set .

It has been noted in the literature [11, 12, 13, 14] that incorporating random reshuffling into the gradient descent implementation helps achieve better performance. More broadly than in the case of the pure SGD algorithm, it has also been observed that applying random reshuffling in variance-reduction algorithms, like SVRG[15], SAGA[16], can accelerate the convergence speed[17, 18, 19]. The reshuffling technique has also been applied in distributed system to reduce the communication and computation cost[20].

In random reshuffling implementations, the data points are no longer picked independently and uniformly at random. Instead, the gradient descent algorithm is run multiple times over the data where each run is indexed by and is referred to as an epoch. For each epoch, the original data is first reshuffled and then passed over in order. In this manner, the -th sample of epoch can be viewed as , where the symbol represents a uniform random permutation of the indices. We can then express the random reshuffling algorithm for the th epoch in the following manner:

 wki=wki−1−μ∇wQ(wki−1;xσk(i)),i=1,…,N (3)

with the boundary condition:

 wk0=wk−1N (4)

In other words, the initial condition for epoch is the last iterate from epoch . The boldface notation for the symbols and in (3) emphasizes the random nature of these variables due to the randomness in the permutation operation. While the samples over one epoch are no longer picked independently from each other, the uniformity of the permutation function implies the following useful properties[19, 21, 22]:

 σk(i)≠ σk(j),1≤i≠j≤N (5) P[ σk(i)=n ]= 1N,1≤n≤N (6) P[σk(i+1)=n|σk(1:i)]= (7)

where represents the collection of permuted indices for the samples numbered through .

Several recent works [12, 13, 23] have pursued justifications for the enhanced behavior of random reshuffling implementations over independent sampling (with replacement) by examining the convergence rate of the learning process under diminishing step-sizes. It has been analytically shown that the convergence rate under random reshuffling can be improved from for strongly convex objective functions in SGD[24] to , where is the number of iterations [13]. However, some of these justifications rely on loose bounds, or their conclusions are dependent on the sample size which is problematic for large datasets. Also, in the work [23], it only establishes that random reshuffling will not degrade performance relative to the stochastic gradient descent implementation.

In this work, we focus on a different setting than[12, 13, 23] involving random reshuffling under constant rather than decaying step-sizes. In this case, convergence is only guaranteed to a small neighborhood of the optimizer albeit at a linear rate. The analysis will establish analytically that random reshuffling outperforms independent sampling (with replacement) by showing that the mean-square-error of the iterate at the end of each run in the random reshuffling strategy will be in the order of . This is a significant improvement over the performance of traditional stochastic gradient descent, which is  [10]. Furthermore, we derive an analytical expression for the steady-state mean-square-error performance of the algorithm, which helps clarify in greater detail the differences between sampling with and without replacement. We also explain the periodic behavior that is observed in random reshuffling implementations.

## Ii Analysis of the Stochastic Gradient Under Random Reshuffling

### Ii-a Properties of the Gradient Approximation

We start by examining the properties of the stochastic gradient under random reshuffling. One main source of difficulty that we shall encounter in the analysis of performance under random reshuffling is the fact that a single sample of the stochastic gradient is now a biased estimate of the true gradient and, moreover, it is no longer independent of past selections, . This is in contrast to implementations where samples are picked independently at every iteration. Indeed, note that conditioned on previously picked data and on the previous iterate, we have:

 E[∇wQσk(i)(wki−1)|wki−1,σk(1:i−1)] = 1N−i+1∑n∉σk(1:i−1)∇wQ(wki−1) ≠ ∇J(wk0) (8)

The difference (8) is generally nonzero in view of the definition (1). For the first iteration of every epoch however, it can be verified that the following holds:

 E[∇wQσk(i)(wk0)∣∣wk0]\lx@stackrel(???)= 1NN∑n=1Q(wk0;xn) \lx@stackrel(???)= ∇J(wk0) (9)

since at the beginning of one epoch, no data has been selected yet. Perhaps surprisingly, we will be showing that the biased construction of the stochastic gradient estimate not only does not hurt the performance of the algorithm, but instead significantly improves it. In large part, the analysis will revolve around considering the accuracy of the gradient approximation over an entire epoch, rather than focusing on single samples at a time. Recall that by construction in random reshuffling, every sample is picked once and only once over one epoch. This means that the sample average (rather than the true mean) of the gradient noise process is zero since

 1NN∑i=1∇wQ(w;xσk(i))=∇J(w) (10)

for any and any reshuffling order . This property will become key in the analysis.

### Ii-B Convergence Analysis

We can now establish a key convergence and performance property for the random reshuffling algorithm, which provides solid analytical justification for its observed improved performance in practice.

To begin with, we assume that the risk function satisfies the following conditions, which are automatically satisfied by many learning problems of interest, such as mean-square-error or logistic regression analysis and their regularized versions — see, e.g., [25, 26, 27, 28, 29].

###### Assumption 1 (Condition on loss function).

It is assumed that is differentiable and has a -Lipschitz continuous gradient, i.e., for every and any :

 ∥∇wQ(w1;xn)−∇wQ(w2;xn)∥≤δn∥w1−w2∥ (11)

where . We also assume is -strongly convex:

 (∇wJ(w1)−∇wJ(w2))T(w1−w2) ≥ν∥w1−w2∥2 (12)

If we introduce , then each and are also -Lipschitz continuous.

The following theorem focuses on the convergence of the starting point of each epoch and establishes in (14) that it actually approaches a smaller neighborhood of size around . Afterwards, using this result, we also show that the same performance level holds for all iterates and not just for the starting points of the epochs.

To simplify the notation, we introduce the constant , which is the gradient noise variance at optimal point :

 K\lx@stackrelΔ=1NN∑n=1∥∇wQ(w⋆;xn)∥2 (13)
###### Theorem 1 (Stability of starting points).

Under assumption 1, the starting point of each run satisfies

 limsupk→∞E∥wk0−w⋆∥2≤ 4μ2δ2N2ν2K=O(μ2) (14)

when the step-size is sufficiently small, namely, for .

###### Proof.

Note first that

 wk+10\lx@stackrelΔ= wkN \lx@stackrel(???)= wkN−1−μ∇wQ(wkN−1;xσk(N)) ⋮ = wk0−μN∑i=1∇wQ(wki−1;xσk(i)) \lx@stackrel(???)= wk0−μN∇wJ(wk0) (15) −μN∑i=1(∇wQ(wki−1;xσk(i))−∇wQ(wk0;xσk(i)))\lx@stackrelΔ=gσk(i)(wki−1)

where we denote by the incremental gradient noise which is the mismatch between the gradient approximations evaluated at and . Next, we introduce the error vector:

 ˜wk0\lx@stackrelΔ=w⋆−wk0 (16)

and let be any scalar that we will specify further below. Subtracting from both sides of (15), squaring, and using Jensen’s inequality in step (a) below we get:

 ∥˜wk+10∥2 \lx@stackrel(a)≤ 1t∥˜wk0+μN∇wJ(wk0)∥2+μ21−t∥∥ ∥∥N∑i=1gσk(i)(wki−1)∥∥ ∥∥2 \lx@stackrel(b)≤ (17)

where step (b) uses the fact that:

 ∥∥ ∥∥N∑i=1xi∥∥ ∥∥2=N2∥∥ ∥∥N∑i=11Nxi∥∥ ∥∥2≤NN∑i=1∥xi∥2 (18)

We show in Appendix A that the rightmost term in (17) can be bounded by:

 N∑i=1∥∥gσk(i)(wki−1)∥∥2 (19)

while for the first term in (17) we have

 ∥˜wk0+μN∇wJ(wk0)∥2 = ∥˜wk0∥2+2μN(˜wk0)T∇wJ(wk0)+μ2N2∥∇wJ(wk0)∥2 = ∥˜wk0∥2−2μN(˜wk0)T(∇wJ(w⋆)−∇wJ(wk0)) +μ2N2∥∇wJ(w⋆)−∇wJ(wk0)∥2 (a)≤ (1−2μνN+μ2N2δ2)∥˜wk0∥2 (b)≤ (1−μνN2)2∥˜wk0∥2. (20)

Inequality holds because is -strongly convex and is -Lipschitz continuous. Inequality holds when is sufficiently small such that

 1−2μνN+μ2N2δ2≤(1−2μνN3)2 (21)

To satisfy (21), it is sufficient to set

 μ≤2ν3δ2N. (22)

Combining (19) and (II-B), we establish:

 ∥˜wk+10∥2≤ 1t(1−2μNν3)2∥˜wk0∥2 +μ2δ2N1−tμ2δ2N31−2μ2δ2N2(2δ2∥˜wk0∥2+K) (23)

We are free to choose . Thus, let . Then, we conclude that

 ∥˜wk+10∥2≤ (1−2μNν/3)∥˜wk0∥2 +3μ3δ2N32ν(1−2μ2δ2N2)(2δ2∥˜wk0∥2+K) (24) = (1−μNν2+3μ3δ4N3ν(1−2μ2δ2N2))∥˜wk0∥2 +3μ3δ2N3K2ν(1−2μ2δ2N2) (25)

If we assume is sufficiently small such that

 1−2μ2δ2N2≥34, (26)

then inequality (25) becomes

 ∥˜wk+10∥2≤ (1−23μNν+4μ3δ4N3ν)∥˜wk0∥2+2μ3δ2N3νK. (27)

If we further assume the step-size is sufficiently small such that

 1−23μNν+4μ3δ4N3ν≤1−12μNν (28)

then inequality (27) becomes

 ∥˜wk+10∥2≤ (1−12μNν)∥˜wk0∥2+2μ3δ2N3νK ≤ (1−12μNν)k∥˜w00∥2 ≤ (1−12μNν)k∥˜w00∥2+4μ2δ2N2ν2K. (29)

By taking expectations on both sides, we have

 E∥˜wk+10∥2≤ (1−12μNν)kE∥˜w00∥2+4μ2δ2N2ν2K, (30)

which implies that

 limsupk→∞E∥˜wk0∥2=O(μ2) (31)

Finally we find a sufficient range for for stability. To satisfy (22), (26) and (28), it is enough to set as

 μ ≤min{2ν3δ2N,1√8δN,ν√24δ2N}<ν5δ2N. (32)

The argument in this derivation provides a self-contained proof for the convergence result (14), which generalizes the approach from [1]. There, the bound (14) was derived from an intermediate property (23) in [1], which does not always hold. Here, the same result is re-derived and shown to hold irrespective of this property. Consequently, we are now able to obtain Lemma 1 from [1] as a corollary to our current result, as shown next.

Having established the stability of the first point of every epoch, we can now establish the stability of every point.

###### Corollary 1 (Full Stability).

Under assumption 1, it holds that

 limsupk→∞E∥wki−w⋆∥2= O(μ2) (33)

for all when the step-size is sufficiently small.

###### Proof.

We have

 E∥˜wki∥2≤ 2E∥wki−wk0∥2+2E∥˜wk0∥2 ≤ 2i−1∑j=0jE∥wkj+1−wkj∥2+2E∥˜wk0∥2 ≤ 2i−1∑j=0jE∥∇wQ(wkj;xσk(j))∥2+2E∥˜wk0∥2 ≤ 2μ2δ2i−1∑j=0jE∥˜wkj∥2+2E∥˜wk0∥2 (34)

Summing over ;

 N−1∑i=1E∥˜wki∥2≤ = ≤ μ2δ2N2N−1∑j=0E∥˜wkj∥2+2NE∥˜wk0∥2 = μ2δ2N2N−1∑j=1E∥˜wkj∥2+(2N+μ2δ2N2)E∥˜wk0∥2 (35)

Rearranging terms, we get

 N−1∑i=1E∥˜wki∥2≤ (36)

Let , then

 limsupk→∞N−1∑i=1E∥˜wki∥2=O(μ2) (37)

Noting that every term in the summation is non-negative, we conclude that for all :

 limsupk→∞E∥˜wkj∥2≤limsupk→∞N−1∑i=1E∥˜wki∥2=O(μ2) (38)

## Iii Illustrating Behavior and Periodicity

In this section we illustrate the theoretical findings so far by numerical simulations. We consider the following logistic regression problem:

 minwJ(w)=1NN∑n=1Q(w;hn,γ(n)), (39)

where is the feature vector, is the scalar label, and

 Q(w;hn,γn)\lx@stackrelΔ=ρ∥w∥2+ln(1+exp(−γ(n)hTnw)). (40)

The constant is the regularization parameter. In the first simulation, we compare the performance of the standard stochastic gradient descent (SGD) algorithm (2) with replacement and the random reshuffling (RR) algorithm (3). We set and . Each is generated from the normal distribution , where is a diagonal matrix with each diagonal entry generated from the uniform distribution . To generate , we first generate an auxiliary random vector with each entry following . Next, we generate from a uniform distribution . If then is set as ; otherwise is set as . We select during all simulations. Figure 1 illustrates the MSD performance of the SGD and RR algorithms when . It is observed that the RR algorithm oscillates during the steady-state regime, and that the MSD at the is the best among all iterates during epoch . Furthermore, it is also observed that RR has better MSD performance than SGD. Similar observations also occur in Fig. 2, where . It is worth noting that the gap between SGD and RR is much larger in Fig. 2 than in Fig. 1.

Next, in the second simulation we verify the conclusion that the MSD for the starting point of each epoch for the random reshuffling algorithm, i.e., , can achieve instead of . We still consider the regularized logistic regression problem (39) and (40), and the same experimental setting. Recall that in Lemma 1, we proved that

 limsupk→∞E∥˜wk0∥2≤ O(μ2), (41)

which indicates that when is reduced a factor of 10, the MSD-performance should be improved by at least dB. We observe a decay of about 20dB per decade in Fig. 3 for a logistic regression problem with data points and 30dB per decade in Fig. 4 with .

## Iv Introducing a Long-Term Model

We proved in the earlier sections that the mean-square error under random reshuffling approaches a small neighborhood around the minimizer. Our objective now is to assess more accurately the size of the constant that multiplies in the result, and examine how this constant may depend on various parameters including the amount of data, , and the form of the loss function . To do that, we proceed in two steps. First, we introduce an auxiliary long-term model in (50) below and subsequently determine how far the performance of this model is from the original system described by (49) further ahead.

### Iv-a Error Dynamics

In order to quantify the performance of the random reshuffling implementation more accurately than the figure obtained earlier, we will need to impose a condition on the smoothness of the Hessian matrix of the risk function.

###### Assumption 2 (Hessian is Lipschitz continuous).

The risk function has a Lipschitz continuous Hessian matrix, i.e., there exists a constant , such that

 ∥∇2wJ(w1)−∇2wJ(w2)∥≤κ∥w1−w2∥ (42)

Under this assumption, the gradient vector, , can be expressed in Taylor expansion in the form[25, p. 378]:

 ∇wJ(w)=∇2wJ(w⋆)(w−w⋆)+ξ(w),∀w (43)

where the residual term satisfies:

 ∥ξ(w)∥≤κ2∥w−w⋆∥2 (44)

As such, we can rewrite algorithm (3) in the form:

 ˜wki= ˜wki−1+μ∇wJ(wki−1) +μ(∇wQ(wki−1;xσk(i))−∇wJ(wki−1)) = ˜wki−1−μ∇2wJ(w⋆)˜wki−1+μξ(wki−1) +μ(∇wQ(wki−1;xσk(i))−∇wJ(wki−1)) (45)

To ease the notation, we introduce the Hessian matrix and the gradient noise process:

 H\lx@stackrelΔ= ∇2wJ(w⋆) sσk(i)(wki−1)\lx@stackrelΔ= ∇wQ(wki−1;xσk(i))−∇wJ(wki−1) (46)

so that (45) is simplified as:

 ˜wki=(I−μH)˜wki−1+μξ(wki−1)+μsσk(i)(wki−1) (47)

Now property (9) motivates us to expand (47) into the following error recursion by adding and subtracting the same gradient noise term evaluated at :

 ˜wki= (I−μH)˜wki−1+μsσk(i)(wk0) +μ(sσk(i)(wki−1)−sσk(i)(wk0))noise mismatch+μξ(wki−1) (48)

Iterating (48) and using (4) we can establish the following useful relation, which we call upon in the sequel:

 ˜wk+10= (I−μH)N˜wk0+μN∑i=1(I−μH)N−isσk(i)(wk0) +μN∑i=1(I−μH)N−i(sσk(i)(wki−1)−sσk(i)(wk0)) +μN∑i=1(I−μH)N−iξ(wki−1) (49)

Note that recursion (49) relates to , which are the starting points of two successive epochs. In this way, we have now transformed recursion (3), which runs from one sample to another within the same epoch, into a relation that runs from one starting point to another over two successive epochs.

To proceed, we will ignore the last two terms in (49) and consider the following approximate model, which we shall refer to as a long-term model.

 ˜w′k+10=(I−μH)N˜w′k0−μN∑i=1(I−μH)N−isσk(i)(wk0)\lx@stackrelΔ=s′(wk0) (50)

Obviously, the state evolution will be different than (49) and is therefore denoted by the prime notation, . Observe, however, that in model (50) the gradient noise process is still being evaluated at the original state vector, , and not at the new state vector, .

### Iv-B Performance of the Long-Term Model across Epochs

Note that the gradient noise in (50) has the form of a weighted sum over one epoch. This noise clearly satisfies the property:

 E[s′(wk0)|wk0]= 0 (51)

We also know that satisfies the Markov property, i.e., it is independent of all previous and , where , conditioned on . To motivate the next lemma consider the following auxiliary setting.

Assume we have a collection of vectors in whose sum is zero. We define a random walk over these vectors in the following manner. At each time instant, we select a random vector uniformly and with replacement from this set and move from the current location along the vector to the next location. If we keep repeating this construction, we obtain behavior that is represented by the right plot in Fig. 5. Assume instead that we repeat the same experiment except that now we assume the data is first reshuffled and then vectors are selected uniformly without replacement. Because of the zero sum property, and because sampling is now performed without replacement, we find that in this second implementation we always return to the origin after selections. This situation is illustrated in the left plot of the same Fig. 5. The next lemma considers this scenario and provides useful expressions that allow us to estimate the expected location after or more (unitl ) movements. These results will be used in the sequel in our analysis of the performance of stochastic learning under RR.

###### Lemma 1.

Suppose we have a set of vectors with the constraint . Assume the elements of are randomly reshuffled and then selected uniformly without replacement. Let be any nonnegative constant, be any symmetric positive semi-definite matrix, and introduce

 Rx\lx@stackrelΔ= 1NN∑i=1xixTi (52) Var(X)\lx@stackrelΔ= 1NN∑i=1∥xi∥2=\rm{\small Tr}(Rx) (53)

Define the following functions for any :

 f(n;X,β)\lx@stackrelΔ= E∥∥ ∥∥n∑j=1βn−jxσ(j)∥∥ ∥∥2 (54) F(n;X,B)\lx@stackrelΔ= E[n∑j=1Bn−jxσ(j)][n∑j=1xTσ(j)Bn−j] (55)

It then holds that

 f(n;X,β)= (∑n−1i=0β2i)N−(∑n−1i=0βi)2N−1Var(X) (56) F(n;X,B)= [∑n−1i=0BiRxBi]N−[∑n−1i=0Bi]Rx[∑n−1i=0Bi]N−1 (57)
###### Proof.

The proof is provided in Appendix B. ∎

We now return to the stochastic gradient implementation under random reshuffling. Recall from (10) that the stochastic gradient satisfies the zero sample mean property so that

 N∑i=1sσk(i)(w)=0 (58)

at any given point . Applying Lemma 1, we readily conclude that

 E[s′(wk0)s′(wk0)T|wk0] = N(∑N−1i=0(I−μH)iRks(I−μH)i)N−1 −[∑N−1i=0(I−μH)i]Rks[∑N−1i=0(I−μH)i]N−1 (59)

where

 Rks\lx@stackrelΔ=1NN∑n=1sn(wk0)sn(wk0)T (60)

Similarly, we conclude for the gradient noise at the optimal :

 R′⋆s\lx@stackrelΔ= E[s′(w⋆)s′(w⋆)T] = N(∑N−1i=0(I−μH)iR⋆s(I−μH)i)N−1 −[∑N−1i=0(I−μH)i]R⋆s[∑N−1i=0(I−μH)i]N−1 (61)

where

 R⋆s= 1NN∑i=0∇Q(w⋆;xi)∇Q(w⋆;xi)T (62)
###### Theorem 2 (Performance of Long-term Model).

Under assumptions 1 and 2, when the step size is small enough, the mean-square-deviation (MSD) of the long term model (50) is given by

 MSDltRR\lx@stac