Stagewise Enlargement of ssBatch Size for SGD-based Learning

# Stagewise Enlargement of ssBatch Size for SGD-based Learning

## Abstract

Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent (SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called stagewise enlargement of batch size (SEBS), to set proper batch size for SGD. More specifically, SEBS adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, SEBS can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for SGD, momentum SGD and AdaGrad. Empirical results on real data successfully verify the theories of SEBS. Furthermore, empirical results also show that SEBS can outperform other baselines.

\jmlrheading

120001-484/0010/00meila00aShen-Yi Zhao and Wu-Jun Li

{keywords}

SGD, Batch size. ss

## 1 Introduction

Many machine learning models can be formulated as the following empirical risk minimization (ERM) problem:

 minw∈RdF(w)=1nn∑i=1f(w;ξi), (1)

where denotes the model parameter, denotes the set of training instances sampled from distribution , and denotes the loss on the -th training instance.

With the rapid growth of data, stochastic gradient descent (SGD) and mini-batch SGD (Robbins and Monro, 1951; Bottou, 1998) have become the most popular methods for solving the ERM problem in (1), and many variants of SGD have been proposed. Among these algorithms, the classical and most widely used one is the stagewise SGD which has been adopted in (Krizhevsky et al., 2012; He et al., 2016). Stagewise SGD is based on a multi-stage learning scheme. At the -th stage, it runs the following iterations:

 wm+1=wm−ηs(1b∑ξ∈Bm∇f(wm;ξ)), (2)

where is the initialization, , is a mini-batch of instances randomly sampled from with a batch size , is the learning rate which is a constant at each stage and decreases geometrically by stage. After the -th stage is completed, the algorithm randomly picks a parameter from or the last one as the initialization of the next stage. For stagewise SGD with stages, the computation complexity (total number of gradient computation) is and the iteration complexity (total number of parameter updates) is . Recently, some work (Yuan et al., 2019) theoretically proves that the stagewise SGD is better than the original SGD which adopts the polynomially decreased learning rate under the weakly quasi-convex and Polyak-Lojasiewicz (PL) condition. Classical stagewise SGD methods (Krizhevsky et al., 2012; He et al., 2016) mainly focus on how to set the learning rate for a given constant batch size which is typically not too large.

From (2), we can find that given a fixed computation complexity, a larger batch size will result in less parameter updates. In distributed training, each parameter update typically needs one time of communication, and hence a larger batch size will result in less frequent communication. Furthermore, a larger batch size can typically better utilize the computing power of current multi-core systems like GPU to reduce computation time, as long as the mini-batch does not exceed the memory or computing limit of the system. Figure 1 gives an example to show that enlarging batch size can reduce computation time. Hence, we need to choose a larger batch size for SGD to reduce computation time if we do not take generalization error into consideration. However, a larger batch size can make a generalization gap more easily (?; ?). Some work (?) points out that we need to train longer (with higher computation complexity) for larger batch training to achieve a similar generalization error as that of smaller batch training. This is contrary to the original intention of large batch training. Hence, how to set a proper batch size for SGD has become an interesting but challenging topic.

There have appeared some works proposing heuristic methods for large batch training (Goyal et al., 2017; You et al., 2017; McCandlish et al., 2018). Compared to classical stagewise SGD methods with a small constant batch size and stagewisely decreased learning rate, these large batch methods need more tricks, which should be carefully tuned on different models and data sets. Furthermore, theoretical guarantee about the iteration complexity and generalization error of these methods is missing. In addition, in our experiments we find that these methods might increase generalization error if a large batch size is adopted from the initialization.

There have also appeared some other methods proposing to dynamically set the batch size.  (Friedlander and Schmidt, 2012; Byrd et al., 2012; De et al., 2017; Yin et al., 2018) relate the batch size with the noise of stochastic gradients. These methods need to determine the batch size in each iteration, which will bring much extra cost.  (Smith et al., 2018) increases the batch size by relating SGD with a stochastic differential equation. However, the theoretical guarantee about the iteration complexity and generalization error is missing. Furthermore, some work (Yu and Jin, 2019) uses the stagewise training strategy. At each stage, the batch size starts from a small constant and is geometrically increased by iteration. However, the scaling ratio for the batch size cannot be large for convergence guarantee. Furthermore, in our experiments we also find that it might increase generalization error.

In this paper, we propose a novel method, called stagewise enlargement of batch size (SEBS), to set proper batch size for SGD. The main contributions of this paper are outlined as follows:

• We first provide theory1 to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose SEBS which adopts a multi-stage scheme and enlarges the batch size geometrically by stage.

• We theoretically prove that decreasing learning rate and enlarging batch size have the same effect on the performance of stagewise SGD.

• We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, SEBS can reduce the number of parameter updates (iteration complexity) without increasing generalization error when the total number of gradient computation (computation complexity) is fixed.

• Besides SGD, SEBS is also suitable for momentum SGD and adaptive gradient descent (AdaGrad) (Duchi et al., 2010). We also provide theoretical results about the number of parameter updates (iteration complexity) for momentum SGD and AdaGrad. To the best of our knowledge, this is the first work that analyzes the effect of batch size on the convergence of AdaGrad 2.

• Empirical results on real data successfully verify the theories of SEBS. Furthermore, empirical results also show that SEBS can outperform other baselines.

## 2 Preliminaries

First, we give the following notations. denotes the norm. denotes the norm. denotes the optimal solution (optimum) of (1). denotes the stochastic gradient of the mini-batch . , we use to denote the -th element of .

We also make the following assumptions. {assumption} The variance of stochastic gradient is bounded: , .

{assumption}

is -smooth (): , .

{assumption}

is -weakly quasi-convex ():

 ∇F(w)T(w−w∗)≥α(F(w)−F(w∗)),∀w.
{assumption}

satisfies -Polyak Lojasiewicz (-PL, ) condition:

 ∥∇F(w)∥2≥2μ(F(w)−F(w∗)),∀w.

Recently, both weak quasi-convexity and PL condition have been observed for many machine learning models, including deep neural networks (Charles and Papailiopoulos, 2018; Yuan et al., 2019). The -PL condition also implies a quadratic growth (Karimi et al., 2016), i.e., . Another inequality (Nesterov, 2004) used in this paper is . Please note that these two inequalities do not need the convex assumption. We call the conditional number of under PL condition.

## 3 Sebs

In this section, we present the details of SEBS for SGD, including the theory about the relationship between batch size and model initialization, SEBS algorithm, theoretical analysis about the training error and generalization error.

### 3.1 Relationship between Batch Size and Model Initialization

We start from the vanilla SGD with a constant batch size and learning rate, which can be written as follows:

 wm+1=wm−η∇fBm(wm), (3)

where , and . The computation complexity (total number of gradient computation) is . Let denote a value randomly sampled from .

We aim to find how large the batch size can be without loss of performance. First, we can obtain the following property about (3): {lemma} By setting , we have

 E[F(^w)−F(w∗)]≤∥w1−w∗∥2αMη+ησ2αb. (4)
{remark}

Another common upper bound for is from (?):

 E[F(^w)−F(w∗)]≤∥w1−w∗∥22Mη+ηG22,

which uses the bounded gradient assumption . Comparing to Assumption 2, we can see that the bounded gradient assumption in (?) omits the effect of batch size.

Based on (4), we can get a learning rate which minimizes the right term of (4). In fact, (4) also implies a proper batch size. Using , we rewrite the right term of (4) as follows:

 ψ(η,b)=b∥w1−w∗∥2αCη+ησ2αb.

Then, we have: ,

 ψ(η,b)≥2∥w1−w∗∥σ/(α√C).

To make get the minimum, the corresponding batch size and learning rate should satisfy:

 η∗=∥w1−w∗∥b∗σ√C≤α2L, (5)

where is from Lemma 3.1.

From (5), we can find that given a fixed computation complexity , a proper batch size is related to the gap between the initialization and optimum of the model parameter. More specifically, the smaller the gap between the initialization and optimum of the model parameter is, the larger the batch size can be.

The theory of this subsection provides theoretical foundation for designing the SEBS algorithm in the following subsection.

### 3.2 SEBS Algorithm

In classical stagewise SGD (Krizhevsky et al., 2012; He et al., 2016), we can see that at each stage it actually runs the vanilla SGD with a constant batch size and learning rate. After each stage, it decreases the learning rate geometrically. In (Yuan et al., 2019), both theoretical and empirical results show that after each stage there is a geometric decrease in the training loss. This means that the gap between the current model parameter and the optimal solution (optimum) is smaller than that of previous stages. Based on the theory about the relationship between the batch size and model initialization from Section 3.1, we can actually enlarge the batch size in the next stage. Inspired by this, we propose our algorithm called stagewise enlargement of batch size (SEBS) for SGD-based learning.

SEBS adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. The detail of SEBS is presented in Algorithm 1. We can find that SEBS divides the whole learning procedure into stages. At the -th stage, SEBS runs the penalty SGD in Algorithm 2, denoted as . Here, denotes the loss function in (1), denotes the training set, is the coefficient of a quadratic penalty, is the initialization of the model parameter at the -th stage, is the batch size at the -th stage, is a constant learning rate, and is the computation complexity at the -th stage. The output of pSGD, denoted as , will be used as the model parameter initialization for the next stage.

The penalty SGD is a variant of vanilla SGD. Compared to the vanilla SGD, there is an additional quadratic penalty in penalty SGD. If , penalty SGD degenerates to the vanilla SGD. The quadratic penalty has been widely used in many recent variants of SGD (Allen-Zhu, 2018; Yu and Jin, 2019; Chen et al., 2019b, a; Yuan et al., 2019). Although it may slow down the convergence rate, it can improve the generalization ability.

### 3.3 Theoretical Analysis about Training Error

First, we have the following one-stage training error for SEBS: {lemma} (One-stage training error for SEBS)
Let be the sequence produced by , where . Then we have:

 E[F(wτ)−F(w∗)] ≤ (1αMη+1αγ)∥~w−w∗∥2+σ2ηαb, (6)

where is the output of pSGD and . We can find that the one-stage training error for SEBS is similar to that in (4). Hence, we can set the batch size of each stage in SEBS according to the gap . Particularly, we can get the following convergence result: {theorem} Let and be the sequence produced by

 ~ws+1={pSGD}(f,I,γ,~ws,ηs,bs,Cs),

where , and

 ηs=√2bsϵsσ√μθ≤α2L. (7)

Then we obtain . If , then . Here, , and .

In SEBS, if we set which is a constant, and set the batch size as

 bs=ση√μθ√2ϵs=ασ√μθ2√2Lϵs=O(1F(~ws)−F(w∗)), (8)

which means , according to Theorem 3.3, we can obtain the computation complexity of SEBS:

 S∑s=1Cs=S∑s=1θϵs≤O(σ2α2μϵ).

This result is consistent with that in (Yuan et al., 2019) which sets . Hence, by setting the batch size according to (8), SEBS achieves the same performance as classical stagewise SGD on computation complexity. Please note that when the loss function is strongly convex, which means , the proved computation complexity above is optimal (?).

The iteration complexity of SEBS is as follows:

 S∑s=1Csbs=S∑s=1O(L√θασ√μ)=O(Lα2μlog(1ϵ)).

Then we can get the following conclusions:

• Compared to classical stagewise SGD which decreases learning rate by stage and adopts a constant batch size, SEBS reduces the iteration complexity from to , where is the upper bound for ;

• We can also observe that the iteration complexity of SEBS is independent of the variance , and hence is independent of the dimension ;

• According to (7) in Theorem 3.3, in order to get the convergence result, we need to keep the relation between the batch size and learning rate in each stage as follows:

 ηsbs=O(ϵs).

This relation implies that the following two strategies for adjusting batch size and learning rate:

 constant batch size & decrease learning rate (a)

and

 constant learning rate & enlarge batch size (b)

are equivalent in terms of training error. Both of them will not affect the computation complexity. Please note that strategy (a) has been widely adopted in classical stagewise training methods, and strategy (b) is proposed in SEBS.

### 3.4 Theoretical Analysis about Generalization Error

In this section, we will analyze the generalization error of SEBS. The main tool we used for the generalization error is the uniform stability (Hardt et al., 2016), which is defined as follows: {definition} A randomized algorithm is -uniformly stable if for all data sets , such that and differ in at most one instance, we have

 ϵstab≜supξEA[f(~w1;ξ)−f(~w2;ξ)]≤ϵ,

where is the output of on data set , . It has been proved (Hardt et al., 2016) that if is -uniformly stable, then

 |EI,A[F(~w)−Eξ∼D[f(~w;ξ)]]|≤ϵ,

where is the output of on data set . Hence, in the following content, we consider the two data sets and differing in only a single instance which is indexed by . Let be the output of SEBS on data set , be the sequences produced by SEBS at the last stage, be the corresponding randomly selected mini-batch of instances, . We omit the subscript and use to denote the learning rate, batch size and computation complexity in the last stage. We also define . Following (Hardt et al., 2016), we assume that . Then we have the following property about . {lemma} For one specific , if , then we get

 δm+1≤ηγ+ηδ1+γ(1+Lη)γ+ηδm.

If , then we get

 δm+1≤ηγ+ηδ1+bγ+(b−1)Lγηb(γ+η)δm+2γηGb(γ+η).

Using the recursive relation of , we get the following uniform stability of SEBS: {theorem} With the defined in Theorem 3.3, we obtain

 ϵstab≤Cn+(1+1/q)n(4γG2(γ+η)μα)11+qCqq+1,

where . According to Theorem 3.4, we obtain the following two conclusions:

• This uniform stability is consistent with (Yuan et al., 2019). The stability error only depends on the computation complexity and has nothing to do with the batch size of each stage.

• Compared to the classical non-penalty SGD in (2) which actually corresponds to the penalty SGD with and , penalty SGD with a finite can improve the stability, and hence improve the generalization error.

Since and , by setting , we obtain a generalization error for SEBS.

## 4 SEBS for Momentum SGD and AdaGrad

Momentum SGD (mSGD) (Polyak, 1964; Tseng, 1998; Ghadimi and Lan, 2013, 2016) and adaptive gradient descent (AdaGrad) (McMahan and Streeter, 2010; Duchi et al., 2010) have been two of the most important and popular variants of SGD. In the following content, we will show that SEBS is also suitable for momentum SGD and AdaGrad. To the best of our knowledge, existing research on AdaGrad only analyzes the convergence property with . This is the first work that analyzes the effect of batch size on the convergence of AdaGrad.

### 4.1 SEBS for Momentum SGD

Here, we propose to adapt SEBS for momentum SGD. The resulting algorithm is called mSEBS, which is presented in Algorithm 3. mSEBS divides the whole learning procedure into stages. At each stage, mSEBS runs the Polyak’s momentum SGD (Polyak, 1964) which is presented in Algorithm 4. Please note that mSEBS will reset the momentum to zero after each stage for the convenience of convergence proof. This is different from some mSGD implementations like that on PyTorch which does not reset the momentum to zero. In our experiments, we find that this difference does not have significant influence.

Similar to SEBS for SGD, mSEBS can also achieve computation complexity and iteration complexity which is independent of and . Due to space limitation, we move the related theorems to the supplementary material.

### 4.2 SEBS for AdaGrad

Here, we propose to adapt SEBS for AdaGrad. The resulting algorithm is called AdaSEBS, which is presented in Algorithm 5. AdaSEBS also divides the whole learning procedure into stages. At the -th stage, AdaSEBS runs which is presented in Algorithm 6. In particular, AdaGrad runs the following iterations:

 wm+1=argminwwT(m∑i=1gi)+1ηψm(w), (9)

where . is a diagonal matrix, in which the diagonal element is defined as , where . In existing research, is typically set to for convex loss functions (McMahan and Streeter, 2010; Duchi et al., 2010) and is typically set to for strongly convex loss functions (Duchi et al., 2010; ?).

Let be the sequence produced by AdaGrad in Algorithm 6. According to (Duchi et al., 2010), we obtain

 αMM∑m=1E[F(wm)−F(w∗)] ≤ 1MηEψM(w∗)+η2MM∑m=1E∥gm∥2ψ∗m−1,

where , and . If we take , then . When the gradient is relatively small, e.g., , the square root operation will make the upper bound of bad. Hence in this work, although is not necessarily strongly convex, we still set and get the following one-stage training error for AdaSEBS:

{lemma}

(One-stage training error for AdaSEBS)
Let be the sequence produced by . Then we have

 (α−4Lηδ2)E[F(wτ)−F(w∗)] ≤ δ22Mη∥~w−w∗∥2+2σ2ηbδ2,

where is the output of AdaGrad, , . Furthermore, if we choose such that and , then we have

 E[F(wτ)−F(w∗)]≤4√2∥~w−w∗∥σα√C. (10)

According to Lemma 4.2, we actually prove a error while (Duchi et al., 2010) proves a error, where . Furthermore, our one-stage training error is independent of the input . Since the exact upper bound of is usually unknown, we can set a large without loss of training error in our presented AdaGrad. Hence, the error bound in (10) is better than that in (Duchi et al., 2010) in which a large may lead to a large error.

Then we have the following convergence result for AdaSEBS: {theorem} Let be the sequence produced by Alorithm 5, . By setting , and

 bs=ασ√μθ8Lϵs,ηs=αδ28L,

we obtain . Here, , , . Similar to SEBS for vanilla SGD, we also obtain computation complexity and iteration complexity.

Recently, there is another work about stagewise AdaGrad, called SADAGrad (Chen et al., 2018), which mainly focuses on the case that is sparse. SADAGrad adopts a constant batch size and decreases the learning rate geometrically by stage. Under convex and quadratic growth condition, SADAGrad achieves an iteration complexity of , which is dependent on dimension . Hence, AdaSEBS is better than SADAGrad.

## 5 Experiments

First, we verify the theory about the relationship between batch size and model initialization. We consider a synthetic problem:

 minw∈R100F(w)=12nn∑i=1(w−ξi)TD(w−ξi), (11)

where , each data is sampled from the gaussian distribution , and is a diagonal matrix with . The corresponding are , respectively, and the optimal solution of (11) is . We run vanilla SGD in (3) with a fixed computation complexity to solve (11). We set the model parameter initialization , where , . For each , we aim to find the optimal batch size that can achieve the smallest value of , where is the output of vanilla SGD algorithm in (3). We repeat 50 times and the average result about the optimal batch size is presented in Figure 2. We can find that the optimal batch size is almost proportional to , and a larger learning rate implies a larger optimal batch size. These phenomenons are consistent with our theory in (5) where for a fixed computation complexity .

Next, we consider a real problem which trains ResNet20 with 0.0001 weight decay on CIFAR10. The experiments are conducted on the PyTorch platform with an NVIDIA V100 GPU (32G GPU memory). For classical stagewise methods, we follow (He et al., 2016) which divides the learning rate by 10 at the , epochs. According to our theory that , in SEBS, mSEBS and AdaSEBS, the learning rate is constant and the batch size is scaled by at the , epochs. We set for illustration. In the experiments about vanilla SGD, we also compare SEBS with DB-SGD (Yu and Jin, 2019) in which the scaling ratio for batch size is . The initial batch size of these methods is . In the experiments about momentum SGD, we also compare mSEBS with the large batch training method LARS (You et al., 2017). The poly power and warm-up of LARS are the same as that in  (You et al., 2017). We set the batch size, based learning rate, scaling factor of LARS as 4096, 3.2, 0.01. The results are presented in Figure 3. We can find that SEBS, mSEBS and AdaSEBS can achieve similar performance, measured based on computation complexity (epochs), as classical stagewise counterparts respectively, especially when is large. When measured based on iteration complexity which is directly related to computation time or wall-clock time, SEBS, mSEBS and AdaSEBS are more efficient than their classical stagewise counterparts respectively. In particular, classical stagewise counterparts expend k parameter updates, while SEBS with only expends k parameter updates. Since DB-SGD increases the batch size in every epoch, it falls into a local minimum and the accuracy is worse than SEBS. We also try some other scaling ratios for DB-SGD and DB-SGD still cannot achieve performance as good as classical stagewise SGD and SEBS on either training loss or test accuracy. Different from DB-SGD, SEBS increases the batch size after a stage which contains several epochs, and hence it achieves better performance than DB-SGD. Although LARS expends fewer parameter updates than mSEBS, it only achieves test accuracy of , while mSGD and mSEBS with achieve test accuracy of . We also try to set the scaling factor of LARS as that in (You et al., 2017), but the test accuracy further drops .

We also compare mSEBS with momentum SGD (mSGD) by training ResNet18 and ResNet50 with 0.0001 weight decay on ImageNet. Data augmentation and initialization of  (including the parameters of batch normalization layers) follow the code of PyTorch 3. In mSGD and mSEBS, the initial batch size is and the learning rate is . Following (He et al., 2016), we divide the learning rates of mSGD by and scale the batch size of mSEBS by , at the , epochs. The results are presented in Table 1. We can see that mSEBS achieves the same performance as momentum SGD on test accuracy. mSEBS scales the batch size to 36k after 60 epochs and saves about parameter updates in total. We also run the large batch training method in (Smith et al., 2018) to train ResNet50: the initial batch size and learning rate are 8192 and 3.2 respectively, the batch size is scaled by 10 at the 30 epoch, the learning rate is divided by 10 at the 60, 80 epochs. Although the method in (Smith et al., 2018) expends fewer parameter updates, its accuracy drops . Hence, SEBS is better than classical stagewise methods and more universal than large batch training methods.

## 6 Conclusion

In this paper, we propose a novel method called SEBS to set proper batch size for SGD-based machine learning. Both theoretical and empirical results show that SEBS can reduce the number of parameter updates without loss of training error and test accuracy, compared to classical stagewise SGD methods.

## Appendix A Sebs

### a.1 Proof of Lemma 3.3

According to the updates , we get that

 (gm+1η(wm+1−wm)+∇r(wm+1))T(wm+1−w)≤0,∀w.

Using the fact that and is convex, we obtain

 gTm(wm−w)+r(wm+1)−r(w) ≤ ∥wm−w∥22η−∥wm+1−w∥22η+gTm(wm−wm+1)−12η∥wm−wm+1∥2 ≤ ∥wm−w∥22η−∥wm+1−w∥22η+η2∥gm∥2.

Taking expectation on both sides, we obtain

 E[∇F(wm)T(wm−w)+r(wm+1)−r(w)] ≤ E[∥wm−w∥22η−∥wm+1−w∥22η]+η2E[∥gm−∇F(wm)∥2+∥∇F(wm)∥2] ≤ E[∥wm−w∥22η−∥wm+1−w∥22η]+η2E[σ2b+∥∇F(wm)∥2].

Summing up from to , we obtain

 M∑m=1E[(α−Lη)(F(wm)−F(w))+r(wm)−r(w)] ≤ ∥w1−w∥22η+Mσ2η2b+E[r(w1)−r(wM+1)] ≤ ∥w1−w∥22η+Mσ2η2b,

which implies

 (α−Lη)E[F(wτ)−F(w∗)]≤∥~w−w∗∥22Mη+σ2η2b+12γ∥~w−w∗∥2.

Since , we obtain

 E[F(wτ)−F(w∗)]≤∥~w−w∗∥2αMη+σ2ηαb+1αγ∥~w−w∗∥2.

### a.2 Proof of Theorem 3.3

Since , we use the induction to prove the result. Assuming , and using the PL condition, we obtain

 E[F(~ws+1)−F(w∗)]≤ (2μαMsηs+2μαγ)(F(~ws)−F(w∗))+σ2ηsαbs ≤ (2bsμαCsηs+2μαγ)ϵs+σ2ηsαbs.

Since and , we obtain

 E[F(~ws+1)−F(w∗)]≤2√2σϵsα√μθ+2ϵsμαγ.

By setting and , we obtain

 E[F(ws+1)−F(w∗)]≤ϵsρ=ϵs+1.

Finally, we obtain that when , .