Stagewise Enlargement of ssBatch Size for SGDbased Learning
Abstract
Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent (SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called stagewise enlargement of batch size (SEBS), to set proper batch size for SGD. More specifically, SEBS adopts a multistage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, SEBS can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for SGD, momentum SGD and AdaGrad. Empirical results on real data successfully verify the theories of SEBS. Furthermore, empirical results also show that SEBS can outperform other baselines.
120001484/0010/00meila00aShenYi Zhao and WuJun Li
SGD, Batch size. ss
1 Introduction
Many machine learning models can be formulated as the following empirical risk minimization (ERM) problem:
(1) 
where denotes the model parameter, denotes the set of training instances sampled from distribution , and denotes the loss on the th training instance.
With the rapid growth of data, stochastic gradient descent (SGD) and minibatch SGD (Robbins and Monro, 1951; Bottou, 1998) have become the most popular methods for solving the ERM problem in (1), and many variants of SGD have been proposed. Among these algorithms, the classical and most widely used one is the stagewise SGD which has been adopted in (Krizhevsky et al., 2012; He et al., 2016). Stagewise SGD is based on a multistage learning scheme. At the th stage, it runs the following iterations:
(2) 
where is the initialization, , is a minibatch of instances randomly sampled from with a batch size , is the learning rate which is a constant at each stage and decreases geometrically by stage. After the th stage is completed, the algorithm randomly picks a parameter from or the last one as the initialization of the next stage. For stagewise SGD with stages, the computation complexity (total number of gradient computation) is and the iteration complexity (total number of parameter updates) is . Recently, some work (Yuan et al., 2019) theoretically proves that the stagewise SGD is better than the original SGD which adopts the polynomially decreased learning rate under the weakly quasiconvex and PolyakLojasiewicz (PL) condition. Classical stagewise SGD methods (Krizhevsky et al., 2012; He et al., 2016) mainly focus on how to set the learning rate for a given constant batch size which is typically not too large.
From (2), we can find that given a fixed computation complexity, a larger batch size will result in less parameter updates. In distributed training, each parameter update typically needs one time of communication, and hence a larger batch size will result in less frequent communication. Furthermore, a larger batch size can typically better utilize the computing power of current multicore systems like GPU to reduce computation time, as long as the minibatch does not exceed the memory or computing limit of the system. Figure 1 gives an example to show that enlarging batch size can reduce computation time. Hence, we need to choose a larger batch size for SGD to reduce computation time if we do not take generalization error into consideration. However, a larger batch size can make a generalization gap more easily (?; ?). Some work (?) points out that we need to train longer (with higher computation complexity) for larger batch training to achieve a similar generalization error as that of smaller batch training. This is contrary to the original intention of large batch training. Hence, how to set a proper batch size for SGD has become an interesting but challenging topic.
There have appeared some works proposing heuristic methods for large batch training (Goyal et al., 2017; You et al., 2017; McCandlish et al., 2018). Compared to classical stagewise SGD methods with a small constant batch size and stagewisely decreased learning rate, these large batch methods need more tricks, which should be carefully tuned on different models and data sets. Furthermore, theoretical guarantee about the iteration complexity and generalization error of these methods is missing. In addition, in our experiments we find that these methods might increase generalization error if a large batch size is adopted from the initialization.
There have also appeared some other methods proposing to dynamically set the batch size. (Friedlander and Schmidt, 2012; Byrd et al., 2012; De et al., 2017; Yin et al., 2018) relate the batch size with the noise of stochastic gradients. These methods need to determine the batch size in each iteration, which will bring much extra cost. (Smith et al., 2018) increases the batch size by relating SGD with a stochastic differential equation. However, the theoretical guarantee about the iteration complexity and generalization error is missing. Furthermore, some work (Yu and Jin, 2019) uses the stagewise training strategy. At each stage, the batch size starts from a small constant and is geometrically increased by iteration. However, the scaling ratio for the batch size cannot be large for convergence guarantee. Furthermore, in our experiments we also find that it might increase generalization error.
In this paper, we propose a novel method, called stagewise enlargement of batch size (SEBS), to set proper batch size for SGD. The main contributions of this paper are outlined as follows:

We first provide theory
^{1} to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose SEBS which adopts a multistage scheme and enlarges the batch size geometrically by stage. 
We theoretically prove that decreasing learning rate and enlarging batch size have the same effect on the performance of stagewise SGD.

We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, SEBS can reduce the number of parameter updates (iteration complexity) without increasing generalization error when the total number of gradient computation (computation complexity) is fixed.

Besides SGD, SEBS is also suitable for momentum SGD and adaptive gradient descent (AdaGrad) (Duchi et al., 2010). We also provide theoretical results about the number of parameter updates (iteration complexity) for momentum SGD and AdaGrad. To the best of our knowledge, this is the first work that analyzes the effect of batch size on the convergence of AdaGrad
^{2} . 
Empirical results on real data successfully verify the theories of SEBS. Furthermore, empirical results also show that SEBS can outperform other baselines.
2 Preliminaries
First, we give the following notations. denotes the norm. denotes the norm. denotes the optimal solution (optimum) of (1). denotes the stochastic gradient of the minibatch . , we use to denote the th element of .
We also make the following assumptions. {assumption} The variance of stochastic gradient is bounded: , .
is smooth (): , .
is weakly quasiconvex ():
satisfies Polyak Lojasiewicz (PL, ) condition:
Recently, both weak quasiconvexity and PL condition have been observed for many machine learning models, including deep neural networks (Charles and Papailiopoulos, 2018; Yuan et al., 2019). The PL condition also implies a quadratic growth (Karimi et al., 2016), i.e., . Another inequality (Nesterov, 2004) used in this paper is . Please note that these two inequalities do not need the convex assumption. We call the conditional number of under PL condition.
3 Sebs
In this section, we present the details of SEBS for SGD, including the theory about the relationship between batch size and model initialization, SEBS algorithm, theoretical analysis about the training error and generalization error.
3.1 Relationship between Batch Size and Model Initialization
We start from the vanilla SGD with a constant batch size and learning rate, which can be written as follows:
(3) 
where , and . The computation complexity (total number of gradient computation) is . Let denote a value randomly sampled from .
We aim to find how large the batch size can be without loss of performance. First, we can obtain the following property about (3): {lemma} By setting , we have
(4) 
Another common upper bound for is from (?):
which uses the bounded gradient assumption . Comparing to Assumption 2, we can see that the bounded gradient assumption in (?) omits the effect of batch size.
Based on (4), we can get a learning rate which minimizes the right term of (4). In fact, (4) also implies a proper batch size. Using , we rewrite the right term of (4) as follows:
Then, we have: ,
To make get the minimum, the corresponding batch size and learning rate should satisfy:
(5) 
where is from Lemma 3.1.
From (5), we can find that given a fixed computation complexity , a proper batch size is related to the gap between the initialization and optimum of the model parameter. More specifically, the smaller the gap between the initialization and optimum of the model parameter is, the larger the batch size can be.
The theory of this subsection provides theoretical foundation for designing the SEBS algorithm in the following subsection.
3.2 SEBS Algorithm
In classical stagewise SGD (Krizhevsky et al., 2012; He et al., 2016), we can see that at each stage it actually runs the vanilla SGD with a constant batch size and learning rate. After each stage, it decreases the learning rate geometrically. In (Yuan et al., 2019), both theoretical and empirical results show that after each stage there is a geometric decrease in the training loss. This means that the gap between the current model parameter and the optimal solution (optimum) is smaller than that of previous stages. Based on the theory about the relationship between the batch size and model initialization from Section 3.1, we can actually enlarge the batch size in the next stage. Inspired by this, we propose our algorithm called stagewise enlargement of batch size (SEBS) for SGDbased learning.
SEBS adopts a multistage scheme, and enlarges the batch size geometrically by stage. The detail of SEBS is presented in Algorithm 1. We can find that SEBS divides the whole learning procedure into stages. At the th stage, SEBS runs the penalty SGD in Algorithm 2, denoted as . Here, denotes the loss function in (1), denotes the training set, is the coefficient of a quadratic penalty, is the initialization of the model parameter at the th stage, is the batch size at the th stage, is a constant learning rate, and is the computation complexity at the th stage. The output of pSGD, denoted as , will be used as the model parameter initialization for the next stage.
The penalty SGD is a variant of vanilla SGD. Compared to the vanilla SGD, there is an additional quadratic penalty in penalty SGD. If , penalty SGD degenerates to the vanilla SGD. The quadratic penalty has been widely used in many recent variants of SGD (AllenZhu, 2018; Yu and Jin, 2019; Chen et al., 2019b, a; Yuan et al., 2019). Although it may slow down the convergence rate, it can improve the generalization ability.
3.3 Theoretical Analysis about Training Error
First, we have the following onestage training error for SEBS:
{lemma}
(Onestage training error for SEBS)
Let be the sequence produced by , where . Then we have:
(6) 
where is the output of pSGD and . We can find that the onestage training error for SEBS is similar to that in (4). Hence, we can set the batch size of each stage in SEBS according to the gap . Particularly, we can get the following convergence result: {theorem} Let and be the sequence produced by
where , and
(7) 
Then we obtain . If , then . Here, , and .
In SEBS, if we set which is a constant, and set the batch size as
(8) 
which means , according to Theorem 3.3, we can obtain the computation complexity of SEBS:
This result is consistent with that in (Yuan et al., 2019) which sets . Hence, by setting the batch size according to (8), SEBS achieves the same performance as classical stagewise SGD on computation complexity. Please note that when the loss function is strongly convex, which means , the proved computation complexity above is optimal (?).
The iteration complexity of SEBS is as follows:
Then we can get the following conclusions:

Compared to classical stagewise SGD which decreases learning rate by stage and adopts a constant batch size, SEBS reduces the iteration complexity from to , where is the upper bound for ;

We can also observe that the iteration complexity of SEBS is independent of the variance , and hence is independent of the dimension ;

According to (7) in Theorem 3.3, in order to get the convergence result, we need to keep the relation between the batch size and learning rate in each stage as follows:
This relation implies that the following two strategies for adjusting batch size and learning rate:
constant batch size decrease learning rate (a) and
constant learning rate enlarge batch size (b) are equivalent in terms of training error. Both of them will not affect the computation complexity. Please note that strategy (a) has been widely adopted in classical stagewise training methods, and strategy (b) is proposed in SEBS.
3.4 Theoretical Analysis about Generalization Error
In this section, we will analyze the generalization error of SEBS. The main tool we used for the generalization error is the uniform stability (Hardt et al., 2016), which is defined as follows: {definition} A randomized algorithm is uniformly stable if for all data sets , such that and differ in at most one instance, we have
where is the output of on data set , . It has been proved (Hardt et al., 2016) that if is uniformly stable, then
where is the output of on data set . Hence, in the following content, we consider the two data sets and differing in only a single instance which is indexed by . Let be the output of SEBS on data set , be the sequences produced by SEBS at the last stage, be the corresponding randomly selected minibatch of instances, . We omit the subscript and use to denote the learning rate, batch size and computation complexity in the last stage. We also define . Following (Hardt et al., 2016), we assume that . Then we have the following property about . {lemma} For one specific , if , then we get
If , then we get
Using the recursive relation of , we get the following uniform stability of SEBS: {theorem} With the defined in Theorem 3.3, we obtain
where . According to Theorem 3.4, we obtain the following two conclusions:

This uniform stability is consistent with (Yuan et al., 2019). The stability error only depends on the computation complexity and has nothing to do with the batch size of each stage.

Compared to the classical nonpenalty SGD in (2) which actually corresponds to the penalty SGD with and , penalty SGD with a finite can improve the stability, and hence improve the generalization error.
Since and , by setting , we obtain a generalization error for SEBS.
4 SEBS for Momentum SGD and AdaGrad
Momentum SGD (mSGD) (Polyak, 1964; Tseng, 1998; Ghadimi and Lan, 2013, 2016) and adaptive gradient descent (AdaGrad) (McMahan and Streeter, 2010; Duchi et al., 2010) have been two of the most important and popular variants of SGD. In the following content, we will show that SEBS is also suitable for momentum SGD and AdaGrad. To the best of our knowledge, existing research on AdaGrad only analyzes the convergence property with . This is the first work that analyzes the effect of batch size on the convergence of AdaGrad.
4.1 SEBS for Momentum SGD
Here, we propose to adapt SEBS for momentum SGD. The resulting algorithm is called mSEBS, which is presented in Algorithm 3. mSEBS divides the whole learning procedure into stages. At each stage, mSEBS runs the Polyak’s momentum SGD (Polyak, 1964) which is presented in Algorithm 4. Please note that mSEBS will reset the momentum to zero after each stage for the convenience of convergence proof. This is different from some mSGD implementations like that on PyTorch which does not reset the momentum to zero. In our experiments, we find that this difference does not have significant influence.
Similar to SEBS for SGD, mSEBS can also achieve computation complexity and iteration complexity which is independent of and . Due to space limitation, we move the related theorems to the supplementary material.
4.2 SEBS for AdaGrad
Here, we propose to adapt SEBS for AdaGrad. The resulting algorithm is called AdaSEBS, which is presented in Algorithm 5. AdaSEBS also divides the whole learning procedure into stages. At the th stage, AdaSEBS runs which is presented in Algorithm 6. In particular, AdaGrad runs the following iterations:
(9) 
where . is a diagonal matrix, in which the diagonal element is defined as , where . In existing research, is typically set to for convex loss functions (McMahan and Streeter, 2010; Duchi et al., 2010) and is typically set to for strongly convex loss functions (Duchi et al., 2010; ?).
Let be the sequence produced by AdaGrad in Algorithm 6. According to (Duchi et al., 2010), we obtain
where , and . If we take , then . When the gradient is relatively small, e.g., , the square root operation will make the upper bound of bad. Hence in this work, although is not necessarily strongly convex, we still set and get the following onestage training error for AdaSEBS:
(Onestage training error for AdaSEBS)
Let be the sequence produced by . Then we have
where is the output of AdaGrad, , . Furthermore, if we choose such that and , then we have
(10) 
According to Lemma 4.2, we actually prove a error while (Duchi et al., 2010) proves a error, where . Furthermore, our onestage training error is independent of the input . Since the exact upper bound of is usually unknown, we can set a large without loss of training error in our presented AdaGrad. Hence, the error bound in (10) is better than that in (Duchi et al., 2010) in which a large may lead to a large error.
Then we have the following convergence result for AdaSEBS: {theorem} Let be the sequence produced by Alorithm 5, . By setting , and
we obtain . Here, , , . Similar to SEBS for vanilla SGD, we also obtain computation complexity and iteration complexity.
Recently, there is another work about stagewise AdaGrad, called SADAGrad (Chen et al., 2018), which mainly focuses on the case that is sparse. SADAGrad adopts a constant batch size and decreases the learning rate geometrically by stage. Under convex and quadratic growth condition, SADAGrad achieves an iteration complexity of , which is dependent on dimension . Hence, AdaSEBS is better than SADAGrad.
5 Experiments
First, we verify the theory about the relationship between batch size and model initialization. We consider a synthetic problem:
(11) 
where , each data is sampled from the gaussian distribution , and is a diagonal matrix with . The corresponding are , respectively, and the optimal solution of (11) is . We run vanilla SGD in (3) with a fixed computation complexity to solve (11). We set the model parameter initialization , where , . For each , we aim to find the optimal batch size that can achieve the smallest value of , where is the output of vanilla SGD algorithm in (3). We repeat 50 times and the average result about the optimal batch size is presented in Figure 2. We can find that the optimal batch size is almost proportional to , and a larger learning rate implies a larger optimal batch size. These phenomenons are consistent with our theory in (5) where for a fixed computation complexity .


Next, we consider a real problem which trains ResNet20 with 0.0001 weight decay on CIFAR10. The experiments are conducted on the PyTorch platform with an NVIDIA V100 GPU (32G GPU memory). For classical stagewise methods, we follow (He et al., 2016) which divides the learning rate by 10 at the , epochs. According to our theory that , in SEBS, mSEBS and AdaSEBS, the learning rate is constant and the batch size is scaled by at the , epochs. We set for illustration. In the experiments about vanilla SGD, we also compare SEBS with DBSGD (Yu and Jin, 2019) in which the scaling ratio for batch size is . The initial batch size of these methods is . In the experiments about momentum SGD, we also compare mSEBS with the large batch training method LARS (You et al., 2017). The poly power and warmup of LARS are the same as that in (You et al., 2017). We set the batch size, based learning rate, scaling factor of LARS as 4096, 3.2, 0.01. The results are presented in Figure 3. We can find that SEBS, mSEBS and AdaSEBS can achieve similar performance, measured based on computation complexity (epochs), as classical stagewise counterparts respectively, especially when is large. When measured based on iteration complexity which is directly related to computation time or wallclock time, SEBS, mSEBS and AdaSEBS are more efficient than their classical stagewise counterparts respectively. In particular, classical stagewise counterparts expend k parameter updates, while SEBS with only expends k parameter updates. Since DBSGD increases the batch size in every epoch, it falls into a local minimum and the accuracy is worse than SEBS. We also try some other scaling ratios for DBSGD and DBSGD still cannot achieve performance as good as classical stagewise SGD and SEBS on either training loss or test accuracy. Different from DBSGD, SEBS increases the batch size after a stage which contains several epochs, and hence it achieves better performance than DBSGD. Although LARS expends fewer parameter updates than mSEBS, it only achieves test accuracy of , while mSGD and mSEBS with achieve test accuracy of . We also try to set the scaling factor of LARS as that in (You et al., 2017), but the test accuracy further drops .
initial  initial  #parameter updates  test accuracy  

ResNet18  mSGD  256  0.1  450k  69.56% 
mSGD*  256  0.1  450k  69.90%  
mSEBS  256  0.1  160k  69.75%  
ResNet50  mSGD  256  0.1  450k  75.85% 
mSGD*  256  0.1  450k  75.85%  
mSEBS  256  0.1  160k  75.87%  
(Smith et al., 2018)  8192  3.2  5.63k  73.44% 
We also compare mSEBS with momentum SGD (mSGD) by training ResNet18 and ResNet50 with 0.0001 weight decay on ImageNet. Data augmentation and initialization of (including the parameters of batch normalization layers) follow the code of PyTorch
6 Conclusion
In this paper, we propose a novel method called SEBS to set proper batch size for SGDbased machine learning. Both theoretical and empirical results show that SEBS can reduce the number of parameter updates without loss of training error and test accuracy, compared to classical stagewise SGD methods.
Appendix A Sebs
a.1 Proof of Lemma 3.3
According to the updates , we get that
Using the fact that and is convex, we obtain
Taking expectation on both sides, we obtain
Summing up from to , we obtain
which implies
Since , we obtain
a.2 Proof of Theorem 3.3
Since , we use the induction to prove the result. Assuming , and using the PL condition, we obtain
Since and , we obtain
By setting and , we obtain
Finally, we obtain that when , .