Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

Dynamic Mini-batch SGD for Elastic Distributed Training:
Learning in the Limbo of Resources

Haibin Lin, Hang Zhang, Yifei Ma, Tong He, Zhi Zhang, Sheng Zha, Mu Li
Amazon Web Services
Palo Alto, CA
{haibilin, hzaws, yifeim, htong, zhiz, zhasheng, mli}@amazon.com
Abstract

With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed training. Despite the recent successes in large mini-batch distributed training, these methods are rarely tested in elastic distributed training environments and suffer degraded performance in our experiments, when we adjust the learning rate linearly immediately with respect to the batch size. One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes. We therefore propose to smoothly adjust the learning rate over time to alleviate the influence of the noisy momentum estimation. Our experiments on image classification, object detection and semantic segmentation have demonstrated that our proposed Dynamic SGD method achieves stabilized performance when varying the number of GPUs from 8 to 128. We also provide theoretical understanding on the optimality of linear learning rate scheduling and the effects of stochastic momentum.

1 Introduction

Deep learning has yet become the de facto standard algorithm in computer vision. Deeper and larger models keep boosting the state-of-the-art performance in image classification [he2016deep, szegedy2015going], object detection [ren2015faster, liu2016ssd] and semantic segmentation [long2015fully, chen2018deeplab]. In addition, computation-intensive tasks such as video classification [wang2018non, feichtenhofer2018slowfast], segmentation [li2018low, voigtlaender2019feelvos] and visual question answering [tapaswi2016movieqa, lei2018tvqa] bring more interests to the community as the more computation resources become available. Despite the rapid growth of the computation powers in the data centers of research labs and leading IT companies, the demand of training deep learning models can be barely satisfied. To train a job using multiple machines, we may need to wait for a long time for the job to start due to limited resources. To finish high-priority training jobs in time, we may reserve resources in advance or stop other low-priority jobs.

Figure 1: Top-1 accuracy on ImageNet dataset vs. different training methods. Static baseline refers to training with mini-batch size and number of machines fixed. In elastic distributed training environments, the mini-batch size is dynamically updated. The proposed Dynamic SGD method enables stable training in such dynamic environments while benefits from the speedup with elastic resources. Our elastic training method achieves comparable model accuracy with small mini-batch training (1K), and always outperforms the large mini-batch one (12K) under the same setting.

Allowing resources for a training job to change dynamically can greatly reduce the waiting time and improve resource utilization. We could start a job earlier when a portion of the requested resources is ready. We could also preempt resources from running low-priority training jobs without stopping them for high-priority tasks. In addition, cloud providers encourage users to use preemptible resources with significant lower prices, such as spot instance in AWS [ec2spot], low-priority virtual machines in Azure [azure] and preemptible virtual machines on Google Cloud [google]. We refer to the above dynamic machine environment where the machines could be added or removed from a task as an elastic distributed training environment.

Mini-batch stochastic gradient descent (SGD) with momentum (e.g. heavy-ball momentum, Nesterov momentum, Adam [nesterov1983method, kingma2014adam]) is a widely used optimization method to train deep learning models in computer vision. Previous work focuses on asynchronous SGD (asyncSGD) for environments where machines are loosely coupled [NIPS2012_4687]. AsyncSGD, however, converges radically differently from synchronous mini-batch SGD [zhang2015staleness, reddi2015variance, zheng2017asynchronous]. It is also difficult for asyncSGD to match the model accuracy trained with synchronized mini-batch SGD with similar computation budget [NIPS2012_4687, chen2016revisiting].

In this work, we focus on extending synchronized mini-batch SGD with momentum into the dynamic training environment. We aim to minimize the convergence difference when the training environment is constantly changing. There are two straightforward approaches to extend synchronized SGD with momentum. One approach is to fix the mini-batch size but reassign mini-batches across machines when the number of machines changes. With the mini-batch size fixed, the number of machines has little impact on the optimization method, but reduces the computational efficiency when adding many machines, since the assigned per machine mini-batch size decreases linearly. The other approach fixes the mini-batch size per machine, and updates the total mini-batch size linearly with the number of machines, while linearly changing the learning rate at the same time [goyal2017accurate, devarakonda2017adabatch, smith2017don]. Empirically, we find this approach often leads to degraded model accuracy when the number of machines changes dramatically. Based on our theoretical analysis, the noise in the momentum state scales inversely with the mini-batch size. Linearly scaling the learning rate when increasing mini-batch size fails to consider the difference of variance noise scale, which leads to the optimization difficulty.

In this paper, we propose a new optimization strategy called Dynamic Mini-batch Stochastic Gradient Descent (Dynamic SGD) that smoothly adjusts the learning rate to stabilize the variance changes. We design multiple dynamic training environment settings by varying the number of GPUs between 8 and 128. We evaluate our proposed method on image classification (ResNet-50 [he2016deep], MobileNet  [howard2017mobilenets] and InceptionV3 [DBLP:journals/corr/SzegedyVISW15] on ImageNet [deng2009imagenet]), object detection (SSD [liu2016ssd] on MS-COCO [lin2014microsoft]), and semantic segmentation (FCN [long2015fully] on ADE20K [zhou2017scene]). The experiment results demonstrate that the convergence of Dynamic SGD in dynamic environments consistently matches the convergence with training in static environments.

The contributions of this paper include:

  1. Identifying the key challenge to extend synchronized SGD with momentum into dynamic training environments.

  2. Proposing a new method which we call Dynamic SGD to smoothly adjust the learning rate when the number of machines changes to stabilize the training.

  3. Extensive empirical studies to benchmark the proposed Dynamic SGD method with straightforward approaches on large-scale image classification, object detection and semantic segmentation tasks.

2 Background

2.1 Mini-batch SGD with Momentum

We first review mini-batch stochastic gradient descent (SGD), and SGD with momentum update [robbins1951stochastic]. Given a network parameterized by vector and a labeled dataset , the loss to minimize can be written in the following form [goyal2017accurate]:

(1)

where is the loss for the sample with parameters . SGD iteratively updates parameters with its gradient estimated within each mini-batch to optimize the loss:

(2)

where is the network parameter at iteration , is the learning rate, and is the number of samples randomly drawn from dataset in a mini-batch.

In practice, SGD with momentum helps accelerate the optimization. The momentum state keeps the exponentially weighted past gradient estimates and updates the loss with the following rule:

(3)

where is the momentum state of historical gradients at iteration , and is the decay ratio for the momentum state. The parameter update depends on both gradient estimated within current mini-batch and exponentially weighted gradients from historical mini-batches.

2.2 Data Parallel Distributed Training

Mini-batch size , number of workers , and learning rate
Randomly sample inputs
Partition inputs into parts
for worker  do Run in parallel
     Compute gradient based on data partition
     Allreduce gradient by
     Update parameters by
end for
Algorithm 1 Data parallel distributed training for a single mini-batch.

In distributed training, multiple workers work together to finish a training job. A worker is a computational unit, such as a CPU or a GPU. Data parallelism defines how training workloads are partitioned into each worker.

Assume we are using mini-batch SGD with a mini-batch size on workers, and each worker has a copy of model parameters. In data parallelism, we partition a mini-batch into parts, so each worker will get examples, and then compute its local gradients. All local gradients are averaged over workers through synchronous communication to obtain the global gradient for this mini-batch. Then this global gradient is used to update model parameters by the SGD update rule. Algorithm 1 illustrates how a single batch is computed.

3 Methods

In this section, we first describe the setup of elastic distributed training environment, and then study the optimization instability of momentum SGD in such environment and introduce Dynamic Mini-batch SGD to stabilize the training.

3.1 Elastic Distributed Training System

In a dynamic scheduling deep learning system, the computation resources are managed, planned and distributed dynamically for different training tasks based on their priorities and system availability. We refer to the distributed training environment with dynamical computation resources as Elatsic Distributed Training.

An overview of elastic distributed training diagram for an example task is shown in Figure 2 (note that the system typically has more than one training tasks). The scheduler maintains the training state and coordinates the resources. It tracks the current and future number of workers by monitoring heartbeats from existing workers and availability notices. Each worker computes the gradients for the assigned data. The parameter server for the training task maintains the primary copy of model parameters, updates the parameters based on the aggregated gradients from workers, and send the updated parameters back to workers. In practice, the parameter server can partition the parameters among multiple machines to increase throughput.

Figure 2: Overview of the elastic distributed training system with an example task. The scheduler keeps track of the training progress, monitors the worker pool, and dynamically assigns the data and workers to each task based on their priority and system availability. The task parameter server aggregates the parameter gradient from each worker and send back the updated parameter weights . (, , and represent the number of workers assigned to the current task at time step , and .)

3.2 Dynamic Mini-batch SGD

Learning Rate Scaling.

Prior work linearly scales the learning rate according to the mini-batch size, which achieves success in large mini-batch training [goyal2017accurate]. Considering the SGD without momentum update in Equation 2, the parameter update for iterations using mini-batch size of can be approximated by updating once with linearly scaled learning rate on the combined mini-batches with the size of , if we could assume . We discuss in further detail in the Appendix LABEL:app:1. We refer to this strategy of linear scaling up the learning rate as linear scaling. Despite its success in large mini-batch training, linear scaling fails to apply to an elastic training environment, because the effect of momentum is not compensated. To our knowledge, the effect of momentum for changing mini-batch size has not been studied in the previous work.

SGD Momentum with Changing Mini-batch Size.

To clearly see the weight update in momentum SGD as in Equation 3, a common way to rewrite the equation is substituting with to absorb the learning rate  [goyal2017accurate]:

(4)

which is identical to Equation 3 for a static learning rate and mini-batch size . For elastic training at iteration , the mini-batch size is changed from to ( is the changing ratio, which can be decimal). Applying the linear scaling directly for learning rate to the momentum SGD, we can get the paramter update as:

(5)

where is the momentum state estimated on previous mini-batch size , which is scaled to times for momentum correction to maintain the equivalence with Equation 3 [goyal2017accurate]. However, this compensation only considers the gradient scale111The scale of the momentum state is times of the gradient scale. instead of the noise scale in the momentum state.

Figure 3: Training and validation accuracy on ImageNet. Increasing the mini-batch size from 1K to 12K at epoch 20. The proposed Dynamic SGD is robust to the mini-batch change, while the training with linear scaling get degraded.

Noise Scale in the Gradient and Momentum State.

SGD obtains an estimated gradient within a mini-batch of size with to approximate the full gradient on the entire dataset . The gradient of each sample is a random variable whose expected value is . The variance of estimated gradient within a mini-batch scales inversely to the mini-batch size  [DBLP:journals/corr/abs-1812-06162]222Assuming is sampled independently from the dataset and .. The gradient in a mini-batch gives a noisy estimate of the full gradient, and larger mini-batch size provides less noisy estimate.

The variance of the momentum state is proportional to the variance of the mini-batch gradient, i.e. . Therefore, directly scaling up the momentum state on the mini-batch size for “momentum correction” as in Equation 5 increases the noise scale quadratically to times of the expected momentum state on mini-batch size . We also observe unstable training under such setting in the experiment, and the training curves are shown in Figure 3.

Momentum Compensation.

To address the difficulty of adapting previous momentum state, we introduce momentum compensation factor , which gradually changes over iterations to allow smooth adaptation of the momentum state when increasing mini-batch size to times. The weight update is given by , where is:

(6)

is the iteration index when the mini-batch size is changed, and is the total compensation iteration number, which we find works well333We only compensate for momentum adaption when increasing mini-batch size, because reducing noise scale will not influence the training (as shown in Figure 4). . We refer to the mini-batch SGD adopting momentum compensation strategy for elastic distributed training as Dynamic Mini-batch Stochastic Gradient Descent (Dynamic SGD)444We discuss other potential momentum compensation methods in the Future Work Section LABEL:sec:conclusion.. The proposed Dynamic SGD stabilizes training as shown in Figure 3.

4 Related Work

4.1 Large Mini-batch Distributed Training

Prior work achieves great success in large mini-batch data parallel training for deep convolutional neural networks. Li [li2017scaling] shows distributed training with up to 5K mini-batch size without a loss in accuracy on ImageNet. Goyal et al. [DBLP:journals/corr/GoyalDGNWKTJH17] employs linear learning rate scaling rule with warm-up scheme to train ResNet-50 for image classification with large batch size up to 8K without reducing accuracy. Layer-wise Adaptive Rate Scaling (LARS) optimization algorithm overcomes the optimization difficulties of larger batch training beyond 8K batch size, and scales ResNet-50 training up to 32K mini-batch. The mini-batch size is further increased to 64K using mixed precision training [you2017large, jia2018highly]. Square root learning rate scaling and longer training time for is also proposed for training a large mini-batch size [hoffer2017train]. Despite their success in large mini-batch training, network training using dynamic mini-batch size is rarely studied.

4.2 Stochastic Gradient Descent

Asynchronous stochastic gradient descent (asyncSGD) [NIPS2012_4687] assumes machines are loosely coupled and thus suits this dynamic machine environment. Previous study demonstrates that it is difficult for asyncSGD to match the model accuracy using synchronized SGD with similar computation cost [NIPS2012_4687, chen2016revisiting]. Therefore we use synchronized SGD in this work instead of async SGD to achieve better model accuracy.

Our work also benefits from pioneering studies on learning rate and mini-batch size. McCandlish et al. [DBLP:journals/corr/abs-1812-06162] analyzes largest useful mini-batch size based on gradient noise scale. Prior work proposes to increase the mini-batch size instead of decaying the learning rate during the training results in less than 1% loss in accuracy on ImageNet when scaling learning rate up to 3 times [smith2017don, devarakonda2017adabatch]. Jastrzębski et al. [jastrzkebski2017three] shows learning rate schedules can be replaced with batch size schedules from theoretical analysis. Despite changing mini-batch sizes are used in these work, reserved computation resources are required due to fixed resource schedule instead of dynamically planned.

5 Experimental Results

In this section, we conduct a comprehensive benchmark of proposed Dynamic SGD and baseline approaches on image classification, object detection and semantic segmentation. For image classification, we compare Dynamic SGD with static baseline and linear scaling for state-of-the-art network architectures ResNet-50 [he2016deep], MobileNet [howard2017mobilenets] and the InceptionV3 [DBLP:journals/corr/SzegedyVISW15] on ImageNet [deng2009imagenet]. Then we go beyond image classification task and evaluate the proposed method for object detection using Single Shot multi-box Object Detector (SSD) [liu2016ssd] on MS-COCO dataset [lin2014microsoft], and semantic segmentation using Fully Convolutional Networks (FCN) [long2015fully] on ADE20K [zhou2017scene].

Baseline Approaches.

In this experiment, we mainly compare the proposed Dynamic SGD with the following baselines:

  • Static Baseline: no elasticity. The number of workers is fixed during the training.

  • Fixed Mini-batch Size: fix the mini-batch size and (re)distribute the workload evenly to available workers.

  • Linear Scaling: linearly scale up the mini-batch size and learning rate based on number of available workers.

5.1 Image Classification

We first briefly describe the implementation details of the baseline network and the elastic training simulation. Then we compare the Dynamic SGD method with baseline approaches with suddenly increased number of workers using ResNet-50. Finally, we conduct a comprehensive study on state-of-the-art image classification models using randomly changing number of GPUs.

Implementation Detail.

We adopt ResNet-50 [he2016deep], MobileNet 1.0 [howard2017mobilenets] and Inception V3 [DBLP:journals/corr/SzegedyVISW15] as the baseline models and evaluate the performance on ImageNet-2012 [deng2009imagenet] dataset. The model is implemented in GluonCV555https://github.com/dmlc/gluon-cv with MXNet [chen2015mxnet]. Each network is trained for 90 epochs using cosine learning rate decay [DBLP:journals/corr/LoshchilovH16a]. The learning rate is warmed-up for 5 epochs [goyal2017accurate]. We use stochastic gradient descent (SGD) optimizer and set the momentum as 0.9 and weight decay as 0.0001. We use 8 GPUs with 128 per-GPU mini-batch size as the baseline, and set the base learning rate as 0.4. We linearly scale up the learning rate when increasing the mini-batch size. The input size is 224 by 224 for ResNet-50 and MobileNet 1.0, and 299 by 299 for Inception V3. We do not apply weight decay to biases as well as and in batch normalization layers  [jia2018highly, xie2018bag], and will discuss its effect in the appendix. To study convergence when the mini-batch size changes, we simulate the gradient of a large mini-batch by accumulating gradients from multiple small mini-batches before applying the parameter update. For throughput analysis, we run the training job with multiple physical machines.

Increasing vs. Decreasing Mini-batch Size Influence.

(a) Increase the mini-batch size to 12 times at epoch 20.
(b) Decrease the mini-batch size to at epoch 20.
Figure 4: Top-1 accuracy on ImageNet validation set using ResNet-50. Influence of sudden change in mini-batch size using linear scaling approach. We use the 1K as the base mini-batch size. We find that the network training is relatively sensitive to increasing mini-batch size, but robust to decreasing.

First we study the influence of a sudden change of number of GPUs at epoch 20. We train the network with two configurations: 1) training starts from 8 GPUs and then increase the number of GPUs to 96 at epoch 20, and 2) training starts from 96 GPUs and then decrease the number of GPUs to 8 at epoch 20. In both configurations we scale up or down the learning rate linearly with the number of GPUs. The results are included in Figure 4. Figure 4 shows that by increasing the number of GPUs to 96, the training curve has a sharp drop, indicating the training process is drastically disrupted at epoch 20. On the other hand, decreasing the number of GPUs has no visible effect.

Baseline Comparisons.

We then focus on the scenario with increasing number of GPUs. The number of GPUs is increased from 8 to 96 at epoch 20 and epoch 70. Besides linear scaling, we test 1) fixing the mini-batch size when increasing GPUs, i.e. reduce the per-GPU batch size, and 2) incorporating the learning rate warm up after the change. The results are in Table LABEL:tab:spike_12.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
357657
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description