Dynamic Minibatch SGD
for Elastic Distributed Training:
Learning in the Limbo of Resources
Abstract
With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed training. Despite the recent successes in large minibatch distributed training, these methods are rarely tested in elastic distributed training environments and suffer degraded performance in our experiments, when we adjust the learning rate linearly immediately with respect to the batch size. One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes. We therefore propose to smoothly adjust the learning rate over time to alleviate the influence of the noisy momentum estimation. Our experiments on image classification, object detection and semantic segmentation have demonstrated that our proposed Dynamic SGD method achieves stabilized performance when varying the number of GPUs from 8 to 128. We also provide theoretical understanding on the optimality of linear learning rate scheduling and the effects of stochastic momentum.
1 Introduction
Deep learning has yet become the de facto standard algorithm in computer vision. Deeper and larger models keep boosting the stateoftheart performance in image classification [he2016deep, szegedy2015going], object detection [ren2015faster, liu2016ssd] and semantic segmentation [long2015fully, chen2018deeplab]. In addition, computationintensive tasks such as video classification [wang2018non, feichtenhofer2018slowfast], segmentation [li2018low, voigtlaender2019feelvos] and visual question answering [tapaswi2016movieqa, lei2018tvqa] bring more interests to the community as the more computation resources become available. Despite the rapid growth of the computation powers in the data centers of research labs and leading IT companies, the demand of training deep learning models can be barely satisfied. To train a job using multiple machines, we may need to wait for a long time for the job to start due to limited resources. To finish highpriority training jobs in time, we may reserve resources in advance or stop other lowpriority jobs.
Allowing resources for a training job to change dynamically can greatly reduce the waiting time and improve resource utilization. We could start a job earlier when a portion of the requested resources is ready. We could also preempt resources from running lowpriority training jobs without stopping them for highpriority tasks. In addition, cloud providers encourage users to use preemptible resources with significant lower prices, such as spot instance in AWS [ec2spot], lowpriority virtual machines in Azure [azure] and preemptible virtual machines on Google Cloud [google]. We refer to the above dynamic machine environment where the machines could be added or removed from a task as an elastic distributed training environment.
Minibatch stochastic gradient descent (SGD) with momentum (e.g. heavyball momentum, Nesterov momentum, Adam [nesterov1983method, kingma2014adam]) is a widely used optimization method to train deep learning models in computer vision. Previous work focuses on asynchronous SGD (asyncSGD) for environments where machines are loosely coupled [NIPS2012_4687]. AsyncSGD, however, converges radically differently from synchronous minibatch SGD [zhang2015staleness, reddi2015variance, zheng2017asynchronous]. It is also difficult for asyncSGD to match the model accuracy trained with synchronized minibatch SGD with similar computation budget [NIPS2012_4687, chen2016revisiting].
In this work, we focus on extending synchronized minibatch SGD with momentum into the dynamic training environment. We aim to minimize the convergence difference when the training environment is constantly changing. There are two straightforward approaches to extend synchronized SGD with momentum. One approach is to fix the minibatch size but reassign minibatches across machines when the number of machines changes. With the minibatch size fixed, the number of machines has little impact on the optimization method, but reduces the computational efficiency when adding many machines, since the assigned per machine minibatch size decreases linearly. The other approach fixes the minibatch size per machine, and updates the total minibatch size linearly with the number of machines, while linearly changing the learning rate at the same time [goyal2017accurate, devarakonda2017adabatch, smith2017don]. Empirically, we find this approach often leads to degraded model accuracy when the number of machines changes dramatically. Based on our theoretical analysis, the noise in the momentum state scales inversely with the minibatch size. Linearly scaling the learning rate when increasing minibatch size fails to consider the difference of variance noise scale, which leads to the optimization difficulty.
In this paper, we propose a new optimization strategy called Dynamic Minibatch Stochastic Gradient Descent (Dynamic SGD) that smoothly adjusts the learning rate to stabilize the variance changes. We design multiple dynamic training environment settings by varying the number of GPUs between 8 and 128. We evaluate our proposed method on image classification (ResNet50 [he2016deep], MobileNet [howard2017mobilenets] and InceptionV3 [DBLP:journals/corr/SzegedyVISW15] on ImageNet [deng2009imagenet]), object detection (SSD [liu2016ssd] on MSCOCO [lin2014microsoft]), and semantic segmentation (FCN [long2015fully] on ADE20K [zhou2017scene]). The experiment results demonstrate that the convergence of Dynamic SGD in dynamic environments consistently matches the convergence with training in static environments.
The contributions of this paper include:

Identifying the key challenge to extend synchronized SGD with momentum into dynamic training environments.

Proposing a new method which we call Dynamic SGD to smoothly adjust the learning rate when the number of machines changes to stabilize the training.

Extensive empirical studies to benchmark the proposed Dynamic SGD method with straightforward approaches on largescale image classification, object detection and semantic segmentation tasks.
2 Background
2.1 Minibatch SGD with Momentum
We first review minibatch stochastic gradient descent (SGD), and SGD with momentum update [robbins1951stochastic]. Given a network parameterized by vector and a labeled dataset , the loss to minimize can be written in the following form [goyal2017accurate]:
(1) 
where is the loss for the sample with parameters . SGD iteratively updates parameters with its gradient estimated within each minibatch to optimize the loss:
(2) 
where is the network parameter at iteration , is the learning rate, and is the number of samples randomly drawn from dataset in a minibatch.
In practice, SGD with momentum helps accelerate the optimization. The momentum state keeps the exponentially weighted past gradient estimates and updates the loss with the following rule:
(3) 
where is the momentum state of historical gradients at iteration , and is the decay ratio for the momentum state. The parameter update depends on both gradient estimated within current minibatch and exponentially weighted gradients from historical minibatches.
2.2 Data Parallel Distributed Training
In distributed training, multiple workers work together to finish a training job. A worker is a computational unit, such as a CPU or a GPU. Data parallelism defines how training workloads are partitioned into each worker.
Assume we are using minibatch SGD with a minibatch size on workers, and each worker has a copy of model parameters. In data parallelism, we partition a minibatch into parts, so each worker will get examples, and then compute its local gradients. All local gradients are averaged over workers through synchronous communication to obtain the global gradient for this minibatch. Then this global gradient is used to update model parameters by the SGD update rule. Algorithm 1 illustrates how a single batch is computed.
3 Methods
In this section, we first describe the setup of elastic distributed training environment, and then study the optimization instability of momentum SGD in such environment and introduce Dynamic Minibatch SGD to stabilize the training.
3.1 Elastic Distributed Training System
In a dynamic scheduling deep learning system, the computation resources are managed, planned and distributed dynamically for different training tasks based on their priorities and system availability. We refer to the distributed training environment with dynamical computation resources as Elatsic Distributed Training.
An overview of elastic distributed training diagram for an example task is shown in Figure 2 (note that the system typically has more than one training tasks). The scheduler maintains the training state and coordinates the resources. It tracks the current and future number of workers by monitoring heartbeats from existing workers and availability notices. Each worker computes the gradients for the assigned data. The parameter server for the training task maintains the primary copy of model parameters, updates the parameters based on the aggregated gradients from workers, and send the updated parameters back to workers. In practice, the parameter server can partition the parameters among multiple machines to increase throughput.
3.2 Dynamic Minibatch SGD
Learning Rate Scaling.
Prior work linearly scales the learning rate according to the minibatch size, which achieves success in large minibatch training [goyal2017accurate]. Considering the SGD without momentum update in Equation 2, the parameter update for iterations using minibatch size of can be approximated by updating once with linearly scaled learning rate on the combined minibatches with the size of , if we could assume . We discuss in further detail in the Appendix LABEL:app:1. We refer to this strategy of linear scaling up the learning rate as linear scaling. Despite its success in large minibatch training, linear scaling fails to apply to an elastic training environment, because the effect of momentum is not compensated. To our knowledge, the effect of momentum for changing minibatch size has not been studied in the previous work.
SGD Momentum with Changing Minibatch Size.
To clearly see the weight update in momentum SGD as in Equation 3, a common way to rewrite the equation is substituting with to absorb the learning rate [goyal2017accurate]:
(4) 
which is identical to Equation 3 for a static learning rate and minibatch size . For elastic training at iteration , the minibatch size is changed from to ( is the changing ratio, which can be decimal). Applying the linear scaling directly for learning rate to the momentum SGD, we can get the paramter update as:
(5) 
where is the momentum state estimated on previous minibatch size , which is scaled to times for momentum correction to maintain the equivalence with Equation 3 [goyal2017accurate]. However, this compensation only considers the gradient scale^{1}^{1}1The scale of the momentum state is times of the gradient scale. instead of the noise scale in the momentum state.
Noise Scale in the Gradient and Momentum State.
SGD obtains an estimated gradient within a minibatch of size with to approximate the full gradient on the entire dataset . The gradient of each sample is a random variable whose expected value is . The variance of estimated gradient within a minibatch scales inversely to the minibatch size [DBLP:journals/corr/abs181206162]^{2}^{2}2Assuming is sampled independently from the dataset and .. The gradient in a minibatch gives a noisy estimate of the full gradient, and larger minibatch size provides less noisy estimate.
The variance of the momentum state is proportional to the variance of the minibatch gradient, i.e. . Therefore, directly scaling up the momentum state on the minibatch size for “momentum correction” as in Equation 5 increases the noise scale quadratically to times of the expected momentum state on minibatch size . We also observe unstable training under such setting in the experiment, and the training curves are shown in Figure 3.
Momentum Compensation.
To address the difficulty of adapting previous momentum state, we introduce momentum compensation factor , which gradually changes over iterations to allow smooth adaptation of the momentum state when increasing minibatch size to times. The weight update is given by , where is:
(6) 
is the iteration index when the minibatch size is changed, and is the total compensation iteration number, which we find works well^{3}^{3}3We only compensate for momentum adaption when increasing minibatch size, because reducing noise scale will not influence the training (as shown in Figure 4). . We refer to the minibatch SGD adopting momentum compensation strategy for elastic distributed training as Dynamic Minibatch Stochastic Gradient Descent (Dynamic SGD)^{4}^{4}4We discuss other potential momentum compensation methods in the Future Work Section LABEL:sec:conclusion.. The proposed Dynamic SGD stabilizes training as shown in Figure 3.
4 Related Work
4.1 Large Minibatch Distributed Training
Prior work achieves great success in large minibatch data parallel training for deep convolutional neural networks. Li [li2017scaling] shows distributed training with up to 5K minibatch size without a loss in accuracy on ImageNet. Goyal et al. [DBLP:journals/corr/GoyalDGNWKTJH17] employs linear learning rate scaling rule with warmup scheme to train ResNet50 for image classification with large batch size up to 8K without reducing accuracy. Layerwise Adaptive Rate Scaling (LARS) optimization algorithm overcomes the optimization difficulties of larger batch training beyond 8K batch size, and scales ResNet50 training up to 32K minibatch. The minibatch size is further increased to 64K using mixed precision training [you2017large, jia2018highly]. Square root learning rate scaling and longer training time for is also proposed for training a large minibatch size [hoffer2017train]. Despite their success in large minibatch training, network training using dynamic minibatch size is rarely studied.
4.2 Stochastic Gradient Descent
Asynchronous stochastic gradient descent (asyncSGD) [NIPS2012_4687] assumes machines are loosely coupled and thus suits this dynamic machine environment. Previous study demonstrates that it is difficult for asyncSGD to match the model accuracy using synchronized SGD with similar computation cost [NIPS2012_4687, chen2016revisiting]. Therefore we use synchronized SGD in this work instead of async SGD to achieve better model accuracy.
Our work also benefits from pioneering studies on learning rate and minibatch size. McCandlish et al. [DBLP:journals/corr/abs181206162] analyzes largest useful minibatch size based on gradient noise scale. Prior work proposes to increase the minibatch size instead of decaying the learning rate during the training results in less than 1% loss in accuracy on ImageNet when scaling learning rate up to 3 times [smith2017don, devarakonda2017adabatch]. Jastrzębski et al. [jastrzkebski2017three] shows learning rate schedules can be replaced with batch size schedules from theoretical analysis. Despite changing minibatch sizes are used in these work, reserved computation resources are required due to fixed resource schedule instead of dynamically planned.
5 Experimental Results
In this section, we conduct a comprehensive benchmark of proposed Dynamic SGD and baseline approaches on image classification, object detection and semantic segmentation. For image classification, we compare Dynamic SGD with static baseline and linear scaling for stateoftheart network architectures ResNet50 [he2016deep], MobileNet [howard2017mobilenets] and the InceptionV3 [DBLP:journals/corr/SzegedyVISW15] on ImageNet [deng2009imagenet]. Then we go beyond image classification task and evaluate the proposed method for object detection using Single Shot multibox Object Detector (SSD) [liu2016ssd] on MSCOCO dataset [lin2014microsoft], and semantic segmentation using Fully Convolutional Networks (FCN) [long2015fully] on ADE20K [zhou2017scene].
Baseline Approaches.
In this experiment, we mainly compare the proposed Dynamic SGD with the following baselines:

Static Baseline: no elasticity. The number of workers is fixed during the training.

Fixed Minibatch Size: fix the minibatch size and (re)distribute the workload evenly to available workers.

Linear Scaling: linearly scale up the minibatch size and learning rate based on number of available workers.
5.1 Image Classification
We first briefly describe the implementation details of the baseline network and the elastic training simulation. Then we compare the Dynamic SGD method with baseline approaches with suddenly increased number of workers using ResNet50. Finally, we conduct a comprehensive study on stateoftheart image classification models using randomly changing number of GPUs.
Implementation Detail.
We adopt ResNet50 [he2016deep], MobileNet 1.0 [howard2017mobilenets] and Inception V3 [DBLP:journals/corr/SzegedyVISW15] as the baseline models and evaluate the performance on ImageNet2012 [deng2009imagenet] dataset. The model is implemented in GluonCV^{5}^{5}5https://github.com/dmlc/gluoncv with MXNet [chen2015mxnet]. Each network is trained for 90 epochs using cosine learning rate decay [DBLP:journals/corr/LoshchilovH16a]. The learning rate is warmedup for 5 epochs [goyal2017accurate]. We use stochastic gradient descent (SGD) optimizer and set the momentum as 0.9 and weight decay as 0.0001. We use 8 GPUs with 128 perGPU minibatch size as the baseline, and set the base learning rate as 0.4. We linearly scale up the learning rate when increasing the minibatch size. The input size is 224 by 224 for ResNet50 and MobileNet 1.0, and 299 by 299 for Inception V3. We do not apply weight decay to biases as well as and in batch normalization layers [jia2018highly, xie2018bag], and will discuss its effect in the appendix. To study convergence when the minibatch size changes, we simulate the gradient of a large minibatch by accumulating gradients from multiple small minibatches before applying the parameter update. For throughput analysis, we run the training job with multiple physical machines.
Increasing vs. Decreasing Minibatch Size Influence.
First we study the influence of a sudden change of number of GPUs at epoch 20. We train the network with two configurations: 1) training starts from 8 GPUs and then increase the number of GPUs to 96 at epoch 20, and 2) training starts from 96 GPUs and then decrease the number of GPUs to 8 at epoch 20. In both configurations we scale up or down the learning rate linearly with the number of GPUs. The results are included in Figure 4. Figure 4 shows that by increasing the number of GPUs to 96, the training curve has a sharp drop, indicating the training process is drastically disrupted at epoch 20. On the other hand, decreasing the number of GPUs has no visible effect.
Baseline Comparisons.
We then focus on the scenario with increasing number of GPUs. The number of GPUs is increased from 8 to 96 at epoch 20 and epoch 70. Besides linear scaling, we test 1) fixing the minibatch size when increasing GPUs, i.e. reduce the perGPU batch size, and 2) incorporating the learning rate warm up after the change. The results are in Table LABEL:tab:spike_12.