Dynamic Mini-batch SGD
for Elastic Distributed Training:
Learning in the Limbo of Resources
With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed training. Despite the recent successes in large mini-batch distributed training, these methods are rarely tested in elastic distributed training environments and suffer degraded performance in our experiments, when we adjust the learning rate linearly immediately with respect to the batch size. One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes. We therefore propose to smoothly adjust the learning rate over time to alleviate the influence of the noisy momentum estimation. Our experiments on image classification, object detection and semantic segmentation have demonstrated that our proposed Dynamic SGD method achieves stabilized performance when varying the number of GPUs from 8 to 128. We also provide theoretical understanding on the optimality of linear learning rate scheduling and the effects of stochastic momentum.
Deep learning has yet become the de facto standard algorithm in computer vision. Deeper and larger models keep boosting the state-of-the-art performance in image classification [he2016deep, szegedy2015going], object detection [ren2015faster, liu2016ssd] and semantic segmentation [long2015fully, chen2018deeplab]. In addition, computation-intensive tasks such as video classification [wang2018non, feichtenhofer2018slowfast], segmentation [li2018low, voigtlaender2019feelvos] and visual question answering [tapaswi2016movieqa, lei2018tvqa] bring more interests to the community as the more computation resources become available. Despite the rapid growth of the computation powers in the data centers of research labs and leading IT companies, the demand of training deep learning models can be barely satisfied. To train a job using multiple machines, we may need to wait for a long time for the job to start due to limited resources. To finish high-priority training jobs in time, we may reserve resources in advance or stop other low-priority jobs.
Allowing resources for a training job to change dynamically can greatly reduce the waiting time and improve resource utilization. We could start a job earlier when a portion of the requested resources is ready. We could also preempt resources from running low-priority training jobs without stopping them for high-priority tasks. In addition, cloud providers encourage users to use preemptible resources with significant lower prices, such as spot instance in AWS [ec2spot], low-priority virtual machines in Azure [azure] and preemptible virtual machines on Google Cloud [google]. We refer to the above dynamic machine environment where the machines could be added or removed from a task as an elastic distributed training environment.
Mini-batch stochastic gradient descent (SGD) with momentum (e.g. heavy-ball momentum, Nesterov momentum, Adam [nesterov1983method, kingma2014adam]) is a widely used optimization method to train deep learning models in computer vision. Previous work focuses on asynchronous SGD (asyncSGD) for environments where machines are loosely coupled [NIPS2012_4687]. AsyncSGD, however, converges radically differently from synchronous mini-batch SGD [zhang2015staleness, reddi2015variance, zheng2017asynchronous]. It is also difficult for asyncSGD to match the model accuracy trained with synchronized mini-batch SGD with similar computation budget [NIPS2012_4687, chen2016revisiting].
In this work, we focus on extending synchronized mini-batch SGD with momentum into the dynamic training environment. We aim to minimize the convergence difference when the training environment is constantly changing. There are two straightforward approaches to extend synchronized SGD with momentum. One approach is to fix the mini-batch size but reassign mini-batches across machines when the number of machines changes. With the mini-batch size fixed, the number of machines has little impact on the optimization method, but reduces the computational efficiency when adding many machines, since the assigned per machine mini-batch size decreases linearly. The other approach fixes the mini-batch size per machine, and updates the total mini-batch size linearly with the number of machines, while linearly changing the learning rate at the same time [goyal2017accurate, devarakonda2017adabatch, smith2017don]. Empirically, we find this approach often leads to degraded model accuracy when the number of machines changes dramatically. Based on our theoretical analysis, the noise in the momentum state scales inversely with the mini-batch size. Linearly scaling the learning rate when increasing mini-batch size fails to consider the difference of variance noise scale, which leads to the optimization difficulty.
In this paper, we propose a new optimization strategy called Dynamic Mini-batch Stochastic Gradient Descent (Dynamic SGD) that smoothly adjusts the learning rate to stabilize the variance changes. We design multiple dynamic training environment settings by varying the number of GPUs between 8 and 128. We evaluate our proposed method on image classification (ResNet-50 [he2016deep], MobileNet [howard2017mobilenets] and InceptionV3 [DBLP:journals/corr/SzegedyVISW15] on ImageNet [deng2009imagenet]), object detection (SSD [liu2016ssd] on MS-COCO [lin2014microsoft]), and semantic segmentation (FCN [long2015fully] on ADE20K [zhou2017scene]). The experiment results demonstrate that the convergence of Dynamic SGD in dynamic environments consistently matches the convergence with training in static environments.
The contributions of this paper include:
Identifying the key challenge to extend synchronized SGD with momentum into dynamic training environments.
Proposing a new method which we call Dynamic SGD to smoothly adjust the learning rate when the number of machines changes to stabilize the training.
Extensive empirical studies to benchmark the proposed Dynamic SGD method with straightforward approaches on large-scale image classification, object detection and semantic segmentation tasks.
2.1 Mini-batch SGD with Momentum
We first review mini-batch stochastic gradient descent (SGD), and SGD with momentum update [robbins1951stochastic]. Given a network parameterized by vector and a labeled dataset , the loss to minimize can be written in the following form [goyal2017accurate]:
where is the loss for the sample with parameters . SGD iteratively updates parameters with its gradient estimated within each mini-batch to optimize the loss:
where is the network parameter at iteration , is the learning rate, and is the number of samples randomly drawn from dataset in a mini-batch.
In practice, SGD with momentum helps accelerate the optimization. The momentum state keeps the exponentially weighted past gradient estimates and updates the loss with the following rule:
where is the momentum state of historical gradients at iteration , and is the decay ratio for the momentum state. The parameter update depends on both gradient estimated within current mini-batch and exponentially weighted gradients from historical mini-batches.
2.2 Data Parallel Distributed Training
In distributed training, multiple workers work together to finish a training job. A worker is a computational unit, such as a CPU or a GPU. Data parallelism defines how training workloads are partitioned into each worker.
Assume we are using mini-batch SGD with a mini-batch size on workers, and each worker has a copy of model parameters. In data parallelism, we partition a mini-batch into parts, so each worker will get examples, and then compute its local gradients. All local gradients are averaged over workers through synchronous communication to obtain the global gradient for this mini-batch. Then this global gradient is used to update model parameters by the SGD update rule. Algorithm 1 illustrates how a single batch is computed.
In this section, we first describe the setup of elastic distributed training environment, and then study the optimization instability of momentum SGD in such environment and introduce Dynamic Mini-batch SGD to stabilize the training.
3.1 Elastic Distributed Training System
In a dynamic scheduling deep learning system, the computation resources are managed, planned and distributed dynamically for different training tasks based on their priorities and system availability. We refer to the distributed training environment with dynamical computation resources as Elatsic Distributed Training.
An overview of elastic distributed training diagram for an example task is shown in Figure 2 (note that the system typically has more than one training tasks). The scheduler maintains the training state and coordinates the resources. It tracks the current and future number of workers by monitoring heartbeats from existing workers and availability notices. Each worker computes the gradients for the assigned data. The parameter server for the training task maintains the primary copy of model parameters, updates the parameters based on the aggregated gradients from workers, and send the updated parameters back to workers. In practice, the parameter server can partition the parameters among multiple machines to increase throughput.
3.2 Dynamic Mini-batch SGD
Learning Rate Scaling.
Prior work linearly scales the learning rate according to the mini-batch size, which achieves success in large mini-batch training [goyal2017accurate]. Considering the SGD without momentum update in Equation 2, the parameter update for iterations using mini-batch size of can be approximated by updating once with linearly scaled learning rate on the combined mini-batches with the size of , if we could assume . We discuss in further detail in the Appendix LABEL:app:1. We refer to this strategy of linear scaling up the learning rate as linear scaling. Despite its success in large mini-batch training, linear scaling fails to apply to an elastic training environment, because the effect of momentum is not compensated. To our knowledge, the effect of momentum for changing mini-batch size has not been studied in the previous work.
SGD Momentum with Changing Mini-batch Size.
To clearly see the weight update in momentum SGD as in Equation 3, a common way to rewrite the equation is substituting with to absorb the learning rate [goyal2017accurate]:
which is identical to Equation 3 for a static learning rate and mini-batch size . For elastic training at iteration , the mini-batch size is changed from to ( is the changing ratio, which can be decimal). Applying the linear scaling directly for learning rate to the momentum SGD, we can get the paramter update as:
where is the momentum state estimated on previous mini-batch size , which is scaled to times for momentum correction to maintain the equivalence with Equation 3 [goyal2017accurate]. However, this compensation only considers the gradient scale111The scale of the momentum state is times of the gradient scale. instead of the noise scale in the momentum state.
Noise Scale in the Gradient and Momentum State.
SGD obtains an estimated gradient within a mini-batch of size with to approximate the full gradient on the entire dataset . The gradient of each sample is a random variable whose expected value is . The variance of estimated gradient within a mini-batch scales inversely to the mini-batch size [DBLP:journals/corr/abs-1812-06162]222Assuming is sampled independently from the dataset and .. The gradient in a mini-batch gives a noisy estimate of the full gradient, and larger mini-batch size provides less noisy estimate.
The variance of the momentum state is proportional to the variance of the mini-batch gradient, i.e. . Therefore, directly scaling up the momentum state on the mini-batch size for “momentum correction” as in Equation 5 increases the noise scale quadratically to times of the expected momentum state on mini-batch size . We also observe unstable training under such setting in the experiment, and the training curves are shown in Figure 3.
To address the difficulty of adapting previous momentum state, we introduce momentum compensation factor , which gradually changes over iterations to allow smooth adaptation of the momentum state when increasing mini-batch size to times. The weight update is given by , where is:
is the iteration index when the mini-batch size is changed, and is the total compensation iteration number, which we find works well333We only compensate for momentum adaption when increasing mini-batch size, because reducing noise scale will not influence the training (as shown in Figure 4). . We refer to the mini-batch SGD adopting momentum compensation strategy for elastic distributed training as Dynamic Mini-batch Stochastic Gradient Descent (Dynamic SGD)444We discuss other potential momentum compensation methods in the Future Work Section LABEL:sec:conclusion.. The proposed Dynamic SGD stabilizes training as shown in Figure 3.
4 Related Work
4.1 Large Mini-batch Distributed Training
Prior work achieves great success in large mini-batch data parallel training for deep convolutional neural networks. Li [li2017scaling] shows distributed training with up to 5K mini-batch size without a loss in accuracy on ImageNet. Goyal et al. [DBLP:journals/corr/GoyalDGNWKTJH17] employs linear learning rate scaling rule with warm-up scheme to train ResNet-50 for image classification with large batch size up to 8K without reducing accuracy. Layer-wise Adaptive Rate Scaling (LARS) optimization algorithm overcomes the optimization difficulties of larger batch training beyond 8K batch size, and scales ResNet-50 training up to 32K mini-batch. The mini-batch size is further increased to 64K using mixed precision training [you2017large, jia2018highly]. Square root learning rate scaling and longer training time for is also proposed for training a large mini-batch size [hoffer2017train]. Despite their success in large mini-batch training, network training using dynamic mini-batch size is rarely studied.
4.2 Stochastic Gradient Descent
Asynchronous stochastic gradient descent (asyncSGD) [NIPS2012_4687] assumes machines are loosely coupled and thus suits this dynamic machine environment. Previous study demonstrates that it is difficult for asyncSGD to match the model accuracy using synchronized SGD with similar computation cost [NIPS2012_4687, chen2016revisiting]. Therefore we use synchronized SGD in this work instead of async SGD to achieve better model accuracy.
Our work also benefits from pioneering studies on learning rate and mini-batch size. McCandlish et al. [DBLP:journals/corr/abs-1812-06162] analyzes largest useful mini-batch size based on gradient noise scale. Prior work proposes to increase the mini-batch size instead of decaying the learning rate during the training results in less than 1% loss in accuracy on ImageNet when scaling learning rate up to 3 times [smith2017don, devarakonda2017adabatch]. Jastrzębski et al. [jastrzkebski2017three] shows learning rate schedules can be replaced with batch size schedules from theoretical analysis. Despite changing mini-batch sizes are used in these work, reserved computation resources are required due to fixed resource schedule instead of dynamically planned.
5 Experimental Results
In this section, we conduct a comprehensive benchmark of proposed Dynamic SGD and baseline approaches on image classification, object detection and semantic segmentation. For image classification, we compare Dynamic SGD with static baseline and linear scaling for state-of-the-art network architectures ResNet-50 [he2016deep], MobileNet [howard2017mobilenets] and the InceptionV3 [DBLP:journals/corr/SzegedyVISW15] on ImageNet [deng2009imagenet]. Then we go beyond image classification task and evaluate the proposed method for object detection using Single Shot multi-box Object Detector (SSD) [liu2016ssd] on MS-COCO dataset [lin2014microsoft], and semantic segmentation using Fully Convolutional Networks (FCN) [long2015fully] on ADE20K [zhou2017scene].
In this experiment, we mainly compare the proposed Dynamic SGD with the following baselines:
Static Baseline: no elasticity. The number of workers is fixed during the training.
Fixed Mini-batch Size: fix the mini-batch size and (re)distribute the workload evenly to available workers.
Linear Scaling: linearly scale up the mini-batch size and learning rate based on number of available workers.
5.1 Image Classification
We first briefly describe the implementation details of the baseline network and the elastic training simulation. Then we compare the Dynamic SGD method with baseline approaches with suddenly increased number of workers using ResNet-50. Finally, we conduct a comprehensive study on state-of-the-art image classification models using randomly changing number of GPUs.
We adopt ResNet-50 [he2016deep], MobileNet 1.0 [howard2017mobilenets] and Inception V3 [DBLP:journals/corr/SzegedyVISW15] as the baseline models and evaluate the performance on ImageNet-2012 [deng2009imagenet] dataset. The model is implemented in GluonCV555https://github.com/dmlc/gluon-cv with MXNet [chen2015mxnet]. Each network is trained for 90 epochs using cosine learning rate decay [DBLP:journals/corr/LoshchilovH16a]. The learning rate is warmed-up for 5 epochs [goyal2017accurate]. We use stochastic gradient descent (SGD) optimizer and set the momentum as 0.9 and weight decay as 0.0001. We use 8 GPUs with 128 per-GPU mini-batch size as the baseline, and set the base learning rate as 0.4. We linearly scale up the learning rate when increasing the mini-batch size. The input size is 224 by 224 for ResNet-50 and MobileNet 1.0, and 299 by 299 for Inception V3. We do not apply weight decay to biases as well as and in batch normalization layers [jia2018highly, xie2018bag], and will discuss its effect in the appendix. To study convergence when the mini-batch size changes, we simulate the gradient of a large mini-batch by accumulating gradients from multiple small mini-batches before applying the parameter update. For throughput analysis, we run the training job with multiple physical machines.
Increasing vs. Decreasing Mini-batch Size Influence.
First we study the influence of a sudden change of number of GPUs at epoch 20. We train the network with two configurations: 1) training starts from 8 GPUs and then increase the number of GPUs to 96 at epoch 20, and 2) training starts from 96 GPUs and then decrease the number of GPUs to 8 at epoch 20. In both configurations we scale up or down the learning rate linearly with the number of GPUs. The results are included in Figure 4. Figure 4 shows that by increasing the number of GPUs to 96, the training curve has a sharp drop, indicating the training process is drastically disrupted at epoch 20. On the other hand, decreasing the number of GPUs has no visible effect.
We then focus on the scenario with increasing number of GPUs. The number of GPUs is increased from 8 to 96 at epoch 20 and epoch 70. Besides linear scaling, we test 1) fixing the mini-batch size when increasing GPUs, i.e. reduce the per-GPU batch size, and 2) incorporating the learning rate warm up after the change. The results are in Table LABEL:tab:spike_12.