DcS3gd:
DelayCompensated StaleSynchronous SGD for LargeScale Decentralized Neural Network Training
Abstract
Data parallelism has become the de facto standard for training Deep Neural Network on multiple processing units. In this work we propose DCS3GD, a decentralized (without Parameter Server) stalesynchronous version of the DelayCompensated Asynchronous Stochastic Gradient Descent (DCASGD) algorithm. In our approach, we allow for the overlap of computation and communication, and compensate the inherent error with a firstorder correction of the gradients. We prove the effectiveness of our approach by training Convolutional Neural Network with large batches and achieving stateoftheart results.
Learning on Supercomputers (DLS) © 2019 IEEE
Keywords: neural networks, machine learning, deep learning, artificial intelligence, high performance computing.
I Introduction
Training Deep Neural Networks (DNNs) is a time and resourceconsuming problem. For example, to train a DNN to stateoftheart accuracy on a single processing unit, the total time needed is in the order of magnitude of days, or even weeks [MLPerf]. For this reason, in recent years, several algorithms have been developed to allow users to perform parallel or distributed training of DNNs [Gupta7837841]. With the correct use of parallelism, training times can be reduced down to hours, or even minutes, [MLPerf, goyal2017accurate, krizhevsky2014weird, you2017large]. The reader interested in a broad survey of Deep Learning algorithms is referred to [DBLP:journals/corr/abs180209941], which is also a great resource for taxonomy and classification of different parallel training strategies.
The most widely adopted type of training parallelism, and the one we will employ in this work, is denominated data parallelism: the DNN is replicated on different Processing Units, each replica is trained on a subset of the training data set, and updates (usually in the form of gradients) are regularly aggregated, to create a single update which is then applied to all the DNN replicas. The way updates are aggregated differs across algorithms in terms of communication scheme, distribution of roles among processing units, and message frequency and content. We will discuss different approaches and architectures in Section II.
In Section III we describe our approach, which constitutes a modification to the DCASGD algorithm proposed in [DBLP:journals/corr/ZhengMWCYML16]. Our approach shows promising results for Convolutional Neural Networks (CNNs): in Section IV we report the results obtained when training different networks on the wellknown ImageNet1k data set, which has imposed itself as the standard benchmark for CNN performance assessment.
In Section V we propose possible extensions to the presented algorithm, and outline what advantages they could bring.
Ii Related Work
With the growing availability of parallel systems, such as clusters and supercomputers, both as onpremises or cloud solutions, the demand for fast, reliable, and efficient parallel training scheme has been fueling research in the Artificial Intelligence community [DBLP:journals/corr/abs180209941, goyal2017accurate, krizhevsky2014weird, you2017large, ma2017accelerated]. The most widespread technique, data parallelism, can be applied to many different areas, such as image classification, Reinforcement Learning, or Natural Language Processing [openai2018empirical]. When dataparallel training has to be scaled to large systems, convergence problems and loss of generalization arise from the fact that the global batch size becomes very large [Li:2018:VLL:3327345.3327535, smith2017bayesian, openai2018empirical].
As suggested in [DBLP:journals/corr/abs180209941], dataparallel training methods can be classified according to two independent aspects: synchronicity (or model consistency across different processes) and communication topology (centralized or decentralized). Synchronous methods are those which ensure that after each training iteration each process (or worker) holds a copy of exactly the same weights; asynchronous methods allow workers to get out of date, receiving updated weights only when they request them (usually after having computed a local update). Centralized communication schemes imply the existence of socalled Parameter Servers, processes which have the task of collecting weight gradients from workers, and send back updated weights; in decentralized schemes, each worker participates in collective communications to compute the weight updates, e.g. via MPI allreduce calls.
Iia Advantages and Disadvantages of Different Training Schemes
Historically, when the first major Deep Learning toolkits (such as e.g., TensorFlow [tensorflow2015whitepaper] or MXNet [chen2015mxnet]) started offering the possibility of parallel training, they did so by implementing techniques with centralized communication, i.e. with Parameter Servers (PSs). As every centralized communication scheme, the PSparadigm does not scale efficiently. With a growing number of Workers, PSs become bottlenecks, and communication becomes of the manytofew type. Nevertheless, asynchronous methods often use this paradigm, as it allows workers to send updates independently, without waiting for other workers to complete processing their batches. The most straightforward algorithm for this setting is clearly the Asynchronous SGD, which has been improved during years with respect to many aspects [DBLP:journals/corr/WangGCLY17, keuper2015asynchronous, DBLP:journals/corr/ZhengMWCYML16], but its core mechanism can be summarized as follows:

at the beginning of the computation, every worker receives an exact copy of the weights from the PSs

every worker processes a minibatch and sends the computed gradients to the PSs, which apply them to their local copy of the weights, and send the updated weights to the worker which initiated the communication

the worker proceeds to process another batch, while the PSs wait for gradients from other workers
The problem (and the subject of the mentioned improvements) of this approach resides in the fact that after the first weight update, the weights on the PSs and the workers will be different (except for the worker who communicated with the PSs last). This in turn creates an inconsistency between the weights used to compute the gradient (on the worker’s side) and the weights which will be updated with such gradient on the PSs. This problem is often reffered to as gradient staleness. Clearly, the larger the difference between the weights, the less accurate the update will be. If we assume that all workers have approximately the same processing speed, we can deduce that after iterations the PSs will receive gradients which are on average out of date by steps. This clearly has a large negative impact on convergence, when is large. We will focus on one particular attempt which has been made to limit this effect and is derived in the DCASGD algorithm. The method computes an approximated firstorder correction to modify the gradients received by the PSs. But even though this approach mitigates the problem, it can only work when the distance between PSs’ and worker’s weights is relatively small.
In recent years, largescale training was obtained by using different flavors of the most classic synchronous scheme, that is Synchronous SGD, in conjunction with decentralized communication. Again, even though many variants exist, the core mechanism is easy to summarize as follows:

at the beginning of the computation, every worker receives an exact copy of the weights

when a worker has finished processing its minibatch, it participates in a blocking allreduce operation, where it shares the gradient it computed with all other workers

at the end of the allreduce, all workers possess the sum of the computed gradients, and they can use it to compute the same weight update

every worker proceeds to process another batch
This scheme has been thoroughly explored, and has one only drawback, which resides in the blocking nature of the allreduce operation: all workers have to wait for the slowest one (sometimes referred to as straggler) before initiating the communication, and then they have to wait for the end of the communication to compute the update.
Decentralized communication can also be used for a particular form of asynchronous methods, which are known as stalesynchronous. In stalesynchronous methods, workers are allowed to go out of sync by a maximum number of iterations (processed minibatches), before waiting for other ones to initiate communication. The maximum number of iterations is called maximum staleness.
As we will see in the next section, our method is a stalesynchronous centralized of DCASGD, and in this work, we will only focus on the version with a maximum staleness of one.
Iii Algorithm
Our algorithm is similar to the DCASGD method proposed in [DBLP:journals/corr/ZhengMWCYML16], with three main differences

it eliminates the need of a Parameter Server in favor of a decentralized communication scheme;

it is stalesynchronous, and not fully asynchronous;

weights computed by different workers are averaged.
In the following sections, we will explain why these differences result in a novel and improved approach, compared to existing algorithms.
Iiia Problem Setting
We quickly review the problem of dataparallel training of a DNN. For this work, we will focus on DNNs trained as multidimensional classifiers, where the input is a sample, denoted by . The goal of training is to find a set of network weights which minimizes a loss function
(1) 
for a set of samples , where is the persample classification loss function (crossentropy loss in our case). Instead of reporting the final value of the loss function, it is usual to derive a figure of merit, which has the benefits of being more understandable by humans and applicable to different loss functions. In our case, we will use the top1 error rate, which is simply the rate of misclassified samples to the number of elements of . We will measure both the error obtained on the training data set and on the validation data set.
We will employ a common version of the classic Minibatch Stochastic Gradient Descent, which is usually referred to as Stochastic Gradient Descent (SGD), and solves the above mentioned minimization problem in an iterative way, following
(2) 
where is a minibatch, i.e. a subset of the training data set, and is the minibatch size, which has been proven to be an important factor, determining how easily a network can be trained. We will adopt a simple version of the SGD algorithm, namely the socalled momentum SGD, in which a momentum term [Qian99onthe] ensures that updates are damped, and allows for faster learning [Qian99onthe].
In the synchronous parallel version, SGD works exactly in the same way, with the only difference that each worker computes gradients locally on the minibatch it processes, and then shares them with other workers by means of an allreduce call.
IiiB DcAsgd
Since our algorithm is a variation of DCASGD, we will briefly outline its most important feature, that is, the delay compensation. As illustrated in Section IIA, gradient staleness reduces the convergence rate, because of the difference between the weights held by the worker and those held by the PSs. In DCASGD, the gradients are modified to take this difference into account. Basically, the idea is to apply a firstorder correction to the gradients, so that they are approximately equal to those which would have been computed using the PSs’ copy of the weights. If the Hessian matrix computed at , here denoted by , was known, one could compute the corrected gradients as
(3) 
where are the weights used by the worker, are those held by the PS, and is a vector with all components equal to one, with being the dimension of the weights. The quadratic error term comes directly from the Taylor expansion used to derive this result, and we will denote it as for the rest of this work. In principle, the Hessian matrix could be computed analytically, but the product of its approximation (known as pseudoHessian) with a vector is computationally convenient to compute as
(4) 
where represents the Hadamard (or componentwise) product. Thus, we can rewrite 3 as
(5) 
Removing the error term and adding a variance control parameter as defined in [DBLP:journals/corr/ZhengMWCYML16], we obtain the final form of the equation as
(6) 
which is the one we base our algorithm on.
IiiC DcS3gd
In our centralized setting, there is no PS, but since we implement a stalesynchronous method, workers can be expected to be out of sync. In fact, the main idea of our approach is to allow for communication and computation to run in parallel, thus diminishing communication’s impact on the total training run time. To allow for this, we make use of the nonblocking allreduce function which is part of the MPI standard, i.e. MPI_Iallreduce.
We now describe our method, which is also illustrated in Algorithm 1. We stress the fact that all processing units will act as identical workers, only fed with different data. The only hyperparameters we will need to set are the learning rate , the momentum , and the variance control parameter .
At the beginning of the computation, each worker receives the same set of initial weights and a different minibatch, which it processes to obtain a set of gradients , where the bar over stresses the fact that the same value is held by all workers, the subscript denotes the worker index, and the superscript denotes the iteration. We will drop the superscripts when possible, to keep the notation concise.
Based on , the worker uses a function to compute the update to its local weights. We denote the update as and all workers will share their local update with the others, by starting a nonblocking allreduce operation.
While the allreduce operation is progressing, the worker updates its local copy of the weights:
(7) 
and proceeds to process the next minibatch, in order to compute new gradients . After having processed the minibatch, all workers wait for the allreduce operation to complete. In our implementation, the completion is checked by means of a call to MPI_Wait. After completion, each worker possesses an identical copy of , that is the sum of all workers’ updates of the previous iteration.
At this point, we can compute the average of the weights held by each worker, as
(8) 
Notice that in principle, there is no guarantee that the mean value of the weights is actually meaningful, but studies such as [DBLP:journals/corr/abs180305407] suggest that averaging different weights can lead to better minima. The Euclidean distance from the weights possessed by the worker to the average weights is
(9) 
Knowing this distance, each worker could replace its own copy of the weights with the average ones, but this is actually not needed. More importantly, by using a modified version of 6, the local gradient can be corrected and used to compute a local update that can be applied to the average weights. The correction equation becomes
(10) 
and thus the new update can be computed as
(11) 
and immediately shared with the other workers, by means of a new nonblocking allreduce call. Each worker will update its weights following
(12) 
where we first move weights to the average value and update them as a single operation. At this point, each worker can start a new iteration, by proceeding to process the next minibatch.
A description of how is computed at each iteration is given in IVA.
IiiD Advantages and Disadvantages of the proposed Method
We compare the proposed approach to two methods described in IIA, SSGD and DCASGD.
IiiD1 Comparison to SSGD
The main advantage over SSGD resides in the fact that communication costs are (at least partially) hidden in our approach. We can approximate the time taken by SSGD to complete an iteration over a minibatch over nodes as
(13) 
where is the time it takes a worker to process the minibatch (including feedforward and backpropagation phases), and is the time taken by the allreduce call to reduce the gradients across all nodes. For our method, a similar approximation can be made, and it yields
(14) 
which is an obvious consequence of the fact that the computation and allreduce operations run concurrently in our setting.
IiiD2 Comparison to DCASGD
Similarly to the results derived in the previous section, we can define an approximation to the runtime of a DCASGD iteration, denoted by , as
(15) 
where is the total time needed by a worker to push its gradients to the PS and obtain the updated weights. Clearly, this time also includes time spent by the worker, waiting for the PS to receive the gradients. Therefore, even though it is true that in DCASGD fast workers do not have to wait for stragglers, it is also true that runtime depends heavily on the network and on the capability of PSs. As mentioned in IIA, DCASGD’s convergence decreases for increasing numbers of workers. This is because the Euclidean distance between the workers’ and the PSs’ weights, , is proportional to . In our method, the distance used to compute the correction is that between workers’ and average weights, which we expect to grow more slowly w.r.t. .
Iv Experiments
Network  #Nodes  Train Accuracy  Val. Accuracy  Speed [img/sec]  Reference Val. Acc.  

ResNet50  16k  32  80.7%  77.5%  2078  75.3% [you2017imagenet], SSGD 
ResNet50  32k  32  80.3%  77.4%  2144  75.4% [you2017imagenet], SSGD 
ResNet50  32k  64  78.5%  77.2%  3815  75.4% [you2017imagenet], SSGD 
ResNet50  64k  64  76.6%  75.6%  4245  76.2% [DBLP:journals/corr/abs180711205], SSGD 
ResNet50  64k  128  75.6%  75.1%  7340  76.2% [DBLP:journals/corr/abs180711205], SSGD 
ResNet50  128k  128  70.0%  69.7%  8201  75.0% [osawa2018largescale], KFAC 
ResNet101  64k  64  78.3%  77.2%  2578  
ResNet152  32k  64  80.9%  78.7%  1768  
VGG16  16k  64  63.03%  69.2%  1206 
In this section, we first describe how we set training hyperparameters, and then we report results obtained by training four standard CNNs on the ImageNet1k data set.
Iva Hyperparameter Settings and Update Schedules
As mentioned in IIIA, to train CNNs, we employed a dataparallel version of SGD with momentum. For each network, we set the momentum to the value used to obtain the stateoftheart results, and we keep it constant for the whole training, which consisted in 90 full epochs. For the learning rate , we first define the theoretical learning rate as
(16) 
where is the number of workers, as usual, and is the learning rate for singlenode training: for ResNet cases, we used as reference a learning rate of 0.1 for a batchsize of 256 samples. This is standard practice, and it seems to give stable results for our setting. For VGG, the base learning rate was 0.02. Another standard approach is to define a learning rate schedule. In our case, we adopted an iterationdependent (and not epochdependent) schedule with linear warmup and linear decrease. The length of the warmup phase was initially defined as half of the total iterations, but we found empirically that after 15 epochs, the training error would reach a plateau (for all batch sizes up to 64k samples), and thus we stopped the warmup phase at the reached learning rate, and we initiated a longer linear decrease phase, which would run until the end of the training. For the case of 128k samples, the plateau was reached after 20 epochs. Identification of the plateau was done by direct observation, but we believe it could easily be automated, by e.g. checking for training error reduction every five epochs during the warmup phase.
To reduce overfitting, weight decay was applied to all weights, with the exception of those belonging to batch normalization layers. This technique has given the best results, and the reasoning behind it can be found in [DBLP:journals/corr/abs180711205]. Since this kind of normalization reduces weights by a constant fraction, when the learning rate is very little (as it can happen in our case, when is very close to 0 or to ), the weight decay can become larger than the update, therefore blocking convergence. To mitigate this problem, we decided to apply the same schedule we used for learning rate, also for the weight decay parameter. To compensate for the smaller effective regularization, we also multiply the weight decay hyperparameter by a constant factor . We find that gives us the best results. This factor was applied to the weight decay hyperparameter value usually adopted in the literature, namely 0.0001 for ResNet topologies and VGG16.
By stopping the warmup phase early, we reach only a small fraction of the maximum step length (e.g. one third for a 15epoch warmup), and we note that the pseudoHessian correction term is very small compared to the computed gradients. We investigated possible correction rescaling techniques, and we found that the best result was to add 0.5% to the validation accuracy, when the step reached the end of the warmup phase. We think that this correction term would have a larger influence for larger learning rates. The parameter , which is used to control the variance introduced by correction step [DBLP:journals/corr/ZhengMWCYML16], was empirically found to give the best results when dynamically set as
(17) 
with .
IvB Hardware and Software Configuration
We ran our experiments on a Cray XC system. Every node was equipped with two 24core Intel Skylake processors with a clock speed of 2.4 GHz and nodes were connected through Cray Aries with dragonfly topology. The use of CPUs only, which is in contrast with the more standard usage of a GPUcluster, allowed us to explore very large local minibatch sizes (up to 1024 samples per local minibatch). As a toolkit, we used a modified version of MXNet [MXNet], in conjunction with the Intel MKLDNN libraries [mkldnn]. We chose to use MXNet because it offered an easy way to implement our algorithm: we modified the original KeyValue Store (KV Store), which is used to update weights after each iteration, so that it included the needed mechanics and MPI code. The MPI implementation was Craympich. The source code can be made available upon direct request to the author.
IvC Results
We report results obtained by training ResNet50, ResNet101, ResNet152, and VGG16 on the ImageNet1k data set.
IvC1 ResNet50
As training ResNet50 has become a reference benchmark, we investigated performances of our method on such problem, for different settings. To maximize CPU usage, and to exploit the large memory available on CPU nodes, we use a local minibatch size of 512 or 1024 samples. From the achieved accuracy values, shown in Table I, it can be seen that we manage to reach stateoftheart accuracy on up to 64 nodes, with a batch size of 32k samples: the total training time, not considering network setup, is of 503 minutes. Keeping the number of nodes at 64 and using a larger batch size results in a slight loss of accuracy and a speedup of 10%. Running the parallel training on 128 nodes, we still reach a reasonable accuracy for a total minibatch size of 64k samples, in 260 minutes: in comparison to [MLPerf], where the target accuracy was 74.9%, we clearly outperform the best results obtained on CPUs, even accounting for the difference between total execution and training time, which never exceeded 10 minutes. From the reported results, we can see that employing a larger batch size on 128 nodes results in a large loss of accuracy. In Figure 1, top1 error for full training of ResNet50 networks is shown. For each combination of node count and aggregate batch size, we plot the results of the training run which reached the lowest validation error.
IvC2 Other Architectures
Table I lists the results we obtained training other CNNs. It is clear that we are able to reach stateoftheart accuracy for all ResNet topologies. More importantly, our method is also capable of training VGG16 with a minibatch size of 16k samples, even though this is known to be a difficult task [DBLP:journals/corr/abs170803888].
In order to fairly assess our method’s performances, we did not adapt the hyperparameters for the different topologies. The only tuning we performed, was to extend the warmup phase to 20 epochs (thus, two ninth of the total training) when running on 64 or 128 nodes.
V Conclusions
In this work, we proposed a new algorithm for distributed training, named DCS3GD, which allows for the overlap of computation and communication by averaging in the parameter space (weights) and applying a firstorder correction to gradients. We showed that this approach can achieve stateoftheart results for parallel DL training.
Many aspects could be improved, for example, more sophisticated methods, like LARS [DBLP:journals/corr/abs170803888], or Adam [Adam], could be used as local optimizers.
Another possible enhancement would be to allow more outofsync minimization steps to be taken by local optimizers, and to see how this influences performances, in terms of timetoaccuracy.
To reduce the error introduced in the correction step, the pseudoHessian could be replaced by an analytical version of the Hessian matrix.
In terms of maximum achieved accuracy, we ran some preliminary tests with a larger number of iterations, and in some cases, extending the training to 100 or 120 epochs could improve the accuracy of 0.20.8%, even for the case of 128k samples per batch.
We believe this approach could also be applied to train neural networks of other types, such as those used for Natural Language Processing, or Reinforcement Learning, if a dataparallel scheme can be adopted.