Delay-Compensated Stale-Synchronous SGD for Large-Scale Decentralized Neural Network Training
Data parallelism has become the de facto standard for training Deep Neural Network on multiple processing units. In this work we propose DC-S3GD, a decentralized (without Parameter Server) stale-synchronous version of the Delay-Compensated Asynchronous Stochastic Gradient Descent (DC-ASGD) algorithm. In our approach, we allow for the overlap of computation and communication, and compensate the inherent error with a first-order correction of the gradients. We prove the effectiveness of our approach by training Convolutional Neural Network with large batches and achieving state-of-the-art results.
Learning on Supercomputers (DLS) © 2019 IEEE
Keywords: neural networks, machine learning, deep learning, artificial intelligence, high performance computing.
Training Deep Neural Networks (DNNs) is a time- and resource-consuming problem. For example, to train a DNN to state-of-the-art accuracy on a single processing unit, the total time needed is in the order of magnitude of days, or even weeks [MLPerf]. For this reason, in recent years, several algorithms have been developed to allow users to perform parallel or distributed training of DNNs [Gupta7837841]. With the correct use of parallelism, training times can be reduced down to hours, or even minutes, [MLPerf, goyal2017accurate, krizhevsky2014weird, you2017large]. The reader interested in a broad survey of Deep Learning algorithms is referred to [DBLP:journals/corr/abs-1802-09941], which is also a great resource for taxonomy and classification of different parallel training strategies.
The most widely adopted type of training parallelism, and the one we will employ in this work, is denominated data parallelism: the DNN is replicated on different Processing Units, each replica is trained on a subset of the training data set, and updates (usually in the form of gradients) are regularly aggregated, to create a single update which is then applied to all the DNN replicas. The way updates are aggregated differs across algorithms in terms of communication scheme, distribution of roles among processing units, and message frequency and content. We will discuss different approaches and architectures in Section II.
In Section III we describe our approach, which constitutes a modification to the DC-ASGD algorithm proposed in [DBLP:journals/corr/ZhengMWCYML16]. Our approach shows promising results for Convolutional Neural Networks (CNNs): in Section IV we report the results obtained when training different networks on the well-known ImageNet-1k data set, which has imposed itself as the standard benchmark for CNN performance assessment.
In Section V we propose possible extensions to the presented algorithm, and outline what advantages they could bring.
Ii Related Work
With the growing availability of parallel systems, such as clusters and supercomputers, both as on-premises or cloud solutions, the demand for fast, reliable, and efficient parallel training scheme has been fueling research in the Artificial Intelligence community [DBLP:journals/corr/abs-1802-09941, goyal2017accurate, krizhevsky2014weird, you2017large, ma2017accelerated]. The most widespread technique, data parallelism, can be applied to many different areas, such as image classification, Reinforcement Learning, or Natural Language Processing [openai2018empirical]. When data-parallel training has to be scaled to large systems, convergence problems and loss of generalization arise from the fact that the global batch size becomes very large [Li:2018:VLL:3327345.3327535, smith2017bayesian, openai2018empirical].
As suggested in [DBLP:journals/corr/abs-1802-09941], data-parallel training methods can be classified according to two independent aspects: synchronicity (or model consistency across different processes) and communication topology (centralized or decentralized). Synchronous methods are those which ensure that after each training iteration each process (or worker) holds a copy of exactly the same weights; asynchronous methods allow workers to get out of date, receiving updated weights only when they request them (usually after having computed a local update). Centralized communication schemes imply the existence of so-called Parameter Servers, processes which have the task of collecting weight gradients from workers, and send back updated weights; in decentralized schemes, each worker participates in collective communications to compute the weight updates, e.g. via MPI all-reduce calls.
Ii-a Advantages and Disadvantages of Different Training Schemes
Historically, when the first major Deep Learning toolkits (such as e.g., TensorFlow [tensorflow2015-whitepaper] or MXNet [chen2015mxnet]) started offering the possibility of parallel training, they did so by implementing techniques with centralized communication, i.e. with Parameter Servers (PSs). As every centralized communication scheme, the PS-paradigm does not scale efficiently. With a growing number of Workers, PSs become bottlenecks, and communication becomes of the many-to-few type. Nevertheless, asynchronous methods often use this paradigm, as it allows workers to send updates independently, without waiting for other workers to complete processing their batches. The most straight-forward algorithm for this setting is clearly the Asynchronous SGD, which has been improved during years with respect to many aspects [DBLP:journals/corr/WangGCLY17, keuper2015asynchronous, DBLP:journals/corr/ZhengMWCYML16], but its core mechanism can be summarized as follows:
at the beginning of the computation, every worker receives an exact copy of the weights from the PSs
every worker processes a mini-batch and sends the computed gradients to the PSs, which apply them to their local copy of the weights, and send the updated weights to the worker which initiated the communication
the worker proceeds to process another batch, while the PSs wait for gradients from other workers
The problem (and the subject of the mentioned improvements) of this approach resides in the fact that after the first weight update, the weights on the PSs and the workers will be different (except for the worker who communicated with the PSs last). This in turn creates an inconsistency between the weights used to compute the gradient (on the worker’s side) and the weights which will be updated with such gradient on the PSs. This problem is often reffered to as gradient staleness. Clearly, the larger the difference between the weights, the less accurate the update will be. If we assume that all workers have approximately the same processing speed, we can deduce that after iterations the PSs will receive gradients which are -on average- out of date by steps. This clearly has a large negative impact on convergence, when is large. We will focus on one particular attempt which has been made to limit this effect and is derived in the DC-ASGD algorithm. The method computes an approximated first-order correction to modify the gradients received by the PSs. But even though this approach mitigates the problem, it can only work when the distance between PSs’ and worker’s weights is relatively small.
In recent years, large-scale training was obtained by using different flavors of the most classic synchronous scheme, that is Synchronous SGD, in conjunction with decentralized communication. Again, even though many variants exist, the core mechanism is easy to summarize as follows:
at the beginning of the computation, every worker receives an exact copy of the weights
when a worker has finished processing its mini-batch, it participates in a blocking all-reduce operation, where it shares the gradient it computed with all other workers
at the end of the all-reduce, all workers possess the sum of the computed gradients, and they can use it to compute the same weight update
every worker proceeds to process another batch
This scheme has been thoroughly explored, and has one only drawback, which resides in the blocking nature of the all-reduce operation: all workers have to wait for the slowest one (sometimes referred to as straggler) before initiating the communication, and then they have to wait for the end of the communication to compute the update.
Decentralized communication can also be used for a particular form of asynchronous methods, which are known as stale-synchronous. In stale-synchronous methods, workers are allowed to go out of sync by a maximum number of iterations (processed mini-batches), before waiting for other ones to initiate communication. The maximum number of iterations is called maximum staleness.
As we will see in the next section, our method is a stale-synchronous centralized of DC-ASGD, and in this work, we will only focus on the version with a maximum staleness of one.
Our algorithm is similar to the DC-ASGD method proposed in [DBLP:journals/corr/ZhengMWCYML16], with three main differences
it eliminates the need of a Parameter Server in favor of a decentralized communication scheme;
it is stale-synchronous, and not fully asynchronous;
weights computed by different workers are averaged.
In the following sections, we will explain why these differences result in a novel and improved approach, compared to existing algorithms.
Iii-a Problem Setting
We quickly review the problem of data-parallel training of a DNN. For this work, we will focus on DNNs trained as multi-dimensional classifiers, where the input is a sample, denoted by . The goal of training is to find a set of network weights which minimizes a loss function
for a set of samples , where is the per-sample classification loss function (cross-entropy loss in our case). Instead of reporting the final value of the loss function, it is usual to derive a figure of merit, which has the benefits of being more understandable by humans and applicable to different loss functions. In our case, we will use the top-1 error rate, which is simply the rate of misclassified samples to the number of elements of . We will measure both the error obtained on the training data set and on the validation data set.
We will employ a common version of the classic Mini-batch Stochastic Gradient Descent, which is usually referred to as Stochastic Gradient Descent (SGD), and solves the above mentioned minimization problem in an iterative way, following
where is a mini-batch, i.e. a subset of the training data set, and is the mini-batch size, which has been proven to be an important factor, determining how easily a network can be trained. We will adopt a simple version of the SGD algorithm, namely the so-called momentum SGD, in which a momentum term [Qian99onthe] ensures that updates are damped, and allows for faster learning [Qian99onthe].
In the synchronous parallel version, SGD works exactly in the same way, with the only difference that each worker computes gradients locally on the mini-batch it processes, and then shares them with other workers by means of an all-reduce call.
Since our algorithm is a variation of DC-ASGD, we will briefly outline its most important feature, that is, the delay compensation. As illustrated in Section II-A, gradient staleness reduces the convergence rate, because of the difference between the weights held by the worker and those held by the PSs. In DC-ASGD, the gradients are modified to take this difference into account. Basically, the idea is to apply a first-order correction to the gradients, so that they are approximately equal to those which would have been computed using the PSs’ copy of the weights. If the Hessian matrix computed at , here denoted by , was known, one could compute the corrected gradients as
where are the weights used by the worker, are those held by the PS, and is a vector with all components equal to one, with being the dimension of the weights. The quadratic error term comes directly from the Taylor expansion used to derive this result, and we will denote it as for the rest of this work. In principle, the Hessian matrix could be computed analytically, but the product of its approximation (known as pseudo-Hessian) with a vector is computationally convenient to compute as
where represents the Hadamard (or component-wise) product. Thus, we can rewrite 3 as
Removing the error term and adding a variance control parameter as defined in [DBLP:journals/corr/ZhengMWCYML16], we obtain the final form of the equation as
which is the one we base our algorithm on.
In our centralized setting, there is no PS, but since we implement a stale-synchronous method, workers can be expected to be out of sync. In fact, the main idea of our approach is to allow for communication and computation to run in parallel, thus diminishing communication’s impact on the total training run time. To allow for this, we make use of the non-blocking all-reduce function which is part of the MPI standard, i.e. MPI_Iallreduce.
We now describe our method, which is also illustrated in Algorithm 1. We stress the fact that all processing units will act as identical workers, only fed with different data. The only hyper-parameters we will need to set are the learning rate , the momentum , and the variance control parameter .
At the beginning of the computation, each worker receives the same set of initial weights and a different mini-batch, which it processes to obtain a set of gradients , where the bar over stresses the fact that the same value is held by all workers, the subscript denotes the worker index, and the superscript denotes the iteration. We will drop the superscripts when possible, to keep the notation concise.
Based on , the worker uses a function to compute the update to its local weights. We denote the update as and all workers will share their local update with the others, by starting a non-blocking all-reduce operation.
While the all-reduce operation is progressing, the worker updates its local copy of the weights:
and proceeds to process the next mini-batch, in order to compute new gradients . After having processed the mini-batch, all workers wait for the all-reduce operation to complete. In our implementation, the completion is checked by means of a call to MPI_Wait. After completion, each worker possesses an identical copy of , that is the sum of all workers’ updates of the previous iteration.
At this point, we can compute the average of the weights held by each worker, as
Notice that in principle, there is no guarantee that the mean value of the weights is actually meaningful, but studies such as [DBLP:journals/corr/abs-1803-05407] suggest that averaging different weights can lead to better minima. The Euclidean distance from the weights possessed by the worker to the average weights is
Knowing this distance, each worker could replace its own copy of the weights with the average ones, but this is actually not needed. More importantly, by using a modified version of 6, the local gradient can be corrected and used to compute a local update that can be applied to the average weights. The correction equation becomes
and thus the new update can be computed as
and immediately shared with the other workers, by means of a new non-blocking all-reduce call. Each worker will update its weights following
where we first move weights to the average value and update them as a single operation. At this point, each worker can start a new iteration, by proceeding to process the next mini-batch.
A description of how is computed at each iteration is given in IV-A.
Iii-D Advantages and Disadvantages of the proposed Method
We compare the proposed approach to two methods described in II-A, SSGD and DC-ASGD.
Iii-D1 Comparison to SSGD
The main advantage over SSGD resides in the fact that communication costs are (at least partially) hidden in our approach. We can approximate the time taken by SSGD to complete an iteration over a mini-batch over nodes as
where is the time it takes a worker to process the mini-batch (including feed-forward and back-propagation phases), and is the time taken by the all-reduce call to reduce the gradients across all nodes. For our method, a similar approximation can be made, and it yields
which is an obvious consequence of the fact that the computation and all-reduce operations run concurrently in our setting.
Iii-D2 Comparison to DC-ASGD
Similarly to the results derived in the previous section, we can define an approximation to the run-time of a DC-ASGD iteration, denoted by , as
where is the total time needed by a worker to push its gradients to the PS and obtain the updated weights. Clearly, this time also includes time spent by the worker, waiting for the PS to receive the gradients. Therefore, even though it is true that in DC-ASGD fast workers do not have to wait for stragglers, it is also true that run-time depends heavily on the network and on the capability of PSs. As mentioned in II-A, DC-ASGD’s convergence decreases for increasing numbers of workers. This is because the Euclidean distance between the workers’ and the PSs’ weights, , is proportional to . In our method, the distance used to compute the correction is that between workers’ and average weights, which we expect to grow more slowly w.r.t. .
|Network||#Nodes||Train Accuracy||Val. Accuracy||Speed [img/sec]||Reference Val. Acc.|
|ResNet-50||16k||32||80.7%||77.5%||2078||75.3% [you2017imagenet], SSGD|
|ResNet-50||32k||32||80.3%||77.4%||2144||75.4% [you2017imagenet], SSGD|
|ResNet-50||32k||64||78.5%||77.2%||3815||75.4% [you2017imagenet], SSGD|
|ResNet-50||64k||64||76.6%||75.6%||4245||76.2% [DBLP:journals/corr/abs-1807-11205], SSGD|
|ResNet-50||64k||128||75.6%||75.1%||7340||76.2% [DBLP:journals/corr/abs-1807-11205], SSGD|
|ResNet-50||128k||128||70.0%||69.7%||8201||75.0% [osawa2018largescale], K-FAC|
In this section, we first describe how we set training hyper-parameters, and then we report results obtained by training four standard CNNs on the ImageNet-1k data set.
Iv-a Hyper-parameter Settings and Update Schedules
As mentioned in III-A, to train CNNs, we employed a data-parallel version of SGD with momentum. For each network, we set the momentum to the value used to obtain the state-of-the-art results, and we keep it constant for the whole training, which consisted in 90 full epochs. For the learning rate , we first define the theoretical learning rate as
where is the number of workers, as usual, and is the learning rate for single-node training: for ResNet cases, we used as reference a learning rate of 0.1 for a batch-size of 256 samples. This is standard practice, and it seems to give stable results for our setting. For VGG, the base learning rate was 0.02. Another standard approach is to define a learning rate schedule. In our case, we adopted an iteration-dependent (and not epoch-dependent) schedule with linear warm-up and linear decrease. The length of the warm-up phase was initially defined as half of the total iterations, but we found empirically that after 15 epochs, the training error would reach a plateau (for all batch sizes up to 64k samples), and thus we stopped the warm-up phase at the reached learning rate, and we initiated a longer linear decrease phase, which would run until the end of the training. For the case of 128k samples, the plateau was reached after 20 epochs. Identification of the plateau was done by direct observation, but we believe it could easily be automated, by e.g. checking for training error reduction every five epochs during the warm-up phase.
To reduce over-fitting, weight decay was applied to all weights, with the exception of those belonging to batch normalization layers. This technique has given the best results, and the reasoning behind it can be found in [DBLP:journals/corr/abs-1807-11205]. Since this kind of normalization reduces weights by a constant fraction, when the learning rate is very little (as it can happen in our case, when is very close to 0 or to ), the weight decay can become larger than the update, therefore blocking convergence. To mitigate this problem, we decided to apply the same schedule we used for learning rate, also for the weight decay parameter. To compensate for the smaller effective regularization, we also multiply the weight decay hyper-parameter by a constant factor . We find that gives us the best results. This factor was applied to the weight decay hyper-parameter value usually adopted in the literature, namely 0.0001 for ResNet topologies and VGG-16.
By stopping the warm-up phase early, we reach only a small fraction of the maximum step length (e.g. one third for a 15-epoch warm-up), and we note that the pseudo-Hessian correction term is very small compared to the computed gradients. We investigated possible correction re-scaling techniques, and we found that the best result was to add 0.5% to the validation accuracy, when the step reached the end of the warm-up phase. We think that this correction term would have a larger influence for larger learning rates. The parameter , which is used to control the variance introduced by correction step [DBLP:journals/corr/ZhengMWCYML16], was empirically found to give the best results when dynamically set as
Iv-B Hardware and Software Configuration
We ran our experiments on a Cray XC system. Every node was equipped with two 24-core Intel Skylake processors with a clock speed of 2.4 GHz and nodes were connected through Cray Aries with dragonfly topology. The use of CPUs only, which is in contrast with the more standard usage of a GPU-cluster, allowed us to explore very large local mini-batch sizes (up to 1024 samples per local mini-batch). As a toolkit, we used a modified version of MXNet [MXNet], in conjunction with the Intel MKL-DNN libraries [mkl-dnn]. We chose to use MXNet because it offered an easy way to implement our algorithm: we modified the original Key-Value Store (KV Store), which is used to update weights after each iteration, so that it included the needed mechanics and MPI code. The MPI implementation was Cray-mpich. The source code can be made available upon direct request to the author.
We report results obtained by training ResNet-50, ResNet-101, ResNet-152, and VGG-16 on the ImageNet-1k data set.
As training ResNet-50 has become a reference benchmark, we investigated performances of our method on such problem, for different settings. To maximize CPU usage, and to exploit the large memory available on CPU nodes, we use a local mini-batch size of 512 or 1024 samples. From the achieved accuracy values, shown in Table I, it can be seen that we manage to reach state-of-the-art accuracy on up to 64 nodes, with a batch size of 32k samples: the total training time, not considering network setup, is of 503 minutes. Keeping the number of nodes at 64 and using a larger batch size results in a slight loss of accuracy and a speed-up of 10%. Running the parallel training on 128 nodes, we still reach a reasonable accuracy for a total mini-batch size of 64k samples, in 260 minutes: in comparison to [MLPerf], where the target accuracy was 74.9%, we clearly outperform the best results obtained on CPUs, even accounting for the difference between total execution and training time, which never exceeded 10 minutes. From the reported results, we can see that employing a larger batch size on 128 nodes results in a large loss of accuracy. In Figure 1, top-1 error for full training of ResNet-50 networks is shown. For each combination of node count and aggregate batch size, we plot the results of the training run which reached the lowest validation error.
Iv-C2 Other Architectures
Table I lists the results we obtained training other CNNs. It is clear that we are able to reach state-of-the-art accuracy for all ResNet topologies. More importantly, our method is also capable of training VGG-16 with a mini-batch size of 16k samples, even though this is known to be a difficult task [DBLP:journals/corr/abs-1708-03888].
In order to fairly assess our method’s performances, we did not adapt the hyper-parameters for the different topologies. The only tuning we performed, was to extend the warm-up phase to 20 epochs (thus, two ninth of the total training) when running on 64 or 128 nodes.
In this work, we proposed a new algorithm for distributed training, named DC-S3GD, which allows for the overlap of computation and communication by averaging in the parameter space (weights) and applying a first-order correction to gradients. We showed that this approach can achieve state-of-the-art results for parallel DL training.
Many aspects could be improved, for example, more sophisticated methods, like LARS [DBLP:journals/corr/abs-1708-03888], or Adam [Adam], could be used as local optimizers.
Another possible enhancement would be to allow more out-of-sync minimization steps to be taken by local optimizers, and to see how this influences performances, in terms of time-to-accuracy.
To reduce the error introduced in the correction step, the pseudo-Hessian could be replaced by an analytical version of the Hessian matrix.
In terms of maximum achieved accuracy, we ran some preliminary tests with a larger number of iterations, and in some cases, extending the training to 100 or 120 epochs could improve the accuracy of 0.2-0.8%, even for the case of 128k samples per batch.
We believe this approach could also be applied to train neural networks of other types, such as those used for Natural Language Processing, or Reinforcement Learning, if a data-parallel scheme can be adopted.