Gradient Energy Matching
for Distributed Asynchronous Gradient Descent
Distributed asynchronous sgd has become widely used for deep learning in large-scale systems, but remains notorious for its instability when increasing the number of workers. In this work, we study the dynamics of distributed asynchronous sgd under the lens of Lagrangian mechanics. Using this description, we introduce the concept of energy to describe the optimization process and derive a sufficient condition ensuring its stability as long as the collective energy induced by the active workers remains below the energy of a target synchronous process. Making use of this criterion, we derive a stable distributed asynchronous optimization procedure, gem, that estimates and maintains the energy of the asynchronous system below or equal to the energy of sequential sgd with momentum. Experimental results highlight the stability and speedup of gem compared to existing schemes, even when scaling to one hundred asynchronous workers. Results also indicate better generalization compared to the targeted sgd with momentum.
Gradient Energy Matching
for Distributed Asynchronous Gradient Descent
Joeri R. Hermans University of Liège email@example.com Gilles Louppe University of Liège firstname.lastname@example.org
noticebox[b]Preprint. Work in progress.\end@float
In deep learning, stochastic gradient descent (sgd) and its variants have become the optimization method of choice for most training problems. For large-scale systems, a popular variant is distributed asynchronous optimization (dean2012large; chilimbi2014project) based on a master-slave architecture. That is, a set of workers individually contribute updates asynchronously to a master node holding the central variable , under a global clock , such that
Due to the presence of asynchronous updates, i.e. without locks or synchronization barriers, an implicit queuing model emerges in the system (2016arXiv160509774M), such that workers are updating with updates that are possibly based on a previous parameterization
of the central variable, where is the number updates that occurred between the time the worker responsible for the update at time pulled (read) the central variable, and committed (wrote) its update. The term is traditionally called the staleness or the delay of the update at time . Assuming a homogeneous infrastructure, the expected staleness (2016arXiv160509774M) for a worker under a simple queuing model can be shown to be . In this setup, instability issues are common because updates that are committed are most likely based on past and outdated versions of the central variable, in particular as the number of workers increases. To mitigate this effect, previous approaches (2015arXiv151105950Z; jiang2017heterogeneity; 2017arXiv171002368H) suggest to specify a projection function that modifies such that the instability that arises from stale updates is mitigated. That is,
In particular, considering staleness to have a negative effect, -softsync (2015arXiv151105950Z) and dynsgd (jiang2017heterogeneity) make use of to weigh down an update . While there is a significant amount of empirical evidence that these methods are able to converge, there is also an equivalent amount of theoretical and experimental evidence that shows that parameter staleness can actually be beneficial, especially when the number of asynchronous workers is small (2016arXiv160509774M; 2017arXiv170603471Z; kurth2017deep). Most notably, (2016arXiv160509774M) identifies that asynchrony induces an implicit update momentum, which is beneficial if kept under control, but that can otherwise have a negative effect when being too strong. Clearly, these results suggest that approaches which use , or the number of workers , are impairing the contribution of individual workers. Of course, this is not desired. As a result, asynchronous methods that are commonly used in practice are variants of downpour (dean2012large) or hogwild! (2011arXiv1106.5730N), where the number of asynchronous workers is typically restricted or the learning rate lowered to ensure stability while at the same time leveraging effects such as implicit momentum. As the number of workers increases, one might wonder what would be an effective way to measure how distributed asynchronous sgd adheres to the desired dynamics of a stable process. Answering this question requires a framework in which we are able to quantify this, along with a definition of stability and desired behavior.
Our contributions are as follows:
We formulate stochastic gradient descent in the context of Lagrangian mechanics and derive a sufficient condition for ensuring the stability of a distributed asynchronous system.
Building upon this framework, we propose a variant of distributed asynchronous sgd, gem, that views the set of active workers as a whole and adjusts individual worker updates in order to match the dynamics of a target synchronous process.
Contrary to previous methods, this paradigm scales to a large number of asynchronous workers, as convergence is guaranteed by the target dynamics.
Speedup is attained by partitioning the effective mini-batch size over the workers, as the system is constructed such that they collectively adhere to the target dynamics.
|Time at global (parameter server) clock.|
|Central variable (central parameterization) at time .|
|Staleness of worker responsible for update .|
|The parameterization of the central variable used by worker responsible for update .|
|Update produced by worker responsible for update . In this work, is computed using sgd, i.e., .|
|Absolute value of an update .|
|Kinetic energy of the proxy at time .|
|Kinetic energy of the central variable at time .|
2 Stability from energy
Taking inspiration from physics, the traditional explicit momentum SGD update rule can be re-expressed within the framework of Lagrangian mechanics, as shown in Equations 4 and 5, where is Rayleigh’s dissipation function to describe the non-conservative friction force:
which is equivalent to the traditional explicit momentum sgd formulation. Since the Euler-Lagrange equation can be used to derive the equations of motion using energy terms, we can similarly express these energies in an optimization setting. Using Equation 4 and discretizing the update process, we find that the kinetic energy and the potential energy are defined at time as
where the loss function describes the potential under a learning rate .
Equations 8 and 9 only hold in a sequential (or synchronous) setting as the velocity term is ill-defined in the asynchronous case. However, since asynchronous methods are concerned with the optimization of the central variable, a worker can implicitly observe the velocity of , that is the parameter shift, between the moment it read the central variable and the moment it committed its update. In the asynchronous setting,
where was computed from . Therefore, the parameter shift between the time worker responsible for the update read and eventually committed is . From this, we define the kinetic energy of the central variable as observed from to produce as
Intuitively, our definition of the kinetic energy of the asynchronous system therefore corresponds to the collective kinetic energy induced by the last updates from every worker responsible for .
This definition allows us to measure the kinetic energy of the central variable. From this, we may now formulate a sufficient condition ensuring the stability and compliance of the asynchronous system. To accomplish this, we introduce the kinetic energy of a proxy, , for which we know, by assumption, that it eventually converges to a stationary solution . As a result, an asynchronous system can be said to be stable whenever
and compliant when
That is, stability and compliance are ensured whenever the kinetic energy collectively induced by the set of workers matches with the kinetic energy of a proxy known to converge. Therefore, equations 12 and 13 are sufficient to ensure convergence, albeit not necessary.
3 Gradient Energy Matching
We now present a new variant of distributed asynchronous sgd, Gradient Energy Matching (gem), which aims at satisfying the compliance condition with respect to the dynamics of a target synchronous process, thereby ensuring its stability. In addition, we design gem so as to minimize computation at the master node, thereby reducing the risk of bottleneck.
The definition of the proxy can be chosen arbitrarily, as it depends on the desired dynamics of the central variable. In this work, we target a proxy for which the dynamics of behaves similarly to what would be computed sequentially in sgd with an explicit momentum term under a static learning rate . Traditionally, this is written as
with the corresponding kinetic energy
3.2 Energy matching
Consider the situation of a small number of asynchronous workers, as is typically the case for downpour or hogwild!. Experimental results reported in the literature suggest that these methods usually converge to a stationary solution, which implies that the stability condition (Equation 12) is most likely satisfied. However, as the number of asynchronous workers increases, so does the kinetic energy of the central variable and there is a limit beyond which the system dynamics do no longer satisfy Equation 12. To address this, we introduce scaling factors , whose purpose are to modify updates such that the compliance condition remains satisfied. As a result, the central variable energy (Equation 11) is extended to
Scaling factors can be assigned in several ways, depending on the information available to each worker and on the desired usage of the individual updates . To prevent wasting computational resources, we aim at equally maximizing the effective contribution of all updates while achieving compliance with the proxy, i.e.,
Directly solving for yields
This mechanism constitutes the core of gem, as it is through scaling factors that the energy of the active workers is matched to the energy of the target proxy.
To show the reader how adjusts the active workers to collectively adhere to the proxy, let us consider the quantity . From 2016arXiv160509774M, we know that under a homogeneity assumption the average staleness is . In expectation, the behavior of the system is therefore similar to updates occurring in a round-robin fashion. In this case, Equation 11 reduces to
which shows that to ensure compliance, the updates of all workers have to collectively be scaled.
3.3 Proxy estimation
While the previous derivation defines an assignment for , it remains to locally estimate the desired kinetic energy . This is particularly problematic since updates from the other workers are not available. As an approximation, we propose to aggregate local updates such that
with the first moment being defined as
Additionally, Appendix B examines the effects of scaling with an amplification factor to compensate for the staleness of the proxy due to the local approximation.
This concludes gem, as summarized in Algorithm 1.
We evaluate gem against downpour dean2012large and an adaptive staleness technique which rescales updates by , similar to what is typically applied to stabilize stale gradients in -softsync (2015arXiv151105950Z) or dynsgd (jiang2017heterogeneity). We carry out experiments on MNIST (mnist) and ImageNet (imagenet) to illustrate the effectiveness and robustness of our method. All algorithms are implemented in PyTorch (paszke2017automatic) using the distributed sub-module. The source code 111https://github.com/montefiore-ai/gradient-energy-matching we provide is to the best of our knowledge the first PyTorch-native parameter server implementation. Experiments described in this section are carried out on general purpose CPU machines without using GPGPU accelerators.
4.1 Illustrative experiments
For all experiments described below we train a convolutional network (see Appendix A for specifications) on MNIST using gem, downpour or the adaptive staleness variant on a computer cluster as an MPI job. A fixed learning rate , momentum term and worker batch-size are applied unless stated otherwise. Additionally, every worker contributes a single epoch of training data. All results in this section is the average of 3 consecutive runs.
Figure 1 illustrates the training loss of gem compared to downpour and adaptive staleness. We observe that gem remains stable and adheres to the target dynamics as designed, even when increasing the number of asynchronous workers. This shows that gem is able to cope with the induced staleness. Remarkably, gem remains stable even when scaling to 100 asynchronous workers, as shown in Figure 5. By contrast, downpour faces strong convergence issues when increasing the number of workers while adaptive staleness shows impaired efficiency as updates are rescaled proportionally to the staleness, which is often too aggressive.
Figure 0(a) shows the behavior of gem for an increasing number of workers and a fixed worker batch-size . When applying an asynchronous method, one would expect that the number of updates per worker to reach a given loss value would decrease since more updates are committed in total to the master node. However, since gem tries to make the workers collectively adhere to the dynamics that are expressed by the proxy, this behavior should not be expected. Indeed, Figure 0(a) shows loss profiles that are similar whereas the number of workers varies from 5 to 30. As designed, inspecting the median per-parameter value of , shown in Figure 5, reveals that the scaling factors steadily decrease whenever additional workers are added. Notwithstanding, we also do observe improvement as we increase the number of workers as it indirectly augments the effective batch size . Furthermore, we observe from Figure 5 the adaptivity of gem as the median changes with time. While this behavior may appear as a limitation, better efficiency can be achieved by simply considering proxies of different energies, as further discussed in Appendix B.
Wall-clock time speedup
In distributed synchronous optimization, wall-clock speedup is typically achieved by setting a desired effective batch-size , and splitting the computation of the batch over workers to produce updates computed individually over samples. In the asynchronous setting, the same strategy can be applied for gem to speed up the training procedure, as shown in Figure 5 for a fixed effective batch-size , a number of workers between 1 and 16 and a corresponding worker batch size between 256 and 16. As expected, and without loss in accuracy, we observe that wall-clock speedup is inversely proportional to , depending on the efficiency of the implementation. However, we also observe larger variability as increases due to the increased stochasticity of the updates.
A common argument against large mini-batch sgd is the poor generalization performance that is typically associated with the resulting model due to easily obtainable sharp minima keskar2016large; hochreiter1997flat; hochreiter1995simplifying; chaudhari2016entropy. Nevertheless, smith2017don; goyal2017accurate; krizhevsky2014one show that test accuracy is not significantly impacted when using larger learning rates to emulate the regularizing effects of smaller batches. Similarly, experimental results for gem do not show such behavior. Even when training with 100 asynchronous workers, as done in Figure 5 and which corresponds to , gem obtains a final test accuracy of 99.38%. By contrast, the targeted momentum sgd procedure starts to overfit, as shown in Figure 5. A possible intuitive explanation for this behavior is most in-line with the arguments presented in chaudhari2016entropy; keskar2016large, i.e., that asynchrony (staleness) regularizes the central variable such that it is not attracted to sharper minima, therefore preferring wider basins.
4.2 Resilience against hyper-parameter misspecification
These experiments use the same setup as in the previous MNIST experiments, with the exception of the learning rate, which is intentionally set to a large value . To show that our method is able to adhere to the proxy dynamics, we compare gem against downpour. From Figure 6, we observe that downpour is not able to cope with the increased learning rate, while gem remains stable since the equivalent synchronous process converges as well under these settings.
We train AlexNet krizhevsky2012imagenet on the ImageNet data (imagenet) with an effective batch size of and . gem hyper-parameters are set to their default values, as specified in Algorithm 1. No learning rate scheduling is applied. All experiments ran for 24 hours with 8 and 16 asynchronous workers, each using 2 CPU cores due to cluster restrictions. A single core computes the gradients, while the other handles data-prefetching, data-processing, and logging. The results are summarized in Figure 7. The figure shows that gem is able to effectively handle the staleness on a realistic use-case. Despite an identical effective batch size, we observe a significant improvement with regards to worker update efficiency, both in wall-clock time and the training loss. The reason for this effect remains to be studied. Figures 10 and 10 in supplementary materials show how individual tensors of an update are adjusted over time.
5 Related work
Many techniques schaul2013adaptive; jiang2017heterogeneity; 2015arXiv151105950Z; chilimbi2014project; 2017arXiv171002368H; mnih2016asynchronous; 2016arXiv160509774M have been proposed to stabilize distributed asynchronous sgd based on the perception that staleness usually has a negative effect on the optimization process. In 2016arXiv160509774M however, the main inspiration behind this work, staleness is shown to be equivalent to an asynchrony-induced implicit momentum, which thereby explains its negative effect when being too strong but also highlights its potential benefits when kept under control. Similarly, in our framework, implicit momentum translates to added kinetic energy of the central variable. Close to our work, the authors of YellowFin 2017arXiv170603471Z build on top of 2016arXiv160509774M by proposing an adaptive scheme that tunes momentum dynamically. They show in an asynchronous setting how individual workers can adjust momentum terms to collectively fit a target momentum provided by the optimizer. In particular, this is achieved by incorporating a negative feedback loop that attempts to estimate the momentum term from the observed staleness. We note the opportunity to extend this strategy in our framework, by using YellowFin to estimate the optimal momentum term of the proxy, and have gem acting as a mechanism to tune individual workers, thereby removing the original feed-back loop.
This work introduces Gradient Energy Matching, gem, a novel algorithm for distributed asynchronous sgd. The core principle of gem is to consider the set of active workers as a whole and to make them collectively adhere to the dynamics of a target synchronous process. More specifically, the estimated energy of the asynchronous system is matched with the energy of the target proxy by rescaling the gradient updates before sending them to the master node. Under the assumption that this proxy converges, the proposed method ensures the stability of the collective asynchronous system, even when scaling to a high number of workers. Wall-clock time speedup is attained by partitioning the effective mini-batch size across workers, while maintaining its stability and generalizability.
A direction for future theoretical work would be the study of the effects of asynchrony on the generalization performance in connection with dinhsharp and how gem differs from regular momentum. This could help explain the unexpected improvement observed in the ImageNet experiments. A further addition to our method would be to adjust the way the proxy is currently estimated, i.e., one could compute the proxy globally instead of locally. This would allow all workers to contribute information to produce better estimates of the (true) proxy. As a result, the scaling factors that would be applied to all worker updates would in turn be more accurate.
Joeri Hermans acknowledges the financial support from F.R.S-FNRS for his FRIA PhD scholarship. We are also grateful to David Colignon for his technical help.
Appendix A Model specification
The convolutional network in the MNIST experiments is fairly straightforward. It consists of 2 convolutional layers, each having a kernel size of 5 with no padding, with stride and dilation set to 1. The first convolutional layer has 1 input channel, and 10 output channels, whereas the second convolutional layer has 20 output channels, both with the appropriate max-pooling operations. We apply dropout dropout using the default settings provided by PyTorch paszke2017automatic on the final convolutional layer. Finally, we flatten the filters, and pass them through 3 fully connected layers with ReLU activations followed by a logarithmic softmax.
Appendix B High-energy proxies
Depending on the definition of the proxy, the efficiency of the optimization procedure can be increased by adjusting its energy by a constant factor . More specifically, to increase the permissible energy in gem proportionally to , we simply redefine the proxy energy to
From this definition it is clear that amplifies the energy of the proxy, therefore allowing for larger worker contributions as the range of the proxy energy is extended. As depends on the proxy, we can directly include in as
Since gradually increasing does not have a significant effect on (because of , we modify Equation 26 for the purpose of this study to
This effect can be observed from Figure 8, where the training loss improves as increases. However, we would like to note that increasing the proxy energy can have adverse effects on the convergence of the training as the energy levels expressed by the proxy is reaching the divergence limit.