Stochastic Modified Equations for the Asynchronous Stochastic Gradient Descent
We propose a stochastic modified equations (SME) for modeling the asynchronous stochastic gradient descent (ASGD) algorithms. The resulting SME of Langevin type extracts more information about the ASGD dynamics and elucidates the relationship between different types of stochastic gradient algorithms. We show the convergence of ASGD to the SME in the continuous time limit, as well as the SME’s precise prediction to the trajectories of ASGD with various forcing terms. As an application of the SME, we propose an optimal mini-batching strategy for ASGD via solving the optimal control problem of the associated SME.
In this paper, we consider the following empirical risk minimization problem commonly encountered in machine learning:
where represents the model parameters, denotes the loss function of the training sample , and is the size of the training sample set. Since the training set for most applications is of large size, stochastic gradient descent (SGD) is the most popular algorithm used in practice. In the simplest scenario, SGD randomly samples one instance at each iteration and updates the parameter by evaluating only the gradient of the selected . The stability and convergence rate of SGD have been studied in depth, for example, see hardt2015train (); needell2014stochastic (). However, the scalability of SGD is unfortunately restricted by its inherent sequential nature. To overcome this issue and hence accelerate the convergence, there has been a line of research devoted to asynchronous parallel SGDs. In the distributed computation scenario, an asynchronous stochastic gradient descent (ASGD) method parallelizes the computation on multiple processing units by (1) calculating multiple gradients simultaneously at different processors and (2) sending the results asynchronously back to the master for updating the model parameters agarwal2011distributed (); recht2011hogwild ().
1.1 Related Work
There has been a vast literature on the analysis of SGD, see for example Bottou et al. bottou2016optimization () for a comprehensive review of this subject. Some widely-used methods include AdaGrad duchi2011adaptive (), which extends SGD by adapting step sizes for different features, RMSProp tieleman2012lecture (), which resolves AdaGrad’s rapidly diminishing learning rates issue, and Adam kingma2014adam (), which combines the advantages of both AdaGrad and RMSProp with a parameter learning rates adaption based on the average of the second moments of the gradients. On the other hand, relatively few studies are devoted to ASGDs. Most of these studies for ASGD take an optimization perspective. Hogwild! recht2011hogwild () assumed data sparsity in order to run parallel SGD without locking successfully. Under various smoothness conditions on such as being strongly convex and ’s all Lipschitz, it showed that the convergence rate can be similar to the synchronous case. Duchi et al. duchi2013estimation () extended this result by developing an asynchronous dual averaging algorithm that allows problems to be non-smooth and non strongly-convex as well. Mitliagkas et al. mitliagkas2016asynchrony () observed that a standard queuing model of asynchrony correlates to the momentum, that is, the asynchrony produces momentum in SGD updates. There are also several methods using asynchrony either in parallel or distributedly, such as asynchronous stochastic coordinate descent algorithms liu2015asynchronous2 (); liu2015asynchronous (); nesterov2012efficiency (); richtarik2014iteration ().
Recently, Li et al. li2015stochastic () introduced the concept of the stochastic modified equation for SGDs (referred as SME-SGD in this report), where in the continuous-time limit an SGD is approximated by an appropriate (overdamped) Langevin equation. Compared to most convergence analysis that give upper bounds for (strongly) convex objects, this new framework not only provides more precise analysis for the leading order dynamics of SGD but also suggests adaptive hyper-parameter strategies using optimal control theory.
1.2 Our Contributions
Inspired by Li et al. li2015stochastic () mentioned above, we extend the application of SME here to characterize the dynamics of ASGD algorithms.
In Section 2, we first derive a stochastic modified equation for the asynchronous stochastic gradient descent, denoted shortly as SME-ASGD, for the case where the gradient of each loss function is linear. The derivation results in a Langevin equation, which has a unique invariant distribution solution with a convergence rate dominated by the temperature factor. Meanwhile, for the momentum SGD (MSGD), a similar Langevin equation denoted as SME-MSGD is derived and we show that the temperature factors for both derived SME agree. This comparison gives a Langevin dynamics explanation of why an asynchronous method gives rise to similar behavior as compared to the momentum-based methods mitliagkas2016asynchrony (). Then by introducing a new accumulative quantity, we derive a more general SME-ASGD for the general case in which the gradient of the loss function can be nonlinear. We show that the two SME-ASGDs are equivalent when the gradients are linear.
Section 3 provides some numerical analysis for SME-ASGD by providing a strong approximation estimation to the ASGD algorithm. Different from the usual convergence studies, we do not assume convexity on and but only require their gradients to be (uniformly) Lipschitz. Numerical results including non-linear forcing terms and non-convex objectives demonstrate that SME-ASGD provides much more accurate predications for the behavior of ASGD compared to SME-SGD derived in li2015stochastic (). In Section 4, we apply the optimal control theory to identify the optimal mini-batch for ASGD and the numerical simulations there verify that the suggested strategy gives a significantly better performance.
2 Stochastic Modified Equations
The asynchronous stochastic gradient descent (ASGD) carries out the following update at each step:
where is the step size, are i.i.d. uniform random variables taking values in , and is delayed read of the parameter used to update with random staleness .
We assume that the staleness are independent and that the sample selection process is mutually independent from the staleness process . ’s are all (uniformly) Lipschitz, that is, for each , there exists such that for any , we have . As a consequence, by taking , is also (uniformly) Lipschitz: . In addition, the staleness process follows the geometric distribution: (i.e., ) with probability for .
Making the geometric distribution assumption here is not only to simplify the computation, but also can be justified by considering the canonical queuing model younes2005verification (). For example, the computation at each processor may involve some randomized algorithm that requires each processor to do multiple independent trials until the result is accepted, thus resulting in a geometrically distributed computation time. Our derivation of SME models can be also easily generalized to other random staleness models as long as the expectation of read delays is finite.
2.1 Linear gradients
We first show the derivation of Langevin dynamics with the linear forcing term. Suppose that, for each is linear, or equivalently each is quadratic. While this is a fairly restrictive assumption, the derivation in this simplified scenario offers a more transparent view towards the stochastic modified equation for the asynchronous algorithm.
A key quantity for our derivation is the expected read given as follows as a weighted sum of :
Note that and . Plugging this into (2), we can rewrite ASGD as
by using the linearity of . Observe that the left hand side and the first term on the right hand side of (3) can be viewed as divided difference approximations to various time derivatives of . The second term on the right hand side is the usual gradient. The last term can be understood as the noise due to stochastic gradient and the read delays; it has mean , since the expectation can be decomposed as
The covariance matrix of the noise will be denoted as
conditioned on and we also denote the square root of by , i.e., . (and thus ) in general depends on the previous history of the trajectory, although such dependence is omitted in our notation.
In order to arrive at a continuous time stochastic modified equation from (3), we view as the evaluation of a function at time points where is the effective time step size for the corresponding stochastic modified equation, and it is chosen as . Let us introduce the auxiliary variables and reformulate (3) as a system of :
To obtain a SME, we first model the random term by a Gaussian random noise, that is, , where is the increment of a Brownian motion (thus and ) and the coefficient is chosen to match the variance. Assuming that is small, we arrive at a Langevin type equation:
When is a smooth confining potential (for example, is a quadratic potential), the process approaches to the minimum and can be approximated by a constant matrix up to a first order approximation for large time . When this constant matrix is a multiple of the identity matrix, in the standardized model is an ergodic Markov process with stationary distribution pavliotis2014stochastic ():
where is a normalization constant. In this case, the resulting friction is and the temperature is . When the constant matrix is not a multiple of identity, the stationary distribution takes such a form in a transformed coordinate system.
The reason why we care about the temperature parameter here is that it quantifies the variance of the noise, and therefore gives us more information about the asymptotic behavior of the optimization process. With such a tool, we can better analyze the connection between different stochastic gradient algorithms. Let us illustrate it by showing one example here: Mitliagkas et al. mitliagkas2016asynchrony () argues that there is some equivalence between adding asynchrony or momentum to the SGD algorithms, and they showed it by taking expectation to a simple queuing model and finding matched coefficients. Here, we investigate such relation by looking at the corresponding Langevin dynamics, specifically the temperature for both SMEs, which offers a more detailed dynamical comparison.
with a momentum parameter . (8) can be also viewed as a discretization of a second-order differential equation. By following a similar derivation with effective time step , and taking (details deferred to Appendix B), we end up with the following stochastic modified equation for MSGD (SME-MSGD)
where friction and temperature dominates the convergence rate to the stationary solution. Comparing SME-ASGD with SME-MSGD results in the following interesting observation.
In Theorems 3 and 5 in Mitliagkas et al.’s paper mitliagkas2016asynchrony (), the staleness’ geometric distribution parameter is taken to be , where is the number of mutually independent workers and is the momentum parameter. With their assumptions, when looking at (6) and (2.1) under the same time scale, which requires , we can see that . Since the corresponding temperature for the asynchronous method and momentum method are equal, we conclude that the perspective of stochastic modified equation given above explains the observation in mitliagkas2016asynchrony () that momentum method has certain equivalent performance as the asynchronous method.
2.2 Nonlinear gradients
We now consider the general case in which the gradient can be non-linear. One can still write the ASGD into a stochastic modified equation by viewing an averaged term as the position and viewing as the momentum in the Langevin dynamics. For this, let us define a new auxiliary variable which will be viewed as the position in SME as
where is to be determined. Directly following the definition, satisfies the difference equation
Moreover, we can rewrite the ASGD (2) into the form
The reason for us arranging terms in this way is to formulate a Langevin-type equation but moving the noise term from the momentum side () to the position side (). Notice that on the right hand side of (12), can be viewed as a noise with mean , and its conditional covariance matrix depends on (details deferred to Appendix B).
In order to view (11) and (12) as a time-discretization of a coupled system with the same time step size, we match , which requires the choice of . Setting the step size , same as in the linear case, and taking a Gaussian approximation to the noise , we arrive at the stochastic modified equation for the nonlinear case
Here . In order to close the system of equations, we derive an explicit evolution equation for
The derivation of (14) is shown in Appendix B. The combined system (13)–(14) will be referred as SME-ASGD, the stochastic modified equations for asynchronous SGD, for the general nonlinear-gradient case. We should point it out that unlike the linear-gradient case (6) , (13) has no known explicit formula for invariant measure even when converging to a constant matrix. Nevertheless, the ergodicity of (13) and (14) will be an interesting future direction to explore.
3 Approximation error of the stochastic modified equation
The difference between the time-discrete ASGD and the time-continuous SME-ASGD can be rigorously quantified as follows.
Under the Assumption 1 and assume further that the variance from the asynchronous gradients is uniformly bounded, there exists such that . Suppose all the iterates updated from the ASGD stay bounded, and the solutions for SME-ASGD and ASGD respectively up to time agree, i.e., , with as given previously, then the SME-ASGD approximates the ASGD, i.e., there exists constant depending only on such that
The assumption can be justified from (14) as is approximated by a constant matrix for large. This is because when the iterate approaches to the minimizer, the gradients are close to , and converges to be a constant vector.
The proof of the Theorem (3) follows from viewing the AGSD as a discretization of SME-ASGD and using the analysis of strong convergence for numerical schemes for stochastic differential equations (SDEs). One interesting observation is that, contrary to the standard Euler-Maruyama method for SDEs having strong order of convergence kloeden1992stochastic (), the above result indicates that ASGD, viewed as a discretization of SME-ASGD, has strong order . This is because the coefficient of the noise term in the SME-ASGD has , which is of order . The SME model proposed in li2015stochastic () has the same feature: the coefficient of the noise term there is of order . When , we can see the order equivalence between the two.
In the following, we provide numerical evidences for Theorem 3 with various loss functions . The results are shown in Figures 1 (for linear forcing) and 2 (for general forcing). For each example, through averaging over samples, we compare the results of ASGD with the predictions from both SME-ASGD (13) and the 2nd-order weak convergent SME-SGD proposed in Li et al.’s paper li2015stochastic ()
When is close to (i.e., the expected delay is short), one would naively expect that SME-SGD (16) would give rise to a reasonable approximation to ASGD as well. However, Figures 1 and 2 demonstrate that it is not the case: only SME-ASGD proposed here results in accurate path approximations for both the first and the second moments (in particular when enlarges), while the trajectories obtained from SME-SGD are way off.
A few remarks regarding the numerical results are in order here. (i) In Figure 1, the path oscillations happen to both ASGD and SME-ASGD due to a longer expected delay, but not to SME-SGD, even though we include staleness when computing for both models. That is because our SME-ASGD model contains in the forcing term, while the forcing term in SME-SGD is -independent. (ii) The convex function whose gradient in Figure 2 does not satisfy the general Ito conditions; however, by having good initial data and choosing smaller time step sizes, we can still obtain the minimizer without blowing up. (iii) Even for the non-convex function (double-well function in Figure 2), our SME-ASGD model is able to give a better prediction about which minimizer that a trajectory with given initial data will fall into; the percentage that SME-ASGD shows is very close to what in the ASGD case.
4 Optimal mini-batch size of ASGD
With much better understanding of dynamics of the ASGD algorithm using SME-ASGD, we are able to tune multiple hyper-parameters of ASGD using the predictions obtained from applying the stochastic optimal control theory to SME-ASGD. Here we demonstrate one such application: the optimal time-dependent mini-batch size for ASGD. By denoting the time-dependent batch size as with , one can write the iteration as
We may assume that the choice of mini-batch size is independent from and the staleness . This is because, even though changing the batch size will simultaneously change the clocks of all the processors, the staleness would not be changed as all the processors are impacted equally. Following the argument given in Section 2, we derive the corresponding SME
To simplify the discussion, let us consider for example the quadratic loss objective . By applying the Ito’s formula to this SME, we obtain the following second moment equation (a similar derivation is shown in Appendix C) for the evolution of the expected loss
As (19) is a linear system with constant coefficient matrix, its asymptotic behavior is determined by the eigenvalue of the coefficient matrix. An easy calculation shows that the eigenvalue with largest real part is given by , which has a negative real part, and thus the second moment of decays exponentially. Moreover, (19) provides us with the stationary solution for
Rather than applying the optimal control subject to the full second moment equation, we shall work with a simpler evolution equation that asymptotically approximates the dynamics (imposed as a constraint). More specifically, we pose the following optimal control problem for the time-dependent mini-batch size
where models – the quantity to minimize, is an admissible control set as the mini-batch size is greater than , and is a constant measuring the unit cost for introducing extra gradient samples throughout the time. The solution to (21) is given by (detailed computation deferred to Appendix D), with ,
In particular, (22) tells that we should use a small mini-batch size (even size ) during the early time (for ), since during this period the gradient flow dominate the dynamics. After the transition time when the noise starts to dominate, one shall apply mini-batch with size exponentially increasing in to reduce the variance. Figure 3 demonstrates that our proposed mini-batching strategy outperforms the ASGD with a constant batch size (for example, applied in dekel2012optimal (); gimpel2010distributed ()). Note that such strategy of increasing the batch size in later stage of training has been also suggested and used in recent works in training large neural networks, e.g., Goyal2017 (); Keskar2017 ().
In this paper, we have developed stochastic modified equations (SMEs) to model the asynchronous stochastic gradient descent (ASGD) algorithms in the continuous-time limit. When the gradient of the loss function is linear, the resulting SME can be put into a Langevin equation, whose solution is known to converge to the unique invariant measure with the convergence rate dictated by the corresponding temperature. We utilize such information to compare with the momentum SGD and prove the “asynchrony begets momentum” phenomenon. For the nonlinear gradient case, though the resulting SME does not have an explicitly known invariant measure, it still provides precise trajectory predictions for the discrete ASGD dynamics. Moreover, with SME available, we are able to find optimal hyper-parameters for ASGD algorithms by performing a moment analysis and leveraging the optimal control theory.
J.A. was partially supported by the Gene Golub Fellowship at ICME. J.L. is supported by the National Science Foundation under award DMS-1454939, and L.Y. is partially supported by the National Science Foundation under award DMS-1521830.
- (1) Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
- (2) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning, 2016. preprint, arXiv:1606.04838.
- (3) Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
- (4) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- (5) John Duchi, Michael I Jordan, and Brendan McMahan. Estimation, optimization, and parallelism when data is sparse. In Advances in Neural Information Processing Systems, pages 2832–2840, 2013.
- (6) Kevin Gimpel, Dipanjan Das, and Noah A Smith. Distributed asynchronous online learning for natural language processing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pages 213–222. Association for Computational Linguistics, 2010.
- (7) P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: Training ImageNet in 1 hour, 2017. arXiv preprint, arXiv:1706.02677.
- (8) Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234, 2016.
- (9) N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
- (10) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
- (11) Peter E Kloeden and Eckhard Platen. Stochastic differential equations. In Numerical Solution of Stochastic Differential Equations, pages 103–160. Springer, 1992.
- (12) Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110, 2017.
- (13) Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351–376, 2015.
- (14) Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285–322, 2015.
- (15) Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and Christopher Ré. Asynchrony begets momentum, with an application to deep learning. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 997–1004. IEEE, 2016.
- (16) Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014.
- (17) Yu Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- (18) Grigorios A Pavliotis. Stochastic processes and applications: Diffusion processes, the Fokker-Planck and Langevin equations, volume 60. Springer, 2014.
- (19) Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- (20) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
- (21) Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1–38, 2014.
- (22) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147, 2013.
- (23) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- (24) Håkan Lorens Samir Younes. Verification and Planning for Stochastic Processes with Asynchronous Events. PhD thesis, Academy of Engineering Sciences, 2005.
Appendix A Appendix A: proof of Theorem 3
We look at the one step approximation in the base case, and the global approximation can be done by induction. Using the variation of constant formula, we know that the solution of
is given by
where as defined in (10). Plugging into the integral form of gives rise to
Splitting into and , we can make the following estimate
In the above derivation, we have applied the Ito isometry to the fourth term and then used
since . The fifth term, after applying the Cauchy-Schwarz inequality, is shown to be a discrete version of the covariance matrix
since the first two terms cancel. Because is Lipschitz, we can estimate the second term by
Because stays in a bounded domain, the third term can be bounded by
With these estimates available, we can choose a sufficiently large constant (depending on and the size of the domain containing the iterates from ASGD) such that
By Gronwall’s inequality, we have
The induction step is similar. With the assumption , we have the following estimate
Here the only difference is in the second term, which is not given but generated from SME. Denote . From (11), we observe that is indeed an approximation of by applying the Euler discretization to the ordinary differential equation part of the SME. Because the global truncation error for the Euler method in ODE is , we have
as shown before. Applying the Gronwall’s inequality again and letting , we obtain
As for all , one can conclude that there exist such that
Appendix B Appendix B: miscellaneous computations in SMEs
In this section, we provide the missing computations in Section 2. First, let us show that in subsection 2.2, the noise term has mean
The covariance matrix conditioned on is given by
b.1 Evolution equation of in (14)
First, we have
By expanding the terms in the expectation and treating them individually, we arrive at the following