A general system of differential equations to model first order adaptive algorithms.

# A general system of differential equations to model first order adaptive algorithms.

André Belotto da Silva*  and  Maxime Gazeau*
July 19, 2019
###### Abstract.

First order optimization algorithms play a major role in large scale machine learning. A new class of methods, called adaptive algorithms, were recently introduced to adjust iteratively the learning rate for each coordinate. Despite great practical success in deep learning, their behavior and performance on more general loss functions are not well understood. In this paper, we derive a non-autonomous system of differential equations, which is the continuous time limit of adaptive optimization methods. We prove global well-posedness of the system and we investigate the numerical time convergence of its forward Euler approximation. We study, furthermore, the convergence of its trajectories and give conditions under which the differential system, underlying all adaptive algorithms, is suitable for optimization. We discuss convergence to a critical point in the non-convex case and give conditions for the dynamics to avoid saddle points and local maxima. For convex and deterministic loss function, we introduce a suitable Lyapunov functional which allow us to study its rate of convergence. Several other properties of both the continuous and discrete systems are briefly discussed. The differential system studied in the paper is general enough to encompass many other classical algorithms (such as Heavy ball and Nesterov’s accelerated method) and allow us to recover several known results for these algorithms.

* Alphabetical order and equal contribution
Université Aix-Marseille, Institut de Mathématiques de Marseille, UMR CNRS 7373, Centre de Mathématiques et Informatique, 39, rue F. Joliot Curie, 13013 Marseille, France (andre-ricardo.belotto-da-silva@univ-amu.fr).
Borealis AI (maxime.gazeau@borealisai.com).
\pdfstringdefDisableCommands

## 1. Introduction

Optimization is at the core of many machine learning problems. Estimating the model parameters can often be formulated in terms of an unconstrained optimization problem of the form

 minθ∈Rdf(θ)where f:Rd→R is differentiable.

Gradient descent [Cauchy:1847], which only depends on the partial derivatives of , is the simplest discrete algorithm to address the optimization problem above

 (1.1) θk+1=θk−s∇f(θk).

Another popular iterative approach to solve the above smooth optimization is the proximal point algorithm [Rockafellar, ParikhBoyd]

 (1.2) θk+1=argminu(12s∥u−θk∥2+f(u))

These discrete methods can be studied solely from the standpoint of optimization performance. It can be proved that both algorithms converge to a critical point ( as ) [Nesterov2004] but also almost surely to a local minimizer [LSJR:16, LPPSJR:2017]. For convex functionals with globally Lipschitz gradient, both algorithms converges at linear rate , where is a minimal point of [Nesterov2004, Rockafellar, ParikhBoyd]. These results give important guarantees on the convergence of each method.

For small and constant learning rate , gradient descent (1.1) (resp. proximal point algorithm (1.2)) corresponds to the forward (resp. backward) Euler’s discretization of the gradient flow system

 (1.3) ˙θ(t)=−∇f(θ(t)),θ0=θ(0),

under the time scaling [Hairer:2006, SRBA:17]. The stable equilibria of this continuous system are given by the the set of strict (local) minima of the loss function and if the level sets of are bounded ( coercive for example), then its trajectories asymptotically converge to a critical point in the sense that as . Moreover for convex functions, a linear rate of convergence holds, which is analogue to those obtained for both gradient descent and proximal point algorithm.

The study of the continuous dynamical system is very useful. The well-behaved convergence properties of the gradient flow (1.3) allows an important number of choices to design an optimization algorithm [SRBA:17]. It, furthermore, provides valuable intuition to prove convergence of discrete systems: for example, continuous Lyapunov functional can be often adapted to the discrete counterparts.

For large scale machine learning, a stochastic version of gradient descent (SGD) is very popular in practice [Bottou:16, kiefer1952, RobbinsMonro]

 (1.4) θk+1=θk−s∇f(θk,ξk),

where is an unbiased estimator of the true gradient . It is well known that this algorithm does not always converge and theoretical analysis provide conditions that guarantee convergence to a critical point [MB:11, Bottou:16]. In particular, the learning rate should be decreasing and converging to zero. Under this condition, it feels natural to conjecture that the long time behavior of SGD is closely related to asymptotic behavior of the trajectories of equation (1.3). This method, called the ODE method, was introduced by Ljung [Ljung:77] and extensively studied after [Benaim, KushnerBook]. Even in the stochastic setting, a good understanding of the underlying continuous dynamical system is important.

The emergence of deep learning has spawned the recent popularity of a special class of optimizers: first order adaptive optimization algorithms (RMSprop [TH:12], Adagrad[DHS:11, DS:13], Adadelta [Z:12], Adam [KB:14]) were originally designed to solve unconstrained optimization problem for a stochastic cost function (minimizing empirical risk in supervised learning). These optimizers became very popular in deep learning for both supervised and unsupervised learning tasks as it is commonly observed that the value of the training loss decays faster than for stochastic gradient descent. However, few positive results can be found in the literature, and limitations have been found. For example, it has been shown empirically that Adam and RMSprop do do not always converge in the stochastic setting, even for a convex loss function [RKK:18].

### 1.1. Motivation and main results

Inspired by the history of gradient descent and stochastic gradient descent, we analyze discrete adaptive optimization algorithms by introducing their continuous time counterparts, with a focus on Adam. The techniques and analysis are similar for the other algorithms (and include classical accelerated methods).

Adaptive Moment estimation (Adam) [KB:14] is an iterative method generating a sequence . In addition to the parameters , it computes the exponential moving average of the gradient and the squared gradient. The algorithm can be formulated as follows: for any constants , and initial vectors and for all

 (1.5) ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩gk=∇f(θk−1,ξk−1)mk=μkmk−1+(1−μk)gkvk=νkvk−1+(1−νk)g2kθk=θk−1−s mk/√vk+ε.

where the two parameters for the moving average, depending on the iterations, are given by

 (1.6) {μk=β1(1−βk−11)/(1−βk1)νk=β2(1−βk−12)/(1−βk2).

Because the coefficients depend on , the underlying dynamical system for Adam must be non-autonomous. In what follows, we show that Adam is a discretization of a class of differential equations and the connection is established in section 3.3. In this work, we address the following questions

1. Is there a general random continuous dynamical system underlying adaptive algorithms? Is this system wellposed?

2. Can adaptive optimization algorithms be formulated as a discretization (possibly an Euler discretization) of a continuous system? If yes, does this numerical approximation converge (as a numerical method)? With which order of convergence?

3. What are the asymptotic behavior of the continuous trajectories for the deterministic counterpart? What intrinsic property make them well adapted to the specificity of deep learning landscape?

More precisely, we introduce in section 3 a general continuous dynamical system (3.1) whose forward Euler approximation matches a large class of first order methods. The connection with adaptive algorithms is made more precise in section 3.3 and with accelerated methods in section 3.4. Our analysis is supplemented by several comments about practical use of adaptive algorithms (see subsection 3.5). Section 4 contains the assumptions and the statement of our main results. We provide conditions under which the random differential equation is well-posed in section 4.1 and we prove that its forward Euler discretization is convergent in section 4.2. Finally, we study the asymptotic behavior of the continuous deterministic trajectories in section 4.3. In the non-convex setting we prove, under mild assumptions, that the trajectories converge to the critical locus of (see Theorem 4.4). This result is supplemented with the analysis of the necessary conditions in order to avoid convergence to saddle or local maximum points of (see Theorem 4.7). For convex functions, we design a Lyapunov functional and obtain a rate of convergence to a neighborhood of the critical locus (see Theorem 4.10), which depends on the behavior over time of and . In particular, this indicates that the efficiency of adaptive algorithms is highly dependant on the problem. In sections 5, we specialize the convergence results to adaptive algorithms and accelerated methods. Finally, most proofs supporting the paper are postponed to the Appendix.

This work is intended to serve as a solid foundation for the posterior study in the discrete and stochastic settings. The deterministic convergence analysis confirms that the trajectories converge to the critical locus of the objective function, which leads to natural conjectures on the convergence in the discrete and stochastic setting. In particular, we believe that the Lyapunov functional, used in section 4.3, can be adapted to the stochastic discrete methods. We note that, nevertheless, a precise correspondence between results valid for a continuous ODE and the stochastic discrete counterparts is far from being obvious. Indeed, recall that Adam and RMSprop are not always converging in the stochastic setting, even for a convex loss function [RKK:18]. This behavior is induced by the stochasticity in the algorithm and it is important to use the right framework to analyze the convergence of Adam in the stochastic case. This framework is well understood for the stochastic gradient descent. For this counterexample [RKK:18], stochastic gradient descent blows up for constant learning rate but converges for decaying learning rates.

## 2. Related work

The rate of convergence for gradient descent is not optimal and depending on the class of functions belongs to, more efficient algorithms can be designed [NoceWrig06, Bonnans:2006, Boyd:2004, Nesterov2004, BubeckBook]. For smooth convex or strongly convex functions, Nesterov [Nesterov2004] introduced an accelerated gradient algorithm which was proven to be optimal (a lower bound matching an upper bound is provided)

 (2.1) vk+1 =θk−s∇f(θk) (2.2) θk+1 =vk+1+kk+3(vk+1−vk).

However, the key mechanism for acceleration is not well understood and have many interpretations [Bubeck2015AGA, HL:17, LRP:16]. A particular interesting interpretation of acceleration is through the lens of a second order differential equation of the form

 (2.3) ¨θ=a(t)˙θ+∇f(θ),θ(0)=θ0,˙θ(0)=ψ0,

where is a smooth, positive and decreasing function of time, having possibly a pole at zero. Even if this singularity has important implications for the choice of the initial velocity , we are more interested by the long term behavior of the solution to (2.3) and hence at . This system is called dissipative because its energy decreases over time. Most accelerated optimization algorithms can be seen as the numerical integration of equation (2.3). For the Heavy Ball method, the function is constant and is called the damping parameter [AABR:02, alvarez2000]. In [gadat2014, Cabot09, CEG:07], conditions on the rate of decay of and its limit are given in order for the trajectories of (2.3) to converge to a critical point of . This analysis highlights situations where (2.3) are fit (or not) for optimization. Intuitively, if decays too fast to zero (like ) the system will oscillate and won’t converge to a critical point. The case was studied more specifically in [SBC:15] and the authors draw interesting connections between (2.3) and Nesterov’s algorithm (2.1). The convergence rates obtained are and respectively, which match with the discrete algorithms by using the time identification [SBC:15]. Extension of this work are proposed in [WW:15, WWJ:16] in which the authors studied acceleration from a different continuous equation having theoretically exponential rate of convergence. However, a naïve discretization looses the nice properties of this continuous system and current work consists on finding a better one preserving the symplectic structure of the continuous flow [BJW:18].

## 3. Presentation of the model and connection to existing optimization algorithms

In this section, we briefly introduce some notations on vector’s operations used in the paper. In section 3.2, we present a general system of differential equations as well as a possible discretization of it. In sections 3.3 and 3.4, we explicitly make the connection between this numerical approximation and first order optimization methods. In section 3.5, we make some observations on the behavior of the deterministic version of Adam that we illustrate on toy problems.

### 3.1. Compact notation

In what follows, we use several times the same non standard operations on vectors. It is convenient to fix the notation of these operations. Given two vectors and of and constants , we use the following notation:

 u+ε =(u1+ε,…,ud+ε) u⊙v =(u1⋅v1,…,ud⋅vd) u/v =(u1/v1,…,ud/vd) [u]a =(ua1,…,uad) √u =(√u1,…,√ud)

### 3.2. Presentation of the continuous time model

Throughout this paper we study the following dynamical system (whose connection with various numerical optimization methods is established via the Euler approximation scheme (3.3)).

 (3.1) ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩˙θ(t)=−m(t)/√v(t)+ε˙m(t)=h(t)∇f(θ(t))−r(t)m(t)˙v(t)=p(t)[∇f(θ(t))]2−q(t)v(t),

where , , (if , then ). The above system has a momentum term (or memory term depending on the values for and ). We also consider the alternative system

 (3.2) ⎧⎨⎩˙θ(t)=−∇f(θ)/√v(t)+ε˙v(t)=p(t)[∇f(θ(t))]2−q(t)v(t),

The analysis of this second differential equations is similar (and simpler) to the first. However, it is interesting here to study the impact of the rescaling term . In this work, we always make the following hypotheses

###### Assumption 1.

The objective function is assumed to be a function defined in whose gradient is locally Lipschitz. The functions , , and are non-negative and non-increasing -functions defined over . We also require that:

 h(t)≢0,r(t)≢0, and either p(t)≢0,% or p(t)≡q(t)≡0

Additional assumptions about the regularity of those functions will be given in section 4.1. Moreover, we won’t make, at any point, the assumption that the ODE (3.1) has globally Lipschitz coefficients. This assumption would be too strong and is not satisfied for quadratic functions, for example. This constitutes an important technical difficulty in the remaining of the paper.

The system is supplemented with initial conditions at time . We denote by a solution of with initial condition . In order to establish a relation between the continuous and the optimization algorithms, we study the finite difference approximation of (3.1) by the forward Euler method

 (3.3) ⎧⎪⎨⎪⎩θk+1=θk−smk/√vk+εmk+1=(1−sr(tk+1))mk+sh(tk+1)∇f(θk+1)vk+1=(1−sq(tk+1))vk+sp(tk+1)[∇f(θk+1)]2

where . We chose this method because it fits well with Adam discrete system. Moreover its convergence can be proven under certain extra assumptions (see subsection 4.2). However it is certainly not the only choice of discretization. In particular, more efficient quadrature rules can be considered to get more accurate numerical integration [QuadratureRule:75, kythe2002computational] in the case of singular functions .

The connections between our model and optimization algorithms is made precise in the next sections. Indeed, we show that these algorithms exactly match with the forward method, which is a good approximation of the continuous trajectories.

###### Remark 3.1 (On the stochastic model).

In subsections 4.1 and 4.2 below, we enunciate our results in terms of the stochastic version of (3.1)

 (3.4) ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩˙θ(t)=−m(t)/√v(t)+ε˙m(t)=h(t)∇f(θ(t),ξ(t))−r(t)m(t)˙v(t)=p(t)[∇f(θ(t),ξ(t))]2−q(t)v(t),

where is a stochastic process with continuous sample paths. In this case, the forward Euler discretization yields:

 (3.5) ⎧⎪⎨⎪⎩θk+1=θk−smk/√vk+εmk+1=(1−sr(tk+1))mk+sh(tk+1)∇f(θk+1,ξk+1)vk+1=(1−sq(tk+1))vk+sp(tk+1)[∇f(θk+1,ξk+1)]2

Adagrad [DHS:11] was designed to incorporate knowledge of the geometry of the data previously observed during the training. For all ,

 (3.6)

with initial conditions and . Here, we recall that denotes the element-wise product . The adaptive part in the algorithm comes from the term which is precisely the preconditioning matrix used to scale the gradients. The algorithm (3.6) can be equivalently described by

 (3.7)

with initial condition and . By setting and , it is easy to conclude that the Adagrad’s update rule can be written as:

 (3.8) {θk+1=θk−α∇f(θk)/√ωkωk+1=ωk+α[∇f(θk+1)]2

with the re-scaled initial condition and . It is now easy to see that (3.8) is a forward Euler discretization of the system of differential equations (3.2) (which we call the Adagrad differential equation) with initial condition given by , and with and .

#### 3.3.2. RMSprop and Adam differential equations

The only difference between these two optimizers and Adagrad is how the preconditioning matrix is computed. In RMSprop, it consists of an exponentially decaying moving average rather than a sum of the previous gradients

 (3.9) {vk+1=βvk+(1−β)∇θf(θk)2θk+1=θk−s∇θf(θk)/√vk+1.

Adaptive Moment estimation (Adam) [KB:14] is a famous variant of RMSprop that incorporates a momentum equation. More precisely, it computes the exponential moving average of the gradient and the square gradient. This method combines the advantages of RMSprop [TH:12] in addition to the running average for the gradient. We recall that Adam has three hyperparameters: the learning rate and the exponential rate of decay for the moment estimates . The parameter is usually set to to avoid dividing by zero. This parameter is typically not tuned. As already formulated in the Introduction, the algorithm reads as follows: for any constants , and initial vectors and for all

 ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩gk=∇f(θk−1)mk=μkmk−1+(1−μk)gkvk=νkvk−1+(1−νk)g2kθk=θk−1−s mk/(√vk+ε).

where the two parameters for the moving average are given by

 {μk=β1(1−βk−11)/(1−βk1)νk=β2(1−βk−12)/(1−βk2).

We rewrite the update for the parameters such that

 (3.10) θk=θk−1−s mk/√vk+ε.

This modification does not change anything in the behavior of the algorithm. By modifying the order of the updates and the value of the initial conditions, we can rewrite the above algorithm in a more suitable way for our analysis. Indeed, let be such that and , then the following recursive update rules are equivalent to Adam for all

 (3.11) ⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩θk+1=θk−s mk/√vk+εgk+1=∇f(θk+1)mk+1=μk+2mk+(1−μk+2)gk+1vk+1=νk+2vk+(1−νk+2)g2k+1

As a consequence, the initial velocity is .

###### Remark 3.2 (On the parameters μ and ν).

In the original formulation of the algorithm (as stated in (1.5)), the parameters and depends on the iterations to correct for the bias induced by the moving average. These coefficients can also be taken constant and and this does not change the conclusion of our work. In particular, let us consider the system (3.1) with

 h≡r≡1/α1,p≡q≡1/α2

where are two positive constants. It can be easily checked that the choice of and leads to the Adam optimizer (3.11) without rescaling.

Consider now the three parameter family of differential equations

 (3.12) ⎧⎪ ⎪⎨⎪ ⎪⎩˙θ=−m/√v+ε˙m=gA1(t,λ,α1,α2)(∇f(θ)−m)˙v=gA2(t,λ,α1,α2)(∇f(θ)2−v)

where the coefficients in (3.1) are given by

 (3.13) h≡r≡gA1(t,λ,α1,α2),p≡q≡gA2(t,λ,α1,α2),

and are positive real numbers and:

 (3.14) gAi(t,λ,α1,α2)=1−e−λ/αiλ(1−e−t/αi),i=1,2.

Note that both functions have a simple pole at and, furthermore, satisfy assumption 3 below. Now, let us consider the associated discretization (3.3) with learning rate and a sub-family of discrete models parametrized by which are given by

 (3.15) λ=s,βi=e−λ/αi,i=1,2.

It easily follows that for

 sgAi((k+1)s,λ,α1,α2)=1−β11−βk11−βk+11=1−μk+1,

which recovers Adam’s discrete system (3.11) (apart from small difference in the evaluation of ). Therefore, Adam is an Euler discretization of system (3.1) for the choice of function (3.13)-(3.14) and parameters (3.15).

### 3.4. Accelerated optimization algorithms

In this section, we show that our framework encompasses previous studies of accelerated methods.

#### 3.4.1. Heavy Ball differential equation

We consider the Heavy ball second order differential equation [alvarez2000]

 (3.16) ¨x+γ˙x+∇f(x)=0,

where . By taking and (and ), we obtain the system (3.1) with

 h(t)≡1,r(t)≡γ,andp(t)≡q(t)≡0.

Equation (3.3) simplifies to

 (3.17) {θk+1=θk−smkmk+1=(1−sγ)mk+s∇f(θk+1)

which corresponds to the classical Heavy ball methods with damping coefficient , momentum variable and learning rate . Implicit discretization has also been considered in [alvarez2000].

#### 3.4.2. Nesterov differential equation

Following [SBC:15], we consider the Nesterov second order differential equation, parametrized by the constant ,

 (3.18) ¨x+rt˙x+∇f(x)=0.

Similarly as in the Heavy Ball case, we define and and write the above equation as a system (3.1) with

 h(t)≡1,r(t)=r/t,andp(t)≡q(t)≡0.

In [SBC:15], the authors studied a slightly different forward Euler scheme and proved that the difference between the numerical scheme and the Nesterov algorithm goes to zero in the limit . Similar analysis holds here.

### 3.5. Considerations on adaptive algorithms

We make here empirical observations and discuss some limitations on adaptive algorithms. On 2D toy problems, we emphasize some facts limiting the applicability of such algorithms in practice.

#### 3.5.1. The discrete dynamics does not necessarily converge.

One strong limitation of Adam is the existence of discrete limit cycles in the sense that the algorithm produces oscillations that never damp out. If the discrete dynamics reaches such an equilibrium, the difference can not converge arbitrarily close to zero with an increasing number of steps. However, it reaches a neighborhood of the critical point whose radius is determined by the learning rate . Decaying the learning rate is therefore necessary to obtain convergence of the dynamics. Numerically, we found that Adam with suffers from the same phenomena but the limit cycles are more difficult to establish. We believe that the existence of such cycles depend on the local curvature of the function near the optimum.

###### Proposition 3.3 (Existence of a discrete limit cycle for Adam).

Let and . Then there exists a discrete limit cycle for (3.11).

###### Proof.

Let us assume that there exists a such that and that , where is the learning rate. It easily follows from the update rules that

 θk+1 =θk−s∇f(θk)√vk=s2−s=−θk vk+1 =(s/2)2

Therefore and the system has entered a discrete equilibrium. ∎

We illustrate this behavior in Figure 1 on the strongly convex function .

It is important to note that the value of the gap between and depends on the learning rate. Choosing a smaller learning rate will reduce this gap but cannot remove it.

#### 3.5.2. For Adam and RMSprop, the hyper-parameters β1,β2 are functions of learning rate s

The second observation is related to the hyper-parameters of the optimizers and give important guidance on how to tune them. As observed in section 3.3, the parameters and were chosen as functions of the learning rate . It is often the case in practice (in particular in stochastic optimization) to decay the learning rate during the training process. By doing so, the discrete dynamics is completely modified unless the ’s are adjusted to keep constant. The coefficients must be modified according to the formula (3.15), which we recall here

 βi=e−s/αi,i=1,2.

In plot b), Figure 2, we compute the logarithm of the error between different trajectories

1. This is the reference dynamics: and . According to formula (3.15),

2. The second dynamics: and

3. For the third dynamics, we keep the same learning rate but adjust .

#### 3.5.3. On the strength of Adam and RMSprop on flat surfaces

The convergence analysis in the convex case (see Section 4.3) seems to indicate that Adam and RMSprop are rather slow algorithms and the convergence is only guaranteed in a neighborhood of the global minimum. However, there are situations where they seem to perform consistently well. This is the case for flat surfaces (see figure 3).

It is common to consider adaptive algorithms as a whole. However, we do not believe this should be done. The variable has different asymptotic and it is easily seen on the continuous dynamics that the equilibrium are different. In the strictly convex setting, for Adam (with ), the equilibrium is given by (and respectively for RMSprop), while for Adagrad it is given by for some (large) constant . In particular because is non-decreasing, we believe that Amsgrad should be considered as a modification of Adagrad rather than Adam.

## 4. Statement of the main results

In this section, we state the main theoretical results of this paper: existence and uniqueness of solutions for the continuous equation (3.4), order of convergence for the forward Euler scheme (3.5) and convergence analysis for equation (3.1) (in the deterministic case).

### 4.1. Existence and uniqueness of solutions

In this section we present the main result about existence and solutions for the random ordinary differential equation (3.4). Let be a probability space where is a -algebra on and is a probability measure. A continuous random differential equation is an ordinary differential equation for almost every realization and we enjoy the fact that the techniques for the deterministic and stochastic system are essentially the same. However, the solution to a random differential equation has to be a stochastic process defined on an -independent time interval. We ask for the following assumptions on the noise process, the function and its gradient.

###### Assumption 2.
1. Let be an -valued stochastic process with continuous sample paths.

2. For almost all , is a continuous function.

3. For all , the function is and its gradient is locally Lipschitz i.e for all and all , there exists a constant , depending on the norm of and , such that almost surely

 ∥∇f(x,ν)−∇f(y,ν)∥≤L(∥x∥,∥y∥)∥x−y∥.

The function is almost surely bounded on the bounded sets.

4. There exist a positive constant such that for all almost surely

 ∥∇f(0,ν)∥≤B.

Therefore

 ∥∇f(x,ν)∥≤L(∥x∥)∥x∥+B.

The assumption on the continuity of the stochastic process is only necessary to prove that the solution is continuously differentiable. It can be relaxed if the continuity of is sufficient. Moreover, note that the ODE (3.4) is measurable in and continuous in , since is continuous in both variables, and is measurable in and has continuous sample paths. Finally, we assume that the gradient is almost surely locally Lipschitz. It corresponds to locally Lipschitz in the deterministic case and is a necessary condition in order to use the Banach contraction mapping theorem.

In order to study the existence of the solution at , we demand the following from the coefficients, which are allowed to have a pole at zero.

###### Assumption 3.

We assume one of the following condition

1. The functions have a simple pole at .

2. If (resp ), then (resp. ) can have (at most) a simple pole at zero.

3. In any other cases, all functions are assumed to be on . The initial conditions can be taken arbitrarily.

In cases and , furthermore, we demand that there exists a small time such that

 q(t)−2r(t)≤0,∀t<^t,

and that the initial conditions must be taken as:

 m0=∇f(θ0,ξ0)limt→0+h(t)/r(t),v0=[∇f(θ0,ξ0)]2limt→0+p(t)/q(t).

Note that these assumptions are verified by Adam’s (under certain assumptions on the parameters, e.g. ) and Nesterov equations (3.12) and (3.18), respectively. The potential singularity in time makes the analysis more technical and we rely on a compactness argument to prove wellposedness at . In order to illustrate this technical difficulty, let us consider the differential equation:

 ˙θ(t)=Ltθ(t),θ(0)=0

Note that there always exist a solution given by of the above system. Now the solutions of this equation with initial condition at are given by (for ). Therefore, uniqueness only holds if . Indeed, if , for every the solution converges to when . Finally we assume that the local variance is bounded, which is needed to prove existence and uniqueness at . More precisely:

###### Assumption 4.

There exists two constants and such that for all

 E(sup0

and

 E(sup0

for all initial conditions and .

We recall that the system (3.1) is supplemented with initial conditions. We say that they are admissible at if . If , then . We are now ready to enunciate our main result:

###### Theorem 4.1.

Suppose that the ODE (3.4) satisfies assumptions 1 and 2. In addition, for any and admissible initial condition , there exists a unique global solution to equation (3.4) such that -almost surely:

 θ ∈C2([t0,∞);Rd) and m,v ∈C1([t0,∞);Rd).

Suppose, furthermore, that assumptions 3 and 4 are also satisfied. Then, there exists a unique global solution to equation (3.4) such that -almost surely:

 θ m,v ∈C1((0,∞);Rd)∩C([0,∞);Rd).

The proof is postponed in Appendix B and is divided as follows: we first study existence and uniqueness for using the Banach fixed point theorem and a cut-off argument. Then we study a regularized system and from a compactness argument, we obtain that the system is well-posed for .

### 4.2. Convergence of the Euler discretization

We now study the validity of the approximation given by the discrete system (3.5) of the differential equation (3.4). As in the previous section, we work on the stochastic equations and we give some additional regularity assumptions on the noise process .

In the traditional finite-time convergence (and order of convergence) theory for numerical methods, one usually requires the vector field to be globally Lipschitz, which is not the case here (see Assumptions 2). This represents a technical difficulty, and we rely on weaker assumptions, such as the one’s requested in [HMS:2002], to prove our results. Furthermore, as pointed out in [Grune2001, KJ:07, Neckel], the stochastic driving process has usually at most Hölder continuous sample paths (and therefore are not continuously differentiable) and standard numerical analysis do not directly apply.

In order to be precise, let us now fix the notation. Fix a (final) time and consider an interval on which we study the approximation of the solution of (3.4). We recall that the step size is given by . Consider , the integer part of . We write for any and is a homogeneous partition of the interval . Now, let be a fixed initial condition and consider:

• The sequence given by the discrete system (3.5) with initial condition ;

• The sequence for all , where is the exact solution of ODE (3.4) with initial condition .

We study the almost sure order of convergence of the Euler approximation. Following [Grune2001] we prove that the strong order of convergence in probability is determined by the order of the Hölder continuity of the sample paths of the driving process. More precisely, as in [KJ:07, page 2931], we make the following assumption on the regularity in time of the stochastic process .

###### Assumption 5.

The stochastic process is assumed to have sample paths which are locally Hölder continuous with the same exponent i.e. there exists a such that almost surely and for all , there exists a constant

 ∥∥ξ(t)−ξ(t′)∥∥≤CT∣∣t−t′∣∣γ,∀t,t′≤T.

Moreover, we introduce the modulus of continuity of the gradient of on

 ωf(s,x,T)=supt≠u,t0≤t,u≤T,|t−u|≤s∥∇f(x,ξ(t))−∇f(x,ξ(u))∥

and we make the following assumption

###### Assumption 6.

There is a positive constant and a positive constant , bounded on bounded sets, such that

 ωf(s,x,T)≤C(∥x∥)supt≠u,t0≤t,u≤T,|t−u|≤s∥ξ(t)−ξ(u)∥α

These two conditions will determine the order of approximation of the Euler scheme. Following [Grune2001], we get:

###### Theorem 4.2.

Let and suppose that . Let us consider a compact set of and assume that the ODE (3.4) and discretization (3.5) satisfies assumptions 1, 2, 5 and 6. Then, there exists a constant (which only depends on and the compact ) such that for any admissible initial condition , the numerical scheme satisfies

 maxk=0,⋯,K∥xk−˜xk∥≤C(T,A0)smin(αγ,1).

### 4.3. Convergence analysis in continuous time

In this section, we study the asymptotic behavior of the solutions of (3.1). Our analysis is based on the following three steps:

• Topological convergence: Find sufficient conditions on the functions and in order for the solutions of equation (3.1) to converge to a critical value of , that is, when . In particular we do not require to be convex.

• Avoiding local maximum and saddles: We want to strengthen the result of part (1) and give sufficient conditions so that the dynamics avoid local maximum and saddles and only converge to local minimum. In other words, fix and denote by the set of initial conditions such that the limit set of the associated solution contains a critical point which is not a local minimum. We give, in subsection 4.3.2, sufficient conditions for the set to have Lebesgue measure zero.

• Rate of convergence: Under the convexity assumption, find the rate of convergence of to a local minimum.

###### Remark 4.3 (On the convexity assumption).

For non-convex functions, in a neighborhood of a local minimum, the function is not always locally convex. Consider, for example, the function which has the origin as a local minimum. It is not locally convex at since there is no sufficiently small neighborhood of the origin where is convex. However, consider the space of all function and fix a compact set . For almost every function , if is a local minimum of then is locally convex at (see Remark F.3 for a precise statement). Therefore a small perturbation of the function can induce local convexity. Consider the one-parameter family . Then for every , the function is locally convex at . This is illustrated in Figure 5.

In the remaining of this section, we give precise statements for all three steps. For each step, we will make appropriate assumptions on the objective function. However, the following two assumptions must be satisfied in all cases

###### Assumption 7.

There exists such that:

 2r(t)−q(t)2+h′(t)h(t)≥0,∀t>˜t
###### Assumption 8.

The solution of the ODE (3.1) is bounded.

We start by the topological convergence analysis.

#### 4.3.1. Topological convergence

In this part, we make an additional assumption on the asymptotic behavior of the coefficients.

###### Assumption 9.

Suppose that . Consider the functions:

 H(s)=h(1/s),R(s)=r(1/s),P(s)=p(1/s),Q(s)=q(1/s),

and suppose that these functions are in such that ; ; and either , or .

Let us observe that this assumption is satisfied when the coefficients do not converge to zero at infinity. Hence, it holds for Adam (3.12) and the Heavy ball differential equation (3.16) but not for the Nesterov’s acceleration equation (3.18). Under this assumption, we prove the convergence of the dynamics in the following sense:

###### Theorem 4.4 (Topological Convergence).

Suppose that assumptions 1, 7, 8 and 9 are verified (and that and if ). Then , and when , where is a critical value of .

The proof of this result is postponed in Appendix D. Our method is inspired by the work of Alvarez [alvarez2000], based on the energy functional of the system. However, we use a different argument, based on the Poincaré-Bendixson type arguments, which does not rely on convexity. In the next part, we improve the above result and we prove that the dynamics converge to a local minimum for almost all initial condition.

#### 4.3.2. Avoiding local maximum and saddles

Before stating the main hypothesis of this section, we need introduce the definition of isolated critical points and strict saddle points

###### Definition 4.5.

A critical point is called isolated if there is a neighborhood around that does not contain any other critical points.

###### Definition 4.6.

(Following [LPPSJR:2017, Definition 1]) A critical point of a function is said to be a strict saddle if there exists a strictly negative eigenvalue of the Hessian of at .

We now make the following assumption

###### Assumption 10.

The function is . Moreover each critical point which is not a local-minimum is a strict saddle and is isolated (as a critical point).

Now, fix a time and recall that the topological limit of a curve is given by:

 ω(θ(t))=⋂τ>t0¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯θ([τ,∞)).

Consider the set of initial conditions such that the limit set of the associated orbit contains a critical point which is not a local minimum

 St0:={x0=(θ0,m0,v0);ω(θ(t))∋θ⋆, where θ⋆ is a strict saddle}

The main result of this subsection is the following:

###### Theorem 4.7 (Avoiding Saddle and Local Maximum points).

Suppose that assumptions 1, 7, 8, 9 and 10 are satisfied. Then, for every , the set has Lebesgue measure zero. More precisely, the Hausdorff dimension of is smaller or equal to .

###### Remark 4.8.

It follows that, if is a random initial condition, then the solution converges to a local minimum of with total probability.

Our method to prove the above result relies on the central(-stable) manifold theory of differential equations (see [CMBook, Chapter 1] for detailed exposition). Similar results are proved for discrete systems having isolated critical points in [LSJR:16, LPPSJR:2017]. An extension of this result to non-isolated critical points can be found in [PP:2016]. In this work, we assume that critical points are isolated in order to exclude pathological differences between local and global center-stable manifold theory (see example 4.9 below). The general theory of central-stable manifolds (c.f. [CMBook, Theorem 3.2] and [PP:2016, Lemma 5]) can be applied if the differential equation has globally Lipschitz coefficients and assumption 10 is not mandatory. However, the global Lipschitz assumption is not met in our case. We believe, nevertheless, that assumption 10 can be relaxed to prove saddle point avoidance in the case of non-isolated critical points, c.f [PP:2016].

###### Example 4.9.

Consider the differential equation:

 ˙x =y−(y2−sin(x)2)sin(x)cos(x) ˙y =sin(x)cos(x)+(y2−sin(x)2))y.

and let us consider the orbits of the differential equation whose limit set contain the singularity . Note the following contrast:

• Local: consider a (very) small neighborhood of , then the only solutions which contain in their limit set have initial conditions in:

 (x