Shadowing Properties of Optimization Algorithms
Abstract
Ordinary differential equation (ODE) models of gradientbased optimization methods can provide insights into the dynamics of learning and inspire the design of new algorithms. Unfortunately, this thoughtprovoking perspective is weakened by the fact that — in the worst case — the error between the algorithm steps and its ODE approximation grows exponentially with the number of iterations. In an attempt to encourage the use of continuoustime methods in optimization, we show that, if some additional regularity on the objective is assumed, the ODE representations of Gradient Descent and Heavyball do not suffer from the aforementioned problem, once we allow for a small perturbation on the algorithm initial condition. In the dynamical systems literature, this phenomenon is called shadowing. Our analysis relies on the concept of hyperbolicity, as well as on tools from numerical analysis.
thmappTheorem[section] \newmdtheoremenvpropappProposition[section] \tcbsetboxsep=0mm,boxrule=0pt,colframe=white,arc=0mm,left=0.5mm,right=0.5mm
1 Introduction
We consider the problem of minimizing a smooth function . This is commonly solved using gradientbased algorithms due to their simplicity and provable convergence guarantees. Two of these approaches — that prevail in machine learning — are: 1) Gradient Descent (GD), that computes a sequence of approximations to the minimizer recursively:
(GD) 
given an initial point and a learning rate ; and 2) Heavyball (HB)polyak1964some that computes a different sequence of approximations such that
(HB) 
given an initial point and a momentum . A method related to HB is Nesterov’s accelerated gradient (NAG), for which is evaluated at a different point (see Eq. 2.2.22 in nesterov2018lectures).
Analyzing the convergence properties of these algorithms can be complex, especially for NAG whose convergence proof relies on algebraic tricks that reveal little detail about the acceleration phenomenon, i.e. the celebrated optimality of NAG in convex smooth optimization. Instead, an alternative approach is to view these methods as numerical integrators of some ordinary differential equations (ODEs). For instance, GD performs the explicit Euler method on and HB the semiimplicit Euler method on polyak1964some; shi2019acceleration. This connection goes beyond the study of the dynamics of learning algorithms, and has recently been used to get a useful and thoughtprovoking viewpoint on the computations performed by residual neural networks chen2018; ciccone2018nais.
In optimization, the relation between discrete algorithms and ordinary differential equations is not new at all: the first contribution in this direction dates back to, at least, 1958 gavurin1958nonlinear. This connection has recently been revived by the work of su2016; krichene2015, where the authors show that the continuoustime limit of NAG is a second order differential equation with vanishing viscosity. This approach provides an interesting perspective on the somewhat mysterious acceleration phenomenon, connecting it to the theory of damped nonlinear oscillators and to Bessel functions. Based on the prior work of wilson2016lyapunov; krichene2017acceleration and followup works, it has become clear that the analysis of ODE models can provide simple intuitive proofs of (known) convergence rates and can also lead to the design of new discrete algorithms zhang2018direct; betancourt2018symplectic; xu2018accelerated; xu2018continuous. Hence, one area of particular interest has been to study the relation between continuoustime models and their discrete analogs, specifically understanding the error resulting from the discretization of an ODE. The conceptual challenge is that, in the worst case, the approximation error of any numerical integrator grows exponentially^{1}^{1}1The error between the numerical approximation and the actual trajectory with the same initial conditions is, for a th order method, at time t with . as a function of the integration interval chow1994shadowing; hairer2003geometric; therefore, convergence rates derived for ODEs can not be straightforwardly translated to the corresponding algorithms. In particular, obtaining convergence guarantees for the discrete case requires analyzing a discretetime Lyapunov function, which often cannot be easily recovered from the one used for the continuoustime analysis shi2018understanding; shi2019acceleration. Alternatively, (sophisticated) numerical arguments can be used to get a rate through an approximation of such Lyapunov function zhang2018direct.
In this work, we follow a different approach and directly study conditions under which the flow of an ODE model is shadowed by (i.e. is uniformly close^{2}^{2}2Formally defined in Sec. 2. to) the iterates of an optimization algorithm. The key difference with previous work, which makes our analysis possible, is that we allow the algorithm — i.e. the shadow— to start from a slightly perturbed initial condition compared to the ODE (see Fig. 2 for an illustration of this point). We rely on tools from numerical analysis hairer2003geometric as well as concepts from dynamical systems brin2002introduction, where solutions to ODEs and iterations of algorithm are viewed as the same object, namely maps in a topological space brin2002introduction. Specifically, our analysis builds on the theory of hyperbolic sets, which grew out of the works of Anosov anosov1967geodesic and Smale smale1967differentiable in the 1960’s and plays a fundamental role in several branches of the area of dynamical systems but has not yet been seen to have a relationship with optimization for machine learning.
In this work we pioneer the use of the theory of shadowing in optimization. In particular, we show that, if the objective is stronglyconvex or if we are close to an hyperbolic saddle point, GD and HB are a shadow of (i.e. follow closely) the corresponding ODE models. We backup and complement our theory with experiments on machine learning problems.
To the best of our knowledge, our work is the first to focus on a (Lyapunov function independent) systematic and quantitative comparison of ODEs and algorithms for optimization. Also, we believe the tools we describe in this work can be used to advance our understanding of related machine learning problems, perhaps to better characterize the attractors of neural ordinary differential equations chen2018.
2 Background
This section provides a comprehensive overview of some fundamental concepts in the theory of dynamical systems, which we will use heavily in the rest of the paper.
2.1 Differential equations and flows
Consider the autonomous differential equation . Every represents a point in (a.k.a phase space) and is a vector field which, at any point, prescribes the velocity of the solution that passes through that point. Formally, the curve is a solution passing through at time if for and . We call this the solution to the initial value problem (IVP) associated with (starting at ). The following results can be found in perko2013differential; khalil2002nonlinear.
Theorem 1.
Assume is Lipschitz continuous and . The IVP has a unique solution in .
This fundamental theorem tells us that, if we integrate for time units from position , the final position is uniquely determined. Therefore, we can define the family of maps — the flow of — such that is the solution at time of the IVP. Intuitively, we can think of as determining the location of a particle starting at and moving via the velocity field for seconds. Since the vector field is static, we can move along the solution (in both directions) by iteratively applying this map (or its inverse). This is formalized below.
Proposition 1.
Assume is Lipschitz continuous and . For any , and, for any , . In particular, is a diffeomorphism^{3}^{3}3A map with welldefined and inverse. with inverse .
In a way, the flow allows us to exactly discretize the trajectory of an ODE. Indeed, let us fix a stepsize ; the associated timeh map integrates the ODE for seconds starting from some , and we can apply this map recursively to compute a sampled ODE solution. In this paper, we study how flows can approximate the iterations of some optimization algorithm using the language of dynamical systems and the concept of shadowing.
2.2 Dynamical systems and shadowing
A dynamical system on is a map . For , we define ( times). If is invertible, then ( times). Since , these iterates form a group if is invertible, and a semigroup otherwise. We proceed with more definitions.
Definition 1.
A sequence is a (positive) orbit of if, for all ,
For the rest of this subsection, the reader may think of as an optimization algorithm (such as GD, which maps to ) and of the orbit as its iterates. Also, the reader may think of as the sequence of points derived from the iterative application of , which is itself a dynamical system, from some . The latter sequence represents our ODE approximation of the algorithm . Our goal in this subsection is to understand when a sequence is "close to" an orbit of . The first notion of such similarity is local.
Definition 2.
The sequence is a pseudoorbit of if, for all ,
If is locally similar to an orbit of (i.e. it is a pseudoorbit of ), then we may hope that such similarity extends globally. This is captured by the concept of shadowing.
Definition 3.
A pseudoorbit of is shadowed if there exists an orbit of such that, for all , .
It is crucial to notice that, as depicted in the figure above, we allow the shadowing orbit (a.k.a. the shadow) to start at a perturbed point . A natural question is the following: which properties must have such that a general pseudoorbit is shadowed? A lot of research has been carried out in the last decades on this topic (see e.g. palmer2013shadowing; lanford1985introduction for a comprehensive survey).
Shadowing for contractive/expanding maps.
A straightforward sufficient condition is related to contraction. is said to be uniformly contracting if there exists (contraction factor) such that for all , . The next result can be found in hayes2005survey.
Theorem 2.
(Contraction map shadowing theorem) Assume is uniformly contracting. For every there exists such that every pseudoorbit of is shadowed by the orbit of starting at , that is . Moreover,
The idea behind this result is simple: since all distances are contracted, errors that are made along the pseudoorbit vanish asymptotically. For instructive purposes, we report the full proof.
Proof: We proceed by induction: the proposition is trivially true at , since ; next, we assume the proposition holds at and we show validity for . We have
Finally, since , .
Next, assume is invertible. If is uniformly expanding (i.e. ), then is uniformly contracting with contraction factor and we can apply the same reasoning backwards: consider the pseudoorbit up to iteration , and set (before, we had ); then, apply the same reasoning as above using the map and find the initial condition . In App. B.2 we show that this reasoning holds in the limit, i.e. that converges to a suitable to construct a shadowing orbit under (precise statement in Thm. B.2).
Shadowing in hyperbolic sets.
In general, for machine learning problems, the algorithm map can be a combination of the two cases above: an example is the pathological contractionexpansion behavior of Gradient Descent around a saddle point du2017gradient. To illustrate the shadowing properties of such systems, we shall start by taking to be linear^{4}^{4}4This case is restrictive, yet it includes the important cases of the dynamics of GD and HB when is a quadratic (which is the case in, for instance, least squares regression)., that is for some . is called linear hyperbolic if its spectrum does not intersect the unit circle . If this is the case, we call the part of inside and the part of outside . The spectral decomposition theorem (see e.g. Corollary 4.56 in irwin2001smooth) ensures that decomposes into two invariant closed subspaces (a.k.a the stable and unstable manifolds) in direct sum : . We call and the restrictions of to these subspaces and and the corresponding matrix representations; the theorem also ensures that and . Moreover, using standard results in spectral theory (see e.g. App. 4 in irwin2001smooth) we can equip and with norms equivalent to the standard Euclidean norm such that, w.r.t. the new norms, is uniformly expanding and uniformly contracting. If is symmetric, then it is diagonalizable and the norms above can be taken to be Euclidean. To wrap up this paragraph — we can think of a linear hyperbolic system as a map that allows us to decouple its stable and unstable components consistently across . Therefore, a shadowing result directly follows from a combination of Thm. 2 and B.2 hayes2005survey.
An important further question is — whether the result above for linear maps can be generalized. From the classic theory of nonlinear systems khalil2002nonlinear; araujo2009hyperbolic, we know that in a neighborhood of an hyperbolic point p for , that is has no eigenvalues on , behaves like^{5}^{5}5For a precise description of this similarity, we invite the reader to read on topological conjugacy in araujo2009hyperbolic. a linear system. Similarly, in the analysis of optimization algorithms, it is often used that, near a saddle point, an Hessiansmooth function can be approximated by a quadratic levy2016power; daneshmand2018escaping; ge2015escaping. Hence, it should not be surprising that pseudoorbits of are shadowed in a neighborhood of an hyperbolic point, if such set is invariant lanford1985introduction: this happens for instance if is a stable equilibrium point (see definition in khalil2002nonlinear).
Starting from the reasoning above, the celebrated shadowing theorem, which has its foundations in the work of Ansosov anosov1967geodesic and was originally proved in bowen1975omega, provides the strongest known result of this line of research: near an hyperbolic set of , pseudoorbits are shadowed. Informally, is called hyperbolic if it is invariant and has clearly marked local directions of expansion and contraction which make behave similarly to a linear hyperbolic system.We provide the precise definition of hyperbolic set and the statement of the shadowing theorem in App. B.1.
Unfortunately, despite the ubiquity of hyperbolic sets, it is practically infeasible to establish this property analytically aulbach1996six. Therefore, an important part of the literature coomes1995rigorous; sauer1991rigorous; chow1994shadowing; van1995numerical is concerned with the numerical verification of the existence of a shadowing orbit a posteriori, i.e. given a particular pseudoorbit.
3 The Gradient Descent ODE
We assume some regularity on the objective function which we seek to minimize.
(H1) is coercive^{6}^{6}6 as , bounded from below and smooth (, ).
In this chapter, we study the wellknown continuoustime model for GD : (GDODE). Under (H1), Thm. 1 ensures that the solution to GDODE exists and is unique. We denote by the corresponding time map. We show that, under some additional assumptions, the orbit of (which we indicate as ) is shadowed by an orbit of the GD map with learning rate : .
Remark 1 (Bound on the ODE gradients).
Under (H1) let be the set of gradient magnitudes experienced along the GDODE solution starting at any . It is easy to prove, using an argument similar to Prop. 2.2 in cabot2009long, that coercivity implies . A similar argument holds for the iterates of Gradient Descent. Hence, for the rest of this chapter it is safe to assume that gradients are bounded: for all . For instance, if is a quadratic centered at , then we have .
The next result follows from the fact that GD implements the explicit Euler method on GDODE.
Proposition 2.
Assume (H1). is a pseudoorbit of with : ,
Proof.
Thanks to Thm. 1, since the solution of GDODE is a curve, we can write using Taylor’s formula with Lagrange’s Remainder in Banach spaces (see e.g. Thm 5.2. in coleman2012calculus) around time . Namely: , where is the approximation error as a function of , which can be bounded as follows:
Hence, since , we have . ∎
3.1 Shadowing under strong convexity
As seen in Sec. 2.2, the last proposition provides the first step towards a shadowing result. We also discussed that if, in addition, is a contraction, we directly have shadowing (Thm. 1). Therefore, we start with the following assumption that will be shown to imply contraction.
(H2) is stronglyconvex: for all , .
The next result follows from standard techniques in convex optimization (see e.g. jung2017fixed).
Proposition 3.
Assume (H1), (H2). If , is uniformly contracting with .
We provide the proof in App. D.1 and sketch the idea using a quadratic form: let with symmetric s.t. ; if then . Prop. 3 follows directly:
The shadowing result for stronglyconvex functions is then a simple application of Thm. 2. {tcolorbox}
Theorem 3.
Assume (H1), (H2) and let be the desired accuracy. Fix ; the orbit of is shadowed by any orbit of with such that .
Proof.
Notice that we can formulate the theorem in a dual way: namely, for every learning rate we can bound the ODE approximation error (i.e. find the shadowing radius).
Corollary 1.
Assume (H1), (H2). If , is shadowed by any orbit of starting at with , with .
This result ensures that if the objective is smooth and stronglyconvex, then GDODE is a theoretically sound approximation of GD. We validate this in Fig. 1 by integrating GDODE analytically.
Sharpness of the result.
First, we note that the bound for in Prop. 2 cannot be improved; indeed it coincides with the wellknown local truncation error of Euler’s method hairer1996solving. Next, pick , and . For , gradients are smaller than 1 for both GDODE and GD, hence . Our formula for the global shadowing radius gives , equal to the local error — i.e. as tight the wellestablished local result. In fact, GD jumps to 0 in one iteration, while ; hence . For smaller steps, like , our formula predicts . In simulation, we have maximum deviation at and is around , only 2.5 times smaller than our prediction.
The convex case.
If is convex but not stronglyconvex, GD is nonexpanding and the error between and cannot be bounded by a constant^{7}^{7}7This in line with the requirement of hyperbolicity in the shadowing theory: a convex function might have zero curvature, hence the corresponding gradient system is going to have an eigenvalue on the unit circle. but grows slowly : in App. C.1 we show the error it is upper bounded by .
Extension to stochastic gradients.
We extend Thm. 3 to account for perturbations: let , where is a stochastic gradient s.t. . Then, for big enough, we can (proof in App. C.2) choose so that the orbit of (deterministic) is shadowed by the stochastic orbit of starting from . So, if the is small enough, GDODE can be used to study SGD. This result is well known from stochastic approximation kushner2003stochastic.
3.2 Towards nonconvexity: behaviour around local maxima and saddles
We first study the strongconcavity case and then combine it with strongconvexity to assess the shadowing properties of GD around a saddle.
Strongconcavity.
In this case, it follows from the same argument of Prop. 3 that GD is uniformly expanding with , with . As mentioned in the background section (see Thm. B.2 for the precise statement) this case is conceptually identical to strongconvexity once we reverse the arrow of time (so that expansions become contractions). We are allowed to make this step because, under (H1) and if , is a diffeomorphism (see e.g. lee2016gradient, Prop. 4.5). In particular, the backwards GD map is contracting with factor . Consider now the initial part of an orbit of GDODE such that the gradient norms are still bounded by and let be the last point of such orbit. It is easy to realize that is a pseudoorbit, with reversed arrow of time, of . Hence, Thm. 3 directly ensures shadowing choosing and . Crucially — the initial condition of the shadow we found are slightly^{8}^{8}8Indeed, Thm. 3 applied backward in time from ensures that . perturbed: . Notice that, if we instead start GD from exactly , the iterates will diverge from the ODE trajectory, since every error made along the pseudoorbit is amplified. We show this for the unstable direction of a saddle in Fig. 1(a).
Quadratic saddles.
As discussed in Sec. 2, if the space can be split into stable (contracting) and unstable (expanding) invariant subspaces (), then every pseudoorbit is shadowed. This is a particular case of the shadowing theorem for hyperbolic sets bowen1975omega. In particular, we saw that if is linear hyperbolic such splitting exists and and are the subspaces spanned by the stable and unstable eigenvalues, respectively. It is easy to realize that is linear if the objective is a quadratic; indeed is such that . It is essential to note that hyperbolicity requires to have no eigenvalue at — i.e. that the saddle has only directions of strictly positive or strictly negative curvature. This splitting allows to study shadowing on and separately: for we can use the shadowing result for strongconvexity and for the shadowing result for strongconcavity, along with the computation of the initial condition for the shadow in these subspaces. We summarize this result in the next theorem, which we prove formally in App. C.3. To enhance understanding, we illustrate the procedure of construction of a shadow in Fig. 2.
Proposition 4.
Let be quadratic centered at with Hessian with no eigenvalues in the interval , for some . Assume the orbit of is s.t. (H1) holds up to iteration . Let be the desired tracking accuracy; if , then is shadowed by an orbit of up to iteration .
General saddles.
In App. C.4 we take inspiration from the literature on the approximation of stiff ODEs near stationary points lubich1995runge; beyn1987numerical; larsson1994behavior and use Banach fixedpoint theorem to generalize the result above to perturbed quadratic saddles , where is required to be smooth with . This condition is intuitive, since effectively counteracts the contraction/expansion. {tcolorbox}
Theorem 4.
Let be a quadratic centered at with Hessian with no eigenvalues in the interval , for some . Let be our objective function, of the form with a smooth perturbation such that . Assume the orbit of on is s.t. (H1) (stated for ) holds, with gradients bounded by up to iteration . Assume and let be the desired tracking accuracy, if also
then is shadowed by a orbit of on up to iteration .
We note that, in the special case of stronglyconvex quadratics, the theorem above recovers the shadowing condition of Cor. 1 up to a factor which is due to the different proof techniques.
Gluing landscapes.
The last result can be combined with Thm. 3 to capture the dynamics of GDODE where directions of negative curvature are encountered during the early stage of training followed by a stronglyconvex regions as we approach a local minimum (such as the one in Fig. 3). Note that, since under (H1) the objective is , there will be a few iterations in the "transition phase" (nonconvex to convex) where the curvature is very close to zero. These few iterations are not captured by Thm. 3 and Thm. 4; indeed, the error behaviour in Fig. 3 is pathological at . Nonetheless, as we showed for the convex case in Sec. 3.1, the approximation error during these iterations only grows as . In the numerical analysis literature, the procedure we just sketched was made precise in chow1994shadowing, who proved that a gluing argument is successful if the number of unstable directions on the ODE path is nonincreasing.
4 The Heavyball ODE
We now turn our attention to analyzing Heavyball whose continuous representation is , where is a positive number called the viscosity parameter. Following maddison2018hamiltonian, we introduce the velocity variable and consider the dynamics of (i.e. in phase space).
(HBODE) 
Under (H1), we denote by the corresponding joint time map and by its orbit (i.e. the sampled HBODE trajectory). First, we show that a semiimplicit hairer2003geometric integration of Eq. (HBODE) yields HB.
Given a point in phase space, this integrator computes as
(HBPS) 
Notice that and , which is exactly one iteration of HB, with and . We therefore have established a numerical link between HB and HBODE, similar to the one presented in shi2019acceleration. In the following, we use to denote the one step map in phase space defined by HBPS.
Similarly to Remark 1, by Prop. 2.2 in cabot2009long, (H1) implies that gradients are bounded by a constant . Hence, we can get an analogue to Prop. 2 (see App. D.2 for the proof).
Proposition 5.
Assume (H1) and let . Then, is a pseudoorbit of with
Strongconvexity.
The next step, as done for GD, would be to consider stronglyconvex landscapes and derive a formula for the shadowing radius (see Thm. 3). However — it is easy to realize that, in this setting, HB is not uniformly contracting. Indeed, it notoriously is not a descent method. Hence, it is unfortunately difficult to state an equivalent of Thm. 3 using similar arguments. We believe that the reason behind this difficulty lies at the very core of the acceleration phenomenon. Indeed, as noted by ghadimi2015global, the current known bounds for HB in the stronglyconvex setting might be loose due to the tediousness of its analysis shi2018understanding— which is also reflected here. Hence, we leave this theoretical investigation (as well as the connection to acceleration and symplectic integration shi2019acceleration) to future research, and show instead experimental results in Sec. 5 and Fig. 4.
Quadratics.
In App. D.3 we show that, if is quadratic, then is linear hyperbolic. Hence, as discussed in the introduction and thanks to Prop. 5, there exists a norm^{9}^{9}9For GD, this was the Euclidean norm. For HB the norm we have to pick is different, since (differently from GD) with nonsymmetric. The interested reader can find more information in App. 4 of irwin2001smooth. under which we have shadowing, and we can recover a result analogous to Prop. 4 and to its perturbed variant (App. C.4). We show this empirically in Fig. 4 and compare with the GD formula for the shadowing radius.
5 Experiments on empirical risk minimization
We consider the problem of binary classification of digits 3 and 5 from the MNIST dataset lecun1998mnist. We take training examples , where is the ith image (in adding a bias) and is the corresponding label. We use the regularized sigmoid loss (nonconvex) , . Compared to the crossentropy loss (convex), this choice of often leads to better generalization shalev2011learning. For 2 different choices of , using the full gradient, we simulate GDODE using fourthorder RungeKuttahairer2003geometric (highaccuracy integration) and run GD with learning rate , which yields a steady decrease in the loss. We simulate HBODE and run HB under the same conditions, using (to induce a significant momentum). In Fig. 5, we show the behaviour of the approximation error, measured in percentage w.r.t. the discretized ODE trajectory, until convergence (with accuracy around ). We make a few comments on the results.

[wide, labelwidth=1pt, labelindent=1pt]

Heavy regularization (in green) increases the contractiveness of GD around the solution, yielding a small approximation error (it converges to zero) after a few iterations — exactly as in Fig. 1. For a small (in magenta), the error between the trajectories is bounded but is slower to converge to zero, since local errors tend not to be corrected (cf. discussion for convex objectives in Sec. 3.1).

Locally, as we saw in Prop. 2, large gradients make the algorithm deviate significantly from the ODE. Since regularization increases the norm of the gradients experienced in early training, a larger will cause the approximation error to grow rapidly at the first iterations (when gradients are large). Indeed, Cor. 1 predicts that the shadowing radius is proportional^{10}^{10}10Alternatively, looking at the formula in Cor. 1 and noting , we get . Hence, regularization, which increases and decreases by the same amount, actually increases . to .

Since HB has momentum, we notice that indeed it converges faster than GD sutskever2013importance. As expected, (see point 2) this has a bad effect on the global shadowing radius, which is 5 times bigger. On the other hand, the error from HBODE is also much quicker to decay to zero when compared to GD.
Last in Fig. 6 we explore the effect of the shadowing radius on the learning rate and find a good match with the prediction of Cor. 1. Indeed, the experiment confirms that such relation is linear: , with no dependency on the number of iterations (as opposed to the classical results discussed in the introduction).
All in all, we conclude from the experiments that the intuition developed from our analysis can potentially explain the behaviour of the GDODE and the HBODE approximation in simple machine learning problems.
6 Conclusion
In this work we used the theory of shadowing to motivate the success of continuoustime models in optimization. In particular, we showed that, if the cost is stronglyconvex or hyperbolic, then any GDODE trajectory is shadowed by a trajectory of GD, with a slightly perturbed initial condition. Weaker but similar results hold for HB. To the best of our knowledge, this work is the first to provide this type of quantitative link between ODEs and algorithms in optimization. Moreover, our work leaves open a lot of directions for future research, including the derivation of a formula for the shadowing radius of Heavyball (which will likely give insights on acceleration), the extension to other algorithms (e.g. Newton’s method) and to stochastic settings. Actually, a partial answer to the last question was provided in the last months for the stronglyconvex case in feng2019uniform. In this work, the authors use backward error analysis to study how close SGD is to its approximation using a high order (involving the Hessian) stochastic modified equation. It would be interesting to derive a similar result for a stochastic variant of GDODE, such as the one studied in orvieto2018continuous.
Acknowledgements.
We are grateful for the enlightening discussions on numerical methods and shadowing with Prof. Lubich (in Tübingen), Prof. Hofmann (in Zürich) and Prof. Benettin (in Padua). Also, we would like to thank Gary Bécigneul for his help in completing the proof of hyperbolicity of Heavyball and Foivos Alimisis for pointing out a mistake in the initial draft.
References
Appendix
Appendix A Notation and problem definition
We work in with the Euclidean norm . If is a matrix, denotes its supremum norm: . We denote by the family of times continuously differentiable functions from to . We denote by the Jacobian (matrix of partial derivatives) of ; if then we denote its gradient by . If the dimensions are clear, we write .
We seek to minimize . We list below some assumptions we will refer to.
(H1) is coercive^{11}^{11}11 as , bounded from below and smooth (, ).
(H2) is stronglyconvex: for all , .
Appendix B Additional results on shadowing
We state here some useful details on shadowing for the interested reader. Also, we propose a simple proof of the expansion map shadowing theorem based on [ombach1993simplest].
b.1 Shadowing near hyperbolic sets
We provide the definitions and results needed to state precisely the shadowing theorem [anosov1967geodesic]. Our discussion is based on [lanford1985introduction]. Let be a diffeomorphism.
Definition 4.
We say is an hyperbolic point for if there exist a splitting in linear subspaces such that
where and can be taken to depend only on , not on .
Remark 2.
It can be showed that, if is hyperbolic for , then also and are hyperbolic points. Moreover, using the chain rule, we have
Definition 5.
A compact set is said to be an hyperbolic set for if it is invariant for (i.e. ), each is hyperbolic and (see Def. 4) can be taken to be independent of . If the entire space is hyperbolic, then is called an Anosov diffeomorphism (from [anosov1967geodesic]).
We now state the main result in the literature, originally proved in [bowen1975omega], which essentially tells us that pseudoorbits sufficiently near an hyperbolic set are shadowed.
[Shadowing theorem] Let be an hyperbolic set for , and let . Then there exists such that, for every pseudoorbit with for all , there is such that
Furthermore, if is small enough, is unique.
Example.
The most famous example of an Anosov diffeomorphism is Arnold’s cat map [brin2002introduction] on the torus , which can be thought of as the quotient space . Shown in Fig. 7, details in [brin2002introduction].
b.2 Expansion map shadowing theorem
First, we remind to the reader an important result in analysis.
[Banach fixedpoint theorem] Let be a nonempty complete metric space with a contraction mapping . Then admits a unique fixedpoint (i.e. ). Furthermore, can be found as follows: start with an arbitrary element in and iteratively apply ; is the limit of this sequence.
We recall that is said to be uniformly expanding if there exists (expansion factor) such that for all , . The next result is adapted from Prop. 1 in [ombach1993simplest].
[Expansion map shadowing theorem] If is uniformly expanding, then for every there exists such that every pseudoorbit of is shadowed by the orbit of starting at , that is . Moreover,
(1) 
Proof.
Fix and define . Let be a pseudoorbit of and extend it to negative iterations: . We call the resulting sequence . We consider the set of sequences which are pointwise close to :
We endow with the supremum metric , defined as follows:
It is easy to show that is complete. Next, we define an operator on sequences in , such that
Notice that, if , then is an orbit of . If, in addition , then by definition we have that shadows . We clearly want to apply Thm. B.2, and we need to verify that

. This follows from the fact that is a contraction with contraction factor , similarly to the contraction map shadowing theorem (see Thm. 2). Let , for all

is itself a contraction in , since
The statement of the expansion map shadowing theorem follows then directly from the Banach fixedpoint theorem applied to , which is a contraction on . ∎
Appendix C Shadowing in optimization
We present here the proofs of some results we claim in the main paper.
c.1 Nonexpanding maps (i.e. the convex setting)
{propapp}If is uniformly nonexpanding, then for pseudoorbit of is such that the orbit of starting at , satisfies for all .
Proof.
The proposition is trivially true at ; next, we assume the proposition holds at and we show validity for . We have
∎
The GD map on a convex function is nonexpanding, hence this result bounds the GDODE approximation, which grows slowly as a function of the number of iterations.
c.2 Perturbed contracting maps (i.e. the stochastic gradients setting)
{propapp}Assume is uniformly contracting with contraction factor . Let be the perturbed dynamical system, that is where . Let be a pseudoorbit of and we denote by the orbit of of starting at , that is . We have shadowing under
Proof.
Again as in Thm. 2, we proceed by induction: the proposition is trivially true at , since ; next, we assume the proposition holds at and we show validity for . We have
Since , . ∎
In the setting of shadowing GD, (Prop. 3), , and which gives the consistency equation
c.3 Hyperbolic maps (i.e. the quadratic saddle setting)
We prove the result for GD. This can be generalized to HB. A more powerful proof technique — which can tackle the effect of perturbations — can be found in App. C.4.
[restated Prop. 4] Let be quadratic with Hessian which has no eigenvalues in the interval , for some . Assume the orbit of is such that (H1) holds up to iteration . Let be the desired tracking accuracy; if , then is shadowed by a orbit of up to iteration .
Proof.
From Prop. 2 and thanks to the hypothesis, we know is a partial pseudoorbit of with