Acceleration and Averaging
In Stochastic Mirror Descent Dynamics
We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions.
Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values, (a.s. and in expectation) both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging weights) and the covariation of the noise process, and show, in particular, how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.
We consider the constrained convex minimization problem
where is a closed, convex, subset of , and is a proper closed convex function, assumed to be differentiable with Lipschitz gradient, and we will denote the set of minimizers of this problem, assumed to be non-empty. First-order optimization methods play an important role in minimizing such functions, in particular in large-scale machine learning applications, in which the dimensionality (number of features) and size (number of samples) in typical datasets makes higher-order methods intractable. Many such algorithms can be viewed as a discretization of a continuous-time dynamics. The simplest example is gradient descent, which can be viewed as the discretization of the gradient flow dynamics , where denotes the time derivative of a trajectory . An important generalization of gradient descent, which elegantly handles the constraint set , was developed by Nemirovsky and Yudin (1983), and termed mirror descent: it couples a dual variable accumulating gradients, and its “mirror” primal variable . More specifically, the dynamics are given by
where is a Lipschitz function defined on the entire dual space , with values in the feasible set ; it is often referred to as a mirror map, and we will recall its definition and properties in Section 2. Mirror descent can be viewed as a generalization of projected gradient descent, where the Euclidean projection is replaced by the mirror map (Beck and Teboulle, 2003). This makes it possible to adapt the choice of the mirror map to the particular geometry of the problem, leading to closed-form solutions of the projection, or to better dependence on the dimension , see (Ben-Tal and Nemirovski, 2001), (Ben-Tal et al., 2001).
Although optimization methods are inherently discrete, the continuous-time point of view can help in their design and analysis, since it can leverage the rich literature on dynamical systems, control theory, and mechanics, see (Helmke and Moore, 1994), (Bloch, 1994), and the references therein. Continuous-time models are also commonly used in financial applications, such as option pricing (Black and Scholes, 1973), even though the actions are taken in discrete time. In convex optimization, beyond simplifying the analysis, continuous-time models have also motivated new algorithms and heuristics: mirror descent is one such example, since it was originally motivated in continuous-time (Chapter 3 in (Nemirovsky and Yudin, 1983)). In a more recent line of work ((Su et al., 2014), (Krichene et al., 2015), (Wibisono et al., 2016)), Nesterov’s accelerated method (Nesterov, 1983) was shown to be the discretization of a second-order ordinary differential equation (ODE), which, in the unconstrained case, can be interpreted as a damped non-linear oscillator (Cabot et al., 2009; Attouch et al., 2015). This motivated a restarting heuristic (O’Donoghue and Candès, 2015), which aims at further dissipating the energy of the oscillator. Krichene et al. (2015) generalized this ODE to mirror descent dynamics, and gave an averaging interpretation (a connection which was previously pointed out by Flammarion and Bach (2015) for quadratic functions). This averaging formulation is the starting point of this paper, in which we introduce and study a stochastic variant of accelerated mirror descent dynamics.
Stochastic dynamics and related work
The dynamics which we discussed so far (gradient descent, mirror descent, and their accelerated variants) are deterministic first-order dynamics, since they use the exact gradient . However, in many machine learning applications, evaluating the exact gradient can be prohibitively expensive, e.g. when the objective function involves the sum of loss functions over a training set, of the form , where indexes the training samples, and is a regularization function. Instead of computing the exact gradient , a common approach is to compute an unbiased, stochastic estimate of the gradient, given by , where is a uniformly random subset of , indexing a random batch of samples from the training set. This approach motivates the study of stochastic dynamics for convex optimization. But despite an extensive literature on stochastic gradient and mirror descent in discrete time, e.g. (Nemirovski et al., 2009), (Duchi et al., 2010), (Lan, 2012), (Johnson and Zhang, 2013), (Xiao and Zhang, 2014), and many others, few results are known for stochastic mirror descent in continuous-time. To the best of our knowledge, the only published results are by Raginsky and Bouvrie (2012) and Mertikopoulos and Staudigl (2016).
In its simplest form, the stochastic gradient flow dynamics can be described by the Itô stochastic differential equation (SDE) (Øksendal, 2003)
where denotes a standard Wiener process (Brownian motion). It is well known that this dynamics admits a unique invariant measure with density proportional to the Gibbs distribution . Such dynamics have recently played an important role in the analysis of sampling methods (Dalalyan, 2017), (Bubeck et al., 2015), (Cheng and Bartlett, 2017), (Cheng et al., 2017), where is taken to be the logarithm of a target distribution . The stationary distribution of the SDE has also been recently interpreted as an approximate Bayesian inference (Mandt et al., 2017), and to derive convergence rates (in expectation) for smooth, non-convex optimization where the objective is dissipative (Raginsky et al., 2017).
where is a constant volatility. In particular, they argued that the function values along sample trajectories do not converge to the minimum value of due to the persistent noise, but the optimality gap is bounded by a quantity proportional to . They also proposed a method to reduce the variance by simultaneously sampling multiple trajectories and linearly coupling them. Mertikopoulos and Staudigl (2016) extended the analysis in some important directions: they replaced the constant volatility with a general volatility matrix which can depend on the current point and current time , and studied two regimes: the vanishing noise regime given by the condition , in which case they prove almost sure convergence of solution trajectories; and the persistent noise regime, given by the condition uniformly in , in which case they define a rectified variant of , obtained by replacing the second equation by , where is a sensitivity parameter. The resulting dynamics is given by
Intuitively, a decreasing sensitivity reduces the impact of accumulated noise on the primal trajectory. In particular, they prove that with , the function values converge to the optimal value at a rate, almost surely. They also give concentration estimates around interior solutions in the strongly convex case. While these recent results paint a broad picture of mirror descent dynamics under different noise regimes, they leave many questions open: in particular, they do not provide estimates for convergence rates in the vanishing noise regime, which is an important regime in machine learning applications, since one can often control the variance of the gradient estimate, for example by gradually increasing the batch size, as done by Xiao and Zhang (2014). Besides, they do not study accelerated dynamics, and the interaction between acceleration and noise remains unexplored in continuous time.
In this paper, we answer many of the questions left open in previous works. We formulate and study a family of stochastic accelerated mirror descent dynamics, and we characterize the interaction between its different parameters: the volatility of the noise, the (primal and dual) learning rates, and the sensitivity of the mirror map. More specifically:
In Theorem 1, we give sufficient conditions for a.s. convergence of solution trajectories to the set of minimizers . In particular, we show that it is possible to guarantee almost sure convergence even when the volatility is unbounded asymptotically.
In Theorem 2, we derive a bound on the expected function values.
In Theorem 3, we provide estimates of sample trajectory convergence rates.
The rest of the paper is organized as follows: We start by reviewing the building blocks of our construction in Section 2, then formulate the stochastic dynamics in Section 3, and prove two instrumental lemmas. Section 4 is dedicated to the convergence results. We give additional numerical examples in Section 5, to illustrate the effect of acceleration on stochastic mirror descent dynamics. We conclude with a brief discussion in Section 6.
2 Accelerated Mirror Descent Dynamics
2.1 Smooth mirror maps
The mirror map is central in defining mirror descent dynamics. We first give a generic method for constructing mirror maps, adapted to the feasible set . We fix a pair of dual reference norms, , defined, respectively, on and its dual space . We say that a map is Lipschitz continuous on with constant if for all , . We recall that the effective domain of a convex function is the set , and its convex conjugate is defined on by . We recall that the sub-differential of at is the set , and that is said to be -strongly convex (w.r.t. ) if , .
Let be a -strongly convex function (w.r.t. ) with effective domain , and let be its convex conjugate. Then is finite and differentiable on all of , is -Lipschitz, and has values in : specifically, for all ,
This follows from standard results from convex analysis, e.g. Theorems 13.3 and 25.3 in (Rockafellar, 1970). To give an example of a Lipschitz mirror map, take to be the squared Euclidean norm, . Then , and the mirror map reduces to the Euclidean projection on . It is worth noting that although one can theoretically construct a smooth mirror map given any convex feasible set , using Proposition 1, this does not necessarily mean that the mirror map can be implemented efficiently, since in its general form, it is given by the solution to the problem (4). However, many convex sets have known mirror maps that are efficient to compute. For a concrete example, when is the probability simplex , choosing to be the negative entropy yields a closed-form mirror map given by , see e.g. Banerjee et al. (2005) for additional examples. We will make the following regularity assumption throughout the paper:
is convex and closed, is non-empty, is strongly convex continuous and non-negative on , is twice differentiable with a Lipschitz gradient, and is differentiable with Lipschitz gradient. We denote by the Lipschitz constant of , and by the Lipschitz constant of .
The assumption that is non-negative is made without loss of generality: since is strongly convex, its infimum is finite, and one can simply translate (without changing the mirror map). Some of our results will also require the feasible set to be compact, so we formulate the following assumption:
The feasible set is compact.
2.2 Averaging formulation of accelerated mirror descent
We start from the averaging formulation of accelerated mirror descent given by Krichene et al. (2015), and propose a variant which includes a time-varying sensitivity parameter, similar to Mertikopoulos and Staudigl (2016). We consider the following ODE:
with initial conditions . The ODE system is parameterized by the following functions, all assumed to be positive and continuous on .
is a non-decreasing, inverse sensitivity parameter. As we will see, the main role of will be to scale the noise term, in order to reduce its impact on the primal trajectory.
is a learning rate in the dual space.
is an averaging rate in the primal space. In order to see this connection with averaging, we can rewrite the primal ODE in integral form as a weighted average of the mirror trajectory as follows: let (equivalently, ), then the primal ODE is equivalent to
and integrating and rearranging,
2.3 Energy decay
Next, we define an energy function which will be central in our analysis. The analysis of continuous-time dynamics often relies on a Lyapunov argument (in reference to Lyapunov (1992)): one starts by defining a non-negative energy function, then bounding its rate of change along solution trajectories. This bound can then be used to prove convergence to the set of minimizers . We will consider a modified version of the energy function used by Krichene et al. (2016): given a positive, function , and a pair of optimal primal-dual points such that111Note that in general, may not be surjective (specifically, points on the boundary of may not be attained), but the analysis can be extended to such cases by replacing the Bregman divergence term in by the Fenchel coupling defined by Mertikopoulos and Staudigl (2016). and , let
Here, is the Bregman divergence (Bregman, 1967) associated to , defined by
Then we can prove a bound on the time derivative of along solution trajectories of , given in the following proposition. To keep the equations compact, we will occasionally omit explicit dependence on time, and write, e.g. instead of .
Suppose that Assumption 1 holds, and suppose that . Then under , for all ,
We start by recalling a Bregman identity which will be useful in the proof. We have for all and , (Theorem 23.5 in Rockafellar (1970)). Thus
We proceed by bounding the rate of change of the Bregman divergence term:
where the third equality uses the identity (9), and the last inequality follows from the assumption that is non-decreasing, and that is non-negative. Using this expression, we can then compute
where we used the chain rule in the first equality, we plugged in the expression of and from to obtain the second equality, and used convexity of in the last inequality. The assumption ensures that the middle term vanishes222Note that this assumption can be replaced by an adaptive rate similar to the heuristic developed in (Krichene et al., 2016)., which concludes the proof. ∎
As a consequence of the previous proposition, we can prove the following convergence rate:
Suppose that and that . Then under , for all
Starting from the bound (21), the first term is non-positive by assumption on . Integrating, we have
and we conclude by observing that by definition of ,
since the Bregman divergence term is non-negative). ∎
In the deterministic case, it appears from the bound of Corollary 1 that a strictly increasing would degrade the convergence rate. However, as we will see in the next section, this parameter will be essential in controlling the effect of noise.
We can recover the (non-accelerated) dynamics as a limiting case of : writing the second equation of as
we can see that can be formally recovered by taking infinite.
Nesterov’s accelerated method can be seen as a special case of dynamics in the unconstrained Euclidean case, with a quadratic . This is discussed in Appendix A.
3 Stochastic dynamics
We now formulate the stochastic variant of accelerated mirror descent dynamics (). Intuitively, we would like to replace the gradient terms in by a noisy gradient . More formally, we define a process which satisfies the Itô SDE
Here, is a standard Wiener process with respect to a given filtered probability space , and is a volatility matrix which satisfies the following assumptions:
The volatility matrix is measurable, Lipschitz in (uniformly in ), and continuous in for all .
The resulting stochastic dynamics are given by the SDE system:
with initial condition (we consider deterministic initial conditions for simplicity). The drift term in is identical to the deterministic case, and the volatility term represents the noise in the gradient. In particular, we note that the volatility is proportional to , to capture the fact that the gradient noise is scaled by the learning rate . This formulation is fairly general, and does not assume, in particular, that the different components of the gradient noise are independent, as we can see in the quadratic covariation of the gradient process :
where we defined the infinitesimal covariance matrix . Similarly, we have
We will denote
where is the induced matrix norm. Contrary to the work of (Raginsky and Bouvrie, 2012; Mertikopoulos and Staudigl, 2016), we do not assume, a priori, that is uniformly bounded in . Observe that when Assumption 2 holds (i.e. is compact), since is Lipschitz in and continuous in , is finite for all , and continuous.
3.1 Illustration of SAMD dynamics
We give an illustration of SAMD dynamics in Figure 1, on a simplex-constrained problem in . The feasible set is given by , and the mirror map is generated by the negative entropy restricted to the simplex, given by:
which is strongly convex with respect to the norm by Pinsker’s inequality. Its convex conjugate is given by
which is differentiable for all , and the mirror map is
which is Lipschitz w.r.t. the dual norms .
Note that for all and all ,
where is the vector of all ones. This can be verified directly using the expression of , but can also be seen as a consequence of the duality of sub-differentials (e.g. Theorem 23.5 in Rockafellar (1970)), which states that if and only if ; and since is the restriction of the negative entropy to the simplex, its sub-differential at is
where is the normal cone to at , which is simply the line (when is in the relative interior of the simplex).
Since the mirror map is constant along the normal to the simplex, we choose to project the dual variable on the hyperplane parallel to the simplex, for visualization purposes. This allows us to visualize the relevant component of the dual dynamics, and ignore a component which does not matter for convergence (but which could have high magnitudes if has a large component along the normal). Note that even numerically, projecting after each iteration helps improve numerical stability (without affecting the primal trajectory).
Finally in order to visualize the function values, we generate a triangular mesh of the simplex, then map it to the dual space. In other words, the colors in the primal space represent , and in the dual space represent . It is interesting to observe how the mirror map distorts the space between primal and dual spaces.
The objective function used in this example (which we use as a running numerical example to illustrate our results) is given by the sum of exponentials
where are vectors in .
3.2 Existence and uniqueness of a continuous solution
We give the following existence and uniqueness result.
By assumption, and are Lipschitz continuous, thus the function is Lipschitz on (since are positive continuous). Additionally, the function is Lipschitz on the compact set . The function is continuous, since it is by definition the supremum of continuous functions
and and are both compact.
Therefore, we can invoke the existence and uniqueness theorem for stochastic differential equations (Øksendal, 2003, Theorem 5.2.1). ∎
Note that since is arbitrary in the previous proposition, we can conclude that there exists a unique continuous solution on . In the previous proposition, we assumed that is compact to guarantee the conditions of the existence and uniqueness theorem, but this can be relaxed. Instead, we can directly assume additional regularity of the noise, e.g. that additionally satisfies for some , uniformly in .
Next, in order to analyze the convergence properties of the solution trajectories , we will need to bound the time-derivative of the energy function .
3.3 Energy decay
In this section, we state and prove two technical lemmas which will be useful in proving our main convergence results. We start by bounding the rate of change of the energy function along solution trajectories of .
Bounding the rate of change of the energy
Suppose Assumption 1 holds, and that the primal rate satisfies , and let be a continuous solution to . Then for all ,
where is the continuous, dimensional process given by
By definition of the energy function , and , which are Lipschitz continuous in (uniformly in on any bounded interval, since are continuous positive functions of ). Thus by the Itô formula for functions with Lipschitz continuous gradients (Errami et al., 2002), we have
and the Itô correction term
We can bound the last term using the following simple fact: Given two linear operators and , such that and (where and are the norms induced by the pair of dual norms and ), then we have since
Now, since is, by assumption, -Lipschitz, we have , and by definition (15) of , for all , therefore for all , , and ,
Combining the previous inequalities, we obtain the desired bound. ∎
Comparing the bound of Lemma 2 to its deterministic counterpart of Lemma 1, we can see that in the stochastic case, we have two additional terms: a drift term proportional to , which is due to the Itô correction term, and a volatility term given by .
To illustrate these terms, we generate solution trajectories for the sum-exponential example of Section 3.1, for both the deterministic and stochastic dynamics, and plot the values of the objective function and the energy function. For simplicity, we consider a constant volatility , a linear energy rate , and a sensitivity . The primal and dual weights are and . For the stochastic dynamics, we plot the mean and standard deviation of sample trajectories. The results are given in Figure 2 below.
For the sake of comparison, we also generate a similar simulation for (non-accelerated) mirror descent dynamics in Figure 3, using the same sensitivity . Comparing Figures 2 and 3, we can already observe some qualitative differences introduced by acceleration: the accelerated dynamics exhibit faster convergence, accompanied with typical oscillations around the minimizer. Note that these oscillations are not due to discretization or noise in the gradient estimate (as can be observed e.g. in discrete gradient descent with large step sizes). The oscillations are rather a property of the continuous-time accelerated dynamics. Besides, acceleration also seems to reduce the effect of noise on the primal trajectory (which appears visually smoother than its non-accelerated counterpart). To give some intuition, consider the following informal argument (which will be formalized in the results of Section 4). In the case of mirror descent, the primal variable is obtained as the mirror of , where
and the noise is cumulated in the Itô martingale term . In the case of accelerated mirror descent, there are two main differences:
First, using a time-varying dual rate in leads to a non-linear accumulation of noise in the Itô martingale .
The combined effect of averaging in the dual space (using ) and in the primal space (using ) will be apparent in Section 4 when we derive explicit convergence rates.
Bounding the Itô martingale term
Integrating the bound of Lemma 2 will allow us to bound changes in energy. This bound will involve the Itô martingale , where is defined in (16). In order to control this term, we give, in the following lemma, an asymptotic envelope (a consequence of the law of the iterated logarithm).
Let us denote the Itô martingale by , and its quadratic variation by . By definition of , we have
By the Dambis-Dubins-Schwartz time change theorem (e.g. Corollary 8.5.4 in (Øksendal, 2003)), there exists a Wiener process such that
We now proceed to bound . Using the expression (16) of , we have
where we defined . Since the mirror map has values in and is assumed compact, the diameter is finite, and for all . Thus, , and integrating,
Since is a non-decreasing process, two cases are possible: if is finite, then is a.s. finite and the result follows immediately. If , then
4 Convergence results
4.1 Almost sure convergence
The result of Theorem 1 makes it possible to guarantee almost sure convergence (albeit without an explicit convergence rate) when the noise is persistent ( is constant, or even increasing). To give a concrete example, suppose (with but can be positive), and let . Then , , , and , and the conditions of the theorem are satisfied. Therefore, with the appropriate choice of learning rate (and the corresponding averaging in the primal space given by ), one can recover almost sure convergence.
We start by giving an outline of the proof, which is similar to that of Theorem 4.1 in (Mertikopoulos and Staudigl, 2016), with some significant changes (we do not make the assumption that the minimizer is unique, and most importantly, the dynamics and the energy function are different, since averaging is essential in our case to handle the noise, since we do not assume that the volatility bound is vanishing). The argument proceeds in the following steps:
The first step is to prove that under the conditions of the theorem, the continuous solution of , , is an asymptotic pseudo trajectory (a notion defined and studied by Benaïm and Hirsch (1996) and Benaïm (1999)) of the deterministic flow . The definition is given below, but intuitively, this means that for large enough times, the sample paths of the process get arbitrarily close to , the solution trajectories of the deterministic dynamics.
Definition 4.1 (Asymptotic Pseudo Trajectory).
Let be the semi-flow associated to the deterministic dynamics , that is, is the solution of the deterministic dynamics with initial condition . A continuous function is an asymptotic pseudo trajectory (APT) for if for all ,
where is a distance on , e.g. .
The second step is to show that under the deterministic flow, the energy decreases enough for large enough times.
The third step is to prove that under the stochastic process, cannot stay bounded away from for all .
Finally, combining these steps, we argue that by (iii), eventually gets close to , then stays close by virtue of the asymptotic pseudo trajectory property (i), and the decrease of the energy under the deterministic flow (ii).
Proof of Theorem 1.
We start by specializing the energy function and the bounds on its time derivative to the setting of Theorem 1. Under the assumptions of the theorem (), simplifies to
where we added the subscript to insist on the fact that the energy function is “anchored” at . Note that since the minimizer is not necessarily unique, does not necessarily converge to for arbitrary . Thus, we define and use
where (by the fact that if and only if ).
Next, we observe that since is -Lipschitz and is -Lipschitz, we can bound the change of the energy due to small displacements in : we will use the fact that for any convex function with -Lipschitz gradient,