An ODE Method to Prove the Geometric Convergence of Adaptive Stochastic Algorithms

# An ODE Method to Prove the Geometric Convergence of Adaptive Stochastic Algorithms

Youhei Akimoto Anne Auger Nikolaus Hansen University of Tsukuba, Center for AI Research, 305-8573 Ibaraki, Japan Inria RandOpt Team, CMAP, Ecole Polytechnique, 91128 Palaiseau, France
###### Abstract

We develop a methodology to prove geometric convergence of the parameter sequence of a stochastic algorithm. The convergence is measured via a function that is similar to a Lyapunov function. Important algorithms that motivate the introduction of this methodology are stochastic algorithms deriving from optimization methods solving deterministic optimization problems. Among them, we are especially interested in analyzing comparison-based algorithms that typically derive from stochastic approximation algorithms with a constant step-size. We employ the so-called ODE method that relates a stochastic algorithm to its mean ODE, along with the Lyapunov-like function such that the geometric convergence of implies—in the case of a stochastic optimization algorithm—the geometric convergence of the expected distance between the optimum of the optimization problem and the search point generated by the algorithm. We provide two sufficient conditions such that decreases at a geometric rate. First, should decrease “exponentially” along the solution to the mean ODE. Second, the deviation between the stochastic algorithm and the ODE solution (measured with the function ) should be bounded by times a constant. We provide in addition practical conditions that allow to verify easily the two sufficient conditions without knowing in particular the solution of the mean ODE. Our results are any-time bounds on , so we can deduce not only asymptotic upper bound on the convergence rate, but also the first hitting time of the algorithm. The main results are applied to two comparison-based stochastic algorithms with a constant step-size for optimization on discrete and continuous domains.

###### keywords:
adaptive stochastic algorithm, comparison-based algorithm, geometric convergence, ordinary differential equation method, Lyapunov stability, optimization

<ltx:note>

• Global convergence on convex quadratic

• Geometric convergence on C2

• Future: strictly (or strongly) convex [Youhei: so far I haven’t found that monotonically decreases along the flow]

• Future: Exponential family (with ) with expectation parameterization (). Objective function . Then, . Probably, we need some condition corresponding to not too small in order to prove a uniform lower bound .

</ltx:note>

<ltx:note>

• connection to papers published in SIAM journals.

• send an email to George Yin to ask his opinion.

</ltx:note>

## 1 Introduction

The ODE method is a standard technique to prove the convergence of stochastic algorithms of the form

 θn+1=θn+αFn, (1)

where is a sequence of parameters taking values in , is a positive integer; is a sequence of random vectors in ; is the step size, also referred to as learning rate that can depend on the time index Kushner2003book (); Ljung1977tac (); Borkar2008book (); benaim2006stochastic2 (); benaim2005stochastic (). The method connects the convergence of the stochastic algorithm to the convergence of the solutions of the underlying “mean” ODE

 dθdt=F(θ),θ(0)=θ0, (2)

where is the conditional expectation given the natural filtration associated to . Here, we assume that is well defined and assume that the dependency of on the past is only through . The stochastic algorithm (1) provides a stochastic approximation of the solution of (2) and is referred to as a stochastic approximation algorithm. Often, the error between the stochastic algorithm (1) and the solution of the mean ODE is controlled by taking a sequence of learning rate (i.e., is time dependent) that is assumed to decrease to zero not too fast Kushner2003book (); Ljung1977tac (); Borkar2008book (); Borkar:2000vd ().

In this paper, we explore the use of the ODE method to prove the convergence of some specific algorithms that, at the most abstract level, are of the form (1) but where geometric convergence of a function of occurs. To set the ideas consider the example of an algorithm arising in the context of the optimization of a (black-box) function . The algorithm state is , where encodes the mean of a multivariate normal sampling distribution and its variance (the covariance matrix being proportional to the identity). The updates for read

 mn+1=mn+α∑λi=1W(i;xn,1,…,xn,λ)cratio(xn,i−mn)vn+1=vn+α∑λi=1W(i;xn,1,…,xn,λ)(1d∥xn,i−mn∥2−vn), (3)

where are candidate solutions sampled independently according to a multivariate normal distribution with mean and overall variance , and are weights assigned to the candidate solutions that are decreasing if the candidate solutions are ordered according to their -values, i.e., better solutions have larger weights ( is a constant to have a different learning rates for the mean and variance updates). Algorithm (3) is termed comparison-based because it uses objective function values only through comparisons of candidate solutions (that give the ranking of the solutions). Comparison-based algorithms are invariant to strictly increasing transformations of the objective function. Therefore, they do not assume convexity or even continuity of the objective function, whereas most gradient based algorithms require strong convexity of the function to guarantee their geometric convergence. The above algorithm (3) is (mainly empirically) known to converge geometrically to a local minimum of on wide classes of functions for a fixed learning rate . Both the mean and the variance converge geometrically (towards a local optimum of and zero, respectively). It is a simplified version of the state-of-the art stochastic search algorithm, namely Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which instead of adapts a full covariance matrix Hansen2001ec (); Hansen2003ec (); Hansen:2013vt (). More and more attentions are paid on these randomized search algorithms with successful applications Salimans2017 (), however, convergence proofs for such algorithms are known to be difficult to achieve. Related to convergence proofs, a central question is to show the convergence rate (and not only the convergence) such that methodologies that allow to prove geometric convergence is needed.

In this context, we extend the ODE method to be able to prove the geometric convergence of algorithms of the form (3). Our setting differs from the ones where the ODE method is usually employed, making thus the extension of the ODE method non trivial. Typically we have to cope with the following different points at the same time (some aspects have been addressed individually):

1. The step-size cannot decrease to zero, otherwise the geometric convergence is jeopardized while most previous works using the ODE method consider a sequence of step-sizes decreasing to zero. (The studies considering a fixed step-size will be discussed later on.)

2. All the (uncountably many) boundary points of the domain are typically equilibrium points of the mean ODE and we want the convergence towards one of them only, as only one equilibrium point means that the underlying optimization algorithm has found the optimum while the others would mean premature convergence. In the above example all the boundary points are equilibrium points of the associated ODE, so any neighborhood of the optimal parameter (equal to in the example above) contains uncountably many equilibrium points.

3. Geometric convergence of the algorithm does not usually mean geometric convergence of with the Euclidean distance. We hence need to resort to a meaningful function whose geometric convergence implies the geometric convergence of the quantities we are interested in. In the above example, we would like to show the geometric convergence of the distance between the generated solutions and the optimal solution in some stochastic sense, and hence needs to be chosen so that its geometric convergence implies the geometric convergence of the distance.

We formulate a few comments on the above list: Regarding the second point, having equilibrium points on the boundary is not a critical issue. Previous works address such cases, e.g., Buche2001sicon (). Yet having multiple equilibrium points that are all connected complicates the analysis. We want to prove the convergence only towards a specific equilibrium point while convergence towards the other points mean a failure of the optimization algorithm. Previous studies Ljung1978 (); Pemantle1990 (); Brandiere1998sicon (); Fang2000ieeetac () address multiple equilibrium, however, in those studies, the objective is to show the convergence of an algorithm towards a set of equilibrium, and our setting does not fit in those analysis—as we want to converge to a single point. The third point of the above list is critical, especially since any neighborhood of the optimal parameter has typically uncountably many equilibrium points. Hence we cannot choose (for instance) the Euclidean distance to the optimal parameter as potential function .

Our approach consists in finding a function that should satisfy conditions A1 and A2 presented below so as to imply the geometric convergence of to zero for an appropriate choice of the learning rate . The function can be thought as a distance to a desired parameter set such that if goes to zero geometrically, we can conclude to the geometric convergence of the algorithm we investigate, i.e.  is an upper bound on the quantity we want to prove the geometric convergence of. Differently from the Euclidean distance, however, to avoid the third difficulty listed above, needs to diverge towards plus infinity as converges to an equilibrium point which is not the point we want to converge to. We call the optimal point. The first condition on translates that the function should decrease along the trajectories of the solution of the ODE (2) (similarly to Lyapunov functions for proving the stability of ODEs). This decrease should be geometric. The condition reads more precisely:

• nonincreasing such that as , and for any and any

 Ψ(φ(t;θ))⩽ΔA1(t)Ψ(θ), (4)

where is the solution of (2) at time and .

The second condition is to control the deviation between the stochastic algorithm and the solution of the ODE:

• nondecreasing with respect to each argument such that as for any fixed , and for any and

 E0[Ψ(θN)]⩽Ψ(φ(Nα;θ0))+ΔA2(α,Nα)Ψ(θ0), (5)

where denotes the conditional expectation given .

Remark that the conditions above are stronger than typical assumptions done in the standard stochastic approximation settings. The reason is that our main focus is on getting the convergence rate of the algorithm (and not only its convergence).

<ltx:note>[</ltx:note>Youhei] see above

<ltx:note>[</ltx:note>Youhei] replaced linear with geometric. commented on the terminology.

<ltx:note>[</ltx:note>Youhei] I don’t emphasize the fact that we have the optimal point at the boundary as a difficulty anymore. <ltx:note>[</ltx:note>Youhei] done. see above.

<ltx:note>[</ltx:note>Youhei] done. see above

The flow solution of the ODE comes into play in both conditions (4) and (5). However it is very seldom that the solution of the ODE is known explicitly. We hence provide some practical conditions to be able to verify the conditions without knowing . The first condition developed in Theorem 11 (that implies (4)) is similar to conditions to obtain exponential stability of equilibrium points of ODE in Lyapunov’s theory. It states that the upper Dini directional derivative of in the direction of has to be smaller than or equal to a negative constant times . The second conditions, developed in Theorem 12 (that implies (5)) are based on the Lemma 1 in Chapter 9 of Borkar2008book () that provides conditions for the expected Euclidean distance between the stochastic algorithm and the solution of the underlying ODE to be bounded by a constant. We replace the constant with a constant times by introducing a different condition. We illustrate how to use the different results and in particular the practical conditions on two examples taken from the context of adaptive algorithms for optimization of black-box functions. The construction of the function is carefully discussed.

This paper is organized as follows. After the summary of the notations employed throughout this paper, we describe in Section 2 rank-based or comparison-based stochastic search algorithms that are examples of algorithms the methodology of the paper can be applied to, including two running examples, namely, a step-size adaptive evolution strategy and the population based incremental learning algorithm. In Section 3 we demonstrate a simple technique of the ODE method and illustrate the difficulties to prove the geometric convergence of stochastic algorithms via the ODE method. In Section 4 we provide a sufficient condition for to converge globally and geometrically towards zero and derive the first hitting time bound from it as a consequence. The choice of is discussed on the two concrete examples listed above. We prove practical conditions to verify the sufficient condition of our main theorem in Section 5, followed by the application of these practical conditions to the two example algorithms and the proofs of their geometric convergence. We end the paper in Section 6 with the extension of the main theorem to cover the cases where the ODE associated to the stochastic algorithm has multiple local attractors.

##### Notations

Let be the indicator function whose value is if event occurs, otherwise. Let denote the expectation taken over whose probability measure is , i.e., . If we take the expectation over all random variables appeared in the expression and if it is clear, we drop the subscripts and just write . Let denote the conditional expectation given a filtration . For any event , be the probability of occurring. Let be the conditional probability given filtration .

Let be the set of real numbers. Let and be the set of nonnegative and positive real numbers, respectively. Let and be the set of nonnegative and positive integers, respectively. For and satisfying , the open, closed, left open, right open intervals are denoted as , , , and , respectively. For and satisfying , denotes the set of integers between and including them. Let be the Euclidean norm for any real vector , and be the Mahalanobis norm given a positive definite symmetric matrix . For any real value , denotes the absolute value.

## 2 Example Algorithms

The methodology presented in this paper is strongly motivated by the class of comparison-based or rank-based stochastic search algorithms. While we have sketched an example of such algorithms in the introduction, we define more generally in this section the class of algorithms and derive the function in (2). We then define two example algorithms that will be used throughout the paper to illustrate our different results.

The objective is to minimize (without loss of generality) a function where is an arbitrary search space. The algorithm updates that parametrizes a family of probability distribution defined over . More precisely, at each iteration , multiple candidate solutions () are generated from the probability distribution . Their objective values are evaluated and higher weight values are assigned to better candidates solutions. The weight value assigned to each is defined as follows. Let be real, predefined constants that are nonincreasing, i.e., . Define a function such that

 W(i;x1,…,xλ)=∑k+lj=kwj+1l+1,where k=∑j≠i\mathbbmI{f(xj)

Note that for any combination of . Then, is the value assigned to . If all the candidate solutions have distinct objective values, we have , where is the number of points with better -values than among as defined in (6). In other words, is assigned to the th best point. Using the set of pairs of the candidate solutions and their weight value, the parameter of the probability distribution is updated. Let . The direction to adjust the parameter is defined by the sum of the product of the weight value and , i.e.,

 θn+1=θn+α∑λi=1W(i;xn,1,…,xn,λ)g(xn,i;θn). (7)

The algorithm (3) sketched in the introduction is one example of method following the previous update equation with and defined as

 g(x;θ=(m,v))=(cratio(x−m), 1d∥x−m∥2−v). (8)

We see that the update equation for in (7) depends on only through the weight functions that themselves depend on only through the ranking of the candidate solutions (see (6)) which explains the name of comparison or rank based algorithms. Hence such algorithms are invariant to rank preserving transformations. In other words, consider a function that is strictly monotonically increasing, we obtain the same update (7) for the algorithm optimizing or .

For this class of algorithms we can explicitly write the function using the following proposition whose proof is included in the appendix.

###### Proposition 1.

Let be i.i.d. following and define as in (6). Given , for any -integrable function , the random vector is -integrable and there exists a function such that the conditional expectation is

 F(θn)=∫Xu(q<θn(f(x)),q=θn(f(x)))g(x;θn)Pθn(dx),

where and are the probabilities of and , respectively, when , and is the function defined by

 u(p,q)=λ−1∑k=0λ−k−1∑l=0(k+l∑j=kwj+1l+1)λPT(λ−1,k,l,p,q), (9)

where is the trinomial probability mass function.

###### Remark 2.

If holds -almost surely in Proposition 1, we can simplify the result as

 F(θn)=∫Xu(q<θn(f(x)))g(x;θn)Pθn(dx) (10)

with defined by

 u(p)=λ−1∑k=0wk+1λPB(λ−1,k,p), (11)

where is the binomial probability mass function. It is a typical case when is a continuous domain.

In the following, we present two examples of algorithms that belong to the class of algorithms implicitly defined via (7) and whose convergence will be proved by the methodology presented in this paper.

### 2.1 Step-size Adaptive Evolution Strategy

Step-size adaptive evolution strategies Rechenberg1973 (); Schwefel1975 (); Hansen:2013wf () are rank-based optimization algorithms in continuous domain, that is . We consider the example corresponding to a simplification of the state-of-the-art CMA-ES algorithm but refer to Hansen:2013wf () for a recent overview of different standard methods. It is the method that was already sketched in the introduction. At each iteration, the algorithm samples candidate solutions from a multivariate Gaussian distribution parameterized by , where represents the mean vector of the Gaussian distribution and its overall variance. The parameter space is then . The parameter update follows (3), which is in the form (7) with the function that equals (8) and never leaves for if for all .

### 2.2 Population-Based Incremental Learning

The population-based incremental learning (PBIL) algorithm Baluja1995icml () is a probability model based search algorithm for optimization of a function defined on a binary space, . At each iteration, PBIL samples candidate solutions from a multivariate Bernoulli distribution parameterized by , where the th element of , denoted by , represents the probability of the th component of , denoted by , being . The probability parameter is then updated by the following formula

 θn+1=θn+α∑λi=1W(i;xn,1,…,xn,λ)(xn,i−θn). (12)

It is easy to see that this algorithm is of the form (7) and that the sequence never leaves for if for all .

### 2.3 Information Geometric Optimization

<ltx:note>New in 2017/02/24 ver</ltx:note>

The information geometric optimization (IGO) Ollivier2017jmlr () is a generic framework of probability model based search algorithms for black-box optimization of in an arbitrary search space . It takes a parametric family of probability distributions, , on as an input to the framework, and provides a procedure to update the distribution parameters , where is the number of parameters. It defines the utility function as a quantile-based transformation of , namely, , where is a non-increasing function. The objective of the IGO algorithm is to update the distribution parameters, , to increase the expected value of the utility function, , where is the probability density at a point given . Note that the function will be differentiable under a mild condition on independently of and . The idea is to take the gradient information of to optimize , which leads to optimizing indirectly. The IGO takes the so-called natural gradient Amari1998nc (), which is computed as the product of the inverse of the Fisher information matrix and the vanilla gradient of , namely, at . The natural gradient of can be approximated by the Monte-Carlo,

 ∇Jn∣θ=θn =∫un(x)pθn(x)∇θln(pθ(x))∣θ=θndx ≈1λλ∑i=1un(xn,i)∇θln(pθ(xn,i))∣θ=θn ≈1λλ∑i=1W(i;xn,1,…,xn,λ)∇θln(pθn(xn,i))∣θ=θn=:ˆ∇Jn(θn),

where is the number of samples, is the weight defined in (6) with a specific choice of . The resulting algorithms fit in the form (7), where is replaced with the natural gradient of the log-likelihood, . It is known that the PBIL and a variant of covariance matrix adaptive evolution strategy are derived as instantiations of the IGO framework.

### 2.4 Policy Gradient with Parameter Exploration

<ltx:note>New in 2017/02/24 ver</ltx:note>

Consider the control task of an agent, which has a parameterized policy, i.e., control law, , where is the observed state of the agent, is the policy parameter, and maps the observed state to an action. The agent interacts with the environment, where the next state of agent is determined by the unknown transition probability kernel, , where and are the current state and the current action, and is the next state. Given the history, , of states and actions in steps, the agent receives a reward, . The objective of the learning task is to optimize so as to maximize the expected reward . The control law is often modeled by a multi-layer perceptron, where is the vector consisting of link weights between neurons. Since the transition probability kernel, which can be deterministic, is unknown to the agent, the function to be minimized forms a black-box objective.

The Policy Gradient with Parameter Exploration (PGPE) Sehnke2010nn () takes a probability model on parameterized by . It takes the expectation of over to define the expected objective . The PGPE takes the vanilla gradient at . The vanilla gradient of can be approximated by the Monte-Carlo,

 ∇J(θ) =∫f(w)∇θln(q(w;θ))q(w;θ)dw≈1λλ∑i=1f(wi)∇θln(q(wi;θ)) ≈1λλ∑i=1(1NhNh∑j=1r(hi,j))∇θln(q(wi;θ))=:ˆ∇J(θ),

where is the number of samples, for are independent samples drawn from , is the number of histories observed for each , and are the histories generated by the agent with polity . If the transition in environment is deterministic, for any , hence no need to set . The parameter update reads

 θn+1=θn+Anˆ∇J(θn),

where is the iteration counter, is a generalized step-size.

In references Sehnke2010nn (); Zhao2012nn (), the independent Gaussian model is employed, where the parameter encode the mean vector and the standard deviation in each coordinate for . The generalized step-size is then the diagonal matrix such that . We find that it is proportional to the inverse of the Fisher information matrix of . Indeed, for this model is a diagonal matrix whose th and th diagonal elements are . Rewriting , the parameter update reads

 θn+1=θn+αI−1(θn)ˆ∇J(θn).

It turns out to be equivalent to the information geometric optimization algorithm, except that the nonlinear transformation in (6) is not performed.

### 2.5 Other Examples

The rank based search algorithms of the form (7) include probabilistic model based algorithms such as cross-entropy optimization algorithms Boer2005aor () and estimation of distribution algorithms Lozano2006book (). Though we are motivated to analyze rank-based algorithms, our approach is not limited to rank-based methods. It also includes randomized derivative-free methods such as natural evolution strategies (NES) wierstra2014jmlr () or adaptive simultaneous perturbation stochastic approximation (SPSA) spall2000adaptive (). Reinforcement learning algorithms for an episodic task such as policy gradients with parameter-based exploration Sehnke2010nn () are also included in our framework. The search space is not necessarily a continuous domain; it can be discrete as long as the probability distribution is parameterized by a continuous parameter vector as in the PBIL in Section 2.2.

## 3 Ordinary Differential Equation Methods

Ordinary Differential Equation (ODE) methods are major methodologies to prove the convergence of recursive stochastic algorithms of the form (1). We use in the sequel the following rewriting of the algorithms

 θn+1=θn+α(F(θn)+Mn), (13)

where is a martingale difference sequence. Throughout the paper, we assume that never leaves its domain for small enough. Under some regularity conditions, it can be proven that the stochastic sequence converges towards the attractor set of the associated ODE (2).

Here we demonstrate a simple ODE method that relates the behavior of a recursive stochastic algorithm (13) and its associated ODE (2). To illustrate the basic principle of the approach, we pose a relatively strong assumption that allows us to obtain an easy proof, namely . That is, we assume that the second moment of is bounded by a constant over the domain of . This assumption may not be satisfied especially when the domain of is unbounded.

Let be the flow derived from , i.e., is an extended solution to the initial value problem with at . More precisely, is an absolutely continuous function defined on that satisfies

 φ(t;θ0)=θ0+∫t0F(φ(τ;θ0))dτ. (14)

This definition of the solution is extended in the sense that it may have non differentiable points in a set of zero measure. The existence of a solution of the ODE of the form (14) is verifiable without knowing the explicit solution using, for example, Carathéodory’s existence theorem.

Let be a deterministic and possibly time-dependent step-size and define

 tn,k =n+k−1∑i=nαi and εn,k =n+k−1∑i=nα2i. (15)

Given and , we consider as the approximation stemming from (13) of the solution to the ODE at time with the initial condition . It follows from (13) that and from (14) that if the solution exists then it satisfies for any and , hence the difference between and can be expressed as

 θn+N−φ(tn,N;θn)=∑N−1i=0αn+i(F(θn+i)−F(φ(tn,i;θn)))+∑N−1i=0∫tn,i+1tn,i(F(φ(tn,i;θn))−F(φ(τ;θn)))dτ+∑N−1i=0αn+iMn+i. (16)

Assume that is Lipschitz continuous with as the Lipschitz constant. The second term on the RHS of (16) is considered as the sum of the errors due to the time-discretization of the ODE solution. It is not difficult to imagine that we can obtain a bound for the second term because of the Lipschitz continuity. The third term is the sum of the martingale difference noises. By the uncorrelation of martingale difference sequences, we can obtain a bound. The first term is bounded as because of the Lipschitz continuity of . Then, we can apply the discrete Gronwall inequality Clark1987279 () to obtain the following result. Its proof is found in the proof of Theorem 12, which is an extension of the following theorem.

###### Theorem 3.

Consider an algorithm of the form (13) on the domain with deterministic and possibly time-dependent step-size . Assume that for a finite constant , is Lipschitz continuous with the Lipschitz constant , i.e., . Let and . Then, for any there exists a unique solution satisfying (14) and for any and ,

 sup0⩽k⩽NEn[∥θn+k−φ(tn,k;θn)∥2]1/2⩽((LK/2)εn,N+Kε1/2n,N)exp(Ltn,N). (17)

If the step-size satisfies and , then for any fixed , and as . Hence, we obtain with (17) that

 limn→∞sup0⩽k⩽NEn[∥θn+k−φ(tn,k;θn)∥2]1/2=0. (18)

Roughly speaking, the limit behavior of for follows the continuous-time trajectory for any fixed . Therefore, if the associated mean ODE has a global attractor in the sense that for any , the sequence will converge towards the global attractor. However, the algorithms that we are interested in (e.g., rank-based search algorithms) are using a constant step-size in which case they exhibit convergence. We do not observe geometric convergence if the step-size is decreasing.

If the step-size is constant over time index , for any and , and as . Therefore, (17) immediately implies that

 limα→0sup0⩽k⩽NEn[∥θn+k−φ(kα;θn)∥2]1/2=0. (19)

Roughly speaking, the stochastic sequence can follow the continuous-time trajectory for with arbitrary precision by taking a sufficiently small . This is not only for the limit behavior, but it holds for any . However, since and , one can see from the RHS of (17) that the upper-bound found for the term increases infinitely as for any fixed . Therefore, we do not get the convergence of from this argument.

Theorem 3 in Chapter 9 of Borkar2008book (), for example, deals with a recursive algorithm with a constant step-size by utilizing a similar idea to Theorem 3, assuming that the associated ODE has a global exponential attractor , that is, one can pick independently of such that . Then, letting , we have that and , which indicates that for any there exists such that the RHS of (17) is no greater than for any . Then, we have

 E[∥θN(k+1)−θ∗∥2]1/2 ⩽E[(∥φ(Nα;θNk)−θ∗∥+∥θN(k+1)−φ(Nα;θNk)∥)2]1/2 ⩽E[∥φ(Nα;θNk)−θ∗∥2]1/2+E[∥θN(k+1)−φ(Nα;θNk)∥2]1/2 ⩽(1/2)E[∥θNk−θ∗∥2]1/2+C.

From the above inequality we obtain an upper bound . We can conclude that eventually becomes arbitrarily small as we take small enough, however, we do not derive the convergence of towards .

There are a few applications of ODE based methods to analyze algorithms to solve deterministic optimization. Yin et al. Yin:1995wj (); Yin1995ec () have analyzed a variant of evolution strategies, namely the (, )-evolution strategy, by means of the ODE method. Nevertheless, encodes only the mean vector in those aforementioned studies and the overall variance is controlled by a function of the gradient of the objective function. Hence the algorithm analyzed is significantly different from state-of-the-art evolution strategies that adapts and simultaneously. Moreover, since these studies are based on the asymptotic relation between the parameter sequence and the ODE where they assume the decreasing learning rate or a constant but infinitesimal learning rate, they can not obtain geometric convergence (Theorem 5.2 of Yin1995ec ()). Recently, a probability model based search algorithm that can be regarded as a variation of the covariance matrix adaptation evolution strategy has been proved globally convergent Zhou:2018tq (). Their algorithm is more practical than the one above, yet the analysis relies on the ODE method with decreasing step-size. Therefore, the convergence is not geometric.

Gerencsér and Vágó gerencser2001mathematics () have developed an approach based on the ODE method to prove the geometric convergence of SPSA for noise free optimization with a constant step-size. In their work, represents the current candidate solution in and is an estimate of the gradient of function that is approximated by taking a pair of symmetric perturbation with perturbation strength , i.e., , where is a random vector. The geometric convergence of the SPSA with a constant on a convex quadratic function has been proven but the approach cannot generalize to a context where not only is adapted. Remark also that it is crucial to adapt to obtain a good practical algorithm, in the same way that it is crucial to use line-search procedures to determine the step-size in gradient-based algorithms. This adaptive scenario is not covered in gerencser2001mathematics ().

Besides the fact that the algorithms we are interested to analyze run with a constant step-size, we have to overcome the following potential difficulties to prove the geometric convergence of the algorithms such as rank-based search algorithms. One is that the optimal parameter (that is, the parameter one optimally wants to converge to—which is usually clear for a given context) is typically located on the boundary of the domain , and it is not necessarily included in . In the case of adaptive ESs, the optimal parameter is . Since the variance of a probability distribution should be positive, is on the boundary and is not included in . Convergence towards points on the boundary which are not necessarily in the parameter set is not standard in ODE analysis.

Another issue is that the optimal parameter may not be the unique equilibrium point of the associated ODE (2) in the closure . In some cases, we might be able to extend the domain to by extending by continuity. However, for the algorithms we are interested in, we have for some . For example, if is zero in the step-size adaptive ES described in Section 2.1, we have , implying that for any where is zero. Therefore, in any -neighborhood of , there exists infinitely many equilibrium points of the associated ODE (2). This prevents us from choosing the most commonly used Lyapunov function to prove that the optimal point is a global attractor of the ODE (2). Indeed, the time derivative of along the ODE solution, i.e., , is arbitrarily close to zero near equilibrium points. This violates the Lyapunov’s stability criterion.

The last issue is that the geometric convergence of the parameter vector using the Euclidean metric is generally not what we want. For the rank-based optimization algorithm described in Section 2, we typically want to show the geometric convergence of candidate solutions, , generated from at each iteration , or the geometric convergence of the sequence of the mean vectors or any other representative points of . They may not be directly connected to the geometric convergence of in a Euclidean sense since one can pick arbitrary parameterization of the probability distribution when designing the algorithm.

## 4 Global Geometric Convergence via ODE Method

We present now our main results to prove the geometric convergence of algorithms exemplified in Section 2. It relies on finding a function that should satisfy conditions A1 and A2 presented in the introduction so as to imply the geometric convergence of to zero for an appropriate choice of the learning rate . The function will typically upper-bound the interesting quantities whose geometric convergence we want to prove—for example, if our objective is to show the geometric convergence of the expected distance between the candidate solution and the optimal point of , one should choose such that for some . The first condition A1 that should be satisfied by translates that the function decreases along the trajectories of the solutions of the ODE (2), similarly to Lyapunov functions to prove stability of equilibrium of ODEs. This decrease should however be geometric. The second condition A2 is to control the deviation between the stochastic trajectory and the solution of the ODE.

After stating that A1 and A2 imply the geometric convergence of in Theorem 4, we will derive a corollary on the first hitting time bound and discuss the choice of for two example algorithms that were presented in Section 2

### 4.1 Theorem: Sufficient Conditions for Geometric Convergence

In the following theorem we prove that the conditions A1 and A2 explicated above imply the global geometric convergence of the expectation of for a small enough belonging to a set characterized below.

###### Theorem 4.

Let be a sequence defined by (13) with given deterministically. Let be an absolutely continuous function satisfying (14). Suppose that there is a function satisfying the following two conditions:

A1

There exists nonincreasing such that as , and for any and any

 Ψ(φ(t;θ))⩽ΔA1(t)Ψ(θ); (20)
A2

There exists nondecreasing w.r.t. each argument such that as for any fixed , and for any and

 E0[Ψ(θN)]⩽Ψ(φ(Nα;θ0))+ΔA2(α,Nα)Ψ(θ0). (21)

Let and define . Then, is nonempty, and for any there exists at least one finite integer such that . Let the smallest of such be denoted by and if and otherwise. Then, for all and for all , the stochastic algorithm (13) with satisfies the following inequality

 E[Ψ(θn)]⩽γnα(¯γα/γNα−1α)Ψ(θ0). (22)

Moreover,

 limsupn→∞1nlnE[Ψ(θn)]⩽lnγα. (23)
###### Proof.

First we show that is nonempty. According to A1, we can take such that