Accelerating Optimization and Reinforcement Learning with Quasi-Stochastic Approximation

Accelerating Optimization and Reinforcement Learning with Quasi-Stochastic Approximation


The ODE (ordinary differential equation) method has been a workhorse for algorithm design and analysis since the introduction of the stochastic approximation technique of Robbins and Monro in the early 1950s. It is now understood that convergence theory amounts to establishing robustness of Euler approximations for ODEs, while theory of rates of convergence requires finer probabilistic analysis. This paper sets out to extend this theory to quasi-stochastic approximation (QSA), based on algorithms in which the “noise” or “exploration” is based on deterministic signals, much like quasi-Monte Carlo. The main results are obtained under minimal assumptions: the usual Lipschitz conditions for ODE vector fields, and for rate results it is assumed that there is a well defined linearization near the optimal parameter , with Hurwitz linearization matrix . Algorithm design is performed in continuous time, in anticipation of discrete-time implementation based on Euler approximations, or high-fidelity alternatives.

The main contributions are summarized as follows:

  • If the algorithm gain is chosen as with and , then the rate of convergence of the algorithm is . There is also a well defined “finite-” approximation:

    where is a vector identified in the paper, and is bounded with zero temporal mean.

  • With gain the results are not as sharp: the rate of convergence holds only if is Hurwitz. Hence we obtain the optimal rate of convergence only if is chosen sufficiently large.

  • Based on the Ruppert-Polyak averaging technique of stochastic approximation, one would expect that a convergence rate of can be obtained by averaging:

    where the estimates are obtained using the gain in (i). The preceding sharp bounds imply that averaging results in convergence rate if and only if . This condition holds if the noise is additive, but appears to fail in general.

  • The theory is illustrated with applications to gradient-free optimization, and policy gradient algorithms for reinforcement learning.

Note: This pre-print is written in a tutorial style so it is accessible to new-comers. It will be a part of a handout for upcoming short courses on RL. A more compact version suitable for journal submission is in preparation.

AMS-MSC: 68T05, 93E35, 49L20

1 Introduction

The ODE method was coined by Ljung in his 1977 survey of stochastic approximation (SA) techniques for analysis of recursive algorithms [25]. The theory of SA was born more than 25 years earlier with the publication of the work of Robbins and Monro [33], and research in this area has barely slowed in the 70 years since its publication. The goal of SA is a simple root finding problem: compute or approximate the vector solving , in which is defined by an expectation:


where , and is a random vector taking values in a set (assumed in this paper to be a subset of Euclidean space).

Given the evolution of this theory, it is useful to reconsider the meaning of the method. Rather than merely a method for analyzing an algorithm, the ODE method is an approach to algorithm design, broadly described in two steps:

  • Step 1: Design (and hence ) so that the following ODE is globally asymptotically stable:

  • Step 2: Obtain a discrete-time algorithm via a “noisy” Euler approximation:


    in which is the non-negative “gain” or “step-size” sequence, and the sequence has mean that vanishes as .

Each step may require significant ingenuity. Step 1 may be regarded as a control problem: design dynamics to reach a desirable equilibrium from each initial condition. There are algorithm design questions in Step 2, but in the original formulation of SA and nearly all algorithms that follow, recursion (3) takes the form


in which has the same distribution as , or the distribution of converges to that of as . A useful approximation requires assumptions on , the “noise” , and the step-size sequence . The required assumptions, and the mode of analysis, are not very different than what is required to successfully apply a deterministic Euler approximation [10].

The motivation for the abstract formulation of Step 2 is to reinforce the idea that we are not bound to the traditional recursion. For example, control variate techniques offer alternatives: an example is the introduction of the “advantage function” in policy gradient techniques for reinforcement learning (RL) [41, 40].

In much of the SA literature, and especially in the applications considered in this paper, it is assumed that is independent and identically distributed (i.i.d.), or more generally, it is a Markov chain. Just as quasi Monte-Carlo algorithms are motivated by fast approximation of integrals, the quasi-stochastic approximation (QSA) algorithms considered in this paper are designed to speed convergence for root finding problems. It is useful to pose the algorithm in continuous time:


where in the main results we restrict to gains of the form


The probing signal is generated from a deterministic (possibly oscillatory) signal rather than a stochastic process. Two canonical choices are the -dimensional mixtures of periodic functions: \cref@addtoresetequationparentequation


for fixed vectors , phases , and frequencies . Under mild conditions on we can be assured of the existence of this limit defining the mean vector field:


Our aim is to obtain tight bounds between solutions of (5) and


where the choice of depends on the stability properties of the associated ODE (2) with constant gain.


It is assumed throughout the paper that (2) is globally asymptotically stable, with unique equilibrium denoted . Convergence of (5) is also assumed: sufficient conditions based on the existence of a Lyapunov function can be found in [6, 7]. This prior work is reviewed in the Appendix, along with a new sufficient condition extending the main result of [11].

We say that the rate of convergence of (5) is if


where is the estimation error. By careful design we can achieve , which is optimal in most cases (such as for Monte-Carlo – see Section 2.1).

Solutions to (9) and (2) are related by a time-transformation (see Lemma C.1—a common transformation in the SA literature), so that . Conditions are imposed so the rate of convergence is faster than , which justifies the following scaling:


We obtain general conditions under which is bounded and non-vanishing using the gain (6), which then implies the rate of convergence of (5) is . A key assumption is that the ODE is smooth near the equilibrium , so that there is a well defined linearization matrix .

Given this background, we are ready to summarize the main results:

1.  The following dichotomy is established:

  • With we obtain rate of convergence, provided is Hurwitz, regardless of the value of appearing in (6).

  • With we obtain the optimal rate of convergence provided is Hurwitz.

In either case, an exact approximation of the scaled error is obtained:


where is a vector identified in (45), and

This is assumed bounded in (justified for natural classes of probing signals in Appendix B).

2.  For those well-versed in stochastic approximation theory, the preceding conclusions would motivate the use of averaging:

where are obtained using the gain (6) with . The superscript refers to the averaging technique for stochastic approximation introduced independently by Ruppert and Polyak [34, 30].

In general, this approach fails: the rate of convergence of to is if and only if . A sufficient condition for this is additive noise, for which (5) becomes


In this case

3.  These theoretical results motivate application to gradient-free optimization, and policy-gradient techniques for RL. Theory and examples are presented in Sections 5 and 4.

Literature review

This paper spans three areas of active research:

1. Stochastic approximation Section 2 and some of the convergence theory in the appendix is adapted from [6, 7], which was inspired by the prior results in [27, 35]; [16] contains applications to gradient-free optimization with constraints. The first appearance of QSA methods appears to have originated in the domain of quasi-Monte Carlo methods applied to finance; see [22, 23].

The contributions here are a significant extension of [6, 7], which considered exclusively the special case in which the function is linear, with for matrices . The noise is thus additive: (13) holds with . Using the gain (6) with , the optimal rate of convergence was obtained under the assumption that is Hurwitz. The assumptions on using are stronger than what is imposed in stochastic approximation, which requires that is Hurwitz. On the other hand, the conclusions for stochastic approximation algorithms are weaker: from (12) we obtain

In the theory of stochastic approximation we must introduce an expectation, and settle for a much slower rate:


That is, the rate is rather than : see [20, 14] for refinements of this result and history.

The “ODE@” (73) was introduced in [11] for stability verification in stochastic approximation. Thm. A.1 is an extension of the Borkar-Meyn Theorem [11, 10], which has been refined considerably in recent years [31, 32].

2. Gradient free optimization The goal is to minimize a loss function over . It is possible to observe the loss function at any desired value, but no gradient information is available.

Gradient-free optimization has been studied in two, seemingly disconnected lines of work: techniques intended to directly approximate gradient descent through perturbation techniques, known as Simultaneous Perturbations Stochastic Approximation (SPSA), and Extremum-Seeking Control (ESC) which is formulated in a purely deterministic setting. The contributions of the present paper are motivated by both points of view, but leaning more on the former. See [42, 24, 2] for the nearly century-old history of ESC.

Theory for SPSA began with the algorithm of Keifer-Wolfowitz [21], which requires at each iteration access to two perturbations per dimension to obtain a stochastic gradient estimate. This computational barrier was addressed in the work of Spall [36, 39, 37]. Most valuable for applications in RL is the one-measurement form of SPSA introduced in [37]: this can be expressed in the form (4), in which


where is a zero-mean and i.i.d. vector-valued sequence. The qSGD (quasi-Stochastic Gradient Descent) algorithm (56) is a continuous time analog of this approach.

The introduction of [29] suggests that there is an older history of improvements to SPSA in the Russian literature: see eqn. (2) of that paper and surrounding discussion. Beyond history, the contributions of [29] include rates of convergence results for standard and new SPSA algorithms. Information theoretic lower bounds for optimization methods that have access to noisy observations of the true function was derived in [19]. This class of algorithms also has some bandits history [1, 12].

In all of the SPSA literature surveyed above, a gradient approximation is obtained through the introduction of an i.i.d. probing signal. For this reason, the best possible rate is of order . The algorithms introduced in the present work are designed to achieve convergence rate for optimization and root-finding problems.

More closely related to the present work is [8] which treats SPSA using a specially designed class of deterministic probing sequences. There are no comparable contributions, but this previous work motivates further research on rates of convergence for the algorithms proposed.

3. Policy gradient techniques These may be regarded as a special case of gradient-free optimization, in which the goal is to minimize average cost for an MDP based solely on input-output measurements. The standard dynamic programming formulation for optimal control is replaced with the following architecture: given a parameterized family of (possibly randomized) state-feedback polices , the goal is to minimize the associated average cost , where the expectation is in steady-state, subject to for all (the state description must be extended if the policy is randomized). Williams’ REINFORCE [43] is an early example, while the most popular algorithms today are of the “actor-critic” category in which the approximation of the gradient of is based on an estimate of a value function. In the widely cited recent paper [26] it is shown that SPSA algorithms such as a modified version of (15) can sometimes outperform actor-critic methods.

Organization Section 2 contains some simple examples, introduced mainly to clarify notation and motivation, much of which is adapted from [6, 7]. The main results are summarized in Section 3, along with sketches of the proofs. Applications to gradient-free optimization are summarized in Section 4, and to policy gradient RL in Section 5. Section 6 contains conclusions and directions for future research. The appendix contains three sections: Appendix A concerns convergence theory of QSA based on [6, 7]. The main challenge is to establish boundedness of trajectories of the algorithm. A new sufficient condition is established, based on a generalization of the Borkar-Meyn Theorem. Ergodic theory for dynamical systems is required in the main assumptions: justification for a broad class of “probing signals” is contained in Appendix B. Appendix C contains the details of the proofs of the main results.

2 Simple Examples

The following subsections contain examples to illustrate theory of QSA, and also a glimpse at applications. Sections 2.3 and 2.1 are adapted from [6, 7].

2.1 Quasi Monte-Carlo

Consider the problem of obtaining the integral over the interval of a function . In a standard Monte-Carlo approach we would draw independent random variables , with distribution uniform on the interval , and then average:


A QSA analog is described as follows: the probing signal is the one-dimensional sawtooth function, (modulo 1) and consider the analogous average


Alternatively, we can adapt the QSA model (5) to this example, with


The averaged function is then given by

so that is the unique root of . Algorithm (5) is given by:


Figure 1: Sample paths of Quasi Monte-Carlo estimates.

This Monte-Carlo approach (17) can be transformed into something resembling (19). Taking derivatives of each side of (17), we obtain using the product rule of differentiation, and the fundamental theorem of calculus,

This is precisely (19) with (not a great choice for an ODE design, since it is not bounded as ).

The numerical results that follow are based on , whose mean is . The differential equation (19) was approximated using a standard Euler scheme with sampling interval . Several variations were simulated, differentiated by the gain . Fig. 1 shows typical sample paths of the resulting estimates for a range of gains, and common initialization . In each case, the estimates converge to the true mean , but convergence is very slow for significantly less than one. Recall that the case is very similar to what was obtained from the Monte-Carlo approach (17).

Independent trials were conducted to obtain variance estimates. In each of independent runs, the common initial condition was drawn from , and the estimate was collected at time . Fig. 2 shows three histograms of estimates for standard Monte-Carlo (16), and QSA using gains and . An alert reader must wonder: why is the variance reduced by 4 orders of magnitude when the gain is increased from to ? The relative success of the high-gain algorithm is explained in Thm. 3.1.

Figure 2: Histograms of Monte-Carlo and Quasi Monte-Carlo estimates after independent runs. The optimal parameter is .

2.2 Constant gain algorithm

In later sections we will consider the linear approximation:


This provides insight, and sometimes we can show strong coupling between the linear and nonlinear QSA ODEs. We briefly consider here this linear model in which is constant. Then QSA is a time-invariant linear system:

where is the error at time . For this simple model we can solve the ODE when the probing signal is the mixture of sinusoids (7b).

A linear system satisfies the principle of super-position. To put this to work, consider the probing signal (7b), and for each , consider the ODE

The principle states that the solution to the ODE is the sum:


We see that the response to the initial error decays to zero exponentially quickly. Moreover, to understand the steady-state behavior of the algorithm it suffices to fix a single value of .

Let’s keep things simple, and stick to sinusoids. And it is much easier to work with complex exponentials:

with and (dropping the scaling for simplicity, and the phase is easily returned by a time-shift). We can express the solution as a convolution:

Writing , the integral of the matrix exponential is expressed,

Using linearity once more, and the fact that the imaginary part of is , we arrive at a complete representation for (21):

Proposition 2.1. ()

Consider the linear model with Hurwitz, and probing signal (7b), for which the constant-gain QSA algorithm has the solution (21). Then for each and , with


Prop. 2.1 illustrates a challenge with fixed gain algorithms: if we want small steady-state error, then we require small (or large , but this brings other difficulties for computer implementation—never forget Euler!). However, if is very small, then the impact of the initial condition in (21) will persist for a long time.

The Ruppert-Polyak averaging technique can be used to improve the steady-state behavior—more on this can be found in Section 3 for vanishing-gain algorithms. It is easy to illustrate the value for the special case considered here. One form of the technique is to simply average some fraction of the estimates:


For example, means that we average the final 20%.

Corollary 2.2. ()

Suppose that the assumptions of Prop. 2.1 hold, so in particular is linear. Consider the averaged estimates (23) in which for fixed . Then,


and for each and , with equal to the integral of appearing in (22):

Hence, converges to at rate .    

2.3 Application to policy iteration

Consider the nonlinear state space model in continuous time,

with , . Given a cost function , our goal is to approximate the optimal value function

and approximate the optimal policy. For this we first explain how policy iteration extends to the continuous time setting.

For any feedback law , denote the associated value function by

This solves a dynamic programming equation:

The policy improvement step in this continuous time setting defines the new policy as the minimizer:

Consequently, approximating the term in brackets is key to approximating PIA.

An RL algorithm is constructed through the following steps. First, add to each side of the fixed-policy dynamic programming equation:

The right-hand side motivates the following definition of the fixed-policy Q-function:

The policy update can be equivalently expressed , and this Q-function solves the fixed point equation


where for any function (note that this is a substitution, rather than the minimization appearing in Q-learning).

Consider now a family of functions parameterized by , and define the Bellman error for a given parameter as


A model-free representation is obtained, on recognizing that for any state-input pair ,


The error can be observed without knowledge of the dynamics f or even the cost function . The goal is to find that minimizes the mean square error:


We choose a feedback law with “exploration”:


chosen so that the resulting state trajectories are bounded for each initial condition, and that the joint process admits an ergodic steady state.

Whatever means we use to obtain the minimizer, this approximation technique defines an approximate version of PIA: given a policy and approximation , the policy is updated:


This procedure is repeated to obtain a recursive algorithm.

Least squares solution

Consider for fixed the loss function

If the function approximation architecture is linear,


in which . Then is a quadratic function of :

We leave it to the reader to find expressions for , , and .

In this special case we do not need gradient descent techniques: the matrices and can be represented as Monte-Carlo, as surveyed in Section 2.1, and then .

Gradient descent

The first-order condition for optimality is expressed as a root-finding problem: , and the standard gradient descent algorithm in ODE form is

with . This is an ODE of the form (2), whose QSA counterpart (5) is the QSA steepest descent algorithm,


Where, based on (26) we can typically swap derivative with respect to time and derivative with respect to to obtain

The QSA gradient descent algorithm (31) is best motivated by a nonlinear function approximation, but it is instructive to see how the ODE simplifies for the the linearly parameterized family (30). We have in this case

and using

so that (31) becomes


The convergence of (32) may be very slow if the matrix


has eigenvalues close to zero (see Prop. 3.3 for a simple explanation). This can be resolved through the introduction of a larger gain , or a matrix gain. One approach is to estimate from data and invert: \cref@addtoresetequationparentequation


This might be motivated by the ODE approximation

This idea is the motivation for Zap SA and Zap Q-learning [15, 17].

Figure 3: Comparison of QSA and Stochastic Approximation (SA) for policy evaluation.

Numerical example

Consider the LQR problem in which , and , with and . Given the known structure of the problem, we know that the function associated with any linear policy , takes the form

where solves the Lyapunov equation

This motivates a quadratic basis, which for the special case and becomes

and there is no harm in setting .

In order to implement the algorithm (34b) we begin with selecting an input of the form


where is a stabilizing controller and . Note that need not be the same whose value function we are trying to evaluate.

Figure 4: Iterations of PIA

The numerical results that follow are based on a double integrator with friction:

which can be expressed in state space form using :


We took a relatively large cost on the input:

and gain .

Fig. 3 shows the evolution of the QSA algorithm for the evaluation of the policy using the stabilizing controller and in (35) as the sum of 24 sinusoids with random phase shifts and whose frequency was sampled uniformly between and rad/s. The QSA algorithm is compared with the related SA algorithm in which is “white noise” instead of a deterministic signal1

Fig. 4 shows the weighted error for the feedback gains obtained using the approximate policy improvement algorithm (29) and the optimal controller (which can be easily computed for an LQR problem). Each policy evaluation was performed by the model-free algorithm (34). The PIA algorithm indeed converges to the optimal control gain .

3 Main Results

3.1 Assumptions

The following assumptions are imposed throughout the remainder of the paper:

  • The process is non-negative, monotonically decreasing, and

  • The functions and are Lipschitz continuous: for a constant ,

    There exists a constant , such that for all , ,

  • The ODE (2) has a globally asymptotically stable equilibrium .

The process in (A1) is a continuous time counterpart of the standard step size schedules in stochastic approximation. The Lipschitz condition in (A2) is standard, and (38) is only slightly stronger than ergodicity of as given by (8). General sufficient conditions on both and for the ergodic bound (38) are given in Lemma B.1. Assumptions (A1) and (A2), along with a strengthening of (A3), imply convergence of (5) to : see Appendix A.

We henceforth assume that (5) is convergent, and turn to identification of the convergence rate. This requires a slight strengthening of (A2):

  • The vector field is differentiable, with derivative denoted


    That is, is a matrix for each , with .

    Moreover, the derivative is Lipschitz continuous, and is Hurwitz.

The matrix-valued function is uniformly bounded over , subject to the global Lipschitz assumption on imposed in (A2). The Hurwitz assumption implies that the ODE (2) is exponentially asymptotically stable.

The final assumption is a substantial strengthening of the ergodic limit (38). For this it is simplest to adopt a “Markovian” setting in which the probing signal is itself the state process for a dynamical system:


where is continuous, with a compact subset of Euclidean space. A canonical choice is the -dimensional torus: , and defined to allow modeling of excitation as a mixture of sinusoids:


with distinct frequencies, ordered for convenience: . The dynamical system (40) is linear in this special case. It is ergodic, in a sense made precise in Lemma B.1.

Lemma B.1 also provides justification for the following assumptions for the special case (41), and under mild smoothness conditions on . We do not assume (41) in our main results.

  • The probing signal is the solution to (40), with a compact subset of Euclidean space. It has a unique invariant measure on , and satisfies the following ergodic theorems for the functions of interest, for each initial condition :

    • For each there exists a solution to Poisson’s equation with forcing function . That is,



    • The function , and derivatives and are Lipschitz continuous in . In particular, admits a derivative satisfying

      where and was defined in (39). Lipschitz continuity is assumed uniform with respect to the exploration process: for ,

    • Denote . The following limit exists:

      and the partial integrals are bounded:

Assumption (A5) (iii) is imposed because the vector arises in an approximation for the scaled error . This assumption is not much stronger than the others. In particular, the partial integrals will be bounded if there is a bounded solution to Poisson’s equation:

The notation in (42) is used to emphasize the parallels with Markov process and stochastic approximation theory: this is precisely the solution to Poisson’s equation (with forcing function ) that appears in theory of simulation of Markov processes, average-cost optimal control, and stochastic approximation [18, 3, 28, 5]. For the one-dimensional probing signal defined by the sawtooth function , , a solution to Poisson’s equation has a simple form:

It will be useful to make the change of notation: for