[

# [

###### Abstract

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a simple and explicit finite time analysis of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. A final section of the paper shows that all of our main results extend to the study of Q-learning applied to high-dimensional optimal stopping problems.

R

Finite Time Analysis of TD]A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation \coltauthor\NameJalaj Bhandari \Emailjb3618@columbia.edu
\addrIndustrial Engineering and Operations Research, Columbia University \AND\NameDaniel Russo \Emaildan.joseph.russo@gmail.com
\addrIndustrial Engineering and Operations Research, Columbia University

einforcement learning, temporal difference learning, finite sample bounds, stochastic gradient descent.

## 1 Introduction

Reinforcement learning (RL) offers a general paradigm for learning effective policies for stochastic control problems. At the core of RL is the task of value prediction: the problem of learning to predict cumulative discounted future reward as a function of the current state of the system. Usually, this is framed formally as the problem of estimating the value function corresponding to a given policy in a Markov decision process (MDP). Temporal difference learning (TD), first introduced by [Sutton (1988)], is the most widely used algorithm for this task. The method approximates the value function by a member of some parametric class of functions. The parameters of this approximation are then updated online in a simple iterative fashion as data is gathered.

While easy to implement, theoretical analysis of TD is quite subtle. A central challenge is that TDâs incremental updates, which are cosmetically similar to stochastic gradient updates, are not true gradient steps with respect to any fixed loss function. This makes it difficult to show that the algorithm makes consistent, quantifiable, progress toward any particular goal. Reinforcement learning researchers in the 1990s gathered both limited convergence guarantees for TD and examples of divergence. Many issues were clarified in the work of [Tsitsiklis and Van Roy (1997)], who established precise conditions for the asymptotic convergence of TD with linear function approximation, and provided counterexamples when these conditions are violated. With guarantees of asymptotic convergence in place, a natural next step is to understand the algorithm’s statistical efficiency. How much data is required to reach a given level of accuracy? Can one give uniform bounds on this, or could data-requirements explode depending on the problem instance? Twenty years after the work of [Tsitsiklis and Van Roy (1997)], such questions remain largely unsettled.

In this work, we take a step toward correcting this by providing a simple and explicit finite time analysis of temporal difference learning. We draw inspiration from the analysis of projected stochastic gradient descent. These analyses are simple–enough so that they are frequently taught in machine learning courses–and the explicit bounds they produce provide clear assurance of the robustness of SGD. Unfortunately, there are critical differences between TD and SGD, and as such these simple analyses do not apply to TD. Instead, past work on TD has needed to invoke powerful results from the theory of stochastic approximation. In this work, we uncover an approach to analyzing TD which, except for a few crucial steps, leverages the standard tools for finite time analysis of SGD. In addition to the several novel guarantees we derive in the paper, we feel the analysis offers insight into the dynamics of TD, and we hope our approach helps future researchers derive stronger bounds and principled improvements to the algorithm.

The most natural model for studying value prediction in RL is one where data consists of a single sequence of observed rewards and state transitions generated from applying the evaluation policy on the underlying Markov chain. The highly dependent nature of the sequence of observations introduces substantial complications. For this reason, many papers carry out the analyses in a simpler model where data is sampled i.i.d. from their stationary distribution. In this work, we will first develop analysis for the simpler i.i.d. model. We then show how to extend our our proof technique to Markov case, yielding similar bounds, except with an additional error term that depends on the mixing time of the Markov chain. A final section of the paper shows that all of our main results extend to the study of a variant of Q-learning applied to optimal stopping problems.

The strong dependencies in the data make analysis in the Markov chain setting challenging, and for tractability, in that setting we study a variant of TD that uses a projection step. This projection imposes a uniform bound on the on the gradient noise, and for this reason many of the standard analyses of stochastic gradient algorithms rely on projection. Even with this projection step, the dependent nature of the data makes finite-time analysis challenging. Our proof uses a novel information-theoretic technique to control for the bias this introduces, which may serve as a useful tool for future analyses in reinforcement learning and stochastic approximation.

##### Related literature:

We discuss some of the existing work on the temporal difference learning algorithm and it’s various modifications used for Reinforcement learning. [Tsitsiklis and Van Roy (1997)] established the asymptotic convergence of TD with linear function approximation by using results from the theory of stochastic approximation. Following that, [Konda (2002)] showed the asymptotic normality of TD. However, there has been only limited work exploring finite time properties of TD. Two exceptions to this are the works of [Korda and La (2015)] and [Dalal et al. (2017)]. While [Korda and La (2015)] give a variety of convergence rate results for different choices of stepsizes, doubts have been raised regarding the correctness of their results (Lakshminarayanan and Szepesvári, 2017). Additionally, their bounds contain many hard to interpret constants.

Dalal et al. (2017) establish a convergence rate for the parameter maintained by TD that is , where is the number of iterations. Here is a tuning parameter that influences the stepsizes used by the algorithm. This work relies on some refined tools from the literature on stochastic approximation that give explicit error bounds when applying the ODE method. One strength of this approach is that their analysis works without any modifications such as a projection step or iterate averaging. However, relative to the techniques in this paper, their analysis is quite difficult and results in bounds with hard to interpret constants along with slower convergence rates.

While this work was under review at COLT 2018, (Lakshminarayanan and Szepesvari, 2018) published their work on analyzing linear stochastic approximation algorithms to explore if linear convergence rates of can be obtained using a universal constant step-size and iterate averaging. The family of TD algorithms can be shown to be a part of this general framework. Such a convergence result was shown by (Bach and Moulines, 2013) for least-squares and logistic regression problems. Extending the work of (Bach and Moulines, 2013), they show that only under structural assumptions on the stochastic noise was such a result possible. As a by-product of their analysis, they were able to show convergence rate for TD(0) with iterate averaging using a universal constant step-size under the i.i.d  assumption, i.e. assuming steady state behavior of the underlying Markov chain. In this work, we also show a similar result in Theorem 7, albeit with problem dependent decaying step-sizes. However, we want to emphasize on the main differences; we take a completely different approach following the classic stochastic gradient descent analysis. We believe this line of analysis is quite powerful as it is both simpler and more generally applicable to different scenarios. In particular, we are easily able to relax the i.i.d  assumption and extend our analysis to the Markov chain sampling model which is a more realistic setting. Our proof techniques also easily extend analyzing the Q-function approximation algorithm proposed by (Van Roy, 1998) to approximate optimal stopping times. It is unclear whether these extensions can be naturally incorporated in the framework of (Lakshminarayanan and Szepesvari, 2018).

Although not directly related, we note some existing work on convergence results of generalizations, modifications of TD learning methods. The TD algorithm is known to only provably converge with linear function approximation and when the data is sampled from the policy under evaluation (on-policy learning) (Tsitsiklis and Van Roy, 1997). Recently, a lot of research work has been done in developing novel variants of the TD method to which are applicable more generally. In this regard, Sutton et al. (2009a, b) develop gradient TD algorithms, called GTD and GTD2, and linear TD with gradient correction, called TDC which provably converge in off-policy learning scenarios. Akin to this, (Bhatnagar et al., 2009) introduce counterparts to GTD2 and TDC which can be shown to provably converge with non-linear function approximators. At their core, these methods use a slightly different objective function, the projected Bellman error, have the same computational complexity as the TD method and essentially rely on the ODE method (Borkar and Meyn, 2000) for the convergence proof. Liu et al. (2015) is the first and probably the only work to explore finite time properties of GTD and GTD2 with linear function approximation. Reformulating these with a primal-dual saddle-point objective function, they design true stochastic gradient descent methods and leverage ideas from convergence analysis of saddle-point problems in the optimization literature (Juditsky et al., 2011), (Nemirovski et al., 2009) to give finite time convergence bounds. However, like the TD method, little is known about finite time properties of other variants and we believe our analysis could potentially offer some insights here too.

It is also important to mention the works of (Antos et al., 2008), Lazaric et al. (2010), Ghavamzadeh et al. (2010), (Pires and Szepesvári, 2012) and Prashanth et al. (2014) who study finite time convergence properties of the least squares TD (LSTD) algorithm and the work of Yu and Bertsekas (2009) analyzing the least squares policy evaluation (LSPE) algorithm. However, we remark that all these belong to a different class of algorithms111These are referred to as batch methods and the study of incremental methods such as the TD algorithm has proven to be much more challenging.

## 2 Problem formulation

##### Markov reward process.

We consider the problem of evaluating the value function of a given policy in a Markov decision process (MDP). We work in the on policy setting, where data is generated by applying the policy in the MDP. Because the policy is applied automatically to select actions, such problems are most naturally formulated as value function estimation in a Markov reward process (MRP). An MRP is a 4-tuple ) (Sutton and Barto, 1998) where is the set of states, is the Markovian transition kernel, is a reward function, and is a discount factor. We simplify the presentation and avoid measure-theoretic issues by assuming the state space is discrete (at most countably infinite). In this case specifies the probability of transitioning from a state to another state . The reward function associates a reward with each state transition. We denote by the expected instantaneous reward generated from an initial state .

Let denote the value function associated with this MRP. It specifies expected cumulative discounted future reward as a function of the state of the system. In particular

where the expectation is over sequences of states and rewards are generated according to the transition kernel and reward function respectively. This value function obeys the Bellman equation , where the Bellman operator associates a value function with another value function satisfying

 (TμV)(s)=R(s)+γ∑s′∈SP(s′|s)V(s′)∀s∈S.

We assume rewards are bounded uniformly, with . Under this assumption, value functions are assured to exist and are the unique solution to Bellman’s equation. We assume that the Markov reward process induced by following the policy is ergodic with a unique stationary distribution . For any two states and , .

##### Value function approximation.

Given a fixed policy , the problem is to efficiently estimate the corresponding value function using only observed rewards and state transitions. Unfortunately, due to the curse of dimensionality, most modern applications have intractably large state spaces, rendering exact value function learning hopeless. Instead researchers resort to parametric approximations of the value function, for example by using a linear function approximator (Sutton and Barto, 1998) or a non-linear function approximation such as a neural network (Mnih et al., 2015). In this work, we consider a linear function approximation architecture where the true value-to-go is approximated by the inner product

 Vμ(s)≈Vθ(s)=ϕ(s)⊤θ,

where is a fixed feature vector for state and is a parameter vector that is shared across states. When the state space is a finite set , can be expressed compactly as

 Vθ =⎡⎢ ⎢⎣ϕ(s1)⊤⋮ϕ(sn)⊤⎤⎥ ⎥⎦θ=⎡⎢ ⎢⎣ϕ1(s1)ϕk(s1)ϕd(s1)⋮⋮⋮ϕ1(sn)ϕk(sn)ϕd(sn)⎤⎥ ⎥⎦θ=Φθ,

where and . We assume that the features vectors , forming the columns of are linearly independent.

##### Norms in value function and parameter space.

For a symmetric positive definite matrix , define the inner product and the associated norm If is positive semi-definite, rather than positive definite is called a semi-norm. Let denote the diagonal matrix whose elements are given by the entries of the stationary distribution . Then, for two value functions and ,

 ∥V−V′∥D=√∑s′∈Sπ(s)(V(s)−V′(s))2

measures the mean-square difference between the value predictions under and in steady-state. This suggests a natural norm on the space of value parameters. In particular, for any ,

 ∥Vθ−Vθ′∥D=√∑s′∈Sπ(s)(ϕ(s)⊤(θ−θ′))2=∥θ−θ′∥Σ

where

 Σ:=Φ⊤DΦ=∑s∈Sπ(s)ϕ(s)ϕ(s)⊤

is the steady-state feature covariance matrix.

##### Feature regularity.

We assume that all features have bounded second moments, so exists, and any entirely redundant or irrelevant features have been removed, so has full rank. We also assume , which can be ensured through feature normalization.

Let the minimum eigenvalue of . From our bound on the second moment of the features, the maximum eigenvalue of is less than , so bounds the condition number of the feature covariance matrix. While we assume is positive, we provide some guarantees that are independent of the conditioning of this matrix. The following lemma is an immediate consequence of our assumptions. {restatable}[Norm equivalence]lem For all ,

## 3 Temporal Difference Learning

We consider the classic temporal difference learning algorithm (Sutton, 1988). The algorithm starts with an initial parameter estimate and at every time step , it observes one data tuple consisting of the current state, the current reward and the next state reached by playing policy in the current state. This tuple is used to compute the loss function which is taken to be the squared sample Bellman error. It then proceeds to compute the next iterate by taking a gradient step. Some of our bounds guarantee the accuracy of the average iterate, denoted by . The version of TD presented in Algorithm 3 also makes online updates to the averaged iterate.

We present in Algorithm 3 the simplest variant of TD, which is known as TD(0). An extension of to TD with eligibility traces, known as TD() is presented in Section LABEL:sec:td_lambda. It is also worth highlighting that here we study online temporal difference learning, which make incremental gradient-like updates to the parameter estimate based on the most recent data observations only. Such algorithms are widely used in practice, but harder to analyze than batch TD methods.

{algorithm2e}

Input :

initial guess , stepsize sequence .

Initialize: .
for  do

Observe tuple:   Define target:
/* sample Bellman operator */ Define loss function:
/* sample Bellman error */ Compute negative gradient:   Take a gradient step:
/* :stepsize */ Update averaged iterate:
/* */
end for
TD(0) with linear function approximation At time , TD takes a step in the direction of the negative gradient evaluated at the current parameter. As a general function of and the tuple , this can be calculated as
 gt(θ)=ϕ(st)(rt+γϕ(s′t)⊤θ−ϕ(st)⊤θ). (1)
The long-run dynamics of TD are closely linked to the expected negative gradient step when the tuple follows its steady state behavior:
 ¯g(θ):=∑s,s′∈Sπ(s)P(s′|s)(R(s,s′)+γϕ(s′)⊤θ−ϕ(s)⊤θ)ϕ(s)∀θ∈Rd.
This can be rewritten more compactly in several useful ways. One is,
 ¯g(θ)=E[ϕr]+E[ϕ(γϕ′−ϕ)⊤]θ (2)
where is the feature vector of a random initial state , is the feature vector of a random next state drawn according to , and . In addition, using that , one finds
 ¯g(θ)=Φ⊤D(TμΦθ−Φθ). (3)

## 4 Asymptotic convergence of temporal difference learning

The main challenge in analyzing TD is that the gradient steps are not true stochastic gradients with respect to any fixed objective. At every iteration, the step taken at time does pull the iterate closer to , but itself depends on . So does this circular process converge? The key insight of Tsitsiklis and Van Roy (1997) was to interpret this as a stochastic approximation scheme for solving a fixed point equation known as the projected Bellman equation. Contraction properties together with general results from stochastic approximation theory can then be used to show convergence.

Should TD converge at all, it should be to a stationary point. Because the feature covariance matrix is full rank there is a unique222This follows formally as a consequence of Lemma 6.2 in this paper. vector with . We briefly review results that offer insight into and proofs of the asymptotic convergence of TD.

##### Understanding the TD Limit Point.

Tsitsiklis and Van Roy (1997) give an interesting characterization of the limit point . They show it is the unique solution to the projected Bellman equation

 Φθ∗=ΠDTμΦθ∗. (4)

the projection operator onto the subspace spanned by these features in the inner product . To see why this is the case note that by (3),

 0=x⊤¯g(θ∗)=⟨Φx,TμΦθ∗−Φθ∗⟩D∀x∈Rd.

That is, the Bellman error under is orthogonal to the space spanned by the features in the inner product . By definition, this means and hence must satisfy the projected Bellman equation.

The following lemma shows the projected Bellman operator is a contraction, and so in principle, one could converge to the approximate value function by repeatedly applying . TD appears to serve a simple stochastic approximation scheme for solving that fixed point equation. {restatable}[]lem [Tsitsiklis and Van Roy (1997)] is a contraction with respect to with modulus . That is,

 \normΠDTμVθ−ΠDTμVθ′D≤γ\normVθ−Vθ′D∀(θ,θ′).

Finally, the limit of convergence comes with some competitive guarantees. From Lemma 4, a short argument shows

 \normVθ∗−VμD≤1√1−γ2\normΠDVμ−VμD. (5)

See for example Chapter 6 of Bertsekas (2012). The left hand side of Equation (5) measures the root-mean-squared deviation between the value predictions of the limiting TD value function and the true value function. One the right hand side, the projected value function minimizes root-mean-squared prediction errors among all value functions representable in the span of . If actually falls within the span of the features, there is no approximation error at all and TD converges to the true value function.

##### Asymptotic Convergence Via The ODE Method.

Like many analyses in reinforcement learning, the convergence proof of Tsitsiklis and Van Roy (1997) appeals to the a powerful technique from the stochastic approximation literature known as the “ODE method.” Under appropriate conditions, and assuming a decaying step-size sequence satisfying the Robbins-Monroe conditions, this method establishes the asymptotic convergence of the stochastic recursion as a consequence of the global asymptotic stability of the deterministic ODE . The critical in the proof of Tsitsiklis and Van Roy (1997) is to use the contraction properties of the Bellman operator to establish this ODE is globally asymptotically stable with equilibrium point .

The ODE method vastly simplifies convergence proofs. First, because the continuous dynamics can be easier to analyze than discretized ones, and more importantly, because it avoids dealing with stochastic noise in the problem. At the same time, by side-stepping these issues, the method offers little insight into the critical effect of stepsize sequences, problem conditioning, and mixing time issues on algorithm performance.

## 5 Outline of analysis

The remainder of the paper focuses on finite time analysis of TD. Broadly, we establish two types of finite time bounds. In the case where the feature covariance matrix is well conditioned, we give bounds on the expected distance of the iterate from the TD fixed-point . We attain explicit bounds mirroring what one would expect from the literature on strongly convex stochastic optimization: results showing constant step-size TD converges to within a radius of at an exponential rate, and convergence rates with appropriate decaying step-sizes. Note that by Lemma 2, , so bounds on the distance of the iterate to the TD fixed point also imply bounds on the distance between value predictions.

These establish fast rates of convergence, but only if the problem is well conditioned. The choice of step-sizes is also very sensitive to problem conditioning. Work on robust stochastic approximation (Nemirovski et al., 2009), argues instead for the use of comparatively large step-sizes together with iterate averaging. Following the spirit of this work, we also give explicit bounds on , which measures the mean-squared gap between the predictions of under the averaged-iterate . These yield slower convergence rates, but both the bounds and step-sizes are completely independent of problem conditioning.

Our approach is to start by developing insights from simple, stylized settings, and then incrementally extending the analysis to more complex settings. The analysis is outlined below:

Noiseless Case:

Drawing inspiration from the ODE method discussed above, we start by analyzing the Euler discretization of the ODE , which is the deterministic recursion . We call this method “mean-path TD(0).’ As motivation, the section first considers a fictitious gradient descent algorithm designed to converge to the TD fixed point. We then develop striking analogues for mean-path TD of the key properties underlying the convergence of gradient descent. Easy proofs then yield two bounds mirroring those given for gradient descent.

Independent Noise:

Section 7 studies TD(0) under an i.i.d. observation model, where the data-tuples used by TD are drawn i.i.d. from the stationary distribution. The techniques used to analyze mean-path TD(0) extend easily to this setting, and the resulting bounds mirror standard guarantees for stochastic gradient descent.

Markov Noise:

In Section 8, we analyze TD in the more realistic setting where the data is collected from a single sample path of an ergodic Markov chain. This setting introduces significant challenges due to the highly dependent nature of the data. For tractability, we assume assuming the Markov chain satisfies a certain uniform bound on the rate at which it mixes, and study a variant of TD that uses a projection step to ensure uniform boundedness of the iterates. In this case, our results essentially scale by a factor of the mixing time relative to the i.i.d. case.

Extension To TD(:

We extend the analysis to TD with eligibility traces, known as TD()

Approximate Optimal Stopping:

A final section extends our results to the problem of optimal stopping with an approximate value function.

## 6 Analysis of Mean-Path TD

All practical applications of TD involve observation noise. A great deal of insight can be gained, however, by investigating a natural deterministic analogue of the algorithm. Here we study the recursion

 θt+1=θt+α¯g(θt)t∈{0,1,2,…},

which is the Euler discretization of the ODE described in the previous section. We will refer to this iterative algorithm as mean-path TD. In this section, we develop key insights into the dynamics of mean-path TD that allow for a remarkably simple finite-time analyis of its convergence. Later sections of the paper show these ideas extend gracefully to analyeses with observation noise.

The key to our approach is to develop properties of mean-path TD that closely mirror those of gradient descent on a particular quadratic loss function. To this end, in the next subsection, we review a simple analysis of gradient descent. In Subsection 6.2, we establish key properties of mean-path TD mirroring those used to analyze this gradient descent algorithm. Finally, Subsection 6.3 gives convergence rates of TD, with proofs and rates mirroring those given for gradient descent except for a constant that depends on the problem’s discount factor.

### 6.1 Gradient Descent on A Value Function Loss

Consider the cost function

 f(θ)=∥Vθ∗−Vθ∥2D=∥θ−θ∗∥2Σ

which measures the mean-squared gap between the value predictions under and those under the stationary point of TD . Consider as well a hypothetical algorithm that performs gradient descent on , iterating for all . Of course, this algorithm is not implementable, as one does not know the limit point of TD. However, reviewing an anlysis of such an algorithm will offer great insight into our eventual analysis of TD.

To start, a standard decomposition characterizes the evolution of the error of the iterate :

To use this decomposition, we need some understanding of , capturing whether the gradient points in the direction of , as well as of the norm of the gradient . In this case , from which we conclude

In addition, one can show333can be seen from the fact that for any vector with , .

Now, using (6) and (7), we have that for stepsize ,

 ∥θt+1−θ∗∥22≤∥θt−θ∗∥22−∥Vθ∗−Vθt∥2D. (8)

The distance to decreases in every step, and does so more rapidly if there is a large gap between the value predictions under and . Combining this with Lemma 2 gives

 ∥θt+1−θ∗∥22≤(1−ω)∥θt−θ∗∥22≤…≤(1−ω)t+1∥θ0−θ∗∥2. (9)

Recall that denotes the minimum eigenvalue of . This shows that error converges at a fast geometric rate. However the rate of convergence degrades if the minimum eigenvalue is close to zero. Such a convergence rate is therefore only meaningful if the feature covariance matrix is well conditioned.

By working in the space of value functions and performing iterate averaging, one can also give a guarantee that is independent of . Recall the notation for the averaged iterate. A simple proof from (8) shows

 ∥Vθ∗−V¯θT∥2D≤1TT−1∑t=0∥Vθ∗−V¯θT∥2D≤∥θ∗−θ0∥22T. (10)

### 6.2 Key Properties of Mean-Path TD

This subsection establishes analogues for TD of the key properties (6) and (7) used to analyze gradient descent. Frist, our analysis builds on Lemma 7 of Tsitsiklis and Van Roy (1997), which uses the contraction properties of the projected Bellman operator to conclude

 ¯g(θ)⊤(θ∗−θ)>0∀θ≠θ∗. (11)

That is, the update of TD always forms a postive angle with . Though only Equation (11) was stated in their lemma, Tsitsiklis and Van Roy (1997) actually reach a much stronger conclusion in the proof itself. This result, given in Lemma 6.2 below, establishes that the expected updates of TD point in a descent direction of , and do so more strongly when the gap between value functions under and is large. We will show that this more quantitative form of (11) allows for elegant finite time-bounds on the performance of TD, but it seems the power of the result has not been appreciated in the literature. We also provide a new and more elementary proof of Lemma 6.2.

Note that this lemma mirrors the property in Equation (6), but with a smaller constant of . This reflects that TD must converge to by bootstrapping, and may follow a less direct path to than the ficticious gradient descent method considered in the previous subsection. {restatable}[]lem Let be the unique solution to . For any ,

 (θ∗−θ)⊤¯g(θ)≥(1−γ)\normVθ−Vθ∗2D.
{proof}

Consider a stationary sequence of states and set and . Similarly, define and . By stationarity, these are two correlated random varibles with the same same marginal distribtuion. By definition . Using the expression for in equation (2),

 ¯g(θ)=¯g(θ)−¯g(θ∗)=E[ϕ(γϕ′−ϕ)⊤(θ−θ∗)]=E[ϕ(ξ−γξ′)]. (12)

Therefore

 (θ∗−θ)⊤¯g(θ)=E[ξ(ξ−γξ′)]=E[ξ2]−γE[ξ′ξ]≥(1−γ)E[ξ2]=(1−γ)\normVθ∗−Vθ2D.

The inequality above uses Cauchy shwartz together with the fact that and have the same marginal distribution to conclude

Lemma 6.2 forms another key to our results. It upper bounds the norm of the expected negative gradient, providing an analogue of Equation (7). {restatable}[]lem . {proof} Beginning from (12) in the Proof of Lemma 6.2, we have

 ∥¯g(θ)∥2=∥E[ϕ(ξ−γξ′)]∥≤√E[∥ϕ∥2]√E[(ξ−γξ′)2]≤√E[ξ2]+√E[(ξ′)2]=2√E[ξ2],

where the second inequality uses that by assumption, and the final equality uses that and have the same marginal distribution. We conclude by again noting . Lemmas 6.2 and 6.2 become particularly powerful when used in conjunction. Together, they show that the angle makes with the vector is lower bounded by a constant that depends on the discount factor. In particular, Lemma 6.2 shows the component of that is aligned with has norm at least . Together with Lemma 6.2 and the Pythagorean theorem we can conclude that the orthogonal component of has norm no greater than . This is summarized in figure 1.

### 6.3 Finite Time Analysis of Mean-Path TD

We now combine the insights of the previous subsection to establish convergence rates for mean-path TD. These mirror the bounds for gradient descent given in equations (9) and (10), except for an additional dependence on the discount factor. The first result bounds the distance between the value function under an averaged iterate and under the TD stationary point. The gives a comparitively slow convergence rate, but does not depend at all on the conditioning of the feature covariance matrix. When this matrix is well conditioned, so the minimium eigenvalue of is not too small, the geometric convergence rate given in the second part of the theorem dominates. Note that by Lemma 2, bounds on always imply bounds on . {restatable}[]thm Consider a sequence of parameters obeying the recursion

 θt+1=θt+α¯g(θt)t=0,1,2,…

where . Then

 ∥Vθ∗−V¯θT∥2D≤4∥θ∗−θ0∥22(1−γ)2T

and

 ∥θ∗−θT∥22≤exp{−((1−γ)2ω4)T}∥θ∗−θ0∥22.
{proof}

For each , we have

 ∥θ∗−θt+1∥22=∥θ∗−θt∥22−2α(θ∗−θt)⊤¯g(θt)+α2∥¯g(θt)∥22

Applying Lemma 6.2 and Lemma 6.2 and using a constant step-size of , we get

 ∥θ∗−θt+1∥22 ≤ ∥θ∗−θt∥22−(2α(1−γ)−4α2)∥Vθ∗−Vθt∥2D (13) = ∥θ∗−θt∥22−((1−γ)24)∥Vθ∗−Vθt∥2D.

Then

 ((1−γ)24)T−1∑t=0∥Vθ∗−Vθt∥2D≤T−1∑t=0(∥θ∗−θt∥22−∥θ∗−θt+1∥22)≤∥θ∗−θ0∥22.

Applying Jensen’s inequality gives

 ∥Vθ∗−V¯θT∥2D≤1TT−1∑t=0∥Vθ∗−Vθt∥2D≤4∥θ∗−θ0∥22(1−γ)2T

as desired. Now, returning to (13), and applying Lemma 2 gives

 ∥θ∗−θt+1∥22≤∥θ∗−θt∥22−((1−γ)24)ω∥θ∗−θt∥22 = (1−ω(1−γ)24)∥θ∗−θt∥22 ≤ exp{−ω(1−γ)24}∥θ∗−θt∥22.

Repeating this inductively, gives the desired result.

## 7 Analysis for the i.i.d. observation model

This section studies TD under an i.i.d observation model, and establishes three explicit guarantees that mirror standard finite-time bounds available for SGD. Specifically, we study a model where the random tuples observed by the TD algorithm are sampled i.i.d. from the stationary distribution of the Markov reward process. This means that for all states and ,

 P[(st,rt,s′t)=(s,R(s,s′),s′)]=π(s)P(s′|s), (14)

and the tuples are drawn independently across time. Note that the probabilities in Equation (14) correspond to a setting where the first state is drawn from the stationary distribution, and then is drawn from . This model is widely used for analyzing RL algorithms. See for example Sutton et al. (2009a), Sutton et al. (2009b), Korda and La (2015), and Dalal et al. (2017).

Theorem 7 follows from a unified analysis that combines the techniques of the previous section with typical arguments used in the SGD literature. All bounds depend on , which roughly captures the variance of TD updates at the stationary point . The bound in part (a) follows the spirit of work on so-called robust stochastic approximation (Nemirovski et al., 2009). It applies to TD(0) with iterate averaging and relatively large step-sizes. The result is a simple bound on the mean-squared gap between the value predictions under the averaged iterate and the TD fixed point. The main strength is that the step-sizes and the bound do not depend at all on the condition number of the feature covariance matrix. Note that the requirement that is not critical; one can carry out analysis using the step-size , but the bounds we attain only become meaningful in the case where is sufficiently large, so we chose to simplify the exposition.

Parts (b) and (c) provide faster convergence rates in the case where the feature covariance matrix is well conditioned. Part (b) studies TD(0) applied with a constant step-size, which is common in practice. In this case, the iterate will never converge to the TD fixed point, but our results show the expected distance to converges at a exponential rate below some level that depends on the choice of step-size. This is sometimes referred to as the rate at which the initial point is “forgotten”. Bounds like this justify the common practice of starting with large step-sizes, and sometimes dividing the step-sizes in half once it appears error is no-longer decreasing. Part (c) attains an order convergence rate for a carefully chosen decaying step-size sequence. This step-size sequence requires knowledge of of the minimum eigenvalue of the feature covariance matrix , which plays a role similar to a strong convexity parameter in the optimization literature. In practice this would need to be estimated, possibly by constructing a sample average approximation to the feature covariance matrix. The proof of part (c) closely follows an inductive argumented presented in Bottou et al. (2016).

{restatable}

[]thm Suppose TD(0) is applied in the i.i.d observation model and set .

1. Let denote the averaged iterate. For any and fixed stepsize ,

 E[∥Vθ∗−V¯θT∥2D]≤∥θ∗−θ0∥22+4σ2 √T(1−γ).
2. For any fixed step-size ,

 E[∥θ∗−θT∥22]≤(e−α0(1−γ)ωT)∥θ∗−θ0∥22+α0(4σ2(1−γ)ω).
3. For decaying step-size with and ,

 E[∥θ∗−θT∥22]≤νλ+Twhereν=max{16σ2(1−γ)2ω2,16∥θ∗−θ0∥22(1−γ)2ω}

Our proof is able to directly leverage Lemma 6.2, but the analysis requires the following extension of Lemma 6.2. {restatable}[]lem For any fixed , where . {proof} For brevity of notation, set and . Define and . By stationarity and have the same marginal distribution. By definition . Then, using the formula for in equation (1), we have

 E[∥gt(θ)∥2] ≤ E[∥gt(θ∗)∥2+∥gt(θ−θ∗)∥2] = σ+E∥∥ϕ(ϕ−γϕ′)⊤(θ−θ∗)∥∥2 = σ+E∥∥ϕ(ξ−γξ′)∥∥2 ≤ σ+√E∥ϕ∥2√E[|ξ−γξ′|2] ≤ σ+√E[|ξ|2]+√E[|ξ′|2] = σ+2∥Vθ−Vθ∗∥D.

Using the fact that for any real numbers and , we also have the crude bound,

 E∥gt(θ)∥22≤4σ2+8∥Vθ−Vθ∗∥2D. (15)

As will be discussed further in next section, the proof below relies critically on the fact that the random tuple is independent of conditioned on the hisotory . This implies

 E[gt(θt)]=E[E[gt(θt)∣Ht−1]]=E[¯g(θt)].

In an annologus way, it lets us conclude by Lemma 7 that .
{proof}of Theorem 7. For each , we have

 ∥θ∗−θt+1∥22≤∥θ∗−θt∥22−2αtgt(θt)⊤(θ∗−θt)+α2t∥gt(θt)∥22.

Under the hypotheses of (a), (b) and (c), . Therefore taking expectations and applying Lemma 6.2 and Lemma 7 gives

 E[∥θ∗−θt+1∥22] ≤ E[∥θ∗−θt∥22]−(2αt(1−γ)−8α2t)E[∥Vθ∗−Vθt∥2D]+4α2σ2 (16) ≤ E[∥θ∗−θt∥22]−αt(1−γ)E[∥Vθ∗−Vθt∥2D]+4α2tσ2.
##### Part (a).

Starting with (16) and summing over implies

 E[T−1∑t=0∥Vθ∗−Vθt∥2D]≤∥θ∗−θ0∥22α(1−γ)+4αTσ2(1−γ)=√T∥θ∗−θ0∥22(1−γ)+4√Tσ2(1−γ).

We find

 E[∥Vθ∗−V¯θT∥2D]≤1TE[T−1∑t=0∥Vθ∗−Vθt∥2D]≤∥θ∗−θ0∥22+4σ2 √T(1−γ).
##### Part (b).

Starting with (16) and applying Lemma 2 implies

 E[∥θ∗−θt+1∥22]≤(1−α0(1−γ)ω)E[∥θ∗−θt∥22]+4α20σ2. (17)

Iterating this inequality establishes that for any

 E[∥θ∗−θT∥22]≤(1−α0(1−γ)ω)TE[∥θ∗−θ0∥22]+4α20σ2∞∑t=0(1−α0(1−γ)ω)t

The result follows by solving the geometric series and using that .

##### Part (c).

Note that by the definitions of and we have

 ν=max{4β2σ2,λ∥θ