# Finite-Sample Analysis for SARSA with Linear Function Approximation

###### Abstract

SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement learning. We investigate the SARSA algorithm with linear function approximation under the non-i.i.d. data, where a single sample trajectory is available. With a Lipschitz continuous policy improvement operator that is smooth enough, SARSA has been shown to converge asymptotically perkins2003convergent ; melo2008analysis . However, its non-asymptotic analysis is challenging and remains unsolved due to the non-i.i.d. samples and the fact that the behavior policy changes dynamically with time. In this paper, we develop a novel technique to explicitly characterize the stochastic bias of a type of stochastic approximation procedures with time-varying Markov transition kernels. Our approach enables non-asymptotic convergence analyses of this type of stochastic approximation algorithms, which may be of independent interest. Using our bias characterization technique and a gradient descent type of analysis, we provide the finite-sample analysis on the mean square error of the SARSA algorithm. We then further study a fitted SARSA algorithm, which includes the original SARSA algorithm and its variant in perkins2003convergent as special cases. This fitted SARSA algorithm provides a more general framework for iterative on-policy fitted policy iteration, which is more memory and computationally efficient. For this fitted SARSA algorithm, we also provide its finite-sample analysis.

## 1 Introduction

SARSA, originally proposed in Rummery1994 , is an on-policy reinforcement learning algorithm, which continuously updates the behavior policy towards attaining as large an accumulated reward as possible over time. Specifically, SARSA is initialized with a state and a policy. At each time instance, it takes an action based on the current policy, observes the next state, and receives a reward. Using the newly observed information, it first updates the estimate of the action-value function, and then improves the behavior policy by applying a policy improvement operator, e.g., -greedy, to the estimated action-value function. Such a process is iteratively taken until it converges (see Algorithm 1 for a precise description of the SARSA algorithm).

With the tabular approach that stores the action-value function, the convergence of SARSA has been established in singh2000convergence . However, the tabular approach may not be applicable when the state space is large or continuous. For this purpose, SARSA that incorporates parametrized function approximation is commonly used, and is more efficient and scalable. With the function approximation approach, SARSA is not guaranteed to converge in general when the -greedy or softmax policy improvement operators are used gordon1996chattering ; de2000existence . However, under certain conditions, its convergence can be established. For example, a variant of SARSA with linear function approximation was constructed in perkins2003convergent , where between two policy improvements, a temporal difference (TD) learning algorithm is applied to learn the action-value function till its convergence. The convergence of this algorithm was established in perkins2003convergent using a contraction argument under the condition that the policy improvement operator is Lipschitz continuous and the Lipschitz constant is not too large. The convergence of the original SARSA algorithm under the same Lipschitz condition was later established using an O.D.E. approach in melo2008analysis .

Previous studies on SARSA in perkins2003convergent ; melo2008analysis mainly focused on the asymptotic convergence analysis, which does not suggest how fast SARSA converges and how the accuracy of the solution depends on the number of samples, i.e., sample complexity. The goal of this paper is to provide such a non-asymptotic finite-sample analysis of SARSA and to further understand how the parameters of the underlying Markov process and the algorithm affect the convergence rate. Technically, such an analysis does not follow directly from the existing finite-sample analysis for time difference (TD) learning bhandari2018finite ; srikant2019 and Q-learning Shah2018 , where samples are taken by a Markov process with a fixed transition kernel. The analysis of SARSA necessarily needs to deal with samples taken from a Markov decision process with a time-varying transition kernel, and in this paper, we develop novel techniques to explicitly characterize the stochastic bias for a Markov decision process with a time-varying transition kernel, which may be of independent interest.

### 1.1 Contributions

In this paper, we design a novel approach to analyze SARSA and a more general fitted SARSA algorithm, and develop the corresponding finite-sample error bounds. In particular, we consider the on-line setting where a single sample trajectory with Markovian noise is available, i.e., samples are not identical and independently distributed (i.i.d.).

Bias characterization for time-varying Markov process. One major challenge in our analysis is due to the fact that the estimate of the “gradient” is biased with non-i.i.d. Markovian noise. Existing studies mostly focus on the case where the samples are generated according to a Markov process with a fixed transition kernel, e.g., TD learning bhandari2018finite ; srikant2019 and Q-learning with nearest neighbors Shah2018 , so that the uniform ergodicity of the Markov process can be exploited to decouple the dependency on the Markovian noise, and then to explicitly bound the stochastic bias. For Markov processes with a time-varying transition kernel, such a property of uniform ergodicity does not hold in general. In this paper, we develop a novel approach to explicitly characterize the stochastic bias induced by non-i.i.d. samples generated from Markov processes with time-varying transition kernels. The central idea of our approach is to construct auxiliary Markov chains, which are uniformly ergodic, to approximate the dynamically changing Markov process to facilitate the analysis. Our approach can also be applied more generally to analyze stochastic approximation (SA) algorithms with time-varying Markov transition kernels, which may be of independent interest.

Finite-sample analysis for on-policy SARSA. For the on-policy SARSA algorithm, as the estimate of the action-value function changes with time, the behavior policy also changes. By a gradient descent type of analysis bhandari2018finite and our bias characterization technique for analyzing time-varying Markov processes, we develop the finite-sample analysis for the on-policy SARSA algorithm with a continuous state space and linear function approximation. Our analysis is for the on-line case with a single sample trajectory and non-i.i.d. data. To the best of our knowledge, this is the first finite-sample analysis for this type of on-policy algorithm with time-varying behavior policy.

Fitted SARSA algorithm. We propose a more general on-line fitted SARSA algorithm, where between two policy improvements, a “fitted” step is taken to obtain a more accurate estimate of the action-value function of the corresponding behavior policy via multiple iterations rather than taking only a single iteration as in the original SARSA. In particular, it includes the variant of SARSA in perkins2003convergent as a special case, in which each fitted step is required to converge before doing policy improvement. We provide a non-asymptotic analysis for the convergence of the proposed algorithm. Interestingly, our analysis indicates that the fitted step can stop at any time (not necessarily until convergence) without affecting the overall convergence of the fitted SARSA algorithm.

### 1.2 Related Work

Finite-sample analysis for TD learning. The asymptotic convergence of the TD algorithm was established in Tsitsiklis1997 . The finite-sample analysis of the TD algorithm was provided in Dalal2018a ; Laksh2018 under the i.i.d. setting and in bhandari2018finite ; srikant2019 recently under the non-i.i.d. setting, where a single sample trajectory is available. The finite sample analysis for the two-time scale methods for TD learning was also studied very recently under i.i.d. setting in dalal2017finite , under non-i.i.d. setting with constant step sizes in gupta2019finite , and under non-i.i.d. setting with diminishing step sizes in xu2019two . Differently from TD, the goal of which is to estimate the value function of a fixed policy, SARSA aims to continuously update its estimate of the action-value function to obtain an optimal policy. While samples of the TD algorithm are generated by following a time-invariant behavior policy, the behavior policy that generates samples in SARSA follows from an instantaneous estimate of the action-value function, which changes over time.

Q-learning with function approximation. The asymptotic convergence of Q-learning with linear function approximation was established in melo2008analysis under certain conditions. An approach based on a combination of Q-learning and kernel-based nearest neighbor regression was proposed in Shah2018 which first discretize the entire state space, and then use the nearest neighbor regression method to estimate the action-value function. Such an approach was shown to converge, and a finite-sample analysis of the convergence rate was further provided. Q-learning algorithms in melo2008analysis ; Shah2018 are off-policy algorithms, where a fixed behavior policy is used to collect samples, whereas SARSA is an on-policy algorithm with a time-varying behavior policy. Moreover, differently from the nearest neighbor approach, we consider SARSA with linear function approximation. These differences require different techniques to characterize the non-asymptotic convergence rate.

On-policy SARSA algorithm. SARSA was originally proposed in Rummery1994 , and using the tabular approach its convergence was established in singh2000convergence . With function approximation, SARSA is not guaranteed to converge if -greedy and softmax are used. With a smooth enough Lipschitz continuous policy improvement operator, the asymptotic convergence of SARSA was shown in melo2008analysis ; perkins2003convergent . In this paper, we further develop the non-asymptotic finite-sample analysis for SARSA under the Lipschitz continuous condition.

Fitted value/policy iteration algorithms. The least-squares temporal difference learning (LSTD) algorithms have been extensively studied in Brad1996 ; Boyan2002 ; Munos2008 ; Lazaric2010 ; Ghav2010 ; Pires2012 ; Prash2013 ; Tagorti2015 ; Tu2018 and references therein, where in each iteration a least square regression problem based on a batch data is solved. Approximate (fitted) policy iteration (API) algorithms further extend fitted value iteration with policy improvement. Several variants were studied, which adopt different objective functions, including least-squares policy iteration (LSPI) algorithms in Lagou2003 ; Lazaric2012 ; Yang2019 , fitted policy iteration based on Bellman residual minimization (BRM) in Antos2008 ; Farah2010 , and classification-based policy iteration algorithm in Lazaric2016 . The fitted SARSA algorithm in this paper uses an iterative way (TD(0) algorithm) to estimate the action-value function between two policy improvements, which is more memory and computationally efficient than the batch method. Differently from perkins2003convergent , we do not require a convergent TD(0) run for each fitted step. For this algorithm, we provide its non-asymptotic convergence analysis.

## 2 Preliminaries

### 2.1 Markov Decision Process

Consider a general reinforcement learning setting, where an agent interacts with a stochastic environment, which is modeled as a Markov decision process (MDP). Specifically, we consider a MDP that consists of , where is a continuous state space , and is a finite action set. We further let denote the state at time , and denote the action at time . Then, the measure defines the action dependent transition kernel for the underlying Markov chain : for any measurable set . The one-stage reward at time is given by , where is the reward function, and is assumed to be uniformly bounded, i.e., for any . Finally, denotes the discount factor.

A stationary policy maps a state to a probability distribution over , which does not depend on time. For a policy , the corresponding value function is defined as the expected total discounted reward obtained by actions executed according to : The action-value function is defined as The goal is to find an optimal policy that maximizes the value function from any initial state. The optimal value function is defined as The optimal action-value function is defined as The optimal policy is then greedy with respect to . It can be verified that . The Bellman operator is defined as It is clear that is contraction in the sup norm defined as , and the optimal action-value function is the fixed point of bertsekas2011dynamic .

### 2.2 Linear Function Approximation

Let be a family of real-valued functions defined on . We consider the problem where any function in is a linear combination of a set of fixed functions for . Specifically, for , We assume that , , which can be ensured by normalizing . The goal is to find a with a compact representation in to approximate the optimal action-value function with a continuous state space.

## 3 Finite-Sample Analysis for SARSA

### 3.1 SARSA with Linear Function Approximation

We consider a -dependent behavior policy, which changes with time. Specifically, the behavior policy is given by , where is a policy improvement operator, e.g., greedy, -greedy, softmax and mellowmax Asadi2016 . Suppose that is a sample trajectory of states, actions and rewards obtained from the MDP following the time dependent behavior policy (see Algorithm 1). The projected SARSA with linear function approximation updates as follows:

(1) |

where , denotes the temporal difference at time t: and In this paper, we refer to as "gradient", although it is not a gradient of any function.

Here, the projection step is to control the norm of the gradient , which is a commonly used technique to control the gradient bias bhandari2018finite ; kushner2010stochastic ; lacoste2012simpler ; bubeck2015convex ; nemirovski2009robust . With a small step size and a bounded gradient, does not change too fast. We note that gordon2000reinforcement showed that SARSA converges to a bounded region, and thus is bounded for all . This implies that our analysis still holds without the projection step. We further note that even without exploiting the fact that is bounded, the finite-sample analysis for SARSA can still be obtained by combining our approach of analyzing the stochastic bias with an extension of the approach in srikant2019 . However, to convey the central idea of characterizing the stochastic bias of a MDP with dynamically changing transition kernel, we focus on the projected SARSA in this paper.

We consider the following Lipschitz continuous policy improvement operator as in perkins2003convergent ; melo2008analysis . For any , the behavior policy is Lipschitz with respect to :

(2) |

where is the Lipschitz constant. Further discussion about this assumption and its impact on the convergence is provided in Section 5. We further assume that for any fixed , the Markov chain induced by the behavior policy and the transition kernel is uniformly ergodic with the invariant measure denoted by , and satisfies the following assumption.

###### Assumption 1.

There are constants and such that

where denotes the total-variation distance between the probability measures and .

We denote by the probability measure induced by the invariant measure and the behavior policy . We assume that the base functions ’s are linearly independent in the Hilbert space , where is the limit point of Algorithm 1, which will be defined in the next section. For the space , two measurable functions on are equivalent if they are identical except on a set of -measure zero.

### 3.2 Finite-Sample Analysis

We first define , and , where denotes the expectation where follows the invariant probability measure , is generated by the behavior policy , is the subsequent state of following action , i.e., follows from the transition kernel , and is generated by the behavior policy . It was shown in melo2008analysis that the algorithm in (1) converges to a unique point , which satisfies the following relation:
if the Lipschitz constant is not so large that is negative definite^{1}^{1}1It can be shown that if ’s are linearly independent in , then is negative definite perkins2003convergent ; Tsitsiklis1997 . .

Let and . Recall in (2) that the policy is Lipschitz with respect to with Lipschitz constant . We then make the following assumption perkins2003convergent ; melo2008analysis .

###### Assumption 2.

The Lipschitz constant is not so large that is negative definite, and denote the largest eigenvalue of by .

The following theorems present the finite-sample bound on the convergence of SARSA with diminishing and constant step sizes.

###### Theorem 1.

Theorem 1 indicates that SARSA has a faster convergence rate than the existing finite-sample bound for Q-learning with nearest neighbors Shah2018 .

###### Theorem 2.

If is small enough, and is large enough, then the algorithm converges to a small neighborhood of . For example, if , the upper bound converges to zero as . The proof of this theorem is a straightforward extension of that for Theorem 1.

In order for Theorems 1 and 2 to hold, the projection radius shall be chosen such that . However, is unknown in advance. We next provide an upper bound on , which can be estimated in practice bhandari2018finite .

###### Lemma 1.

For the projected SARSA algorithm in (1), the limit point satisfies that where is the largest eigenvalue of .

### 3.3 Outline of Technical Proof of Theorem 1

The major challenge in the finite-sample analysis of SARSA lies in analyzing the stochastic bias in gradient, which are two-folds: (1) non-i.i.d. samples; and (2) dynamically changing behavior policy.

First, as per the updating rule in (1), there is a strong coupling between the sample path and , because the samples are used to compute the gradient and then , which introduces a strong dependency between and , and thus the bias in . Moreover, differently from TD learning and Q-learning, is further used (as in the policy ) to generate the subsequent actions, which makes the dependency even stronger. Although the convergence can still be established using the O.D.E. approach melo2008analysis , in order to derive a finite-sample analysis, the stochastic bias in the gradient needs to be explicitly characterized, which makes the problem challenging.

Second, as updates, the transition kernel for the state-action pair changes with time. Previous analyses, e.g., bhandari2018finite , rely on the facts that the behavior policy is fixed and that the underlying Markov process is uniformly ergodic, so that the Markov process reaches its stationary distribution quickly. In perkins2003convergent , a variant of SARSA was studied, where between two policy improvements, the behavior policy is fixed, and a TD method is used to estimate its action-value function until convergence. The behavior policy is then improved using a Lipschitz continuous policy improvement operator. In this way, for each given behavior policy, the induced Markov process can reach its stationary distribution quickly so that the analysis can be conducted. The SARSA algorithm studied in this paper does not possess these nice properties. The behavior policy of the SARSA algorithm changes at each time step, and the underlying Markov process does not necessarily reach a stationary distribution due to lack of uniform ergodicity.

To provide a finite-sample analysis, our major technical novelty lies in the design of auxiliary Markov chains, which are uniformly ergodic and , to approximate the original Markov chain induced by the SARSA algorithm, and a careful decomposition of the stochastic bias. Using such an approach, the gradient bias can be explicitly characterized. Then together with a gradient descent type of analysis, we derive the finite-sample analysis for the SARSA algorithm.

To illustrate the main idea of the proof, we provide a sketch. We note that Step 3 contains our major technical contributions of bias characterization for time-varying Markov processes.

###### Proof sketch.

We first introduce some notations. For any fixed , define , where follows the stationary distribution , and are subsequent actions and states generated according to the policy and the transition kernel . Here, can be interpreted as the noiseless gradient at . We then define

(5) |

Thus, measures the bias caused by using non-i.i.d. samples to estimate the gradient.

Step 1. Error decomposition. The error at each time step can be decomposed recursively as follows:

(6) |

Step 2. Gradient descent type analysis. The first three terms in (3.3) mimic the analysis of the gradient descent algorithm without noise, because the accurate gradient at is used.

Due to the projection step in (1), is upper bounded by . It can also be shown that

(7) |

For a not so large , i.e., is smooth enough with respect to , is negative definite. Then, we have

(8) |

Step 3. Stochastic bias analysis. This step consists of our major technical developments. The last term in (3.3) is the bias caused by using a single sample path with non-i.i.d. data and time-varying behavior policy. For convenience, we rewrite as , where . Bounding this term is challenging due to the strong dependency between and .

We first show that is Lipschitz in . Due to the projection step, changes slowly with . Combining the two facts, we can show that for any ,

(9) |

Such a step is intended to decouple the dependency between and by considering and . If the Markov chain induced by SARSA was uniformly ergodic, and satisfied Assumption 1, then for any , would reach its stationary distribution quickly for large . However, such an argument is not necessarily true, since changes with time and thus the transition kernel of the Markov chain changes with time.

Our idea is to construct an auxiliary Markov chain to assist our proof. Consider the following new Markov chain. Before time , the states and actions are generated according to the SARSA algorithm, but after time , the behavior policy is kept fixed as to generate all the subsequent actions. We then denote by the observations of the new Markov chain at time and time . For this new Markov chain, for large , reaches the stationary distribution induced by and . It then can be shown that

(10) |

The next step is to bound the difference between the Markov chain generated by the SARSA algorithm and the auxiliary Markov chain that we construct. Since the behavior policy changes slowly, due to its Lipschitz property and the small step size , the two Markov chains should not deviate from each other too much. It can be shown that for the case with diminishing step size (similar argument can be obtained for the case with constant step size),

(11) |

Step 4. Putting the first three steps together and recursively applying Step 1 complete the proof. ∎

## 4 Finite-sample Analysis for Fitted SARSA Algorithm

In this section, we introduce a more general on-policy fitted SARSA algorithm (see Algorithm 2), which provides a general framework for on-policy fitted policy iteration. Specifically, after each policy improvement, we perform a “fitted” step that consists of TD(0) iterations to estimate the action-value function of the current policy. This more general fitted SARSA algorithm contains the original SARSA algorithm Rummery1994 as a special case with and the algorithm in perkins2003convergent as another special case with (i.e., until TD(0) converges). Moreover, the entire algorithm uses only one single Markov trajectory, instead of restarting from state after each policy improvement perkins2003convergent . Differently from most existing fitted policy iteration algorithms, where a regression problem for model fitting is solved between two policy improvements, our fitted SARSA algorithm does not require a convergent TD iteration process between policy improvements. As will be shown, the on-policy fitted SARSA algorithm is guaranteed to converge for an arbitrary . The overall sample complexity for this fitted algorithm will be provided.

In fact, there is no need for the number of TD iterations in the fitted step to be the same. More generally, by setting the number of TD iterations differently, we can control the estimation accuracy of the action-value function between policy improvements using the finite-sample bound of TD bhandari2018finite . Our analysis can be extended to this general scenario in a straightforward manner, but the mathematical expressions get more involved. Thus we focus on the simple case with the same to convey the central idea.

The following theorem provides the finite-sample bound on the convergence of the fitted SARSA algorithm.

###### Theorem 3.

Consider the fitted SARSA algorithm with linear function approximation as in Algorithm 2. Suppose that Assumptions 1 and 2 hold.

(1) With a decaying step size for and , we have that

(12) |

where . For sufficiently large , , and hence For any given , to guarantee the accuracy for a small , the overall sample complexity is given by .

(2) With a constant step size for , we have that

(13) |

where .

The item (2) of Theorem 3 indicates that with a small enough constant step size and a large enough , the fitted SARSA algorithm converges to a small neighborhood of .

Theorem 3 further implies that the fitted step can take any number of TD iterations (not necessarily to converge) without affecting the overall convergence and sample complexity of the fitted SARSA algorithm. In particular, the comparison between the original SARSA and the fitted SARSA algorithms indicates that they have the same overall sample complexity. On the other hand, the fitted SARSA algorithm is more computationally efficient due to the following two facts: (a) with the same number of samples , the general fitted SARSA algorithm uses a fewer number of policy improvement operators; and (b) to apply the policy improvement operator, an inner product between and needs to be computed, the complexity of which scales linearly with the size of the action space .

## 5 Discussion of Lipschitz Continuity Assumption

In this section, we discuss the Lipschitz continuity assumption on the policy improvement operator , which plays an important role in the convergence of SARSA.

Using a tabular approach that stores the action-values, the convergence of the SARSA algorithm was established in singh2000convergence . However, an example given in gordon1996chattering shows that SARSA with function approximation and -greedy policy improvement operator is chattering, and does not converge. Later, gordon2000reinforcement showed that SARSA converges to a bounded region, although this region may be large, and does not diverge as Q-learning with linear function approximation. One possible explanation of this non-convergent behavior of the SARSA algorithm with -greedy and softmax policy improvement operators is the discontinuity in the action selection strategies perkins2002existence ; de2000existence . More specifically, a slight change in the estimate of the action-value function may result in a big change in the behavior policy, which thus yields a completely different estimate of the action-value function.

Toward further understanding the convergence of SARSA, de2000existence showed that the approximate value iteration with soft-max policy improvement is guaranteed to have fixed points, which however may not be unique, and perkins2002existence later showed that for any continuous policy improvement operator, fixed points of SARSA are guaranteed to exist. Then perkins2003convergent developed a convergent form of SARSA by using a Lipschitz continuous policy improvement operator, and demonstrated its convergence to the unique limit point when the Lipschitz constant is not too large. As discussed in perkins2002existence , the non-convergence example in gordon1996chattering does not contradict the convergence result in perkins2003convergent , because the example does not satisfy the Lipschitz continuity condition of the policy improvement operator, which is essential to guarantee the convergence of SARSA. In this paper, we follow this line of reasoning, and consider Lipschitz continuous policy improvement operators.

As discussed in perkins2003convergent , the Lipschitz constant shall be chosen not so large to ensure the convergence of the SARSA algorithm. However, to ensure exploitation, one generally prefers a large Lipschitz constant so that the agent can choose actions with higher estimated action-values. In perkins2003convergent , an adaptive approach to choose a policy improvement operator with a proper was proposed. It was also noted in perkins2003convergent that it is possible that the convergence could be obtained with a much larger than the one suggested by Theorems 1, 2 and 3.

However, an important open problem for the SARSA algorithms with Lipschitz continuous operator (also for other algorithms with continuous action selection de2000existence ) is that there is no theoretical performance characterization of the solutions this type of algorithms produce. It is thus of future interest to further investigate the performance of the policy generated by the SARSA algorithm with Lipschitz continuous operator.

## 6 Conclusion

In this paper, we presented the first finite-sample analysis for the SARSA algorithm with continuous state space and linear function approximation. Our analysis is applicable to the on-line case with a single sample path and non-i.i.d. data. In particular, we developed a novel technique to handle the stochastic bias for dynamically changing behavior policies, which enables non-asymptotic analysis of this type of stochastic approximation algorithms. We also presented a fitted SARSA algorithm, which provides a general framework for iterative on-policy fitted policy iterations. We also presented the finite-sample analysis for such a fitted SARSA algorithm.

## Acknowledgement

We would like to thank the anonymous reviewer and the Area Chair for their valuable comments. The work of T. Xu and Y. Liang was supported in part by the U.S. National Science Foundation under Grants ECCS-1818904, CCF-1909291, and CCF-1900145.

## References

- (1) A. Antos, C. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
- (2) K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In Proc. International Conference on Machine Learning (ICML), 2016.
- (3) D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 3rd edition, 2012.
- (4) J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
- (5) J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49:233–246, 2002.
- (6) S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57, 1996.
- (7) S. Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- (8) G. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Proc. Conference on Learning Theory (COLT), 2018.
- (9) G. Dalal, B. Szrnyi, G. Thoppe, and S. Mannor. Finite sample analyses for TD(0) with function approximation. In Proc. AAAI Conference on Artificial Intelligence (AAAI), 2018.
- (10) D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105(3):589–608, 2000.
- (11) A.-M. Farahmand, C. Szepesvari, and R. Munos. Error propagation for approximate policy and value iteration. In Proc. Advances in Neural Information Processing Systems (NIPS), 2010.
- (12) M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos. LSTD with random projections. In Proc. Advances in Neural Information Processing Systems (NIPS), 2010.
- (13) G. J. Gordon. Chattering in SARSA ()-a CMU learning lab internal report. 1996.
- (14) G. J. Gordon. Reinforcement learning with function approximation converges to a region. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 1040–1046, 2001.
- (15) H. Gupta, R. Srikant, and L. Ying. Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. To appear in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2019.
- (16) H. Kushner. Stochastic approximation: a survey. Wiley Interdisciplinary Reviews: Computational Statistics, 2(1):87–96, 2010.
- (17) S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
- (18) M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
- (19) C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.
- (20) A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of lstd. In Proc. International Conference on Machine Learning (ICML), 2010.
- (21) A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of least-squares policy iteration. Journal of Machine Learning Research, 13:3041–3074, 2012.
- (22) A. Lazaric, M. Ghavamzadeh, and R. Munos. Analysis of classification-based policy iteration algorithms. Journal of Machine Learning Research, 17:583–612, 2016.
- (23) F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with function approximation. In Proc. International Conference on Machine Learning (ICML), pages 664–671. ACM, 2008.
- (24) A. Y. Mitrophanov. Sensitivity and convergence of uniformly ergodic markov chains. Journal of Applied Probability, 42(4):1003–1014, 2005.
- (25) R. Munos and C. Szepesvari. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9:815–857, May 2008.
- (26) A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- (27) T. J. Perkins and M. D. Pendrith. On the existence of fixed points for Q-learning and Sarsa in partially observable domains. In Proc. International Conference on Machine Learning (ICML), pages 490–497, 2002.
- (28) T. J. Perkins and D. Precup. A convergent form of approximate policy iteration. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 1627–1634, 2003.
- (29) B. A. Pires and C. Szepesvari. Statistical linear estimation with penalized estimators: An application to reinforcement learning. In Proc. International Conference on Machine Learning (ICML), 2012.
- (30) L. Prashanth, N. Korda, and R. Munos. Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. In Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2013.
- (31) G. A. Rummery and M. Niranjan. Online Q-learning using connectionist systems. Technical Report, Cambridge University Engineering Department, Sept. 1994.
- (32) D. Shah and Q. Xie. Q-learning with nearest neighbors. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2018.
- (33) S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3):287–308, 2000.
- (34) R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning. In Proc. Annual Conference on Learning Theory (CoLT), 2019.
- (35) M. Tagorti and B. Scherrer. On the rate of convergence and error bounds for LSTD (). In Proc. International Conference on Machine Learning (ICML), 2015.
- (36) J. N. Tsitsiklis and B. Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, May 1997.
- (37) S. Tu and B. Recht. Least-squares temporal difference learning for the linear quadratic regulator. In Proc. International Conference on Machine Learning (ICML), 2018.
- (38) T. Xu and Y. Zou, S.and Liang. Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. To appear in Proc. Advances in Neural Information Processing Systems (NeurIPS), 2019.
- (39) Z. Yang, Y. Xie, and Z. Wang. A theoretical analysis of deep Q-learning. ArXiv: 1901.00137, Jan. 2019.

Supplementary Materials

## Appendix A Useful Lemmas for Proof of Theorem 1

For the SARSA algorithm, define for any ,

(14) |

It can be easily verified that . We then define .

###### Lemma 2.

For any such that ,

###### Proof.

By the definition of , we obtain

(15) |

where the first two inequalities are due to the assumption that . ∎

The following lemma is useful to deal with the time-varying behavior policy.

###### Lemma 3.

For any and in ,

(16) |

and

(17) |

###### Proof.

For , , define the transition kernels respectively as follows:

(18) |

Following from Theorem 3.1 in [24], we obtain

(19) |

where is the operator norm: , and denotes the total-variation norm. Then, we have

(20) |

By definition, , for . Therefore, the second result follows after a few steps of simple computations. ∎

###### Lemma 4.

For any such that ,

(21) |

###### Proof.

Let . Denote by . By the definition of , we have

(22) |

The first term in (A) can be bounded as follows:

(23) |

where the second inequality follows from Lemma 3, and .

The second term in (A) can be bounded as follows:

(24) |

where the second inequality follows from Lemma 3, and .

Hence,

(25) |

where is the identity matrix, and is the largest eigenvalue of . ∎

###### Lemma 5.

For all , .

###### Lemma 6.