Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

# Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Tengyu Xu
Department of Electrical and Computer Engineering
The Ohio State University
xu.3260@osu.edu
Shaofeng Zou
Department of Electrical Engineering
University at Buffalo, The State University of New York
szou3@buffalo.edu
Yingbin Liang
Department of Electrical and Computer Engineering
The Ohio State University
liang.889@osu.edu
###### Abstract

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d. Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.

## 1 Introduction

In practice, it is very common that we wish to learn the value function of a target policy based on data sampled by a different behavior policy, in order to make maximum use of the data available. For such off-policy scenarios, it has been shown that conventional temporal difference (TD) algorithms Sutton (1988); Sutton and Barto (2018) and Q-learning Watkins and Dayan (1992) may diverge to infinity when using linear function approximation Baird (1995). To overcome the divergence issue in off-policy TD learning, Sutton et al. (2008, 2009); Maei (2011) proposed a family of gradient-based TD (GTD) algorithms, which were shown to have guaranteed convergence in off-policy settings and are more flexible than on-policy learning in practice Maei (2018); Silver et al. (2014). Among those GTD algorithms, the TD with gradient correction (TDC) algorithm has been verified to have superior performance Maei (2011) Dann et al. (2014) and is widely used in practice. To elaborate, TDC uses the mean squared projected Bellman error as the objective function, and iteratively updates the function approximation parameter with the assistance of an auxiliary parameter that is also iteratively updated. These two parameters are typically updated with stepsizes diminishing at different rates, resulting the two time-scale implementation of TDC, i.e., the function approximation parameter is updated at a slower time-scale and the auxiliary parameter is updated at a faster time-scale.

The convergence of two time-scale TDC and general two time-scale stochastic approximation (SA) have been well studied. The asymptotic convergence has been shown in Borkar (2009); Borkar and Pattathil (2018) for two time-scale SA, and in Sutton et al. (2009) for two time-scale TDC, where both studies assume that the data are sampled in an identical and independently distributed (i.i.d.) manner. Under non-i.i.d. observed samples, the asymptotic convergence of the general two time-scale SA and TDC were established in Karmakar and Bhatnagar (2017); Yu (2017).

All the above studies did not characterize how fast the two time-scale algorithms converge, i.e, they did not establish the non-asymptotic convergence rate, which is specially important for a two time-scale algorithm. In order for two time-scale TDC to perform well, it is important to properly choose the relative scaling rate of the stepsizes for the two time-scale iterations. In practice, this can be done by fixing one stepsize and treating the other stepsize as a tuning hyper-parameter Dann et al. (2014), which is very costly. The non-asymptotic convergence rate by nature captures how the scaling of the two stepsizes affect the performance and hence can serve as a guidance for choosing the two time-scale stepsizes in practice. Recently, Dalal et al. (2018b) established the non-asymptotic convergence rate for the projected two time-scale TDC with i.i.d. samples under diminishing stepsize.

• One important open problem that still needs to be addressed is to characterize the non-asymptotic convergence rate for two time-scale TDC under non-i.i.d. samples and diminishing stepsizes, and explore what such a result suggests for designing the stepsizes of the fast and slow time-scales accordingly. Existing method developed in Dalal et al. (2018b) that handles the non-asymptotic analysis for i.i.d. sampled TDC does not accommodate a direct extension to the non-i.i.d. setting. Thus, new technical developments are necessary to solve this problem.

Furthermore, although diminishing stepsize offers accurate convergence, constant stepsize is often preferred in practice due to its much faster error decay (i.e., convergence) rate. For example, empirical results have shown that for one time-scale conventional TD, constant stepsize not only yields fast convergence, but also results in comparable convergence accuracy as diminishing stepsize Dann et al. (2014). However, for two time-scale TDC, our experiments (see Section 4.2) demonstrate that constant stepsize, although yields faster convergence, has much bigger convergence error than diminishing stepsize. This motivates to address the following two open issues.

• It is important to theoretically understand/explain why constant stepsize yields large convergence error for two-time scale TDC. Existing non-asymptotic analysis for two time-scale TDC Dalal et al. (2018b) focused only on the diminishing stepsize, and does not characterize the convergence rate of two time-scale TDC under constant stepsize.

• For two-time scale TDC, given the fact that constant stepsize yields large convergence error but converges fast, whereas diminishing stepsize has small convergence error but converges slowly, it is desirable to design a new update scheme for TDC that converges faster than diminishing stepsize, but has as good convergence error as diminishing stepsize.

In this paper, we comprehensively address the above issues.

### 1.1 Our Contribution

Our main contributions are summarized as follows.

We develop a novel non-asymptotic analysis for two time-scale TDC with a single sample path and under non-i.i.d. data. We show that under the diminishing stepsizes and respectively for slow and fast time-scales (where are positive constants and ), the convergence rate can be as large as , which is achieved by . This recovers the convergence rate (up to factor due to non-i.i.d. data) in Dalal et al. (2018b) for i.i.d. data as a special case.

We also develop the non-asymptotic analysis for TDC under non-i.i.d. data and constant stepsize. In contrast to conventional one time-scale analysis, our result shows that the training error (at slow time-scale) and the tracking error (at fast time scale) converge at different rates (due to different condition numbers), though both converge linearly to the neighborhood of the solution. Our result also characterizes the impact of the tracking error on the training error. Our result suggests that TDC under constant stepsize can converge faster than that under diminishing stepsize at the cost of a large training error, due to a large tracking error caused by the auxiliary parameter iteration in TDC.

We take a further step and propose a TDC algorithm under a blockwise diminishing stepsize inspired by Yang et al. (2018) in conventional optimization, in which both stepsizes are constants over a block, and decay across blocks. We show that TDC asymptotically converges with an arbitrarily small training error at a blockwisely linear convergence rate as long as the block length and the decay of stepsizes across blocks are chosen properly. Our experiments demonstrate that TDC under a blockwise diminishing stepsize converges as fast as vanilla TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.

From the technical standpoint, our proof develops new tool to handle the non-asymptotic analysis of bias due to non-i.i.d. data for two time-scale algorithms under diminishing stepsize that does not require square summability, to bound the impact of the fast-time-scale tracking error on the slow-time-scale training error, and the analysis to recursively refine the error bound in order to sharpening the convergence rate.

### 1.2 Related Work

Due to extensive studies on TD learning, we here include only the most relevant work to this paper.

On policy TD and SA. The convergence of TD learning with linear function approximation with i.i.d samples has been well established by using standard results in SA Borkar and Meyn (2000). The non-asymptotic convergence have been established in Borkar (2009); Kamal (2010); Thoppe and Borkar (2019) for the general SA algorithms with martingale difference noise, and in Dalal et al. (2018a) for TD with i.i.d. samples. For the Markovian settings, the asymptotic convergence has been established in Tsitsiklis and Van Roy (1997); Tadić (2001) for of TD(), and the non-asymptotic convergence has been provided for projected TD() in Bhandari et al. (2018) and for linear SA with Markovian noise in Karmakar and Bhatnagar (2016); Ramaswamy and Bhatnagar (2018); R. Srikant (2019).

Off policy one time-scale GTD. The convergence of one time-scale GTD and GTD2 (which are off-policy TD algorithms) were derived by applying standard results in SA Sutton et al. (2008) Sutton et al. (2009); Maei (2011). The non-asymptotic analysis for GTD and GTD2 have been conducted in Liu et al. (2015) by converting the objective function into a convex-concave saddle problem, and was further generalized to the Markovian setting in Wang et al. (2017). However, such an approach cannot be generalized for analyzing two-time scale TDC that we study here because TDC does not have an explicit saddle-point representation.

Off policy two time-scale TDC and SA. The asymptotic convergence of two time-scale TDC under i.i.d. samples has been established in Sutton et al. (2009); Maei (2011), and the non-asymptotic analysis has been provided in Dalal et al. (2018b) as a special case of two time-scale linear SA. Under Markovian setting, the convergence of various two time-scale GTD algorithms has been studied in Yu (2017). The non-asymptotic analysis of two time-scale TDC under non-i.i.d. data has not been studied before, which is the focus of this paper.

General two time-scale SA has also been studied. The convergence of two time-scale SA with martingale difference noise was established in Borkar (2009), and its non-asymptotic convergence was provided in Konda et al. (2004); Mokkadem and Pelletier (2006); Dalal et al. (2018b); Borkar and Pattathil (2018). Some of these results can be applied to two time-scale TDC under i.i.d. samples (which can fit into a special case of SA with martingale difference noise), but not to the non-i.i.d. setting. For two time-scale linear SA with more general Markovian noise, only asymptotic convergence was established in Tadic (2004); Yaji and Bhatnagar (2016); Karmakar and Bhatnagar (2017). In fact, our non-asymptotic analysis for two time-scale TDC can be of independent interest here to be further generalized for studying linear SA with more general Markovian noise.

## 2 Problem Formulation

### 2.1 Off-policy Value Function Evaluation

We consider the problem of policy evaluation for a Markov decision process (MDP) , where is a compact state space, is a finite action set, is the transition kernel, is the reward function bounded by , and is the discount factor. A stationary policy maps a state to a probability distribution over . At time-step , suppose the process is in some state . Then an action is taken based on the distribution , the system transitions to a next state governed by the transition kernel , and a reward is received. Assuming the associated Markov chain is ergodic, let be the induced stationary distribution of this MDP, i.e., . The value function for policy is defined as: , and it is known that is the unique fixed point of the Bellman operator , i.e., , where is the expected reward of the Markov chain induced by policy .

We consider policy evaluation problem in the off-policy setting. Namely, a sample path is generated by the Markov chain according to the behavior policy , but our goal is to obtain the value function of a target policy , which is different from .

### 2.2 Two Time-Scale TDC

When is large or infinite, a linear function is often used to approximate the value function, where is a fixed feature vector for state and is a parameter vector. We can also write the linear approximation in the vector form as , where is the feature matrix. To find a parameter with . The gradient-based TD algorithm TDC Sutton et al. (2009) updates the parameter by minimizing the mean-square projected Bellman error (MSPBE) objective, defined as

 J(θ)=Eμπb[^v(s,θ)−ΠTπ^v(s,θ)]2,

where is the orthogonal projection operation into the function space and denotes the diagonal matrix with the components of as its diagonal entries. Then, we define the matrices , , and the vector as

 A\coloneqqEμπb[ρ(s,a)ϕ(s)(γϕ(s′)−ϕ(s))⊤],B\coloneqq−γEμπb[ρ(s,a)ϕ(s′)ϕ(s)⊤], C\coloneqq−Eμπb[ϕ(s)ϕ(s)⊤],b\coloneqqEμπb[ρ(s,a)r(s,a,s′)ϕ(s)],

where is the importance weighting factor with being its maximum value. If and are both non-singular, is strongly convex and has as its global minimum, i.e., . Motivated by minimizing the MSPBE objective function using the stochastic gradient methods, TDC was proposed with the following update rules:

 θt+1=ΠRθ(θt+αt(Atθt+bt+Btwt)), (1) wt+1=ΠRw(wt+βt(Atθt+bt+Ctwt)), (2)

where , , , , and is the projection operator onto a norm ball of radius . The projection step is widely used in the stochastic approximation literature. As we will show later, iterations (1)-(2) are guaranteed to converge to the optimal parameter if we choose the value of and appropriately. TDC with the update rules (1)-(2) is a two time-scale algorithm. The parameter iterates at a slow time-scale determined by the stepsize , whereas iterates at a fast time-scale determined by the stepsize . Throughout the paper, we make the following standard assumptions Bhandari et al. (2018); Wang et al. (2017); Maei (2011).

###### Assumption 1 (Problem solvability).

The matrix and are non-singular.

for all and .

###### Assumption 3 (Geometric ergodicity).

There exist constants and such that

 sups∈SdTV(P(st∈⋅|s0=s),μπb)≤mρt,∀t≥0,

where denotes the total-variation distance between the probability measures and .

In Assumption 1, the matrix is required to be non-singular so that the optimal parameter is well defined. The matrix is non-singular when the feature matrix has linearly independent columns. Assumption 2 can be ensured by normalizing the basis functions and when is non-degenerate for all . Assumption 3 holds for any time-homogeneous Markov chain with finite state-space and any uniformly ergodic Markov chains with general state space. Throughout the paper, we require and . In practice, we can estimate , and as mentioned in Bhandari et al. (2018) or simply let and to be large enough.

## 3 Main Theorems

### 3.1 Non-asymptotic Analysis under Diminishing Stepsize

Our first main result is the convergence rate of two time-scale TDC with diminishing stepsize. We define the tracking error: , where is the stationary point of the ODE given by , with being fixed. Let and be any constants that satisfy and .

###### Theorem 1.

Consider the projected two time-scale TDC algorithm in (1)-(2). Suppose Assumptions 1-3 hold. Suppose we apply diminishing stepsize , which satisfy , and . Suppose and can be any constants in and , respectively. Then we have for :

 E∥θt−θ∗∥22 (3) E∥zt∥22 ≤O(logttν)+O(h(σ,ν)), (4)

where

 h(σ,ν)=⎧⎨⎩1tν,σ>1.5ν,1t2(σ−ν)−ϵ,ν<σ≤1.5ν. (5)

If , with and , we have for

 E∥θt−θ∗∥22 ≤O((logt)2t)+O(logttν+h(1,ν))1−ϵ′. (6)

For explicit expressions of (3), (4) and (6), please refer to (A.2), (A.2) and (A.2) in the Appendix.

We further explain Theorem 1 as follows: (a) In (3) and (5), since both and can be arbitrarily small, the convergence of can be almost as fast as when , and when . Then best convergence rate is almost as fast as with . (b) If data are i.i.d. generated, then our bound reduces to with when , and when . The best convergence rate is almost as fast as with as given in Dalal et al. (2018b).

Theorem 1 characterizes the relationship between the convergence rate of and stepsizes and . The first term of the bound in (3) corresponds to the convergence rate of with full gradient , which exponentially decays with . The second term is introduced by the bias and variance of the gradient estimator which decays sublinearly with . The last term arises due to the accumulated tracking error , which specifically arises in two time-scale algorithms, and captures how accurately tracks . Thus, if tracks the stationary point in each step perfectly, then we have only the first two terms in (3), which matches the results of one time-scale TD learning Bhandari et al. (2018); Dalal et al. (2018a). Theorem 1 indicates that asymptotically, (3) is dominated by the tracking error term , which depends on the diminishing rate of and . Since both and can be arbitrarily small, if the diminishing rate of is close to that of , then the tracking error is dominated by the slow drift, which has an approximate order ; if the diminishing rate of is much faster than that of , then the tracking error is dominated by the accumulated bias, which has an approximate order . Moreover, (5) and (6) suggest that for any fixed , the optimal diminishing rate of is achieved by .

From the technical standpoint, we develop novel techniques to handle the interaction between the training error and the tracking error and sharpen the error bounds recursively. The proof sketch and the detailed steps are provided in Appendix A.

### 3.2 Non-asymptotic Analysis under Constant Stepsize

As we remark in Section 1, it has been demonstrated by empirical results Dann et al. (2014) that the standard TD under constant stepsize not only converges fast, but also has comparable training error as that under diminishing stepsize. However, this does not hold for TDC. When the two variables in TDC are updated both under constant stepsize, our experiments demonstrate that constant stepsize yields fast convergence, but has large training error. In this subsection, we aim to explain why this happens by analyzing the convergence rate of the two variables in TDC, and the impact of one on the other.

The following theorem provides the convergence result for TDC with the two variables iteratively updated respectively by two different constant stepsizes.

###### Theorem 2.

Consider the projected TDC algorithm in eqs. 2 and 1. Suppose Assumption 1-3 hold. Suppose we apply constant stepsize , and which satisfy , and . We then have for :

 E∥θt−θ∗∥22 ≤(1−|λθ|α)t(∥θ0−θ∗∥22+C1) +C2max{α,αln1α}+(C3max{β,βln1β}+C4η)0.5 (7) E∥zt∥22 ≤(1−|λw|β)t∥z0∥22+C5max{β,βln1β}+C6η, (8)

where with , and , , , and are positive constants independent of and . For explicit expressions of , , , and , please refer to (67), (68), (69), (B), and (60) in the Supplementary Materials.

Theorem 2 shows that TDC with constant stepsize converges to a neighborhood of exponentially fast. The size of the neighborhood depends on the second and the third terms of the bound in (7), which arise from the bias and variance of the update of and the tracking error in (8), respectively. Clearly, the convergence , although is also exponentially fast to a neighborhood, is under a different rate due to the different condition number. We further note that as the stepsize parameters , approach in a way such that , approaches to as , which matches the asymptotic convergence result for two time-scale TDC under constant stepsize in Yu (2017).

Diminishing vs Constant Stepsize: We next discuss the comparison between TDC under diminishing stepsize and constant stepsize. Generally, Theorem 1 suggests that diminishing stepsize yields better converge guarantee (i.e., converges exactly to ) than constant stepsize shown in Theorem 2 (i.e., converges to the neighborhood of ). In practice, constant stepsize is recommended because diminishing stepsize may take much longer time to converge. However, as Figure 2 in Section 4.2 shows, although TDC with large constant stepsize converges fast, the training error due to the convergence to the neighborhood is significantly worse than the diminishing stepsize. More specifically, when is fixed, as grows, the convergence becomes faster, but as a consequence, the term due to the tracking error increases and results in a large training error. Alternatively, if gets small so that the training error is comparable to that under diminishing stepsize, then the convergence becomes very slow. This suggests that simply setting the stepsize to be constant for TDC does not yield desired performance. This motivates us to design an appropriate update scheme for TDC such that it can enjoy as fast error convergence rate as constant stepsize offers, but still have comparable accuracy as diminishing stepsize enjoys.

### 3.3 TDC under Blockwise Diminishing Stepsize

In this subsection, we propose a blockwise diminishing stepsize scheme for TDC (see Algorithm 1), and study its theoretical convergence guarantee. In Algorithm 1, we define .

The idea of Algorithm 1 is to divide the iteration process into blocks, and diminish the stepsize blockwisely, but keep the stepsize to be constant within each block. In this way, within each block, TDC can decay fast due to constant stepsize and still achieve an accurate solution due to blockwisely decay of the stepsize, as we will demonstrate in Section 4. More specifically, the constant stepsizes and for block are chosen to decay geometrically, such that the tracking error and accumulated variance and bias are asymptotically small; and the block length increases geometrically across blocks, such that the training error decreases geometrically blockwisely. We note that the design of the algorithm is inspired by the method proposed in Yang et al. (2018) for conventional optimization problems.

The following theorem characterizes the convergence of Algorithm 1.

###### Theorem 3.

Consider the projected TDC algorithm with blockwise diminishing stepsize as in Algorithm 1. Suppose Assumptions 1-3 hold. Suppose , and , where and are constant independent of (see (C) and (75) in the Supplementary Materials for explicit expression of and ), and . Then, after blocks, we have

 E∥θS−θ∗∥22 ≤ϵ.

The total sample complexity is , where can be any arbitrarily small constant.

Theorem 3 indicates that the sample complexity of TDC under blockwise diminishing stepsize is slightly better than that under diminishing stepsize. Our empirical results (see Section 4.3) also demonstrate that blockwise diminishing stepsize yields as fast convergence as constant stepsize and has comparable training error as diminishing stepsize. However, we want to point out that the advantage of blockwise diminishing stepsize does not come for free, rather at the cost of some extra parameter tuning in practice to estimate , , and ; whereas diminishing stepsize scheme as guided by our Theorem 1 requires to tune at most three parameters to obtain desirable performance.

## 4 Experimental Results

In this section, we provide numerical experiments to verify our theoretical results and the efficiency of Algorithm 1. More precisely, we consider Garnet problems Archibald et al. (1995) denoted as , where denotes the number of states, denotes the number of actions, denotes the number of possible next states for each state-action pair, and denotes the number of features. The reward is state-dependent and both the reward and the feature vectors are generated randomly. The discount factor is set to in all experiments. We consider the problem. For all experiments, we choose . All plots report the evolution of the mean square error over independent runs.

### 4.1 Optimal Diminishing Stepsize

In this subsection, we provide numerical results to verify Theorem 1. We compare the performance of TDC updates with the same but different . We consider four different diminishing stepsize settings: (1) , ; (2) , ; (3) , ; (4) , . For each case with fixed slow time-scale parameter , the fast time-scale stepsize has decay rate to be , , , , , and . Our results are reported in Figure 1, in which for each case the left figure reports the overall iteration process and the right figure reports the corresponding zoomed tail process of the last 100000 iterations. It can be seen that in all cases, TDC iterations with the same slow time-scale stepsize share similar error decay rates (see the left plot), and the difference among the fast time-scale parameter is reflected by the behavior of the error convergence tails (see the right plot). We observe that yields the best error decay rate. This corroborates Theorem 1, which illustrates that the fast time-scale stepsize with parameter affects only the tracking error term in (3), that dominates the error decay rate asymptotically.

### 4.2 Constant Stepsize vs Diminishing Stepsize

In this subsection, we compare the error decay of TDC under diminishing stepsize with that of TDC under four different constant stepsizes. For diminishing stepsize, we set and , and tune their values to the best, which are given by , . For the four constant-stepsize cases, we fix for each case, and tune to the best. The resulting parameter settings are respectively as follows: ; ; ; and . The results are reported in Figure 2, in which for both the training and tracking errors, the left plot illustrates the overall iteration process and the right plot illustrates the corresponding zoomed error tails. The results suggest that although some large constant stepsizes ( and ) yield initially faster convergence than diminishing stepsize, they eventually oscillate around a large neighborhood of due to the large tracking error. Small constant stepsize ( and ) can have almost the same asymptotic accuracy as that under diminishing stepsize, but has very slow convergence rate. We can also observe strong correlation between the training and tracking errors under constant stepsize, i.e., larger training error corresponds to larger tracking error, which corroborates Theorem 2 and suggests that the accuracy of TDC heavily depends on the decay of the tracking error .

### 4.3 Blockwise Diminishing Stepsize

In this subsection, we compare the error decay of TDC under blockwise diminishing stepsize with that of TDC under diminishing stepsize and constant stepsize. We use the best tuned parameter settings as listed in Section 4.2 for the latter two algorithms, i.e., and for diminishing stepsize, and for constant stepsize. We report our results in Figure 3. It can be seen that TDC under blockwise diminishing stepsize converges faster than that under diminishing stepsize and almost as fast as that under constant stepsize. Furthermore, TDC under blockwise diminishing stepsize also has comparable training error as that under diminishing stepsize. Since the stepsize decreases geometrically blockwisely, the algorithm approaches to a very small neighborhood of in the later blocks. We can also observe that the tracking error under blockwise diminishing stepsize decreases rapidly blockwisely.

## 5 Conclusion

In this work, we provided the first non-asymptotic analysis for the two time-scale TDC algorithm over Markovian sample path. We developed a novel technique to handle the accumulative tracking error caused by the two time-scale update, using which we characterized the non-asymptotic convergence rate with general diminishing stepsize and constant stepsize. We also proposed a blockwise diminishing stepsize scheme for TDC and proved its convergence. Our experiments demonstrated the performance advantage of such an algorithm over both the diminishing and constant stepsize TDC algorithms. Our technique for non-asymptotic analysis of two time-scale algorithms can be applied to studying other off-policy algorithms such as actor-critic Maei (2018) and gradient Q-learning algorithms Maei and Sutton (2010).

## References

• [1] T. Archibald, K. McKinnon, and L. Thomas (1995) On the generation of Markov decision processes. Journal of the Operational Research Society 46 (3), pp. 354–361. Cited by: §4.
• [2] L. Baird (1995) Residual algorithms: reinforcement learning with function approximation. In Machine Learning Proceedings, pp. 30–37. Cited by: §1.
• [3] J. Bhandari, D. Russo, and R. Singal (2018) A finite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory (COLT), pp. 1691–1692. Cited by: §A.3, §1.2, §2.2, §2.2, §3.1.
• [4] V. S. Borkar and S. P. Meyn (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. Journal on Control and Optimization 38 (2), pp. 447–469. Cited by: §1.2.
• [5] V. S. Borkar and S. Pattathil (2018) Concentration bounds for two time scale stochastic approximation. In Proc. Allerton Conference on Communication, Control, and Computing (Allerton), pp. 504–511. Cited by: §1.2, §1.
• [6] V. S. Borkar (2009) Stochastic approximation: a dynamical systems viewpoint. Vol. 48, Springer. Cited by: §1.2, §1.2, §1.
• [7] G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor (2018) Finite sample analyses for TD (0) with function approximation. In Proc. AAAI Conference on Artificial Intelligence, Cited by: §A.3, §A.3, §1.2, §3.1.
• [8] G. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor (2018) Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Proc. Conference on Learning Theory (COLT), Cited by: 1st item, 1st item, §1.1, §1.2, §1.2, §1, §3.1.
• [9] C. Dann, G. Neumann, and J. Peters (2014) Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research 15 (1), pp. 809–883. Cited by: §1, §1, §1, §3.2.
• [10] S. Kamal (2010) On the convergence, lock-in probability and sample complexity of stochastic approximation. Journal on Control and Optimization 48 (8), pp. 5178–5192. Cited by: §1.2.
• [11] P. Karmakar and S. Bhatnagar (2016) Dynamics of stochastic approximation with Markov iterate-dependent noise with the stability of the iterates not ensured. arXiv preprint arXiv:1601.02217. Cited by: §1.2.
• [12] P. Karmakar and S. Bhatnagar (2017) Two time-scale stochastic approximation with controlled Markov noise and off-policy temporal-difference learning. Mathematics of Operations Research 43 (1), pp. 130–151. Cited by: §1.2, §1.
• [13] V. R. Konda, J. N. Tsitsiklis, et al. (2004) Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability 14 (2), pp. 796–819. Cited by: §1.2.
• [14] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik (2015) Finite-sample analysis of proximal gradient td algorithms. In Proc. Uncertainty in Artificial Intelligence (UAI), pp. 504–513. Cited by: §1.2.
• [15] H. R. Maei (2011) Gradient temporal-difference learning algorithms. Ph.D. Thesis, University of Alberta. Cited by: Appendix C, §1.2, §1.2, §1, §2.2.
• [16] H. R. Maei and R. S. Sutton (2010) GQ (lambda): a general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proc. Artificial General Intelligence (AGI), Cited by: §5.
• [17] H. R. Maei (2018) Convergent actor-critic algorithms under off-policy training and function approximation. arXiv preprint arXiv:1802.07842. Cited by: §1, §5.
• [18] A. Mokkadem and M. Pelletier (2006) Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. The Annals of Applied Probability 16 (3), pp. 1671–1702. Cited by: §1.2.
• [19] L. Y. R. Srikant (2019) Finite-time error bounds for linear stochastic approximation and TD learning. arXiv preprint arXiv:1902.00923. Cited by: §1.2.
• [20] A. Ramaswamy and S. Bhatnagar (2018) Stability of stochastic approximations with ’controlled Markov’ noise and temporal difference learning. Transactions on Automatic Control. Cited by: §1.2.
• [21] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In Proc. International Conference on Machine Learning (ICML), Cited by: §1.
• [22] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
• [23] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora (2009) Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. International Conference on Machine Learning (ICML), pp. 993–1000. Cited by: §1.2, §1.2, §1, §1, §2.2.
• [24] R. S. Sutton, C. Szepesvári, and H. R. Maei (2008) A convergent o(n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in Neural Information Processing Systems (NIPS) 21 (21), pp. 1609–1616. Cited by: §1.2, §1.
• [25] R. S. Sutton (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §1.
• [26] V. B. Tadic (2004) Almost sure convergence of two time-scale stochastic approximation algorithms. In Proc. American Control Conference, Vol. 4, pp. 3802–3807. Cited by: §1.2.
• [27] V. Tadić (2001-03-01) On the convergence of temporal-difference learning with linear function approximation. Machine Learning 42 (3), pp. 241–267. Cited by: §1.2.
• [28] G. Thoppe and V. Borkar (2019) A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems. Cited by: §1.2.
• [29] J. N. Tsitsiklis and B. Van Roy (1997) Analysis of temporal-diffference learning with function approximation. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1075–1081. Cited by: §1.2.
• [30] Y. Wang, W. Chen, Y. Liu, Z. Ma, and T. Liu (2017) Finite sample analysis of the GTD policy evaluation algorithms in Markov setting. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 5504–5513. Cited by: §1.2, §2.2.
• [31] C. J. Watkins and P. Dayan (1992) Q-learning. Machine Learning 8 (3-4), pp. 279–292. Cited by: §1.
• [32] V. Yaji and S. Bhatnagar (2016) Stochastic recursive inclusions in two timescales with non-additive iterate dependent Markov noise. arXiv preprint arXiv:1611.05961. Cited by: §1.2.
• [33] T. Yang, Y. Yan, Z. Yuan, and R. Jin (2018) Why does stagewise training accelerate convergence of testing error over SGD?. arXiv preprint arXiv:1812.03934. Cited by: §1.1, §3.3.
• [34] H. Yu (2017) On convergence of some gradient-based temporal-differences algorithms for off-policy learning. arXiv preprint arXiv:1712.09652. Cited by: §1.2, §1, §3.2.

Supplementary Materials

## Appendix A Technical Proofs for TDC under Decreasing Stepsize

We present the proof of Theorem 1 in four subsections. Section A.1 provides the proof sketch. Section A.2 contains the main part of the proof. Section A.3 includes all technical lemmas for the convergence proof of fast time-scale iteration, and Section A.4 includes all the technical lemmas for the convergence proof of the slow time-scale iteration.

### a.1 Proof Sketch of Theorem 1

###### Proof Sketch of Theorem 1.

The proof consists of four steps as we briefly describe here. The details are provided in Appendix A.2.

Step 1. Formulate training and tracking error updates. In stead of investigating the convergence of and directly, we substitute into the TDC update (1)-(2) and analyze the update of TDC in terms of and tracking error .

Step 2. Derive preliminary bound on . We decompose the mean square tracking error into an exponentially decaying term, a variance term, a bias term, and a slow drift term, and bound each term individually. We obtain a preliminary upper bound on with order .

Step 3. Recursively refine bound on . By recursively substituting the preliminary bound of into the slow drift term, we obtain the refined decay rate .

Step 4. Derive bound on . We decompose the training error into an exponentially decaying term, a variance term, a bias term, and a tracking error term, and bound each term individually. We then recursively substitute the decay rate of and into the tracking error term to obtain an upper bound on the training error with order . Combining each term yields the final bound of in (3). ∎

### a.2 Proof of Theorem 1

We provide the proof of Theorem 1 following four steps.

Step 1. Formulation of training error and tracking error update. We define the tracking error vector . By substituting into (1)-(2), we can rewrite the update rule of TDC in terms of and as follows:

 θt+1=ΠRθ(θt+αt(f1(θt,Ot)+g1(zt,Ot))), (9) zt+1=ΠRw(zt+βt(f2(θt,Ot)+g2(zt,Ot))−C−1(b+Aθt))+C−1(b+Aθt+1), (10)

where

 f1(θt,Ot)=(At−BtC−1A)θt+(bt−BtC−1b),g1(zt,Ot)=Btzt, f2(θt,Ot)=(At−CtC−1A)θt+(bt−CtC−1b),g2(zt,Ot)=Ctzt,

with denoting the observation at time step . We further define

 ¯f1(θt)=(A−BC−1A)θt+(b−BC−1b),¯g1(zt)=Bzt,¯g2(zt)=Czt.

Step 2. Derive preliminary bound on . We bound the recursion of the tracking error vector in (10) as follows. For any , we derive

 ∥zt+1∥22 =∥∥ΠRw(zt+βt(f2(θt,Ot)+g2(zt,Ot))−C−1(b+Aθt))+C−1(b+Aθt+1)∥∥22 =∥∥ΠRw(zt+βt(f2(θt,Ot)+g2(zt,Ot))−C−1(b+Aθt))+ΠRw(C−1(b+Aθt+1))∥∥22 ≤∥∥zt+βt(f2(θt,Ot)+g2(zt,Ot))+C−1A(θt+1−θt)∥∥22 =∥zt∥22+2βt⟨f2(θt,Ot),zt⟩+2βt⟨g2(zt,Ot),zt⟩+2⟨C−1A(θt+1−θt),zt⟩ +∥∥βtf2(θt,Ot)+βtg2(zt,Ot)+C−1A(θt+1−θt)∥∥22 ≤∥zt∥22+2βt⟨¯g2(zt),zt⟩+2βt⟨f2(θt,Ot),zt⟩+2βt⟨g2(zt,Ot)−¯g2(zt),zt⟩ +2⟨C−1A(θt+1−θt),zt⟩ +3β2t∥f2(θt,Ot)∥22+3β2t∥g2(zt,Ot)∥22+3∥∥C−1A(θt+1−θt)∥∥22 ≤∥zt∥22+2βt⟨Czt,zt⟩+2βt⟨f2(θt,Ot),zt⟩+2βt⟨g2(zt,Ot)−¯g2(zt),zt⟩ +2⟨C−1A(θt+1−θt),zt⟩+3β2t∥f2(θt,Ot)∥22+3β2t∥g2(zt,Ot)∥22 ≤(1−βt|λw|)∥zt∥22+2βtζf2(θt,zt,Ot)+2βtζg2(zt,Ot)+2⟨C−1A(θt+1−θt),zt⟩ +3β2tK2f2+3β2tK2g2+6α2t∥∥C−1∥∥22∥A∥22(K2f1+K2g1),

where , , . and , and are positive constants, please refer to Lemma 12, 13, 2 and 6 for their definitions. Then, defining and taking the expectation over (the filtration up to state ) on both sides, we have

 E∥zt+1∥22 ≤(1−βt|λw|)E