Two Timescale OffPolicy TD Learning: Nonasymptotic Analysis over Markovian Samples
Abstract
Gradientbased temporal difference (GTD) algorithms are widely used in offpolicy learning scenarios. Among them, the two timescale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the nonasymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first nonasymptotic convergence analysis for two timescale TDC under a noni.i.d. Markovian sample path and linear function approximation. We show that the two timescale TDC can converge as fast as under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a nonvanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.
1 Introduction
In practice, it is very common that we wish to learn the value function of a target policy based on data sampled by a different behavior policy, in order to make maximum use of the data available. For such offpolicy scenarios, it has been shown that conventional temporal difference (TD) algorithms Sutton (1988); Sutton and Barto (2018) and Qlearning Watkins and Dayan (1992) may diverge to infinity when using linear function approximation Baird (1995). To overcome the divergence issue in offpolicy TD learning, Sutton et al. (2008, 2009); Maei (2011) proposed a family of gradientbased TD (GTD) algorithms, which were shown to have guaranteed convergence in offpolicy settings and are more flexible than onpolicy learning in practice Maei (2018); Silver et al. (2014). Among those GTD algorithms, the TD with gradient correction (TDC) algorithm has been verified to have superior performance Maei (2011) Dann et al. (2014) and is widely used in practice. To elaborate, TDC uses the mean squared projected Bellman error as the objective function, and iteratively updates the function approximation parameter with the assistance of an auxiliary parameter that is also iteratively updated. These two parameters are typically updated with stepsizes diminishing at different rates, resulting the two timescale implementation of TDC, i.e., the function approximation parameter is updated at a slower timescale and the auxiliary parameter is updated at a faster timescale.
The convergence of two timescale TDC and general two timescale stochastic approximation (SA) have been well studied. The asymptotic convergence has been shown in Borkar (2009); Borkar and Pattathil (2018) for two timescale SA, and in Sutton et al. (2009) for two timescale TDC, where both studies assume that the data are sampled in an identical and independently distributed (i.i.d.) manner. Under noni.i.d. observed samples, the asymptotic convergence of the general two timescale SA and TDC were established in Karmakar and Bhatnagar (2017); Yu (2017).
All the above studies did not characterize how fast the two timescale algorithms converge, i.e, they did not establish the nonasymptotic convergence rate, which is specially important for a two timescale algorithm. In order for two timescale TDC to perform well, it is important to properly choose the relative scaling rate of the stepsizes for the two timescale iterations. In practice, this can be done by fixing one stepsize and treating the other stepsize as a tuning hyperparameter Dann et al. (2014), which is very costly. The nonasymptotic convergence rate by nature captures how the scaling of the two stepsizes affect the performance and hence can serve as a guidance for choosing the two timescale stepsizes in practice. Recently, Dalal et al. (2018b) established the nonasymptotic convergence rate for the projected two timescale TDC with i.i.d. samples under diminishing stepsize.

One important open problem that still needs to be addressed is to characterize the nonasymptotic convergence rate for two timescale TDC under noni.i.d. samples and diminishing stepsizes, and explore what such a result suggests for designing the stepsizes of the fast and slow timescales accordingly. Existing method developed in Dalal et al. (2018b) that handles the nonasymptotic analysis for i.i.d. sampled TDC does not accommodate a direct extension to the noni.i.d. setting. Thus, new technical developments are necessary to solve this problem.
Furthermore, although diminishing stepsize offers accurate convergence, constant stepsize is often preferred in practice due to its much faster error decay (i.e., convergence) rate. For example, empirical results have shown that for one timescale conventional TD, constant stepsize not only yields fast convergence, but also results in comparable convergence accuracy as diminishing stepsize Dann et al. (2014). However, for two timescale TDC, our experiments (see Section 4.2) demonstrate that constant stepsize, although yields faster convergence, has much bigger convergence error than diminishing stepsize. This motivates to address the following two open issues.

It is important to theoretically understand/explain why constant stepsize yields large convergence error for twotime scale TDC. Existing nonasymptotic analysis for two timescale TDC Dalal et al. (2018b) focused only on the diminishing stepsize, and does not characterize the convergence rate of two timescale TDC under constant stepsize.

For twotime scale TDC, given the fact that constant stepsize yields large convergence error but converges fast, whereas diminishing stepsize has small convergence error but converges slowly, it is desirable to design a new update scheme for TDC that converges faster than diminishing stepsize, but has as good convergence error as diminishing stepsize.
In this paper, we comprehensively address the above issues.
1.1 Our Contribution
Our main contributions are summarized as follows.
We develop a novel nonasymptotic analysis for two timescale TDC with a single sample path and under noni.i.d. data. We show that under the diminishing stepsizes and respectively for slow and fast timescales (where are positive constants and ), the convergence rate can be as large as , which is achieved by . This recovers the convergence rate (up to factor due to noni.i.d. data) in Dalal et al. (2018b) for i.i.d. data as a special case.
We also develop the nonasymptotic analysis for TDC under noni.i.d. data and constant stepsize. In contrast to conventional one timescale analysis, our result shows that the training error (at slow timescale) and the tracking error (at fast time scale) converge at different rates (due to different condition numbers), though both converge linearly to the neighborhood of the solution. Our result also characterizes the impact of the tracking error on the training error. Our result suggests that TDC under constant stepsize can converge faster than that under diminishing stepsize at the cost of a large training error, due to a large tracking error caused by the auxiliary parameter iteration in TDC.
We take a further step and propose a TDC algorithm under a blockwise diminishing stepsize inspired by Yang et al. (2018) in conventional optimization, in which both stepsizes are constants over a block, and decay across blocks. We show that TDC asymptotically converges with an arbitrarily small training error at a blockwisely linear convergence rate as long as the block length and the decay of stepsizes across blocks are chosen properly. Our experiments demonstrate that TDC under a blockwise diminishing stepsize converges as fast as vanilla TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.
From the technical standpoint, our proof develops new tool to handle the nonasymptotic analysis of bias due to noni.i.d. data for two timescale algorithms under diminishing stepsize that does not require square summability, to bound the impact of the fasttimescale tracking error on the slowtimescale training error, and the analysis to recursively refine the error bound in order to sharpening the convergence rate.
1.2 Related Work
Due to extensive studies on TD learning, we here include only the most relevant work to this paper.
On policy TD and SA. The convergence of TD learning with linear function approximation with i.i.d samples has been well established by using standard results in SA Borkar and Meyn (2000). The nonasymptotic convergence have been established in Borkar (2009); Kamal (2010); Thoppe and Borkar (2019) for the general SA algorithms with martingale difference noise, and in Dalal et al. (2018a) for TD with i.i.d. samples. For the Markovian settings, the asymptotic convergence has been established in Tsitsiklis and Van Roy (1997); Tadić (2001) for of TD(), and the nonasymptotic convergence has been provided for projected TD() in Bhandari et al. (2018) and for linear SA with Markovian noise in Karmakar and Bhatnagar (2016); Ramaswamy and Bhatnagar (2018); R. Srikant (2019).
Off policy one timescale GTD. The convergence of one timescale GTD and GTD2 (which are offpolicy TD algorithms) were derived by applying standard results in SA Sutton et al. (2008) Sutton et al. (2009); Maei (2011). The nonasymptotic analysis for GTD and GTD2 have been conducted in Liu et al. (2015) by converting the objective function into a convexconcave saddle problem, and was further generalized to the Markovian setting in Wang et al. (2017). However, such an approach cannot be generalized for analyzing twotime scale TDC that we study here because TDC does not have an explicit saddlepoint representation.
Off policy two timescale TDC and SA. The asymptotic convergence of two timescale TDC under i.i.d. samples has been established in Sutton et al. (2009); Maei (2011), and the nonasymptotic analysis has been provided in Dalal et al. (2018b) as a special case of two timescale linear SA. Under Markovian setting, the convergence of various two timescale GTD algorithms has been studied in Yu (2017). The nonasymptotic analysis of two timescale TDC under noni.i.d. data has not been studied before, which is the focus of this paper.
General two timescale SA has also been studied. The convergence of two timescale SA with martingale difference noise was established in Borkar (2009), and its nonasymptotic convergence was provided in Konda et al. (2004); Mokkadem and Pelletier (2006); Dalal et al. (2018b); Borkar and Pattathil (2018). Some of these results can be applied to two timescale TDC under i.i.d. samples (which can fit into a special case of SA with martingale difference noise), but not to the noni.i.d. setting. For two timescale linear SA with more general Markovian noise, only asymptotic convergence was established in Tadic (2004); Yaji and Bhatnagar (2016); Karmakar and Bhatnagar (2017). In fact, our nonasymptotic analysis for two timescale TDC can be of independent interest here to be further generalized for studying linear SA with more general Markovian noise.
2 Problem Formulation
2.1 Offpolicy Value Function Evaluation
We consider the problem of policy evaluation for a Markov decision process (MDP) , where is a compact state space, is a finite action set, is the transition kernel, is the reward function bounded by , and is the discount factor. A stationary policy maps a state to a probability distribution over . At timestep , suppose the process is in some state . Then an action is taken based on the distribution , the system transitions to a next state governed by the transition kernel , and a reward is received. Assuming the associated Markov chain is ergodic, let be the induced stationary distribution of this MDP, i.e., . The value function for policy is defined as: , and it is known that is the unique fixed point of the Bellman operator , i.e., , where is the expected reward of the Markov chain induced by policy .
We consider policy evaluation problem in the offpolicy setting. Namely, a sample path is generated by the Markov chain according to the behavior policy , but our goal is to obtain the value function of a target policy , which is different from .
2.2 Two TimeScale TDC
When is large or infinite, a linear function is often used to approximate the value function, where is a fixed feature vector for state and is a parameter vector. We can also write the linear approximation in the vector form as , where is the feature matrix. To find a parameter with . The gradientbased TD algorithm TDC Sutton et al. (2009) updates the parameter by minimizing the meansquare projected Bellman error (MSPBE) objective, defined as
where is the orthogonal projection operation into the function space and denotes the diagonal matrix with the components of as its diagonal entries. Then, we define the matrices , , and the vector as
where is the importance weighting factor with being its maximum value. If and are both nonsingular, is strongly convex and has as its global minimum, i.e., . Motivated by minimizing the MSPBE objective function using the stochastic gradient methods, TDC was proposed with the following update rules:
(1)  
(2) 
where , , , , and is the projection operator onto a norm ball of radius . The projection step is widely used in the stochastic approximation literature. As we will show later, iterations (1)(2) are guaranteed to converge to the optimal parameter if we choose the value of and appropriately. TDC with the update rules (1)(2) is a two timescale algorithm. The parameter iterates at a slow timescale determined by the stepsize , whereas iterates at a fast timescale determined by the stepsize . Throughout the paper, we make the following standard assumptions Bhandari et al. (2018); Wang et al. (2017); Maei (2011).
Assumption 1 (Problem solvability).
The matrix and are nonsingular.
Assumption 2 (Bounded feature).
for all and .
Assumption 3 (Geometric ergodicity).
There exist constants and such that
where denotes the totalvariation distance between the probability measures and .
In Assumption 1, the matrix is required to be nonsingular so that the optimal parameter is well defined. The matrix is nonsingular when the feature matrix has linearly independent columns. Assumption 2 can be ensured by normalizing the basis functions and when is nondegenerate for all . Assumption 3 holds for any timehomogeneous Markov chain with finite statespace and any uniformly ergodic Markov chains with general state space. Throughout the paper, we require and . In practice, we can estimate , and as mentioned in Bhandari et al. (2018) or simply let and to be large enough.
3 Main Theorems
3.1 Nonasymptotic Analysis under Diminishing Stepsize
Our first main result is the convergence rate of two timescale TDC with diminishing stepsize. We define the tracking error: , where is the stationary point of the ODE given by , with being fixed. Let and be any constants that satisfy and .
Theorem 1.
Consider the projected two timescale TDC algorithm in (1)(2). Suppose Assumptions 13 hold. Suppose we apply diminishing stepsize , which satisfy , and . Suppose and can be any constants in and , respectively. Then we have for :
(3)  
(4) 
where
(5) 
If , with and , we have for
(6) 
For explicit expressions of (3), (4) and (6), please refer to (A.2), (A.2) and (A.2) in the Appendix.
We further explain Theorem 1 as follows: (a) In (3) and (5), since both and can be arbitrarily small, the convergence of can be almost as fast as when , and when . Then best convergence rate is almost as fast as with . (b) If data are i.i.d. generated, then our bound reduces to with when , and when . The best convergence rate is almost as fast as with as given in Dalal et al. (2018b).
Theorem 1 characterizes the relationship between the convergence rate of and stepsizes and . The first term of the bound in (3) corresponds to the convergence rate of with full gradient , which exponentially decays with . The second term is introduced by the bias and variance of the gradient estimator which decays sublinearly with . The last term arises due to the accumulated tracking error , which specifically arises in two timescale algorithms, and captures how accurately tracks . Thus, if tracks the stationary point in each step perfectly, then we have only the first two terms in (3), which matches the results of one timescale TD learning Bhandari et al. (2018); Dalal et al. (2018a). Theorem 1 indicates that asymptotically, (3) is dominated by the tracking error term , which depends on the diminishing rate of and . Since both and can be arbitrarily small, if the diminishing rate of is close to that of , then the tracking error is dominated by the slow drift, which has an approximate order ; if the diminishing rate of is much faster than that of , then the tracking error is dominated by the accumulated bias, which has an approximate order . Moreover, (5) and (6) suggest that for any fixed , the optimal diminishing rate of is achieved by .
From the technical standpoint, we develop novel techniques to handle the interaction between the training error and the tracking error and sharpen the error bounds recursively. The proof sketch and the detailed steps are provided in Appendix A.
3.2 Nonasymptotic Analysis under Constant Stepsize
As we remark in Section 1, it has been demonstrated by empirical results Dann et al. (2014) that the standard TD under constant stepsize not only converges fast, but also has comparable training error as that under diminishing stepsize. However, this does not hold for TDC. When the two variables in TDC are updated both under constant stepsize, our experiments demonstrate that constant stepsize yields fast convergence, but has large training error. In this subsection, we aim to explain why this happens by analyzing the convergence rate of the two variables in TDC, and the impact of one on the other.
The following theorem provides the convergence result for TDC with the two variables iteratively updated respectively by two different constant stepsizes.
Theorem 2.
Consider the projected TDC algorithm in eqs. 2 and 1. Suppose Assumption 13 hold. Suppose we apply constant stepsize , and which satisfy , and . We then have for :
(7)  
(8) 
where with , and , , , and are positive constants independent of and . For explicit expressions of , , , and , please refer to (67), (68), (69), (B), and (60) in the Supplementary Materials.
Theorem 2 shows that TDC with constant stepsize converges to a neighborhood of exponentially fast. The size of the neighborhood depends on the second and the third terms of the bound in (7), which arise from the bias and variance of the update of and the tracking error in (8), respectively. Clearly, the convergence , although is also exponentially fast to a neighborhood, is under a different rate due to the different condition number. We further note that as the stepsize parameters , approach in a way such that , approaches to as , which matches the asymptotic convergence result for two timescale TDC under constant stepsize in Yu (2017).
Diminishing vs Constant Stepsize: We next discuss the comparison between TDC under diminishing stepsize and constant stepsize. Generally, Theorem 1 suggests that diminishing stepsize yields better converge guarantee (i.e., converges exactly to ) than constant stepsize shown in Theorem 2 (i.e., converges to the neighborhood of ). In practice, constant stepsize is recommended because diminishing stepsize may take much longer time to converge. However, as Figure 2 in Section 4.2 shows, although TDC with large constant stepsize converges fast, the training error due to the convergence to the neighborhood is significantly worse than the diminishing stepsize. More specifically, when is fixed, as grows, the convergence becomes faster, but as a consequence, the term due to the tracking error increases and results in a large training error. Alternatively, if gets small so that the training error is comparable to that under diminishing stepsize, then the convergence becomes very slow. This suggests that simply setting the stepsize to be constant for TDC does not yield desired performance. This motivates us to design an appropriate update scheme for TDC such that it can enjoy as fast error convergence rate as constant stepsize offers, but still have comparable accuracy as diminishing stepsize enjoys.
3.3 TDC under Blockwise Diminishing Stepsize
In this subsection, we propose a blockwise diminishing stepsize scheme for TDC (see Algorithm 1), and study its theoretical convergence guarantee. In Algorithm 1, we define .
The idea of Algorithm 1 is to divide the iteration process into blocks, and diminish the stepsize blockwisely, but keep the stepsize to be constant within each block. In this way, within each block, TDC can decay fast due to constant stepsize and still achieve an accurate solution due to blockwisely decay of the stepsize, as we will demonstrate in Section 4. More specifically, the constant stepsizes and for block are chosen to decay geometrically, such that the tracking error and accumulated variance and bias are asymptotically small; and the block length increases geometrically across blocks, such that the training error decreases geometrically blockwisely. We note that the design of the algorithm is inspired by the method proposed in Yang et al. (2018) for conventional optimization problems.
The following theorem characterizes the convergence of Algorithm 1.
Theorem 3.
Consider the projected TDC algorithm with blockwise diminishing stepsize as in Algorithm 1. Suppose Assumptions 13 hold. Suppose , and , where and are constant independent of (see (C) and (75) in the Supplementary Materials for explicit expression of and ), and . Then, after blocks, we have
The total sample complexity is , where can be any arbitrarily small constant.
Theorem 3 indicates that the sample complexity of TDC under blockwise diminishing stepsize is slightly better than that under diminishing stepsize. Our empirical results (see Section 4.3) also demonstrate that blockwise diminishing stepsize yields as fast convergence as constant stepsize and has comparable training error as diminishing stepsize. However, we want to point out that the advantage of blockwise diminishing stepsize does not come for free, rather at the cost of some extra parameter tuning in practice to estimate , , and ; whereas diminishing stepsize scheme as guided by our Theorem 1 requires to tune at most three parameters to obtain desirable performance.
4 Experimental Results
In this section, we provide numerical experiments to verify our theoretical results and the efficiency of Algorithm 1. More precisely, we consider Garnet problems Archibald et al. (1995) denoted as , where denotes the number of states, denotes the number of actions, denotes the number of possible next states for each stateaction pair, and denotes the number of features. The reward is statedependent and both the reward and the feature vectors are generated randomly. The discount factor is set to in all experiments. We consider the problem. For all experiments, we choose . All plots report the evolution of the mean square error over independent runs.
4.1 Optimal Diminishing Stepsize
In this subsection, we provide numerical results to verify Theorem 1. We compare the performance of TDC updates with the same but different . We consider four different diminishing stepsize settings: (1) , ; (2) , ; (3) , ; (4) , . For each case with fixed slow timescale parameter , the fast timescale stepsize has decay rate to be , , , , , and . Our results are reported in Figure 1, in which for each case the left figure reports the overall iteration process and the right figure reports the corresponding zoomed tail process of the last 100000 iterations. It can be seen that in all cases, TDC iterations with the same slow timescale stepsize share similar error decay rates (see the left plot), and the difference among the fast timescale parameter is reflected by the behavior of the error convergence tails (see the right plot). We observe that yields the best error decay rate. This corroborates Theorem 1, which illustrates that the fast timescale stepsize with parameter affects only the tracking error term in (3), that dominates the error decay rate asymptotically.




4.2 Constant Stepsize vs Diminishing Stepsize
In this subsection, we compare the error decay of TDC under diminishing stepsize with that of TDC under four different constant stepsizes. For diminishing stepsize, we set and , and tune their values to the best, which are given by , . For the four constantstepsize cases, we fix for each case, and tune to the best. The resulting parameter settings are respectively as follows: ; ; ; and . The results are reported in Figure 2, in which for both the training and tracking errors, the left plot illustrates the overall iteration process and the right plot illustrates the corresponding zoomed error tails. The results suggest that although some large constant stepsizes ( and ) yield initially faster convergence than diminishing stepsize, they eventually oscillate around a large neighborhood of due to the large tracking error. Small constant stepsize ( and ) can have almost the same asymptotic accuracy as that under diminishing stepsize, but has very slow convergence rate. We can also observe strong correlation between the training and tracking errors under constant stepsize, i.e., larger training error corresponds to larger tracking error, which corroborates Theorem 2 and suggests that the accuracy of TDC heavily depends on the decay of the tracking error .


4.3 Blockwise Diminishing Stepsize
In this subsection, we compare the error decay of TDC under blockwise diminishing stepsize with that of TDC under diminishing stepsize and constant stepsize. We use the best tuned parameter settings as listed in Section 4.2 for the latter two algorithms, i.e., and for diminishing stepsize, and for constant stepsize. We report our results in Figure 3. It can be seen that TDC under blockwise diminishing stepsize converges faster than that under diminishing stepsize and almost as fast as that under constant stepsize. Furthermore, TDC under blockwise diminishing stepsize also has comparable training error as that under diminishing stepsize. Since the stepsize decreases geometrically blockwisely, the algorithm approaches to a very small neighborhood of in the later blocks. We can also observe that the tracking error under blockwise diminishing stepsize decreases rapidly blockwisely.


5 Conclusion
In this work, we provided the first nonasymptotic analysis for the two timescale TDC algorithm over Markovian sample path. We developed a novel technique to handle the accumulative tracking error caused by the two timescale update, using which we characterized the nonasymptotic convergence rate with general diminishing stepsize and constant stepsize. We also proposed a blockwise diminishing stepsize scheme for TDC and proved its convergence. Our experiments demonstrated the performance advantage of such an algorithm over both the diminishing and constant stepsize TDC algorithms. Our technique for nonasymptotic analysis of two timescale algorithms can be applied to studying other offpolicy algorithms such as actorcritic Maei (2018) and gradient Qlearning algorithms Maei and Sutton (2010).
References
 [1] (1995) On the generation of Markov decision processes. Journal of the Operational Research Society 46 (3), pp. 354–361. Cited by: §4.
 [2] (1995) Residual algorithms: reinforcement learning with function approximation. In Machine Learning Proceedings, pp. 30–37. Cited by: §1.
 [3] (2018) A finite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory (COLT), pp. 1691–1692. Cited by: §A.3, §1.2, §2.2, §2.2, §3.1.
 [4] (2000) The ODE method for convergence of stochastic approximation and reinforcement learning. Journal on Control and Optimization 38 (2), pp. 447–469. Cited by: §1.2.
 [5] (2018) Concentration bounds for two time scale stochastic approximation. In Proc. Allerton Conference on Communication, Control, and Computing (Allerton), pp. 504–511. Cited by: §1.2, §1.
 [6] (2009) Stochastic approximation: a dynamical systems viewpoint. Vol. 48, Springer. Cited by: §1.2, §1.2, §1.
 [7] (2018) Finite sample analyses for TD (0) with function approximation. In Proc. AAAI Conference on Artificial Intelligence, Cited by: §A.3, §A.3, §1.2, §3.1.
 [8] (2018) Finite sample analysis of twotimescale stochastic approximation with applications to reinforcement learning. In Proc. Conference on Learning Theory (COLT), Cited by: 1st item, 1st item, §1.1, §1.2, §1.2, §1, §3.1.
 [9] (2014) Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research 15 (1), pp. 809–883. Cited by: §1, §1, §1, §3.2.
 [10] (2010) On the convergence, lockin probability and sample complexity of stochastic approximation. Journal on Control and Optimization 48 (8), pp. 5178–5192. Cited by: §1.2.
 [11] (2016) Dynamics of stochastic approximation with Markov iteratedependent noise with the stability of the iterates not ensured. arXiv preprint arXiv:1601.02217. Cited by: §1.2.
 [12] (2017) Two timescale stochastic approximation with controlled Markov noise and offpolicy temporaldifference learning. Mathematics of Operations Research 43 (1), pp. 130–151. Cited by: §1.2, §1.
 [13] (2004) Convergence rate of linear twotimescale stochastic approximation. The Annals of Applied Probability 14 (2), pp. 796–819. Cited by: §1.2.
 [14] (2015) Finitesample analysis of proximal gradient td algorithms. In Proc. Uncertainty in Artificial Intelligence (UAI), pp. 504–513. Cited by: §1.2.
 [15] (2011) Gradient temporaldifference learning algorithms. Ph.D. Thesis, University of Alberta. Cited by: Appendix C, §1.2, §1.2, §1, §2.2.
 [16] (2010) GQ (lambda): a general gradient algorithm for temporaldifference prediction learning with eligibility traces. In Proc. Artificial General Intelligence (AGI), Cited by: §5.
 [17] (2018) Convergent actorcritic algorithms under offpolicy training and function approximation. arXiv preprint arXiv:1802.07842. Cited by: §1, §5.
 [18] (2006) Convergence rate and averaging of nonlinear twotimescale stochastic approximation algorithms. The Annals of Applied Probability 16 (3), pp. 1671–1702. Cited by: §1.2.
 [19] (2019) Finitetime error bounds for linear stochastic approximation and TD learning. arXiv preprint arXiv:1902.00923. Cited by: §1.2.
 [20] (2018) Stability of stochastic approximations with ’controlled Markov’ noise and temporal difference learning. Transactions on Automatic Control. Cited by: §1.2.
 [21] (2014) Deterministic policy gradient algorithms. In Proc. International Conference on Machine Learning (ICML), Cited by: §1.
 [22] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
 [23] (2009) Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proc. International Conference on Machine Learning (ICML), pp. 993–1000. Cited by: §1.2, §1.2, §1, §1, §2.2.
 [24] (2008) A convergent o(n) algorithm for offpolicy temporaldifference learning with linear function approximation. Advances in Neural Information Processing Systems (NIPS) 21 (21), pp. 1609–1616. Cited by: §1.2, §1.
 [25] (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §1.
 [26] (2004) Almost sure convergence of two timescale stochastic approximation algorithms. In Proc. American Control Conference, Vol. 4, pp. 3802–3807. Cited by: §1.2.
 [27] (20010301) On the convergence of temporaldifference learning with linear function approximation. Machine Learning 42 (3), pp. 241–267. Cited by: §1.2.
 [28] (2019) A concentration bound for stochastic approximation via Alekseev’s formula. Stochastic Systems. Cited by: §1.2.
 [29] (1997) Analysis of temporaldiffference learning with function approximation. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1075–1081. Cited by: §1.2.
 [30] (2017) Finite sample analysis of the GTD policy evaluation algorithms in Markov setting. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 5504–5513. Cited by: §1.2, §2.2.
 [31] (1992) Qlearning. Machine Learning 8 (34), pp. 279–292. Cited by: §1.
 [32] (2016) Stochastic recursive inclusions in two timescales with nonadditive iterate dependent Markov noise. arXiv preprint arXiv:1611.05961. Cited by: §1.2.
 [33] (2018) Why does stagewise training accelerate convergence of testing error over SGD?. arXiv preprint arXiv:1812.03934. Cited by: §1.1, §3.3.
 [34] (2017) On convergence of some gradientbased temporaldifferences algorithms for offpolicy learning. arXiv preprint arXiv:1712.09652. Cited by: §1.2, §1, §3.2.
Supplementary Materials
Appendix A Technical Proofs for TDC under Decreasing Stepsize
We present the proof of Theorem 1 in four subsections. Section A.1 provides the proof sketch. Section A.2 contains the main part of the proof. Section A.3 includes all technical lemmas for the convergence proof of fast timescale iteration, and Section A.4 includes all the technical lemmas for the convergence proof of the slow timescale iteration.
a.1 Proof Sketch of Theorem 1
Proof Sketch of Theorem 1.
The proof consists of four steps as we briefly describe here. The details are provided in Appendix A.2.
Step 1. Formulate training and tracking error updates. In stead of investigating the convergence of and directly, we substitute into the TDC update (1)(2) and analyze the update of TDC in terms of and tracking error .
Step 2. Derive preliminary bound on . We decompose the mean square tracking error into an exponentially decaying term, a variance term, a bias term, and a slow drift term, and bound each term individually. We obtain a preliminary upper bound on with order .
Step 3. Recursively refine bound on . By recursively substituting the preliminary bound of into the slow drift term, we obtain the refined decay rate .
Step 4. Derive bound on . We decompose the training error into an exponentially decaying term, a variance term, a bias term, and a tracking error term, and bound each term individually. We then recursively substitute the decay rate of and into the tracking error term to obtain an upper bound on the training error with order . Combining each term yields the final bound of in (3). ∎
a.2 Proof of Theorem 1
We provide the proof of Theorem 1 following four steps.
Step 1. Formulation of training error and tracking error update. We define the tracking error vector . By substituting into (1)(2), we can rewrite the update rule of TDC in terms of and as follows:
(9)  
(10) 
where
with denoting the observation at time step . We further define
Step 2. Derive preliminary bound on . We bound the recursion of the tracking error vector in (10) as follows. For any , we derive
where , , . and , and are positive constants, please refer to Lemma 12, 13, 2 and 6 for their definitions. Then, defining and taking the expectation over (the filtration up to state ) on both sides, we have