FiniteTime Analysis for Double Qlearning
Abstract
Although Qlearning is one of the most successful algorithms for finding the best actionvalue function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Qfunction values incurred by random sampling. The double Qlearning algorithm proposed in Hasselt (2010) overcomes such an overestimation issue by randomly switching the update between two Qestimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Qlearning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first nonasymptotic (i.e., finitetime) analysis for double Qlearning. We show that both synchronous and asynchronous double Qlearning are guaranteed to converge to an accurate neighborhood of the global optimum by taking iterations, where is the decay parameter of the learning rate, and is the discount factor. Our analysis develops novel techniques to derive finitetime bounds on the difference between two interconnected stochastic processes, which is new to the literature of stochastic approximation.
1 Introduction
Qlearning is one of the most successful classes of reinforcement learning (RL) algorithms, which aims at finding the optimal actionvalue function or Qfunction (and thus the associated optimal policy) via offpolicy data samples. The Qlearning algorithm was first proposed by Watkins and Dayan (1992), and since then, it has been widely used in various applications including robotics (Tai and Liu, 2016), autonomous driving (Okuyama et al., 2018), video games (Mnih et al., 2015), to name a few. Theoretical performance of Qlearning has also been intensively explored. The asymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar and Meyn (2000); Melo (2001); Lee and He (2019). The nonasymptotic (i.e., finitetime) convergence rate of Qlearning was firstly obtained in Szepesvári (1998), and has been further studied in (EvenDar and Mansour, 2003; Shah and Xie, 2018; Wainwright, 2019; Beck and Srikant, 2012; Chen et al., 2020) for synchronous Qlearning and in (EvenDar and Mansour, 2003; Qu and Wierman, 2020) for asynchoronous Qlearning.
One major weakness of Qlearning arises in practice due to the large overestimation of the actionvalue function (Hasselt, 2010; Hasselt et al., 2016). Practical implementation of Qlearning involves using the maximum sampled Qfunction to estimate the maximum expected Qfunction (where the expectation is taken over the randomness of reward). Such an estimation often yields a large positive bias error (Hasselt, 2010), and causes Qlearning to perform rather poorly. To address this issue, double Qlearning was proposed in Hasselt (2010), which keeps two Qestimators (i.e., estimators for Qfunctions), one for estimating the maximum Qfunction value and the other one for update, and continuously changes the roles of the two Qestimators in a random manner. It was shown in Hasselt (2010) that such an algorithm effectively overcomes the overestimation issue of the vanilla Qlearning. In Hasselt et al. (2016), double Qlearning was further demonstrated to substantially improve the performance of Qlearning with deep neural networks (DQNs) for playing Atari 2600 games. It inspired many variants (Zhang et al., 2017; Abedalguni and Ottom, 2018), received a lot of applications (Zhang et al., 2018a, b), and have become one of the most common techniques for applying Qlearning type of algorithms (Hessel et al., 2018).
Despite its tremendous empirical success and popularity in practice, theoretical understanding of double Qlearning is rather limited. Only the asymptotic convergence was provided in Hasselt (2010); Weng et al. (2020c). There has been no nonasymptotic result on how fast double Qlearning converges. From the technical standpoint, such finitetime analysis for double Qlearning does not follow readily from those for the vanilla Qlearning, because it involves two randomly updated Qestimators, and the coupling between these two random paths significantly complicates the analysis. This goes much more beyond the existing techniques for analyzing the vanilla Qlearning that handles the random update of a single Qestimator. Thus, the goal of this paper is to develop new finitetime analysis techniques that handle the interconnected two random path updates in double Qlearning and provide the convergence rate.
1.1 Our contributions
The main contribution of this paper lies in providing the first finitetime analysis for double Qlearning with both the synchronous and asynchronous implementations.

We show that synchronous double Qlearning with a learning rate (where ) attains an accurate global optimum with at least the probability of by taking iterations, where is the discount factor, and are the sizes of the state space and action space, respectively.

We further show that under the same accuracy and high probability requirements, asynchronous double Qlearning takes iterations, where is the covering number specified by the exploration strategy.
Our results corroborate the design goal of double Qlearning, which opts for better accuracy by making less aggressive progress during the execution in order to avoid overestimation. Specifically, our results imply that in the high accuracy regime, double Qlearning achieves the same convergence rate as vanilla Qlearning in terms of the orderlevel dependence on , which further indicates that the high accuracy design of double Qlearning dominates the less aggressive progress in such a regime. In the lowaccuracy regime, which is not what double Qlearning is designed for, the cautious progress of double Qlearning yields a slightly weaker convergence rate than Qlearning in terms of the dependence on .
From the technical standpoint, our proof develops new techniques beyond the existing finitetime analysis of the vanilla Qlearning with a single random iteration path. More specifically, we model the double Qlearning algorithm as two alternating stochastic approximation (SA) problems, where one SA captures the error propagation between the two Qestimators, and the other captures the error dynamics between the Qestimator and the global optimum. For the first SA, we develop new techniques to provide the finitetime bounds on the two interrelated stochastic iterations of Qfunctions. Then we develop new tools to bound the convergence of Bernoullicontrolled stochastic iterations of the second SA conditioned on the first SA.
1.2 Related work
Due to the rapidly growing literature on Qlearning, we review only the theoretical results that are highly relevant to our work.
Qlearning was first proposed in Watkins and Dayan (1992) under finite stateaction space. Its asymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar and Meyn (2000); Melo (2001) through studying various general SA algorithms that include Qlearning as a special case. Along this line, Lee and He (2019) characterized Qlearning as a switched linear system and applied the results of Borkar and Meyn (2000) to show the asymptotic convergence, which was also extended to other Qlearning variants. Another line of research focuses on the finitetime analysis of Qlearning which can capture the convergence rate. Such nonasymptotic results were firstly obtained in Szepesvári (1998). A more comprehensive work (EvenDar and Mansour, 2003) provided finitetime results for both synchronous and asynchoronous Qlearning. Both Szepesvári (1998) and EvenDar and Mansour (2003) showed that with linear learning rates, the convergence rate of Qlearning can be exponentially slow as a function of . To handle this, the socalled rescaled linear learning rate was introduced to avoid such an exponential dependence in synchronous Qlearning (Wainwright, 2019; Chen et al., 2020) and asynchronous Qlearning (Qu and Wierman, 2020). The finitetime convergence of synchronous Qlearning was also analyzed with constant step sizes (Beck and Srikant, 2012; Chen et al., 2020). Moreover, the polynomial learning rate, which is also the focus of this work, was investigated for both synchronous (EvenDar and Mansour, 2003; Wainwright, 2019) and asynchronous Qlearning (EvenDar and Mansour, 2003). In addition, it is worth mentioning that Shah and Xie (2018) applied the nearest neighbor approach to handle MDPs on infinite state space.
Differently from the above extensive studies of vanilla Qlearning, theoretical understanding of double Qlearning is limited. The only theoretical guarantee was on the asymptotic convergence provided by Hasselt (2010); Weng et al. (2020c), which do not provide the nonasymptotic (i.e., finitetime) analysis on how fast double Qlearning converges. This paper provides the first finitetime analysis for double Qlearning.
The vanilla Qlearning algorithm has also been studied for the function approximation case, i.e., the Qfunction is approximated by a class of parameterized functions. In contrast to the tabular case, even with linear function approximation, Qlearning has been shown not to converge in general (Baird, 1995). Strong assumptions are typically imposed to guarantee the convergence of Qlearning with function approximation (Bertsekas and Tsitsiklis, 1996; Zou et al., 2019; Chen et al., 2019; Du et al., 2019; Xu and Gu, 2019; Weng et al., 2020a, b). Regarding double Qlearning, it is still an open topic on how to design double Qlearning algorithms under function approximation and under what conditions they have theoretically guaranteed convergence.
2 Preliminaries on Qlearning and Double Qlearning
In this section, we introduce the Qlearning and the double Qlearning algorithms.
2.1 Qlearning
We consider a discounted Markov decision process (MDP) with a finite state space and a finite action space . The transition probability of the MDP is given by , that is, denotes the probability distribution of the next state given the current state and action . We consider a random reward function at time drawn from a fixed distribution , where and denotes the next state starting from . In addition, we assume . A policy characterizes the conditional probability distribution over the action space given each state .
The actionvalue function (i.e., Qfunction) for a given policy is defined as
(1) 
where is the discount factor. Qlearning aims to find the Qfunction of an optimal policy that maximizes the accumulated reward. The existence of such a has been proved in the classical MDP theory (Bertsekas and Tsitsiklis, 1996). The corresponding optimal Qfunction, denoted as , is known as the unique fixed point of the Bellman operator given by
(2) 
where is the admissible set of actions at state . It can be shown that the Bellman operator is contractive in the supremum norm , i.e., it satisfies
(3) 
The goal of Qlearning is to find , which further yields . In practice, however, exact evaluation of the Bellman operator (2) is usually infeasible due to the lack of knowledge of the transition kernel of MDP and the randomness of the reward. Instead, Qlearning draws random samples to estimate the Bellman operator and iteratively learns as
(4) 
where is the sampled reward, is sampled by the transition probability given , and denotes the learning rate.
2.2 Double Qlearning
Although Qlearning is a commonly used RL algorithm to find the optimal policy, it can suffer from overestimation in practice (Smith and Winkler, 2006). To overcome this issue, Hasselt (2010) proposed double Qlearning given in Algorithm 1.
Double Qlearning maintains two Qestimators (i.e., Qtables): and . At each iteration of Algorithm 1, one Qtable is randomly chosen to be updated. Then this chosen Qtable generates a greedy optimal action, and the other Qtable is used for estimating the corresponding Bellman operator for updating the chosen table. Specifically, if is chosen to be updated, we use to obtain the optimal action and then estimate the corresponding Bellman operator using . As shown in Hasselt (2010), is likely smaller than , where the expectation is taken over the randomness of the reward for the same stateaction pair. In this way, such a twoestimator framework of double Qlearning can effectively reduce the overestimation.
Synchronous and asynchronous double Qlearning: In this paper, we study the finitetime convergence rate of double Qlearning in two different settings: synchronous and asynchronous implementations. For synchronous double Qlearning (as shown in Algorithm 1), all the stateaction pairs of the chosen Qestimator are visited simultaneously at each iteration. For the asynchronous case, only one stateaction pair is updated in the chosen Qtable. Specifically, in the latter case, we sample a trajectory under a certain exploration strategy, where denotes the index of the chosen Qtable at time . Then the two Qtables are updated based on the following rule:
where .
We next provide the boundedness property of the Qestimators and the errors in the following lemma, which is typically necessary for the finitetime analysis.
Lemma 1.
For either synchronous or asynchronous double Qlearning, let be the value of either Q table corresponding to a stateaction pair at iteration . Suppose . Then we have and for all , where .
Lemma 1 can be proved by induction arguments using the triangle inequality and the uniform boundedness of the reward function, which is seen in Appendix A.
3 Main results
We present our finitetime analysis for the synchronous and asynchronous double Qlearning in this section, followed by a sketch of the proof for the synchronous case which captures our main techniques. The detailed proofs of all the results are provided in the Supplementary Materials.
3.1 Synchronous double Qlearning
Since the update of the two Qestimators is symmetric, we can characterize the convergence rate of either Qestimator, e.g., , to the global optimum . To this end, we first derive two important properties of double Qlearning that are crucial to our finitetime convergence analysis.
The first property captures the stochastic error between the two Qestimators. Since double Qlearning updates alternatingly between these two estimators, such an error process must decay to zero in order for double Qlearning to converge. Furthermore, how fast such an error converges determines the overall convergence rate of double Qlearning. The following proposition (which is an informal restatement of Proposition 1 in Section B.1) shows that such an error process can be blockwisely bounded by an exponentially decreasing sequence for and some . Conceptually, as illustrated in Figure 1, such an error process is upperbounded by the bluecolored piecewise linear curve.
Proposition 1.
(Informal) Consider synchronous double Qlearning under a polynomial learning rate with . We divide the time horizon into blocks for , where and with some . Fix . Then for any such that and under certain conditions on (see Section B.1), we have
where the positive constants and are specified in Section B.1.
Proposition 1 implies that the two Qestimators approach each other asymptotically, but does not necessarily imply that they converge to the optimal actionvalue function . Then the next proposition (which is an informal restatement of Proposition 2 in Section B.2) shows that as long as the high probability event in Proposition 1 holds, the error process between either Qestimator (say ) and the optimal Qfunction can be blockwisely bounded by an exponentially decreasing sequence for and . Conceptually, as illustrated in Figure 1, such an error process is upperbounded by the yellowcolored piecewise linear curve.
Proposition 2.
(Informal) Consider synchronous double Qlearning using a polynomial learning rate with . We divide the time horizon into blocks for , where and with some . Fix . Then for any such that and under certain conditions on (see Section B.2), we have
where and denote certain events defined in (12) and (13) in Section B.2, and the positive constants , and are specified Section B.2.
As illustrated in Figure 1, the two block sequences in Proposition 1 and in Proposition 2 can be chosen to coincide with each other. Then combining the above two properties followed by further mathematical arguments yields the following main theorem that characterizes the convergence rate of double Qlearning. We will provide a proof sketch for Theorem 1 in Section 3.3, which explains the main steps to obtain the supporting properties of Proposition 1 and 2 and how they further yield the main theorem.
Theorem 1.
Fix and . Consider synchronous double Qlearning using a polynomial learning rate with . Let be the value of for a stateaction pair at time . Then we have , given that
(5) 
where .
Theorem 1 provides the finitetime convergence guarantee in high probability sense for synchronous double Qlearning. Specifically, double Qlearning attains an accurate optimal Qfunction with high probability with at most iterations. Such a result can be further understood by considering the following two regimes. In the high accuracy regime, in which , the dependence on dominates, and the time complexity is given by , which is optimized as approaches to 1. In the low accuracy regime, in which , the dependence on dominates, and the time complexity can be optimized at , which yields .
Furthermore, Theorem 1 corroborates the design effectiveness of double Qlearning, which overcomes the overestimation issue and hence achieves better accuracy by making less aggressive progress in each update. Specifically, comparison of Theorem 1 with the time complexity bounds of vanilla synchronous Qlearning under a polynomial learning rate in EvenDar and Mansour (2003) and Wainwright (2019) indicates that in the high accuracy regime, double Qlearning achieves the same convergence rate as vanilla Qlearning in terms of the orderlevel dependence on . Clearly, the design of double Qlearning for high accuracy dominates the performance. In the lowaccuracy regime (which is not what double Qlearning is designed for), double Qlearning achieves a slightly weaker convergence rate than vanilla Qlearning in EvenDar and Mansour (2003); Wainwright (2019) in terms of the dependence on , because its nature of less aggressive progress dominates the performance.
3.2 Asynchronous Double Qlearning
In this subsection, we study the asynchronous double Qlearning and provide its finitetime convergence result.
Differently from synchronous double Qlearning, in which all stateaction pairs are visited for each update of the chosen Qestimator, asynchronous double Qlearning visits only one stateaction pair for each update of the chosen Qestimator. Therefore, we make the following standard assumption on the exploration strategy (EvenDar and Mansour, 2003):
Assumption 1.
(Covering number) There exists a covering number , such that in consecutive updates of either or estimator, all the stateaction pairs of the chosen Qestimator are visited at least once.
The above conditions on the exploration are usually necessary for the finitetime analysis of asynchronous Qlearning. The same assumption has been taken in EvenDar and Mansour (2003). Qu and Wierman (2020) proposed a mixing time condition which is in the same spirit.
creftypecap 1 essentially requires the sampling strategy to have good visitation coverage over all stateaction pairs. Specifically, creftypecap 1 guarantees that consecutive updates of visit each stateaction pair of at least once, and the same holds for . Since iterations of asynchronous double Qlearning must make at least updates for either or , creftypecap 1 further implies that any stateaction pair must be visited at least once during iterations of the algorithm. In fact, our analysis allows certain relaxation of creftypecap 1 by only requiring each stateaction pair to be visited during an interval with a certain probability. In such a case, we can also derive a finitetime bound by additionally dealing with a conditional probability.
Next, we provide the finitetime result for asynchronous double Qlearning in the following theorem.
Theorem 2.
Fix . Consider asynchronous double Qlearning under a polynomial learning rate with . Suppose Assumption 1 holds. Let be the value of for a stateaction pair at time . Then we have , given that
(6) 
Comparison of Theorem 1 and 2 indicates that the finitetime result of asynchronous double Qlearning matches that of synchronous double Qlearning in the order dependence on and . The difference lies in the extra dependence on the covering time in Theorem 2. Since synchronous double Qlearning visits all stateaction pairs (i.e., takes sample updates) at each iteration, whereas asynchronous double Qlearning visits only one stateaction pair (i.e., takes only one sample update) at each iteration, a more reasonable comparison between the two should be in terms of the overall sample complexity. In this sense, synchronous and asynchronous double Qlearning algorithms have the sample complexities of (where is given in (5)) and (where is given in (6)), respectively. Since in general , synchronous doubleQ is more efficient than asynchronous doubleQ in terms of the overall sampling complexity.
3.3 Proof Sketch of Theorem 1
In this subsection, we provide an outline of the technical proof of Theorem 1 and summarize the key ideas behind the proof. The detailed proof can be found in Appendix B.
Our goal is to study the finitetime convergence of the error between one Qestimator and the optimal Qfunction (this is without the loss of generality due to the symmetry of the two estimators). To this end, our proof includes: (a) Part I which analyzes the stochastic error propagation between the two Qestimators ; (b) Part II which analyzes the error dynamics between one Qestimator and the optimum conditioned on the error event in Part I; and (c) Part III which bounds the unconditional error . We describe each of the three parts in more details below.
Part I: Bounding (see Proposition 1). The main idea is to upper bound by a decreasing sequence blockwisely with high probability, where each block (with ) is defined by . The proof consists of the following four steps.
Step 1 (see Lemma 2): We characterize the dynamics of as an SA algorithm as follows:
where is a contractive mapping of , and is a martingale difference sequence.
Step 2 (see Lemma 3): We derive lower and upper bounds on via two sequences and as follows:
for any , stateaction pair , and , where is deterministic and driven by , and is stochastic and driven by the martingale difference sequence .
Step 3 (see Lemma 5 and Lemma 6): We blockwisely bound using the induction arguments. Namely, we prove for holds for all . By induction, we first observe for , holds. Given any stateaction pair , we assume that holds for . Then we show holds for , which follows by bounding and separately in Lemma 5 and Lemma 6, respectively.
Step 4 (see Section B.1.4) : We apply union bound (Lemma 8) to obtain the blockwise bound for all stateaction pairs and all blocks.
Part II: Conditionally bounding (see Proposition 2). We upper bound by a decreasing sequence blockwisely conditioned on the following two events:

Event : is upper bounded properly (see (12) in Section B.2), and

Event : there are sufficient updates of in each block (see (13) in Section B.2).
The proof of Proposition 2 consists of the following four steps.
Step 1 (see Lemma 10): We design a special relationship (illustrated in Figure 1) between the blockwise bounds and and their block separations.
Step 2 (see Lemma 11): We characterize the dynamics of the iteration residual as an SA algorithm as follows: when is chosen to be updated at iteration ,
where is the error between the Bellman operator and the samplebased empirical estimator, and is thus a martingale difference sequence, and has been defined in Part I.
Step 3 (see Lemma 12): We provide upper and lower bounds on via two sequences and as follows:
for all , all stateaction pairs , and all , where is deterministic and driven by , and is stochastic and driven by the martingale difference sequence . In particular, if is not updated at some iteration, then the sequences and assume the same values from the previous iteration.
Step 4 (see Lemma 13, Lemma 14 and Section B.2.4): Similarly to Steps 3 and 4 in Part I, we conditionally bound for and via bounding and and further taking the union bound.
Part III: Bounding (see Section B.3). We combine the results in the first two parts, and provide high probability bound on with further probabilistic arguments, which exploit the high probability bounds on in Proposition 1 and in Lemma 15.
4 Conclusion
In this paper, we provide the first finitetime results for double Qlearning, which characterize how fast double Qlearning converges under both synchronous and asynchronous implementations. For the synchronous case, we show that it achieves an accurate optimal Qfunction with at least the probability of by taking iterations. Similar scaling order on and also applies for asynchronous double Qlearning but with extra dependence on the covering number. We develop new techniques to bound the error between two correlated stochastic processes, which can be of independent interest.
Acknowledgements
The work was supported in part by the U.S. National Science Foundation under the grant CCF1761506 and the startup fund of the Southern University of Science and Technology (SUSTech), China.
Broader Impact
Reinforcement learning has achieved great success in areas such as robotics and game playing, and thus has aroused broad interests and more potential realworld applications. Double Qlearning is a commonly used technique in deep reinforcement learning to improve the implementation stability and speed of deep Qlearning. In this paper, we provided the fundamental analysis on the convergence rate for double Qlearning, which theoretically justified the empirical success of double Qlearning in practice. Such a theory also provides practitioners desirable performance guarantee to further develop such a technique into various transferable technologies.
Supplementary Materials
Appendix A Proof of Lemma 1
We prove Lemma 1 by induction.
First, it is easy to guarantee that the initial case is satisfied, i.e., . (In practice we usually initialize the algorithm as ). Next, we assume that . It remains to show that such conditions still hold for .
Observe that
Similarly, we can have . Thus we complete the proof.
Appendix B Proof of Theorem 1
In this appendix, we will provide a detailed proof of Theorem 1. Our proof includes: (a) Part I which analyzes the stochastic error propagation between the two Qestimators ; (b) Part II which analyzes the error dynamics between one Qestimator and the optimum conditioned on the error event in Part I; and (c) Part III which bounds the unconditional error . We describe each of the three parts in more details below.
b.1 Part I: Bounding
The main idea is to upper bound by a decreasing sequence blockwisely with high probability, where each block or epoch (with ) is defined by .
Proposition 1.
Fix and . Consider synchronous double Qlearning using a polynomial learning rate with . Let with and . Let for with and as the finishing time of the first epoch satisfying
Then for any such that , we have
The proof of Proposition 1 consists of the following four steps.
Step 1: Characterizing the dynamics of
We first characterize the dynamics of as a stochastic approximation (SA) algorithm in this step.
Lemma 2.
Proof.
Algorithm 1 indicates that at each time, either or is updated with equal probability. When updating at time , for each we have
Similarly, when updating , we have
Therefore, we can rewrite the dynamics of as , where
Thus, we have