Finite-Time Analysis for Double Q-learning

# Finite-Time Analysis for Double Q-learning

## Abstract

Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in Hasselt (2010) overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an -accurate neighborhood of the global optimum by taking iterations, where is the decay parameter of the learning rate, and is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation.

## 1 Introduction

Q-learning is one of the most successful classes of reinforcement learning (RL) algorithms, which aims at finding the optimal action-value function or Q-function (and thus the associated optimal policy) via off-policy data samples. The Q-learning algorithm was first proposed by Watkins and Dayan (1992), and since then, it has been widely used in various applications including robotics (Tai and Liu, 2016), autonomous driving (Okuyama et al., 2018), video games (Mnih et al., 2015), to name a few. Theoretical performance of Q-learning has also been intensively explored. The asymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar and Meyn (2000); Melo (2001); Lee and He (2019). The non-asymptotic (i.e., finite-time) convergence rate of Q-learning was firstly obtained in Szepesvári (1998), and has been further studied in (Even-Dar and Mansour, 2003; Shah and Xie, 2018; Wainwright, 2019; Beck and Srikant, 2012; Chen et al., 2020) for synchronous Q-learning and in  (Even-Dar and Mansour, 2003; Qu and Wierman, 2020) for asynchoronous Q-learning.

One major weakness of Q-learning arises in practice due to the large overestimation of the action-value function (Hasselt, 2010; Hasselt et al., 2016). Practical implementation of Q-learning involves using the maximum sampled Q-function to estimate the maximum expected Q-function (where the expectation is taken over the randomness of reward). Such an estimation often yields a large positive bias error (Hasselt, 2010), and causes Q-learning to perform rather poorly. To address this issue, double Q-learning was proposed in Hasselt (2010), which keeps two Q-estimators (i.e., estimators for Q-functions), one for estimating the maximum Q-function value and the other one for update, and continuously changes the roles of the two Q-estimators in a random manner. It was shown in Hasselt (2010) that such an algorithm effectively overcomes the overestimation issue of the vanilla Q-learning. In Hasselt et al. (2016), double Q-learning was further demonstrated to substantially improve the performance of Q-learning with deep neural networks (DQNs) for playing Atari 2600 games. It inspired many variants (Zhang et al., 2017; Abed-alguni and Ottom, 2018), received a lot of applications (Zhang et al., 2018a, b), and have become one of the most common techniques for applying Q-learning type of algorithms (Hessel et al., 2018).

Despite its tremendous empirical success and popularity in practice, theoretical understanding of double Q-learning is rather limited. Only the asymptotic convergence was provided in Hasselt (2010); Weng et al. (2020c). There has been no non-asymptotic result on how fast double Q-learning converges. From the technical standpoint, such finite-time analysis for double Q-learning does not follow readily from those for the vanilla Q-learning, because it involves two randomly updated Q-estimators, and the coupling between these two random paths significantly complicates the analysis. This goes much more beyond the existing techniques for analyzing the vanilla Q-learning that handles the random update of a single Q-estimator. Thus, the goal of this paper is to develop new finite-time analysis techniques that handle the inter-connected two random path updates in double Q-learning and provide the convergence rate.

### 1.1 Our contributions

The main contribution of this paper lies in providing the first finite-time analysis for double Q-learning with both the synchronous and asynchronous implementations.

• We show that synchronous double Q-learning with a learning rate (where ) attains an -accurate global optimum with at least the probability of by taking iterations, where is the discount factor, and are the sizes of the state space and action space, respectively.

• We further show that under the same accuracy and high probability requirements, asynchronous double Q-learning takes iterations, where is the covering number specified by the exploration strategy.

Our results corroborate the design goal of double Q-learning, which opts for better accuracy by making less aggressive progress during the execution in order to avoid overestimation. Specifically, our results imply that in the high accuracy regime, double Q-learning achieves the same convergence rate as vanilla Q-learning in terms of the order-level dependence on , which further indicates that the high accuracy design of double Q-learning dominates the less aggressive progress in such a regime. In the low-accuracy regime, which is not what double Q-learning is designed for, the cautious progress of double Q-learning yields a slightly weaker convergence rate than Q-learning in terms of the dependence on .

From the technical standpoint, our proof develops new techniques beyond the existing finite-time analysis of the vanilla Q-learning with a single random iteration path. More specifically, we model the double Q-learning algorithm as two alternating stochastic approximation (SA) problems, where one SA captures the error propagation between the two Q-estimators, and the other captures the error dynamics between the Q-estimator and the global optimum. For the first SA, we develop new techniques to provide the finite-time bounds on the two inter-related stochastic iterations of Q-functions. Then we develop new tools to bound the convergence of Bernoulli-controlled stochastic iterations of the second SA conditioned on the first SA.

### 1.2 Related work

Due to the rapidly growing literature on Q-learning, we review only the theoretical results that are highly relevant to our work.

Q-learning was first proposed in Watkins and Dayan (1992) under finite state-action space. Its asymptotic convergence has been established in Tsitsiklis (1994); Jaakkola et al. (1994); Borkar and Meyn (2000); Melo (2001) through studying various general SA algorithms that include Q-learning as a special case. Along this line, Lee and He (2019) characterized Q-learning as a switched linear system and applied the results of Borkar and Meyn (2000) to show the asymptotic convergence, which was also extended to other Q-learning variants. Another line of research focuses on the finite-time analysis of Q-learning which can capture the convergence rate. Such non-asymptotic results were firstly obtained in Szepesvári (1998). A more comprehensive work (Even-Dar and Mansour, 2003) provided finite-time results for both synchronous and asynchoronous Q-learning. Both Szepesvári (1998) and Even-Dar and Mansour (2003) showed that with linear learning rates, the convergence rate of Q-learning can be exponentially slow as a function of . To handle this, the so-called rescaled linear learning rate was introduced to avoid such an exponential dependence in synchronous Q-learning (Wainwright, 2019; Chen et al., 2020) and asynchronous Q-learning (Qu and Wierman, 2020). The finite-time convergence of synchronous Q-learning was also analyzed with constant step sizes (Beck and Srikant, 2012; Chen et al., 2020). Moreover, the polynomial learning rate, which is also the focus of this work, was investigated for both synchronous (Even-Dar and Mansour, 2003; Wainwright, 2019) and asynchronous Q-learning (Even-Dar and Mansour, 2003). In addition, it is worth mentioning that Shah and Xie (2018) applied the nearest neighbor approach to handle MDPs on infinite state space.

Differently from the above extensive studies of vanilla Q-learning, theoretical understanding of double Q-learning is limited. The only theoretical guarantee was on the asymptotic convergence provided by Hasselt (2010); Weng et al. (2020c), which do not provide the non-asymptotic (i.e., finite-time) analysis on how fast double Q-learning converges. This paper provides the first finite-time analysis for double Q-learning.

The vanilla Q-learning algorithm has also been studied for the function approximation case, i.e., the Q-function is approximated by a class of parameterized functions. In contrast to the tabular case, even with linear function approximation, Q-learning has been shown not to converge in general (Baird, 1995). Strong assumptions are typically imposed to guarantee the convergence of Q-learning with function approximation (Bertsekas and Tsitsiklis, 1996; Zou et al., 2019; Chen et al., 2019; Du et al., 2019; Xu and Gu, 2019; Weng et al., 2020a, b). Regarding double Q-learning, it is still an open topic on how to design double Q-learning algorithms under function approximation and under what conditions they have theoretically guaranteed convergence.

## 2 Preliminaries on Q-learning and Double Q-learning

In this section, we introduce the Q-learning and the double Q-learning algorithms.

### 2.1 Q-learning

We consider a -discounted Markov decision process (MDP) with a finite state space and a finite action space . The transition probability of the MDP is given by , that is, denotes the probability distribution of the next state given the current state and action . We consider a random reward function at time drawn from a fixed distribution , where and denotes the next state starting from . In addition, we assume . A policy characterizes the conditional probability distribution over the action space given each state .

The action-value function (i.e., Q-function) for a given policy is defined as

 Qπ(s,a):= E[∞∑t=0γtRt(s,π(s),s′)∣∣s0=s,a0=a] = Es′∼P(⋅|s,a)a′∼π(⋅|s′)[Rs′sa+γQπ(s′,a′)], (1)

where is the discount factor. Q-learning aims to find the Q-function of an optimal policy that maximizes the accumulated reward. The existence of such a has been proved in the classical MDP theory (Bertsekas and Tsitsiklis, 1996). The corresponding optimal Q-function, denoted as , is known as the unique fixed point of the Bellman operator given by

 TQ(s,a)=Es′∼P(⋅|s,a)[Rs′sa+γmaxa′∈U(s′)Q(s′,a′)], (2)

where is the admissible set of actions at state . It can be shown that the Bellman operator is -contractive in the supremum norm , i.e., it satisfies

 ∥∥TQ−TQ′∥∥≤γ∥∥Q−Q′∥∥. (3)

The goal of Q-learning is to find , which further yields . In practice, however, exact evaluation of the Bellman operator (2) is usually infeasible due to the lack of knowledge of the transition kernel of MDP and the randomness of the reward. Instead, Q-learning draws random samples to estimate the Bellman operator and iteratively learns as

 Qt+1(s,a)=(1−αt(s,a))Qt(s,a)+αt(s,a)(Rt(s,a,s′)+γmaxa′∈U(s′)Qt(s′,a′)), (4)

where is the sampled reward, is sampled by the transition probability given , and denotes the learning rate.

### 2.2 Double Q-learning

Although Q-learning is a commonly used RL algorithm to find the optimal policy, it can suffer from overestimation in practice (Smith and Winkler, 2006). To overcome this issue, Hasselt (2010) proposed double Q-learning given in Algorithm 1.

Double Q-learning maintains two Q-estimators (i.e., Q-tables): and . At each iteration of Algorithm 1, one Q-table is randomly chosen to be updated. Then this chosen Q-table generates a greedy optimal action, and the other Q-table is used for estimating the corresponding Bellman operator for updating the chosen table. Specifically, if is chosen to be updated, we use to obtain the optimal action and then estimate the corresponding Bellman operator using . As shown in  Hasselt (2010), is likely smaller than , where the expectation is taken over the randomness of the reward for the same state-action pair. In this way, such a two-estimator framework of double Q-learning can effectively reduce the overestimation.

Synchronous and asynchronous double Q-learning: In this paper, we study the finite-time convergence rate of double Q-learning in two different settings: synchronous and asynchronous implementations. For synchronous double Q-learning (as shown in Algorithm 1), all the state-action pairs of the chosen Q-estimator are visited simultaneously at each iteration. For the asynchronous case, only one state-action pair is updated in the chosen Q-table. Specifically, in the latter case, we sample a trajectory under a certain exploration strategy, where denotes the index of the chosen Q-table at time . Then the two Q-tables are updated based on the following rule:

 Qit+1(s,a)=⎧⎪⎨⎪⎩Qit(s,a),(s,a)≠(st,at) or i≠it;(1−αt(s,a))Qit(s,a)+αt(s,a)(Rt(s,a,s′)+γQict(s′,argmaxa′∈U(s′)Qit(s′,a′)), otherwise,

where .

We next provide the boundedness property of the Q-estimators and the errors in the following lemma, which is typically necessary for the finite-time analysis.

###### Lemma 1.

For either synchronous or asynchronous double Q-learning, let be the value of either Q table corresponding to a state-action pair at iteration . Suppose . Then we have and for all , where .

Lemma 1 can be proved by induction arguments using the triangle inequality and the uniform boundedness of the reward function, which is seen in Appendix A.

## 3 Main results

We present our finite-time analysis for the synchronous and asynchronous double Q-learning in this section, followed by a sketch of the proof for the synchronous case which captures our main techniques. The detailed proofs of all the results are provided in the Supplementary Materials.

### 3.1 Synchronous double Q-learning

Since the update of the two Q-estimators is symmetric, we can characterize the convergence rate of either Q-estimator, e.g., , to the global optimum . To this end, we first derive two important properties of double Q-learning that are crucial to our finite-time convergence analysis.

The first property captures the stochastic error between the two Q-estimators. Since double Q-learning updates alternatingly between these two estimators, such an error process must decay to zero in order for double Q-learning to converge. Furthermore, how fast such an error converges determines the overall convergence rate of double Q-learning. The following proposition (which is an informal restatement of Proposition 1 in Section B.1) shows that such an error process can be block-wisely bounded by an exponentially decreasing sequence for and some . Conceptually, as illustrated in Figure 1, such an error process is upper-bounded by the blue-colored piece-wise linear curve.

###### Proposition 1.

(Informal) Consider synchronous double Q-learning under a polynomial learning rate with . We divide the time horizon into blocks for , where and with some . Fix . Then for any such that and under certain conditions on (see Section B.1), we have

 P[∀q∈[0,n],∀t∈[^τq+1,^τq+2),∥∥QBt−QAt∥∥≤Gq+1]≥1−c2nexp(−c3^τω1^ϵ2V2max),

where the positive constants and are specified in Section B.1.

Proposition 1 implies that the two Q-estimators approach each other asymptotically, but does not necessarily imply that they converge to the optimal action-value function . Then the next proposition (which is an informal restatement of Proposition 2 in Section B.2) shows that as long as the high probability event in Proposition 1 holds, the error process between either Q-estimator (say ) and the optimal Q-function can be block-wisely bounded by an exponentially decreasing sequence for and . Conceptually, as illustrated in Figure 1, such an error process is upper-bounded by the yellow-colored piece-wise linear curve.

###### Proposition 2.

(Informal) Consider synchronous double Q-learning using a polynomial learning rate with . We divide the time horizon into blocks for , where and with some . Fix . Then for any such that and under certain conditions on (see Section B.2), we have

where and denote certain events defined in (12) and (13) in Section B.2, and the positive constants , and are specified Section B.2.

As illustrated in Figure 1, the two block sequences in Proposition 1 and in Proposition 2 can be chosen to coincide with each other. Then combining the above two properties followed by further mathematical arguments yields the following main theorem that characterizes the convergence rate of double Q-learning. We will provide a proof sketch for Theorem 1 in Section 3.3, which explains the main steps to obtain the supporting properties of Proposition 1 and 2 and how they further yield the main theorem.

###### Theorem 1.

Fix and . Consider synchronous double Q-learning using a polynomial learning rate with . Let be the value of for a state-action pair at time . Then we have , given that

 (5)

where .

Theorem 1 provides the finite-time convergence guarantee in high probability sense for synchronous double Q-learning. Specifically, double Q-learning attains an -accurate optimal Q-function with high probability with at most iterations. Such a result can be further understood by considering the following two regimes. In the high accuracy regime, in which , the dependence on dominates, and the time complexity is given by , which is optimized as approaches to 1. In the low accuracy regime, in which , the dependence on dominates, and the time complexity can be optimized at , which yields .

Furthermore, Theorem 1 corroborates the design effectiveness of double Q-learning, which overcomes the overestimation issue and hence achieves better accuracy by making less aggressive progress in each update. Specifically, comparison of Theorem 1 with the time complexity bounds of vanilla synchronous Q-learning under a polynomial learning rate in Even-Dar and Mansour (2003) and Wainwright (2019) indicates that in the high accuracy regime, double Q-learning achieves the same convergence rate as vanilla Q-learning in terms of the order-level dependence on . Clearly, the design of double Q-learning for high accuracy dominates the performance. In the low-accuracy regime (which is not what double Q-learning is designed for), double Q-learning achieves a slightly weaker convergence rate than vanilla Q-learning in Even-Dar and Mansour (2003); Wainwright (2019) in terms of the dependence on , because its nature of less aggressive progress dominates the performance.

### 3.2 Asynchronous Double Q-learning

In this subsection, we study the asynchronous double Q-learning and provide its finite-time convergence result.

Differently from synchronous double Q-learning, in which all state-action pairs are visited for each update of the chosen Q-estimator, asynchronous double Q-learning visits only one state-action pair for each update of the chosen Q-estimator. Therefore, we make the following standard assumption on the exploration strategy (Even-Dar and Mansour, 2003):

###### Assumption 1.

(Covering number) There exists a covering number , such that in consecutive updates of either or estimator, all the state-action pairs of the chosen Q-estimator are visited at least once.

The above conditions on the exploration are usually necessary for the finite-time analysis of asynchronous Q-learning. The same assumption has been taken in Even-Dar and Mansour (2003). Qu and Wierman (2020) proposed a mixing time condition which is in the same spirit.

creftypecap 1 essentially requires the sampling strategy to have good visitation coverage over all state-action pairs. Specifically, creftypecap 1 guarantees that consecutive updates of visit each state-action pair of at least once, and the same holds for . Since iterations of asynchronous double Q-learning must make at least updates for either or , creftypecap 1 further implies that any state-action pair must be visited at least once during iterations of the algorithm. In fact, our analysis allows certain relaxation of creftypecap 1 by only requiring each state-action pair to be visited during an interval with a certain probability. In such a case, we can also derive a finite-time bound by additionally dealing with a conditional probability.

Next, we provide the finite-time result for asynchronous double Q-learning in the following theorem.

###### Theorem 2.

Fix . Consider asynchronous double Q-learning under a polynomial learning rate with . Suppose Assumption 1 holds. Let be the value of for a state-action pair at time . Then we have , given that

 T=Ω⎛⎜⎝(L4V2max(1−γ)4ϵ2ln|S||A|L4V2max(1−γ)5ϵ2δ)1ω+(L21−γlnγVmax(1−γ)ϵ)11−ω⎞⎟⎠. (6)

Comparison of Theorem 1 and 2 indicates that the finite-time result of asynchronous double Q-learning matches that of synchronous double Q-learning in the order dependence on and . The difference lies in the extra dependence on the covering time in Theorem 2. Since synchronous double Q-learning visits all state-action pairs (i.e., takes sample updates) at each iteration, whereas asynchronous double Q-learning visits only one state-action pair (i.e., takes only one sample update) at each iteration, a more reasonable comparison between the two should be in terms of the overall sample complexity. In this sense, synchronous and asynchronous double Q-learning algorithms have the sample complexities of (where is given in (5)) and (where is given in (6)), respectively. Since in general , synchronous double-Q is more efficient than asynchronous double-Q in terms of the overall sampling complexity.

### 3.3 Proof Sketch of Theorem 1

In this subsection, we provide an outline of the technical proof of Theorem 1 and summarize the key ideas behind the proof. The detailed proof can be found in Appendix B.

Our goal is to study the finite-time convergence of the error between one Q-estimator and the optimal Q-function (this is without the loss of generality due to the symmetry of the two estimators). To this end, our proof includes: (a) Part I which analyzes the stochastic error propagation between the two Q-estimators ; (b) Part II which analyzes the error dynamics between one Q-estimator and the optimum conditioned on the error event in Part I; and (c) Part III which bounds the unconditional error . We describe each of the three parts in more details below.

Part I: Bounding (see Proposition 1). The main idea is to upper bound by a decreasing sequence block-wisely with high probability, where each block (with ) is defined by . The proof consists of the following four steps.

Step 1 (see Lemma 2): We characterize the dynamics of as an SA algorithm as follows:

 uBAt+1(s,a)=(1−αt)uBAt(s,a)+αt(ht(s,a)+zt(s,a)),

where is a contractive mapping of , and is a martingale difference sequence.

Step 2 (see Lemma 3): We derive lower and upper bounds on via two sequences and as follows:

 −Xt;^τq(s,a)+Zt;^τq(s,a)≤uBAt(s,a)≤Xt;^τq(s,a)+Zt;^τq(s,a),

for any , state-action pair , and , where is deterministic and driven by , and is stochastic and driven by the martingale difference sequence .

Step 3 (see Lemma 5 and Lemma 6): We block-wisely bound using the induction arguments. Namely, we prove for holds for all . By induction, we first observe for , holds. Given any state-action pair , we assume that holds for . Then we show holds for , which follows by bounding and separately in Lemma 5 and Lemma 6, respectively.

Step 4 (see Section B.1.4) : We apply union bound (Lemma 8) to obtain the block-wise bound for all state-action pairs and all blocks.

Part II: Conditionally bounding (see Proposition 2). We upper bound by a decreasing sequence block-wisely conditioned on the following two events:

• Event : is upper bounded properly (see (12) in Section B.2), and

• Event : there are sufficient updates of in each block (see (13) in Section B.2).

The proof of Proposition 2 consists of the following four steps.

Step 1 (see Lemma 10): We design a special relationship (illustrated in Figure 1) between the block-wise bounds and and their block separations.

Step 2 (see Lemma 11): We characterize the dynamics of the iteration residual as an SA algorithm as follows: when is chosen to be updated at iteration ,

 rt+1(s,a)=(1−αt)rt(s,a)+αt(TQAt(s,a)−Q∗(s,a))+αtwt(s,a)+αtγuBAt(s′,a∗),

where is the error between the Bellman operator and the sample-based empirical estimator, and is thus a martingale difference sequence, and has been defined in Part I.

Step 3 (see Lemma 12): We provide upper and lower bounds on via two sequences and as follows:

 −Yt;τk(s,a)+Wt;τk(s,a)≤rt(s,a)≤Yt;τk(s,a)+Wt;τk(s,a),

for all , all state-action pairs , and all , where is deterministic and driven by , and is stochastic and driven by the martingale difference sequence . In particular, if is not updated at some iteration, then the sequences and assume the same values from the previous iteration.

Step 4 (see Lemma 13, Lemma 14 and Section B.2.4): Similarly to Steps 3 and 4 in Part I, we conditionally bound for and via bounding and and further taking the union bound.

Part III: Bounding (see Section B.3). We combine the results in the first two parts, and provide high probability bound on with further probabilistic arguments, which exploit the high probability bounds on in Proposition 1 and in Lemma 15.

## 4 Conclusion

In this paper, we provide the first finite-time results for double Q-learning, which characterize how fast double Q-learning converges under both synchronous and asynchronous implementations. For the synchronous case, we show that it achieves an -accurate optimal Q-function with at least the probability of by taking iterations. Similar scaling order on and also applies for asynchronous double Q-learning but with extra dependence on the covering number. We develop new techniques to bound the error between two correlated stochastic processes, which can be of independent interest.

## Acknowledgements

The work was supported in part by the U.S. National Science Foundation under the grant CCF-1761506 and the startup fund of the Southern University of Science and Technology (SUSTech), China.

Reinforcement learning has achieved great success in areas such as robotics and game playing, and thus has aroused broad interests and more potential real-world applications. Double Q-learning is a commonly used technique in deep reinforcement learning to improve the implementation stability and speed of deep Q-learning. In this paper, we provided the fundamental analysis on the convergence rate for double Q-learning, which theoretically justified the empirical success of double Q-learning in practice. Such a theory also provides practitioners desirable performance guarantee to further develop such a technique into various transferable technologies.

Supplementary Materials

## Appendix A Proof of Lemma 1

We prove Lemma 1 by induction.

First, it is easy to guarantee that the initial case is satisfied, i.e., . (In practice we usually initialize the algorithm as ). Next, we assume that . It remains to show that such conditions still hold for .

Observe that

 ∥∥QAt+1(s,a)∥∥ =∥∥ ∥∥(1−αt)QAt(s,a)+αt(Rt+γQBt(s′,argmaxa′∈U(s′)QAt(s′,a′))∥∥ ∥∥ ≤(1−αt)∥∥QAt∥∥+αt∥Rt∥+αtγ∥∥QBt∥∥ ≤(1−αt)Rmax1−γ+αtRmax+αtγRmax1−γ =Rmax1−γ=Vmax2.

Similarly, we can have . Thus we complete the proof.

## Appendix B Proof of Theorem 1

In this appendix, we will provide a detailed proof of Theorem 1. Our proof includes: (a) Part I which analyzes the stochastic error propagation between the two Q-estimators ; (b) Part II which analyzes the error dynamics between one Q-estimator and the optimum conditioned on the error event in Part I; and (c) Part III which bounds the unconditional error . We describe each of the three parts in more details below.

### b.1 Part I: Bounding ∥∥QBt−QAt∥∥

The main idea is to upper bound by a decreasing sequence block-wisely with high probability, where each block or epoch (with ) is defined by .

###### Proposition 1.

Fix and . Consider synchronous double Q-learning using a polynomial learning rate with . Let with and . Let for with and as the finishing time of the first epoch satisfying

 ^τ1≥max⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(11−ln(2+Δ))1ω,⎛⎜ ⎜⎝128c(c+κ)V2maxκ2(Δ2+Δ)2σ2ξ2ϵ2ln⎛⎜ ⎜⎝64c(c+κ)V2maxκ2(Δ2+Δ)2σ2ξ2ϵ2⎞⎟ ⎟⎠⎞⎟ ⎟⎠1ω⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭.

Then for any such that , we have

 P[∀q∈[0,n],∀t∈[^τq+1,^τq+2),∥∥QBt−QAt∥∥≤Gq+1]

The proof of Proposition 1 consists of the following four steps.

#### Step 1: Characterizing the dynamics of QBt(s,a)−QAt(s,a)

We first characterize the dynamics of as a stochastic approximation (SA) algorithm in this step.

###### Lemma 2.

Consider double Q-learning in Algorithm 1. Then we have

 uBAt+1(s,a)=(1−αt)uBAt(s,a)+αtFt(s,a),

where

 Ft(s,a)={QBt(s,a)−Rt−γQBt(st+1,a∗),w.p. 1/2Rt+γQAt(st+1,b∗)−QAt(s,a),%w.p.1/2.

 ∥E[Ft|Ft]∥≤1+γ2∥∥uBAt∥∥.
###### Proof.

Algorithm 1 indicates that at each time, either or is updated with equal probability. When updating at time , for each we have

 uBAt+1(s,a) =QBt+1(s,a)−QAt+1(s,a) =QBt(s,a)−(QAt(s,a)+αt(Rt+γQBt(st+1,a∗)−QAt(s,a))) =(1−αt)QBt(s,a)−((1−αt)QAt(s,a)+αt(Rt+γQBt(st+1,a∗)−QBt(s,a))) =(1−αt)uBAt(s,a)+αt(QBt(s,a)−Rt−γQBt(st+1,a∗)).

Similarly, when updating , we have

 uBAt+1(s,a) =QBt+1(s,a)−QAt+1(s,a) =(QBt(s,a)+αt(Rt+γQAt(st+1,b∗)−QBt(s,a)))−QAt(s,a) =(1−αt)QBt(s,a)+(αt(Rt+γQAt(st+1,b∗)−QAt(s,a))−(1−αt)QAt(s,a)) =(1−αt)uBAt(s,a)+αt(Rt+γQAt(st+1,b∗)−QAt(s,a)).

Therefore, we can rewrite the dynamics of as , where

 Ft(s,a)={QBt(s,a)−Rt−γQBt(st+1,a∗),w.p. 1/2Rt+γQAt(st+1,b∗)−QAt(s,a),%w.p.1/2.

Thus, we have

 E [Ft(s,a)|Ft] =12(QBt(sa)−Est+1[Rs′sa−γQBt(st+1,a∗)])+12(Est+1[Rs′s,a+γQAt(st+1,b∗)]−QAt(s,a)) =12(QBt(s,a)−QAt(s,a))+γ2Est+1[QAt(st+1,b∗)−QBt