# Target Transfer Q-Learning and Its Convergence Analysis

###### Abstract

Reinforcement Learning (RL) technologies are powerful to learn how to interact with environments and have been successfully applied to variants of important applications. Q-learning is one of the most popular methods in RL, which uses temporal difference method to update the Q-function and can asymptotically learn the optimal Q-function. Transfer Learning aims to utilize the learned knowledge from source tasks to help new tasks. For supervised learning, it has been shown that transfer learning has the potential to significantly improve the sample complexity of the new tasks. Considering that data collection in RL is both more time and cost consuming and Q-learning converges slowly comparing to supervised learning, different kinds of transfer RL algorithms are designed. However, most of them are heuristic with no theoretical guarantee of the convergence rate. Therefore, it is important for us to clearly understand when and how will transfer learning help RL method and provide the theoretical guarantee for the improvement of the sample complexity. In this paper, we propose to transfer the Q-function learned in the source task to the target in the Q-learning of the new task when certain safe conditions are satisfied. We call this new transfer Q-learning method target transfer Q-Learning. The safe conditions are necessary to avoid the harm to the new tasks brought by the transfer target and thus ensure the convergence of the algorithm. We study the convergence rate of the target transfer Q-learning. We prove that if the two tasks are similar with respect to the MDPs, the optimal Q-functions of the two tasks are similar which means the error of the transferred target Q-function in the new task is small. Also, the convergence rate analysis shows that the target transfer Q-Learning will converge faster than Q-learning if the error of the transferred target Q-function is smaller than the current Q-function in the new task. Based on our theoretical results and the relationship between the Q error and the Bellman error, we design the safe condition as the Bellman error of the transferred target Q-function is less than the current Q-function. Our experiments are consistent with our theoretical founding and verified the effectiveness of our proposed target transfer Q-learning method.

Target Transfer Q-Learning and Its Convergence Analysis

Yue Wang^{†}^{†}thanks: This work was done when the first author was visiting Microsoft Research Asia. , Qi Meng, Wei Cheng, Yuting Liug, Zhi-Ming Ma, Tie-Yan Liu
School of Science, Beijing Jiaotong University, Beijing, China {11271012, ytliu}@bjtu.edu.cn
Microsoft Research, Beijing, China {meq, wche,Tie-Yan.Liu}@microsoft.com
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China mazm@amt.ac.cn

Copyright © 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

## Introduction

Reinforcement Learning (RL) (?) technologies are very powerful to learn how to interact with environments and have been successfully applied to variants of important applications, such as robotics, computer games and so on (?; ?; ?; ?).

Q-learning (?) is one of the most popular RL algorithms which uses temporal difference method to update the Q-function. To be specific, Q-learning maps the current Q-function to a new Q-function by using Bellman operator and use the difference between these two Q-functions to update the Q-function. Since Bellman operator is a contractive mapping, Q-learning will converge to the optimal Q-function (?). Comparing to supervised learning algorithms, Q-learning converges much slower due to the interactions with the environment. At the same time, the data collection is both very time and cost consuming in RL. Thus, it is crucial for us to utilize available information to save the sample complexity of Q-Learning.

Transfer learning aims to improve the learning performance on a new task by utilizing knowledge/model learned from source tasks. Transfer learning has a long history in supervised learning (?; ?; ?). Recently, by leveraging the experiences from supervised transfer learning, researchers developed different kinds of transfer learning methods for RL, which can be categorized into three classes: (1) instance transfer in which old data will be reused in the new task (?; ?); (2) representation transfer such as reward shaping and basis function extraction (?; ?); (3) parameter transfer (?) in which the parameters of the source task will be partially merged into the model of the new task. While supervised learning is a pure optimization problem, reinforcement learning is a more complex control problem. To the best of our knowledge, most of the existing transfer reinforcement learning algorithms are heuristic with no theoretical guarantee of the convergence rate (?), (?) and (?). As mentioned by (?), the transfer learning method potentially do not work or even harm to the new tasks and we do not know the reason since the absence of the theory. Therefore, it is very important for us to clearly understand how and when transfer learning will help reinforcement learning save sample complexity.

In this paper, we design a novel transfer learning method for Q-learning in RL with theoretical guarantee. Different from the existing transfer RL algorithms, we propose to transfer the Q-function learned in the source task as the temporal difference update target of the new task when certain safe conditions are satisfied. We call this new transfer Q-learning method target transfer Q-learning. The intuitive motivation is that when the two RL tasks are similar to each other, their optimal Q-function will be similar which means the transferred target is better ( the error is smaller than the current Q-function ). Combine it with that a better target Q-function in Q-learning will help to accelerate the convergence, we may expect that the target transfer Q-learning method will outperform the Q-learning. The safe conditions are necessary to avoid the harm to the new tasks and thus ensure the convergence of the algorithm.

We prove that target transfer Q-learning has the theoretical guarantee of convergence rate. Furthermore, if the two MDPs and thus the optimal Q-functions in the source and new RL tasks are similar, the target transfer Q-learning converges faster than Q-learning. To be specific, we prove the error of target transfer Q-learning consists of two errors: the initialization error and the sampling error. Both of the errors are increasing with the the product of discount factor and the relative Q-function error ratio (error ratio for simplicity) which measures the relative error of the target Q-function comparing with the current Q-function in the new task. We called discounted relative Q-function error ratio(discounted error ratio for simplicity). The smaller the discounted error ratio is, the faster the convergence is. And if the discounted error ratio is larger than 1, the convergence will no longer guaranteed.

If the two RL tasks are similar, the learned Q-function in the source task will be close to the optimal Q-function comparing to the current Q-function in the new task. Thus, the discounted error ratio will be small(especially for the early stage) when we transfer the learned Q-function from the source task to the target of the new task. Please note that the traditional Q-learning is a special case for target transfer Q-learning with constant discounted error ratio .

Therefore, our convergence analysis for target transfer Q-learning help us design the safe condition. We can transfer the target if it will lead the discounted error ratio smaller than . We call it error ratio safe condition. Specifically, in the early stage of the training, the Q-function in the new task is not fully trained, the learned Q-function in the source task it a better choice with a smaller error ratio. With the updating of the Q-function in the new task, its error ratio becomes larger. When its discounted error ratio is close or larger than , the safe condition will not be satisfied, and we will stop transferring the target to avoid the harm brought by the transfer learning. Following the standard way in Q-learning, we estimate the error ratio about the error of the Q-function w.r.t the optimal Q-function by the Bellman error.

Our experiments on synthetic MDPs fully support our convergence analysis and verify the effectiveness of our proposed target transfer Q-Learning with error ratio safe condition.

## Related Work

This section briefly outline related work in transfer learning in reinforcement learning.

Transfer Learning in RL(?) (?) aims to improve learning in new MDP tasks by borrowing knowledge from a related but different learned MDP tasks. In paper (?), the authors propose to use instance transfer in the Transfer Reinforcement Learning with Shared Dynamics (TRLSD) setting in which only the reward function is different between MDPs. In paper (?), the authors propose to use the representation transfer and learned the invariant feature space. The papers (?; ?) propose to use the parameter transfer to guide the exploration or to initialize the Q-function of the new task directly. In paper (?), the authors propose to use the meta-learning method to do transfer learning in RL. All these works are empirically evaluated and no theoretical analysis for the convergence rate.

There are few works that have the convergence analysis. In paper (?), the authors use the representation transfer but only consider the TRLSD setting. (?) propose a method by using instance transfer. They gives the theoretical analysis of the asymptotic convergence and no finite sample performance guarantee.

## Q Learning Background

Consider the reinforcement learning problem with Markov decision process (MDP) , where is the state space, is the action space, is the transition matrix and is the transition probability from state to state after taking action , is the reward function and is the reward received at state if taking action , and is the discount factor. A policy indicates the probability to take each action at each state. Value function for policy is defined as: . Action value function for policy is also called Q-function and is defined as:

Without loss of generality, we assume that the rewards all lie between 0 and 1. The optimal policy is denoted and has value function and Q value function .

As we know, the Q-function in RL satisfies the following Bellman equation:

Denote the right hand side(RHS) of the equation as , is called Bellman operator for policy . Similar, consider the optimal Bellman equation:

(RHS) of the equation is been denoted as , is called optimal Bellman operator. It can be proved that the optimal Bellman operator is a contraction mapping for the Q-function. We know that there is an unique fix point which is optimal Q-function by contraction mapping theorem. Q-learning algorithm is designed by the above theory. Watkins introduced the Q-learning algorithm to estimate the value of state-action pairs in discounted MDPs (?) :

We introduce the max norm error to measure the quality of Q-function:

## Target Transfer Q-Learning

First of all, we formalize transfer learning in RL problem. Secondly, We propose our new transfer Q-learning method Target Transfer Q-Learning (TTQL) and introduce the intuition.

Transfer Learning in RL(?) (?) aims to improve learning in new MDP tasks by borrowing knowledge from a related but different learned MDP tasks.

According to the definition of MDPs, , we consider the situation that two MDPs are different in transition probability , reward function and discount factor . Assume there are two MDPs: source MDP and new MDP , and are the corresponding optimal Q-functions. Let be the source domain and we have already learned the . The goal of transfer in RL considered in this work is how we can use the information of and to achieve learning speed improvement in .

To solve the problem mentioned above, we propose to use TTQL method. TTQL use the Q-function learned from the source task as the target Q-function in the new task when safe conditions satisfied. The safe condition ensures that the transferred target only appears if it can help to accelerate the training. Otherwise we will replace it with the current Q-function in the new MDP’s learning progress. We describe the TTQL in Algorithm 1.

The intuitive motivation is that when the two RL tasks are similar to each other, their optimal Q-function will be similar. Thus the transferred target is better ( the error is smaller than the current Q-function ) and the better target can help to accelerate the convergence.

We define the distance between two MDPs as

The following Proposition 1 shows the relation between the distance of two MDPs and the component of two MDPs.

###### Proposition 1.

Assume two MDPs, and , Let the corresponding optimal Q-functions be and , then we have

(1) | |||

for , where is the available combination of the .

###### Proof.

Without loss of generality, we assume , , we will show that other cases can be proved similarly. We define the following auxiliary MDPs: , , and let the corresponding optimal Q-functions be and . We have

(2) | |||

(3) | |||

(4) |

Notice that in each term, two MDPs are only different in one component. Using the results of (?), we have that , , . Combine the above upper bound and set , we can get the in-equation (1).

In other situation, we can construct auxiliary MDPs like above and use the similar procedure to prove the theorem. After traversing all the available combination of the , we can prove the Proposition 1

∎

By the Proposition 1, we can conclude that if the two RL tasks are similar, in the sense of that the component of two MDPs are similar, the learned Q-function in the source task will be close to the optimal Q in the new task.

A question is that when to transfer the target will have performance guarantee. Here, we need safe conditions which are necessary to avoid the harm to the new tasks and thus ensure the convergence of the algorithm. We can now heuristically relate it to the distance between two MDPs and the current learning quality. The concrete value of the safe condition need to further investigate through quantified theoretical analysis and we present these result in the following section.

## Convergence Rate of TTQL

In this section, we present the convergence rate of the Target Transfer Q Learning (TTQL) and make discussions for the key factor that influence the convergence. Theorem 1 analysis the convergence of the target transfer Q learning. Theorem 2 and 3 analysis two key factors of the convergence rate. Theorem 4 discuss the convergence rate for the TTQL totally.

First of all, Theorem 1 analysis the convergence rate for the target transfer method which is

For simplicity, we denote . We denote the error ratio and if we do not specify the learning steps .

###### Theorem 1.

we denote , . If , then with probability we have

Before showing the proof of Theorem 1, we first introduce a modified Hoeffding inequality lemma which bounds the distance between the weighted sum of the bounded random variable and its expectation.

###### Lemma 1.

Let almost surely , , then we have

(5) |

###### Proof.

We first prove the inequality

For , Markov’s inequality and the independence of implies

(6) | |||

(7) | |||

(8) | |||

(9) | |||

(10) | |||

(11) | |||

(12) |

Now we consider the minimum of the right hand side of the last inequality as a function of , and denote

Note that g is a quadratic function and achieves its minimum at , Thus we get

(13) |

We can easily obtain the second part of the Lemma 1 by inverse the inequality. ∎

###### Proof of Theorem 1.

Our analysis are derived based on the following synchronous generalized Q-learning setting. Compare with the traditional synchronous Q-learning ^{1}^{1}1It is the same as the commonly used setting or more general((?), (?), (?) (?))., we replace the target Q-function as the independent Q-function rather than the current one .

(14) |

Let satisfied the following condition ,

(15) |

Note that if we set , we can verify according to inequality 15. First of all, we decompose the update role,

If we denote , and recall the definition of we can have

The last step is right because for . Taking maximization of the both sides(RHS) of the inequality and using recursion of we can have

According to Lemma 1(weighted Hoeffding inequality), with probability 1-, we have

(16) |

∎

The convergence result reveals the how the error ratio influence the convergence rate. In short, if we can find a better target Q-function, we can learn much more faster.

We can see from the Theorem 1 that there are two key factors that influence the convergence rate. One is the initialization error , the other one is the sampling error . To make it clear, we analysis the order of these two terms in 2 and 3 respectively.

###### Theorem 2.

Denote , and , we have

Based on the results of Theorem 2, we can get the following corollary directly.

###### Corollary 1.

The order of is:

, if ,.

, if .

, if .

The sufficient condition for the is

Before showing the proof of Theorem 2, we first introduce a Lemma which will be used.

###### Lemma 2.

If , .

###### Proof.

∎

###### Proof of Theorem 2.

(17) |

We rewrite the product term in (a) into the summarization term. Then we drop one term outside of the summarization to align the sum from to in (b). (c) follows the concave property of the function. (d) follows the relation between summarization and integral as shown in Lemma 2. The last two terms is right because we only rearrange the term and write it simply.

If , ,

If ,

Note that term (f) is a constant.

If , term(g) will dominant the order, will be .

If , term(h) will dominant the order, will be .

If , will be .

In all case, the (Convergence Rate of TTQL) will converge to 0 as will go to .
∎

Note that if . The theorem 2 shows, converges to 0 and the convergence rate is highly related to the . The next theorem shows the upper bound of the coefficient in initialization error.

###### Theorem 3.

Denote , and , we can bound as:

(18) |

where is a constantã

###### Proof of Theorem 3.

(19) | |||

(20) | |||

(21) | |||

(22) | |||

(23) | |||

(24) |

We rewrite the product term in the second equation into the summarization term. The third equation is rearrange the terms. The first inequality follows the concave property of function. The second inequality follows the relation between summarization and integral(Lemma 2). ∎

Note that if . The theorem 3 shows, converge to 0 and the convergence rate is in order .

###### Theorem 4.

The TTQL will converge if we set the safe condition as

And the convergence rate is:

(25) |

Note that if the safe condition is satisfied, we set and

We would like to make the following discussion:

(1) The distance between two MDPs influence the convergence rate. According to the Proposition 1, if two MPDs have the similar components(, , ), the optimal Q-function of these two MDPs will be closed. The discounted error ratio will be relatively small in this situation and the convergence rate will be improved.

(2) Q-learning is the special case. Please note that the traditional Q-learning is a special case for target transfer Q-learning with . Thus the error ratio is a constant and and our results reduce to the previous (?). It shows that if the in TTQL,the TTQL converge faster than traditional Q-learning.

(3) The TTQL method do converge with the safe condition. As shown in Theorem 4, the TTQL method will converge. And the convergence rate changes under different discounted error ratio . The smaller will lead to a quicker convergence rate. Intuitively, smaller means that provides more information about the optimal Q-function. Besides, the discount factor can be viewed as the ”horizon” of the infinite MDPs. Smaller means that the expected long-term return is less influenced by the future information and the immediate reward is assigned more weights.

(4) Safe condition is necessary. As mentioned above, the safe condition is defined as . If the safe condition is satisfied, we set and . If safe condition is not satisfied, we set and . So with the safe condition, TTQL algorithms do converge at any situation. At the beginning of the new task training, due to the large error of the current Q-function, will be relatively small and the transfer learning will be greatly helpful. Speedup would come down as the error of current Q-function, become smaller. Finally when is equal to or larger than one we need to remove the transfer Q target which means to set to avoid the harm brought by the transfer learning.

## Discussion for Error Ratio Safe Condition

Until now, we can conclude that TTQL will converge. TTQL method need the safe condition to guarantee the convergence. In this section, We discuss the safe conditions.

At the beginning, we propose the safe condition is that can guarantee the algorithms convergence generally. Heuristically, the safe condition is related to the distance between two MDPs and the quality of the current value function. Then according to the Theorem 1, we know that the safe condition is which we called error ratio safe condition. Under the transfer learning in RL setting, it means that the distance between two MDPs need to be smaller than the error of the current Q-function. In the real algorithms, it is impossible to calculate the error of the current Q-function and the distance between two MDPs precisely. However it is easy to calculate the bellman error . We can prove that these two metrics follow the relationship as:

Following the standard way in Q-learning, we estimate the error ratio about the error of the Q-function w.r.t the optimal Q-function by the Bellman error.

###### Proof of the relation between and .

Denote as bellman operator.

So we can proof that

∎

## Experiment

In this section, we report our simulation experiments to support our convergence analysis and verified the effectiveness of our proposed target transfer Q-Learning with the error ratio safe condition.

We consider the general MDP setting. We construct the random MDP by generating the transition probability , reward function and discount factor and fixing the state and action space size as 50.

First of all, we generate 9 different MDPs () as source tasks and then generate the new MDP . Let be different from in and the distance from and increase as Similarly, MDPs is different from in , and MDPs