Double Q(\sigma) and Q(\sigma,\lambda) Unifying Reinforcement Learning Control Algorithms

# Double Q(σ) and Q(σ,λ) Unifying Reinforcement Learning Control Algorithms

###### Abstract

Temporal-difference (TD) learning is an important field in reinforcement learning. Sarsa and Q-Learning are among the most used TD algorithms. The Q() algorithm (Sutton and Barto (2017)) unifies both. This paper extends the Q() algorithm to an on-line multi-step algorithm Q() using eligibility traces and introduces Double Q() as the extension of Q() to double learning. Experiments suggest that the new Q() algorithm can outperform the classical TD control methods Sarsa(), Q() and Q().

Double Q() and Q()
Unifying Reinforcement Learning Control Algorithms

Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de

## 1 Introduction

Reinforcement Learning is a field of machine learning addressing the problem of sequential decision making. It is formulated as an interaction of an agent and an environment over a number of discrete time steps . At each time step the agent chooses an action based on the environment’s state . The environment takes as an input and returns the next state observation and reward , a scalar numeric feedback signal.

The agent is thereby following a policy , which is the behavior function mapping a state to action probabilities

 π(a|s)=P(At=a|St=s). (1)

The agent’s goal is to maximize the return which is the sum of discounted rewards,

 Gt=Rt+1+γRt+2+γ2Rt+3+...=T−1∑k=0γkRt+1+k, (2)

where is the discount factor and is the length of the episode or infinity for a continuing task.

While rewards are short-term signals about the goodness of an action, values represent the long-term value of a state or state-action pair. The action value function is defined as the expected return taking action from state and thereafter following policy :

 qπ(s,a)=Eπ[Gt|St=s,At=a]. (3)

Value-based reinforcement learning is concerned with finding the optimal action value function . Temporal-difference learning is a class of model-free methods which estimates from sample transitions and iteratively updates the estimated values using observed rewards and estimated values of successor actions. At each step an update of the following form is applied:

 Q(St,At)←Q(St,At)+αδt, (4)

where is an estimate of , is the step size and is the TD error, the difference between our current estimate and a newly computed target value. The following TD control algorithms can all be characterized by their different TD errors.

When the action values are represented as a table we call this tabular reinforcement learning, else we speak of approximate reinforcement learning, e.g. when using a neural network to compute the action values. For sake of simplicity the following analysis is done for tabular reinforcement learning but can be easily extended to function approximation.

## 2 TD control algorithms: From Sarsa to Q(σ)

Sarsa (Rummery and Niranjan (1994)) is a temporal-difference learning algorithm which samples states and actions using an -greedy policy and then updates the values using Equation 4 with the following TD error

 δt=Rt+1+γQ(St+1,At+1)−Q(St,At). (5)

The term is called the TD target and consists of the reward plus the discounted value of the next state and next action.

Sarsa is an on-policy method, i.e. the TD target consists of , where is sampled using the current policy. In general the policy used to sample the state and actions - the so called behaviour-policy - can be different from the target policy , which is used to compute the TD target. If behaviour and target policy are different we call this off-policy learning. An example for an off-policy TD control algorithm is the well known Q-Learning algorithm proposed by Watkins (1989). As in Sarsa states and actions are sampled using an exploratory behaviour policy, e.g. an -greedy policy, but the TD target is computed using the greedy policy with respect to the current Q values. The TD error of Q-Learning is

 δt=Rt+1+γmaxa′Q(St+1,a′)−Q(St,At). (6)

Expected Sarsa generalizes Q-Learning to arbitrary target policies. The TD error is

 δt=Rt+1+γ∑a′π(a′|St+1)Q(St+1,a′)−Q(St,At). (7)

The current state-action pair is updated using the expectation of all subsequent action values with respect to the action value. Q-Learning is a special case of Expected Sarsa if is the greedy policy with respect to (Sutton and Barto (2017)). Of course Expected Sarsa could also be used as an on-policy algorithm if the target policy is chosen to be the same as the behaviour policy (Van Seijen et al. (2009)).

Sutton and Barto (2017) propose a new TD control algorithm called Q() which unifies Sarsa and Expected Sarsa. The TD target of this new algorithm is a weighted mean of the Sarsa and Expected Sarsa TD targets, where the parameter controls the weighting. Q(1) is equal to Sarsa and Q(0) is equal to Expected Sarsa. For intermediate values of new algorithms are obtained, which can achieve better performance (Asis et al. (2017)).

The TD error of Q() is

 δt=Rt+1+γ(σQ(St+1,At+1)+(1−σ)∑a′π(a′|St+1)Q(St+1,a′))−Q(St,At). (8)

## 3 Q(σ,λ): An on-line multi-step algorithm

The TD methods presented so far are one-step methods, which use only rewards and values from the next step . These can be extended to use eligibility traces to incorporate data of multiple time steps.

An eligibility trace is a scalar numeric value for each state-action pair. Whenever a state-action pair is visited its eligibility is increased, if not, the eligibility fades away over time. State-action pairs visited often will have a higher eligibility than those visited less frequently and state-action pairs visited recently will have a higher eligibility than those visited long time ago.

The accumulating eligibility trace (Singh and Sutton (1996)) uses an update of the form

 Et+1(s,a)={γλEt(s,a)+1,if At=a,St=sγλEt(s,a),otherwise. (9)

Whenever taking action in state the eligibility of this pair is increased by 1 and for all states and actions decreased by a factor , where is the trace decay parameter.

Then all state-action pairs are updated according to their eligibility trace

 Q(s,a)←Q(s,a)+αδtEt(s,a) (10)

The corresponding algorithm using the one-step Sarsa TD error and an update using eligibility traces is called Sarsa(). Though it looks like a one-step algorithm, it is in fact a multi-step algorithm, because the current TD error is assigned back to all previously visited states and actions weighted by their eligibility.

For off-policy algorithms like Q-Learning different eligibility updates have been proposed. Watkin’s Q() uses the same updates as long as the greedy action is chosen by the behaviour policy, but sets the values to 0, whenever a non-greedy action is chosen assigning credit only to state-action pairs we would actually have visited if following the target policy and not the behaviour policy . More generally the eligibility is weighted by the target policy’s probability of the next action. The update rule is then

 Et+1(s,a)={γλEt(s,a)π(At+1|St+1)+1,if At=a,St=sγλEt(s,a)π(At+1|St+1),otherwise. (11)

Whenever an action occurs, which is unlikely in the target policy, the eligibility of all previous states is decreased sharply. If the target policy is the greedy policy, the eligibility will be set to 0 for the complete history.

In this paper we introduce a new kind of eligibility trace update to extend the Q() algorithm to an on-line multi-step algorithm, which we will call Q(). Recall that the one-step target of Q() is a weighted average between the on-policy Sarsa and off-policy Expected Sarsa targets weighted by the factor :

 δt=Rt+1+γ(σQ(St+1,At+1)+(1−σ)∑a′π(a′|St+1)Q(St+1,a′))−Q(St,At) (12)

In this paper we propose to weight the eligibility accordingly with the same factor . The eligibility is then a weighted average between the on-policy eligibility used in Sarsa() and the off-policy eligibility used in . The eligibility trace is updated at each step by

 Et+1(s,a)={γλEt(s,a)(σ+(1−σ)π(At+1|St+1))+1,if At=a,St=sγλEt(s,a)(σ+(1−σ)π(At+1|St+1)),% otherwise. (13)

When the one-step target of Q() is equal to the Sarsa one-step target and therefore the eligibility update reduces to the standard accumulate eligibility trace update. When the one-step target of Q() is equal to the Expected Sarsa target and accordingly the eligibility is weighted by the target policy’s probability of the current action. For intermediate values of the eligibility is weighted in the same way as the TD target. Asis et al. (2017) showed that n-step Q() with an intermediate or dynamic value of can outperform Q-Learning and Sarsa. By extending this algorithm to an on-line multi-step algorithm we can make use of the good initial performance of Sarsa() combined with the good asymptotic performance of Q(). In comparison to the n-step Q() algorithm (Asis et al. (2017)) the new Q(, ) algorithm can learn on-line and is therefore likely to learn faster.

Pseudocode for tabular episodic Q() is given in Algorithm 1. This can be easily extended to continuing tasks and to function approximation using one eligibility per weight of the function approximator.

## 4 Double Q(σ) Algorithm

Double learning is another extension of the basic algorithms. It has been mostly studied with Q-Learning Hasselt (2010) and prevents the overestimation of action values when using Q-Learning in stochastic environments. The idea is to use decouple action selection (which action is the best one?) and action evaluation (what is the value of this action?). The implementation is simple, instead of using only one value function we will use two value functions and . Actions are sampled due to an -greedy policy with respect to . Then at each step either or is updated, e.g. if is selected by

 QA(St,At) ←QA(St,At)+α(Rt+1+γQB(argmaxa∈AQA(St+1,a))−QA(St,At)) (14) QB(St,At) ←QB(St,At)+α(Rt+1+γQA(argmaxa∈AQB(St+1,a))−QB(St,At)) (15)

Double learning can also be used with Sarsa and Expected Sarsa as proposed by Michael Ganger and Hu (2016). Using double learning these algorithms can be more robust and perform better in stochastic environments. The decoupling of action selection and action evaluation is weaker than in Double Q-Learning because the next action is selected according to an -greedy behavior policy using and evaluated either with or . For Expected Sarsa the policy used for the target in Equation 7 could be the -greedy behavior policy as proposed by Michael Ganger and Hu (2016), but it is probably better to use a policy according to (if updating ), because then it can also be used off-policy with Double Q-Learning as a special case, if is the greedy policy with respect to .

In this paper we propose the extension of double learning to Q() - Double Q() - to obtain a new algorithm with the good learning properties of double learning, which generalizes (Double) Q-Learning, (Double) Expected Sarsa and (Double) Sarsa. Of course Double Q() can also be used with eligibility traces.

Double Q() has the following TD error when is selected,

 δt=Rt+1+γ(σQB(St+1,At+1)+(1−σ)∑aπ(a|St+1)QB(St+1,a))−QA(St,At) (16)

and

 δt=Rt+1+γ(σQA(St+1,At+1)+(1−σ)∑aπ(a|St+1)QA(St+1,a))−QB(St,At) (17)

if is selected. The target policy is computed with respect to the value function which is updated, i.e. with respect to in Equation 16 and with respect to in Equation 17.

Pseudocode for Double Q() is given in Algorithm 2.

## 5 Experiments

In this section the performance of the newly proposed Q() algorithm will be tested on a gridworld navigation task compared with the performance of classical TD control algorithms like Sarsa and Q-Learning as well as Q().

The windy gridworld is a simple navigation task described by Sutton and Barto (1998). The goal is to get as fast as possible from a start state to a goal state using the actions left, right, up or down. In each column of the grid the agent is pushed upward by a wind. When an action would take the agent outside the grid, the agent is placed in the nearest cell inside the grid. The stochastic windy gridworld (Asis et al. (2017)) is a variant where state transitions are random, with a probability of 0.1 the agent will transition to one of the surrounding eight states independent of the action. The task is treated as an undiscounted episodic task with a reward of -1 for each transition. Figure 2 visualizes the gridworld.

Experiments were conducted using an -greedy behaviour policy with . The performance in terms of the average return over the first 100 episodes was measured for different values of and as a function of the step size . For the Expected Sarsa part of the update a greedy target policy was chosen, i.e. is exactly Q-Learning. Results were averaged over 200 independent runs.

Figure 2 shows that an intermediate value of performed better than Sarsa (Q(1)) and Q-Learning (Q(0)). The best performance was found by dynamically varying over time, i.e. decreasing by a factor of 0.99 after each episode. Multi-step bootstrapping with a trace decay parameter performed better than the one-step algorithms (). Dynamically varying the value of allows to combine the good initial performance of Sarsa with the good asymptotic performance of Expected Sarsa. This confirms the results observed by Asis et al. (2017) for n-step algorithms. Figure 1: The windy gridworld task. The goal is to move from the start state S to the goal state G while facing an upward wind in the middle of the grid, which is denoted in the numbers below the grid. Described by Sutton and Barto (1998).

## 6 Conclusions

This paper has presented two extensions to the Q() algorithm, which unify Q-Learning, Expected Sarsa and Sarsa. Q() extends the algorithm to an on-line multi-step algorithm using eligibility traces and Double Q() extends the algorithm to double learning. Empirical results suggest that Q() can outperform classic TD control algorithms like Sarsa(), Q() and Q(). Dynamically varying obtains the best results.

Future research might focus on performance of Q() when used with non-linear function approximation and different schemes to update over time.

## References

• (1)
• Asis et al. (2017) Asis, K. D., Hernandez-Garcia, J. F., Holland, G. Z. and Sutton, R. S. (2017). Multi-step reinforcement learning: A unifying algorithm, CoRR abs/1703.01327.
http://arxiv.org/abs/1703.01327
• Hasselt (2010) Hasselt, H. V. (2010). Double q-learning, in J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta (eds), Advances in Neural Information Processing Systems 23, Curran Associates, Inc., pp. 2613–2621.
http://papers.nips.cc/paper/3964-double-q-learning.pdf
• Michael Ganger and Hu (2016) Michael Ganger, E. D. and Hu, W. (2016). Double sarsa and double expected sarsa with shallow and deep learning, Journal of Data Analysis and Information Processing 4: 159–176.
• Rummery and Niranjan (1994) Rummery, G. A. and Niranjan, M. (1994). On-line q-learning using connectionist systems, Technical report.
• Singh and Sutton (1996) Singh, S. P. and Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces, Machine Learning 22(1): 123–158.
• Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning, 1st edn, MIT Press, Cambridge, MA, USA.
• Sutton and Barto (2017) Sutton, R. S. and Barto, A. G. (2017). Reinforcement learning : An introduction. Accessed: 2017-08-01.
• Van Seijen et al. (2009) Van Seijen, H., Van Hasselt, H., Whiteson, S. and Wiering, M. (2009). A theoretical and empirical analysis of expected sarsa, Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on, IEEE, pp. 177–184.
• Watkins (1989) Watkins, C. J. C. H. (1989). Learning from delayed rewards, PhD thesis, King’s College, Cambridge.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   