Double Q() and Q()
Unifying Reinforcement Learning Control Algorithms
Abstract
Temporaldifference (TD) learning is an important field in reinforcement learning. Sarsa and QLearning are among the most used TD algorithms. The Q() algorithm (Sutton and Barto (2017)) unifies both. This paper extends the Q() algorithm to an online multistep algorithm Q() using eligibility traces and introduces Double Q() as the extension of Q() to double learning. Experiments suggest that the new Q() algorithm can outperform the classical TD control methods Sarsa(), Q() and Q().
Double Q() and Q()
Unifying Reinforcement Learning Control Algorithms
Markus Dumke Department of Statistics LudwigMaximiliansUniversität München markus.dumke@campus.lmu.de
1 Introduction
Reinforcement Learning is a field of machine learning addressing the problem of sequential decision making. It is formulated as an interaction of an agent and an environment over a number of discrete time steps . At each time step the agent chooses an action based on the environment’s state . The environment takes as an input and returns the next state observation and reward , a scalar numeric feedback signal.
The agent is thereby following a policy , which is the behavior function mapping a state to action probabilities
(1) 
The agent’s goal is to maximize the return which is the sum of discounted rewards,
(2) 
where is the discount factor and is the length of the episode or infinity for a continuing task.
While rewards are shortterm signals about the goodness of an action, values represent the longterm value of a state or stateaction pair. The action value function is defined as the expected return taking action from state and thereafter following policy :
(3) 
Valuebased reinforcement learning is concerned with finding the optimal action value function . Temporaldifference learning is a class of modelfree methods which estimates from sample transitions and iteratively updates the estimated values using observed rewards and estimated values of successor actions. At each step an update of the following form is applied:
(4) 
where is an estimate of , is the step size and is the TD error, the difference between our current estimate and a newly computed target value. The following TD control algorithms can all be characterized by their different TD errors.
When the action values are represented as a table we call this tabular reinforcement learning, else we speak of approximate reinforcement learning, e.g. when using a neural network to compute the action values. For sake of simplicity the following analysis is done for tabular reinforcement learning but can be easily extended to function approximation.
2 TD control algorithms: From Sarsa to Q()
Sarsa (Rummery and Niranjan (1994)) is a temporaldifference learning algorithm which samples states and actions using an greedy policy and then updates the values using Equation 4 with the following TD error
(5) 
The term is called the TD target and consists of the reward plus the discounted value of the next state and next action.
Sarsa is an onpolicy method, i.e. the TD target consists of , where is sampled using the current policy. In general the policy used to sample the state and actions  the so called behaviourpolicy  can be different from the target policy , which is used to compute the TD target. If behaviour and target policy are different we call this offpolicy learning. An example for an offpolicy TD control algorithm is the well known QLearning algorithm proposed by Watkins (1989). As in Sarsa states and actions are sampled using an exploratory behaviour policy, e.g. an greedy policy, but the TD target is computed using the greedy policy with respect to the current Q values. The TD error of QLearning is
(6) 
Expected Sarsa generalizes QLearning to arbitrary target policies. The TD error is
(7) 
The current stateaction pair is updated using the expectation of all subsequent action values with respect to the action value. QLearning is a special case of Expected Sarsa if is the greedy policy with respect to (Sutton and Barto (2017)). Of course Expected Sarsa could also be used as an onpolicy algorithm if the target policy is chosen to be the same as the behaviour policy (Van Seijen et al. (2009)).
Sutton and Barto (2017) propose a new TD control algorithm called Q() which unifies Sarsa and Expected Sarsa. The TD target of this new algorithm is a weighted mean of the Sarsa and Expected Sarsa TD targets, where the parameter controls the weighting. Q(1) is equal to Sarsa and Q(0) is equal to Expected Sarsa. For intermediate values of new algorithms are obtained, which can achieve better performance (Asis et al. (2017)).
The TD error of Q() is
(8) 
3 Q(): An online multistep algorithm
The TD methods presented so far are onestep methods, which use only rewards and values from the next step . These can be extended to use eligibility traces to incorporate data of multiple time steps.
An eligibility trace is a scalar numeric value for each stateaction pair. Whenever a stateaction pair is visited its eligibility is increased, if not, the eligibility fades away over time. Stateaction pairs visited often will have a higher eligibility than those visited less frequently and stateaction pairs visited recently will have a higher eligibility than those visited long time ago.
The accumulating eligibility trace (Singh and Sutton (1996)) uses an update of the form
(9) 
Whenever taking action in state the eligibility of this pair is increased by 1 and for all states and actions decreased by a factor , where is the trace decay parameter.
Then all stateaction pairs are updated according to their eligibility trace
(10) 
The corresponding algorithm using the onestep Sarsa TD error and an update using eligibility traces is called Sarsa(). Though it looks like a onestep algorithm, it is in fact a multistep algorithm, because the current TD error is assigned back to all previously visited states and actions weighted by their eligibility.
For offpolicy algorithms like QLearning different eligibility updates have been proposed. Watkin’s Q() uses the same updates as long as the greedy action is chosen by the behaviour policy, but sets the values to 0, whenever a nongreedy action is chosen assigning credit only to stateaction pairs we would actually have visited if following the target policy and not the behaviour policy . More generally the eligibility is weighted by the target policy’s probability of the next action. The update rule is then
(11) 
Whenever an action occurs, which is unlikely in the target policy, the eligibility of all previous states is decreased sharply. If the target policy is the greedy policy, the eligibility will be set to 0 for the complete history.
In this paper we introduce a new kind of eligibility trace update to extend the Q() algorithm to an online multistep algorithm, which we will call Q(). Recall that the onestep target of Q() is a weighted average between the onpolicy Sarsa and offpolicy Expected Sarsa targets weighted by the factor :
(12) 
In this paper we propose to weight the eligibility accordingly with the same factor . The eligibility is then a weighted average between the onpolicy eligibility used in Sarsa() and the offpolicy eligibility used in . The eligibility trace is updated at each step by
(13) 
When the onestep target of Q() is equal to the Sarsa onestep target and therefore the eligibility update reduces to the standard accumulate eligibility trace update. When the onestep target of Q() is equal to the Expected Sarsa target and accordingly the eligibility is weighted by the target policy’s probability of the current action. For intermediate values of the eligibility is weighted in the same way as the TD target. Asis et al. (2017) showed that nstep Q() with an intermediate or dynamic value of can outperform QLearning and Sarsa. By extending this algorithm to an online multistep algorithm we can make use of the good initial performance of Sarsa() combined with the good asymptotic performance of Q(). In comparison to the nstep Q() algorithm (Asis et al. (2017)) the new Q(, ) algorithm can learn online and is therefore likely to learn faster.
Pseudocode for tabular episodic Q() is given in Algorithm 1. This can be easily extended to continuing tasks and to function approximation using one eligibility per weight of the function approximator.
4 Double Q() Algorithm
Double learning is another extension of the basic algorithms. It has been mostly studied with QLearning Hasselt (2010) and prevents the overestimation of action values when using QLearning in stochastic environments. The idea is to use decouple action selection (which action is the best one?) and action evaluation (what is the value of this action?). The implementation is simple, instead of using only one value function we will use two value functions and . Actions are sampled due to an greedy policy with respect to . Then at each step either or is updated, e.g. if is selected by
(14)  
(15) 
Double learning can also be used with Sarsa and Expected Sarsa as proposed by Michael Ganger and Hu (2016). Using double learning these algorithms can be more robust and perform better in stochastic environments. The decoupling of action selection and action evaluation is weaker than in Double QLearning because the next action is selected according to an greedy behavior policy using and evaluated either with or . For Expected Sarsa the policy used for the target in Equation 7 could be the greedy behavior policy as proposed by Michael Ganger and Hu (2016), but it is probably better to use a policy according to (if updating ), because then it can also be used offpolicy with Double QLearning as a special case, if is the greedy policy with respect to .
In this paper we propose the extension of double learning to Q()  Double Q()  to obtain a new algorithm with the good learning properties of double learning, which generalizes (Double) QLearning, (Double) Expected Sarsa and (Double) Sarsa. Of course Double Q() can also be used with eligibility traces.
Double Q() has the following TD error when is selected,
(16) 
and
(17) 
if is selected. The target policy is computed with respect to the value function which is updated, i.e. with respect to in Equation 16 and with respect to in Equation 17.
Pseudocode for Double Q() is given in Algorithm 2.
5 Experiments
In this section the performance of the newly proposed Q() algorithm will be tested on a gridworld navigation task compared with the performance of classical TD control algorithms like Sarsa and QLearning as well as Q().
The windy gridworld is a simple navigation task described by Sutton and Barto (1998). The goal is to get as fast as possible from a start state to a goal state using the actions left, right, up or down. In each column of the grid the agent is pushed upward by a wind. When an action would take the agent outside the grid, the agent is placed in the nearest cell inside the grid. The stochastic windy gridworld (Asis et al. (2017)) is a variant where state transitions are random, with a probability of 0.1 the agent will transition to one of the surrounding eight states independent of the action. The task is treated as an undiscounted episodic task with a reward of 1 for each transition. Figure 2 visualizes the gridworld.
Experiments were conducted using an greedy behaviour policy with . The performance in terms of the average return over the first 100 episodes was measured for different values of and as a function of the step size . For the Expected Sarsa part of the update a greedy target policy was chosen, i.e. is exactly QLearning. Results were averaged over 200 independent runs.
Figure 2 shows that an intermediate value of performed better than Sarsa (Q(1)) and QLearning (Q(0)). The best performance was found by dynamically varying over time, i.e. decreasing by a factor of 0.99 after each episode. Multistep bootstrapping with a trace decay parameter performed better than the onestep algorithms (). Dynamically varying the value of allows to combine the good initial performance of Sarsa with the good asymptotic performance of Expected Sarsa. This confirms the results observed by Asis et al. (2017) for nstep algorithms.
6 Conclusions
This paper has presented two extensions to the Q() algorithm, which unify QLearning, Expected Sarsa and Sarsa. Q() extends the algorithm to an online multistep algorithm using eligibility traces and Double Q() extends the algorithm to double learning. Empirical results suggest that Q() can outperform classic TD control algorithms like Sarsa(), Q() and Q(). Dynamically varying obtains the best results.
Future research might focus on performance of Q() when used with nonlinear function approximation and different schemes to update over time.
References
 (1)

Asis et al. (2017)
Asis, K. D., HernandezGarcia, J. F., Holland, G. Z. and Sutton,
R. S. (2017).
Multistep reinforcement learning: A unifying algorithm, CoRR
abs/1703.01327.
http://arxiv.org/abs/1703.01327 
Hasselt (2010)
Hasselt, H. V. (2010).
Double qlearning, in J. D. Lafferty, C. K. I. Williams,
J. ShaweTaylor, R. S. Zemel and A. Culotta (eds), Advances in
Neural Information Processing Systems 23, Curran Associates, Inc.,
pp. 2613–2621.
http://papers.nips.cc/paper/3964doubleqlearning.pdf  Michael Ganger and Hu (2016) Michael Ganger, E. D. and Hu, W. (2016). Double sarsa and double expected sarsa with shallow and deep learning, Journal of Data Analysis and Information Processing 4: 159–176.
 Rummery and Niranjan (1994) Rummery, G. A. and Niranjan, M. (1994). Online qlearning using connectionist systems, Technical report.
 Singh and Sutton (1996) Singh, S. P. and Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces, Machine Learning 22(1): 123–158.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning, 1st edn, MIT Press, Cambridge, MA, USA.
 Sutton and Barto (2017) Sutton, R. S. and Barto, A. G. (2017). Reinforcement learning : An introduction. Accessed: 20170801.
 Van Seijen et al. (2009) Van Seijen, H., Van Hasselt, H., Whiteson, S. and Wiering, M. (2009). A theoretical and empirical analysis of expected sarsa, Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on, IEEE, pp. 177–184.
 Watkins (1989) Watkins, C. J. C. H. (1989). Learning from delayed rewards, PhD thesis, King’s College, Cambridge.