# Per-decision Multi-step Temporal Difference Learning with Control Variates

###### Abstract

Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especially in the off-policy setting, where the agent aims to learn about a policy different from the one generating its behaviour, the variance in the updates can cause learning to diverge as the number of sampled rewards used in the estimates increases. In this paper, we introduce per-decision control variates for multi-step TD algorithms, and compare them to existing methods. Our results show that including the control variates can greatly improve performance on both on and off-policy multi-step temporal difference learning tasks.

oddsidemargin has been altered.

textheight has been altered.

marginparsep has been altered.

paperwidth has been altered.

textwidth has been altered.

marginparwidth has been altered.

marginparpush has been altered.

paperheight has been altered.

The page layout violates the UAI style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layout-changing commands and try again.

Per-decision Multi-step Temporal Difference Learning with Control Variates

Kristopher De Asis Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 kldeasis@ualberta.ca Richard S. Sutton Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 rsutton@ualberta.ca

## 1 Temporal Difference Learning

Temporal-difference (TD) methods (Sutton, 1988) combine ideas from Monte Carlo and dynamic programming methods, and are an important approach in reinforcement learning. They allow learning to occur from raw experience in the absence of a model of the environment’s dynamics, like with Monte Carlo methods, while computing estimates which bootstrap off of other estimates, like with dynamic programming. TD methods provide a way to learn online and incrementally in both prediction and control settings.

Several TD methods have been proposed. Sarsa (Rummery & Niranjan, 1994; Sutton, 1996) is a classical on-policy algorithm, where the policy being learned about, the target policy, is identical to the one generating the behaviour, the behaviour policy. However, Sarsa can be extended to learn off-policy, where the target policy can differ from the behaviour policy, through the use of per-decision importance sampling (Precup et al., 2000). Expected Sarsa (van Seijen et al., 2009) is another extension of Sarsa where instead of using the value of the current state-action pair to update the value of the previous state, it uses the expectation of the values of all actions in the current state under the target policy. Since Expected Sarsa takes the expectation under the target policy, it can be used off-policy without importance sampling to correct for the discrepancy between its target and behaviour policies. -learning (Watkins, 1989) is arguably the most popular off-policy TD control algorithm, as it can also perform off-policy learning without importance sampling, but it is equivalent to Expected Sarsa where the target policy is greedy. The above methods are often described in the one-step case, but they can be extended across multiple time steps.

Multi-step TD methods, such as the -step TD and TD() methods, create a spectrum of algorithms where at one end exists one-step TD learning, and at the other, exists Monte Carlo Methods. Intermediate algorithms are created which, due to a bias-variance tradeoff, can outperform either extreme (Jaakkola et al., 1994). Multi-step off-policy algorithms, especially ones with explicit use of importance sampling, have significantly larger variance than their on-policy counterparts (Sutton & Barto, 1998), and several proposals have been made to address this issue in the TD() space of algorithms (Munos et al., 2016; Mahmood et al., 2017).

In this paper, we focus on -step TD algorithms as they provide exact computation of the multi-step return, have conceptual clarity, and provide the foundation for TD() methods. We formulate per-decision control variates for existing -step TD algorithms, and give insight on their implications in the TD() space of algorithms. On problems with tabular representations as well as one with function approximation, we show that the introduction of per-decision control variates can improve the performance of existing -step TD methods on both on and off-policy prediction and control tasks.

## 2 One-Step Td Methods

The sequential decision-making problem in reinforcement learning is often modeled as a Markov decision process (MDP). Under the MDP framework, an agent interacts with an environment over a sequence of discrete time steps. At each time step , the agent receives information about the environment’s current state, , where is the set of all possible states in the MDP. The agent is to use this state information to select an action, , where is the set of possible actions in state . Based on the environment’s current state and the agent’s selected action, the agent receives a reward, , and gets information about the environment’s next state, , according to the environment model: .

The agent selects actions according to a policy, , which gives a probability distribution across actions for a given state . Through policy iteration (Sutton & Barto, 1998), the agent can learn an optimal policy, , where behaving under it will maximize the expected discounted return:

(1) |

given a discount factor and equal to the final time step in an episodic task, or and equal to infinity for a continuing task.

Value-based methods approach the sequential decision-making problem by computing value functions, which provide estimates of what the return will be from a particular state onwards. In prediction problems, also referred to as policy evaluation, the goal is to estimate the return under a particular policy as accurately as possible, and a state-value function is often estimated. It is defined to be the expected return when starting in state and following policy : . For control problems, the policy which maximizes the expected return is to be learned, and an action-value function from which a policy can be derived is instead estimated. It is defined to be the expected return when taking action in state , and following policy :

(2) |

Of note, the action-value function can still be used for prediction problems, and the state-value can be computed as an expectation across action-values under the policy for a given state:

(3) |

One-step TD methods learn an approximate value function, such as for action-values, by computing an estimate of the return, . First, Equation 2 can be written in terms of its succesor state-action pairs, also known as the Bellman equation for :

(4) |

Based on Equation 4, one-step TD methods estimate the return by taking an action in the environment according to a policy, sampling the immediate reward, and bootstrapping off of the current estimates in the value function for the remainder of the return. The difference between this TD target and the value of the previous state-action pair is then computed, and is often referred to as the TD error. The previous state-action pair’s value is then updated by taking a step proportional to the TD error with step size :

(5) | ||||

(6) |

Equations 5 and 6 correspond to the Sarsa algorithm. It can be seen that in state , it samples an action according to its behaviour policy, and then bootstraps off of the value of this state-action pair. With a sufficiently small step size, this estimates the expectation under its behaviour policy over the values of successor state-action pairs in Equation 4, allowing for on-policy learning.

In the off-policy case, the discrepancy from being drawn from the behaviour policy needs to be corrected. One approach is to correct the affected terms with per-decision importance sampling. With actions sampled from a behaviour policy , and a target policy , the estimate of the return of off-policy Sarsa with per-decision importance sampling becomes:

(7) | ||||

(8) |

Note that in the on-policy case, is always 1, strictly generalizing the original on-policy TD target in Equation 5.

Another approach for the off-policy case is to compute the expectation of all successor state action pairs under the target policy directly, instead of sampling and correcting the discrepancy. This approach has lower variance and is often preferred in the one-step setting for action-values, and gives the Expected Sarsa algorithm (van Seijen et al., 2009) characterized by the following TD target:

(9) |

## 3 Multi-Step Td Learning

TD algorithms are referred to as one-step TD algorithms when they only incorporate information from a single time step in the estimate of the return that the value function is being updated towards. In multi-step TD methods, a longer sequence of experienced rewards is used to estimate the return. For example, on-policy -step Sarsa would update an action-value towards the following estimate:

(10) |

Of note, -step Expected Sarsa (Sutton & Barto, 2018) is identical up until the -th step, where it instead bootstraps off of the expectation under the target policy:

(11) |

The -step returns can also be written recursively, and is convenient in the more general per-decision off-policy case. If we define the following bootstrapping condition:

(12) |

The -step extension of off-policy Sarsa with per-decision importance sampling, as characterized by Equations 7 and 8, can now be written as:

(13) |

TD algorithms which update towards these -step estimates of the return constitute the -step TD algorithm family (Sutton & Barto, 2018). Their computational complexity increases with , but have the benefit of conceptual clarity, and exact computation of the multi-step return. The -step returns also provide the foundation for other multi-step TD algorithms.

Another family of multi-step per-decision TD algorithms, TD(), is also used in practice. They are characterized by computing a geometrically weighted sum of -step returns, denoted as the -return:

(14) |

It introduces a hyperparameter where gives one-step TD, and increasing effectively increases the number of sampled rewards included in the estimated return. Substituting the -step Sarsa return (13) into Equation 14 gives the -return for the Sarsa() algorithm, and assuming does not change, it can be expressed as a sum of one-step Sarsa’s TD errors:

(15) |

This shows that the -return for Sarsa() can be estimated by computing one-step TD errors, and decaying the weight of later TD errors at a rate of . Implementing this online and incrementally, an eligibility trace vector is maintained to track which state-action pairs led to the current step’s TD error. The traces of earlier state-action pairs are decayed at each step by the afformentioned decay rate, and each action-value is adjusted by the current TD error weighted by the trace of the corresponding state-action pair.

Contrasting with -step TD methods, the computational complexity of TD() control algorithms scales with the size of the environment, . That is, there is an environment-specific increase in complexity, but it no longer scales with the number of sampled rewards in the estimate of the return.

## 4 Per-Decision Control Variates

When trying to estimate the expectation of some variable , control variates are often of the following form (Ross, 2013):

(16) |

where is the outcome of another variable with a known expected value, and is a coefficient to be set. then has the following variance:

(17) |

From this, the variance can be minimized with the optimal coefficient :

(18) |

Suppose the -step Sarsa algorithm samples the importance sampling-corrected -step return, jointly samples the importance sampling-corrected action-value (through the sampled action), and computes the expected action-value under the target policy. We get the following estimate of this term of the multi-step return:

(19) |

Under the assumption that the current estimates are accurate, the action-values represent the expected return. Due to this, the sampled reward sequence and the action-value are, in expectation, perfectly correlated. The covariance term in Equation 18 would then be the variance of the action-value due to the policy, and from this, a reasonable choice for the coefficient would be . This gives:

(20) |

Substituting this estimate into the recursive definition of -step Sarsa (13) and maintaining the same bootstrapping condition in Equation 12 gives the following -step return:

(21) |

Because , the additional term does not introduce bias into the estimate. To provide an intuition of how it might reduce the variance in the estimate, we can consider some extreme cases of the importance sampling ratio. If , when the behaviour policy takes an action that the target policy would have never taken, it will bootstrap off of the expectation of its current estimates instead of cutting the return. If is much greater than , an equivalent amount of its current action-value estimate is subtracted to compensate.

In the one-step case, the introduction of this control variate results in one-step Expected Sarsa’s target:

When applied at the bootstrapping step, it implicitly results in bootstrapping off of the expectation under the target policy as opposed to the importance sampling-corrected action-value. It can be viewed as an alternate generalization of Expected Sarsa to the multi-step setting, where the control variate is applied to the sampled reward sequence in addition to the bootstrapping step.

The control variate can be interpreted as performing an expectation correction at each step based on current estimates. Each reward in the trajectory depends on the sampled action at each step, but the algorithm aims to learn the expectation across all possible trajectories under a policy. The importance sampling-corrected action-value is a closer estimate to the sampled return, as the agent knows which action resulted in the immediate reward at each step. Because of this, the action-value is like a guess of what the remainder of the sampled reward sequence will be, and the difference between that and the expectation across all actions provides a per-step estimate of the discrepancy between the sampled reward sequence and the expectation across all reward sequences from the current step onwards.

It can also be seen as implicitly performing adaptive -step learning, adjusting the amount of information included based on how accurate its current estimates are. If we rearrange the -step return:

(22) |

We get the one-step Expected Sarsa target, along with some difference between the actual sampled rewards and its current estimates. If the value estimates are poor, more rewards will be effectively included in the estimate, and vice-versa. If there is no stochasticity in the environment, it ends up approaching one-step Expected Sarsa as the estimates get close to the true value function.

If we follow similar steps in the state-value case, we arrive at the following -step return with a per-decision control variate:

(23) |

Of note, the state-value control variate disappears in the on-policy case, but the action-value one does not.

## 5 Relationship With Existing Algorithms

If we substitute the -step Sarsa return with the per-decision control variate (21) into the definition of the -return in Equation 14, we can rearrange it into a sum of one-step Expected Sarsa’s TD errors:

(24) |

This is equivalent to using the eligibility trace decay rate of Sarsa(), but backing up the TD error of one-step Expected Sarsa. That is, in the space of action-value TD() algorithms, having the one-step estimates of the return bootstrap off of the expectation under the target policy implicitly induces this per-decision control variate in the corresponding -step returns.

An existing algorithm that also uses one-step Expected Sarsa’s TD error in its -return is the Tree-backup() algorithm (Precup et al., 2000). Denoting , Tree-backup() is characterized by the following equations:

(25) |

If we look at -step Tree-backup’s estimate of the return, we can show that it also includes the expectation correction terms:

(26) |

The estimate takes some portion of the sampled reward sequence, and the difference between the expectation under the target policy and an equivalent portion of the sampled action-value estimate.

The introduction of the control variates with the afformentioned choice of the control variate parameter results in an instance of a doubly robust estimator. The use of doubly-robust estimators in off-policy policy evaluation has been investigated by Jiang et al. (2016) and Thomas et al. (2016). However, results when applying the approaches in an online, model-free setting, as well as its view as a multi-step generalization of Expected Sarsa, appear to be novel.

Harutyunyan et al. (2016) has acknowledged the implicit introduction of these terms when using the expectation form of the TD error in action-value TD() algorithms. However, their work investigated the off-policy correcting effects of including the difference between the expectation under the target policy with an action-value sampled from the behaviour policy (without importance sampling corrections). This work focuses on the effect of explicitly including the additional terms, with importance sampling, in the -step setting for both on and off-policy TD learning.

In the state-value setting, combining Equations 23 and 14 gives the following -return:

(27) |

which is an intuitive generalization of off-policy per-decision importance sampling for state-values, having an additional importance sampling correction term for the first reward in the sequence. It can be seen that the inclusion of an action-dependent trace decay rate scaling a TD error, as opposed to the return estimate alone, implicitly induces the state-value control variate in the -step estimate of the return.

## 6 Experiments

In this section, we focus on the action-value setting and investigate the performance of -step Sarsa with the per-decision control variate (denoted as -step CV Sarsa) on three problems. The first two are multi-step prediction tasks in a tabular environment, one being off-policy and one being on-policy. The remaining one is a control problem involving function approximation, evaluating the performance of -step CV Sarsa beyond the tabular setting, as well as how it handles a changing (greedifying) policy.

Since -step CV Sarsa ends up bootstrapping off of the expectation over action-values at the end of the reward sequence, we compare the algorithm to -step Expected Sarsa as characterized by Equation 11. This allows for examining the effects of the control variate being applied to each reward in the reward sequence in an online and incremental setting.

### 6.1 55 Grid World

The 55 Grid World is a 2-dimensional grid world having terminal states in two opposite corners. The actions consist of 4-directional movement, and moving into a wall transitions the agent to the same state. The agent starts in the center, and a reward of is received at each transition. Experiments were run in both the off-policy and on-policy settings with no discounting (), and the root-mean-square (RMS) error between the learned value function and the true value function were compared.

#### 6.1.1 Off-policy Prediction

For the off-policy experiments, the target policy would move north with probability , and select a random action equiprobably otherwise. was set to 0.5, and the behaviour policy was equiprobable random for all states. A parameter study was done for 1, 2, and 4 steps, and the RMS error was measured after 200 episodes. The results are averaged over 1000 runs, and can be seen in Figure 2.

It can be seen that 2-step Expected Sarsa only outperforms 1-step Expected Sarsa for a very limited range of parameters, but is worse otherwise. 4-step Expected Sarsa was unable to learn for most parameter settings. When the control variate is applied to each reward, we can see that 2-step CV Sarsa outperforms 1-step Expected Sarsa for all parameter settings, and the variance is reduced relative to 2-step Expected Sarsa. Furthermore, 4-step CV Sarsa ends up being able to learn, and can outperform 2-step CV Sarsa for a reasonably wide range of parameters.

#### 6.1.2 On-policy Prediction

In the on-policy case, the target policy and behaviour policy were both equiprobable random for all states. The parameters tested are identical to the off-policy experiment with the addition of 8-step instances of each algorithm. The RMS error was measured after 200 episodes, and are also averaged over 1000 runs. The results are summarized in Figure 3.

2-step Expected Sarsa ends up performing better than 1-step Expected Sarsa for a wider range of parameters than in the off-policy case, but the best parameter settings for each perform similarly. Further increasing the number of steps results in relatively poor performance, and doesn’t do better than the best parameter setting of 1-step Expected Sarsa. Looking at -step CV Sarsa, we can see that performance is drastically improved for all tested settings of . Of note, while introducing the per-decision control variate resulted in lower variance for a reasonable range of parameters, assumptions were made regarding the accuracy of the value function when setting the control variate parameter in Equation 19. If the number of steps and the step size get too large, it can result in larger variance and divergence on parameter settings where -step Expected Sarsa did not diverge. We did not investigate alternate methods of setting the control variate parameter in this work.

### 6.2 Mountain Car

To show that this use of control variates is compatible with function approximation, we ran experiments on mountain car as described by Sutton and Barto (1998). A reward of is received at each step, and there is no discounting.

Because the environment’s state space is continuous, we used tile coding (Sutton & Barto, 1998) to produce a feature representation for use with linear function approximation. The tile coder used 16 tilings, an asymmetric offset by consecutive odd numbers, and each tile covered -th of the feature space in each direction.

We compared -step Expected Sarsa and -step CV Sarsa with 1, 2, 4, and 8 steps across different step sizes . Each algorithm learned on-policy with an -greedy policy which selects an action greedily with respect to its value function with probability , and selected a random action equiprobably otherwise. In this experiment, was set to 0.1. We measured the return per episode up to 100 episodes, and averaged the results over 100 runs. The results for the best parameter setting for each algorithm can be found in Figure 5.

The two algorithms showed a similar trend in the parameters as in the 55 Grid World environment, but were less pronounced. This is likely due to not requiring accurate value function estimates to perform the task well, and the control variate having less of an effect with greedier target policies, because gets relatively close to . Despite this, as seen in the results for the best parameter settings, -step CV Sarsa still outperforms -step Expected Sarsa on this task.

## 7 Discussion

From our experiments, -step CV Sarsa appears to be an improved multi-step generalization of Expected Sarsa. In both on and off-policy prediction tasks on the 55 Grid World environment, it generally resulted in lower variance as well as considerably lower error in the estimates compared to -step Expected Sarsa, an algorithm which can be interpreted as only applying the control variate at the bootstrapping step. Moreover, when used on a continuous state space control problem with function approximation, applying the control variate on a per-reward level still resulted in greater performance in terms of average return per episode.

Despite the improvement on most of the tested parameter settings, the results also showed that the addition of the per-decision control variates can cause learning to diverge for large and large step size , even on settings where -step Expected Sarsa did not diverge. It is suspected that this is due to assuming the estimates are accurate when setting the control variate parameter in Equation 19. This was not further investigated, but it could be an avenue for future work.

While our results focused on the action-value per-decision control variate, other experiments not included in this paper showed that the state-value per-decision control variate in Equation 23 can also be applied in the off-policy action-value setting. It resulted in performance in between that of -step Expected Sarsa and -step CV Sarsa, supporting that it is beneficial to add it, but better to use the action-value control variate if the agent is learning action-values.

## 8 Conclusions

In this paper, we presented a way to derive per-decision control variates in both state-value and action-value -step TD methods. The state-value control variate is only present in the off-policy setting, but the action-value control variate affects both on and off-policy learning. In the action-value case, applying the per-decision control variate results in an alternative multi-step extension of Expected Sarsa. With this control variate perspective, the existing -step Expected Sarsa algorithm can be interpreted as only applying a control variate at the bootstrapping step, when it can be applied to the sampled reward sequence as well. Our results on prediction and control problems show that applying it on a per-decision level can greatly improve the accuracy of the learned value function, and consequently perform better when doing TD control.

We also showed how the per-decision control variates relate to TD() algorithms. This provided insight on how minor adjustments in the TD() space can implicitly induce these per-decision control variates in the underlying -step returns, resulting in a more unified view of per-decision multi-step TD methods.

Our experiments were limited to the -step TD setting without eligibility traces, and focused on learning action-values. We only considered a naive setting of the control variate scaling parameter , when our results suggest that the way we set it can negatively affect learning for a few (relatively extreme) parameter combinations. Perhaps insight from the analytical optimal coefficient in Equation 18 can be used to adapt the control variate online to further improve performance.

#### Acknowledgements

The authors thank Yi Wan for insights and discussions contributing to the results presented in this paper, and the entire Reinforcement Learning and Artificial Intelligence research group for providing the environment to nurture and support this research. We gratefully acknowledge funding from Alberta Innovates – Technology Futures, Google Deepmind, and from the Natural Sciences and Engineering Research Council of Canada.

#### References

Harutyunyan, A., Bellemare, M. G., Stepleton, T., and Munos, R. (2016). Q() with off-policy corrections. arXiv:1509.05172.

Jaakola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6(6), 1185-1201.

Jiang, N., and Li, L. (2016). Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:652-661.

Mahmood, A. R., Yu, H., and Sutton, R. S. (2017). Multi-step off-policy learning without importance sampling ratios. arXiv:1702.03006.

Munos, R., Stepleton, T., Haruytunyan, A., and Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. arXiV:1606.02647.

Precup, D., Sutton, R. S., and Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759-766. Morgan Kaufmann.

Ross, S. M. (2013). Simulation. San Diego: Academic Press.

Rummery, G. A. (1995). Problem Solving with Reinforcement Learning. PhD Thesis, Cambridge University.

Rummery, G. A., and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUEF/F-INFENG/TR 166, Engineering Department, Cambridge University.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3, 9-44.

Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Touretzky, D. S. and Hasselmo, M. E. (eds.), Advances in Neural Information Processing Systems 8, pp. 1038-1044. MIT Press.

Sutton, R. S., and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts.

Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). Manuscript in preparation.

Thomas, P. and Brunskill, E. (2016). Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:2139-2148.

van Seijen, H., van Hasselt, H., Whiteson, S., and Wiering, M. (2009). A theoretical and empirical analysis of expected sarsa. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177-184.

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University.