# True Online Temporal-Difference Learning

###### Abstract

The temporal-difference methods TD() and Sarsa() form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD() and true online Sarsa(), respectively (van Seijen & Sutton, 2014). Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD()/Sarsa() with regular TD()/Sarsa() on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-depth analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal-difference methods can be derived by making changes to the online forward view and then rewriting the update equations.

A. Rupam Mahmood ashique@ualberta.ca

Patrick M. Pilarski patrick.pilarski@ualberta.ca

Marlos C. Machado machado@ualberta.ca

Richard S. Sutton sutton@cs.ualberta.ca

Reinforcement Learning and Artificial Intelligence Laboratory

University of Alberta

2-21 Athabasca Hall, Edmonton, AB

Canada, T6G 2E8

Maluuba Research

2000 Peel Street, Montreal, QC

Canada, H3A 2W5

Editor: George Konidaris

Keywords: temporal-difference learning, eligibility traces, forward-view equivalence

## 1 Introduction

Temporal-difference (TD) learning is a core learning technique in modern reinforcement learning (Sutton, 1988; Kaelbling et al., 1996; Sutton & Barto, 1998; Szepesvári, 2010). One of the main challenges in reinforcement learning is to make predictions, in an initially unknown environment, about the (discounted) sum of future rewards, the return, based on currently observed feature values and a certain behaviour policy. With TD learning it is possible to learn good estimates of the expected return quickly by bootstrapping from other expected-return estimates. TD() (Sutton, 1988) is a popular TD algorithm that combines basic TD learning with eligibility traces to further speed learning. The popularity of TD() can be explained by its simple implementation, its low-computational complexity and its conceptually straightforward interpretation, given by its forward view. The forward view of TD() states that the estimate at each time step is moved towards an update target known as the -return, with determining the fundamental trade-off between bias and variance of the update target. This trade-off has a large influence on the speed of learning and its optimal setting varies from domain to domain. The ability to improve this trade-off by adjusting the value of is what underlies the performance advantage of eligibility traces.

Although the forward view provides a clear intuition, TD() closely approximates the forward view only for appropriately small step-sizes. Until recently, this was considered an unfortunate, but unavoidable part of the theory behind TD(). This changed with the introduction of true online TD() (van Seijen & Sutton, 2014), which computes exactly the same weight vectors as the forward view at any step-size. This gives true online TD() full control over the bias-variance trade-off. In particular, true online TD(1) can achieve fully unbiased updates. Moreover, true online TD() only requires small modifications to the TD() update equations, and the extra computational cost is negligible in most cases.

We hypothesize that true online TD(), and its control version true online Sarsa(), not only have better theoretical properties than their regular counterparts, but also dominate them empirically. We test this hypothesis by performing an extensive empirical comparison between true online TD(), regular TD() (which is based on accumulating traces), and the common variation based on replacing traces. In addition, we perform comparisons between true online Sarsa() and Sarsa() (with accumulating and replacing traces). The domains we use include random Markov reward processes, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment (Bellemare et al., 2013). The representations we consider range from tabular values to linear function approximation with binary and non-binary features.

Besides the empirical study, we provide an in-depth discussion on the theory behind true online TD(). This theory is based on a new online forward view. The traditional forward view, based on the -return, is inherently an offline forward view meaning that updates only occur at the end of an episode, because the -return requires data up to the end of an episode. We extend this forward view to the online case, where updates occur at every time step, by using a bounded version of the -return that grows over time. Whereas TD() approximates the traditional forward view only at the end of an episode, we show that TD() approximates this new online forward view at all time steps. True online TD() is equivalent to this new online forward view at all time steps. We prove this by deriving the true online TD() update equations directly from the online forward view update equations. This derivation forms a blueprint for the derivation of other true online methods. By making variations to the online forward view and following the same derivation as for true online TD(), we derive several other true online methods.

This article is organized as follows. We start by presenting the required background in Section 2. Then, we present the new online forward view in Section 3, followed by the presentation of true online TD() in Section 4. Section 5 presents the empirical study. Furthermore, in Section 6, we present several other true online methods. In Section 7, we discuss in detail related papers. Finally, Section 8 concludes.

## 2 Background

Here, we present the main learning framework. As a convention, we indicate scalar-valued random variables by capital letters (e.g., , ), vectors by bold lowercase letters (e.g., , ), functions by non-bold lowercase letters (e.g., ), and sets by calligraphic font (e.g., , ).^{1}^{1}1An exception to this convention is the TD error, a scalar-valued random variable that we indicate by .

### 2.1 Markov Decision Processes

Reinforcement learning (RL) problems are often formalized as Markov decision processes (MDPs), which can be described as 5-tuples of the form , where indicates the set of all states; indicates the set of all actions; indicates the probability of a transition to state , when action is taken in state ; indicates the expected reward for a transition from state to state under action ; the discount factor specifies how future rewards are weighted with respect to the immediate reward.

Actions are taken at discrete time steps according to a policy , which defines for each action the selection probability conditioned on the state. The return at time is defined as the discounted sum of rewards, observed after :

where is the reward received after taking action in state . Some MDPs contain special states called terminal states. After reaching a terminal state, no further reward is obtained and no further state transitions occur. Hence, a terminal state can be interpreted as a state where each action returns to itself with a reward of 0. An interaction sequence from the initial state to a terminal state is called an episode.

Each policy has a corresponding state-value function , which maps any state to the expected value of the return from that state, when following policy :

In addition, the action-value function gives the expected return for policy , given that action is taken in state :

Because no further rewards can be obtained from a terminal state, the state-value and action-values for a terminal state are always 0.

There are two tasks that are typically associated with an MDP. First, there is the task of determining (an estimate of) the value function for some given policy . The second, more challenging task is that of determining (an estimate of) the optimal policy , which is defined as the policy whose corresponding value function has the highest value in each state:

In RL, these two tasks are considered under the condition that the reward function and the transition-probability function are unknown. Hence, the tasks have to be solved using samples obtained from interacting with the environment.

### 2.2 Temporal-Difference Learning

Let’s consider the task of learning an estimate of the value function from samples, where is being estimated using linear function approximation. That is, is the inner product between a feature vector of , and a weight vector :

If is a terminal state, then by definition , and hence .

We can formulate the problem of estimating as an error-minimization problem, where the error is a weighted average of the squared difference between the value of a state and its estimate:

with the stationary distribution induced by . The above error function can be minimized by using stochastic gradient descent while sampling from the stationary distribution, resulting in the following update rule:

using as a shorthand for . The parameter is called the step-size. Using the chain rule, we can rewrite this update as:

Because is in general unknown, an estimate of is used, which we call the update target, resulting in the following general update rule:

(1) |

There are many different update targets possible. For an unbiased estimator the full return can be used, that is, . However, the full return has the disadvantage that its variance is typically very high. Hence, learning with the full return can be slow. Temporal-difference (TD) learning addresses this issue by using update targets based on other value estimates. While the update target is no longer unbiased in this case, the variance is typically much smaller, and learning much faster. TD learning uses the Bellman equations as its mathematical foundation for constructing update targets. These equations relate the value of a state to the values of its successor states:

Writing this equation in terms of an expectation yields:

Sampling from this expectation, while using linear function approximation to approximate , results in the update target:

This update target is called a one-step update target, because it is based on information from only one time step ahead. Applying the Bellman equation multiple times results in update targets based on information further ahead. Such update targets are called multi-step update targets.

### 2.3 Td()

The TD() algorithm implements the following update equations:

(2) | |||||

(3) | |||||

(4) |

for , and with . The scalar is called the TD error, and the vector is called the eligibility-trace vector. The update of shown above is referred to as the accumulating-trace update. As a shorthand, we will refer to this version of TD() as ‘accumulate TD()’, to distinguish it from a slightly different version that is discussed below. While these updates appear to deviate from the general, gradient-descent-based update rule given in (1), there is a close connection to this update rule. This connection is formalized through the forward view of TD(), which we discuss in detail in the next section. Algorithm 1 shows the pseudocode for accumulate TD().

Accumulate TD() can be very sensitive with respect to the and parameters. Especially, a large value of combined with a large value of can easily cause divergence, even on simple tasks with bounded rewards. For this reason, a variant of TD() is sometimes used that is more robust with respect to these parameters. This variant, which assumes binary features, uses a different trace-update equation:

where indicates the -th component of vector . This update is referred to as the replacing-trace update. As a shorthand, we will refer to the version of TD() using the replacing-trace update as ‘replace TD()’.

## 3 The Online Forward View

The traditional forward view relates the TD() update equations to the general update rule shown in Equation (1). Specifically, for small step-sizes the weight vector at the end of an episode computed by accumulate TD() is approximately the same as the weight vector resulting from a sequence of Equation (1) updates (one for each visited state) using a particular multi-step update target, called the -return (Sutton & Barto, 1998; Bertsekas & Tsitsiklis, 1996). The -return for state is defined as:

(5) |

where is the time step the terminal state is reached, and is the -step return, defined as:

We call a method that updates the value of each visited state at the end of the episode an offline method; we call a method that updates the value of each visited state immediately after the visit (i.e., at the time step after the visit) an online method. TD() is an online method. The update sequence of the traditional forward view, however, corresponds with an offline method, because the -return requires data up to the end of an episode. This leaves open the question of how to interpret the weights of TD() during an episode. In this section, we provide an answer to this long-standing open question. We introduce a bounded version of the -return that only uses information up to a certain horizon and we use this to construct an online forward view. This online forward view approximates the weight vectors of accumulate TD() at all time steps, instead of only at the end of an episode.

### 3.1 The Online -Return Algorithm

The concept of an online forward view contains a paradox. On the one hand, multi-step update targets require data from time steps far beyond the time a state is visited; on the other hand, the online aspect requires that the value of a visited state is updated immediately. The solution to this paradox is to assign a sequence of update targets to each visited state. The first update target in this sequence contains data from only the next time step, the second contains data from the next two time steps, the third from the next three time steps, and so on. Now, given an initial weight vector and a sequence of visited states, a new weight vector can be constructed by updating each visited state with an update target that contains data up to the current time step. Below, we formalize this idea.

We define the interim -return for state with horizon as follows:

(6) |

Note that this update target does not use data beyond the horizon . implicitly defines a sequence of update targets for : . As time increases, update targets based on data further away become available for state . At a particular time step , a new weight vector is computed by performing an Equation (1) update for each visited state using the interim -return with horizon , starting from the initial weight vector . Hence, at time step , a sequence of updates occurs. To describe this sequence mathematically, we use weight vectors with two indices: . The superscript indicates the time step at which the updates are performed (this value corresponds with the horizon of the interim -returns that are used in the updates). The subscript is the iteration index of the sequence (it corresponds with the number of updates that have been performed at a particular time step). As an example, the update sequences for the first three time steps are:

with for all . More generally, the update sequence at time step is:

(7) |

We define (without superscript) as the final weight vector of the update sequence at time , that is, . We call the algorithm implementing Equation (7) the online -return algorithm. By contrast, we call the algorithm that implements the traditional forward view the offline -return algorithm.

The update sequence performed by the online -return algorithm at time step T (the time step that a terminal state is reached) is very similar to the update sequence performed by the offline -return algorithm. In particular, note that and are the same, under the assumption that the weights used for the value estimates are the same. Because these weights are in practise not exactly the same, there will typically be a small difference.^{2}^{2}2If there is never a difference because there is no bootstrapping.

Figure 1 illustrates the difference between the online and offline -return algorithm, as well as accumulate TD(), by showing the RMS error on a random walk task. The task consists of 10 states laid out in a row plus a terminal state on the left. Each state transitions with 70% probability to its left neighbour and with 30% probability to its right neighbour (or to itself in case of the right-most state). All rewards are 1 and . Furthermore, and . The right-most state is the initial state. Whereas the offline -return algorithm only makes updates at the end of an episode, the online -return algorithm, as well as accumulate TD), make updates at every time step.

The comparison on the random walk task shows that accumulate TD() behaves similar to the online -return algorithm. In fact, the smaller the step-size, the smaller the difference between accumulate TD() and the online -return algorithm. This is formalized by Theorem 1. The proof of the theorem can be found in Appendix A. The theorem uses the term , which is defined as:

with the interim -return for state with horizon that uses for all value evaluations. Note that is independent of the step-size.

###### Theorem 1

Let be the initial weight vector, be the weight vector at time computed by accumulate TD(), and be the weight vector at time computed by the online -return algorithm. Furthermore, assume that . Then, for all time steps :

Theorem 1 generalizes the traditional result to arbitrary time steps. The traditional result states that the difference between the weight vector at the end of an episode computed by the offline -return algorithm and the weight vector at the end of an episode computed by accumulate TD() goes to 0, if the step-size goes to 0 (Bertsekas & Tsitsiklis, 1996).

### 3.2 Comparison to Accumulate TD()

While accumulate TD() behaves like the online -return algorithm for small step-sizes, small step-sizes often result in slow learning. Hence, higher step-sizes are desirable. For higher step-sizes, however, the behaviour of accumulate TD() can be very different from that of the online -return algorithm. And as we show in the empirical section of this article (Section 5), when there is a difference, it is almost exclusively in favour of the online -return algorithm. In this section, we analyze why the online -return algorithm can outperform accumulate TD(), using the one-state example shown in the left of Figure 2.

The right of Figure 2 shows the RMS error over the first 10 episodes of the one-state example for different step-sizes and . While for small step-sizes accumulate TD() behaves indeed like the online -return algorithm—as predicted by Theorem 1—, for larger step-sizes the difference becomes huge. To understand the reason for this, we derive an analytical expression for the value at the end of an episode.

First, we consider accumulate TD(). Because there is only one state involved, we indicate the value of this state simply by V. The update at the end of an episode is . In our example, for all time steps , except for , where . Because is 0 for all time steps except the last, . Furthermore, for all time steps , resulting in . Substituting all this in the expression for yields:

(8) |

So for accumulate TD(), the total value difference is simply a summation of the value difference corresponding to a single update.

Now, consider the online -return algorithm. The value at the end of an episode, , is equal to , resulting from the update sequence:

By incremental substitution, we can directly express in terms of the initial value, , and the update targets:

Because for all in our example, the weights of all update targets can be added together and the expression can be rewritten as a single pseudo-update, yielding:

(9) |

The term in (9) acts like a pseudo step-size. For larger or this pseudo step-size increases in value, but as long as the value will never exceed 1. By contrast, for accumulate TD() the pseudo step-size is , which can grow much larger than 1 even for , causing divergence of values. This is the reason that accumulate TD() can be very sensitive to the step-size and it explains why the optimal step-size for accumulate TD() is much smaller than the optimal step-size for the online -return algorithm in Figure 2 ( versus , respectively). Moreover, because the variance on the pseudo step-size is higher for accumulate TD() the performance at the optimal step-size for accumulate TD() is worse than the performance at the optimal step-size for the online -return algorithm.

### 3.3 Comparison to Replace TD()

The sensitivity of accumulate TD() to divergence, demonstrated in the previous subsection, has been known for long. In fact, replace TD() was designed to deal with this. But while replace TD() is much more robust with respect to divergence, it also has its limitations. One obvious limitation is that it only applies to binary features, so it is not generally applicable. But even in domains where replace TD() can be applied, it can perform poorly. The reason is that replacing previous trace values, rather than adding to it, reduces the multi-step characteristics of TD().

To illustrate this, consider the two-state example shown in the left of Figure 3. It is easy to see that the value of the left-most state is 2 and of the other state is 0. The state representation consists of only a single, binary feature that is 1 in both states and 0 in the terminal state. Because there is only a single feature, the state values cannot be represented exactly. The weight that minimizes the mean squared error assigns a value of 1 to both states, resulting in an RMS error of 1. Now consider the graph shown in the right of Figure 3, which shows the asymptotic RMS error for different values of . The error for accumulate TD() converges to the least mean squares (LMS) error for , as predicted by the theory (Dayan, 1992). The online -return algorithm has the same convergence behaviour (due to Theorem 1). By contrast, replace TD() converges to the same value as TD(0) for any value of . The reason for this behaviour is that because the single feature is active at all time steps, the multi-step behaviour of TD() is fully removed, no matter the value of . Hence, replace TD() behaves exactly the same as TD(0) for any value of at all time steps. As a result, it also behaves like TD(0) asymptotically.

The two-state example very clearly demonstrates that there is a price payed by replace TD() to achieve robustness with respect to divergence: a reduction in multi-step behaviour. By contrast, the online -return algorithm, which is also robust to divergence, does not have this disadvantage. Of course, the two-state example, as well as the one-state example from the previous section, are extreme examples, merely meant to illustrate what can go wrong. But in practise, a domain will often have some characteristics of the one-state example and some of the two-state example, which negatively impacts the performance of both accumulate and replace TD().

## 4 True Online TD()

The online -return algorithm is impractical on many domains: the memory it uses, as well as the computation required per time step increases linearly with time. Fortunately, it is possible to rewrite the update equations of the online -return algorithm to a different set of update equations that can be implemented with a computational complexity that is independent of time. In fact, this alternative set of update equations differs from the update equations of accumulate TD() only by two extra terms, each of which can be computed efficiently. The algorithm implementing these equations is called true online TD() and is discussed below.

### 4.1 The Algorithm

For the online -return algorithm, at each time step a sequence of updates is performed. The length of this sequence, and hence the computation per time step, increases over time. However, it is possible to compute the weight vector resulting from the sequence at time step directly from the weight vector resulting from the sequence at time step . This results in the following update equations (see Appendix B for the derivation):

(10) | |||||

(11) | |||||

(12) |

for , and with . Compared to accumulate TD(), both the trace update and the weight update have an additional term.
We call a trace updated in this way a dutch trace; we call the term the TD-error time-step correction, or simply the -correction.
Algorithm 2 shows pseudocode that implements these equations.^{3}^{3}3When using a time-dependent step-size (e.g., when annealing the step-size) use the pseudocode from Section 6.1. For reasons explained in that section this requires a modified trace update. That pseudocode is the same as the pseudocode from van Seijen & Sutton (2014).

In terms of computation time, true online TD() has a (slightly) higher cost due to the two extra terms that have to be accounted for. While the computation-time complexity of true online TD() is the same as that of accumulate/replace TD()— per time step with being the number of features—, the actual computation time can be close to twice as much in some cases. In other cases (for example if sparse feature vectors are used), the computation time of true online TD() is only a fraction more than that of accumulate/replace TD(). In terms of memory, true online TD() has the same cost as accumulate/replace TD().

### 4.2 When Can a Performance Difference be Expected?

In Section 3, a number of examples were shown where the online -return algorithm outperforms accumulate/replace TD(). Because true online TD() is simply an efficient implementation of the online -return algorithm, true online TD() will outperform accumulate/replace TD() on these examples as well. But not in all cases will there be a performance difference. For example, it follows from Theorem 1 that when appropriately small step-sizes are used, the difference between the online -return algorithm/true online TD() and accumulate TD() is negligible. In this section, we identify two other factors that affect whether or not there will be a performance difference. While the focus of this section is on performance difference rather than performance advantage, our experiments will show that true online TD() performs always at least as well as accumulate TD() and replace TD(). In other words, our experiments suggest that whenever there is a performance difference, it is in favour of true online TD().

The first factor is the parameter and follows straightforwardly from the true online TD() update equations.

###### Proposition 1

For , accumulate TD(), replace TD() and the online -return algorithm / true online TD() behave the same.

Proof
For , the accumulating-trace update, the replacing-trace update and the dutch-trace update all reduce to . In addition, because , the -correction of true online TD() is 0.

Because the behaviour of TD() for small is close to the behaviour of TD(0), it follows that significant performance differences will only be observed when is large.

The second factor is related to how often a feature has a non-zero value. We start again with a proposition that highlights a condition under which the different TD() versions behave the same. The proposition makes use of an accumulating trace at time step , , whose non-recursive form is:

(13) |

Furthermore, the proposition uses to denote the -th element of vector .

###### Proposition 2

If for all features and at all time steps

(14) |

then accumulate TD(), replace TD() and the online -return algorithm / true online TD() behave the same (for any ).

Proof Condition (14) implies that if , then . From this it follows that for binary features the accumulating-trace update can be written as a replacing-trace update at every time step:

Hence, accumulate TD() and replace TD() perform exactly the same updates.

Furthermore, condition (14) implies that . Hence, the accumulating-trace update can also be written as a dutch trace update at every time step:

In addition, note that the -correction is proportional to , which can be written as . The value is proportional to for accumulate TD(). Because , accumulate TD() can add a -correction at every time step without any consequence. This shows that accumulate TD() makes the same updates as true online TD().

An example of a domain where the condition of Proposition 2 holds is a domain with tabular features (each state is represented with a unique standard-basis vector), where a state is never revisited within the same episode.

The condition of Proposition 2 holds approximately when the value is close to 0 for all features at all time steps. In this case, the different TD() versions will perform very similarly. It follows from Equation (13) that this is the case when there is a long time delay between the time steps that a feature has a non-zero value. Specifically, if there is always at least time steps between two subsequent times that a feature has a non-zero value with being very small, then will always be close to 0. Therefore, in order to see a large performance difference, the same features should have a non-zero value often and within a small time frame (relative to ).

Summarizing the analysis so far: in order to see a performance difference and should be sufficiently large, and the same features should have a non-zero value often and within a small time frame. Based on this summary, we can address a related question: on what type of domains will there be a performance difference between true online TD() with optimized parameters and accumulate/replace TD() with optimized parameters. The first two conditions suggest that the domain should result in a relatively large optimal and optimal . This is typically the case for domains with a relatively low variance on the return. The last condition can be satisfied in multiple ways. It is for example satisfied by domains that have non-sparse feature vectors (that is, domains for which at any particular time step most features have a non-zero value).

### 4.3 True Online Sarsa()

TD() and true online TD() are policy evaluation methods. However, they can be turned into control methods in a straightforward way. From a learning perspective, the main difference is that the prediction of the expected return should be conditioned on the state and action, rather than only on the state. This means that an estimate of the action-value function is being learned, rather than of the state-value function .

Another difference is that instead of having a fixed policy that generates the behaviour, the policy depends on the action-value estimates. Because these estimates typically improve over time, so does the policy. The (on-policy) control counterpart of TD() is the popular Sarsa() algorithm. The control counterpart of true online TD() is ‘true online Sarsa()’. Algorithm 3 shows pseudocode for true online Sarsa().

To ensure accurate estimates for all state-action values are obtained, typically some exploration strategy has to be used. A simple, but often sufficient strategy is to use an -greedy behaviour policy. That is, given current state , with probability a random action is selected, and with probability the greedy action is selected:

with an action-feature vector, and a (linear) estimate of at time step . A common way to derive an action-feature vector from a state-feature vector involves an action-feature vector of size , where is the number of state features and is the number of actions. Each action corresponds with a block of features in this action-feature vector. The features in that correspond to action take on the values of the state features; the features corresponding to other actions have a value of 0.

## 5 Empirical Study

This section contains our main empirical study, comparing TD(), as well as Sarsa(), with their true online counterparts. For each method and each domain, a scan over the step-size and the trace-decay parameter is performed such that the optimal performance can be compared. In Section 5.4, we discuss the results.

### 5.1 Random MRPs

For our first series of experiments we used randomly constructed Markov reward processes (MRPs).^{4}^{4}4The code for the MRP experiments is published online at: https://github.com/armahmood/totd-rndmdp-experiments. The process we used to construct the MRPs is based on the process used by Bhatnagar, Sutton, Ghavamzadeh and Lee (2009). An MRP can be interpreted as an MDP with only a single action per state. Consequently, there is only one policy possible. We represent a random MRP as a 3-tuple , consisting of , the number of states; , the branching factor (that is, the number of next states with a non-zero transition probability); and , the standard deviation of the reward.
An MRP is constructed as follows. The potential next states for a particular state are drawn from the total set of states at random, and without replacement. The transition probabilities to those states are randomized as well (by partitioning the unit interval at random cut points).
The expected value of the reward for a transition is drawn from a normal distribution with zero mean and unit variance. The actual reward is drawn from a normal distribution with a mean equal to this expected reward and standard deviation . There are no terminal states.

We compared the performance of TD() on three different MRPs: one with a small number of states, , one with a larger number of states, , and one with a larger number of states but a low branching factor and no stochasticity for the reward, . The discount factor is for all three MRPs. Each MRP is evaluated using three different representations. The first representation consists of tabular features, that is, each state is represented with a unique standard-basis vector of dimensions. The second representation is based on binary features. This binary representation is constructed by first assigning indices, from 1 to , to all states. Then, the binary encoding of the state index is used as a feature vector to represent that state. The length of a feature vector is determined by the total number of states: for , the length is 4; for , the length is 7. As an example, for the binary feature vectors of states 1, 2 and 3 are , and , respectively. Finally, the third representation uses non-binary features. For this representation each state is mapped to a 5-dimensional feature vector, with the value of each feature drawn from a normal distribution with zero mean and unit variance. After all the feature values for a state are drawn, they are normalized such that the feature vector has unit length. Once generated, the feature vectors are kept fixed for each state. Note that replace TD() cannot be used with this representation, because replacing traces are only defined for binary features (tabular features are a special case of this).

In each experiment, we performed a scan over and . Specifically, between 0 and 0.1, is varied according to with varying from -3 to -1 with steps of 0.2, and from 0.1 to 2.0 (linearly) with steps of 0.1. In addition, is varied from 0 to 0.9 with steps of 0.1 and from 0.9 to 1.0 with steps of 0.01. The initial weight vector is the zero vector in all domains. As performance metric we used the mean-squared error (MSE) with respect to the LMS solution during early learning (for k = 10, we averaged over the first 100 time steps; for k = 100, we averaged over the first 1000 time steps). We normalized this error by dividing it by the MSE under the initial weight estimate.

Figure 4 shows the results for different at the best value of . In Appendix C, the results for all values are shown. The optimal performance of true online TD() is at least as good as the optimal performance of accumulate TD() and replace TD(), on all domains and for all representations. A more in-depth discussion of these results is provided in Section 5.4.

### 5.2 Predicting Signals From a Myoelectric Prosthetic Arm

In this experiment, we compared the performance of true online TD() and TD() on a real-world data-set consisting of sensorimotor signals measured during the human control of an electromechanical robot arm. The source of the data is a series of manipulation tasks performed by a participant with an amputation, as presented by Pilarski et al. (2013). In this study, an amputee participant used signals recorded from the muscles of their residual limb to control a robot arm with multiple degrees-of-freedom (Figure 5). Interactions of this kind are known as myoelectric control (see, for example, Parker et al., 2006).

For consistency and comparison of results, we used the same source data and prediction learning architecture as published in Pilarski et al. (2013). In total, two signals are predicted: grip force and motor angle signals from the robot’s hand. Specifically, the target for the prediction is a discounted sum of each signal over time, similar to return predictions (see general value functions and nexting; Sutton et al., 2011; Modayil et al., 2014). Where possible, we used the same implementation and code base as Pilarski et al. (2013). Data for this experiment consisted of 58,000 time steps of recorded sensorimotor information, sampled at 40 Hz (i.e., approximately 25 minutes of experimental data). The state space consisted of a tile-coded representation of the robot gripper’s position, velocity, recorded gripping force, and two muscle contraction signals from the human user. A standard implementation of tile-coding was used, with ten bins per signal, eight overlapping tilings, and a single active bias unit. This results in a state space with 800,001 features, 9 of which were active at any given time. Hashing was used to reduce this space down to a vector of 200,000 features that are then presented to the learning system. All signals were normalized between 0 and 1 before being provided to the function approximation routine. The discount factor for predictions of both force and angle was , as in the results presented by Pilarski et al. (2013). Parameter sweeps over and are conducted for all three methods. The performance metric is the mean absolute return error over all 58,000 time steps of learning, normalized by dividing by the error for .

Figure 6 shows the performance for the angle as well as the force predictions at the best value for different values of . In Appendix D, the results for all values are shown. The relative performance of replace TD() and accumulate TD() depends on the predictive question being asked. For predicting the robot’s grip force signal—a signal with small magnitude and rapid changes—replace TD() is better than accumulate TD() at all values larger than 0. However, for predicting the robot’s hand actuator position, a smoothly changing signal that varies between a range of 300–500, accumulate TD() dominates replace TD(). On both prediction tasks, true online TD() dominates accumulate TD() and replace TD().

### 5.3 Control in the ALE Domain Asterix

In this final experiment, we compared the performance of true online Sarsa() with that of accumulate Sarsa() and replace Sarsa(), on a domain from the Arcade Learning Environment (ALE) (Bellemare et al., 2013; Defazio & Graepel, 2014; Mnih et al., 2015), called Asterix. The ALE is a general testbed that
provides an interface to hundreds of Atari 2600 games.^{5}^{5}5We used ALE version 0.4.4 for our experiments. The code for the Asterix experiments is published online at: https://github.com/mcmachado/TrueOnlineSarsa.

In the Asterix domain, the agent controls a yellow avatar, which has to collect ‘potion’ objects, while avoiding ‘harp’ objects (see Figure 7 for a screenshot). Both potions and harps move across the screen horizontally. Every time the agent collects a potion it receives a reward of 50 points, and every time it touches a harp it looses a life (it has three lives in total). The agent can use the actions up, right, down, and left, combinations of two directions, and a no-op action, resulting in 9 actions in total. The game ends after the agent has lost three lives, or after 5 minutes, whichever comes first.

We use linear function approximation using features derived from the screen pixels. Specifically, we use what Bellemare et al. (2013) call the Basic feature set, which “encodes the presence of colours on the Atari 2600 screen.” It is obtained by first subtracting the game screen background (see Bellemare et al., 2013, sec. 3.1.1) and then dividing the remaining screen in to tiles of size pixels. Finally, for each tile, one binary feature is generated for each of the available colours, encoding whether a colour is active or not in that tile. This generates 28,672 features (plus a bias term).

Because episode lengths can vary hugely (from about 10 seconds all the way up to 5 minutes), constructing a fair performance metric is non-trivial. For example, comparing the average return on the first episodes of two methods is only fair if they have seen roughly the same amount of samples in those episodes, which is not guaranteed for this domain. On the other hand, looking at the total reward collected for the first samples is also not a good metric, because there is no negative reward associated to dying. To resolve this, we look at the return per episode, averaged over the first samples. More specifically, our metric consists of the average score per episode while learning for 20 hours (4,320,000 frames). In addition, we averaged the resulting number over 400 independent runs.

As with the evaluation experiments, we performed a scan over the step-size and the trace-decay parameter . Specifically, we looked at all combinations of , and (these values were determined during a preliminary parameter sweep). We used a discount factor and -greedy exploration with . The weight vector was initialized to the zero vector. Also, as Bellemare et al. (2013), we take an action at each 5 frames. This decreases the algorithms running time and avoids “super-human” reflexes. The results are shown in Figure 8. On this domain, the optimal performance of all three versions of Sarsa() is similar.

Note that the way we evaluate a domain is computationally very expensive: we perform scans over and , and use a large number of independent runs to get a low standard error. In the case of Asterix, this results in a total of runs per method. This rigorous evaluation prohibits us unfortunately to run experiments on the full suite of ALE domains.

### 5.4 Discussion

Figure 9 summarizes the performance of the different TD() versions on all evaluation domains. Specifically, it shows the error for each method at its best settings of and . The error is normalized by dividing it by the error at (remember that all versions of TD() behave the same for ). Because lies in the parameter range that is being optimized over, the normalized error can never be higher than 1. If for a method/domain the normalized error is equal to 1, this means that setting higher than 0 either has no effect, or that the error gets worse. In either case, eligibility traces are not effective for that method/domain.

Overall, true online TD() is clearly better than accumulate TD() and replace TD() in terms of optimal performance. Specifically, for each considered domain/representation, the error for true online TD() is either smaller or equal to the error of accumulate/replace TD(). This is especially impressive, given the wide variety of domains, and the fact that the computational overhead for true online TD() is small (see Section 4.1 for details).

The observed performance differences correspond well with the analysis from Section 4.2. In particular, note that MRP (10, 3, 0.1) has less states than the other two MRPs, and hence the chance that the same feature has a non-zero value within a small time frame is larger. The analysis correctly predicts that this results in larger performance differences. Furthermore, MRP is less stochastic than MRP , and hence it has a smaller variance on the return. Also here, the experiments correspond with the analysis, which predicts that this results in a larger performance difference.

On the Asterix domain, the performance of the three Sarsa() versions is similar. This is in accordance with the evaluation results, which showed that the size of the performance difference is domain dependent. In the worst case, the performance of the true online method is similar to that of the regular method.

The optimal performance is not the only factor that determines how good a method is; what also matters is how easy it is to find this performance. The detailed plots in appendices C and D reveal that the parameter sensitivity of accumulate TD() is much higher than that of true online TD() or replace TD(). This is clearly visible for MRP (10, 3, 0.1) (Figure 10), as well as the experiments with the myoelectric prosthetic arm (Figure 13).

There is one more thing to take away from the experiments. In MRP (10, 3, 0.1) with non-binary features, replace TD() is not applicable and accumulate TD() is ineffective. However, true online TD() was still able to obtain a considerable performance advantage with respect to TD(0). This demonstrates that true online TD() expands the set of domains/representations where eligibility traces are effective. This could potentially have far-reaching consequences. Specifically, using non-binary features becomes a lot more interesting. Replacing traces are not applicable to such representations, while accumulating traces can easily result in divergence of values. For true online TD(), however, non-binary features are not necessarily more challenging than binary features. Exploring new, non-binary representations could potentially further improve the performance for true online TD() on domains such as the myoelectic prosthetic arm or the Asterix domain.

## 6 Other True Online Methods

In Appendix B, it is shown that the true online TD() equations can be derived directly from the online forward view equations. By using different online forward views, new true online methods can be derived. Sometimes, small changes in the forward view, like using a time-dependent step-size, can result in surprising changes in the true online equations. In this section, we look at a number of such variations.

### 6.1 True Online TD() with Time-Dependent Step-Size

When using a time-dependent step-size in the base equation of the forward view (Equation 7) and deriving the update equations following the procedure from Appendix B, it turns out that a slightly different trace definition appears. We indicate this new trace using a ‘+’ superscript: . For fixed step-size, this new trace definition is equal to:

Of course, using instead of also changes the weight vector update slightly. Below, the full set of update equations is shown:

In addition, . We can simplify the weight update equation slightly, by using

which changes the update equations to:^{6}^{6}6These equations are the same as in the original true online paper (van Seijen & Sutton, 2014).

(15) | |||||

(16) | |||||

(17) |

Algorithm 2 shows the corresponding pseudocode. Of course, this pseudocode can also be used for constant step-size.

### 6.2 True Online Version of Watkins’s Q()

So far, we just considered on-policy methods, that is, methods that evaluate a policy that is the same as the policy that generates the samples. However, the true online principle can also be applied to off-policy methods, for which the evaluation policy is different from the behaviour policy. As a simple example, consider Watkins’s Q() (Watkins, 1989). This is an off-policy method that evaluates the greedy policy given an arbitrary behaviour policy. It does this by combining accumulating traces with a TD error that uses the maximum state-action value of the successor state:

In addition, traces are reset to 0 whenever a non-greedy action is taken.

From an online forward-view perspective, the strategy of Watkins’s Q() method can be interpreted as a growing update target that stops growing once a non-greedy action is taken. Specifically, let be the first time step after time step that a non-greedy action is taken, then the interim update target for time step can be defined as:

with

Algorithm 5 shows the pseudocode for the true online method that corresponds with this update target definition. A problem with Watkins’s Q() is that if the behaviour policy is very different from the greedy policy traces are reset very often, reducing the overall effect of the traces. In Section 7, we discuss more advanced off-policy methods.

### 6.3 Tabular True Online TD()

Tabular features are a special case of linear function approximation. Hence, the update equations for true online TD() that are presented so far also apply to the tabular case. However, we discuss it here separately, because the simplicity of this special case can provide extra insight.