# TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent

###### Abstract

In this paper, we introduce a method for adapting the step-sizes of temporal difference (TD) learning. The performance of TD methods often depends on well chosen step-sizes, yet few algorithms have been developed for setting the step-size automatically for TD learning. An important limitation of current methods is that they adapt a single step-size shared by all the weights of the learning system. A vector step-size enables greater optimization by specifying parameters on a per-feature basis. Furthermore, adapting parameters at different rates has the added benefit of being a simple form of representation learning. We generalize Incremental Delta Bar Delta (IDBD)—a vectorized adaptive step-size method for supervised learning—to TD learning, which we name TIDBD. We demonstrate that TIDBD is able to find appropriate step-sizes in both stationary and non-stationary prediction tasks, outperforming ordinary TD methods and TD methods with scalar step-size adaptation; we demonstrate that it can differentiate between features which are relevant and irrelevant for a given task, performing representation learning; and we show on a real-world robot prediction task that TIDBD is able to outperform ordinary TD methods and TD methods augmented with AlphaBound and RMSprop.

TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent

Alex Kearney^{†}^{†}thanks: All authors are with the Reinforcement Learning and Artificial Intelligence Laboratory (RLAI).
Department of Computing Science, University of Alberta
Edmonton, AB, Canada, T6G 2E1
kearney@ualberta.ca
Vivek Veeriah
Department of Computing Science, University of Alberta
Edmonton, AB, Canada, T6G 2E1
vivekveeriah@ualberta.ca
Jaden B. Travnik
Department of Computing Science, University of Alberta
Edmonton, AB, Canada, T6G 2E1
travnik@ualberta.ca
Richard S. Sutton
Department of Computing Science, University of Alberta
Edmonton, AB, Canada, T6G 2E1
rsutton@ualberta.ca
Patrick M. Pilarski
Departments of Computing Science & Medicine, University of Alberta
Edmonton, AB, Canada, T6G 2E1
pilarski@ualberta.ca

noticebox[b]Submitted to 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float

## 1 Step-size adaptation in temporal-difference learning

The problem of how to set step-sizes automatically is an important one for machine learning. The performance of many learning methods depends on a step-size parameter that scales weight updates. To ease the burden on practitioners, it is desirable to set this step-size parameter algorithmically, and over the years many such methods have been proposed. Some of these are fixed schedules, but in principle it is better to adapt the step-size based on experience. Such step-size adaptation methods are more suitable for large and long-lived machine learning systems, and are the focus of the present work.

Several interesting issues have been explored within step-size adaptation research. One issue is whether there is a single global step-size shared by all the weights in the learning system, or whether each weight has its own step size. The former is simpler of course, but the latter can be more powerful (Sutton, 1992; Schraudolph, 1999). Another issue is whether step-sizes always decrease, or whether they can increase as well as decrease over time. While strictly decreasing step-sizes are often effective in stationary problems, in non-stationary problems it is sometimes advantageous to increase the step-size.

In this paper, we focus on step-size adaptation in Temporal Difference (TD) learning methods. TD methods form key components of many reinforcement learning algorithms. Step-size adaptation in TD learning has received relatively little attention and involves interesting challenges that are less pressing in other learning methods. In particular, TD learning uses learned estimates as targets for further learning; this is known as bootstrapping, and because of it TD learning always involves a degree of nonstationarity, which, as noted above, increases the need for step-sizes to increase as well as decrease.

In this paper, we seek a step-size adaptation method for TD learning that satisfies the two criteria discussed above: 1) step-sizes should be able to both increase and decrease in order to compensate for the non-stationarity in both TD learning, and 2) there should be a vector of many step-sizes in order to specify the step-size on a per-feature basis.

None of the existing methods for step-size adaptation in TD learning satisfy both of our criteria while also performing well in practice. HL() (Hutter et al., 2007) and AlphaBound (Dabney and Barto, 2012) have a single step-size which only decreases in value. RMSprop (Tieleman and Hinton, 2012) satisfies our criteria and can be trivially generalized to TD, however, it does not perform well in TD Learning—as we demonstrate in this paper.

The work which has come closest to satisfying our criteria is the work by Dabney (2014) on SID and NOSID. These methods have increasing and decreasing step-sizes and work well within TD learning, but only adapt a single global step-size. These methods are based on IDBD—a method for supervised learning which adapts it’s step-size through stochastic meta-descent(Sutton, 1992). We adopt a similar approach in this paper. Like Dabney (2014), we extend IDBD to TD, but unlike in his work, we retain IDBD’s vectorized step-size adaptation. We name our method TIDBD (a shortened version of TD-IDBD, pronounced tid-bid). By generalizing IDBD to TD with vectorized step-sizes, TIDBD meets our criteria for a step-size adaptation method for TD learning.

## 2 Markov reward processes

Prediction problems are of importance to many real-world applications. Being able to anticipate the value of signal from the environment can be seen to be knowledge which is acquired through interaction (Sutton et al., 2011). For example, being able to anticipate the signal of a servomotor can inform a robot’s decisions and form a key component of their control systems (Edwards et al., 2016; Sherstan et al., 2015): predictions are a way for machines to anticipate events in their world and inform their decision-making processes.

Predictions may be thought of as estimating the value of a state in a Markov Reward Process (MRP). A MRP is described by the tuple where is the set of all states, describes the probability of a transition from a state to a new state , describes the reward observed on a transition from to , and is a discount factor which determines how future reward is weighted. The goal in an MRP is to learn a value function which estimates the expected return from a given state , where the return is , or the discounted sum of all future rewards. Within the context of an MRP, a prediction is an estimation of the value of a state—the discounted future return of a signal. For example, the prediction signal could be the position of a robot’s gripper, which is later used as an input into a robot’s control system.

## 3 Tidbd

We introduce Temporal-Difference Incremental Delta Bar Delta (TIDBD), an extension of IDBD to TD learning. First, we discuss TIDBD’s algorithmic implementation (Algorithm 1); a derivation as stochastic meta-descent is in the following section. As with conventional TD learning methods, TIDBD’s task is to learn an estimate of the true value of a state , where is the weight vector. TIDBD estimates the value of a state using linear function approximation : the linear combination of a vector of learned weights and a feature vector approximating the state . The weights are updated as follows by moving them in the direction of the TD error by the step-size , where . Note that unlike other TD methods, TIDBD’s is specified for a specific time-step , as its value changes over time.

Eligibility traces, denoted on line 7, allow the currently observed reward to be attributed to previous experiences; eligibility traces can be thought of as a decaying history of visited states. We make the distinction between two types of traces: accumulating traces (shown in Alg. 1), which simply continue adding to for each visit; and replacing traces, which will replace the value for each visit to a state. The former can cause instability, as features can have a weight greater than 1 given to their updates as a result of multiple visits to the same state. The TD error on Line 3 is the difference between the predicted return for the current state and the estimate of future return . For a more detailed explanation on TD learning and eligibility traces, please consult the description given by Sutton and Barto (1998).

TIDBD adds to ordinary TD by updating the values of the step-size on per-time-step basis; we define our step-size vector using an exponentiation of the vector (Line 6). By exponentiating the step-size parameters, we ensure that the step-size is always positive and that the step-size moves by geometric steps. No fixed meta step-size would appropriate for all features; moving each feature’s step-size by a percentage of it’s current value is beneficial.

The vector is the set of parameters we adapt in order to change the step-size vector . On line 5, a meta step-size parameter is used to move at every time-step; is updated in the direction of the most recent change to the weight vector, , as scaled by the elements of . The vector is a decaying trace of the current updates to the weight vector via and previous updates (Line 9). Here is for and for all other values of .

This has the effect of changing based on what can be considered the correlation between current updates to our weight vector and previous updates. If the current weight update is strongly correlated with previous weight updates , then we are making many updates in the same direction and it would have been a more efficient use of data to have made a larger update instead. If the updates are negatively correlated, then we have overshot the target value, and should have used smaller step-sizes and smaller updates.

## 4 TIDBD derivation

We now derive TIDBD as stochastic meta-descent. We start the derivation of TIDBD by describing the update rule for —the weights with which we define our step-size.

(1) |

TIDBD learns it’s step-size parameters , by moving them in the direction of the meta-gradient . Here, our meta step-size is . In (2), we simplify the update by approximating as where . We do this because the effect of changing the step-size for a particular weight will predominantly be on the weight itself; effects on other weights will be nominal.

(2) |

The use of the TD error in the gradient introduces some subtleties. The estimate of the TD error depends on the predicted value of the future state , resulting in a biased gradient. Since the error’s target is dependent on the current values of the learned weight vector , it will not produce a true-gradient descent method (Barnard, 1993).

We use the semi-gradient, taking into account the impact of changing the weight vector or the estimate , but not on the target . While less robust than other forms gradient descent, semi-gradient methods converge reliably and more quickly than true-gradient methods (Sutton and Barto, 1998).

(3) |

We then complete the simplification of ’s update by defining an additional memory vector as . We then derive the update for .

(4) |

This simplification leaves us with , derived in (5), and , derived in (6). We use the same approximation as in (2) to simplify:

(5) |

We finally simplify the following:

(6) |

We see that (6) results in a decaying trace of the initialized value of the eligibility traces. Since eligibility traces are initialized to 0, this value will always be 0.

(7) |

## 5 Does TIDBD outperform ordinary TD?

First, we assess the ability of TIDBD to improve upon traditional TD prediction in a simple tabular setting. In a tabular setting the advantages of vectorizing step-sizes and performing representation learning are abstracted away. By using a stationary tabular problem, we assess whether adapting step-sizes with TIDBD is an improvement over ordinary TD in general.

To make a suitable prediction task, we created a MRP from a grid-world problem originally described in Sutton and Barto 1998. As depicted in Figure 0(a), each tile in the 5 5 grid-world represents a state. The state transitions were the four cardinal directions—north, south, east, and west—which were taken by a equiprobable random policy. A transition which leaves the grid results in the agent staying in the same state and a reward of -1. Regardless of the transition taken in A or B, the learner transitions to state A’ and B’ with probability 1. A transition from A to A’ yields a reward of 10 and a transition from B to B’ yields a reward of 5. All other transitions receive a reward of 0. The start state was the top left-hand corner. A trial consisted of each prediction method learning a value function while the equiprobable random policy moved the agent around the grid world for 15000 time-steps.

We compared TIDBD and TD on initial step-sizes distributed between 0.0005 and 0.5. For both methods and . We swept over a 21 different meta-parameters equally distributed within the range of . Note, when , TIDBD and TD are equivalent. The mean squared value error (MSVE) was calculated at each time-step using the difference between the actual value of all states and the current learned estimate.

Figure 0(b) shows the performance for all values of at all values, suggesting the sensitivity of TIDBD to choices in this domain. Figure 2 depicts the MSPVE of the TD and the TIDBD predictors averaged across 30 runs for sample initial s across the sweep. Although for the best initial step-size (0.0025) TIDBD has only a modest improvement over TD, it still has lower variance. For every tested initial , there was a corresponding for which TIDBD improved upon ordinary TD.

In general, tuned TIDBD outperforms ordinary TD by adapting step-sizes online, attaining lower variance and error on a simple tabular prediction task.

## 6 Can TIDBD perform representation learning?

In our second experiment we explore the benefits of using a vectorized step-size. In particular, we evaluate TIDBD’s ability to give large step-sizes to relevant features while giving irrelevant features small step-sizes.

In a fashion similar to the grid-world experiment, we used the classic mountain-car problem to construct a MRP prediction problem. The goal of mountain-car is to reach the top of a steep hill in the shortest time possible. We trained a SARSA(0) agent with to reach the top of the hill in 160 time-steps and used this policy to define the transitions in our MRP. The TIDBD learner evaluates the value of a state based on the rewards received as the SARSA agent drives the car.

For the TIDBD learner and . Its state space consists of the real-valued position, velocity of the car, and 10 real-valued random numbers. These random values were added to act as irrelevant features, with which we asses TIDBD’s ability to perform representation learning. The resulting state is tile-coded with 10 tilings of size 10 10. A bias feature was concatenated to the feature vector, resulting in a size of 1001.

We expect the random features to receive small step-sizes, while the step-sizes of the relevant features grow.

Figure 3 depicts the magnitude of the step-sizes for two different features over time: one is relevant to the task, while the other is random noise. TIDBD clearly increases the value of the relevant step-size over time, while the irrelevant feature’s step-size remains close to zero. These two features are representative of the behaviour for all the other features. As anticipated, TIDBD identifies the relevant features necessary for making an accurate prediction, and assigns them larger step-sizes.

TIDBD is able to perform basic representation learning by assigning large step-sizes to relevant features, and smaller step-sizes to irrelevant features.

## 7 How robust is TIDBD?

In our final experiment, we determine how well TIDBD performs relative to both ordinary TD and other adaptive step-size methods. We examine the performance of TIDBD in comparison to ordinary TD, AlphaBound, and TD with RMSprop on a real-world robot prediction task. We assess the sensitivity of TIDBD to it’s settings in comparison to other methods, and whether it can improve predictions such as those used in robot control systems.

We replicated the experiment used by van Seijen et al. (2016) with the dataset from Edwards et al. (2016): the learning system predicts the position of a robot gripper as a user performs a dexterity training task while in control of the robot. Such predictions can be used to build more complex control systems which adapt to the user controlling the robot (Edwards et al., 2016; Sherstan et al., 2015), making it possible to continuously improve the control of these systems (Sherstan et al., 2015). A great challenge for forming these predictions is finding and setting appropriate parameters for a given prediction method. End-user time is precious, and designers cannot possibly test their prediction algorithms on datasets which are representative of all the situations the robot might encounter.

We performed a sweep over parameter settings in order to compare the TD methods for a selection of values between 0 and 1. For all methods . Ordinary TD methods were swept over settings between 0 and 2. AlphaBound was initialized with an initial step-size of 1, as specified in Dabney and Barto (2012). The initial step-size for both TD RMSprop and TIDBD was set to , where 9 is the number of active features. We are assessing whether or not TIDBD is easier to tune than other methods, so we choose an intuitively good initial step-size, but not necessarily the the best possible value. To make RMSprop suitable for TD methods, we use the semi-gradient in its calculated weighted average. For TD RMSprop, , an decay rates were varied between 0 and 1. For TIDBID, values were swept between 0 an 0.02.

Our error is the difference between the true return of the hand position and the predicted position at each time-step. Algorithm performance is assessed at each combination of parameter settings using the cumulative absolute prediction error averaged over 24 independent trials of user data.

Each point in Figure 4 represents the best performing parameter settings for each value. TIDBD with replacing traces outperforms all other methods at every , especially for low values of In addition, for both variants of traces, TIDBD has low variance in comparison to other methods. Interestingly, TIDBD with accumulating traces suffers at higher settings. The origin of this performance drop is in the the interaction between our adaptive and our eligibility traces. Accumulating traces will continue to assign credit to states if we visit them multiple times. If a state has been visited multiple times before the weighting has decayed from the eligibility, the weight update will have a greater weight due to assigning greater credit. Similarly, our TIDBD uses a correlation of recent weight updates to determine the step size of a feature. If a weight’s updates are highly correlated, it’s learning rate will adjust according. Together, these interactions can cause instability for larger meta-step-sizes.

In Figure 5 we show a parameter study which demonstrates the parameter sensitivity of ordinary TD and TIDBD. Each sub-figure shows the error at each setting of for all or values in our given sweep. When observing the sensitivity of ordinary TD, we can see that values similar to the initial we used for TIDBD perform as expected: they perform relatively well, but are not the best settings. Despite the use of a sub-optimal initial , TIDBD was able to outperform the ordinary TD methods for most values.

In Figure 5 we see that the performance of TIDBD in comparison to meta-step-size ’s setting is much less variable than the ordinary TD counterparts. The sensitivity of TIDBD with accumulating traces further elaborates on performance problems with higher values, depicting how with larger meta-step-sizes and larger values divergence can occur. Of note is TIDBD’s performance with replacing traces. We can see that even when TIDBD uses a sub-optimal setting for the ordinary TD methods, any of the meta-step-sizes in our sweep brings the performance to the same level of the best tuned TD methods, or even better. However, this comes at risk of instability.

## 8 Conclusion, limitations, and future work

We presented an approach of generalizing Delta-Bar-Delta to temporal-difference learning and demonstrate that the effectiveness of IDBD carries over from supervised learning to TD. We derived TIDBD as stochastic meta-descent over the step-size parameter. We demonstrated that adapting step-sizes with TIDBD improves over ordinary TD methods with tuned static step-sizes, even on stationary problems. On non-stationary tasks, we showed that TIDBD is able to find appropriate step-sizes in general; TIDBD can discriminate between relevant and irrelevant features, giving appropriate step-sizes to each.

We examined TIDBD’s performance against both ordinary TD and other adaptive methods on data from a real-world robotic prediction problem. We found that for an ordinary, intuitive step-size which is commonly used on such tasks, the sensitivity to TIDBD’s meta-step-size was far less than the sensitivity to ordinary TD’s step-size. Using a sub-optimal step-size and a tuned meta-parameter, TIDBD outperformed ordinary TD and had less variance in average error. In addition, TIDBD outperformed the scalar step-size adaptation method AlphaBound for each parameter setting

The greatest limitation of TIDBD is the need to tune the meta-parameter. While a limitation, there are signs that TIDBD is less sensitive to than TD is to . In addition this sensitivity could be possibly be mitigated by further extending TIDBD to include AutoStep’s (Mahmood et al., 2012) normalization. This would bring greater stability to TIDBD by attempting to eliminate divergence when using adaptive step-sizes. Additional stability can be sought out by finding ways of better integrating traces into TIDBD to prevent over-compensation in applications where is close to 1.0. This work focused solely on prediction tasks as a testbed to assess generalization: in the future, it would be desirable to assess TIDBD’s performance on control tasks.

In summary, we generalized IDBD to TD, creating TIDBD: an adaptive step-size method for TD learning which satisfies our requirement of being vectorized, and enabling step-sizes to increase and decrease in value. TIDBD shows promise for adapting step-sizes and performing representation learning for TD learning, enabling better solutions for life-long continual learning problems.

## References

- Barnard (1993) Barnard, E. (1993). Temporal-difference methods and markov models. IEEE Transactions on Systems, Man, and Cybernetics, 23(2):357–365.
- Dabney (2014) Dabney, W. (2014). Adaptive step-sizes for reinforcement learning. PhD thesis, University of Massachusetts Amherst.
- Dabney and Barto (2012) Dabney, W. and Barto, A. G. (2012). Adaptive step-size for online temporal difference learning. In Association for the Advancement of Artificial Intelligence (AAAI).
- Edwards et al. (2016) Edwards, A. L., Dawson, M. R., Hebert, J. S., Sherstan, C., Sutton, R. S., Chan, K. M., and Pilarski, P. M. (2016). Application of real-time machine learning to myoelectric prosthesis control: A case series in adaptive switching. Prosthetics and orthotics international, 40(5):573–581.
- Hutter et al. (2007) Hutter, M., Legg, S., et al. (2007). Temporal difference updating without a learning rate. In Conference on Neural Information Processing Systems (NIPS), pages 705–712.
- Mahmood et al. (2012) Mahmood, A. R., Sutton, R. S., Degris, T., and Pilarski, P. M. (2012). Tuning-free step-size adaptation. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2121–2124. IEEE.
- Schraudolph (1999) Schraudolph, N. N. (1999). Local gain adaptation in stochastic gradient descent. In Artificial Neural Networks, 1999. ICANN 99. Ninth International Conference on (Conf. Publ. No. 470), volume 2, pages 569–574. IET.
- Sherstan et al. (2015) Sherstan, C., Modayil, J., and Pilarski, P. M. (2015). A collaborative approach to the simultaneous multi-joint control of a prosthetic arm. In 2015 IEEE International Conference on Rehabilitation Robotics (ICORR), pages 13–18. IEEE.
- Sutton (1992) Sutton, R. S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. In Association for the Advancement of Artificial Intelligence (AAAI), pages 171–176.
- Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press Cambridge, 1 edition.
- Sutton et al. (2011) Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768. International Foundation for Autonomous Agents and Multiagent Systems.
- Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2).
- van Seijen et al. (2016) van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., and Sutton, R. S. (2016). True online temporal-difference learning. Journal of Machine Learning Reasearch (JMLR), 17(145):1–40.