# Gamma-Nets: Generalizing Value Estimation Over Timescale

## Abstract

Temporal abstraction is a key requirement for agents making decisions over long time horizons—a fundamental challenge in reinforcement learning. There are many reasons why making value estimates at multiple timescales might be useful; recent work has shown that value estimates at different time scales can be the basis for creating more advanced discounting functions and for driving representation learning. Further, predictions at many different timescales serve to broaden an agent’s model of its environment. One predictive approach of interest within an online learning setting is general value function (GVFs), which represent models of an agent’s world as a collection of predictive questions each defined by a policy, a signal to be predicted, and a prediction timescale. In this paper we present s, a method for generalizing value function estimation over timescale, allowing a given GVF to be trained and queried for arbitrary timescales so as to greatly increase the predictive ability and scalability of a GVF-based model. The key to our approach is to use timescale as one of the value estimator’s inputs. As a result, the prediction target for any timescale is available at every timestep and we are free to train on any number of timescales. We first provide two demonstrations by 1) predicting a square wave and 2) predicting sensorimotor signals on a robot arm using a linear function approximator. Next, we empirically evaluate s in the deep reinforcement learning setting using policy evaluation on a set of Atari video games. Our results show that s can be effective for predicting arbitrary timescales, with only a small cost in accuracy as compared to learning estimators for fixed timescales. s provide a method for accurately and compactly making predictions at many timescales without requiring a priori knowledge of the task, making it a valuable contribution to ongoing work on model-based planning, representation learning, and lifelong learning algorithms.

## 1 Value Functions and Timescale

Reinforcement learning (RL) studies algorithms in which an agent learns to maximize the amount of reward it receives over its lifetime. A key method in RL is the estimation of value—the expected cumulative sum of discounted future rewards (called the return). In loose terms this tells an agent how good it is to be in a particular state. The agent can then use value estimates to learn a policy—a way of behaving—which maximizes the amount of reward received.

Sutton et al. (2011) broadened the use of value estimation by introducing general value functions (GVFs), in which value estimates are made of other sensorimotor signals, not just reward. GVFs can be thought of as representing an agent’s model of itself and its environment as a collection of questions about future sensorimotor returns; a predictive representation of state (Dayan, 1993). A GVF is defined by three elements: 1) the policy, 2) the cumulant (the sensorimotor signal to be predicted), and 3) the prediction timescale, . Considering a simple mobile robot, examples of GVF questions include “How much current will my motors consume over the next 3 seconds if I spin clockwise?” or “How long until my bump sensor goes high if I drive forward?”

Modeling the world at many timescales is seen as a key problem in artificial intelligence (Sutton, 1995; Sutton et al., 1999). Further, there is evidence that humans and other animals make estimates of reward and other signals at numerous timescales (Tanaka et al., 2016). This paper focuses on generalizing value estimation over timescale. Our work can be seen as directly connected to the concept of nexting, in which animals and people make large numbers of predictions of sensory input at many, short-term, timescales (Gilbert, 2006). Modayil et al. (2014) demonstrated the concept of nexting using GVFs on a mobile robot. Until now, value estimation has generally been limited to a single fixed timescale. That is, for each desired timescale, a discrete and unique predictor was learned. However, there are situations where we may desire to have value estimates of the same cumulant over many different timescales. For example, consider an agent driving a car. Such an agent may make numerous predictions about the likelihood of colliding with various objects in its vicinity. The agent needs to consider the risk of collisions in both the near term and far term and the relevance of each may change with the speed of the car. If the engineer knew which timescales would be needed ahead of time they could design them into the system, but this is not the case for complex settings.

Here we present a novel class of algorithms which enables the explicit learning and inference of value estimates for any valid fixed discount. The key insights to our approach are: 1) the timescale can be treated as an input parameter for inference and learning and 2) the estimated bootstrapped prediction target for any fixed timescale is available at every timestep. We demonstrate s in three policy evaluation settings: 1) predicting a square wave, 2) predicting sensorimotor signals on a robot arm, 3) predicting reward in Atari video games.

The ideas behind our approach are based on work by Schaul et al. (2015) which generalized value estimation across goals by providing a goal embedding vector as input to the value network. In contrast, our approach provides the discount, as input. Xu et al. (2018) also provide as input to their value and policy networks. They present a meta-learning approach which learns the best to provide to an inner policy. Here we focus on determining what is necessary to effectively train a value network to train over timescale. Additionally, our algorithm trains on multiple timescales simultaneously.

## 2 Background

We model the environment as a Markov Decision Process. At each timestep the agent, in state , takes action according to policy and transitions to state according to the transition probability . In the traditional RL setting the agent receives a reward . The agent tries to learn a policy which maximizes the cumulative reward it receives in the future, which is defined as the return: . In the case of GVFs we simply substitute our signal of interest, the cumulant, for reward, . The term is referred to by several names including the timescale, the continuation function and the discount; it represents the amount of emphasis applied to future rewards and is the focus of this paper.

A value estimate is simply the expectation of the return: . Temporal difference (TD) learning is a common class of algorithms used in RL for learning an approximation of value (Sutton and Barto, 1998). Estimation weights are typically trained by semi-gradient descent using the TD error: .

While simple domains can be represented using tabular lookup, complex settings in which the state space is very large or infinite must use function approximation (FA) methods to estimate the value as , where is a set of weights parameterizing the network. Function approximation has the advantage that states are not treated independently, but rather, a learning step updates related states as well, allowing for generalization across state-space.

## 3 Generalizing over Timescale

Our goal is to be able to predict the value function for any discount factor . While the GVF specification allows for that are a function of the transition, here we focus solely on the case of fixed timescale. To achieve that goal, we propose -nets: an architecture for value functions that operates not only on the state, but also the desired target discount factor (see Figure 1). On each transition the network is trained on many values. Thus, the -net learns to generalize over arbitrary values.

Generating the error function for a is also straightforward. For any single , the TD error is:

(1) |

The total gradient can then be summed over all and applied to update the network.

Choosing must be done with care. A naive approach might uniformly sample . However, value functions change non-linearly with . To illustrate this property, consider that can be viewed as the probability of continuation, allowing us to derive the expected number of timesteps (ts) until termination of the return as (see Sherstan (2015) for a derivation):

(2) |

Table 1 shows selected values of and their corresponding values of . The relationship between and is non-linear for large values of (Figure 2). Thus, naively drawing from a uniform distribution would tend to favor very short timescales. Conversely, drawing uniformly from would put little emphasis on short timescales. While the best method for selecting for training is outside the scope of this paper, we provide some comparisons in our experiments. Note that throughout this paper we will refer broadly to the word timescale for which we will use the parameters or as appropriate. It should be assumed that these terms can be used interchangeably using Eq. (2).

0 | 1 |
---|---|

0.5 | 2 |

0.8 | 5 |

0.9 | 10 |

0.95 | 20 |

0.975 | 40 |

0.983 | 60 |

0.9875 | 80 |

0.99 | 100 |

The representation of timescale used for input to the network may affect the network’s ability to represent different timescales. The scale compresses long timescales but spreads short ones and in the scale we have the opposite effect. Thus, providing both and as input may allow for good discrimination at all timescales.

Finally, the magnitude of returns at different timescales can be very different. Larger returns can produce larger errors and corresponding larger gradients, which can effectively dominate the network weights. In general it is the longer timescales which will produce larger magnitude returns, but returns can be constructed for which the opposite is true. To prevent large magnitude returns from dominating the network weights we need to scale the returns in some way. We want to look for a general solution as we may not know beforehand which timescales are most important and thus seek a way to balance accuracy for all timescales. A general approach is given by van Hasselt et al. (2016), in which they continually normalize the target to have a mean of 0 and variance of 1. This allows them to handle rewards of varying magnitude. Here, we take a simpler approach focusing on keeping the magnitude of the returns, as a function of timescale, in the same ballpark, by learning the value of a scaled cumulant: . This will scale the loss by timescale and should result in smaller network weights. However, the resulting prediction must then be rescaled by dividing by . However, if we instead redefine our value estimator as

(3) |

then we can simply scale the TD loss by . In the following we show this derivation for n-step returns. The TD error for the n-step scaled cumulant is:

But, if we substitute in from Eq. 3 we have:

This results in scaled losses and gradients. This scaling can then be applied at either the loss or gradient level.

## 4 Experiments

We first provide two proof of concept demonstrations using linear function approximation. The first on a square wave signal, which is easily understood. The second on a robot arm. Next we empirically evaluate s in a deep learning setting by looking at performance on Atari games.

### 4.1 Square-wave

Our target signal was a repeating square wave 100 timesteps in length with a magnitude of (Figure 3). Inputs were normalized and then tilecoded (Sutton and Barto, 1998) with 20 tilings of width 1.0, 20 tilings of width 0.5 and 30 tilings of width 0.1. Tiling positions were randomly shifted by small amounts at the time of initialization for each run. Value estimates were computed using linear function approximation on the output of the tilecoding and the final layer of weights was updated using TD(0) (Sutton and Barto, 1998) (The algorithm used for this experiment is given in Algorithm 1). We also evaluated the impact of loss scaling. Unless otherwise stated: 1) timescale inputs were given on both the and scales simultaneously, 2) was 6 elements long, with always included and two additional timescales drawn uniformly from each of the and timescales, 3) loss scaling was used. Results are shown in Figure 4. Each training run lasted for 50k timesteps and for each series 100 different runs were made. We show the normalized errors as a function of the prediction timescale, given on the -scale. Results are averaged over the last 5k timesteps. For each we normalize by the maximum mean error across the series in the plot.

### 4.2 Predictions on a Robot Arm

In this experiment a human operated the shoulder rotation and elbow flexion joints of a robot arm by joystick. The task was to maintain contact between a rod held by the robot and the inside of a wire maze while moving in a counter-clockwise direction (Figure 4(a)). Fifty circuits of the maze were completed in approximately 12 minutes. Network inputs were the normalized positions of the shoulder and elbow servos as well as both and normalized . Inputs were tilecoded (Sutton and Barto, 1998) with 100 tilings of width 1.0 into a space of 2048 bits and a bias unit was added giving a feature vector of 2049 bits. Value estimates were computed by linear function approximation (LFA) and trained by TD(0). On each timestep was generated from ts. The upper and lower bounds were included in the set and one and 29 were sampled uniformly from their respective scales for a total of 32 timescales. More emphasis was placed on sampling from the scale because of the relatively high update rate (30 Hz, 1 ts 0.03 ms). Thus, the likely important timescales will be above . Loss prescaling was used. We used a step-size of 0.1 divided by the number of active features. The step-size was linearly decayed to zero over the course of the training set. A baseline predictor with a fixed timescale was also trained using the same parameters as the excepting the inclusion of timescale input.

Figure 5 shows the predicting shoulder joint speed at several timescales. Table 2 shows the cumulative sum of absolute error between the predictor and the return over the whole dataset after training. With this configuration the outperformed the baseline for most of the timescales tested. For the baseline performed slightly better.

Baseline | ||
---|---|---|

0.9 | 1025 | 1124 |

0.9666 | 602 | 822 |

0.98333 | 379 | 440 |

0.99 | 273 | 253 |

### 4.3 Atari Environment

We examined the performance of s under policy evaluation in the Arcade Learning Environment (ALE) (Bellemare et al., 2015). The agent’s policy was trained using the Dopamine project’s (Castro et al., 2018) implementation of the Rainbow agent (Hessel et al., 2018), which uses the same network architecture as the DQN agent (Mnih et al., 2015), but adds prioritized replay (Schaul et al., 2016), n-step returns, and distributional representation of the value estimates (Bellemare et al., 2017).

The primary results presented are for the game Centipede with a Rainbow agent trained for 25 M frames, which we will refer to as Centipede@25M. Additional Atari games were evaluated using agents trained for 200 M frames, which we will refer to as Atari@200M. These agents were included as part of the Dopamine package and trained according to the specifications given in Castro et al. (2018). Results for Atari@200M are included in the appendix.

Figure 6 shows predictions on the early transitions of a single episode. For this episode the expected return was estimated by running 2000 Monte Carlo rollouts from each state visited along the way (dashed lines). The solid lines indicate the predictions after training for 20 M frames (using the direct configuration which will be described in following sections).

#### Training

The prediction networks were trained using samples of transitions generated by pretrained policies. Agents select actions using -greedy over their Q-values. During policy training , but for generating the samples used for training the s we use an evaluation mode where . Transitions were generated sequentially and the environment was reset at the end of each episode or 27,000 steps, whichever came first. These transitions were saved to file in sequence and for each experiment they were reloaded in the same order. For each transition, we saved the reward as well as the activation of the final core layer of the agent’s network , which serves as the input to the s. The network was composed of five fully-connected layers of sizes [512, 256, 128, 16, 1], with all but the final layer using ReLU activation. The architecture used in shown in Figure 7. Training of the s proceeded as if the data was generated in an online fashion, as would be the case during policy learning. That is the agent would read in transition samples from the file, add them to a prioritized replay buffer, and then train by sampling from the replay buffer. When a new sample was added to the buffer it was given the highest level of priority so that its probability of being sampled was high. Like the policy training we train on a batch of sampled transitions, using n-step returns. To update the priorities for a given sample in the batch we use the maximum squared loss across .

A of size 8 was used, which always included lower and upper bounds of . An additional 6 were drawn on each timestep. Unless otherwise stated the sampling was done by drawing 3 timescales uniformly each from the scale on and the scale on (for we drew from the integer scales, rather than float). Each network was trained for 20 M frames with network weights saved every 500k frames. Additional training details can be found in the appendix.

#### Evaluation

To evaluate predictive accuracy we created a set of evaluation points for each game. These were generated by running the agent in evaluation mode over multiple episodes. At the start of each episode an offset was randomly chosen between steps. Then, starting at the offset, the state of the environment and agent were saved every 30 steps (120 frames with 4 frame frameskip). For Centipede@25M a total of 269 evaluation points were created in this way including the episode start state. From each of these evaluation points we ran 1000 episodes till termination and then computed the average return. These were used as the baseline against which we computed our prediction error. To compute the prediction error for a given evaluation point we restored the agent’s and environment’s state and recorded the network’s predictions for the probe timescales ts. For comparison, we trained fixed timescale networks for all the probe timescales (plotted in fuschia). These networks used exactly the same architecture as the , but did not provide timescale as input to the network and only trained on the single fixed timescale. These probe networks also used loss scaling. For the Atari@200M results a reduced set of probe timescales was used: .

We use a reference configuration of the across the different plots. We plot this series in black and refer to it as direct although the figure legends may give it a different label to call out the significance of its configuration for a given comparison. For this configuration both and were provided as inputs to the network. Additionally, was populated by drawing samples from both and scales and loss scaling was used. For each of the other configurations only a single setting was modified from this reference.

#### Plotting

We focus our evaluation on the steady-state performance of the network, computing averages over the last 5 M frames of the 20 M frame runs (with evaluation at every 500k frames). Mean-squared error (MSE) for each experiment is presented as a function of the evaluation probe timescale given in (Ex. Figure 7(a)). For each we normalize across the different series by the largest mean error. Thus, the largest mean error for each is shown as 1.0. We do this to be able to clearly show results for all the different timescales in a single plot despite the large differences in magnitude. As a result, series can only be directly compared within a plot, not across plots. To rank each for comparison we provide a bar chart (Ex. Figure 7(b)) which averages the normalized means and normalized variances of the MSE. That is, we take the normalized mean MSE for each and average across all . Likewise we take the variance at each , normalize it by the maximum variance for each and take the average across all . Note that averaging this way is a biased approach in that it is dependent on what probe are used. For example, if we took many large and few small ones then our results would give more weight to the large . In practice, the weighting of errors for different timescales will be task dependent.

While conducting parameter sweeps it was observed that a particular network configuration might produce the lowest value of MSE but not actually be predictive. In this case the network would learn to always output a fixed value which captured the mean of the expected returns. Thus, we adopted a two step evaluation process. First, we took the evaluation points and concatenated them in sequence. We then computed the correlation between their expected returns and the predicted returns made by the network. If a configuration had a positive correlation then it would be considered for comparison with other architectures. We have also included the plots of correlation by probe timescale (Ex. Fig 7(c)). Correlation values are easily interpreted with the maximum (best) value of 1. This tells us how closely the shapes of the target sequence and the prediction sequence match.

All series are an average over 6 seeds and the shading indicates max and min values. Note, that due to the high degree of overlap in many of the figures, color printing is required to discern individual series. Plots taken with respect to are produced by combining two different x-axes, allowing us to make both short and long timescales discernible. This split occurs at and is indicated by the vertical black line.

While our evaluation method seeks to discern differences in performance due to the various configuration, in reality most configurations perform similarly. In order to rank configurations we first considered the MSE and then variance.

#### Embedding Comparison.

We compare methods for combining the timescale inputs with the agent’s features, , using an embedding vector (Figure 8). The direct embedding performs a concatenation, . Xu et al. (2018) learned a vector, , of size 16 which was concatenated with , which we refer to as l_embed. We also considered a Hadamard embedding in which a learned vector, , the same length as , was combined using element-wise multiplication with , that is (h_l_embed). Finally, we considered a matrix multiplication approach in which the timescales were given as inputs to a fully connected layer whose output was a square matrix, , with dimensions the same size as . The embedding was then formed by matrix multiplication: . We found little difference between the approaches in terms of their MSE or correlation. Overall the linear embedding appears the best choice based on its lower variance, but this did not hold universally for the other games evaluated (Figure 16). Learning and computation were both slower with the matrix multiplication approach (Figure 15) and linear activations were generally slightly better than ReLU (Figure 14).

#### Timescale Input Comparison

We examine how the input timescale representation affects prediction performance (Figures 9, 17). We consider whether to use or inputs or both. The input values are naturally scaled between and the input values were normalized by dividing by the max , which in these experiments was 100. We consistently see that using only produced the worst performance (Asteroids, in Figure 17, is an exception). Providing or both and performed very similarly, but we consistently observed that providing both representations performed best for very short timescales and had lower variance.

#### Distribution Comparison

We look at the effect of drawing from different distributions (Figures 10, 18). We use a of size 8, two of which are always the lower and upper bounds . Six additional are drawn from a given distribution. We either draw all six uniformly from the or scale or draw half from each. We see that drawing solely from performs worst overall, particularly at longer timescales, as is expected. Surprisingly, did not consistently outperform at very short timescales. If we consider all timescales and games evaluated there is no clear winner between drawing solely from or from and . However, at very short timescales drawing from both tended to produce better results. Thus, we recommend drawing from both scales as a default.

#### Loss Scaling

We examined the effect that loss scaling has on network performance. Figure 11 shows that on Centipede@25M there is a clear benefit, with clearly lower MSE and variance. Scaling the loss was expected to improve short timescale performance. Surprisingly, in terms of MSE, the greatest impact was on longer timescales. However, such a pronounced difference was not seen in other Atari games (Figure 19). Instead we saw a general trend in which scaling did improve performance at short scales at the cost of performance at mid and long timescales, which was in line with our expectations (again, Asteroids was somewhat an exception).

#### Estimation by Interpolation

An alternative approach to estimating value at arbitrary timescales is to have multiple prediction heads, each at a fixed timescale, and then linearly interpolate between the nearest bracketing timescales. In Figure 12 we show results for such an interpolation. Here we took the previously trained probe networks (with scaled loss and the taper network architecture) and performed linear interpolation for . Because of the non-linear relationship between and the linear interpolation gives different weighting depending on whether the interpolation is done on the or scale. Interpolating in these spaces is also compared. Results show that performance was fairly similar between the interpolation scales, but that the did not perform as well. While it might have been expected that the ability of the neural network used by to capture the non-linearity of the timescales would give it an advantage, this was not shown in this experiment. Rather, we suspect that the increased accuracy of the probe networks allowed the interpolation approach to win out.

## 5 Discussion

We have empirically evaluated various approaches to constructing s and compared their predictive accuracy to baseline predictors. While we sought to separate the impacts of the various approaches, in reality all of the variants we explored performed similarly. We have considered several different Atari games with deep learning architectures as well as a simulation signal and robotics demonstration using a shallow architecture. Overall we found that s worked reliably both for reward and sensorimotor prediction.

Despite the relatively minor differences in performance across the variants we do make some recommendations for implementation. Since there was no universal difference between the direct or l_embed embedding approaches we recommend just using the simplest, direct. If looking for a general approach that is not specifically adapted to the task then we recommend using both and as inputs to the network as well as drawing samples from both scales in order to populate . On the other hand if longer timescales are preferred then it seems sufficient to use only for both input and sampling distribution. With regards to scaling the loss a clear universal benefit has not been observed and we suggest that further investigation is required to determine the best way to balance the losses resulting from different timescales. Such an investigation is a clear opportunity for future work.

Our method is thus far limited to the fixed discounting case. However, one of the key generalizations of GVFs is to support transition-dependent discounting functions: (White, 2017). This allows GVFs to be more expressive in terms of what the types of returns they can estimate. Extending our method to support such discounting is clearly an important next-step in this work.

There are several ways in which our work and that of Fedus et al. (2019) are complementary. First, they demonstrated that using value predictions at many different timescales could serve as useful auxiliary tasks for driving representation learning. A clear next step is to investigate whether or not s could also serve as a useful auxiliary task. One of the advantages of TD algorithms is that they allow the agent to bootstrap estimation of the return from its existing estimates. This limits a single predictor to only capturing returns with geometric discounting. However, such returns can be used as a basis to form alternative returns as is demonstrated by Figure 13. In fact, Fedus et al. (2019) used geometrically discounted value estimates at multiple scales as a basis to estimate hyperbolically discounted returns. s could provide such a basis function using a single network.

Long timescale predictions can be difficult to learn due to the higher variance of the returns. Romoff et al. (2019) presented an algorithm which computes values for an ordered set of timescales by predicting the differences between the values using separate network heads. Value estimates are constructed in a cascade where each timescale prediction adds to the one that came before it. They showed their method could improve estimation accuracy for longer timescales by leveraging the accuracy of the easier to learn shorter timescales. We might expect a similar effect using s where long timescale predictions could benefit from the short timescales being learned directly in the network. Our current evaluation approach is not fine grained enough to discern such a benefit. Thus, this area warrants further exploration.

s is related to other works which seek to learn many different predictions simultaneously and tractably. The UVFA (Schaul et al., 2015), on which this work is based, generalizes over goals. The successor representation (SR) (Dayan, 1993) separates environment dynamics from reward, providing a way to transfer learning across tasks (Barreto et al., 2017; Sherstan et al., 2018). These ideas have been combined (Mankowitz et al., 2018; Ma et al., 2018) to enable transfer learning over multiple goals using off-policy learning. However, these methods still use fixed timescales, thus, a natural extension of s is to combine them with these approaches.

The original motivation for this work was to use s to create GVFs which form a predictive representations of state for use by the agent’s policy. It now seems that the best approach would be to use multiple heads with predictions at fixed timescales and let the policy network learn to generalize over those predictions as it needed. Such an approach could be costly in terms of network weights and s might accomplish the same thing with a smaller network.

## 6 Conclusion

We presented s, a simple technique for generalizing value estimation across timescale. This technique allows a system to make predictions for values of any timescale within the training regime of the network. We expect that this ability will be useful in areas such as predictive representations of state—i.e., modeling the world as a collection of predictions about future sensorimotor signals. In complex environments complete models are not feasible, thus, being able to query for predicted outcomes at any timescale makes a model potentially more compact and expressive. An investigation of s in different control learning scenarios is an important area for future work, and we believe they may be of benefit to ongoing research in planning and lifelong learning. In particular s are complimentary to approaches which seek to learn many things about the world simultaneously such as the successor representation and universal value functions, suggesting that s may provide us with a functional new tool for the pursuit of knowledgeable intelligent systems.

## Acknowledgements

The authors would like to thank the following colleagues for providing thoughtful suggestions to this work: Alex Kearney, Marlos C. Machado, and Matt Schlegel. Additionally, Brendan Bennett, Jesse Farebrother and Vivek Veeriah provided helpful technical assistance. Initial stages of this work were funded by Cogitai and additional support was provided by the Natural Sciences and Engineering Research Council of Canada (NSERC), Compute Canada, the Canada Research Chair’s program, Alberta Innovates, and the Alberta Machine Intelligence Institute (Amii).

## Appendix A Atari Details

Various parameters are indicated in Table 3.

A brief sweep was made over the step-size parameter (also referred to as learning rate) for the Centipede@25M policy. Sweeps were made over the probe timescales as well as over various variants of the for 3 seeds each. The values tried were: . It was found that, almost universally, the value gave the lowest error when errors were aggregated over all probe timescales. This is also the step-size used in training the Rainbow agent. This step-size value was used for all reported experiments. Note that these sweeps were done on Centipede@25M experiments only.

Dopamine’s implementation of the prioritized replay buffer used fixed discounting for a single timescale. Thus, we needed to modify this implementation to return the n-step transitions and then apply discounting afterwards.

We use a frame skip of 4, meaning that when an action is sent to the environment it is executed 4 times in a row and the resulting final frame is returned as observation. The implementation also uses frame stacking in which a max pooling is taken over the last 2 consecutive frames in order to deal with flicker in the rendering of the game images. We used sticky-actions with a probability of 0.25. This means that when an action is sent to the environment there is a 25% chance that the environment will use the previous action instead. Every reset of the ALE environment restores the environment to the same initial state. State transitions are deterministic. The policy was trained with an -greedy value of , but for evaluation transitions were generated with reduced to . Thus, during evaluation the largest source of stochasticity is due to the sticky-actions. Further, the agent sees the early states of the episode more frequently than the later states. Like the policy training, one training update was performed for every 4 steps in the environment. Since every step in the environment corresponds to 4 frames a training update was performed every 16 frames.

To train the sampled batch of transitions on we tile the samples with each . Thus, for a batch size of 32 sampled transitions and a of 8 the effective batch size is 256. This does add some additional computation time to the process, but this is also affected by the quality of the implementation. When sampling from the scale for we used the integer scale rather than float.

Like the policy we use a target network which periodically copies weights from the online network; it is the online network which is updated on each training step and the target network which is used for bootstrapping. Note TD learning is typically trained using a semi-gradient approach in which the gradients are not computed with respect to the bootstrapping.

Parameter | Value |
---|---|

Input dim | 84x84 |

dim | 512 |

Replay buffer size | 100000 |

Batch size | 32 |

n-step | 4 |

Min-replay history | 20000 |

Sync interval | 10000 |

Frameskip | 4 |

Sticky-actions | 0.25 |

Terminal on life loss | False |

Max steps per episode | 27000 |

Consecutive frame pooling | True |

-greedy: policy learning | 0.001 |

-greedy: evaluation | 0.0001 |

Adam optimizer: Step-size (learning rate) | |

Adam optimizer: eps |

## Appendix B Additional Centipede@25M Figures

Here we present several additional figures of evaluation on the Centipede@25M policy. Figure 14 compares the performance of using linear or ReLU embeddings on the embedding networks used in Figure 8. The ReLU embedding performs the same as the linear embedding for the concatenated architecture and performs worse with the Hadamard. Figure 15 looks at a matrix embedding approach. We see that here too the ReLU embedding performs worst. Note that with the matrix embedding the learning was slower than for the direct embedding.

## Appendix C Atari@200M

Here we present the results of training s on five Atari games for 200 M frames (Figures 16—19): Asteroids, Atlantis, ChopperCommand, Centipede, and Qbert. We used pretrained networks from the Dopamine package (Castro et al., 2018), trained for 200 M frames. Network configurations are the same as those described in the paper. Each run was trained for 20 M frames and six seeds were run for each experiment. A reduced set of probe timescales was used: . For these results we also included learning curves (rightmost column). These learning curves an average of normalized MSE taken across all evaluation timescales. For each timescale we normalize each of the series by the largest MSE of any series for that timescale. Then for each series we average the normalized MSE across all the timescales. As before, shaded areas indicate max and min.

### References

- Successor Features for Transfer in Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), Long Beach, California, pp. 4055–4065. Cited by: §5.
- A Distributional Perspective on Reinforcement Learning. In International Conference on Machine Learning (ICML), Sydney, Australia, pp. 449–458. Cited by: §4.3.
- The Arcade Learning Environment: An Evaluation Platform for General Agents. In International Joint Conference on Artificial Intelligence (IJCAI), Lille, France, pp. 4148–4152. Cited by: §4.3.
- Dopamine: A Research Framework for Deep Reinforcement Learning. arXiv 1812.06110. Cited by: Appendix C, §4.3, §4.3.
- Improving Generalization for Temporal Difference Learning: The Successor Representation. Neural Computation 5 (4), pp. 613–624. Cited by: §1, §5.
- Hyperbolic Discounting and Learning over Multiple Horizons. arXiv 1902.06865. Cited by: §5.
- Stumbling on Happiness. Knopf. Cited by: §1.
- Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI Conference on Artificial Intelligence, New Orleans, USA. Cited by: §4.3.
- Universal Successor Representations for Transfer Reinforcement Learning. In International Conference on Learning Representations (ICLR), Vancouver, Canada. Cited by: §5.
- Unicorn: Continual Learning with a Universal, Off-policy Agent. arXiv 1802.08294. Cited by: §5.
- Human-level Control through Deep Reinforcement Learning. Nature 518 (7540), pp. 529–533. Cited by: §4.3.
- Multi-Timescale Nexting in a Reinforcement Learning Robot. Adaptive Behavior 22 (2), pp. 146–160. Cited by: §1.
- Separating Value Functions Across Time-scales. arXiv 1902.01883. Cited by: §5.
- Universal Value Function Approximators. pp. 1312–1320. Cited by: §1, §5.
- Prioritized Experience Replay. arXiv 1511.05952. Cited by: §4.3.
- Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation. In IEEE International Conference on Robots and Systems (IROS), Madrid, Spain, pp. 2997–3003. Cited by: §5.
- Towards Prosthetic Arms as Wearable Intelligent Robots. Master’s Thesis, University of Alberta. Cited by: §3.
- Reinforcement Learning: An Introduction.. MIT Press, Cambridge, MA. Cited by: §2, §4.1, §4.2.
- Horde: A Scalable Real-time Architecture for Learning Knowledge from Unsupervised Sensorimotor Interaction. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Vol. 2, Taipei, Taiwan, pp. 761–768. Cited by: §1.
- Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence 112 (1), pp. 181–211. Cited by: §1.
- TD Models: Modeling the World at a Mixture of Time Scales. In International Conference on Machine Learning (ICML), pp. 531–539. Cited by: §1.
- Prediction of Immediate and Future Rewards Differentially Recruits Cortico-basal Ganglia Loops. Behavioral Economics of Preferences, Choices, and Happiness 7 (8), pp. 593–616. Cited by: §1.
- Learning Values Across Many Orders of Magnitude. In Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, pp. 4287–4295. Cited by: §3.
- Unifying Task Specification in Reinforcement Learning. In International Conference on Machine Learning (ICML), Sydney, Australia, pp. 3742–3750. Cited by: §5.
- Meta-Gradient Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS), Montreal, Canada, pp. 2396–2407. Cited by: §1, §4.3.4.