# Fast Task Inference with Variational Intrinsic Successor Features

###### Abstract

It has been established that diverse behaviors spanning the controllable subspace of an Markov decision process can be trained by rewarding a policy for being distinguishable from other policies (Gregor et al., 2016; Eysenbach et al., 2018; Warde-Farley et al., 2018). However, one limitation of this formulation is generalizing behaviors beyond the finite set being explicitly learned, as is needed for use on subsequent tasks. Successor features (Dayan, 1993; Barreto et al., 2017) provide an appealing solution to this generalization problem, but require defining the reward function as linear in some grounded feature space. In this paper, we show that these two techniques can be combined, and that each method solves the other’s primary limitation. To do so we introduce Variational Intrinsic Successor FeatuRes (VISR), a novel algorithm which learns controllable features that can be leveraged to provide enhanced generalization and fast task inference through the successor feature framework. We empirically validate VISR on the full Atari suite, in a novel setup wherein the rewards are only exposed briefly after a long unsupervised phase. Achieving human-level performance on 14 games and beating all baselines, we believe VISR represents a step towards agents that rapidly learn from limited feedback.

Fast Task Inference with Variational Intrinsic Successor Features

Steven Hansen DeepMind stevenhansen@google.com Will Dabney DeepMind wdabney@google.com André Barreto DeepMind andrebarreto@google.com Tom Van de Wiele DeepMind tomvandewiele@google.com David Warde-Farley DeepMind dwf@google.com Volodymyr Mnih DeepMind vmnih@google.com

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

Advances in unsupervised learning heralded the recent improvements in machine learning with early contributions due to pre-training of deep neural networks (Bengio et al., 2007), and more recently the overwhelming rate of progress in generative modeling due to fundamentals such as generative adversarial networks (GANs) (Goodfellow et al., 2014). We have repeatedly seen how representations from unsupervised learning can be leveraged to dramatically improve sample efficiency in a variety of supervised learning domains(Rasmus et al., 2015; Salimans et al., 2016).

However, in reinforcement learning (RL) the process of unsupervised learning has been less clearly defined due to the coupling between behavior, state-visitation, and the motivation behind that behavior. How to generate behaviors without the supervision provided by an external reward signal has long been studied under the psychological auspice of intrinsic motivation (Barto et al., 2004; Barto, 2013; Mohamed and Rezende, 2015), often with the goal of improved exploration (Şimşek and Barto, 2006; Oudeyer and Kaplan, 2009; Bellemare et al., 2016).

Exploration is inherently concerned with finding rewards, and thus cannot be considered entirely in their absence. However, there are other aspects of the RL problem that can benefit from an unsupervised phase in which no rewards are observed: representation and skill learning. We call this scenario in which an unsupervised phase without rewards is followed by a supervised phase in which rewards are observed no-reward reinforcement learning.

Just as unsupervised learning is often evaluated by the subsequent benefits to supervised learning, in no-reward RL the agent is evaluated against how quickly the representations and behaviors learned during the unsupervised phase adapt to a newly introduced reward signal. In this way they are orthogonal, and complementary, with work on the exploration problem, which is concerned with finding the reward signal of which these methods then attempt to make efficient use.

The current state-of-the-art for no-reward RL comes from a class of algorithms which, independent of reward, maximize the mutual-information between latent-variable policies and their behavior in terms of state visitation (Mohamed and Rezende, 2015; Gregor et al., 2016; Eysenbach et al., 2018; Warde-Farley et al., 2018). They all exhibit a great deal of diversity in behavior, with variational intrinsic control (Gregor et al., 2016, VIC) and diversity-is-all-you-need (Eysenbach et al., 2018, DIAYN) even providing a natural formalism for solving the no-reward RL problem. However, both methods suffer from poor generalization and a slow inference process when the reward signal is introduced. The fundamental problem they face is a need to generalize between different latent codes, a task to which neural networks alone seem poorly suited. Thus, an open problem for such methods is how to efficiently generalize to unseen latent codes.

Our first contribution is to solve this generalization and slow inference problem by making use of another recent advance in RL, successor features (Barreto et al., 2017). Successor features (SF) provide fast transfer learning between tasks that differ only in their reward function, which is assumed to be linear in some features. But, until now, it has been an open research problem of how to automatically construct these reward function features (Barreto et al., 2018). Incredibly, we show that the mutual information maximizing procedure of VIC/DIAYN provides a solution to this feature learning problem. Thus, our second contribution is to show that VIC/DIAYN algorithms can be adapted to learn precisely the features needed by successor features. Our final contribution is to show that together these methods form an algorithm, Variational Intrinsic Successor Representations (VISR), which make significant progress on the no-reward RL problem.

## 2 No-reward reinforcement learning

As usual, we assume that the interaction between agent and environment can be modeled as a Markov decision process (MDP, Puterman, 1994). An MDP is defined as a tuple where and are the state and action spaces, gives the next-state distribution upon taking action in state , and is a discount factor that gives smaller weights to future rewards. The function specifies the reward received at transition ; more generally, we call any signal defined as a cumulant (Sutton and Barto, 2018).

As discussed in the introduction, in this paper we consider the scenario where the interaction of the agent with the environment can be split into two stages: an initial unsupervised phase in which the agent does not observe any rewards, and the usual supervised phase in which rewards are observable. We can model this scenario using a second MDP to represent the unsupervised phase, where for all . Note that cumulants other than can be non-zero even during the unsupervised phase.

During the supervised phase the goal of the agent is to find a policy that maximizes the expected return where . A principled way to address this problem is to use methods derived from dynamic programming, which heavily rely on the concept of a value function (Puterman, 1994). The action-value function of a policy is defined as where denotes expected value when following policy . Based on we can compute a greedy policy

(1) |

is guaranteed to do at least as well as , that is: for all . The computation of and are called policy evaluation and policy improvement, respectively; under certain conditions their successive application leads to the optimal value function , from which one can derive an optimal policy using (1). The alternation between policy evaluation and policy improvement is at the core of many RL algorithms, which usually carry out these steps only approximately (Sutton and Barto, 2018). Clearly, if we replace the reward with an arbitrary cumulant all the above still holds. In this case we will use to refer to the value of under cumulant and the associated optimal policies will be referred to as , where is the greedy policy (1) with respect to .

Usually it is assumed, either explicitly or implicitly, that during learning there is a cost associated with each transition in the environment, and therefore the agent must learn a policy as quickly as possible. Here we consider that such a cost is only significant in the supervised learning phase, so during the unsupervised phase the agent is essentially free to interact with the environment as much as desired. The goal in this stage is to collect information about the environment to speed up the supervised phase as much as possible. In what follows we will make this definition more precise.

## 3 Universal successor features and fast task inference

Following Barreto et al. (2017, 2018), we assume that there exist features such that the reward of the supervised phase can be written as

(2) |

where are weights that specify how desirable each feature component is, or a ‘task vector’ for short. Note that, unless we constrain somehow, (2) is not restrictive in any way: for example, by making for some we can clearly recover the rewards exactly. Barreto et al. (2017) note that (2) allows one to decompose the value of a policy as

(3) |

where and are the successor features (SFs) of . SFs can be seen as multidimensional value functions in which play the role of rewards, and as such they can be computed using standard RL algorithms (Szepesvári, 2010).

One of the benefits provided by SFs is the possibility of quickly evaluating a policy . Suppose that during the unsupervised learning phase we have computed ; then, during the supervised phase, we can find a by solving a regression problem based on (2) and then compute through (3). Once we have , we can apply (1) to derive a policy that will likely outperform .

Since was computed without access to the reward, it is not deliberately trying to maximize it. Thus, the solution relies on a single step of policy improvement (1) over a policy that is agnostic to the rewards. It turns out that we can do better than that by extending the strategy above to multiple policies. Let be a policy-encoding mapping, that is, a function that turns policies into vectors in . Borsa et al.’s (2019) universal successor feature (USFs) are defined as . Note that, using USFs, we can evaluate any policy by simply computing .

Now that we can compute for any , we should be able to leverage this information to improve our previous solution based on a single policy. This is possible through generalized policy improvement (GPI, Barreto et al., 2017). Let be USFs, let , , …, be arbitrary policies, and let

(4) |

It can be shown that (4) is a strict generalization of (1), in the sense that for all , , and . This result can be extended to the case in which (2) holds only approximately and is replaced by a universal successor feature approximator (USFA) (Barreto et al., 2017, 2018; Borsa et al., 2019).

The above suggests an approach to the no-reward RL problem. First, during the unsupervised phase, the agent learns a USFA . Then, the rewards observed at the early stages of the supervised phase are used to find an approximate solution for (2). Finally, policies are generated and a policy is derived through (4). If the approximations used in this process are reasonably accurate, will be an improvement over , , .., .

## 4 Empowered Conditional Policies

Features should be defined in such a way that the down-stream task reward is likely to be a simple function of them (see (2)). Since the no-reward reinforcement learning paradigm disallows knowledge of this task reward at training time, this amounts to utilizing a strong inductive bias that is likely to yield features relevant to the rewards of any ‘reasonable’ task.

One such bias is to only represent the subset of observation space that the agent can control. This can be accomplished by maximizing the mutual information between a policy conditioning variable and the agent’s behavior. There exist many algorithms for accomplishing this using similar methods (Gregor et al., 2016; Eysenbach et al., 2018; Warde-Farley et al., 2018).Since this objective has typically been characterized as ‘empowering’ the agent to control its environment (Salge et al., 2014), we denote these methods empowered conditional policies (ECP).

The objective is to find policy parameters that maximize the mutual information () between some policy-conditioning variable, , and some function of the history induced by the conditioned policy, where is the entropy of some variable:

(5) |

Let us assume that is drawn from a fixed (or at least non-parametric) distribution. This simplifies the objective to minimizing the conditional entropy of the conditioning variable given the history.

(6) |

When the history is sufficiently long, this corresponds to sampling from the steady state distribution induced by the policy. Commonly is assumed to return the final state, but for simplicity we will consider that samples a single state uniformly over .

(7) |

The intractable conditional distribution can be lower-bounded by a variational approximation () which produces the loss function used in practice (See Agakov (2004) for details)

(8) |

The variational parameters can be optimized by minimizing the negative log likelihood of samples from the true conditional distribution, i.e. is a discriminator trying to predict the correct from behavior. However, it is not obvious how to optimize the policy parameters , as they only affect the loss through the non-differentiable environment. Through analysis of the stochastic computation graph (Schulman et al., 2015), one can derive the appropriate score function estimator: the policy gradient with , which plays the role of the rewards.

## 5 Variational Intrinsic Successor Features

Plugging the ECP feature learning objective into the successor features framework, it should be clear that the policy-conditioning variable directly corresponds to the task-vector . Following the line of reasoning from before, we can go from wanting to maximize the mutual information between and the induced policy’s state distribution to minimizing the negative log-likelihood of a variational approximation, as in Equation 8. But whereas the other algorithms just consist of a conditional-policy and variational approximation, utilizing successor features requires the conditional-policy to be split into the reward-predictive features and the successor representation .

Remember that the SF framework requires to be linear with respect to the rewards, while the ECP framework requires that the rewards should be equal to . Satisfying these two constraints is non-trivial, as the rewards depend on the linear weights . For intuition, let us inspect how the most straightforward approach fails to satisfy both constraints. If we assume that is drawn from a standard-normal distribution, then it would make sense to parameterize as a fixed-variance Gaussian. However, this would lead to an equation that has no clear solution:

SF constraint | (9) | |||||

ECP constraint | ||||||

isotropic Gaussian | ||||||

Inner product. |

Equation 9 shows the problem with this setup. Since the reward is a non-linear function of , must depend on if their inner-product is to be equal to the reward. Specifically, we need the log of the probability density function to be equal to the inner product of the prediction and . In this case can simply be replaced by that prediction. One way to obtain this is to resort to the Von-Mises Fisher (VMF) distribution (the analog of the Gaussian on a sphere), since its density function has the desired form. Equation 10 shows the resulting substitution:

SF constraint | (10) | |||||

ECP constraint | ||||||

isotropic VMF | ||||||

This implies that the features required by the SF framework should be the predictions of the discriminator . Figure 1 shows the resulting model. Training proceeds as in other ECP methods: by randomly sampling a task vector and then trying to infer it from the state produced by the conditioned-policy. But the internal structure enforces the task/dynamics factorization as in the SF approach (Equation 2).

### 5.1 Adding Generalized Policy Improvement to VISR

Now that successor features have been given a feature-learning mechanism, we can return to the second question raised at the end of Section 3: how can we obtain a diverse set of policies over which to apply GPI?

Recall that we are training a USFA whose encoding function is (that is, is the policy that tries to maximize the reward in Equation 10 for a particular value of ). So, the question of which policies to use with GPI comes down to the selection of a set of vectors .

One natural candidate is the solution for a regression problem derived from Equation 2. Let us call this solution , that is, . But what should the other task vectors ’s be? Given that task vectors are sampled from a uniform distribution over the unit circle during training, there is no single subset that has any privileged status. So, following Borsa et al. (2018), we sample additional ’s on the basis of similarity to . Since the discriminator enforces similarity on the basis of probability under a VMF distribution, these additional ’s are sampled via a VMF centered on , with the concentration parameter acting as a hyper-parameter specifying how diverse the additional ’s should be. Calculating the improved policy is thus done as follows:

(11) | ||||

## 6 Experiments

Our experiments are divided in four groups corresponding to Sections 6.1 to 6.4. First, we assess how well VISR does in the no-reward scenario described in Section 2. Since this setup is unique in the literature on the Atari Suite, for the full two-phase process we only compare to ablations on the full VISR model (Table 1). In order to frame performance relative to prior work, in Section 6.2 we also compare to results for algorithms that operate in a purely unsupervised manner (Table 2). Next, in Section 6.3, we contrast VISR’s performance to that of standarsd RL algorithms in a low data regime (Table 3). Finally, we assess how well the proposed approach of inferring the task through the solution of a regression derived from Equation 2 does as compared to alternative ECP methods.

### 6.1 No-reward reinforcement learning

To evaluate VISR, we impose the no-reward reinforcement learning setup on the full suite of 57 Atari games (Bellemare et al., 2013). This means that agents are allowed a long unsupervised training phase without access to rewards, followed by a short test phase with rewards. When solving the full no-reward problem, the test phase consists of solving the linear reward regression problem for all of the ablated models, which we accomplish via linear least-squares regression on a fixed data buffer. Four variations are tested. The full VISR algorithm includes both features learned through a VIC-style controllability loss and GPI to improve the execution of policies during both the train and test phases. The main baseline model, RF VISR, removes the controllability objective, instead learning SF over features given by a random convolutional network (the same architecture as the network in the full model). The remaining ablations remove GPI from each of these models. The ablation results shown in Table 1 confirm that all components of VISR play complementary roles in the overall functioning of our model (see Figure 1).

Algorithm | Median | Mean | Human-level | |
---|---|---|---|---|

RF VISR (No GPI) | ||||

RF VISR | ||||

VISR (No GPI) | ||||

Full VISR | ||||

Full VISR (unclipped) | 9.04 | 109.16 | 46 | 14 |

### 6.2 Unsupervised approaches

Comparing against fully unsupervised approaches, our main external baseline is the Intrinsic Curiosity Module (Pathak et al., 2017). This uses forward model prediction error in some feature-space to produce an intrinsic reward signal. Two variants have been evaluated on a 47 game subset of the Atari suite (Burda et al., 2018). One uses random features as the basis of their forward model (RF Curiosity), and the other uses features learned via an inverse-dynamics model (IDF Curiosity). It is important to note that, in addition to the extrinsic rewards, these methods did not use the terminal signals provided by the environment, whereas all other methods reported here do use them. The reason for not using the terminal signal was to avoid the possibility of the intrinsic reward reducing to a simple "do not die" signal. To rule this out, an explicit "do not die" baseline was run (Constant Reward NSQ), wherein the terminal signal remains and a small constant reward is given at every time-step. Finally, the full VISR model was run purely unsupervised. In practice this means not performing the fast-adaptation step (i.e. reward regression), instead switching between random vectors every time-steps (as is done during the training phase). Results shown in Table 2 make it clear that VISR outperforms the competing methods on all criteria.

Algorithm | Median | Mean | Human-level | |

IDF Curiosity | ||||

RF Curiosity | ||||

Random-Task VISR | ||||

Constant Reward NSQ | ||||

RF VISR (No GPI) | ||||

RF VISR | ||||

VISR (No GPI) | 11 | |||

Full VISR | 9.72 | 72.83 | 38 |

### 6.3 Low-data reinforcement learning

Comparisons to reinforcement learning algorithms in the low-data regime are largely based on the similar analysis in Kaiser et al. (2019) on the 26 easiest games in the Atari suite (as judged by above random performance for their algorithm). In that work the authors introduce a model-based agent (SimPLe) and show that it compares favorably to standard reinforcement learning algorithms when data is limited. Three canonical reinforcement learning algorithms are compared against: proximal policy optimization (PPO) (Schulman et al., 2017), Rainbow (Hessel et al., 2017), and DQN (Mnih et al., 2015). For each, the results from the lowest data regime reported are used. In addition, we also compare to a version of N-step Q-learning (NSQ) that uses the same codebase and base network architecture as VISR. Results shown in Table 3 shows that VISR is highly compepetitive with the other RL methods. Note that, while these methods are actually solving the full RL problem, VISR’s performance is based exclusively on the solution of the regression problem (Equation 2). Obviously, this solution can be used to “warm start” an agent which can then refine its policy using any RL algorithm. We expect this version of VISR to have even better performance.

Algorithm | Median | Mean | Human-level | |
---|---|---|---|---|

DQN | ||||

NSQ | ||||

SimPLe | 26 | |||

Rainbow | ||||

PPO | 20.93 | |||

RF VISR (No GPI) | ||||

RF VISR | ||||

VISR (No GPI) | 104.27 | 8 | ||

Full VISR |

### 6.4 Fast-inference

In the previous results, it was assumed that solving the linear reward regression problem is the best way to infer the appropriate task vector. However, Eysenbach et al. (2018) suggests a simpler approach: exhaustive search. As there are no guarantees that the extrinsic rewards will be linear in the learned features (), it is non-obvious which approach works better in practice.^{1}^{1}1Since VISR utilizes a continuous space of possible task vectors, exhaustive search must be replaced with random search.

We hypothesize that exploiting reward regression task inference mechanism given by VISR should yield more efficient inference than random search. To show this, episodes are rolled out using a trained VISR, each conditioned on a task vector chosen uniformly on a -dimensional sphere. From these initial episodes, one can either pick the task vector corresponding to the trajectory with the highest return (random search), or combine the data across all episodes and solve the linear regression problem. In each condition the VISR policy given by the inferred task vector is executed for episodes and the average returns compared.

As shown in Figure 2, linear regression substantially improves performance despite using data generated specifically to aid in random search. The mean performance across all 57 games was for reward-regression, compared to random search at . Even more dramatically, the median score for reward-regression was compared to random search at . Overall, VISR outperformed the random search alternative on 41 of the 57 games, with one tie, using the exact same data for task inference. This is corroborates the main hypothesis of this paper, namely, that the endowing ECP with the fast task-inference provided by SF gives rise to a powerful method able to quickly learn competent policies when exposed to a reward signal.

## 7 Conclusions

Our results suggest that VISR is the first algorithm to achieve notable performance on the full Atari task suite in the no-reward reinforcement learning setting, outperforming all baselines and buying performance equivalent to hundreds of thousands of interaction steps for any traditional reinforcement learning algorithm.

As a suggestion for future investigations, the somewhat underwhelming results for the fully unsupervised version of VISR suggest that there is much room for improvement. While curiosity-based methods are transient (i.e., asymptotically their intrinsic reward vanishes) and lack a fast adaptation mechanism, they do seem to encourage exploratory behavior slightly more than VISR. A possible direction for future work would be to use a curiosity-based intrinsic reward inside of VISR, to encourage it to better explore the space of controllable policies. Another interesting avenue for future investigation would be combine the approach recently-proposed by Ozair et al. (2019) to enforce the policies computed by VISR to be not only distinguishable but also far apart in a given metric space.

By combining ECP and SF, we proposed an approach, VISR, that solves two open questions in the literature: how to get reliable generalization in the former and how to compute features for the latter. Beyond the concrete method proposed here, we believe bridging the gap between ECP and SF is an insightful contribution that may inspire other useful methods.

## References

- Agakov [2004] D. B. F. Agakov. The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201, 2004.
- Barreto et al. [2017] A. Barreto, W. Dabney, R. Munos, J. Hunt, T. Schaul, H. van Hasselt, and D. Silver. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Barreto et al. [2018] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Zidek, and R. Munos. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In Proceedings of the International Conference on Machine Learning (ICML), pages 501–510, 2018.
- Barto [2013] A. G. Barto. Intrinsic motivation and reinforcement learning. In Intrinsically motivated learning in natural and artificial systems, pages 17–47. Springer, 2013.
- Barto et al. [2004] A. G. Barto, S. Singh, and N. Chentanez. Intrinsically motivated learning of hierarchical collections of skills. In Proceedings of the 3rd International Conference on Development and Learning, pages 112–19, 2004.
- Bellemare et al. [2016] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
- Bellemare et al. [2013] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 06 2013.
- Bengio et al. [2007] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pages 153–160, 2007.
- Borsa et al. [2018] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos, H. van Hasselt, D. Silver, and T. Schaul. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018.
- Borsa et al. [2019] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, H. van Hasselt, R. Munos, D. Silver, and T. Schaul. Universal successor features approximators. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
- Burda et al. [2018] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018.
- Dayan [1993] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
- Espeholt et al. [2018] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- Eysenbach et al. [2018] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function, 2018. URL http://arxiv.org/abs/1802.06070.
- Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Gregor et al. [2016] K. Gregor, D. J. Rezende, and D. Wierstra. Variational intrinsic control. CoRR, abs/1611.07507, 2016. URL http://arxiv.org/abs/1611.07507.
- Hessel et al. [2017] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. CoRR, abs/1710.02298, 2017.
- Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Kaiser et al. [2019] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.
- Kapturowski et al. [2018] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. 2018.
- Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Mohamed and Rezende [2015] S. Mohamed and D. J. Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 2125–2133, 2015.
- Oudeyer and Kaplan [2009] P.-Y. Oudeyer and F. Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
- Ozair et al. [2019] S. Ozair, C. Lynch, Y. Bengio, A. v. d. Oord, S. Levine, and P. Sermanet. Wasserstein dependency measure for representation learning. arXiv preprint arXiv:1903.11780, 2019.
- Pathak et al. [2017] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
- Puterman [1994] M. L. Puterman. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
- Rasmus et al. [2015] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems 28, pages 3546–3554. 2015.
- Salge et al. [2014] C. Salge, C. Glackin, and D. Polani. Empowerment–an introduction. In Guided Self-Organization: Inception, pages 67–114. Springer, 2014.
- Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
- Schulman et al. [2015] J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pages 3528–3536, 2015.
- Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Şimşek and Barto [2006] Ö. Şimşek and A. G. Barto. An intrinsic reward mechanism for efficient exploration. In Proceedings of the 23rd international conference on Machine learning, pages 833–840. ACM, 2006.
- Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018. URL https://mitpress.mit.edu/books/reinforcement-learning-second-edition.
- Szepesvári [2010] C. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
- Warde-Farley et al. [2018] D. Warde-Farley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih. Unsupervised control through non-parametric discriminative rewards. arXiv preprint arXiv:1811.11359, 2018.
- Werbos et al. [1990] P. J. Werbos et al. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.

## Appendix A Appendix

### a.1 Network Architecture

A distributed reinforcement learning setup was utilized to accelerate experimentation as per Espeholt et al. [2018]. This involved having separate actors, each running on its own instance of the environment. After every roll-out of steps, the experiences are added to a queue. This queue is used by the centralized learner to calculate all of the losses and change the weights of the network, which are then passed back to the actors.

The roll-out length implicitly determines other hyper-parameters out of convenience, namely the amount of backpropagation through time is done before truncation [Werbos et al., 1990], as the sequential structure of the data is lost outside of the roll-out window. The task vector is also resampled every steps for similar reasons.

The network architecture is the same convolutional residual network as in Espeholt et al. [2018], with the following exceptions. and each have their own instance of this network (i.e. there is no parameter sharing). The network is conditioned on a task vector which is pre-processed as in Borsa et al. [2019]. Additionally, we found that individual cumulants in benefited from additional capacity, so each of the cumulants used a separate MLP with hidden units to process the output of the network trunk. While the trunk of the IMPALA network has an LSTM [Hochreiter and Schmidhuber, 1997], it is excluded from the network, as initial testing found that it destabilized training on some games. A target network was used for , with an update period of updates.

### a.2 Hyper-parameters

Due to the high computational cost, hyper-parameter optimization was minimal. The hyper-parameters were fixed acrossed games and only optimized on the subset of games shown in fast-inference experiment in the main paper. The Adam optimizer [Kingma and Ba, 2014] was used with a learning rate of and an of as in Kapturowski et al. [2018]. The dimensionality of task vectors was swept-over (with values between and considered), with eventually chosen. We suspect the optimal value correlates with the amount of data available for reward regression. The discount factor was . Standard batch size of . A constant -greedy action-selection strategy with an of for both training and testing.

### a.3 Experimental Methods

All experiments were conducted as in Mnih et al. [2015]. The frames are scaled to x , normalized, and the most recent frames are stacked. At the beginning of each episode, between and no-ops are executed to provide a source of stochasticity. A 5 minute time-limit is imposed on both training and testing episodes.

In all results (modulo some reported from other papers) are the average of random seeds per game per condition. Due to the high computational cost of the controlled fast-inference experiments, for the other experiments an online evaluation scheme was utilized. Rather than actually performing no-reward reinforcement learning as distinct phases, reward information^{2}^{2}2Since the default settings of the Atari environment were used, the rewards were clipped, though more recent experiments suggest unclipped rewards would be superior was exposed to of the actors which used the task vector resulting from solving the reward regression via OLS. This regression was continuously solved using the most recent experiences from these actors. This has the benefit of producing learning curves (Figure 3) showing how well a given network could perform the extrinsically defined task if training stopped at that time point.