CountBased Exploration with Neural Density Models
Abstract
Bellemare et al. (2016) introduced the notion of a pseudocount, derived from a density model, to generalize countbased exploration to nontabular reinforcement learning. This pseudocount was used to generate an exploration bonus for a DQN agent and combined with a mixed Monte Carlo update was sufficient to achieve state of the art on the Atari 2600 game Montezuma’s Revenge. We consider two questions left open by their work: First, how important is the quality of the density model for exploration? Second, what role does the Monte Carlo update play in exploration? We answer the first question by demonstrating the use of PixelCNN, an advanced neural density model for images, to supply a pseudocount. In particular, we examine the intrinsic difficulties in adapting Bellemare et al.’s approach when assumptions about the model are violated. The result is a more practical and general algorithm requiring no special apparatus. We combine PixelCNN pseudocounts with different agent architectures to dramatically improve the state of the art on several hard Atari games. One surprising finding is that the mixed Monte Carlo update is a powerful facilitator of exploration in the sparsest of settings, including Montezuma’s Revenge.
1 Introduction
Exploration is the process by which an agent learns about its environment. In the reinforcement learning framework, this involves reducing the agent’s uncertainty about the environment’s transition dynamics and attainable rewards. From a theoretical perspective, exploration is now wellunderstood (e.g. Strehl & Littman, 2008; Jaksch et al., 2010; Osband et al., 2016), and Bayesian methods have been successfully demonstrated in a number of settings (Deisenroth & Rasmussen, 2011; Guez et al., 2012). On the other hand, practical algorithms for the general case remain scarce; fully Bayesian approaches are usually intractable in large state spaces, and the countbased method typical of theoretical results is not applicable in the presence of value function approximation.
Recently, Bellemare et al. (2016) proposed the notion of pseudocount as a reasonable generalization of the tabular setting considered in the theory literature. The pseudocount is defined in terms of a density model trained on the sequence of states experienced by an agent:
where can be thought of as a total pseudocount computed from the model’s recoding probability , the probability of computed immediately after training on . As a practical application the authors used the pseudocounts derived from the simple CTS density model (Bellemare et al., 2014) to incentivize exploration in Atari 2600 agents. One of the main outcomes of their work was substantial empirical progress on the infamously hard game Montezuma’s Revenge.
Their method critically hinged on several assumptions regarding the density model: 1) the model should be learningpositive, i.e. the probability assigned to a state should increase with training; 2) it should be trained online, using each sample exactly once; and 3) the effective model stepsize should decay at a rate of . Part of their empirical success also relied on a mixed Monte Carlo/QLearning update rule, which permitted fast propagation of the exploration bonuses.
In this paper, we set out to answer several research questions related to these modelling choices and assumptions:

To what extent does a better density model give rise to better exploration?

Can the above modelling assumptions be relaxed without sacrificing exploration performance?

What role does the mixed Monte Carlo update play in successfully incentivizing exploration?
In particular, we explore the use of PixelCNN (van den Oord et al., 2016b, a), a stateoftheart neural density model. We examine the challenges posed by this approach:
Model choice. Performing two evaluations and one model update at each agent step (to compute and ) can be prohibitively expensive. This requires the design of a simplified – yet sufficiently expressive and accurate – PixelCNN architecture.
Model training. A CTS model can naturally be trained from sequentially presented, correlated data samples. Training a neural model in this online fashion requires more careful attention to the optimization procedure to prevent overfitting and catastrophic forgetting (French, 1999).
Model use. The theory of pseudocounts requires the density model’s rate of learning to decay over time. Optimization of a neural model, however, imposes constraints on the stepsize regime which cannot be violated without deteriorating effectiveness and stability of training.
The concept of intrinsic motivation has made a recent resurgence in reinforcement learning research, in great part due to a dissatisfaction with greedy and Boltzmann policies. Of note, Tang et al. (2016) maintain an approximate count by means of hash tables over features, which in the pseudocount framework corresponds to a hashbased density model. Houthooft et al. (2016) used a secondorder Taylor approximation of the prediction gain to drive exploration in continuous control. As research moves towards ever more complex environments, we expect the trend towards more intrinsically motivated solutions to continue.
2 Background
2.1 PseudoCount and Prediction Gain
Here we briefly introduce notation and results, referring the reader to (Bellemare et al., 2016) for technical details.
Let be a density model on a finite space , and the probability assigned by the model to after being trained on a sequence of states . Assume for all . The recoding probability is then the probability the model would assign to if it were trained on that same one more time. We call learningpositive if for all . The prediction gain (PG) of is
(1) 
A learningpositive implies for all . For learningpositive , we define the pseudocount as
derived from postulating that a single observation of should lead to a unit increase in pseudocount:
where is the pseudocount total. The pseudocount generalizes the usual state visitation count function . Under certain assumptions on , pseudocounts grow approximately linearly with real counts. Crucially, the pseudocount can be approximated using the prediction gain of the density model:
Its main use is to define an exploration bonus. We consider a reinforcement learning (RL) agent interacting with an environment that provides observations and extrinsic rewards (see Sutton & Barto, 1998, for a thorough exposition of the RL framework). To the reward at step we add the bonus
which incentivizes the agent to try to reexperience surprising situations. Quantities related to prediction gain have been used for similar purposes in the intrinsic motivation literature (Lopes et al., 2012), where they measure an agent’s learning progress (Oudeyer et al., 2007). Although the pseudocount bonus is close to the prediction gain, it is asymptotically more conservative and supported by stronger theoretical guarantees.
2.2 Density Models for Images
The CTS density model (Bellemare et al., 2014) is based on the namesake algorithm, Context Tree Switching (Veness et al., 2012), a Bayesian variableorder Markov model. In its simplest form, the model takes as input a 2D image and assigns to it a probability according to the product of locationdependent Lshaped filters, where the prediction of each filter is given by a CTS algorithm trained on past images. In Bellemare et al. (2016), this model was applied to 3bit greyscale, downsampled Atari 2600 frames (Fig. 1). The CTS model presents advantages in terms of simplicity and performance but is limited in expressiveness, scalability, and data efficiency.
In recent years, neural generative models for images have achieved impressive successes in their ability to generate diverse images in various domains (Kingma & Welling, 2013; Rezende et al., 2014; Gregor et al., 2015; Goodfellow et al., 2014). In particular, van den Oord et al. (2016b, a) introduced PixelCNN, a fully convolutional neural network composed of residual blocks with multiplicative gating units, which models pixel probabilities conditional on previous pixels (in the usual topleft to bottomright rasterscan order) by using masked convolution filters. This model achieved stateoftheart modelling performance on standard datasets, paired with the computational efficiency of a convolutional feedforward network.
2.3 MultiStep RL Methods
A distinguishing feature of reinforcement learning is that the agent “learns on the basis of interim estimates” (Sutton, 1996). For example, the QLearning update rule is
linking the reward and nextstate value function to the current state value function . This particular form is the stochastic update rule with stepsize and involves the TDerror . In the approximate reinforcement learning setting, such as when is represented by a neural network, this update is converted into a loss to be minimized, most commonly the squared loss .
It is well known that better performance, both in terms of learning efficiency and approximation error, is attained by multistep methods (Sutton, 1996; Tsitsiklis & van Roy, 1997). These methods interpolate between onestep methods (QLearning) and the MonteCarlo update
where is a sample path through the environment beginning in . To achieve their success on the hardest Atari 2600 games, Bellemare et al. (2016) used the mixed MonteCarlo update (MMC)
with . This choice was made for “computational and implementational simplicity”, and is a particularly coarse multistep method. A better multistep method is the recent Retrace() algorithm (Munos et al., 2016). Retrace() uses a product of truncated importance sampling ratios to replace with the error term
effectively mixing in TDerrors from all future time steps. Munos et al. showed that Retrace is safe (does not diverge when trained on data from an arbitrary behaviour policy), and efficient (makes the most of multistep returns).
3 Using PixelCNN for Exploration
As mentioned in the Introduction, the theory of using density models for exploration makes several assumptions that translate into concrete requirements for an implementation:

The density model should be trained completely online, i.e. exactly once on each state experienced by the agent, in the given sequential order.

The prediction gain (PG) should decay at a rate to ensure that pseudocounts grow approximately linearly with real counts.

The density model should be learningpositive.
Simultaneously, a partly competing set of requirements are posed by the practicalities of training a neural density model and using it as part of an RL agent:

For stability, efficiency, and to avoid catastrophic forgetting in the context of a drifting data distribution, it is advantageous to train a neural model in minibatches, drawn randomly from a diverse dataset.

For effective training, a certain optimization regime (e.g. a fixed learning rate schedule) has to be followed.

The density model must be computationally lightweight, to allow computing the PG (two model evaluations and one update) as part of every training step of an RL agent.
We investigate how to best resolve these tensions in the context of the Arcade Learning Environment (Bellemare et al., 2013), a suite of benchmark Atari 2600 games.
3.1 Designing a Suitable Density Model
Driven by (f) and aiming for an agent with computational performance comparable to DQN, we design a slim variant of the PixelCNN network. Its core is a stack of 2 gated residual blocks with 16 feature maps (compared to 15 residual blocks with 128 feature maps in vanilla PixelCNN). As was done with the CTS model, images are downsampled to and quantized to 3bit greyscale. See Appendix A for technical details.
3.2 Training the Density Model
Instead of using randomized minibatches, we train the density model completely online on the sequence of experienced states. Empirically we found that with minor tuning of optimization hyperparameters we could train the model as robustly on a temporally correlated sequence of states as on a sequence with randomized order (Fig. 2(left)).
Besides satisfying the theoretical requirement (a), completely online training of the density model has the advantage that , so that the model update performed for computing the PG need not be reverted^{1}^{1}1The CTS model allows querying the PG cheaply, without incurring an actual update of model parameters..
Another more subtle reason for avoiding minibatch updates of the density model (despite (d)) is a practical optimization issue. The (necessarily online) computation of the PG involves a model update and hence the use of an optimizer. Advanced optimizers used with deep neural networks, like the RMSProp optimizer (Tieleman & Hinton, 2012) used in this work, are stateful, tracking running averages of e.g. mean and variance of the model parameters. If the model is additionally trained from minibatches, the two streams of updates may show different statistical characteristics (e.g. different gradient magnitudes), invalidating the assumptions underlying the optimization algorithm and leading to slower or unstable training.
To determine a suitable online learning rate schedule, we train the model on a sequence of 1M frames of experience of a randompolicy agent. We compare the loss achieved by training procedures following constant or decaying learning rate schedules, see Fig. 3. The lowest final training loss is achieved by a constant learning rate of or a decaying learning rate of . We settled our choice on the constant learning rate schedule as it showed greater robustness with respect to the choice of initial learning rate.
PixelCNN rapidly learns a sensible distribution over state space. Fig. 2(left) shows the model’s loss decaying as it learns to exploit image regularities. Spikes in its loss function quickly start to correspond to visually meaningful events, such as the starts of episodes (Fig. 2(middle)). A video of early density model training is provided in http://youtu.be/T6iaa8Z4eyE.
3.3 Computing the PseudoCount
From the previous section we obtain a particular learning rate schedule that cannot be arbitrarily modified without deteriorating the model’s training performance or stability. To achieve the required PG decay (b), we instead replace by with a suitably decaying sequence .
In experiments comparing actual agent performance we empirically determined that in fact the constant learning rate , paired with a PG decay , obtains the best exploration results on hard exploration games like Montezuma’s Revenge, see Fig. 2(right). We find the model to be robust across 12 orders of magnitude for the value of , and informally determine to be a sensible configuration for achieving good results on a broad range of Atari 2600 games (see also Section 7).
Regarding (c), it is hard to ensure learningpositiveness for a deep neural model, and a negative PG can occur whenever the optimizer ‘overshoots’ a local loss minimum. As a workaround, we threshold the PG value at 0. To summarize, the computed pseudocount is
4 Exploration in Atari 2600 Games
Having described our pseudocount friendly adaptation of PixelCNN, we now study its performance on Atari games. To this end we augment the environment reward with a pseudocount exploration bonus, yielding the combined reward . As usual for neural networkbased agents, we ensure the total reward lies in by clipping larger values.
4.1 DQN with PixelCNN Exploration Bonus
Our first set of experiments provides the PixelCNN exploration bonus to a DQN agent (Mnih et al., 2015)^{2}^{2}2Unlike Bellemare et al. we use regular QLearning instead of Double QLearning (van Hasselt et al., 2016), as our early experiments showed no significant advantage of DoubleDQN with the PixelCNNbased exploration reward.. At each agent step, the density model receives a single frame, with which it simultaneously updates its parameters and outputs the PG. We refer to this agent as DQNPixelCNN.
The DQNCTS agent we compare against is derived from the one in (Bellemare et al., 2016). For better comparability, it is trained in the same online fashion as DQNPixelCNN, i.e. the PG is computed whenever we train the density model. By contrast, the original DQNCTS queried the PG at the end of each episode.
Unless stated otherwise, we always use the mixed Monte Carlo update (MMC) for the intrinsically motivated agents^{3}^{3}3The use of MMC in a replaybased agent poses a minor complication, as the MC return is not available for replay until the end of an episode. For simplicity, in our implementation we disregard this detail and set the MC return to 0 for transitions from the most recent episode., but regular QLearning for the baseline DQN.
Fig. 5 shows training curves of DQN compared to DQNCTS and DQNPixelCNN. On the famous Montezuma’s Revenge, both intrinsically motivated agents vastly outperform the baseline DQN. On other hard exploration games (Private Eye; or Venture, appendix Fig. 15), DQNPixelCNN achieves state of the art results, substantially outperforming DQN and DQNCTS. The other two games shown (Asteroids, Berzerk) pose easier exploration problems, where the reward bonus should not provide large improvements and may have a negative effect by skewing the reward landscape. Here, DQNPixelCNN behaves more gracefully and still outperforms DQNCTS. We hypothesize this is due to a qualitative difference between the models, see Section 5.
Overall PixelCNN provides the DQN agent with a larger advantage than CTS, and often accelerates or stabilizes training even when not affecting peak performance. Out of 57 Atari games, DQNPixelCNN outperforms DQNCTS in 52 games by maximum achieved score, and 51 by AUC (methodology in Appendix B). See Fig. 6 for a high level comparison (appendix Fig. 15 for full training graphs). The greatest gains from using either exploration bonus are observed in games categorized as hard exploration games in the ‘taxonomy of exploration’ in (Bellemare et al., 2016, reproduced in Appendix D), specifically in the most challenging sparse reward games (e.g. Montezuma’s Revenge, Private Eye, Venture).
4.2 A MultiStep RL Agent with PixelCNN
Empirical practitioners know that techniques beneficial for one agent architecture often can be detrimental for a different algorithm. To demonstrate the wide applicability of the PixelCNN exploration bonus, we also evaluate it with the more recent Reactor agent^{4}^{4}4The exact agent variant is referred to as ‘LOO’ with . (Gruslys et al., 2017). This replaybased actorcritic agent represents its policy and value function by a recurrent neural network and, crucially, uses the multistep Retrace() algorithm for policy evaluation, replacing the MMC we use in DQNPixelCNN.
To reduce impact on computational efficiency of this agent, we subsample intrinsic rewards: we perform updates of the PixelCNN model and compute the reward bonus on (randomly chosen) 25% of all steps, leaving the agent’s reward unchanged on other steps. We use the same PG decay schedule of , with the number of model updates.
Training curves for the Reactor/ReactorPixelCNN agent compared to DQN/DQNPixelCNN are shown in Fig. 7. The baseline Reactor agent is superior to the DQN agent, obtaining higher scores and learning faster in about 50 out of 57 games. It is further improved on a large fraction of games by the PixelCNN exploration reward, see Fig. 8 (full training graphs in appendix Fig. 16).
The effect of the exploration bonus is rather uniform, yielding improvements on a broad range of games. In particular, ReactorPixelCNN enjoys better sample efficiency (in terms of area under the curve, AUC) than vanilla Reactor. We hypothesize that, like other policy gradient algorithms, Reactor generally suffers from weaker exploration than its valuebased counterpart DQN. This aspect is much helped by the exploration bonus, boosting the agent’s sample efficiency in many environments.
However, on hard exploration games with sparse rewards, Reactor seems unable to make full use of the exploration bonus. We believe this is because, in very sparse settings, the propagation of reward information across long horizons becomes crucial. The MMC takes one extreme of this view, directly learning from the observed returns. The Retrace algorithm, on the other hand, has an effective horizon which depends on and, critically, the truncated importance sampling ratio. This ratio results in the discarding of trajectories which are offpolicy, i.e. unlikely under the current policy. We hypothesize that the very goal of the Retrace() algorithm to learn cautiously is what prevents it from taking full advantage of the exploration bonus!
5 Quality of the Density Model
PixelCNN can be expected to be more expressive and accurate than the less advanced CTS model, and indeed, samples generated after training are somewhat higher quality (Fig. 4). However, we are not using the generative function of the models when computing an exploration bonus, and a better generative model does not necessarily give rise to better probability estimates (Theis et al., 2016).
In Fig. 9 we compare the PG produced by the two models throughout 5K training steps. PixelCNN consistently produces PGs lower than CTS. More importantly, its PGs are smoother, exhibiting less variance between successive states, while showing more pronounced peaks at certain infrequent events. This yields a reward bonus that is less harmful in easy exploration games, while providing a strong signal in the case of novel or rare events.
Another distinguishing feature of PixelCNN is its nondecaying stepsize. The perstep PG never completely vanishes, as the model tracks the most recent data. This provides an unexpected benefit: the agent remains mildly surprised by significant state changes, e.g. switching rooms in Montezuma’s Revenge. These persistent rewards act as milestones that the agent learns to return to. This is illustrated in Fig. 10, depicting the intrinsic reward over the course of an episode. The agent routinely revisits the righthand side of the torch room, not because it leads to reward but just to “take in the sights”. A video of the episode is provided at http://youtu.be/232tOUPKPoQ.^{5}^{5}5Another agent video on the game Private Eye can be found at http://youtu.be/kNyFygeUa2E.
Lastly, PixelCNN’s convolutional nature is expected to be beneficial for its sample efficiency. In Appendix C we compare to a convolutional CTS and confirm that this explains part, but not all of PixelCNN’s advantage over vanilla CTS.
6 Importance of the Monte Carlo Return
Like for DQNCTS, the success of DQNPixelCNN hinges on the use of the mixed Monte Carlo update. The transient and vanishing nature of the exploration rewards requires the learning algorithm to latch on to these rapidly. The MMC serves this end as a simple multistep method, helping to propagate reward information faster. An additional benefit lies in the fact that the Monte Carlo return helps bridging long horizons in environments where rewards are far apart and encountered rarely. On the other hand, it is important to note that the Monte Carlo return’s onpolicy nature increases variance in the learning algorithm, and can prevent the algorithm’s convergence to the optimal policy when training offpolicy. It can therefore be expected to adversely affect training performance in some games.
To distill the effect of the MMC on performance, we compare all four combinations of DQN with/without PixelCNN exploration bonus and with/without MMC. Fig. 11 shows the performance of these four agent variants (graphs for all games are shown in Fig. 17). These games were picked to illustrate several commonly occurring cases:

MMC speeds up training and improves final performance significantly (examples: Bank Heist, Time Pilot). In these games, MMC alone explains most or all of the improvement of DQNPixelCNN over DQN.

MMC hurts performance (examples: Ms. PacMan, Breakout). Here too, MMC alone explains most of the difference between DQNPixelCNN and DQN.

MMC and PixelCNN reward bonus have a compounding effect (example: H.E.R.O.).
Most importantly, the situation is rather different when we restrict our attention to the hardest exploration games with sparse rewards. Here the baseline DQN agent fails to make any training progress, and neither Monte Carlo return nor the exploration bonus alone provide any significant benefit. Their combination however grants the agent rapid training progress and allows it to achieve high performance.
One effect of the exploration bonus in these games is to provide a denser reward landscape, enabling the agent to learn meaningful policies. Due to the transient nature of the exploration bonus, the agent needs to be able to learn from this reward signal faster than regular onestep methods allow, and MMC proves to be an effective solution.
7 Pushing the Limits of Intrinsic Motivation
In this section we explore the idea of a ‘maximally curious’ agent, whose reward function is dominated by the exploration bonus. For that we increase the PG scale, previously chosen conservatively to avoid adverse effects on easy exploration games.
Fig. 12 shows DQNPixelCNN performance on the hardest exploration games when the PG scale is increased by 12 orders of magnitude. The algorithm seems fairly robust across a wide range of scales: the main effect of increasing this parameter is to trade off exploration (seeking maximal reward) with exploitation (optimizing the current policy).
As expected, a higher PG scale translates to stronger exploration: several runs obtain record peak scores (900 in Gravitar, 6,600 in Montezuma’s Revenge, 39,000 in Private Eye, 1,500 in Venture) surpassing the state of the art by a substantial margin (for previously published results, see Appendix D). Aggressive scaling speeds up the agent’s exploration and achieves peak performance rapidly, but can also deteriorate its stability and longterm performance. Note that in practice, because of the nondecaying stepsize the PG does not vanish. After reward clipping, an overly inflated exploration bonus can therefore become essentially constant, no longer providing a useful intrinsic motivation signal to the agent.
Another way of creating an entirely curiositydriven agent is to ignore the environment reward altogether and train based on the exploration reward only, see Fig. 13. Remarkably, the curiosity signal alone is sufficient to train a highperforming agent (measured by environment reward!).
It is worth noting that agents with exploration bonus seem to ‘never stop exploring’: for different seeds, the agents make learning progress at very different times during training, a qualitative difference to vanilla DQN.
8 Conclusion
We demonstrated the use of PixelCNN for exploration and showed that its greater accuracy and expressiveness translate into a more useful exploration bonus than that obtained from previous models. While the current theory of pseudocounts puts stringent requirements on the density model, we have shown that PixelCNN can be used in a simpler and more general setup, and can be trained completely online. It also proves to be widely compatible with both valuefunction and policybased RL algorithms.
In addition to pushing the state of the art on the hardest exploration problems among the Atari 2600 games, PixelCNN improves speed of learning and stability of baseline RL agents across a wide range of games. The quality of its reward bonus is evidenced by the fact that on sparse reward games, this signal alone suffices to learn to achieve significant scores, creating a truly intrinsically motivated agent.
Our analysis also reveals the importance of the Monte Carlo return for effective exploration. The comparison with more sophisticated but fixedhorizon multistep methods shows that its significance lies both in faster learning in the context of a useful but transient reward function, as well as bridging reward gaps in environments where extrinsic and intrinsic rewards are, or quickly become, extremely sparse.
Acknowledgements
The authors thank Tom Schaul, Olivier Pietquin, Ian Osband, Sriram Srinivasan, Tejas Kulkarni, Alex Graves, Charles Blundell, and Shimon Whiteson for invaluable feedback on the ideas presented here, and Audrunas Gruslys especially for providing the Reactor agent.
References
 Bellemare et al. (2014) Bellemare, Marc, Veness, Joel, and Talvitie, Erik. Skip context tree switching. Proceedings of the International Conference on Machine Learning, 2014.
 Bellemare et al. (2013) Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Bellemare et al. (2016) Bellemare, Marc G., Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Rémi. Unifying countbased exploration and intrinsic motivation. Advances in Neural Information Processing Systems, 2016.
 Deisenroth & Rasmussen (2011) Deisenroth, Marc P. and Rasmussen, Carl E. PILCO: A modelbased and dataefficient approach to policy search. In Proceedings of the International Conference on Machine Learning, 2011.
 French (1999) French, Robert M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
 Gregor et al. (2015) Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende, Danilo, and Wierstra, Daan. Draw: A recurrent neural network for image generation. In Proceedings of the International Conference on Machine Learning, 2015.
 Gruslys et al. (2017) Gruslys, Audrunas, Azar, Mohammad Gheshlaghi, Bellemare, Marc G., and Munos, Rémi. The Reactor: A sampleefficient actorcritic architecture. arXiv preprint arXiv:1704.04651, 2017.
 Guez et al. (2012) Guez, Arthur, Silver, David, and Dayan, Peter. Efficient bayesadaptive reinforcement learning using samplebased search. In Advances in Neural Information Processing Systems, 2012.
 Houthooft et al. (2016) Houthooft, Rein, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, and Abbeel, Pieter. Variational information maximizing exploration. In Advances in Neural Information Processing Systems (NIPS), 2016.
 Jaksch et al. (2010) Jaksch, Thomas, Ortner, Ronald, and Auer, Peter. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
 Kingma & Welling (2013) Kingma, Diederik P. and Welling, Max. Autoencoding variational bayes. In Proceedings of the International Conference on Learning Representations, 2013.
 Lopes et al. (2012) Lopes, Manuel, Lang, Tobias, Toussaint, Marc, and Oudeyer, PierreYves. Exploration in modelbased reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, 2012.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Munos et al. (2016) Munos, Rémi, Stepleton, Tom, Harutyunyan, Anna, and Bellemare, Marc G. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, 2016.
 Osband et al. (2016) Osband, Ian, van Roy, Benjamin, and Wen, Zheng. Generalization and exploration via randomized value functions. In Proceedings of the International Conference on Machine Learning, 2016.
 Oudeyer et al. (2007) Oudeyer, PierreYves, Kaplan, Frédéric, and Hafner, Verena V. Intrinsic motivation systems for autonomous mental development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007.
 Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of The International Conference on Machine Learning, 2014.
 Strehl & Littman (2008) Strehl, Alexander L. and Littman, Michael L. An analysis of modelbased interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 – 1331, 2008.
 Sutton (1996) Sutton, Richard S. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems, 1996.
 Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press, 1998.
 Tang et al. (2016) Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke, Adam, Chen, Xi, Duan, Yan, Schulman, John, De Turck, Filip, and Abbeel, Pieter. #Exploration: A study of countbased exploration for deep reinforcement learning. arXiv preprint arXiv:1611.04717, 2016.
 Theis et al. (2016) Theis, Lucas, van den Oord, Aäron, and Bethge, Matthias. A note on the evaluation of generative models. In Proceedings of the International Conference on Learning Representations, 2016.
 Tieleman & Hinton (2012) Tieleman, Tijmen and Hinton, Geoffrey. RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA. Lecture 6.5 of Neural Networks for Machine Learning, 2012.
 Tsitsiklis & van Roy (1997) Tsitsiklis, John N. and van Roy, Benjamin. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997.
 van den Oord et al. (2016a) van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, 2016a.
 van den Oord et al. (2016b) van den Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, 2016b.
 van Hasselt et al. (2016) van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with Double Qlearning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
 Veness et al. (2012) Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowling, Michael H. Context tree switching. In Proceedings of the Data Compression Conference, 2012.
 Wang et al. (2016) Wang, Ziyu, Schaul, Tom, Hessel, Matteo, van Hasselt, Hado, Lanctot, Marc, and de Freitas, Nando. Dueling network architectures for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1995–2003, 2016.
Appendix A PixelCNN Hyperparameters
The PixelCNN model used in this paper is a lightweight variant of the Gated PixelCNN introduced in (van den Oord et al., 2016a). It consists of a masked convolution, followed by two residual blocks with masked convolutions with 16 feature planes, and another masked convolution producing features planes, which are mapped by a final masked convolution to the output logits. Inputs are greyscale images, with pixel values quantized to bins.
The model is trained completely online, from the stream of Atari frames experienced by an agent. Optimization is performed with the (uncentered) RMSProp optimizer (Tieleman & Hinton, 2012) with momentum , decay and epsilon .
Appendix B Methodology
Unless otherwise stated, all agent performance graphs in this paper show the agent’s training performance, measured as the undiscounted perepisode return, averaged over 1M environment frames per data point.
The algorithmcomparison graphs Fig. 6 and Fig. 8 show the relative improvement of one algorithm over another in terms of areaunderthecurve (AUC). A comparison by maximum achieved score would yield similar overall results, but underestimate the advantage in terms of learning speed (sample efficiency) and stability that the intrinsically motivated and MMCbased agents show over the baselines.
Appendix C Convolutional CTS
In Section 4 we have seen that DQNPixelCNN outperforms DQNCTS in most of the 57 Atari games, by providing a more impactful exploration bonus in hard exploration games, as well as a more graceful (less harmful) one in games where the learning algorithm does not benefit from the additional curiosity signal. One may wonder whether this improvement is due to the generally more expressive and accurate density model PixelCNN, or simply its convolutional nature, which gives it an advantage in generalization and sample efficiency over a model that represents pixel probabilities in a completely locationdependent way.
To answer this question, we developed a convolutional variant of the CTS model. This model has a single set of parameters conditioning a pixel’s value on its predecessors shared across all pixel locations, instead of the locationdependent parameters in the regular CTS. In Fig. 14 we contrast the performance of DQN, DQNMC, DQNCTS, DQNConvCTS and DQNPixelCNN on 6 example games.
We first consider dense reward games like Q*Bert and Zaxxon, where most improvement comes from the use of the MMC, and the exploration bonus hurts performance. We find that in fact convolutional CTS behaves fairly similarly to PixelCNN, leaving agent performance unaffected, whereas regular CTS causes the agent to train more slowly or reach an earlier performance plateau. On the sparse reward games (Gravitar, Private Eye, Venture) however, convolutional CTS shows to be as inferior to PixelCNN as the vanilla CTS variant, failing to achieve the significant improvements over the baseline agents presented in this paper.
We conclude that while the convolutional aspect plays a role in the ’softer’ nature of the PixelCNN model compared to its CTS counterpart, it alone is insufficient to explain the massive exploration boost that the PixelCNNderived reward provides to the DQN agent. The more advanced model’s accuracy advantage translates into a more targeted and useful curiosity signal for the agent, which distinguishes novel from wellexplored states more clearly and allows for more effective exploration.
Appendix D The Hardest Exploration Games
Easy Exploration  Hard Exploration  

HumanOptimal  Score Exploit  Dense Reward  Sparse Reward  
Assault  Asterix  Beam Rider  Alien  Freeway 
Asteroids  Atlantis  Kangaroo  Amidar  Gravitar 
Battle Zone  Berzerk  Krull  Bank Heist  Montezuma’s Revenge 
Bowling  Boxing  Kungfu Master  Frostbite  Pitfall! 
Breakout  Centipede  Road Runner  H.E.R.O.  Private Eye 
Chopper Cmd  Crazy Climber  Seaquest  Ms. PacMan  Solaris 
Defender  Demon Attack  Up n Down  Q*Bert  Venture 
Double Dunk  Enduro  Tutankham  Surround  
Fishing Derby  Gopher  Wizard of Wor  
Ice Hockey  James Bond  Zaxxon  
Name this Game  Phoenix  
Pong  River Raid  
Robotank  Skiing  
Space Invaders  Stargunner 
DQN  A3CCTS  Prior. Duel  DQNCTS  DQNPixelCNN  
Freeway  30.8  30.48  33.0  31.7  31.7 
Gravitar  473.0  238.68  238.0  498.3  859.1 
Montezuma’s Revenge  0.0  273.70  0.0  3705.5  2514.3 
Pitfall!  286.1  259.09  0.0  0.0  0.0 
Private Eye  146.7  99.32  206.0  8358.7  15806.5 
Solaris  3,482.8  2270.15  133.4  2863.6  5501.5 
Venture  163.0  0.00  48.0  82.2  1356.25 
Table 1 reproduces Bellemare et al. (2016)’s taxonomy of games available through the ALE according to their exploration difficulty. “HumanOptimal” refers to games where DQNlike agents achieve humanlevel or higher performance; “Score Exploit” refers to games where agents find ways to achieve superhuman scores, without necessarily playing the game as a human would. “Sparse” and “Dense” rewards are qualitative descriptors of the game’s reward structure. See the original source for additional details.
Table 2 compares previously published results on the 7 hard exploration, sparse reward Atari 2600 games with results obtained by DQNCTS and DQNPixelCNN.