Generalization and Regularization in DQN

Generalization and Regularization in DQN

Jesse Farebrother  ,  Marlos C. Machado,  Michael Bowling
University of Alberta, Edmonton, AB, Canada
Corresponding author:

Deep reinforcement learning (RL) algorithms have shown an impressive ability to learn complex control policies in high-dimensional environments. However, despite the ever-increasing performance on popular benchmarks like the Arcade Learning Environment (ALE), policies learned by deep RL algorithms can struggle to generalize when evaluated in remarkably similar environments. These results are unexpected given the fact that, in supervised learning, deep neural networks often learn robust features that generalize across tasks. In this paper, we study the generalization capabilities of DQN in order to aid in understanding this mismatch between generalization in deep RL and supervised learning methods. We provide evidence suggesting that DQN overspecializes to the domain it is trained on. We then comprehensively evaluate the impact of traditional methods of regularization from supervised learning, and dropout, and of reusing learned representations to improve the generalization capabilities of DQN. We perform this study using different game modes of Atari 2600 games, a recently introduced modification for the ALE which supports slight variations of the Atari 2600 games used for benchmarking in the field. Despite regularization being largely underutilized in deep RL, we show that it can, in fact, help DQN learn more general features. These features can then be reused and fine-tuned on similar tasks, considerably improving the sample efficiency of DQN.


1 Introduction

Recently, reinforcement learning (RL) has proven very successful on complex high-dimensional problems, in large part due to the increase in computational power and to the use of deep neural networks for function approximation (e.g., Mnih et al., 2015; Silver et al., 2016). Despite the generality of the proposed solutions, applying these algorithms to slightly different environments generally requires agents to learn the new task from scratch. Practitioners often realize that the learned policies rarely generalize to other domains, even when they are remarkably similar, and that the learned representations are seldom reusable.

Deep neural networks, though, are lauded for their generalization capabilities (e.g., LeCun et al., 1998). Some communities heavily rely on reusing representations learned by neural networks. In computer vision, classification and segmentation algorithms are rarely trained from scratch; instead they are initialized with pre-trained models from larger datasets like ImageNet (e.g., Razavian et al., 2014; Long et al., 2015). The field of natural language processing has also seen successes in reusing and refining weights from certain layers of neural networks using pre-trained word embeddings, with more recent techniques able to reuse all weights of the network (e.g., Howard & Ruder, 2018).

In light of the successes of traditional supervised learning methods, the current lack of generalization or reusable knowledge (e.g., policies, representation) acquired by current deep RL algorithms is somewhat surprising. In this paper we investigate whether the representation learned by deep RL methods can be generalized, or at the very least reused and refined on small variations to the task at hand. First, we evaluate the generalization capabilities of DQN (Mnih et al., 2015). We further explore whether the experience gained by the supervised learning community to improve generalization and to avoid overfitting could be used in deep RL. We employ conventional supervised learning techniques, albeit largely unexplored in deep RL, such as fine-tuning (i.e., reusing and refining the representation) and regularization. We show that a learned representation trained with regularization allows us to learn more general features capable of being reused and fine-tuned. Besides improving the generalization capabilities of the learned policies this fine-tuning procedure has the potential to greatly improve sample efficiency on settings in which an agent might face multiple variations of the same task. Finally, the results we present here also can be seen as paving a way towards novel curriculum learning approaches for deep RL.

We perform our experiments using different game modes and difficulties of Atari 2600 games, a newly introduced feature of the Arcade Learning Environment (ALE; Bellemare et al., 2013). These game modes allow agents to be trained in one environment while being evaluated in a slightly different environment that still captures key concepts of the original environment (e.g., game sprites, agent goals, dynamics). This use of game modes is itself a novel approach for measuring our progress toward a longstanding goal of agents that can learn to be generally competent and generalize across tasks (Bellemare et al., 2013; Machado et al., 2018; Nichol et al., 2018). This paper also introduces the first baselines for the different modes of Atari 2600 games.

2 Background

2.1 Reinforcement Learning

Reinforcement learning (RL) is a problem where an agent interacts with an environment with the goal of maximizing some form of cumulative long term reward. RL problems are often modeled as a Markov decision process (MDP), defined by a 5-tuple . At a discrete time step the agent observes the current state and chooses an action to probabilistically transition to the next state according to the transition dynamics function . The agent receives a reward signal according to the reward function . The agents goal is to learn a policy defined as the conditional probability of taking action in state written as . The learning agent refines its policy with the objective of maximizing the expected return, that is, the cumulative discounted reward incurred from time , defined by where is the discount factor.

Q-learning (Watkins & Dayan, 1992) is a traditional approach to learning an optimal policy from samples obtained from interactions with the environment. It is used to learn an optimal state-action value function via a bootstrapped iterative method. For a given policy we define the state-action value function as the expected return conditioned on a state and action . The agent iteratively updates the state-action value function based on samples from the environment using the update rule

where denotes the current timestep and the step size. Generally, due to the exploding size of the state space in many real-world problems, it is intractable to learn a state-action pairing for the entire MDP, with researchers and practitioners often resorting to learning an approximate to .

DQN approximates the state-action value function such that , where denotes the weights of a neural network. The network takes as input some encoding of the current state and outputs scalars corresponding to the state-action values for that given state. DQN is trained to minimize

where are uniformly sampled from , the experience replay buffer filled with experience collected by the agent. The weights of a duplicate network are updated less frequently for stability purposes.

2.2 Supervised Learning

In the supervised learning problem we are given a dataset of examples represented by a matrix with training examples of dimension , and a vector denoting the output target for each training example . We want to learn a function which maps each training example to its predicted output label . The goal is to learn a robust model that accurately predicts from while also being able to generalize to unseen training examples. In this paper we focus on using a neural network parameterized by the weights to learn the function such that . We typically train these models by minimizing

where is a differentiable loss function which outputs a scalar determining the quality of the prediction (e.g., squared error loss). The first term is a form of regularization, i.e., regularization, which encourages generalization. regularization imposes a penalty on large weight vectors with being the weighted importance of the regularization term.

Another popular regularization technique is dropout (Srivastava et al., 2014). When using dropout, during forward propagation each neural unit has a chance of being set to zero according to a Bernoulli distribution with probability , referred to as the dropout rate. Dropout discourages the network from relying on a small number of neurons to make a prediction, making it hard for the network to memorize the dataset.

Prior to training, the network parameters are usually initialized through a stochastic process (e.g., Xavier initialization; Glorot & Bengio, 2010). We can also initialize the network using pre-trained weights from a different task. If we reuse one or more pre-trained layers we say the weights encoded by those layers will be fine-tuned during training (e.g., Razavian et al., 2014; Long et al., 2015).

3 The ALE as a platform for evaluating generalization

The Arcade Learning Environment (ALE) is a platform used to evaluate agents across dozens of Atari 2600 games (Bellemare et al., 2013). It has become one of the standard evaluation platforms in the field and has led to a number of exciting algorithmic advances (e.g., Mnih et al., 2015). The ALE poses the problem of general competency by having agents use the same learning algorithm to perform well in as many games as possible, while learning without using game specific knowledge. Learning to play multiple games with the same agent, or learning to play a game faster by leveraging knowledge acquired in a different game is much harder, with fewer successes being known (e.g., Rusu et al., 2016; Kirkpatrick et al., 2016; Parisotto et al., 2016; Schwarz et al., 2018; Espeholt et al., 2018).

In this paper, we use the different modes and difficulties of Atari 2600 games to evaluate a neural network’s ability to generalize in high-dimensional state spaces. Game modes, originally native to the Atari console, were recently added in the ALE (Machado et al., 2018). They give us modifications of the default environment dynamics and state space, often modifying sprites, velocities, and partial observability. These modes pose a tractable way to investigate generalization of RL agents in a high-dimensional environment. Instead of requiring an agent to play multiple games that are visually very different or even non-analogous, it requires agents to play games that are visually very similar and that can be played with policies that are very similar, at least from a human perspective.

We use flavours (combinations of a mode and a difficulty) obtained from games: Freeway, HERO, Breakout, and Space Invaders. In Freeway, the different modes vary the speed and number of vehicles, while different difficulties change how the player is penalized for running into a vehicle. In HERO, subsequent modes start the player off at increasingly harder levels of the game. The mode we use in Breakout makes the bricks partially observable. The used modes in Space Invaders allow for oscillating shield barriers, increasing the width of the player sprite, and partially observable aliens. Full explanations of specific games, their modes, and their difficulties can be found in Appendix A. Figure 1 provides screenshots showing side by side comparisons of some of the modes explored in this paper. When reading the analyses of this paper it is important to keep in mind how remarkably similar these modes are.

Figure 1: Each column shows variation between two selected flavours of each game. From left to right: Freeway, Hero, Breakout, and Space Invaders.

4 Generalization of the learned policies and overfitting

In order to test the generalization capabilities of DQN we first evaluate whether a policy learned in one flavour can perform well in a different flavour. As afformentioned, different modes and difficulties of a single game look very similar. If the representation encodes a robust policy we might expect it to be able to generalize to slight variations of the underlying reward signal, game dynamics, or observations. Evaluating the learned policy in a similar but different flavour can be seen as evaluating generalization in RL, similar to cross-validation in supervised learning.

To evaluate DQN’s ability to generalize across flavours we evaluate the learned -greedy policy on a new flavour after being trained for 50M frames in the default flavour, m0d0 (mode 0, difficulty 0). We measure the cumulative reward averaged over 100 episodes in the new flavour, adhering to the evaluation protocol suggested by Machado et al. (2018). The results are summarized in Table 1. Baseline results where the agent is trained from scratch for 50M frames in the flavour we use for evaluation are summarized in the baseline column. Theoretically, this baseline can be seen as an upper bound on the performance DQN can achieve in that flavour, as it represents the agent’s performance when evaluated in the same flavour it was trained on. Full baseline results with the agent’s performance after different number of frames can be found in Appendix B.

We can see in the results that the policies learned by DQN do not generalize well to different flavours, even when the flavours are remarkably similar. For example, in Freeway, a high-level policy applicable to all flavours is to go up while avoiding cars. Perhaps surprisingly, this does not seem to be what DQN learns. For example, the default flavour m0d0 and m4d0 have exactly the same sprites on the screen, the only difference is that in m4d0 some cars accelerate and decelerate over time. The close to optimal policy learned in m0d0 is only able to score 15.8 points when evaluated on m4d0, which is approximately half of what the policy learned from scratch in that flavour achieves (29.9 points). The learned policy when evaluated on flavours that differ more from m0d0 perform even worse.

As previously mentioned, the different modes of HERO can be seen as giving the agent a curriculum or a natural progression. Interestingly, the agent trained in the default mode for 50M frames can progress to at least level 3 and sometimes level 4. Mode 1 starts the agent off at level 5, and performance in this mode suffers greatly during evaluation. There are very few game mechanics added to level 5, indicating that perhaps the agent is memorizing trajectories instead of learning a robust policy capable of solving each level.

The results in some flavours suggest that the agent is overfitting to the flavour it is trained on. We tested this hypothesis by periodically evaluating the policy being learned in each of the other flavours of that game. This process involved taking checkpoints of the network at every frames and evaluating the -greedy policy in the prescribed flavour for episodes, again further averaged over five runs. The obtained results in Freeway, the most pronounced game in which we see this overfitting trend, are depicted in Figure 2. Learning curves for all flavours can be found in Appendix C.

Game Variant Evaluation Learn Scratch Freeway m1d0 () () m1d1 () () m4d0 () () Hero m1d0 () () m2d0 () () Breakout m12d0 () () Space Invaders m1d0 () () m1d1 () () m9d0 () ()
Table 1: Direct policy evaluation. Each game was initially trained in the default mode for 50M frames then evaluated in each listed game flavour. Reported numbers are the average over 5 runs. Standard deviation is reported between parentheses.
Figure 2: Performance of an agent that was trained in the default mode of Freeway and evaluated at every frames in each corresponding mode. Results are averaged over five seeds. The y-axis is log scaled.

In Freeway, while we see the policy’s performance flattening out in m4d0, we do see the traditional bell-shaped curve associated to overfitting in the other modes. At first, improvements in the original policy do correspond to improvements in the performance of that policy in other domains. With time, it seems that it starts to refine its policy for the specific flavour it is being trained on, overfitting to that flavour. With other game flavours being significantly more complex in their dynamics and gameplay, we do not observe this prominent bell-shaped curve though. For example, in Breakout, we actually observe a monotonic increase in performance throughout the evaluation process.

In conclusion, when looking at Table 1, it seems that the policies learned by DQN struggle to generalize to even small variations encountered in game flavours. This lack of generalization is surprising, and results as seen in Freeway exhibit a troubling notion of overfitting. Based on these results we aim to evaluate whether deep RL could benefit from established methods from supervised learning promoting generalization and reducing overfitting.

5 Regularization in deep RL

In order to evaluate the hypothesis that the observed lack of generalization is due to overfitting, we revisit some popular regularization methods from the supervised learning literature. The two forms of regularization we test are dropout and regularization.

First we want to understand the effect of regularization on evaluating the learned policy in a different flavour. We do so by applying dropout to the first four layers of the network during training, that is, the three convolutional layers and the first fully connected layer. We simultaneously apply regularization on all weights in the network based on preliminary experiments that showed an additive effect when combining dropout and regularization. This confirms, for example, Srivastava et al.’s (2014) result that these methods provide benefit in tandem.

We follow the same evaluation scheme described when evaluating the unregularized policy to different flavours. We evaluate the policy learned after 50M frames of the default mode of each game. A grid search was performed on Freeway to find reasonable hyperparameters for the dropout rate and the weighted regularization parameter . These parameters were then used for each subsequent flavour. Notably, significantly smaller dropout values were required compared to heuristics used in supervised learning, although this could be due to the small size of the network in question. We ended up choosing , for the first three convolutional layers, and for the first fully connected layer. We contrast these results with the results presented in the previous section. This evaluation protocol allows us to directly evaluate the effect of regularization on the learned policy’s ability to generalize. A baseline agent trained from scratch for 50M frames in each flavour is also provided. The results are presented in Table 2 with the evaluation learning curves being available in the Appendix.

Game Variant Eval. with
Eval. without
Learn Scratch
Freeway m1d0 () () m1d1 () () m4d0 () () Hero m1d0 () () m2d0 () () Breakout m12d0 () () Space Invaders m1d0 () () m1d1 () () m9d0 () ()
Table 2: Policy evaluation using regularization. Each game was initially trained in the default mode for 50M frames with dropout and regularization then evaluated on each listed flavour. Reported numbers are the average over 5 runs. Standard deviation is reported between parentheses.
Figure 3: Performance of an agent that was evaluated every frames after being trained in the default flavour of Freeway with dropout and regularization. Results are averaged over five seeds. The y-axis is log scaled.

When using regularization during training we sometimes observe a performance hit in the default flavour. Dropout generally requires increased training iterations to reach the same level of performance sans-dropout. Suprisingly, we did not observe this performance hit in all games. Nevertheless, maximal performance in one flavour is not our goal. We are interested in the setting where one may be willing to take lower performance on one task in order to obtain higher performance, or adaptability, on future tasks. Nevertheless, full baseline results using regularization in the default flavour can also be found in Table 7 in the Appendix.

In most flavours, evaluating the policy trained with regularization does not negatively impact performance when compared to the performance of the policy trained without regularization. In some flavours we even see an increase in performance. Interestingly, when using regularization the agent in Freeway improves for all flavours and even learns a policy capable of outperforming the baseline learned from scratch in two of the three flavours. Moreover, in Freeway we now observe increasing performance during evaluation throughout most of the learning procedure as depicted in Figure 3. These results seem to confirm the notion of overfitting observed in Figure 2.

Despite slight improvements from these techniques, regularization by itself does not seem sufficient to enable policies to generalize across flavours. As shown in the next section, perhaps the real benefit of regularization in deep RL comes from the ability to learn more general features. These features may lead to a more adaptable representation which can be reused and subsequently fine-tuned on other flavours, which is often the case in supervised learning.

6 Value function fine-tuning

We hypothesize that the benefit of regularizing deep RL algorithms may not come from improvements during evaluation, but instead in having a good parameter initialization that can be adapted to new tasks that are similar. We evaluate this hypothesis using two common practices in machine learning. First, we the use the weights trained with regularization as the initialization for the entire network. We subsequently fine-tune all weights in the network. This is similar to what is performed in computer vision with supervised classification methods (e.g., Razavian et al., 2014). Secondly, we evaluate reusing and fine-tuning only early layers of the network. This has been shown to improve generalization in some settings (e.g., Yosinski et al., 2014), and is sometimes used in natural language processing (e.g., Mou et al., 2016; Howard & Ruder, 2018).

When fine-tuning the entire network, we take the weights of the network trained in the default flavour for 50M frames and use them to initialize the network commencing training in the new flavour for 50M frames. We perform this set of experiments twice. Once for the weights trained without regularization, and again for the weights trained with regularization, as described in the previous section. Each run is averaged over five seeds. For comparison we provide a baseline trained from scratch for 50M and 100M frames in each flavour. Directly comparing the performance obtained after fine-tuning to the performance after 50M frames (Scratch) shows the benefit of re-using a representation learned in a different task instead of randomly initializing the network. Comparing the performance obtained after fine-tuning to the performance of 100M frames (Scratch) lets us take into consideration the whole learning process. The results are presented in Table 3.

Fine-tuning Regularized Fine-tuning Scratch
Game Variant 10M 50M 10M 50M 50M 100M
Freeway m1d0 () () () () ()
m1d1 () () () () ()
m4d0 () () () ()
Hero m1d0 () () () () ()
m2d0 () () () () ()
Breakout m12d0 () () () () () ()
Space Invaders m1d0 () () () () ()
m1d1 () () () () ()
m9d0 () () () () ()
Table 3: Experiments fine-tuning the entire network with and without regularization (dropout + ). An agent is trained with dropout + regularization in the default flavour of each game for 50M frames, then DQN’s parameters were used to initialize the fine-tuning procedure on each new flavour for 50M frames. The baseline agent is trained from scratch up to 100M frames. Standard deviation reported between  parenthesis.

Fine-tuning from an unregularized representation yields conflicting conclusions. Although in Freeway we obtained positive fine-tuning results, we note that rewards are so sparse in m1d0 and m1d1 that this initialization is likely to be simply acting as a form of optimistic initialization, biasing the agent to go up. The agent observes rewards more often, therefore, it learns quicker about the new flavour. However, the agent is still unable to reach the maximum score in these flavours.

The results of fine-tuning the regularized representation are more exciting. In Freeway we observe the highest scores on m1d0 and m1d1 throughout the whole paper. In HERO we vastly outperform fine-tuning from an unregularized representation. In Space Invaders we obtain higher scores across the board on average when comparing to the same amount of experience. These results suggest that reusing a regularized representation in deep RL might allow us to learn more general features which can be more successfully fine-tuned.

Moreover, initializing the network with a regularized representation has a big impact on the agent’s performance when compared to initializing the network randomly. These results are impressive when we consider the potential regularization has in reducing the sample complexity of deep RL algorithms. Such an observation also holds when we take the total number of frames seen between two flavours into consideration. When directly comparing one row of Regularized Fine-tuning to Scratch we are comparing two algorithms that observed 100M frames. However, to generate two rows of Scratch we used 200M frames while two rows of Regularized Fine-tuning used 150M frames (50M from scratch + 50M in each row). The distinction becomes bigger and bigger as more tasks are taken into consideration.

We further investigate which layers may encode general features able to be fine-tuned. Inspiration was taken from other studies that have shown that neural networks can re-learn co-adaptations when their final layers are randomly initialized, sometimes improving generalization (Yosinski et al., 2014). We conjectured DQN may benefit from re-learning the co-adaptations between early layers comprising general features and the randomly initialized layers which ultimately assign state-action values. We hypothesized that it might be beneficial to re-learn the final layers from scratch since state-action values are ultimately conditioned on the flavour at hand. Therefore, we also evaluated whether fine-tuning only the convolutional layers, or the convolutional layers and the first fully connected layer was more effective than fine-tuning the whole network. Suprisingly, this does not seem to be the case. The performance obtained when the whole network is fine-tuned (Table 3) is consistently better than when it is not (Table 4). We speculate that this might not be the case on more dissimilar tasks.

7 Discussion and conclusion

Regularized Fine-tuning
Regularized Fine-tuning
Game Variant 10M 50M 10M 50M 50M
Freeway m1d0 () () () ()
m1d1 () () () ()
m4d0 () () ()
Hero m1d0 () () ()
m2d0 () () ()
Breakout m12d0 () () () ()
Space Invaders m1d0 () () () ()
m1d1 () () () ()
m9d0 () () () ()
Table 4: Experiments fine-tuning early layers of the network trained with regularization. An agent is trained with dropout + regularization in the default flavour of each game for 50M frames, then DQN’s parameters were used to initialize the corresponding layers to be further fine-tuned on each new flavour. Remaining layers were randomly initialized. Compared against fine-tuning the entire network from Table 3. Standard deviation reported between parenthesis.

Many studies have tried to explain generalization of deep neural networks in supervised learning settings (e.g., Zhang et al., 2018; Dinh et al., 2017). Analyzing generalization and overfitting in deep RL has its own issues on top of the challenges posed in the supervised learning case. Actually, generalization in RL can be seen in different ways. We can talk about generalization in RL in terms of conditioned sub-goals within an environment (e.g., Andrychowicz et al., 2017; Sutton, 1995), learning multiple tasks at once (e.g., Teh et al., 2017; Parisotto et al., 2016), or sequential task learning as in a continual learning setting (e.g., Schwarz et al., 2018; Kirkpatrick et al., 2016). In this paper we evaluated generalization in terms of small variations of high-dimensional control tasks. This provides a candid evaluation method to study how well features and policies learned by deep neural networks in RL problems can generalize. The approach of studying generalization with respect to the representation learning problem intersects nicely with the aforementioned problems in RL where generalization is key.

The empirical evaluation presented in this paper has shown that traditional DQN seems to generalize poorly even between very similar high-dimensional control tasks. Given this lack of generality we investigated how dropout and regularization can be used to improve generalization in deep RL. Other forms of regularization in RL that have been explored in the past are sticky-actions, random initial states, entropy regularization (Zhang et al., 2018), and procedural generation of environments (Justesen et al., 2018). More related to our work, regularization in the form of weight constraints has been applied in the continual learning setting in order to reduce the catastrophic forgetting exhibited by fine-tuning on many sequential tasks (Kirkpatrick et al., 2016; Schwarz et al., 2018). Similar weight constraint methods have been explored in multitask learning (Teh et al., 2017).

Evaluation practices in RL often focuses on training and evaluating agents on exactly the same task. Consequently, regularization has traditionally been underutilized in deep RL. With a renewed emphasis on generalization in RL, regularization applied to the representation learning problem can be a feasible method to improving generalization on closely related tasks. Our results suggest that dropout and regularization seem to be able to learn more general purpose features which can be adapted to similar problems. Although other communities relying on deep neural networks have shown similar successes, this is of particular importance for the deep RL community which struggles with sample efficiency (Henderson et al., 2018). This work is also related to recent meta-learning procedures like MAML (Finn et al., 2017) which aim to find a parameter initialization that can be quickly adapted to new tasks. In fact, some of the results here can also be seen under the light of curriculum learning. The regularization techniques we’ve evaluated here seem to be effective in leveraging situations where an easier task is presented first, sometimes leading to unseen performance levels (e.g., Freeway).

Finally, we believe it would be extremely beneficial for the field if we were able to develop algorithms that can generalize across tasks. Ultimately we want agents that can keep learning as they interact with the world in a continual learning fashion. The ability to generalize is essential. Throughout this paper we often avoided the expression transfer learning because we believe that succeeding in slightly different environments should be actually seen as a problem of generalization. Our results suggested that regularizing and fine-tuning representations in deep RL might be a viable approach towards improving sample efficiency and generalization on multiple tasks. It is particularly interesting that fine-tuning a regularized network was the most successful approach because this might also be applicable in the continual learning settings where the environment changes without the agent being told so, and re-initializing layers of a network is obviously not an option. In this setting, the work from Kirkpatrick et al. (2016), and Schwarz et al. (2018) might be a great starting point as they provide a more thorough discussion of generalization in continual learning.


The authors would like to thank Matthew E. Taylor, Tom Van de Wiele, and Marc G. Bellemare for useful discussions, as well as Vlad Mnih for feedback on a preliminary draft of the manuscript. This work was supported by funding from NSERC and Alberta Innovates Technology Futures through the Alberta Machine Intelligence Institute (Amii). Computing resources were provided by Compute Canada through CalculQuébec.


  • Andrychowicz et al. (2017) Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems (NIPS), pp. 5055–5065, 2017.
  • Bellemare et al. (2013) Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Dinh et al. (2017) Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1019–1028, 2017.
  • Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1406–1415, 2018.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1126–1135, 2017.
  • Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249–256, 2010.
  • Henderson et al. (2018) Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • Howard & Ruder (2018) Jeremy Howard and Sebastian Ruder. Fine-tuned language models for text classification. CoRR, abs/1801.06146, 2018.
  • Justesen et al. (2018) Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Procedural level generation improves generality of deep reinforcement learning. CoRR, abs/1806.10729, 2018.
  • Kirkpatrick et al. (2016) James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. CoRR, abs/1612.00796, 2016.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440, 2015.
  • Machado et al. (2018) Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mou et al. (2016) Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. How transferable are neural networks in NLP applications? In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 479–489, 2016.
  • Nichol et al. (2018) Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta learn fast: A new benchmark for generalization in RL. CoRR, abs/1804.03720, 2018.
  • Parisotto et al. (2016) Emilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • Razavian et al. (2014) Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 512–519, 2014.
  • Rusu et al. (2016) Andrei A. Rusu, Neil C. Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  • Schwarz et al. (2018) Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 4535–4544, 2018.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Sutton (1995) Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems (NIPS), pp. 1038–1044, 1995.
  • Teh et al. (2017) Yee Whye Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), pp. 4499–4509, 2017.
  • Watkins & Dayan (1992) Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
  • Yosinski et al. (2014) Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems (NIPS), pp. 3320–3328, 2014.
  • Zhang et al. (2018) Chiyuan Zhang, Oriol Vinyals, Rémi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning. CoRR, abs/1804.06893, 2018.

Appendix A Game Modes


(a) Freeway m0d0
(b) Freeway m1d0
(c) Freeway m4d0

In Freeway a chicken must cross a road containing multiple lanes of moving traffic within a prespecified time limit. In all modes of Freeway, the agent gets rewarded for reaching the top of the screen and is subsequently teleported to the bottom of the screen. If the chicken collides with a vehicle in difficulty 0 it gets bumped down one lane of traffic, alternatively, in difficulty 1 the chicken gets teleported to its starting position on the bottom of the screen. Mode 1 changes some vehicle sprites to include buses, adds more vehicles to some lanes, and increases the velocity of all vehicles. Mode 4 is almost identical to Mode 1; the only difference being vehicles can oscillate between two speeds.


(d) Hero m0d0
(e) Hero m1d0
(f) Hero m2d0

In Hero you control a character who must navigate a maze in order to save a trapped miner within a cave system. The agent scores points for any forward progression such as clearing an obstacle or killing an enemy. Once the miner is rescued, the level is terminated and you continue to the next level with a different maze. Some levels have partially observable rooms, more enemies, and more difficult obstacles to traverse. Past the default mode, each subsequent mode starts off at increasingly harder levels denoted by a level number increasing by multiples of . The default mode starts you off at level , mode 1 starts at level , and so on.


(g) Breakout m0d0
(h) Breakout m12d0

In Breakout you control a paddle which can move horizontally along the bottom of the screen. At the beginning of the game, or on loss of life a ball is set into motion and can bounce off the paddle and collide with bricks at the top of the screen. The objective of the game is to break all the bricks without having the ball fall below your paddles horizontal plane. Subsequently, mode 12 of breakout hides the bricks from the player until the ball collides with the bricks in which case the bricks flash for a brief moment before disappearing again.

Space Invaders

(i) Space Invaders m0d0
(j) Space Invaders m1d1
(k) Space Invaderws m9d0

When playing Space Invaders you control a spaceship which can move horizontally along the bottom of the screen. There is a grid of aliens which are above you and the objective of the game is to shoot-out all aliens. You are afforded some protection from the alien bullets with three barriers just above the spaceship. Difficulty 1 of space invaders widens your spaceships sprite making it harder to doge enemy bullets. Mode 1 of Space Invaders causes the shields above you to oscillate horizontally. Mode 9 of Space Invaders is similar to Mode 12 of Breakout where the aliens are partially observable until struck with the players bullet.

Appendix B Baseline Results

In all experiments performed in this paper we utilize the neural network architecture used by Mnih et al. (2015). That is, a convolutional neural network with three convolutional layers and two fully connected layers. A visualization of this network can be found in Figure 4. Hyperparametes are generally kept consistent with Machado et al. (2018). Below we provide a table of the key hyperparameters used in the baseline experiments.

Neural network architecture

Figure 4: Neural network architecture used by DQN to predict state-action values.


Learning rate Minibatch size Dropout rate convolutions Dropout rate fully connected Regularization term Replay buffer size Target update frequency decay horizon 1M frames initial final Discount factor 0.99


Each baseline run is trained for up to 100M frames in each game flavour. We decay epsilon linearly over the -decay period to allow for an exploratory period at the beginning of training. We use sticky-actions with a probability of of executing instead of action (Machado et al., 2018). We allow the agent access to all 18 primitive actions in the ALE, we do not utilize the reduced action set nor the lives signal.

Furthermore, as a crude measure for environment complexity, we measure the best greedy action an agent could take in a game flavour. Simply put, we iterate through every action in , executing this action -greeidly, with , at every time step for episodes. These results were then averaged over runs with the standard deviations between runs reported in parenthesis.

Game Variant 10M 50M 100M Best Action


m0d0 () () () ()
m1d0 () () () ()
m1d1 () () () ()
m4d0 () () () ()


m0d0 () () () ()
m1d0 () () () ()
m2d0 () () () ()


m0d0 () () () ()
m12d0 () () () ()

Space Invaders

m0d0 () () () ()
m1d0 () () () ()
m1d1 () () () ()
m9d0 () () () ()
Table 5: Baselines using vanilla DQN for all tested game variants.
Game Variant 10M 50M 100M Best Action
Freeway m0d0 () () () ()
Hero m0d0 () () () ()
Breakout m0d0 () () () ()
Space Invaders m0d0 () () () ()
Table 6: Baselines using dropout + regularization for each default flavour.
Baseline Baseline w/ Regularization
Game Variant 10M 50M 100M 10M 50M 100M
Freeway m0d0 () () () () () ()
Hero m0d0 () () () () () ()
Breakout m0d0 () () () () () ()
Space Invaders m0d0 () () () () () ()
Table 7: Comparison of baseline results with and without regularization in the default flavour. The baseline agent with regularization was trained with dropout and regularization.

Appendix C Policy Evaluation Learning Curves

We provide learning curves for evaluating a policy learned in the default flavour (m0d0) to each subsequent flavour of that game. Each subplot are the results of evaluating the policy from a representation trained with and without regularization.


Checkpoint of the network weights were taken during training every frames, up to 50M frames in total. Each checkpoint was then evaluated in the target mode for episodes averaged over five runs. Hyperparameters are kept consistent with the baseline experiments in Appendix B.

(a) *
(b) *
(c) *
(d) *
(e) *
(f) *
(g) *
(h) *
(i) *
Figure 5: Performance curves for policy evaluation results. The x-axis is the number of frames before we evaluated the -greedy policy from the default flavour on the target flavour. The y-axis is the cumulative reward the agent incurred.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description