Novelty Search for Deep RL Network Weights By Edit Metric Distance

Novelty Search for Deep Reinforcement Learning Policy Network Weights by Action Sequence Edit Metric Distance

Ethan C. Jackson The University of Western Ontario
Vector Institute
 and  Mark Daley The University of Western Ontario
Vector Institute

Reinforcement learning (RL) problems often feature deceptive local optima, and learning methods that optimize purely for reward signal often fail to learn strategies for overcoming them (Lehman and Stanley, 2011). Deep neuroevolution and novelty search have been proposed as effective alternatives to gradient-based methods for learning RL policies directly from pixels. In this paper, we introduce and evaluate the use of novelty search over agent action sequences by string edit metric distance as a means for promoting innovation. We also introduce a method for stagnation detection and population resampling inspired by recent developments in the RL community (Savinov et al., 2018), (Ecoffet et al., 2019) that uses the same mechanisms as novelty search to promote and develop innovative policies. Our methods extend a state-of-the-art method for deep neuroevolution using a simple-yet-effective genetic algorithm (GA) designed to efficiently learn deep RL policy network weights (Such et al., 2017). Experiments using four games from the Atari 2600 benchmark were conducted. Results provide further evidence that GAs are competitive with gradient-based algorithms for deep RL. Results also demonstrate that novelty search over action sequences is an effective source of selection pressure that can be integrated into existing evolutionary algorithms for deep RL.

1. Introduction

Reinforcement learning (RL) (Sutton and Barto, 1998) problems often feature deceptive local optima that impose difficult challenges to many learning algorithms. Algorithms that optimize strictly for reward often produce degenerate policies that cause agents to under-explore their environments or under-develop strategies for increasing reward. Deceptive local optima have proved to be equally challenging for both gradient-based RL algorithms, including DQN (Mnih et al., 2015), and gradient-free algorithms including genetic algorithms (GAs) (Such et al., 2017).

Deceptive local optima in reinforcement learning have long been studied by the evolutionary algorithms community — with concepts including novelty search being introduced in response (Lehman and Stanley, 2011). The deep RL community has responded with similar ideas and tools, but in purely gradient-based learning frameworks. A good example is given by recent work from Google Brain and DeepMind that promotes episodic curiosity in deep RL benchmarks (Savinov et al., 2018). These methods were both designed to address deceptive local optima by substituting or supplementing reward signal with some measure of behavioural novelty. In practice, an agent’s behaviour has usually been defined in terms of its environment. Behaviour is often quantified using information contained in environment observations. For example, agents that reach new locations (Such et al., 2017), or that reach the same location using an alternate route(Savinov et al., 2018), can be rewarded for their novel behaviour.

In this paper, we investigate whether agent behaviour can be quantified more generally and leveraged more directly. We investigate the following question: “Can the history of actions performed by agents be used to promote innovative behaviour in benchmark RL problems?’ Towards answering this, we implemented two novel methods for incorporating behavioural history in an evolutionary algorithm designed to effectively train deep RL networks. The base algorithm is an approximate replication of Such et al.’s genetic algorithm (GA) for learning DQN network (Mnih et al., 2015) weights. This is a very simple yet effective gradient-free approach for learning DQN policies that are competitive with those produced by Deep Q-learning (Such et al., 2017).

Both methods are GA extensions based on Lehman and Stanley’s novelty search (Lehman and Stanley, 2011) — an evolutionary algorithm designed to avoid deceptive local optima by defining selection pressure in terms of behaviour instead of conventional optimization criteria such as reward signal. Novelty search has been shown to be an effective tool for promoting innovation in RL (Such et al., 2017). In this paper, we introduce the use of Levenshtein distance (Levenshtein, 1966) — a form of string edit metric distance — as the behavioural distance function in a novelty search for deep RL network weights.

The first method (Method I) is an implementation of novelty search in which, during training, the reward signal is completely substituted by a novelty score based on the Levenshtein distance between sequences of game actions. In a novelty search, behaviour characteristics are stored in an archive for a randomly-selected subset of individuals in each generation. We define the behaviour characteristic as the sequence of actions performed by an agent during the training episode. Selection pressure is then determined by computing the behavioural distance between individuals in the current population and those in the archive — which we define as Levenshtein distance.

The second method (Method II) is not a novelty search, but rather a modification to the Base GA that incorporates elements of novelty search to avoid population convergence to locally-optimal behaviours. The modified algorithm detects slowing learning progress as measured using game scores in validation episodes. When validation scores are non-increasing for a fixed number of episodes, the population is regenerated by sampling the archive for individuals whose behaviours were most novel compared to the current population — a concept related to restarting and recentering in evolutionary algorithms (Hughes et al., 2013).

Using two sets of experiments, we evaluated each method’s effectiveness for learning RL policies for four Atari 2600 games, namely Assault, Asteroids, MsPacman, and Space Invaders. We found that while Method I is less effective than the Base GA for learning high-scoring policies, it returns policies that are behaviourally distinct. For example, we observed greater uses of obstacles or greater agent lifespans in some games. Method II was more effective than Method I for learning high-scoring policies. In two out of four games, it produced better-scoring policies than the Base GA, and in one out of four, it produced better-scoring policies than the original DQN learning method.

Importantly, and in contrast to previous uses of novelty search for deep RL, the behaviour characteristic and behavioural distance function used here do not require environment-specific knowledge. While such a requirement is not inherently a hindrance, it is convenient to have tools that work in more general contexts. Compared to related methods that use memories of observations (usually environment observations) to return to previous states (Ecoffet et al., 2019) or to re-experience or re-visit under-explored areas (Savinov et al., 2018), archives of action sequences are relatively compact, easy to store, and efficient to compare. As such, the methods presented in this paper can either be used as stand-alone frameworks, or as extensions to existing methods that use environmental memory to improve learning.

In the next section, we give an overview of the Base GA and architecture, the Atari benchmark problem, and our experimental setup. In Section 3 we provide a full definition of novelty search and details of our implementation based on action sequences and Levenshtein distance (Method I). In Section 4 we provide further details for Method II. Section 5 describes experiments and results, and is followed by discussion in Section 6.

2. Highly-Scalable Genetic Algorithms for Deep Reinforcement Learning

The conventional objective in RL is to produce an optimal policy — a function that maps states to actions such that reward, or gain in reward, is optimized. The methods introduced in this paper are extensions of a replicated state-of-the-art GA for learning deep RL policy network weights introduced by Such et al. in (Such et al., 2017).

2.1. DQN Architecture and Preprocessing

A RL policy network is an instance of a neural network that implements a RL policy. For comparability to related work, we used the DQN neural network architecture (Mnih et al., 2015) in all experiments. This network consists of three convolutional layers with 32, 64, and 64 filters, respectively, followed by one dense layer with 512 units. The convolutional filter sizes are , , and , respectively. The strides are 4, 2, and 1, respectively. All network weights are initialized using Glorot normal initialization. All network layer outputs use rectified linear unit (ReLU) activation. All game observations (frames) are downsampled to arrays. The third dimension reflects separate intensity channels for red, green, blue, and luminosity. Consecutive game observations are summed to rectify sprite flickering.

2.2. Seed-Based Genetic Algorithm

Perhaps surprisingly, very simple genetic algorithms have been shown to be competitive with Deep Q-learning for learning DQN architecture parameterizations (Such et al., 2017). In their paper, Such et al. introduced an efficient seed-based encoding that enables very large network parameterizations to be indirectly encoded by a list of deterministic pseudo-random number generator (PRNG) seeds. This, in contrast to a direct encoding, scales with the number of evolutionary generations (typically thousands) rather than the number of network connections (typically millions or more). This encoding enables GAs to work at unprecedented scales for tuning neural network weights. It thus enables, more generally, a new wave of exploration for evolutionary algorithms and deep learning.

For the present work, we implemented a GA and encoding approximately as described in (Such et al., 2017) using Keras (Chollet, 2017), a high-level interface for Tensorflow (Abadi et al., 2016), and NumPy (Oliphant, 2006). An individual in the GA’s population is encoded by a list of seeds for both Keras’ and NumPy’s deterministic PRNGs. The first seed is used to initialize network weights. Subsequent seeds are used to produce additive mutation noise. A constant scaling factor (mutation power) is used to scale down the intensity of noise added per generation.

A network parameterization is thus defined by:


where denotes network weights at generation , denotes the encoding of as a list of seeds, denotes a seeded, deterministic initialization function, denotes a seeded, deterministic, normally-distributed PRNG seeded with   and denotes a constant scaling factor (mutation power).

As in its introductory paper, the GA does not implement crossover, and mutation simply appends a randomly-generated seed to an individual’s list . The GA performs truncated selection — a process whereby the top individuals are selected as reproduction candidates (parents) for the next generation. From these parents, the next generation’s population is uniformly, randomly sampled with replacement, and mutated.

The GA also implements a form of elitism — a commonly used tactic to ensure that the best performing individual is preserved in the next generation without mutation. A separate set of validation episodes is used to help determine the elite individual during training. This has the effect of adding secondary selection pressure for generalizability and helps to reduce overfitting. More details are given in Section 5.

It is important to note that this encoding imposes network reconstruction costs that would not be needed using a direct encoding. The compact representation, though, enables a high degree of scalability that would not be practical using a direct encoding. Algorithm descriptions and source code for the Base GA, Method I, and Method II are provided in the Appendix and Digital Appendix111, respectively. For further details on the Base GA, refer to (Such et al., 2017).

2.3. Atari 2600 Benchmark

The Atari 2600 Benchmark is provided as part of OpenAIGym (Brockman et al., 2016) — an open-source platform for experimenting with a variety of reinforcement learning problems. Work by Mnih et al. (Mnih et al., 2013) introduced a novel method and architecture for learning to play games directly from pixels — a challenge that remains difficult (Hessel et al., 2018). Though many enhancements and extensions have been developed for DQN, no single learning method has emerged as being generally dominant (Hessel et al., 2018), (Ecoffet et al., 2019).

The games included in the Atari 2600 benchmark provide a diverse set of control problems. In particular, the games vary greatly in both gameplay and logic. In MsPacman, for example, part of the challenge comes from the fact that the rules for success change once MsPacman consumes a pill. To achieve a high score, the player or agent must shift strategies from escape to pursuit. This is quite different from Breakout for example — a game in which the optimal paddle position can be computed as a function of consecutive ball observations. The variety of problems provided by this benchmark makes it an interesting set to study.

Before designing experiments, it is important to ask whether the chosen methods are plausibly capable of learning high quality policies. In games like MsPacman, is it reasonable to expect that a strictly feed-forward network architecture like DQN should be capable of producing high-quality policies? Though we do not investigate this question in the experiments presented in this paper, we comment on it in Section 6.

2.4. Experimental Setup

The Base GA and encoding for our experiments is an approximate replication of the GA and encoding introduced by Such et al. in (Such et al., 2017). All code was written in Python and uses Keras and TensorFlow for network evaluation. All experiments were run on a CPU-only 32-core Microsoft Azure cloud-based virtual machine (Standard F32s_v2). The code is scalable to any number of threads and could be adapted to run on a distributed system. A single run of a Method II experiment (see Table 1) required roughly 120 wall-hours of compute time using this system.

3. Novelty Search Over Action Sequences

Reinforcement learning problems often feature deceptive local optima or sparse reward signals. For example, consider a simple platform game in which the player navigates the environment to collect rewards. Environmental obstacles, such as walls and stacked platforms, increase gameplay complexity and introduce latent optimization criteria. A simple example of such a game is visualized by Figure 1.

Figure 1. Example of a simple game stage with a deceptive local optimum. Assuming the goal is for the player to earn points by collecting as many diamonds as possible before using a door to exit the stage, a globally suboptimal policy may never learn to scale the wall to the player’s left and collect three additional diamonds.

To overcome such challenges, agents may need to develop behavioural or strategic innovations that are not exhibited by any agent in the initial population. While it is possible for innovations to appear strictly as a result of mutations using the Base GA, these innovations are only promoted to the next generation if they immediately yield a positive return in terms of reward signal. Introduced by Lehman and Stanley in (Lehman and Stanley, 2011), novelty search addresses environmental challenges in RL by redefining optimization criteria in terms of behavioural innovation. In the context of evolutionary algorithms, including GAs, a pure novelty search defines fitness in terms of novelty rather than reward. Novelty search requires the following additional components over a typical genetic algorithm: 1) a behaviour characteristic, 2) a behavioural distance function, and 3) an archive of stored behaviour characteristics.

3.1. Behaviour Characteristic

The behaviour characteristic of a policy , denoted by , is a description of some behavioural aspect of with respect to its environment. For example, the behaviour characteristic of a policy for controlling an agent in a 2D navigation environment could be the coordinate pair of the agent’s location at the end of an episode. The behavioural distance between two behaviour characteristics is the output of a suitable distance metric function applied to two behaviour characteristics and . For example, assuming that maps a policy to the final resting coordinates of an agent in 2D space, the behavioural distance function could be Euclidean distance in . Continuing with this example, an archive would consist of a randomly-selected subset of final resting coordinates reached by agents throughout training.

In previous work, both behaviour characteristics and behavioural distance functions were assumed to be domain-specific: they would not usually generalize to other environments. In this paper, we introduce a generalized formulation of novelty that applies to any game in the Atari 2600 benchmark, and that generalizes to many more control problems.

We define the behaviour characteristic of a policy to be the sequence of discrete actions performed by an agent in response to consecutive environment observations. These action sequences are encoded as strings of length , where is the maximum number of frames available during training. Characters are either elements of a game’s action space (distinct symbols that encode a button press) or the character , which is reserved to encode a death action or non-consumed frame.

3.2. Behavioural Distance Function

We define the behavioural distance function as an approximation of the Levenshtein distance (Levenshtein, 1966) between action sequences encoded by strings. Note that other string edit distance metrics, such as Hamming distance (Hamming, 1950), or distributional distance metrics, such as Kullback-Leibler divergence (Kullback and Leibler, 1951) could also be used as behavioural distance functions. We chose to base our behavioural distance function on Levenshtein distance because it captures temporal relationships between action sequences that the other metrics do not.

For example, two action sequences encoded by x12345 and 12345x are much closer in Levenshtein space (two edits: one deletion and one insertion) than in Hamming space (six edits: one substitution at each position). The Kullback-Leibler divergence between the distribution of actions in these two strings is zero since each action occurs exactly once, thus failing to discriminate the two policies by their statistics.

The additional descriptive power of Levenshtein distance comes with higher computational costs. The time complexity of computing the Levenshtein distance between two strings of length is . For large enough , Levenshtein distance computations will impose a bottleneck on learning – a problem we encountered in preliminary experiments.

To remedy this, we simply restrict the size of by splitting action sequences into fixed-length segments and compute the cumulative Levenshtein distance between corresponding segments. All experiments reported in this paper use for computing segmented Levenshtein distance. While some information is lost using this approach, the practical reduction in runtime necessitates the choice. The behavioural distance function which computes segmented Levenshtein distance is defined by Equation 3:


where and are two action sequences encoded by strings, is the number of segments, is the length of each segment, and computes the Levenshtein distance between two strings. The number of segments is determined by computing , where is the number of characters in and , equal to the maximum number of frames available during training. In experiments, is computed using the Python package python-Levenshtein (Haapala, [n. d.]).

3.3. Hybrid Algorithm

In a pure novelty search, fitness in the GA would be defined entirely by novelty scores. The experiments reported in this paper for Method I use a hybrid algorithm in which, like for a pure novelty search, selection pressure during training is solely determined using novelty scores. To identify the generation elite, however, we use the validation game score instead of a novelty score. This is due to the episodic nature of our chosen behaviour characteristic. Action sequences archived during training are specific to the training episode. To be consistent with other experiments using novelty search, we avoided the introduction of validation-specific archives for additional episodes. And so while novelty is the dominant component of selection pressure, we make this distinction clear to differentiate it from a pure novelty search. Experiments using Method I are discussed in Section 5.1.

4. Novelty-Based Population Resampling in Genetic Algorithms

Reward sparsity is highly variable between RL problems. The Atari 2600 game Montezuma’s Revenge, for example, is a complex platform game that requires significant exploration, puzzle-solving, and other strategies to complete. Until very recently, it has proved challenging to develop high-performing policies for this game without human-generated playthrough examples. A new method called Go-Explore was recently introduced as the state-of-the-art for producing Montezuma’s Revenge policies (Ecoffet et al., 2019). Though it is not based on evolutionary algorithms or the DQN architecture, Go-Explore borrows ideas from novelty search – namely the use of an archive to store and recall states over the course of policy search.

Motivated by this result, we designed Method II as an extension to the Base GA that adds features inspired by Go-Explore. In particular, we designed experiments to test whether an archive of action-sequences recorded throughout evolution could be effectively used for promoting innovation. Over the course of evolution, a randomly selected subset of individuals together with their action sequences are archived. This archive gradually collects individuals that could potentially lead to better policies than those that were selected for reproduction. Since the Base GA’s selection pressure is based entirely on game score, it is still susceptible to converge around locally optimal policies and to discard innovations that do not yield immediate returns.

Since novelty scores are not computed to determine primary selection pressure, Method II is not a novelty search. Instead, novelty scores are only computed when the algorithm detects that policy generalizability has stagnated over some number of generations. In such cases, the algorithm generates a new population by sampling the archive for individuals whose behaviour characteristics are most distant from the current population. These sampled individuals are used as parents for the next generation and the GA proceeds otherwise identically as the Base GA.

As expected given DQN’s prior ineffectiveness for learning Montezuma’s Revenge, both Methods I and II were also unsuccessful in preliminary experiments. As a result we excluded it from main experiments, which are discussed in the next Section.

5. Experiments

All experiments use the same four games: Assault, Asteroids, MsPacman, and Space Invaders. These games were chosen because they each feature gameplay that falls into one of two categories: games with one- or two-dimensional navigation. Assault and Space Invaders both allow the player or agent to move an avatar across a one-dimensional axis at the bottom of the game screen, while Asteroids and MsPacman allow a much greater range of exploration. Experiments using these four games also provide new results for the Base GA’s effectiveness for learning to play Atari using the DQN architecture.

In all experiments, we provide a baseline result using our replication of the GA described by Such et al. in (Such et al., 2017). The purpose of this baseline is to provide a replicated benchmark for using GAs to learn DQN architecture weights. While we acknowledge that many modified versions of the DQN architecture have been developed (Hessel et al., 2018), we use the original architecture to ensure comparability to a wide variety of existing results, thereby controlling for differences between algorithms rather than architectures. Video comparisons of the Base GA and Methods I and II are included in the Digital Appendix.

Hyperparameter Method I Method II
Population Size (N) 100 + 1 1,000 + 1

500 1000

Truncation Size (T)
20 20

Mutation Power ()
0.002 0.002

Archive Probability
0.1 0.01

Max Frames Per Episode (F)
20,000 20,000

Training Episodes
1 1

Validation Episodes
5 30

Improvement Generations (IG)
Table 1. Hyperparameters for Method I and Method II experiments. Note that the Improvement Generations hyperparameter is only used in Method II experiments, and that baseline results do not use archiving. Population sizes are incremented to account for elites.

5.1. Method I

Method I was designed to test the merits of using novelty search over agent action sequences in the Atari 2600 benchmark. This method substitutes reward signal with a measure of behavioural novelty as the selection pressure in an evolutionary search for DQN architecture weights. For comparability with existing gradient-based (Mnih et al., 2015) and gradient-free (Such et al., 2017) methods, we evaluated Method I against the Base GA. Due to compute time constraints, these experiments were run at a smaller scale than for Method II. Hyperparameters are summarized by Table 1.

Method I training progress is visualized by Figure 2 and testing evaluation of Method I policies is summarized by Table 2. Overall, Method I does not produce policies that score better than either DQN or the Base GA. On the other hand, it is interesting to evaluate the behaviours of policies generated by (almost) completely ignoring the reward signal during training.

Mean St. Dev.
Game Base GA Method I Base GA Method I
Assault 812 488 228 158
Asteroids 1321 736 503 426
MsPacman 2325 1437 351 527
Space Invaders 500 474 303 195
Table 2. Comparison of Base GA and Method I testing results over 30 episodes not used in training or validation. Means and standard deviations are measured in game score units. Bolded means denote significantly better testing performance (p ¡ in a two-tailed t-test). The Base GA outperforms Method I in all but one game.

Results for Method I experiments suggest that novelty search indeed creates selection pressure for innovation. For example in Space Invaders, we observed more regular uses of obstacles by agents trained using Method I than the Base GA. And in MsPacman, we observed that agents trained using Method I tended to explore more paths than their Base GA-trained counterparts. In two out of four games (Assault and Space Invaders), agents trained by Method I had significantly longer lifespans than those trained by the Base GA (see Table 3). This could either be due to the behavioural distance function’s sensitivity to differences in lifespan, or to defensive innovations that increase agent lifespan.

Mean St. Dev.
Game Base GA Method I Base GA Method I
Assault 3538 5242 995 1998
Asteroids 1263 1223 545 668
MsPacman 1112 931 137 142
Space Invaders 1264 1552 369 285
Table 3. Comparison of Base GA and Method I lifespans over 30 episodes not used in training or validation. Means and standard deviations are shown in numbers of frames over which agents survived. Bolded means denote significantly longer lifespans (p ¡ in a two-tailed t-test). Method I produced agents with significantly longer mean lifespans in testing in Assault and Space Invaders.

To determine whether Levenshtein distance is effectively different than lifespan as a behavioural distance function, we conducted an additional small-scale experiment using MsPacman (Method I-L). We observed that lifespan, measured by counting the number of frames an agent survives in its environment, is not an equivalent behavioural distance function to Levenshtein distance. See Table 4 for hyperparameters and Figure 3 for results.

Method I Learning Progress
Assault Asteroids MsPacman Space Invaders




Figure 2. Base GA and Method I learning progress.
Figure 3. Population mean game score over generations during training on MsPacman. Mean scores diverge after generation 160. Levenshtein distance (Method I) and lifespan are thus not equivalent behavioural distance functions.
Hyperparameter Method I-L
Population Size (N) 100 + 1
Generations 500
Truncation Size (T) 10
Mutation Power () 0.004
Archive Probability 0.02
Max Frames Per Episode (F) 2,500
Training Episodes 2
Table 4. Hyperparameters for experiment on Method I-L. Validation episodes were not used; elites were determined using highest game score in training over 2 episodes.

A problem with this approach is that by continually selecting for innovation, there may be insufficient evolutionary time for innovations to be optimized. Method II attempts to remedy this by integrating secondary selection pressure for novelty into an otherwise standard search for reward-optimizing policies.

5.2. Method II

Method II was designed to help the GA avoid stagnation or premature convergence to locally optimal solutions. This method adds two components to the Base GA: 1) stagnation detection, and 2) population resampling. Stagnation is detected by examining the trend of validation scores. In the Base GA, validation episodes are used solely to identify the elite individual of a population. In Method II, learning progress is declared to be stagnant when validation scores are non-increasing over 10 episodes. This is reflected by the hyperparamter Improvement Generations (IG) in Table 1. Population resampling is achieved by sampling individuals from the archive to be the next generation’s parents. For Method II, novelty scores are used to select archived individuals whose policies were most different from the current population, according to the behavioural distance metric.

As a baseline, we tested whether novelty-based population resampling is better than sampling random individuals from the archive for learning MsPacman policies. Using the same evaluation criteria and hyperparameters as for Method II (see Table 1) we found that, for MsPacman, novelty-based population resampling is significantly better than random archive sampling. This result is summarized by Table 5 and motivated further evaluation of the method applied to other games.

We then evaluated Method 2 by comparing it to the Base GA. These experiments were run using similar hyperparameters to related work (Such et al., 2017) — (see Table 1). Method II training progress is visualized by Figure 4 and testing evaluation of Method II is summarized by Table 6. In testing, Method II yielded improved results over the Base GA in two out of four games and no significant change in two out of four games. We also compared Method II testing scores to those reported in (Mnih et al., 2015) for Deep Q-learning — see Table 7. Method II outperforms DQN methods in one game, and is outperformed by DQN methods in two games. These mixed results are consistent with previous work (Such et al., 2017).

Mean St. Dev.
Game Random Method II Random Method II
MsPacman 3377 3790 661 322
Table 5. Comparison of Method II (novelty-based population resampling) to random population-resampling over 30 episodes not used in training or validation. In MsPacman, Method II yielded better mean game scores in testing than random population resampling with p in a two-tailed t-test.
Mean St. Dev.
Game Base GA Method II Base GA Method II
Assault 1219 1007 676 413
Asteroids 1263 1476 590 640
MsPacman 3385 3700 633 209
Space Invaders 615 1211 323 244
Table 6. Comparison of Base GA and Method II testing results over 30 episodes not used in training or validation. Means and standard deviations are measured in game score units. Bolded means denote significantly better testing performance (p ¡ in a two-tailed t-test). Method II improves learning in 2 out of 4 games over the Base GA.
Mean St. Dev.
Game DQN Method II DQN Method II
Assault 3359 1007 775 413
Asteroids 1629 1476 542 640
MsPacman 2311 3700 525 209
Space Invaders 1976 1211 893 244
Table 7. Comparison of DQN and Method II using testing scores over 30 randomly-seeded episodes reported in (Mnih et al., 2015). Means and standard deviations are measured in game score units. Means and standard deviations are measured in game score units. Bolded means denote significantly better testing performance (p ¡ in a two-tailed t-test). Method II outperforms DQN in one game, performs similarly to DQN in one game, and is outperformed by DQN in two games. These mixed results are consistent with previous comparisons between gradient-based and gradient-free learning methods (Such et al., 2017).
Method II Learning Progress
Assault Asteroids MsPacman Space Invaders




Figure 4. Base GA and Method II learning progress. Mean denotes population mean game score over generations in training, high denotes score of top-performaing individual over generations in training, and validation denotes the mean score of the best-generalizing individual to 30 differently-seeded environments. In each generation, the best individual in validation is designated as the elite. In 3 out of 4 games, validation scores reach a higher maximum. Whereas the Base GA seemingly failed to escape a local optima, Method II was particularly effective for improving performance in Space Invaders.

6. Discussion

The results presented in this paper support recent work showing that GAs are effective at training deep neural networks for RL. We took advantage of this to explore whether the behaviour of agents could be effectively used as selection pressure in an evolutionary search for RL policies. While our implementation of novelty search based on Levenshtein distance was not as effective as the Base GA, we found that it produced potentially useful and informative policies. In particular, we found that novelty search over Levenshtein distances is not equivalent to a longevity search, and that the policies it produces may be more defensive than than those produced by typical reward optimization.

The combination of reward signal and novelty scores in Method II resulted in a net improvement in testing scores over the Base GA in the four games tested. During Space Invaders training it is particularly evident that, while the Base GA was showing signs of stagnation or convergence in validation performance, Method II effectively reoriented the search.

Method II yielded improved policies for MsPacman over both the Base GA and DQN method. On closer inspection, however, it is clear that all of the compared policies suffer from limited complexity. In no cases did we observe a successful strategy shift from escape to pursuit upon consumption of a pill. The lack of this emergent behaviour in any of the results we considered, in addition to sub-human performance in the current state-of-the-art based on a modified DQN architecture (Hessel et al., 2018), leads us to suspect that the DQN architecture combined with reward-signal optimization is not well-suited for effectively learning situational policies or discrete mode switching. In response, we think that emerging methods such as Differentiable Inductive Logic Programming (Evans and Grefenstette, 2018), a learning framework that enables logical rules to be inferred from large-scale data using neural networks, and a new wave of automated network architecture construction algorithms could be especially useful.

More broadly, the production and storage of policies with varying behaviours, including defensiveness, could have many applications in real-world control problems. In autonomous transport, for example, it could be desirable to evaluate potential policies with a wide range of behaviours in order to select the safest. Methods based on novelty search, like the ones introduced in this paper, could be used to purposefully learn diverse strategies for achieving the same goals. This concept has recently been shown to be effective in learning frameworks based on a wide variety of methods — see (Ecoffet et al., 2019) and (Savinov et al., 2018), each of which use environment observations to help instil novelty, in addition to (Mouret and Clune, 2015) and (Pugh et al., 2016). Methods that already implement observationally-based storage and comparison methods could benefit from the relatively low-cost inclusion of action sequences and string edit metric distances to diversify learned policies.

7. Future Work

The evolutionary algorithms community has developed and applied many methods for evolving network architectures and related structures. NEAT (Stanley and Miikkulainen, 2002) and HyperNEAT (Stanley et al., [n. d.]) are very popular methods for simultaneously evolving network architectures and weights, and Cartesian Genetic Programming (Miller, 2011) is a related method that uses more general basis functions than are typically used in neural networks. All of these methods have been successfully applied in RL problems.

In future work, we will extend the methods detailed in this paper to include automated network architecture search. A method inspired by NEAT and that uses a compact algebraic approach to modular network representation (Jackson et al., 2017) is currently in development. Given the scale at which Such et al.’s method enables GAs to train deep neural networks, we are optimistic that both existing and forthcoming methods for topology- and weight- evolving neural networks (TWEANNs) will be effective tools for solving increasingly complex problems in RL.

We are particularly eager to develop tools that combine the open-endedness of evolutionary algorithms with the reliability and robustness of functional modules, which could range from simple logical operators to convolutional network layers and beyond. Methods for searching the complex search space of deep neural network architectures and hyperparameters have recently been developed for gradient-based learning (Negrinho and Gordon, 2017). And though similar methods like HyperNEAT are certainly able to learn high-quality RL policies (:20, 2018), we think a method that combines recent advances in both evolutionary algorithms and gradient-based deep reinforcement learning could be even more effective.


  • (1)
  • :20 (2018) 2018. Atari 2600 Leaderboard. (2018).
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning.. In OSDI, Vol. 16. 265–283.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
  • Chollet (2017) François Chollet. 2017. Keras. (2017).
  • Ecoffet et al. (2019) Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. 2019. Go-Explore: a New Approach for Hard-Exploration Problems. arXiv:1901.10995 (2019).
  • Evans and Grefenstette (2018) Richard Evans and Edward Grefenstette. 2018. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research 61 (2018), 1–64.
  • Haapala ([n. d.]) Antti Haapala. [n. d.]. python-Levenshtein. ([n. d.]).
  • Hamming (1950) Richard W Hamming. 1950. Error detecting and error correcting codes. Bell System technical journal 29, 2 (1950), 147–160.
  • Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Hughes et al. (2013) James Hughes, Sheridan Houghten, and Daniel Ashlock. 2013. Recentering, reanchoring & restarting an evolutionary algorithm. In Nature and Biologically Inspired Computing (NaBIC), 2013 World Congress on. IEEE, 76–83.
  • Jackson et al. (2017) Ethan C Jackson, James Alexander Hughes, Mark Daley, and Michael Winter. 2017. An algebraic generalization for graph and tensor-based neural networks. In Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2017 IEEE Conference on. IEEE, 1–8.
  • Kullback and Leibler (1951) Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics 22, 1 (1951), 79–86.
  • Lehman and Stanley (2011) Joel Lehman and Kenneth O Stanley. 2011. Abandoning objectives: Evolution through the search for novelty alone. Evolutionary computation 19, 2 (2011), 189–223.
  • Levenshtein (1966) Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, Vol. 10. 707–710.
  • Miller (2011) Julian F Miller. 2011. Cartesian genetic programming. Cartesian Genetic Programming (2011), 17–34.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, and Others. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
  • Mouret and Clune (2015) Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909 (2015).
  • Negrinho and Gordon (2017) Renato Negrinho and Geoff Gordon. 2017. DeepArchitect: Automatically Designing and Training Deep Architectures. arXiv preprint arXiv:1704.08792 (2017).
  • Oliphant (2006) Travis E Oliphant. 2006. A guide to NumPy. Vol. 1. Trelgol Publishing USA.
  • Pugh et al. (2016) Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. 2016. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI 3 (2016), 40.
  • Savinov et al. (2018) Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lillicrap, and Sylvain Gelly. 2018. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274 (2018).
  • Stanley et al. ([n. d.]) Kenneth O Stanley, David D’Ambrosio, and Jason Gauci. [n. d.]. A Hypercube-Based Indirect Encoding for Evolving Large-Scale Neural Networks. ([n. d.]).
  • Stanley and Miikkulainen (2002) Kenneth O Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolutionary computation 10, 2 (2002), 99–127.
  • Such et al. (2017) Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. 2017. Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning. arXiv preprint arXiv:1712.06567 (2017).
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.

Appendix A Appendix

Input: mutation function , population size , number of generations , truncation size , individual initializer , individual decoder , fitness function , training episodes , validation episodes , deterministic uniform PRNG .
for  do
     Append to population
for  do
     trainingResults , policies)
     Sort trainingResults by game score
     eliteCandidates best in trainingResults
     validationResults , eliteCandidates)
     Sort validationResults by game score
     elite 1 best in validationResults
     Save elite to disk
     parents best in trainingResults
     if  then
         newPopulation [elite]
         for  do
              Append (parent) to newPopulation          
         population newPopulation      
Algorithm 1 Base GA
Input: mutation function , population size , number of generations , truncation size , individual initializer , individual decoder , fitness function , training episodes , validation episodes , deterministic uniform PRNG , archive insertion probability , novelty function .
for  do
     Append to population
for  do
     trainingResults , policies)
     for  in trainingResults do
         Append to with probability      
     nScores map(
     Sort trainingResults by novelty score
     eliteCandidates best in trainingResults
     validationResults , eliteCandidates)
     Sort validationResults by game score
     elite 1 best in validationResults
     Save elite to disk
     parents most novel in trainingResults
     if  then
         newPopulation [elite]
         for  do
              Append (parent) to newPopulation          
         population newPopulation      
Algorithm 2 Method I - Novelty Search
Input: mutation function , population size , number of generations , truncation size , individual initializer , individual decoder , fitness function , training episodes , validation episodes , deterministic uniform PRNG , archive insertion probability , novelty function , number of improvement generations
for  do
     Append to population
vScores []
for  do
     trainingResults , policies)
     for  in trainingResults do
         Append to with probability      
     Sort trainingResults by game score
     eliteCandidates best in trainingResults
     validationResults , eliteCandidates)
     Sort validationResults by game score
     elite 1 best in validationResults
     Save elite to disk
     Append elite validation score to vScores
     parents best in trainingResults
     if  then
         progress []
         for  do
              Append vScores - vScores to progress          
         if  in progress,  then
              noveltyResults map(
              Sort noveltyResults by novelty score
              parents most novel in noveltyResults
              vScores []               
     if  then
         newPopulation [elite]
         for  do
              Append (parent) to newPopulation          
         population newPopulation      
Algorithm 3 Method II - Stagnation Detection and Population Resampling
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description