# CEM-RL: Combining evolutionary and gradient-based methods for policy search

###### Abstract

Deep neuroevolution and deep reinforcement learning (deep RL) algorithms are two popular approaches to policy search. The former is widely applicable and rather stable, but suffers from low sample efficiency. By contrast, the latter is more sample efficient, but the most sample efficient variants are also rather unstable and highly sensitive to hyper-parameter setting. So far, these families of methods have mostly been compared as competing tools. However, an emerging approach consists in combining them so as to get the best of both worlds. Two previously existing combinations use either a standard evolutionary algorithm or a goal exploration process together with the ddpg algorithm, a sample efficient off-policy deep RL algorithm. In this paper, we propose a different combination scheme using the simple cross-entropy method (cem) and td3, another off-policy deep RL algorithm which improves over ddpg. We evaluate the resulting algorithm, cem-rl, on a set of benchmarks classically used in deep RL. We show that cem-rl benefits from several advantages over its competitors and offers a satisfactory trade-off between performance and sample efficiency.

CEM-RL: Combining evolutionary and gradient-based methods for policy search

Aloïs Pourchot, Olivier Sigaud |

Sorbonne Université, CNRS UMR 7222, |

Institut des Systèmes Intelligents et de Robotique, F-75005 Paris, France |

olivier.sigaud@isir.upmc.fr +33 (0) 1 44 27 88 53 |

## 1 Introduction

Policy search is the problem of finding a policy or controller maximizing some unknown utility function. Recently, research on policy search methods has witnessed a surge of interest due to the combination with deep neural networks, making it possible to find good enough continuous action policies in large domains. From one side, this combination gave rise to the emergence of efficient deep reinforcement learning (deep RL) techniques (lillicrap2015continuous; schulman2015trust; schulman2017proximal). From the other side, evolutionary methods, and particularly deep neuroevolution methods applying evolution strategies (ES) to the parameters of a deep network emerged as a competitive alternative to deep RL due to their higher parallelization capability (salimans2016weight; conti2017improving; such2017deep).

Both families of techniques have clear distinguishing properties. Evolutionary methods are significantly less sample efficient than deep RL methods because they learn from complete episodes, whereas deep RL methods use elementary steps of the system as samples, and thus exploit more information (sigaud2018policy). In particular, off-policy deep RL algorithms can use a replay buffer to exploit the same samples as many times as useful, greatly improving sample efficiency. Actually, the sample efficiency of ESs can be improved using the ”importance mixing” mechanism, but a recent study has shown that the capacity of importance mixing to improve sample efficiency by a factor of ten is still not enough to compete with off-policy deep RL (pourchot2018importance). From the other side, sample efficient off-policy deep RL methods such as the ddpg algorithm (lillicrap2015continuous) are known to be unstable and highly sensitive to hyper-parameter setting. Rather than opposing both families as competing solutions to the policy search problem, a richer perspective looks for a way to combine them so as to get the best of both worlds. As covered in Section 2, there are very few attempts in this direction so far.

After some background in Section 3, we propose in Section 4 a new combination method that combines the cross-entropy method (cem) with td3, an off-policy deep RL algorithm which improves over ddpg. In Section LABEL:sec:study, we investigate experimentally the properties of this cem-rl method, showing its advantages both over the components taken separately and over a competing approach. Beyond the results of cem-rl, the conclusion of this work is that there is still a lot of unexplored potential in new combinations of evolutionary and deep RL methods.

## 2 Related work

Policy search is an extremely active research domain. The realization that evolutionary methods are an alternative to continuous action reinforcement learning and that both families share some similarity is not new (stulp12icml; stulp2012policy; stulp13paladyn) but so far works have mainly focused on comparing them (salimans2017evolution; such2017deep; conti2017improving). Under this perspective, it was shown in (duan2016benchmarking) that, despite its simplicity with respect to most deep RL methods, the Cross-Entropy Method (cem) was a strong baseline in policy search problems. Here, we focus on works which combine both families of methods.

Synergies between evolution and reinforcement learning have already been investigated under the light of the so-called Baldwin effect (simpson1953baldwin). This literature is somewhat related to research on meta-learning, where one seeks to evolve an initial policy from which a self-learned reinforcement learning algorithm will perform efficient improvement (wang2016learning; houthooft2018evolved; gupta2018meta). The key difference with respect to the method proposed here is that in this literature, the outcome of the RL process is not incorporated back into the genome of the agent, whereas here evolution and reinforcement learning update the same parameters in iterative sequences.

Closer to ours, the work of colas2018gep sequentially applies a goal exploration process (gep) to fill a replay buffer with purely exploratory trajectories and then applies ddpg to the resulting data. The gep shares many similarities with evolutionary methods, apart from its focus on diversity rather than on performance of the learned policies. The authors demonstrate on the Continuous Mountain Car and half-cheetah-v2 benchmarks that their combination, gep-pg, is more sample-efficient than ddpg, leads to better final solutions and induces less variance during learning. However, due to the sequential nature of the combination, the gep part does not benefit from the efficient gradient search of the deep RL part.

Another approach related to ours is the work of maheswaranathan2018guided, where the authors introduce optimization problems with an surrogate gradient, i.e. a direction which is correlated with the real gradient. They show that by modifying the covariance matrix of an ES to incorporate the informations contained in the surrogate, an hybrid algorithm can be constructed. They provide a thorough theoretical investigation of their procedure, which they experimentally show capable of outperforming both a standard gradient descent method and a pure ES on several simple benchmarks. They argue that this method could be useful in RL, since surrogate gradients appear in Q-learning and actor-critic methods. However, a practical analysis of those claims remains to be performed. Their approach resembles ours, since they use a gradient method to enhance an ES. But a notable difference is that they use the gradient information to directly change the distribution from which samples are drawn, whereas we use gradient information on the samples themselves, which changes the distribution only indirectly.

The work which is the closest to ours is khadka2018evolutionary. The authors introduce an algorithm called erl (for Evolutionary Reinforcement Learning), which is presented as an efficient combination of a deep RL algorithm, ddpg, and a population-based evolutionary algorithm. It takes the form of a population of actors, which are constantly mutated and selected based on their fitness. In parallel, a single ddpg agent is trained from the samples generated by the population. This single agent is then periodically inserted into the population. When the gradient-based policy improvement mechanism of ddpg is efficient, this individual outperforms its evolutionary siblings, it gets selected into the next generation and draws the whole population towards higher performance. Through their experiments, khadka2018evolutionary demonstrate that this setup benefits from an efficient transfer of information between the RL algorithm and the evolutionary algorithm.

However, their combination scheme cannot be applied in the context of an ES. Indeed, in these methods a covariance matrix is used to produce the next generation but, because the additional individual from ddpg is generated in isolation, it may not comply with this covariance matrix. This is unfortunate because ESs are generally the most efficient evolutionary methods, and importance mixing can only be applied in their context to bring further sample efficiency improvement.

By contrast with the works outlined above, the method presented here combines cem and td3 in such a way that our algorithm benefits from the gradient-based policy improvement mechanism of td3, from the better stability of ESs, and may even profit from the additional sample efficiency brought by importance sampling.

## 3 Background

In this section, we provide a quick overview of the evolutionary and deep RL methods used throughout the paper.

### 3.1 Evolutionary algorithms, evolution strategies and EDAs

Evolutionary algorithms manage a limited population of individuals, and generate new individuals randomly in the vicinity of the previous elite individuals (back1996evolutionary). Evolution strategies can be seen as specific evolutionary algorithms where only one individual is retained from one generation to the next, this individual being the mean of the distribution from which new individuals are drawn. More specifically, an optimum individual is computed from the previous samples and the next samples are obtained by adding Gaussian noise to the current optimum. Finally, among ESs, Estimation of Distribution Algorithms (EDAs) are a specific family where the population is represented as a distribution using a covariance matrix (larranaga2001estimation). This covariance matrix defines a multivariate Gaussian function and samples at the next iteration are drawn with a probability proportional to this Gaussian function. Along iterations, the ellipsoid defined by is progressively adjusted to the top part of the hill corresponding to the local optimum . Various instances of EDAs, such as cem, cma-es, pi-cma, are covered in stulp12icml; stulp2012policy; stulp13paladyn. Here we focus on the first two.

### 3.2 The Cross-Entropy Method

The Cross-Entropy Method (cem) is a simple EDA where the number of elite individuals is fixed to a certain value (usually set to half the population). After all individuals of a population are evaluated, the most fit individuals are used to compute the new mean and variance of the population, from which the next generation is sampled after adding some limited extra variance to prevent premature convergence.

In more details, we denote by the mean of the distribution, and its covariance. Each individual is sampled by adding Gaussian noise around , according to the current covariance matrix , i.e. . The problem-dependent fitness of these new individuals is then computed. The top-performing individuals, are then used to update the parameters of the distribution as follows:

(1) | ||||

(2) |

where are weights given to the individuals. Common choices for those weights are or . In the former, each individual is given the same importance, whereas the latter gives more importance to better individuals.

In this work, we add some noise in the form of to the usual covariance update. This prevents early convergence, which is inevitable considering the greediness of cem. We choose to have an exponentially decaying , by setting an initial and a final standard deviation, respectively and , initializing to and updating at each iteration with .

An iteration on the cem algorithm is depicted in Figure (a)a.

### 3.3 cma-es

Like cem, cma-es is an EDA where the number of elite individuals is fixed to a certain value . The mean and covariance of the new generation are constructed from those individuals. However this construction is more elaborate than in cem. The top individuals are ranked according to their performance, and are assigned weights conforming to this ranking. Those weights measure the impact of individuals on the construction of the new mean and covariance. Quantities called ”Evolutionary paths” are also used to accumulate the search directions of successive generations. In fact, the updates in cma-es are shown to approximate the natural gradient, without explicitly modeling the Fisher information matrix (arnold11informationgeometric).

### 3.4 Importance mixing

Importance mixing is a mechanism to improve the sample efficiency of ESs. It was initially introduced in sun2009efficient and consisted in reusing some samples from the previous generation into the current one, to avoid the cost of re-evaluating the corresponding policies in the environment. The mechanism was recently extended in pourchot2018importance to reusing samples from any generation stored into an archive. Empirical results showed that importance sampling can improve sample efficiency by a factor of ten, and that most of these savings just come from using the samples from the previous generation, as performed by the initial algorithm. A pseudo-code of the importance mixing mechanism is given in Appendix LABEL:sec:im_algo and more details can be found in pourchot2018importance.

### 3.5 DDPG and TD3

The ddpg (lillicrap2015continuous) and td3 (fujimoto2018adressing) algorithms are two off-policy, actor-critic and sample efficient deep RL algorithms. The ddpg algorithm suffers from instabilities partly due to an overestimation bias in the way it updates the critic, and is known to be difficult to tune given its sensitivity to hyper-parameter settings. The availability of properly tuned code baselines incorporating several advanced mechanisms improves on the latter issue (baselines). The td3 algorithm rather improves on the former issue, limiting the over-estimation bias by using two critics and taking the lowest estimate of the action value functions in the update mechanisms. Our cem-rl algorithm uses td3 rather than ddpg as the former has been shown to consistently outperform the latter (fujimoto2018adressing). However, for the sake of comparison with erl, we also use cem in combination with ddpg in Section LABEL:sec:res2.

## 4 Methods

Our method combines cem and td3. The combination scheme can be explained as follows: The mean actor of the cem population, referred to as , is first initialized with a random actor network. A critic network is also initialized. At each iteration, a population of actors is sampled by adding Gaussian noise around the current mean , according to the current covariance matrix . Half of the resulting actors are directly evaluated. The corresponding fitness is computed as the cumulative reward obtained during an episode in the environment. The other half of the population follows the direction of the gradient given by the current critic , as performed in ddpg and td3. This gradient is applied for a fixed number of steps, before the actor gets evaluated too. cem then takes the top-performing half of the resulting global population to compute its new and .

The steps used to evaluate all actors in the population are fed into the replay buffer. The critic is then trained pro rata to the quantity of new information introduced in the buffer at the current generation. For instance, if the population contains 10 individuals, and if each episode lasts 1000 time steps, then