# Region Growing Curriculum Generation

for Reinforcement Learning

###### Abstract

Learning a policy capable of moving an agent between any two states in the environment is important for many robotics problems involving navigation and manipulation. Due to the sparsity of rewards in such tasks, applying reinforcement learning in these scenarios can be challenging. Common approaches for tackling this problem include reward engineering with auxiliary rewards, requiring domain-specific knowledge or changing the objective.

In this work, we introduce a method based on region-growing that allows learning in an environment with any pair of initial and goal states. Our algorithm first learns how to move between nearby states and then increases the difficulty of the start-goal transitions as the agent’s performance improves. This approach creates an efficient curriculum for learning the objective behavior of reaching any goal from any initial state. In addition, we describe a method to adaptively adjust expansion of the growing region that allows automatic adjustment of the key exploration hyperparameter to environments with different requirements. We evaluate our approach on a set of simulated navigation and manipulation tasks, where we demonstrate that our algorithm can efficiently learn a policy in the presence of sparse rewards.

Region Growing Curriculum Generation

for Reinforcement Learning

Artem Molchanov, Karol Hausman, Stan Birchfield, Gaurav Sukhatme, University of Southern California Nvidia

## 1 Introduction

In recent years, deep reinforcement learning (Deep RL) has enjoyed success in many different applications, including playing Atari games [?], controlling a humanoid robot to perform various manipulation tasks [?; ?] and beating the world champion in Go [?]. The success and wide range of use cases of RL algorithms is partly due to the very general description of the problem that RL aims to solve, i.e., to learn autonomous behaviors given a high-level specification of a task by interacting with the environment. Such high-level specification is provided by a reward function, which must be sufficiently descriptive as well as easy to optimize for an RL algorithm to learn efficiently. These requirements make the design of the reward function challenging in practice, creating a bottleneck for even a wider set of applications for RL algorithms.

The problem of designing a reward function has been tackled in various ways. These include: i) learning the reward function from human demonstrations in the field of inverse reinforcement learning (IRL) [?; ?], ii) initializing the reinforcement learning process with demonstrations in imitation learning [?; ?], and iii) creating reward shaping functions that aim to guide the RL process to high-reward regions [?; ?]. Even though all of these methods have shown promising solutions to the problem of reward function design, they present other significant challenges such as the requirement of domain expertise or access to demonstration data.

Ideally, one would like to learn from a simple sparse binary reward that indicates completion of the task. Such a reward signal is natural for many goal-oriented tasks. It allows significant reduction of engineering effort, and in some cases can be used to learn very complicated skills from human feedback, where design of the reward function is very hard [?]. Despite being attractive, such a reward function creates significant difficulties for learning. This is due to the fact that it is very unlikely for an agent to generate the exact sequence of actions leading to solving the task from random exploration [?].

Recent efforts focus on learning from such sparse reward signals by constructing a curriculum from a continuous set of tasks [?; ?]. These methods exploit the simple intuition that tasks initialized closer to the goal should be easier to solve. Proximity to the goal is defined either explicitly [?] or through the number of random actions needed to reach the state from the goal [?]. Nevertheless, all of these methods have a common disadvantage: they are designed for either single-start or single-goal scenarios. In this paper, we address the situation in which the task contains both a continuous set of goals and a continuous set of initial conditions, thus broadening the applicability of our algorithm to a wide range of problems. In addition, we introduce a method to adaptively adjust expansion of the growing region, eliminating manual tuning of a key exploration hyperparameter whose optimal value varies across different environments.

## 2 Related Work

Intrinsic motivation. Learning from sparse rewards is a long-standing goal in RL. The most established way of coping with such scenarios has been reward shaping [?], which requires extensive engineering and domain specific knowledge. To address this problem, various researchers proposed curiosity and intrinsic motivation [?; ?] as a more general way of guiding learning in the absence of the main reward. Intrinsic motivation is typically introduced in the form of auxiliary rewards or loss components incentivizing exploration, that are not connected to the main objective. Such incentives could be based on counting visited states and/or maintaining a state-visitation density model [?; ?; ?], prediction error [?], prediction error-improvement of the learned model [?], predictive model uncertainty [?], neuro-correlation [?] or learning auxiliary tasks [?]. Despite a vast variety of approaches, many curiosity-inspired methods are prone to creating additional local minima in the learned objective function.

Curriculum learning. Another approach to learning in the presence of sparse rewards is to construct a curriculum of the task instances to ease the learning process. In this case, the agent initially learns from easy scenarios, where the chance of acquiring positive reward is relatively high, and the difficulty of the presented tasks is gradually increased until the final task is learned. The main advantage of such an approach is that the agent learns on the final objective directly, and thus avoids the problems of curiosity-driven methods. Traditionally, curriculum design has been explored from the perspective of manually engineered schedules in both supervised tasks [?; ?] and reinforcement learning scenarios [?; ?]. More recently, there have also been multiple approaches for automated curriculum generation for RL. [?] create curriculum in the form of an acyclic graph based on a transfer potential metric, [?] explored task sampling based on their current performance, and [?] utilized task performance improvement as a basis for task sampling. All of these approaches, as opposed to our method, are designed to perform well only in a discrete set of tasks with dense rewards.

Another related approach is presented in the recent work by [?] based on the idea of self play between two agents. The first agent plays the role of a teacher that sets the tasks for the second agent, who plays the role of a student who tries to repeat the teacher’s actions or reverse the environment to its original state. As mentioned by the authors and confirmed in [?], the asymmetric structure of this method often leads to a biased exploration resulting in the teacher and the student becoming stuck in a small subspace of the task. Our method avoids such situations by using random exploration to expand the set of goals and the initial conditions to the appropriate level of difficulty.

Another related piece of work is that of [?], who consider the problem of generating multiple goals of the appropriate level of difficulty using generative adversarial networks (GANs) [?]. Their approach is designed to learn a goal distribution and, thus, in its straightforward form, cannot learn to generalize to multiple initial conditions. In addition, since their approach contains a learned generative model, it tends to struggle when the dimensionality of the task representation is large and the number of examples is very limited, which is usually the case for robotics. We address this problem by generating tasks through the interaction with the environment.

The approach most related to ours is the concurrent work of [?]. We exploit similar core principles and assumptions, i.e., we utilize Brownian motion for growing the current task region and generate curriculum through reverse exploration of new tasks. We extend this approach to multi-goal and multi-start scenarios with infinitely many start-goal pairs, and present results in environments with sparse rewards. In addition, we address the question of controlling expansion of the growing region. Our algorithm adaptively changes the key exploration hyperparameter for environments with significantly different optimal settings. These contributions lead to improved resampling efficiency and eliminate the need of expensive hyperparameter tuning.

## 3 Background

We consider a reinforcement learning problem where an agent is represented by a global policy that aims to reach any goal in an environment. This section introduces a formal definition of the problem and our framework.

### 3.1 Markov decision process

We consider a discrete-time, finite-horizon Markov decision process (MDP) defined by a tuple , in which is the agent state set, is the action set, is the transition probability distribution, is a bounded reward function dependent on the goal state, where represents the goal set; is the initial state distribution, and is the time horizon. Our aim is to learn a stochastic policy parameterized by . We would like to point out that in order to communicate the goal to the agent, our formulation requires the policy to be conditioned on the goal specified by the environment, i.e. . The objective is to maximize the expected return, with the expected reward starting at being , where denotes the trajectory generated by executing actions sampled from the policy under environment dynamics .

### 3.2 Goal-dependent sparse reward function

In this work, we consider the problem of reaching any goal state in the environment from any initial state , where denotes a uniform distribution. For this purpose, we define a sparse binary reward function dependent on the goal:

(1) |

where is a set of states corresponding to the goal . We note that although the binary reward function in Eq. (1) is typically defined through some distance metric , our learning algorithm does not explicitly utilize this metric.

### 3.3 Assumptions

In this work we exploit several assumptions:

Assumption 1 The agent can be initialized at an arbitrary state . This assumption is a common requirement for many algorithms [?; ?] in the RL setting, especially those that exploit uniform initialization to generalize to multiple initial states.

Assumption 2 At least one initial state is provided to the algorithm, which we call a seed state.

Assumption 3 For every state , there exists a function that maps any state in the environment to the corresponding goal representation. This assumption is required since, in our algorithm, states encountered by the agent should be converted to the corresponding goal representations.

Assumption 4 For any pair of states there exists a trajectory that moves the agent from to . In other words, the agent can reach any state from any other state.

We note that although we explicitly introduced Assumption 4, it does not prevent our algorithm from being applied to a wider set of tasks where some states might not be mutually reachable. For example, if isolated or irreversible pairs of states exist, the algorithm nevertheless applies to all the reachable states, which depends on the initial state provided.

## 4 Approach

The main difficulty of training an RL agent in a sparse reward setting arises from the fact that it is unlikely for the agent to accomplish the task using random exploration if the initial state is far from the goal state. In this work, we take advantage of the intuition also exploited by [?] that the agent has a higher chance of success if the goal is located in close proximity to the initial state. In particular, if one could initialize learning by generating goal states that are close to the initial states, then the initial learning stages should progress much faster.

Since it can be highly nontrivial to engineer a correct distance metric directly in the observation space, we define the proximity of points by the number of actions it takes to reach one point from another.

Taking this into consideration, we propose the idea of gradually-growing reachability regions for generating a curriculum in a multi-goal setting. Our algorithm consists of two agents: a sampler and a learner. The sampler uses short chains of random actions to arrive at a state that is then added to the currently-explored set of points, which we refer to as the reachability region. This region is defined as the area where the learner has already mastered transitions between all pairs of points. As learning progresses, the sampler removes already-learned states from the reachability region and adds new points that have not been explored yet. This generates a natural curriculum for learning a global reaching policy, i.e., a policy capable of moving the agent between any two states in the environment. In the following, we first discuss the sampler and then describe the learner.

### 4.1 Filtering states

In the first part of the sampling algorithm, we focus on a criterion that indicates whether a particular set of states has been mastered. In order to select which states have already been mastered, we retain statistics of rewards received by the agent on every state within the current reachability region. We choose to follow a simple approach in which we only retain statistics of the points in the role of starting states, as opposed to retaining statistics on start-goal pairs. We define thresholds and that prevent states from being too hard or too easy, respectively. We refer to the set of all states in the current reachability region as . We keep a history of rewards in a vector and associate them with start states. If the average reward for a state in does not exceed the and thresholds we use the state for further resampling. This behavior is implemented in a helper function FilterStates that takes , , , and as input and returns the retained set of states as .

### 4.2 Adaptive state resampling

As previously mentioned, we define the proximity of the points through the action space, i.e., points are close to each other if they are reachable via short random trajectories. We use Brownian motion to sample new states to grow the region of learned state-goal pairs.

A major challenge of this approach is the selection of the variance for exploration. Poorly selected variance can result in either a very spread set of points that are hard to learn from, or a set of points that are too easily mastered, which in both cases results in a slow learning progress of the RL algorithm. We adjust the sampling variance dynamically using a method that is inspired by the integral part of a PID controller. Our approach adjusts the variance such that average reward in the current iteration () is close to a user-provided target reward (). In particular, every time before resampling, we update the sampling variance () according to the following procedure:

(2) |

where , is the control coefficient, is the maximum change of variance, and are the variance limits. Thus, if the success ratio systematically exceeds the preferred value, our method increases the variance, promoting faster exploration and vice versa.

We encode Eq. (2) in the helper function UpdateVariance that takes and as inputs and returns the new sampling variance. Resampling a set of new states is implemented in the helper function ResampleStates that takes the current set of states , the set of old mastered states , and the variance as inputs and returns the the new set of states. Resampling is carried out in two stages. First, we create an oversampled set of states by performing Brownian-motion rollouts, which we refer to as sampling rollouts. We use random actions generated by the sampler agent using and collect states visited by the agent.

Each of these rollouts is initialized at one of the states from the growing oversampled set. This set is initialized with the states retained in after filtering. At the second stage, we sample states uniformly from the oversampled set and add them to states sampled uniformly from to form the new current set of states.

### 4.3 Policy training

Algorithm 1 describes the policy training procedure including both the sampler and the learner agents. The sampler agent updates the reachability region (lines 1 – 1), while the learner follows a its own learning strategy (lines 1 – 1).

Our method starts by initializing the current state set , the corresponding vector of history of rewards and the pool of the previously learned states (line 1).

The sampler uses a fixed update period (line 1) to adjust the variance according to Eq. (2) (line 1) and proceeds to the filtering stage to find good states to propagate from (line 1). Once the filtering is finished, the sampler resamples a new set of states using Brownian motion (line 1).

The learner performs policy rollouts in every iteration (line 1) using the helper function Rollouts. This function follows a special start-goal pair sampling strategy. Start states for the rollouts are sampled uniformly from the current state set , whereas the goals are sampled from either (with probability ) or (with probability ). Once the batch of samples used for the user-chosen RL algorithm is accumulated, we update the policy (line 1).

Our approach is agnostic to the choice of agent optimization method; we only require that this method provides the UpdatePolicy function. In our experiments we use TRPO [?] as one of the most robust RL algorithms with an implementation available online.

## 5 Experiments

We implemented this approach in Python and applied it to two representative environments. We show empirically that this technique successfully trains agents in our multi-goal scenarios. Furthermore, we demonstrate that our dynamic variance selection is less sensitive to hyperparameters than other alternatives. In all of our experiments, we use the following parameters across all environments: , , , , , , , , , , .

### 5.1 Environments

The SparseReacher is an environment with a two-link manipulator based on the Reacher-v0 environment from OpenAI Gym [?]. We use it in a sparse reward setting: the agent receives a positive binary reward only when it touches the goal marker. This corresponds to the situation where the robot’s end effector is not further than from the center of the goal marker. In addition, the Cartesian velocities of the robot must be lower than . The episode ends as soon as the positive reward is acquired. As we observed in our experiments, such sparse reward makes this environment significantly more challenging, especially when the goal is to learn a policy that can reach any point in the robot’s workspace.

The goal in the Maze environment is to bring a cube of size to a goal location. The agent receives a reward only if the center of the cube lies still within an -radius of the goal location. The episode ends as soon as the positive reward is acquired. We define a variable time step in the environment that is dependent on the time it takes for the cube to settle after a force is applied. The table is constrained by the size and surrounded by walls, such that the cube cannot fall off the table. This environment has continuous action space that consists of two components of the force applied to the center of the cube, parallel to the table plane. Observations contain a 7-dimensional cube pose where the rotation is encoded as quaternion. We define a goal representation as a simple 2d position on the table.

This environment is challenging due to several aspects. First, the search space increases with , thus, the probability to encounter the target by chance is very small. Second, the cube has relatively complex dynamics compared to a simple point mass: it can be pushed or rolled depending on the direction and amount of force applied, and it exhibits a complex behavior when it comes into a contact with the wall. Third, the cube cannot simply roll over the goal to acquire a positive reward—it must stop at the precise location of the goal.

Both environments are shown in Fig. 1. For each environment, we select a single seed state to expand the growing region. For the Maze environment, we explicitly pick the most challenging scenario of the seed state located at the end of the central corridor since the policy has to learn how to precisely navigate inside of the narrow corridor entrance. Both environments can be naturally used in both single- and multi-goal settings. We also note that in every training scenario, in addition to the sparse reward, we add a very small negative reward for every time step to promote shorter episodes.

### 5.2 Reachability regions

Fig. 2 demonstrates how the region expansion proceeds during learning in the maze environment. In particular, it shows an interesting phenomenon associated with variance adaptation that we refer to as region clustering. During expansion, if the new set of points was selected too aggressively, our algorithm responds by decreasing the variance of the region expansion. Since this event by definition happens due to a poor performance (see Eq. (2)), there will be very few available states to sample from. Thus, the algorithm forms a cluster of newly resampled states located around a few states that passed through the filtering stage (see Fig. 1(b)). Later, as the learner agent improves, those clusters grow and connect, forming a single region which is illustrated in Fig. 1(c). Such behavior helps the learner to create new growing regions in isolated areas.

### 5.3 SparseReacher

Our results for the multi-goal version of the SparseReacher environment are shown in Fig. 2(a). We execute the learning process several times and provide the average reward for each iteration over six executions. We test our algorithm with two constant resampling variances ( and ) and with our adaptive variance. We also provide results for the case that does not use a reachability region, but instead samples start and goal states uniformly over the environments.

The environment is conservative and requires small exploration variances; we found that a constant variance of performs much better than a variance of . Our adaptive variance selection achieves a slightly higher average reward than the best hand-tuned constant variance. The simple uniform state sampling performs as well as our reachability approach when a bad constant variance is applied.

### 5.4 Maze

The experimental results for the multi-goal version of the Maze environments are shown in Fig. 2(b). As before, we provide the average reward for each iteration over six executions.

This environment requires more exploration than the SparseReacher environment; we found that when using a constant variance a value of the agent performs best, while results in a very poor learning performance. The reward of our adaptive variance selection is comparable to the best hand-tuned constant variance.

Uniform state-goal sampling performed surprisingly well, but as we can see our approach clearly indicates the benefits of generating a curriculum for learning. We would also like to note that uniform sampling exhibits the best of its capabilities to learn transitions in this environment, whereas we selected the worst seed state for our algorithm specifically to emphasize benefits of our dynamic reachability region.

### 5.5 Hyperparameter adaptation

The environments that we selected are representative in the spectrum of requirements for growing region expansion. The multi-goal version of the SparseReacher environment is more conservative and requires small exploration variances, whereas the Maze environment benefits from aggressive exploration, and hence high variances perform well. For example, Fig. 2(a) shows that, under constant variance, the learner completely fails to improve when the variance is set to a high value. On the other hand, Fig. 2(b) shows the opposite for the Maze, where the optimal variance value is close to the maximum value. Our adaptive variance approach performs similar to the optimal constant variance. Given that we have the same set of exploration hyperparameters for both environments, our approach eliminates the need to tune the key hyperparameter of the region growing curriculum learning method.

Fig. 4 shows the sampling variance evolution over training. Initially, our algorithm picks the largest and the smallest variance values for the Maze and the SparseReacher environments, respectively. In the case of SparseReacher, it keeps a low variance at the beginning, since random initialization of the policy weights results in actions of large magnitude. As the agent keeps learning, the exploration is gradually relaxed. Our algorithm regulates the variance in such a way that allows the learner to maintain the proper exploration pace, resulting in steep learning curves. We also find this idea connected to the approach proposed by [?] in the context of adversarial learning. In our scenario, since there is no loss for the sampler, we apply the equilibrium principle through balancing the success ratio for the learner.

We also evaluated the single-goal variations of our environments, where the seed state represents the only goal in each environment. In this scenario, variance adaptation showed similar benefits. On average the single-goal SparseReacher was learned 20–50% faster with variance adaptation than with manual tuning of a constant variance. For the Maze environment our algorithm is able to match the performance of the version with the constant sampler variance.

## 6 Conclusions and Future work

In this work we proposed a novel algorithm for learning a global policy capable of moving an agent in environments with any pair of start-goal transitions. Our algorithm is based on the idea of region growing, and it is capable of automatic adjustment of the region expansion to achieve an appropriate pace of learning without extensive hyperparameter tuning.

To apply our approach to real robotic systems in the future, we plan to address the following shortcomings. First, the algorithm could substantially benefit from parallel learning of a reversing policy, allowing it to return to some safe states within the current growing region. Second, the current version of our algorithm is sensitive to the choice of the seed state. For example, in experiments with the Maze environment, we observed that the success rate can be twice as good depending on the location of the seed state. This phenomenon occurs because for some seeds the policy can avoid learning how to reach states in the hardest subspace of the environment, namely, the corridor. In the future, we plan to address this problem by using a principle of skill chaining to learn a set of policies for different state regions.

## References

- [Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
- [Bellemare et al., 2016] Marc G. Bellemare, Sriram Srinivasan, et al. Unifying count-based exploration and intrinsic motivation. In NIPS, pages 1471–1479, 2016.
- [Bengio et al., 2015] Samy Bengio, Oriol Vinyals, et al. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, pages 1171–1179, 2015.
- [Berthelot et al., 2017] David Berthelot, Tom Schumm, and Luke Metz. BEGAN: boundary equilibrium generative adversarial networks. CoRR, abs/1703.10717, 2017.
- [Brockman et al., 2016] Greg Brockman, Vicki Cheung, et al. OpenAI Gym. CoRR, abs/1606.01540, 2016.
- [Chebotar et al., 2017a] Y. Chebotar, K. Hausman, et al. Combining model-based and model-free updates for trajectory-centric reinforcement learning. In ICML, 2017.
- [Chebotar et al., 2017b] Yevgen Chebotar, Mrinal Kalakrishnan, et al. Path integral guided policy search. In ICRA, pages 3381–3388, 2017.
- [Christiano et al., 2017] Paul F. Christiano, Jan Leike, et al. Deep reinforcement learning from human preferences. In NIPS, pages 4302–4310, 2017.
- [Duan et al., 2016] Yan Duan, Xi Chen, et al. Benchmarking deep reinforcement learning for continuous control. In ICML, pages 1329–1338, 2016.
- [Florensa et al., 2017] Carlos Florensa, David Held, et al. Reverse curriculum generation for reinforcement learning. In CoRL, pages 482–495, 2017.
- [Goodfellow et al., 2014] Ian J. Goodfellow, Jean Pouget-Abadie, et al. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
- [Heess et al., 2017] Nicolas Heess, Dhruva TB, et al. Emergence of locomotion behaviours in rich environments. CoRR, abs/1707.02286, 2017.
- [Held et al., 2017] David Held, Xinyang Geng, et al. Automatic goal generation for reinforcement learning agents. CoRR, abs/1705.06366, 2017.
- [Houthooft et al., 2016] Rein Houthooft, Xi Chen, et al. VIME: variational information maximizing exploration. In NIPS, pages 1109–1117, 2016.
- [Jaderberg et al., 2016] Max Jaderberg, Volodymyr Mnih, et al. Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397, 2016.
- [Kalakrishnan et al., 2012] Mrinal Kalakrishnan, Ludovic Righetti, et al. Learning force control policies for compliant robotic manipulation. In ICML, 2012.
- [Kearns et al., 2002] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine Learning, 49(2-3):193–208, 2002.
- [Levine et al., 2011] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Nonlinear inverse reinforcement learning with Gaussian processes. In NIPS, pages 19–27, 2011.
- [Lopes et al., 2012] M. Lopes, T. Lang, et al. Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, pages 206–214, 2012.
- [Martin et al., 2017] Jarryd Martin, Suraj N. Sasikumar, et al. Count-based exploration in feature space for reinforcement learning. In IJCAI, pages 2471–2478, 2017.
- [Matiisen et al., 2017] Tambet Matiisen, Avital Oliver, et al. Teacher-student curriculum learning. CoRR, abs/1707.00183, 2017.
- [Mnih et al., 2013] Volodymyr Mnih, Koray Kavukcuoglu, et al. Playing Atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
- [Ostrovski et al., 2017] Georg Ostrovski, Marc G. Bellemare, et al. Count-based exploration with neural density models. In ICML, pages 2721–2730, 2017.
- [Oudeyer et al., 2007] P. Oudeyer, F. Kaplan, and V. Hafner. Intrinsic motivation systems for autonomous mental development. Evolutionary Computation, 11(2):265–286, 2007.
- [Popov et al., 2017] Ivaylo Popov, Nicolas Heess, et al. Data-efficient deep reinforcement learning for dexterous manipulation. CoRR, abs/1704.03073, 2017.
- [Schmidhuber, 2010] Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). Autonomous Mental Development, 2(3):230–247, 2010.
- [Schossau et al., 2016] Jory Schossau, Christoph Adami, et al. Information-theoretic neuro-correlates boost evolution of cognitive systems. Entropy, 18(1):6, 2016.
- [Schulman et al., 2015] John Schulman, Sergey Levine, et al. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
- [Sharma et al., 2017] Sahil Sharma, Ashutosh Jha, et al. Learning to multi-task by active sampling. CoRR, abs/1702.06053, 2017.
- [Silver et al., 2016] David Silver, Aja Huang, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- [Stadie et al., 2015] B. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. CoRR, abs/1507.00814, 2015.
- [Sukhbaatar et al., 2017] S. Sukhbaatar, I. Kostrikov, et al. Intrinsic motivation and automatic curricula via asymmetric self-play. CoRR, abs/1703.05407, 2017.
- [Svetlik et al., 2017] M. Svetlik, M. Leonetti, et al. Automatic curriculum graph generation for reinforcement learning agents. In AAAI, pages 2590–2596, 2017.
- [Wu and Tian, 2017] Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. In ICLR, 2017.
- [Zaremba and Sutskever, 2014] Wojciech Zaremba and Ilya Sutskever. Learning to execute. CoRR, abs/1410.4615, 2014.