Search on the Replay Buffer:
Bridging Planning and Reinforcement Learning
Abstract
The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collisionfree paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and highdimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a suboptimal solution. We introduce a generalpurpose control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a particular subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment – namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goalconditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in imagebased environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.^{1}^{1}1Run our algorithm in your browser: http://bit.ly/rl_search
Search on the Replay Buffer:
Bridging Planning and Reinforcement Learning
Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine CMU, Google Brain, UC Berkeley beysenba@cs.cmu.edu
noticebox[b]\end@float
1 Introduction
How can agents learn to solve complex, temporally extended tasks? Classically, planning algorithms give us one tool for learning such tasks. While planning algorithms work well for tasks where it is easy to determine distances between states and easy to design a local policy to reach nearby states, both of these requirements become roadblocks when applying planning to highdimensional (e.g., imagebased) tasks. Learning algorithms excel at handling highdimensional observations, but reinforcement learning (RL) – learning for control – fails to reason over long horizons to solve temporally extended tasks. In this paper, we propose a method that combines the strengths of planning and RL, resulting in an algorithm that can plan over long horizons in tasks with highdimensional observations.
Recent work has introduced goalconditioned RL algorithms (Schaul et al., 2015; Pong et al., 2018) that acquire a single policy for reaching many goals. In practice, goalconditioned RL succeeds at reaching nearby goals but fails to reach distant goals; performance degrades quickly as the number of steps to the goal increases (Nachum et al., 2018; Levy et al., 2019). Moreover, goalconditioned RL often requires large amounts of reward shaping (Chiang et al., 2019) or human demonstrations (Nair et al., 2018; Lynch et al., 2019), both of which can limit the asymptotic performance of the policy by discouraging the policy from seeking novel solutions.
We propose to solve longhorizon, sparse reward tasks by decomposing the task into a series of easier goalreaching tasks. We learn a goalconditioned policy for solving each of the goalreaching tasks. Our main idea is to reduce the problem of finding these subgoals to solving a shortest path problem over states that we have previous visited, using a distance metric extracted from our goalconditioned policy. We call this algorithm Search on Replay Buffer (SoRB), and provide a simple illustration of the algorithm in Figure 1.
Our primary contribution is an algorithm that bridges planning and deep RL for solving longhorizon, sparse reward tasks. We develop a practical instantiation of this algorithm using ensembles of distributional value functions, which allows us to robustly learn distances and use them for riskaware planning. Empirically, we find that our method generates effective plans to solve long horizon navigation tasks, even in imagebased domains, without a map and without odometry. Comparisons with stateoftheart RL methods show that SoRB is substantially more successful in reaching distant goals. We also observe that the learned policy generalizes well to navigate in unseen environments. In summary, graph search over previously visited states is a simple tool for boosting the performance of a goalconditioned RL algorithm.
2 Bridging Planning and Reinforcement Learning
Planning algorithms must be able to (1) sample valid states, (2) estimate the distance between reachable pairs of states, and (3) use a local policy to navigate between nearby states. These requirements are difficult to satisfy in complex tasks with high dimensional observations, such as images. For example, consider a robot arm stacking blocks using image observations. Sampling states requires generating photorealistic images, and estimating distances and choosing actions requires reasoning about dozens of interactions between blocks. Our method will obtain distance estimates and a local policy using a RL algorithm. To sample states, we will simply use a replay buffer of previously visited states as a nonparametric generative model.
2.1 Building Block: GoalConditioned RL
A key building block of our method is a goalconditioned policy and its associated value function. We consider a goalreaching agent interacting with an environment. The agent observes its current state and a goal state . The initial state for each episode is sampled , and dynamics are governed by the distribution . At every step, the agent samples an action and receives a corresponding reward that indicates whether the agent has reached the goal. The episode terminates as soon as the agent reaches the goal, or after steps, whichever occurs first. The agent’s task is to maximize its cumulative, undiscounted, reward. We use an offpolicy algorithm to learn such a policy, as well as its associated goalconditioned Qfunction and value function:
We obtain a policy by acting greedily w.r.t. the Qfunction: . We choose an offpolicy RL algorithm with goal relabelling (Kaelbling, 1993b; Andrychowicz et al., 2017) and distributional RL (Bellemare et al., 2017)) not only for improved data efficiency, but also to obtain good distance estimates (See Section 2.2). We will use DQN (Mnih et al., 2013) for discrete action environments and DDPG (Lillicrap et al., 2015) for continuous action environments. Both algorithms operate by minimizing the Bellman error over transitions sampled from a replay buffer .
2.2 Distances from GoalConditioned Reinforcement Learning
To ultimately perform planning, we need to compute the shortest path distance between pairs of states. Following Kaelbling (1993b), we define a reward function that returns 1 at every step: . The episode ends when the agent is sufficiently close to the goal, as determined by a stateidentity oracle. Using this reward function and termination condition, there is a close connection between the Q values and shortest paths. We define to be the shortest path distance from state to state . That is, is the expected number of steps to reach from under the optimal policy. The value of state with respect to goal is simply the negative shortest path distance: . We likewise define as the shortest path distance, conditioned on initially taking action . Then Q values also equal a negative shortest path distance: . Thus, goalconditioned RL on a suitable reward function yields a Qfunction that allows us to estimate shortestpath distances.
2.3 The Replay Buffer as a Graph
We build a weighted, directed graph directly on top of states in our replay buffer, so each node corresponds to an observation (e.g., an image). We add edges between nodes with weight (i.e., length) equal to their predicted distance, but ignore edges that are longer than MaxDist, a hyperparameter:
Given a start and goal state, we temporarily add each to the graph. We add directed edges from the start state to every other state, and from every other state to the goal state, using the same criteria as above. We use Dijkstra’s Algorithm to find the shortest path. See Appendix A for details.
2.4 Algorithm Summary
After learning a goalconditioned Qfunction, we perform graph search to find a set of waypoints and use the goalconditioned policy to reach each. We view the combination of graph search and the underlying goalconditioned policy as a new SearchPolicy, shown in Algorithm 1. The algorithm starts by using graph search to obtain the shortest path from the current state to the goal state , planning over the states in our replay buffer . We then estimate the distance from the current state to the first waypoint, as well as the distance from the current state to the goal. In most cases, we then condition the policy on the first waypoint, . However, if the goal state is closer than the next waypoint and the goal state is not too far away, then we directly condition the policy on the final goal. If the replay buffer is empty or there is not a path in to the goal, then Algorithm 1 resorts to standard goalconditioned RL.
3 Better Distance Estimates
The success of our SearchPolicy depends heavily on the accuracy of our distance estimates. This section proposes two techniques to learn better distances with RL.
3.1 Better Distances via Distributional Reinforcement Learning
Offtheshelf Qlearning algorithms such as DQN (Mnih et al., 2013) or DDPG (Lillicrap et al., 2015) will fail to learn accurate distance estimates using the reward function. The true value for a state and goal that are unreachable is , which cannot be represented by a standard, feedforward Qnetwork. Simply clipping the Qvalue estimates to be within some range avoids the problem of illdefined Qvalues, but empirically we found it challenging to train clipped Qnetworks. We adopt distributional Qlearning (Bellemare et al., 2017), noting that is has a convenient form when used with the reward function. Distributional RL discretizes the possible value estimates into a set of bins . For learning distances, bins correspond to distances, so indicates the event that the current state and goal are steps away from one another. Our Qfunction predicts a distribution over these bins, where is the predicted probability that states and are steps away from one another. To avoid illdefined Qvalues, the final bin, is a catchall for predicted distances of at least . Importantly, this gives us a welldefined method to represent large and infinite distances. Under this formulation, the targets for our Qvalues have a simple form:
As illustrated in Figure 2, if the state and goal are equivalent, then the target places all probability mass in bin 0. Otherwise, the targets are a rightshift of the current predictions. To ensure the target values sum to one, the mass in bin of the targets is the sum of bins and from the predicted values. Following Bellemare et al. (2017), we update our Q function by minimizing the KL divergence between our predictions and the target :
(1) 
3.2 Robust Distances via Ensembles of Value Functions
Since we ultimately want to use estimated distances to perform search, it is crucial that we have accurate distances estimates. It is challenging to robustly estimate the distance between all pairs of states in our buffer , some of which may not have occurred during training. If we fail and spuriously predict that a pair of distant states are nearby, graph search will exploit this “wormhole” and yield a path which assumes that the agent can “teleport” from one distant state to another. We seek to use a bootstrap (Bickel et al., 1981) as a principled way to estimate uncertainty for our Qvalues. Following prior work (Osband et al., 2016; Lakshminarayanan et al., 2017), we implement an approximation to the bootstrap. We train an ensemble of Qnetworks, each with independent weights, but trained on the same data using the same loss (Eq. 1). When performing graph search, we aggregate predictions from each Qnetwork in our ensemble. Empirically, we found that ensembles were crucial for getting graph search to work on imagebased tasks, but we observed little difference in whether we took the maximum predicted distance or the average predicted distance.
4 Related Work
Planning Algorithms: Planning algorithms (LaValle, 2006; Choset et al., 2005) efficiently solve longhorizon tasks, including those that stymie RL algorithms (see, e.g., Levine et al. (2011); Kavraki et al. (1996); Lau and Kuffner (2005)). However, these techniques assume that we can (1) efficiently sample valid states, (2) estimate the distance between two states, and (3) acquire a local policy for reaching nearby states, all of which make it challenging to apply these techniques to highdimensional tasks (e.g., with image observations). Our method removes these assumptions by (1) sampling states from the replay buffer and (2,3) learning the distance metric and policy with RL. Some prior works have also combined planning algorithms with RL (Chiang et al., 2019; Faust et al., 2018; Savinov et al., 2018a), finding that the combination yields agents adept at reaching distant goals. Perhaps the most similar work is SemiParametric Topological Memory (Savinov et al., 2018a), which also uses graph search to find waypoints for a learned policy. We compare to SPTM in Section 5.3.
GoalConditioned RL: Goalconditioned policies (Kaelbling, 1993b; Schaul et al., 2015; Pong et al., 2018) take as input the current state and a goal state, and predict a sequence of actions to arrive at the goal. Our algorithm learns a goalconditioned policy to reach waypoints along the planned path. Recent algorithms (Andrychowicz et al., 2017; Pong et al., 2018) combine offpolicy RL algorithms with goalrelabelling to improve the sample complexity and robustness of goalconditioned policies. Similar algorithms have been proposed for visual navigation (Anderson et al., 2018; Gupta et al., 2017; Zhu et al., 2017; Mirowski et al., 2016). A common theme in recent work is learning distance metrics to accelerate RL. While most methods (Florensa et al., 2019; Savinov et al., 2018b; Wu et al., 2018) simply perform RL on top of the learned representation, our method explicitly performs search using the learned metric.
Hierarchical RL: Hierarchical RL algorithms automatically learn a set of primitive skills to help an agent learn complex tasks. One class of methods (Kaelbling, 1993a; Parr and Russell, 1998; Sutton et al., 1999; Precup, 2000; Vezhnevets et al., 2017; Nachum et al., 2018; Frans et al., 2017; Bacon et al., 2017; Kulkarni et al., 2016) jointly learn a lowlevel policy for performing each of the skills together with a highlevel policy for sequencing these skills to complete a desired task. Another class of algorithms (Fox et al., 2017; Şimşek et al., 2005; Drummond, 2002) focus solely on automatically discovering these skills or subgoals. SoRB learns primitive skills that correspond to goalreaching tasks, similar to Nachum et al. (2018). While jointly learning highlevel and lowlevel policies can be unstable (see discussion in Nachum et al. (2018)), we sidestep the problem by using graph search as a fixed, highlevel policy.
model  real states  multistep  prediction dimension 

statespace  ✓  ✓  1000s+ 
latentspace  ✗  ✓  10s 
inverse  ✓  ✗  10s 
SoRB  ✓  ✓  1 
Model Based RL: RL methods are typically divided into modelfree (Williams, 1992; Schulman et al., 2015b, a, 2017) and modelbased (Watkins and Dayan, 1992; Lillicrap et al., 2015) approaches. Modelbased approaches all perform some degree of planning, from predicting the value of some state (Silver et al., 2016; Mnih et al., 2013), obtaining representations by unrolling a learned dynamics model (Racanière et al., 2017), or learning a policy directly on a learned dynamics model (Sutton, 1990; Chua et al., 2018; Kurutach et al., 2018; Finn and Levine, 2017; Agrawal et al., 2016; Oh et al., 2015; Nagabandi et al., 2018). One line of work (Amos et al., 2018; Srinivas et al., 2018; Tamar et al., 2016; Lee et al., 2018) embeds a differentiable planner inside a policy, with the planner learned endtoend with the rest of the policy. Other work (Watter et al., 2015; Lenz et al., 2015) explicitly learns a representation for use inside a standard planning algorithm. In contrast, SoRB learns to predict the distances between states, which can be viewed as a highlevel inverse model. SoRB predicts a scalar (the distance) rather than actions or observations, making the prediction problem substantially easier. By planning over previously visited states, SoRB does not have to cope with infeasible states that can be predicted by forward models in statespace and latentspace.
5 Experiments
We compare SoRB to prior methods on two tasks: a simple 2D environment, and then a visual navigation task, where our method will plan over images. Ablation experiments will illustrate that accurate distances estimates are crucial to our algorithm’s success.



5.1 Didactic Example: 2D Navigation
We start by building intuition for our method by applying it to two simple 2D navigation tasks, shown in Figure 3(a). The start and goal state are chosen randomly in free space, and reaching the goal often takes over 100 steps, even for the optimal policy. We used goalconditioned RL to learn a policy for each environment, and then evaluated this policy on randomly sampled (start, goal) pairs of varying difficulty. To implement SoRB, we used exactly the same policy, both to perform graph search and then to reach each of the planned waypoints. In Figure 3(b), we observe that the goalconditioned policy can reach nearby goals, but fails to generalize to distant goals. In contrast, SoRB successfully reaches goals over 100 steps away, with little drop in success rate. Figure 3(c) compares rollouts from the goalconditioned policy and our policy. Note that our policy takes actions that temporarily lead away from the goal so the agent can maneuver through a hallway to eventually reach the goal.
5.2 Planning over Images for Visual Navigation
We now examine how our method scales to highdimensional observations in a visual navigation task, illustrated in Figure 5. We use 3D houses from the SUNCG dataset (Song et al., 2017), similar to the task described by Shah et al. (2018). The agent receives either RGB or depth images and takes actions to move North/South/East/West. Following Shah et al. (2018), we stitch four images into a panorama, so the resulting observation has dimension , where is the number of channels (3 for RGB, 1 for Depth). At the start of each episode, we randomly sample an initial state and goal state. We found that sampling nearby goals (within 4 steps) more often (80% of the time) improved the performance of goalconditioned RL. We use the same goal sampling distribution for all methods. The agent observes both the current image and the goal image, and should take actions that lead to the goal state. The episode terminates once the agent is within 1 meter of the goal. We also terminate if the agent has failed to reach the goal after 20 time steps, but treat the two types of termination differently when computing the TD error (see Pardo et al. (2017)). Note that it is challenging to specify a meaningful distance metric and local policy on pixel inputs, so it is difficult to apply standard planning algorithms to this task.
On this task, we evaluate four stateoftheart prior methods: hindsight experience replay (HER) (Andrychowicz et al., 2017), distributional RL (C51) (Bellemare et al., 2017), semiparametric topological memory (SPTM) (Savinov et al., 2018a), and value iteration networks (VIN) (Tamar et al., 2016). SoRB uses C51 as its underlying goalconditioned policy. For VIN, we tuned the number of iterations as well as the number of hidden units in the recurrent layer. For SPTM, we performed a grid search over the threshold for adding edges, the threshold for choosing the next waypoint along the shortest path, and the parameters for sampling the training data. In total, we performed over 1000 experiments to tune baselines, more than an order of magnitude more than we used for tuning our own method. See Appendix F for details.
We evaluated each method on goals ranging from 2 to 20 steps from the start. For each distance, we randomly sampled 30 (start, goal) pairs, and recorded the average success rate, defined as reaching within 1 meter of the goal within 100 steps. We then repeated each experiment for 5 random seeds. In Figure 6, we plot each random seed as a transparent line; the solid line corresponds to the average across the 5 random seeds. While all prior methods degrade quickly as the distance to the goal increases, our method continues to succeed in reaching goals with probability around 90%. SPTM, the only prior method that also employs search, performs second best, but substantially worse than our method.
5.3 Comparison with SemiParametric Topological Memory


To understand why SoRB succeeds at reaching distant goals more frequently than SPTM, we examine the two key differences between the methods: (1) the goalconditioned policy used to reach nearby goals and (2) the distance metric used to construct the graph. While SoRB acquires a goalconditioned policy via goalconditioned RL, SPTM obtains a policy by learning an inverse model with supervised learning. First, we compared the performance of the RL policy (used in SoRB) with the inverse model policy (used in SPTM). In Figure 6(a), the solid colored lines show that, without search, the policy used by SPTM is more successful than the RL policy, but performance of both policies degrades as the distance to the goal increases. We also evaluate a variant of our method that uses the policy from SPTM to reach each waypoint, and find (dashedlines) no difference in performance, likely because the policies are equally good at reaching nearby goals (within MaxDist steps). We conclude that the difference in goalconditioned policies cannot explain the difference in success rate.
The other key difference between SoRB and SPTM is their learned distance metrics. When using distances for graph search, it is critical for the predicted distance between two states to reflect whether the policy can successfully navigate between those states: the model should be more successful at reaching goals which it predicts are nearby. We can naturally measure this alignment using the area under a precision recall curve. Note that while SoRB predicts distances in the range , SPTM predicts whether two states are reachable, so its predictions will be in the range . Nonetheless, precisionrecall curves^{2}^{2}2We negate the distance prediction from SoRB before computing the precision recall curve because small distances indicate that the policy should be more successful. only depend on the ordering of the predictions, not their absolute values. Figure 6(b) shows that the distances predicted by SoRB more accurately reflect whether the policy will reach the goal, as compared with SPTM. The average AUC across five random seeds is 22% higher for SoRB than SPTM. In retrospect, this finding is not surprising: while SPTM employs a learned, inverse model policy, it learns distances w.r.t. a random policy.
5.4 Better Distance Estimates


We now examine the ingredients in SoRB that contribute to its accurate distance estimates: distributional RL and ensembles of value functions. In a first experiment, evaluated a variant of SoRB trained without distributional RL. As shown in Figure 7(a), this variant performed worse than the random policy, clearly illustrating that distributional RL is a key component of SoRB. The second experiment studied the effect of using ensembles of value functions. Recalling that we introduced ensembles to avoid erroneous distance predictions for distant pairs of states, we expect that ensembles will contribute most towards success at reaching distant goals. Figure 7(b) confirms this prediction, illustrating that ensembles provide a 10  20% increase in success at reaching goals that are at least 10 steps away. We run additional ablation analysis in Appendix C.
5.5 Generalizing to New Houses
We now study whether our method generalizes to new visual navigation environments. We train on 100 SUNCG houses, randomly sampling one per episode. We evaluated on a heldout test set of 22 SUNCG houses. In each house, we collect 1000 random observations and use those observations to perform search. We use the same goalconditioned policy and associated distance function that we learned during training. As before, we measure the fraction of goals reached as we increase the distance to the goal. In Figure 9, we observe that SoRB reaches almost 80% of goals that are 10 steps away, about four times more than reached by the goalconditioned RL agent. Our method succeeds in reaching 40% of goals 20 steps away, while goalconditioned RL has a success rate near 0%. We repeated the experiment for three random seeds, retraining the policy from scratch each time. Note that there is no discernible difference between the three random seeds, plotted as transparent lines, indicating the robustness of our method to random initialization.
6 Discussion and Future Work
We presented SoRB, a method that combines planning via graph search and goalconditioned RL. By exploiting the structure of goalreaching tasks, we can obtain policies that generalize substantially better than those learned directly from RL. In our experiments, we show that SoRB can solve temporally extended navigation problems, traverse environments with image observations, and generalize to new houses in the SUNCG dataset. Our method relies heavily on goalconditioned RL, and we expect advances in this area to make our method applicable to even more difficult tasks. While we used a stagewise procedure, first learning the goalconditioned policy and then applying graph search, in future work we aim to explore how graph search can improve the goalconditioned policy itself, perhaps via policy distillation or obtaining better Qvalue estimates. In addition, while the planning algorithm we use is simple (namely, Dijkstra), we believe that the key idea of using distance estimates obtained from RL algorithms for planning will open doors to incorporating more sophisticated planning techniques into RL.
Acknowledgements: We thank Vitchyr Pong, Xingyu Lin, and Shane Gu for helpful discussions on learning goalconditioned value functions, Aleksandra Faust and Brian Okorn for feedback on connections to planning, and Nikolay Savinov for feedback on the SPTM baseline. RS is supported by NSF grant IIS1763562, ONR grant N000141812861, AFRL CogDeCON, and Apple. Any opinions, findings and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of NSF, AFRL, ONR, or Apple.
References
 Agrawal et al. (2016) Agrawal, P., Nair, A. V., Abbeel, P., Malik, J., and Levine, S. (2016). Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pages 5074–5082.
 Amos et al. (2018) Amos, B., Jimenez, I., Sacks, J., Boots, B., and Kolter, J. Z. (2018). Differentiable mpc for endtoend planning and control. In Advances in Neural Information Processing Systems, pages 8289–8300.
 Anderson et al. (2018) Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., and van den Hengel, A. (2018). Visionandlanguage navigation: Interpreting visuallygrounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683.
 Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058.
 Bacon et al. (2017) Bacon, P.L., Harb, J., and Precup, D. (2017). The optioncritic architecture. In ThirtyFirst AAAI Conference on Artificial Intelligence.
 Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 449–458. JMLR. org.
 Bickel et al. (1981) Bickel, P. J., Freedman, D. A., et al. (1981). Some asymptotic theory for the bootstrap. The annals of statistics, 9(6):1196–1217.
 Chiang et al. (2019) Chiang, H.T. L., Faust, A., Fiser, M., and Francis, A. (2019). Learning navigation behaviors endtoend with autorl. IEEE Robotics and Automation Letters, 4(2):2007–2014.
 Choset et al. (2005) Choset, H. M., Hutchinson, S., Lynch, K. M., Kantor, G., Burgard, W., Kavraki, L. E., and Thrun, S. (2005). Principles of robot motion: theory, algorithms, and implementation. MIT press.
 Chua et al. (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4759–4770.
 Drummond (2002) Drummond, C. (2002). Accelerating reinforcement learning by composing solutions of automatically identified subtasks. Journal of Artificial Intelligence Research, 16:59–104.
 Faust et al. (2018) Faust, A., Ramirez, O., Fiser, M., Oslund, K., Francis, A., Davidson, J., and Tapia, L. (2018). Prmrl: Longrange robotic navigation tasks by combining reinforcement learning and samplingbased planning. In Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pages 5113–5120, Brisbane, Australia.
 Finn and Levine (2017) Finn, C. and Levine, S. (2017). Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE.
 Florensa et al. (2019) Florensa, C., Degrave, J., Heess, N., Springenberg, J. T., and Riedmiller, M. (2019). Selfsupervised learning of image embedding for continuous control. arXiv preprint arXiv:1901.00943.
 Fox et al. (2017) Fox, R., Krishnan, S., Stoica, I., and Goldberg, K. (2017). Multilevel discovery of deep options. arXiv preprint arXiv:1703.08294.
 Frans et al. (2017) Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. (2017). Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767.
 Gupta et al. (2017) Gupta, S., Davidson, J., Levine, S., Sukthankar, R., and Malik, J. (2017). Cognitive mapping and planning for visual navigation. arXiv preprint arXiv:1702.03920, 3.
 Hadar and Russell (1969) Hadar, J. and Russell, W. R. (1969). Rules for ordering uncertain prospects. The American Economic Review, 59(1):25–34.
 Kaelbling (1993a) Kaelbling, L. P. (1993a). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the tenth international conference on machine learning, volume 951, pages 167–173.
 Kaelbling (1993b) Kaelbling, L. P. (1993b). Learning to achieve goals. In IJCAI, pages 1094–1099. Citeseer.
 Kavraki et al. (1996) Kavraki, L., Svestka, P., and Overmars, M. H. (1996). Probabilistic roadmaps for path planning in highdimensional configuration spaces. IEEE transactions on robotics and automation, 12(4):566–580.
 Kulkarni et al. (2016) Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683.
 Kurutach et al. (2018) Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P. (2018). Modelensemble trustregion policy optimization. arXiv preprint arXiv:1802.10592.
 Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413.
 Lau and Kuffner (2005) Lau, M. and Kuffner, J. J. (2005). Behavior planning for character animation. In Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 271–280. ACM.
 LaValle (2006) LaValle, S. M. (2006). Planning algorithms. Cambridge university press.
 Lee et al. (2018) Lee, L., Parisotto, E., Chaplot, D. S., Xing, E., and Salakhutdinov, R. (2018). Gated path planning networks. arXiv preprint arXiv:1806.06408.
 Lenz et al. (2015) Lenz, I., Knepper, R. A., and Saxena, A. (2015). Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems. Rome, Italy.
 Levine et al. (2011) Levine, S., Lee, Y., Koltun, V., and Popović, Z. (2011). Spacetime planning with parameterized locomotion controllers. ACM Transactions on Graphics (TOG), 30(3):23.
 Levy et al. (2019) Levy, A., Platt, R., and Saenko, K. (2019). Hierarchical reinforcement learning with hindsight. In International Conference on Learning Representations.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 Lynch et al. (2019) Lynch, C., Khansari, M., Xiao, T., Kumar, V., Tompson, J., Levine, S., and Sermanet, P. (2019). Learning latent plans from play. arXiv preprint arXiv:1903.01973.
 Mirowski et al. (2016) Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al. (2016). Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
 Nachum et al. (2018) Nachum, O., Gu, S. S., Lee, H., and Levine, S. (2018). Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pages 3307–3317.
 Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE.
 Nair et al. (2018) Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. (2018). Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–6299. IEEE.
 Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871.
 Osband et al. (2016) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. (2016). Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034.
 Pardo et al. (2017) Pardo, F., Tavakoli, A., Levdik, V., and Kormushev, P. (2017). Time limits in reinforcement learning. arXiv preprint arXiv:1712.00378.
 Parr and Russell (1998) Parr, R. and Russell, S. J. (1998). Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, pages 1043–1049.
 Pong et al. (2018) Pong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal difference models: Modelfree deep rl for modelbased control. arXiv preprint arXiv:1802.09081.
 Precup (2000) Precup, D. (2000). Temporal abstraction in reinforcement learning. University of Massachusetts Amherst.
 Racanière et al. (2017) Racanière, S., Weber, T., Reichert, D., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess, N., Li, Y., et al. (2017). Imaginationaugmented agents for deep reinforcement learning. In Advances in neural information processing systems, pages 5690–5701.
 Savinov et al. (2018a) Savinov, N., Dosovitskiy, A., and Koltun, V. (2018a). Semiparametric topological memory for navigation. arXiv preprint arXiv:1803.00653.
 Savinov et al. (2018b) Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., and Gelly, S. (2018b). Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274.
 Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. In International Conference on Machine Learning, pages 1312–1320.
 Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015b). Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 Shah et al. (2018) Shah, P., Fiser, M., Faust, A., Kew, J. C., and HakkaniTur, D. (2018). Follownet: Robot navigation by following natural language directions with deep reinforcement learning. arXiv preprint arXiv:1805.06150.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484.
 Şimşek et al. (2005) Şimşek, Ö., Wolfe, A. P., and Barto, A. G. (2005). Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the 22nd international conference on Machine learning, pages 816–823. ACM.
 Song et al. (2017) Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., and Funkhouser, T. (2017). Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1746–1754.
 Srinivas et al. (2018) Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn, C. (2018). Universal planning networks. arXiv preprint arXiv:1804.00645.
 Sutton (1990) Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier.
 Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211.
 Tamar et al. (2016) Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162.
 Vezhnevets et al. (2017) Vezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu, K. (2017). Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3540–3549. JMLR. org.
 Watkins and Dayan (1992) Watkins, C. J. and Dayan, P. (1992). Qlearning. Machine learning, 8(34):279–292.
 Watter et al. (2015) Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. (2015). Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754.
 Williams (1992) Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256.
 Wu et al. (2018) Wu, Y., Tucker, G., and Nachum, O. (2018). The laplacian in rl: Learning representations with efficient approximations. arXiv preprint arXiv:1810.04586.
 Zhu et al. (2017) Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., FeiFei, L., and Farhadi, A. (2017). Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3357–3364. IEEE.
References
Appendix A Efficient Shortest Path Computation
Our policy solves a shortest path problem every time it recomputes a new waypoint. Naïvely running Dijkstra’s algorithm to compute a shortest path among the states in our active set requires queries of our value function. While the search algorithm itself is fast, it is expensive to evaluate the value function on each pair of states at every time step. In our implementation (Algorithm 2), we amortize this computation across many calls to the policy. We periodically periodically evaluate the value function on each pair of nodes in the replay buffer, and then used the Floyd Warshall algorithm to compute the shortest path between all pairs. This takes time, but only calls to the value function. Let be the resulting matrix storing the shortest path distances between all pairs of states in the active set. Now, given a start state and goal state , the shortest path distance is
This computation requires calls to the value function, substantially better than the calls required with the naïve implementation.
Appendix B Environments
We used two simple navigation environments, PointU and PointFourRooms, shown in Figure 3(a). In both environments, the observations are the location of the agent, . The agent’s actions are added to the agents current position at every time step. We tuned the environments so that the goalconditioned algorithm (which we will use as a baseline) would perform as well as possible. Observing that the agent would get stuck at corners, we modified the environment to automatically add Gaussian noise to the agents action. The resulting dynamics were
where proj() handles collisions with walls by projecting the state to the nearest free state. We used for PointU, and for the (larger) PointFourRooms environment.
b.1 Visual Navigation
We ran most experiments on SUNCG house 0bda523d58df2ce52d0a1d90ba21f95c. We repeated all experiments on SUNCG house 0601a680273d980b791505cab993096a, with nearly identical results. We manually choose houses using the following criteria (1) single story, (2) no humans, and (3) included multiple rooms to make planning challenging. During training, we sampled “nearby” goal states (within 4 steps) for 80% of episodes, and sampled goals uniformly at random for the remaining 20% of episodes. We tuned these parameters to make goalconditioned RL work as well as possible. We implemented goalrelabelling (Kaelbling, 1993b; Andrychowicz et al., 2017), choosing between the (1) originally sampled goal, the (2) current state, and (3) a future state in the same trajectory, each with probability 33%. The agent’s actions space was to move North/South/East/West. Observations were panoramic images, created by concatenating the firstperson views from each of the cardinal directions. We used ensembles of 3 value functions, each with entirely independent weights. For all neural networks conditioned on both the current observation and the goal observation, we concatenated the current observation and goal observation along their last channel. For RGB images, this resulted in an input with dimensions . For depth images, the concatenated input had dimension .
Appendix C Ablation Experiments


Because SoRB plans over a fixed replay buffer, one potential concern is that performance might degrade if the replay buffer is too small. To test this concern, we ran an experiment varying the size of the replay buffer. As shown in Figure 9(a), decreasing the replay buffer by a factor of 10x led to no discernible drop on performance. While we do expect performance to drop if we further decrease the size of the replay buffer, the requirement of storing 100 states (even highresolution images) seems relatively minor. In a second ablation experiment, we varied the MaxDist hyperparameter that governs when we stop adding new edges to the graph. As shown in Figure 9(b), SoRB is sensitive to this hyperparameter, with values too large and too smaller leading to worse performance. When the MaxDist parameter is too small, graph search fails to find a path to the goal state. As we increase MaxDist, we increase the probability of underestimating the distance between pairs of states. We expect that improvements in uncertainty quantification in RL will improve the stability of our method w.r.t. this hyperparameter.
Appendix D Tricks for Learning Distances with RL

Small learning rates: Especially for the imagebased tasks, we found that RL completely failed with using a critic learning rate larger than 1e4. Smaller learning rates work too, but take longer to converge.

Distributional RL: The value function update for distributional RL has a particularly nice form when values correspond to distances. Additionally, distributional RL implicitly clips the values, preventing the critic to predict that unreachable states are infinitely far away.

Termination Condition: Carefully consider whether to set done = True at the end of each episode. In our setting the agent received a reward of 1 at each time step, so the value of each state was negative. An optimal agent therefore attempts to terminate the episode as quickly as possible. We only set done = True when the agent reached the goal state, not when the maximum number of time steps was reached or when it reached some other absorbing state.

Ensembles of Value Functions: Predicted distances from a single value function can be inaccurate for unseen (state, goal) pairs. When performing search using these predicted distances, these inaccuratelyshort predictions result in “wormholes” through the environment, where the agent mistakenly believes that two distant states are actually nearby. To mitigate this, we trained multiple, independent critics in parallel on the same data, and then aggregated predictions from each before doing search. Surprisingly, we found that taking the average predicted distance over the ensemble worked as well as taking the maximum predicted distance. We tried accelerating training by using shared convolutional layers for all critics in the ensemble, but found that this resulted in highlycorrelated distant predictions that exhibited the “wormhole” problem.

Normalizing Observations: For the visual navigation experiments, we normalized the observations to be in the interval by dividing by the maximum pixel intensity (32 for depth, 255 for RGB). Normalization was most important for the generalization experiment with RGB observations.
Appendix E Failed Experiments

Lowerbounds on Qvalues: We attempted to use the search path to obtain a lower bound on the target Qvalues during training. In the Bellman update, we replaced the distance predicted by the target Qvalues with the minimum of (1) the distance predicted by the target Qnetwork and (2) the distance of the shortest path found by search. This can be interpreted as a generalization of the singlestep lower bound from Kaelbling (1993b). Initial experiments showed this approach slowed down learning, and in some cases prevented the algorithm from converging. We hypothesize that Qlearning is must more sensitive to error in the relative values of two actions, rather than the absolute value of any particular action. While our lowerbound method likely decreased the absolute error, it did not decrease the relative error (and may have even increased it).

TD3style Ensemble Aggregation: In our main experiments, we aggregated distance predictions from the ensemble of distributional critics by first computing the expected distance of each critic, and then taking the maximum predicted distance. This approach ignores the fact that our critics are distributional. Inspired by the stability of TD3, we attempted to apply a similar approach to aggregating predictions from the ensemble of distributional critics. The naïve approach of taking the minimum for each atom does not work because the resulting distribution will not sum to one. Instead, we first compute the cumulative density function (CDF) of each critic and then take the pointwise maximum over the CDFs. Note that critics correspond to negative distance, so the maximum corresponds to being pessimistic. Finally, we convert the resulting CDF back into a PDF and return the corresponding expected distance. While this method has neat connections to secondorder stochastic dominance and riskaverse expected utility maximizers (Hadar and Russell, 1969), we found that it worked poorly in practice.
Appendix F Hyperparameters
Unless otherwise noted, all baselines use the same hyperparameters as our method. Unless otherwise noted, parameters were not tuned.
f.1 Search on the Replay Buffer
Parameter  Value  Comments 

learning rate  1e4  Lower values also work, but training takes longer. Same for actor and critic. 
training iterations  1e6 environment steps  Performance changed little after 200k steps. 
batch size  64  
train steps per environment step  1:1  
random steps at start of training  1000  
NN architecture (images)  Conv(16, 8, 4) + Conv(32, 4, 4) + FC(256)  Same for depth and RGB images. 
optimizer  Adam  We used the default Tensorflow settings for . Same for actor and critic. 
MaxDist  3  See Figure 10 
replay buffer size (training)  100k  
replay buffer size (search)  1k  See Figure 10 
gamma / discount  1  
0.1  Exploration parameter for discrete actions, used for visual navigation.  
OUstddev, OUdamping  1.0, 2.0  Exploration parameters for continuous actions, used for didactic 2D navigation 
reward scale factor  0.1  Tuned for the DDPG baseline on the 2D navigation task. 
target network update frequency  every 5 steps  
target network update rate ()  0.05 
f.2 Value Iteration Networks
Parameter  Value  Comments 

number of iterations  50  Tuned over [1, 2, 5, 10, 20, 50]. Little effect. 
hidden units in VI block  100  Tuned over [10, 30, 100, 300]. Little effect 
f.3 SemiParametric Topological Memory
We first tuned the parameter on goalreaching without search. Setting to the best found value, we performed a massive (over 1000 experiments) grid search over , , and the threshold for adding edges.
Parameter  Value  Comments 

threshold for adding edges  0.9  Tuned over [0.1, 0.2, 0.5, 0.7, 0.9] 
, threshold for choosing the next waypoint along the shortest path  0.5  Tuned over [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0] 
NN architecture  Conv(16, 8, 4) + Conv(32, 4, 4) + FC(256)  Same architecture (but different weights) for the retrival and locomotor networks. 
, threshold for sampling nearby states in trajectory  8  Tuned over [1, 2, 4, 8] 
, margin between “close” and “far” states  1  Tuned over [1, 2, 4] 