Dynamical Distance Learning for
SemiSupervised and Unsupervised
Skill Discovery
Abstract
Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very timeconsuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide wellshaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We show that dynamical distances can be used in a semisupervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both on a realworld robot and in simulation. We show that our method can learn to turn a valve with a realworld 9DoF hand, using raw image observations and just ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: https://sites.google.com/view/skillsviadistancelearning.
1 Introduction
The manual design of reward functions represents a major barrier to the adoption of reinforcement learning (RL), particularly in robotics, where visionbased policies can be learned endtoend (levine2016end; haarnoja2018bsoft), but still require reward functions that themselves might need visual detectors to be designed by hand (singh2019end). While in principle the reward only needs to specify the goal of the task, in practice RL can be exceptionally timeconsuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. Prior work tackles such situations with dedicated exploration methods (houthooft2016vime; osband2016deep; andrychowicz2017hindsight), or by using large amounts of random exploration (mnih2015human), which is feasible in simulation but infeasible for realworld robotic learning. It is also common to employ heuristic shaping, such as the Cartesian distance to a goal for an object relocation task (mahmood2018setting; haarnoja2018composable). However, this kind of shaping is brittle and requires manual insight, and is often impossible when ground truth state observations are unavailable, such as when learning from image observations.
In this paper, we aim to address these challenges by introducing dynamical distance learning (DDL), a general method for learning distance functions that can provide effective shaping for goalreaching tasks without manual engineering. Instead of imposing heuristic metrics that have no relationship to the system dynamics, we quantify the distance between two states in terms of the number of time steps needed to transition between them. This is a natural choice for dynamical systems, and prior works have explored learning such distances in simple and lowdimensional domains (kaelbling1993learning). While such distances can be learned using standard modelfree reinforcement learning algorithms, such as Qlearning, we show that such methods generally struggle to acquire meaningful distances for more complex systems, particularly with highdimensional observations such as images. We present a simple method that employs supervised regression to fit dynamical distances, and then uses these distances to provide reward shaping, guide exploration, and discover distinct skills.
The most direct use of DDL is to provide reward shaping for a standard deep RL algorithm, to optimize a policy to reach a given goal state. We can also formulate a semisupervised skill learning method, where a user expresses preferences over goals, and the agent autonomously collects experience to learn dynamical distances in a selfsupervised way. Finally, we can use DDL in a fully unsupervised method, where the most distant states are selected for exploration, resulting in an unsupervised reinforcement learning procedure that discovers difficult skills that reach dynamically distant states from a given start state. All of these applications avoid the need for manually designed reward functions, demonstrations, or userprovided examples, and involve minimal modification to existing deep RL algorithms.
DDL is a simple and scalable approach to learning dynamical distances that can readily accommodate raw image inputs and, as shown in our experiments, substantially outperforms prior methods that learn goalconditioned policies or distances using approximate dynamic programming techniques, such as Qlearning. We show that using dynamical distances as a reward function in standard reinforcement learning methods results in policies that take the shortest path to a given goal, despite the additional shaping. Empirically, we compare the semisupervised variant of our method to prior techniques for learning from preferences. We also compare our method to prior methods for unsupervised skill discovery on tasks ranging from 2D navigation to quadrupedal locomotion. Our experimental evaluation demonstrates that DDL can learn complex locomotion skills without any supervision at all, and that the preferencesbased version of DDL can learn to turn a valve with a realworld 9DoF hand, using raw image observations and 10 preference labels, without any other supervision.
2 Related Work
Dynamical distance learning is most closely related to methods that learn goalconditioned policies or value functions (schaul2015universal; sutton2011horde). Many of these works learn goalreaching directly via modelfree RL, often by using temporal difference updates to learn the distance function as a value function (kaelbling1996reinforcement; schaul2015universal; andrychowicz2017hindsight; pong2018temporal; nair2018visual; florensa2019self). For example, kaelbling1993learning learns a goal conditioned Qfunction to represent the shortest path between any two states, and andrychowicz2017hindsight learns a value function that resembles a distance to goals, under a userspecified lowdimensional goal representation. Unlike these methods, DDL learns policyconditioned distances with an explicit supervised learning procedure, and then employs these distances to recover a reward function for RL. We experimentally compare to RLbased distance learning methods, and show that DDL attains substantially better results, especially with complex observations. Another line of prior work uses a learned distance to build a search graph over a set of visited states (savinov2018semi; eysenbach2019search), which can then be used to plan to reach new states via the shortest path. Our method also learns a distance function separately from the policy, but instead of using it to build a graph, we use it to obtain a reward function for a separate modelfree RL algorithm.
The semisupervised variant of DDL is guided by a small number of preference queries. Prior work has explored several ways to elicit goals from users, such as using outcome examples and a small number of label queries (singh2019end), or using a large number of relatively cheap preferences (christiano2017deep). The preference queries that our semisupervised method uses are easy to obtain and, in contrast to prior work (christiano2017deep), we only need a small number of these queries to learn a policy that reliably achieves the user’s desired goal. Our method is also well suited for fully unsupervised learning, in which case DDL uses the distance function to propose goals for unsupervised skill discovery. Prior work on unsupervised reinforcement learning has proposed choosing goals based on a variety of unsupervised criteria, typically with the aim of attaining broad state coverage (nair2018visual; florensa2018automatic; eysenbach2018diversity; warde2018unsupervised; pong2019skewfit). Our method instead repeatedly chooses the most distant state as the goal, which produces rapid exploration and quickly discovers relatively complex skills. We provide a comparative evaluation in our experiments.
3 Preliminaries
In this work, we study control of systems defined by fully observed Markovian dynamics , where and are continuous state and action spaces. We aim to learn a stochastic policy , to reach a goal state . We will denote a trajectory with , where is a the trajectory distribution induced by the policy , and is sampled from an initial state distribution . The policy can be optimized using any reinforcement learning algorithm by maximizing
(1) 
where is a bounded reward function and is a discount factor.^{1}^{1}1In practice, we use soft actorcritic to learn the policy, which uses a related maximum entropy objective (haarnoja2018bsoft). However, we do not assume that we have access to a shaped reward function. In principle, we could set the reward to if and otherwise to learn a policy to reach the goal in as few time steps as possible. Unfortunately, such a sparse reward signal is extremely hard to optimize, as it does not provide any gradient towards the optimal solution until the goal is actually reached. Instead, in Section 4, we will show that we can efficiently learn to reach goals by making use of a learned dynamical distance function.
4 Dynamical Distance Learning
The aim of our method is to learn policies that reach goal states. These goal states can be selected either in an unsupervised fashion, to discover complex skills, or selected manually by the user. The learning process alternates between two steps: in the distance evaluation step, we learn a policyspecific dynamical distance, which is defined in the following subsection. In the policy improvement step, the policy is optimized to reach the desired goal by using the distance function as the negative reward. This process will lead to a sequence of policies and dynamical distance functions that converge to an effective goalreaching policy. Under certain assumptions, we can prove that this process converges to a policy that minimizes the distance from any state to any goal, as discussed in Appendix B. In this section, we define dynamical distances and describe our dynamical distance learning (DDL) procedure. In Section 5, we will describe the different ways that the goals can be chosen to instantiate our method as a semisupervised or unsupervised skill learning procedure.
4.1 Dynamical Distance Functions
The dynamical distance associated with a policy , which we write as , is defined as the expected number of time steps it took for to reach a state from a state , given that the two were visited in the same episode.^{2}^{2}2Dynamical distances are not true distance metrics, since they do not in general satisfy triangle inequalities. Mathematically, the distance is defined as:
(2) 
where is sampled from the conditional distribution of trajectories that passes through first and then , and where is some local cost of moving from to . For example, in a typical case in the absence of supervision, we can set analogously to the binary reward function in Equation 1, in which case the sum reduces to , and we recover the expected number of time steps to reach . In principle, we could also trivially incorporate more complex local costs , for example to include action costs. This modification would be straightforward, though we focus on the simple in our derivation and experiments. We include the discount factor to extend the definition to infinitely long trajectories, but in practice we set .
4.2 Distance Evaluation
In the distance evaluation step, we learn a distance function , parameterized by , to estimate the dynamical distance between pairs of states visited by a given policy , parameterized by . We first roll out the policy multiple times to sample trajectories of length . The empirical distance between states , where , is given by . Because the trajectories have a finite length, we are effectively ignoring the cases where reaching from would take more than steps, biasing this estimate toward zero, but since the bias becomes smaller for shorter distances, we did not find this to be a major limitation. We can now learn the distance function via supervised regression by minimizing
(3) 
As we will show in our experimental evaluation, this supervised regression approach makes it feasible to learn dynamical distances for complex tasks with raw image observations, something that has proven exceptionally challenging for methods that learn distances via goalconditioned policies or value functions and rely on temporal differencestyle methods. In direct comparisons, we find that such methods generally struggle to learn on the more complex tasks with image observations. On the other hand, a disadvantage of supervised regression is that it requires onpolicy experience, potentially leading to poor sample efficiency. However, because we use the distance as an intermediate representation that guides offpolicy policy learning, as we will discuss in Section 4.3, we did not find the onpolicy updates for the distance to slow down learning. Indeed, our experiments in Section 6.1 show that we can learn a manipulation task on a real robot with roughly the same amount of experience as is necessary when using a wellshaped and handtuned reward function.
4.3 Policy Improvement
In the policy improvement step, we use to optimize a policy , parameterized by , to reach a goal . In principle, we could optimize the policy by choosing actions that greedily minimize the distance to the goal, which essentially treats negative distances as the values of a value function, and would be equivalent to the policy improvement step in standard policy iteration. However, acting greedily with respect to the dynamical distance defined in Equation 2 would result in a policy that is optimistic with respect to the dynamics.
This is because the dynamical distance is defined as the expected number of time steps conditioned on the policy successfully reaching the second state from the first state, and therefore does not account for the case where the second state is not reached successfully. In some cases, this results in pathologically bad value functions. For example, consider the MDP shown on the right, where the agent can reach the goal using one of two paths. The first path has one intermediate state that leads to the target state with probability , and an absorbing terminal state with probability . The other path has two intermediate states, but allows the agent to reach the target every time. The optimal dynamical distance will be 2, regardless of the value of , causing the policy to always choose the risky path and potentially miss the target completely.
The definition of dynamical distances in Equation 2 follows directly from how we learn the distance function, by choosing both and from the same trajectory. Conditioning on both and is needed when the state space is continuous or large, since visiting two states by chance has zero or nearzero probability. We instead propose to use the distance as a negative reward, and apply reinforcement learning to minimize the cumulative distance on the path to the goal:
(4) 
This amounts to minimizing the cumulative distance over visited states, and thus taking a risky action becomes unfavourable if it takes the agent to a state that is far from the target at a later time. We further show that, under certain assumption, the policy that optimizes Equation 4 will indeed acquire the correct behavior, as discussed in Appendix A, and will converge to a policy that takes the shortest path to the goal, as we show in Appendix B.
4.4 Algorithm Summary
The dynamical distance learning (DDL) algorithm is described in Algorithm 1. Our implementation uses soft actorcritic (SAC) (haarnoja2018bsoft) as the policy optimizer, but one could also use any other offtheshelf algorithm. In each iteration, DDL first samples a trajectory using the current policy, and saves it in a replay pool . In the second step, DDL updates the distance function by minimizing the loss in Equation 3. To avoid overfitting, the distance function is optimized for a fixed number of stochastic gradient steps. Note that this method requires that we use recent experience from , so as to learn the distance corresponding to the current policy. In the third step, DDL chooses a goal state from the recent experience buffer. We will describe two methods to choose these goal states in Section 5. In the fourth step, DDL updates the policy by taking gradient steps to minimize the loss in Equation 4. The implementation of this step depends on the RL algorithm of choice. These steps are then repeated until convergence.
5 Goal Proposals
In the previous section, we discussed how we can utilize a learned distance function to efficiently optimize a goalreaching policy. However, a learned distance function is only meaningful if evaluated at states from the distribution it has been trained on, suggesting that the goal states should be chosen from the replay pool. Choosing a goal that the policy can already reach might at first appear strange, but it turns out to yield efficient directed exploration, as we explain next.
5.1 SemiSupervised Learning from Preferences
DDL can be used to learn to reach specific goals elicited from a user. The simplest way to do this is for a user to provide the goal state directly, either by specifying the full state, or selecting the state manually from the replay pool. However, we can also provide a more convenient way to elicit the desired state with preference queries. In this setting, the user is repeatedly presented with a small slate of candidate states from the replay pool, and asked to select the one that they prefer most. In practice, we present the user with a visualization of the final state in several of the most recent episodes, and the user selects the one that they consider closest to their desired goal.
For example, if the user wishes to train a legged robot to walk forward, they might pick the state where the robot has progressed the largest distance in the desired direction. The required user effort in selecting these states is minimal, and most of the agent’s experience is still unsupervised, simply using the latest userchosen state as the goal. In our experiments, we show that this semisupervised learning procedure, which we call deep dynamical distance learning from preferences (DDLfP) can learn to rotate a valve with realworld hand from just ten queries, and can learn simulated locomotion tasks using 100 simulated queries.
5.2 Unsupervised Exploration and Skill Acquisition
We can also use DDL to efficiently acquire complex behaviors, such as locomotion skills, in a completely unsupervised fashion. Simple random exploration, such as greedy exploration or other strategies that add noise to the actions, can effectively cover states that are close to the starting state, in terms of dynamical distance. However, when highreward states or goal states are far away from the start state, such naïve strategies are unlikely to reach them. From this observation, we can devise a simple and effective exploration strategy that uses our learned dynamical distances: since random exploration is biased to explore nearby states, we can simply set goals that are far from the current state according to their estimated dynamical distance to mitigate this bias. We call this variant of our method “dynamical distance learning  unsupervised” (DDLUS). Intuitively, this method causes the agent to explore the “frontier” of hardtoreach states, either discovering shorter paths for reaching them and thus making them no longer be on the frontier, or else finding new states further on the fringe through additive random exploration. In practice, we find that this allows the agent to quickly explore distant states in a directed fashion. In Section 6, we show that, by setting , where is the initial state, we can acquire effective running gaits and pole balancing skills in a variety of simulated settings. While this approach is not guaranteed to discover interesting and useful skills in general, we find that, on a variety of commonly used benchmark tasks, this approach to unsupervised goal selection actually discovers behaviors that perform better with respect to the (unknown) task reward than previously proposed unsupervised reinforcement learning objectives.
6 Experiments
(hardware) 
(simulation) 
DoublePendulum 



Our experimental evaluation aims to study the following empirical questions: (1) Does supervised regression provide a good estimator of the true dynamical distance? (2) Is DDL applicable to realworld, visionbased robotic control tasks? (3) Does DDL provide an efficient method of learning skills a) from userprovided preferences, and b) completely unsupervised?
We evaluate our method both in the real world and in simulation on a set of state and visionbased continuous control tasks. We consider a 9DoF realworld dexterous manipulation task and 4 standard OpenAI Gym tasks (Hopperv3, HalfCheetahv3, Antv3, and InvertedDoublePendulumv2). For all of the tasks, we parameterize our distance function as a neural network, and use soft actorcritic (SAC) (haarnoja2018soft) with the default hyperparameters to learn the policy. For statebased tasks, we use feedforward neural networks with two 256unit hidden layers. For the visionbased tasks we add a convolutional preprocessing network before these fully connected layers, consisting of four convolutional layers, each with 64 3x3 filters. The image observation for all the visionbased tasks are 3072 dimensional (32x32 RGB images).
We study question (1) using a simple didactic example involving navigation through a twodimensional Sshaped maze, which we present in Appendix C. The other two research questions are studied in the following sections.
6.1 VisionBased RealWorld Manipulation from Human Preferences
To study the question (2), we apply DDLfP to a realworld visionbased robotic manipulation task. The domain consists of a 9DoF “DClaw” hand introduced by ahn2019robel, and the manipulation task requires the hand to rotate a valve 180 degrees, as shown in Figure 1. The human operator is queried for a preference every 10K environment steps. Both the vision and statebased experiments with the real robot use 10 queries during the first 4 hours of an 8 hour training period. Note that, for this and all the subsequent experiments, DDLfP does not have access to the true reward, and must learn entirely from preference queries, which in this case are provided by a human operator.
Figure 3 presents the performance over the course of training. DDLfP uses 10 preference to learn the task, and its performance is comparable to that of SAC trained with a ground truth shaped reward function. We also show a comparison to variational inverse control with events (VICE) (singh2019end), a recent classifierbased reward specification framework. Instead of preference queries, VICE requires the user to provide examples of the desired goal state at the beginning of training (20 images in this case). For visionbased tasks, VICE involves directly showing images of the desired outcome to the user, which requires physically arranging a scene and taking a picture of it. Preferences, on the other hand, require a user to simply select one state out of a small set, which can be done with a button press and done e.g. remotely, thus making it substantially less laborintensive than VICE. As we can see in the experiments, DDLfP achieves similar performance with substantially less operator effort, using only a small number of preference queries. The series of goal preferences queried from the human operator are shown in Appendix D.
6.2 Ablations, Comparisons, and Analysis
Next, we analyze design decisions in our method and compare it to prior methods in simulation. First, we replace the supervised loss in Equation 3 of our DDL method with a temporal difference (TD) Qlearning style update rule that learns dynamical distances with approximate dynamic programming. The results in Figure 5 show that, all else being equal, the TDbased method fails to learn successfully from both lowdimensional state and vision observations. Figure 5 further shows a comparison between using the dynamical distance as the reward in comparison to a reward of 1 for each step until the goal is reached, which corresponds to hindsight experience replay (HER) with goal sampling replaced with preference goals (andrychowicz2017hindsight). We see that dynamical distances allow the policy to reach the goal when learning both from state and from images, while HER is only successful when learning from lowdimensional states.
These results are corroborated by prior results in the literature that have found that temporal difference learning struggles to capture the true value accurately (lillicrap2015continuous; fujimoto2018addressing). Note that prior work work does not use the full state as the goal, but rather manually selects a lowdimensional subspace, such as the location of an object, forcing the distance to focus on taskrelevant objects (andrychowicz2017hindsight). Our method learns distances between full image states (3072dimensional) while HER uses 3dimensional goals, a difference of two orders of magnitude in dimensionality. This difficulty of learning complex imagebased goals is further corroborated in prior work (pong2018temporal; nair2018visual; pong2019skewfit; warde2018unsupervised).
Figure 4 presents results for learning from preferences via DDLfP (in green) on a set of benchmark continuous control tasks, to further study the question (3,a). The plots show the true reward for each method on each task. DDLfP receives only sparse preferences as taskspecific supervision, and the preferences in this case are provided synthetically, using the true reward of the task. However, this still provides substantially less supervision signal than access to the true reward for all samples. We compare to (christiano2017deep), which also uses preferences for learning skills, but without the use of dynamical distances. The prior method is provided with 750 preference queries over the course of training, while our method uses 100 for all locomotion tasks, and only 1 query for the InvertedDoublePendulumv2, as the initial state and the goal states coincides.^{3}^{3}3In our case, one preference query amounts to choosing one of five states, whereas in (christiano2017deep) a query consists always of two state sequences. Note that christiano2017deep utilizes an onpolicy RL algorithms, which is less efficient than SAC. However, DDLfP outperforms this prior method in terms of both final performance and learning speed on all tasks, except for the Hopperv3 task.
6.3 Acquiring Unsupervised Skills
Finally, we study question (3,b), to understand how well DDLUS can acquire skills without any supervision. We structure these experiments analogously to the unsupervised skill learning experiments proposed by eysenbach2018diversity, and compare to the DIAYN algorithm, another unsupervised skill discovery method, proposed in their prior work. While our method maximizes the complexity of the learned skills by attempting to reach the furthest possible goal, DIAYN maximizes the diversity of learned skills. This of course produces different biases in the skills produced by the two methods. Figure 6 shows both learning curves and histograms of the skills learned in the locomotion tasks with the two methods, evaluated according to how far the simulated robot in each domain travels from the initial state. Our DDLUS method learns skills that travel further than DIAYN, while still providing a variety of different behaviors (e.g., travel in different directions). This experiment aims to provide a direct comparison to the DIAYN algorithm (eysenbach2018diversity), though a reasonable criticism is that maximizing dynamical distance is particularly wellsuited for the criteria proposed by eysenbach2018diversity. We also evaluated DDLUS on the InvertedDoublePendulumv2 domain, where the task is to balance a pole on a cart. As can be seen from Figure 6, DDLUS can efficiently solve the task without the true reward, as reaching dynamically far states amounts to avoiding failure as far as possible.
7 Conclusion
We presented dynamical distance learning (DDL), an algorithm for learning dynamical distances that can be used to specify reward functions for goal reaching policies, and support both unsupervised and semisupervised exploration and skill discovery. Our algorithm uses a simple and stable supervised learning procedure to learn dynamical distances, which are then used to provide a reward function for a standard reinforcement learning method. This makes DDL straightforward to apply even with complex and highdimensional observations, such as images. By removing the need for manual reward function design and manual reward shaping, our method makes it substantially more practical to employ deep reinforcement learning to acquire skills even with realworld robotic systems. We demonstrate this by learning a valve turning task with a realworld robotic hand, using 10 preference queries from a user, without any manual reward design or other examples or supervision. One of the main limitations of our current approach is that, although it can be used with an offpolicy reinforcement learning algorithm, it requires onpolicy data collection for learning the dynamical distances. While the resulting method is still efficient enough to learn directly in the real world, the efficiency of our approach can likely be improved in future work by lifting this limitation. This would not only make learning faster, but would also make it possible to pretrain dynamical distances using previously collected experience, potentially making it feasible to scale our method to a multitask learning setting, where the same dynamical distance function can be used to learn multiple distinct skills.
We thank Vikash Kumar for the DClaw robot design, Nicolas Heess for helpful discussion, and Henry Zhu and Justin Yu for their help on setting up and running the hardware experiments. This research was supported by the Office of Naval Research, the National Science Foundation through IIS1651843 and IIS1700696, and Berkeley DeepDrive.
References
Appendix A Correct Behavior in the Pathological MDP
In this appendix we show that the policy that maximizes the objective in Equation 1, with the reward , where is given by Equation 2, prefers safe actions over risky actions. We assume that all trajectories visit and that is a terminal state.
Assume that is an indicator function that is 0 if is a goal state or the terminal state and 1 for all the other states. We can now write the definition of as an infinite sum and substitute in Equation 1:
(5) 
The first term () in the inner sum depends only on , which is given, and the term can thus be moved outside the inner expectation:
(6) 
Next, note that the statistics of the inner expectation over are the same as the outer expectation over , as they are both conditioned on the same . Thus, we can condition the second expectation directly on :
(7) 
We can now apply the same argument as before and move outside the inner expectation. Repeating these steps multiple times yields
(8) 
Assuming that the agent always reaches the goal relatively quickly compared to the discount factor, such that , the trajectories that take longer dominate the loss due to the factor. Therefore, an optimal agent prefers actions that reduce the risk of long, highly suboptimal trajectories, avoiding the pathological behavior discussed in Section 4.3.
Appendix B Policy Improvement when Using Distance as Reward
In this appendix we show that, when we use the negative dynamical distance as the reward function in RL, we can learn an optimal policy with respect to the true dynamical distance, leading to policies that optimize the actual number of time steps needed to reach the goal. This result is nontrivial, since the reward function does not at first glance directly optimize for shortest paths. Our proof relies on the assumption that the MDP has deterministic dynamics. However, this assumption holds in all of our experiments, since the MuJoCo benchmark tasks are governed by deterministic dynamics. Under this assumption, DDL will learn policies that take the shortest path to the goal at convergence, despite using the negative dynamical distance as the reward.
Let be the optimal distance from state to goal state . Let be the optimal policy for the reinforcement learning problem with reward . DDL can be viewed as alternating between fitting to the current policy , and learning a new policy that is optimal with respect to the reward function given by .^{4}^{4}4Of course, the actual DDL algorithm interleaves policy updates and distance updates. In this appendix, we analyze the “policy iteration” variant that optimizes the policy to convergence, but the result can likely be extended to interleaved updates in the same way that policy iteration can be extended into an interleaved actorcritic method. We can now state our main theorem as follows:
Theorem 1.
Under deterministic dynamics, for any state and , we have:

.

If , then .
This implies that, when the policy converges, such that , the policy achieves the optimal distance to any goal, and therefore is the optimal policy for the shortest path reward function (e.g., the reward function that assigns a reward of for any step that does not reach the goal).
Proof.
Part 1
Without loss of generality, we assume that our policy is deterministic, since the set of optimal policies in an MDP always includes at least one deterministic policy. Let us denote the action of policy on state as . We start by showing that . We fix a particular goal . Let be the set of states that takes steps under to reach the goal. We show that for all for each by contradiction.
For , is just the single goal state and by definition. For , for all , there is an action that reaches the goal state as the direct next state. Therefore, the optimized policy would still take the same action on these states and .
Now assume that the opposite is true, that for some states. Then, there must be a smallest number and a state such that . Now let us denote the trajectory of states taken by starting from as , and the trajectory taken by as . Let denote the accumulated discounted sum of distance as defined in Equation 4. By our assumption , and since is optimal with respect to the reward , we have
(9) 
Then there must be a time such that . Therefore for some . However, starting from , we have . Therefore, we reached a contradiction with our assumption that for all , such that . Therefore, holds for all states.
Part 2
Now we show the second part: if , then . We prove this with a similar argument, grouping states by distance. Let be the set of states that takes steps under the optimal policy to reach the goal. Note that, for any arbitrary policy , we have by definition, since is the optimal distance.
Suppose that for some state . Then there must be a smallest integer such that there exists a state where . For all , we have for all . Now starting from that state , let the trajectory of states taken by be . Note that since , . Let be the policy such that it agrees with on and agrees with everywhere else. At the first step, lands on state . Since is steps away from under , must be steps away under and . Therefore, since and agrees on all states that are less than steps away from goal , would take the same action as and hence take another steps to goal . Now let us denote the trajectory taken by as . We compare the discounted sum of rewards of and under the reward function .
(10) 
Therefore, we can see that is a better policy than . Then the optimal policy under this reward must be different from on at least one state. Hence .
We’ve now reached the conclusion that if , then . Hence, by contraposition, if , then it must be that . Our proof is thus complete.
∎
Appendix C Didactic Example
Our didactic example involves a simple 2D point robot navigating an Sshaped maze. The state space is twodimensional, and the action is a twodimensional velocity vector. This experiment is visualized in Figure 7. The black rectangles correspond to walls, and the goal is depicted with a blue star. The learned distance from all points in the maze to the goal is illustrated with a heat map, in which lighter colors correspond to closer states and darker colors to distant states. During the training, the initial state is chosen uniformly at random, and the policy is trained to reach the goal state. From the visualization, it is apparent that DDL learns an accurate estimate of the true dynamical distances in this domain. Note that, in contrast to naïve metrics, such as Euclidean distance, the dynamical distances conform to the walls and provide an accurate estimate of reachability, making them ideally suited for reward shaping.


Appendix D Preference Queries for RealWorld DClaw Experiment