NearOptimal Representation Learning
for Hierarchical Reinforcement Learning
Abstract
We study the problem of representation learning in goalconditioned hierarchical reinforcement learning. In such hierarchical structures, a higherlevel controller solves tasks by iteratively communicating goals which a lowerlevel policy is trained to reach. Accordingly, the choice of representation – the mapping of observation space to goal space – is crucial. To study this problem, we develop a notion of suboptimality of a representation, defined in terms of expected reward of the optimal hierarchical policy using this representation. We derive expressions which bound the suboptimality and show how these expressions can be translated to representation learning objectives which may be optimized in practice. Results on a number of difficult continuouscontrol tasks show that our approach to representation learning yields qualitatively better representations as well as quantitatively better hierarchical policies, compared to existing methods.^{1}^{1}1See videos at https://sites.google.com/view/representationhrl
NearOptimal Representation Learning
for Hierarchical Reinforcement Learning
Ofir Nachum, Shixiang Gu, Honglak Lee & Sergey Levine^{†}^{†}thanks: Also at UC Berkeley. 

Google Brain 
{ofirnachum,shanegu,honglak,slevine}@google.com 
1 Introduction
Hierarchical reinforcement learning has long held the promise of extending the successes of existing reinforcement learning (RL) methods (Gu et al., 2017; Schulman et al., 2015; Lillicrap et al., 2015) to more complex, difficult, and temporally extended tasks (Parr & Russell, 1998; Sutton et al., 1999; Barto & Mahadevan, 2003). Recently, goalconditioned hierarchical designs, in which higherlevel policies communicate goals to lowerlevels and lowerlevel policies are rewarded for reaching states (i.e. observations) which are close to these desired goals, have emerged as an effective paradigm for hierarchical RL (Nachum et al., 2018; Levy et al., 2017; Vezhnevets et al., 2017). In this hierarchical design, representation learning – the mapping between observation space and goal space – determines the types of subtasks the lowerlevel can be instructed to perform, and is therefore a critical component determining the success or failure of a hierarchical agent.
Previous works have largely studied two ways to choose the representation: learning the representation endtoend together with the higher and lowerlevel policies (Vezhnevets et al., 2017), or using the state space asis for the goal space (i.e., the goal space is a subspace of the state space) (Nachum et al., 2018; Levy et al., 2017). The former approach is appealing, but in practice often produces poor results (see Nachum et al. (2018) and our own experiments), since the resulting representation is underdefined; i.e., not all possible subtasks are expressible as goals in the space. On the other hand, fixing the representation to be the full state means that no information is lost, but this choice is difficult to scale to higher dimensions. For example, if the state observations are entire images, the higherlevel must output target images for the lowerlevel, which can be very difficult.
We instead study how unsupervised objectives can be used to train a representation that is more concise than the full state, but also not as underdetermined as in the endtoend approach. In order to do so in a principled manner, we propose a measure of suboptimality of a given representation. This measure aims to answer the question: How much does using the learned representation in place of the full representation cause us to lose, in terms of expected reward, against the optimal policy? This question is important, because a useful representation will compress the state, hopefully making the learning problem easier. At the same time, the compression might cause the representation to lose information, making the optimal policy impossible to express. It is therefore critical to understand how lossy a learned representation is, not in terms of reconstruction, but in terms of the ability to represent nearoptimal policies on top of this representation.
Our main theoretical result shows that, for a particular choice of representation learning objective, we can learn representations for which the return of the hierarchical policy approaches the return of the optimal policy within a bounded error. This suggests that, if the representation is learned with a principled objective, the ‘lossyness’ in the resulting representation should not cause a decrease in overall task performance. We then formulate a representation learning approach that optimizes this bound. We further extend our result to the case of temporal abstraction, where the higherlevel controller only chooses new goals at fixed time intervals. To our knowledge, this is the first result showing that hierarchical goalsetting policies with learned representations and temporal abstraction can achieve bounded suboptimality against the optimal policy. We further observe that the representation learning objective suggested by our theoretical result closely resembles several other recently proposed objectives based on mutual information (van den Oord et al., 2018; Ishmael Belghazi et al., 2018; Hjelm et al., 2018), suggesting an intriguing connection between mutual information and goal representations for hierarchical RL. Results on a number of difficult continuouscontrol navigation tasks show that our principled representation learning objective yields good qualitative and quantitative performance compared to existing methods.
2 Framework
Following previous work (Nachum et al., 2018), we consider a twolevel hierarchical policy on an MDP , in which the higherlevel policy modulates the behavior of a lowerlevel policy by choosing a desired goal state and rewarding the lowerlevel policy for reaching this state. While prior work has used a subspace of the state space as goals (Nachum et al., 2018), in more general settings, some type of state representation is necessary. That is, consider a state representation function . A twolevel hierarchical policy on is composed of a higherlevel policy , where is the goal space, that samples a highlevel action (or goal) every steps, for fixed . A nonstationary, goalconditioned, lowerlevel policy then translates these highlevel actions into lowlevel actions for . The process is then repeated, beginning with the higherlevel policy selecting another goal according to . The policy is trained using a goalconditioned reward; e.g. the reward of a transition is , where is a distance function.
In this work we adopt a slightly different interpretation of the lowerlevel policy and its relation to . Every steps, the higherlevel policy chooses a goal based on a state . We interpret this stategoal pair as being mapped to a nonstationary policy , where denotes the set of all possible step policies acting on . We use to denote this mapping from to . In other words, on every step, we encounter some state . We use the higherlevel policy to sample a goal and translate this to a policy . We then use to sample actions for . The process is then repeated from .
Although the difference in this interpretation is subtle, the introduction of is crucial for our subsequent analysis. The communication of is no longer as a goal which desires to reach, but rather more precisely, as an identifier to a lowlevel behavior which desires to induce or activate.
The mapping is usually expressed as the result of an RL optimization over ; e.g.,
(1) 
where we use to denote the probability of being in state after following for steps starting from . We will consider variations on this lowlevel objective in later sections. From Equation 1 it is clear how the choice of representation affects (albeit indirectly).
We will restrict the environment reward function to be defined only on states. We use to denote the maximal absolute reward: .
3 Hierarchical Policy SubOptimality
In the previous section, we introduced twolevel policies where a higherlevel policy chooses goals , which are translated to lowerlevel behaviors via . The introduction of this hierarchy leads to a natural question: How much do we lose by learning which is only able to act on via ? The choice of restricts the type and number of lowerlevel behaviors that the higherlevel policy can induce. Thus, the optimal policy on is potentially not expressible by . Despite the potential lossyness of , can one still learn a hierarchical policy which is nearoptimal?
To approach this question, we introduce a notion of suboptimality with respect to the form of : Let be the optimal higherlevel policy acting on and using as the mapping from to lowlevel behaviors. Let be the corresponding full hierarchical policy on . We will compare to an optimal hierarchical policy agnostic to . To define we begin by introducing an optimal higherlevel policy agnostic to ; i.e. every steps, samples a lowlevel behavior which is applied to for the following steps. In this way, may express all possible lowlevel behaviors. We then denote as the full hierarchical policy resulting from .
We would like to compare to , and we do so in terms of state values. Let be the future value achieved by a policy starting at state . We define the suboptimality of as
(2) 
The state values are determined by the form of , which is in turn determined by the choice of representation . However, none of these relationships are direct. It is unclear how a change in will result in a change to the suboptimality. In the following section, we derive a series of bounds which establish a more direct relationship between and . Our main result will show that if one defines as a slight modification of the traditional objective given in Equation 1, then one may translate suboptimality of to a practical representation learning objective for .
4 Good Representations Lead to Bounded SubOptimality
In this section, we provide proxy expressions that bound the suboptimality induced by a specific choice of . Our main result is Claim 4, which connects the suboptimality of to both goalconditioned policy objectives (i.e., the objective in 1) and representation learning (i.e., an objective for the function ).
4.1 SingleSteps () and Deterministic Policies
For ease of presentation, we begin by presenting our results in the restricted case of and deterministic lowerlevel policies. In this setting, the class of lowlevel policies may be taken to be simply , where corresponds to a policy which always chooses action . There is no temporal abstraction: The higherlevel policy chooses a highlevel action at every step, which is translated via to a lowlevel action . Our claims are based on quantifying how many of the possible lowlevel behaviors (i.e., all possible state to state transitions) can be produced by for different choices of . To quantify this, we make use of an auxiliary inverse goal model , which aims to predict which goal will cause to yield an action that induces a next state distribution similar to .^{2}^{2}2In a deterministic, setting, may be seen as a stateconditioned action abstraction mapping . We have the following theorem, which bounds the suboptimality in terms of total variation divergences between and :
Theorem 1.
If there exists such that,
(3) 
then , where .
Theorem 1 allows us to bound the suboptimality of in terms of how recoverable the effect of any action in is, in terms of transition to the next state. One way to ensure that effects of actions in are recoverable is to have an invertible . That is, if there exists such that for all , then the suboptimality of is 0.
However, in many cases it may not be desirable or feasible to have an invertible . Looking back at Theorem 1, we emphasize that its statement requires only the effect of any action to be recoverable. That is, for any , we require only that there exist some (given by ) which yields a similar nextstate distribution. To this end, we have the following claim, which connects the suboptimality of to both representation learning and the form of the lowlevel objective.
Claim 2.
Let be a prior and be so that, for ,^{3}^{3}3 may be interpreted as the conditional of the joint distribution for normalization constant .
(4) 
If the lowlevel objective is defined as
(5) 
then the suboptimality of is bounded by .
We provide an intuitive explanation of the statement of Claim 2. First, consider that the distribution appearing in Equation 4 may be interpreted as a dynamics model determined by and . By bounding the difference between the true dynamics and the dynamics implied by and , Equation 4 states that the representation should be chosen in such a way that dynamics in representation space are roughly given by . This is essentially a representation learning objective for choosing , and in Section 5 we describe how to optimize it in practice.
Moving on to Equation 5, we note that the form of here is only slightly different than the onestep form of the standard goalconditioned objective in Equation 1. Therefore, all together Claim 2 establishes a deep connection between representation learning (Equation 4), goalconditioned policy learning (Equation 5), and suboptimality. Specifically, if the lowlevel RL objective is expressed as in Equation 5, then to minimize the suboptimality we need only optimize a representation learning objective based on Equation 4.
4.2 Temporal Abstraction () and General Policies
We now move on to presenting the same results in the fully general, temporally abstracted setting, in which the higherlevel policy chooses a highlevel action every steps, which is transformed via to a step lowerlevel behavior policy . In this setting, the auxiliary inverse goal model is a mapping from to and aims to predict which goal will cause to yield a policy that induces future state distributions similar to , for . We weight the divergences between the distributions by weights for and for . We denote . The analogue to Theorem 1 is as follows:
Theorem 3.
Consider a mapping and define for as,
(6) 
If
(7) 
then , where .
For the analogue to Claim 2, we simply replace the singlestep KL divergences and lowlevel rewards with a discounted weighted sum thereof:
Claim 4.
Let be a prior over . Let be such that,
(8) 
where .
If the lowlevel objective is defined as
(9) 
then the suboptimality of is bounded by .
5 Learning
We now have the mathematical foundations necessary to learn representations that are provably good for use in hierarchical RL. We begin by elaborating on how we translate Equation 8 into a practical training objective for and auxiliary (as well as a practical parameterization of policies as input to ). We then continue to describe how one may train a lowerlevel policy to match the objective presented in Equation 9. In this way, we may learn and lowerlevel policy to directly optimize a bound on the suboptimality of . A pseudocode of the full algorithm is presented in the Appendix (see Algorithm 1).
5.1 Learning Good Representations
Consider a representation function and an auxiliary function , parameterized by vector . In practice, these are separate neural networks: .
While the form of Equation 8 suggests to optimize a supremum over all and , in practice we only have access to a replay buffer which stores experience sampled from our hierarchical behavior policy. Therefore, we propose to choose sampled uniformly from the replay buffer and use the subsequent actions as a representation of the policy , where we use to denote the sequence . Note that this is equivalent to setting the set of candidate policies to (i.e., is the set of step, deterministic, openloop policies). This choice additionally simplifies the possible structure of the function approximator used for (a standard neural net which takes in and ). Our proposed representation learning objective is thus,
(10) 
where will correspond to the inner part of the supremum in Equation 8.
We now define the inner objective . To simplify notation, we use and use as the distribution over such that . Equation 8 suggests the following learning objective on each :
(11)  
(12) 
(13) 
where is a constant. The gradient with respect to is then,
(14) 
The first term of Equation 14 is straightforward to estimate using experienced . We set to be the replay buffer distribution, so that the numerator of the second term is also straightforward. We approximate the denominator of the second term using a minibatch of states independently sampled from the replay buffer:
(15) 
This completes the description of our representation learning algorithm.
Connection to Mutual Information Estimators.
The form of the objective we optimize (i.e. Equation 13) is very similar to mutual information estimators, mostly CPC (van den Oord et al., 2018). Indeed, one may interpret our objective as maximizing a mutual information via an energy function given by . The main differences between our approach and these previous proposals are as follows: (1) Previous approaches maximize a mutual information agnostic to actions or policy. (2) Previous approaches suggest to define the energy function as for some matrix , whereas our energy function is based on the distance used for lowlevel reward. (3) Our approach is provably good for use in hierarchical RL, and hence our theoretical results may justify some of the good performance observed by others using mutual information estimators for representation learning. Different approaches to translating our theoretical findings to practical implementations may yield objectives more or less similar to CPC, some of which perform better than others (see Appendix D).
5.2 Learning a LowerLevel Policy
Equation 9 suggests to optimize a policy for every . This is equivalent to the parameterization , which is standard in goalconditioned hierarchical designs. Standard RL algorithms may be employed to maximize the lowlevel reward implied by Equation 9:
(16) 
weighted by and where corresponds to when the state and goal are fixed. While the first term of Equation 16 is straightforward to compute, the log probabilities are in general unknown. To approach this issue, we take advantage of the representation learning objective for . When are optimized as dictated by Equation 8, we have
(17) 
We may therefore approximate the lowlevel reward as
(18) 
As in Section 5.1, we use the sampled actions to represent as input to . We approximate the third term of Equation 18 analogously to Equation 15. Note that this is a slight difference from standard lowlevel rewards, which use only the first term of Equation 18 and are unweighted.
6 Related Work
Representation learning for RL has a rich and diverse existing literature, often interpreted as an abstraction of the original MDP. Previous works have interpreted the hierarchy introduced in hierarchical RL as an MDP abstraction of state, action, and temporal spaces (Sutton et al., 1999; Dietterich, 2000; Bacon et al., 2017). In goalconditioned hierarchical designs, although the representation is learned on states, it is in fact a form of action abstraction (since goals are highlevel actions). While previous successful applications of goalconditioned hierarchical designs have either learned representations naively endtoend (Vezhnevets et al., 2017), or not learned them at all (Levy et al., 2017; Nachum et al., 2018), we take a principled approach to representation learning in hierarchical RL, translating a bound on suboptimality to a practical learning objective.
Bounding suboptimality in abstracted MDPs has a long history, from early work in theoretical analysis on approximations to dynamic programming models (Whitt, 1978; Bertsekas & Castanon, 1989). Extensive theoretical work on state abstraction, also known as state aggregation or model minimization, has been done in both operational research (Rogers et al., 1991; Van Roy, 2006) and RL (Dean et al., 1997; Ravindran & Barto, 2002; Abel et al., 2017). Notably, Li et al. (2006) introduce a formalism for categorizing classic work on state abstractions such as bisimulation (Dean et al., 1997) and homomorphism (Ravindran & Barto, 2002) based on what information is preserved, which is similar in spirit to our approach. Exact state abstractions (Li et al., 2006) incur no performance loss (Dean et al., 1997; Ravindran & Barto, 2002), while their approximate variants generally have bounded suboptimality (Bertsekas & Castanon, 1989; Dean et al., 1997; Sorg & Singh, 2009; Abel et al., 2017). While some of the prior work also focuses on learning state abstractions (Li et al., 2006; Sorg & Singh, 2009; Abel et al., 2017), they often exclusively apply to simple MDP domains as they rely on techniques such as state partitioning or Qvalue based aggregation, which are difficult to scale to our experimented domains. Thus, the key differentiation of our work from these prior works is that we derive bounds which may be translated to practical representation learning objectives. Our impressive results on difficult continuouscontrol, highdimensional domains is a testament to the potential impact of our theoretical findings.
Lastly, we note the similarity of our representation learning algorithm to recently introduced scalable mutual information maximization objectives such as CPC (van den Oord et al., 2018) and MINE (Ishmael Belghazi et al., 2018). This is not a surprise, since maximizing mutual information relates closely with maximum likelihood learning of energybased models, and our bounds effectively correspond to bounds based on modelbased predictive errors, a basic family of bounds in representation learning in MDPs (Sorg & Singh, 2009; Brunskill & Li, 2014; Abel et al., 2017). To our knowledge, no prior work has connected these mutual information estimators to representation learning in hierarchical RL, and ours is the first to formulate theoretical guarantees on suboptimality of the resulting representations in such a framework.
7 Experiments
Ant Maze Env  XY  Ours  Ours (Images) 
VAE  VAE (Images)  E2C  E2C (Images) 
We evaluate our proposed representation learning objective compared to a number of baselines:

XY: The oracle baseline which uses the position of the agent as the representation.

VAE: A variational autoencoder (Kingma & Welling, 2013) on raw observations.

E2C: Embed to control (Watter et al., 2015). A method which uses variational objectives to train a representation of states and actions which have locally linear dynamics.

E2E: Endtoend learning of the representation. The representation is fed as input to the higherlevel policy and learned using gradients from the RL objective.

Whole obs: The raw observation is used as the representation. No representation learning. This is distinct from Nachum et al. (2018), in which a subset of the observation space was predetermined for use as the goal space.
We evaluate on the following continuouscontrol MuJoCo (Todorov et al., 2012) tasks (see Appendix C for details):

Ant (or Point) Maze: An ant (or point mass) must navigate a shaped corridor.

Ant Push: An ant must push a large block to the side to reach a point behind it.

Ant Fall: An ant must push a large block into a chasm so that it may walk over it to the other side without falling.

Ant Block: An ant must push a small block to various locations in a square room.

Ant Block Maze: An ant must push a small block through a shaped corridor.
In these tasks, the raw observation is the agent’s coordinates and orientation as well as local coordinates and orientations of its limbs. In the Ant Block and Ant Block Maze environments we also include the coordinates and orientation of the block. We also experiment with more difficult raw representations by replacing the coordinates of the agent with a lowresolution topdown image of the agent and its surroundings. These experiments are labeled ‘Images’.
Point Maze  Ant Maze  Ant Push  Ant Fall  Ant Block 
Point Maze (Images)  Ant Maze (Images)  Ant Push (Images)  Ant Fall (Images)  Ant Block Maze 
Ant and block  Ant pushing small block through corridor  Representations 

For the baseline representation learning methods which are agnostic to the RL training (VAE and E2C), we provide comparative qualitative results in Figure 2. These representations are the result of taking a trained policy, fixing it, and using its sampled experience to learn 2D representations of the raw observations. We find that our method can successfully deduce the underlying nearoptimal representation, even when the raw observation is given as an image.
We provide quantitative results in Figure 3. In these experiments, the representation is learned concurrently while learning a full hierarchical policy (according to the procedure in Nachum et al. (2018)). Therefore, this setting is especially difficult since the representation learning must learn good representations even when the behavior policy is very far from optimal. Accordingly, we find that most baseline methods completely fail to make any progress. Only our proposed method is able to approach the performance of the XY oracle.
For the ‘Block’ environments, we were curious what our representation learning objective would learn, since the coordinate of the agent is not the only nearoptimal representation. For example, another suitable representation is the coordinates of the small block. To investigate this, we plotted (Figure 4) the trajectory of the learned representations of a successful policy (cyan), along with the representations of the same observations with agent perturbed (green) or with block perturbed (magenta). We find that the learned representations greatly emphasize the block coordinates over the agent coordinates, although in the beginning of the episode, there is a healthy mix of the two.
8 Conclusion
We have presented a principled approach to representation learning in hierarchical RL. Our approach is motivated by the desire to achieve maximum possible return, hence our notion of suboptimality is in terms of optimal state values. Although this notion of suboptimality is intractable to optimize directly, we are able to derive a mathematical relationship between it and a specific form of representation learning. Our resulting representation learning objective is practical and achieves impressive results on a suite of highdimensional, continuouscontrol tasks.
Acknowledgments
We thank Bo Dai, Luke Metz, and others on the Google Brain team for insightful comments and discussions.
References
 Abel et al. (2017) David Abel, D Ellis Hershkowitz, and Michael L Littman. Near optimal behavior via approximate state abstraction. arXiv preprint arXiv:1701.04113, 2017.
 Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. arXiv preprint arXiv:1705.10528, 2017.
 Bacon et al. (2017) PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. In AAAI, pp. 1726–1734, 2017.
 Barto & Mahadevan (2003) Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
 Bertsekas & Castanon (1989) Dimitri P Bertsekas and David Alfred Castanon. Adaptive aggregation methods for infinite horizon dynamic programming. IEEE transactions on Automatic Control, 34(6):589–598, 1989.
 Brunskill & Li (2014) Emma Brunskill and Lihong Li. Pacinspired option discovery in lifelong reinforcement learning. In International Conference on Machine Learning, pp. 316–324, 2014.
 Dean et al. (1997) Thomas Dean, Robert Givan, and Sonia Leach. Model reduction techniques for computing approximately optimal solutions for markov decision processes. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pp. 124–131. Morgan Kaufmann Publishers Inc., 1997.
 Dietterich (2000) Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
 Gu et al. (2017) Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 3389–3396. IEEE, 2017.
 Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
 Ishmael Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: Mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Levy et al. (2017) Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical actorcritic. arXiv preprint arXiv:1712.00948, 2017.
 Li et al. (2006) Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps. In ISAIM, 2006.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Nachum et al. (2018) Ofir Nachum, Shane Gu, Honglak Lee, and Sergey Levine. Dataefficient hierarchical reinforcement learning. NIPS, 2018.
 Parr & Russell (1998) Ronald Parr and Stuart J Russell. Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, pp. 1043–1049, 1998.
 Ravindran & Barto (2002) Balaraman Ravindran and Andrew G Barto. Model minimization in hierarchical reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pp. 196–211. Springer, 2002.
 Rogers et al. (1991) David F Rogers, Robert D Plante, Richard T Wong, and James R Evans. Aggregation and disaggregation techniques and methodology in optimization. Operations Research, 39(4):553–582, 1991.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
 Sorg & Singh (2009) Jonathan Sorg and Satinder Singh. Transfer via soft homomorphisms. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, pp. 741–748. International Foundation for Autonomous Agents and Multiagent Systems, 2009.
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 van den Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 Van Roy (2006) Benjamin Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
 Vezhnevets et al. (2017) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
 Watter et al. (2015) Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754, 2015.
 Whitt (1978) Ward Whitt. Approximations of dynamic programs, i. Mathematics of Operations Research, 3(3):231–243, 1978.
Appendix A Proof of Theorem 3 (Generalization of Theorem 1)
Consider the suboptimality with respect to a specific state , . Recall that is the hierarchical result of a policy , and note that may be assumed to be deterministic due to the Markovian nature of . We may use the mapping to transform to a highlevel policy on and using the mapping :
(19) 
Let be the corresponding hierarchical policy. We will bound the quantity , which will bound . We follow logic similar to Achiam et al. (2017) and begin by bounding the total variation divergence between the discounted state visitation frequencies of the two policies.
Denote the step state transition distributions using either or as,
(20)  
(21) 
for . Considering as linear operators, we may express the state visitation frequencies of , respectively, as
(22)  
(23) 
where is a Dirac distribution centered at and
(24)  
(25) 
We will use to denote the everysteps discounted state frequencies of ; i.e.,
(26)  
(27) 
By the triangle inequality, we have the following bound on the total variation divergence :
(28) 
We begin by attacking the first term of Equation 28. We note that
(29) 
Thus the first term of Equation 28 is bounded by
(30) 
By expressing as a geometric series and employing the triangle inequality, we have , and we thus bound the whole quantity (30) by
(31) 
We now move to attack the second term of Equation 28. We may express this term as
(32) 
Furthermore, by the triangle inequality we have
(33) 
Therefore, recalling for and for , we may bound the total variation of the state visitation frequencies as
(34)  
(35)  
(36) 
By condition 7 of Theorem 3 we have,
(37) 
We now move to considering the difference in values. We have
(38)  
(39) 
Therefore, we have
(40)  
(41) 
as desired.
Appendix B Proof of Claim 4 (Generalization of Claim 2)
Consider a specific . Let . Note that the definition of may be expressed in terms of a KL:
(42) 
Therefore we have,
(43) 
By condition 8 we have,
(44) 
Jensen’s inequality on the sqrt function then implies
(45) 
Pinsker’s inequality now yields,
(46) 
Similarly Jensen’s and Pinsker’s inequality on the LHS of Equation 43 yields
(47) 
Appendix C Experimental Details
c.1 Environments
The environments for Ant Maze, Ant Push, and Ant Fall are as described in Nachum et al. (2018). During training, target locations are selected randomly from all possible points in the environment (in Ant Fall, the target includes a coordinate as well). Final results are evaluated on a single difficult target point, equal to that used in Nachum et al. (2018).
The Point Maze is equivalent to the Ant Maze, with size scaled down by a factor of 2 and the agent replaced with a point mass, which is controlled by actions of dimension two – one action determines a rotation on the pivot of the point mass and the other action determines a push or pull on the point mass in the direction of the pivot.
Ant (or Point) Maze  Ant Push  Ant Fall 
Ant Block  Ant Block Maze  TopDown View 
For the ‘Images’ versions of these environments, we zeroout the coordinates in the observation and append a lowresolution topdown view of the environment. The view is centered on the agent and each pixel covers the size of a large block (size equal to width of the corridor in Ant Maze). The 3 channels correspond to (1) immovable blocks (walls, gray in the videos), (2) movable blocks (shown in red in videos), and (3) chasms where the agent may fall.
The Ant Block environment puts the ant in a square room next to a small movable block. The agent is rewarded based on negative L2 distance of the block to a desired target location. During training, these target locations are sampled randomly from all possible locations. Evaluation is on a target location diagonally opposite the ant.
The Ant Block Maze environment consists of the same ant and small movable block in a shaped corridor. During training, these target locations are sampled randomly from all possible locations. Evaluation is on a target location at the end of the corridor.
c.2 Training Details
We follow the basic training details used in Nachum et al. (2018). Some differences are listed below:

We input the whole observation to the lowerlevel policy (Nachum et al. (2018) zeroout the coordinates for the lowerlevel policy).

We use a Huber function for , the distance function used to compute the lowlevel reward.

We use a goal dimension of size . We train the higherlevel policy to output actions in . These actions correspond to desired deltas in state representation.

We use a Gaussian with standard deviation for highlevel exploration.

Additional differences in lowlevel training (e.g. reward weights and discounting) are implemented according to Section 5.
We parameterize with a feedforward neural network with two hidden layers using relu activations. The hidden dimensions were 400 and 300, respectively. The network structure for is identical, and we use . These networks are trained with the Adam optimizer using learning rate .
Appendix D Objective Function Evaluation
Point Maze  Ant Maze  Ant Push  Ant Fall  Ant Block 
Point Maze (Images)  Ant Maze (Images)  Ant Push (Images)  Ant Fall (Images)  Ant Block Maze 