Successor Options: An Option Discovery
Framework for Reinforcement Learning
Abstract
The options framework in reinforcement learning models the notion of a skill or a temporally extended sequence of actions. The discovery of a reusable set of skills has typically entailed building options, that navigate to bottleneck states. This work adopts a complementary approach, where we attempt to discover options that navigate to landmark states. These states are prototypical representatives of wellconnected regions and can hence access the associated region with relative ease. In this work, we propose Successor Options, which leverages Successor Representations to build a model of the state space. The intraoption policies are learnt using a novel pseudoreward and the model scales to highdimensional spaces easily. Additionally, we also propose an Incremental Successor Options model that iterates between constructing Successor Representations and building options, which is useful when robust Successor Representations cannot be built solely from primitive actions. We demonstrate the efficacy of our approach on a collection of gridworlds, and on the highdimensional robotic control environment of Fetch.
Successor Options: An Option Discovery
Framework for Reinforcement Learning
Manan Tomar , Rahul Ramesh and Balaraman Ravindran
Indian Institute of Technology Madras
rahul13ramesh@gmail.com, manan.tomar@gmail.com, ravi@cse.iitm.ac.in
1 Introduction
Reinforcement Learning (RL) [?] has garnered significant attention recently due to its success in challenging highdimensional tasks [?; ?; ?]. Deep Learning has had a major role in the achievements of RL by enabling generalization across a large number of states using powerful function approximators. Deep learning must however be complemented by efficient exploration in order to discover solutions with reasonable sample complexities. Hierarchical Reinforcement Learning (HRL) is one potential strategy that mitigates the curse of dimensionality by operating on abstract state and action spaces. Recent work [?; ?; ?] has attempted to use a hierarchy of controllers, operating in different timescales, in order to search large state spaces rapidly.
The options framework [?] is an example of a hierarchical approach that models temporally extended actions or skills. Discovering “good” options can potentially allow for exploring the state space efficiently and transferring to various similar tasks. However, the discovery of reusable options is a meticulous task and has not been effectively addressed. While there are a number of approaches to this problem, a large fraction of literature revolves around discovering options that navigate to bottleneck states [?; ?; ?; ?]. This work adopts a paradigm, that fundamentally differs from the idea of identifying bottleneck states as subgoals for options. Instead, we attempt to discover landmark or prototypical states of wellconnected regions. We empirically validate that navigating to landmark states leaves the agent well situated, to navigate the associated wellconnected region consequently. Building on this intuition, we propose Successor Options, a subgoal discovery, and intraoption policy learning framework.
Our method does not construct a graph of the state space explicitly but instead leverages Successor Representations (SR) [?], to learn the subgoals and the associated intraoption policies. The SR inherently captures the temporal structure between states and thus forms a reasonable proxy for the actual graph. Moreover, Successor Representations have been extended to work with function approximators [?; ?], allowing us to implicitly form the graphical structure of any highdimensional state space with neural networks.
SRs attempt to assign similar representations to states with similar future states. Formally, the SR of a state is a vector representing the expected discounted visitation counts of all states in the future, starting from state . SR varies with the policy since the expected visitation counts depend on the policy being executed, to estimate these counts. Since nearby states are expected to have similar successors, their Successor Representations are expected to be similar in nature. Hence, states in a wellconnected region in state space will have similar SRs (for example, states in a single room in a gridworld will have very similar SRs). Building on this intuition, one would like to identify a set of subgoals for which the corresponding SRs are dissimilar to one another.
Successor Options proceeds as follows. The first step involves constructing the SRs of all states. The subgoals are then identified by clustering a large sample (or all) of the SR vectors and assigning the cluster centers as the various subgoals. The cluster centers translate to a set of subgoals that have vastly different successor states, meaning different subgoals provide access to a different region in statespace. Once the subgoals are identified, a novel pseudoreward is used to build options that navigate to each of these subgoals. This process relies solely on primitive actions to navigate the state space when estimating the SRs. However, in large state spaces, full exploration through primitive actions might not be possible. To mitigate the same, we propose the Incremental Successor Options algorithm. This method works in an iterative fashion where each iteration involves an option discovery step and an SR learning step.
Besides the improved accessibility to any given state in the state space, Successor Options offer a number of other advantages over existing option discovery methods. While an intermediate clustering step segments the algorithm into distinct stages (nondifferentiability introduced), the step is critical in many aspects. Firstly, the number of options is specified beforehand which allows the model to adapt by finding the most suited subgoals. Hence, the algorithm does not require pruning redundant options from a very large set, unlike other works [?; ?; ?]. Furthermore, the discovered options are reward agnostic and are hence transferable across multiple tasks. The principal contributions of this paper are as follows : (i) An automatic option discovery mechanism revolving around identifying landmark states, (ii) A novel pseudo reward for learning the intraoption policies that extends to function approximators (iii) An incremental approach that alternates between exploration and option construction to navigate the state space in tasks with a fixed horizon setup where primitive actions fail to explore fully.
2 Preliminaries
Reinforcement Learning deals with sequential decision making tasks and considers the interaction of an agent with an environment. It is traditionally modeled by a Markov Decision Process (MDP) [?], defined by the tuple , where defines the set of states, the set of actions, the transition function, the probability distribution over initial states, the reward function and the discount factor. In the context of optimal control, the objective is to learn a policy that maximizes the discounted return , where is the reward function.
Qlearning: Qlearning [?] attempts to estimate the optimal actionvalue function . It exploits the Bellman optimality equation, the repeated application of which leads to convergence to . The optimal value function can be used to behave optimally by selecting action in every state such that
(1) 
[?] introduce Deep Qlearning, that extends Qlearning to high dimensional spaces by using a neural network to model
Options and SemiMarkov Decision Processes: Options [?] provide a framework to model temporally extended actions. Formally, an option is defined using the 3state tuple : , where is the initiation set, the termination probabilities for each state and the intraoption policy. This work assumes that the intraoption policies satisfy the Markov assumption.
Successor Representation: The Successor Representation (SR) [?] represents a state in terms of its successors. The SR for is defined as a vector of size with the index equal to the discounted future occupancy for state given the agent starts from . Since the SR captures the visitation of successor states, it is directly dependent on the policy and the transition dynamics . More concretely, the SR can be written as follows:
(2) 
where, is 1 if its argument is true, else 0 (indicator function). The SR can be learnt in a temporal difference (TD) like fashion by writing it in terms of the SR of the next state.
(3) 
Equation 3 is for state samples , where is the estimate of SR being learnt and is a onehot vector with all zeros except a 1, at the position. Successor Representations can be naturally extended to the deep setting [?; ?] as follows (note is kdimensional feature representation of , and is the set of parameters) :
(4) 
3 Proposed Method
Successor Options (SROptions) adopts an approach that attempts to discover options that navigate to states that are representatives of wellconnected regions. The method holds a number of advantages which include (i) A robust subgoal identification step that uses clustering to obtain a set of options, with no two options being identical. (ii) Learning useful options without an extrinsic reward, but through latent learning (iii) Using an incremental approach to work in scenarios where primitive actions are unable to facilitate the option discovery process.
3.1 Successor options
Subgoal discovery: In learning Successor Options, the first step involves learning the SR. The policy used to learn the SR () determines a prior over state space. As a result, the discovered subgoals will lie in those states which are more likely to be visited under . Since we do not have any such preference in our experiments, we stick to the uniform random policy for in this work.
This is followed by clustering states, based on the learnt SR (we utilize Kmeans++ [?] for this purpose). Since the SR captures temporally closeby states efficiently, the generated clusters are spread across the state space, with each cluster assigned to a set of states that are densely connected. We wish to learn options that navigate to the cluster centers which act as landmark states. Since the cluster center may not correspond to the SR of any state, we select the subgoal to be that state whose SR has the largest cosine similarity with the cluster center.
Latent Learning: The pseudo reward defined in Equation 5 is used to learn the intraoption policies.
(5) 
For Equation 5, an agent transitions from state to state under action . is the ^{th} component of the SR vector of subgoal ( is cluster center). can also be understood as the discounted visitation count of state , starting from state . Hence, the reward is proportional to the change in the discounted visitation counts of the states involved in the transition. Why this reward? The reward drives the agent to states which have highest values of , meaning they are led to states that have the highest visitation count when starting from state . Hence the agent is driven to landmark states by the pseudoreward and can be understood as a hillclimbing task on the SR (see Figure 2). The option policy terminates when the agent reaches the state with the highest value of which occurs when Value function (this condition is used to decide option termination).
Hence, every subgoal has a corresponding pseudoreward, that navigates the agent to that subgoal before terminating. Furthermore, this reward is not handcrafted and is dense in nature (see Figure 2), which leads to faster learning. Note that an approximately developed SR is often a sufficient signal to learn the optimal policy for the option. Formally, the initiation set for the options is the set of all states , the option deterministically terminates () at states with (all other states have ) and the option policy is dictated by the reward function in Equation 5.
Solving Tasks: The learnt options can be used under an SMDP framework to solve tasks that differ in their reward structure. One can use SMDPQlearning with intraoption value function updates [?] for faster learning, since the learnt options are Markovian in nature.
3.2 Incremental Successor Options
SRoptions relies on primitive actions to build the SR of all states. Finite horizon environments are good examples of scenarios where the SR cannot be learnt with a uniform random policy, which would consequently lead to poor options. Hence, we propose an incremental approach where we discover intermediate options, which facilitate the SR learning process. Such an approach would be critical in longhorizon tasks, where the exploration can be facilitated using reward agnostic options.
The algorithm (See Algorithm 1) starts by building the SR from primitive actions. This is followed by an option discovery step from the current SR matrix. In the next iteration, the option and actions are used in tandem to construct a more robust SR. Since we are interested in the SR of the uniformly random policy, the SR is not updated when executing an option, but only when executing primitive actions (update refers to Equation 3). The constructed intermediate options can be used in any manner but one would ideally want to sample actions more frequently than options since options navigate to specific subgoals and sampling them frequently would hence limit you to certain states. After the SR is rebuilt, a new set of options are formed with this SR and the old set is discarded. The newly formed options are used in the next iteration and the process is repeated. Finally, when the SRs are sufficiently built, one can use SRoptions, with the final SR matrix obtained from the incremental exploration procedure.
How are the options obtained in the incremental setup? Ideally, one would like to discover options that drive the agent towards unvisited parts of state space. While visitation count would be one such ideal metric, we make use of the L1norm of the SRvector as a proxy for the visitation count [?]. Hence, only states with low L1norm SRs, participate in the clustering. As shown in Algorithm 1, the clustering stage uses a set of candidate subgoals which are a fraction of the set of reached states. Formally, a state is a candidate subgoal if , where and are hyperparameters that decide the range of L1norms of the selected states. Such a condition ensures that all candidate subgoal states have an SR that is neither fully developed nor extremely sparse or underdeveloped, thus providing a pseudo reward which is easy to learn over.
3.3 Deep Successor Options
Deep Successor Options (Figure 3) extends SRoptions to the function approximation setting. [?; ?] propose Successor Features (SF), a model for generating the SR using neural networks. Deep Successor options is extended to continuous action spaces by learning through three branches. These branches are the reward prediction error (branch 1), the TD error for learning Successor Features (branch 2) and the Option policy heads (branch 3). The first two branches usually share the same base representation . However, reward prediction is required only when one is interested in computing the Qvalues. Since the Qvalues (of policy ) need not be estimated, we do not include the reward layer in our architecture for learning the SF. Unlike other works [?; ?], Deep Successor options does not explicitly construct the graph and the formulation is hence naturally functional with neural networks.
Once the SF is trained, a sample of the SF vectors is collected. Similar to the tabular case, the obtained vectors are clustered to produce SF cluster centers. These cluster centers represent various subgoals. The intraoption policies are learnt using the reward function presented in Equation 6. In the equation, is the SF cluster centroid, and is the intermediate feature representation. This formulation degenerates to the tabular setup when is a onehot vector. The reward function is based on an identical intuition where the options learn to navigate to landmark states (states with the highest value of ). As shown in Figure 3, the options can be learnt using a separate head for each option (branch 3). Branch 2 remains frozen during the intraoption learning process, since it is responsible for determining the reward function for every option.
(6) 
4 Experiments
This section analyzes the answers to the following questions



How different are the subgoals discovered through SROptions, from the ones discovered through other techniques such as Eigenoptions? (Section 4.2)

Why do we need a different exploration strategy when options are used? (Section 4.3)

How do SROptions fare empirically against other methods and baselines? (Section 4.4)

How do SROptions fare against Incremental SROptions, in terms of discovered subgoals in a finite horizon setting? (Section 4.5)

And finally, how do SROptions scale to handle continuous state and action spaces? (Section 4.6)
4.1 Tasks
We consider 4 gridworld tasks, a finite horizon task, and the FetchReach environment [?]. There are 4 different gridworlds (see Figure 4) with varying complexities. Each of them has 5 actions, the 5 being Noop, Left, Right, Up and Down. All transitions are completely deterministic. For the incremental setting, we consider grid4 (from Figure 4) and limit the horizon to 100 steps.
For the first setup, we consider 500 random start and end states and evaluate on the same. The reward is +10 for reaching the goal and +0 otherwise and the discount factor . For the second setup (finite horizon), we fix the start state to be the bottomleftmost state and the goal to be the toprightmost. The action space, reward, discount factor, and transition function are identical to that of the first setup. For the FetchReach environment, we use the full state and action spaces of the task and use a
4.2 Discovered subgoals
This section demonstrates the qualitative difference between SRoptions and Eigenoptions [?] through Figure 4(a). It is clear to see that the subgoals are more diverse and spread out, for the case of SRoptions. Furthermore, the discovered subgoals are landmark states and situated in the middle of wellconnected regions.
4.3 Understanding Exploration with Options
An SMDP optimal control framework typically uses options and actions together and explores using a uniform random policy between options and actions. eowever, options lead the agent to specific subgoals unlike actions. As a result (as seen in Figure 4(b)), the agent spends a majority of its time near these subgoals. Hence, we propose two schemes for exploration when options are used. These two schemes are the NonUniform (NU) scheme and the AdaptiveExploration (AE) scheme. In the NU scheme, options and actions are sampled in the ratio 1:. Hence, the agent will navigate to subgoals following which, a sequence of actions are used to explore the neighbourhood of that subgoal. However, different neighbourhoods have different sizes. Since SRoptions use a clustering step, the size of the cluster can be used to change the ratio at which options and actions are sampled. Hence, we propose the AE scheme where, after picking option , options and actions are sampled in ratio 1 : . Hence, the sampling ratio is changed every time an option is picked and the agent makes use of the most recently used option to determine this ratio.
4.4 Evaluating Successor Options
This section highlights the quantitative differences between Eigenoptions, Successor Options, and Qlearning. We have 6 different methods, namely Qlearning, SRoptions, SRoptions with NU scheme (SRNU), SRoptions with AE scheme (SRAE), Eigen options and Eigenoptions with NU scheme (EigenNU). We evaluate the first setup mentioned in Section 4.1, for 5 different seeds. Each seed involves evaluating on 500 different random start and endstates. The grid1 and grid2 tasks are evaluated 100 times over 50,000 steps and grid3 and grid4 are evaluated 100 times over 500,000 steps. The number of options in grid1, grid2, grid3, and grid4 are (4, 5, 10, 10), but we have observed the performance to be robust to this hyperparameter. The value of for the NU and AE schemes are (15, 15, 50, 50). We have observed that this parameter can be tuned further. The plots are presented in Figure 4(c) and SRAE has the best trainingcurve in all environments (with respect to area under curve and performance at t=0).
4.5 Incremental Successor Options
Incremental Successor Options is run using setup2, described in section 4.1, where the horizon is limited to 100 steps. Figure 7 shows the nature of the discovered subgoals when SRoptions and incremental SRoptions are used. Both algorithms are run for the same number of steps (intermediate option’s learning time included). is the 5^{th} percentile value and is 40^{th} percentile value. Figure 6 plots the L1norm of the SRvectors for the first 4 iterations of training. We observe a clear increase in the explored state space in the given horizon, while discovering subgoals that are well spread out.
4.6 Understanding Deep Successor Options
We look at the FetchReach robotic control environment to look at the efficacy of Deep Successor Options. Figure 8 demonstrates that clustering over the Successor Representations naturally results in the segregation of the state space, based on the 3dimensional coordinates. Moreover, we learn corresponding option policies (5 in total) using the intrinsic reward described in Equation 6 and do so using the Proximal Policy Optimization (PPO) algorithm [?]. The option policies are observed to be diverse and have been visualized through a video ^{1}^{1}1https://www.dropbox.com/s/9284c190vlkimym/sroptions.mp4?dl=0
5 Related Work
[?] introduce airport hierarchies which assign different states as airports or landmarks with various levels defined on the basis of seniority. Each state is assigned to be a landmark only if it is reachable from a threshold number of states. The airport analogy is similar to the spread of clusters obtained from SRoptions since each airport also represents a group of similar states.
[?] describe a diverse density based solution that casts this problem as a multipleinstance learning task. The discovered solutions are bottlenecks since they are present in a larger fraction of positive bags. [?] describe a betweenness centrality based approach which also naturally lead to bottleneck based options. Subgoals based on relative novelty [?] identify states that could lead to vastly different states consequently which is closely tied to the notion of bottleneck states. Graph partitioning methods have also been employed to find options [?; ?; ?]. These methods design options that transition from one wellconnected region to another. Since the subgoals are the boundaries between two wellconnected regions, these methods also typically identify bottlenecks as subgoals.
Option Critic [?] is an endtoend differentiable model that learns options on a single task. However, this method is forced to specialize for a single task and the learnt options are not easily transferable. Eigenoptions [?] use the eigenvectors of the Laplacian as rewards to learn intraoption policies. This method, however lacks a variety in subgoals since ascending the different eigenvectors often correspond to reaching the same subgoal. The clustering step provides flexibility regarding the number of options required, which is absent in the case of eigenoptions. More recent work, [?] attempts to use Successor Representations to obtain the eigenvectors of the Laplacian. However, the obtained options are identical to the options obtained from Eigenoptions (for reversible environments and under the uniform random policy) and hence our work significantly differs from this work. Successor Options clusters the SR vectors, while [?] diagonalize the SR matrix to use the eigenvectors of the same, in order to find the eigenvectors of the graph Laplacian.
6 Conclusion
Successor Options is an option discovery framework that leverages Successor Representations to build options. Deep SRoptions are formulated to work in the function approximation setting and the Incremental SRoptions model attempts to address the finite horizon setting, where SRs cannot be constructed solely from primitive actions.
As future work, we aim to use Deep Successor Options to achieve optimal control on highdimensional sparse reward tasks. We believe that this is out of scope for this current work since highdimensional spaces require a more robust termination condition and a reliable Successor Features network. This work assumes that the initiation set is the set of all states, which may not be an optimal choice. Another avenue for experimentation is to learn options, using a mixture of the pseudoreward and an extrinsic reward.
References
 [Arthur and Vassilvitskii, 2007] David Arthur and Sergei Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
 [Bacon et al., 2017] PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. 2017.
 [Barreto et al., 2017] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems, pages 4055–4065, 2017.
 [Dayan, 1993] Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
 [Kulkarni et al., 2016a] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
 [Kulkarni et al., 2016b] Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016.
 [Lakshminarayanan et al., 2016] Aravind S Lakshminarayanan, Ramnandan Krishnamurthy, Peeyush Kumar, and Balaraman Ravindran. Option discovery in hierarchical reinforcement learning using spatiotemporal clustering. arXiv preprint arXiv:1605.05359, 2016.
 [Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [Machado et al., 2017a] Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956, 2017.
 [Machado et al., 2017b] Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089, 2017.
 [Machado et al., 2018] Marlos C Machado, Marc G Bellemare, and Michael Bowling. Countbased exploration with the successor representation. arXiv preprint arXiv:1807.11622, 2018.
 [McGovern and Barto, 2001] Amy McGovern and Andrew G Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. 2001.
 [Menache et al., 2002] Ishai Menache, Shie Mannor, and Nahum Shimkin. Qcut—dynamic discovery of subgoals in reinforcement learning. In European Conference on Machine Learning, pages 295–306. Springer, 2002.
 [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, and others. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [Moore et al., 1999] Andrew W Moore, L Baird, and LP Kaelbling. Multivaluefunctions: E cient automatic action hierarchies for multiple goal mdps. 1999.
 [Plappert et al., 2018] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. Multigoal reinforcement learning: Challenging robotics environments and request for research, 2018.
 [Puterman, 1994] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
 [Schulman et al., 2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [Schulman et al., 2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [Şimşek and Barto, 2004] Özgür Şimşek and Andrew G Barto. Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 95. ACM, 2004.
 [Şimşek and Barto, 2009] Özgür Şimşek and Andrew G Barto. Skill characterization based on betweenness. In Advances in neural information processing systems, pages 1497–1504, 2009.
 [Şimşek et al., 2005] Özgür Şimşek, Alicia P. Wolfe, and Andrew G. Barto. Identifying useful subgoals in reinforcement learning by local graph partitioning. pages 816–823. ACM Press, 2005.
 [Sutton and Barto, 1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 1998.
 [Sutton et al., 1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 [Vezhnevets et al., 2017] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal Networks for Hierarchical Reinforcement Learning. arXiv:1703.01161 [cs], March 2017.
 [Watkins and Dayan, 1992] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.