Deictic Image Maps: An Abstraction For Learning Pose Invariant Manipulation Policies
In applications of deep reinforcement learning to robotics, it is often the case that we want to learn pose invariant policies: policies that are invariant to changes in the position and orientation of objects in the world. For example, consider a peg-in-hole insertion task. If the agent learns to insert a peg into one hole, we would like that policy to generalize to holes presented in different poses. Unfortunately, this is a challenge using conventional methods. This paper proposes a novel state and action abstraction that is invariant to pose shifts called deictic image maps that can be used with deep reinforcement learning. We provide broad conditions under which optimal abstract policies are optimal for the underlying system. Finally, we show that the method can help solve challenging robotic manipulation problems.
|Robert Platt, Colin Kohler, Marcus Gualtieri|
|College of Computer and Information Science, Northeastern University|
|360 Huntington Ave, Boston, MA 02115, USA|
Keywords: Deep Reinforcement Learning, Robotic Manipulation
Policies learned by deep reinforcement learning agents are generally not invariant to changes in the position and orientation of the camera or objects in the environment. For example, consider the peg-in-hole insertion task shown in Figure 1. A policy that can insert the peg into Hole A cannot necessarily accomplish the same thing for Hole B. In order to learn a policy that can perform both insertions, the agent must be presented with examples of the hole in both configurations during training. This is a significant problem in robotics because we often want to learn policies most easily expressed relative to an object (i.e. relative to the hole) rather than relative to an image. Without the ability to generalize over pose, deep reinforcement learning must be trained with experiences that span the space of all possible pose variation. For example, in order to learn an insertion policy for any hole configuration, the agent must be trained with task instances for hole configurations spanning , a three dimensional space. The problem is even worse in which spans six dimensions. The need for all this training data slows down learning and hampers the agent’s ability to generalize over other dimensions of variation such as task and shape. While pose invariance is not always desirable, the inability to generalize over pose can be a major problem.
This paper introduces deictic image maps, a projection of state and action onto an abstract space that generalizes over pose. The key idea is to encode a movement action in terms of an image of the world that is centered, aligned, and focused on the target pose of a collision free motion action. State is encoded as a history of actions expressed this way. Since this representation encodes the appearance of the environment relative to an action, it generalizes well over absolute positions and orientations. We call this a “deictic” representation because the state representation is contextualized by previous action choices. This paper explores the idea in the context of a broad class of robotic systems that we call move-effect systems where the robot is constrained to perform gross motions and end effector actions sequentially. We show that, in conjunction with deep Q-learning (DQN) [1, 2], this method can solve important move-effect problems more efficiently than would otherwise be possible. From a theoretical perspective, we show that for certain problems involving move-effect systems, the proposed state and action abstraction induces an MDP homomorphism [3, 4]. As a result, we are assured that optimal policies found in the abstract state and action space induce optimal policies for the original problem. Finally, we report results from simulated and real robotic experiments that illustrate how the method can be used to solve robotics problems that would otherwise be challenging for deep reinforcement learning.
2 Problem Statement
We will consider problems expressed for a class of robotic systems that we call move-effect systems.
Definition 2.1 (Move-Effect System).
A move-effect system is a discrete time system comprised of a set of effectors mounted on a fully actuated mobile base that operates in a Euclidean space of dimension . On every time step, the system perceives an image or signed distance function, , and then executes a collision-free motion, , of the base followed by a motion, , of the effectors.
A good example of a move-act robot is a robotic arm engaged in prehensile manipulation. The robot may move its hand (i.e. the “base”) to any desired reachable pose via a collision-free motion. Once there, it may execute an effector motion such as opening or closing the hand. More complex manipulation systems can also be formalized this way. For example, a domestic robot that navigates within a house performing simple tasks can be expressed as a move effect system in where the simple tasks to be performed are the “effector motions”.
2.1 Formulation as a Markov decision process
A Markov decision process (an MDP) is a tuple where denotes a set of system states and a discrete set of actions. denotes the transition probability function where is the probability of transitioning to state when action is taken from state . denotes the expected reward of executing action from state . The solution to an MDP is a control policy that maximizes the expected sum of future discounted rewards when acting under the policy.
In order to use the MDP framework for move-effect systems, we need to define state and action sets. The action set is easy to define: . State is harder to define because it is not observed directly. Instead, the agent has access to an image or signed distance function , taken at the beginning of each time step taken via a camera and/or depth sensor. Here, denotes the positions of a finite set of points corresponding to the pixels or voxels in the image or signed distance function. The system also has access to the configuration of its effectors, , obtained using joint position sensors. We define state to be a history of the most recent observations and actions:
where and denote the current values for those two variables and , , and denote the
2.2 Move-effect problems are challenging
The standard MDP formulation described in Section 2.1 is not well suited to move-effect systems. For example, consider the grid-disk domain as shown in Figure 2. At the beginning of an episode, the grid is initialized with two disks placed in random cells. The agent may move the disks by performing pick and place actions. It receives +10 reward for picking up one disk and placing it adjacent to the other (either horizontally or vertically) and zero reward otherwise. Using DQN on the MDP formulation described above, the number of episodes needed to learn a good policy increases as the number of cells in the grid increases. This is illustrated in Figure 3 which shows learning curves averaged over 10 runs each for DQN applied to grid-disk for three different grid sizes: a grid, a grid, and a . This example is a move-effect system where is the set of 16 grid positions and contains exactly one pick and one place action (32 actions total). On each time step, the agent observes an image of the grid as well as a single bit that denotes the configuration of its hand – either open or closed (Equation 1 for ). Transitions are deterministic: a pick (resp. place) succeeds if executed for an occupied (resp. unoccupied) cell and does not otherwise. We are using standard -greedy DQN with dueling networks , no prioritized replay , a buffer size of 10k, a batch size of 10, and an episode length of 10 steps. Epsilon decreases linearly from 100% to 10% over the entire training session. The neural network has two convolutional+relu+pooling layers of 16 and 32 units respectively with a stride of 3 followed by one fully connected layer with 48 units. We use the Adam optimizer with a learning rate of 0.0003. The images are preprocessed by resizing to . The number of episodes (and hence the slope of decrease in ) is chosen to be the shortest that on average enables the agent to reach 9.5 reward over 10 episodes. Notice that DQN learns much more slowly for the case than it does for . It does even worse for larger grid sizes.
3 Deictic image mapping
Move-effect systems have structure that makes problems involving these systems easier to solve than the results of Section 2.2 suggest. Notice that in the case of grid-disk, the optimal policy is most easily expressed relative to current disk positions on the grid: the agent must learn to pick up a disk and place it horizontally or vertically adjacent to the other. This is a result of an important problem symmetry: for the purposes of arranging disks, world configurations where objects in the world occupy the same relative configurations have similar (i.e. symmetric) transitions and reward outcomes. In order to improve learning performance, we need to encode the problem in a way that reflects this symmetry. Note that this cannot be accomplished just by switching to an actor critic method like DDPG . DDPG could make it easier for the agent to learn to generalize over position, but it cannot generalize over orientation. Instead, this section introduces deictic image state and action mappings that induce an abstract MDP that captures problem symmetries. We solve this abstract MDP using DQN.
3.1 Action mapping
Recall that in a move-effect system, each action is a move-effect pair, , where we use the notation to describe the destination of the base motion and to describe the effector motion at time . The hard part about expressing this action is expressing the component: whereas is relatively small, could be any position and orientation within the work space of the manipulator. Our key idea is to express in terms of the appearance of the world in the vicinity of rather than as coordinates in a geometeric space. Recall that on each time step, the move-effect system observes an image or signed distance function, , expressed over a finite set of positions, , corresponding to pixels or voxels. Let denote a cropped region of centered on and aligned with the basis axes of , defined over the positions . is assumed to correspond to a closed region of . In our experiments, it is a square or cube centered on and aligned with . The action mapping is then where:
is an element of , and and are elements of . We call this a deictic image action mapping because it uses an image to encode the motion relative to other objects in the world. Because it represents motions this way, it generalizes well over translations and rotations in or . This can be viewed as an action abstraction [3, 4] and we will call the abstract action set.
3.2 State mapping
The action mapping of Equation 2 induces a state abstraction. Recall that we defined state to be the history of the last observations and actions (Equation 1). We can simplify this representation by using the action abstraction of Equation 2. The easiest way to do this is to define abstract state to be the current robot state paired with a history of the most recently executed abstract actions:
where , , and are elements of and is the set of abstract states. We call this the original crops abstraction because the abstract action is evaluated using , the image or signed distance function as it was at the time the action was originally executed. When is understood, we will sometimes abbreviate . We refer to as the abstract state set.
3.3 DQN in the abstract state and action space
The state and action mappings introduced above induce an abstract MDP that can be solved using DQN as follows. Let denote the abstract action-value function, expressed over the abstract state and action space, encoded by a deep neural network. At the beginning of each time step, we evaluate the abstract state, , for the current state, . Note that the set of abstract actions that are feasible from this state is . Therefore, we can calculate the greedy action for the current state with respect to the abstract action-value function using:
After each time step, we store the underlying transition, in the standard way. Training is nearly the same as usual except that after sampling a mini-batch, we calculate targets using the abstract state-action value function, , rather than in the standard way, . The neural network that encodes takes an abstract state-action pair as input and outputs a single estimate of value. In our experiments, we use a standard convolutional architecture comprised of two convolution+relu+pooling layers followed by fully connected layers. In contrast to the standard DQN network implementation which evaluates all action values in a single pass, this architecture must do a single forward pass for each action candidate. However, each of these forward passes is faster because the image patches that are input to the network are typically much smaller than the full image that standard DQN would take.
3.4 Comparison between deictic image mapping and a DQN baseline
First, we define a deictic image mapping for grid-disk. Recall that for an image and a base motion , is a cropped region of centered and aligned with . is the set of base motion target locations corresponding to the cells ( for grid-disk). In this example, we define to be the cell square region centered on . (We use zero-padding for positions on near the edge of the grid.) For example, Figure 4 (a) shows for the place action shown in Figure 2 (b). Since the agent is to place a disk to the left of an existing disk, the deictic image for this place action shows the existing disk to the right of an empty square where the place will occur. is the configuration of the effector: in this case just the configuration of the gripper jaws. Figure 4 (b) compares learning curves for DQN with deictic image states and actions versus the DQN baseline from Section 2.2 averaged over 10 runs each for the grid-disk domain. DQN is parameterized just as it was in Section 2.2. Notice that deictic image mapping speeds up learning considerably. Our experience has been that this kind of speedup is typical for deictic image mapping applied to move-effect systems.
4 Scaling up
By itself, the DQN implementation of deictic image mapping does not scale to real world problems. It is necessary to introduce techniques for handling the large action branching factor and for leveraging curriculum learning.
4.1 Large action branching factors
The fact that the DQN implementation of deictic image mapping must do a single forward pass of the neural network for each action candidate is a major computational problem for large action branching factors. Many problems of practical interest will involve tens or hundreds of thousands of potential actions. One approach to handling this would be to adapt an actor critic method such as DDPG to the deictic image mapping setting. However, we have not yet explored that approach. Instead, this paper handles the problem using the two strategies described below.
Keeping an estimate of the state-value function: The large action branching factor is a problem every time it is necessary to evaluate . This happens twice in DQN: once during action selection and again during training when evaluating target values for a mini-batch. We eliminate the second evaluation by estimating the abstract state-value function, , concurrently with the abstract action-value function, , using a separate neural network. is trained using targets obtained during action selection: each time we select an action for a given state, we update the network using the just-evaluated max for that state. This enables us to calculate a target using rather than .
Using hierarchy: We also speed up evaluation of through a hierarchy specifically designed for move-effect problems in . Given an abstract action , let be the function that rectifies into a globally fixed orientation. We define two neural networks, and . is trained in the standard way: for each pair in a mini-batch, is trained with the corresponding target. However, we also train with the same target. Essentially, the estimate is an average of the value function over all orientations for a given position. In order to maximize , we score each action using and then maximize over the top scoring actions. This is related to graduated non-convexity methods . It is not guaranteed to provide an optimum, but it works well empirically on our problems.
4.2 Curriculum learning with deictic image mapping
Curriculum learning  can speed up training in reinforcement learning. This basically amounts to training the agent to solve a sequence of progressively more challenging tasks. The hard part is defining the task sequence in such a way that each task builds upon knowledge learned in the last. However, this is particularly easy with deictic image mapping. In this case, we just need to vary the discretization of on each curriculum task, starting with a coarse and ending with a fine discretization. For example, suppose we want to learn a grid-disk policy over a fine grid of possible disk locations. We would initially train using the coarse discretization described in Section 3.4 and subsequently train on a more fine discretization. Although some of the image patches in the fine discretization will be different from what was experienced in the coarse discretization, these new image patches will have a similar appearance to nearby patches in the coarse discretization. Essentially, base motions that move to similar locations will be representated by similar image patches. As a result, policy knowledge learned at a coarse level will generalize appropriately at the fine level.
4.3 Blocks world in simulation
Here we show that deictic image mapping in conjunction with the ideas described above can be used to solve problems where spans . Consider a pick and place problem for rectangular objects. At the beginning of each episode, two rectangular blocks with randomly sampled widths are placed in random positions and orientations as shown in Figure 5 (a). The agent must pick up one rectangle and place it adjacent to another so that the long ends of the rectangles roughly line up as shown in Figure 5 (b). We use a dense discretization of in this problem, resulting in 26.9k different actions that the agent can execute on each time step.
We solve this problem using DQN with deictic image mapping in conjunction with the ideas presented in this section. We use a curriculum of eight tasks starting with a task involving disks as in Figure 2 and ending with tasks involving blocks as shown in Figure 5. The granularity of the pick/place positions and orientations gradually increases from beginning to end of the eight-task curriculum. Everything is parameterized the same as it was in Section 2.2 except for the following: we vary the number of episodes per curriculum task as appropriate; we use prioritized replay with , , and ; and we start -greedy exploration with instead of (but still decreases linearly with time).
Figure 5 (c) shows the learning curve for this problem. Notice the “spikes” downward in the learning curve. Each spike corresponds to the start of a new task in the curriculum (a total of eight spikes including the one at episode zero). After the agent solves a task in the curriculum, this triggers a switch to a new task. Performance drops while the agent learns the new task but eventually recovers. The most important part of the learning curve is the far right – this represents performance on the full task instance. Performance does not reach 10 for these tasks because the random initial placement of the blocks makes some task instances infeasible (the blocks may be oriented such that it is impossible to achieve the correct placement). The full training curriculum executes in approximately 1.5 hours on a standard Intel Core i7-4790K running one NVIDIA 1080 graphics card. Overall, our approach does well on this problem. It would be hard to do as well on this task using a different deep RL approach.
4.4 Robot experiments
Deictic image mapping is also useful in practice. We performed experiments for a blocks world domain using a UR5 equipped with a Robotiq two finger gripper, as shown in Figure 6. This blocks world was the same as that described in Section 4.3 except that the blocks were cubic (5 cm on a side) and always placed in a single orientation. Initially, blocks were placed on a table in randomly selected positions within an cm region within the manipulator workspace. The robot executed collision-free end-to-end motions to and from pick and place configurations within this region. All motions were constrained to move the gripper into configurations where a line pointing in the direction of the gripper fingers was orthogonal to the plane of the table as shown in Figure 6. At the beginning of each time step, the gripper was moved above the table and looking down so that it could acquire a new depth image using a Structure IO sensor mounted to the wrist (the light blue device in Figure 6). This produced an image of the table similar to that shown in Figure 5 (a-b). Then, the system selected and performed a pick or place action consisting of a single collision free motion followed by an opening or closing motion of the gripper. An episode ended when a goal condition was satisfied or after 20 time steps.
|Task||Success Rate||Avg # Steps||Align Err||Align Err||Align Err|
|Block 1-2||Block 2-3||Block 3-4|
|3 Block Row||94%||7.8|
|3 Block Stack||90%||4.7||5.6||9.2|
|4 Block Stack||84%||6.9||5.7||8.2||13.8|
We trained this system in simulation using OpenRAVE . The simulation consisted of the table, the blocks setup, and a floating Robotiq two finger hand that acted as a Cartesian manipulator in the plane of the table. This simulated setup was identical to the physical setup except that we omitted simulation of the arm motion and kinematics and assumed that motions of the robotic manipulator always reached their target pose. The depth image was simulated by generating a point cloud in OpenRAVE and converting it into a depth image. A simulated grasp was deemed to succeed when the fingers were placed around an object within 2mm of the object center line. The neural network and other DQN parameters were the same as those used in the simulated blocks world experiment (Section 4.3) except that we used a four stage curriculum. The first stage trained on blocks presented with a step size of 10cm (64 total actions). The second for a 5cm step size, the third for a 3cm step size, and the fourth for a 1cm step size (12.482k total actions). Total training time was approximately 1.5 hours on a single Intel Core i7-4790K running two NVIDIA 1080 graphics cards.
We evaluated the approach for three different tasks defined over the blocks world domain: placing three blocks in a row, stacking three blocks, and stacking four blocks. All tasks were trained to convergence using the curriculum, parameters, and setup described above. Then, we took the policy learned in simulation and ran it directly on the physical robotic system. Although we trained the system for with a 1cm step size, we deployed with a 5mm step size, thereby giving us somewhat more accurate placements. All tasks started with the blocks placed randomly on the table. The system received a reward of +10 if a goal state was reached and the episode terminated. Table 1 shows the results for the three tasks averaged over 50 trials per task. Probably the most significant result is that we achieve relatively high success rates for tasks of this complexity. The 84% success rate for four stacked blocks is particularly noteworthy – it would be challenging for us to achieve this success rate using any other deep RL method. The “align err” columns show the average offset of a block relative to the block below. These results suggest that the lower success rate of the 4 block stack is simply a result of accumulation of alignment errors. If the CG of the top block is not within the “footprint” of the base block, then the stack will topple.
We can show that deictic image mapping is theoretically correct in the sense that optimal policies for the abstract MDP found using DQN as described in Section 3.3 induce optimal policies for the original system. We use the MDP homomorphism framework.
Definition 5.1 (MDP Homomorphism, Ravindran 2004 ).
An MDP homomorphism from an MDP onto is a tuple of mappings, , where is a state mapping, is a set of action mappings, and the following two conditions are satisfied for all , , and :
and are the abstract state and action sets, respectively. Recall that the action mapping introduced in Equation 2 encodes an action in terms of . Since is the image or signed distance function perceived by sensors, this is a state-dependent mapping. The MDP Homomorphism framework is specifically relevant to this situation because, unlike other abstraction frameworks [11, 12], it allows for state dependent action abstraction. If we can identify conditions under which the action mapping of Equation 2 in conjunction with the state mapping of Equation 3 is an MDP homomorphism, then a theorem exists that says that a solution to the abstract MDP induces a solution in the original problem:
Theorem 5.1 (Optimal value equivalence, Ravindran 2004 ).
Let be the homomorphic image of under the MDP homomorphism, . For any ,
Here, and denote the optimal action-value function for the underlying MDP and the abstract MDP respectively. We now show that the deictic image mapping induces an MDP homomorphism.
Given a move-effect system with state set ; action set ; transition function such that for all , is conditionally independent of given ; an original crops mapping ; and an abstract reward function ; then there exist a reward function and an abstraction transition function for which is an MDP homomorphism from to .
To show Equation 5, we must identify an abstraction transition function such that for all ,
To show that exists that satisfies Equation 7, it is sufficient to show that is conditionally independent of and given and :
Notice that we since we can express as , then is conditionally independent of and given , , and . However, since by assumption is conditionally independent of given and we know that it is conditionally independent of given by the definitions of those variables, we have shown Equation 8 and the theorem is proven. ∎
6 Related Work and Discussion
This work is related to prior work exploring applications of deixis in AI, starting with . Evidence exists that the human brain leverages deictic representations during performance of motor tasks . Whitehead and Ballard proposed an application of deixis to reinforcement learning, focusing on blocks world applications in particular . The resulting system was recognized to be partially observable and a variety of methods were explored for combating the problem without much success [15, 16, 17]. This paper leverages the MDP Homomorphism framework introduced by Ravindran and Barto [3, 4] to show that our particular representation does not result in suboptimal policies. MDP Homomorphisms are related to other MDP abstraction methods [11, 12, 18]. However, unlike those methods, the MDP Homomorphisms allow for the state-dependent action abstraction, a critical part of our proposed abstraction.
In contrast to a variety of recent work on abstraction in reinforcement learning [19, 20, 21, 22], the abstraction proposed here does not necessarily result in an MDP with a smaller state or action space. Rather, our focus is on projecting onto a representation that generalizes as we would like. Nevertheless, our representation naturally incorporates end-to-end collision-free motions, an important type of action abstraction in robotic manipulation. This paper is also closely related to recent work by Gualtieri and Platt who used a similar method to solve a pick and place task . The current paper expands upon that work by allowing for larger action sets, more complex tasks, and by providing the analysis of Section 5. The paper is also loosely related to recent grasp detection work by Mahler et al.  and ten Pas et al. , where the authors use detection in the context of a robotics task. Here, however, we explore tasks with longer time horizons.
This work has been supported in part by the National Science Foundation through IIS-1427081, IIS-1724191, and IIS-1724257, NASA through NNX16AC48A and NNX13AQ85G, ONR through N000141410047, Amazon through an ARA to Platt, and Google through a FRA to Platt.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
- Ravindran  B. Ravindran. An algebraic approach to abstraction in reinforcement learning. PhD thesis, University of Massachusetts at Amherst, 2004.
- Ravindran and Barto  B. Ravindran and A. Barto. Smdp homomorphisms: An algebraic approach to abstraction in semi markov decision processes. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1011–1016, 2003.
- Wang et al.  Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- Schaul et al.  T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Lillicrap et al.  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Blake and Zisserman  A. Blake and A. Zisserman. Visual reconstruction. MIT press, 1987.
- Bengio et al.  Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
- Diankov and Kuffner  R. Diankov and J. Kuffner. Openrave: A planning architecture for autonomous robotics. Technical Report CMU-RI-TR-08-34, Robotics Institute, Pittsburgh, PA, July 2008.
- Givan et al.  R. Givan, T. Dean, and M. Greig. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(1-2):163–223, 2003.
- Dean et al.  T. Dean, R. Givan, and S. Leach. Model reduction techniques for computing approximately optimal solutions for markov decision processes. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pages 124–131. Morgan Kaufmann Publishers Inc., 1997.
- Agre and Chapman  P. E. Agre and D. Chapman. Pengi: An implementation of a theory of activity. In AAAI, volume 87, pages 286–272, 1987.
- Ballard et al.  D. H. Ballard, M. M. Hayhoe, P. K. Pook, and R. P. Rao. Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences, 20(4):723–742, 1997.
- Whitehead and Ballard  S. D. Whitehead and D. H. Ballard. Learning to perceive and act by trial and error. Machine Learning, 7(1):45–83, 1991.
- McCallum  A. K. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, 1995.
- Finney et al.  S. Finney, N. H. Gardiol, L. P. Kaelbling, and T. Oates. The thing that we tried didn’t work very well: deictic representation in reinforcement learning. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 154–161. Morgan Kaufmann Publishers Inc., 2002.
- Castro and Precup  P. S. Castro and D. Precup. Using bisimulation for policy transfer in mdps. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages 1399–1400. International Foundation for Autonomous Agents and Multiagent Systems, 2010.
- Durugkar et al.  I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan. Deep reinforcement learning with macro-actions. arXiv preprint arXiv:1606.04615, 2016.
- Kulkarni et al.  T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
- Bacon et al.  P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.
- Levy et al.  A. Levy, R. Platt, and K. Saenko. Hierarchical reinforcement learning with hindsight. arXiv preprint arXiv:1805.08180, 2018.
- Gualtieri et al.  M. Gualtieri, A. ten Pas, and R. Platt. Pick and place without geometric object models. In IEEE Int’l Conf. on Robotics and Automation (ICRA). IEEE, 2018.
- Mahler et al.  J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
- ten Pas et al.  A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt. Grasp pose detection in point clouds. The International Journal of Robotics Research, page 0278364917735594, 2017.