IRIS: Implicit Reinforcement without Interaction at Scale
for Learning Control from Offline Robot Manipulation Data
Learning from offline task demonstrations is a problem of great interest in robotics. For simple short-horizon manipulation tasks with modest variation in task instances, offline learning from a small set of demonstrations can produce controllers that successfully solve the task. However, leveraging a fixed batch of data can be problematic for larger datasets and longer-horizon tasks with greater variations. The data can exhibit substantial diversity and consist of suboptimal solution approaches. In this paper, we propose Implicit Reinforcement without Interaction at Scale (IRIS), a novel framework for learning from large-scale demonstration datasets. IRIS factorizes the control problem into a goal-conditioned low-level controller that imitates short demonstration sequences and a high-level goal selection mechanism that sets goals for the low-level and selectively combines parts of suboptimal solutions leading to more successful task completions. We evaluate IRIS across three datasets, including the RoboTurk Cans dataset collected by humans via crowdsourcing, and show that performant policies can be learned from purely offline learning. Additional results and videos at https://stanfordvl.github.io/iris/.
Recent research has successfully leveraged Reinforcement Learning (RL) for short-horizon robotic manipulation tasks, such as pushing and grasping objects [yu2016more, levine2016learning, fang2018learning]. However, RL algorithms face the burden of efficient exploration in large state and action spaces, and consequently need large amounts of environment interaction to successfully learn policies. Furthermore, leveraging RL for policy learning requires specifying a task-specific reward function that is often carefully shaped and crafted to assist in exploration. An appealing alternative to learning policies from scratch is to bring policy learning closer to the setting of supervised learning by leveraging prior experience. In Imitation Learning (IL), expert demonstrations are used to guide policy learning. The demonstrated data can be used in lieu of a reward function and also lessen the burden of exploration for the agent, ameliorating some of the aforementioned issues. However, Imitation Learning has primarily been applied to small scale datasets collected in a consistent manner (i.e. by one decision maker). In order to truly reap the benefits of supervised learning, it is useful to consider how large-scale, diverse supervision can be used for task learning. Large-scale human supervision has accelerated progress in computer vision and natural language processing [deng2009imagenet, rajpurkar2018squad2], but policy learning has witnessed no such success. The advent of supervision mechanisms that allow for the collection of thousands of task demonstrations in a matter of days [mandlekar2018roboturk] motivates the following question: does a policy learning algorithm necessarily need to interact with the system to learn a policy, or can a robust and performant policy be learned purely from external experiences provided in the form of a dataset? For example, consider a pick-and-place task where a robot has to pick up a soda can and place it on a shelf. We have access to a large set of task demonstrations collected via human supervision where the soda can was placed in several initial poses and people controlled the arm to demonstrate many different approaches for grasping the can and placing it on the shelf. We would like to use this dataset to train a policy that can successfully solve the task. In order to leverage large datasets for policy learning, we argue that it is important to develop methods that are tolerant to datasets that are both suboptimal and diverse, since large-scale human supervision is likely to produce data that is highly varied in terms of both quality and task solution approaches. For example, some approaches for moving to the can and grasping it can be more efficient than others, and there are many valid ways to pick the can up. By contrast, conventional imitation learning methods assume that demonstration data is near-optimal and unimodal, and most methods start to deteriorate significantly when expert demonstrations are of lower quality, or when multiple solutions are demonstrated.
We present Implicit Reinforcement without Interaction at Scale (IRIS), a novel framework that addresses the problem of offline policy learning from a large set of diverse and suboptimal demonstrations. IRIS consists of a low-level controller that performs unimodal goal-conditioned imitation over a short horizon and a high-level goal selection mechanism that generates a set of plausible goals and picks the one with the highest value. This factorization allows IRIS to learn a performant policy by selectively imitating local sequences from the suboptimal trajectories in the dataset. For example, the high-level mechanism might decide to have the low-level imitate sequences from one demonstrator’s trajectories to approach the can and then switch over to another demonstrator’s trajectory to grasp the can. Summary of Contributions:
[ topsep=0pt, noitemsep, leftmargin=*, itemindent=12pt]
We present Implicit Reinforcement without Interaction at Scale (IRIS), a framework that enables offline learning from a large set of diverse and suboptimal demonstrations by selectively imitating local sequences from the dataset.
We evaluate IRIS across three datasets collected on tasks of varying difficulty. The first dataset is a pedagogical dataset that exhibits significant diversity in the demonstrations. The second dataset exhibits significant suboptimality in the demonstrations (collected with one user). The third dataset is the RoboTurk dataset collected by humans via crowdsourcing. While our framework can leverage rewards if present in the demonstrations, the experiments only assume sparse task completion rewards that occur at the end of each demonstration.
Empirically, our experiments demonstrate that IRIS is able to leverage large-scale off-policy task demonstrations that exhibit suboptimality and diversity, and significantly outperforms other imitation learning and batch reinforcement learning baselines.
Ii Related work
Imitation Learning and Learning from Demonstration: Imitation learning guides policy learning by leveraging a reference set of expert demonstrations. Imitation learning methods are either offline, such as Behavioral Cloning [pomerleau1989alvinn, ross2013learning, schulman2016learning], or online, such as Inverse Reinforcement Learning (IRL) [abbeel2011inverse, krishnan2019swirl]. Offline methods are sensitive to the quantity of demonstration data and can suffer from covariate shift, since no additional data is collected by the agent, while online methods can require significant additional interaction for successful policy learning. Furthermore, most imitation learning approaches are sensitive to the quality of expert demonstrations since they assume that the data is near-optimal.
Imitation and Reinforcement Learning from Suboptimal Demonstrations: Recent work has tried to leverage off-policy deep reinforcement learning in conjunction with a set of demonstrations to account for suboptimal data and learn policies that outperform the demonstrations [zhu2018reinforcement, nac, ddpgfd, nair2018overcoming, dqfd]. However, such approaches still require significant interaction to learn policies. Furthermore, off-policy deep RL can be unstable due to the compounding effects of bootstrapping value learning and function approximation [ross2014reinforcement, fujimoto2018off, bhatt2019crossnorm, achiam2019towards]. Other methods take the perspective of Batch Reinforcement Learning to try and leverage arbitrary off-policy data for policy learning without collecting additional experience [fujimoto2018off, kumar2019stabilizing, agarwal2019striving, liu2019off, jaques2019way]. While recent efforts in Batch RL have produced successful continuous control policies for simple locomotion domains [fujimoto2018off, kumar2019stabilizing], neither robot manipulation nor diverse demonstration data has been considered.
Goal-directed Reinforcement and Imitation Learning: Recent work has extended reinforcement learning [andrychowicz2017hindsight, pong2018temporal, nachum2018data] and imitation learning [ding2019goal, lynch2019learning] to condition on goal observations, enabling improved sample efficiency and off-policy learning. Similar to the architecture of IRIS, Nachum et al. [nachum2018data] decompose policy learning into a high-level policy that outputs goal observations and a low-level policy that conditions on goals and tries to achieve them, and Lynch et al. [lynch2019learning] learn a goal-conditioned RNN network to imitate sequences from teleoperation data.
Large-Scale Data Collection in Robotics: Self-Supervised Learning has been employed to collect and learn from large amounts of data for tasks such as grasping in both simulated [mahler2017dex, kasper2012kit, goldfeder2009columbia] and physical settings [levine2016learning, pinto2016supersizing, kalashnikov2018qt]. These methods collected hundreds of hours of robot interaction, although most of the interactions were not successful. By contrast, RoboTurk [mandlekar2018roboturk] is a platform that has been leveraged to collect large-scale datasets in simulation via crowdsourced human supervision, resulting in datasets with several successful demonstrations. We show in our experiments that IRIS can leverage such sources of demonstrations for successful policy learning without collecting additional samples of experience.
Every robot manipulation task can be formulated as a sequential decision making problem. Consider an infinite-horizon discrete-time Markov Decision Process (MDP) , where is the state space, is the action space, , is the state transition distribution, is the reward function, is the discount factor, and is the initial state distribution. At every step, an agent observes , uses a policy to choose an action , and observes the next state and reward . The goal in reinforcement learning is to learn an policy that maximizes the expected return . To use this formulation for robotic task learning, we augment this MDP with a set of absorbing goal states , where each goal state corresponds to a state of the world in which the task is considered to be solved. Similarly, every state corresponds to a new task instance. To measure task success, we define a sparse reward function . Consequently, maximizing expected returns corresponds to solving a task quickly and consistently. Next, we formalize the structure of the datasets we aim to leverage for task learning. Definition 3.1 (Goal-Reaching Trajectories) Let be a -length trajectory in the MDP, where is an initial state from the MDP with rewards , and states produced by the MDP given the actions . This trajectory is goal-reaching if the last state is a goal state, . In our setting, we assume access to a dataset of goal-reaching trajectories that has been collected by a set of policies. Our goal is to develop an offline learning algorithm that leverages this large batch of goal-reaching trajectories to learn a policy that maximizes task returns. Importantly, the algorithm cannot collect additional samples of experience in the MDP. Next, we outline some dataset properties that makes learning in this setting challenging. Suboptimal Data: There are no guarantees placed on the quality of data-generating policies - each trajectory may take longer than necessary to solve the task. Equivalently, for a given trajectory in the dataset it is possible that , so the task return of the demonstrated trajectory is worse than that of the optimal policy. Thus, this is different from the standard setting of imitation learning - the learned policy should not seek to imitate all demonstrated data due to variations in data quality. Multimodal Data: Since many trajectories are in the dataset and multiple policies were used for generation, data can exhibit multimodality in how task instances are solved. For example, a soda can can be grasped from the top, or knocked down and then picked up on its side.
Iv IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control
IRIS consists of a low-level goal-conditioned controller and a high-level goal selection mechanism. The controller is conditioned on goal observations and trained to imitate short sequences of actions from the dataset that reached the corresponding goals. The goal selection mechanism consists of a conditional Variational Autoencoder (cVAE) [kingma2013auto] that samples nearby states to be used as goal proposals, and a value function trained using a variant of Batch-Constrained deep Q-Learning (BCQ) [fujimoto2018off]. Together, these components allow for selective imitation of local sequences in the dataset. The complete training loop is provided in Algorithm 1.
Iv-a Low-Level Goal-Conditioned Imitation Controller
The low-level goal-conditioned controller is a goal-conditioned RNN (similar to [lynch2019learning]) trained on trajectory sequences of length (a hyperparameter). Consecutive state-action sequences are sampled from trajectories in the dataset. The last observation in each sequence, is treated as a goal that the RNN should try to reach, and the RNN is trained to output the action sequence by treating the observations as an input sequence and conditioned on the goal observation at every timestep (lines 4-5 in Algorithm 1). The loss function for the RNN is a simple Behavioral Cloning loss . Consequently, the low-level controller is trained to perform unimodal imitation over short demonstration sequences to reach different goals. At test time, for a start observation , the RNN can be used to approximately reach a goal by unrolling the network for timesteps.
Iv-B High-Level Goal Selection Mechanism
The high-level goal selection mechanism chooses goal states for the low-level to try and reach (similar to [nachum2018data]). The goal selection mechanism has two components: (1) a cVAE [kingma2013auto] to propose goal observations at a particular observation and (2) a value function that models the expected return of goal observations. The cVAE is a conditional generative model that is trained on pairs of current and future observations sampled from trajectories in the dataset (lines 5-7 in Algorithm 1). An encoder maps a current and future observation to the parameters of a latent Gaussian distribution and the decoder is trained to reconstruct the future observation from the current observation and a latent sampled from the encoder distribution , . The encoder distribution is regularized with a KL-loss with weight [higgins2017beta] to encourage the encoder distribution to match a prior latent distribution so that at test-time, the decoder can be used as a conditional generative model by sampling latents and passing them through the decoder. The value function consists of a state-action value function trained using a simple variant of Batch Constrained Q-Learning (BCQ) [fujimoto2018off] (lines 8-12 in Algorithm 1). The loss function for the value function is a modified version of the BCQ update, which maintains a cVAE to model a state-conditional action distribution over the dataset, and a Q-network trained with a temporal difference loss, . The target value is computed by considering a set of action proposals from the cVAE and maximizing the Q-network over the set of actions, . At test-time, given an observation , a set of candidate goals is proposed by the cVAE. Then, the BCQ model evaluates the value of each goal by computing with action proposals generated by the action cVAE, , and chooses the goal . This goal is given to the goal-conditioned controller, which is unrolled for timesteps, after which a new goal is selected by the mechanism.
V IRIS: Challenges of Purely Offline Data
In this section we elaborate on different properties of the method and how it addresses the challenges in our datasets. Learning from diverse solution approaches: The goal-conditioned controller is trained to condition on future goal observations at a fine temporal resolution and produce unimodal action sequences. Consequently, it is not concerned with modeling diversity, but rather reproduces small action sequences in the dataset to move from one state to another. Meanwhile, the generative model in the goal selection mechanism proposes potential future observations that are reachable from the current observation - this explicitly models the diversity of solution approaches. In this way, IRIS decouples the problem into reproducing specific, unimodal sequences (policy learning) and modeling state trajectories that encapsulate different solution approaches (diversity), allowing for selective imitation. Learning from suboptimal data: The low-level goal-conditioned controller operates for a small number of timesteps, so it has no need to account for suboptimal actions. This is because if the goal is to reach a state from , and is sufficiently small, then a policy would only be able to improve by reaching in less than steps, which is a negligible improvement for small values of . By contrast, the value learning component of the goal selection mechanism explicitly accounts for suboptimal solution approaches by evaluating the expected task returns of each goal and selecting the goal with the highest return. Learning from off-policy datasets: Policy learning from arbitrary off-policy data can be challenging [fujimoto2018off, kumar2019stabilizing]. Following prior work, IRIS deals with this issue by constraining learning to occur within the distribution of training data. The goal-conditioned controller directly imitates sequences from the training data, and the generative goal model is also trained to propose goal observations from the training data. Finally, the value learning component of the goal selection mechanism mitigates extrapolation error by making sure that the Q-network is only queried on state-action pairs that lie within the training distribution [fujimoto2018off].
|\rowcolor[HTML]CBCEFB||Graph Reach||Robosuite Lift||RoboTurk Cans|
|\rowcolor[HTML]CBCEFB Model||Success Rate||Rollout Length||Task Return||Success Rate||Rollout Length||Task Return||Success Rate||Rollout Length||Task Return|
|\rowcolor[HTML]EFEFEF IRIS, no Goal VAE||100% 0%||1895 131||151.4 18.9|
|IRIS, no Q||30.7% 3.68%||618 38.5||167.9 23.8|
|\rowcolor[HTML]EFEFEF IRIS (Full Model)||81.3% 6.60%||523 29.0||486.0 49.7|
Vi Experimental Setup
Vi-a Tasks and Datasets
Graph Reach - A Pedagogical Example: We constructed a simple task in a 2D navigation domain where the agent begins each episode at a start location and must navigate to a goal. The start and goal locations are fixed across all episodes. We generate a large, varied dataset by leveraging a 5x5 grid of points to sample random paths from the start location to the goal, and collecting demonstration trajectories by playing noisy, random magnitude actions to move along sampled random paths. Demonstration paths that deviate from the central path are made to take longer detours before joining the central path again (see Fig. 2). Several varied demonstrated paths are available in the dataset, and only certain parts of each path should be imitated to yield optimal performance. The algorithm needs to be able to recover a policy that follows the straight line path from the start to the goal by choosing to imitate pieces of the demonstrations in the dataset (for example the first, second, and third part of the three paths respectively, in the top right 3 images of Fig. 2). The dataset contains 250 demonstrations with an average completion time of 3844 timesteps. Robosuite Lift - Suboptimal Demonstrations from a Human: We collected human demonstrations from a single human using RoboTurk [mandlekar2018roboturk] on the Robosuite Lifting task [fan2018surreal]. The goal is to actuate the Sawyer robot arm to grasp and lift the cube on the table. The demonstrator lifted the cube with a consistent grasping strategy, but took their time to grasp the cube, often moving the arm to the cube and then back, or actuating the arm from side to side near the cube, as shown in Fig. 2. This was done intentionally to ensure that there would be several state-action pairs in the dataset with little value. Algorithms need to avoid being misled by the suboptimal paths taken by the demonstrator. The dataset contains 137 demonstrations with an average completion time of 622 timesteps. RoboTurk Can Pick and Place - Crowdsourced Demonstrations: We leverage the RoboTurk pilot dataset [mandlekar2018roboturk] to train policies on the Robosuite Can Pick and Place task [fan2018surreal]. While the original dataset contained over 1100 demonstrations, we present results on a filtered version consisting of the fastest 225 trajectories. These demonstrations were collected across multiple humans and exhibit significant suboptimality and diversity in the solution approaches. For example, some people chose to grasp the can in an upright position by carefully positioning the gripper above the can while others chose to knock the can over before grasping the can on its side. An example of the latter is shown in Fig. 2. This dataset contains 225 demonstrations with an average completion time of 589 timesteps.
Vi-B Experiment Details
We compare IRIS to a Behavioral Cloning (BC) baseline that performs simple regression over state-action pairs in the dataset, a Recurrent Neural Network (RNN) variant of Behavioral Cloning that we call BC-RNN, and a Batch-Constrained Q-Learning (BCQ) baseline, which is a state-of-the-art Batch Reinforcement Learning algorithm for continuous control [fujimoto2018off]. We also compare against two variants of IRIS to evaluate the utility of each component - a version with no Q-function at the high-level (goal selection occurs by simply sampling the Goal VAE) and a version where a deterministic goal prediction network is used in lieu of the VAE (simple regression is used to train this network). We emphasize that all training is offline - no algorithm is allowed to collect additional samples of experience.
Our experiments answer the following questions. (1) Can IRIS successfully recover a performant policy by selectively imitating pieces of a varied set of demonstrations? (2) What benefits do our two-level decomposition provide for learning manipulation policies from diverse demonstration data? (3) How much data is necessary to train policies successfully on these tasks? To answer (1), we present quantitative results across all datasets and baselines in Table I and also investigate qualitative model performance on the Graph Reach dataset in Fig. 4. Table I shows that while all models are able to solve the Graph Reach task consistently, variants of IRIS and BCQ are able to solve the task faster. To verify that IRIS is indeed imitating useful portions of demonstrated trajectories, we plot trajectories taken by the best BC (red), BCQ (green) and IRIS (orange) model in Fig. 4 and compare them to trajectories in the dataset (shown in blue). The plot demonstrates that our model has the capacity to imitate several different demonstrated modes from the dataset and leverage them to reach the goal quickly, while BCQ extrapolates to unseen states to reach the goal and attain similar performance. While this type of extrapolation is okay in a toy environment, this behavior can be harmful in more complex robot manipulation tasks such as our Lift and Can tasks. This toy dataset shows that IRIS is able to reproduce multimodal behaviors in the dataset and selectively interpolate between them to solve a task efficiently. Next, we answer (2) by considering the more challenging Lift and Cans manipulation datasets. As Table I and the left two plots of Fig. 3 show, there is a stark contrast in performance between variants of IRIS and baselines. Our models achieve success rates of 70-80% and 20-30% on the Lift and Cans tasks respectively while baseline models can only attain 18% on the Lift task, and fail to solve the Cans task at all. The only difference between the BC-RNN model and IRIS, no Goal VAE is that IRIS conditions the RNN on goal observations and these goal observations are generated at test-time by a network that was trained to predict observations timesteps into the future. The large performance gap between these two models implies that goal-directed imitation, which the baselines lack, is critical to deal with the multimodality in these datasets, and helps facilitate faithful imitation. Allowing for diverse goal predictions also significantly improves performance - IRIS, no Q achieves 10% higher success rate than IRIS, no Goal VAE on the Cans dataset by replacing a deterministic goal prediction with a VAE. Finally, although using the value network for goal selection did not improve performance on the Cans dataset, using value selection allowed significant improvement on the Lift task. We hypothesize that the value function helps avoid situations where the demonstrator moved away from the cube or drifted from side to side on the Lift dataset by choosing goals that lead the arm closer to the cube. In summary, our decomposition allows behaviors from the demonstrations to be reproduced over an extended period of time while simultaneously allowing the high-level component flexibility in dictating which behaviors should be reproduced. Finally, we answer (3) by training IRIS on smaller subsets of the manipulation datasets - small datasets consisting of the best 10% of the trajectories (in terms of completion time) and medium datasets consisting of the best 50% of the trajectories. The right two plots in Fig. 3 depict learning curves for IRIS on these datasets. The smaller-sized datasets lead to poor performance but the medium-sized Lift dataset has the same asymptotic performance as the full dataset. By contrast, the medium-sized Cans dataset restricts performance significantly. This shows that for tasks with greater variation in task instance, IRIS benefits from having more data in the dataset.
We introduced IRIS, a framework for offline learning from a large set of diverse and suboptimal demonstrations that operates by selectively imitating local sequences from the dataset. We demonstrated that IRIS recovers performant policies from large manipulation datasets and significantly outperforms other baselines due to our decomposition of the problem into goal-conditioned imitation and a high-level goal selection mechanism. Future work will focus on extending and evaluating the framework on other manipulation tasks with increasing levels of complexity and longer time horizons.
Ajay Mandlekar acknowledges the support of the Department of Defense (DoD) through the NDSEG program. The authors would like to thank members of the NVIDIA Seattle Robotics Lab for several helpful discussions and feedback.\printbibliography