Towards Online Learning from Corrective Demonstrations
Abstract
Robots operating in realworld human environments will likely encounter task execution failures. To address this, we would like to allow copresent humans to refine the robot’s task model as errors are encountered. Existing approaches to task model modification require reasoning over the entire dataset and model, limiting the rate of corrective updates. We introduce the StateIndexed Task Updates (SITU) algorithm to efficiently incorporate corrective demonstrations into an existing task model by iteratively making local updates that only require reasoning over a small subset of the model. In future work, we will evaluate this approach with a user study.
Towards Online Learning from Corrective Demonstrations
Reymundo A. Gutierrez Department of Computer Science University of Texas at Austin Elaine Schaertl Short Department of Electrical and Computer Engineering University of Texas at Austin
Scott Niekum Department of Computer Science University of Texas at Austin Andrea L. Thomaz Department of Electrical and Computer Engineering University of Texas at Austin
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
1 Introduction
The dynamic and unstructured nature of realworld human environments precludes the modeling of all possible failure modes and their associated recovery policies. To address this issue, we would like to allow copresent humans to correct the robot’s behavior as execution failures are encountered. In particular, we are interested in developing techniques that enable naive users to provide corrections and improve the robot’s performance over time.
Much prior work has separately investigated the learning and refinement of individual primitives (?; ?; ?; ?; ?) and the sequencing of learned primitives (?). Recent efforts have focused on learning task models by jointly reasoning over action primitives and their sequencing (?; ?; ?). However, these methods present scalability problems as they require reasoning over the entire collected dataset or the entire current model which limits the rate of corrective updates.
In this work, we present StateIndexed Task Updates (SITU), a new method that utilizes corrective demonstrations to efficiently update task models represented as a finitestate automaton (FSA) by making iterative local updates that only require reasoning over a small subset of the task model. Given a segmented corrective demonstration, SITU first uses the world state to select the relevant subset of the task model. Then, SITU determines if the demonstration segment is represented in the selected model subset, retraining and rebuilding model components as needed. This process repeats until the final segment of the demonstration.
2 Related Work
Much prior work has explored the topic of model refinement from corrections or subsequent demonstrations for individual primitives. Akgun et al. (?) introduced a method for learning keyframebased models of skills, and an interaction model to add/remove keyframes in the learned model in subsequent interactions. In work by Sauser et al. (?), a teacher can provide tactile feedback to a robot during playback of a learned skill to adapt or correct object grasping behaviors. Jain et al. (?) introduced a method for iterative improvement of trajectories in order to incorporate user preferences, with new demonstrations providing information regarding task constraints. A similar approach was explored by Bajcsy et al. (?), with user demonstrations being used to modify only one task constraint dimension at a time. Our approach seeks to simultaneously update many primitives and overall task structure from a set of extended demonstrations.
Various approaches have addressed the problem of updating both primitives and task structure. Kappler et al. (?) separate the two by learning a set of associative skill memories (ASMs) and manuallyspecifying a manipulation graph that sequences the ASMs. New ASMs are learned to incorporate recovery behaviors as needed and attached to the manipulation graph through user specification. Niekum et al. (?) incrementally learn a finite state automaton (FSA) for a task by segmenting demonstrations with a betaprocess autoregressive hidden Markov model (BPARHMM) and constructing a graph using the segment sequences. Classifiers are trained at branching points to determine which primitives to execute. Corrections can be provided after execution failure and incorporated into the graph by resegmenting demonstrations and rebuilding the FSA. Gutierrez et al. (?) provide a taxonomy of edit types on an FSA and learn corrections by instantiating models assuming a set of candidate corrections of each type. A statebased transitions autoregressive hidden Markov model (STARHMM) (?) is then trained on each of the candidate corrections using the provided corrective demonstrations, with the best model update selected according to a modified AIC score. Each of the above approaches present scalability problems particularly for the online case as they require either manual specification or evaluation on the entire dataset or model. Our approach leverages state information and execution history to limit the number of primitives to consider at any time step.
MohseniKabir et al. (?) simultaneously learn task hierarchy and primitives through interactive demonstrations. The demonstrator’s narration is leveraged to identify boundaries between primitives and a set of heuristics are used to query the demonstrator regarding the grouping of primitives. This approach allows for online updates to the model. However, its applicability is limited as it requires the abstract structure of objects used (parts relations), full geometry to explore and learn constraints, as well as the retargeting of motion captured human demonstrations onto the robot platform. Our approach gathers demonstrations directly on the robot and relies only on visual features to construct and update its models.
3 Algorithm
StateIndexed Task Updates (SITU) provides a method for incorporating corrective demonstrations, defined in this work as a partial demonstration given at or near execution failure. For any given task model, the corrective demonstration is a combination of modeled and unmodeled segments. Thus, incorporating the corrective demonstration into the existing task model requires determining which segments are unmodeled and adding/updating the associated components.
3.1 Problem Formulation
We model a task as a finitestate automaton (FSA), with nodes representing primitives and edges representing valid transitions between primitives. In this work, a node consists of a policy model ()  describing when to take which action and an initiation classifier ()  describing where a primitive can begin. More concretely, for each node
(1) 
where and is the observed state.
The edges of the graph are encoded in the following two functions:

: returns the parents of node

: returns the children of node
A full policy execution is computed by selecting a primitive and executing its policy. Primitive selection follows the structure of the graph such that if the current primitive is , the most likely primitive from the set is selected as the next primitive. Using this task model, the problem of corrective updates is reduced to determining which policies and classifiers need to be updated and how the corrective demonstrations progress through the FSA (i.e. ordered set of traversed edges).
The approach presented here is not limited to particular policy or classifier models. However, we require that the classifiers provide a measure of classification confidence. For the policies, we require that the following operator is available for the selected policy representation:
(2) 
where is a demonstration segment and is a set of policies. Given the policy set, FindPolicy returns the index of the policy that the demo is an instance of, or returns empty if no such policy exists. The implementation details of FindPolicy are entirely dependent on the policy specification.
3.2 StateIndexed Task Updates
After segmenting the demonstration, SITU iterates over the demonstration segments and their associated start states (Lines 35 of Algorithm 1) and applies local updates to the FSA task model given the last executed primitive , the current state , the current demonstration segment , and the set of applicable primitives (Lines 67 of Algorithm 1). A primitive is considered applicable if its classifier is activated by the current state, which restricts the set of primitives used in subsequent steps.
The local update algorithm is outlined in Algorithm 2. It makes use of a few helper functions that we will define here. Cluster clusters the underlying demonstration segments in the nodes and returns a new set of primitives retrained on the clustered data. LocalReconnect removes the old set of primitives from the task model and connects the new primitives according to the primitive traversal history of the underlying data. returns the ordered set of policies for the set of nodes . S returns the state samples used to train the classifiers of nodes . FitPolicy and FitClassifier train a new policy and classifier models, respectively. UpdatePolicy and UpdateClassifier update existing policy and classifier models given new data, respectively. Finally, AddNode and AddEdge add new nodes and edges to the task model, respectively.
LocalUpdate takes as input the current primitive , the current state , the current demonstration segment , the set of applicable primitives , and the task model . Using the FindPolicy operator, LocalUpdate determines if the current segment is an instance of one of the primitives in (Line 3 of Algorithm 2). If the current segment is not an instance of one of the primitives in , a new primitive is created and connected to the existing FSA (Lines 512 of Algorithm 2). If the current segment is an instance of one of the primitives in , that primitive is retrained with the new data the segment provides (Lines 1420 in Algorithm 2).
The above description of Algorithm 2 left out the role of Lines 2 and 3. The FSA update algorithm as described so far can result in primitives that execute similar actions on similar regions of the state space. While this is not necessarily an impediment to successful execution, ideally we would like such redundant instances to be rolled into one primitive. Lines 2 and 3 of Algorithm 2 accomplish this by reconstructing the applicable primitive models with Cluster, effectively merging similar policy models. Because this clustering step is limited to only a small subset of the nodes, the FSA can be quickly updated with LocalReconnect. This is equivalent to modifying the first primitive in the redundant set with the data of the newer primitives.
In previous work, we defined a taxonomy of edit types that can be made to an FSA: node addition, edge addition, and node modification (?). It is important for an algorithm designed to to refine FSAs to perform these edit types as they represent the learning of new skills, learning of new skill transitions, and the refinement of previously learned skills. The above algorithms span the outlined edit types in the following manner. Node addition occurs during a local update when no sufficient policy model for the current demonstration segment is found (Lines 612 of Algorithm 2). Edge addition is performed in Lines 12 and 20 of Algorithm 2. The edge addition is mediated through the indexing of primitives via their initiation classifiers (Line 6 of Algorithm 1) and policy matching through the FindPolicy function (Line 4 of Algorithm 2). Finally, node modification occurs indirectly through the clustering (Line 2) and reconnecting (Line 3) steps of Algorithm 2, as described in the previous paragraph.
4 Keyframe Based Algorithms
Prior work has shown that novice teachers can more successfully demonstrate tasks by providing keyframe demonstrations as opposed to trajectory demonstrations (?). Keyframes produce lowernoise demonstrations and achieve more consistent results. Motivated by this, we instantiate SITU for keyframe based policies.
4.1 Segmentation
A variety of algorithms are available to produce the demonstration segments required as input to SITU. While we focus on keyframe demonstrations and utilize keyframes as segmentation points in this work, the presented approach can be used with any segmentation algorithm. However, we note that the overall performance is dependent on the consistency of segmentation across demonstrations.
One approach for segmenting keyframe demonstrations is to produce a trajectory for each demonstration and use any of the several existing trajectory segmentation algorithms. Another approach, used in this work, is to use the changes in reference object as segmentation points. We assume in this work that the reference object is provided by the demonstrator. This can be obtained from demonstration narration (e.g. “get the cup”) and has been accomplished in prior work (?). Thus, this approach produces segments that are a sequence of keyframes.
4.2 Primitive Specification
The primitives consist of a policy model and classifier model. For the classifier, we use a logistic regression model. Under this model, classification confidence is given in the form of a probability. The policy takes the form of a hidden Markov model (HMM) with multivariate Gaussian emissions over the endeffector pose relative to the reference object. The HMM is trained on a set of keyframe sequences, with each hidden state interpreted to correspond to an underlying ‘true’ keyframe. The policy is executed by sampling a keyframe trajectory and planning between subsequent keyframes.
4.3 Find Policy
Algorithm 2 specifies separate Cluster and FindPolicy steps. For our keyframe instantiation of SITU, we combine them. In order to find the policy of the demonstration segment in the policy set , we first train an HMM over the trajectory that results from linearly interpolating between the keyframes of . We then cluster this new HMM and the HMM policies in using the average KL divergence between policy pairs as the distance function. Whichever cluster contains ’s HMM is its policy membership. Each cluster can then be used to train a new primitive, which are connected to the existing graph according to the primitive traversal history contained in the training data.
5 Proposed Evaluation
The described algorithms will be evaluated with a user study. Participants with no prior experience teaching a robot will be recruited from the local university campus. The participants will first familiarize themselves with keyframe teaching by performing a few simple exercises. After this practice session, they will provide demonstrations for the completion of a set of tasks. Examples include pouring from a pitcher and scooping contents from one bowl to another. The latter is shown in Figure 1. The tasks will then be modified such that a correction is needed for successful execution (e.g. adding a lid removal step to the pouring task). The task respecifications will be structured to span the set of edit types outlined in Section 3.2.
The performance will be evaluated by comparing the number and length of the demonstrations needed to successfully correct the task models against the number and length of demonstrations needed to construct a successful task model from scratch. The models constructed from all collected demonstrations (original+correction) will also be compared against the models that result from the separate corrective steps.
6 Conclusion
Robots operating in unstructured environments will likely encounter task execution failures. To address this, we would like to allow copresent humans to correct the robot’s behavior as errors are encountered. Existing approaches to task model modification require reasoning over the entire dataset and/or model, which limits the rate of corrective updates and presents interactive bottlenecks. We introduce the StateIndexed Task Updates (SITU) algorithm to efficiently incorporate corrective demonstrations into an existing task model by iteratively making local updates that only require reasoning over a small subset of the task model. In future work, we will evaluate this approach with a user study.
Acknowledgments
This work has taken place in the Socially Intelligent Machines (SIM) lab and the Personal Autonomous Robotics Lab (PeARL) at The University of Texas at Austin. This research is supported in part by the National Science Foundation (IIS1638107, IIS1724157) and the Office of Naval Research (#N000141410003).
References
 [Akgun and Thomaz 2016] Akgun, B., and Thomaz, A. 2016. Simultaneously learning actions and goals from demonstration. Autonomous Robots 40(2):211–227.
 [Akgun et al. 2012] Akgun, B.; Cakmak, M.; Jiang, K.; and Thomaz, A. L. 2012. Keyframebased learning from demonstration. International Journal of Social Robotics 4(4):343–355.
 [Bajcsy et al. 2018] Bajcsy, A.; Losey, D. P.; O’Malley, M. K.; and Dragan, A. D. 2018. Learning from Physical Human Corrections, One Feature at a Time. In Proceedings of the 2018 ACM/IEEE International Conference on HumanRobot Interaction  HRI ’18, 141–149. ACM Press.
 [Gutierrez et al. 2018] Gutierrez, R. A.; Chu, V.; Thomaz, A. L.; and Niekum, S. 2018. Incremental task modification via corrective demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 1126–1133. IEEE.
 [Jain et al. 2013] Jain, A.; Wojcik, B.; Joachims, T.; and Saxena, A. 2013. Learning trajectory preferences for manipulators via iterative improvement. In Advances in neural information processing systems, 575–583.
 [Kappler et al. 2015] Kappler, D.; Pastor, P.; Kalakrishnan, M.; Wüthrich, M.; and Schaal, S. 2015. Datadriven online decision making for autonomous manipulation. In Robotics: Science and Systems.
 [Kroemer et al. 2015] Kroemer, O.; Daniel, C.; Neumann, G.; Van Hoof, H.; and Peters, J. 2015. Towards learning hierarchical skills for multiphase manipulation tasks. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, 1503–1510. IEEE.
 [MohseniKabir et al. 2018] MohseniKabir, A.; Li, C.; Wu, V.; Miller, D.; Hylak, B.; Chernova, S.; Berenson, D.; Sidner, C.; and Rich, C. 2018. Simultaneous learning of hierarchy and primitives for complex robot tasks. Autonomous Robots 1–16.
 [Niekum et al. 2015] Niekum, S.; Osentoski, S.; Konidaris, G.; Chitta, S.; Marthi, B.; and Barto, A. G. 2015. Learning grounded finitestate representations from unstructured demonstrations. The International Journal of Robotics Research 34(2):131–157.
 [Sauser et al. 2012] Sauser, E. L.; Argall, B. D.; Metta, G.; and Billard, A. G. 2012. Iterative learning of grasp adaptation through human corrections. Robotics and Autonomous Systems 60(1):55–71.