Do What I Want, Not What I Did:
Imitation of Skills by Planning Sequences of Actions*
Abstract
We propose a learningfromdemonstration approach for grounding actions from expert data and an algorithm for using these actions to perform a task in new environments. Our approach is based on an application of samplingbased motion planning to search through the tree of discrete, highlevel actions constructed from a symbolic representation of a task. Recursive samplingbased planning is used to explore the space of possible continuousspace instantiations of these actions. We demonstrate the utility of our approach with a magnetic structure assembly task, showing that the robot can intelligently select a sequence of actions in different parts of the workspace and in the presence of obstacles. This approach can better adapt to new environments by selecting the correct highlevel actions for the particular environment while taking human preferences into account.
I Introduction
Learning from demonstration has emerged as a useful paradigm to teach robots the skills they need to interact with the real world. The challenge in learning from demonstration is to generalize what is learned to new contexts and new tasks. Consider a moderately complex task such as assembling part of a structure, shown in Fig. 1 and defined by the PDDL in Fig. 3. The precise movements and the particular movement goals and parameters will vary from one situation to the next. When attempting to execute this task in a new environment, the robot must be able to select the particular actions, motions, and manipulation goals that will allow completion of the task in this new environment. By exploiting learned models for actions, we are able to demonstrate a planner that is able to produce solutions for performing tasks in an effective manner, is able to improve with additional demonstration data, and can adapt to new circumstances.
Adapting to new environments in the context of task and motion planning poses several challenges when attempting to generalize learned actions. Recently there has been significant progress in integrating symbolic task planning and continuous motion planning [1, 2, 3, 4, 5, 6], which have in the past evolved as two separate fields. At the same time, learning from demonstration has been established as a powerful tool for learning models of individual actions [7, 8]. Learning from demonstration has previously been connected to symbolic task planning [9, 10, 11], but these approaches are yet to be incorporated in the context of a motionconstrained task planning in a principled manner. This paper aims to address this gap.
In our approach, probabilistic models over features associated with each action are learned from human demonstrations and later refined in a supervised manner using additional robotgenerated examples scored by a human teacher. At the core of this approach lies a mapping from symbolic actions (e.g. approach, grasp) to physical motions encoded probabilistically as a distribution over observed features along each motion trajectory. Fig. 2 shows this relationship: multiple demonstrations connect predicate states, which allow us to learn a model of each action. Features are defined as a set of functional relations between the robot and its environment (e.g. relative position and orientation between robot endeffector and desired object to grasp).
Planning in a new environment is accomplished by updating each action distribution to remain as close as possible to the prior while satisfying new environment constraints such as different obstacles and object shapes. This is accomplished through importance sampling and optimal distribution reestimation using the crossentropy method [12, 13]. Transitions from symbolic states to actions are similarly encoded as a discrete probability distribution representing the “preference” of executing different actions. A product model is induced over a complete task from the sequence of probabilistic action models, together with discrete transition models. Planning a complete task then corresponds to optimally updating this model to reproduce the prior and satisfy the new scenario.
The contributions of this paper are: (1) a new method for reproducing demonstrated actions in novel environments, derived from samplingbased motion planning; (2) an algorithm for combining these learned actions for executing multistep tasks with multiple valid plans; and (3) experimental validation of this algorithm on a simple assembly task as shown in Fig. 1. Experiments in a 2D Android game domain were omitted for reasons of space.
Ii Related Work
Prior work exists in describing the relationship between high and lowlevel actions and in learning representations of actions from demonstration, but does not combine learning with action selection and motion planning.
ObjectAction Complexes (OACs) have been proposed as a way of formalizing actions unifying perception and learning that can be associated with learned lowlevel actions, and sequenced based on predicate effects by a symbolic planner [10]. The proposed method is similar to work such as [14], which grounded PDDL position predicates with Gaussian Mixture Models, and [11], which associated Dynamic Movement Primitives (DMPs) for particular actions with expected visual features.
Probabilistic models are commonly used in imitation learning, e.g. [15, 16]. Dynamic Movement Primitives (DMPs) are a policy representation that has proven useful for modeling lowlevel actions from demonstration as a set of dynamical systems [17]. Prior work has added object avoidance to these methods through potential fields [18, 19] or through reinforcement learning [20].
Pastor et al. used Path Integral Policy Improvement with DMPs and multiple human demonstrations to learn a model of expected features when executing two robotic tasks in [21]: shooting pool and flipping over a box with a pair of chopsticks. This method was expanded upon by Stulp et al., who proposed Path Integral Policy Improvement with Covariance Matrix Adaptation [22]. These techniques are closely related to the CrossEntropy Method for motion planning [13] from which we draw inspiration.
Our work is also related to the method proposed by Engbert et al. use the KL divergence between an expert demonstration and trajectories sampled from a Gaussian Process forward model to optimize imitation learning policies [23]. Similarly in [24] the authors propose a method for inverse reinforcement learning based on minimization of relative entropy. In addition, the proposed approach can be thought of as a parameterized set of actions; this has been shown to improve performance on policy learning in Markov Decision Processes [25].
Other work combines learned motion primitives into state machines for execution, but in a purely reactive way: selecting only the next action, rather than the next sequence. Examples include Niekum et al. [26], who build a task plan from unstructured demonstrations. Manschitz et al. [27] learned classifiers to determine the next action when sequencing motion primitives. Work by Kappler et al. [28] uses Associative Skill Memories to perform dexterous tasks.
Symbolic task planning and motion planning have commonly been integrated through algorithms that “fill in the gaps” in symbolic plans with callouts to continuousspace motion planners. Recent work in combined task and motion planning include work by by Plaku et al. [1], by Shivashankar et al [3], by Wolfe et al [2], and by Lagriffoul et al. [4]. These works do not analyze actions with a wide variety of goals and cost functions, focusing instead on exploration and pickandplace tasks. Similarly, Srivastava et al. [29] efficiently integrate task planning with continuousspace reasoning about goal positions, but still rely on callouts to a traditional motion planner to instantiate trajectories. Work by Toussaint describes a hierarchal approach for integrated task and motion planning that first examines feasible end states before optimizing kinematics and motion planning [30]. Unlike these methods, our approach jointly optimizes sequences of trajectories by adaptively allocating trajectory simulations to different actions. However, the proposed approach suggests directions for future work in improving efficiency in complex domains.
Iii Task Description
We assume existence of (1) a symbolic description of a task, and (2) labeled training data associating features with each lowlevel action that can appear in this domain. The symbolic description naturally decomposes the task into a sequence of predicate world states For the structure assembly task, part of the symbolic description is shown in Fig. 3. A world state is then defined as a combination of predicates. In turn, actions are the connections between these predicate states as shown in Fig. 2. Each in a given task is represented as a probability distribution over a set of features associated with a successful instantiation of a skill in a new environment given .
The features are denoted by and defined using the function through relationship where denotes time in the action starting at and ending at , is the robot state, and are the applied controls. With these definitions, a probabilistic model associated to each action is denoted by and is computed using unsupervised learning from expert demonstrations, typically assuming a parametric density . A joint model of a task consisting of multiple actions can be constructed using a density assuming conditional independence between actions.
Specific features are derived from the PDDL description of the task. For example, in Fig. 3, the approach action describes the arm moving to pick up a link object without knocking it over. In this case would return the relative position, orientation, and velocity between the robot end effector and the link object. To use the proposed method, one would provide the identifier for an action and a list of associated symbols from perception.
An optimal task is a sequence of actions that takes the robot from the initial state to goal that have the highest probability given our expert model, while also avoiding hard constraints such as collisions and joint limits:
(1)  
(2) 
Our goal is to learn a stochastic “symbolic” policy over the sequence of predicate states, as well as continuousspace “physical” policy generating trajectories for each action . We represent trajectories using parameters , where represents the space of all possible parameters resulting in valid trajectories in the new environment. Since robot perception and motion are uncertain, each parameter induces a density where
denotes the system trajectory. For instance, would typically define a reference trajectory and an associated tracking control law resulting in the density
In practice, given the trajectory will either be sampled using a highfidelity simulator or generated by the real robot.
Iv Planning Algorithm
Iva Local Planning Algorithm
First, we consider adaptation of only a single action to a new environment. When presented with a new environment, we pose the planning task as the problem of learning a new parameterized policy . To do so we employ a stochastic optimization technique using a surrogate distribution which is iteratively updated so that generated trajectories produce feature observations with high likelihoods under the expert distribution for action , where is the set of actions available from predicate state .
We follow the Cross Entropy Method described by Rubinstein et al. [12], particularly following its application to motion planning by Kobilarov [13]. This is accomplished by introducing an artificial surrogate distribution over that will induce a distribution over trajectories and over the corresponding features along these trajectories. The surrogate will then be iteratively optimized until it becomes optimally close (in a distribution sense) to the expert density without violating the constraints of the environment such as obstacles and joint limits. The surrogate model is built using a parametric density such as a multivariate Gaussian or a GMM with parameters . Assuming that a nominal (prior) parameter is known the problem can be formalized as the optimal estimation of the expectation
(3) 
The optimal importance sampling density [12] for estimating this integral is
(4) 
where the numerator in (4) can be thought of as the correlation between the expert feature distribution and the parameterized distribution . Unfortunately we cannot compute the solution to (4) as it involves computing the estimator from (3). Instead, we approximate this optimal by finding the appropriate parameters of . A logical way of doing this is to minimize the KullbackLeibler (KL) divergence:
(5) 
To find the value of that minimizes this expression, we approximate this solution by drawing i.i.d. samples from . In this case is a generated feature from robot state at time along the sampled trajectory for .
This can be more formally expressed as
(6)  
(7) 
If we assume that there is a bijection between a tuple along a trajectory and a feature then we have the following approximation
(8) 
since were sampled under , and substituting (8) into (7) results in:
(9) 
The necessary conditions for a minimum correspond to setting the gradient of (13) to zero, i.e. by solving the equality:
(10) 
where the weights are given by .
When (i.e. a single multivariate Gaussian with domain restricted to feasible parameter set ), the relationship (10) can be solved in closed form as
(11) 
where and . When is a GMM the minimization from Eq. (13) is performed using a weighted expectation maximization (EM) algorithm.
In practice, the optimal parameter is computed iteratively starting with some nominal choice which approximately covers the trajectory space of interest. At each iteration we draw samples and compute the next by minimizing (13). At the next iteration is set to and the process continues until the cost converges.
We add a fixed normalization term to the diagonal entries in of and of to make sure covariances stay welldefined. In addition, to prevent premature convergence, we introduce an extra parameter , which controls the size of steps taken at each iteration. In the case where is multivariate Gaussian, with as the optimal at iteration , we compute and as:
(12)  
Avoiding Obstacles and Joint Limits
We constrain to consist only of the space of valid trajectories, removing any samples that would collide with objects or pass joint limits. This means that when drawing our samples, we remove samples currently in collision or past joint limits in our new environment and continue to draw sample trajectories until we have all valid examples. This works effectively in practice as long as the task does not require generalization in environments with very narrow passages that the system has never been trained on. Such cases are extremely difficult since the probability of obtaining samples in the narrow passage is close to zero, unless an informative nominal density parameter is used with enough probability mass over such regions.
IvB Task Planning Algorithm
We wish to optimize parameters for all possible actions in a successful execution of the task, where our cost is the joint probability over any sequence of actions that represent a valid execution of the task as per Eq. (2). Our task planning approach takes the algorithm described in Section IVA and expands it into a recursive algorithm similar to Monte Carlo Tree Search.
First, consider the problem of choosing one of possible actions. We think of this as the choice of which action would be most similar to our expert’s demonstrations in other scenes, starting in symbolic state . We expand our notion of to include the switch between each possible action as . Substituting this into Eq. (13) gives us:
(13) 
where is the trajectory distribution associated with .
Action selection is modeled as a stochastic policy over possible worlds. We introduce a surrogate distribution into our trajectory search that captures the probability of choosing each future action from the current . When sampling trajectories, we draw the next action according to this probability.
Furthermore, we can extend this reasoning to consider which of a whole tree of possible actions is the most similar to an expert tree, allowing us to capture expert preferences for particular actions in addition to continuousspace trajectories. Assuming that all actions in a branch of the tree are independent given time, we can define the expert probability of a particular action starting at continuous robot state :
(14)  
(15) 
Where is the final state in sampled trajectory and represents the world after symbolic action . Eq. (15) describes the probability of all possible actions from a continuous world state occurring after execution of an action .
When recording a set of demonstrations starting in the same predicate state , we compute the conditional probability for action :
(16) 
We specify a surrogate distribution over possible choices of actions for a world given as . This probability is initialized as . In the case where this is updated as where trajectory samples have been drawn from . Otherwise we compute this as:
for starting state and action . In practice we use the step size to prevent this term from converging too quickly.
Each predicate state corresponds to a range of valid continuousspace states. The algorithm recursively samples from the trajectories associated with each successive action to map to continuous states. As shown in Alg. 1, we repeatedly call the Sample function from Alg. 2, providing it with the set of possible start states . We select a start state from these according to the cumulative probability of these actions. The process continues until we reach a userprovided horizon . This approach allows us to maintain a constant number of samples: over successive iterations, more samples will be devoted to promising regions of the search space.
This results in a recursive search strategy outlined in Alg. 1. The return of the Sample function is the average probability of all future actions and trajectories associated with each current start state. This value is used to compute a version of the weights in Eq. (13), where is replaced by the probability of all future actions from each trajectory. Fig. 4 illustrates how the algorithm works in practice.


V Experiments
We performed experiments in a simulated Barrett WAM arm and on a Universal Robot UR5, applied to an object manipulation task. The goal of this task was to build a structure of increasing complexity out of magnetic blocks, as per the task described in [31]. In our case, we only perform a part of the whole structure assembly task: we combine one link and one node object to create a substructure. The connections between different skills are described by the PDDL specification in Figure 3. We used FastDownward [32] to translate the PDDL into a graph of possible actions that can be performed assuming all feasibility predicates are true.
Figure 5 shows how the planner works in practice. It iteratively sampled out different motions for each selected action, choosing to approach the link from the front and then to mate it to the leftmost node. As the algorithm progressed, successively fewer samples were drawn from actions associated with the rightmost node.
We use three types of features: (1) the time in a particular state, (2) the gripper command variables, (3) the transforms between the end frame and the objects. Relevant features are determined by the parameters specified in the task description in Fig. 3. In cases where two objects are parameters, we used the transform between the inhand object and the other object. For these examples, , where values are computed from the offset between the current manipulation frame and and the relevant object. The values define a unit quaternion. The manipulation frame is either an end effector position or the coordinate frame associated with the object in the gripper, for actions defined where the handoccupied predicate is true.
We parameterize trajectories with Dynamic Movement Primitives with basis functions in the robot’s joint space, plus a goal pose . This allows us to find paths in the space of the robot arm, but to adapt to different possible continuousspace goals. We implemented the system using ROS [33] with Orocos KDL for inverse kinematics [34].
In practice, the “link” object can shift in unpredictable ways after a grasp action, so we adjust the plan after completion of the grasp action. We add noise to the parameters of the trajectory distribution associated with the subsequent align and place actions and replan. In the real robot experiment we omit this step due to the lack of accurate position information once the object is in the gripper.
The current implementation of the planner is singlethreaded. The single largest inefficiency was detecting collisions, followed by computation of inverse kinematics. Particularly in scenes with more obstacles, both of these are very important: inverse kinematics are required to adapt trajectories to different possible grasp points, and accurate collision detection guarantees safe execution. As inverse kinematics and collision detection are outside the purview of this paper, we did not focus on efficiency.
Va Simulation Experiments
We collected three demonstrations of each of the different skills with a dynamic simulation of the Barrett WAM arm. We then place these pieces in different positions in the environment, and validated our method by performing the task in different locations. The results of one performance in a novel environment are shown in Fig. 5.
To create a model of each of these skills, we collect three demonstrations of the object manipulation action using the same grasp, with two of the Barrett Hand’s fingers on the left side of the link and one on the right. The simulated WAM arm was teleoperated with a Razer Hydra to collect training data. The user provided examples of three different grasps: a direct approach and approaching from the left or the right. The user specified to indicate a preference for a direct approach.
We perform our task on scenes with one link and two nodes at different positions, and demonstrate task effectiveness for trials with different configurations of the world. The key measure of performance is how easily we can add extra training data to our model and how close these results will be to the target mate. We set and used trajectories, with a maximum of iterations. Our full algorithm used a depth of : a long enough horizon to plan the whole assembly task. Average likelihood of sampled trajectories converged exponentially as we proceeded through various iterations. Fig 6 shows distance to an ideal final mate after outliers were removed.
By way of comparison, we remove one or both of two parts of our algorithm. We use a single randomly selected task plan (“no options” in Fig. 6). We also compare against the case where our planner only examines the currently available actions, setting (“no lookahead” in Fig. 6). The “no options/no lookahead” case functions as our baseline: it uses the algorithm in Sec. IVA to reproduce an action based on a GMM. While our approach is technically unconstrained, due to the sampling method we implicitly constrain the trajectory search to a feasible set of valid trajectories. To demonstrate how we can improve performance by improving action models, a human user selected three successful trials from the automatic performance of this task and added them to the model (“auto” in Fig. 6).
Table I shows the number of planning failures associated with different environments. These are cases where the algorithm failed to find a trajectory with nonzero probability under the expert distributions defining each of our actions. Without the full algorithm, either the robot often cannot find a solution that will accomplish the task or performance is significantly degraded. The case where there are options and no lookahead is a good example. While the robot is almost always able to find a plan in this situation, the quality of plans is far worse, as shown by Fig. 6. In the higherperforming “Auto” case, the robot was always able to find a plan but few of these plans were successful: only 4/10 achieved highquality mates, and several outliers fell off the node and the table completely. This is because without knowledge of the place and release actions, the align action will often not terminate in a good state to complete the task.
Fig. 6 shows a comparison on successful trials in different environments without obstacles. The full algorithm was highly reliable and accurate, achieving less than 1 cm of placement error. Other versions of the algorithm made mistakes that planning alone could not recover from. In addition, performance of all versions of the algorithm showed improvement when extra data was added, though the full version of the algorithm was still better and more flexible.
Expert Data Only  With Auto Data  

No Options/No Lookahead  7  4 
No Options/Lookahead  3  1 
Options/No Lookahead  1  0 
Options/Lookahead*  0  0 
We also introduced different obstacles into the environment. In these cases, the algorithm is able to avoid these objects and still complete its required task. Figure 7 shows examples of these results. Since the planner removes paths that are in collision, this restricts the set of feasible trajectories. As in the upper left of Figure 7, the most likely action in a particular scenario might be an approach from a particular direction. Once this grasp is blocked, the planner can either attempt a less likely trajectory that results in that grasp, or it can approach from a different direction. This tradeoff illustrates why our approach is more powerful than adding a potential field term to action primitives as in [18] or similar work.
VB Real Robot Experiments
The UR5 was given the option of grabbing either of two links or a node object and combining them to create the same structure as in the simulation experiments. Demonstrations were provided in which the robot grasped either the node or the link first. The object localization technique described by Li et al. [35] was used to determine the poses of all objects in the scene.
Our system intelligently selected whether to grasp the link or the node, and selected which face of the link to grasp based on feasibility and presence of other obstacles. The UR5 had a fairly limited workspace and has a limited ability to interact with objects when compared to the Barrett WAM arm used in the simulation experiments, making this a more challenging problem. However, the robot was able to grasp both node and link objects and complete the task. Our video supplement provides examples of the UR5 performing this task in different configurations, as well as an overview of the algorithm and videos of the simulation experiments.
In particular, when presented with both link and node objects in different orientations, the robot was able to correctly select available faces not blocked by other obstacles. If the node was better aligned with the robot’s gripper, then the algorithm chose to grasp the node; if one of the links was better aligned, it would grasp this link.
Vi Conclusions
We described a practical approach for task and motion planning based on models of skills grounded from expert demonstrations of skills. By representing actions as probability distributions learned from expert demonstrations, we create a framework that can combine a broad range of actions to accomplish a task. We validated this approach with experiments in a structure assembly domain both in simulation and in a real robot. While we did not address efficiency in the implementation used in this paper, we will examine strategies for decreasing the number of costly trajectory evaluations and collision checks and apply our planner to larger and more complex tasks.
References
 [1] E. Plaku and G. D. Hager, “Samplingbased motion and symbolic action planning with geometric and differential constraints,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 5002–5008.
 [2] J. Wolfe, B. Marthi, and S. J. Russell, “Combined task and motion planning for mobile manipulation.” in ICAPS, 2010, pp. 254–258.
 [3] V. Shivashankar, K. N. Kaipa, D. S. Nau, and S. K. Gupta, “Towards integrating hierarchical goal networks and motion planners to support planning for humanrobot teams,” 2014.
 [4] F. Lagriffoul, D. Dimitrov, J. Bidot, A. Saffiotti, and L. Karlsson, “Efficiently combining task and motion planning using geometric constraints,” The International Journal of Robotics Research, p. 0278364914545811, 2014.
 [5] E. Plaku and S. Karaman, “Motion planning with temporallogic specifications: Progress and challenges,” AI Communications, no. Preprint, pp. 1–12.
 [6] J. Bajada, M. Fox, and D. Long, “Temporal planning with semantic attachment of nonlinear monotonic continuous behaviours,” in Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 2015, pp. 1523–1529.
 [7] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
 [8] P. Pastor, M. Kalakrishnan, F. Meier, F. Stulp, J. Buchli, E. Theodorou, and S. Schaal, “From dynamic movement primitives to associative skill memories,” Robotics and Autonomous Systems, vol. 61, no. 4, pp. 351–361, 2013.
 [9] N. Krüger, J. Piater, F. Wörgötter, C. Geib, R. Petrick, M. Steedman, A. Ude, T. Asfour, D. Kraft, D. Omrcen, et al., “A formal definition of objectaction complexes and examples at different levels of the processing hierarchy,” PACOPLUS Technical Report, available fro m http://www. pacoplus. org, 2009.
 [10] M. Wachter, S. Schulz, T. Asfour, E. Aksoy, F. Worgotter, and R. Dillmann, “Action sequence reproduction based on automatic segmentation and objectaction complexes,” in Humanoid Robots (Humanoids), 2013 13th IEEERAS International Conference on. IEEE, 2013, pp. 189–195.
 [11] S. R. Ahmadzadeh, A. Paikan, F. Mastrogiovanni, L. Natale, P. Kormushev, and D. G. Caldwell, “Learning symbolic representations of actions from human demonstrations,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 3801–3808.
 [12] R. Y. Rubinstein and D. P. Kroese, The crossentropy method: a unified approach to combinatorial optimization. Springer, 2004.
 [13] M. Kobilarov, “Crossentropy motion planning,” International Journal of Robotics Research, vol. 31, no. 7, pp. 855–871, 2012.
 [14] K. Welke, P. Kaiser, A. Kozlov, N. Adermann, T. Asfour, M. Lewis, and M. Steedman, “Grounded spatial symbols for task planning based on experience,” in 13th International Conference on Humanoid Robots (Humanoids). IEEE/RAS, 2013.
 [15] S. Calinon, F. Guenter, and A. Billard, “On learning, representing, and generalizing a task in a humanoid robot,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 37, no. 2, pp. 286–298, 2007.
 [16] S. Dong and B. Williams, “Motion learning in variable environments using probabilistic flow tubes,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 1976–1981.
 [17] S. Schaal, “Dynamic movement primitivesa framework for motor control in humans and humanoid robotics,” in Adaptive Motion of Animals and Machines. Springer, 2006, pp. 261–280.
 [18] D. Park, H. Hoffmann, P. Pastor, and S. Schaal, “Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields,” in 8th IEEERAS International Conference on Humanoid Robots. IEEE, 2008, pp. 91–98.
 [19] A. M. Ghalamzan E., C. Paxton, H. G. D., and L. Bascetta, “An incremental approach to learning generalizable robot tasks from human demonstration,” in Proceeding of IEEE International Conference on Robotics and Automation. IEEE, 2015.
 [20] P. Kormushev, S. Calinon, and D. G. Caldwell, “Robot motor skill coordination with embased reinforcement learning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 3232–3237.
 [21] P. Pastor, M. Kalakrishnan, S. Chitta, E. Theodorou, and S. Schaal, “Skill learning and task outcome prediction for manipulation,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 3828–3834.
 [22] F. Stulp and O. Sigaud, “Path integral policy improvement with covariance matrix adaptation,” arXiv preprint arXiv:1206.4621, 2012.
 [23] P. Englert, A. Paraschos, J. Peters, and M. P. Deisenroth, “Modelbased imitation learning by probabilistic trajectory matching,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 1922–1927.
 [24] A. Boularias, J. Kober, and J. Peters, “Relative entropy inverse reinforcement learning.” in AISTATS, 2011, pp. 182–189.
 [25] W. Masson and G. Konidaris, “Reinforcement learning with parameterized actions,” arXiv preprint arXiv:1509.01644, 2015.
 [26] S. Niekum, S. Chitta, A. G. Barto, B. Marthi, and S. Osentoski, “Incremental semantically grounded learning from demonstration.” in Robotics: Science and Systems, vol. 9, 2013.
 [27] S. Manschitz, J. Kober, M. Gienger, and J. Peters, “Learning to sequence movement primitives from demonstrations,” in Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on. IEEE, 2014, pp. 4414–4421.
 [28] D. Kappler, P. Pastor, M. Kalakrishnan, M. Wüthrich, and S. Schaal, “Datadriven online decision making for autonomous manipulation,” 2015.
 [29] S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel, “Combined task and motion planning through an extensible plannerindependent interface layer,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 639–646.
 [30] M. Toussaint, “Logicgeometric programming: An optimizationbased approach to combined task and motion planning,” in International Joint Conference on Artificial Intelligence, 2015.
 [31] J. Bohren, C. Papazov, D. Burschka, K. Krieger, S. Parusel, S. Haddadin, W. L. Shepherdson, G. D. Hager, and L. L. Whitcomb, “A pilot study in visionbased augmented telemanipulation for remote assembly over highlatency networks,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3631–3638.
 [32] M. Helmert, “The fast downward planning system.” J. Artif. Intell. Res.(JAIR), vol. 26, pp. 191–246, 2006.
 [33] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an opensource robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2, 2009, p. 5.
 [34] R. Smits, “KDL: Kinematics and Dynamics Library,” http://www.orocos.org/kdl.
 [35] C. Li, J. Bohren, E. Carlson, and G. D. Hager, “Hierarchical semantic parsing for object pose estimation in densely cluttered scenes,” in IEEE International Conference on Robotics and Automation (ICRA), 2016.