Co-active Learning to Adapt Humanoid Movement for Manipulation

Co-active Learning to Adapt Humanoid Movement for Manipulation

\authorblockNRen Mao, John S. Baras, Yezhou Yang, and Cornelia Fermller R. Mao and J. Baras are with the Department of Electrical and Computer Engineering and the ISR, Y. Yang and C. Fermller are with the Department of Computer Science and the UMIACS, University of Maryland, College Park, Maryland 20742, USA. {neroam, baras} at umd.edu, yzyang at cs.umd.edu and fer at umiacs.umd.edu.
Abstract

In this paper we address the problem of robot movement adaptation under various environmental constraints interactively. Motion primitives are generally adopted to generate target motion from demonstrations. However, their generalization capability is weak while facing novel environments. Additionally, traditional motion generation methods do not consider the versatile constraints from various users, tasks, and environments. In this work, we propose a co-active learning framework for learning to adapt robot end-effector’s movement for manipulation tasks. It is designed to adapt the original imitation trajectories, which are learned from demonstrations, to novel situations with various constraints. The framework also considers user’s feedback towards the adapted trajectories, and it learns to adapt movement through human-in-the-loop interactions. The implemented system generalizes trained motion primitives to various situations with different constraints considering user preferences. Experiments on a humanoid platform validate the effectiveness of our approach.

I Introduction

Trajectories learning from human demonstrations has been studied in the field of Robotics for decades due to its wide range of applications, in both industrial and domestic scenarios. Among the various approaches, Motion Primitives (MPs) aims to parameterize observed human motion and then reproduce it, given different initial and target states. However, it is widely known that general MPs methods, such as Dynamic Movement Primitives (DMPs)  [1], have limited capability to generalize towards novel environments involving other constraints. Moreover, standard MPs learning method ignores user preferences of the tasks and the environments. For real world humanoid applications, a practical robot movement learning framework needs to take user preferences and environment constraints into consideration.

Fig. 1: System for learning movement adaptation for manipulation tasks. Dashed lines indicate feedback.

Let’s start from a common example, that a human user teaches a humanoid how to transfer a bottle from different start and end states. Using off-the-shelf approach, the robot is able to learn the motion by acquiring MPs from the demonstrated trajectories and apply them to generate new trajectories given different initial and target states. However, solely following the generated trajectories may fail if the environment has slight alteration, such as having a bowl blocking the trajectory as illustrated in Fig. 2. Here we assume that the robot can only be able to receive these constraints during task execution phase (testing phase), and these constrains are not presented during the training phase. In this work, we propose an optimization based framework to adapt trained movements for novel environments. The first goal of our system is to generate adapted trajectories, as shown in Fig. 2, that can: 1) follow the demonstrated trajectories for the purpose of preserving movement patterns, and 2) fulfill novel constraints perceived from the environment during testing phase.

Fig. 2: Baxter Transferring Leaking Bottle: (a) Movement imitation, failed to avoid the bowl; (b) Movement adaptation with initial weights, successfully avoided the bowl by path above it but spilled water in the bowl; (c) Movement adaptation with learned weights for new situation where obstacle locates differently, successfully avoided the bowl by path around it and avoided spilling water in the bowl.

Moreover, novel environment constraints perceived during testing phase could be more complicated than just obstacles. Following the example mentioned before, this time let’s consider a situation where the target bottle is leaking. Ideally an intelligent robot that understands the situation should avoid moving the bottle over the bowl, but follows the movement path around it. Even though we could adjust the objective function during optimization for movement adaptation, what if in another scenario the robot is asked to transfer a knife while avoiding obstacles above them to prevent potential scratches? Such constraints are not only associated with the task context, i.e, leaking bottle or knife as the manipulated object, but also associated with user’s preference, i.e, avoiding the bowl with a certain manner. Therefore, a human-in-the-loop on-line adaptation system is necessary to generate manipulation trajectories for different preferences. In the optimization framework presented in this paper for movement adaptation, we first treat the reward weights as the adjustable parameters to alter the quality of the trajectory. Then based on user feedback, the framework learns the preferred behavior, that fulfills constraints, by updating the reward weights. Therefore, the learned behavior can be generalized towards different situations where similar constraints are encountered. As illustrated in Fig. 2, after a few iterations of on-line learning, the robot is able to generate an adapted trajectory according to the learned preferences.

This paper proposes an approach for interactively learning movement adaptation for manipulation tasks. Fig. 1 illustrates the proposed system. The main contributions of this work are: 1) A system for robot to generalize movement learned from demonstrations to fulfill constraints perceived from novel environment. It is able to adapt trajectories for various situations according to user preferences; 2) An approach for robot learning to adapt trajectories by updating reward weights based on users’ feedback. The user thus can co-actively train the robot in-the-loop by demonstrating desired trajectories; 3) An implementation of the optimization schema to adapt transferring skill considering obstacles and different manners. We validate the implementation on a humanoid platform (Baxter) and the experimental results support our claims.

Ii Related Work

Various approaches have been proposed to enable robot learning manipulation movements. Among them, imitation learning [2] focuses on mimicking human demonstrations, while learning from demonstration (LfD) techniques [3] are applicable. However, with these approaches, the robot could only reproduce learned movement in a similar environment. To deal with novel environments, extended approaches [4] augmented the trajectory generation with additional cost terms or different objective function as a criterion of trajectories’ quality. The criterion is based on human experts’ prior knowledge about the task or environment before execution phase. Then, the motion is generalized with these predefined constraints in similar situations. These approaches do not consider various user preferences. Here, we present another layer of exploration and learning to adapt the trained movement considering novel environment constraints, such as observed obstacles and task preferences.

Approaches [5] for encoding trajectory as motion primitives have been proposed for various forms of generalization and modulation, such as Gaussian mixture regression and Gaussian mixture models [6, 7]. In [8], a mixture model was used to estimate the entire movement skill from several sample trajectories. Another school of approaches derive from Hidden Markov models [9]. One popular representation to encode motion from demonstrated trajectories is Dynamic Movement Primitives (DMPs), as introduced in [1]. It consists of differential equations with well-defined attractor properties and a non-linear learnable component that allows modeling of almost arbitrarily complex motion. Recently, Probabilistic Movement Primitives (ProMPs) [10] was proposed as an alternative representation in probabilistic formulation. It learns a trajectory distribution from multiple demonstrations and modulates the movement by conditioning on desired target states. Incorporating the variance of demonstrations, ProMPs approach handles noise from different demonstrations and provides increased flexibility for reproducing movement. However, all these approaches hardly deal with novel environments such as involving different obstacles. In our work, we first train our robot using ProMPs and then generalize these trained motion primitives to newly introduced environment constraints.

In order to enable MPs to adapt to novel environments with obstacles [11, 12], Kober et al. [4] proposed an augmented version of DMPs which incorporates perceptual coupling to an external variable. They firstly learned the initial dynamic models by standard imitation learning and subsequently used a reinforcement learning method for self-improvement. Ghalamzan et al. [13] proposed a three-tiered approach for robot learning from demonstrations that can generalize noisy task demonstrations to a new target state and to an environment with obstacles. They encoded the nominal path generated from a Gaussian Mixture Model with DMPs and generated trajectory for a new target state. Then they adapted the DMP-generated trajectory to avoid obstacles by formulating an optimal control problem regarding the reward function learned from demonstrations by inverse optimal control. This approach allows an non-expert user to teach a robot the desired response to different objects but requires offline training in the environment involving those obstacles for learning the reward function. However, in real world scenarios, the human users often have different preferences for trajectories generation according to various environments and tasks, while it is extremely challenging for them to provide the optimal trajectories in every situation. Instead, in our approach, the human users can interactively provide sub-optimal suggestions on how to improve the trajectory and the robot learns the preference for different constraints, and also incorporate it in generating more applicable trajectories.

User preferences over robot’s trajectories have been studied in the field of human robot interaction (HRI). Sisbot et al. [14] proposed to model user specified preferences as constraints on the distance of the robot from the user, the visibility of the robot and the user’s arm comfort. Then a path planner fulfilling such user preferences is provided. Ashesh Jain et al. [15] proposed a co-active learning method to learn user preferences over generated trajectories for manipulation tasks by iteratively taking user sub-optimal feedback, thereafter the optimal trajectory was selected based on the learned reward function. In our work, we adopt the co-active learning paradigm and further propose a reward formulation to model user preferences over constraints for movement generation. Then we integrate it with movement adaptation through optimization based planning.

Iii Co-active Learning for Movement Generalization

For the problem of robot learning from demonstrations [3], a common practice is to offline learn the skills by encoding the trajectories with movement patterns such as DMPs [16]. They can be then, during the testing phase, used to generalize the movement to novel situations with slight alterations, such as different initial and target states. Nevertheless, this generalization capability does not apply to novel environments with different obstacles or to a new task contexts with a variety of manipulated objects. In this paper, we propose a complementary framework for generalizing movement skills, which are offline learned from demonstrations, to novel situations, and in addition incorporate on-line learning preferences of how to generalize from human’s feedback co-actively.

While facing a novel situation, the robot is given a manipulation task context that describes the environment, the objects and any other task-related information. It could compute an imitation movement trajectory by generalizing offline learned skills to new initial and target states. Such a trajectory can be executed if the new environment does not have obstacles and there is no other constraints inherited from the task.

To further generalize learned movement skills to more challenging situations, the robot has to generate an adapted trajectory based on the task contexts and the computed imitation trajectory . Here we use a reward function to reflect how much reward the adapted trajectory can achieve for different contexts. Therefore, we can adapt the movement by solving an optimal control problem which outputs an adapted trajectory by maximizing the reward function . The reward function consists of a Imitation Reward describing the tendency to follow the imitation trajectory , a Control Reward describing the smoothness of executing the adapted trajectory and a Response Reward describing the expected response given the environment. Although this reward function can be recovered from demonstrations by Inverse Optimal Control, as [13] suggests, it assumes that demonstrations are from experts, which bears an oracle reward function. In fact, it is common for non-expert users to provide non-optimal trajectories in practice. Also, [13] requires the manipulated objects or obstacles existing during these demonstrations and is hard to update the learned reward function online when the robot is facing situations that involve new objects. To learn the reward function which controls how the robot adapts trajectories under new contexts, we applied a co-active learning technique [15] in which the user only corrects the robot by providing an improved trajectory and then the robot updates the parameter of based on user’s feedback. It is worth to note that this feedback only indicates , and may be non-optimal trajectories. With iterations of improvement, the robot could learn a function that approximates the oracle tightly.

Iv Our System

Overall, after the robot has offline learned the movement skill from demonstrations, when facing a different task context in a novel environment, the testing phase includes three stages: 1) Movement Imitation, which computes an imitation trajectory by generalizing demonstrated movement to new initial and target states; 2) Movement Adaptation, which generates an adapted trajectory under new task and environment contexts by maximizing the given reward function; 3) Rewards Learning, which updates the parameters of estimated reward function according to user’s feedback through co-active learning. Fig. 1 demonstrates our proposed framework. In the following sections, we entail and formulate each stage.

Iv-a Movement Imitation

At the beginning, our system offline learns movement skill in an environment without obstacle or other constraints. In this work, we adopt the Probabilistic Movement Primitives (ProMPs) [10] for offline learning and movement imitation. It obtains a distribution over trajectories from multiple demonstrations, which captures the variations, and can be easily generalized to new initial and target states while imitating the movement.

To be specific, we consider that a robot’s end-effector has degrees of freedom (DOF) along with its arm, with its state denoted as . The trajectory of the robot’s end effector is represented as a sequence . We model each dimension of using linear regression with Gaussian time-dependent basis functions and a -dimensional weight vectors as

(1)

where denotes zero-mean i.i.d. Gaussian noise. With the underlying weight vectors , the probability of observing a trajectory can be given by

(2)

where and .

Iv-A1 Learning from Demonstrations

For each demonstration, the trajectory can be easily represented by a weight vector which has fewer dimensions than the number of time steps. To capture trajectory variations from multiple demonstrations of the movement, a Gaussian distribution over the weights is estimated. Therefore, the distribution of the trajectory can be represented as

(3)
(4)

We can then estimate the parameters by using maximum likelihood estimation as suggested in [10].

Iv-A2 Trajectory Generation

In novel situations, the trajectory could be modulated by conditioning with different observed states. By adding an observation vector of indicating desired initial state and target state with the accuracy , we could apply Bayes theorem and represent conditional distribution for as

(5)

where and are augmented for observation vector .

With a conditional distribution of , we could generate conditional trajectory distribution and easily evaluate the mean and the variance of the trajectory for any time point according to Eq.( 2) and Eq.( 3). Therefore, the mean trajectory can be used as the imitation trajectory in movement adaptation and the variance can be used to indicate which parts or dimensions of the trajectory are more flexible to adapt. The larger variance reflects higher variations in demonstrations. It means more flexibility to modify the corresponding part of the trajectory.

It is worth to mention that, although we adopt ProMPs for movement imitation in this work, the proposed Movement Adaptation framework can be integrated with any other movement imitation learning techniques.

Iv-B Movement Adaptation

As mentioned before, if the environment of a new situation is exactly the same as the one during demonstration when ProMPs are learned, e.g, no obstacle, safety constraints or other new considerations, the robot can perform movement optimally by directly following the imitation trajectory in discrete time generated by learned ProMPs.

In this work, we want to have a system that can adapt to an environment with novel constraints. Thus, we model the movement adaptation as an optimal control problem with fixed time horizon in discrete time. The output of the adaptation system is a new trajectory in discrete time. The input consists of the task context that describes the environment, the objects and any other task-related information which are obtained from the perception module, the imitation trajectory which is generated from learned ProMPs, and the reward function which represents the reward of the adapted trajectory corresponding to the new situation.

Iv-B1 Optimization with Constraints

Let’s consider that the perception module detects objects in the environment, which may be obstacles during the manipulation. Each object is abstracted as a sphere in the space represented by its center location and semi-diameter . Assuming the reward function can be modeled as accumulated sum of rewards from each state at time step :

(6)

Because we are only modulating the trajectory, we can model the adaptation system as linear dynamics with the control signal , as it does not involve real physical dynamics. According to the embodiment of robotic end-effector based on its design, we could compute the end-effector’s position in spatial space following the kinematics modeling [17]. Then, considering obstacles avoidance in spatial space, the target optimal policy could be defined from Eq. (7) with constraints.

(7)
subj. to (8)
(9)
(10)
(11)
(12)
(13)

where are system matrices, Eq.( 13) constrains the final position of the adapted trajectory, Eq.( 11) constrains the trajectory within feasible limits, and Eq. 12 ensures the adapted trajectory can avoid obstacles safely by keeping a minimum distance between the robot’s end-effector and any object.

Iv-B2 Model Predictive Control

In order to find an optimal solution of such a system with continuous state and action spaces, we adopt Model Predictive Control which computes the optimal actions in a finite prediction horizon. Therefore, by considering a prediction time horizon , the optimal action , at time step , can be solved by:

(14)

At each step , the optimal actions for decision steps in future are computed but only the action for current step is performed. Therefore, it can deal with changing environments as these changes could be considered in the next decision steps.

Iv-B3 Reward Function

In order to adapt robot movements to perform well in novel situations, considering only hard constraints such as obstacle avoidance, Eq.( 12), does not suffice. Thus, our framework further models a reward function that reflects the amount of rewards that an adapted trajectory can gain within the context and . As the reward function is assumed temporally discrete in Eq.( 6), we model the reward function at by three parts:

(15)

where the Imitation Reward models the tendency to follow the imitation trajectory , the Control Reward models the smoothness of executing the adapted trajectory and the Response Reward characterize the expected response to the environment. Meanwhile, are parameters that affect the behavior of the movement adaptation. We describe each reward function in detail as follows.

Imitation Reward

Imitation Reward characterizes how well the adapted trajectory can imitate the demonstrations through the distance between points on and . Recall that we have the variance of the imitation trajectory by Movement Imitation IV-A2, which indicates how flexible we could adapt the trajectory. Considering to be diagonal for the sake of simplicity, we model the Imitation Reward by the weighted distance:

(16)
(17)

where is a weight matrix consisting of parameters and in which the variances learned from demonstrations are modeled to affect adaptation rewards.

Control Reward

Control Reward characterizes the smoothness of executing the adapted trajectory through the following formulation:

(18)

where is the parameter to weigh this reward.

Response Reward

Response reward describes the expected response to the environment such as safety considerations for obstacles and objects under manipulation. Here we give intuitive examples for Response Reward. Although we can ensure minimum distance to avoid obstacles using Eq.( 14), as human users we still expect the robot to transfer a cup full of water around a laptop instead of above it, in case of spilling. Another example is that the user would prefer the robot manipulating sharp objects, such as a knife, to keep a relatively larger distance from the human for safety consideration. These examples indicate that we would have preferences towards how the robot avoids obstacles. Moreover, for safety consideration, we also prefer the robot to transfer a fragile object closer to the table top to maintain a safety margin. All the above preferences are specific to objects under manipulation and the exact environment. Thus, we set the Response Reward to ensure that the better the adapted trajectory fulfills these preferences, the higher the reward is.

To formally represent the Response Reward, let us consider a scenario with obstacles on the table. The leftmost and rightmost locations of the table are and the table surface is , we then can formulate Response Rewards as follows:

(19)
(20)
(21)
(22)

where represents the feature vector for preferences in avoiding obstacle , of which the first element denotes avoiding distance and the second element denotes the deviation vector as shown in Fig. 3. The preferred deviation vector is given as reward weights and the inner product between two vectors indicates the rewards of deviation considering the given preference. The exponential decay function is applied so that the features are only effective when the robot’s end-effector is close to the obstacles. and are features related to safety by considering boarders and surface of the table. are weights corresponding to the features respectively.

Given a set of parameters , the MPC module generates an adapted trajectory by maximizing . The robot could follow the adapted trajectory and execute the task facing the novel situation. However, the generated trajectory may not be satisfying enough from user’s perspective, since the given or initialized parameters may not be accurate for modeling the rewards. To accommodate the issue, after the movement execution, our system allows the user to provide a better trajectory as feedback to update the parameters during the following Rewards Learning section.

Fig. 3: Illustration of deviation vector feature: vector from original imitation trajectory to an adapted one.

Iv-C Rewards Learning

In this section, we describe how our system learns the reward function. Assuming there is an oracle reward function that reflects exactly how much reward the adapted trajectory can gain for each context. The goal of this module is to estimate such a reward function , where are the parameters to be learned, that approximate the oracle reward tightly.

By rewriting Eq.( 6) and Eq.( 15) for the entire trajectory, we can have the reward function in a linear form represented by features and weights:

(23)
(24)
(25)
(26)

where represent features of the entire trajectory corresponding to Imitation, Control and Response Rewards.

Since the user only provides a feedback trajectory and the system can not directly observe the reward function, we apply the co-active learning technique [15] in which the robot iteratively updates the parameter of based on user’s feedback. Note that this feedback only needs to indicate and could be non-optimal trajectories. Algorithm 1 gives our learning algorithm for movement adaptation.

Initialize
for Iteration to  do
     Task Context and Environment Perception:
     Movement Imitation:
     
     Movement Adaptation:
     
     
     Movement Execution:
     if User Provides Feedback:  then
         
         
         
         
         Weights Projection:
         
         
     else
     end if
end for
Algorithm 1 Rewards Learning for Movement Adaptation

Note that is a learning rate, which decays along iterations and in the weights projection part is a bounded set to ensure updated parameters are in feasible space. After iterations of improvements, the robot can learn an estimated reward function that approximates the oracle reward function as proved in  [18]. By maximizing the estimated reward function , the robot can generate an adapted trajectory that maximizes the rewards facing situation based on imitation trajectory .

V Experiments

To validate the system described above, we design and conduct the following experiments on a Baxter humanoid platform. The Baxter robot is asked to do manipulation tasks such as cleaning on a table top, with the surface as , the leftmost location as and the rightmost location as in robot spatial space in meter. It needs to learn transferring the manipulated object between different locations while avoiding obstacles with desired manners.

Fig. 4: Movement Imitation with ProMPs for Transferring Task: (a) Imitation trajectory predicted based on prior movement and task contexts in spatial space; (b) Imitation trajectory for joint in joint space, shaded area indicating the predicted variance.
Fig. 5: Learning to Adapt Movement for Transferring a Leaking Bottle: (a) Movement Imitation, failed to avoid the obstacle; (b) Movement adaptation with initial weights, successfully avoided the obstacle by path above it but has a potential danger of spilling water, feedback trajectory is provided afterwards; (c) Movement adaptation for a different situation with new task contexts and obstacle locations, with updated weights after learning from feedback trajectory, successfully avoids the obstacle through a path around. Corresponding execution on the Baxter platform is given by Fig. 2.

During an off-line learning phase, the robot learns the movement skill from multiple kinethestic demonstrations with no obstacles on the table. During the online learning stage, a variety of obstacles are located randomly on the table and we assume the robot can obtain their locations from perception modules. The system learns iteratively to adapt the movement skill in novel situations such as with different manners avoiding obstacles, at the same time follows the similar movement pattern from off-line demonstrations.

V-a Movement Imitation

In the first stage of the experiments, we have our robot learn off-line the movement skill from demonstrations. All trajectories are sampled discretely and normalized to steps for transferring movement in joint space, and the left arm of the Baxter has degree of freedom. The training trajectories are encoded by ProMPs with Gaussian basis functions so that the movement skill can be generalized to different initial and target states.

Fig. 4 shows an example of our generated imitation trajectory in spatial space for new task contexts using ProMPs. Fig. 4 shows the corresponding imitation trajectory of joint in joint space. The blue crosses here are desired new initial and target states, and the shaded area is the estimated variance for imitation trajectory, which reflects the variations of demonstrations. True trajectory here means a trajectory recorded from user demonstration in the testing scenario for comparison. It is not hard to see that the predicted mean of the imitation trajectory is well generalized to new initial and target states and follows the same movement pattern as the prior mean trajectory learned from demonstrations. Therefore, the robot can perform the task well by following this imitation trajectory if there is no obstacles or other safety constraints under new situations.

V-B Learning Adaptation

While facing a task of transferring a leaking bottle, the robot may find a bowl with food inside as an obstacle on the table where its center location and minimum safety distance are assumed to be obtained through perception.

For movement adaptation, we set the prediction horizon in model predictive control and select system matrices to make the system stable in the prediction window as suggested in [13]. The limits of joints could be found from the Baxter hardware specification. The minimum safety distance with table boarder is set as . And the weights for reward function are initialized to be . And then we apply the native Matlab Gradient-based optimization method fmincon to solve the optimization at each time step.

Fig. 5 shows the output from movement imitation for transferring the leaking bottle, which failed to avoid the obstacle even though the trajectory generalizes to a novel initial and target states. Fig. 5 shows the movement adaptation with initial weights. There is no preference specified in reward function about how to avoid obstacles or take safety considerations about boarders. Therefore, even though the adapted trajectory could avoid the obstacle successfully, it may be not an ideal trajectory.

Fig. 6: Rewards Learning from User Feedback for Transferring Leaking Bottle: (a) User feedback via kinethestic demonstration; (b) Learning curve for adaptation under the same feedback.

To learn the user preference, we then provide feedback via kinethestic demonstration illustrated in Fig. 6 and the feedback trajectory is shown in Fig. 5 as dash line to indicate user preferences. Following Algo. 1, the robot iteratively updates the rewards weights based on the user feedback. Weights are limited via projection in the feasible set where except that last two parameters in indicating preferred deviation direction could be . To quantitatively validate the performance of our method in movement adaptation, we consider the metric of cumulative error between the adapted trajectory and the feedback trajectory as the learning error at iteration . Since the metric is affected by different situations such as obstacles’ locations, we consider the feedback trajectory as fixed and let the robot iteratively learn several times to see how it performs and record the “learning curve” under the same feedback. From Fig. 6, we can see that the error decreases and converges after several iterations, and it only requires a few of iterations to achieve an adapted trajectory as desired preference according to the feedback.

After learning, the robot uses the updated weights for movement adaptation in a different situation with novel initial/target states and the obstacles’ locations. Fig. 5 shows the adapted trajectory based on the updated weights after one iteration, where it successfully avoids the obstacle via the desired direction.

In a second scenario where a robot is transferring a knife around some fragile obstacle, the user may prefer robot to avoid the obstacle above it instead of around it. With the same methods here, we could also generate adapted trajectories as shown in Fig. 7 and Fig. 7 for initial weights. With the user provided feedback trajectory, the robot successfully learns the user specified preferences for movement adaptation and generates the improved adapted trajectories for different situations as shown in Fig. 7 and Fig. 7.

Fig. 7: Baxter Learning to Adapt Movement for Transferring Knife: (a) (c) Movement adaptation with initial weights, successfully avoided duck doll around it but may risk scratches, afterwards feedback trajectory is provided for adaptation preferences; (b) (d) Movement adaptation for different situation, with updated weights after learning from feedback trajectory, successfully avoided the duck doll above it as desired.

Vi Conclusion and Future Work

We present a framework for learning to adapt robot end effector movement for manipulation tasks. The proposed method generalizes offline learned movement skills to novel situations considering obstacle avoidance and other task-dependent constraints. It adapts the imitation trajectory generated from demonstrations, while maintaining the learned movement pattern and considering the variations, to avoid obstacles with desired directions and distances and keep a safety margin within a workspace. Also it provides a way to learn how to adapt the movement by on-line interactions from user’s feedback.

Besides learning how to adapt movement from user’s feedback, the visual information of the objects and the environment could also indicate the preferences of movement adaptation. For instance, the deviation direction for avoiding a knife could be inferred directly from the location of its blade from visual space. We are further investigating the possibility of directly learning the preferences to adapt movement from visual perception of the task context.

References

  • [1] A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal, “Dynamical movement primitives: learning attractor models for motor behaviors,” Neural computation, vol. 25, no. 2, pp. 328–373, 2013.
  • [2] T. Asfour, P. Azad, F. Gyarfas, and R. Dillmann, “Imitation learning of dual-arm manipulation tasks in humanoid robots,” International Journal of Humanoid Robotics, vol. 5, no. 02, pp. 183–202, 2008.
  • [3] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal, “Learning and generalization of motor skills by learning from demonstration,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on.   IEEE, 2009, pp. 763–768.
  • [4] J. Kober, B. Mohler, and J. Peters, “Learning perceptual coupling for motor primitives,” in Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on.   IEEE, 2008, pp. 834–839.
  • [5] A. Gams, A. J. Ijspeert, S. Schaal, and J. Lenarčič, “On-line learning and modulation of periodic movements with nonlinear dynamical systems,” Autonomous robots, vol. 27, no. 1, pp. 3–23, 2009.
  • [6] F. Guenter, M. Hersch, S. Calinon, and A. Billard, “Reinforcement learning for imitating constrained reaching movements,” Advanced Robotics, vol. 21, no. 13, pp. 1521–1544, 2007.
  • [7] S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. G. Billard, “Learning and reproduction of gestures by imitation,” Robotics & Automation Magazine, IEEE, vol. 17, no. 2, pp. 44–54, 2010.
  • [8] S. M. Khansari-Zadeh and A. Billard, “Imitation learning of globally stable non-linear point-to-point robot motions using nonlinear programming,” in Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on.   IEEE, 2010, pp. 2676–2683.
  • [9] T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura, “Embodied symbol emergence based on mimesis theory,” The International Journal of Robotics Research, vol. 23, no. 4-5, pp. 363–377, 2004.
  • [10] A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann, “Probabilistic movement primitives,” in Advances in neural information processing systems, 2013, pp. 2616–2624.
  • [11] D.-H. Park, P. Pastor, S. Schaal et al., “Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields,” in Humanoid Robots, 2008. Humanoids 2008. 8th IEEE-RAS International Conference on.   IEEE, 2008, pp. 91–98.
  • [12] H. Hoffmann, P. Pastor, D.-H. Park, and S. Schaal, “Biologically-inspired dynamical systems for movement generation: automatic real-time goal adaptation and obstacle avoidance,” in Robotics and Automation, 2009. ICRA’09. IEEE International Conference on.   IEEE, 2009, pp. 2587–2592.
  • [13] A. M. Ghalamzan E., C. Paxton, G. D. Hager, and L. Bascetta, “An incremental approach to learning generalizable robot tasks from human demonstration,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on.   IEEE, 2015, pp. 5616–5621.
  • [14] E. A. Sisbot, L. F. Marin, and R. Alami, “Spatial reasoning for human robot interaction,” in Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on.   IEEE, 2007, pp. 2281–2287.
  • [15] A. Jain, S. Sharma, T. Joachims, and A. Saxena, “Learning preferences for manipulation tasks from online coactive feedback,” The International Journal of Robotics Research, p. 0278364915581193, 2015.
  • [16] R. Mao, Y. Yang, C. Fermuller, Y. Aloimonos, and J. S. Baras, “Learning hand movements from markerless demonstrations for humanoid tasks,” in Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on.   IEEE, 2014, pp. 938–943.
  • [17] Z. Ju, C. Yang, and H. Ma, “Kinematics modeling and experimental verification of baxter robot,” in Control Conference (CCC), 2014 33rd Chinese.   IEEE, 2014, pp. 8518–8523.
  • [18] G. Ciná and U. Endriss, “Proving classical theorems of social choice theory in modal logic,” Autonomous Agents and Multi-Agent Systems, pp. 1–27, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
11614
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description