C-3PO: Cyclic-Three-Phase Optimization for Human-Robot Motion Retargeting based on Reinforcement Learning

C-3PO: Cyclic-Three-Phase Optimization for Human-Robot Motion Retargeting based on Reinforcement Learning

Taewoo Kim and Joo-Haeng Lee Taewoo Kim is with the Department of Computer Software and Engineering, Korea University of Science and Technology, Daejeon, Republic of Korea twkim0812@gmail.comJoo-Haeng Lee with the Human-Robot Interaction Research Group, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea joohaeng@etri.re.kr

Motion retargeting between heterogeneous polymorphs with different sizes and kinematic configurations requires a comprehensive knowledge of kinematics and inverse kinematics. Moreover, it is non-trivial to provide a kinematic independent general solution. In this study, we developed a cyclic three-phase optimization method based on deep reinforcement learning for human-robot motion retargeting. The motion retargeting and reward calculations were performed using refined data in a latent space by the cyclic and filtering paths of our method. In addition, the human-in-the-loop based three-phase approach provides a framework for the improvement of the motion retargeting policy by both quantitative and qualitative manners. Using the proposed C-3PO method, we were successfully able to learn the motion retargeting skill between the human skeleton and the real NAO, Pepper, Baxter and C-3PO robot motions.

I Introduction

Humans can effortlessly imitate the motions of others with different body sizes or even animals because of the humans’ extraordinary motion retargeting skill that grasps the target’s motion attributes from visual information and connects it with their joints appropriately. There have been several attempts to teach this motion retargeting skill to robots for motion imitation. Direct joint mapping [42, 50, 24] and inverse kinematics (IK)-solver-based methods [30, 31] require expertise in robot kinematics and are difficult to generalize due to their different kinematic configurations. They also have a singular position problem [8] and high IK calculation cost [31]. Recent machine-learning-based approaches learn imitation skills from demonstration, where they are collected by visual sensors [25, 7, 50, 24], motion capture (MoCap) [20, 32, 48, 18] and virtual reality (VR) devices [49, 2]. However, visual-sensor-based sampling (e.g., human skeleton) is very noisy and unstable. MoCap and VR methods require additional cost and are not convenient to wear. Direct teaching (DT) methods [14, 22, 45, 39] are difficult to collect a large number of demonstrations while they are intuitive.

Fig. 1: Motion retargeting from humans to the C-3PO robot* using the C-3PO algorithm.

* https://www.turbosquid.com/3d-models/c-3po-star-wars-3d-obj/903731

In this study, we propose to advance the three-phase framework developed in our previous work111Our previous work “TeachMe: Three-phase learning framework for robotic motion imitation based on interactive teaching and reinforcement learning” was accepted in Ro-Man 2019. for learning human-robot motion retargeting skills. The robot agent learns a mapping policy between the human skeleton and the robot motion using reinforcement learning. Mapping is performed in the latent space, which is trained in phase 1. In phase 2, quantitative learning is performed based on a reward function and a simulator. Because the skeleton cannot provide the yaw angles of the wrist or the neck joint, we learn such motion details with a small set of direct teachings in phase 3. In our improved framework, we attempt to remove the noise in the skeleton data and utilize the latent space more actively using filtering and cyclic paths.

In reinforcement learning, the widely used temporal-difference (TD) method works effectively in the Markovian environment. If a robotic task is in the Markovian environment, the state of the robot agent should include not only the angular position but also the rate of the position difference (angular velocity) to predict the next state based on the current state. However, robots using low-cost motors such as Dynamixel [28] may not provide accurate angular velocity due to sensor errors and delays in the control system [5]. Because our goal is to build a model that can be applied to such low-cost systems, we modeled our motion retargeting as a non-Markovian problem where the state of the agent has only positional information without velocity. We attempted to learn the motion retargeting policy based on the Monte-Carlo (MC) method that more effectively works in this non-Markovian environment than the TD method.

Our main contributions can be summarized as follows:

1) We propose a novel architecture by reusing the network neglected in the previous work.

2) Based on the newly proposed cyclic and filtering path, we define extended state in a latent space and a refined reward function. This method shows higher performance than developed in the previous work.

3) Based on a unified policy and an encoder-decoder network, which embraces all motion classes, we show that our model can sufficiently perform social robot motion retargeting using the MC method in the non-Markovian environment.

Fig. 2: Our target motion classes chosen from NTU-DB.

Ii Related Work

Motion retargeting has been attracting significant attention in many research fields including robotics and computer graphics [3]. In this section, we review the related studies on motion retargeting and reinforcement learning.

Ii-a Motion Retargeting

Michael [13] proposed a method of motion retargeting on a new character with an identical kinematic structure and a different segment length using geometrically constrained optimization and a simple objective function. For online motion retargeting, Choi et al. [6] improved offline motion retargeting by space-time constraints and inverse rate control. Jean-Sébastien et al. [30] exploited an intermediate skeleton and an IK solver for retargeting from a character’s motion to a geometrically and topologically different one. Another study attempted to retarget a motion between characters with different skeleton configurations such as humans and dogs [16]. Ilya et al. [4] proposed an automatic rigging and modeling algorithm from 3D character shapes, called Pinocchio. Chris et al. [15] proposed a real-time motion retargeting method for highly varied user-created characters using a particle IK solver. Park et al. [33] proposed an example-based motion cloning. In their work, using scattered data interpolation, the animator clones the behavior of the source example motion by specifying the key-posture between the source and the target with dynamic time-warping. They solved the time misalignment between the source and the target animation by fine-tuning the main algorithm process.

In the robotics field, there are many studies on motion retargeting between human motions and humanoid robots. Behzad et al. [9, 10] proposed an online motion retargeting method, which transfers human motions obtained from depth sensors to the humanoid robot ASIMO based on a constrained IK solver. Sen et al.[47] estimated a human pose from the 3D point cloud of a depth sensor and retargeted its pose to a humanoid robot without any skeleton and joint limitations. Ko et al. [1] presented a motion retargeting method, which solves the geometric parameter identification for motion morphing and motion optimization simultaneously. With MoCap sensors, the IK-solver-based motion retargeting methods from humans to robots have been widely studied in recent years [46, 34].

Although most motion retargeting studies have used IK-solver-based approaches, in this study, we applied reinforcement learning to motion retargeting without using any IK solvers. We also exploited the fine tuning approach for pose correction after the main learning phase as in [33].

Ii-B Reinforcement Learning

In recent years, reinforcement learning (RL) has been used in various research areas including computer games [29, 23, 43], robotics [27] and animation [35] and outperformed previous approaches. Many studies in robotics used RL for a specific task such as ball throwing [12], pick & place [17], vision-based robotic grasping [37], robotic navigation [11], and other robotic tasks in daily life [26]. Peng [35] demonstrated learning skills such as locomotion, acrobatics, and martial arts on animation characters based on the reference motion and proximal policy optimization (PPO) [40] RL algorithm. We adopted the reference motion and the PPO algorithm with variational auto-encoder (VAE)-based [19] network architecture [12] in our learning model.

Iii Preliminaries

In this section, we describe the background knowledge and preliminary processes for a better understand of the method.

Iii-a Deep Reinforcement Learning

We model motion retargeting as an infinite-horizon discounted partially observable Markov decision process (POMDP) as a tuple , with a state space , partial observation space , action space , state transition probability function , where , reward function , discount factor , and initial state distribution . The goal of the agent is to learn a deterministic policy that maximizes the expected discounted reward over an infinite-horizon:


where the return is defined as follows:


We adopted PPO-based [40] actor-critic algorithm [21] to learn the policy parameters of for the actor and for the critic network, respectively. The critic network evaluates the action-value of the policy. We define a Q-function, which describes the expected return under policy with parameter from action at state as follows:


During training, the agent’s experience data represented by a set of tuples are stored in a rollout memory, where and , which indicate the encoded latent representations of a skeleton and a robot posture at time . The experience tuples stored in the rollout memory are then used to optimize the actor and the critic network.

Fig. 3: [Left] NTU-DB skeleton and each joint number. [Right] Transformation from camera to robot coordinates.
Class Name
Refined Scene
(Filtered / Total)
No. of Frames

Cheer up
533 / 948 56.2% 37,613

Hand waving
522 / 948 55.0% 37,228

Pointing with finger
500 / 948 52.7% 28,296

Wipe face
389 / 948 41.0% 42,172

508 / 948 53.5% 29,258

Put the palms together
493 / 948 52.0% 27,994
TABLE I: NTU-DB Data Refinement Statistics.

Iii-B Source Dataset

For human-robot motion retargeting, we utilized the public human motion dataset NTU-DB [41]. From initially chosen 12 motion classes among a total of 60, 6 classes such as shake head were ruled out because they cannot easily capture the precise motions of the skeleton only. The selected final six motion classes are {cheer up, hand waving, pointing with finger, wipe face, salute, and put the palms together} (Fig. 2). We also excluded the data with severe noise and used 90% of the data for training and the remaining 10% for evaluation (Table I). The NTU-DB data manipulation code can be found in our repository: https://github.com/gd-goblin/NTU_DB_Data_Loader.

Iii-C Data Pre-Processing

The skeleton data of the NTU-DB are given in camera coordinates while the robot data are given based on its torso coordinates. Because the reward in phase 2 is calculated using direction vector similarities, the coordinate alignment process between the skeleton and the robot is essential. For proper alignment, we made following assumptions.

At least within the selected motion classes:

  • No bending posture at the waist exists.

  • Therefore, shoulder, torso, and pelvis center joint in the skeleton are coplanar.

  • The vector from the left shoulder joint to the right is always parallel to the ground.

Based on these assumptions, we performed coordinate alignment in two steps: 1) normalization with respect to (w.r.t) the skeleton torso frame, and 2) rotation w.r.t the robot basis frame. In the first step, each skeleton joint position is normalized by subtracting the torso position for all skeleton joints , where . For the second step, we first need to make an identical local coordinate to the robot torso frame. To do this, we get a vector by , each of which corresponds to joint number 1 and 2 respectively in Fig. 3, and , which corresponds to the 5 and 9. We can then calculate the anterior axis by and obtain the cranial vector by . From the normalized local coordinate frame, we create a direction cosine matrix (DCM) and transform the skeleton in camera coordinate using the DCM transpose, which is identical to the robot basis frame matrix :


where and are normalized joint positions and transformed positions to the robot coordinates, respectively.

Fig. 4: Cyclic-three-phase optimization framework for human-robot motion retargeting. In phase 1, latent manifold for the skeleton and the robot motion are trained using the NTU-DB and the robot reference motion. Quantitative learning is performed using a simulator and a reward function in phase 2. The policy is optimized by DT-based fine-tuning in phase 3.

Iv Method

We designed a three-phase framework to learn a human-robot motion retargeting skill. The policy evolves through learning by (in)direct human guidance at each phase. In addition to this, we applied filtering and cyclic paths and -step MC method to our policy network for better performance.

Iv-a Problem Formulation

The skeleton generation function takes an image of human posture at time and generate a skeleton vector corresponding to the input human posture, where the raw skeleton data contain x,y,z positions for all joints . The skeleton encoder then takes a transformed skeleton (Eq.(5)) from the raw skeleton data and generates a seven-dimensional latent representation . The skeleton latent vector can be decoded by the skeleton decoder as for later use in skeleton reconstruction and latent representation learning. Similarly, robot motion defined by joint angles (rad) at time is encoded by the robot motion encoder as . This latent vector of the robot motion can also be decoded by the robot motion decoder as for future use in robot motion reconstruction and latent representation learning. Our mapping policy performs motion retargeting by mapping between the latent representations of the skeleton and the robot motion as .

Iv-B Phase 1: Learning Latent Manifold

In the first phase, we learn the latent manifold of the skeleton and the robot motion using VAE [19] (Fig. 4). The skeleton encoder consists of four fully connected (FC) layers including 512, 256, 128, 64 with ReLU, and encodes the transformed skeleton in the seven-dimensional latent vector . The skeleton decoder has an identical structure to the encoder but in reverse order. We created a unified skeleton encoder-decoder by learning from all six motion class data at once. As described in Table I, the training data are randomly selected from 90% of the refined NTU-DB.

In order to learn the robot motion encoder-decoder, we first need to sample a set of reference robot motion trajectories corresponding to a motion in each class. We generated reference motion trajectories for all classes using the V-REP [38] and Choregraphe [36] simulator, where , and . The reference motion generation took a few minutes per class on average. Based on the reference motion, augmented training dataset were generated by adding uniform noise to iteratively as , where . We augmented our reference motion trajectories up to 20k frames per class and combined them to learn a unified robot motion encoder-decoder from a total of 120k augmented datasets. The robot motion network consists of three FC layers with Tanh including 256, 128, 64 for the encoder, 7 for latent representation, and identical but in reverse order for the decoder. Both the skeleton and robot motion networks are learned using an MSE loss, learning rate=1e-4, weight decay=1e-6, and batch size=128.

Fig. 5: Training results on the NAO, Baxter, and C-3PO robot. Policies using FLT and cyclic show the best performance.

Iv-C Phase 2: Learning Mapping Function

In the second phase, we learn mapping policy for proper motion retargeting based on a simulator and a reward function. In the forward step represented by the gray line in the second row of Fig.4, encodes a raw skeleton to a latent vector at time , where . The actor then performs mapping to generate a robot motion latent vector , and the decoded vector is transferred to the robot in the simulator. After processing one time step (=50ms), the simulator outputs the next states with , which contains relative x,y,z positions of the robot arm w.r.t the torso frame ; the subscripts represent the (left and right) shoulders, elbows and wrist joints respectively. This position vector is used in the following reward function:


where Eq.(6) describes the reward based on the similarities in the arm vector between the skeleton and the robot. Vectors and represent the direction vector of the upper and lower left and right arms for both the skeleton and the robot (see the right side of the second row in Fig. 4). These vectors can be obtained by taking the vector difference; e.g., the upper right robot arm vector is given by . Even though we ruled out the skeletons with severe noise, there still remain noisy data in our dataset. Thus, in case of the skeleton, we calculated the arm vectors from the reconstructed skeleton for denoising [19] in the reward calculation. The cosine similarity-based reward is then normalized by multiplying the error amplitude constant (-2.0) and the exponential function, where . In phase 2, there is a cyclic structure for learning the critic network based on the latent representation. The joint angle in the next step is encoded again and combined with the next latent vector of the skeleton as . The critic network evaluates the action-value of the agent based on this full state, which contains the information of the skeleton and the robot motion at time . In section V, we demonstrated the comparative analysis results, which showed that this cyclic architecture can improve the motion retargeting performance. The objective function of phase 2 can be defined as:


where and is obtained by applying to the simulator. The goal of phase 2 is to find the optimal policy parameters that maximizes the expected reward . Our unified policy network consists of three 512 FC layers with the ReLU.

Iv-D Phase 3: Policy Optimization by Fine Tuning

Even though the policy performs the mapping between the latent manifolds that are learned by the reference motion in phase 1, false retargeting can possibly occur because the reward of phase 2 does not consider the posture of the head or the wrist. In the last phase, we attempted to correct this false retargeting using DT-based fine tuning. First, we collected a ground truth dataset of 512 frames per class for about ten minutes using DT. The ground truth consists of a set of transformed skeleton frames , corresponding robot joint angles and robot joint positions . Because the reward of phase 2 was calculated using the reconstructed skeleton , we constructed a cyclic structure that encoded the reconstructed skeleton (see phase 3 in Fig. 4). In the forward pass, the actor retargets the latent vector of a ground truth skeleton to generate a robot motion prediction. After one step simulation, the observed next robot state and the corresponding ground truth are encoded to calculate the reward function of phase 3:


where is calculated using the between the robot motion prediction and the corresponding ground truth in the latent space with the human teaching error (=0, =1e-3) and then normalized between 0 and 1. The following equation shows the objective function of phase 3 to determine the optimal parameter .


Iv-E -Step Monte-Carlo Learning

In general, the MC method has unbiased, high variance estimates, while the TD has biased and low variance estimates. This is because MC empirically updates the policy with the actual return, whereas the TD estimates the expected rewards by inference using bootstrapping [44]. MC usually works in episodic environments; however, it can be applied to our motion retargeting because we modeled our problem as a non-episodic task and the reward can be obtained at each time frame. Owing to the continuing and every-reward environment, we can apply the n-step MC to our problem:


where represents the number of steps in n-step MC. We present the comparative results on the n-step MC and TD (Eq.(3)) method in the next section.

V Experiments

Intuitive Motion Retargeting. Our C-3PO algorithm can be applied to various robots with different kinematics and sizes. This method is more intuitive than the other methods such as direct joint mapping or IK-solver-based methods because it does not require knowledge about mathematical modeling of kinematics. Through this method, we can learn the motion retargeting skill by manually appointing major joints (e.g., shoulder) and generating simple reference motion. We were successfully able to teach motion retargeting skills to the real NAO, Pepper, Baxter and C-3PO robots (Fig.6).

Learning Details. We used the learning rates of 1e-4 for the actor and 2e-4 for the critic. The rest of the hyper-parameters were set as follows: rollout steps=2048, PPO epoch=5, mini batch size=32, =0.98, entropy coefficient=5e-3. V-REP and Choregraphe run at 20Hz. Each unified policy learning for 1 million frames takes about 8 h on i7-8700K and Titan Xp.

Fig. 6: Motion Retargeting result using our C-3PO algorithm. The Pepper was controlled using the NAO’s policy because they have identical kinematic configurations.

V-a Ablation Study on Network Architecture

To verify the performance in terms of network architecture, we evaluated them by combining the raw skeleton , the filtered skeleton , and the cyclic and acyclic path-based reward function calculations. Fig. 5 represents the training results in these four cases. Due to the effects of noise filtering, policies using the filtering path (FLT) show far better performance than the raw skeleton method. The cyclic path is also shown to assist the policy to output better action. This ablation study shows that the proposed method is effective in improving the latent space-based motion retargeting task in various types of robots with different kinematic configurations and sizes.

TD MC (FLT; cyclic)
(FLT; cyclic) 1-step 3-step 5-step



TABLE II: Performance Comparison Result of TD and n-step MC Learning Methods. Evaluated by average reward and standard deviation during 5k frames.

V-B Temporal Difference and n-step Monte-Carlo Learning

We evaluated the performance of TD and MC w.r.t the number of steps during 5k frames. As shown in the previous subsection, because the policy with the filtering and the cyclic path showed the best performance, we only considered the FLT and the cyclic poliy in the TD method, and the n-step of MC was set to 1, 3, and 5. The experimental results evaluated by the average mean and the standard deviation. in Table II suggest that MC ourperforms TD method in the non-Markovian motion retargeting problem. In MC, step-3 outperforms the others, but the overall performance is similar, and there is no dramatic performance improvement in more than three steps.

Fig. 7: Policy optimization by DT-based fine tuning. Through our method, the policy was able to learn motion details while retaining much of the retargeting skills learned from phase 2 than our previous work.

V-C Policy Optimization Experiments on Phase 3

In our previous study, policy optimization by fine tuning was performed through differences in the joint space. We were able to learn the motion details; however, there was a significant loss of retargeting skill that is learned in the previous phase. We estimated that the rapid collapse of the policy is caused by the reward space mismatch; i.e., the phase 3 reward is obtained in the joint space while the reward in phase 2 is based on the Cartesian space. To learn the motion details while retaining the learned skill as much as possible, we optimized the policy by fine tuning in the common latent space using a cyclic path. The ground truth datasets of 3k frames for all six motion classes were sampled and shuffled scene-by-scene. Except for the learning rate of the actor (2e-4) and the rollout steps (256), the remainder of the other learning parameters were identical to those in phase 2. Based on this experimental environment, we were successfully able to correct our policy as shown in Fig. 7. As the fine tuning progresses, we lose the motion retargeting skill of phase 2. However, learning by cyclic path where the reward is calculated by the distance in the latent space is helpful to keep the motion retargeting skill of phase 2. Compared to our previous work, we achieved great advances in our motion retargeting task; we used a smaller training dataset with the unified encoder-decoder and policy while retaining the pre-learned skill more than our previous work. Qualitative results can be found in our supplementary video: https://youtu.be/hXEQWXWDpTQ.

Vi Discussion and Conclusion

In this study, we developed the C-3PO method for human-robot motion retargeting. In comparison with the previous work, we achieved a significant improvement in performance through the cyclic and filtering paths. In addition, we set a non-Markovian environment for our task to be applied to a low-cost system and solved it using the -step MC method. Even though we could train the motion details using a small set of direct teaching, the motion ambiguity problem rarely occurred because we exploited frame-by-frame motion retargeting. In our future work, we will conduct a study on trajectory-based motion retargeting. We expect that this human-in-the- loop based learning framework can be extended to other robotic tasks such as human-robot interactions, specially when objects are involved.


This work was supported by UST Young Scientist Research Program through the University of Science and Technology. (No. [18AS1810])

This work was supported by the ICT R&D program of MSIP/IITP. [2017-0-00162, Development of Human-care Robot Technology for Aging Society.


  • [1] K. Ayusawa and E. Yoshida (2017) Motion retargeting for humanoid robots based on simultaneous morphing parameter identification and motion optimization. IEEE Transactions on Robotics 33 (6), pp. 1343–1357. Cited by: §II-A.
  • [2] S. Baek, S. Lee, and G. J. Kim (2003) Motion retargeting and evaluation for vr-based training of free motions. The Visual Computer 19 (4), pp. 222–242. Cited by: §I.
  • [3] J. Bandera, J. Rodriguez, L. Molina-Tanco, and A. Bandera (2012) A survey of vision-based architectures for robot learning by imitation. International Journal of Humanoid Robotics 9 (01), pp. 1250006. Cited by: §II.
  • [4] I. Baran and J. Popović (2007) Automatic rigging and animation of 3d characters. In ACM Transactions on graphics (TOG), Vol. 26, pp. 72. Cited by: §II-A.
  • [5] A. Bubnov, V. Emashov, and A. Chudinov (2015) Iterative method of measurement with a given accuracy for angular velocity errors. In 2015 International Siberian Conference on Control and Communications (SIBCON), pp. 1–4. Cited by: §I.
  • [6] K. Choi and H. Ko (2000) Online motion retargetting. The Journal of Visualization and Computer Animation 11 (5), pp. 223–235. Cited by: §II-A.
  • [7] J. B. Cole, D. B. Grimes, and R. P. Rao (2007) Learning full-body motions from monocular vision: dynamic imitation in a humanoid robot. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 240–246. Cited by: §I.
  • [8] J. J. Craig (2009) Introduction to robotics: mechanics and control, 3/e. Pearson Education India. Cited by: §I.
  • [9] B. Dariush, M. Gienger, A. Arumbakkam, C. Goerick, Y. Zhu, and K. Fujimura (2008) Online and markerless motion retargeting with kinematic constraints. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 191–198. Cited by: §II-A.
  • [10] B. Dariush, M. Gienger, A. Arumbakkam, Y. Zhu, B. Jian, K. Fujimura, and C. Goerick (2009) Online transfer of human motion to humanoids. International Journal of Humanoid Robotics 6 (02), pp. 265–289. Cited by: §II-A.
  • [11] A. Faust, K. Oslund, O. Ramirez, A. Francis, L. Tapia, M. Fiser, and J. Davidson (2018) PRM-rl: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5113–5120. Cited by: §II-B.
  • [12] A. Ghadirzadeh, A. Maki, D. Kragic, and M. Björkman (2017) Deep predictive policy training using reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2351–2358. Cited by: §II-B.
  • [13] M. Gleicher (1998) Retargetting motion to new characters. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 33–42. Cited by: §II-A.
  • [14] G. Grunwald, G. Schreiber, A. Albu-Schaffer, and G. Hirzinger (2003) Programming by touch: the different way of human-robot interaction. IEEE Transactions on Industrial Electronics 50 (4), pp. 659–666. Cited by: §I.
  • [15] C. Hecker, B. Raabe, R. W. Enslow, J. DeWeese, J. Maynard, and K. van Prooijen (2008) Real-time motion retargeting to highly varied user-created morphologies. In ACM Transactions on Graphics (TOG), Vol. 27, pp. 27. Cited by: §II-A.
  • [16] M. Hsieh, B. Chen, and M. Ouhyoung (2005) Motion retargeting and transition in different articulated figures. In Ninth International Conference on Computer Aided Design and Computer Graphics (CAD-CG’05), pp. 6–pp. Cited by: §II-A.
  • [17] S. James, A. J. Davison, and E. Johns (2017) Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. arXiv preprint arXiv:1707.02267. Cited by: §II-B.
  • [18] S. Kim, C. Kim, B. You, and S. Oh (2009) Stable whole-body motion generation for humanoid robots to imitate human motions. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2518–2524. Cited by: §I.
  • [19] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II-B, §IV-B, §IV-C.
  • [20] J. Koenemann, F. Burget, and M. Bennewitz (2014) Real-time imitation of human whole-body motions by humanoids. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 2806–2812. Cited by: §I.
  • [21] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §III-A.
  • [22] D. Kushida, M. Nakamura, S. Goto, and N. Kyura (2001) Human direct teaching of industrial articulated robot arms based on force-free control. Artificial Life and Robotics 5 (1), pp. 26–32. Cited by: §I.
  • [23] G. Lample and D. S. Chaplot (2017) Playing fps games with deep reinforcement learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §II-B.
  • [24] J. Lee et al. (2012) Full-body imitation of human motions with kinect and heterogeneous kinematic structure of humanoid robot. In 2012 IEEE/SICE International Symposium on System Integration (SII), pp. 93–98. Cited by: §I.
  • [25] J. Lei, M. Song, Z. Li, and C. Chen (2015) Whole-body humanoid robot imitation with pose similarity evaluation. Signal Processing 108, pp. 136–146. Cited by: §I.
  • [26] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §II-B.
  • [27] Y. Liu, A. Gupta, P. Abbeel, and S. Levine (2018) Imitation from observation: learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. Cited by: §II-B.
  • [28] A. Mensink (2008) Characterization and modeling of a dynamixel servo. Trabajo Individual de Investigación en el Electrical Engineering Control Engineeringde la University of Twente. Cited by: §I.
  • [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §II-B.
  • [30] J. Monzani, P. Baerlocher, R. Boulic, and D. Thalmann (2000) Using an intermediate skeleton and inverse kinematics for motion retargeting. In Computer Graphics Forum, Vol. 19, pp. 11–19. Cited by: §I, §II-A.
  • [31] S. Mukherjee, D. Paramkusam, and S. K. Dwivedy (2015) Inverse kinematics of a nao humanoid robot using kinect to track and imitate human motion. In 2015 International Conference on Robotics, Automation, Control and Embedded Systems (RACE), pp. 1–7. Cited by: §I.
  • [32] C. Ott, D. Lee, and Y. Nakamura (2008) Motion capture based human motion recognition and imitation by direct marker control. In Humanoids 2008-8th IEEE-RAS International Conference on Humanoid Robots, pp. 399–405. Cited by: §I.
  • [33] M. J. Park and S. Y. Shin (2004) Example-based motion cloning. Computer Animation and Virtual Worlds 15 (3-4), pp. 245–257. Cited by: §II-A, §II-A.
  • [34] L. Penco, B. Clément, V. Modugno, E. M. Hoffman, G. Nava, D. Pucci, N. G. Tsagarakis, J. Mouret, and S. Ivaldi (2018) Robust real-time whole-body motion retargeting from human to humanoid. In 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), pp. 425–432. Cited by: §II-A.
  • [35] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne (2018) Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG) 37 (4), pp. 143. Cited by: §II-B.
  • [36] E. Pot, J. Monceaux, R. Gelin, and B. Maisonnier (2009) Choregraphe: a graphical tool for humanoid robot programming. In RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication, pp. 46–51. Cited by: §IV-B.
  • [37] D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine (2018) Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6284–6291. Cited by: §II-B.
  • [38] E. Rohmer, S. P. Singh, and M. Freese (2013) V-rep: a versatile and scalable robot simulation framework. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1321–1326. Cited by: §IV-B.
  • [39] R. D. Schraft, C. Meyer, C. Parlitz, and E. Helms (2005) Powermate-a safe and intuitive robot assistant for handling and assembly tasks. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pp. 4074–4079. Cited by: §I.
  • [40] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §II-B, §III-A.
  • [41] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU rgb+ d: a large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019. Cited by: §III-B.
  • [42] P. Shahverdi and M. T. Masouleh (2016) A simple and fast geometric kinematic solution for imitation of human arms by a nao humanoid robot. In 2016 4th International Conference on Robotics and Mechatronics (ICROM), pp. 572–577. Cited by: §I.
  • [43] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §II-B.
  • [44] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §IV-E.
  • [45] T. Tsumugiwa, R. Yokogawa, and K. Hara (2002) Variable impedance control based on estimation of human arm stiffness for human-robot cooperative calligraphic task. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), Vol. 1, pp. 644–650. Cited by: §I.
  • [46] A. E. Vijayan, S. Alexanderson, J. Beskow, and I. Leite (2018) Using constrained optimization for real-time synchronization of verbal and nonverbal robot behavior. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1955–1961. Cited by: §II-A.
  • [47] S. Wang, X. Zuo, R. Wang, F. Cheng, and R. Yang (2017) A generative human-robot motion retargeting approach using a single depth sensor. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5369–5376. Cited by: §II-A.
  • [48] K. Yamane and J. Hodgins (2009) Simultaneous tracking and balancing of humanoid robots for imitating human motion capture data. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2510–2517. Cited by: §I.
  • [49] T. Zhang, Z. McCarthy, O. Jowl, D. Lee, X. Chen, K. Goldberg, and P. Abbeel (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §I.
  • [50] F. Zuher and R. Romero (2012) Recognition of human motions for imitation and control of a humanoid robot. In 2012 Brazilian Robotics Symposium and Latin American Robotics Symposium, pp. 190–195. Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description