Proactive Action Visual Residual Reinforcement Learning for Contact-Rich Tasks Using a Torque-Controlled Robot
Contact-rich manipulation tasks are commonly found in modern manufacturing settings. However, manually designing a robot controller is considered hard for traditional control methods as the controller requires an effective combination of modalities and vastly different characteristics. In this paper, we firstly consider incorporating operational space visual and haptic information into reinforcement learning(RL) methods to solve the target uncertainty problem in unstructured environments. Moreover, we propose a novel idea of introducing a proactive action to solve the partially observable Markov decision process problem. Together with these two ideas, our method can either adapt to reasonable variations in unstructured environments and improve the sample efficiency of policy learning. We evaluated our method on a task that involved inserting a random-access memory using a torque-controlled robot, and we tested the success rates of the different baselines used in the traditional methods. We proved that our method is robust and can tolerate environmental variations very well.
For high-precision assembly tasks, the robot needs to combine the high positioning accuracy with high flexibility. Designing a robot for these tasks is very challenging although such tasks can be easily performed by humans. Several torque-controlled robots have been designed for cooperative tasks to be performed in industrial environments , . These torque-controlled robots have seven revolute joints with torque sensors, and similar control algorithms , , . Currently, torque-controlled robots are already safe enough when collisions occur with environments or humans , . However, their effectiveness in real-life and production scenarios is still not satisfactory.
Torque-controlled robots often serve computers, communication, and consumer electronics (3C) product lines, which usually involve small but complex assembly tasks, and need to be adjusted quickly and frequently. Nowadays there are a few 3C assemble factory lines  but they need a long time to build and setup in high precise, which are not suitable for small and medium-sized enterprises (SMEs) who have automation needs but cannot afford to upgrade the entire production line.
Position uncertainty are quite normal in human-based traditional production lines. Some studies used simple fixed curves for exploring ,  but they have low robustness against positional and angular errors for insertion tasks especially when targets are not fixed accurately. Schimmels and Peshkin ,  designed an admittance matrix for force-guided assembly in the absence of friction and after two years they improved the admittance control law. However, there still existed a maximum limit requirement of friction value . Stemmer et al.  proposed the region of attraction (ROA) method using vision and force perception to assemble the specified-shape objects, while geometry of the parts are required.
In this paper we equip a robot with a visual residual policy that combine multimodal feedback from vision and touch, two modalities with different frequencies, and characteristics. Our primary contributions are:
1) We propose a visual reinforcement learning (RL) method by combining a visual-based fixed policy with a contact-based parametric policy, this method greatly enhances the robustness and efficiency of the reinforcement learning (RL) algorithm.
2) We propose the proactive action in the visual residual RL policy to solve the partially observable Markov decision process (POMDP) problem, which could ensure the task success rate and the ability to tolerate environmental variations.
3) We implement ablative and comparative studies to give the effects of each modality on task success rate and prove the robustness of our method in experiment.
Ii Background and Related Work
Ii-a Torque-controlled Robot Concepts
Torque-controlled robots have been developed for unstructured environments that are fundamentally different from the environments where classical industrial robotics have been used. The torque sensor in each joint plays a key role in robot controller. The basic controller consists of a torque feedback loop, which can be interpreted as the scaling of the motor inertia to the desired value :
Here, is an intermediate control input that could shape the Cartesian or joint impedance behavior , and is the joint torque data measured by the torque sensor as well as the torque vector applied to manipulatorsâ joints. is the torque on demand of the motor controller. For the Cartesian impedance behavior, we have
and are the permutation and diagonal matrices of desired stiffness and damping; is the desired end-effector (EE) pose, and is the EE pose computed based on the motor position. is the manipulator Jacobian; and are the measured and desired motor positions, respectively. is the gravity function that always comes from the CAD model or the parameter identification; this function inevitably has errors.
Ii-B Visual Servo Control in Manufacturing Application
Vision sensor allows robot to measure the environment with noncontact method. Shirai and Inoue  described an idea on how to use visual feedback to correct the position of a robot in order to increase assembly task accuracy. Position-based visual servo (PBVS) systems and image-based visual servo (IBVS) systems are two major classes of visual servo control systems. The typical control structure of PBVS can be found in .
An end-effector mounted camera could acquire the target depth and orientation information which can be used directly for PBVS , . While the lens and the imaging sensors, the calibration of intrinsic/extrinsic parameters, the reflection, shadow and occlusion will exert a strong influence on the precision of the visual guidance .
Ii-C Reinforcement Learning for Assembly Tasks
RL offers a set of tools for the design of sophisticated robotic behaviors that are difficult to engineer. RL has been applied previously and has gained great success in solving a variety of problems in robotic manipulations , , , , . Newman et al.  inverted the mapping from the relative positions to the observed moments and trained the neural net to guide the robotic assembly. Inoue et al.  used long short-term memory to learn algorithms with two threads (an action thread and a learning thread) for searing and inserting a peg into a tight hole; however, their methods required several pre-defined heuristics, flat searching surfaces, and also a long training time.
Residual RL could take advantage of the efficiency of conventional controllers and the flexibility of RL . The idea is to try to inject prior information into an RL algorithm to speed up the training process instead of randomly exploring from scratch.
Specifying goals via images makes it possible to specify goals with minimal manual effort such as taking a photo . Combining the sense of vision and touch could endow robots with a similar ability as humans to complete the assembly tasks , which could provide robustness to sensor and actuator noises  as well as position uncertainty. However, only a few studies have focused on real industrial production contact-rich tasks, and they also require a sliding surface for the algorithms to search , , .
Iii Problem Statement and Method Overview
Iii-a Problem Statement
Position Uncertainty in Unstructured Environments
As we talked in Section II-A, position uncertainty are quite normal in human based production lines. Workers could perform high-precision robotic assembly tasks with their strong intelligence, excellent visual ability, and dexterous hands. while these tasks are very challenging to robots especially in these unstructured production environments.
Also, torque-controlled robots have low position robustness to friction and obstruction in contact-rich tasks due to the low stiffness design concepts as we described in Section II-A. The limited control stiffness together with the friction and obstruction in contact-rich tasks give the position control error at the millimeter level. Torque-controlled robots are expected to achieve a desired dynamical relationship between environmental forces and movements of the robot in order to avoid breaking the environments or targets, thus the desired position and contact force can not be satisfied in the same dimension simultaneously. Moreover, the location of the targets is uncertain sometimes due to insufficient accuracy of industrial assembly line.
Using visual method to correct the positions of the targets is an intuitive solution, while we still have position control problems when robot contact with targets due to the reason as we explained in section II-B, even we have implemented some explore actions (e.g., the spiral explore method ).
In 3C production lines, the insertion scenarios are different with the typical simplification settings of peg-in-hole , . For example, the random-access memory (RAM) insert-type task has problems as follow:
1) The RAM slot or other slots does not have a proper surface for sliding behavior of the robot in alignment stage ,  as shown in Figure 2 which makes sliding type algorithm not work any more.
2) The objects (like the RAM or hard disk) would be easily stuck by the structure near the slot or the slot itself in the explore/alignment stage as shown in Figure 2.
Uncertainty POMDP States
The main challenge of the traditional policy is to design adaptable, yet robust algorithms in the face of inherent difficulties for modeling all possible interaction behaviors. RL enabled us to find new control policies automatically for contact-rich problems where traditional heuristics had been used, but the results were not satisfactory.
Contact states are hard to estimate due to the sensor noise and robot modeling error, changing the Markov decision process (MDP) to POMDP, which makes it significantly harder to find an optimal policy , and it also requires more training time. Belief state tracking is one way to deal with the POMDP problem , , , but this method takes too much time to find an optimal policy.
Iii-B Method Overview
An eye-in-hand camera is helpful for solving the problem of position uncertainty in unstructured Environments in contact-rich tasks. The camera could try to align the characters of the target and compensate for the position error of the robot. Visual feedback control could provide geometric object properties for the pre-reaching target phase, whereas the camera aligning accuracy would always be disturbed by the target material or light. Force feedback control is quite helpful for providing contact information between the object and environment for accurate localization and control under occlusions or bad vision conditions, and force information could be obtained easily from the proprioceptive data in the torque-controlled robot controller. Visual feedback and force feedback are complementary and sometimes concurrent during contact-rich manipulation. In this paper, we implemented the visual-based fixed policy combined with contact-based parametric policy (see Figure 3) as follow:
1) For roughly locating the slot, we use one global image take from the teach mode with the RGB-D camera and rely only on the PBVS method  (i.e. the visual-based fixed policy) control in this phase, because in free space, the contact-based parametric policy can not receive proper contact information.
2) After rough location phase finished, the robot will move to target slot according to the pre-recorded transformation from global image pose to detailed image pose, where is recorded in teach phase. When the RAM in EE contact with the target slot, the detailed image which more accuracy for locating a slot, will be used to insert the RAM into the slot according to our method descirbed in Section IV.
Iv Policy and Controller Design
Iv-a Policy Design
Visual Residual Reinforcement Learning
To take advantage of the high flexibility of RL and the high efficiency of conventional controllers, we introduce the idea of residual RL from  with vision information; our method is expected to perform better compared with original residual RL in a variable environment due to the position uncertainty problem in Section III-A1.
In residual RL, the policy are chosen by additively combining a fixed policy with a parametric RL policy : . The fixed policy can help the agent move to the target, but prevent the agent from exploring more states. To balance the exploration and exploitation between the fixed policy and parametric RL policy, we design the weighted residual RL as
Here, is the action weight between the fixed policy and the parametric RL policy; The parametric policy is learned in the RL process to maximize expected returns on the task. We use a P-controller as the hand-designed controller in our experiments for the visual-based fixed policy.
Firstly, we explain the detailed design of . represents a geometric relationship of robot states which is a Euclidean distance calculated by visual and estimated depth information. We introduce the method from  which used depth information in PBVS. Combined with feature extraction and features’ depth information , we could get estimated target feature set and current feature set whose coordinates are expressed with respect to the camera coordinate frame following the perspective projection method :
Where is the focal length of the camera lens. gives the coordinates of the image feature set expressed in pixel units. Iterative Closest Point (ICP)  could be used to get the coordinate transformation by the feature set and .
Here we set depends on Equation 5, where is the translation error vector, and gives the angle/axis representation for the rotation error . Then a velocity control scheme is designed by using an exponential and decoupled decrease of the error(i. e., ) as:
Equation 6 is used in rough location phase in section III-B. is the camera frame velocity command under current camera frame , which could be easily transfer to robot EE frame . In this paper, we calculate robot movement commands under robot EE frame first, and then transfer them to base frame before sending to Equation 2. Secondly, we directly use as the states of fixed policy in accurate location phase,
which is quite convenient to implement.
In this paper, we use a value-based RL called Q-learning algorithm as the contact-based parametric RL policy , the Q-function is implemented as a table with states as rows and actions as columns, then we can update the table by using the Bellman equation:
Most studies , , and  have modeled the robot manipulation task as a finite-horizon discounted Markov Decision Process (MDP) in an environment , with a state space , an action space , state transition dynamics : , a discount factor , and a reward function to determine an optimal stochastic policy .
In practice, many contact states cannot be observed directly in the manipulation tasks that are close to a POMDP problem. However, the POMDP problem is confined to the modeling error of the torque-controlled robot, which makes it difficult to detect the contact states. Inspired by wild gorillas, who tried to cross a pool of water using a walking stick to test the water depth , we improved our RL process by adding a proactively investigative action () that could detect the clear states () involved in the RL process as Figure 4 which is different with  who continues push the target get a detectable moment; here, investigative action is one kind of the proactive actions.
We use the investigative action combined with to construct a new policy instead of the original , which can be written as . Where is determined by adding an investigative action of the torque-controlled robot to the environment. Consequently, the heuristic design of the investigative action prevents the learning process from falling into multiple unclear states.
In particular, the torque-controlled robot outputs either the movements or the forces. In our experiments, the movements are taken as the actions in the action space , and the forces are taken as the investigative actions. Instead of using 20 N force continuously to detect the values of the moments in the search phase , we only command the controller to exert a force (10-25 N) in some directions in a short time (0.5-1 s) as the investigative action while the feedback movements or force/moments are used to verify the contact states when the states are not clear. Our investigative action method can greatly reduce the friction and the probability of getting stuck when the robot performs the movement actions.
Iv-B Controller Design:
We use the increment equation to avoid the potential “far away” problem for safety concerns; is the desired EE pose, and is the current EE pose; is the increment action command from the agent.
The Cartesian impedance controller takes the Cartesian EE movement from agent at 0.5 to 2 Hz, and the output joint torque gives the command to the robot at 1000 Hz. We calculate the desired EE pose by combining with the current EE pose . The trajectory generator bridges the low frequency output of the agent and the high frequency impedance control of the robot and outputs to the Cartesian impedance controller in Equation 2. is the position and is the quaternion representation of the orientation given by a simple linear interpolator:
V Experiments: Design and Setup
We consider the experiment for the insertion task here. The task can be described as moving the already-grasped parts to their goal pose as shown in Figure 1. This is the most common setting in manufacturing. The success of such tasks can be measured by minimizing the distance between the objects and their goal pose especially in the Z direction (see Figure 1).
V-a Experiment Algorithm Design
In our weighted residual RL, actions are designed by adding the fixed policy with the parametric policy :
The fixed policy output is calculated by a hand-designed controller as given in Equation 7; helps to adjust the balance between exploration and exploitation. We set to (1,1,0.3,0,0,0) when the fixed policy is calculated. To identify a reasonable weight between the two components, we initially experimented with the weighted residual RL by introducing a group of action weight parameters, such as 0.3, 0.5, and 0.7. The training experiments suggested an optimum policy output with a weight of 0.5, whereas the weight could increase or decrease around 0.5 according to the visual condition in the implementation phase. We utilized the algorithm to detect states and implemented its slightly-modified version where the trained policies were constructed by two aforementioned components. Here the flag belief is set to 0 or 1, according to the moment threshold settings, a detectable moment(over threshold) always gives true belief state. Combined with the investigative action mentioned in Section IV-A2, the modified Q-learning algorithm was trained at a high speed, and it easily resulted in optimization.
We design Cartesian movement actions for this experiment. Each Cartesian movement dimension was set to +1 for a positive movement and -1 for a negative movement; therefore, we had actions. We set as the scale parameter to adjust the amplitude of actions as
Here, and are positional and orientational movements under EE frame, respectively. is easy to choose because it is closely related to assembly clearance and visual accuracy, normally we set , then we have movements resolution at mm and rad level. We found that orientational movements accuracy were enough by using the fixed policy , so we only output positional movements actions in our RL idea, this is normal setting because the visual feedback and force feedback are complementary during contact-rich manipulation.
The Investigative action was designed as the force action under robot EE frame for 1 s. The robot will try to add force but will stop moving if the force is greater than 25N or the movement is more than 3 mm. Then, the agent will obtain clear state feedback because of the large contact force and torque amplitude, as shown in Figure 5.
Depending on the pose error between the current picture and the target picture and the depth information, the reward function was set as follows:
Here, is the norm of the x and y errors of the images, is the number of steps in one episode, and is the maximum steps in one episode.
We get the estimated 6-DoF external force and moments along the X, Y, and Z axis under the EE frame from Franka controller. Here we consider the contact force and the moments between the robot’s EE (i.e., the grasping RAM) and the slot as the MDP states as
We assume that the EE contacts the slot when the external force N or the external moments Nm, a value of means that a contact is made, and 0 means that there is no contact with the encoding states.
V-B Experiment Environment and Task Setup
We used the Franka robot  for real robot experiments and set the Cartesian stiffness as 3000 N/m and 300 Nm/rad (Recommended upper limit). Two sensor modalities were available in the real hardware, including proprioception and redâgreenâblue (RGB) depth camera. The RGB and depth information were recorded using the eye-in-hand Intel RealSense Depth Camera D435i. The policy ran on a Dell Precision 5510 laptop and sent the updated position to the real-time controller, which calculated the joint torque command and sent it to the robot controller at 1000 Hz. We used a CORSAIR DDR3 RAM and a motherboard as training and testing environment.
In ablation study experiment, We evaluated our trained policy by masking different modalities as 4 baselines given below:
1) No vision: masks out the visual part action; we set .
2) No RL policy: masks out the RL part action; we set .
3) Random policy: generates a random Q table.
4) No investigative action: masks out the investigative action and chooses random action when the state is not clear.
We set maximum steps as 10 and add initial random errors() in x and y directions for each baseline only in ablation study experiment.
In comparison study experiment, we compared the task success rates of our method with the other four baselines in the real scenarios (no maximum steps limit and no initial random errors for each baseline) by moving the motherboard, which are as follows:
1) Baseline 1: For normal teaching and direct insertion
2) Baseline 2: For normal teaching with spiral exploration
3) Baseline 3: For teaching with vision and direct insertion
4) Baseline 4: For teaching with vision and spiral exploration
Vi Experiments: Results and Discussion
|No vision||92/200||1.09 h|
|No RL policy||112/200||0.65 h|
|Random RL policy||77/200||2.59 h|
|No investigative action||66/200||0.85 h|
|Our method||179/200||1.18 h|
We trained our policy with 500 episodes, and each episode lasted a maximum of 50 steps. The training time for the exploration was approximately 150 minutes which is much less than . We specified discrete actions in this experiment, and the action execution had errors. Our policy can increase the probability of success and decrease the cost steps but cannot guarantee success every time. We set random errors for the initial pose of the robot; sometimes, the robot will successfully insert by chance and obtain a high reward in the early stage of training.
Table I shows the ablation study result of the policy evaluation statistics. Random RL policy and No investigative action had poor performances with success rates of 38.5% and 33%, respectively. No vision had a 46% success rate because of discrete overshooting actions whereas No RL policy had a 56% success rate because the RAM was always stuck by the short side of the slot. Our method had a success rate of 89.5%. It should be noted that the success rate of our method is limited buy the maximum steps in the experiment.
We observed that the absence of either visual or correct forces/moments information negatively affected the task success rate, and wrong policy performance was even worse than without RL policy. Therefore, the Random RL policy and No investigative action had similar performances because the RL policy is always in conflict with the visual output action. None of the four baselines has reached the same level of performance as the final method. With visual input alone, the robot sometimes cannot overcome the last small distance because of either the limited movement accuracy of the robot or contact friction, while RL policy are capable of recovering from such issues whcih could be proved in our method. Without the visual input, the robot will require more steps to find the proper pose for insertion and will always overshoot for some actions (i.e., drop out of the slot).
|Baselines||Fix motherboard||Move motherboard|
Table II shows a comparison of the success rates of different traditional method baselines. In order to simulate industrial scenario, the additional random error and maximum steps limit in ablation study are removed. Obviously, The baselines 1&2 work well only when the motherboard are fixed in the same position as in the teaching phase, so we only test 20 times in “move motherboard” case for baselines 1&2 for saving time. The success rates for the baselines 3&4 are increased with vision correction, but they still have failure cases due to the visual error. Our method shows a strong ability to tolerate environmental variations and resilience from stuck with full success which really meets the requirements of industrial scenarios. Please note that in the comparison study, the increase of success rates is also related to the removal of initial errors and the removal of limit of the maximum steps.
Vii Conclusion and Future Work
In this paper, we combined RL with an operational space visual controller to solve position uncertainty problem in high-precision assembly tasks, and we proposed a proactive action idea to solve the POMDP problem by using an investigative action.
Our method could solve the shortage of traditional visual servoing method by using our visual residual RL algorithm, we inherits some traditional controller parameters which makes setting up not fast enough, we will extend our method to be trained towards end-to-end approach in next step.
Unfortunately, space does not permit more generalize test in this paper, while we test the SSD insertion scenario as Figure 2 with our policy and achieve full success with 100 times test. We will continue to generalize the model and policy so that they could handle different parts and robot manipulators in the next step. Then, the skill could be packaged as a service that will be delivered to robots in new factory lines with a short setup time. Our method uses a discrete number of actions to perform the insertion task, as an obvious next step, we will analyze the difference between this method and continuous space learning techniques.
- A. Albu-Schäffer, S. Haddadin, C. Ott, A. Stemmer, T. Wimböck, and G. Hirzinger, “The dlr lightweight robot: design and control concepts for robots in human environments,” Industrial Robot: an international journal, vol. 34, no. 5, pp. 376–385, 2007.
- C. Gaz, M. Cognetti, A. Oliva, P. R. Giordano, and A. De Luca, “Dynamic identification of the franka emika panda robot with retrieval of feasible parameters using penalty-based optimization,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4147–4154, 2019.
- A. Albu-Schaffer, C. Ott, and G. Hirzinger, “A passivity based cartesian impedance controller for flexible joint robots-part ii: Full state feedback, impedance design and experiments,” in IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, vol. 3. IEEE, 2004, pp. 2666–2672.
- C. Ott, A. Albu-Schaffer, A. Kugi, S. Stamigioli, and G. Hirzinger, “A passivity based cartesian impedance controller for flexible joint robots-part i: Torque feedback and gravity compensation,” in IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, vol. 3. IEEE, 2004, pp. 2659–2665.
- A. Albu-Schäffer, C. Ott, and G. Hirzinger, “A unified passivity-based control framework for position, torque and impedance control of flexible joint robots,” The international journal of robotics research, vol. 26, no. 1, pp. 23–39, 2007.
- S. Haddadin, A. De Luca, and A. Albu-Schäffer, “Robot collisions: Detection, isolation, and identification,” Submitted to IEEE Transactions on Robotics, 2015.
- L. Robot. Desktop computer host automatic assembly line. Youtube. [Online]. Available: https://www.youtube.com/watch?v=GNqNVgLk1Mg
- H. Park, J.-H. Bae, J.-H. Park, M.-H. Baeg, and J. Park, “Intuitive peg-in-hole assembly strategy with a compliant manipulator,” in IEEE ISR 2013. IEEE, 2013, pp. 1–5.
- F. EMIKA. Ram. Youtube. [Online]. Available: https://www.youtube.com/watch?time_continue=1&v=HQt7XZB-rts&feature=emb_logo
- M. A. Peshkin, “Programmed compliance for error corrective assembly,” IEEE Transactions on Robotics and Automation, vol. 6, no. 4, pp. 473–482, 1990.
- J. M. Schimmels and M. A. Peshkin, “Admittance matrix design for force-guided assembly,” IEEE Transactions on Robotics and Automation, vol. 8, no. 2, pp. 213–227, 1992.
- ——, “Force-assembly with friction,” IEEE Transactions on Robotics and Automation, vol. 10, no. 4, pp. 465–479, 1994.
- A. Stemmer, G. Schreiber, K. Arbter, and A. Albu-Schaffer, “Robust assembly of complex shaped planar parts using vision and force,” in 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. IEEE, 2006, pp. 493–500.
- Y. Shirai and H. Inoue, “Guiding a robot by visual feedback in assembling tasks,” Pattern recognition, vol. 5, no. 2, pp. 99–108, 1973.
- S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” IEEE transactions on robotics and automation, vol. 12, no. 5, pp. 651–670, 1996.
- C. Teulière and E. Marchand, “Direct 3d servoing using dense depth maps,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 1741–1746.
- H. Fujimoto, “Visual servoing of 6 dof manipulator by multirate control with depth identification,” in 42nd IEEE International Conference on Decision and Control (IEEE Cat. No. 03CH37475), vol. 5. IEEE, 2003, pp. 5408–5413.
- R. Li and H. Qiao, “A survey of methods and strategies for high-precision robotic grasping and assembly tasksâsome new trends,” IEEE/ASME Transactions on Mechatronics, vol. 24, no. 6, pp. 2718–2732, 2019.
- M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks,” arXiv preprint arXiv:1810.10191, 2018.
- J. Luo, E. Solowjow, C. Wen, J. A. Ojea, and A. M. Agogino, “Deep reinforcement learning for robotic assembly of mixed deformable and rigid objects,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 2062–2069.
- G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow, and S. Levine, “Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards,” arXiv preprint arXiv:1906.05841, 2019.
- T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana, “Deep reinforcement learning for high precision assembly tasks,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 819–825.
- J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel, “Reinforcement learning on variable impedance controller for high-precision robotic assembly,” arXiv preprint arXiv:1903.01066, 2019.
- W. S. Newman, Y. Zhao, and Y.-H. Pao, “Interpretation of force and moment signals for compliant peg-in-hole assembly,” in Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No. 01CH37164), vol. 1. IEEE, 2001, pp. 571–576.
- T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine, “Residual reinforcement learning for robot control,” arXiv preprint arXiv:1812.03201, 2018.
- M. A. Lee, C. Florensa, J. Tremblay, N. Ratliff, A. Garg, F. Ramos, and D. Fox, “Guided uncertainty-aware policy optimization: Combining learning and model-based strategies for sample-efficient policy learning,” arXiv preprint arXiv:2005.10872, 2020.
- H. Park, J. Park, D.-H. Lee, J.-H. Park, M.-H. Baeg, and J.-H. Bae, “Compliance-based robotic peg-in-hole assembly strategy without force feedback,” IEEE Transactions on Industrial Electronics, vol. 64, no. 8, pp. 6299–6309, 2017.
- A. Y. Ng, “Shaping and policy search in reinforcement learning,” Ph.D. dissertation, University of California, Berkeley Berkeley, 2003.
- M. L. Littman, A. R. Cassandra, and L. P. Kaelbling, “Efficient dynamic-programming updates in partially observable markov decision processes,” 1995.
- C. C. White, “A survey of solution techniques for the partially observed markov decision process,” Annals of Operations Research, vol. 32, no. 1, pp. 215–230, 1991.
- W. S. Lovejoy, “A survey of algorithmic methods for partially observed markov decision processes,” Annals of Operations Research, vol. 28, no. 1, pp. 47–65, 1991.
- P. Martinet, J. Gallice, and K. Djamel, “Vision based control law using 3d visual features,” in World Automation Congress, WAC’96, Robotics and Manufacturing Systems, vol. 3, 1996, pp. 497–502.
- P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures, vol. 1611. International Society for Optics and Photonics, 1992, pp. 586–606.
- B. Siciliano and O. Khatib, Springer handbook of robotics. Springer, 2016.
- T. Breuer, M. Ndoundou-Hockemba, and V. Fishlock, “First observation of tool use in wild gorillas,” PLoS Biology, vol. 3, no. 11, 2005.