Online Tool and Task learning via Human Robot Interaction
This work describes the development of a robotic system that acquires knowledge incrementally through human interaction where new tools and motions are taught on the fly. The robotic system developed was one of the five finalists in the KUKA Innovation Award competition and demonstrated during the Hanover Messe 2018 in Germany. The main contributions of the system are a) a novel incremental object learning module - a deep learning based localization and recognition system - that allows a human to teach new objects to the robot, b) an intuitive user interface for specifying 3D motion task associated with the new object, c) a hybrid force-vision control module for performing compliant motion on an unstructured surface. This paper describes the implementation and integration of the main modules of the system and summarizes the lessons learned from the competition.
A key challenge in deploying robots in human environments is the uncertainty and ever-changing nature of the human environment. To adapt to the variability, the robot needs to constantly update its model of the world. In other words, the robot needs to be capable of learning incrementally.
One of the first steps of interacting with the world is to recognize new objects and know how to utilize them. This is a difficult problem to solve. One approach to tackle this problem is to leverage human’s knowledge and guidance whenever the robot is not able to make a decision on its own . Along this direction, in our previous work  we proposed a method to improve the robot’s visual perception incrementally and used human robot interaction (HRI) to learn new objects and correct false interpretations.
In this paper, we build on the same idea and develop a new system with a learning module that is more natural and efficient, and allows teaching tasks that are associated with the new objects. In particular, the new system improves on the following aspects: 1) when teaching a new object to the robot, it does not require that the object remains static anymore. It automatically tracks the object such that a human can teach a new object just by holding it in hand and showing it to the camera, 2) a novel incremental object detection system that has robust performance on newly learned objects, existing objects, and objects from unknown classes, and supports open-set recognition. 3) an intuitive user interface for specifying 3D motion task associated with the new object, 4) a hybrid vision-force control for performing compliant motions that require contact with unstructured surfaces.
Our developed robotic system was one of the top five finalists in the KUKA Innovation Award competition 2018 . This year’s theme of the competition was “Real-World Interaction Challenge”. The competition aimed to seek robotics solutions that adapted to the changing environment in the real world. We performed live demos at the Hannover Messe 2018. In these demos the audience brought new objects. Our robot system learned both the visual appearance of the new objects and how to use them in contact motions w.r.t. a sensed surface.
The rest of the paper is organized as follows. Section II outlines the overall system modules and their interconnections. Object localization module is described in section III. The details of the proposed incremental classification module is presented in Sec IV followed by the user interface design in Sec V. Details of the hybrid force-vision control module is explained in Sec VI. Experimental results and the demonstration at the Hannover Messe are presented in Sec VII. Section VIII concludes the paper and summarizes the lessons learned during the competition.
Ii System Overview
The overall architecture of our system is illustrated in Fig. 1. It learns new tools and tasks for contact motions on unstructured surfaces, in an interactive way. An example of a use case is to teach the robot how to comb hair, which requires teaching what is a comb, where to move the comb and how to properly align with the head during the movement. With our system, a user can teach the robot what a comb is just like teaching another human. The user only needs to hold a comb in his/her hand and rotate it to show different object poses to the camera. The learning module automatically tracks the object and stores the features. From there, the incremental detection module will be able to detect the comb and picks it up from the table. The user can then teach the movement trajectory by drawing paths on a touch-screen.
Our hardware consists of a KUKA LBR iiwa arm instrumented with a flexible Festo gripper , a pointgrey camera (for teaching objects), two RGB-D sensors and a touch-screen for user interaction. Our system is composed of three main modules, all of which are fully integrated with ROS :
Incremental Object Detection: an object detection system that allows incrementally adding new classes. We dubbed it as “Incremental” in contrast with the traditional method. We adopt a two-stage approach, object localization followed by incremental classification;
Visual User Interface for user interaction: enables the operator to seamlessly add/remove object classes, and the associated paths. The user defines the paths by drawing 2D trajectories on the touch-screen.
Hybrid force-vision control: this module enables the robot to perform 3D motion tasks that requires contact with unstructured surfaces. It receives the RGB-D sensor information and constructs the 3D trajectories of the motion.
In the following sections, we will describe these modules in detail.
Iii Object Localization
The objects are placed on a table in front of the robot manipulator. Before the objects can be classified into different categories, the robot needs to localize them. We adopt the Region Proposal Network from Faster RCNN  for the object localization. This network serves as a generic object detector, that predicts bounding boxes of the objects in an image and the associated objectness. Instead of using the downstream classification network layers in Faster RCNN, we use the incremental classification method described in next section such that we can incrementally add new classes. We use VGG16  as the backbone network, trained on MS-COCO  dataset. The image patches inside each predicted bounding box are passed as the input to the incremental classification stage. The location of each bounding box is sent to the path specification module. We optimize the speed of the localization by only processing the lastest image while skipping all the images that are observed during the last network inference.
We have tried other alternatives, like using a one-stage object detector, YOLOv2 . YOLOv2 predicts class and objectness for every anchor box, which comes from priors learned from the training dataset by clustering. In our setting, we keep the anchor boxes that had high objectnesses as the set of potential objects. For each of these objects, we pass them as input to the incremental classification stage and update its class label according to the classification result. YOLOv2 achieved great balance of speed and performance on PASCAL  and COCO datasets. However, it did not perform as well as the Faster RCNN based approarch. The main reason is that, while the priors learned from these datasets may be a good representation of object locations in the test set, they do not fit the real-world scenarios. The user may place an object anywhere on the table. If it’s in a location less represented in the dataset, it will get a lower objectness and may be considered a non-object.
Iv Incremental classification
One approach for object classification is to depend on large-scale a-priori training data. However, large-scale training datasets such as ImageNet  or MSCOCO do not contain all the objects and tools used in a variety of manipulation tasks. Nonetheless, the learned convolutional features from pre-training on large-scale training data can benefit from the incremental learning of new objects. Unlike the general classification problems where one image is required to be classified based on previous training data that do not capture the object poses, the classification in a human robot interaction setting has additional temporal information of the object undergoing different transformations. The different object poses can aid in building better classification modules and greatly benefit the low shot recognition problem. Finally, to enable the robot to operate in different environments and incrementally learn objects, it needs to acknowledge when an object presented is from an unknown class. The ability to recognize objects outside of the closed set in its own data is termed as the open-set recognition problem , and allows the robot to request the human help.
Our classification module is comprised of two stages: (1) Teaching Phase, (2) Inference Phase. During the teaching phase a human demonstrates the different object poses, while during the inference phase the robot is required to detect the novel objects that it has learned. This mimics children being taught about novel objects by a teacher or parent .
In the teaching phase a saliency method based on class activation maps (CAM)  from ResNet-50  is used to automatically detect the object being demonstrated. Although the learned object might be a novel one that does not belong to ImageNet set of classes, CAM will still be able to correctly localize the salient object as shown in Figure 2. The features from a ResNet-50 network pretrained on ImageNet are extracted for the image patches extracted based on CAM localization. The mean of activations as shown in a previous few-shot learning study  acts as a strong indicator for the object class. Thus the mean of activations from ResNet-50 for all object poses are used to represent each object. Each novel object being taught to the robot creates a new set of features and its corresponding mean.
During the inference phase, image patches based on the computed bounding boxes from the object localization module are extracted and their corresponding ResNet-50 features are computed. These act as the query features corresponding to the query objects, and the classification problem is dealt with as a retrieval problem. The query object is classified based on the nearest neighbour of the query feature. Nearest neighbour algorithm is used in our application since it requires open-set recognition: not only that we need to classify the objects, we also need to know when an object is from an unknown class. To do that, a distance ratio is computed between the first and second nearest neighbour distances. A higher ratio near 1 indicates an ambiguous classification and is rather classified as an Unknown object to the robot. While a lower ratio indicates higher discrimination between the first and second nearest classes, and the nearest neighbour class is used. An overview of the two phases are shown in Figure 3.
V User interface
The user interface and the workflow is depicted in Fig. 4. At the start of the system, the user decides whether to add a new class, if yes, the user needs to show the new object to the camera along with entering the name of the class. The system will collect samples of this new object and move on to the incremental object detection module. All the objects placed on the table in front of the robot will be detected and assigned as either one of the old classes or the new class. If the user does not want to add a new object, the system can either wait, or bypass the learning module and detect the object as one of the old classes. The user can then define a 3D motion by drawing on the touch screen, and pair this motion with one of the objects. The robotic manipulator will automatically pick up the object and apply the user-defined motion. The interface is implemented with the qt-ROS package.
The User Interface has three view panes:
Teaching pane: which allows the user to introduce new objects and define the class by entering its class name. It subscribes to the cropped images of new object and shows it to the user. The system collects 200 sample images of the new object, stores them and passes them to the classification module. Once the image capturing is done, the deep feature extraction process is activated and runs until completion.
Detection pane: This pane subscribes to the scene video stream and detects all the objects in the scene. Detected objects are displayed to the user and the corresponding names and locations of the objects are stored for the next phase which is motion specification and execution.
Task pane: This pane subscribes to the scene video and enables the user to specify paths by drawing on the 2D image stream from the scene. The user is able to modify the path, add multiple paths or delete the specified path. It is also possible to define area tasks, in which the user selects a region. The interface enclosed the area with a polygon and automatically calculates multiple strokes that covers the selected area. This feature is specifically useful for cleaning or polishing tasks of unstructured surfaces. Once satisfied with the path, the user needs to select an object to pair this path with. The robot will attempt to pick up this object and move it along the specified path. The idea is to decouple the object classes from the utility of the object. With this interface, we allows the user to define how to use the object, with the flexibility to pair/unpair the path from the object. Note that as the focus of this system is not on grasping, flexible Festo gripper fingers  is used and a simple table-top grasping strategy based on the orientation of the object is used to grasp the objects.
The goal of the controller is to constrain the end-effector motion to the path (on a surface) while maintaining a contact force () on the surface. In order to achieve this, we make use of so-called operational space control. Following the work of Khatib , we can decompose the motion task by projecting the motion onto orthogonal directions along the tangential () and normal ( directions, see Fig. 6(c). A dedicated controller is then designed for each direction. An additional controller is also required for orienting the robot’s tool.
Vi Hybrid force-vision control Module
The overview of the control module is illustrated in Fig. 6. User-defined 2D paths (drawn on a touch screen) are first projected on the 3D surface. To do this, we need to compute the surface normals.
Vi-a Surface Normal Extraction
This module computes the 3D path coordinates and the corresponding surface normals. The input to the module is a 2D path, which is converted to the 3D path coordinates via the direct correspondence between the RGB sensor and depth sensor. The Kinect sensor is calibrated such that these coordinates in the Kinect frame can be transformed into the robot base frame. To compute the surface normals, we fit a plane to the neighborhood patch of the target point . is obtained from the points within a radius from . We can then estimate the target normal by computing the smallest eigenvalue of the covariance matrix of . The reader is referred to  for the details.
The integral normal estimation implemented in the PCL library  is used. It optimizes the computation by taking advantage of the organized structure of the point cloud. The normal is calculated by taking the cross product of the two local tangential vectors formed by the right-left pixels and up-down pixels. Due to the noisy nature of the sensor data, we smooth out the tangential vectors by taking the average, or, an integral image . In our implementation we use the PCL Average 3D Gradient mode which creates 6 integral images to compute smoothed versions of horizontal and vertical 3D gradients .
Vi-B Path controller
It is assumed hearafter that the visual interface provides the desired path, i.e. a down-sampled sequence of 3D points on the surface and the corresponding normal directions at each point . With the desired path, the tangential direction at each point is computed as follows (see Fig. 6 (c)):
The desired path reference frame is then fully characterized by .
The desired orientation of the robot end-effector can also be obtained from the path on the surface. In general, the desired orientation111We use relative Euler angles with rotation order followed by followed by . could be set path dependant and could change along the the path. For example, suppose that we want to reorient the end-effector (tool) reference frame such that is aligned with and is aligned with the tangential direction . This can be achieved by constructing the rotation matrix from tool to base .
The goal of the controller is to constrain the end-effector motion to the path (on a surface) while maintaining a contact force () on the surface. In order to achieve this, we follow the work of Khatib  in operational space control and decompose the motion task by projecting it onto orthogonal directions along the tangential () and normal ( directions, see Fig. 6(c). Path following in the tangential plane (plane constructed by ,) is achieved by designing a compliant controller.
KUKA LBR iiwa robot is a torque controlled 7-DOF manipulator with integrated torque sensors at each link. Its dynamic equation is of the form
where is the joint angles, is the positive-definite inertia matrix, is the Coriolis matrix, is the gravitational force, the actuators torques, and is the external generalized force applied to the robot by the environment.222Note that for simplifying the notations, we drop the dependencies on unless it is required.
Assuming that the end-effector position and orientation is described by a set of local coordinates and the forward kinematics map ) is known, the mappings between joint and Cartesian velocities and accelerations are , , where is the manipulator Jacobian and has full row rank333The singular case can be treated using the method described in .. Now denote as the Cartesian error between actual Cartesian pose and the desired one .
Out controller sends the torque command:
which has a desired compliance behavior in the presence of external forces and torques at the end-effector . Matrices and are diagonal and specify the desired stiffness and damping in each direction. The stiffness gains in the tangential plane are set high while the stiffness in the normal direction is set to be very low.
Vii Case study: Demonstration at Kuka Innovation Award Competition 2018
Demo Setup As the final stage of the KUKA innovation award competition, we presented our system at Hanover Messe. It was in the format of a live demo, running for five days, 6 hours per day. The algorithm of the demo system ran on the following hardware. The user interface ran on a Thinkpad X230i touch-screen laptop. The backend system ran on two commodity computers, both of which had a Quad-core CPU running at 4GHz. One of them was for the message passing of the sensors and the processing of the point cloud. The other ran the incremental learning algorithms on a NVIDIA GTX 960. Despite the moderate computing power of the GPU, we could get near real-time performance from our localization and recognition algorithm, thanks to the optimization described in Section III. The inference time of our incremental detection model was on average 200ms per image, accounting for both the localization and the recognition.
The vision sensors included a pointgrey camera for teaching objects, an asus-X RGB-D sensor for extracting the 3D point cloud of the surface, a Kinect2 sensor for object detection and mapping 2D localization result to 3D. Both of the RGB-D sensors were calibrated with respect to the robot base frame. The accuracy of the surface reconstruction is limited by the resolution of the RGBD-sensor, in our case within the workspace of the robot. The compliance controller would be able to deal with the inaccuracy and ensure the continuous contact during the motion. When teaching the object, we cropped the image from pointgrey camera to 400 by 400 pixels and resized to 224 by 224 pixels before feeding into the network. The input to object detection module was the HD image stream from the Kinect2 sensor. Again, it was cropped, to 600 by 600 pixels, resized to 224 by 224 pixels, for a good balance of speed and performance.
What was unique about our live demo in contrast to a traditional lab setup, was that there was no direct control over the interaction with the audience nor the lighting conditions, which changed due to overhead sky-lights.
A full run of each competition demo was required by the organizers to run in a tight four-minute window, consisting of two phases, (1) a new object detection phase, and (2) a robot motion teaching phase with the new object. In phase one, the image collection of new object took 40 seconds plus another 40 seconds for deep features extraction. Presenting the object detection results to the audience took another 30 seconds. In phase two, visual motion teaching on the laptop touch screen took 10 seconds, and path planning 20 seconds. Finally autonomous execution of the surface contact motion took one and a half minute (the robot manipulator performed a reaching and grasping motion, then followed the trajectory defined on the unstructured surface). These were the average duration of each phase. The specific duration varied from demo to demo.
In order to test the limits of our system, we encouraged the visitors to bring their own objects and teach the robot. In Table I, we summarized the classification result of the new objects brought by the visitors at the demo.
|Successful classification||Unsuccessful classification|
|key chains||car key|
Discussion Overall, the system performed robustly even in the presence of all kinds of uncertainties. For example, a big challenge was to deal with lighting variations. There was an opening on the ceiling near the setup, causing a mixture of natural and indoor lighting that changed dramatically during the day. As shown in Table I, the incremental learning algorithm classified the object successfully on about 69% of all the object categories introduced by the audience. Of the failure cases, most were due to the following characteristics of the objects: flexible material, tiny shape, reflective surface, similar textures and colors to the objects in the training set. In total, we performed about 75 demos where about 65 were successful in the incremental learning part, which came to 87% success rate of the demos. Note that the running time of the incremental classification algorithm grows linearly with the total number of existing classes, due to the nearest neighbor approach. To ensure the speed of the demos, we reset the system to five classes at the end of each day.
In the motion execution phase, the contact motions were always successful as long as the defined path was within reach of the manipulator. The unsuccessful cases were either the gripper failed to pick up the object, or the defined path was not reachable. We provided failure recovery mode for each case. In the case of a pick up failure444As the focus of our work was not on grasping, a simple table-top grasping was considered for the objects based on the shape of the bounding box. Obviously we had limitations due to the nature of the provided gripper and we were not able to grasp objects that were too large (greater than in diameter) or too small (less than )., the user could stop the motion of the robot by pressing the button on the robot flange and manually pass the object to the gripper. In the case of non-reachable path, the robot would signal the user such that he/she can redefine the path.
This paper introduced a novel system that acquired knowledge incrementally from human, where new tools and motions can be taught on the fly. The usage of the system was showcased in live demonstrations as the final round of the KUKA Innovation Award competition. Our system made it possible for users to define new classes of object that could be recognized later. It also allowed the user to associate specific tasks for these new objects and perform actions with them.
At the demos, our system has attracted the attention from multiple companies for possible trials, including bin-picking applications where they were specifically interested in fast teaching of new objects. Also, there were interests from cosmetic industry for 3D path specification and automotive industry 3D contour inspection.
-  A. E. Leeper, K. Hsiao, M. Ciocarlie, L. Takayama, and D. Gossow, “Strategies for human-in-the-loop robotic grasping,” in Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction. ACM, 2012, pp. 1–8.
-  S. Valipour, C. P. Quintero, and M. Jägersand, “Incremental learning for robot perception through HRI,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, 2017, pp. 2772–2777.
-  “Kuka innovation award,” https://www.kuka.com/en-ca/technologies/research-and-development/kuka-innovation-award/kuka-innovation-award-2018, accessed: 2018-09-14.
-  “Kuka lbr grippers,” https://www.kuka.com/en-ca/products/robotics-systems/robot-periphery/end-effectors/kuka-lbr-grippers, accessed: 2016-09-13.
-  M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2, 2009, p. 5.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” International Conference on Learning Representations (ICLR), 2015.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
-  J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” CVPR, 2017.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee, 2009, pp. 248–255.
-  A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1563–1572.
-  E. M. Markman, Categorization and naming in children: Problems of induction. Mit Press, 1989.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  S. Qiao, C. Liu, W. Shen, and A. L. Yuille, “Few-shot image recognition by predicting parameters from activations,” CoRR, abs/1706.03466, vol. 1, 2017.
-  O. Khatib, “A unified approach for motion and force control of robot manipulators: The operational space formulation,” IEEE Journal on Robotics and Automation, vol. 3, no. 1, pp. 43–53, February 1987.
-  R. B. Rusu, Semantic 3D object maps for everyday robot manipulation. Springer, 2013.
-  R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 1–4.
-  S. Holzer, R. B. Rusu, M. Dixon, S. Gedikli, and N. Navab, “Adaptive neighborhood selection for real-time surface normal estimation from organized point cloud data using integral images,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 2684–2689.
-  C. P. Quintero, M. Dehghan, O. Ramirez, M. H. Ang, and M. Jagersand, “Flexible virtual fixture interface for path specification in tele-manipulation,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 5363–5368.