Transfer Learning for Unseen Robot Detection and Joint Estimation on a Multi-Objective Convolutional Neural Network
A significant problem of using deep learning techniques is the limited amount of data available for training. There are some datasets available for the popular problems like item recognition and classification or self-driving cars, however, it is very limited for the industrial robotics field. In previous work, we have trained a multi-objective Convolutional Neural Network (CNN) to identify the robot body in the image and estimate 3D positions of the joints by using just a 2D image, but it was limited to a range of robots produced by Universal Robots (UR). In this work, we extend our method to work with a new robot arm - Kuka LBR iiwa, which has a significantly different appearance and an additional joint. However, instead of collecting large datasets once again, we collect a number of smaller datasets containing a few hundred frames each and use transfer learning techniques on the CNN trained on UR robots to adapt it to a new robot having different shapes and visual features. We have proven that transfer learning is not only applicable in this field, but it requires smaller well-prepared training datasets, trains significantly faster and reaches similar accuracy compared to the original method, even improving it on some aspects.
Industrial robotics has been associated with structured and well-defined environments for many years and robot arms have achieved great performance in areas like manufacturing. It comprises of hard-coded repetitive motions, where a machine can do a better job compared to a person in terms of no fatigue, precision and non-stop operation. However, with developing hardware, computing power and advancing algorithms, the same systems are becoming more adaptive. Nowadays, instead of fencing off the robots, environment understanding and adaptive behaviour is a part of the Industry 4.0 concept, where robots and people can share the same workspace and collaborate [lee2015cyber].
There are numerous approaches to sense the environment: laser scanners, stereo vision, RGB-D cameras, camera arrays, ultrasound sensors, motion capture systems. Each one has its own pros and cons, often either needing additional markers or calibrated devices or having a high price-tag. Very often there is still a significant amount of work needed to set up a new robustly working system.
Inspiration of the environment understanding comes from biology - how animals and especially humans are able to understand the environment. We are capable of learning what objects are, how they move, their functionality and the way we should interact with them by looking at example situations. Furthermore, after we know how it works in some situation, it is very likely that next time we see similar conditions, we will be able to find parallels between the two and figure out how we should act by simply using our previously gained knowledge. That is the motivation of the transfer learning method, which uses a previous well-trained neural network and adjusts it to new conditions using limited amount of training data and significantly shorter training time compared to the full training of the neural network.
Transfer learning has been used in a variety of fields. In many cases, the whole or part of the CNN trained on ImageNet is taken as a base network and then adjusted to a specific application [krizhevsky2012imagenet]. This has been proven to work for mid-level image representations in object classification, using the pre-trained network on natural images to adapt for medical image recognition and even emotion recognition [oquab2014learning] [greenspan2016guest] [ng2015deep]. Another interesting application of transfer learning is to use a fully trained network on night-time satellite imagery of poverty areas and adapt it to recognise poverty areas from daytime satellite imagery [xie2015transfer]. Furthermore, detailed analyses of the transfer learning approaches were made with surveys of the techniques used and various CNN structures [shin2016deep] [weiss2016survey].
The proof that generalised visual features can be transferred to new systems has motivated to use it to extend our previous work of recognising the robot and estimating its 3D position of the joints by using a simple 2D color camera image [miseikis2018multi]. Instead of using ImageNet or any other well known pre-trained network, we take our previously fully trained multi-objective CNN on Universal Robots and use it to adapt to a new Kuka LBR iiwa robot arm. Additionally, the new dataset adds new unseen backgrounds making the network even more robust.
The main goal of identifying the robot in a 2D camera image is to remove the need for fully calibrated camera-robot systems allowing for more dynamic environments, while still ensuring safe operations. It is crucial for shared workspaces between humans and robots. There are many good methods of real-time dynamic obstacle and people avoidance, but most of them require a fully-calibrated robot-camera system [mainprice2013human] [miseikis2016multi]. Despite some efficient Hand-Eye calibration methods, it is still a cumbersome process when the operation of the robot has to be halted until the calibration is completed [miseikis2016automatic]. Furthermore, it can simplify the task of having mobile robots moving around the floor without any special markings. By identifying other fixed robots it can both avoid possible collisions and localise itself to known fixed-base robots. By identifying other robots and knowing their exact position, the setup could be expanded to prediction of the behaviour of other machinery in the surrounding environment without having the direct communication channel between them. This would be a very useful approach in swarm robotic applications.
This paper is organized as follows. We present the system setup and dataset collection in Section II. Then, we explain the proposed method and CNN structure and configuration in Section III and the transfer learning procedure in Section IV. We provide experiments and results in Section V, followed by relevant conclusions and future work in Section VI.
Ii System Setup and Dataset Collection
Training a deep learning network typically requires a large amount of diverse training data. The main problem lies in the necessity to have precise ground-truth information, which is given as a correct answer.
Our setup consisted of a vision sensor, in this case, a Kinect V2 camera, placed in arbitrary positions overlooking the robot and perform Hand-Eye calibration at each of the positions [Fankhauser2015KinectV2ForMobileRobotNavigation]. The calibration is done by placing a known marker on the end-effector of the robot and performing a number of movements until the calibration accuracy reaches the necessary precision. The result is a coordinate frame transformation between the camera and the robot base [heikkila2000flexible].
Given a precise coordinate frame transformation, the robot model is used together with the live information from its joint encoder readings to create a simplified mesh model defining the robot shape. Then it is transformed to the coordinate frame of the camera and depth image estimated from the viewpoints of the camera. The result is a precise mask of the robot body in the camera image, which can be overlayed with a color image and used as a ground truth data for teaching the CNN. The main benefit is that this process is fully automated by using ROS with MoveIt! package [sucan2013moveit]. The robot model is taken from the Unified Robot Description Format (URDF) files provided by the robot manufacturers [meeussen2012urdf].
In our experiments, we use an already trained multi-objective CNN from the previous project [miseikis2018multi], which was trained from scratch on three robot models from Universal Robots: UR3, UR5 and UR10. In order to test the capabilities of transfer learning, new datasets using Kuka LBR iiwa were used. For comparison reasons, relatively large datasets, summarised in Table I, were collected for all the robots. These datasets consist of multiple recordings, each one with the camera placed at different angles and distances relative to the robot as well as having various backgrounds.
|Robot Type||Number of Datasets||Total Number of Samples|
|Kuka LBR iiwa|
Robot movements included a large variety of joint configurations resulting in many viewpoints of the robot. Furthermore, lighting conditions were varied for each of the recordings to allow for more robustness regarding the brightness and reflections.
The new datasets with the Kuka robot also included more dynamic background with people moving around and even another Kuka robot placed further away and not being used in experiments. Furthermore, in some cases, the robot went out of bounds of the color image. In total, 9 datasets of Universal Robots and 14 datasets of Kuka robot were used. Each recording had different camera placement, changing distance between the robot and the camera, varying lighting conditions and new background. During each of the recordings, the robot was moving to give a large variety of joint configurations in the dataset.
At the completion of each movement, a trigger signal was sent in order to save the color image, depth model, cartesian and joint coordinates of each of the robot joint and ground-truth mask model of the robot. All this information was later used to train the neural network. However, depth information was used only for training, while the recognition part of the system relies only on the color camera image as an input.
In order to normalise the input data, internal camera calibration was used to ensure a perfect overlap between color and depth images. All the input images are also rectified and have the resolution of pixels. Testing and validation sets were divided by the ratios of and respectively based on random sampling.
Iii Cnn Structure and Configuration
The base of a multi-objective CNN is taken from previous work, where it was trained on a line of robots made by Universal Robot [miseikis2018multi]. The network simultaneously optimises for multiple heterogeneous outputs by using just a single image as an input.
The network in this paper is trained on four objectives:
Robot mask in the image
3D Robot base position in relation to the camera
3D Position of the robot joints
The structure of the CNN is shown in Figure 3. The network shares a number of common convolutional layers and then branches for more objective-specific optimisation. Having a single training process, it means that the features in common layers are reused.
Iii-a Loss Functions
Loss functions are used to evaluate the training progress and the achieved accuracy compared to the ground truth data. Our system optimises for four objectives simultaneously, resulting in four loss functions, which are later combined into one for the training process. First, each of the loss functions will be described separately followed by the explanation of how they are all connected into one.
The robot body takes up a relatively small area in the whole image. The area taken up by the robot body in UR datasets is between and for Kuka datasets, it is between of the whole image. Given a standard pixel classification loss function, there would be cases when an accuracy of over can be reached by classifying the whole image as a background. That is conceptually wrong, so the loss function was adjusted by using the foreground weight , which is calculated in Equation 1. It is based on the inverse probability of the foreground and background classes, where .
The background weight is calculated in Equation 2.
The loss function for the robot mask is defined by two steps. First, a per-pixel loss is calculated in Equation 3, where is , is and is the ground truth value from the mask image.
This is followed by a normalised loss calculation for the whole image in Equation 4. A normalisation factor , which is the number of pixels in the image, allows us to keep the same learning parameters independent of the input image size.
3D coordinates of the robot base and robot joints are defined as regression tasks. The loss function is based on the Euclidean distance between the estimated values and the ground truth values. For the robot joints estimation, the loss function is described in Equation 5, where is the number of joints, is the ground truth position of each joint and is the estimated values by the neural network.
The loss function for the coordinates of the robot base is calculated in Equation 6. is the ground truth position of the robot base in 3D, and is the estimated 3D position of the robot base. These positions are relative to the coordinate frame of the camera.
Classification of the robot type is defined as a categorical cross-entropy problem with multiple classes. is calculated in Equation 7, where is the ground truth labels, are the predicted labels and , where contains all the available types of robots in the dataset.
For the training of the multi-objective CNN and optimisation for all four objectives, a single loss function is needed. This was achieved by combining the previously defined loss functions into by having a weight element for each of the losses, as shown in Equation 8. The larger the weight , the higher the impact on the corresponding value.
Iv Transfer Learning and Training
The benefit of transfer learning technique is that the parameters contained in so-called frozen layers are copied from the previously trained network, while only part of layers is trained during the process. This speeds up the training process and requires smaller training datasets compared to the full CNN training. In this work, most of the convolutional layers had the parameters transferred and frozen with all the fully connected layers and only the two last convolutional layers for robot mask estimation being trained to adapt for specific variation in visual features. The exact setup is explained in Figure 3. By a layer being frozen it means that after the parameter transfer, they are fixed and not adjusted at all during the training.
Weights for the loss function are kept identical to the ones in previous work given good results and ability to compare the results of the works directly. Selected weight values were the following:
One important difference between the UR robots and the Kuka robot is the number of joints. Universal Robot line has 6 joints, while Kuka has 7 joints. This difference changes the number of outputs for the 3D position estimation of robot joints. However, because the fully connected layers, as well as output layers, are trained, it can be adjusted to accommodate estimation of an extra joint.
Training was done by using datasets of different sizes containing the Kuka robot. Mini-batches were created in order to make the most out of the available GPU memory and all the data was randomly shuffled to reduce the biases. Before starting any training, parameters for the frozen layers were transferred from the old model fully-trained on UR datasets. This ensured that each training had an identical configuration in the beginning. The number of training samples varied by the experiment and the input size of the images was reduced by half from the original dimensions, down to pixels. The pixel intensity values of the input images were normalised to the range between 0 and 1. The learning rate was set to at the start of the training and then gradually decreased towards as the training progressed.
V Experiments and Results
A number of experiments were carried out in order to determine the effectiveness of the transfer learning process. In order to find the optimum amount of training samples needed for transfer learning, each experiment consisted of a training set with different size, all randomly sampled from the Kuka dataset. The testing set was identical for all the experiments.
|Measure||Full Training||Transfer Learning|
|Mask Accuracy, %|
|Robot Type Accuracy, %||—|
|Joint Pos Error (Median)|
|Base Pos Error (Median)|
|Training Time (hours)||hours||hours|
The evaluation was done using a testing set by comparing the output against the ground truth data. The robot mask accuracy is defined by counting the number of pixels in the CNN output image that match the ground truth mask. For the robot joint and base coordinates, Euclidean distance between the CNN estimated results and ground truth results was calculated. We compare the results of transfer learning method trained on the Kuka robot against our previously presented multi-objective CNN fully trained for UR robots [miseikis2018multi]. Results are summarised in Table II.
Compared to a fully trained system, the transfer learning results matched closely. As seen in Figure 3(a), the error in estimating 3D positions of robot joints was cm compared to cm in a fully trained system, while the robot mask accuracy difference was just with for transfer learning method and in a fully trained CNN. Robot base position estimation was actually more accurate in transfer learning method with an error of cm compared to cm. Resulting positions of robot base and joints mapped onto 2D and marked on the example dataset images are shown in Figure 5. The system was adapted for just one type of the new robot, so we did not evaluate the accuracy of robot type detection. The system adapted to an additional robot joint in the transfer learning method. There can be seen an increase in the error for Joint 3, and that could be partly caused by a different structure of the robot, as seen in Figure 3(b).
However, the main benefit of the transfer learning method can be seen in training time. By looking at the Figure 3(c), where loss calculation and training time is shown against the number of samples used for training, it can be clearly seen that the optimum result is around 300 training samples. To be specific, it was the experiment, where 312 training samples were used. It took a bit under 2 hours of training and the resulting loss was . Having more samples, the loss got down to , but it took significantly more time to train. An interesting point was that by using the whole training dataset, the loss increased back to and took around hours of training. The forward propagation time per sample averaged to ms. All the training and testing was done on Nvidia Geforce GTX 1080 Ti graphics card.
Vi Conclusions and Future Work
In this paper, we have presented a transfer learning approach to adapt a previously trained multi-objective CNN to new types of robots. In general, the system identifies and localises the robot arm and estimates its base and joints’ positions in 3D. This allows a camera to be placed in any position and moved around without having to re-calibrate the camera-robot system with Hand-Eye calibration. In this work, we have shown that by taking a fully trained system, a significantly less training data is needed to adapt it for new robot models, which have a different shape, appearance and even more degrees of freedom.
The results have shown that accuracy achieved by using transfer learning closely matches the results of the fully trained system and can even improve in some cases. This means that the system is able to adapt and learn to recognise new robots with having just limited amount of training data. Similarly to what we do when learning new skills and practising them afterwards.
This work can be useful in dynamic environments where it is difficult to predict where robots, sensors and people are located, but operational safety has to be established. By expanding this method to numerous robots, and other equipment, fixed setups and calibration can be discarded. Unfortunately, the accuracy is still in centimetre level, and it is not applicable for precision tasks. However, in many adaptive human-robot and robot-robot interaction tasks, general obstacle avoidance and collaboration movements could be made possible.
Given a precise robot body detection, another possible application could be self-inspection for the robot to detect any unknown and unexpected damage. Similar to new robots are added using transfer learning, typical damages could be taught to the system and identified by the robot scanning itself, observing its own reflection or having another robot to scan it. This can be very useful in environments, like disaster areas, where robots have to work autonomously for long periods of time, or when internal sensors give unusual readings and hull should be inspected.
For future work, we plan to implement more robots as well as having robots on mobile platforms in the system. Instead of training on one new robot model, transfer learning will be used to expand the CNN to work with a line of robots, including the originally trained ones. Previously mentioned robot self-inspection is also of high interest, as well as adding collaborative tasks with people by tracking their movements using the latest skeleton tracking methods. Furthermore, more types of cameras will be tested and transition from one camera to another analysed.
This work is partially supported by The Research Council of Norway as a part of the Engineering Predictability with Embodied Cognition (EPEC) project, under grant agreement 240862, and by the Austrian Ministry for Transport, Innovation and Technology (BMVIT) within the project framework CollRob (Collaborative Robotics).