MaskedFusion: Mask-based 6D Object Pose Detection
MaskedFusion is a framework to estimate 6D pose of objects using RGB-D data, with an architecture that leverages multiple stages in a pipeline to achieve accurate 6D poses. 6D pose estimation is an open challenge due to complex world objects and many possible problems when capturing data from the real world, e.g., occlusions, truncations, and noise in the data. Achieving accurate 6D poses will improve results in other open problems like robot grasping or positioning objects in augmented reality. MaskedFusion improves upon DenseFusion where the key differences are in pre-processing data before it enters the Neural Network (NN), eliminating non-relevant data, and adding additional features extracted from the mask of the objects to the NN to improve its estimation. It achieved average error on the widely used LineMOD dataset, which is an improvement, of more than 20%, compared to the state-of-the-art method, DenseFusion.
With the increasing automation and the need for robots that can work in non-restricted environments, the capacity to understand the scene in 3 dimensions is becoming very important. One of the main tasks is the grasping of objects in a cluttered environment, e.g., bin picking. In this work we improve current state-of-the-art in 6D object pose detection. Although there are already several important contributions in this area[9, 18, 25, 26, 27], this paper goes well beyond the current accuracy and improves 20% the pose estimation results with respect to a widely used benchmark dataset.
6D pose estimation is an open important problem because it is used in several other important tasks like robotic manipulation, augmented reality and others. 6D pose is as important in robotic tasks as in augmented reality, where the pose of real objects can affect the interpretation of the scene and the pose of virtual objects can also improve the augmented reality experience. 6D pose estimation can be useful in human-robot interaction tasks such as learning from demonstration and human-robot collaboration. With better accuracy of the 6D pose, robots can have improved grasping performance.
This is a challenging problem due to the different objects existing in the real world and how they appear in the real world captured scenes with possible occlusions and truncations. It is also an open problem obtaining the data to retrieve the 6D pose, as RGB-D data can some times be deceiving, e.g., fully metallic objects, meshed office garbage bins. These issues can lead to errors, because if the data captured from the real world has problems many methods will not work or will predict wrong poses.
We evaluated our method and compared its results with other methods using the LineMOD  dataset. The main comparison is with the state-of-the-art method in 6D pose estimation using RGB-D data, DenseFusion . Our method improves upon DenseFusion in terms of accuracy costing more time during the training process, but using the same time of inference per image. Our method, MaskedFusion, achieved an improvement of more than 20% over DenseFusion. MaskedFusion keeps the same inference time as DenseFusion, making it useful in robotics due to its high rate of inference. We achieved this with the elimination of unnecessary data around the object that we want to know its 6D pose, and by introducing the mask of the object into the convolutional neural network (CNN). This technique is different from the previous approaches because besides the RGB-D data being used in the network we also use the mask of the object.
In summary, our contributions are: improvement over DenseFusion with more accuracy with only a disadvantage, more time to train the method, but keeping the same inference time. Our method surpasses the results of the state-of-the-art as far as we know in the LineMOD dataset.
2 Related Work
In this section, we present the related work in 6D object pose estimation. It is possible to split the methods that do 6D pose estimation into three different categories. These categories are defined by the type of input data that the methods use.
Estimate 6D pose using RGB images: most of methods in this category [1, 15, 22, 23] usually rely on the detection and matching of keypoints from the objects in a scene with the 3D render and use the PnP  algorithm to solve the pose of that object. The methods that do not fit in this technique will be methods that use deep learning [14, 21] like convolutional neural networks (CNN) to extract features from the multiple viewpoint poses and later match the new scenes with previous known poses . One of the most accurate method in 6D pose using RGB images is PVNet . It is an hybrid method that uses the two methods mentioned above and it can be robust in occlusion and truncation. PVNet has a neural network to predict pixel-wise object labels and unit vectors that represent the direction from every pixel to every keypoint of the object. These keypoints are voted and matched with the 3D model of the object to estimate its 6D pose.
Estimate 6D pose using point clouds: methods like PVFH  and its predecessors [20, 19] achieve remarkable speed solving the 6D pose estimation, but these methods are not reliable due to lack of good data retrieval, i.e., noise. In this category deep learning architectures, like PointNets [17, 18] and VoxelNet  have been emerging as well and achieved good results on multiple datasets.
Estimate 6D pose using RGB-D data: in methods like PoseCNN , SSD-6D  the 6D poses are directly estimated from the image data, and then furthered refined with the depth data. These types of approaches usually rely on expensive post-processing steps to make full use of 3D input, e.g., Iterative Closest Point (ICP). Methods like Li et al.  and Kehl et al.  can use RGB-D data directly, since they rely on just extracting features from the RGB-D data. RGB-D data only needs to be stacked as a four channel image (RGB + Depth). In , after extracting the features they will be matched with a code-book that was previously generated. DenseFusion  fuses the depth data to the RGB image while retaining the input’s space geometric structure. DenseFusion has more similarity’s with PointFusion . PointFusion also keep the geometric structure and appearance information of the object, to later fuse this information in a heterogeneous architecture.
We present an improved method that has been inspired and has some elements of the DenseFusion . Our method improves upon methods where information from RGB and depth data are fused, and it has a good performance in the tests that we have made. Figure 1 shows the pipeline of our architecture. Our method can be divided into four stages: instance segmentation, feature extraction, 6D pose estimation and pose refinement neural network.
On the first stage, an instance segmentation is needed to extract masks for each object presented in the scene. In this stage only the RGB image is used.
Since our method is modular all of the stages can be swapped with other alternatives that result in the same data expected by the next stage. Its possible to use any instance segmentation method. The main objective of this stage is to find the masks for each objects in scene. As far as we know the best method for instance segmentation is the HTC , it combines both methods Mask R-CNN  and Cascade R-CNN  achieving excellent performance.
The masks obtained from the instance segmentation algorithm are used to crop the RGB-D data per object in the scene. To crop the RGB-D data, we apply a bit-wise and between the RGB image and mask, and also between the depth image and the mask. The result of the bit-wise and of the RGB image and the mask will be inside of a rectangular crop that encloses all the object and this smaller image will serve as input data to the second stage. In the case of the depth image a point cloud is further generated from the bit-wise and cropped depth image, as in RGB image, and the point cloud will also serve as input to the second stage of our pipeline. Figure 2 represents the flow of the data and the processing. Cropping the data with the mask is a pre-processing of the data that helps the NN because it discards the background or other non-relevant data that are around the object leaving only the data that is most relevant to the 6D pose estimation.
On the second stage, feature extraction, all the data coming from the previous stage is separated into different neural networks that extract features for each type of data. For the point cloud data, the method PointNet  is used to extract its features. The PointNet method will return 500 features that represent the point cloud data. For the cropped RGB image we used a Fully Convolutional Neural Network (FCNN) based on Resnet18 to extract 500 features from the image. Finally, for the mask image, we also used a Resnet18 FCNN to extract 500 features from it. All the extracted features of each data source are then concatenated into a single vector of features that will be sent to the third stage of our method. Using features extracted from the mask will help on the 6D pose estimation stage. Having features that represent the shape of the object will improve its accuracy.
On the third stage, a custom NN is used to receive the features extracted previously and estimate the 6D pose, that is, the rotation matrix and the translation vector of the object. This custom NN has five hidden layers and we used (1) as loss function to train our method. This loss function is also used on the DenseFusion method:
where, denotes the point of the randomly selected 3D points from the objects 3D model, is the ground truth pose, where is the rotation matrix of the object and is the translation. The predicted pose generated from the fused embedding of the dense-pixel is represented by where denotes the predicted rotation and the predicted translation. After training the 6D pose NN, the output of it () can be retrieved after the third stage or it can be sent to the next stage of the pipeline.
The last stage, pose refinement neural network, is the same as in DenseFusion. The authors of DenseFusion created a pose refinement NN that improves upon the pose previously estimated on the third stage. The output of our previous stage just serves as input to the DenseFusion pose refinement NN.
|Objects||SSD-6D + ICP||PointFusion||DenseFusion||Average||Stand. Deviation||MaskedFusion Individual Experiments|
We tested our method on a widely-used dataset, LineMOD . We used this dataset because its easier to compare our method with previous methods that were also tested in LineMOD. We mainly compare our method with DenseFusion because we use its idea of fusing data as a base line to our work and we also used DenseFusion refinement neural network and the loss function for our 6D object pose estimation neural network.
4.1 LineMOD Dataset
LineMOD  is one of the most used dataset to tackle the 6D pose estimation problem. Many types of methods that tackle the 6D pose estimation problem use this dataset ranging from the classical methods like [2, 5, 24] to the most recent deep learning approaches [13, 25, 26]. This dataset was captured with a Kinect, it has a procedure internally that automatically aligns the RGB and depth images. LineMOD has 15 low-textured objects (although we only use 13 as in previous methods) in over 18000 real images and has the ground truth pose annotated.
Each object is associated with a test image showing one annotated object instance with significant clutter but only mild occlusion, as shown in the Figure 3. Each object also contains the 3D model saved as a point cloud and a distance file with the maximum diameter () of the object.
As in previous works [9, 16, 25, 26] we used the Average Distance of Model Points (ADD)  as metric of evaluation for non-symmetric objects and for the egg-box and glue we used the Average Closest Point Distance (ADD-S) . We needed to used other metric for these two objects because they are symmetric objects.
In the ADD metric (equation 2), assuming the ground truth rotation and translation and the estimated rotation and translation , the average distance calculates the mean of the pairwise distances between the 3D model points of the ground truth pose and the estimated pose. In equation (2) and (3) represents the set of 3D model points and is the number of points.
For the symmetric objects (egg-box and glue), the matching between points is ambiguous for some poses. In these cases we used the ADD-S metric:
In this section we present the results of the tests that we have made and compare them with other methods that have been tested in the same dataset. To test our method we trained it and test it five times and we will present the results of the best, the worst and the mean of our tested results. All results of our experiments were executed on a desktop with SSD NVME, 64GB of ram, a NVIDIA GEFORCE GTX 1080 Ti and Intel Core i7-7700K.
In Figure 4, we show the test results for several different epochs. We trained both MaskedFusion and the DenseFusion for 100 epochs, and every 10 epochs we tested them and plotted their mean errors. It can be seen that in Figure 4 that even our worst values still have better performance than DenseFusion. All MaskedFusion average error values were always bellow the DenseFusion in all epochs tested, and most important is that our method entered first in the error mark. Our method got bellow error in epoch 30 and DenseFusion only entered in epoch 50. It also can be seen the learning of our method and the mean values of our 5 experiments. Comparing the mean error of our best to DenseFusion we have an error of and DenseFusion has . We have more accuracy leading to possible better placement of objects in a virtual scene or better grasping accuracy. We also need less epochs to achieve an error bellow .
To train the 100 epochs in our method took 40 hours, compared with just 33 for DenseFusion. Since we have one more network to train and we have additional computation over the data and more data flowing in our method, its normal to take longer in the overall training.
In Table 1 we present a comparison of our test results in a per object comparison with other three methods, SSD-6D, PointFusion and DenseFusion. The values presented in Table 1 result from the ADD metric (equation 2) and ADD-S metric (equation 3). From Table 1, we conclude that our method has overall better accuracy than previous methods.
Relating to time of inference, we have the same time of inference per object in image as DenseFusion has, meaning that we have an average inference time per object of seconds. Both methods can achieve 30 frames per second. These times are measured without the stage of instance segmentation (MaskedFusion) and semantic segmentation (DenseFusion) as we assume that both methods already have the mask of each object and we started counting the time from that stage onward.
The average results from the 5 repetitions of our method are better than DenseFusion in 11 out of 13 objects. The only two objects with worst results were the symmetric objects (egg-box, glue). In our worst performing experience, second column of MaskedFusion Individual Experiments in Table 1, we achieved an average of , that is better than the DenseFusion, and one of our lowest values was in the glue, but in the other 4 experiences we achieved a score of on this object. That network achieves better results in 12 out of 13 objects compared with DenseFusion. Finally, the best of our 5 repetitions improves the overall ADD from the 94.3% of Densefusion to 97.8%.
Achieving robust 6D pose of objects captured from the real world still is an open challenge. MaskedFusion improved the state-of-the-art in this area by achieving lower error than other methods. MaskedFusion is an improvement upon the DenseFusion while maintaining the inference time of the DenseFusion at the cost of an increase training time. We achieved an error bellow with just 100 training epochs. MaskedFusion makes use of the mask of the objects previous retrieved during the instance segmentation to identify and localize the object in the RGB image. The mask is used to remove non-relevant data from the input of the CNN and serves as an additional feature to the 6D pose estimation CNN. As future work, we intend to make evaluations in other datasets, study the influence of the instance segmentation method on the MaskedFusion results (since the higher the accuracy of the former, the better results are expected from the latter) and work towards speeding up the training stage of MaskedFusion111The code is available at https://github.com/kroglice/MaskedFusion.
-  (2014) Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pp. 536–551. Cited by: §2.
-  (2017) Rotational subgroup voting and pose clustering for robust 3d object recognition. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4137–4145. Cited by: §4.1.
-  (2018) Cascade r-cnn: delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §3.
-  (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4974–4983. Cited by: §3.
-  (2010) Model globally, match locally: efficient and robust 3d object recognition. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 998–1005. Cited by: §4.1.
-  (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.
-  (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, pp. 858–865. Cited by: §1, §4.1, §4.2, §4.
-  (2017) SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529. Cited by: §1, §2, §4.2.
-  (2016) Deep learning of local rgb-d patches for 3d object detection and 6d pose estimation. In European Conference on Computer Vision, pp. 205–220. Cited by: §2, §2.
-  (2018) A unified framework for multi-view multi-class object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 254–269. Cited by: §2.
-  (2019) 3D object recognition and pose estimation for random bin-picking using partition viewpoint feature histograms. Pattern Recognition Letters 128, pp. 148–154. Cited by: §2.
-  (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §4.1.
-  (2017) 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082. Cited by: §2.
-  (2017) 6-dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. Cited by: §2.
-  (2019) PVNet: pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570. Cited by: §2, §4.2.
-  (2018) Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927. Cited by: §2.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §1, §2, §3.
-  (2010-10) Fast 3d recognition and pose using the viewpoint feature histogram. In Proceedings of the 23rd IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Taipei, Taiwan. Cited by: §2.
-  (2009-10) Semantic 3d object maps for everyday manipulation in human living environments. Ph.D. Thesis, Computer Science department, Technische Universitaet Muenchen, Germany. Cited by: §2.
-  (2018) Discovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in Neural Information Processing Systems, pp. 2059–2070. Cited by: §2.
-  (2018) Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301. Cited by: §2.
-  (2018) Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning, pp. 306–316. Cited by: §2.
-  (2018) 6D pose estimation using an improved method based on point pair features. In 2018 4th International Conference on Control, Automation and Robotics (ICCAR), pp. 405–409. Cited by: §4.1.
-  (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3343–3352. Cited by: §1, §1, §2, §3, §4.1, §4.2.
-  (2017) Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199. Cited by: §1, §2, §4.1, §4.2.
-  (2018) Pointfusion: deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253. Cited by: §1, §2.
-  (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499. Cited by: §2.