Real-Time, Highly Accurate Robotic Grasp Detection using Fully Convolutional Neural Network with Rotation Ensemble Module

Real-Time, Highly Accurate Robotic Grasp Detection using Fully Convolutional Neural Network with Rotation Ensemble Module

Dongwon Park, Yonghyeok Seo, Se Young Chun* Dongwon Park, Yonghyeok Seo and Se Young Chun are with Department of Electrical Engineering (EE), UNIST, Ulsan, 44919, Republic of Korea.*Corresponding author :

Rotation invariance has been an important topic in computer vision tasks. Ideally, robot grasp detection should be rotation-invariant. However, rotation-invariance in robotic grasp detection has been only recently studied by using rotation anchor box that are often time-consuming and unreliable for multiple objects. In this paper, we propose a rotation ensemble module (REM) for robotic grasp detection using convolutions that rotates network weights. Our proposed REM was able to outperform current state-of-the-art methods by achieving up to 99.2% (image-wise), 98.6% (object-wise) accuracies on the Cornell dataset with real-time computation (50 frames per second). Our proposed method was also able to yield reliable grasps for multiple objects and up to 93.8% success rate for the real-time robotic grasping task with a 4-axis robot arm for small novel objects that was significantly higher than the baseline methods by 11-56%.

I Introduction

Robot grasping of novel objects has been investigated extensively, but it is still a challenging open problem in robotics. Humans instantly identify multiple grasps of novel objects (perception), plan how to pick them up (planning) and actually grasp it reliably (control). However, accurate robotic grasp detection, trajectory planning and reliable execution are quite challenging for robots. As the first step, detecting robotic grasps accurately and quickly from imaging sensors is an important task for successful robotic grasping.

Deep learning has been widely utilized for robotic grasp detection from a RGB-D camera and has achieved significant improvements over conventional methods. For the first time, Lenz et al. proposed deep learning classifier based robotic grasp detection methods that achieved up to 73.9% (image-wise) and 75.6% (object-wise) grasp detection accuracy on their in-house Cornell dataset [Lenz:2013uz, Lenz:2015ih]. However, its computation time per image was still slow (13.5 sec per image) due to sliding windows. Redmon and Angelova proposed deep learning regressor based grasp detection methods that yielded up to 88.0% (image-wise) and 87.1% (object-wise) with remarkably fast computation time (76 ms per image) on the Cornell dataset [Redmon:2015eq]. Since then, there have been a lot of works proposing deep neural network (DNN) based methods to improve the performance in terms of detection accuracy and computation time. Fig. 1 summarizes the computation time (frame per second) vs. grasp detection accuracy on the Cornell dataset with object-wise split for some previous works (Redmon [Redmon:2015eq], Kumra [Kumra:2017ko], Asif [Asif:2017bv], Chu [Chu:ek], Zhou [zhou2018fully], Zhang [zhang2018rprg]) and our proposed method. Note that recent works (except for our proposed method) using state-of-the-art DNNs such as [Asif:2017bv, Chu:ek, zhou2018fully, zhang2018rprg] seem to show trade-off between computation time and grasp detection accuracy. For example, Zhou [zhou2018fully] were based on ResNet-101, ResNet-50 [He:2016ib], respectively, that have the trade-off between network parameters vs. computation time. Note that prediction accuracy is generally related to real successful grasping and computation time is potentially related to real-time applications for fast moving objects or stand-alone applications with limited power.

Fig. 1: Performance summary of computation time (frame per second) vs. grasp detection accuracy on the Cornell dataset with object-wise data split.

Rotation invariance has been an important topic in computer vision tasks such as face detection [rowley1997rotation], texture classification [greenspan1994overcomplete] and character recognition [kim1994practical], to name a few. The importance of rotation invariant properties for computer vision methods still remains for recent DNN based approaches. In general, DNNs often require a lot more parameters with data augmentation with rotations to yield rotational-invariant outputs. Max pooling helps alleviating this issue, but since it is usually  [Jaderberg:2015voa], it is only for images rotated with very small angles. Recently, there have been some works on rotation-invariant neural network such as rotating weights [cohen2016group, follmann2018rotationally], enlarged receptive field using dialed convolutional neural network (CNN) [YuKoltun2016] or a pyramid pooling layer [he2014spatial], rotation region proposals for recognizing arbitrarily placed texts [ma2018arbitrary] and polar transform network to extract rotation-invariant features [esteves2017polar].

Ideally, robot grasp detection should be rotation-invariant. Rotation angle prediction in robot grasp detection has been done by regression of continuous angle value [Redmon:2015eq], classification of discretized angles (e.g., [guo2017hybrid, Chu:ek] or rotation anchor box that is a hybrid method of regression and classification [zhang2018roi, zhou2018fully, zhang2018rprg]. Previous works were not considering rotation-invariance or attempting rotation-invariant detection by rotating images or feature maps that were often time-consuming especially for multiple objects.

In this paper, we propose a rotation ensemble module (REM) for robotic grasp detection using convolutions that rotates network weights. This special structure allows the DNN to select rotation convolutions for each grid. Our proposed REM were evaluated for two different tasks: robotic grasp detection on the Cornell dataset [Lenz:2013uz, Lenz:2015ih] and real robotic grasping tasks with novel objects that were not used during training. Our proposed REM was able to outperform state-of-the-art methods such as [zhou2018fully] by achieving up to 99.2% (image-wise), 98.6% (object-wise) accuracy on the Cornell dataset as shown in Fig. 1 with faster computation than [zhou2018fully]. Our proposed method was also able to yield up to 93.8% success rate for the real-time robotic grasping task with a 4-axis robot arm for novel objects and to yield reliable grasps for multiple objects unlike rotation anchor box.

Ii Related works

Ii-a Spatial, rotational invariance

Max pooling layers often alleviate the issue of spatial variance in CNN. To better achieve spatial-invariant image classification, Jaderberg et al. proposed spatial transformer network (STN), a method of image (or feature) transformation by learning (affine) transformation parameters so that it can help to improve the performance of inference operations of the following neural network layers [Jaderberg:2015voa]. Lin et al. proposed to use STN repeatedly with an inverse composite method by propagating warp parameters rather than images (or features) for improved performance [lin2017inverse]. Esteves et al. proposed a rotation-invariant network by replacing the grid generation of STN with a polar transform [esteves2017polar]. Input feature map (or image) was transformed into the polar coordinate with the origin that was determined by the center of mass. Cohen and Welling proposed a method to use group equivariant convolutions and pooling with weight flips and four rotations with stepsize [cohen2016group]. Follmann et al. proposed to use rotation-invariant features that were created using rotational convolutions and pooling layers [follmann2018rotationally]. Marcos et al. proposed a network with a different set of weights for each local window instead of weight rotation [marcos2017rotation].

Ii-B Object detection

Faster R-CNN was a method of using a region proposal network for generating region proposals to reduce computation time [ren2015faster]. YOLO was faster but less accurate than the faster R-CNN by directly predicting {, , , , class} without using the region proposal network [Redmon:2016gh]. YOLO9000 stabilized the loss of YOLO by using anchor box inspired by region proposal network and yielded much faster object detection results than faster R-CNN while its accuracy was comparable [Redmon:2017gn]. For rotation-invariant object detection, Shi et al. investigated face detection using a progressive calibration network that predicted rotation by 180, 90 or an angle in [-45, 45] after sliding window [shi2018real]. Ma et al. used a rotation region proposal network to transform regions for classification using rotation region-of-interest (ROI) pooling [ma2018arbitrary]. Note that rotation angle was predicted using 1) rotation anchor box, 2) regression or 3) classification.

Ii-C Robotic grasp detection

Deep learning based robot grasp detection methods seem to belong one of the two types: two stage detector (TSD) or one stage detector (OSD). TSD consists of a region proposal network and a detector [guo2017hybrid, Chu:ek, zhang2018roi, zhou2018fully, zhang2018rprg]. After extracting feature maps using proposals from the network in the first stage, objects are detected in the second stage. The region proposal network of TSD generally helps to improve accuracy, but is often time-consuming due to feature map extractions. OSD detects an object on each grid instead of generating region proposal to reduce computation time with decreased prediction accuracy [Redmon:2015eq].

Lenz et al. proposed a TSD model that classifies object graspability using a sparse auto-encode (SAE) with sliding windows for brute-force region proposals [Lenz:2015ih]. Redmon et al. developed a regression based OSD [Redmon:2015eq] using AlexNet [Krizhevsky:2012wl]. Guo et al. applied ZFNet [Zeiler:2014fr] based TSD to robot grasping and formulated angle prediction as classification [guo2017hybrid]. Chu et al. further extended the TSD model of Guo [Chu:ek] by incorporating recent ResNet [He:2016ib]. Zhou et al. also used ResNet for TSD, but proposed rotation anchor box [zhou2018fully]. Zhang et al. extended the TSD method of Zhou [zhou2018fully] by additionally predicting objects using ROI [zhang2018roi]. DexNet 2.0 is also TSD that predicts grasp candidates from a depth image and then selects the best one by its classifier, GQ-CNN [mahler2017dex].

Iii Method

Iii-a Problem setup and reparametrization

Fig. 2: (a) A 5D detection representation with location (, ), rotation , gripper opening with w and plate size . (b) For a (,) grid cell, all parameters for 5D representation are illustrated including a pre-defined anchor box (black dotted box) and a 5D detection representation (red box).

The goal of the problem is to predict 5D representations for multiple objects from a color image where a 5D representation consists of location (,), rotation , width , and height , as illustrated in Fig. 2. Multi-grasp detection often directly estimates 5D representation as well as its probability (confidence) of being a class (or being graspable) for each grid cell. In summary, the 5D representations with its probability are

For TSD, region proposal networks generate potential candidates for  [Chu:ek, guo2017hybrid, zhou2018fully, zhang2018roi] and rotation region proposal network yields possible arbitrary-oriented proposals  [ma2018arbitrary]. Then, classification is performed for proposals to yield their graspable probabilities . Rotation region proposal network classifies rotation anchor boxes with stepsize and then regresses angles.

For OSD, a set of is directly estimated [Redmon:2015eq]. Inspired by YOLO9000 [Redmon:2017gn], we propose to use the following reparametrization for 5D grasp representation and its probability for robotic grasp detection as where and Note that is a sigmoid function, , are the predefined height and width of anchor box, respectively, and are the top left corner of each grid cell. Therefore, a DNN directly estimates instead of .

Iii-B Parameter descriptions of the proposed OSD method

For grid cells, the following locations are defined

which are the top left corner of each grid cell . Thus, our proposed method estimates the offset from the top left corner of each grid cell. For a given , the range of will be due to the reparametrization using sigmoid functions.

We also adopt anchor box approach [Redmon:2017gn] to robotic grasp detection. Reparametrization changes regression for into regression & classification. Classification is performed to pick the best representation among all anchor box candidates that were generated using estimated and the following values: or .

We investigated three prediction methods for rotation . Firstly, a regressor predicts . Secondly, a classifier predicts . Lastly, anchor box approach with regressor & classifier predicts both and to yield .

Predicting detection (grasp) probability is crucial for multibox approaches such as MultiGrasp [Redmon:2015eq]. Conventional ground truth for detection probability was 1 (graspable) or 0 (not graspable) [Redmon:2015eq]. Inspired by [Redmon:2017gn], we proposed to use IOU (Intersection Over Union) as the ground truth detection probability as where is the predicted detection rectangle, is the ground truth detection rectangle, and is the area of the rectangle.

Fig. 3: An illustration of incorporating our proposed REM in a DNN for robot grasp detection (a) and the architecture of our proposed REM with rotation convolutions (b).

Iii-C Rotation ensemble module (REM)

We propose a rotation ensemble module (REM) with rotation convolution and rotation activation to determine an ensemble weight associated with angle class probability for each grid. We added our REM to the latter part of a robot grasp detection network since it is often effective to put geometric transform related layers in the latter of the network such as deformable convolutions [dai2017deformable]. A typical location for REM in DNNs is illustrated in Fig. 3 (a).

Consider a typical scenario of convolution with input feature maps where is the number of pixels and C is the number of channels. Let us denote , a convolution kernel where is the spatial dimension of the kernel and there are number of kernels in each channel. Similar to the group convolutions [cohen2016group], we propose rotations of the weights to obtain rotated weights for each channel. Bilinear interpolations of four adjacent pixel values were used for generating rotated kernels. A rotation matrix is

where is an index for rotations. Then, the rotated weights (or kernels) are Finally, the output of these convolutional layers with rotation operators for the input is

where is a convolution operator. This pipeline of operations is called “rotation convolution”. A typical kernel size is K=5.

Our REM contains rotation activation that aggregates all feature maps at different angles. Assume that an intermediate output for is available in REM, called . Note that where . For each angle, activations will be generated and all of them must be aggregated to yield one final feature map where is Hadamard product. Thus, our proposed method utilizes class probability (probability to grasp) to selectively aggregate activations along with the weight of angle classification.

In the REM, the intermediate output is partially used for rotation activation, it still contains valuable, compressed information about the final output - it could be a good initial bounding box. Thus, we designed our REM to decompress, concatenate it at the end of REM as illustrated in Fig. 3 (b). This pipeline delivers valuable information about indirectly to the final layer and this structure seemed to decrease probability errors.

Iii-D Loss function for REM-equipped DNN

We re-designed the loss function for training robotic grasp detection DNNs to emphasize this additional REM. The output of DNN and the intermediate output of the REM should be converted into and , respectively. Then, using the ground truth , the loss function is defined as

where is a mask vector with 1 (ground truth for that grid) or 0 (no ground truth for that grid), is norm, CE is cross entropy, and AngLoss is one of these functions: CE for classification on , norm for regression or rotation anchor box on . We chose and .

Iv Simulations and Experiments

We evaluated our proposed REM methods on the Cornell robotic grasp dataset  [Lenz:2013uz, Lenz:2015ih] and on real robot grasping tasks with novel objects. The effectiveness of our REM was demonstrated in prediction accuracy, computation time and grasping success rate. Our proposed methods were compared with previous methods such as [Lenz:2015ih, Redmon:2015eq, guo2017hybrid, Chu:ek, zhou2018fully, zhang2018roi] based on literature for widely used Cornell dataset as well as our in-house implementations of some previous works.

Iv-a Implementation details

It is challenging to fairly compare a robot grasp detection method with other previous works such as [Lenz:2015ih, Redmon:2015eq, guo2017hybrid, Chu:ek, zhou2018fully, zhang2018roi]. Due to the Cornell dataset, most works were able to compare their results with those of previous methods that were reported in literature. Considering fast advances of computing power and DNN techniques, it is often not clear how much the proposed scheme or method actually contributed to the increase of performance.

In this paper, we did not only compare our REM methods with previous works on the Cornell dataset through literature, but also implemented the core angle prediction schemes of other previous works with modern DNNs: regression (Reg) that Redmon et al. proposed [Redmon:2015eq], classification (Cls) that Guo et al. proposed [guo2017hybrid] and rotation anchor box (Rot) that Zhou et al. proposed [zhou2018fully]. While Redmon [Redmon:2015eq], Guo [guo2017hybrid] and Zhou [zhou2018fully] used AlexNet [Krizhevsky:2012wl], ZFNet [Zeiler:2014fr] and ResNet [He:2016ib], respectively, our in-house implementations, Reg, Cls and Rot, all used DarkNet-19 [darknet13]. While Guo and Zhou were based on faster R-CNN (TSD) [ren2015faster], our implementations were based on YOLO9000 (OSD) [Redmon:2017gn].

We performed ablation studies for our REM so that it becomes clear which part will affect the performance of rotated grasp detection most significantly. We placed our proposed REM at the 6th layers from the end of the detection network. We also performed simulations with rotation activation using angle and probability. For multiple robotic grasps detection, boxes were plotted when probabilities were 0.25 or higher.

All algorithms were tested on the platform with GPU (NVIDIA 1080Ti), CPU (Intel i7-7700K 4.20GHz) and 32GB memory. Our REM methods and other in-house DNNs such as Ref, Cls and Rot were implemented with PyTorch.

Iv-B Benchmark dataset and novel objects

Fig. 4: (a) Images from the Cornell dataset and (b) novel objects for real robot grasping task.

The Cornell robot grasp detection dataset [Lenz:2013uz, Lenz:2015ih] consists of images (RGB color and depth) of 240 different objects as shown in Fig. 3(a) with ground truth labels of a few graspable rectangles and a few non-graspable rectangles. We used RG-D information without B channel just like the work of Redmon [Redmon:2015eq]. An image was cropped to yield a image and five-fold cross validation was performed. Then, mean prediction accuracy was reported for image-wise and object-wise splits. Image-wise split divides the Cornell dataset into training and testing data with 4:1 ratio randomly without considering the same or different objects. Object-wise is a way of splitting training and testing data with 4: 1 ratio such that both data do not contain the same object. We followed other previous works for accuracy metrics [Lenz:2015ih, Redmon:2015eq, Kumra:2017ko]. Successful grasp detection is defined as follows: if IOU is larger than a certain threshold (, 0.25, 0.3 or 0.35) and the difference between the output orientation and the ground truth orientation is less than 30 (Jaccard index), then it is considered as a successful grasp detection.

Fig. 5: (Left) Robot experiment setup with a top-mounted RGB-D camera and a small 4-axis robot arm. (Right) Dimensional information on our robot gripper and an object.

We also performed real grasping tasks with our REM methods on 8 novel objects as shown in Fig. 3(b) (toothbrush, candy, earphone cap, cable, styrofoam bowl, L-wrench, nipper, pencil). Our proposed methods were applied to a small 4-axis robot arm (Dobot Magician, China) and a RGB-D camera (Intel RealSense D435, USA) that has a field-of-view of the robot and its workspace from the top. If a robot can pick and place an object, it is counted as success. Our robot experiment setup is illustrated in Fig. 5.

Iv-C Results for in-house implementations of previous works

Anchor Box Angle Prediction Image-wise Object-wise
25% 35% 25% 35%
N Reg 91.0 86.5 88.7 85.6
1 Reg 91.8 87.7 89.2 86.3
N Cls 97.2 93.1 96.1 93.1
1 Cls 97.3 94.1 96.6 92.9
1 Rot 98.3 94.4 96.6 93.6
TABLE I: Ablation studies on the Cornell dataset for anchor box of , with various ratios or one ratio and angle prediction methods with Reg, Cls, Rot.
Fig. 6: Grasp detection accuracy over epoch on the Cornell dataset using various methods for angle predictions: Rot: rotation anchor box, Cls: classification, Reg: regression, REM: ours.

Table I shows the results of ablation studies for our in-house implementations on the Cornell dataset for anchor box with and with various ratios (N) vs. one ratio of 1:1 (1) and angle prediction methods: regression (Reg) vs. classification (Cls) vs. rotation anchor box (Rot). The results show that using a 1:1 ratio (1) yields better accuracy than using a variety of anchor boxes (N). For angle prediction methods, rotation anchor box yielded the best performance while regression yielded the lowest that was consistent with the literature. Thus, our in-house implementations seem to yield better performance in accuracy than the original previous works possibly due to modern DNNs in our implementations: Reg - Redmon et al. [Redmon:2015eq], Cls - Guo et al. [guo2017hybrid] and Rot - Zhou et al. [zhou2018fully].

Fig. 6 shows the results of different angle prediction methods at IOU 25% over epoch. We observed that Rot yielded slowly increased accuracy over epochs than Cls initially and Reg yielded overall slow increase in accuracy over epochs. These slow initial convergences of Reg and Rot may not be desirable for re-training on additional data.

Iv-D Results for our proposed REM on the Cornell dataset

Table II shows the results of the ablation studies for our proposed REM with different components such as rotation convolution (RC) and rotation activation (RA). RA can be obtained by using rotation activation loss (RL) as show in Fig. 3. We observed that RC itself did not improve the performance while RC & RA significantly improved the accuracy. Comparable performance was observed when using RC & RA with Rot, but substantially low performance was achieved with Reg.

Angle RC RA RL Image-wise Object-wise
25% 35% 25% 35%
Cls - - - 97.3 94.1 96.6 92.9
Cls O - - 97.6 94.1 97.3 92.7
Cls O O - 99.2 95.3 98.6 95.5
Cls O O O 98.6 94.9 97.3 94.1
Reg O O - 89.3 84.0 88.3 84.5
Rot O O - 98.5 95.6 98.0 94.0
TABLE II: The ablation studies on the Cornell dataset for our REM with RC, RA and RL.
Method Angle Type Img Obj Speed
25% 25% (FPS)
Lenz [Lenz:2015ih], SAE Cls TSD 73.9 75.6 0.08
Redmon [Redmon:2015eq], AlexNet Reg OSD 88.0 87.1 13.2
Kumra [Kumra:2017ko], ResNet-50 Reg TSD 89.2 88.9 16
Asif [Asif:2018ud] Reg OSD 90.2 90.6 41
Guo [guo2017hybrid]#a, ZFNet Cls TSD 93.2 82.8 -
Guo [guo2017hybrid]#c, ZFNet Cls TSD 86.4 89.1 -
Chu [Chu:ek], ResNet-50 Cls TSD 96.0 96.1 8.3
Zhou [zhou2018fully]#b, ResNet-50 Rot TSD 97.7 94.9 9.9
Zhou [zhou2018fully]#a, ResNet-101 Rot TSD 97.7 96.6 8.5
Zhang [zhang2018roi], ResNet-101 Rot TSD 93.6 93.5 25.2
Our REM, DarkNet-19 Cls OSD 99.2 98.6 50
TABLE III: Performance summary on Cornell dataset. Our proposed method yielded state-of-the-art prediction accuracy in both image-wise (Img) and object-wise (Obj) splits with real-time computation. The unit for performance is %.
Fig. 7: Grasp detection results on the Cornell dataset for (a) Reg, a modern version of Redmon [Redmon:2015eq], (b) Cls, a modern version of Guo [guo2017hybrid], (b) Rot, a modern version of Zhou [zhou2018fully] and (d) our proposed Cls+REM. (e) Ground truth labels in Cornell dataset. Black boxes are grasp candidates and green-red boxes are the best grasp among them.
Fig. 8: Grasp detection results (cropped) on multiple novel objects including a nipper using (a) Reg, (b) Cls, (c) Rot and (d) ours (Cls + REM). Black boxes are grasp candidates and green-red boxes are the best grasp among them.
Fig. 9: Multiple robotic grasp detection results on several novel objects for (a) Reg, (b) Cls, (c) Rot and (d) our proposed Cls+REM. Black boxes are grasp candidates and green-red boxes are the best grasp among them.

Table III summarizes all evaluation results on the Cornell robotic grasp dataset for previous works and our proposed methods. Our proposed method yielded state-of-the-art performance, up to 99.2% prediction accuracy for image-wise split and up to 98.6% for object-wise split, respectively, over reported accuracies of the previous works that are listed in the Table. Our proposed methods yielded these state-of-the-art performances with real-time computation at 50 frame per second (FPS). Note that AlexNet, DarkNet-19, ResNet-50, ResNet-101 require 61.1, 20.8, 25.6 and 44.5 MB parameters, respectively. Thus, our REM method achieved state-of-the-art results with relatively small size of DNN (20.8MB) compared to other recent works using large DNNs such as ResNet-101 (44.5MB).

Fig. 7 illustrates grasp detection results on the Cornell dataset. Our proposed Cls+REM yielded grasp candidates that were close to the ground truth compared to other previous methods such as Reg and Cls.

Iv-E Results for real grasping tasks on novel objects

Object Reg Cls Ours
toothbrush 5 / 8 8 / 8 8 / 8
candy 0 / 8 6 / 8 8 / 8
earphone cap 5 / 8 7 / 8 8 / 8
cable 3 / 8 6 / 8 7 / 8
styrofoam bowl 3 / 8 7 / 8 7 / 8
L-wrench 5 / 8 6 / 8 8 / 8
nipper 0 / 8 5 / 8 6 / 8
pencil 3 / 8 8 / 8 8 / 8
Average 3 / 8 6.6 / 8 7.5 / 8
TABLE IV: Performance summary of real robotic grasping tasks for 8 novel, small objects with 8 repetitions.

We applied all grasp detection methods that were trained on the Cornell set to real grasping tasks with novel (multiple) objects without re-training. Fig. 8 illustrates our robot grasp experiment with novel objects including nipper using our algorithm implementations. Multi-object multi-grasp detection results on novel objects are reported in Fig. 9 for Reg, Cls, Rot and our Cls+REM methods, respectively. Both Cls and our Cls+REM generated good grasp candidates over Reg an Rot. Our REM seems to detect reliable grasps and angles (e.g, pen, L-wrench) over Rot. Real grasping task results with our 4-axis robot arm is tabulated in Table IV. Possibly due to reliable angle detections, our proposed Cls+REM yielded 93.8% grasping success rate, that is 11% higher than Cls. We did not perform real grasping with Rot, a modern version of Zhou [zhou2018fully], due to unreliable angle predictions for multiple objects. However, the advantage of our Cls+REM seems clear over Rot for detection accuracies, fast computation and reliable angle predictions for multi-objects.

V Conclusion

We propose the REM for robotic grasp detection that was able to outperform state-of-the-art methods by achieving up to 99.2% (image-wise), 98.6% (object-wise) accuracies on the Cornell dataset with fast computation (50 FPS) and reliable grasps for multi-objects. Our proposed method was able to yield up to 93.8% success rate for the real-time robotic grasping task with a 4-axis robot arm for small novel objects that was higher than the baseline methods by 11-56%.


This work was supported partly by the Technology Innovation Program or Industrial Strategic Technology Development Program (10077533, Development of robotic manipulation algorithm for grasping/assembling with the machine learning using visual and tactile sensing information) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea) and partly by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number : HI18C0316).


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description