# Learning Based Industrial Bin-picking

Trained with Approximate Physics Simulator

###### Abstract

In this research, we tackle the problem of picking an object from randomly stacked pile. Since complex physical phenomena of contact among objects and fingers makes it difficult to perform the bin-picking with high success rate, we consider introducing a learning based approach. For the purpose of collecting enough number of training data within a reasonable period of time, we introduce a physics simulator where approximation is used for collision checking. In this paper, we first formulate the learning based robotic bin-picking by using CNN (Convolutional Neural Network). We also obtain the optimum grasping posture of parallel jaw gripper by using CNN. Finally, we show that the effect of approximation introduced in collision checking is relaxed if we use exact 3D model to generate the depth image of the pile as an input to CNN.

## I Introduction

Randomized bin-picking refers to the problem of automatically picking an object from randomly stacked pile. If randomized bin-picking is introduced to a production process, we do not need any parts-feeding machines or human workers to once arrange the objects to be picked by a robot. Although a number of researches have been done on randomized bin-picking such as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], randomized bin-picking is still difficult due to the complex physical phenomena of contact among objects and fingers. To cope with this problem, learning based approach has been researched by some researchers such as[14, 13]. By using the learning based approach, it is expected that the complex physical phenomena can automatically be learned and that we can be realized the robotic bin-picking with high success rate.

In this paper, we research a learning based approach for robotic bin-picking. We introduce CNN (Convolutional Neural Network) to predict whether or not a robot can successfully pick an object from the pile for given depth image of the pile and grasping pose of a parallel jaw gripper. Since our CNN outputs the success rate of picking, we search for the grasping pose maximizing the success rate. However, learning based bin-picking trained with CNN usually requires extremely large number of training data. To cope with this problem, this research aims to effectivelly collect enough number of training data by introducing a physics simulator. Here, physics simulation on randomly stacked objects with complex shape usually takes longer time than the physics simulation of simple shaped objects. For the purpose of shortening the calculation time of physics simulation used to collect the training data, we consider approximating the shape of objects. This approximation is applied for checking collision among objects and fingers. Here, although we introduce approximation in physics simulation, we do not want to reduce the accuracy of prediction made by CNN. One of the goals of our research is to give an answer to the question: how we can relax the effect of object shape approximation on the accuracy of prediction.

The feature of our physics simulator is that, while the shape approximation is introduced for checking collision, simulated depth image of the pile used as an input to CNN is obtained by using objects with the original shape. To check the effect of approximation on the accuracy of prediction, we consider focusing on some cases included in the training data where a robot successfully picked an object with approximated shape while a robot may fail in picking the same object with original shape. Our finding in the research is that even if we approximate the object shape in collision checking, the effect of approximation can be relaxed if we use original shaped object to construct simulated depth image of the pile as an input to CNN.

The rest of this paper is organized as follows: After introducing previous works in Section 2, we show the overview of our physics simulator in Section 3. In Section 4, we explain our learning based bin-picking method. In Sections 5 and 6, we show results by using our learning based method.

## Ii Related Works

So far, research on industrial bin-picking has been mainly done on image segmentation [1, 2, 3, 4], pose identification [5, 6, 7, 8], and picking method [9, 10, 11, 12].

As for the research on bin-picking method, Ghita and Whalan [5] proposed to pick the top most object of the pile. Domae et al. [9] proposed a method for determining the grasping pose of an object directly from the depth image of the pile. Some researchers such as [7, 10, 11, 12] proposed methods for identifying the poses of multiple objects of the pile and picking one of them by using a grasp planning method. However, in conventional bin-picking methods, we have to carefully set up several parameters used in both visual recognition and grasp planning corresponding to each object to be picked. Since a robot usually has to pick a lot of objects to assemble a product, it is not easy for setting up parameters for all objects to be picked.

On the other hand, learning based approaches on randomized bin-picking is expected to break this barrier existing in the conventional randomized bin-picking [13, 14, 18, 17]. Levine et al. [13] proposed an end-to-end approach by using deep neural network whose input is a 2D RGB image. However, they need extremely large number of training data which was collected 800,000 times of picking trials for two months by using 2D RGB image of the pile. Recently, there are some trials on reducing the effort to collect a number of training data by using a method so called GraspGAN [18] and cloud database [17]. On the other hand, this research aims to collect enough number of training data within reasonable time by introducing an approximate physics simulation. Our method searches for the grasping posture with maximum success rate of picking.

The learning approach has also been used for grasping a novel daily object placed on a table[20, 21, 22, 23] and for warehouse automation [15, 16]. Pas et al. [22] developed a method for learning an antipodal grasp of a novel object by using the SVM (Support Vector Machine). Lenz et al. [21] used deep learning to detect the appropriate grasping pose of an object. Zeng et al. [15] proposed a learning based picking method used for warehouse automation. However, industrial bin-picking is different from the warehouse application since the grasped object is not existing in our daily life and it is impossible to use the generalized object recognition methods.

## Iii Physics Simulator

In this section, we show an overview of the physics simulator used in this research. We use PhysX as a physics engine. The overview of the simulator is shown in Fig. 2. In the simulation world, we assume a tray where its bottom surface is horizontally flat. We also assume the gravity acceleration acting in a vertically downward direction. We use two rectangular shaped objects simulating a two-fingered parallel jaw gripper where a gripper can translate, rotate about the vertical axis and open/close the fingers.

To collect training data, we first consider dropping predefined number of objects from predefined height with randomly defined poses. Then, we consider obtaining a simulated depth image of the pile assuming that a simulated 3D depth sensor is facing vertically downward direction. We furthermore define the gripper’s horizontal position and orientation about the vertical axis for the gripper to grasp the top most object. To pick an object, the gripper first moves in the vertically downward direction, closes the fingers, and then moves in the vertically upward direction. After the gripper moves up, we judge whether or not an object is successfully picked by checking the vertical position of objects. At each picking trial, we collect the following three information: 1) a depth image of the pile, 2) gripper’s horizontal position and orientation about the vertical axis, and 3) success/failure of picking.

As explained in the introduction, physics simulation of randomly stacked objects with complex shape usually takes longer time than the simulation of simple shaped objects since the calculation time of physics simulation usually depends on the number of contact points included in the simulation world. For the purpose of shortening the calculation time of physics simulation used to collect training data, we consider approximating the shape of an object. This approximation is used just for checking collision among objects and fingers. In our method, the shape of an object is approximated by a set of shape primitives such as rectangular. Fig. 3 shows our method for approximating an object shape. For a given polygonal model of an object, we consider applying the convex decomposition [36] where it is decomposed into a set of convex shaped polygons. Then, for each convex shaped polygon, we consider fitting rectangular.

We checked the calculation time of physics simulation as shown in Fig. 4. We performed simulation of picking an object from the pile for four times where each simulation includes same number of same objects with different resolution of convex decomposition. As shown in the figure, as the number of convex objects generated by the convex decomposition increases, calculation time of physics simulation also increases. In the following, we set each object decomposed into 10 rectangular polygons as shown in Fig. 3.

Here, we note that, although the convex decomposition is introduced just for checking collision of physics simulation, it is not used to obtain a simulated depth image of the pile which is an input to the CNN explained in the next section.

## Iv Learning Based Approach

This section explains our learning based approach for randomized bin-picking introduced in this research.

### Iv-a Convolutional Neural Network

We use CNN (Convolutional Neural Network) [39, 38] to predict whether or not a robot can successfully pick an object from the pile. The overview of our CNN is shown in Fig. 5 and Table I. We use a depth image of the pile (500 500 [pixel]) and gripper’s pose before picking an object. Since we use a parallel jaw gripper grasping an object with upright posture, a gripper can be expressed by using a segment where two fingers are located at the edge. To reduce the time needed to train the CNN, we consider extracting 250250[pixel] subset of the pile’s image. This is an input to the main channel of the CNN. On the other hand, 250250 [pixel] image of the segment expressing a gripper’s pose is an input to the side channel of the CNN. Our CNN is composed of serially connected convolutional and pooling layers. In the pooling layer, we applied the max pooling of 22 [pixel]. At the end of the convolutional and pooling layers, fully-connected layers is attached. The last layer of the fully-connected ones classifies success/ failure of picking. Success and failure rates are denoted respectively by and by using the following softmax function:

(1) |

where denotes weight of the input to the last fully connected layer. Activation function used in convolutional and fully connected layers should avoid the problem of gradient loss. To cope with this problem, we use the following ReLU (Rectified Linear Unit) function [37] as an activation function:

(2) |

Layer | Filter | Function | Dropout | Pooling | Output size |
---|---|---|---|---|---|

Convolutional Layer 1A1B | 1616 | ReLU | - | 22 | 555532 |

Convolutional Layer 2A2B | 88 | ReLU | - | 22 | 242464 |

Convolutional Layer 3 | 55 | ReLU | - | 22 | 101064 |

Convolutional Layer 4 | 33 | ReLU | - | 22 | 4464 |

Fully-connected Layer 1 | - | ReLU | 0.5 | - | 111024 |

Fully-connected Layer 2 | - | ReLU | 0.5 | - | 111024 |

Fully-connected Layer 3 | - | softmax | - | - | 112 |

### Iv-B Discriminator

Our CNN predicts whether or not a robot can successfully pick an object. Given 250250 [pixel] subset of a pile’s depth image and a pose of parallel jaw gripper, the CNN outputs the success rate of picking. If the success rate is more than 0.5, we judge that a robot will successfully pick an object. Otherwise, we judge that a robot will fail in picking an object.

### Iv-C Optimum Grasping Pose Detection

To detect the optimum grasping pose, Lenz et al.[21] used a 2 step DNN (Deep Neural Network) where multiple candidates of grasping poses are generated by using a small Neural Network in the first step, and then optimal grasping pose is detected by using a larger Neural Network in the second step. However, this method requires high calculation cost since this method uses Raster scan with changing the size and orientation of a rectangular window and iteratively uses two DNNs. On the other hand, we apply a simple method to detect the grasping pose maximizing the success rate of picking (Fig. 6). Our method uses raster scan with fixed size and orientation of a rectangular window. We consider eight candidates of gripper’s orientation corresponding to each rectangular window. By considering 66 candidates of gripper’s position, we totally have 288 (866) candidates of gripper’s grasping poses. For each gripper’s pose, we calculate success rate of picking by using CNN. Among, 288 candidates, we consider calculating a grasping pose with highest success rate of picking.

Simulation | |||
---|---|---|---|

Success | Failure | ||

Success | 436(TP) | 134(FP) | |

Discriminator | Failure | 164(FN) | 466(TN) |

Precision=0.765, Recall=0.727, F-value=0.745

## V Collection of Training Data

We performed physics simulation of bin-picking for 6 hours with 15 threads and collected 6000 success data. The failure data is sampled to make the number of failure data be same as the number of success data. 90 of the data is used to train the CNN and remaining 10 is used to verify the trained CNN. By rotating and inverting the depth image included in the training data, we extend the number of training data up to 64800. By using the training data, we trained the CNN shown in Fig. 5 for 17 hours.

## Vi Results

### Vi-a Discrimination

As shown in Table II, we verified the trained CNN by using 1200 verification data including 600 success and 600 failure cases. Fig. 7 shows 4 examples included in four classes (TP:True Positive),(TN:True Negative),(FP:False Positive) and (FN:False Negative) shown in Table II where red and blue figures respectively show the success and failure cases. We judged the successful cases if the success rate is larger than 0.5. F-value of our discriminator is 0.745 including the cases where a robot successfully picked up an object in spite of the prediction result where a robot fails in picking up an object. We will analyze this prediction error in more detail in the following subsections.

### Vi-B Derivation of Optimum Grasping Pose

By using the trained CNN and 20 verification data, we detected the optimum grasping pose as shown in Fig. 8 where the segments marked in red and yellow shows the optimum grasping pose and grasping poses where the success rate is more than 0.9, respectively. We confirmed that, in all cases, the obtained grasping poses have high graspability index[9]. Fig. 9 shows an experimental result where, for given depth image of the pile, we determined the grasping pose by using CNN trained by using physics simulation.

### Vi-C Analysis of Model Approximation

Let us consider the effect of approximation introduced in our physics simulation. We consider fitting a rectangular to each convex decomposed part of a grasped object. While this approximation is used for checking collision among objects and fingers, a simulated depth image is obtained by using object models with the original shape. In our physics simulation, a robot sometimes stably grasps a part of an approximated shaped object where a robot may not be able to stably grasp the same part of an original shaped object. However, even if we use such unrealistic training data caused by the effect of approximation, the effect of approximation may be relaxed if we use the depth image of original shaped object as an input to CNN.

To explain this phenomenon, we collected 200 cases of physics simulation where a robot successfully picked an isolated single object. Among 200 cases, we picked up 15 unrealistic cases as shown in Fig. 10 where a robot stably grasps an object contrary to our expectations. In these cases, a robot stably grasps a part of an object with approximated shape while this part is not included in an object with original shape. The figure also shows the success rate obtained by using the trained CNN. The interesting thing is that the success rate is low in most of the cases in spite of the fact that a robot successfully picks an object in the physics simulation. This implies that, in the discrimination result shown in Table II, (FP) and (FN) do not simply show the cases of discrimination errors. We can consider that the effect of approximation is relaxed if we use the depth image including original shaped objects to train the CNN.

Let us consider analyzing the effect of shape approximation in more detail. As shown in Fig. 11, we consider making a robot pick an object placed on a tray. We prepared two kinds of objects to train the CNN where one is approximated by rectangular parallelepiped and the other is not approximated when checking collision. We change the rate of using approximated object when training the CNN. After finished training the CNN, we consider estimating the success rate when a robot trying to pick a part of an object where it is included approximated one and is not included in the original one. The results are shown in Figs. 12 and 13 where a hexagonal prism and an elliptic cylinder are used, respectively. In both cases, if the rate of using approximated object is less than 30 , we can correctly estimate the success of the picking since success is predicted if the success rate is larger than 0.5. This result means that, in randomized bin-picking, we can correctly estimate whether or not a robot can successfully pick an object from the pile if such rough approximation is used in less than 30 of rectangular.

## Vii Conclusions

In this research, we researched the learning based randomized bin-picking. We introduced approximate physics simulation to effectively collect the training data within short period of time. We first formulated the learning based method by using CNN. Then, we obtained the optimum grasping posture of parallel jaw gripper by using CNN. Finally, we showed that the effect of approximation introduced in collision checking is relaxed if we use exact 3D model to generate the depth image of the pile as an input to CNN.

## References

- [1] M. J. Turkey, “Automated Online Measurment of Limestone Particle Size Distributions using 3D Range Data”, J. of Process Control, vol. 21, pp. 254-262, 2011.
- [2] S. Kristensen, S. Estable, M. Kossow, and R. Broesel, “Bin-picking with a Solid State Range Camera”, Robotics and Autonomous Systems, vol. 35, pp. 143-151, 2001.
- [3] I. Fryndental, “Segmentation of Sugar Beets using Image and Graph Processing”, Proc. of Int. Conf. on Pattern Recognition, pp. 1697-1699, 1998.
- [4] E. Al-Hujazi and A. Sood, “Range Image Segmentation with Applications to Robot Bin-Picking Using Vacuum Gripper”, IEEE Trans. SMC, vol. 20, no. 6, pp. 1,313-1,325, 1990.
- [5] O. Ghita and P. F. Whelan, “A bin picking system based on depth from defocus”, J. Machine Vision and Applications, vol. 13, no. 4, pp. 234-244, 2003.
- [6] J. Kirkegaard and T. B. Moeslund, “Bin-Picking based on Harmonic Shape Contexts and Graph-Based Matching”, Int. Conf. on Pattern Recognition, vol. 2, pp. 581-584, 2006.
- [7] S. Fuchs, S. Haddadin, M. Keller, S. Parusel, A. Kolb, and M. Suppa, “Cooperative Bin-Picking with Time-of-Flight Camera and Impedance Controlled DLR Lightweight Robot III”, Proc. of IEEE Int. Conf. on Intelligent Robots and Systems, pp.4862-4867, 2010.
- [8] A. Zuo, J. Z. Zhang, K. Stanley, and Q. M. J. Wu, “A Hybrid Stereo Feature Matching Algorithm for Stereo Vision-Based Bin Picking”, J. Pattern Recognition and Artificial Intelligence, vol. 18, no. 8, pp. 1407-1422. 2004.
- [9] Y. Domae, H. Okuda, Y. Taguchi, K. Sumi, and T.Hirai, “Fast Graspability Evaluation on Single Depth Maps for Bin Picking with General Grippers”, Proc. of 2014 IEEE Int. Conf. on Robotics and Automation, pp. 1197-2004, 2014.
- [10] D.C. Dupuis, Simeon Lenard, M. A. Baumann, E. A. Croft, and J. J. Little, “Two-Fingered Grasp Planning for Randomized Bin-Picking”, Proc. of Robotics, Science and Systems 2008 Manipulation Workshop, 2008.
- [11] K. Harada, K. Nagata, T. Tsuji, N. Yamanobe, A. Nakamura, and Y. Kawai, “Probabilistic Approach for Object Bin Picking Approximated by Cylinders”, Proc. of IEEE Int. Conf. on Robotics and Automation, pp. 3727-3732, 2013.
- [12] K. Harada, T. Yoshimi, Y. Kita, K. Nagata, N. Yamanobe, T. Ueshiba, Y. Satoh, T. Masuda, R. Takase, T. Nishi, T. Nagami, T. Kawai, and O. Nakamura, “Project on Development of a Robot System for Random Picking–Grasp/Manipulation Planner for a Dual-arm Manipulator–”, Proc. of IEEE/SICE Int. Symposium on System Integration, pp. 583-589, 2014.
- [13] S. Levine, P. Pastor, A. Krizhevsky, D. Quillen, “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection”, Preprints of Int. Symposium on Experimental Robotics, 2016.
- [14] K. Harada, W. Wan, T. Tsuji, K. Kikuchi, K. Nagata, and H. Onda, “Initial Experiments on Learning-Based Randomized Bin-Picking Allowing Finger Contact with Neighboring Objects”, Proc. of IEEE Int. Conf. on Automation Science and Engineering, pp. 1196-1202, 2016.
- [15] A. Zeng et al., “Robotic Pick-and-place of Novel Objects in Clutter with Multi-affordance Grasping and Cross Domain Image Matching”, https://arxiv.org/abs/1710.01330
- [16] G. Lin et al., “RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation”, https://arxiv.org/abs/1611.06612
- [17] J. Mahler et al., “Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics”, https://arxiv.org/abs/1703.09312
- [18] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, ‘Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping”, https://arxiv.org/abs/1709.07857, 2017.
- [19] L. Breiman, “Random Forests”, Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
- [20] N. Curtis and J. Xiao, ”Efficient and Effective Grasping of Novel Objects through Learning and Adapting a Knowledge Base”, Proc. of IEEE Int. Conf. on Intelligent Robots and Systems, pp.2252-2257, 2008.
- [21] I. Lenz, H. Lee, and A. Saxena, “Deep Learning for Detecting Robotic Grasps”, Int. J. Robotics Resaerch, vol. 34, no. 4-5, pp. 705-724, 2015.
- [22] A.t. Pas and R. Platt, “Using Geometry to Detect Grasps in 3D Point Clouds”, Preprint of Int. Symposium on Robotics Research, 2015.
- [23] S. Ekvall and D. Kragic, “Learning and evaluation of the approach vector for automatic grasp generation and planning”, Proc. of IEEE Int. Conf. on Robotics and Automation, 2007.
- [24] G. Olague and R. Mohr, “Optimal Camera Placement for Accurate Reconstructions”, Pattern Recognition, vol. 35, no. 4, pp. 927?944, 2002.
- [25] R. Sablatnig, S. Tosovic, and M. Kampel, “Next View Planning for a Combination of Passive and Active Acquisition Techniques”, Proc. of 3DIM, pp. 62-69, 2003.
- [26] W. R. Scott, G. Roth, and J.-F. Rivest, “View Planning for Automated Three-dimensional Object Reconstruction and Inspection”, ACM Comput. Surv., vol. 35, no. 1, pp.64?96, 2003.
- [27] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A Comparison and Evaluation of Multi-view Stereo Reconstruction Algorithms”, Proc. of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 519?528, 2006.
- [28] S. Wenhardt, B. Deutsch, J. Hornegger, H. Niemann, and J. Denzler. An information, “theoretic approach for next best view planning in 3-d reconstruction”, Proc. of Int. Conf. on Pattern Recognition, 2006.
- [29] K.A. Tarabanis, P.K. Allen, and R.Y. Stag. “A Survey of Sensor Planning in Computer Vision”, IEEE Trans. on Robotics and Automation, vol. 11, no. 1, pp. 86?104, 1995.
- [30] K. Harada, K. Kaneko, and F. Kanehiro, “Fast Grasp Planning for Hand/Arm Systems Based on Convex Model”, Proc. of IEEE Int. Conf. on Robotics and Automation, pp. 1162-1168, 2008.
- [31] K. Harada, T. Tsuji, S. Uto, N. Yamaobe, K. Nagata, and K. Kitagaki, ”Stabiltiy of Soft-Finger Grasp under Gravity”, Proc. of IEEE Int. Conf. on Robotics and Automation, pp. 883-888, 2014. bibitemThrun S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics, MIT Press, 2005.
- [32] K. Nagata, T. Miyasaka, D.N. Nenchev, N. Yamanobe, K. Maruyama, S. Kawabata, and Y. Kawai, “Picking up and Indicated Object in a Complex Environment”, Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2010.
- [33] A. Aldoma, F. Tombari, R.B. Rusu, and M. Vincze, “OUR-CVFH - Oriented, Unique and Repeatable Clustered Viewpoint Feature Histogram for Object Recognition and 6DOF Pose Estimation”, Pattern Recognition, Springer, pp. 113-122, 2012.
- [34] C. M. Stein, M. Schoeler, J. Papon, and F. Woergoetter, “Object Partitioning using Local Convexity”, Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2014.
- [35] PCL-Point Cloud Library, http://pointclouds.org/
- [36] K. Mamou and G. Faouzi, “A Simple and Efficient Approach for 3D Mesh Approximate Convex Decomposition”, Proc. of IEEE Int. Conf. on Image Processing, pp. 3501-3004, 2009.
- [37] Y. LeCun, Y. Bengio, G. Hinton, “Deep Learning”, Nature, vol. 521, pp. 436-444, 2015.
- [38] O. Russakovsky et al., “Imagenet large scale visual recognition challenge”, Int. J. of Computer Vision, vol. 115, no. 3, pp. 211-252, 2015.
- [39] K. Grauman and B. Leibe, Visual object recognition, Synthesis lectures on artificial intelligence and machine learning, vol. 5, no. 2, pp. 1-181, 2011.