Next-best-view Reaching for Improved Grasping in Clutter
Camera viewpoint selection is an important aspect of visual grasp detection, especially in clutter where many occlusions are present. Where other approaches use a static camera position or fixed data collection routines, our Multi-View Picking (MVP) controller uses an active perception approach to choose informative viewpoints based directly on a distribution of grasp pose estimates in real time, reducing uncertainty in the grasp poses caused by clutter and occlusions. In trials of grasping 20 objects from clutter, our MVP controller achieves 80% grasp success, outperforming a single-viewpoint grasp detector by 12%. We also show that our approach is both more accurate and more efficient than approaches which consider multiple fixed viewpoints.
Grasping and transporting objects is a canonical robotics problem which has seen great advancements in recent years, especially with regards to detection of grasp poses for previously unseen objects using only visual information. As the performance of these visual grasp detection systems has improved, so has the difficulty of standard benchmarking tasks for evaluation, seeing a shift away from grasping relatively simple, isolated objects to grasping geometrically and visually challenging objects in cluttered environments [mahler2017binpicking].
The use of cluttered environments and complex objects has resulted in more visually challenging scenarios for grasp detection, with the level of clutter impacting grasp detection performance [mahler2017binpicking]. Recent work has shown that improved visual information from point cloud fusion [ten2017grasp, arruda2016active] or viewpoint selection [gualtieri2017viewpoint] can improve the quality of grasp estimates in clutter. However, these typically are treated as separate, fixed data gathering step and do not directly take into account grasp detections from multiple viewpoints.
At the same time, improvements in grasp detection models and computational hardware have seen the time required to visually detect grasps (especially using deep learning methods) decrease from tens of seconds [lenz2015deep] to less than a second [mahler2017dex, viereck2017learning] to small fractions of a second [morrison2018closing]. Consequently, the largest contributor to grasp execution time has become the movement of the robotic arm executing the grasp, making grasp detection from multiple viewpoints feasible with very little impact to overall execution time.
To address the added difficulties of grasping in clutter, we propose to use the act of reaching towards a grasp as a method of data collection, making it a meaningful part of the grasping pipeline, rather than just a mechanical necessity.
To achieve this, we develop the Multi-View Picking (MVP) controller, which selects multiple informative viewpoints for an eye-in-hand camera while reaching to a grasp in order to reduce uncertainty in the grasp pose estimate caused by clutter and occlusion, resulting in an overall improvement in grasp success (Fig. 1). Unlike previous works in active perception for grasping which employ object-specific heuristics [gualtieri2017viewpoint] or a secondary task such as point cloud reconstruction [arruda2016active, kahn2015active, ten2017grasp], our approach directly uses entropy in the grasp pose estimation to influence control.
We validate our approach through trials of grasping from piles of 20 objects in clutter, and compare our results to baselines which represent common visual grasp detection approaches. Using our MVP controller, we achieve an 80% success rate in grasping from clutter, a 12% increase compared to a single-viewpoint grasp detection approach. Additionally, our method outperforms a baseline which considers multiple viewpoints over a predefined trajectory, achieving a higher grasp success rate with fewer distinct viewpoints and reduced mean time per grasp. This highlights the advantages of using our information-gain based approach which focuses on salient areas, reducing unnecessary data collection. By varying the cost associated with data collection during reaching, we show that it is possible to trade off between success rate and execution time, allowing the system to be optimised for either raw grasp success rate or overall efficiency.
Ii Related Work
In order to grasp and transport a wide range of unknown objects, a robotic system cannot rely on using offline information such as object models or object-specific grasp poses. Instead, it must use geometric information to compute stable grasp poses for previously unseen objects. Recently, many approaches combining visual inputs with machine-learning techniques – primarily Convolutional Neural Networks (CNNs) – have been widely and successfully applied to this problem [lenz2015deep, morrison2018closing, mahler2017dex, mahler2017binpicking, pinto2016supersizing, ten2017grasp, johns2016deep], which we refer to as visual grasp detection.
The robustness of many visual grasp detection systems is sensitive to factors such as sensor noise, robot control inaccuracies and visual occlusion, which is especially prevalent in cluttered environments. While the detrimental effects of sensor noise and poor robot precision can be minimised as part of the visual grasp detection algorithm, e.g. through sensor noise injection during training [johns2016deep, mahler2017dex, mahler2017binpicking, viereck2017learning], overcoming the issue of occlusion in clutter requires multiple camera viewpoints to be considered. For example, ten \citetten2017grasp showed that computing grasp poses using a fused point cloud from many viewpoints along a predefined trajectory resulted in a 9% increase in grasp success rate compared to using a point cloud collected from two static cameras. Rather than rely on a fixed data collection routine, \citetarruda2016active use an active perception approach to choose viewpoints which specifically aid point cloud reconstruction near potential finger contact points in an efficient manner.
Broadly, active perception is defined as the situation where a robot “adopts strategies for decisions of sensor placement or sensor configuration” in order to perform a task [chen2011active]. It is a concept which has been applied to a wide variety of robotic tasks, such as mapping [stachniss2005information, burgard2005coordinated], object modelling [pito1999solution], object identification [roy2004active, monica2018contour] and path planning [velez2011planning]. Common strategies for active perception focus on planning the expected next best action to efficiently minimise measurement uncertainty or maximise information gain via a metric such as Shannon entropy [vazquez2001viewpoint] or KL divergence [van2012maximally].
Active perception approaches have been applied to robotic grasping in prior work, but rather than directly use grasp quality as a metric rely on a secondary task such as object modelling [bone2008automated, aleotti2014perception], point cloud reconstruction while searching for graspable geometry [arruda2016active, kahn2015active] or localising previously seen objects [holz2014active] to generate next best view commands. This is partly due to the difficulty in defining a probability distribution over predicted grasp poses [gualtieri2017viewpoint], which we overcome by using a visual grasp detection system which generates a pixelwise distribution of grasp pose estimates [morrison2018closing].
gualtieri2017viewpoint apply active perception directly to grasp detection. They compute a distribution of viewpoints for object classes which are likely to improve the quality of detected grasps. However, this approach requires knowledge of the object’s class and doesn’t easily generalise to cluttered environments or more complex objects. They also show improved grasp estimates using a heuristic approach where the camera is aligned to the best detected grasp.
Iii Multi-View Picking
The choice of camera viewpoints plays an important role in the quality of visual grasp detection. In this work, we apply active perception techniques to compute the next best viewpoint for a robot with an eye-in-hand camera in real time while reaching for an object. Unlike previous work, our approach does not rely on any object-specific knowledge or heuristics, does not use a fixed data collection routine and uses visual grasp detection observations directly rather than a secondary metric such as point cloud reconstruction.
We develop an information-gain controller, which we call the Multi-View Picking (MVP) controller. The MVP controller combines visual grasp predictions from multiple camera viewpoints along a trajectory to a grasp pose, and seeks to minimise the uncertainty (entropy) associated with the grasp pose prediction by altering the trajectory to include informative viewpoints, overcoming the challenges associated with visual grasp detection in clutter. Our implementation of the MVP controller is described in the following sections, with an overview given in Fig. 3.
Iii-a Viewpoint Trajectory
We consider the case of a robotic arm with an antipodal gripper and an eye-in-hand camera reaching from an arbitrary starting pose to a grasp pose . At any instance during this motion, the camera’s viewpoint is the 3D position of the camera, centred above a point in the workspace at a height , which we constrain to be parallel to the xy-plane. We define a viewpoint trajectory for grasping as a set of discrete viewpoints from which we make a visual grasp detection observation while reaching for a grasp. The initial viewpoint is at a fixed height , and the trajectory ends when the camera reaches a height , at which point the best detected grasp is executed.
Iii-B Visual Grasp Detection
For visual grasp detection, we use the real-time Generative Grasping Convolutional Neural Network (GG-CNN) from [morrison2018closing]. Given a depth image as input, the GG-CNN produces a pixelwise visual grasp detection , representing the grasp quality (the chance of grasp success), angle of rotation around the vertical axis and gripper width at each pixel of the input respectively (Fig. 2). Each pixel of represents an antipodal grasp , parameterised by the grasp’s centre position , rotation , gripper width and quality . is computed from the measured depth and the camera’s intrinsic parameters.
Other visual grasp detection systems regress a single grasp pose [kumra2017grasp, redmon2015realtime] or perform classification on sampled grasp candidates [mahler2017dex, lenz2015deep], which do not easily lend themselves to defining a probability distribution over grasp estimates. The GG-CNN is an ideal component of our active perception system because it directly generates a distribution of grasp estimates and also gives real-time computation (approx. 50Hz).
Iii-C Grid Map Representation
To combine observations at time-steps along the viewpoint trajectory, we represent the workspace of the robot as a 2D grid map, , of cells. Each cell corresponds an physical area. The grid map has the advantage of being computationally efficient compared to other representations, such as Gaussian Processes, which could be used in a similar. Within each cell , grasp quality observations () are counted in a vector , discretised into intervals, and combined grasp quality and angle observations (, ) are counted in a 2-dimensional histogram , discretised into intervals respectively (Fig. 3b). These vectors represent the distribution of observations at each point, and form the basis for our information gain approach.
A grasp at cell is parameterised by the mean of observations within that cell:
where is the physical position at the centre of the cell and the mean observations are calculated as per below. (We drop the notation for the sake of readability.)
For a single cell, the mean quality observation is given by:
and the mean angle is the vector mean of the angle observations [mardia2014statistics] weighted by the corresponding grasp quality observations:
The mean grasp width for a cell is simply the mean of observations:
Iii-D MVP Controller
We formulate our MVP controller using an information gain approach, where viewpoints are selected to reduce the uncertainty in the grasp pose observations. Specifically, we aim to reduce the entropy of grasp quality observations in which correspond to high quality grasps.
We can calculate the entropy of the grasp quality observations a single grid cell as:
Here, we calculate entropy in the quality observations only, rather than the full distribution of grasp quality and angle in because entropy in the distribution of angle measurements is not always a good indicator of uncertainty in the measurement. For example a small, spherical object can be graspable from any angle with high quality, so may have a high entropy across but low entropy when considering only .
We find that entropy in the distribution of grasp quality measurements is a good candidate for predicting informative viewpoints. The fifth column of Fig. 3a shows the entropy of measurements in a semi-cluttered scene at three time steps during a grasp. As shown in Fig. 3, areas of high entropy are present around areas of clutter, occlusion and complex geometry where the output of the grasp detector is highly variable and dependent on viewpoint, compared to “uninteresting” the areas with certain measurements (and hence low entropy) regardless of viewpoint such as the flat bottom surface of the workspace.
Calculating the expected information gain of an observation from a viewpoint, in terms of reduction in entropy, is an intractable problem in this case. As such, we use the common simplifying assumption that the total entropy of the observed area will provide a good approximation to computing the expected information gain [thrun2005probabilistic]. That is, viewpoints which observe areas of high entropy are likely to be more informative (i.e. reduce entropy) than observing areas which already have low entropy.
Hence, we approximate the expected information gain from an observation at a viewpoint as the weighted sum of entropy of the grid cells observed from that viewpoint:
where is the set of grid cell coordinates observable by the camera from viewpoint , and weights points by a Gaussian function based on their distance from the geometric centre of , encouraging the controller to view high-entropy areas front-on rather than peripherally.
To predict the next best viewpoint, we calculate the utility of moving to a viewpoint centred above each cell in , which represents the desirability of moving to each viewpoint as:
where is the cost associated with moving to the viewpoint , tunable by the exploration cost parameter . To encourage the controller to observe areas nearby the best detected grasp, rather than irrelevant distant points, the cost term represents the horizontal () distance to the grid cell with the highest average grasp quality, centred at :
The second term scales the cost based on the vertical position of the camera in the trajectory (). At the beginning of the trajectory there is zero cost, encouraging exploration of the workspace, which increases linearly as the end-effector descends, causing the controller to converge to the best grasp.
The MVP controller generates a horizontal velocity command in the direction of maximum utility, shown in the final column of Fig. 3a. To enable direct comparison between our experiments, we scale the vertical component of velocity to maintain a constant absolute end-effector velocity.
Iv Robotic Experiments
We test and validate our approach through a number of trials of robotic grasping in clutter, with the setup is shown in Fig. 4a. Our experiments are described in the following sections. Our software implementation of the system in the form of ROS nodes, primarily written in python, will be made available online.
Experiments are performed using a Franka Emika Panda robot, fitted with custom, 3D-printed gripper fingers using the design from [guo2017design]. We use an Intel Realsense D435 depth camera, which is mounted to the robot’s end-effector.
We use a set of 40 objects, comprising 20 “adversarial” 3D-printed objects from the DexNet 2.0 [mahler2017dex] dataset1113D-printable mesh files are available from
https://berkeleyautomation.github.io/dex-net/ (Fig. 4b), and 20 household objects (Fig. 4c). The adversarial objects have complex geometry, making them difficult to perceive and grasp. The household objects contains objects that are a wide variety of sizes and shapes and include deformable objects and visually challenging objects which are transparent or black.
For each experimental run, 20 objects are chosen at random, 10 from each of the adversarial and household sets, and emptied into a cm bin in an unstructured jumble. The robot the grasps objects one by one and places them into a second bin until all objects have been removed. A grasp is counted as a success if the object is successfully transported to the second bin. Scales on the second bin are used to record the success or failure.
We perform 9 experiments using the MVP controller, comprising 7 runs each, varying the exploration cost while keeping all other parameters listed in Table I constant. Additionally, we perform baseline experiments for comparison as described in the next section.
|,||Grid Map Size||68|
|Grid Cell Size||5mm|
|Controller Update Rate||10Hz|
|End-effector Velocity (During Reach)||0.1m/s|
We compare our results to three baselines which represent common methods in other robotic grasping work. We complete 5 runs of each baseline using the method above. Where relevant, the parameters in Table I (including the end-effector velocity) are kept constant to allow for the best possible comparison.
Single Viewpoint Most work in visual grasp detection considers only a single viewpoint for grasp detection [mahler2017dex, mahler2017binpicking, pinto2016supersizing, lenz2015deep, johns2016deep]. In this baseline we always execute the best grasp detected by the GG-CNN from a single viewpoint centred above the workspace.
Fixed Data Collection ten \citetten2017grasp increase their grasp success rate by performing point cloud fusion along a fixed trajectory. Because our method uses grasp estimates rather than point clouds, we use our grid map representation to combine GG-CNN predictions along a fixed, spiral trajectory (Fig. 5), considering 25 and 50 uniformly spaced viewpoints in two experiments.
No Exploration \citetgualtieri2017viewpoint showed an increase in grasp success by aligning the camera to the axis of the best detected grasp. In this baseline we disable the exploration of our MVP controller, instead always generating a velocity command to align the camera to the best detected grasp at each time step.
|Ours (MVP Controller)||Baselines|
|Exploration Cost ()||Single View||No Expl.||Fixed Data Collection|
|0.0||0.05||0.1||0.2||0.3||0.4||0.5||0.6||0.7||25 Views||50 Views|
|Success Rate (%)||80||80||79||79||77||77||75||74||74||68||74||73||78|
|Mean Time (s)||10.5||10.2||9.8||9.2||9.0||9.1||9.0||9.1||9.1||8.8||9.1||11.4||11.4|
The results of our experiments are shown in Table II. We assess each experiment based on three metrics:
Success Rate: The overall ratio of successful grasps to grasp attempts across all runs.
Mean Time per Pick: The average time per grasp attempt (in seconds), regardless of success.
Mean Picks Per Hour (MPPH): The overall efficiency of the system representing the average rate of successful picks per hour calculated as
It is important to note that we include the MPPH measurement as a way of comparing the different results, so are primarily concerned with the relative difference of values rather than their overall magnitude which is highly dependent on the fixed end-effector velocity that we use.
V-a Exploration During Grasping
We first investigate the effect of trading off between exploration and execution time. Varying the exploration cost from (minimum exploration) to (maximum exploration) results in a 6% increase in grasp success rate, improving from 74% to 80%, at the expense of 1.4 seconds per grasp on average. Two example trajectories for and are shown in Fig. 5.
As the increased time associated with exploration does not scale linearly with the increase in grasp success rate, the overall efficiency of the system, measured in MPPH, is maximised between the two extremes, for , where there is an increase in grasp success rate (79%) but minimal extra time cost. As a result, the MVP controller can be optimised for either raw grasp success rate or overall efficiency by adjusting the cost of exploration.
As is increased, the results begin to plateau in the range 0.5 to 0.7 due to the the cost term becoming dominant in this range and causing to controller to perform minimal exploration and converge directly to the grasp.
V-B Comparison to Baselines
Single Viewpoint Our approach outperforms the single viewpoint baseline approach in terms of both success rate, improving up to 12% (68% vs. 80%), and MPPH, increasing by approximately 10% from 281 to 308 despite an increased mean time per pick. This reinforces the assertion that considering multiple viewpoints is an effective way to overcome the visual challenges associated with grasping in clutter.
No Exploration The No Exploration baseline achieves similar results to our MVP controller when using a high exploration cost, which is unsurprising since the high exploration cost results in minimal exploration. However, for lower values of , our MVP controller outperforms this baseline by 6% success rate, highlighting the added benefit of actively exploring based on uncertainty compared to using a heuristic such as aligning with the best detected grasp.
Fixed Data Collection The Fixed Data Collection baseline reinforces the idea that incorporating multiple viewpoints can improve the success rate of a grasping system, with both experiments outperforming the Single Viewpoint baseline. Furthermore, the 50-viewpoint experiment outperforms the 25-viewpoint experiment by 5% grasp success rate. However, because the fixed trajectory always views the entire workspace uniformly, it results in a constant execution time which is longer than all other experiments and is unable to focus on salient areas of the workspace. As a result, our MVP controller outperforms this baseline with regards to all metrics, including a higher grasp success rate with fewer viewpoints. The main advantage comes from our information gain approach, which is able to focus on salient areas of the workspace to reduce unnecessary data collection.
V-C Automatically Adapting to Scene Complexity
As shown with the Fixed Data Collection baseline, using a set of fixed viewpoints for data collection can increase grasp detection accuracy. However, it results in longer, constant execution times, and is unable to focus attention on “interesting” parts of the scene. In contrast, our MVP controller is able to actively adapt to the complexity of the scene and provides more a more efficient data collection process. Fig. 6 shows that for our MVP controller the mean time per pick is dependent on the number of objects in the workspace. As the number of objects in the scene increases, and with it the amount of clutter and potential occlusions, so does the mean time per pick, taking on average 20% longer when 20 objects are present compared to a single object.
We presented an active perception approach to grasping in clutter, in which we consider visual grasp detections from multiple viewpoints while reaching. Our work reinforces the importance of viewpoint selection and combining data from multiple viewpoints when grasping in clutter, but in contrast to previous work our Multi-View Picking controller uses an information-gain approach to select informative viewpoints that directly seek to reduce entropy in the grasp pose estimates caused by clutter and occlusions.
We validate our approach with several experiments in grasping from clutter. Our MVP controller achieves up to 80% grasp success while picking from cluttered piles of up to 20 objects, including adversarial objects with complex geometry, outperforming a single-viewpoint method by 12%.
Additionally, by using an information-gain approach, our MVP controller is able to adapt to the complexity of the scene, unlike other approaches which rely on fixed data collection routines as part of a visual grasp detection pipeline. Compared to such a method, our approach results in a higher grasp success rate while also being more efficient, requiring fewer viewpoints and less time per grasp.