Pick-Place With Uncertain Object Instance Segmentation and Shape Completion
In this paper we consider joint perception and control of a pick-place system. It is important to consider perception and control jointly as some actions are more likely to succeed than others given non-uniform, perceptual uncertainty. Our approach is to combine 3D object instance segmentation and shape completion with classical regrasp planning. We use the perceptual modules to estimate their own uncertainty and then incorporate this uncertainty as a regrasp planning cost. We compare 7 different regrasp planning cost functions, 4 of which explicitly model probability of plan execution success. Results show uncertainty-aware costs improve performance for complex tasks, e.g., for a bin packing task, object placement success is 6.2% higher in simulation and 4.1% higher in the real world with an uncertainty-aware cost versus the commonly used minimum-number-of-grasps cost.
Pick-place is prehensile manipulation where grasped objects are fixed in the hand as the arm moves and rest stably after being placed . Owing to the simplifying, static nature of this problem, an interesting computational aspect has emerged: search can be separated into a discrete phase over grasp-place combinations and a continuous phase over arm motions, naturally enabling hierarchical planning [2, 3]. It has been more or less assumed that, a planner exploiting this property can be combined with a perception module to create a working system. However, perception of the object’s geometry is a difficult and error-prone process in itself. Furthermore, separately designing perception and planning modules in this way is not necessarily optimal, e.g., these methods treat grasping a completely unobserved part of an object the same as grasping a part that is fully observed, which could clearly lead to avoidable failures.
One approach to this problem is to dispense with the idea of separate perception and planning modules and use reinforcement learning (RL) to learn a single module that does both. While some success has been achieved with this idea (e.g., [4, 5, 6]), training is time-consuming, the system is not robust to changes in either task or environment, and performance is often suboptimal, even for simple tasks (cf. placing mugs in  vs. a pipelined approach ).
Another approach is to plan in belief space, i.e., in probability distributions over the current state . While this is the most principled approach, there are still a couple of important drawbacks. First, these methods almost always require a detailed description of the observation and state transition models of the system, which can be very difficult to obtain (e.g., [9, 10]). Second, planning takes place in the space of probability distributions over states, which is continuous and, for practical problems, high dimensional. For these reasons, this approach has been confined to problems with few dimensions or other simplifying structure.
In this work, we take a different approach, which is to (a) use perception to predict the complete geometry of the objects and (b) incorporate instance segmentation and shape completion uncertainty as a planning cost at the level of discrete search. With only small modifications to existing components, we efficiently account for perceptual uncertainty. Our results for a bin packing task show that perception is indeed a significant source of error and that some of this error can be compensated for by penalizing uncertain grasps/places. We compare four different ways of modeling probability of pick-place execution success, including grasp quality (GQ), Monte-Carlo (MC) sampling, uncertainty at contact points (CU), and success prediction (SP) to two baselines. We find either SP or a combination of GQ and MC performs best, depending on the scenario. We test the applicability of the approach with real robot experiments on three benchmark tasks: block arrangement, bottle arrangement, and bin packing.
Ii Related Work
Ii-a Pick-place in deterministic, fully observed environments
Pick-place was often studied independently from perception. The structure of the problem for one movable object (called regrasping) was first explained by Tournassoud et al. . There is a discrete search component, for sequencing grasp-place combinations, and a continuous search component, for connecting grasp-place combinations with a motion plan. Alami et al. generalized regrasping to multiple, movable objects, pointed out the problem is NP-hard, and coined the term manipulation planning . Later they considered different cost functions for the discrete search, including path length and number of grasp changes . Nielsen and Kavraki gave a 2-level, probabilistically complete planner for manipulation planning . Manipulation planning is related to the more general concept of multi-modal planning, which deals with discontinuities in the configuration space . Wan et al. employ a 3-level planner, where the high-level planner provides a set of goal poses for the objects, the middle-level planner is a regrasp planner, and the low-level planner is a motion planner . For non-monotonic rearrangement problems (i.e., objects need moved more than once), a middle-level planner that displaced multiple objects was more efficient . Our approach is to start with a well-established regrasp (i.e., middle-level) planner and build an uncertainty capability upon it.
Ii-B Pick-place of novel objects
A few projects have considered novel-object pick-place, where the complete shapes of the objects are no longer given. The first to address this was Jiang et al.  who used random sampling with classification to identify placements that are likely to be stable and satisfy human preference. After this, we approached the problem with deep RL by learning a grasp/place value function [4, 5, 6]. Next, Manuelli et al. proposed a 4-component pipeline: (a) instance segmentation, (b) key point detection, (c) optimization-based planning for task-specific object displacements, and (d) grasp detection . Objects were minimally represented by key points, which are 3D points indicating task-relevant object parts, e.g., the top, bottom, and handle of a mug. Later, Gao and Tedrake augmented this with shape completion, which was useful for avoiding collisions when planning arm motions with the held object . Finally, Mitash et al. addressed the problem by fusing multiple sensor views and allowing a single regrasp as necessary, conservatively assuming the object is as large as its unobserved region . None of these considered multiple regrasps or compared different ways of explicitly accounting for segmentation and completion uncertainty, as we do here.
Ii-C Pick-place under uncertainty
A general approach to pick-place under arbitrary types of uncertainty is to solve a partially observable Markov decision process (POMDP). Kaelbling and Lozano-Pérez focus on symbolic planning in belief space with black-box geometric planners and state estimators . Xiao et al. use POMCP  to update their belief about the arrangement of a small set of known objects . Although the POMDP approach is very general, it requires significant computation and an accurate model of transition and sensor dynamics.
Ii-D Grasping under uncertainty
Our approach is to extend ideas from grasping under object shape uncertainty to pick-place planning. The two most common approaches to grasping under shape uncertainty are (a) evaluate force closure over an MC sampling of object shapes and (b) evaluate a probabilistic model of grasp success. Kehoe et al. took the MC approach and represented uncertainty as normally distributed polygonal vertices and center of mass with given means and variances . Hsiao et al. provide a probabilistic model for grasp success given multiple object detections and grasp quality evaluations . Soon afterward, Gaussian process implicit surfaces (GPISs) were proposed as a representation of object shape uncertainty for grasping [22, 23, 24, 25]. GPISs combine multiple observations of the object’s signed distance function (SDF) into a Gaussian process – a normal distribution over SDFs . Mahler et al. compare a probabilistic model (based on the variance of the GPIS at contact points) versus an MC approach . The MC approach does better but has higher computational cost. Laskey et al. improved the efficiency of MC sampling from the GPIS by employing multi-armed bandit techniques to reduce the number of evaluations for grasps that are unlikely to succeed . Li et al. conducted real-world experiments filtering grasps with different thresholds on variance of the GPIS at contact points . Finally, Lundell et al. represented objects as voxels, used a deep network to complete objects, and performed MC sampling using dropout .
Iii Problem Statement
Move-binary-effect system. A move-binary-effect system (cf. move-effect system ) is a discrete-time system consisting of a robotic manipulator, one or more depth sensors, and one or more objects, each situated 3D Euclidean space. The manipulator has configuration and is equipped with an effector with status empty or holding. The action of the robot is to move along a trajectory, , followed by an effector operation, either open or close. At each step, the depth sensors acquire a point cloud , sampling points on the object surfaces. The objects are rigid polyhedrons and can be either fixed or movable. At each step the robot observes , its effector status, and a point cloud and takes an action.
Rearrangement of unknown objects. Within a move-binary-effect system, given a set of goal arrangements (where an arrangement is a pose for each movable object w.r.t. a fixed frame, ), find a minimum number of actions that is guaranteed to achieve a goal arrangement.
Since objects are unknown, the set of goal arrangements cannot be specified explicitly as a list of poses. Instead, it is specified with a boolean property, e.g., using first-order logic (e.g., all bottles are upright on coasters). Reasonable variations of this problem are also possible, such find a minimum number of actions that, with given probability, achieves a goal arrangement .
As this problem is PSPACE-hard , approximate solutions are needed. Our approach is to break the problem into two subproblems: (a) find (possibly multiple) sequences of explicit, single-object displacements (independent of how it is moved by the robot) that are likely to be executable and to achieve a goal arrangement and (b) find a regrasp plan that is most likely to achieve a single-object, goal displacement. For this paper, we focus on subproblem (b); subproblem (a) is implemented task-by-task.
Iv System Overview
The proposed system for rearranging unknown objects is summarized in Fig. 1. For each perception-action cycle, the environment produces a point cloud, the geometry of the scene is estimated, a partial plan for achieving a goal arrangement is found, and the first pick-place of the plan is executed. Automatic resensing and replanning accounts for failures, similar to MPC . In this section, each component is briefly described; regrasping under segmentation and completion uncertainty – the main contribution – is described next (Section V).
The purpose of the perceptual modules is to reconstruct the geometry of the scene so we can apply geometric planning algorithms. Additionally, they must quantify their own uncertainty so that plans unlikely to succeed can be avoided. For both instance segmentation and shape completion, we have chosen point clouds as the input/output representation of objects. A point representation consumes less memory than uncompressed voxel grids, enables efficient planning, and, from our previous experience [4, 5, 6], exhibits good simulation-to-real domain transfer.
Object instance segmentation
The input to the segmentation module is a point cloud , and the output is a point cloud for each object, with , and uncertainties . Although any object instance segmentation method with this interface can be used in the proposed architecture, our implementation uses BoNet . BoNet produces an matrix, where is a predefined maximum number of objects, and each row is a distribution for each point over object ID. The column-wise maximum value of this matrix is used for , which is interpreted as the estimated probability each point is correctly segmented. (And, optionally, points with below a threshold can be omitted.)
The input to the shape completion module is a point cloud , and the output is a point cloud that is a dense sampling of points on all object faces, including faces not visible to the sensor. For robust regrasp planning, we also require an uncertainty estimate for each completed point, . Although any shape completion method with this interface can be used in the proposed architecture, our implementation uses a modified version of PCN . PCN consists of an encoder (two PointNet layers ) and a decoder (three fully connected, inner product layers
We use a 3-level planner, similar to Wan et al. .
The input to the arrangement planner is a list of completed clouds, , and the output is a set of triples , where is a target pose for object and is an associated goal cost. The reason the arrangement planner produces multiple goals for multiple objects is to increase the chances one of them is feasible. Besides, not all goals are equal: some may be more preferable to the task. For example, in bin packing, some placements will result in tighter packings than others. This is captured by the goal cost, . For this paper, we implement a different arrangement planner for each task.
The regrasp planner takes in the triples from the arrangement planner and produces a sequence of pick-places, i.e., effector poses, that displaces one object. If a regrasp plan is not found, more goals can be requested from the arrangement planner (as indicated by dashed lines in Fig. 1).
The motion planner finds a continuous motion between picks and places. Any off-the-shelf motion planner will do: we use a 3-level planner that first attempts a linear motion, then trajopt , and then RRT* with timeout . If no motion plan is found, the regrasp planner can be resumed from where it left off, but marking the infeasible section so the same solution is not found again.
V Regrasp Planning Under Uncertainty
Regrasps are needed due to kinematic constraints: the grasps at the object’s current pose may all be in collision or out of reach at the object’s goal poses. In this case, a number of temporary places (i.e., non-goal places) are needed. Our regrasp planner (Alg. 1) extends that of  to handle multiple goals for multiple objects, arbitrary additive costs, and discrete grasp/place sampling. Related planners (e.g, [3, 11, 12, 13, 15]) could also have been adapted to the purpose: the main point is to incorporate segmentation and shape completion uncertainty into the cost.
Alg. 1 is run in parallel for each object that has at least one goal pose. For steps (or until a timeout is reached), additional grasps (with costs ) and temporary placements (with costs ) are sampled, the regrasp graph is updated with the new samples, and A* with a consistent heuristic finds an optimal pick-place sequence, given the current samples. Grasps that are force-closure and temporary places that are stable, w.r.t. the completed cloud, are randomly sampled. The regrasp graph is represented by a matrix where rows refer to grasps and columns refer to places. When the object has been grasped, column changes are allowed to switch the object’s placement, and when the object has been placed, row changes are allowed to switch grasps. Matrix values are the sum of the corresponding grasp and place costs if the grasp/place combination is feasible (i.e., there is a collision-free IK solution) and infinity otherwise.
It is desirable to choose grasps that fix the actual object in the effector as the arm moves and temporary places that, when the effector releases the object, the object rests stably. This way, the pick-place plan will execute predictably. For the case of a parallel-jaw gripper, an antipodal grasp
V-a Maximize probability of regrasp plan execution success
The aim is to choose a regrasp plan that maximizes the joint probability each grasp is antipodal and each temporary place is stable, i.e., maximize Eq. 1, where is the event the th grasp is antipodal, is the event the th place is stable, and is total number of picks and places. Assuming each grasp/place is independent of previous steps in the plan, we arrive at Eq. 2
Negating Eq. 3 results in a non-negative cost. We account for additional factors in the regrasp planning problem, such as plan length and task cost, by adding these as objectives to a multi-criterion optimization problem ( pp. 181-184). Scalarization results in the cost (4), where are trade-off parameters and is the task cost associated with the goal placement (from the arrangement planner). This is the cost used by our regrasp planner. To complete the description, we next look at different ways of estimating and .
V-B Probability grasps are antipodal and places are stable
Grasp quality (GQ)
One way to estimate is via a measure of “grasp quality” evaluated on the nominal shape completion. For example, consider a distance measuring how far the line connecting the contact points is from the centers of both friction cones (cf.  pp. 233). Intuitively, if the line between contacts is near the edge of either friction cone, small errors in shape completion are likely to result in an incorrect antipodal assessment (see footnote 2).
To place this idea into our probabilistic framework, suppose for each grasp contact , the angle between the surface normal (i.e., the center of the friction cone) and the normalized, outward-pointing vector connecting the contacts is distributed according to a truncated normal distribution with mode and scale , where (Eq. 5) is derived from the nominal object shape and is given. The probability lies in the friction cone is then , where is the cumulative density function of the truncated normal distribution and is half the angle of the friction cone. We make the simplifying assumption that this probability is independent between contacts, giving Eq. 6.
The effect of the GQ estimator is simply to choose grasps that are as centered as possible in both friction cones, given the estimated object shape. The scale parameter makes the trade-off between regrasp plan length and centering of grasps: small prefers centered grasps over short plans and large prefers short plans over centered grasps.
Monte Carlo (MC)
Another approach is to estimate and via segmentation and completion samples. Suppose we are given a mechanism for sampling from the distribution , for , where is a random variable over object shapes and is the input point cloud. Such a mechanism could be implemented as an ensemble of segmentation and completion networks, e.g., multiple networks trained with different weight initializations or datasets. Or, this could be implemented as a pair of networks with randomized components, e.g., dropout (as in  for grasping under uncertainty) or VAEs. Or, the option used here for sake of comparison to the CU method, one could use the point-wise uncertainty outputs of the networks ( in Section IV-A), as follows.
For segmentation, the object ID for each point is independently sampled from the distributions given by the segmentation matrix. (To reduce noise, we only sample points whose -value is below a threshold.) For shape completion, assuming the th point’s offset from the nominal point is i.i.d. , is given by Eq. 7, since is the (estimated) probability the point is offset no more than . To summarize, to sample a shape: (a) sample a segmentation point-wise using the segmentation mask and (b) compute the shape completion given this segmentation, and then, for each point, (c) sample a direction uniformly at random and (d) sample an offset along this direction from a normal distribution with 0 mean and standard deviation given by Eq. 7.
Regardless of implementation, is estimated as and is estimated as where is the number of shape samples and #antipodal is the number shapes for which the th grasp is antipodal and #stable is the number of shapes for which the th place is stable.
Contact Uncertainty (CU)
Computing and using an MC method is expensive if is large. This motivates using the network uncertainties directly. The basic idea is to penalize grasp/place contact points with low -values. Formally, suppose for the th point, the segmentation network estimates , where is the event the th point is segmented correctly. Suppose the shape completion network estimates where is the event the th point is within Euclidean distance of a ground truth point. Assuming whether a grasp (place) is antipodal (stable) depends only on whether each contact point is correctly segmented and is within Euclidean distance of the nearest ground truth point, and assuming independence between contacts, and are estimated via Eq. 8 and 9, where contacts are explained in Fig. 3.
The uncertainty values from PCN (Section IV-A) are used to estimate . Estimating from the uncertainty values from BoNet (Section IV-A) is less straight-forward since, for each completed point we must associate a corresponding segmentation uncertainty. A heuristic we found that works well for this is, for each point in the completed cloud, take the nearest neighbor in the segmented cloud.
Success Prediction (SP)
and can also be estimated with a neural network. The encoding of grasp/place as input to the neural network is an important design choice that affects performance . Here, we encode grasps as the points from the shape completion, , inside the gripper’s closing region w.r.t. the gripper’s reference frame (cf. ). For places, the completed cloud, , is transformed to the place pose and translated with the bottom-center of the cloud at the origin. For network architecture, we use PCN with a single output with sigmoid activation, trained with the binary, cross-entropy loss. Training data is generated in simulation, so labeling ground truth antipodal/stable is straight-forward.
We evaluated the proposed system in the environment shown in Fig. 4, left, consisting of a UR5 arm, Robotiq 85 gripper, and Structure depth sensor on the following tasks:
Block arrangement. Arrange 5 rectangular blocks from tallest to shortest according to the longest edge (reminiscent of “blocks world” ).
In each case, same-category novel objects were tested. Novel-category objects were also tested for bin packing. Arrangement planners were designed separately for each task, without considering uncertainty. For brevity, we primarily discuss results for the most difficult task – bin packing. Similar trends are observed in the other tasks, but perception is more accurate, reducing the urgency to account for uncertainty.
Vi-B Simulation results
The environment was simulated by OpenRAVE  using 3DNet objects . Train/Test-1 categories included boat, bottle, box, car, dinosaur, mug, and wine glass. Test-2 (novel) categories included airplane, bowl, and stapler. Scenes were initialized without objects touching, as segmentation otherwise performs poorly. Grasps succeeded if (a) exactly 1 object intersected the hand closing region, (b) the antipodal condition with friction cone was met, and (b) the robot was collision-free. These conditions are conservative relative to reality, e.g., as the hand closes or the arm moves, an object in a non-antipodal grasp may rotate but still remain in the gripper and arrive near the goal. Unstable temporary places were recored, but otherwise ignored, because, in reality they never resulted in arrangement failures but required replanning. In Tables I and II, “Place Execution Success” refers to the proportion of successfully executed regrasp plans and “Packing Height of 5” refers to the end-of-episode packing height when 5/6 objects were placed.
Perception ablation study
The purpose of this study is to evaluate the perceptual modules in terms of pick-place performance and to quantify the potential benefit of accounting for uncertainty. We evaluate performance with ground truth perception (GT Seg. & Comp.), imperfect completion (GT Seg.), imperfect segmentation and completion (Percep.), and without shape completion (GT Seg. & No Comp.) Step and task costs were used, i.e., and in (4), where the task cost, , was the estimated final packing height in centimeters.
|GT Seg. & Comp.||GT Seg. (Train)||GT Seg. (Test-1)||Percep. (Train)||Percep. (Test-1)||GT Seg. & No Comp.|
|Place Execution Success||0.929 0.008||0.767 0.013||0.747 0.013||0.718 0.014||0.710 0.014||0.508 0.046|
|Grasp Antipodal||0.931 0.007||0.779 0.013||0.761 0.013||0.755 0.013||0.736 0.013||0.563 0.047|
|Temporary Place Stable||1.000 0.000||0.769 0.122||1.000 0.000||0.828 0.071||0.826 0.081||0.500 0.500|
|Packing height of 5 (cm)||12.27 0.315||12.36 0.331||12.18 0.306||12.37 0.447||12.44 0.307||–|
|Regrasp planning time (s)||35.62 1.103||38.46 1.115||38.68 1.141||35.76 1.059||35.05 1.077||15.86 1.482|
The results (shown in Table I) are mostly as expected. A clear drop in performance is observed as perception becomes imperfect (down 18% for imperfect completion and another 4% for imperfect segmentation). A slight drop is noticeable from train to test objects. Without shape completion, planning is crippled (not shown in table, regrasp plan found rate drops from 94.1% for Percep. Test-1 to just 10.0%). A similar but more extreme trend is seen with Test-2.
Regrasp cost comparison
We compare four different ways of evaluating probability grasps (places) are antipodal (stable) (Section V-B) to two baselines – “No Cost”, which takes the first regrasp plan found, and “Step Cost”, which includes the step cost term only ( in (4)). The step cost appears almost exclusively in the regrasping literature (e.g., [2, 3, 11, 14]). For simplicity, task cost is not included in this evaluation, but the 1st step of 16 packing solutions (top-2 solutions on each of 8 threads after 1 minute) are used as goal poses. “GQ + MC” refers to the case where GQ and MC costs are summed together.
|No Cost||Step Cost||GQ||MC||MC + GQ||CU||SP|
|Place Execution Success||0.651 0.013||0.725 0.012||0.748 0.012||0.756 0.012||0.787 0.011||0.712 0.013||0.779 0.012|
|Grasp Antipodal||0.737 0.011||0.751 0.012||0.794 0.011||0.811 0.011||0.830 0.010||0.743 0.012||0.823 0.010|
|Temporary Place Stable||0.784 0.024||0.857 0.097||0.845 0.030||0.904 0.028||0.883 0.031||0.848 0.054||0.959 0.018|
|Plan Length||2.665 0.031||2.038 0.008||2.293 0.021||2.222 0.019||2.201 0.018||2.105 0.013||2.233 0.019|
|Regrasp planning time (s)||4.904 0.230||7.201 0.393||84.56 0.827||90.10 0.892||126.5 1.029||72.00 0.835||86.61 1.040|
|Place Execution Success||0.412 0.017||0.417 0.017||0.395 0.017||0.458 0.017||0.422 0.017||0.429 0.017||0.465 0.017|
|Grasp Antipodal||0.484 0.017||0.449 0.017||0.450 0.017||0.504 0.017||0.472 0.017||0.457 0.017||0.518 0.017|
|Temporary Place Stable||0.704 0.051||0.714 0.125||0.533 0.075||0.750 0.083||0.800 0.082||0.778 0.101||0.686 0.080|
|Plan Length||2.514 0.036||2.094 0.015||2.247 0.024||2.167 0.020||2.150 0.019||2.118 0.017||2.193 0.022|
|Regrasp planning time (s)||6.030 0.237||8.484 0.408||51.61 1.113||58.56 1.064||71.38 1.333||50.92 1.177||53.35 1.159|
Results for bin packing are shown in Table II. For Test-1, GQ+MC has the best grasp performance while SP has the best temporary place stability rate. GQ+MC significantly outperforms GQ (for 1-sided, same-variance, unpaired -test, for execution success and for grasp antipodal), suggesting the network’s uncertainty estimates are useful for planning. For Test-2, SP has the best grasp performance (vs. step cost, for execution success and for grasp antipodal), and GQ+MC has the best temporary place stability rate. It was disappointing that CU did not significantly outperform the baselines for either dataset: it is apparently not sufficient to account for uncertainty only at the contact points. Also, while GQ has a significantly higher antipodal rate than the baselines for Test-1, the same is not true for Test-2, suggesting the GQ method can tolerate only small errors in shape completion.
|No Cost||Step Cost||GQ||MC||GQ + MC||CU||SP|
|Place Success||0.725 0.010||0.775 0.009||0.854 0.008||0.849 0.008||0.860 0.008||0.828 0.008||0.911 0.006|
|Grasp Antipodal||0.833 0.007||0.824 0.009||0.906 0.006||0.902 0.006||0.908 0.006||0.857 0.008||0.951 0.005|
|Temporary Place Stable||0.785 0.015||0.623 0.067||0.700 0.031||0.852 0.022||0.784 0.030||0.885 0.029||0.967 0.012|
|Plan Length||3.061 0.029||2.079 0.009||2.273 0.016||2.286 0.016||2.220 0.014||2.157 0.013||2.239 0.015|
|Regrasp planning time (s)||2.462 0.061||6.413 0.353||62.19 0.326||117.6 0.724||121.1 0.577||54.88 0.366||61.54 0.900|
For bin packing, we did not see a significant improvement for place stability over the step cost, but this is because regrasps were rare with the step cost, obscuring the significance of the results. To better test place stability, we designed a scenario with the same Test-1 objects used in bin packing, but where there is no bin and, for each episode, 1 of 5 objects, each with exactly 1 goal pose, has to be placed. The result is shown in Table III. In this case, MC, CU, and SP methods have significantly higher temporary place stability rates than no cost (which happened to do better than step cost) (, , and , respectively). Interestingly, unlike with packing, CU has a significantly higher place success rate compared to step cost (), so all methods significantly outperformed the baselines in terms of place success rate for the canonical task (for SP vs. step cost, ).
Vi-C Real robot results
The purpose of the real-world experiments is to (a) see if the perceptual components, trained with simulated data, work well with real sensor data and (b) verify the importance of uncertainty seen in simulation results. To answer part (a), no domain transfer was needed for bin packing. For blocks and bottles, BoNet (but not PCN) severely overfit to simulation data. This problem was mitigated by using the network trained for bin packing for blocks and adding simulated sensor noise for bottles. For part (b), results for bin packing with step and MC cost are shown in Table IV. Although the MC method appears to be doing better, the gap is relatively small (4.1%). This may be because many non-antipodal grasps still succeed in placing the object into the bin, as we see the grasp success rates are higher in reality. An example packing sequence and a regrasp with blocks are shown in Fig. 5.
|Place Success Rate||0.867 0.031||0.908 0.026|
|Grasp Success Rate||0.908 0.025||0.940 0.021|
|Number of Grasp Attempts||131||134|
|Number of Regrasps||11||14|
|Packing height of 5 (cm)||7.4 1.0||7.9 1.2|
We also compare bottle arrangement performance to our previous method, which uses RL to learn a pick-place policy . Many of the same bottles as before were included, but 4/15 of them were more challenging. Two of the bottles were difficult to distinguish orientation (size of tops near size of bottoms), and two were near the 8.5 cm gripper width. Results are shown in Table V. With the proposed method, all places were correct. Only the grasp success rate is lower than before, but all 3 grasp failures were with the wider bottles. Overall, we conclude the pipelined method performs much better (80% vs. 67% task success rate).
|Shape Completion||HSA |
|Number of Objects Placed||1.800 0.074||1.667 0.088|
|Task Success Rate||0.800 0.074||0.667 0.088|
|Grasp Success Rate||0.948 0.029||0.983 0.017|
|Place Success Rate||1.000 0.000||0.900 0.040|
These results demonstrate that object instance segmentation and shape completion are accurate enough to enable difficult pick-place tasks such as bin packing. However, perceptual errors are still a major cause of failures. Some of these failures can be avoided by simply not grasping or placing on object parts where uncertainty is high. We formalize this idea with four different regrasp costs which account for perceptual uncertainty, GQ, MC, CU, and SP. We find SP or a combination of GQ and MC performs best and is more robust than the classical step cost.
To guide future work, we note some important limitations with the current system. First, the regrasp planner is much slower when using a more sophisticated cost function than the step cost. This is mainly due to having to sample plenty of grasps and places, to improve the likelihood the plan is executed successfully, without having a good stopping criterion. Another issue is to identify under what conditions the overall system (Fig. 1) is guaranteed to converge to a goal arrangement if a feasible path to one exists. Finally, integrating additional views to decrease uncertainty is an important aspect that we do not examine here.
We thank Yuchen Xiao and Andreas ten Pas for reviewing an early draft of this paper and Lawson Wong and Chris Amato for discussions during early stages of this project.
- The “detailed output” layers were omitted in our implementation, and the CD loss was used for the shape completion branch. (See .)
- A parallel-jaw gripper forms an antipodal grasp on an object iff the line connecting the contact points lies inside both friction cones ( p. 233).
- Assuming knowledge that a previous grasp/place was successful does not decrease the joint probability of success, Eq. 2 is a lower bound.
- M. Mason, “Toward robotic manipulation,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, pp. 1–28, 2018.
- P. Tournassoud, T. Lozano-Pérez, and E. Mazer, “Regrasping,” in IEEE Int’l Conf. on Robotics and Automation, vol. 4, 1987, pp. 1924–1928.
- R. Alami, T. Siméon, and J.-P. Laumond, “A geometrical approach to planning manipulation tasks. the case of discrete placements and grasps,” in Int’l Symp. on Robotics Research. Cambridge, MA, USA: MIT Press, 1991, pp. 453–463.
- M. Gualtieri, A. ten Pas, and R. Platt, “Pick and place without geometric object models,” in IEEE Int’l Conf. on Robotics and Automation, 2018.
- M. Gualtieri and R. Platt, “Learning 6-DoF grasping and pick-place using attention focus,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87, Oct 2018, pp. 477–486.
- ——, “Learning manipulation skills via hierarchical spatial attention,” IEEE Transactions on Robotics, vol. 36, no. 4, pp. 1067–1078, 2020.
- L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kPAM: Keypoint affordances for category-level robotic manipulation,” in Int’l Symp. on Robotics Research, 2019.
- L. P. Kaelbling, M. Littman, and A. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial intelligence, vol. 101, no. 1-2, pp. 99–134, 1998.
- L. P. Kaelbling and T. Lozano-Pérez, “Integrated task and motion planning in belief space,” The Int’l Journal of Robotics Research, vol. 32, no. 9-10, pp. 1194–1227, 2013.
- Y. Xiao, S. Katt, A. ten Pas, S. Chen, and C. Amato, “Online planning for target object search in clutter under partial observability,” in IEEE Int’l Conf. on Robotics and Automation, 2019, pp. 8241–8247.
- R. Alami, J.-P. Laumond, and T. Siméon, “Two manipulation planning algorithms,” in Proceedings of the Workshop on Algorithmic Foundations of Robotics. A. K. Peters, Ltd., 1995, pp. 109–125.
- C. Nielsen and L. Kavraki, “A two level fuzzy prm for manipulation planning,” in IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems, vol. 3, 2000, pp. 1716–1721.
- K. Hauser and V. Ng-Thow-Hing, “Randomized multi-modal motion planning for a humanoid robot manipulation task,” The Int’l Journal of Robotics Research, vol. 30, no. 6, pp. 678–698, 2011.
- W. Wan, H. Igawa, K. Harada, H. Onda, K. Nagata, and N. Yamanobe, “A regrasp planning component for object reorientation,” Autonomous Robots, vol. 43, no. 5, pp. 1101–1115, 2019.
- A. Krontiris and K. Bekris, “Dealing with difficult instances of object rearrangement.” in Robotics: Science and Systems, 2015.
- Y. Jiang, C. Zheng, M. Lim, and A. Saxena, “Learning to place new objects,” in Int’l Conf. on Robotics and Automation, 2012, pp. 3088–3095.
- W. Gao and R. Tedrake, “kPAM-SC: Generalizable manipulation planning using keypoint affordance and shape completion,” arXiv preprint arXiv:1909.06980, 2019.
- C. Mitash, R. Shome, B. Wen, A. Boularias, and K. Bekris, “Task-driven perception and manipulation for constrained placement of unknown objects,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5605–5612, 2020.
- D. Silver and J. Veness, “Monte-Carlo planning in large POMDPs,” in Advances in Neural Information Processing Systems, 2010, pp. 2164–2172.
- B. Kehoe, D. Berenson, and K. Goldberg, “Toward cloud-based grasping with uncertainty in shape: Estimating lower bounds on achieving force closure with zero-slip push grasps,” in Int’l Conf. on Robotics and Automation. IEEE, 2012, pp. 576–583.
- K. Hsiao, M. Ciocarlie, and P. Brook, “Bayesian grasp planning,” in ICRA Workshop on Mobile Manipulation, 2011.
- S. Dragiev, M. Toussaint, and M. Gienger, “Gaussian process implicit surfaces for shape estimation and grasping,” in IEEE Int’l Conf. on Robotics and Automation. IEEE, 2011, pp. 2845–2850.
- J. Mahler, S. Patil, B. Kehoe, J. van den Berg, M. Ciocarlie, P. Abbeel, and K. Goldberg, “GP-GPIS-OPT: Grasp planning with shape uncertainty using gaussian process implicit surfaces and sequential convex programming,” in IEEE Int’l Conf. on Robotics and Automation. IEEE, 2015, pp. 4919–4926.
- M. Laskey, J. Mahler, Z. McCarthy, F. Pokorny, S. Patil, J. van den Berg, D. Kragic, P. Abbeel, and K. Goldberg, “Multi-armed bandit models for 2D grasp planning with uncertainty,” in IEEE Int’l Conf. on Automation Science and Engineering. IEEE, 2015, pp. 572–579.
- M. Li, K. Hang, D. Kragic, and A. Billard, “Dexterous grasping under shape uncertainty,” Robotics and Autonomous Systems, vol. 75, pp. 352–364, 2016.
- J. Lundell, F. Verdoja, and V. Kyrki, “Robust grasp planning over uncertain shape completions,” in IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems, 2019, pp. 1526–1532.
- G. Wilfong, “Motion planning in the presence of movable obstacles,” Annals of Mathematics and Artificial Intelligence, vol. 3, no. 1, pp. 131–150, 1991.
- M. Morari and J. Lee, “Model predictive control: past, present and future,” Computers & Chemical Engineering, vol. 23, no. 4-5, pp. 667–682, 1999.
- B. Yang, J. Wang, R. Clark, Q. Hu, S. Wang, A. Markham, and N. Trigoni, “Learning object bounding boxes for 3D instance segmentation on point clouds,” in Advances in Neural Information Processing Systems, 2019, pp. 6737–6746.
- W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert, “PCN: Point completion network,” in Int’l Conf. on 3D Vision. IEEE, 2018, pp. 728–737.
- C. Qi, H. Su, K. Mo, and L. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in Conf. on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 652–660.
- J. Schulman, J. Ho, A. Lee, I. Awwal, H. Bradlow, and P. Abbeel, “Finding locally optimal, collision-free trajectories with sequential convex optimization.” in Robotics: Science and Systems, vol. 9, no. 1, 2013, pp. 1–10.
- S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion planning,” The Int’l Journal of Robotics Research, vol. 30, no. 7, pp. 846–894, 2011.
- R. Murray, Z. Li, and S. Sastry, A mathematical introduction to robotic manipulation. CRC press, 1994.
- S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
- M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt, “High precision grasp pose detection in dense clutter,” in IEEE Int’l Conf. on Intelligent Robots and Systems, 2016.
- H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang, “PointNetGPD: Detecting grasp configurations from point sets,” in 2019 Int’l Conf. on Robotics and Automation. IEEE, 2019, pp. 3629–3635.
- D. Chapman, “Penguins can make cake,” AI magazine, vol. 10, no. 4, pp. 45–45, 1989.
- G. Wäscher, H. Haußner, and H. Schumann, “An improved typology of cutting and packing problems,” European journal of operational research, vol. 183, no. 3, pp. 1109–1130, 2007.
- F. Wang and K. Hauser, “Stable bin packing of non-convex 3D objects with a robot manipulator,” in Int’l Conf. on Robotics and Automation. IEEE, 2019, pp. 8698–8704.
- R. Diankov, “Automated construction of robotic manipulation programs,” Ph.D. dissertation, Robotics Institute, Carnegie Mellon University, 2010.
- W. Wohlkinger, A. Aldoma, R. Rusu, and M. Vincze, “3DNet: Large-scale object class recognition from CAD models,” in IEEE Int’l Conf. on Robotics and Automation, 2012, pp. 5384–5391.