CAGE: Context-Aware Grasping Engine
Semantic grasping is the problem of selecting stable grasps that are functionally suitable for specific object manipulation tasks. In order for robots to effectively perform object manipulation, a broad sense of context, including object and task constraints, needs to be accounted for. We introduce the Context-Aware Grasping Engine, a neural network structure based on the Wide & Deep model, to learn suitable semantic grasps from data. We quantitatively validate our approach against three prior methods on a novel dataset consisting of 14,000 semantic grasps for 44 objects, 7 tasks, and 6 different object states. Our approach outperformed all baselines by statistically significant margins, producing new insights into the importance of balancing memorization and generalization of contexts for semantic grasping. We further demonstrate the effectiveness of our approach on robot experiments in which the presented model successfully achieved 31 of 32 grasps.
Many methods have been developed to facilitate stable grasping of objects for robots [22, 24, 23, 28], some with optimality guarantees . However, stability is only the first step towards successful object manipulation. A grasp choice should also be based on a broader sense of context, such as object constraints (e.g., shape, material, and function) and task constraints (e.g., force and mobility) . Considering all these factors allows us to grasp objects intelligently, such as grasping the closed blades of scissors when handing them to another person. In robotics, semantic grasping is the problem of selecting stable grasps that are functionally suitable for specific object manipulation tasks .
Most works on semantic grasping ground task-specific grasps to low-level features. Such features include object convexity , tactile data , and visual features . Leveraging similarities in low-level features allows grasps to be generalized between similar objects (e.g., from grasping a short round cup for pouring water to grasping a large tall cup for the same task ). However, discovering structures in high-dimensional and irregular low-level features is hard and has prevented these methods from generalizing to a wider range of objects and tasks.
As a supplement to other types of information, such as geometrical or topological data, semantic data has been shown to improve robot reasoning and knowledge inference across many robot applications, including task planning , plan repair , and semantic localization . Semantic information has also been used for semantic grasping, in the form of hand written rules. For example, in , the suitable grasp for handing over a cup is defined to be on the middle of the body of the cup. However, manually specifying every rule for different objects and tasks is not scalable, and limits adaptation and generalization to new situations.
In this work, we address the problem of semantic grasping by first extracting abstract object semantics, such as affordances of object parts and object materials. We then apply the Wide & Deep model  to learn the relations between extracted semantic features and semantic grasps from data. The wide part of the model facilitates memorization of feature interactions while the deep part aids generalization to novel feature combinations. Jointly, the model is capable of capturing the complex reasoning patterns for selecting suitable grasps and also adapting to novel object and task combinations. Our work makes the following contributions:
We introduce a novel semantic representation that incorporates affordance, material, object state, and task to inform grasp selection and promote generalization.
We apply a neural network structure, based on the Wide & Deep model, to learn suitable grasps from data.
We contribute a unique grasping dataset, SG14000, consisting of 14,000 semantic grasps for 44 objects, 7 tasks, and 6 different object states.
We quantitatively validate our approach against three prior methods on the above dataset. Our results show statistically significant improvements over existing approaches to semantic grasping. We further analyze how the semantic representation of objects facilitates generalization between object instances and object classes. Finally, we demonstrate how a mobile manipulator, the Fetch robot, can extract and reason about semantic information to execute semantically correct grasps on everyday objects.
Ii Related Work
Below we discuss previous approaches to semantic grasping along with related works in affordance and material detection.
Ii-a Semantic Grasping
Multiple previous works have learned semantic grasps in a data driven manner, but they have relied on low-level or narrowly scoped features, making generalization challenging. In  semantic grasps appropriate for different tasks were learned by storing the visual and tactile data associated with each grasp. However, the example-based approach is limited in its ability to transfer grasps to different object categories. To improve generalization, Song et al.  used Bayesian Networks to model the relations between objects, grasps and tasks; Hjelm et al.  learned a discriminative model based on visual features of objects. Instead of relying on low-level features, Aleotti et al.  modeled semantic grasps with object parts obtained from topological shape decomposition and learned the parts that should be grasped at different stages of a task. Lakani et al.  learned the co-occurrence frequency of manipulative affordances (affordances associated with the tasks) and executive affordances (affordances of the parts which need to be grasped). These last two works are closest in spirit to our approach, but we are reasoning over a wider variety of contextual information.
Other works have used more abstract semantic representations, but leveraged rule-based inference mechanisms that must be encoded manually. Antanas et al.  encode suitable grasp regions based on probabilistic logic descriptions of tasks and segmented object parts. Works on object affordance detection [20, 9, 13, 26, 27] have leveraged the affordances of object parts to define the correspondences between affordances and grasp types (e.g., rim grasp for parts with contain or scoop affordance.) Detry et al.  built on these works by training a separate detection model using data generated by predefined heuristics to detect suitable grasp regions for each task. Kokic et al.  assigned preferences to affordances in different tasks, which are then used to determine grasp regions to use or avoid. Instead of manually specifying semantic grasps based on semantic representations, we learn the relations between these semantic features and suitable grasps from data.
Ii-B Affordance Detection
Affordance detection has been defined as the problem of pixel-wise labeling of object parts by functionality. Myers et al.  first proposed to use hand-crafted features to detect affordances from 2.5D data. Nguyen et al.  later learned useful features from data with an encoder-decoder structure. Building on these methods, Nguyen et al.  and Do et al.  developed models to simultaneously perform object detection and affordance segmentation. Joint training on these two objectives helps their models to achieve state-of-art performance. Sawatzky et al.  tried to reduce the cost of labeling with weakly supervised learning. Chu et al.  explored transferring from systhetic data to real environment with unsupervised domain adaptation. We are the first to leverage part affordances to learn semantic grasping from data.
Ii-C Material Detection
Material detection is the problem of the inferring an object’s constituent materials from observations of the object. While most approaches seek to identify material classes from images [31, 16, 5], in more recent work, Erickson et al.  used spectral data obtained from a hand-held spectrometer for material classification, achieving a validation accuracy of 94.6%. Previous works in robotics  have successfully used this approach to reason about object materials when constructing tools. We also leverage this approach to infer materials of objects.
Iii Problem Definition
Given a set of grasps , our objective is to model the probability distribution , such that the probability is maximized for a grasp that is suitable for the context . In this work, we model contexts as consisting of information related to both objects, , and the tasks, . To avoid the computational complexity of directly modeling , we simplify the problem by learning a discriminative model for , where is a class label indicating whether grasps in are suitable for contexts in . With this discriminative model, we can rank grasp candidates for various contexts , ensuring that more suitable grasp candidates are ranked higher than less suitable ones given a context, i.e., if is more suitable than for the context .
Iv Context-Aware Grasping Engine
To address the above problem, we present the Context-Aware Grasping Engine (CAGE) (Figure 2), which takes as input the context and a set of grasp candidates , and outputs a ranking of grasps ordered by their suitability to the context . In this work, we model context as the label of the task being performed and the following information about object : a point cloud of the object, a spectral reading of the object from the SCiO sensor for material recognition, and a label specifying the object state111In the current implementation, the task and object state are hand labeled. In general, task information can be obtained from the robot’s task plan. Object state can be obtained through techniques such as  or .. For each grasp , CAGE extracts semantic features from both the context inputs and the grasp, forming the vector . In this formulation, serves as a unified abstraction of and , allowing us to model as . Finally, we use as input into the Semantic Grasp Network, which ranks grasps based on . In Section V, we describe the process for extracting the semantic features . Then in Section VI, we present details of the semantic grasp network.
V Semantic Representations
In this section, we introduce semantic representations of objects and grasps extracted from context inputs.
V-a Affordances of Object Parts
We use affordance of each object part as one of the semantic features to capture constraints on suitable grasp regions. Part affordances present an abstract representation to help generalize among object instances and classes (e.g. graspable affordance can be observed in most typical spatulas, and also other objects such as knives and pans). Compared to other topological decompositions of objects (e.g., curvature-based segmentation ), the focus on functionalities also makes part affordances more suitable for determining task-dependent grasps.
We use the state of the art affordance detecton model, AffordanceNet , to extract part affordances. AffordanceNet takes as input an RGB image of the scene and simultaneously outputs class labels, bounding boxes, and part affordances of the detected object. The affordance prediction is produced as a 2D segmentation mask; however, grasp candidates often correspond to 3D points. Therefore, we superpose the affordance segmentation mask with the object point cloud segmented from the plane model segmentation222We use code from http://wiki.ros.org/rail_segmentation to map affordance labels from 2D pixels to 3D points.
V-B Materials of Object Parts
In addition to affordances of object parts, we extract materials of object parts as another semantic feature to help rank grasp candidates. AffordanceNet and other affordance detection methods do not explicitly take material into account. Therefore, adding materials helps refine semantics of objects and facilitate more complex reasoning (e.g., grasping the part of knife with the cut affordance is not prohibited if the part is made of plastic).
In order to infer the materials of object parts, we follow the approach of  and use spectral readings of objects. A handheld spectrometer, the SCiO sensor, is used by the robot to get spectral reading of objects at various poses. Each spectral scan from the SCiO returns a 331-D vector of real values that can be classified by the neural network architecture in .
Given a set of grasp candidates, the semantic meaning of each grasp is determined based on the grasp’s relation to the object. Specifically, for each grasp, we assign the affordance and material of the object part closest to the grasp as the semantic representation of the grasp, which we call grasp affordance and grasp material, respectively. Modeling these two semantic features has shown to be extremely effective in our experiments as many simple semantic grasping heuristics can be captured solely based on these two features. For example, grasps with the grasp affordance open are usually not suitable for the pouring task.
In order to extract the grasp affordance and grasp material, we use a kd-Tree to efficiently search for the closest point on the object point cloud to the center of the grasp. Then the affordance and material corresponding to the part of object the point lies on are assigned to the grasp.
We combine the extracted semantic features above to create the unified semantic representation . Including task, object state, grasp affordance, grasp material, part affordances and part materials, has a dimension of , where N is the number of parts.
Vi Semantic Grasp Network
Vi-a Computing Object Embedding
To effectively reason about the semantics of contexts, a model needs both memorization and generalization capabilities. Memorization facilitates forming reasoning patterns based on combinations of semantic features (e.g., combining the open part affordance, cup object, and pour task to avoid grasping the opening part of a cup when pouring liquid). Generalization helps unifying previously seen and new contexts (e.g., generalize avoiding grasps on the opening part of bottle for the pour task to scoop and handover). Additionally, combining these two capabilities allows the model to refine generalized rules with exceptions (e.g., grasping the opening part of the cup for handover is fine if the cup is emtpy).
In this work, we present a solution utilizing the Wide & Deep model  to predict based on the extracted semantic features of the context and the grasp . The wide component of the model promotes memorization of feature combinations with a linear model, and the deep component aids generalization by learning dense embeddings of features. Below, we present details of both components.
Vi-B The Wide Component
The wide component captures the frequent co-occurrence of semantic features and target labels. This part of the model takes in one-hot encoding of sparse features, which include the features representing the task, object state, grasp affordance, and grasp material. As illustrated in Figure 3 (left), the wide component is a generalized linear model of the following form:
where y is the prediction, is the concatenated encodings of all sparse features.
Vi-C The Deep Component
The deep component achieves generalizations of feature combinations by learning a low-dimensional dense embedding vector for each feature. For example, despite never encountering the pair of the pound task and the support affordance before, the network can still generalize to this new combination because the embedding vectors of both elements have been paired and trained individually with other affordances and tasks.
The deep component is a feed-forward neural network, as shown in Figure 3 (right). Sparse features are first converted into dense embedding vectors. The embedding vectors are intialized randomly and the values are learned through training. The embedding vectors, along with object embedding (see Sec. VI-A) and other dense features (e.g., visual features) are combined to create the input into the hidden layers of the neural network. Each hidden layer has the following form:
where is the layer number, is the activation function, and are the neural network parameters, and is the activation. The first activation is the input, i.e., .
Vi-D Object Embedding
In this section, we discuss in detail how to create the object embedding that encodes semantic information of object parts. A naive approach is to simply concatenate embeddings of all parts. However, this approach can lead to features that are difficult to generalize because objects consist of different numbers of parts and a canonical order of parts for concatenation is undetermined.
Inspired by Graph Neural Networks , we instead create the object embedding with propagation and pooling. The semantic features of each part, including part affordance and part material, are first mapped to dense embedding vectors and then together passed through a propagation function, creating an embedding for each part . All part embeddings are then combined through a pooling function (we use average pooling) to create the object embedding . Formally,
where represents the concatenated embeddings of a part’s affordance and material, is the number of parts for the object, is the activation function, and and are the parameters of the propagation function.
Vi-E Joint Training of Wide and Deep Components
To combine the wide and deep components, we compute a weighted sum of the outputs of both components. The sum is then fed into a softmax function to predict the probability that a given grasp has label . For a multi-class prediction problem, the model’s prediction is:
where and are the model parameters of the wide and the deep components, is the bias term, and is the final activations of the deep component.
The combined model is trained end to end with the training set of contexts and grasps , from which semantic features are extracted, as well as corresponding ground-truth labels . Our training objective is to minimize the negative log-likelihood of the training data. We used Adam  for optimization with default parameters (learning rate=, =, =, =). We trained the models fully to 150 epochs.
Vii Experimental Setup
In this section we discuss the details of our novel semantic grasping dataset that is used to validate CAGE, the baseline approaches, and the evaluation metric.
|Object classes||cup, spatula, bowl, pan, bottle|
|Materials||plastic, metal, ceramic, glass, stone, paper, wood|
|Tasks||pour, scoop, poke, cut, lift, hammer, handover|
|States||hot, cold, empty, filled, lid on, lid off|
Vii-a SG14000 Dataset
In order to establish a benchmark dataset for semantic grasping, we collected a dataset of 14,000 semantic grasps for 700 different contexts333Available at github.com/wliu88/rail_semantic_grasping. A summary of the dataset is shown in Table I. As shown in Figure 4, we chose objects with a variety of materials, shapes, and appearances to reflect the diversity of objects robots might encounter in real environments. For each object, we first sampled 20 grasp candidates with the Antipodal grasp sampler . Then we labeled all grasps as very suitable, suitable, or not suitable for a given context. Using three class labels  allows methods to directly model positive and negative preferences (e.g., for handing over a cup filled with hot tea, grasping the body is preferred and grasping the opening should be avoided).
We compare CAGE to the following three baselines that are representative of existing works:
Classification of Visual Features (VF)  performs semantic grasping based on low-level visual features. The features include object descriptors that represent object shape, elongatedness, and volume, visual descriptors such as image intensity and image gradients, grasp features that encode grasping position and orientation. This method ranks grasp candidates based on prediction scores.
Frequency Table of Affordances (FT) performs semantic grasping based on part affordances. This benchmark is adapted from , with the difference that our representation replaces executive affordances with contexts. This method learns the co-occurrence frequency of grasp affordances and contexts. More frequently used grasp affordances are ranked higher.
Vii-C Evaluation Metric
We use Average Precision (AP) for evaluating the performance of all methods. As a single number that summarises the Precision-Recall curve, AP is widely used for assessing ranking systems. An AP score of indicates that all suitable grasps are ranked higher than unsuitable grasps. Mean Average Precision (MAP) is reported.
In this section, we compare our method to the baselines on three experiments444Where ever data is split into training and test sets, we report results averaged over 10 random splits. Statistical significance was determined using a paired t-test, with each experiment treated as paired data *, **, and *** indicate and , respectively.. We then demonstrate the effectiveness of our method on a real robot.
Viii-a Context-Aware Grasping
In this experiment, we examine each method’s ability to successfully rank suitable grasps under different contexts. We split the dataset by context, with 70% training and 30% testing.
As shown in Figure 4(a), CAGE outperformed all baselines by statistically significant margins. This result highlights our model’s ability to collectively reason about the contextual information of grasps. VF also performed better than the other two baselines, and demonstrated that visual features also help generalize the similarities observed in semantic grasps. However, the significantly better performance of CAGE shows that our representation of context can promote generalization not only at the visual level, but also at the semantic level. One of possible reasons that FT performed worse than CA is that it failed to distinguish different contexts and overfitted to the bias in training data. This result again confirms that part affordances need to be treated as part of a broader set of semantic features that together determine the suitability of grasps.
|Wide and Deep||0.8409|
To gain deeper insights into our proposed model and the effect of different contextual information, we performed an ablation study on our model. As shown in Table II, removing either the Deep or the Wide component of the model hinders the model’s ability to generalize different contexts, which is crucial for context aware grasping. The ablation also showed that removing task information (e.g., pour, hammer) has a negative impact on the performance because different tasks introduce drastically different constraints on grasp suitability. Additionally, the ablation shows that reasoning about object states (e.g. hot, full), which is lacking in prior work, is also important for accurately modeling semantic grasping.
Viii-B Object Instance Generalization
In this experiment, we evaluate semantic grasp generalization across object instances of a single object class. We first split the data by each task and object combination, then divided the resulting data sets into 70% training and 30% testing.
Figure 4(b) presents the results. CAGE again outperforms all other baselines. Note that since the train and test sets share the same object class and task, less contextual information is required by the models. However, the statistically significant improvement of CAGE over VF in this experiment reaffirms the importance of affordances of object parts as a representation for modeling objects and semantic grasps.
Viii-C Object Class Generalization
In this experiment, we test the generalization of semantic grasps between object classes. We split the dataset by task; for each task, we set aside data from one object class for testing and use remaining object classes for training.
As shown in Figure 4(c), only CAGE has statistically significant differences in performance from the other methods. The similar performance of VF and CA might be explained by the diverse set of objects in our data. Since objects in different categories have drastically different appearances, generalization of semantic grasps based on visual similarities is hard to achieve. FT also performed poorly on this experiment because it again cannot distinguish different object classes, while different object classes can have distinct grasps suitable for the same task.
Viii-D Robot Experiment
In our final experiment, we evaluated the effectiveness of our method on a Fetch robot  equipped with a RGBD camera, a 7-DOF arm, and a parallel-jaw gripper. For testing, we randomly withheld 32 contexts (shown in attached video) from our dataset, of which 16 were semantically feasible and 16 were not, covering 14 different objects and all 7 tasks. The objective of this experiment was to validate CAGE’s success rate in selecting semantically meaningful grasps for feasible cases, and its ability to reject grasps in semantically infeasible cases (e.g., hammer with a ceramic bowl, and scoop with a flat spatula).
We trained a CAGE model with the remaining data and tested grasping with each of the withheld contexts on the robot. The test for each context proceeded as follows: We placed the object defined in the context on the table and 50 grasp candidates were automatically sampled from the RGBD data with the Antipodal grasp sampler . CAGE extracted semantic information from the object and each grasp candidate, and then used the trained network to predict a ranking of all grasp candidates. If the trained network predicted that all grasps have suitability probabilities below a threshold (0.01 was used in the experiment), the robot rejected all grasp candidates. Otherwise, the robot executed the first grasp candidate from the ranked list.
In the experiment, 15/16 (94%) semantically feasible contexts had successful grasps (stable grasp lifting object off the table). While 1 grasp failed due to a bad grasp sample (i.e. wrap grasp on a pan), 16/16 (100%) grasps were correct for the context. The remaining 16/16 (100%) semantically infeasible contexts were correctly predicted by the model to have zero suitable grasps. The material classification incorrectly predicted metal as glass once; however, the misclassification did not cause an error in the ranking of semantic grasps.
Shown in Figure 6 are examples of semantic grasps executed by the robot. With CAGE, the robot was able to select suitable grasps based on task (a and b), state (c and d), and material (e). Generalization of semantic grasps between object instances (f and g) and classes (h and i) is also demonstrated, due to the abstract semantic representations and the semantic grasp network.
This work addresses the problem of semantic grasping. We introduced a novel semantic representation that incorporates affordance, material, object state, and task to inform grasp selection and promote generalization. We also applied the Wide & Deep model to learn suitable grasps from data. Our approach results in statistically significant improvements over existing methods evaluated on the dataset of 14,000 semantic grasps for 44 objects, 7 tasks, and 6 different object states. An experiment on the Fetch robot demonstrated the effectiveness of our approach for semantic grasping on everyday objects.
-  (2011) Part-based robot grasp planning from human demonstration. In 2011 IEEE International Conference on Robotics and Automation, pp. 4554–4560. Cited by: §II-A.
-  (2018) Multi-modal predicate identification using dynamically learned robot controllers. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: footnote 1.
-  (2018) Semantic and geometric reasoning for robotic grasping: a probabilistic logic approach. Autonomous Robots, pp. 1–26. Cited by: §I, §II-A, §VII-B.
-  (2010) Towards performing everyday manipulation activities. Robotics and Autonomous Systems 58 (9), pp. 1085–1095. Cited by: §I.
-  (2015-06) Material recognition in the wild with the materials in context database. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-C.
-  (2000) Robotic grasping and contact: a review. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Vol. 1, pp. 348–353. Cited by: §I.
-  (2015) Towards robot adaptability in new situations. In 2015 AAAI Fall Symposium Series, Cited by: §I.
-  (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §I, §VI-A.
-  (2019) Learning affordance segmentation for real-world robotic manipulation via synthetic images. IEEE Robotics and Automation Letters 4 (2), pp. 1140–1147. Cited by: §II-A, §II-B.
-  (2019) On the choice of grasp type and location when handing over an object. Science Robotics 4 (27), pp. eaau9757. Cited by: §I.
-  (2012) Semantic grasping: planning robotic grasps functionally suitable for an object manipulation task. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1311–1317. Cited by: §I, §I, §II-A.
-  (2017) Task-oriented grasping with semantic and geometric scene understanding. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3266–3273. Cited by: §II-A, §VII-A.
-  (2018) Affordancenet: an end-to-end deep learning approach for object affordance detection. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–5. Cited by: §II-A, §II-B, §V-A.
-  (2019) Classification of household materials via spectroscopy. IEEE Robotics and Automation Letters 4 (2), pp. 700–707. Cited by: §II-C, §V-B.
-  (2015) Learning human priors for task-constrained grasping. In International Conference on Computer Vision Systems, pp. 207–217. Cited by: §I, §II-A, 2nd item.
-  (2011) Toward robust material recognition for everyday objects.. In BMVC, Vol. 2, pp. 6. Cited by: §II-C.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VI-E.
-  (2017) Affordance detection for task-specific grasping using deep learning. In 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pp. 91–98. Cited by: §II-A, §VII-B.
-  (2018) Exercising affordances of objects: a part-based approach. IEEE Robotics and Automation Letters 3 (4), pp. 3465–3472. Cited by: §II-A, 3rd item.
-  (2019) Towards affordance detection for robot manipulation using affordance for parts and parts for affordance. Autonomous Robots 43 (5), pp. 1155–1172. Cited by: §II-A.
-  Autonomous tool construction using part shape and attachment prediction. Cited by: §II-C.
-  (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Cited by: §I.
-  (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §I, 1st item.
-  (2017) Dex-net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. Cited by: §I, 1st item.
-  (2014) Affordance of object parts from geometric features. In Workshop on Vision meets Cognition, CVPR, Vol. 9. Cited by: §II-B.
-  (2016) Detecting object affordances with convolutional neural networks. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2765–2770. Cited by: §II-A, §II-B.
-  (2017) Object-based affordances detection with convolutional neural networks and dense conditional random fields. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5908–5915. Cited by: §II-A, §II-B.
-  (2016) Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pp. 3406–3413. Cited by: §I.
-  (2012) Large-scale semantic mapping and reasoning with heterogeneous modalities. In 2012 IEEE International Conference on Robotics and Automation, pp. 3515–3522. Cited by: §I.
-  (2017) Weakly supervised affordance detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2795–2804. Cited by: §II-B.
-  (2019) Recognizing material properties from images. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II-C.
-  (2010) Learning task constraints for robot grasping using graphical models. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1579–1585. Cited by: §I, §II-A.
-  (2018) Using geometry to detect grasp poses in 3d point clouds. In Robotics Research, pp. 307–324. Cited by: 1st item, §VII-A, §VIII-D.
-  (2013) Decomposing cad models of objects of daily use and reasoning about their functional parts. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5943–5949. Cited by: §V-A.
-  (2016) Learning multi-modal grounded linguistic semantics by playing “I spy”. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: footnote 1.
-  (2016) Fetch and freight: standard platforms for service robot applications. In Workshop on autonomous mobile service robots, Cited by: §VIII-D.
-  (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §VI-D.