Visual Manipulation Relationship Network

Visual Manipulation Relationship Network


Robotic grasping is one of the most important fields in robotics and convolutional neural network (CNN) has made great progress in detecting robotic grasps. However, including multiple objects in one scene can invalidate the existing grasping detection algorithms based on CNN because of lacking of manipulation relationships among objects to guide the robot to grasp things in the right order. Therefore, the manipulation relationships are needed to help robot better grasp and manipulate objects. This paper presents a new CNN architecture called Visual Manipulation Relationship Network (VMRN) to help robot detect targets and predict the manipulation relationships in real time. To implement end-to-end training and meet real-time requirements in robot tasks, we propose the Object Pairing Pooling Layer (OPL) to help to predict all manipulation relationships in one forward process. In order to train VMRN, we collect a dataset named Visual Manipulation Relationship Dataset (VMRD) consisting of 5185 images with more than 17000 object instances and the manipulation relationships between all possible pairs of objects in every image, which is labeled by the manipulation relationship tree. The experiment results show that the new network architecture can detect objects and predict manipulation relationships simultaneously and meet the real-time requirements in robot tasks.


1 Introduction

Grasping is one of the most significant manipulation in everyday life. Robotic grasping develops rapidly in recent years. However, it is still far behind human performance and remains unsolved. For example, when humans encounter a stack of objects like shown in Fig. 1, they instinctively know how to grasp them. As for the robot, when it is planning to grasp a stack of objects, the process includes detecting objects and their grasps, determining the grasping order and finally planning grasping motion. These problems still remain challenging and hinder the widespread use of robots in everyday life.

Recently, with rapid development of deep learning, it is proved to be a useful tool in computer vision, which has made impressive breakthroughs in many research fields, such as image classification[1] and object detection[2, 3, 4]. The reason that deep learning has made so much progress is that through its training, deep networks can extract features of objects or images that are far superior to hand-designed ones [5]. These advantages have encouraged researchers to apply deep learning to robotics. Some recent works have proved the effectiveness of deep learning and convolutional neural network (CNN) in robotic perception [6, 7] and control [8, 9]. In particular, deep learning has achieved unprecedented performance in robotic grasp detection [10, 11, 12, 13]. Most current robotic grasp detection methods take RGB or RGB-D images as input and output a vectorized and standardized grasps. These robotic grasp detection algorithms essentially solve an object detection problem. They treat the grasps as a special kind of objects and detect them by using a neural network.

Figure 1: Importance of manipulation order. Left: A cup is on a book. Right: A phone is on a box. As shown in two scenes, if we do not consider the relationships of manipulation, and robots choose to pick up the book or the box first, then the cup or the phone may be dumped or even broken.

However, using this type of grasp detection algorithm for robotic grasping experiments can only deal with scenes containing a single target. Robot will execute the grasp having the highest confidence score. Doing this can have a devastating effect on objects in some multi-object scenes. For example, as shown in Fig. 1, a cup is placed on a book, and if the detected grasp with the highest confidence score belongs to the book, which means the robot chooses to pick up the book first, the cup may fall apart and be broken. Therefore, in this paper, we focus on helping the robot find the right grasping order when it is facing a stack of objects, which is defined as manipulation relationship prediction.

Some recent works have used convolutional neural network to predict the relationships between objects rather than just object detection[14, 15, 16, 17]. These works show that convolutional neural network has the potential to understand the relationships between objects. Therefore, we hope to establish a method based on neural network so that the robot can understand the manipulation relationships between objects in multi-object scenes to help the robot finish more complicated grasp tasks.

In our work, we design a new network architecture named Visual Manipulation Relationship Network (VMRN) to simultaneously detect objects and predict the manipulation relationships. The network architecture has two stages. The output of the first-stage is the object detection result, and the output of the second-stage is the prediction result of the manipulation relationships. To train our network, we contribute a new dataset called Visual Manipulation Relationship Dataset (VMRD). The dataset contains 5185 images of hundreds of objects with 51530 manipulation relationships and the category and location information of each object. The format of the annotations refers to the PASCAL VOC dataset. In summary, the contributions of our work include three points:

  • We design a new convolutional neural network architecture to simultaneously detect objects and predict manipulation relationships, which meets the real-time requirements in robot tasks.

  • We collect a new dataset of hundreds graspable objects, which includes the location and category information and the manipulation relationships between pairs of objects.

  • As we know, it is the first end-to-end architecture to predict robotic manipulation relationships directly using images as input with convolutional neural network.

The rest part of this paper is organized as follows: section II reviews the background and related works; section III introduces the problem formulation including the object detection and representation of manipulation relationships; section IV gives the details of our approach and network including the training method; section V shows the experiment results of manipulation relationship prediction and image-based and object-based test including some subjective experiment results; and finally, the conclusion of our results and future work are discussed in section VI.

2 Background

2.1 Object Detection

Object detection is defined as a process using a image including several objects as input to locate and classify as many target objects as possible in the image. Sliding window used to be the most common method to detect objects. When using such way to do object detection, the features, such as HOG[18] and SIFT[19], of the target objects are usually extracted first, and then they are used to train a classifier, like Supported Vector Machine, to classify the candidates coming from sliding window stage. Deformable Parts Model (DPM)[20] is the most successful one of this type of object detection algorithms.

Recently, object detection algorithms based on deep features, such as Region-based CNN (RCNN) family[4, 3] and Single Shot Detector (SSD) family[2], are proved to drastically outperform the previous algorithms which are based on hand-designed features. Based on the detection process, the main algorithms are classified into two types, which we call one-stage algorithms such as SSD[2] and two-stage algorithms such as Faster RCNN[3]. One-stage algorithms are usually faster than two-stage algorithms while two-stage algorithms often get better results [21].

Our work focuses on not only the object detection, but also the manipulation relation prediction. The challenge is how to combine the relation prediction stage with object detection stage. To solve this problem, we design the Object Pairing Pooling Layer, which is used to generate the input of manipulation relationship predictor using the object detection results and convolutional feature maps as input. The details will be described in following sections.

2.2 Visual Relationship Detection

Visual relationship detection means understanding object relationships of an image. Some previous works try to learn spatial relationships[22, 23]. Later, researchers attempt to collect relationships from images and videos and help models map these relationships from images to language[24, 25, 26, 27]. Recently, with the help of deep learning, the relationship prediction between objects has made a great process[14, 15, 16, 17]. Lu et al. [14] collect a new dataset of object relationships called Visual Relationship Dataset and propose a new relationship prediction model consisting of visual and language parts, which outperforms previous methods. Liang et al. [15] firstly combine deep reinforcement learning with relationships and their model can sequentially output the relationships between objects in one image. Yu et al. [16] use internal and external linguistic knowledge to compute the conditional probability distribution of a predicate given a pair, which achieves better performance. Dai et al. [17] propose an integrated framework called Deep Relational Network for exploiting the statistical dependencies between objects and their relationships.

These works focus on relationships represented by linguistic information between objects but not manipulation relationships. In our work, we introduce relationship detection methods to help robot find the right order in which the objects should be manipulated. And because of the real-time requirements of robot system, we need to find a way to accelerate the prediction of manipulation relationships. Therefore, we propose an end-to-end architecture different from all previous works.

3 Problem Formulation

Figure 2: An example of manipulation relationship tree. Left: Images including several objects. Middle: All pair of objects and manipulation relationships. Right: manipulation relationship tree, in which the leaf nodes should be manipulated before the other nodes.
Figure 3: Network architecture of VMRN. Input of the network is images including several graspable objects. Feature extractor is a stack of convolution layers ( VGG [28] or ResNet [29]), which output feature maps with size of . These features are used by object detector and OPL to respectively detect objects and generate the feature groups of all possible object pairs which are used to predict manipulation relationships by manipulation relationship predictor.

3.1 Object Detection

Output of object detection is the location of each object and its category. The location of objects is represented by a vertical bounding box, which is a 4-dimensional vectors: , where and represent the coordinates of upper left vertex and lower right vertex, respectively. The category is encoded as an integer which is the index of the maximum value of classification confidence scores: , where is the number of categories and is the confidence score vector with each element standing for the likelihood that the object is classified into the corresponding category.

3.2 Manipulation Relationship Representation

In this paper, manipulation relationship is the order of grasping. Therefore, we need a objective criterion to determine the grasping order, which is described as following: if moving one object will have an effect on the stability of another object, this object should not be moved first. Since we only focus on the manipulation relationships between objects and do not concern the linguistic information, a tree-like structure (two objects may have one same child), called manipulation relationship tree in the following, can be constructed to represent the manipulation relationships of all the objects in each image. Objects are represented by nodes and parent-child relationships between nodes indicate the manipulation relationships. In manipulation relationship tree, the object represented by the parent node should be grasped after the object represented by the child node. Fig. 2 is an example of the manipulation relationship tree. A pen is on a remote controller, and a remote controller, an apple and a stapler are on a book. Therefore, the pen is the child of the remote controller and the remote controller, the apple and the stapler are children of the book in the manipulation tree.

4 Proposed Approach

The proposed network architecture is shown in Fig. 3. The inputs of our network are images and outputs are object detection results and manipulation relationship trees. Our network consists of three parts: feature extractor, object detector and manipulation relationship predictor, with parameters denoted by , and respectively.

In our work, taking into account the real-time requirements of object detection, we use Single Shot Detector (SSD) algorithm[2] as our object detector. SSD is an one-stage object detection algorithm based on CNN. It utilizes multi-scale feature maps to regress and classify bounding boxes in order to adapt to object instances with different size. Input of object detector is convolution feature maps (in our work, we use VGG [28] or ResNet50 [29] features). Through object classification and multi-scale object location regression, we obtain the final object detection results. The result of each object is a 5-dimensional vector . Then the inputs of Object Pairing Pooling Layer (OPL) are object detection results and convolution features. The outputs are concatenated as a mini-batch for predicting manipulation relationships by traversing all possible pairs of objects. Finally, the manipulation relationship between each pair of objects is predicted by manipulation relationship predictor.

4.1 Object Pairing Pooling Layer

OPL is designed to implement the end-to-end training of the whole network. In our work, weights of feature extractor are shared by manipulation relationship predictor and object detector. OPL is added between feature extractor and manipulation relationship predictor like in Fig. 3, using object location ( the online output of object detection or the offline ground truth bounding box) and shared feature maps as input, where is the input image. It finds out all possible pairs ( objects correspond to pairs) of objects and makes their features a mini-batch to train the manipulation relationship predictor. Although in complex visual relationship prediction tasks, traversing all possible object pairs is time-consuming [30] due to the large number of objects in the scene and the sparsity of the relationships between the objects. However, in our manipulation relationship prediction task, there are only a few of objects in the scene and it does not take a long time to traverse all the object pairs.

Let and stand for an object pair. OPL can generate the features of and denoted by , which includes features of two objects and their union. In detail, the features are cropped from shared feature maps and adaptively pooled into a fixed spatial size (). The gradients with respect to the same object or union bounding box coming from manipulation relationship predictor are accumulated and propagated backward to the front layers.

Figure 4: Method to label online data. First, we match the predicted bounding boxes to the ground truth by areas of overlap. Then we use the manipulation relationship between ground truth bounding boxes as the ground truth manipulation relationship between predicted bounding boxes to generate online data used to train manipulation relationship predictor.

4.2 Training Data of Relation Predictor

An extra branch of CNN is cascaded after OPL to predict manipulation relationships between objects. Training data for manipulation relationship predictor is generated by OPL, which includes two parts: online data and offline data , coming from object detection results and ground truth bounding boxes respectively. That is to say . For each image, is a set of CNN features of all possible object pairs and their labels , where is the manipulation relationship between and . The reason we mix online data and offline data to train manipulation relationship predictor is that online data can be seen as the augmentation of offline data while offline data can be seen as the correction of online data. Manipulation relationships between online object instances are labeled according to the manipulation relationships between ground truth bounding boxes that maximumly overlap the online ones. As shown in Fig. 4, object detection result is shown in right. The manipulation relationship between the mobile phone and the box is determined by the following two steps: 1) match detected bounding boxes of the mobile phone and the box to the ground truth ones by overlaps; 2) use manipulation relationship between the ground truth bounding boxes to label the manipulation relationship of detected bounding boxes.

4.3 Loss Function of Relation Predictor

In our work, there are three manipulation relationship types between any two objects in one image:

  • 1) object 1 is the parent of object 2

  • 2) object 2 is the parent of object 1

  • 3) object 1 and object 2 have no manipulation relationship.

Therefore, our manipulation relationship prediction process is essentially a classification problem of three categories for any pair of objects . Let denote the weights of relation prediction branch. Note that because exchanging the subject and object will possibly change the manipulation relationship type ( from to ), the prediction of and may be different. The manipulation relationship likelihood of is defined as:


We choose multi-class cross entropy function as loss function of manipulation relationship prediction:


For each image, manipulation relationship prediction loss includes two parts: online data loss and offline data loss . The loss for the whole image is:


where is used to balance the importance of online data and offline data . In our work, we set to 0.5.

4.4 Training Method

The whole network is trained end-to-end, which means that the object detector and manipulation relationship predictor are trained simultaneously.

Let be the weights of object detector and be the training data of object detector including shared features of the whole image and object detection ground truth . The loss function for object detector is the same as Liu et al. described in [2]:


where is set to 1 according to experience. Like in [2], are defined as a set of predetermined bounding boxes with a few of fixed sizes, which serve as a reference during object detection process. Location loss is smooth L1 loss between ground truth bounding box and matched default bounding box and all bounding boxes are encoded as offsets. Classification confidence loss is also multi-class cross entropy loss.

Input: Training set of images with object location , category and manipulation relationship , pretrained VGG [28] or ResNet50 parameters [29]
Output: Feature extractor , object detector and manipulation relationship predictor

1:Initialize feature extractor using pretrained VGG or ResNet50 and object detector , manipulation relationship predictor randomly.
2:Pretrain feature extractor and object detector for 10k iterations on images[2].
3:while  do
4:      Randomly sample a mini-batch.
5:      Extract CNN features
6:      Detect objects and get a set of online predicted bounding boxes
7:      Get offline bounding boxes
8:      Feed and into OPL and get manipulation relationship training data
9:      Update object detector and manipulation relationship predictor using and
10:      Update feature extractor using
Algorithm 1 Training Algorithm

Loss function of manipulation relationship prediction is detailed in section IV.B. Combining and , the complete loss for shared layers is:


is used to balance the importance of and . In our work, is set to 0.5. And according to chain rule:


Complete training algorithm is shown in Algorithm 1.

5 Dataset

5.1 Data Collection

Different from visual relationship dataset[14], we focus on manipulation relationships, so objects included in our dataset should be manipulatable or graspable. Moreover, manipulation relationship dataset should contain not only objects localized in images, but also rich variety of position relationships.

Our data are collected and labeled using hundreds of objects coming from 31 categories. There are totally 5185 images including 17688 object instances and 51530 manipulation relationships. Category and manipulation relationship distribution is shown in Fig. a. The annotation format is similar to PASCAL VOC: each object node includes category information, bounding box location, different from PASCAL VOC, the index of the current node and indexes of its parent nodes and child nodes. Some examples of our dataset is shown in Fig. b.

During training, we randomly split the dataset into a training set and a testing set in a ratio of nine to one. In detail, training set includes 4656 images, 15911 object instances and 46934 manipulation relationships, and testing set contains the rest.

(a) Category and manipulation relationship distrubution
(b) Dataset examples
Figure 5: Visual Manipulation Relationship Dataset. (a) Category and manipulation relationship distribution of our dataset. (b) Some dataset examples
Author Algorithm Training Data


Rel. Obj.Rec. Obj.Prec. Img.

Speed (ms)

Lu et al.[14]


93.01 88.76 75.50 71.28 46.88 100


91.72 88.76 74.33 75.19 49.72

VGG-VMRN (No Rel. Grad.)

93.01 88.36 77.28 73.04 50.66

ResNet-VMRN (No Rel. Grad.)

91.72 90.73 77.68 77.55 53.12


94.18 92.80 82.64 77.76 60.49


94.36 92.75 81.55 76.09 58.60 28


94.09 93.36 82.29 78.01 63.14


91.81 92.01 79.03 72.55 54.44


92.71 91.86 79.33 74.71 55.95


92.67 92.19 80.55 76.02 57.28
Table 1: Accuracy of Object Detection and Visual Manipulation Relationship Prediction

5.2 Labeling Criterion

Because our dataset focuses on the manipulation relationship with no linguistic or position information, instead of directly giving position relationships ( under, on, beside and so on) between objects, we only give the order of manipulation of objects: manipulation relationship tree. There are several advantages over giving position relationships: 1) the manipulation relationships are more simpler, which makes relationship prediction task easier; 2) the prediction can directly give the manipulation relationships between objects, without the need to reconstruct the manipulation relationships through position relationships.

During labeling, there should be a criterion that can be strictly enforced. Therefore, in our work, we set a labeling criterion of manipulation relationship: when the movement of an object will affect the stability of other objects, the object should not be the leaf node of the manipulation relationship tree, which means that the object should not be moved first. For example, as shown in the up-left image in Fig. b, there are three objects: on the left, there is an orange can and on the right, a red box is put on a green box. If the green box is moved first, it will have an effect on the stability of the red box, so it should not be the leaf node of the manipulation relationship tree. If the red box or the can is moved first, it will not affect stability of any other object, so they should be the leaf node.

6 Experiments

6.1 Training Settings

Our models are trained on Titan Xp with 12 GB memory. We have trained two Visual Manipulation Relationship Network (VMRN) models based on VGGNet and ResNet called VGG-VMRN and ResNet VMRN. Because of the unstability of the random object detection results in the beginning, the two VMRN models are pretrained without for the first 10k iterations. Detailed training settings are listed in Table 2.




Learning Rate 0-80k iters: 1e-3 0-80k iters: 1e-3
80k-120k iters: 1e-4 80k-120k iters: 1e-4
Learning Rate Decay 0 0
Weight Decay 3e-3 1e-4
Batch Size 8 8
Momentum 0.9 0.9
Nesterov True True
Max Epoches 120 120
Iters per Epoch 1000 1000
Framework Torch7 Torch7
Table 2: Training Settings
Figure 6: Result examples. Upper: examples with right object detection and manipulation relationship prediction. Lower: examples with wrong results (from left to right: redundant object detection, failing object detection, redundant manipulation relationship, failing manipulation relationship)

6.2 Testing Settings

Comparison Model As we know, there is no research about robotic manipulation relationships so far. So we compare our experiment results with Visual Appearance Model (VAM) in Lu et al. [14], which is modified to adapt to our task. VAM takes union bounding box as input and outputs the relationship. But in our work, exchanging the subject and object may change the manipulation relationship. Therefore, instead of only using union bounding box, we parallel subject, object and union bounding boxes as input to get the final manipulation relationship.

Self Comparison To study the contribution of OPL and end-to-end training, we also confirm the performance of our models that are trained with no gradients backward from manipulation relationship predictor ({VGG-SSD, VMRN (No Rel. Grad.)} and {ResNet-SSD, VMRN (No Rel. Grad.)}). To explore the benefits from online and offline data, we also train our models with only online () or offline () training data.

Metrics Three metrics are used in our experiment: 1) Manipulation Relationship Testing (Rel.): this metric focus on the accuracy of manipulation relationship model on ground truth object instance pairs, in which the input features or image patches of manipulation relationship predictor are obtained based on the offline ground truth bounding boxes; 2) Object-based Testing (Obj. Rec. and Obj. Prec.): this metric tests the accuracy based on object pairs. In this setting, the triplet is treated as a whole. The prediction is considered correct if both objects are detected correctly (category is right and IoU between predicted bounding box and ground truth is more than 0.5) and the predicted manipulation relationship is correct. We compute the recall (Obj. Rec.) and precision (Obj. Prec.) of our models during object-based testing 3) Image-based Testing (Img.): this metric tests the accuracy based on the whole image. In this setting, the image is considered correct only when all possible triplets are predicted correctly.

6.3 Analysis

Results are shown in Table 1. Compared with VAM, we can conclude that:

1) Performance is better: VAM performs worse than proposed VMRN models in all three experiment settings. The gains mainly come from the end-to-end training process, which improves the accuracy of manipulation relationship prediction a lot (from 88.76% to 93.36%). This is confirmed in the following self comparison part.

2) Speed is faster: The proposed VMRN models (VGG-VMRN and ResNet-VMRN) are both less time-consuming than VAM. Forward process of OPL and manipulation relationship predictor takes 5.5ms per image in average. As described in [2], the speed of SSD object detector is 21.74ms per image on Titan X with mini-batch of 1 image. So our manipulation relationship prediction has little effect on speed of the whole network. But because of the huge network architecture and sequential prediction process, VAM spends 122ms on each image in average to predict all of the manipulation relationships. Even when we put all possible triplets of one image to a batch, it still spends 86ms for one image.

Self comparison results indicate that proposed VMRN models trained end-to-end can outperform the models that trained without the gradients from manipulation relationship predictor. It mainly benefits from the influence coming from manipulation relationship prediction loss . The parameters of the network are adjusted to better predict the visual manipulation relationships and the network is more holistic. As explored in Pinto et al.[31], multi-task learning in our network can help improve the performance because of diversity of data and regularization in learning. Finally, we can observe that using online and offline data simultaneously may actually help to improve the performance of the network due to the complementing of online and offline data.

The difference between the performance of VGG-VMRN and ResNet-VMRN is also interesting. Gradients coming from manipulation relationship prediction loss improve both networks, but its improvement on ResNet-VMRN is less than that on VGG-VMRN as shown in Table 1. Note that VGG-based feature extractor has 7.63 million parameters and ResNet-based feature extractor has 1.45 million parameters, so the number of parameters may limit the performance ceiling of ResNet-VMRN. In the future, we will try deeper ResNet as our base network.

Some subjective results are shown in Fig. 6. From the four examples in the first line, we can see that our model can simultaneously detect objects and manipulation relationships in one image. From the four examples in the second line, we can conclude that the occlusion, the similarity between different categories and visual illusion can have a negative influence on the predicted results.

7 Conclusions

In this paper, we focus on solving the problem of visual manipulation relationship prediction to help robot manipulate things in the right order. We propose a new network architecture named Visual Manipulation Relationship Network and collect a dataset called Visual Manipulation Relationship Dataset to implement simultaneously detecting objects and predicting manipulation relationships, which meets the real-time requirement on robot platform. The proposed Object Paring Pooling Layer (OPL) can not only accelerate the manipulation relationship prediction by replacing the sequential prediction with a simple forward process, but also improve the performance of the whole network by back-propagating the gradients from manipulation relationship predictor.

However, due to the limited number of objects used in training, it is difficult for the object detector to generalize to objects with a large difference in appearance from our dataset. In our future work, we will expand our dataset using more graspable objects and combine the grasp detection with VMRN to implement an all-in-one network which can simultaneously detects objects and their grasp positions and predicts the correct manipulation relationships.


This work was supported in part by NSFC No. 91748208, the National Key Research and Development Program of China under grant No. 2017YFB1302200 and 2016YFB1000903, NSFC No. 61573268, and Shaanxi Key Laboratory of Intelligent Robots.


  1. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
  2. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision (ECCV), pages 21–37. Springer, 2016.
  3. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), pages 91–99, 2015.
  4. Girshick Ross. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1440–1448. IEEE, 2015.
  5. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  6. Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter Abbeel. Deep spatial autoencoders for visuomotor learning. In IEEE International Conference on Robotics and Automation (ICRA), pages 512–519. IEEE, 2016.
  7. Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research (IJRR), 2016.
  8. Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA), pages 3389–3396. IEEE, 2017.
  9. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 17(39):1–40, 2016.
  10. Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research (IJRR), 34(4-5):705–724, 2015.
  11. Joseph Redmon and Anelia Angelova. Real-time grasp detection using convolutional neural networks. In IEEE International Conference on Robotics and Automation (ICRA), pages 1316–1322. IEEE, 2015.
  12. Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE International Conference on Robotics and Automation (ICRA), pages 3406–3413. IEEE, 2016.
  13. Sulabh Kumra and Christopher Kanan. Robotic grasp detection using deep convolutional neural networks. In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017.
  14. Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision (ECCV), pages 852–869. Springer, 2016.
  15. Xiaodan Liang, Lisa Lee, and Eric P Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4408–4417. IEEE, 2017.
  16. Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1974–1982, 2017.
  17. Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual relationships with deep relational networks. arXiv preprint arXiv:1704.03114, 2017.
  18. Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893. IEEE, 2005.
  19. David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the 7th IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1150–1157. Ieee, 1999.
  20. Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(9):1627–1645, 2010.
  21. Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  22. Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan, and Daphne Koller. Multi-class segmentation with relative location prior. International Journal of Computer Vision (IJCV), 80(3):300–316, 2008.
  23. Carolina Galleguillos, Andrew Rabinovich, and Serge Belongie. Object categorization using co-occurrence, location and appearance. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008.
  24. Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(12):2891–2903, 2013.
  25. Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Charles Rosenberg, and Li Fei-Fei. Learning semantic relationships for better action retrieval in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1100–1109, 2015.
  26. Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), volume 2, page 9, 2014.
  27. C Lawrence Zitnick, Devi Parikh, and Lucy Vanderwende. Learning the visual interpretation of sentences. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1681–1688, 2013.
  28. Simonyan Karen and Zisserman Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  29. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  30. Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. Relationship proposal networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 2, 2017.
  31. Lerrel Pinto and Abhinav Gupta. Learning to push by grasping: Using multiple tasks for effective learning. In IEEE International Conference on Robotics and Automation (ICRA), pages 2161–2168. IEEE, 2017.
This is a comment super asjknd jkasnjk adsnkj
The feedback cannot be empty
Comments 0
The feedback cannot be empty
Add comment

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.