Net: Single Stage Salient-Instance Segmentation
In this paper, we consider an interesting vision problem—salient instance segmentation. Other than producing approximate bounding boxes, our network also outputs high-quality instance-level segments. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also its surrounding context, enabling us to distinguish the instances in the same scope even with obstruction. Our network is end-to-end trainable and runs at a fast speed (40 fps when processing an image with resolution ). We evaluate our approach on a public available benchmark and show that it outperforms other alternative solutions. In addition, we also provide a thorough analysis of the design choices to help readers better understand the functions of each part in our network. To facilitate the development of this area, our code will be available at https://github.com/RuochenFan/S4Net.
Rather than recognizing all the objects in a scene, we humans only care about a small set of interesting objects/instances . A recent experiment  demonstrates that interesting objects are normally visually salient, reflecting the importance of detection salient objects. In fact, localizing objects of interest is also essential for a wide range of computer graphics and computer vision applications. Such a capability allows many modern applications (e.g., image manipulation/editing [7, 49, 5] and robotic perception ) to provide initial regions that might be of interest to users or robots so that they can directly proceed to image editing or scene understanding. Similar to , in this paper, we aim at detecting salient instances given an input image or photograph and simultaneously outputting their accurate instance-level segments.
Benefiting from the multi-level features extracted from convolutional neural networks (CNNs), recent object detection [17, 40, 11] methods provide better tools for localizing the bounding boxes of semantic objects. Additionally, there are also some other works  focusing on detecting approximate positions of salient objects (detection windows). These methods do provide various useful tools by finding salient objects, but they aim at providing bounding boxes instead of outputting instance-level object segments that are required for applications performing image editing [7, 5]. Recent instance-level semantic segmentation methods [9, 20] produce high-quality segmentations for each instance, however these works focus on semantic objects and hence are not competent to class-agnostic instance segmentation. Although it is possible for us to change the segmentation branches so that they can be applied to the binary case, these segmentation branches are based on either RoIWarp  or RoIAlign  layer, which only covers the feature information inside the bounding boxes. Compared to instance-level semantic segmentation, instance-level salient object segmentation, similar to salient object detection, focuses more on contrast information [26, 25]. This requires more global context  to be considered so as to highlight the instances of interest.
Taking the above line of thoughts into consideration, in this paper, we present a novel single-stage salient instance segmentation framework. Based on a single-stage object detector , we introduce a new segmentation branch aiming at producing pixel-wise segmentations for each salient instance. Unlike RoIPool [16, 21] and RoIAlign  which extract features only from limited scopes (bounding boxes), we propose a new region-based feature extraction layer, namely RoIMasking, to take into account the features inside the bounding boxes as well as more global information. Interestingly, RoIMasking is completely quantization-free and scale-preserving, allowing more detailed information to be successfully detected. Regarding the fact that salient instance segmentation relies more on global context, we design a new residual segmentation branch. This new branch not only substantially increases its own receptive field but also inherits the residual property of ResNets , leading to satisfactory performance. Beyond that, our model is end-to-end trainable and runs at 40fps on a single GPU when processing a image.
To sum up, our proposed approach contains the following contributions:
First, we propose an end-to-end single-shot salient instance segmentation framework. This model can not only achieve the state-of-the-art performance but also runs in real time.
Second, we design a new RoIMasking layer which is able to preserve the original aspect ratio as well as resolution of the regions of interest, and meanwhile remain the context information around the regions of interest.
2 Related Works
Salient Object Detection. Salient object detection aims at jointly detecting the most distinguished objects and segmenting them out from a given scene. Early salient object detection methods mostly depended on either global or local contrast cues [6, 27]. They designed various hand-crafted features (e.g., color histogram and textures) for each region [2, 14, 43] and fused these features in either manual-designed or learning-based manners. Because of their weak ability of preserving the integrity of salient instances and the instability of hand-crafted features, these methods were gradually taken place by later CNN-based data-driven methods [23, 32, 47, 52, 15, 31, 30]. The key problems of these methods when applied to salient instance segmentation task are two-fold. First, the integrity of the salient objects are difficult to be preserved because the distinguished regions might be parts of the interesting instances. Second, salient object detection is a binary problem and hence cannot be competent to instance-level segmentation. Beyond only detecting salient objects, MSRNet  designed a series of mechanisms based on MCG  to provide instance-level salient object segmentation. Nevertheless, this method was excessively reliant on the quality of the edge maps and hence often failed when processing complicated real-world scenes.
Object Detection. The goal of object detection is to produce all the bounding box candidates for semantic categories. Earlier works mostly relied on hand-engineered features (e.g., SIFT , SURF , and HOG ). They built different types of image pyramids so as to leverage more information across scales. Recently, the emergence of CNNs greatly promoted the development of object detectors. For example, R-CNN  and OverFeat  regarded CNNs as sliding window detectors for extracting high-level semantic information. Given a stack of pre-computed proposals [46, 8], these methods computed its feature vectors for each proposal using CNNs and then fed the features into a classifier. Later works [21, 16] took as inputs the entire images and applied region-based detectors to feature maps, substantially accelerating the running speed. Faster R-CNN  broke through the limitation of using pre-computed proposals by introducing a region proposal network (RPN) into CNNs. In this way, the whole network could be trained end-to-end, offering a better trade-off between accuracy and speed compared to previous works. However, all the method discussed above aim at outputting reliable object proposals rather than instance segmentations.
Semantic Instance Segmentation. Earlier semantic instance segmentation methods [10, 18, 19, 38] were mostly based on segment proposals generated by segmentation methods [46, 39, 3]. In , Dai et al. predicted segmentation proposals by leveraging a multi-stage cascade to gradually refine rectangle regions from bounding box proposals. Li et al.  proposed to integrate the segment proposal network into an object detection network. More recently, He et al. implemented a Mask R-CNN framework, extending the Faster R-CNN  architecture by introducing a segmentation branch. Albeit more and more fascinating results, these methods are not suitable for our task as the segment proposals all belong to a pre-defined category collection. Sometimes the categories of interesting objects are unknown and thus our task requires class-agnostic segment proposals.
Regarding the demand of real time for most applications, we design a single-shot salient instance segmentation framework—Net. Net, as Mask R-CNN , introduces a new segmentation branch with RoIMasking to our single-shot object detector, which is easy to be implemented.
An overall architecture of Net can be found in Fig. 3. Functionally, the framework of Net can be separated into two components: a bounding box detector and a segmentation branch, both of which share the same base model as shown in Fig. 3. As in most object detection works, we select ResNet-50  as our base model, which is pretrained on the ImageNet dataset . (Notice that we also exhibit the performance of other base models in our experiment section to verify the generality of the proposed framework.) For notational convenience, we divide ResNet-50 into 5 residual blocks, named conv1, …, conv5 as in most works [34, 22].
The Backbone. As pointed out in most previous works , both local texture details and high-level abstract information are essential to detect objects, but for the base models themselves, it is difficult to contain both because of their bottom-up structure. Thus, inspired by the feature pyramid networks  and RON , in order to combine fine-grained details with highly-abstracted information, we adopt a U-shape hyper net structure. The output feature maps from conv2 to conv5 in ResNets are convoluted with convolutional layers with kernel and 256 channels to make a series of intermediate lateral layers. A reverse connection can be made by upsampling an upper lateral layer and then combining it with a lower lateral layer by simple summation. Similar to , to avoid the alias effect brought in by summation, we add another convolutional layer with kernel size and 256 channels after summation.
Single-Shot Object Detector. A detection branch is connected to the backbone to produce object bounding boxes. Considering the efficiency of the entire network, we adopt a single-shot object detector, which is similar to . To leverage the multi-level features extracted from our U-shape backbone, we introduce 4 heads which are connected to each lateral layer. Each head structure is the same to the one used in  but are with different strides in order to perform detection at multiple scales. Large objects are assigned to high-level feature maps with a big stride and more global information while small objects are assigned to low-level feature maps with a small stride and high resolution. It is worth mentioning that in single-shot detection model, negative samples are far more than positives samples. Unbalanced the positive and negative samples will greatly degrade the performance of the resulting proposals. Taking this into account, we calculate the objectness loss of positive and negative samples separately. In order to suppress false positives, online hard example mining (OHEM)  is used to calculate the objectness loss for negative samples.
Salient Instance Segmentation. The bounding boxes predicted by the detection branch and the output of the lateral layer with stride 8 in the backbone are fed into our RoIMasking layer, which will be described in detail thereafter. The RoIMasking layer marks out the regions of interest in feature maps and suppresses the information irrelevant to the interesting objects. In addition, a segmentation branch can be connected to the RoIMasking layer selectively. Taking the feature maps produced by RoIMasking layer as inputs, the segmentation branch outputs a series of saliency score maps by a fully-convolutional structure.
|(a) Binary masking||(b) Ternary masking|
RoIPool  and RoIAlign  are two standard fixed-size operations for extracting the features of the regions of interest. Both RoIPool and RoIAlign sample a region of interest into a fixed spatial extent of , and typically . However, one of their drawbacks is that the sampling process is unable to maintain the original aspect ratio and resolution of the regions of interest. Besides, both RoIPool and RoIAlign focus on the regions inside the proposals, neglecting the rest area. In fact, the context around the regions of interest also makes evident sense to saliency segmentation but both RoIPool and RoIAlign discard it completely. In this subsection, we present a new resolution-preserving and quantization-free layer, called RoIMasking, to take the place of RoIPool or RoIAlign.
Binary RoIMasking We first introduce a simplified version of RoIMasking which we call binary RoIMasking. The binary RoIMasking receives feature maps and proposals predicted by the detection branch. A binary mask is generated according to the proposals, inside which the values are set to 1 and otherwise 0. Fig. 4a provides an illustration, in which the bright area is associated with label 1 and the dark region is with label 0. The output of the binary RoIMasking layer is the input feature maps multiplied by this mask. In Fig. 5, we show a typical example of the output feature maps. In the experiment part, we show that the proposed binary RoIMasking outperforms the RoIPool and RoIAlign baselines.
Ternary RoIMasking To make better use of the context information around the regions of interest, we further advance the binary RoIMasking to a ternary case. Because of the ReLU activation function, there is no negative value in the feature maps before RoIMasking. So, to better highlight the salient instances inside the proposals, we set the pixels around the regions of interest in the mask to -1. In Fig. 4b, the area between the blue and orange boxes is the corresponding area where pixels are set to -1. In this way, the features around regions of interest are distinct from those inside the bounding boxes of the salient instances. This allows the segmentation branch to be able to not only recognize which features belong to the regions of interest but also make use of the context information round the salient instances. The feature map after ternary RoIMasking is illustrated in Fig. 5d. It is worth mentioning that this operation introduces no more computational cost into our model. Ternary RoIMaking leads to a large improvement as we show in the experiment part. In the following, we abbreviate ternary RoIMasking as RoIMasking for notational convenience unless otherwise noted.
|(a) Input image||(b) Feature map before masking|
|(c) Binary RoIMasking||(d) Ternary RoIMasking|
3.3 Segmentation Branch
Taking into account the structure of our backbone, we take the feature maps from the lateral layer associated with conv3 with a stride of 8 as the input to our segmentation branch on trade-off between global context and details. Before connecting our RoIMasking layer, we first add a simple convolutional layer with 256 channels and kernel size for compressing the number of channels.
RoIMasking modifies feature maps by highlighting a series of areas bounding the instances and the corresponding perimeter context. However, it is still difficult for it to distinguish the salient instances from the other instances inside the same scope. To this end, we add a new module—salient instance discriminator (SID) to help better distinguish the instances even with large obstructions.
Having an overall look on the whole instance is crucial for distinguishing instances, so we should ensure the receptive field of SID is large enough. A detailed illustration of our SID module can be found in Fig. 3c. Other than two residual blocks, we also add two max pooling with stride 1 and dilated convolutional layers with dilation rate 2 for enlarging the receptive field. All the convolutional layers has a kernel size and stride 1. For the channel numbers, we set the first three to 128 and the rest 64, which we find are enough for salient instance segmentation.
3.4 Loss function
As described above, there are two sibling branches in our framework for detection and saliency segmentation, respectively. The detection branch undertakes objectness classification task and coordinates regression task, and the segmentation branch is for saliency segmentation task. Therefore, we use a multi-task loss on each training sample to jointly train the model:
Regarding the fact that positive proposals are far less than negative samples in the detection branch, we adopt the following strategy. Let and be the collections of positive and negative proposals, and be the numbers of positive and negative proposals (), then we calculate the positive and negative objectness loss separately to avoid the domination of negative gradients during training. Thus we have:
in which is the probability of the th proposal being positive.
loss is used for coordinate regression, and the overall regression loss is the mean of all losses for the objects in an input image, which can be computed by
in which is the regression target for the coordinate predicted by the detection branch for the th object.
We also use cross-entropy loss for the segmentation branch. Despite the predicted score map has the same size with the feature map input to the segmentation branch, because of the RoIMasking, only parts of the score map are valid which corresponds to the field set to 1 in the mask. Let be the number of valid pixels in score map, the loss for segmentation branch is:
in which is the predicted probability of the pixel belonging to foreground and is 1 if pixel belongs to foreground in ground truth, and 0 otherwise.
In this section, we carry out detailed analysis to elaborate the functions of each component in our method by ablation studies. We also perform thorough comparisons with the state-of-the-art methods to exhibit the effectiveness of our approach. As salient instance segmentation is a much new vision problem with limited datasets released, we only use the dataset proposed in  for all experiments. This dataset contains 1,000 images with well-annotated instance-level annotations. For fair comparisons, as done in , we randomly select 500 images for training, 200 for validation, and 300 for testing.
4.1 Implementation Details
Training and Testing. In training phase, IoU is used to determine whether a bounding box proposal is a positive or negative sample in detection branch. Recall that in Faster R-CNN, a bounding box proposal is assigned to be a positive sample if the IoU between the proposal and an object ground truth bounding box is more than 0.7. The proposal is assigned to be a negative sample if IoU is less than 0.3 and ignored between 0.3 and 0.7. However, we empirically found that using a single IoU threshold for reducing the false alarms is evidently better than the double IoU thresholds in Faster R-CNN. So in our detection branch, a bounding box proposal is positive if it’s IoU , and negative if IoU 0.5.
In testing phase, the bounding boxes fed into RoIMasking layer are from the detection branch. But in training phase, we directly feed the ground truth bounding boxes into the RoIMasking layer. This provides the segmentation branch with more stable and valid training data and meanwhile accelerates the training process.
Hyper-parameters. Our proposed network is implemented on TensorFlow . The input images are augmented by horizontal flipping. The hyper-parameters are set as follows: weight decay (0.0001) and momentum (0.9). We train our network on 2 GPUs for 20k iterations, with an initial learning rate of 0.004 which is divided by a factor of 10 at the 10k iteration. It only takes 40 minutes to train the whole model.
4.2 Analysis of RoIMasking
|(a) Binary masking||(b) Ternary masking|
This subsection demonstrates the importance of the context information around the regions of interest in feature maps and the effectiveness of ternary RoIMasking. To do so, we explore the impact of each activations in the feature maps before RoIMasking on the performance. Inspired by , we visualize the function of a specific neuron in this model by drawing a gradients map. After loading the fully trained model weights, we do a forward pass using a specific image. In this process, the activation value of the feature maps before RoIMasking, , is extracted and stored. Next, we do a backward pass. Note that in the general training stage, back-propagation is performed to calculate the gradients of the total loss with respect to the weights in neural network. But in this experiment, we load the stored as a variable, and regard the convolution kernels as constant. Back-propagation is performed to calculate the gradients of the saliency loss with respect to each feature map input to RoIMasking:
The absolute value of reflects the importance of the feature map pixel to the saliency task. After summing up along the channel dimension, the gradient map can be obtained.
Fig. 6 shows the gradient maps for binary RoIMasking and ternary RoIMasking, respectively. The orange rectangle is the ground truth bounding box of a salient instance. By definition, the pixels inside the orange rectangle in the ternary mask are set to 0 and the pixels between the orange and blue boxes are set to -1. It is obvious that in Fig. 6b there are evident responses in the ‘-1’ area. In Fig. 6a, there are only few responses between the orange and blue boxes. This phenomenon indirectly proves the importance of the context information around the regions of interest. More experimental results can be found in the subsequent parts.
4.3 Ablation Studies
In order to evaluate the effectiveness of each component in our proposed framework for instance-level salient object segmentation, we train our model on the salient instance segmentation dataset by . Following the standard COCO metrics , we report results on mAP (averaged precision over IoU thresholds), , , as well as , , for large, medium, and small instances. In order to analyze the ability to distinguish different instances, we divide the test set into two parts. One contains only separated instances, while the other comprises obstructed instances. Quantitative results for these two subsets are denoted by and , respectively.
The Effect of RoIMasking. To evaluate the effectiveness of the proposed RoIMasking layer, we also consider using RoIPool and RoIAlign. We simply replace our RoIMasking with RoIPool and RoIAlign to perform two comparative experiments and keep other network structures and experimental conditions unchanged. Quantitative evaluation results are listed in Table 1. As can be seen, our proposed binary RoIMasking and ternary RoIMasking both outperform RoIPool and RoIAlign in . Specifically, our ternary RoIMasking result improves the RoIAlign result by around 2.5 points. This reflects that considering more context information outside the proposals does help for salient instance segmentation.
The SID Module. An evaluation of our SID module is shown in Table 1. For this experiment, we only attempt to remove the SID module to show how much performance gain it brings in our framework. As can be seen on the right part of Table 1, the main difference between these two cases lies in the results of the images with obstructions. There is no evident performance gain on samples with only separated objects but a large improvement on images with obstructions (+8%). As a result, the major function of the SID module is to further distinguish different instances in the same scope of regions.
The Size of Context Regions. For better understanding our RoIMasking layer, we analyze how large the context regions should be here. Suppose bounding box size of a salient instance is . Here, we define an expansion coefficient to denote the width of the ‘-1’ region in the RoI mask. Hence, the size of the valid region is . By default, we set to 1/3. We also try different values of to explore its influence on the final results as shown in Table 2 but found both larger and smaller values of slightly harms the performance. This indicates that a region size of is enough for discriminating different instances.
The Number of Proposals. The number of proposals sent to the segmentation branch also has effect on the performance. According to our experiments, more proposals lead to better performance but more computational costs. Fig. 8 shows the relationship between performance and speed along with the increase of proposals. Notice that performance gain is not obvious when the number of proposals exceeds 20. Specially, when we set the number of proposals to 100, only around 1.5% improvement can be achieved but the running speed drops dramatically. Taking this into account, we take 20 proposals as a trade-off during the inference phase. Users may decide their own number of proposals in accordance with their tailored tasks.
|Base models||mAP@0.5||mAP@0.7||Speed (FPS)|
Base Models. Besides the base model of ResNet-50 , we also try another three popular base models, including Resnet-101 , VGG16 , and MobileNet . Table 4 lists the results when different base models are considered. As can be seen, base models with better performance on classification also works better in our experiments. For speed, real time processing can be achieved by our proposed Net. When the size of input images is , Net has a frame rate of 40.0 fps on a GTX 1080 Ti GPU. Furthermore, using MobileNet  as our base model, Net runs very fast at a speed of 90.9 fps.
4.4 Comparisons with the State-of-the-Arts
As salient instance segmentation is a new problem, there is only one related work MSRNet  that can be used for comparison. For solidity, we train an instance-level semantic segmentation model FCIS  on the saliency dataset as an additional baseline. In this experiment, both our Net and FCIS are based on ResNet-50  pre-trained on ImageNet dataset. We report the results on the ‘test’ set.
Quantitative Analysis. The results of comparative experiments are listed in Table 3. Obviously, our proposed Net achieves the best results in both and . Specifically, our approach improve the baseline result presented in MSRNet  by about 27 points in . In terms of , we also have an improvement of more than 17 points on the same dataset. Compared to FCIS , our method wins by a large margin on each column of Table 3 as well. Even when only binary RoIMasking is used or SID module is removed, our approach still outperforms both MSRNet  and FCIS .
In this paper, we present the Net, a single stage salient-instance segmentation framework, which is able to implement instance-level salient object segmentation in real time. Based on the single stage object detector, we introduce a novel segmentation branch, containing a novel RoIMasking layer and an advanced salient instance discriminator (SID). Our RoIMasking layer preserves the original resolution and aspect ratio of the regions of interest and at the meantime takes into account more context information outside the proposals. The SID module enlarges the receptive field of our segmentation branch, which further boosts the performance. Thorough experiments show that the proposed RoIMasking greatly outperforms RoIAlign and RoIPool, especially for distinguishing instances in the same scope. Our Net achieves the state-of-the-art performance on a publicly available benchmark. Finally, we hope this framework can be carved a useful niche in image manipulation and robotic perception.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE TPAMI, 2012.
-  P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE TPAMI, 33(5):898–916, 2011.
-  H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). 2008.
-  T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. Sketch2photo: Internet image montage. ACM TOG, 28(5):124:1–10, 2009.
-  M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu. Global contrast based salient region detection. IEEE TPAMI, 2015.
-  M.-M. Cheng, F.-L. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu. Repfinder: finding approximately repeated scene elements for image editing. In ACM TOG, volume 29, page 83, 2010.
-  M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing: Binarized normed gradients for objectness estimation at 300fps. In CVPR, 2014.
-  J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In ECCV, 2016.
-  J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In CVPR, 2015.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
-  L. Elazary and L. Itti. Interesting objects are visually salient. Journal of vision, 2008.
-  P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation. IJCV, 2004.
-  L. Gayoung, T. Yu-Wing, and K. Junmo. Deep saliency with encoded low level distance map and high level features. In CVPR, 2016.
-  R. Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV, 2014.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, pages 447–456, 2015.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE TPAMI, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. In CVPR, 2017.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  L. Itti and C. Koch. Computational modeling of visual attention. Nature reviews neuroscience, 2(3):194–203, 2001.
-  L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI, (11):1254–1259, 1998.
-  H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li. Salient object detection: A discriminative regional feature integration approach. In CVPR, pages 2083–2090, 2013.
-  T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen. Ron: Reverse connection with objectness prior networks for object detection. 2017.
-  F.-F. Li, R. VanRullen, C. Koch, and P. Perona. Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences, 2002.
-  G. Li, Y. Xie, L. Lin, and Y. Yu. Instance-level salient object segmentation. In CVPR, 2017.
-  G. Li and Y. Yu. Visual saliency based on multiscale deep features. In CVPR, pages 5455–5463, 2015.
-  G. Li and Y. Yu. Deep contrast learning for salient object detection. In CVPR, 2016.
-  Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In CVPR, 2017.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
-  P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In NIPS, 2015.
-  J. Pont-Tuset, P. Arbelaez, J. T. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE TPAMI, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE TPAMI, 2017.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. 2014.
-  J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE TPAMI, 2000.
-  A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
-  L. Wang, H. Lu, X. Ruan, and M.-H. Yang. Deep networks for saliency detection via local estimation and global search. In CVPR, pages 3183–3192, 2015.
-  C. Wu, I. Lenz, and A. Saxena. Hierarchical semantic labeling for task-relevant rgb-d perception. In Robotics: Science and systems, 2014.
-  H. Wu, Y.-S. Wang, K.-C. Feng, T.-T. Wong, T.-Y. Lee, and P.-A. Heng. Resizing by symmetry-summarization. In ACM Transactions on Graphics (TOG), volume 29, page 159, 2010.
-  J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
-  J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech. Unconstrained salient object detection via proposal subset optimization. In CVPR, 2016.
-  R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In CVPR, pages 1265–1274, 2015.