Towards the Success Rate of One:
Real-time Unconstrained Salient Object Detection
In this work, we propose an efficient and effective approach for unconstrained salient object detection in images using deep convolutional neural networks. Instead of generating thousands of candidate bounding boxes and refining them, our network directly learns to generate the saliency map containing the exact number of salient objects. During training, we convert the ground-truth rectangular boxes to Gaussian distributions that better capture the ROI regarding individual salient objects. During inference, the network predicts Gaussian distributions centered at salient objects with an appropriate covariance, from which bounding boxes are easily inferred. Notably, our network performs saliency map prediction without pixel-level annotations, salient object detection without object proposals, and salient object subitizing simultaneously, all in a single pass within a unified framework. Extensive experiments show that our approach outperforms existing methods on various datasets by a large margin, and achieves more than 100 fps with VGG16 network on a single GPU during inference.
Saliency detection is the problem of finding the most distinct regions from a visual scene. It attracts a great amount of attention due to its importance in object detection , image segmentation , image thumb-nailing , video summarization , etc.
Saliency detection has been studied under three different scenarios. Early works attempt to predict human eye-fixation over an image , while later works increasingly focus on salient foreground segmentation [18, 19, 23, 41, 50], i.e., predicting a dense, pixel-level binary map to differentiate salient objects from background. However, it cannot separate overlapping salient objects and requires pixel-level annotations that are expensive to acquire for large datasets. Different from salient foreground segmentation, salient object detection aims to locate and draw bounding boxes around salient objects. It only requires bounding box annotations, which significantly reduces the effort for human labeling, and can easily separate overlapping objects. These advantages make the problem of salient object detection more valuable to investigate in terms of applicability to the real world.
With the re-emergence of convolutional neural networks (CNN), computer vision community has witnessed numerous breakthroughs, including salient object detection, thanks to the extraordinary discriminative ability of CNNs . Prior to CNNs, Some works [26, 37, 40, 43] have proposed heuristics to detect single salient object in an image, while others [10, 36] rank a fixed-sized list of bounding boxes which might contain salient objects without determining the exact detections. However, most of these methods do not solve the existence problem, i.e., determining whether any salient objects exist in an image at all, and simply rely on external binary classifiers to address this problem. Recently, saliency detection based on deep networks has achieved state-of-the-art performance. Zhang et al.  propose to use the MultiBox proposal network  to generate hundreds of candidate bounding boxes that are further ranked to output a compact set of salient objects. A probabilistic approach is proposed to filter and re-rank candidate boxes as a substitution for non-maxima suppression (NMS). To accurately localize salient objects,  requires a large number of class-agnostic proposals covering the whole image (see Figure 1). However, its precision and recall significantly drop if one only uses tens of boxes. The reason is that generic object proposals have very low success rate of locating an object, i.e., only few of them tightly enclose the ground-truth objects, while most are redundant. Even though additional refinement steps are applied , there are still a lot of false positives (see Figure 1). The additional steps add more overhead and make this framework infeasible for real-time applications.
In this paper, we address this problem by moving towards the success rate of one, i.e., generating the exact number of boxes for salient objects without object proposals. We present an end-to-end deep network for real-time salient object detection, dubbed as RSD. Rather than generating lots of candidate boxes and filtering them, our network directly predicts a saliency map with Gaussian distributions centered at salient objects, and infers bounding boxes from these distributions. Our network consists of two branches trained with multi-task loss to perform saliency map prediction, salient object detection and subitizing simultaneously, all in a single pass within a unified framework. Notably, our RSD with VGG16 achieves more than fps on a single GPU during inference, significantly faster than existing CNN-based approaches. To the best of our knowledge, this is the first work on real-time non-redundant bounding box prediction for simultaneous salient object detection, saliency map estimation and subitizing, without object proposals. We also show the possibility of generating accurate saliency maps without pixel-level annotations, formulating it as a weakly-supervised approach that is more practical than fully-supervised approaches.
Our contributions are summarized as follows. First, we present a unified deep network performing salient object detection, saliency map prediction and subitizing simultaneously in a single pass. Second, our network is trained with Gaussian distributions centered at ground-truth salient objects that are considered to be more informative and discriminative than bounding boxes to distinguish multiple salient objects. Third, our approach outperforms state-of-the-art methods using object proposals by a large margin, and also produces comparable results on salient foreground segmentation datasets, even though we do not use any pixel-level annotations. Finally, our network achieves fps during inference and is applicable to real-time systems.
2 Related Works
Salient object detection aims to mark important regions by rectangles in an image. Early works assume that there is only one dominant object in an image and utilize various hand-crafted features to detect salient objects [24, 43]. Salient objects are segmented out by a CRF model  or bounding box statistics learned from a large image database . Some works [10, 36] demonstrate the ability of generating multiple overlapping bounding boxes in a single scene by combining multiple image features. Recently, Zhang et al.  apply deep networks with object proposals to achieve state-of-the-art results. However, these methods are not scalable for real-time applications due to the use of sliding windows, complex optimization or expensive box sampling process.
Object proposal have been used widely in object detection, which are either generated from grouping superpixels [3, 4, 39] or sliding windows [2, 51]. However, it is a bottleneck to generate a large number of proposals for real-time detection [11, 12]. Recently, deep networks are trained to generate proposals in an end-to-end manner to improve efficiency [8, 30]. While both SSD  and YOLO  instead adopt grid structure to generate candidate boxes, they still rely on a smaller set of proposals. Different from previous methods, our approach does not use any proposals.
Object subitizing addresses the object existence problem by learning an external binary classifier [33, 43]. Zhang et al.  present a salient object subitizing model to remove detected boxes in images with no salient object. While the method in  addresses existence and localization problems at the same time, it still requires generating proposals recursively, which is inefficient.
Saliency map prediction produces a binary mask to segment salient objects from background. While both bottom-up methods using low-level image features [28, 45, 45, 5, 20] and top-down methods [24, 43] have been proposed for decades, many recent works utilize deep neural networks for this task [50, 19, 41, 23, 42, 21]. Li et al.  propose a model for visual saliency using multi-scale deep features computed by CNNs. Wang et al.  develop two deep neural networks to learn local features and global contrast with geometric features to predict saliency score of each region. In , both global and local context are combined into a single deep network, while a fully convolutional network is applied in . Note that existing methods heavily rely on pixel-level annotations [50, 23, 42] or external semantic information, i.e., superpixels , which is not feasible for large-scale problems, where human labeling is extremely sparse. In contrast, our approach, as a weakly-supervised approach, only requires bounding box annotations and produces promising results as a free by-product, along with salient object detection and subitizing.
3 Proposed Approach
Existing detection methods based on CNNs and object proposals [8, 11, 12, 30, 48] convert the problem of selecting candidate locations in an image in the spatial domain to a parameter estimation problem, e.g., finding independent numbers as the coordinates of the bounding boxes. They use as many as billions of parameters in fully connected (fc) layers [11, 12], which is computationally expensive and increases the possibility of overfitting on small datasets. In contrast, our RSD approach discards proposals and directly solves the problem in the spatial domain. It reduces the number of parameters from billions to millions and achieves real-time speed. We predict a rough saliency map, from which we infer the exact number of boxes as the ground-truth objects based on the guidance of the subitizing output of our network. This unified framework addresses three closely related problems, saliency map prediction, subitizing and salient object detection, without allocating separate resources for each.
3.1 Network architecture
Our network is composed of the following components (see Figure 2). Images first go through a series of convolutional layers that can be any widely used models, such as VGG16 and ResNet-50. Specifically, we use the convolutional layers conv1_1 through conv5_3 from VGG16 , and conv1 through res4f from ResNet-50 . These layers capture low-level cues and high-level visual semantics. Two branches are connected to the feature maps from the last convolutional layer: saliency map prediction branch and subitizing branch. The saliency map prediction branch consists of two convolutional layers, conv_s1 and conv_s2, to continue processing the image in the spatial domain and produce a rough saliency map. The layer conv_s1 has 80 filters to produce intermediate saliency maps conditioned on different latent distributions of the objects (e.g., latent object categories). For instance, each of the 80 filters can be seen as a way to generate a rough saliency map for a specific type of category. The layer conv_s2 summarizes these conditional maps into a single saliency map by a filter followed by a sigmoid function. The subitizing branch predicts the number of salient objects that can be 0, 1, 2, or . It contains the final fc layers for VGG16, and all the remaining convolutional layers followed by a global average pooling layer and a single fc layer for ResNet-50.
3.2 Ground-truth preparation
The ground-truth for salient object detection only contains a set of numbers defining coordinates of bounding boxes tightly enclosing the objects. Although we can generate a binary mask based on these coordinates, i.e., 1 inside the bounding boxes and 0 elsewhere, it cannot separate overlapping objects or encode non-rigid boundaries well.
To address this problem, we propose to generate Gaussian distributions to represent salient objects, and use images with Gaussian distributions as ground-truth saliency maps. Given a ground-truth bounding box for an image with width and height , let represent the coordinates of its center, width, and height. If the network has the stride of at the beginning of the saliency map prediction branch (e.g., 16 for VGG16), the ground-truth saliency map is an image of size , where is the floor function. Its -th element is then defined as
where is the location vector, and is the mean value. is the number of ground-truth bounding boxes in the image. represents the ROI inside bounding box . is an indicator function. The covariance matrix can be represented as
By (1), we represent each bounding box as a normalized 2D Gaussian distribution, located at the center of the bounding box, with the co-variance determined by the bounding box’s height and width and truncated at the box boundary. As shown in Figure 3, the Gaussian shape ground-truth provides better separation for multiple objects compared to rectangular bounding boxes. It also naturally acts as spatial weighting to the ground-truth, so that the network learns to focus more on the center of objects instead of being distracted by background.
3.3 Multi-task loss
Our network predicts a saliency map from an image and performs subitizing as well. During training, the network tries to minimize the difference between the ground-truth map and the predicted saliency map. Although Euclidean loss is widely used to measure the pixel-wise distance, it pushes gradients towards 0 if the values of most pixels are 0, which is the case in our application when there are images with no salient object. Therefore, we use a weighted Euclidean loss to better handle this scenario, defined as
where and are the vectorized predicted and ground-truth saliency maps, respectively. and represent the corresponding -th element. is a constant weight set to 5 in all our experiments. Essentially, the loss assigns more weight to the pixels with the ground-truth value higher than 0.5, compared to those with the value close to 0. In this way, the problem of gradient vanishing is alleviated since the loss focuses more on pixels belonging to the real salient objects and is not dominated by background pixels. As a classifier, the subitizing branch minimizes the multinomial logistic loss between the ground-truth number of objects , and the predicted number of objects . The two losses are combined as our final multi-task loss
where is a weighting factor to balance the importance of the two losses. We set to make the magnitude of the loss values comparable.
The loss in (4) defines a multi-task learning problem previously studied by other vision applications [11, 30]. It reduces required resources by sharing weights between different tasks, and acts as a regularization to avoid over-fitting. We use standard SGD with momentum and weight decay for learning the parameters of the network.
To ensure fair comparison, we adopt the same two-stage training scheme suggested by . In the first stage, we initialize the network using the weights trained on ImageNet  for classification and fine-tune it on the ILSVRC2014 detection dataset  by treating all objects as salient. In the second stage, we continue fine-tuning the network on the SOS dataset  for salient object subitizing and detection. Although all the images in SOS are annotated for subitizing, some are not labled for detection. Therefore, we do not back-propagate gradients to our saliency map prediction branch for these images labeled as containing salient objects but without bounding box annotations. The loss function to fine-tune on the SOS dataset is denoted as , where indicates the number of bounding box annotations in the image.
3.5 Bounding box generation
Our method leverages the saliency prediction branch and subtizing branch to infer the correct number and location of bounding boxes. Given the output of the subitizing branch and the rough saliency prediction map , the goal is to find Gaussians that align with the predicted saliency map and are supported by the subitizing output, which can be formulated as
where captures the discrepancy between the predicted saliency map and the generated Gaussian map. measures the disagreement between the subitizing branch and the number of Gaussians, from which boxes’ locations can be inferred. is the maximal possible output of the subitizing branch, i.e., maximal number of salient objects. is the confidence score of the subitizing branch, and is a fixed confidence threshold that will be discussed later. In other words, if or is lower than the threshold, we rely only on the predicted saliency map to determine the number and locations of salient objects. Since solving (5) directly is intractable, we propose a feasible and efficient greedy algorithm to approximate it, which predicts the center and scale of boxes, while optimizing the objective function. If , our method does not generate any bounding boxes; otherwise it generates either single or multiple objects.
3.5.1 Single salient object detection
If and the confidence of subitizing branch is larger than a pre-defined threshold , we think there is only a single object. We convert the saliency map to a binary map using , and then perform contour detection using the fast Teh-Chin chain approximation  on to detect connected components and infer bounding boxes . We define the ROI of box on the original map as , from which the maximal value is assigned as its score . The one with the highest score is selected as the salient object. The entire process is summarized in Algorithm 1.
3.5.2 Multiple salient object detection
If , there may be multiple salient objects. When the subitizing branch outputs , or its confidence score , we rely on the predicted saliency map to find as many reliable peaks as possible. Therefore, our method is able to detect arbitrary number of salient objects (see Table 1). Otherwise, we try to find at least reliable peaks. A multi-level thresholding scheme is proposed for robust peak detection and balancing the losses, and , in (5). Starting from a high threshold, a peak is discovered from following similar steps in Algorithm 1. Peaks are continuously identified and added to the set of peaks by reducing the threshold and repeating the process until the cardinality of reaches or exceeds . Note that the predicted number of boxes depends on both the subitizing and saliency map prediction branches, which could be less or more than , if no threshold can separate reliable peaks or more peaks are detected in different thresholds.
After the initial set of peaks are determined, peaks with low confidence are treated as noise and removed. Then we try to find separating lines to isolate remaining peaks into different non-overlapping regions. Each line perpendicular to the line segment connecting a pair of peaks is associated with a score. The score is the maximal value of the pixels this line passes on . The one with the minimal score is selected as the separating line of the two peaks. In this way, we ensure that the separating line passes through the boundaries between objects rather than the objects themselves. These lines divide into different regions. Finally, for each peak , we apply Algorithm 1 to its corresponding region on the saliency map to obtain a bounding box. Algorithm 2 summarizes the process.
4.1 Experimental setup
We evaluate our salient object detection method on four datasets, MSO, PASCAL-S, MSRA and DUT-O. The MSO dataset is the test set of the SOS dataset annotated for salient object detection. It contains images of multiple salient objects and many background images with no salient object. PASCAL-S is a subset of PASCAL VOC 2010 validation set  annotated for saliency segmentation problem. It contains images with multiple salient objects and 8 subjects decided on the saliency of each object segment. As suggested by , we define salient objects as those having a saliency score of at least 0.5, i.e., half of the subjects believe that the object is salient, and consider the surrounding rectangles as the ground-truth. We also compare the performance of our method and existing methods for subitizing on this dataset. The MSRA and DUT-O datasets only contain images of single salient object. For every image in the MSRA and DUT-O datasets, five raw bounding box annotations are provided, which are later converted to one ground-truth following the same protocol in . We use only the SOS dataset for training and others for evaluation. To verify that our RSD can also generate accurate pixel-wise saliency map, we additionally compare our method with existing methods on ESSD  and PASCAL-S  datasets.
Parameters and settings.
In Algorithm 1 and 2, we set as our strong evidence threshold, as our peak detection thresholds, and use vertical and horizontal lines as our separating lines. In our real-time network based on VGG16, we use an image size of and for our network based on ResNet-50 we use instead. We smooth predicted saliency maps by a Gaussian filter before converting them to binary maps. We use a Gaussian kernel with standard deviation of for input and for input. In the first training step, we use Xavier initialization for conv_s1 and conv_s2 and Gaussian initializer for the final fc layer in the subitizing branch. For fine-tuning on SOS, we use a momentum of , weight decay of , and learning rates of and for our VGG16 and ResNet-50 based methods, respectively. All timings are measured on an NVIDIA Titan X Pascal GPU, and a system with 128GB RAM and Intel Core i7 6850K CPU.
Salient Object detection.
We compare RSD with several existing methods including the state-of-the-art approach in , which are SalCNN+MAP, SalCNN+NMS, SalCNN with Maximum Marginal Relevance (SalCNN+MMR), and MultiBox  with NMS. Unlike our RSD that generates the exact number of bounding boxes as salient objects, other methods have free parameters to determine the number of selected bounding boxes from hundreds of proposals, which greatly changes their performance. For fair comparison, we change these free parameters and show their best results with our performance point in Figure 4. It should be noted that we use the same set of parameters on all datasets, while for other methods different parameters lead to their best performance on different datasets.
On the MSO and PASCAL-S datasets that contain multiple salient objects, our RSD-ResNet produces the best results at the same precision or recall rate. RSD-VGG achieves comparable precision/recall as the state-of-the-art methods while being nearly faster. Although our subitizing branch has a range of three, Table 1 shows that our RSD-ResNet also achieves the best results on images with objects based on the predicted saliency map. On the MSRA and DUT-O datasets that contain single salient object in an image, both of our RSD-VGG and RSD-ResNet outperform the state-of-the-arts by a large margin. Notably, our RSD-ResNet achieves nearly 15% and 10% absolute improvement in precision at the same recall rate on the MSRA and DUT-O datasets, respectively, which clearly indicates that our method, without any object proposals, is more powerful and robust even when it is allowed to generate only a single bounding box.
|F1 Score ( objects)||79.2/77.4||78.9/77.0||71.6/72.6||72.5/70.7|
|F1 Score ( objects)||57.5/26.8||55.2/50.9||46.1/47.7||47.7/48.5|
|Method||ResNet ||RSD-ResNet||VGG ||RSD-VGG|
We evaluate the subitizing performance of our RSD on the MSO dataset. First, we compare our RSD with state-of-the-art methods in terms of solving the existence problem in Table 1. While our parameters are fixed, we vary the parameters of other methods on different datasets to match their performance. For example, we tune the parameters of other methods when comparing with our RSD-ResNet, so that they can achieve the same recall as ours. Then we compare the number of false positives in the background images. We do the same thing for the comparison with our RSD-VGG as well.
For predicting existence, both our RSD-ResNet and RSD-VGG produce fewer false positives when there is no salient object. Additionally, we compare the counting performance of RSD with two baselines using vanilla ResNet-50 and VGG16 in Table 2. For fair comparison, we use exactly the same training scheme and initialization for all networks. Our RSD method successfully produces better accuracy compared with vanilla ResNet-50 and VGG16, verifying that the multi-task training facilitates the subitizing branch to learn a better classifier by utilizing the information from saliency map prediction.
Saliency map prediction.
In real world scenarios, pixel-level annotations are difficult to collect. It is challenging to generate precise saliency maps without such detailed labeling. As a weakly-supervised approach only using bounding boxes for salient foreground segmentation, we will show that our RSD still generates accurate saliency map that aligns well with multiple salient objects in the scene. We compare our RSD against five powerful unsupervised salient object segmentation algorithms, RC , SF , GMR , PCAS , GBVS  and three state-of-the-art supervised methods, HDCT , DRFI , GBVS+PatchCut .We also evaluate the performance using precision-recall curves. Specifically, the precision and recall are computed by binarizing the grayscale saliency map using varying thresholds [1, 28, 45, 44] and comparing the binary mask against the ground-truth. Our RSD approach is surprisingly good considering that it only uses rough Gaussian maps as ground-truth. In particular, the RSD-ResNet approach produces comparable results with the fully-supervised methods in terms of precision/recall, making it readily applicable for salient foreground segmentation without any pixel-level annotations.
4.3 Ablation study
Although we do not use proposals and pruning stage like NMS, our straightforward bounding box generation algorithm generates good results. Moreover, bounding boxes generated by our method align with the ground-truth better compared to existing approaches, leading to the best precision and recall, as shown in Figure 6. In this experiment, we let other methods to pick their parameters to get the same recall as ours at IoU, and then change the IoU threshold to evaluate the performance change. Notably, if we have a more strict IoU criteria, such as 0.8, RSD still maintains a relatively high precision and recall, while the precision and recall of all the other methods greatly drop. At this IoU, even our fast RSD-VGG is able to outperform the state-of-the-art methods on all datasets by an average margin of around 10% in terms of both precision and recall. The results clearly demonstrate that our network successfully predicts an accurate saliency map and easily generates only a few bounding boxes tightly enclosing the correct salient objects. Some qualitative results are presented in Figure 7. Our RSD approach clearly outperforms SalCNN+MAP in generating better bounding boxes that more tightly enclose the ground-truth.
|Method||MSO Dataset||MSRA Dataset||DUT-O Dataset||PASCAL-S Dataset|
The behavior of detection methods usually differs when dealing with small and large objects. To better understand how our method works compared to existing methods, we further analyze its performance with respect to different sizes of objects. Objects with an area larger than pixels are counted as large objects. For MSO and DUT-O datasets, the ground-truth boxes with an area less than pixels are defined as small objects. We increase this size to pixels for the MSRA dataset to obtain a statistically reliable subset for performance estimation since the salient objects in this dataset are generally larger. We evaluate the precision and recall on small and large objects separately and show the results in Table 3.
Our RSD-ResNet clearly outperforms all the compared methods, achieving the best performance on the MSO dataset for both small and large objects. It also produces the best recall at the same precision for large objects on MSRA dataset and small objects on DUT-O dataset, indicating that it discovers objects of different sizes well under various conditions. At the same recall, our RSD-ResNet greatly improves the precision, especially for small objects that are difficult to locate by object proposal based approaches.
4.4 Run-time efficiency
By directly generating the saliency map through network forward without proposals, our approach is extremely efficient for salient object detection during inference. We compare the run-time speed of SalCNN  and our approach in Table 4. With ResNet, our approach achieves nearly 20 fps, while SalCNN only runs at 10 fps. With VGG16, our method achieves an impressive speed at 120 fps, faster than SalCNN, and readily applicable to real-time scenarios. This experiment confirms that we successfully improve the run-time speed of the network by removing the bottleneck of proposal generation and refinement.
We have presented a real-time unconstrained salient object detection framework using deep convolutional neural networks, named RSD. By eliminating the steps of proposing and refining thousands of candidate boxes, our network learns to directly generates the exact number of salient objects. Our network performs saliency map prediction without pixel-level annotations, salient object detection without object proposals, and salient object subitizing simultaneously, all in a single pass within a unified framework. Extensive experiments show that our RSD approach outperforms existing methods on various datasets for salient object detection and subitizing, and produces comparable results for salient foreground segmentation. In particular, our approach based on VGG16 network achieves more than 100 fps on average on GPU during inference time, which is faster than the state-of-the-art approach, while being more accurate.
-  R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. Frequency-tuned salient region detection. In CVPR, pages 1597–1604, 2009.
-  B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE Trans. Pattern Anal. Mach. Intell., 34(11):2189–2202, 2012.
-  P. A. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marqués, and J. Malik. Multiscale combinatorial grouping. In CVPR, pages 328–335, 2014.
-  J. Carreira and C. Sminchisescu. CPMC: automatic object segmentation using constrained parametric min-cuts. IEEE Trans. Pattern Anal. Mach. Intell., 34(7):1312–1328, 2012.
-  M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell., 37(3):569–582, 2015.
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
-  M. Donoser, M. Urschler, M. Hirzer, and H. Bischof. Saliency driven total variation segmentation. In ICCV, pages 817–824, 2009.
-  D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, pages 2155–2162, 2014.
-  M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
-  J. Feng, Y. Wei, L. Tao, C. Zhang, and J. Sun. Salient object detection by composition. In ICCV, pages 1028–1035, 2011.
-  R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
-  J. Harel, C. Koch, P. Perona, et al. Graph-based visual saliency. In NIPS, volume 1, page 5, 2006.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  L. Itti, C. Koch, E. Niebur, et al. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254–1259, 1998.
-  P. Jiang, H. Ling, J. Yu, and J. Peng. Salient region detection by ufo: Uniqueness, focusness and objectness. In ICCV, pages 1976–1983, 2013.
-  J. Kim, D. Han, Y.-W. Tai, and J. Kim. Salient region detection via high-dimensional color transform. In CVPR, pages 883–890, 2014.
-  G. Lee, Y.-W. Tai, and J. Kim. Deep saliency with encoded low level distance map and high level features. CVPR, 2016.
-  G. Li and Y. Yu. Visual saliency based on multiscale deep features. In CVPR, pages 5455–5463, 2015.
-  X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang. Saliency detection via dense and sparse reconstruction. In ICCV, pages 2976–2983, 2013.
-  X. Li, L. Zhao, L. Wei, M. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang. Deepsaliency: Multi-task deep neural network model for salient object detection. IEEE Trans. Image Processing, 25(8):3919–3930, 2016.
-  Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. In CVPR, pages 280–287, 2014.
-  N. Liu and J. Han. Dhsnet: Deep hierarchical saliency network for salient object detection. In CVPR, pages 678–686, 2016.
-  T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum. Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell., 33(2):353–367, 2011.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, pages 21–37, 2016.
-  Y. Luo, J. Yuan, P. Xue, and Q. Tian. Saliency density maximization for object detection and localization. In ACCV, pages 396–408, 2010.
-  L. Marchesotti, C. Cifarelli, and G. Csurka. A framework for visual saliency detection with applications to image thumbnailing. In ICCV, pages 2232–2239, 2009.
-  F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung. Saliency filters: Contrast based filtering for salient region detection. In CVPR, pages 733–740, 2012.
-  J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang. Region-based saliency detection and its application in object recognition. IEEE Transactions on Circuits and Systems for Video Technology, 24(5):769–779, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  C. Scharfenberger, S. L. Waslander, J. S. Zelek, and D. A. Clausi. Existence detection of objects in images for robot vision using saliency histogram features. In CRV, pages 75–82, 2013.
-  D. Simakov, Y. Caspi, E. Shechtman, and M. Irani. Summarizing visual data using bidirectional similarity. In CVPR, pages 1–8, 2008.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  P. Siva, C. Russell, T. Xiang, and L. Agapito. Looking beyond the image: Unsupervised learning for object saliency and detection. In CVPR, pages 3238–3245, 2013.
-  B. Suh, H. Ling, B. B. Bederson, and D. W. Jacobs. Automatic thumbnail cropping and its effectiveness. In UIST, pages 95–104, 2003.
-  C.-H. Teh and R. T. Chin. On the detection of dominant points on digital curves. IEEE Transactions on pattern analysis and machine intelligence, 11(8):859–872, 1989.
-  J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
-  R. Valenti, N. Sebe, and T. Gevers. Image saliency by isocentric curvedness and color. In ICCV, pages 2185–2192. IEEE, 2009.
-  L. Wang, H. Lu, X. Ruan, and M.-H. Yang. Deep networks for saliency detection via local estimation and global search. In CVPR, pages 3183–3192, 2015.
-  L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan. Saliency detection with recurrent fully convolutional networks. In ECCV, pages 825–841, 2016.
-  P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li. Salient object detection for searched web images via global saliency. In CVPR, pages 3194–3201, 2012.
-  Q. Yan, L. Xu, J. Shi, and J. Jia. Hierarchical saliency detection. In CVPR, 2013.
-  C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via graph-based manifold ranking. In CVPR, pages 3166–3173, 2013.
-  J. Yang, B. Price, S. Cohen, Z. Lin, and M.-H. Yang. Patchcut: Data-driven object segmentation via local shape transfer. In CVPR, pages 1770–1778, 2015.
-  J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke, Z. L. Lin, X. Shen, B. L. Price, and R. Mech. Salient object subitizing. In CVPR, pages 4045–4054, 2015.
-  J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mĕch. Unconstrained salient object detection via proposal subset optimization. In CVPR, 2016.
-  L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell. Sun: A bayesian framework for saliency using natural statistics. Journal of Vision, 8(7):32–32, 2008.
-  R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In CVPR, pages 1265–1274, 2015.
-  C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391–405, 2014.