Salient Object Detection: A Review and Benchmark

Salient Object Detection: A Review and Benchmark

Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang and Jia Li A. Borji is with the Center for Research in Computer Vision at University of Central Florida, Orlando, FL. E-mail: M.M. Cheng (Corresponding author) and Q. Hou are with CCCE, Nankai University, Tianjin, China. E-mail: {cmm.thu, andrewhoux} H. Jiang is with the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, China. E-mail: J. Li is with the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, and International Research Institute for Multidisciplinary Science at Beihang University. E-mail: A. Borji and M.-M. Cheng equally contributed to this work. Manuscript received xx 2017.

Salient Object Detection: A Large Scale Evaluation

Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang and Jia Li A. Borji is with the Center for Research in Computer Vision at University of Central Florida, Orlando, FL. E-mail: M.M. Cheng (Corresponding author) and Q. Hou are with CCCE, Nankai University, Tianjin, China. E-mail: {cmm.thu, andrewhoux} H. Jiang is with the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, China. E-mail: J. Li is with the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, and International Research Institute for Multidisciplinary Science at Beihang University. E-mail: A. Borji and M.-M. Cheng equally contributed to this work. Manuscript received xx 2017.

Salient Object Detection: A Survey

Ali Borji, Ming-Ming Cheng, Qibin Hou, Huaizu Jiang and Jia Li A. Borji is with the Center for Research in Computer Vision at University of Central Florida, Orlando, FL. E-mail: M.M. Cheng (Corresponding author) and Q. Hou are with CCCE, Nankai University, Tianjin, China. E-mail: {cmm.thu, andrewhoux} H. Jiang is with the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, China. E-mail: J. Li is with the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, and International Research Institute for Multidisciplinary Science at Beihang University. E-mail: A. Borji and M.-M. Cheng equally contributed to this work. Manuscript received xx 2017.

Detecting and segmenting salient objects in natural scenes, often referred to as salient object detection, has attracted a lot of interest in computer vision. While many models have been proposed and several applications have emerged, yet a deep understanding of achievements and issues is lacking. We aim to provide a comprehensive review of the recent progress in salient object detection and situate this field among other closely related areas such as generic scene segmentation, object proposal generation, and saliency for fixation prediction. Covering 228 publications, we survey i) roots, key concepts, and tasks, ii) core techniques and main modeling trends, and iii) datasets and evaluation metrics in salient object detection. We also discuss open problems such as evaluation metrics and dataset bias in model performance and suggest future research directions.


Salient object detection, bottom-up saliency, explicit saliency, visual attention, regions of interest

1 Introduction

HUMANS are able to detect visually distinctive, so called salient, scene regions effortlessly and rapidly (i.e., pre-attentive stage). These filtered regions are then perceived and processed in finer details for the extraction of richer high-level information (i.e., attentive stage). This capability has long been studied by cognitive scientists and has recently attracted a lot of interest in the computer vision community mainly because it helps find the objects or regions that efficiently represent a scene and thus harness complex vision problems such as scene understanding. Some topics that are closely or remotely related to visual saliency include: salient object detection [1], fixation prediction [2, 3], object importance [4, 5, 6], memorability [7], scene clutter [8], video interestingness [9, 10, 11, 12], surprise [13], image quality assessment [14, 15, 16], scene typicality [17, 18], aesthetic [11] and attributes [19]. Given space limitations, this paper cannot fully explore all the aforementioned research directions. Instead, we only focus on salient object detection, a research area that has been greatly developed in the past twenty years in particular since 2007 [20].

1.1 What is Salient Object Detection about?

“Salient object detection” or “salient object segmentation” is commonly interpreted in computer vision as a process that includes two stages: 1) detecting the most salient object and 2) segmenting the accurate region of that object. Rarely, however, models explicitly distinguish between these two stages (with few exceptions such as [21, 22, 23]). Following the seminal works by Itti et al.[24] and Liu et al.[25], models adopt the saliency concept to simultaneously perform the two stages together. This is witnessed by the fact that these stages have not been separately evaluated. Further, mostly area-based scores have been employed for model evaluation (e.g., Precision-recall). The first stage does not necessarily need to be limited to only one object. The majority of existing models, however, attempt to segment the most salient object, although their prediction maps can be used to find several objects in the scene. The second stage falls in the realm of classic segmentation problems in computer vision but with the difference that here accuracy is only determined by the most salient object.

Fig. 1: An example image in Borji et al.’s experiment [26] along with annotated salient objects. Dots represent 3-second free-viewing fixations.
Fig. 2: Sample results produced by different models. From left to right: input image, salient object detection [27], fixation prediction [24], image segmentation (regions with various sizes) [28], image segmentation (superpixels with comparable sizes) [29], and object proposals (true positives) [30].

In general, it is agreed that for good saliency detection a model should meet at least the following three criteria: 1) good detection: the probability of missing real salient regions and falsely marking the background as a salient region should be low, 2) high resolution: saliency maps should have high or full resolution to accurately locate salient objects and retain original image information, and 3) computational efficiency: as front-ends to other complex processes, these models should detect salient regions quickly.

1.2 Situating Salient Object Detection

Salient object detection models usually aim to detect only the most salient objects in a scene and segment the whole extent of those objects. Fixation prediction models, on the other hand, typically try to predict where humans look, i.e., a small set of fixation points [31, 32]. Since the two types of methods output a single continuous-valued saliency map, where a higher value in this map indicates that the corresponding image pixel is more likely to be attended, they can be used interchangeably.

A strong correlation exists between fixation locations and salient objects. Further, humans often agree which each other when asked to choose the most salient object in a scene [26, 23, 22]. These are illustrated in Fig. 1.

Unlike salient object detection and fixation prediction models, object proposal models aim at producing a small set, typically a few hundreds or thousands, of overlapping candidate object bounding boxes or region proposals [33]. Object proposal generation and salient object detection are highly related. Saliency estimation is explicitly used as a cue in objectness methods [34, 35].

Image segmentation, a.k.a semantic scene labeling or semantic segmentation, is one of the very well researched areas in computer vision (e.g.,[36]). In contrast to salient object detection where the output is a binary map, these models aim to assign a label, one out of several classes such as sky, road, and building, to each image pixel.

Fig. 2 illustrates the difference among these themes.


Ginwidth=.95\OVP@calc1st wave: Itti et al.[24],computational model2nd wave: Liu et al.[25],def. as binary labeling prob.,dataset with bound. boxesAchanta et al.[37]: pixelaccuracy g-truth datasetCheng et al.[38]:global contrastGoferman et al.[39]:context aware saliencyPerazzi et al.[27]:saliency filtersNew models & datasets:[40, 41, 42, 43]3rd wave: deep models[44, 45, 46, 47, 48]Hou et al.[49]:deeply supervised

Fig. 3: A simplified chronicle of salient object detection modeling. The first wave started with the Itti et al. model [24] followed by the second hit with the introduction of the Liu et al.[25] who were the first to define saliency as a binary segmentation problem. The third wave started with the surge of deep learning models and the Li et al. model [47].

1.3 History of Salient Object Detection

One of the earliest saliency models, proposed by Itti et al.[24], generated the first wave of interest across multiple disciplines including cognitive psychology, neuroscience, and computer vision. This model is an implementation of earlier general computational frameworks and psychological theories of bottom-up attention based on center-surround mechanisms (e.g., Feature Integration Theory by Treisman and Gelade [50], Guided Search Model by Wolfe et al. [51], and the Computational Attention Architecture by Koch and Ullman [52]). In [24], Itti et al. show some examples where their model is able to detect spatial discontinuities in scenes. Subsequent behavioral (e.g., [53]) and computational investigations (e.g., [54]) used fixations as a means to verify the saliency hypothesis and to compare models.

A second wave of interest surged with the works of Liu et al. [25, 55] and Achanta et al. [56] who defined saliency detection as a binary segmentation problem. These authors were inspired by some earlier models striving to detect salient regions or proto-objects (e.g., Ma and Zhang [57], Liu and Gleicher [58], and Walther et al. [59]). A plethora of saliency models has emerged since then. It has been, however, less clear how this new definition relates to other established computer vision areas such as image segmentation (e.g., [60, 61]), category independent object proposal generation (e.g., [34, 62, 30]), fixation prediction (e.g., [54, 63, 64, 65, 66]), and object detection (e.g., [67, 68]).

A third wave of interest has appeared recently with the resurgence of the convolutional neural networks (CNNs) [69], in particular with the introduction of the fully convolutional neural networks [70]. Unlike the majority of classic methods based on contrast cues [1], CNN-based methods eliminate the need for hand-crafted features, alleviate the dependency on center bias knowledge, and hence have been adopted by many researchers. A CNN-based model normally contains hundreds of thousands of tunable parameters and neurons with variable receptive field sizes. Neurons with large receptive fields provide global information that can help better identify the most salient region in an image. While neurons with small receptive fields provide local information that can be leveraged to refine saliency maps produced by the top layers. This allows highlighting salient regions and refining their boundaries. These desirable properties enable CNN-based models to achieve unprecedented performance compared to hand-crafted feature-based models. CNN models are gradually becoming the mainstream direction in salient object detection.

2 Survey of the State of the Art

In this section, we review related works in 3 categories, including: 1) salient object detection models, 2) applications , and 3) datasets. Due to similarity among some models, such that it is sometimes hard to draw sharp boundaries among them, here we mainly focus on the models contributing to the major “waves” in the chronicle shown in Fig. 3.

2.1 Old Testament: Classic Models

A large number of approaches have been proposed for detecting salient objects in images in the past two decades. Except for a few models which attempt to segment objects-of-interest (e.g., [71, 72, 73]), most of these approaches aim to identify the salient subsets111 Visual subsets could be pixels, blocks, superpixels and regions. Blocks are rectangular patches uniformly sampled from the image (pixels are blocks). A superpixel or a region is perceptually homogeneous image patch that is confined with intensity edges. Superpixels, in the same image, often have comparable but different sizes, while the shapes and sizes of regions may change remarkably. from images first (i.e., compute a saliency map) and then integrate them to segment the entire salient object.

In general, classic approaches can be categorized in two different ways depending on the types operation or attributes they exploit.

1) Block-based vs. Region-based analysis. Two types of visual subsets have been utilized: blocks and regions222In this review, the term “block” is used to represent pixels and patches, while “superpixel” and “region” are used interchangeably., to detect salient objects. Blocks were primarily adopted by early approaches, while regions became popular with the introduction of superpixel algorithms.

2) Intrinsic cues vs. Extrinsic cues. A key step in detecting salient objects is to distinguish them from distractors. To this end, some approaches propose to extract various cues only from the input image itself to highlight targets and to suppress distractors (i.e., the intrinsic cues). However, other approaches argue that intrinsic cues are often insufficient to distinguish targets and distractors specially when they share common visual attributes. To overcome this issue, they incorporate extrinsic cues such as user annotations, depth map, or statistical information of similar images to facilitate detecting salient objects in the image.

From the above model categorization, four combinations are thus possible. For a better organization, we group the models in three major subgroups 1) block-based models with intrinsic cues, 2) region-based models with intrinsic cues, and 3) models with extrinsic cues (both block- and region-based). Some approaches that may not easily fit into these subgroups will be discussed under the other classic models subgroup. Reviewed models are listed in Fig. 4 (Intrinsic models), Fig. 5 (Extrinsic models), and Fig. 6 (Other classic models).

2.1.1 Block-based Models with Intrinsic Cues

In this subsection, we mainly review salient object detection models which utilize intrinsic cues extracted from blocks. Following the seminal work of Itti et al. [24], salient object detection is widely defined as capturing the uniqueness, distinctiveness, or rarity in a scene.

In early works [57, 58, 56], uniqueness was often computed as the pixel-wise center-surround contrast. Hu et al. [74] represent the input image in a 2D space using the polar transformation of its features. Each region in the image is then mapped into a 1D linear subspace. Afterwards, the Generalized Principal Component Analysis (GPCA) [75] is used to estimate the linear subspaces without actually segmenting the image. Finally, salient regions are selected by measuring feature contrasts and geometric properties of regions. Rosin [76] proposes an efficient approach for detecting salient objects. His approach is parameter-free and requires only very simple pixel-wise operations such as edge detection, threshold decomposition and moment preserving binarization. Valenti et al. [77] propose an isophote-based framework where the saliency map is estimated by linearly combining the saliency maps computed in terms of curvedness, color boosting, and isocenters clustering.

In an influential study, Achanta et al. [37] adopt a frequency-tuned approach to compute full resolution saliency maps. The saliency of pixel is computed as:


where is the mean pixel value of the image (e.g., RGB/Lab features) and is a Gaussian blurred version of the input image (e.g., using a kernel).

Without any prior knowledge of the sizes of salient objects, multi-scale contrast is frequently adopted for robustness purposes [58, 25]. A -layer Gaussian pyramid is first constructed (as in [58, 25]). The saliency score of pixel at the image at the th-level of this pyramid (denoted as ) is defined as:


where is a neighboring window centered at (e.g., pixels). Even with such multi-scale enhancement, intrinsic cues derived at pixel-level are often too poor to support object segmentation. To address this, some works (e.g., [25, 56, 78, 79]) extended the contrast analysis to the patch level (i.e., a patch compared to its neighbors).

Later in [78], Klein and Frintrop proposed an information-theoretic approach to compute center-surround contrasts using the Kullback-Leibler divergence between distribution of features such as intensity, color and orientation. Li et al. [79] formulated the center-surround contrast as a cost-sensitive max-margin classification problem. The center patch is labeled as a positive sample while the surrounding patches are all used as negative samples. The saliency of the center patch is then determined by its separability from surrounding patches based on a trained cost-sensitive Support Vector Machine (SVM).

Some works have defined patch uniqueness as its global contrast with other patches [39]. Intuitively, a patch is considered to be salient if it is remarkably distinct from its most similar patches, while their spatial distances are taken into account. Similarly, Borji and Itti computed local and global patch rarity in RGB and LAB color spaces and fused them to predict fixation locations [65]. In a recent work [80], Margolin et al. propose to define the uniqueness of a patch by measuring its distance to the average patch based on the observation that distinct patches are more scattered than non-distinct ones in the high-dimensional space. To further incorporate the patch distributions, the uniqueness of a patch is measured by projecting its path to the average patch onto the principal components of the image.

To sum up, approaches in Sec. 2.1.1 aim to detect salient objects based on pixels or patches where only intrinsic cues are utilized. These approaches usually suffer from two shortcomings: i) high-contrast edges usually stand out instead of the salient object, and ii) the boundary of the salient object is not preserved well (especially when using large blocks). To overcome these issues, some methods propose to compute saliency based on regions. This offers two main advantages. First, the number of regions is far less than the number of blocks, which implies the potential to develop highly efficient and fast algorithms. Second, more informative features can be extracted from regions, leading to better performance. These region-based approaches will be discussed in the next subsection.

2.1.2 Region-based Models with Intrinsic Cues

Saliency models in the second subgroup adopt intrinsic cues extracted from image regions generated using methods such as graph-based segmentation [81], mean-shift [28], SLIC [29] or Turbopixels [82]. Different from the block-based models, region-based models often segment an input image into regions aligned with intensity edges first and then compute a regional saliency map.

As an early attempt, in [58], the regional saliency score is defined as the average saliency score of its contained pixels, defined in terms of multi-scale contrast. Yu et al. [83] propose a set of rules to determine the background scores of each region based on observations from background and salient regions. Saliency, defined as uniqueness in terms of global regional contrast, is widely studied in many approaches [84, 42, 85, 86, 87]. In [84], a region-based saliency algorithm is introduced by measuring the global contrast between the target region with respect to all other image regions. In a nutshell, an image is first segmented into regions . Saliency of the region is measured as:


where captures the appearance contrast between two regions. Higher saliency scores are assigned to regions with large global contrast. is a weight term between regions and , which incorporates spatial distance and region size. Perazzi et al. [27] demonstrate that if is defined as the Euclidean distance of colors between and , the global contrast can be computed using efficient filtering based techniques [88].

In addition to color uniqueness, distinctiveness of complementary cues such as texture [85] and structure [89] is also considered for salient object detection. Margolin et al. [80] propose to combine the regional uniqueness and patch distinctiveness to form a saliency map. Instead of maintaining a hard region index for each pixel, a soft abstraction is proposed in [86] to generate a set of large scale perceptually homogeneous regions using histogram quantization and Gaussian Mixture Models (GMM). By avoiding the hard decision boundaries of superpixels, such soft abstraction provides large spatial support which results in a more uniform saliency region.

# Model Pub Year Elements Hypothesis Aggregation Code
Uniqueness Prior (Optimization)
1 FG [57] MM 2003 PI L - - NA
2 RSA [74] MM 2005 PA G - - NA
3 RE [58] ICME 2006 mPI + RE L - LN NA
4 RU [83] TMM 2007 RE - P LN NA
5 AC [56] ICVS 2008 mPA L - LN NA
6 FT [37] CVPR 2009 PI CS - - C
7 ICC [77] ICCV 2009 PI L - LN NA
8 EDS [76] PR 2009 PI - ED - NA
9 CSM [90] MM 2010 PI + PA L SD - NA
10 RC [84] CVPR 2011 RE G - - C
11 HC [84] CVPR 2011 RE G - - C
12 CC [91] ICCV 2011 mRE - CV - NA
13 CSD [78] ICCV 2011 mPA CS - LN NA
14 SVO [92] ICCV 2011 PA + RE CS O EM M + C
15 CB [93] BMVC 2011 mRE L CP LN M + C
16 SF [27] CVPR 2012 RE G SD NL C
17 ULR [94] CVPR 2012 RE SPS CP + CLP - M + C
18 GS [95] ECCV 2012 PA/RE - B - NA
19 LMLC [96] TIP 2013 RE CS - BA M + C
20 HS [42] CVPR 2013 hRE G - HI EXE
21 GMR [97] CVPR 2013 RE - B - M
22 PISA [89] CVPR 2013 RE G SD + CP NL NA
23 STD [85] CVPR 2013 RE G - - NA
24 PCA [80] CVPR 2013 PA + PE G - NL M+C
25 GU [86] ICCV 2013 RE G - - C
26 GC [86] ICCV 2013 RE G SD AD C
27 CHM [79] ICCV 2013 PA + mRE CS + L - LN M + C
28 DSR [98] ICCV 2013 mRE - B BA M + C
29 MC [99] ICCV 2013 RE - B - M + C
30 UFO [100] ICCV 2013 RE G F + O NL M + C
31 CIO [101] ICCV 2013 RE G O GMRF NA
32 SLMR [102] BMVC 2013 RE SPS BC - NA
33 LSMD [103] AAAI 2013 RE SPS CP + CLP - NA
34 SUB [87] CVPR 2013 RE G CP + CLP + SD - NA
35 PDE [104] CVPR 2014 RE - CP + B + CLP - NA
36 RBD [105] CVPR 2014 RE - BC LS M
Fig. 4: Salient object detection models with intrinsic cues (sorted by year). Element, {PI = pixel, PA = patch, RE = region}, where prefixes m and h indicate multi-scale and hierarchical versions, respectively. Hypothesis, {CP = center prior, G = global contrast, L = local contrast, ED = edge density, B = background prior, F = focusness prior, O = objectness prior, CV = convexity prior, CS = center-surround contrast, CLP = color prior, SD = spatial distribution, BC = boundary connectivity prior, SPS = sparse noises}. Aggregation/optimization, {LN = linear, NL = non-linear, AD = adaptive, HI = hierarchical, BA = Bayesian, GMRF = Gaussian MRF, EM = energy minimization, and LS = least-square solver.}. Code, {M= Matlab, C= C/C++, NA = not available, EXE = executable}.

In [93], Jiang et al. propose a multi-scale local region contrast based approach, which calculates saliency values across multiple segmentations for robustness purposes and combines these regional saliency values to obtain a pixel-wise saliency map. A similar idea for estimating regional saliency using multiple hierarchical segmentations is adopted in [42, 98]. Li et al. [79] extend the pairwise local contrast by building a hypergraph, constructed by non-parametric multi-scale clustering of superpixels, to capture both internal consistency and external separation of regions. Salient object detection is then casted as finding salient vertices and hyperedges in the hypergraph.

Salient objects, in terms of uniqueness, can also be defined as the sparse noises in a certain feature space where the input image is represented as a low-rank matrix [94, 102, 103]. The basic assumption is that non-salient regions (i.e., background) can be explained by the low-rank matrix while the salient regions are indicated by the sparse noises.

Based on such a general low-rank matrix recovery framework, Shen and Wu [94] propose a unified approach to incorporate traditional low-level features with higher-level guidance, e.g., center prior, face prior, and color prior, to detect salient objects based on a learned feature transformation333Though extrinsic ground-truth annotations are adopted to learn high-level priors and the feature transformation, we classify this model in intrinsic models to better organize the low-rank matrix recovery based approaches. Additionally, we treat face and color priors as universal intrinsic cues for salient object detection.. Instead, Zou et al. [102] propose to exploit bottom-up segmentation as a guidance cue of the low-rank matrix recovery for robustness purpose. Similar to [94], high-level priors are also adopted in [103], where a tree-structured sparsity-inducing norm regularization is introduced to hierarchically describe the image structure in order to uniformly highlight the entire salient object.

In addition to capturing the uniqueness, more and more priors are proposed for salient object detection as well. Spatial distribution prior [25] implies that the wider a color is distributed in the image, the less likely a salient object contains this color. The spatial distribution of superpixels can be efficiently evaluated in linear time using the Gaussian blurring kernel as well, in a similar way of computing the global regional contrast in Eq. (3). Such a spatial distribution prior is also considered in [89] evaluated in terms of both color and structure cues.

Center prior assumes that a salient object is more likely to be found near the image center. In other words, the background tends to be far away from the image center. To this end, the backgroundness prior is adopted for salient object detection [95, 97, 98, 99], assuming that a narrow border of the image is the background region, i.e., the pseudo-background. With this pseudo-background as a reference, regional saliency can be computed as the contrast of regions versus “background”. In [97], a two-stage saliency computation framework is proposed based on the manifold ranking on an undirected weighted graph. In the first stage, the regional saliency scores are computed based on the relevances given to each side of the pseudo-background queries. In the second stage, the saliency scores are refined based on the relevances given to the initial foreground. In [98], saliency computation is formulated as the dense and sparse reconstruction errors w.r.t. the pseudo-background. The dense reconstruction error of each region is computed based on the Principal Component Analysis (PCA) basis of the background templates, while the sparse reconstruction error is defined as the residual based on the sparse representation of the background templates. These two types of reconstruction errors are propagated to pixels on multiple segmentations, which will be fused to form the final saliency map. Jiang et al. [99] propose to formulate the saliency detection via absorbing Markov Chain where the transient and absorbing nodes are superpixels around the image center and border, respectively. The saliency of each superpixel is computed as the absorbed time for the transient node to the absorbing nodes of the Markov Chain.

Beyond these approaches, the generic objectness prior444Although it is learned from training data, we also tend to treat it as a universal intrinsic cue for salient object detection. is also used to facilitate salient object detection by leveraging object proposals [34]. Chang et al. [92] present a computational framework by fusing the objectness and regional saliency into a graphical model. These two terms are jointly estimated by iteratively minimizing the energy function that encodes their mutual interactions. In [100], regional objectness is defined as the average objectness values of its contained pixels, which is used for regional saliency computation. Jia and Han [101] compute the saliency of each region by comparing it to the “soft” foreground and background according to the objectness prior.

Salient object detection relying on the pseudo-background assumption may fail sometimes, especially when the object touches the image border. To this end, a boundary connectivity prior is utilized in [84, 105]. Intuitively, salient objects are much less connected to the image border than the ones in the background. Thus, the boundary connectivity score of a region could be estimated according to the ratio between its length along the image border and the spanning area of this region [105], which can be computed based on its geodesic distances to the pseudo-background and other regions, respectively. Such a boundary connectivity score is then integrated into a quadratic objective function to get the final optimized saliency map. It is worth noting that similar ideas of boundary connectivity prior are also investigated in [102] as segmentation prior and as surroundness in [106].

The focusness prior, the fact that a salient object is often photographed in focus to attract more attention, has been investigated in [100, 107]. Jiang et al. [100] calculate the focusness from the degree of focal blur. By modeling such a de-focus blur as the convolution of a sharp image with a point spread function, approximated by a Gaussian kernel, the pixel-level focusness is casted as estimating the standard deviation of the Gaussian kernel by scale space analysis. Regional focusness score is computed by propagating the focusness and/or sharpness at the boundary and interior edge pixels. The saliency score is finally derived from the non-linear combination of uniqueness (global contrast), objectness, and focusness scores.

Performance of salient object detection based on regions might be affected by the segmentation parameters. In addition to other approaches based on multi-scale regions [93, 42, 79], single-scale potential salient regions are extracted by solving the facility location problem in [87]. An input image is first represented as an undirected graph on superpixels, where a much smaller set of candidate region centers are then generated through agglomerative clustering. On this set, a submodular objective function is built to maximize the similarity. By applying a greedy algorithm, the objective function can be iteratively optimized to group superpixels into regions whose saliency values are further measured via the regional global contrast and spatial distribution.

The Bayesian framework is exploited for saliency computation [108, 96], formulated as estimating the posterior probability of pixel being foreground given the input image . To estimate the saliency prior, a convex hull is first estimated around the detected interest points. The convex hull , which divides the image into the inner region and outside region , provides a coarse estimation of foreground as well as background and can be adopted for likelihood computation. Liu et al. [104] adopt an optimization-based framework for detecting salient objects. Similar to [96], a convex hull is roughly estimated to partition an image into pure background and potential foreground. Then, saliency seeds are learned from the image, while a guidance map is learned from background regions, as well as human prior knowledge. Using these cues, a general Linear Elliptic System with Dirichlet boundary is introduced to model the diffusions from seeds to other regions to generate a saliency map.

Among all the models reviewed in this subsection, there are mainly three types of regions adopted for saliency computation. Irregular regions with varying sizes can be generated using the graph-based segmentation algorithm [81], mean-shift algorithm [28], or clustering (quantization). On the other hand, with recent progress on superpixels algorithms, compact regions with comparable sizes are also popular choices using the SLIC algorithm [29], Turbopixel algorithm [82], etc. The main difference between these two types of regions is whether the influence of region size should be taken into account. Furthermore, soft regions are also considered for saliency analysis, where every pixel maintains a probability belonging to each of all the regions (components) instead of only a hard region label (e.g., fitted by a GMM). To further enhance robustness of segmentation, regions can be generated based on multiple segmentations or in a hierarchical way. Generally, single-scale segmentation is faster, while multi-scale segmentation can improve the overall performance.

To measure the saliency of regions, uniqueness, usually in the form of global and local regional contrasts, is the most frequently used feature. Further, more and more complementary priors for the regional saliency are investigated to improve the overall performance, such as backgroundness, objectness, focusness and boundary connectivity. Compared with the block-based saliency models, the extension of these priors is also the main advantage of the region-based saliency models. Furthermore, regions provide more sophisticated cues (e.g., color histogram) to better capture the salient object of a scene in contrast to pixels and patches. Another benefit of defining saliency using regions is related to the efficiency. Since the number of regions in an image is far less than the number of pixels, computing saliency at region level can significantly reduce the computational cost while producing full-resolution saliency maps.

Notice that the approaches discussed in this subsection only utilize intrinsic cues. In the next subsection, we will review how to incorporate extrinsic cues to facilitate the detection of salient objects.

# Model Pub Year Cues Elements Hypothesis Aggregation GT Form Code
Uniqueness Prior (Optimization)
1 LTD [25] CVPR 2007 GT mPI + PA + RE L + CS SD CRF BB NA
2 OID [109] ECCV 2010 GT mPI + PA + RE L + CS SD mixtureSVM BB NA
3 LGCR [110] BMVC 2010 GT RE - P BDT BM NA
4 DRFI [40] CVPR 2013 GT mRE L B + P RF BM M + C
5 LOS [111] CVPR 2014 GT RE L + G PRA + B + SD + CP SVM BM NA
6 HDCT [112] CVPR 2014 GT RE L + G SD + P + HD BDT + LS BM M
# Model Pub Year Cues Elements Hypothesis Aggregation GT Necessity Code
Uniqueness Prior (Optimization)
7 VSIT [113] ICCV 2009 SI PA - SS - yes NA
8 FIEC [114] CVPR 2011 SI PI + PA L - LN no NA
9 SA [115] CVPR 2013 SI PI - CMP CRF yes NA
10 LBI [35] CVPR 2013 SI PA SP - - no M + C
# Model Pub Year Cues Elements Hypothesis Aggregation Type Code
Uniqueness Prior (Optimization)
11 LC [116] MM 2006 TC PI + PA L - LN online NA
12 VA [117] ICPR 2008 TC mPI + PA + RE L CS + SD + MCO CRF offline NA
13 SEG [108] ECCV 2010 TC PA + PI CS MCO CRF offline M + C
14 RDC [118] CSVT 2013 TC RE L - - offline NA
# Model Pub Year Cues Elements Hypothesis Aggregation Image Number Code
Uniqueness Prior (Optimization)
15 CSIP [119] TIP 2011 SCO mRE - RS LN two M + C
16 CO [120] CVPR 2011 SCO PI + PA G RP - multiple NA
17 CBCO [121] TIP 2013 SCO RE G SD + C NL multiple NA
# Model Pub Year Cues Elements Hypothesis Aggregation Source Code
Uniqueness Prior (Optimization)
18 LS [122] CVPR 2012 DP RE G DK NL stereo images NA
19 DRM [123] BMVC 2013 DP RE G - SVM Kinect NA
20 SDLF [107] CVPR 2014 LF mRE G F + B + O NL Lytro camera NA
Fig. 5: Salient object detection models with extrinsic cues grouped by their adopted cues. For cues, {GT = ground-truth annotation, SI = similar images, TC = temporal cues, SCO = saliency co-occurrence, DP = depth, and LF = light field}. For saliency hypothesis, {P = generic properties, PRA = pre-attention cues, HD = discriminativity in high-dimensional feature space, SS = saliency similarity, CMP = complement of saliency cues, SP = sampling probability, MCO = motion coherence, RP = repeatedness, RS = region similarity, C = corresponding, and DK = domain knowledge.}. Others, {CRF = conditional random field, SVM = support vector machine, BDT = boosted decision tree, and RF = random forest.}.

2.1.3 Models with Extrinsic Cues

Models in the third subgroup adopt the extrinsic cues to assist the detection of salient objects in images and videos. In addition to the visual cues observed from the single input image, the extrinsic cues can be derived from the ground-truth annotations of the training images, similar images, the video sequences, a set of input images containing the common salient objects, depth maps, or light field images. In this section, we will review these models according to the types of used extrinsic cues. Fig. 5 lists all the models with extrinsic cues, where each method is highlighted with several pre-defined attributes.

Salient object detection with similar images. With the availability of increasingly large amount of visual content on the web, salient object detection by leveraging the visually similar images to the input image has been studied in recent years. Generally, given the input image , similar images are first retrieved from a large collection of images . The salient object detection on the input can be assisted by examining these similar images.

In some studies, it is assumed that saliency annotations of are available. For example, Marchesotti et al. [113] propose to describe each indexed image by a pair of descriptors , where and denote the feature descriptors (Fisher vector) of the salient and non-salient regions according to the saliency annotations, respectively. To compute the saliency map, each patch of the input image is described by a fisher vector . Saliency of patches are computed according to their contrast with foreground and background region features .

Alternatively, based on the observation that different features contribute differently to the saliency analysis on each image, Mai et al. [115] propose to learn the image specific rather than universal weights to fuse the saliency maps that are computed on different feature channels. To this end, the CRF aggregation model of saliency maps is trained only on the retrieved similar images to account for the dependence of aggregation on individual images555We will discuss more technical details about [115] in Sect. 2.1.4..

Saliency based on similar images works well if large-scale image collections are available. Saliency annotation, however, is time consuming, tedious, and even intractable on such collections. To mitigate this, some methods leverage the unannotated similar images. With the web-scale image collections , Wang et al. [114] propose a simple yet effective saliency estimation algorithm. The pixel-wise saliency map is computed as:


where is the geometrically warped version of with the reference . The main insight is that similar images offer good approximations to the background regions while salient regions might not be well-approximated.

Siva et al. [35] propose a probabilistic formulation for saliency computation as a sampling problem. A patch is considered to be salient if it has the low probability of being sampled from the images . In another word, higher saliency scores will be given to if it is unique among a bag of patches extracted from similar images.

Co-saliency object detection. Instead of concentrating on computing saliency on a single image, co-salient object detection algorithms focus on discovering the common salient objects shared by multiple input images . That is, such objects can be the same object with different viewpoints or the objects of the same category sharing similar visual appearances. Note that the key characteristic of co-salient object detection algorithms is that their input is a set of images, while classical salient object detection models only need a single input image.

Co-saliency detection is closely related to the concept of image co-segmentation that aims to segment similar objects from multiple images [124, 125]. As stated in [121], three major differences exist between co-saliency and co-segmentation. First, co-saliency detection algorithms only focus on detecting the common salient objects while the similar but non-salient background might be also segmented out in co-segmentation approaches [126, 127]. Second, some co-segmentation methods, e.g., [125], need user input to guide the segmentation process in ambiguous situations. Third, salient object detection often serves as a pre-processing step, and thus more efficient algorithms are preferred than co-segmentation algorithms, especially over a large number of images.

Li and Ngan [119] propose a method to compute co-saliency for an image pair with some objects in common. The co-saliency is defined as the inter-image correspondence, i.e., low saliency values should be given to the dissimilar regions. Similarly in [120], Chang et al. propose to compute co-saliency by exploiting the additional repeatedness property across multiple images. Specifically, the co-saliency score of a pixel is defined as the multiplication of its traditional saliency score [39] and its repeatedness likelihood over the input images. Fu et al. [121] propose a cluster-based co-saliency detection algorithm by exploiting the well-established global contrast and spatial distribution concepts on a single image. Additionally, the corresponding cues over multiple images are introduced to account for the saliency co-occurrence.

2.1.4 Other Classic Models

In this section, we review algorithms that aim to directly segment or localize salient objects with bounding boxes, and algorithms that are closely related to saliency detection. Some subsections offer a different categorization of some models covered in the previous sections (e.g., supervised vs. unsupervised). See Fig. 6.

Localization models. Liu et al. [25] convert the binary segmentation map to bounding boxes. The final output is a set of rectangles around salient objects. Feng et al. [128] define saliency for a sliding window as its composition cost using the remaining image parts. Based on an over-segmentation of the image, the local maxima, which can efficiently be found among all sliding windows in a brute-force manner, are assumed to correspond to salient objects.

The basic assumption in many previous approaches is that at least one salient object exists in the input image. This may not always hold as some background images contain no salient objects at all. In [129], Wang et al. investigate the problem of localizing and predicting the existence of salient objects on thumbnail images. Specifically, each image is described by a set of features extracted in multiple channels. The existence of salient objects is formulated as a binary classification problem. For localization, a regression function is learned using a Random Forest regressor on training samples to directly output the position of the salient object.

# Model Pub Year Type Code
1 COMP [128] ICCV 2011 Localization NA
2 GSAL [129] CVPR 2012 Localization NA
3 CTXT [130] ICCV 2011 Segmentation NA
4 LCSP [131] IJCV 2014 Segmentation NA
5 BENCH [132] ECCV 2012 Aggregation M
6 SIO [133] SPL 2013 Optimization NA
7 ACT [21] PAMI 2012 Active C
8 SCRT [22] CVPR 2014 Active NA
9 WISO [23] TIP 2014 Active NA
Fig. 6: Other salient object detection models


Segmentation models. Segmenting salient objects is closely related to the figure-ground problem, which is essentially a binary classification problem trying to separate the salient object from the background. Yu et al.[90] utilize the complementary characteristics of imperfect saliency maps generated by different contrast-based saliency models. Specifically, two complementary saliency maps are first generated for each image, including a sketch-like map and an envelope-like map. The sketch-like map can accurately locate parts of the most salient object (i.e., skeleton with high precision), while the envelope-like map can roughly cover the entire salient object (i.e., envelope with high recall). With these two maps, the reliable foreground and background regions can be detected from each image first to train a pixel classifier. By labeling all other pixels with this classifier, the salient object can be detected as a whole. This method is extended in [131] by learning the complementary saliency maps for the purpose of salient object segmentation.

Lu et al. [91] exploit the convexity (concavity) prior for salient object segmentation. This prior assumes that the region on the convex side of a curved boundary tends to belong to the foreground. Based on this assumption, concave arcs are first found on the contours of superpixels. For a concave arc, its convexity context is defined as the windows which are tightly close to the arc. An undirected weight graph is then built over the superpixels with concave arcs, where the weights between vertices are determined by the summation of concavity context on different scales in the hierarchical segmentation of the image. Finally, the Normalized Cut algorithm [134] is performed to separate the salient object from the background.

To leverage the contextual cues more effectively, Wang et al. [130] propose to integrate an auto-context classifier [135] into an iterative energy minimization framework to automatically segment the salient object. The auto-context model is a multi-layer Boosting classifier on each pixel and its surroundings to predict if it is associated with the target concept. The subsequent layer is built on the classification of the previous layer. Hence through the layered learning process, the spatial context is automatically utilized for more accurate segmentation of the salient object.

Supervised vs. unsupervised models. The majority of the existing learning-based works on saliency detection focus on the supervised scenario, i.e., learning a salient object detector given a set of training samples with ground-truth annotations. The aim here is to separate the salient elements from the background elements.

Each element (e.g., a pixel or a region) in the input image is represented by a feature vector , where is the feature dimension. Such a feature vector is then mapped to a saliency score based on the learned linear or non-linear mapping function .

One can assume the mapping function is linear, i.e., , where w denotes the combination weights of all components in the feature vector. Liu et al. [25] propose to learn the weights with the Conditional Random Field (CRF) model trained on the rectangular annotations of the salient objects. In a recent work [111], the large-margin framework is adopted to learn the weights w.

Due to the highly non-linear essence of the saliency mechanism, however, the linear mapping might not perfectly capture the characteristics of saliency. To this end, such a linear integration is extended in [109], where a mixture of linear Support Vector Machines (SVM) is adopted to partition the feature space into a set of sub-regions that are linearly separable using a divide-and-conquer strategy. In each region, a linear SVM, its mixture weights, and the combination parameters of the saliency features are learned for better saliency estimation. Alternatively, other non-linear classifiers such as boosted decision trees (BDT) [110, 112] and the random forest (RF) [40] are also utilized.

Generally speaking, supervised approaches allow richer representations for the elements compared with the heuristic methods. In the seminal work of the supervised salient object detection, Liu et al. [25] propose a set of features including the local multi-scale contrast, regional center-surround histogram distance, and global color spatial distribution. Similar to models with only intrinsic cues, region-based representation for salient object detection has become increasingly popular as more sophisticated descriptors can be extracted at region level. Mehrani and Veksler [110] demonstrate promising results by considering generic regional properties, e.g., color and shape, which are widely used in other applications like image classification. Jiang et al. [40] propose a regional saliency descriptor including the regional local contrast, regional backgroundness, and regional generic properties. In [111, 112], each region is described by a set of features such as local and global contrast, backgroundness, spatial distribution, and the center prior. The pre-attentive features are also considered in [111].

Usually, the richer representations result in feature vectors with higher dimensions, e.g.,  in [40] and in [112]. With the availability of a large collections of training samples, the learned classifier is capable of automatically integrating such richer features and picking up the most discriminative ones. Therefore, better performance can be expected compared with the heuristic methods.

Some models have utilized unsupervised techniques. In [35], saliency computation is formulated in a probabilistic framework as a sampling problem. The saliency of each image patch is proportional to its sampling probability from all of the patches, which are extracted from both the input image and the similar images retrieved from a corpus of unlabeled images. In [136], cellular automata is exploited for unsupervised salient object detection.

Aggregation and optimization models. Given saliency maps , coming from different salient object detection models or hierarchical segmentations of the input image, aggregation models try to form a more accurate saliency map. Let denote the saliency value of pixel of the -th saliency map. In [132], Borji et al. propose a standard saliency aggregation method as follows:


where is the saliency scores for pixel and indicates is labeled as salient. is a real-valued function which can take the following form:


Inspired by the aggregation model in [132], Mai et al. [115] propose two aggregation solutions. The first solution adopts the pixel-wise aggregation:


where is the set of model parameters and . However, it is noted that one potential problem of such direct aggregation is its ignorance of the interaction between neighboring pixels. Inspired by [55], they propose the second solution by using the CRF to aggregate saliency maps of multiple methods to capture the relation between neighboring pixels. The parameters of the CRF aggregation model are optimized on the training data. The saliency of each pixel is the posterior probability of being labeled as salient with the trained CRF.

Alternatively, Yan et al. [42] integrate the saliency maps computed on the hierarchical segmentations of the image into a tree-structure graphical model, where each node corresponds to a region in every hierarchy. Thanks to the tree structure, the saliency inference can efficiently be conducted using belief propagation. In fact, solving the three layer hierarchical model is equivalent to applying a weighted average to all single-layer maps. Different from naive multi-layer fusion, this hierarchical inference algorithm can select optimal weights for each region instead of global weighting.

Li et al. [133] propose to optimize the saliency values of all superpixels in an image to simultaneously meet several saliency criteria including visual rarity, center-bias and mutual correlation. Based on the correlations (similarity scores) between region pairs, the saliency value of each superpixel is optimized by quadratic programming when considering the influences of all the other superpixels. Let denote the correlation between two regions and , the saliency values (denoted by as for short) can be optimized by solving:


where is half the image diagonal length. and are the spatial distances from the to and the image center, respectively. In the optimization, the saliency value of each superpixel is optimized by quadratic programming when considering the influences of all other superpixels. Zhu et al.[105] adopt a similar optimization-based framework to integrate multiple foreground/background cues with the smoothness terms to automatically infer saliency values.

The Bayesian framework is adopted to effectively integrate the complementary dense and sparse reconstruction errors [98]. A fully-connected Gaussian Markov Random Field between each pair of regions is constructed to enforce the consistency between salient regions [101], which leads to an efficient computation of the final regional saliency scores.

Active models. Inspired by the interactive segmentation models (e.g., [137, 138]), a new trend has emerged recently by explicitly decoupling the two stages of saliency detection mentioned in Sec. 1.1: a) detecting the most salient object and b) segmenting it. Some studies propose to perform active segmentation by utilizing the advantages of both fixation prediction and segmentation models. For example, Mishra et al. [21] combine multiple cues (e.g., color, intensity, texture, stereo and/or motion) to predict fixations. The “optimal” closed contour for salient object around the fixation point is then segmented in polar space. Li et al. [22] propose a model composed of two components: a segmenter that proposes candidate regions and a selector that gives each region a saliency score (using a fixation prediction model). Similarly, Borji [23] proposes to first roughly locate the salient object at the peak of the fixation map (or its estimation using a fixation prediction model) and then segment the object using superpixels. The last two algorithms adopt annotations to determine the upper-bound of the segmentation performance, propose datasets with multiple objects in scenes, and provide new insight to the inherent connections of fixation prediction and salient object segmentation.

Salient object detection on videos. In addition to the spatial information, video sequence provides the temporal cue, e.g.,  motion which facilitates salient object detection. Zhai and Shah [116] first estimate the keypoint correspondences between two consecutive frames. Motion contrast is computed based on the planar motions (homography) between images, which is estimated by applying RANSAC on point correspondences. Liu et al. [117] extend their spatial saliency features [25] to the motion field resulting from the optical flow algorithm. With the colorized motion field as the input image, the local multi-scale contrast, regional center-surround distance, and global spatial distribution are computed and finally integrated in a linear way. Rahtu et al. [108] integrate the spatial saliency into the energy minimization framework by considering the temporal coherence constraint. Li et al. [118] extend the regional contrast-based saliency to the spatio-temporal domain. Given the over-segmentation of the frames of the video sequence, spatial and temporal region matchings between each two consecutive frames are estimated based on their color, texture, and motion features in a interactive manner on an undirected un-weighted matching graph. The saliency of a region is determined by computing its local contrast to the surrounding regions not only in the present frame but also in the temporal domain.

Salient object detection with depth. We live in real 3D environments where stereoscopic content provide additional depth cues for guiding visual attention and understanding the surroundings. This is further validated by Lang et al. [139] through experimental analysis of the importance of depth cues for eye fixation prediction. Recently, researchers have started to study how to exploit the depth cues for salient object detection [122, 123], which might be captured indirectly from the stereo images or directly using a depth camera (e.g., Kinect).

The most straightforward extension is to adopt the widely used hypotheses introduced in Sec. 2.1.1 and 2.1.2 to the depth channel, e.g., the global contrast on the depth map [122, 123]. Further, Niu et al. [122] demonstrate how to leverage the domain knowledge in stereoscopic photography to compute the saliency map. The input image is first segmented into regions . In practice, the attended regions are often assigned small or zero disparities to minimize the vergence-accommodation conflict. Thus, the first type of regional saliency based on the disparity is defined as:


where and are the maximal and minimal disparities, respectively. denotes the average disparity in region . Additionally, objects with negative disparities are perceived popping out from the scene. The second type of regional stereo saliency is then defined as:


Stereo saliency is linearly computed by an adaptive weight.

Salient object detection on light field. The idea of using light field for saliency detection was proposed in [107]. A light field, captured using a specifically designed camera e.g., Lytro, is essentially an array of images shot by a grid of cameras viewing the scene. The light field data offers two benefits for salient object detection: 1) it allows synthesizing a stack of images focusing at different depths, and 2) it provides an approximation of scene depth and occlusions.

With this additional information, Li et al. [107] first utilize the focusness and objectness priors to robustly choose the background and select the foreground candidates. Specifically, the layer with the estimated background likelihood score is used to estimate the background regions. Regions, coming from Mean-shift algorithm, with the high foreground likelihood score are chosen as salient object candidates. Finally, the estimated background and foreground are utilized to compute the contrast-based saliency map on the all-focus image.

A new benchmark dataset for light-field saliency, known as HFUT-Lytro, has been recently introduced in [140].

2.2 New Testament: Deep Learning Based Models

All the methods that we have reviewed so far aim at detecting salient objects using heuristics. While hand-crafted features allow real-time detection performance, they suffer from several shortcomings that limit their ability in capturing salient objects in challenging scenarios.

Convolutional neural networks (CNNs) [69], as one of the most popular tools in machine learning, have been applied to many vision problems such as object recognition [141], semantic segmentation [70] and edge detection[142]. Recently, it has been shown that CNNs [47, 44] are also very effective when applied to salient object detection. Thanks to their multi-level and multi-scale features, CNNs are capable of accurately capturing the most salient regions without using any prior knowledge (e.g., segment-level information). Furthermore, multi-level features allow CNNs to better locate the boundaries of the detected salient regions, even when shades or reflections exist. By exploiting the strong feature learning ability of CNNs, a series of algorithms are proposed to learn the saliency representations from large amounts of data. These CNN-based models continuously refresh the records on almost all existing datasets and are becoming the main stream solution. The rest of this subsection is dedicated to reviewing CNN-based models.

Basically, salient object detection models based on deep learning can be split into two main categories. The first category includes models that have used multi-layer perceptrons (MLPs) for saliency detection. In these models, the input image is usually over-segmented into single- or multi-scale small regions. Then, a CNN is used to extract high-level features which are later fed to a MLP to determine the saliency value of a small region. Though high-level features are extracted from CNNs, unlike fully convolutional networks (FCNs), the spatial information from CNN features cannot be preserved because of the utilization of MLPs. To highlight the differences between these methods and FCN-based methods, we call them ”Classic Convolutional Network based” (CCN-based) methods. The second category includes models that are based on ”Fully Convolutional Networks” (FCN-based). The pioneering work of Long et al.[70] falls under this category and aims at solving the semantic segmentation problem. Since salient object detection is inherently a segmentation task, a number of researchers have adopted FCN-based architectures because of their capability in preserving spatial information.

Fig. 7 shows a list of CNN-based saliency models.

# Model Pub Year #Training Images Training Set Pre-trained Model Fully Conv
1 SuperCNN [44] IJCV 2015 800 ECSSD -
2 LEGS [45] CVPR 2015 3,340 MSRA-B + PASCALS -
3 MC [46] CVPR 2015 8,000 MSRA10K GoogLeNet [143]
4 MDF [47] CVPR 2015 2,500 MSRA-B -
5 HARF [48] ICCV 2015 2,500 MSRA-B -
6 ELD [144] CVPR 2016 nearly 9,000 MSRA10K VGGNet
7 SSD-HS [145] ECCV 2016 2,500 MSRA-B AlexNet
8 FRLC [146] ICIP 2016 4,000 DUT-OMRON VGGNet
9 SCSD-HS [147] ICPR 2016 2,500 MSRA-B AlexNet
10 DISC [148] TNNLS 2016 9,000 MSRA10K -
11 LCNN [149] Neuro 2017 2,900 MSRA-B + PASCALS AlexNet
12 DHSNET [150] CVPR 2016 6,000 MSRA10K VGGNet
13 DCL [151] CVPR 2016 2,500 MSRA-B VGGNet [152]
14 RACDNN [153] CVPR 2016 10,565 DUT+NJU2000+RGBD VGG
15 SU [154] CVPR 2016 10,000 MSRA10K VGGNet
16 CRPSD [155] ECCV 2016 10,000 MSRA10K VGGNet
17 DSRCNN [156] MM 2016 10,000 MSRA10K VGGNet
18 DS [157] TIP 2016 nearly 10,000 MSRA10K VGGNet
19 IMC [158] WACV 2017 nearly 6,000 MSRA10K ResNet
20 MSRNet [159] CVPR 2017 2,500 MSRA-B + HKU-IS VGGNet
21 DSS [49] CVPR 2017 2,500 MSRA-B VGGNet
Fig. 7: CNN-based salient object detection models and their used information during training. Models in the top part are all CCN-based while models in the bottom part are all FCN-based.

2.2.1 CCN-based Models

One-dimensional (1D) convolution based methods. As an early attempt, He et al.[44] followed a region-based approach to learn superpixel-wise feature representations. Their approach dramatically reduces the computational cost compared to pixel-wise CNNs, meanwhile takes global context into consideration. However, representing a superpixel with the mean color is not informative enough. Further, the spatial structure of the image is difficult to be fully recovered using 1D convolution and pooling operations, leading to cluttered predictions, especially when the input image is a complex scene.

Leveraging local and global context. Wang et al. consider both local and global information for better detection of salient regions [160]. To this end, two subnetworks are designed for local estimation and global search, respectively. A deep neural network (DNN-L) is first used to learn local patch features to determine the saliency value of each pixel, followed by a refinement operation which captures the high-level objectness. For global search, they train another deep neural network (DNN-G) to predict the saliency value of each salient region using a variety of global contrast features such as geometric information, global contrast features, etc. The top candidate regions are utilized to compute the final saliency map using a weighted summation.

In [46], similar to most of the classic salient object detection methods, both local context and global context are taken into account for constructing a multi-context deep learning framework. The input image is first fed to the global-context branch to extract global contrast information. Meanwhile, each image patch which is a superpixel-centered window, is fed to the local-context branch for capturing local information. A binary classifier is finally used to determine the saliency value by minimizing a unified softmax loss between the prediction value and the ground truth label. A task-specific pre-training scheme is adopted to jointly optimize the designed multi-context model.

Lee et al.[144] exploit two subnetworks to encode low-level and high-level features separately. They first extract a number of features for each superpixel and feed them into a subnetwork composed of a stack of convolutional layers with kernel size. Then, the standard VGGNet [152] is used to capture high-level features. Both low- and high-level features are flattened, concatenated, and finally fed into a two-layer MLP to judge the saliency of each query region.

Bounding box based methods. In [48], Zou et al. proposes a hierarchy-associated rich feature (HARF) extractor. To do so, a binary segmentation tree is first built for extracting hierarchical image regions and for analyzing the relationships between all pairs of regions. Two different methods are then used to compute two kinds of features ( and ) for regions at the leaf-nodes of the binary segmentation tree. They leverage all the intermediate features extracted from RCNN [161], to capture various characteristics of each image region. With these high-dimensional elementary features, both local regional contrasts and border regional contrasts for each elementary feature type are computed for building a more compact representation. Finally, the AdaBoost algorithm is adopted to gradually assemble weak decision trees to construct a composite strong regressor.

Kim et al.[145] design a two-branch CNN architecture to obtain the coarse- and fine representations of the coarse-level and fine-level patches, respectively. The selective search [162] method is utilized to generate a number of region candidates that are treated as the input to the two-branch CNN. Feeding the concatenation of the feature representations of the two branches into the final fully connected layer allows a coarse continuous map to be predicted. To further refine the coarse prediction map, a hierarchical segmentation method is used to sharpen the boundaries and improve the spatial consistency.

In [146], Wang et al. solve the salient object detection by employing the Fast R-CNN [161] framework. The input image is first segmented into multi-scale regions using both over-segmentation and edge-preserving methods. For each region, the external bounding box is used and the enclosed region is fed to the Fast R-CNN. A small network composed of multiple fully connected layers is connected to the ROI pooling layer for determining the saliency value of each region. Finally, an edge-based propagation method is proposed to suppress the background regions and make the resulting saliency map more uniform.

Kim et al.[147] train a CNN to predict the saliency shape of image patches. The selective search method is first used to localize a stack of images patches, each of which is taken as the input to the CNN. After predicting the shape of each patch, an intermediate mask is computed by accumulating the product of the mask of the predicted shape class and the corresponding probability and averaging all the region proposals. To further refine the coarse prediction map, a shape class-based saliency detection with hierarchical segmentation (SCSD-HS) is used to incorporate more global information (often needed for saliency detection).

Fig. 8: Popular FCN-based architectures. One can see that apart from the classical architecture (a) more and more advanced architectures have been developed recently. Some of them (b,c,d,and e) exploit skip layers from different scales so as to learn multi-scale and multi-level features. Some of them (e, g, h, and i) adopt the encoder-decoder structure to better fuse high-level features with low-level ones. There are also some works (f, g, and i) introduce side supervision as done in [142] in order to capture more detailed multi-level information. See Table 9 for details on these architectures.

Li et al.[149] leverage both high-level features from CNNs and low-level features extracted based on hand-crafted methods. To enhance the generalization and learning ability of CNNs, the original R-CNN is redesigned by adding local response normalization (LRN) to the first two layers. The selective search method is utilized [162] to generate a stack of squared patches as the input to the network. Both high-level and low-level features are fed to a SVM with the hinge-loss to help judge the saliency of each squared region.

Models with multi-scale inputs. Li et al.[47] utilize a pre-trained CNN as a feature extractor. Given an input image, they first decompose it into a series of non-overlapping regions and then feed them into a CNN with three different-scale inputs. Three subnetworks are then employed to capture advanced features at different scales. The features obtained from patches at three scales are concatenated and then fed into a small MLP with only two fully connected layers as a regressor to output a distribution over binary saliency labels. To solve the problem of imperfect over-segmentation, a superpixel based saliency refinement method is used.

Fig. 8 illustrates a number of popular FCN-based architectures. Fig. 9 lists different types of information leveraged by these architectures.

# Model SP SS RCL PCF IL CRF Arch.
1 DCL [151] Fig. 8(b)
2 CRPSD [155] Fig. 8(c)
3 DSRCNN [156] Fig. 8(f)
4 DHSNET [150] Fig. 8(g)
5 RACDNN [153] Fig. 8(h)
6 SU [154] Fig. 8(d)
7 DS [157] Fig. 8(a)
8 IMC [158] Fig. 8(a)
9 MSRNet [159] Fig. 8(h)
10 DSS [49] Fig. 8(i)
Fig. 9: Different types of information leveraged by existing FCN-based models. Acronyms include SP: Superpixel, SS: Side Supervision, RCL: Recurrent Convolutional Layer, PCF: Pure CNN Feature, IL: Instance-Level, Arch: Architecture.

Discussion. As can be seen, the above mentioned MLP-based works rely mostly on segment-level information (e.g., image patches) and classification networks. These image patches are normally resized to a fixed size and are then fed to a classification network which is used to determine the saliency of each patch. Some of the models use multi-scale inputs to extract features in several scales. However, such a learning framework cannot fully leverage high-level semantic information. Further, spatial information cannot be propagated to the last fully connected layers, thus resulting in global information loss.

2.2.2 FCN-based Models

Unlike CCN-based models that operate at the patch level, fully convolutional networks (FCNs) [70] consider pixel-level operations to overcome the problems caused by fully connected layers such as blurriness and inaccurate predictions near the boundaries of salient objects. Due to desirable properties of FCNs, a great number of FCN-based salient object detection models have been introduced recently.

Li et al.[151] design a CNN with two complementary branches: a pixel-level fully convolutional stream (FCS) and a segment-wise spatial pooling stream (SPS). The FCS introduces a series of skip layers after the last convolutional layer of each stage and then the skip layers are fused together as the output of FCS. Notice that a stage of a CNN is composed of all the layers sharing the same resolution. The SPS leverages segment-level information for spatial pooling. Finally, the outputs of FCS and SPS are fused together, followed by a balanced sigmoid cross entropy loss layer as done in [142].

Liu [150] propose two subnetworks to produce a prediction map in a coarse-to-fine and global-to-local manner. The first subnetwork can be considered as an encoder whose goal is to generate a coarse global prediction. Then, a refinement subnetwork composed of a series of recurrent convolution layers is used to refine the coarse prediction map from coarse scales to fine scales.

In [155], Tang et al. consider both region-level saliency estimation and pixel-level saliency prediction. For pixel-level prediction, two side paths are connected to the last two stages of the VGGNet and then concatenated for learning multi-scale features. For region-level estimation, each given image is first over-segmented into multiple superpixels and then the Clarifai model [163] is used to predict the saliency of each superpixel. The original image and the two prediction maps are taken as the inputs to a small CNN to generate a more convincing saliency map as the final output.

Tang et al.[156] take the deeply supervised net [164] and adopt a similar architecture as in the holistically-nested edge detector [142]. Unlike HED, they replace the original convolutional layers in VGGNet with recurrent convolutional layers to learn local, global, and contextual information.

In [153], Kuen et al. propose a two-stage CNN by utilizing spatial transformer and recurrent network units. A convolutional-deconvolutional network is first used to produce an initial coarse saliency map. The spatial transformer network [165] is applied to extract multiple sub-regions from the original images, followed by a series of recurrent network units to progressively refine the predictions of these sub-regions.

Kruthiventi et al.  [154] consider both fixation prediction and salient object detection in a unified network. To capture multi-scale semantic information, four inception modules [143] are introduced which are connected to the output of the 2nd, 4th, 5th, and 6th stages, respectively. These four side paths are concatenated together and passed through a small network composed of two convolutional layers for reducing the aliasing effect of upsampling. Finally, the sigmoid cross entropy loss is used to optimize the model.

Li et al.[157] consider joint semantic segmentation and salient object detection. Similar to the FCN work [70], the two original fully connected layers in VGGNet [152] are replaced by convolutional layers.To overcome the fuzzy object boundaries caused by the down-sampling operations of CNNs, they make use of the SLIC [166] superpixels to model the topological relationships among superpixels in both spatial and feature dimensions. Finally, the graph Laplacian regularized nonlinear regression is used to change the combination of the predictions from CNNs and the superpixel graph from the coarse level to the fine level.

Zhang et al.[158] detect salient objects using saliency cues extracted by CNNs and a multi-level fusion mechanism. The Deeplab [167] architecture is first used to capture high-level features. To address the problem of large strides in Deeplab, a multi-scale binary pixel labeling method is adopted to improve spatial coherence, similar to [47].

The MSRNet [159] by Li et al. consider both salient object detection and instance-level salient object segmentation. A multi-scale CNN is used to simultaneously detect salient regions and contours. For each scale, features from upper layers are merged with features from lower layers to gradually refine the results. To generate a contour map, the MCG [168] approach is used to extract a small number of candidate bounding boxes and well-segmented regions that are used to help generate salient object instance segmentation. Finally, a fully connected CRF model [169] is employed for refining the spatial coherence.

Hou et al.[49] design a top-down model based on the HED architecture [142]. Unlike connecting independent side paths to the last convolutional layer of each stage, a series of short connections are introduced to build a strong relationship between each pair of side paths. As a result, features from upper layers with strongly semantic information are propagated to lower layers, helping them accurately locate exact positions of salient objects. In the meantime, rich detailed information from lower layers allow the irregular prediction maps from deeper layers to be refined. A special fusion mechanism is exploited to better combine the saliency maps predicted by different side paths.

Discussion. The foregoing approaches are all based on fully convolutional networks, which enable the point-to-point learning and end-to-end training strategies. Compared with CCN-based models, these methods make better use of the convolution operation and substantially decrease the time cost. More importantly, recent FCN-based approaches [49, 159] that utilize CNN features greatly outperform those methods with segment-level information.

To sum up, the 3 following advantages have been obtained in utilizing FCN-based models for saliency detection.

1) Local vs. global. As was mentioned in Sec. 2.2.1, earlier CNN-based models incorporate both local and global contextual information explicitly (embedded in separate networks [45, 46, 47]) or implicitly (using an end-to-end framework). This indeed agrees with the design principles behind many hand-crafted cues reviewed in previous sections. However, FCN-based methods are capable of learning both local and global information internally. Lower layers tend to encode more detailed information such as edge and fine components, while deeper layers favor global and semantically meaningful information. Such properties enable FCN-based networks to drastically outperform classic methods.

2) Pre-training and fine-tuning. The effectiveness of fine-tuning a pre-trained network has been demonstrated in many different applications. The network is typically pre-trained on the ImageNet dataset [170] for image classification. The learned knowledge can be applied to several different target tasks (e.g., object detection [161], object localization [171]) through simple fine-tuning. A similar strategy has been adopted in salient object detection [46, 151] and has resulted in superior performance compared to training from scratch. The learned features, more importantly, are able to capture high-level semantic knowledge on object categories, as the employed networks are pre-trained for scene and object classification tasks.

3) Versatile architectures. A CNN architecture is formed by a stack of distinct layers that transform the input images into an output map through a differentiable function. The diversity of FCNs allows designers to design different structures that are appropriate for them.

Despite a great success, FCN-based models still fail in several cases. Typical examples include scenes with transparent objects, low contrast between foreground and background, and complex backgrounds, as shown in [49]. This calls for developing of more powerful architectures in the future.

Figure 10 provides a visual comparison of maps generated by classic and CNN-based models.



Fig. 10: Visual comparisons of two best classic methods (DRFI and DSR), according to [132] and two leading CNN-based methods (MDF and DSS).

Ginwidth=\OVP@calc(a) Content aware resizing[173](b) Image collage[174](c) View selection [175](d) Unsupervised learning [176](e) Mosaic [41](f) Image montage [177](g) Object manipulation [178](h) Semantic colorization [179]

Fig. 11: Sample applications of salient object detection. Images are reproduced from corresponding references.

3 Applications of Salient Object Detection

The value of salient object detection models lies on their applications in many areas of computer vision, graphics, and robotics. Salient object detection models have been utilized for several applications such as object detection and recognition  [180, 181, 182, 183, 184, 185, 186], image and video compression [187, 188], video summarization [189, 190, 191], photo collage/media re-targeting/cropping/thumb-nailing [192, 193, 174], image quality assessment [194, 195, 196], image segmentation [197, 198, 199, 200], content-based image retrieval and image collection browsing [177, 201, 202, 203], image editing and manipulating [179, 175, 41, 178], visual tracking [204, 205, 206, 207, 208, 209, 210], object discovery [211, 212], and human-robot interaction [213, 214]. Fig. 11 shows example applications.

4 Datasets and Evaluation Measures

4.1 Salient Object Detection Datasets

As more models have been proposed in the literature, more datasets have been introduced to further challenge saliency detection models. Early attempts aim to collect images with salient objects being annotated with bounding boxes (e.g., MSRA-A and MSRA-B [25]), while later efforts annotate such salient objects with pixel-wise binary masks (e.g., ASD [37] and DUT-OMRON [97]). Typically, images, which can be annotated with accurate masks, contain only limited objects (usually one) and simple background regions. On the contrary, recent attempts have been made to collect datasets with multiple objects in complex and cluttered backgrounds (e.g., [26, 23, 22]). As we mentioned in the Introduction section, a more sophisticated mechanism is required to determine the most salient object when several candidate objects are present in the same scene. For example, Borji [23] and Li et al. [22] use the peak of human fixation map to determine which object is the most salient one (i.e., the one that humans look at the most; See section 1.2).

A list of 22 salient object datasets including 20 image datasets and 2 video datasets is shown in  Fig. 12. Notice that all images or video frames in these datasets are annotated with binary masks or rectangles. Subjects are often asked to label the salient object in an image with one object (e.g., [25]) or annotate the most salient one among several candidate objects (e.g., [26]). Some image datasets also provide the fixation data for each image collected during free-viewing task.

4.2 Evaluation Measures

Five universally-agreed, standard, and easy-to-compute measures for evaluating salient object detection models are described next. For the sake of simplicity, we use to represent the predicted saliency map normalized to and be the ground-truth binary mask of salient objects. For a binary mask, we use to represent the number of non-zero entries in the mask.

Precision-recall (PR). A saliency map is first converted to a binary mask and then and are computed by comparing with the ground-truth :


The binarization of is the key step in the evaluation. There are three popular ways to perform the binarization. In the first solution, Achanta et al. [37] propose the image-dependent adaptive threshold for binarizing , which is computed as twice as the mean saliency of :


where and are the width and the height of the saliency map , respectively.

The second way to binarize is to use a threshold that varies from 0 to 255. For each threshold, a pair of (, ) scores are computed and used to plot a precision-recall (PR) curve.

The third way to perform the binarization is to use the GrabCut-like algorithm (e.g., as in [84]). Here, the PR curve is first computed and the threshold that leads to 95% recall is selected. With this threshold, the initial binary mask is generated, which can be used to initialize the iterative GrabCut segmentation [138]. After several iterations, the binary mask can be gradually refined.

F-measure. Often, neither nor can fully evaluate the quality of a saliency map. To this end, the F-measure is proposed as the weighted harmonic mean of and with a non-negative weight :


As suggested in many salient object detection works (e.g., [37]), is often set to to weigh more. The reason is because recall rate is not as important as precision (see also [55]). For instance, recall can be easily achieved by setting the whole map to be foreground.

Receiver operating characteristics (ROC) curve. As above, false positive () and true positive rates () can be computed when binarizing the saliency map with a set of fixed thresholds:


where and denote the opposite of the binary mask and ground-truth , respectively. The ROC curve is the plot of versus by testing all possible thresholds.

Arear under ROC curve (AUC). While ROC is a 2D representation of a model’s performance, the AUC distills this information into a single scalar. As the name implies, it is calculated as the area under the ROC curve. A perfect model will score an AUC of 1, while random guessing will score an AUC of around 0.5.

Mean absolute error (MAE). The overlap-based evaluation measures introduced above do not consider the true negative saliency assignments, i.e., the pixels correctly marked as non-salient. They favors methods that successfully assign high saliency to salient pixels but fail to detect non-salient regions. Moreover, for some applications [215], the quality of the weighted continuous saliency maps may be of higher interest than the binary masks. For a more comprehensive comparison, it is recommended to evaluate the mean absolute error (MAE) between the continuous saliency map and the binary ground-truth , both normalized in the range [0, 1]. The MAE score is defined as:


Please see [216] for more details on datasets and scores.

Dataset Year Imgs Obj Ann Resolution Sbj Eye I/V
MSRA-A [25, 217] 2007 20K 1 BB 400 300 3 - I
MSRA-B [25, 217] 2007 5K 1 BB 400 300 9 - ,,
SED1 [218, 132] 2007 100 1 PW 300 225 3 - ,,
SED2 [218, 132] 2007 100 2 PW 300 225 3 - ,,
ASD [37, 25] 2009 1000 1 PW 400 300 1 - ,,
SOD [219, 60] 2010 300 3 PW 481 321 7 - ,,
iCoSeg [125] 2010 643 1 PW 500 400 1 - ,,
MSRA5K [93, 25] 2011 5K 1 PW 400 300 1 - ,,
Infrared [220, 221] 2011 900 5 PW 1024 768 2 15 ,,
ImgSal [205] 2013 235 2 PW 640 480 19 50 ,,
CSSD [42] 2013 200 1 PW 400 300 1 - ,,
ECSSD [42, 222] 2013 1000 1 PW 400 300 1 - ,,
MSRA10K [223, 25] 2013 10K 1 PW 400 300 1 - ,,
THUR15K [223, 25] 2013 15K 1 PW 400 300 1 - ,,
DUT-OMRON [97] 2013 5,172 5 BB 400 400 5 5 ,,
Bruce-A [26, 54] 2013 120 4 PW 681 511 70 20 ,,
Judd-A [23, 224] 2014 900 5 PW 1024 768 2 15 ,,
PASCAL-S [22] 2014 850 5 PW variable 12 8 ,,
UCSB [225] 2014 700 5 PW 405 405 100 8 ,,
OSIE [226] 2014 700 5 PW 800 600 1 15 ,,
RSD [227] 2009 62,356 var. BB variable 23 - V
STC [228] 2011 4,870 1 BB variable 1 - ,,
Fig. 12: Overview of popular salient object datasets. Top: image datasets, Bottom: video datasets. Obj = objects per image; Ann = Annotation; Sbj = Subjects/Annotators; Eye = Eye tracking subjects; I/V = Image/Video.

5 Discussions

5.1 Design Choices

In the past two decades, hundreds of classic and deep learning based methods have been proposed for detecting and segmenting salient objects in scenes and a large number of design choices have been explored. Although great successes have been achieved recently, there is still a large room for improvement. Our detailed method summarization (see Fig. 4 & Fig. 5) does send some clear messages about the commonly used design choices, which are valuable for the design of future algorithms. They are discussed next.

5.1.1 Heuristic vs. Learning From Data

Early methods were mainly based on heuristic (both local or global) cues to detect salient objects [37, 84, 27, 97]. Recently, saliency models based on learning algorithms have shown to be very efficient (see Fig. 4 and Fig. 5). Among these models, deep learning based methods greatly outperform conventional heuristic methods because of their ability in learning large amount of extrinsic cues from large datasets. Data-driven approaches for salient object detection seem to have a surprisingly good generalization ability. An emerging question, however, is whether the data-driven ideas for salient object detection conflict with the ease of use of these models. Most learning based approaches are only trained on a small subset of MSRA5K dataset, and still consistently outperform other methods on all other datasets which have considerable differences. This suggests that it is worth to further explore data-driven salient object detection without losing the simplicity and ease-of-use advantages, in particular from an application point of view.

5.1.2 Hand-crafted vs. CNN-based Features

The first generation of learning-based methods were based on lots of hand-crafted features. An obvious drawback of these methods is the generalization capability, especially when applied to complex cluttered scenes. In addition, these methods mainly rely on over-segmentation algorithms, such as SLIC [166], yielding the incompleteness of most salient objects with high contrast components. CNN-based models solve these problems, to some degree, even when complex scenes are considered. Because of the ability of learning multi-level features, it is easy for CNNs to accurately locate where the salient objects are. Low-level features such as edges enable sharpening boundaries of salient objects while high-level features allow incorporating semantic information to identify salient objects.

5.1.3 Recent Advances in CNN-based Saliency Detection

Various CNN-based architectures have been proposed recently. Among these approaches, there are several promising choices that can be further explored in the future. The first one regards models with deep supervision. As shown in [49], deeply supervised networks strengthen the power of features at different layers. The second choice is the encoder-decoder architecture, which has been adopted in many segmentation-related tasks. These types of approaches gradually back-propagate high-level features to lower layers allowing effective fusion of multi-level features. Another choice is exploiting stronger baseline models, such as using very deep ResNets [229] instead of VGGNet [152].

5.2 Dataset Bias

Datasets have been consequential in the rapid progress in saliency detection. On the one hand, they supply large scale training data and enable comparing performance of competing algorithms. On the other hand, each dataset is a unique sampling of an unlimitted application domain, and contains a certain degree of bias.

To date, there seems to be a unanimous agreement on the presence of bias (i.e. skewness) in underlying structures of datasets. Consequently, some studies have addressed the effect of bias in image datasets. For instance, Torralba & Efros identify three biases in computer vision datasets, namely: selection bias, capture bias and negative set bias [230]. Selection bias is caused by preference of a particular kind of image during data gathering. It results in qualitatively similar images in a dataset. This is witnessed by the strong color contrast (see [22, 84]) in most frequently used salient object benchmark datasets [37]. Thus, two practices in dataset construction are preferred: i) having independent image selection and annotation process [22], and ii) detecting the most salient object first and then segmenting it. Negative set bias is the consequence of a lack of rich and unbiased negative set, i.e., one should avoid concentrating on a particular image of interest and datasets should represent the whole world. Negative set bias may affect the ground-truth by incorporating annotator’s personal preference to some object types. Thus, including a variety of images is encouraged in constructing a good dataset. Capture bias conveys the effect of image composition on the dataset. The most popular kind of such a bias is the tendency of composing objects in the central region of the image, i.e., center bias. The existence of bias in a dataset makes the quantitative comparisons very challenging and sometimes even misleading. For instance, a trivial saliency model which consists of a Gaussian blob at the image center, often scores higher than many fixation prediction models [231, 63, 232].

5.3 Future Directions

Several promising research directions for constructing more effective models and benchmarks are discussed here.

5.3.1 Beyond Working with Single Images

Most benchmarks and saliency models discussed in this study deal with single images. Unfortunately, salient object detection on multiple input images, e.g., salient object detection on video sequences, co-salient object detection, and salient object detection over depth and light field images, are less explored. One reason behind this is the limited availability of benchmark datasets on these problems. For example, as mentioned in Sec. 4, there are only two publicly available benchmark datasets for video saliency (mostly cartoons and news). For these videos, only bounding boxes are provided for the key frames to roughly localize salient objects. Multi-modal data is becoming increasingly more accessible and affordable. Integrating additional cues such as spatio-temporal consistency and depth will be beneficial for efficient salient object detection.

5.3.2 Instance-Level Salient Object Detection

Existing saliency models are object-agnostic (i.e., they do not split salient regions into objects). However, humans possess the capability of detecting salient objects at instance level. Instance-level saliency can be useful in several applications, such as image editing and video compression.

Two possible approaches for instance-level saliency detection are as follows. The first one regards using an object detection or object proposal method, e.g., Fast-RCNN [161], to extract a stack of object bounding box candidates and then segment salient objects in them. The second approach, initially proposed in [159], is leveraging edge information to distinguish different salient objects.

5.3.3 Versatile Network Architectures

With the deeper understanding of researchers on CNNs, more and more interesting network architectures have been developed. It has been shown that using advanced baseline models and network architectures [151] can substantially improve the performance. On the one hand, deeper networks do help better capture salient objects because of their ability in extracting high-level semantic information. On the other hand, apart from high-level information, low-level features [49, 159] should also be considered to build high resolution saliency maps.

5.3.4 Unanswered Questions

Some remaining questions include: how many (salient) objects are necessarily to represent a scene? does map smoothing affect the scores and model ranking? how is salient object detection different from other fields? what is the best way to tackle the center bias in model evaluation? and what is the remaining gap between models and humans? A collaborative engagement with other related fields such as saliency for fixation prediction, scene labeling and categorization, semantic segmentation, object detection, and object recognition can help answer these questions, situate the field better, and identify future directions.

6 Summary and Conclusion

In this paper, we exhaustively review salient object detection literature with respect to its closely related areas. Detecting and segmenting salient objects is very useful. Objects in images automatically capture more attention than background stuff, such as grass, trees and sky. Therefore, if we can detect salient or important objects first, then we can perform detailed reasoning and scene understanding at the next stage. Compared to traditional special-purpose object detectors, saliency models are general, typically fast, and do not need heavy annotation. These properties allow processing a large number of images at low cost.

Exploring connections between salient object detection and fixation prediction models can help enhance performance of both types of models. In this regard, datasets that offer both salient object judgments of humans and eye movements are highly desirable. Conducting behavioral studies to understand how humans perceive and prioritize objects in scenes and how this concept is related to language, scene description and captioning, visual question answering, attributes, etc, can offer invaluable insights. Further, it is critical to focus more on evaluating and comparing salient object models to gauge future progress. Tackling dataset biases such as center bias and selection bias and moving towards more challenging images is important.

Although salient object detection and segmentation methods have made great strides in recent years, a very robust salient object detection algorithm that is able to generate high quality results for nearly all images is still missing. Even for humans, what is the most salient object in the image, is sometimes a quite ambiguous question. To this end, a general suggestion:
Don’t ask what segments can do for you, ask what you can do for the segments666 — Jitendra Malik
is particularly important to build robust algorithms. For instance, when dealing with noisy Internet images, although salient object detection and segmentation methods do not guarantee robust performance on individual images, their efficiency and simplicity makes it possible to automatically process a large number of images. This allows filtering images for the purposes of reliability and accuracy, running applications robustly [177, 84, 233, 174, 175, 179] , and unsupervised learning [176].


  • [1] M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu, “Global contrast based salient region detection,” IEEE TPAMI, vol. 37, no. 3, pp. 569–582, 2015.
  • [2] Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva, and A. Torralba, “Mit saliency benchmark (2015),” 2015.
  • [3] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba, and F. Durand, “Where should saliency models look next?” in European Conference on Computer Vision, 2016, pp. 809–824.
  • [4] M. Spain and P. Perona, “Measuring and predicting object importance,” IJCV, vol. 91, no. 1, pp. 59–76, 2011.
  • [5] A. C. Berg, T. L. Berg, H. Daume, J. Dodge, A. Goyal, X. Han, A. Mensch, M. Mitchell, A. Sood, K. Stratos et al., “Understanding and predicting importance in images,” in CVPR, 2012, pp. 3562–3569.
  • [6] B. M’t Hart, H. C. Schmidt, C. Roth, and W. Einhäuser, “Fixations on objects in natural scenes: dissociating importance from salience,” Frontiers in psychology, vol. 4, 2013.
  • [7] P. Isola, J. Xiao, A. Torralba, and A. Oliva, “What makes an image memorable?” in CVPR, 2011, pp. 145–152.
  • [8] R. Rosenholtz, Y. Li, and L. Nakano, “Measuring visual clutter,” J. Vision, vol. 7, no. 2, 2007.
  • [9] H. Katti, K. Y. Bin, T. S. Chua, and M. Kankanhalli, “Pre-attentive discrimination of interestingness in images,” in IEEE ICME, 2008, pp. 1433–1436.
  • [10] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. Van Gool, “The interestingness of images,” ICCV, 2013.
  • [11] S. Dhar, V. Ordonez, and T. L. Berg, “High level describable attributes for predicting aesthetics and interestingness,” in CVPR, 2011, pp. 1657–1664.
  • [12] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang, “Understanding and predicting interestingness of videos,” AAAI, 2013.
  • [13] L. Itti and P. Baldi, “Bayesian surprise attracts human attention,” in NIPS, 2005, pp. 547–554.
  • [14] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004.
  • [15] Z. Wang, A. C. Bovik, and L. Lu, “Why is image quality assessment so difficult?” in IEEE ICASSP, vol. 4, 2002.
  • [16] W. Zhang, A. Borji, Z. Wang, P. Le Callet, and H. Liu, “The application of visual saliency models in objective image quality assessment: A statistical evaluation,” IEEE transactions on neural networks and learning systems, vol. 27, no. 6, pp. 1266–1278, 2016.
  • [17] J. Vogel and B. Schiele, “A semantic typicality measure for natural scene categorization,” in Pattern Recognition, 2004.
  • [18] K. A. Ehinger, J. Xiao, A. Torralba, and A. Oliva, “Estimating scene typicality from human ratings and image features,” in Annual Cognitive Science Conference, 2011.
  • [19] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in CVPR, 2009, pp. 1778–1785.
  • [20] H. Liu, S. Jiang, Q. Huang, C. Xu, and W. Gao, “Region-based visual attention analysis with its application in image browsing on small displays,” in ACM Multimedia, 2007.
  • [21] A. K. Mishra, Y. Aloimonos, L. F. Cheong, and A. Kassim, “Active visual segmentation,” IEEE TPAMI, vol. 34, 2012.
  • [22] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets of salient object segmentation,” in CVPR, 2014.
  • [23] A. Borji, “What is a salient object? a dataset and a baseline model for salient object detection,” in IEEE TIP, 2014.
  • [24] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE TPAMI, no. 11, pp. 1254–1259, 1998.
  • [25] T. Liu, J. Sun, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” in CVPR, 2007, pp. 1–8.
  • [26] A. Borji, D. N. Sihite, and L. Itti, “What stands out in a scene? a study of human explicit saliency judgment,” Vision research, vol. 91, pp. 62–77, 2013.
  • [27] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in CVPR.   IEEE, 2012, pp. 733–740.
  • [28] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE TPAMI, vol. 24, 2002.
  • [29] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE TPAMI, vol. 34, no. 11, 2012.
  • [30] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING: Binarized normed gradients for objectness estimation at 300fps,” in CVPR, vol. 2, 2014, p. 4.
  • [31] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE TPAMI, vol. 35, no. 1, pp. 185–207, 2013.
  • [32] A. Borji, H. R. Tavakoli, D. N. Sihite, and L. Itti, “Analysis of scores, datasets, and models in visual saliency prediction,” in ICCV, 2013, pp. 921–928.
  • [33] J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What makes for effective detection proposals?” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 4, pp. 814–830, 2016.
  • [34] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?” in CVPR, 2010, pp. 73–80.
  • [35] P. Siva, C. Russell, T. Xiang, and L. Agapito, “Looking beyond the image: Unsupervised learning for object saliency and detection,” in CVPR, 2013, pp. 3238–3245.
  • [36] H.-D. Cheng, X. Jiang, Y. Sun, and J. Wang, “Color image segmentation: advances and prospects,” Pattern Recognition, vol. 34, no. 12, pp. 2259–2281, 2001.
  • [37] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk, “Frequency-tuned salient region detection,” in CVPR, 2009.
  • [38] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in CVPR, 2011.
  • [39] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” IEEE TPAMI, vol. 34, no. 10, 2012.
  • [40] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in IEEE CVPR, 2013, pp. 2083–2090.
  • [41] R. Margolin, L. Zelnik-Manor, and A. Tal, “Saliency for image manipulation,” The Visual Computer, pp. 1–12, 2013.
  • [42] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in CVPR, 2013, pp. 1155–1162.
  • [43] C. Yang, L. Zhang, and H. Lu, “Graph-regularized saliency detection with convex-hull-based center prior,” IEEE Signal Processing Letters, vol. 20, no. 7, pp. 637–640, 2013.
  • [44] S. He, R. Lau, W. Liu, Z. Huang, and Q. Yang, “Supercnn: A superpixelwise convolutional neural network for salient object detection,” IJCV, vol. 115, no. 3, pp. 330–344, 2015,
  • [45] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency detection via local estimation and global search,” in CVPR, 2015, pp. 3183–3192.
  • [46] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in CVPR, 2015, pp. 1265–1274,
  • [47] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in CVPR, 2015, pp. 5455–5463,
  • [48] W. Zou and N. Komodakis, “Harf: Hierarchy-associated rich features for salient object detection,” in ICCV, 2015, pp. 406–414.
  • [49] Q. Hou, M.-M. Cheng, X.-W. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply supervised salient object detection with short connections,” in CVPR, 2017.
  • [50] A. M. Treisman and G. Gelade, “A feature-integration theory of attention,” Cognitive Psychology, pp. 97–136, 1980.
  • [51] J. M. Wolfe, K. R. Cave, and S. L. Franzel, “Guided search: an alternative to the feature integration model for visual search.” J. Exp. Psychol. Human., vol. 15, no. 3, p. 419, 1989.
  • [52] C. Koch and S. Ullman, “Shifts in selective visual attention: towards the underlying neural circuitry,” in Matters of Intelligence, 1987, pp. 115–141.
  • [53] D. Parkhurst, K. Law, and E. Niebur, “Modeling the role of salience in the allocation of overt visual attention,” Vision research, vol. 42, no. 1, pp. 107–123, 2002.
  • [54] N. D. Bruce and J. K. Tsotsos, “Saliency based on information maximization,” in NIPS, 2005, pp. 155–162.
  • [55] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” IEEE TPAMI, vol. 33, no. 2, pp. 353–367, 2011.
  • [56] R. Achanta, F. Estrada, P. Wils, and S. Süsstrunk, “Salient region detection and segmentation,” in Comp. Vis. Sys., 2008.
  • [57] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention analysis by using fuzzy growing,” in ACM Multimedia, 2003.
  • [58] F. Liu and M. Gleicher, “Region enhanced scale-invariant saliency detection,” in ICME, 2006, pp. 1477–1480.
  • [59] D. Walther and C. Koch, “Modeling attention to salient proto-objects,” Neural Networks, vol. 19, no. 9, pp. 1395–1407, 2006.
  • [60] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE TPAMI, vol. 33, no. 5, pp. 898–916, 2011.
  • [61] D. R. Martin, C. C. Fowlkes, and J. Malik, “Learning to detect natural image boundaries using local brightness, color, and texture cues,” IEEE TPAMI, vol. 26, no. 5, pp. 530–549, 2004.
  • [62] I. Endres and D. Hoiem, “Category independent object proposals,” in ECCV, 2010, vol. 6315, pp. 575–588.
  • [63] T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in ICCV, 2009, pp. 2106–2113.
  • [64] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in CVPR.   IEEE, 2007, pp. 1–8.
  • [65] A. Borji and L. Itti, “Exploiting local and global patch rarities for saliency detection,” in CVPR.   IEEE, 2012, pp. 478–485.
  • [66] A. Borji, “Boosting bottom-up and top-down visual features for saliency estimation,” in CVPR, 2012, pp. 438–445.
  • [67] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, vol. 1, 2001, pp. I–511.
  • [68] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE TPAMI, pp. 1627–1645, 2010.
  • [69] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [70] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
  • [71] G. Hua, Z. Liu, Z. Zhang, and Y. Wu, “Iterative local-global energy minimization for automatic extraction of objects of interest,” IEEE TPAMI, vol. 28, no. 10, pp. 1701–1706, 2006.
  • [72] B. C. Ko and J.-Y. Nam, “Automatic object-of-interest segmentation from natural images,” in ICPR, 2006, pp. 45–48.
  • [73] M. Allili and D. Ziou, “Object of interest segmentation and tracking by using feature selection and active contours,” in CVPR, 2007, pp. 1–8.
  • [74] Y. Hu, D. Rajan, and L.-T. Chia, “Robust subspace analysis for detecting visual attention regions in images,” in ACM Multimedia, 2005, pp. 716–724.
  • [75] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (gpca),” IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 12, pp. 1945–1959, 2005.
  • [76] P. L. Rosin, “A simple method for detecting salient regions,” Pattern Recognition, vol. 42, no. 11, pp. 2363–2371, 2009.
  • [77] R. Valenti, N. Sebe, and T. Gevers, “Image saliency by isocentric curvedness and color,” in ICCV, 2009, pp. 2185–2192.
  • [78] D. A. Klein and S. Frintrop, “Center-surround divergence of feature statistics for salient object detection,” in ICCV.   IEEE, 2011, pp. 2214–2219.
  • [79] X. Li, Y. Li, C. Shen, A. R. Dick, and A. van den Hengel, “Contextual hypergraph modeling for salient object detection,” in ICCV, 2013, pp. 3328–3335.
  • [80] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a patch distinct?” in CVPR, 2013, pp. 1139–1146.
  • [81] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” IJCV, pp. 167–181, 2004.
  • [82] A. Levinshtein, A. Stere, K. N. Kutulakos, D. J. Fleet, S. J. Dickinson, and K. Siddiqi, “Turbopixels: Fast superpixels using geometric flows,” IEEE TPAMI, pp. 2290–2297, 2009.
  • [83] Z. Yu and H.-S. Wong, “A rule based technique for extraction of visual attention regions based on real-time clustering,” IEEE TMM, pp. 766–784, 2007.
  • [84] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE TPAMI, vol. 37, no. 3, pp. 569–582, 2015.
  • [85] C. Scharfenberger, A. Wong, K. Fergani, J. S. Zelek, and D. A. Clausi, “Statistical textural distinctiveness for salient region detection in natural images,” in CVPR, 2013, pp. 979–986.
  • [86] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook, “Efficient salient region detection with soft image abstraction,” in ICCV, 2013, pp. 1529–1536.
  • [87] Z. Jiang and L. S. Davis, “Submodular salient region detection,” in CVPR, 2013, pp. 2043–2050.
  • [88] A. Adams, J. Baek, and M. A. Davis, “Fast high-dimensional filtering using the permutohedral lattice,” in Computer Graphics Forum, vol. 29, no. 2, 2010, pp. 753–762.
  • [89] K. Shi, K. Wang, J. Lu, and L. Lin, “Pisa: Pixelwise image saliency by aggregating complementary appearance contrast measures with spatial priors,” in CVPR, 2013, pp. 2115–2122.
  • [90] H. Yu, J. Li, Y. Tian, and T. Huang, “Automatic interesting object extraction from images using complementary saliency maps,” in ACM Multimedia, 2010, pp. 891–894.
  • [91] Y. Lu, W. Zhang, H. Lu, and X. Xue, “Salient object detection using concavity context,” in ICCV, 2011, pp. 233–240.
  • [92] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai, “Fusing generic objectness and visual saliency for salient object detection,” in ICCV, 2011, pp. 914–921.
  • [93] H. Jiang, J. Wang, Z. Yuan, T. Liu, and N. Zheng, “Automatic salient object segmentation based on context and shape prior,” in BMVC, 2011.
  • [94] X. Shen and Y. Wu, “A unified approach to salient object detection via low rank matrix recovery,” in CVPR, 2012.
  • [95] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background priors,” in ECCV, 2012, vol. 7574, pp. 29–42.
  • [96] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and mid level cues,” IEEE TIP, vol. 22, no. 5, 2013.
  • [97] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in CVPR, 2013.
  • [98] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection via dense and sparse reconstruction,” in ICCV, 2013, pp. 2976–2983.
  • [99] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency detection via absorbing markov chain,” in ICCV, 2013.
  • [100] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection by ufo: Uniqueness, focusness and objectness,” in ICCV, 2013.
  • [101] Y. Jia and M. Han, “Category-independent object-level saliency detection,” in ICCV, 2013.
  • [102] W. Zou, K. Kpalma, Z. Liu, J. Ronsin et al., “Segmentation driven low-rank matrix recovery for saliency detection,” in BMVC, 2013, pp. 1–13.
  • [103] H. Peng, B. Li, R. Ji, W. Hu, W. Xiong, and C. Lang, “Salient object detection via low-rank and structured sparse matrix decomposition,” in AAAI, 2013.
  • [104] R. Liu, J. Cao, G. Zhong, Z. Lin, S. Shan, and Z. Su, “Adaptive partial differential equation learning for visual saliency detection,” in CVPR, 2014.
  • [105] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in CVPR, 2014.
  • [106] J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,” in ICCV, 2013, pp. 153–160.
  • [107] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on light fields,” in CVPR, 2014.
  • [108] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting salient objects from images and videos,” in ECCV, 2010.
  • [109] P. Khuwuthyakorn, A. Robles-Kelly, and J. Zhou, “Object of interest detection by saliency learning,” in ECCV, 2010.
  • [110] P. Mehrani and O. Veksler, “Saliency segmentation based on learning and graph cut refinement.” in BMVC, 2010, pp. 1–12.
  • [111] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal seeds for diffusion-based salient object detection,” in CVPR, 2014.
  • [112] J. Kim, D. Han, Y.-W. Tai, and J. Kim, “Salient region detection via high-dimensional color transform,” in CVPR, 2014.
  • [113] L. Marchesotti, C. Cifarelli, and G. Csurka, “A framework for visual saliency detection with applications to image thumbnailing,” in ICCV, 2009, pp. 2232–2239.
  • [114] M. Wang, J. Konrad, P. Ishwar, K. Jing, and H. Rowley, “Image saliency: From intrinsic to extrinsic context,” in CVPR, 2011.
  • [115] L. Mai, Y. Niu, and F. Liu, “Saliency aggregation: A data-driven approach,” in CVPR, 2013, pp. 1131–1138.
  • [116] Y. Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in ACM Multimedia, 2006, pp. 815–824.
  • [117] T. Liu, N. Zheng, W. Ding, and Z. Yuan, “Video attention: Learning to detect a salient object sequence,” in ICPR, 2008.
  • [118] S. Bin, Y. Li, L. Ma, W. Wu, and Z. Xie, “Temporally coherent video saliency using regional dynamic contrast,” IEEE TCSVT, vol. 23, no. 12, pp. 2067–2076, 2013.
  • [119] H. Li and K. N. Ngan, “A co-saliency model of image pairs,” IEEE TIP, vol. 20, no. 12, pp. 3365–3375, 2011.
  • [120] K.-Y. Chang, T.-L. Liu, and S.-H. Lai, “From co-saliency to co-segmentation: An efficient and fully unsupervised energy minimization model,” in CVPR, 2011, pp. 2129–2136.
  • [121] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,” IEEE TIP, vol. 22, no. 10, pp. 3766–3778, 2013.
  • [122] Y. Niu, Y. Geng, X. Li, and F. Liu, “Leveraging stereopsis for saliency analysis,” in CVPR, 2012, pp. 454–461.
  • [123] K. Desingh, K. M. Krishna, D. Rajan, and C. Jawahar, “Depth really matters: Improving visual salient region detection with depth,” in BMVC, 2013.
  • [124] C. Rother, T. P. Minka, A. Blake, and V. Kolmogorov, “Cosegmentation of image pairs by histogram matching - incorporating a global constraint into mrfs,” in CVPR, 2006.
  • [125] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “icoseg: Interactive co-segmentation with intelligent scribble guidance,” in CVPR.   IEEE, 2010, pp. 3169–3176.
  • [126] L. Mukherjee, V. Singh, and J. Peng, “Scale invariant cosegmentation for image groups,” in CVPR, 2011, pp. 1881–1888.
  • [127] G. Kim, E. P. Xing, F.-F. Li, and T. Kanade, “Distributed cosegmentation via submodular optimization on anisotropic diffusion,” in ICCV, 2011, pp. 169–176.
  • [128] J. Feng, Y. Wei, L. Tao, C. Zhang, and J. Sun, “Salient object detection by composition,” in ICCV, 2011, pp. 1028–1035.
  • [129] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salient object detection for searched web images via global saliency,” in CVPR, 2012, pp. 3194–3201.
  • [130] L. Wang, J. Xue, N. Zheng, and G. Hua, “Automatic salient object extraction with contextual cue,” in ICCV, 2011.
  • [131] Y. Tian, J. Li, S. Yu, and T. Huang, “Learning complementary saliency priors for foreground object segmentation in complex scenes,” IJCV, 2014.
  • [132] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE TIP, vol. 24, no. 12, pp. 5706–5722, 2015.
  • [133] J. Li, Y. Tian, L. Duan, and T. Huang, “Estimating visual saliency through single image optimization,” IEEE Signal Processing Letters, vol. 20, no. 9, pp. 845–848, 2013.
  • [134] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 888–905, 2000.
  • [135] Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3d brain image segmentation,” IEEE TPAMI, vol. 32, no. 10, pp. 1744–1757, 2010.
  • [136] Y. Qin, H. Lu, Y. Xu, and H. Wang, “Saliency detection via cellular automata,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 110–119.
  • [137] Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum, “Lazy snapping,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 303–308, 2004.
  • [138] C. Rother, V. Kolmogorov, and A. Blake, “”GrabCut”: interactive foreground extraction using iterated graph cuts,” ACM TOG, vol. 23, no. 3, pp. 309–314, 2004.
  • [139] C. Lang, T. V. Nguyen, H. Katti, K. Yadati, M. S. Kankanhalli, and S. Yan, “Depth matters: Influence of depth cues on visual saliency,” in ECCV, 2012, pp. 101–115.
  • [140] J. Zhang, M. Wang, L. Lin, X. Yang, J. Gao, and Y. Rui, “Saliency detection on light field: A multi-cue approach,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 13, no. 3, p. 32, 2017.
  • [141] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105.
  • [142] S. Xie and Z. Tu, “Holistically-nested edge detection,” in ICCV, 2015, pp. 1395–1403.
  • [143] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015, pp. 1–9.
  • [144] L. Gayoung, T. Yu-Wing, and K. Junmo, “Deep saliency with encoded low level distance map and high level features,” in CVPR, 2016,
  • [145] J. Kim and V. Pavlovic, “A shape-based approach for salient object detection using deep learning,” in European Conference on Computer Vision.   Springer, 2016, pp. 455–470.
  • [146] X. Wang, H. Ma, and X. Chen, “Salient object detection via fast r-cnn and low-level cues,” in Image Processing (ICIP), 2016 IEEE International Conference on.   IEEE, 2016, pp. 1042–1046.
  • [147] J. Kim and V. Pavlovic, “A shape preserving approach for salient object detection using convolutional neural networks,” in ICPR.   IEEE, 2016, pp. 609–614.
  • [148] T. Chen, L. Lin, L. Liu, X. Luo, and X. Li, “Disc: Deep image saliency computing via progressive representation learning,” IEEE transactions on neural networks and learning systems, vol. 27, no. 6, pp. 1135–1149, 2016.
  • [149] H. Li, J. Chen, H. Lu, and Z. Chi, “Cnn for saliency detection with low-level feature integration,” Neurocomputing, vol. 226, pp. 212–220, 2017.
  • [150] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in CVPR, 2016, pp. 678–686.
  • [151] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 478–487.
  • [152] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [153] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3668–3677.
  • [154] S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and R. Venkatesh Babu, “Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5781–5790.
  • [155] Y. Tang and X. Wu, “Saliency detection via combining region-level and pixel-level predictions with cnns,” in European Conference on Computer Vision.   Springer, 2016, pp. 809–825.
  • [156] Y. Tang, X. Wu, and W. Bu, “Deeply-supervised recurrent convolutional neural network for saliency detection,” in Proceedings of the 2016 ACM on Multimedia Conference.   ACM, 2016, pp. 397–401.
  • [157] X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang, “Deepsaliency: Multi-task deep neural network model for salient object detection,” IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3919–3930, 2016.
  • [158] J. Zhang, Y. Dai, and F. Porikli, “Deep salient object detection by integrating multi-level cues,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on.   IEEE, 2017, pp. 1–10.
  • [159] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient object segmentation,” in CVPR, 2017.
  • [160] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency detection via local estimation and global search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3183–3192.
  • [161] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014, pp. 580–587.
  • [162] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
  • [163] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV, 2014, pp. 818–833.
  • [164] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.
  • [165] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
  • [166] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
  • [167] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.
  • [168] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marqués, and J. Malik, “Multiscale combinatorial grouping,” in CVPR, 2014.
  • [169] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in neural information processing systems, 2011, pp. 109–117.
  • [170] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
  • [171] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in CVPR, 2014, pp. 1717–1724.
  • [172] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in CVPR, 2013, pp. 2083–2090,
  • [173] G.-X. Zhang, M.-M. Cheng, S.-M. Hu, and R. R. Martin, “A shape-preserving approach to image resizing,” Computer Graphics Forum, vol. 28, no. 7, pp. 1897–1906, 2009.
  • [174] H. Huang, L. Zhang, and H.-C. Zhang, “Arcimboldo-like collage using internet images,” ACM Transactions on Graphics, vol. 30, no. 6, p. 155, 2011.
  • [175] H. Liu, L. Zhang, and H. Huang, “Web-image driven best views of 3d shapes,” The Visual Computer, 2012.
  • [176] J.-Y. Zhu, J. Wu, Y. Wei, E. Chang, and Z. Tu, “Unsupervised object class discovery via saliency-guided multiple class learning,” in CVPR, 2012, pp. 3218–3225.
  • [177] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu, “Sketch2photo: internet image montage,” ACM TOG, 2009.
  • [178] C. Goldberg, T. Chen, F.-L. Zhang, A. Shamir, and S.-M. Hu, “Data-driven object manipulation in images,” Computer Graphics Forum, vol. 31, no. 21, pp. 265–274, 2012.
  • [179] A. Y.-S. Chia, S. Zhuo, R. K. Gupta, Y.-W. Tai, S.-Y. Cho, P. Tan, and S. Lin, “Semantic colorization with internet images,” ACM TOG, vol. 30, no. 6, p. 156, 2011.
  • [180] U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up attention useful for object recognition?” in CVPR, 2004.
  • [181] C. Kanan and G. Cottrell, “Robust classification of objects, faces, and flowers using natural image statistics,” in CVPR, 2010, pp. 2472–2479.
  • [182] F. Moosmann, D. Larlus, and F. Jurie, “Learning saliency maps for object categorization,” in ECCV Workshop, 2006.
  • [183] A. Borji, M. N. Ahmadabadi, and B. N. Araabi, “Cost-sensitive learning of top-down modulation for attentional control,” Machine Vision and Applications, 2011.
  • [184] A. Borji and L. Itti, “Scene classification with a sparse set of salient regions,” in IEEE ICRA, 2011, pp. 1902–1908.
  • [185] H. Shen, S. Li, C. Zhu, H. Chang, and J. Zhang, “Moving object detection in aerial video based on spatiotemporal saliency,” Chinese Journal of Aeronautics, 2013.
  • [186] Z. Ren, S. Gao, L.-T. Chia, and I. Tsang, “Region-based saliency detection and its application in object recognition,” IEEE TCSVT, vol. PP, no. 99, pp. 1–1, 2013.
  • [187] C. Guo and L. Zhang, “A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression,” IEEE TIP, vol. 19, no. 1, pp. 185–198, 2010.
  • [188] L. Itti, “Automatic foveation for video compression using a neurobiological model of visual attention,” IEEE TIP, vol. 13, no. 10, pp. 1304–1318, 2004.
  • [189] Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang, “A generic framework of user attention model and its application in video summarization,” IEEE TMM, 2005.
  • [190] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in CVPR, 2012, pp. 1346–1353.
  • [191] Q.-G. Ji, Z.-D. Fang, Z.-H. Xie, and Z.-M. Lu, “Video abstraction based on the visual attention model and online clustering,” Signal Processing: Image Communication, 2012.
  • [192] S. Goferman, A. Tal, and L. Zelnik-Manor, “Puzzle-like collage,” Computer Graphics Forum, 2010.
  • [193] J. Wang, L. Quan, J. Sun, X. Tang, and H.-Y. Shum, “Picture collage,” in CVPR, vol. 1, 2006, pp. 347–354.
  • [194] A. Ninassi, O. Le Meur, P. Le Callet, and D. Barbba, “Does where you gaze on an image affect your perception of quality? applying visual attention to image quality metric,” in IEEE ICIP, vol. 2, 2007, pp. II–169.
  • [195] H. Liu and I. Heynderickx, “Studying the added value of visual attention in objective image quality metrics based on eye movement data,” in IEEE ICIP, 2009, pp. 3097–3100.
  • [196] A. Li, X. She, and Q. Sun, “Color image quality assessment combining saliency and fsim,” in ICDIP, vol. 8878, 2013.
  • [197] M. Donoser, M. Urschler, M. Hirzer, and H. Bischof, “Saliency driven total variation segmentation,” in ICCV.   IEEE, 2009, pp. 817–824.
  • [198] Q. Li, Y. Zhou, and J. Yang, “Saliency based image segmentation,” in ICMT, 2011, pp. 5068–5071.
  • [199] C. Qin, G. Zhang, Y. Zhou, W. Tao, and Z. Cao, “Integration of the saliency-based seed extraction and random walks for image segmentation,” Neurocomputing, vol. 129, 2013.
  • [200] M. Johnson-Roberson, J. Bohg, M. Bjorkman, and D. Kragic, “Attention-based active 3d point cloud segmentation,” in IEEE IROS, 2010, pp. 1165–1170.
  • [201] S. Feng, D. Xu, and X. Yang, “Attention-driven salient edge (s) and region (s) extraction with application to CBIR,” Signal Processing, vol. 90, no. 1, pp. 1–15, 2010.
  • [202] J. Sun, J. Xie, J. Liu, and T. Sikora, “Image adaptation and dynamic browsing based on two-layer saliency combination,” IEEE Trans. Broadcasting, vol. 59, no. 4, pp. 602–613, 2013.
  • [203] L. Li, S. Jiang, Z. Zha, Z. Wu, and Q. Huang, “Partial-duplicate image retrieval via saliency-guided visually matching,” IEEE MultiMedia, vol. 20, no. 3, pp. 13–23, 2013.
  • [204] S. Stalder, H. Grabner, and L. Van Gool, “Dynamic objectness for adaptive tracking,” in ACCV, 2012.
  • [205] J. Li, M. Levine, X. An, X. Xu, and H. He, “Visual saliency based on scale-space analysis in the frequency domain,” IEEE TPAMI, vol. 35, no. 4, pp. 996–1010, 2013.
  • [206] G. M. García, D. A. Klein, J. Stückler, S. Frintrop, and A. B. Cremers, “Adaptive multi-cue 3d tracking of arbitrary objects,” in Pattern Recognition, 2012, pp. 357–366.
  • [207] A. Borji, S. Frintrop, D. N. Sihite, and L. Itti, “Adaptive object tracking by learning background context,” in IEEE CVPRW.   IEEE, 2012, pp. 23–30.
  • [208] D. A. Klein, D. Schulz, S. Frintrop, and A. B. Cremers, “Adaptive real-time video-tracking for arbitrary objects,” in IEEE IROS, 2010, pp. 772–777.
  • [209] S. Frintrop and M. Kessel, “Most salient region tracking,” in IEEE ICRA, 2009, pp. 1869–1874.
  • [210] G. Zhang, Z. Yuan, N. Zheng, X. Sheng, and T. Liu, “Visual saliency based object tracking,” in ACCV, 2010.
  • [211] A. Karpathy, S. Miller, and L. Fei-Fei, “Object discovery in 3d scenes via shape analysis,” in ICRA, 2013, pp. 2088–2095.
  • [212] S. Frintrop, G. M. Garcıa, and A. B. Cremers, “A cognitive approach for object discovery,” in ICPR, 2014.
  • [213] D. Meger, P.-E. Forssén, K. Lai, S. Helmer, S. McCann, T. Southey, M. Baumann, J. J. Little, and D. G. Lowe, “Curious george: An attentive semantic robot,” Robotics and Autonomous Systems, vol. 56, no. 6, pp. 503–511, 2008.
  • [214] Y. Sugano, Y. Matsushita, and Y. Sato, “Calibration-free gaze sensing using saliency maps,” in CVPR, 2010.
  • [215] S. Avidan and A. Shamir, “Seam carving for content-aware image resizing,” ACM TOG, vol. 26, no. 3, p. 10, 2007.
  • [216] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE TIP, vol. 24, no. 12, pp. 5706–5722, 2015.
  • [217] “Msra dataset,”
  • [218] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image segmentation by probabilistic bottom-up aggregation and cue integration,” in CVPR, 2007, pp. 1–8.
  • [219] V. Movahedi and J. H. Elder, “Design and perceptual validation of performance measures for salient object segmentation,” in IEEE CVPRW.   IEEE, 2010, pp. 49–56.
  • [220] M. Brown and S. Susstrunk, “Multi-spectral sift for scene category recognition,” in CVPR, 2011, pp. 177–184.
  • [221] Q. Wang, P. Yan, Y. Yuan, and X. Li, “Multi-spectral saliency detection,” Elsevier PRL, 2013.
  • [222] “Msra10k dataset,”
  • [223] “Thur15k dataset,”
  • [224] “Judd-a dataset,”
  • [225] K. Koehler, F. Guo, S. Zhang, and M. P. Eckstein, “What do saliency models predict?” J. Vision, 2014.
  • [226] J. Xu, M. Jiang, S. Wang, M. S. Kankanhalli, and Q. Zhao, “Predicting human gaze beyond pixels,” J. Vision, 2014.
  • [227] J. Li, Y. Tian, T. Huang, and W. Gao, “A dataset and evaluation methodology for visual saliency in video,” in IEEE ICME, 2009, pp. 442–445.
  • [228] Y. Wu, N. Zheng, Z. Yuan, H. Jiang, and T. Liu, “Detection of salient objects with focused attention based on spatial and temporal coherence,” Chinese Science Bulletin, vol. 56, pp. 1055–1062, 2011.
  • [229] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [230] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in IEEE CVPR, 2011, pp. 1521–1528.
  • [231] B. W. Tatler, “The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions,” J. Vision, 2007.
  • [232] A. Borji, D. Sihite, and L. Itti, “Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study,” IEEE TIP, vol. 22, no. 1, pp. 55–69, 2013.
  • [233] M.-M. Cheng, N. J. Mitra, X. Huang, and S.-M. Hu, “Salientshape: group saliency in image collections,” The Visual Computer, vol. 30, no. 4, pp. 443–453, 2014.

[]Ali Borji received his BSc and MSc in computer engineering from Petroleum University of Technology, Tehran, 2001 and Shiraz University, Shiraz, 2004, respectively. He did his Ph.D. in cognitive neurosciences at Institute for Studies in Fundamental Sciences (IPM) in Tehran, Iran, 2009 and was a postdoctoral scholar at iLab, USC from 2010 to 2014. He is currently an assistant professor at the University of Central Florida.


[] Ming-Ming Cheng received his PhD degree from Tsinghua University in 2012. He is currently a research fellow in Oxford University, working with Prof. Philip Torr. His research interests includes computer graphics, computer vision, image processing, and image retrieval. He has received the Google PhD fellowship award, the IBM PhD fellowship award, and the new PhD Researcher Award from Chinese Ministry of Education.

Qibin Hou is currently a Ph.D. Candidate with College of Computer Science and Control Engineering, Nankai University, under the supervision of Prof. Ming-Ming Cheng. His research interests include deep learning, image processing, and computer vision.


[]Huaizu Jiang is currently working as a research assistant at Institute of Artificial Intelligence and Robotics at Xi’an Jiaotong University. Before that, he received his BS and MS degrees from Xi’an Jiaotong University, China, in 2005 and 2009, respectively. He is interested in how to teach an intelligent machine to understand the visual scene like a human. Specifically, his research interests include object detection, large-scale visual recognition, and (3D) scene understanding.


[]Jia Li received his B.E. degree from Tsinghua University in 2005 and Ph.D. degree from the Chinese Academy of Sciences in 2011. During 2011 and 2013, he served as a research fellow and visiting assistant professor in Nanyang Technological University, Singapore. He is currently an associate professor at Beihang University, Beijing, China. His research interests include visual attention/saliency modeling, multimedia analysis, and vision from Big Data.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description