Coarse-to-Fine Salient Object Detection with Low-Rank Matrix Recovery

Coarse-to-Fine Salient Object Detection
with Low-Rank Matrix Recovery

Qi Zheng, Shujian Yu, Xinge You,  Qinmu Peng, Wei Yuan Q. Zheng, X. You, Q. Peng and Y. Wei are with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail:,,, Yu is with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA (e-mail:

Despite the great potential of using the low-rank matrix recovery (LRMR) theory on the task of salient object detection, existing LRMR-based approaches scarcely consider the interrelationship among elements within the sparse components and suffer from high computational cost. In this paper, we propose a novel LRMR-based saliency detection method under a coarse-to-fine framework to circumvent these two limitations. The first step of our approach is to generate a coarse saliency map by integrating a -norm sparsity constraint imposed on the sparse matrix and a Laplacian regularization for smoothness. Following this, we aim to exploit and reveal the interrelationship among sparse elements and to increase detection recall values near the object boundaries using a learned mapping function to precisely distinguish foreground and background in the cluttered or complex scenes. Extensive experiments on three benchmark datasets demonstrate that our method can achieve enhanced performance compared with other state-of-the-art saliency detection approaches, and also verifies the efficacy of our coarse-to-fine architecture.

Salient object detection, Coarse-to-fine, Low-rank matrix recovery, Learning-based refinement

I Introduction

Visual saliency has been a fundamental problem in neuroscience, psychology, and computer vision for a long time [1, 2]. It refers to the identification of a portion of essential visual information for further processing. Recently, it has been extended from originally predicting eye-fixation to identifying a region containing salient objects, known as salient object detection or saliency detection. Tremendous efforts have been made to saliency detection over the past decades owing to its extensive real applications in the realm of multimedia applications, including image manipulation [3, 4], image/video quality assessment [5, 6] and virtual reality (VR) [7], etc.

Existing approaches for saliency detection can be divided into two categories [1]: the top-down (or task-driven) approaches utilize high-level human perceptual knowledge (e.g., object labels, semantic information or background priors) to guide the estimation of saliency maps, whereas the bottom-up (or stimulus-driven) approaches are usually based on low-level visual information such as color, texture and localization. Compared with the top-down approaches, the bottom-up approaches require less computational power and exhibit better generality and scalability, although their detected salient regions are liable to be confused with background [1, 2].

A recent trend is to combine bottom-up cues with top-down priors to facilitate saliency detection using low-rank matrix recovery (LRMR) theory [8]. Generally speaking, these methods (e.g., [9, 10, 11]) assume that a natural scene image consists of visually consistent background regions (correspond to a highly redundant information part with low-rank structure) and distinctive foreground regions (correspond to a visually salient part with sparse structure). For example, Yan et al. [9] applied LRMR to the response matrix of image patches obtained by sparse coding. Lang et al. [10] jointly decomposed multiple-feature matrices and then produced the saliency map by inference. Despite promising results achieved by various LRMR-based methods, there are two challenges when it comes to real applications [12]:

  • Inter-correlation between elements in the sparse component is ignored. Specifically, the produced sparse component covers those regions of salient objects, but they may be scattered or incomplete due to a lack of guide during the decomposition.

  • Existing methods either attempt to learn a dictionary or transformation that depends on large amount of training data which is typically hard to obtain or even unattainable, or introduce a fusion scheme that suffers from high computational cost.

In this paper, we propose a novel coarse-to-fine framework for saliency object detection based on LRMR to circumvent these two limitations. Since LRMR is totally unsupervised, it is suited to roughly distinguish salient regions from the background. Based on this coarse saliency, we then take into account the spatial relationship among those elements within the sparse component through learning-based refinement. Specifically, our framework features two modules in a successive manner: a coarse-processing module and a fine-tuning module. In the coarse-processing module, a low-rank matrix recovery model with Laplacian constraint is proposed to roughly extract target from its surrounding background. Through the decomposition, the detected background contains most of the regions that are irrelevant to the desired target, while the foreground may include some cluttered regions, i.e., both the target boundary and its local surrounding background, which severely decreases the detection precision. Therefore, a fine-tuning processing is utilized to refine the coarse saliency. We select confident examples (i.e., positive and negative super-pixels) from the coarse saliency map to learn a mapping by considering spatial relationship, and the fine-tuned saliency value for those remaining tough super-pixels is determined by the classifier.

To summarize, our main contributions are threefold:

  • An effective saliency detection model, integrating -norm sparsity constrained LRMR and Laplacian regularization, is proposed to roughly detect salient object. We set this as our baseline model and demonstrate that it works well especially in the scenario of multiple objects.

  • A learning-based refinement module is developed by considering the spatial adjacency relationship among image regions on the coarse saliency map obtained from our baseline model. It can assign more accurate saliency values to those obscure boundary regions in the coarse saliency map, promoting final wholeness of detected salient objects.

  • Extensive experiments on three benchmark datasets are conducted to demonstrate the superior performance and robustness of our method against other state-of-the-art approaches.

The remainder of this paper is organized as follows. Section II briefly reviews related work. In Section III, we present our coarse-to-fine framework for salient object detection in details. Section IV shows the experimental results and analysis. Finally, Section V draws the conclusion.

Ii Related Work

An extensive review on saliency detection is beyond the scope of this paper. We refer interested readers to two recently published surveys [1, 2] for more details about existing bottom-up and top-down approaches for saliency detection. This section first briefly reviews the prevailing unsupervised bottom-up saliency detection methods, and then introduces several popular low-rank matrix recovery based methods that are closely related to our work.

Ii-a Popular Saliency Detection Methods

As a pioneering work, Itti et al. [13] innovatively suggested using “Center and Surround” filters to extract image features and to simulate human vision system on multi-scale levels to generate saliency maps. Motivated by Itti’s framework, various contrast based approaches have been developed in past decades, which include local contrast based ones (e.g., [4, 14]), global contrasts based ones (e.g., [15, 16, 17]), or even those combining both local and global contrasts (e.g., [18, 19, 20]). Local contrast is sensitive to object boundaries and noises as it concerns more on the difference among regions within an image patch, whereas global contrast is difficult to distinguish similar colors or texture patterns.

On the other hand, frequency domain also provides a reliable avenue for salient object detection. For example, Hou et al. [21] analyzed spectral residual of an image in spectral domain, where the high-frequency components are considered as background. A similar work is presented by Fang et al. [22], where the standard Fast Fourier Transform (FFT) is substituted with Quaternion Fourier Transform (QFT). Other representative examples include [23, 24]. Generally, these methods are effective when salient objects are small, but they tend to detect only the boundary when objects are larger.

Graph based models (e.g.,[25, 26, 27]) are proposed to increase robustness and adaptability by constructing a graph with local or global nodes (e.g., image patches or regions). For example, Yang et al. [25] proposed a graph-based manifold ranking model to detect salient objects with foreground or background seeds. Based on this model, Wang et al. [26] added connectivity with and within boundary nodes in order to catch global saliency cues. This way, salient information among different nodes can be jointly exploited. However, a fully connected graph suffers from high computational cost.

Fig. 1: The general coarse-to-fine framework of our proposed LRMR based saliency detection method. Given an input image, we first conduct over-segmentation and feature extraction (see module (A)), i.e., over-segment the original image and extract basic features, and then extract coarse saliency map (see module (B)) via low-rank matrix decomposition to the feature matrix, and finally refine the saliency map (see module (C)) by learning a mapping function that maps features to saliency value.

Ii-B LRMR-based Saliency Detection Methods

The usage of low-rank matrix recovery (LRMR) theory on saliency detection was initiated by Yan et al. [9] and then extended in [28]. Specifically, the LRMR based saliency detection approaches assume that an image can be decomposed into redundancy part and saliency part, which can be characterized with a low-rank component and a sparse component separately. Given a data matrix , where represents one sample, the optimization problem can be formulated as follows


where and denote nuclear norm and -norm respectively, and is a parameter balancing the rank term and the sparse term. Given the decomposition, saliency map can be generated from the obtained sparse matrix.

Unfortunately, the early methods are typically data-dependent, i.e., the learned dictionaries or transformations depend heavily on the selected training images or image patches, which suffer from limited adaptability and generalization capability. To this end, various approaches are developed in an unsupervised manner by either adopting a multitask scheme (e.g., [10]) or introducing extra priors (e.g., [11, 29]). For example, Lang et al. [10] jointly decomposed multiple-feature matrices instead of directly combining individual saliency maps. Zou et al. [11] introduced segmentation priors to cooperate with sparse saliency in an advanced manner. To preserve the wholeness of detection objects, saliency fusion models (e.g.,[30, 31, 32, 33]) were proposed thereafter. For example, double low-rank matrix recovery (DLRMR) was suggested in [30] to fuse saliency maps detected by different approaches.

Although above extensions improved the detection robustness to the cluttered backgrounds, there still remains two open problems. First, extra priors [11] or sophisticated operations (such as saliency fusion [30, 33]) may introduce expensive computational cost. Second, all these methods ignore the inter-correlation among elements within the sparse component, which may fail to detect the whole salient objects. The first work that pinpointed these two limitations is the recently proposed structured matrix decomposition (SMD) by Peng at al. [12]. Specifically, SMD introduced a tree-structured sparse constraint in order to efficiently emphasize the inter-correlation in a unified model:


where the matrix represents high-level priors [28], and denotes dot-product of matrices. The second term on the right side of equality denotes the structured-sparse constraint, is the -norm (), is the depth of index tree and is the total number of nodes at the th level. Here each node represents a graph that contains several adjacent super-pixels, and ( denotes set cardinality) is the sub-matrix of corresponding to node . By contrast, the third term is introduced to promote the performance under cluttered background, where is a parameter that trades off this regularization and the other two terms. is un-normalized graph Laplacian matrix.

Our work is directly motivated by SMD. However, two observations prompt us to propose our method:

  • Sparsity regularization on fine graphs could destroy spatial relationship among object parts, which is inevitable in SMD as the index tree is constructed in a fine-to-coarse way.

  • Laplacian constraint works more as a smooth term than an antagonism to cluttered background. It aims at improving the consistency of salient regions.

As discussed in SMD [12], a tree-structured sparsity-inducing norm is introduced to model the spatial contiguity and feature similarity among image patches, thus generating more precise and structurally consistent result. However, we argue that the structured-sparse constraint does not fit in the case well. Specifically, an intuitive idea is that the regions within the same object should be popped out as a whole without disrupting the completeness. It is obvious that the spatial relationship is taken into consideration during the construction of an index tree, i.e., merging with different thresholds to form graphs. However, such relationship has not been preserved if we naively imposing the sparsity constraint on the graphs. If we going deeper, the -norm and -norm lead to row-sparsity and column-sparsity of a graph respectively, without any consideration on the spatial information. In fact, structured sparsity can be used to for structure preserving [34, 35]. For example, continuous bits of a (binarized) genetic sequence can appear simultaneously in tumor diagnose [36]. On the other hand, some features such as eyes and nose, can be jointly considered as localized features for occluded face recognition [35]. In these cases, structured sparsity either corresponds to actual spatial positions, or to dictionary elements representing specific spatial parts. While in salient object detection, such priors like continuously distributed bits or determined object parts are unavailable or unattainable. As a result, utilizing the structured-sparse constraint is potential to destroy object structure, especially in the case of multiple objects.

Iii Our Method

The goal of this work is to overcome the two challenges in existing low-rank matrix recovery (LRMR) based approaches by introducing a coarse-to-fine architecture. Based on the discussion above, we integrate the basic LRMR model in (II-B) and Laplacian regularization to generate a coarse saliency map. Then, we learn a mapping with features and spatial relationships amongst pairwise super-pixels in the coarse saliency map to obtain final saliency. It is worth noting that we consider the spatial relationship among super-pixels in the refinement module, which increases robustness to cluttered background. The overall flowchart is illustrated in Fig. 1.

Fig. 2: Comparison of the effects of a four-layer-based index-tree structured constraint in SMD [12] and our coarse-to-fine architecture. In both two examples, (a) shows the raw image; (b) shows the over-segmented super-pixels; (c) shows the coarse saliency map obtained by our baseline model (i.e., Eq. (III-A)); (d) shows the merged graph in the -nd layer of the index tree; (e) shows the saliency map obtained by incorporating tree-constraint in both the -st layer and the -nd layer; (f) shows the merged graph in the -th layer of the index tree; (g) shows saliency map obtained by incorporating tree-constraints from the -st layer to the -th layer; (h) shows the coarse graph constructed with salient super-pixels from the rough saliency map in (c); (i) shows the refined salient graph; (j) shows the refined saliency map given by refined salient graph in (i); (k) shows the ground truth. From the two examples, we can see that tree-structured constraint in shallow layers indeed improves the preservation of spatial relationship and the wholeness of detected object. However, it works adversely with respect to deeper layers in the scenario of multiple objects (or complex backgrounds as shown in supplementary material). By contrast, our coarse-to-fine model enhances object wholeness in a designated way, with much clearer boundaries and edges preserved, for both single object in pure background and multiple objects. More examples are available in our supplementary material.

Iii-a Coarse Saliency from Low-Rank Matrix Recovery

Through the discussion in Sect. II-B, we can see that tree-structured regularization is not appropriate for salient object detection, especially in the circumstance of multiple objects. Thus we revert to original -norm sparsity constraint, yielding sparsity by treating each element individually. Besides, compared with the basic LRMR model in (II-B), the model in (II-B) introduces Laplacian regularization to address the issue of cluttered background. Admittedly, this regularization promotes the performance of LRMR under cluttered background. However, its functionality acts more as a smooth term, which has been widely used in previous work (e.g., [1, 2, 27, 37]). Instead, we deal with cluttered background by considering the spatial relationship within sparse elements in a refinement module, while keeping the Laplacian regularization as a smooth term in our coarse module. Therefore, we obtain the coarse saliency using the basic LRMR model in (II-B) and Laplacian constraint as a smooth term, which is formulated as follows


where matrices , is un-normalized graph Laplacian matrix. Once the low-rank matrix and sparse matrix are determined, saliency value of the th super-pixel can be calculated as


where denotes the th column of matrix . Note that here is a vector, thus its -norm is the sum of the absolute value of each entry.

To demonstrate the effect of structured-sparse regularization in (II-B) and -norm sparsity constraint in (III-A), we provide two examples in Fig. 2, and more can be found in our supplementary material. Specifically, we set a four-layer index-tree for experimental verification. It should be firstly made clear that during the fine-to-coarse construction of the index tree, the bottom layer (Depth4) is composed of graphs, with each containing a super-pixel, while the top layer (Depth1) is composed of one graph containing all the super-pixels. The -norm constraint is applied to each graph separately and then the results are summed.

The first image contains single object in pure background. Comparing Fig. 2(c-1) with Fig. 2(e-1) and Fig. 2(g-1) respectively, we can observe that adding constraint to Depth2 eliminates irrelevant background, while deeper constraint is unnecessary for preserving spatial structure of the object. Considering the construction of an indexed tree, it can be seen that the graph in the top layer of fine-to-coarse structure corresponds to the coarse module in our coarse-to-fine architecture. Thus, it indicates that our coarse module is capable of maintaining a rough spatial structure of patches in single object. Furthermore, our refinement module utilizes the spatial relationship among super-pixels to refine the salient graph in (i-1). This way, those several super-pixels below the flower are found to be surrounded by background regions and to only weakly relate to main body of the flower, thus they are classified into background, as shown in (j-1).

The second image contains multiple objects. Comparing Fig. 2(c-2) with Fig. 2(e-2) and Fig. 2(g-2), we can observe that adding constraint to Depth2 promotes the structural wholeness of objects to some extent, while deeper constraint destroys the spatial structure. This is because tree-structure regularization term in deep layers encourages sparsity of single super-pixel or small-regional super-pixels, thus ignoring the wholeness of multiple objects. On the contrary, in our coarse-to-fine architecture, we consider multiple objects as a whole without disrupting their inner organizations. Our coarse module firstly generate rough saliency of those objects, and then the refinement module produces more accurate saliency of super-pixels around object boundaries by learning from patterns of the foreground and background, respectively. For example, some super-pixels in leg areas adjacent to image boundary are originally considered as background, but are assigned higher saliency value by the refinement module, which improves the wholeness of objects.

Having illustrated the difference between the structured-sparse regularization in (II-B) and the -norm sparsity constraint in (III-A), we present the optimization procedure of (III-A).

Optimization: The optimization problem in (III-A) can be efficiently solved via the alternating direction method of multipliers (ADMMs) [38]. For simplification, we denote the projected feature matrix as . An auxiliary variable is introduced and problem (III-A) becomes


Lagrange multipliers and are introduced to remove the equality constraints, and the augmented Lagrangian function is constructed as


where is the penalty parameter.

Iterative steps of minimizing the Lagrangian function are utilized to optimize (III-A), and stop criteria at step are given by (7) and (8)


The variables and can be alternately updated by minimizing the augmented Lagrangian function with other variables fixed. In this model, each variable can be updated with a closed form solution. With respect to and , they can be updated as follows


where the soft-thresholding operator is defined by

and , where SVD is the singular value decomposition.

Regarding and , we can update them as follows


where the parameter controls the convergence speed.

Iii-B Learning-based Saliency Refinement

As we have discussed in Sect. II-B, the coarse saliency map generated using objective (3) ignores spatial relationship among adjacent super-pixels. To further improve the detection results, we attempt to refine the coarse saliency via mapping learning, which learns a projection that connects feature values and specific saliency values to refine the coarse saliency map.

Specifically, given the coarse saliency calculated by (4), we can roughly distinguish foreground from background. In order to obtain common interior feature of foreground and background respectively, we choose confident super-pixels based on their coarse saliency value. We set two thresholds to select confident super-pixel samples for background and for foreground respectively, i.e., super-pixels with saliency value lower than are considered as negative samples, and super-pixels with saliency value higher than are considered as positive ones. We denote as the sample matrix composed of both positive and negative samples, and as corresponding label matrix, where is the total number of confident samples. For the th positive sample, its label vector is , while for the th negative sample, its label vector is . See Fig. 3 for more intuitive examples.

(a) (b) (c) (d)
Fig. 3: Illustration for the process of learning-based saliency refinement. (a) Over-segmented RGB images (). (b) Coarse saliency maps and corresponding graph structure of salient super-pixels. (c) Positive samples (in orange), negative samples (in purple) and tough samples (in black) generated from coarse saliency map. The line-connections demonstrate spatial relationship around those tough samples. (d) Refined saliency of those tough samples and their spatial relationship.

In order to determine the saliency of those tough samples, we utilize their spatial relationships with these confident samples, as shown in Fig. 3. Based on the coarse saliency and adjacent relationship, we generate rough saliency for the th tough sample as follows


where is the number of super-pixels adjacent to the th tough sample, and denotes the number of pixels contained in the th super-pixel. Similarly, we formulate label vector of the th tough sample as , and the label matrix , where is the number of tough samples.

Combining the coarse saliency for confident samples and tough samples, we build our saliency refining model as follows


where is the mapping to be learned, and are regularization parameters. Once the mapping is learned, saliency of those tough super-pixels are given by the first column of matrix .

Despite the simplicity of (III-B), one should note that the background regions are typically much larger than foreground objects. This leads to the issue of learning in the circumstance of imbalanced data. In order to overcome this limitation, we introduce a weighting strategy to balance the contributions of positive and negative samples in mapping learning, which is formulated as follows


where is the weight for the th confident sample. Actually, we can simplify (III-B) by combining the second term and the third term with generalized weights as follows


where is the weight for the th sample, either positive one, negative one or tough one, which is defined as follows

where are numbers of negative and positive samples, respectively, and we have . Optimization problem in (18) can be efficiently solved by


where is a diagonal matrix with , and is an identity matrix.

Iii-C Complexity analysis

Here we briefly discuss the computational complexity of optimization in Sect. III-A and Sect. III-B respectively, and we have , .

We set the th iteration for coarse saliency generation as an example. The time consumption mainly involves three kinds of operations, i.e., SVD, matrix inversion and matrix multiplication. Specifically, update for and is addressed by SVD, with the complexity of and , respectively. While major operations in updating include matrix inversion and matrix multiplication, with complexity of . Considering , the final computational complexity is . Compared with this, the optimization for the tree-structured sparsity in [12] requires no extra computational complexity. However, multi-scale segmentation in constructing the index tree introduces computational cost thus slows down the speed, as listed in Table III.

For saliency refinement, the solution in (19) involves matrix inversion and matrix multiplication, with the complexity of and , respectively. Considering , the final computational complexity is .

Iv Experiments

To evaluate the performance of our model, we compare it with the other twelve state-of-the-art methods covering those categories mentioned in Sect. II. Among them, three methods are low-rank decomposition based, i.e., SMD [12], SLR [11] and ULR [28], which share similar motivation with our baseline model in (III-A). Moreover, we select five state-of-the-art methods depending on contrast or priors including RBD [27], PCA [17], HS [39], HCT [40] and DSR [41]. The four remaining approaches include one graph-based method (MR [25]), two involving frequency domain (SS [42], FT [43]), and one involving supervised training (DRFI [44]). We conduct experiments on three benchmark datasets, i.e., MSRA10K [15], ECSSD [16] and iCoSeg [45]. The dataset MSRA10K contains 10,000 images with single object, iCoSeg contains 643 images with multiple objects and ECSSD contains 1,000 images with complicated background. Sample images from the three datasets can be found in Fig. 9.

Dataset MSRA10K iCoSeg ECSSD
ULR [28] 0.425 0.524 0.831 0.224 0.379 0.443 0.814 0.222 0.351 0.369 0.788 0.274
SLR [11] 0.601 0.691 0.840 0.141 0.473 0.505 0.805 0.179 0.402 0.486 0.805 0.226
SMD [12] 0.704 0.741 0.847 0.104 0.611 0.598 0.822 0.138 0.544 0.563 0.813 0.174
Ours (C) 0.688 0.734 0.844 0.108 0.614 0.599 0.823 0.137 0.535 0.557 0.810 0.175
ULR 0.532 0.597 0.846 0.195 0.439 0.459 0.814 0.219 0.421 0.418 0.801 0.262
SLR 0.681 0.726 0.847 0.122 0.602 0.587 0.816 0.161 0.519 0.542 0.814 0.199
SMD 0.706 0.753 0.854 0.103 0.630 0.618 0.838 0.132 0.546 0.571 0.820 0.175
Ours 0.705 0.751 0.854 0.104 0.634 0.624 0.838 0.131 0.545 0.571 0.820 0.176
TABLE I: Comparison with the other low-rank methods and performance boost with different baselines on three datasets. The best two results are marked with red and blue respectively. The sign denotes method with refinement.
RGB image ULR [28] SLR [11] SMD [12] Ours (C) ULR SLR SMD Ours GT
Fig. 4: Visual comparison of our method (the coarse and the fine) with the other low-rank involved approaches. The sign denotes method with refinement. The three images are randomly selected from MSRA10K, iCoSeg and ECSSD datasets, respectively.

Iv-a Experimental setup

We follow Peng et al.'s methodology [12] to compare the performance of different models. The metrics include precision-recall (PR) curve, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), the weighted F-measure (WF) score, overlapping ratio (OR) and mean absolute error (MAE). It is a slightly improved variant of the methodology suggested by the benchmark [2], which includes PR curve, ROC curve, MAE and F-measure. Among them, precision reflects the percentage of salient pixels correctly assigned, and recall corresponds to the fraction of detected salient pixels belonging to the salient object in the ground truth. PR curve is obtained by setting a series of discrete threshold ranging from to for a grayscale saliency map in range . ROC curve measures in a similar way, which considers hit-rate (recall) and false-alarm. Supposing saliency values are normalized to the range of , the generated saliency map can be binarized with a given threshold, i.e., salient or non-salient. With four basic quantities: true-positive (), true-negative (), false positive () and false negative (), the precision (), recall/hit-rate () and false-alarm rate () are calculated as . Generally, precision decreases with the increase of recall, while F-measure () is a trade-off between them , with in many works [27, 44].

Although these metrics are widely-used, in [46], the authors pointed out that there existed interpolation flaw, dependency flaw and equal-important flaw, and proposed -measure (WF) to alleviate the flaws , where the precision and recall are replaced with weighted ones. Overlapping ratio measures the intersection between predicted (binarized) saliency map (S) and the ground-truth saliency map (G), . Mean absolute error provides a numerical difference between the continuous saliency map and the true saliency map.

We adopt SLIC algorithm [47] () for over-segmentation and extract generally used features to compare fairly with the other approaches [11, 12, 28]. Initialization for variables and parameters in the coarse module are set as . Regularization parameters for coarse saliency generation are set as optimal ones, i.e., through out the experiments except for parametric analysis. For the refinement module, we set and corresponding parametric sensitivity is provided in Sect. IV-D. As for homogenization, we consider location, contrast and background priors as done in [12]. All the experiments in this paper were conducted with MATLAB2016b on an Intel i5-6500 3.2GHz Dual Core PC with 16GB RAM.

Iv-B Experimental Analysis of the Proposed Model

Iv-B1 Comparison of the coarse module with other LRMR-based methods

To evaluate the performance of our baseline model, i.e., the low-rank decomposition model with Laplacian constraint in (III-A), a thorough comparison with other low-rank based methods including ULR [28], SLR [11] and SMD [12] is provided in Table I and Fig. 4. From the qualitative comparison in Fig. 4, we can see that methods such as ULR and SLR fail to generate uniform detection results. On the contrary, salient objects detected by SMD [12] and our baseline model are much smoother. This indicates the effectiveness of the Laplacian constraint. From quantitative comparison in Table I, we can see that our baseline model and SMD [12] outperform ULR [28] and SLR [11] by a large margin. It is worth noting that our baseline model is only slightly outperformed by SMD [12] on MSRA10K and ECSSD datasets. While on iCoSeg dataset, our baseline model even achieves better result than SMD [12] in terms of all the four metrics, indicating that the -norm sparsity constraint can be equally effective with the structured-sparse regularization. We analyze the reason lies in the essential limitation of addressing spatial relationship with fixed-level index tree, which is especially evident in scenes of multiple objects. The result again verifies our discussion in Sect. III-A that sparse regularization on deeper layers of the index-tree is potential to destroy object structure.

Iv-B2 Analysis of the coarse-to-fine architecture

It can be observed in Fig. 4 that salient objects detected by these low-rank approaches are not whole enough, and even contain irrelevant background regions. This is because the basic low-rank model ignores the spatial relationship of object parts. Though SMD [12] attempts to handle this issue by replacing original -norm sparsity constraint with structured-sparse constraint, it has difficulty in achieving the goal as discussed in Sect. III-A. Instead, we address the issue by cascading mapping learning to produce finer saliency maps. We can see that our method generates more whole saliency detection result compared with our baseline model, e.g., the persons in the second image and the dog in the third image. Besides, it also helps eliminate irrelevant background, e.g., blue water in the first image. With quantitative comparison listed in Table I, we can see an obvious boost of performance of our model on all the three benchmark datasets.

To further verify the general effectiveness of our coarse-to-fine architecture, we conduct more experiments with different low-rank baseline models, i.e., ULR [28], SLR [11] and SMD [12]. Test results are also summarized in Table I. Comparing with original baseline models, an improvement is obtained after a refinement to these baselines on all the three datasets. The best performance is achieved by our method and also by the SMD [12] model with refinement. Similar visual improvement as discussed above can be observed in Fig. 4. It is especially obvious for the ULR [28] baseline, where clearer and more whole saliency maps are generated after refinement.

Iv-C Comparison with State-of-the-Arts

To evaluate the superiority of our coarse-to-fine model, we systematically compare it with the other twelve state-of-the-arts. PR curves on three datasets are shown in Fig. 5, ROC curves are shown on Fig. 6, and results of four metrics mentioned above are listed in Table II. Besides, qualitative comparisons are provided in Fig. 9. From the results we can see that, in most cases, our model ranks first or second on the three datasets under different criteria. It is worth noting that we report the result of DRFI [44] as a reference, which belongs to top-down methods with supervised training.

(a) Metric   Ours SMD[12] DRFI[44] RBD[27] HCT[40] DSR[41] PCA[17] MR[25] SLR[11] SS[42] ULR[28] HS[39] FT[43]
WF 0.705 0.704 0.666 0.685 0.582 0.656 0.473 0.642 0.601 0.137 0.425 0.604 0.277
OR 0.751 0.741 0.723 0.716 0.674 0.654 0.576 0.693 0.691 0.148 0.524 0.656 0.379
AUC 0.854 0.847 0.857 0.834 0.847 0.825 0.839 0.601 0.840 0.801 0.831 0.833 0.690
MAE 0.104 0.104 0.114 0.108 0.143 0.121 0.185 0.125 0.141 0.255 0.224 0.149 0.231
(a) Metric   Ours SMD[12] DRFI[44] RBD[27] HCT[40] DSR[41] PCA[17] MR[25] SLR[11] SS[42] ULR[28] HS[39] FT[43]
WF 0.634 0.611 0.592 0.599 0.464 0.548 0.407 0.554 0.473 0.126 0.379 0.563 0.289
OR 0.624 0.598 0.582 0.588 0.519 0.514 0.427 0.573 0.505 0.164 0.443 0.537 0.387
AUC 0.838 0.822 0.839 0.827 0.833 0.801 0.798 0.795 0.805 0.630 0.814 0.812 0.717
MAE 0.131 0.138 0.139 0.138 0.179 0.153 0.201 0.162 0.179 0.253 0.222 0.176 0.223
(a) Metric   Ours SMD[12] DRFI[44] RBD[27] HCT[40] DSR[41] PCA[17] MR[25] SLR[11] SS[42] ULR[28] HS[39] FT[43]
WF 0.545 0.544 0.547 0.513 0.446 0.514 0.364 0.496 0.402 0.128 0.351 0.454 0.195
OR 0.571 0.563 0.568 0.526 0.486 0.514 0.395 0.523 0.486 0.103 0.369 0.458 0.216
AUC 0.820 0.813 0.817 0.781 0.785 0.785 0.791 0.793 0.805 0.567 0.788 0.801 0.607
MAE 0.176 0.174 0.160 0.171 0.198 0.171 0.247 0.186 0.226 0.278 0.274 0.227 0.270
TABLE II: WF, OR, AUC, MAE of all methods on (a) MSRA10K, (b) iCoSeg and (c) ECSSD. The best three results are marked with red, green and blue respectively.
(a) (b) (c)
Fig. 5: PR curve of all methods. (a) results on MSRA10K dataset. (b) results on iCoSeg dataset. (c) results on ECSSD dataset
(a) (b) (c)
Fig. 6: ROC curve of all methods. (a) results on MSRA10K dataset. (b) results on iCoSeg dataset. (c) results on ECSSD dataset

Iv-C1 Results on single-object images

The MSRA10K dataset contains images with diverse objects of varying size, and with only one object in each image. From Fig. 5 (a), Fig. 6 (a) and Table II (a), we can see that our method achieves the best result with the highest weighted F-measure, overlapping ratio and the lowest mean average error, while DRFI [44] obtains the highest AUC score. It is worth noting that, our method even outperforms DRFI [44] with just simple features and no supervision. Frequency-based methods like FT [43] perform badly, as it is difficult to choose a proper scale to suppress background without knowing of object size. While SS [42] considers sparsity directly in standard spatial space and DCT space, it can only give a rough result of detected objects. In PR curves, our method shows an obvious superiority to other approaches. While in ROC curves, DRFI [44] and our method are the best two among those competitive methods.

Iv-C2 Results on multiple-object images

The iCoSeg dataset contains images with multiple objects, separate or adjacent. From Fig. 5 (b), Fig. 6 (b) and Table II (b), we can see that our method also achieves the highest weighted F-measure, overlapping ratio and the lowest mean average error, which shows that our method is effective under cases of multiple objects. However, we can see that the performance of PCA [17], SLR [11], DSR [41] and ULR [28] decrease heavily. As PCA [17] considers the dissimilarity between image patches and SLR [11] introduces a segmentation prior, they are more sensible to the number of objects under a scene. DSR [41] relies on reconstruction error so that its precision drops quickly with the increase of recall. ULR [28] trains a feature transformation on MSRA dataset, hence it obtains poor performance for the detection of multiple objects. In PR curves, our method presents better stability with increased recall. While in ROC curves, our method and DRFI [44] achieve the best performance and almost the same AUC score, outperforming the rest approaches.

Methods Ours SMD[12] DRFI[44] RBD[27] HCT[40] DSR[41] PCA[17] MR[25] SLR[11] SS[42] ULR[28] HS[39] FT[43]
Time(s) 0.83 1.59 9.06 0.20 4.12 10.2 4.43 1.84 22.80 0.05 15.62 0.53 0.07
Code M+C M+C M+C M+C M M+C M+C M+C M+C M M+C EXE C
TABLE III: Average time consumption for each method to process an image in MSRA10K dataset.
(a) (b) (c)
Fig. 7: Parametric sensitivity analysis: (a) shows the variation of WF, OR, AUC, MAE w.r.t. by fixing . (b) shows the variation of WF, OR, AUC, MAE w.r.t. by fixing . (c) shows the variation of WF, OR, AUC, MAE w.r.t. by fixing .
(a) (b) (c)
Fig. 8: Parametric sensitivity analysis: (a) shows the variation of WF, OR, AUC, MAE w.r.t. . (b) shows the PR curve of different thresholding strategies. (c) shows the ROC curve of different thresholding strategies.

Iv-C3 Results on complex scene images

The ECSSD dataset contains images with complicated background and also objects of varying size. From Fig. 5 (c), Fig. 6 (c) and Table II (c), we can see that our method achieves the highest overlapping ratio and AUC score, and is outperformed by DRFI [44] in terms of weighted F-measure and mean absolute error. In PR curves, our method performs similarly to SMD [12], while in ROC curves, DRFI [44] and our method are the best two among the state-of-the-arts. The result demonstrates that our method is competitive under complex scene. Approaches such as HS [39], HCT [40], MR [25] and RBD [27] that depend on cues like contrast bias and center bias fail to keep good performance.

Iv-C4 Visual comparison

Finally, to have an intuitive concept of the performance, we provide a visual comparison of detection result with images selected from the three benchmark datasets, which are diverse in object size, complexity of background and number of objects, as listed in Fig. 9. We can see that our method works well under most cases, and is capable of providing a relatively whole detection. As analyzed above, frequency-tuned method FT [43] tends either to filter out part of object or to preserve part of background. Basic low-rank matrix recovery methods like SLR [11] and ULR [28] are not robust enough to background and fail to provide a uniform saliency map. Approaches depending on prior cues such as HC [39], HCT [40], MR [25] and RBD [27] are more likely to miss object parts that are adjacent to image boundary. Finally, time consumption for all methods is provided in Table III, which demonstrates the efficiency of our method.

Iv-D Analysis of Parameters

Fig. 9: Visible comparison of saliency maps generated by different methods. We select six images from the MSRA10K dataset, four from the iCoSeg dataset and four from the ECSSD dataset, which are arranged sequentially.

Iv-D1 Parameters in coarse module

In our coarse module, the algorithm takes three parameters, i.e., the number of super-pixels in over-segmentation, regularization parameters . We examine the sensitivity of our model to changes of on iCoSeg dataset as an example. The analysis is conducted by tuning one parameter while fixing another two. The performance changes in terms of WF, OR, AUC, MAE are shown in Fig. 7. For , we observe that similar results are achieved by varying and is a good trade-off between efficiency and performance, as larger requires more expensive computation. Besides, we observe that when is fixed (), the WF, OR and MAE performance decreases while the AUC performance initially increases, spikes within a range of from to , and then decreases. Thus, we choose the optimal . When is fixed (), the WF and OR performance initially increases, spikes within a range of from to . The AUC performance initially maintains and then decreases, and the MAE performance initially maintains, increases within a range of from to , and then decreases. Thus, we choose the optimal .

Iv-D2 Parameters in refining module

In our fine module, the main parameter is the regularization parameter . The sensitivity in terms of WF, OR, AUC, MAE is shown in Fig. 8 (a). We observe that the WF, OR performance initially increases, spikes within a range of from to , and then decreases. The AUC performance initially increases, spikes within a range of from to , and then decreases. The MAE performance initially increases, spikes at , and then maintains. Thus, we choose the optimal .

Moreover, we also examine the sensitivity of our model to the changes of different thresholding strategies in our refining module. We fix the lower threshold, i.e., we set as the average value of coarse saliency, and test varying . PR curves and ROC curves of and are shown in Fig. 8 (b) and (c). We observe that our method performs similarly under the three strategies, which demonstrates its robustness.

V Conclusion

In this paper, we proposed a model based on low-rank matrix decomposition with Laplacian constraint as our baseline under a coarse-to-fine architecture for salient object detection. First we applied the baseline model to produce a low-rank component and a sparse component of the feature matrix extracted from an over-segmented image, which indicated background and foreground respectively. This produced a coarse saliency map, which might fail to detect the whole salient objects. Then we refined the coarse saliency map with spatial relationship considered. We chose negative and positive super-pixel examples from the coarse saliency map to learn a projection, which determined the final saliency of those tough super-pixels. Due to the coarse-to-fine architecture, our method achieved competitive results with higher efficiency. We comprehensively compared the baseline model and also the proposed model with other approaches, and illustrated general effectiveness of the coarse-to-fine architecture. Finally, parametric sensitivity analyses and time consumption were provided to show the robustness and efficiency of our model.


This work was supported partially by National Key Technology Research and Development Program of the Ministry of Science and Technology of China (No. 2015BAK36B00), in part by the Key Science and Technology of Shenzhen (No. CXZZ20150814155434903), in part by the Key Program for International S&T Cooperation Projects of China (No. 2016YFE0121200), in part by the National Natural Science Foundation of China (No. 61571205), in part by the National Natural Science Foundation of China (No. 61772220).


  • [1] A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 185–207, 2013.
  • [2] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE transactions on image processing, vol. 24, no. 12, pp. 5706–5722, 2015.
  • [3] C. Goldberg, T. Chen, F.-L. Zhang, A. Shamir, and S.-M. Hu, “Data-driven object manipulation in images,” in Computer Graphics Forum, vol. 31, no. 2pt1.   Wiley Online Library, 2012.
  • [4] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 10, pp. 1915–1926, 2012.
  • [5] K. Gu, S. Wang, H. Yang, W. Lin, G. Zhai, X. Yang, and W. Zhang, “Saliency-guided quality assessment of screen content images,” IEEE transactions on multimedia, vol. 18, no. 6, pp. 1098–1110, 2016.
  • [6] L. Tang, Q. Wu, W. Li, and Y. Liu, “Deep saliency quality assessment network with joint metric,” IEEE Access, vol. 6, pp. 913–924, 2018.
  • [7] V. Sitzmann, A. Serrano, A. Pavel, M. Agrawala, D. Gutierrez, B. Masia, and G. Wetzstein, “Saliency in VR: How do people explore virtual environments?” IEEE transactions on visualization and computer graphics, vol. 24, no. 4, pp. 1633–1642, 2018.
  • [8] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011.
  • [9] J. Yan, M. Zhu, H. Liu, and Y. Liu, “Visual saliency detection via sparsity pursuit,” IEEE Signal Processing Letters, vol. 17, no. 8, pp. 739–742, 2010.
  • [10] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by multitask sparsity pursuit,” IEEE transactions on image processing, vol. 21, no. 3, pp. 1327–1338, 2012.
  • [11] W. Zou, K. Kpalma, Z. Liu, and J. Ronsin, “Segmentation driven low-rank matrix recovery for saliency detection,” in BMVC, 2013.
  • [12] H. Peng, B. Li, H. Ling, W. Hu, W. Xiong, and S. J. Maybank, “Salient object detection via structured matrix decomposition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 818–832, 2017.
  • [13] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
  • [14] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li, “Automatic salient object segmentation based on context and shape prior.” in BMVC, vol. 6, no. 7, 2011.
  • [15] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 3, pp. 569–582, 2015.
  • [16] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [17] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a patch distinct?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [18] A. Borji and L. Itti, “Exploiting local and global patch rarities for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [19] S. Lu and J.-H. Lim, “Saliency modeling from image histograms,” in European Conference on Computer Vision, 2012.
  • [20] S. Lu, C. Tan, and J.-H. Lim, “Robust and efficient saliency modeling from image co-occurrence histograms,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 195–201, 2014.
  • [21] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
  • [22] Y. Fang, W. Lin, B.-S. Lee, C.-T. Lau, Z. Chen, and C.-W. Lin, “Bottom-up saliency detection model based on human visual sensitivity and amplitude spectrum,” IEEE transactions on multimedia, vol. 14, no. 1, pp. 187–198, 2012.
  • [23] J. Li, M. D. Levine, X. An, X. Xu, and H. He, “Visual saliency based on scale-space analysis in the frequency domain,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 4, pp. 996–1010, 2013.
  • [24] N. Imamoglu, W. Lin, and Y. Fang, “A saliency detection model using low-level features based on wavelet transform,” IEEE transactions on multimedia, vol. 15, no. 1, pp. 96–105, 2013.
  • [25] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [26] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [27] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [28] X. Shen and Y. Wu, “A unified approach to salient object detection via low rank matrix recovery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [29] Q. Zhang, Y. Liu, S. Zhu, and J. Han, “Salient object detection based on super-pixel clustering and unified low-rank representation,” Computer Vision and Image Understanding, vol. 161, pp. 51–64, 2017.
  • [30] J. Li, L. Luo, F. Zhang, J. Yang, and D. Rajan, “Double low rank matrix recovery for saliency fusion,” IEEE transactions on image processing, vol. 25, no. 9, pp. 4421–4432, 2016.
  • [31] J. Li, J. Ding, and J. Yang, “Visual salience learning via low rank matrix recovery,” in Asian Conference on Computer Vision, 2014, pp. 112–127.
  • [32] J. Li, J. Yang, C. Gong, and Q. Liu, “Saliency fusion via sparse and double low rank decomposition,” Pattern Recognition Letters, 2017.
  • [33] R. Huang, W. Feng, and J. Sun, “Saliency and co-saliency detection by low-rank multiscale fusion,” in IEEE International Conference on Multimedia and Expo (ICME), 2015.
  • [34] R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selection with sparsity-inducing norms,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2777–2824, 2011.
  • [35] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse principal component analysis,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
  • [36] F. Rapaport, E. Barillot, and J.-P. Vert, “Classification of arraycgh data using fused svm,” Bioinformatics, vol. 24, no. 13, pp. i375–i382, 2008.
  • [37] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal seeds for diffusion-based salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [38] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Advances in neural information processing systems, 2011.
  • [39] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [40] J. Kim, D. Han, Y.-W. Tai, and J. Kim, “Salient region detection via high-dimensional color transform and local spatial support,” IEEE transactions on image processing, vol. 25, no. 1, pp. 9–23, 2016.
  • [41] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection via dense and sparse reconstruction,” in Proceedings of the IEEE International Conference on Computer Vision, 2013.
  • [42] X. Hou, J. Harel, and C. Koch, “Image signature: Highlighting sparse salient regions,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 1, pp. 194–201, 2012.
  • [43] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [44] J. Wang, H. Jiang, Z. Yuan, M.-M. Cheng, X. Hu, and N. Zheng, “Salient object detection: A discriminative regional feature integration approach,” International journal of computer vision, vol. 123, no. 2, pp. 251–268, 2017.
  • [45] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “Interactively co-segmentating topically related images with intelligent scribble guidance,” International journal of computer vision, vol. 93, no. 3, pp. 273–292, 2011.
  • [46] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground maps?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [47] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description