CoarsetoFine Salient Object Detection
with LowRank Matrix Recovery
Abstract
Despite the great potential of using the lowrank matrix recovery (LRMR) theory on the task of salient object detection, existing LRMRbased approaches scarcely consider the interrelationship among elements within the sparse components and suffer from high computational cost. In this paper, we propose a novel LRMRbased saliency detection method under a coarsetofine framework to circumvent these two limitations. The first step of our approach is to generate a coarse saliency map by integrating a norm sparsity constraint imposed on the sparse matrix and a Laplacian regularization for smoothness. Following this, we aim to exploit and reveal the interrelationship among sparse elements and to increase detection recall values near the object boundaries using a learned mapping function to precisely distinguish foreground and background in the cluttered or complex scenes. Extensive experiments on three benchmark datasets demonstrate that our method can achieve enhanced performance compared with other stateoftheart saliency detection approaches, and also verifies the efficacy of our coarsetofine architecture.
I Introduction
Visual saliency has been a fundamental problem in neuroscience, psychology, and computer vision for a long time [1, 2]. It refers to the identification of a portion of essential visual information for further processing. Recently, it has been extended from originally predicting eyefixation to identifying a region containing salient objects, known as salient object detection or saliency detection. Tremendous efforts have been made to saliency detection over the past decades owing to its extensive real applications in the realm of multimedia applications, including image manipulation [3, 4], image/video quality assessment [5, 6] and virtual reality (VR) [7], etc.
Existing approaches for saliency detection can be divided into two categories [1]: the topdown (or taskdriven) approaches utilize highlevel human perceptual knowledge (e.g., object labels, semantic information or background priors) to guide the estimation of saliency maps, whereas the bottomup (or stimulusdriven) approaches are usually based on lowlevel visual information such as color, texture and localization. Compared with the topdown approaches, the bottomup approaches require less computational power and exhibit better generality and scalability, although their detected salient regions are liable to be confused with background [1, 2].
A recent trend is to combine bottomup cues with topdown priors to facilitate saliency detection using lowrank matrix recovery (LRMR) theory [8]. Generally speaking, these methods (e.g., [9, 10, 11]) assume that a natural scene image consists of visually consistent background regions (correspond to a highly redundant information part with lowrank structure) and distinctive foreground regions (correspond to a visually salient part with sparse structure). For example, Yan et al. [9] applied LRMR to the response matrix of image patches obtained by sparse coding. Lang et al. [10] jointly decomposed multiplefeature matrices and then produced the saliency map by inference. Despite promising results achieved by various LRMRbased methods, there are two challenges when it comes to real applications [12]:

Intercorrelation between elements in the sparse component is ignored. Specifically, the produced sparse component covers those regions of salient objects, but they may be scattered or incomplete due to a lack of guide during the decomposition.

Existing methods either attempt to learn a dictionary or transformation that depends on large amount of training data which is typically hard to obtain or even unattainable, or introduce a fusion scheme that suffers from high computational cost.
In this paper, we propose a novel coarsetofine framework for saliency object detection based on LRMR to circumvent these two limitations. Since LRMR is totally unsupervised, it is suited to roughly distinguish salient regions from the background. Based on this coarse saliency, we then take into account the spatial relationship among those elements within the sparse component through learningbased refinement. Specifically, our framework features two modules in a successive manner: a coarseprocessing module and a finetuning module. In the coarseprocessing module, a lowrank matrix recovery model with Laplacian constraint is proposed to roughly extract target from its surrounding background. Through the decomposition, the detected background contains most of the regions that are irrelevant to the desired target, while the foreground may include some cluttered regions, i.e., both the target boundary and its local surrounding background, which severely decreases the detection precision. Therefore, a finetuning processing is utilized to refine the coarse saliency. We select confident examples (i.e., positive and negative superpixels) from the coarse saliency map to learn a mapping by considering spatial relationship, and the finetuned saliency value for those remaining tough superpixels is determined by the classifier.
To summarize, our main contributions are threefold:

An effective saliency detection model, integrating norm sparsity constrained LRMR and Laplacian regularization, is proposed to roughly detect salient object. We set this as our baseline model and demonstrate that it works well especially in the scenario of multiple objects.

A learningbased refinement module is developed by considering the spatial adjacency relationship among image regions on the coarse saliency map obtained from our baseline model. It can assign more accurate saliency values to those obscure boundary regions in the coarse saliency map, promoting final wholeness of detected salient objects.

Extensive experiments on three benchmark datasets are conducted to demonstrate the superior performance and robustness of our method against other stateoftheart approaches.
Ii Related Work
An extensive review on saliency detection is beyond the scope of this paper. We refer interested readers to two recently published surveys [1, 2] for more details about existing bottomup and topdown approaches for saliency detection. This section first briefly reviews the prevailing unsupervised bottomup saliency detection methods, and then introduces several popular lowrank matrix recovery based methods that are closely related to our work.
Iia Popular Saliency Detection Methods
As a pioneering work, Itti et al. [13] innovatively suggested using “Center and Surround” filters to extract image features and to simulate human vision system on multiscale levels to generate saliency maps. Motivated by Itti’s framework, various contrast based approaches have been developed in past decades, which include local contrast based ones (e.g., [4, 14]), global contrasts based ones (e.g., [15, 16, 17]), or even those combining both local and global contrasts (e.g., [18, 19, 20]). Local contrast is sensitive to object boundaries and noises as it concerns more on the difference among regions within an image patch, whereas global contrast is difficult to distinguish similar colors or texture patterns.
On the other hand, frequency domain also provides a reliable avenue for salient object detection. For example, Hou et al. [21] analyzed spectral residual of an image in spectral domain, where the highfrequency components are considered as background. A similar work is presented by Fang et al. [22], where the standard Fast Fourier Transform (FFT) is substituted with Quaternion Fourier Transform (QFT). Other representative examples include [23, 24]. Generally, these methods are effective when salient objects are small, but they tend to detect only the boundary when objects are larger.
Graph based models (e.g.,[25, 26, 27]) are proposed to increase robustness and adaptability by constructing a graph with local or global nodes (e.g., image patches or regions). For example, Yang et al. [25] proposed a graphbased manifold ranking model to detect salient objects with foreground or background seeds. Based on this model, Wang et al. [26] added connectivity with and within boundary nodes in order to catch global saliency cues. This way, salient information among different nodes can be jointly exploited. However, a fully connected graph suffers from high computational cost.
IiB LRMRbased Saliency Detection Methods
The usage of lowrank matrix recovery (LRMR) theory on saliency detection was initiated by Yan et al. [9] and then extended in [28]. Specifically, the LRMR based saliency detection approaches assume that an image can be decomposed into redundancy part and saliency part, which can be characterized with a lowrank component and a sparse component separately. Given a data matrix , where represents one sample, the optimization problem can be formulated as follows
(1) 
where and denote nuclear norm and norm respectively, and is a parameter balancing the rank term and the sparse term. Given the decomposition, saliency map can be generated from the obtained sparse matrix.
Unfortunately, the early methods are typically datadependent, i.e., the learned dictionaries or transformations depend heavily on the selected training images or image patches, which suffer from limited adaptability and generalization capability. To this end, various approaches are developed in an unsupervised manner by either adopting a multitask scheme (e.g., [10]) or introducing extra priors (e.g., [11, 29]). For example, Lang et al. [10] jointly decomposed multiplefeature matrices instead of directly combining individual saliency maps. Zou et al. [11] introduced segmentation priors to cooperate with sparse saliency in an advanced manner. To preserve the wholeness of detection objects, saliency fusion models (e.g.,[30, 31, 32, 33]) were proposed thereafter. For example, double lowrank matrix recovery (DLRMR) was suggested in [30] to fuse saliency maps detected by different approaches.
Although above extensions improved the detection robustness to the cluttered backgrounds, there still remains two open problems. First, extra priors [11] or sophisticated operations (such as saliency fusion [30, 33]) may introduce expensive computational cost. Second, all these methods ignore the intercorrelation among elements within the sparse component, which may fail to detect the whole salient objects. The first work that pinpointed these two limitations is the recently proposed structured matrix decomposition (SMD) by Peng at al. [12]. Specifically, SMD introduced a treestructured sparse constraint in order to efficiently emphasize the intercorrelation in a unified model:
(2) 
where the matrix represents highlevel priors [28], and denotes dotproduct of matrices. The second term on the right side of equality denotes the structuredsparse constraint, is the norm (), is the depth of index tree and is the total number of nodes at the th level. Here each node represents a graph that contains several adjacent superpixels, and ( denotes set cardinality) is the submatrix of corresponding to node . By contrast, the third term is introduced to promote the performance under cluttered background, where is a parameter that trades off this regularization and the other two terms. is unnormalized graph Laplacian matrix.
Our work is directly motivated by SMD. However, two observations prompt us to propose our method:

Sparsity regularization on fine graphs could destroy spatial relationship among object parts, which is inevitable in SMD as the index tree is constructed in a finetocoarse way.

Laplacian constraint works more as a smooth term than an antagonism to cluttered background. It aims at improving the consistency of salient regions.
As discussed in SMD [12], a treestructured sparsityinducing norm is introduced to model the spatial contiguity and feature similarity among image patches, thus generating more precise and structurally consistent result. However, we argue that the structuredsparse constraint does not fit in the case well. Specifically, an intuitive idea is that the regions within the same object should be popped out as a whole without disrupting the completeness. It is obvious that the spatial relationship is taken into consideration during the construction of an index tree, i.e., merging with different thresholds to form graphs. However, such relationship has not been preserved if we naively imposing the sparsity constraint on the graphs. If we going deeper, the norm and norm lead to rowsparsity and columnsparsity of a graph respectively, without any consideration on the spatial information. In fact, structured sparsity can be used to for structure preserving [34, 35]. For example, continuous bits of a (binarized) genetic sequence can appear simultaneously in tumor diagnose [36]. On the other hand, some features such as eyes and nose, can be jointly considered as localized features for occluded face recognition [35]. In these cases, structured sparsity either corresponds to actual spatial positions, or to dictionary elements representing specific spatial parts. While in salient object detection, such priors like continuously distributed bits or determined object parts are unavailable or unattainable. As a result, utilizing the structuredsparse constraint is potential to destroy object structure, especially in the case of multiple objects.
Iii Our Method
The goal of this work is to overcome the two challenges in existing lowrank matrix recovery (LRMR) based approaches by introducing a coarsetofine architecture. Based on the discussion above, we integrate the basic LRMR model in (IIB) and Laplacian regularization to generate a coarse saliency map. Then, we learn a mapping with features and spatial relationships amongst pairwise superpixels in the coarse saliency map to obtain final saliency. It is worth noting that we consider the spatial relationship among superpixels in the refinement module, which increases robustness to cluttered background. The overall flowchart is illustrated in Fig. 1.
Iiia Coarse Saliency from LowRank Matrix Recovery
Through the discussion in Sect. IIB, we can see that treestructured regularization is not appropriate for salient object detection, especially in the circumstance of multiple objects. Thus we revert to original norm sparsity constraint, yielding sparsity by treating each element individually. Besides, compared with the basic LRMR model in (IIB), the model in (IIB) introduces Laplacian regularization to address the issue of cluttered background. Admittedly, this regularization promotes the performance of LRMR under cluttered background. However, its functionality acts more as a smooth term, which has been widely used in previous work (e.g., [1, 2, 27, 37]). Instead, we deal with cluttered background by considering the spatial relationship within sparse elements in a refinement module, while keeping the Laplacian regularization as a smooth term in our coarse module. Therefore, we obtain the coarse saliency using the basic LRMR model in (IIB) and Laplacian constraint as a smooth term, which is formulated as follows
(3) 
where matrices , is unnormalized graph Laplacian matrix. Once the lowrank matrix and sparse matrix are determined, saliency value of the th superpixel can be calculated as
(4) 
where denotes the th column of matrix . Note that here is a vector, thus its norm is the sum of the absolute value of each entry.
To demonstrate the effect of structuredsparse regularization in (IIB) and norm sparsity constraint in (IIIA), we provide two examples in Fig. 2, and more can be found in our supplementary material. Specifically, we set a fourlayer indextree for experimental verification. It should be firstly made clear that during the finetocoarse construction of the index tree, the bottom layer (Depth4) is composed of graphs, with each containing a superpixel, while the top layer (Depth1) is composed of one graph containing all the superpixels. The norm constraint is applied to each graph separately and then the results are summed.
The first image contains single object in pure background. Comparing Fig. 2(c1) with Fig. 2(e1) and Fig. 2(g1) respectively, we can observe that adding constraint to Depth2 eliminates irrelevant background, while deeper constraint is unnecessary for preserving spatial structure of the object. Considering the construction of an indexed tree, it can be seen that the graph in the top layer of finetocoarse structure corresponds to the coarse module in our coarsetofine architecture. Thus, it indicates that our coarse module is capable of maintaining a rough spatial structure of patches in single object. Furthermore, our refinement module utilizes the spatial relationship among superpixels to refine the salient graph in (i1). This way, those several superpixels below the flower are found to be surrounded by background regions and to only weakly relate to main body of the flower, thus they are classified into background, as shown in (j1).
The second image contains multiple objects. Comparing Fig. 2(c2) with Fig. 2(e2) and Fig. 2(g2), we can observe that adding constraint to Depth2 promotes the structural wholeness of objects to some extent, while deeper constraint destroys the spatial structure. This is because treestructure regularization term in deep layers encourages sparsity of single superpixel or smallregional superpixels, thus ignoring the wholeness of multiple objects. On the contrary, in our coarsetofine architecture, we consider multiple objects as a whole without disrupting their inner organizations. Our coarse module firstly generate rough saliency of those objects, and then the refinement module produces more accurate saliency of superpixels around object boundaries by learning from patterns of the foreground and background, respectively. For example, some superpixels in leg areas adjacent to image boundary are originally considered as background, but are assigned higher saliency value by the refinement module, which improves the wholeness of objects.
Having illustrated the difference between the structuredsparse regularization in (IIB) and the norm sparsity constraint in (IIIA), we present the optimization procedure of (IIIA).
Optimization: The optimization problem in (IIIA) can be efficiently solved via the alternating direction method of multipliers (ADMMs) [38]. For simplification, we denote the projected feature matrix as . An auxiliary variable is introduced and problem (IIIA) becomes
(5) 
Lagrange multipliers and are introduced to remove the equality constraints, and the augmented Lagrangian function is constructed as
(6) 
where is the penalty parameter.
Iterative steps of minimizing the Lagrangian function are utilized to optimize (IIIA), and stop criteria at step are given by (7) and (8)
(7)  
(8) 
The variables and can be alternately updated by minimizing the augmented Lagrangian function with other variables fixed. In this model, each variable can be updated with a closed form solution. With respect to and , they can be updated as follows
(9) 
(10) 
where the softthresholding operator is defined by
and , where SVD is the singular value decomposition.
Regarding and , we can update them as follows
(11)  
(12)  
(13)  
(14) 
where the parameter controls the convergence speed.
IiiB Learningbased Saliency Refinement
As we have discussed in Sect. IIB, the coarse saliency map generated using objective (3) ignores spatial relationship among adjacent superpixels. To further improve the detection results, we attempt to refine the coarse saliency via mapping learning, which learns a projection that connects feature values and specific saliency values to refine the coarse saliency map.
Specifically, given the coarse saliency calculated by (4), we can roughly distinguish foreground from background. In order to obtain common interior feature of foreground and background respectively, we choose confident superpixels based on their coarse saliency value. We set two thresholds to select confident superpixel samples for background and for foreground respectively, i.e., superpixels with saliency value lower than are considered as negative samples, and superpixels with saliency value higher than are considered as positive ones. We denote as the sample matrix composed of both positive and negative samples, and as corresponding label matrix, where is the total number of confident samples. For the th positive sample, its label vector is , while for the th negative sample, its label vector is . See Fig. 3 for more intuitive examples.
(a)  (b)  (c)  (d) 
In order to determine the saliency of those tough samples, we utilize their spatial relationships with these confident samples, as shown in Fig. 3. Based on the coarse saliency and adjacent relationship, we generate rough saliency for the th tough sample as follows
(15) 
where is the number of superpixels adjacent to the th tough sample, and denotes the number of pixels contained in the th superpixel. Similarly, we formulate label vector of the th tough sample as , and the label matrix , where is the number of tough samples.
Combining the coarse saliency for confident samples and tough samples, we build our saliency refining model as follows
(16) 
where is the mapping to be learned, and are regularization parameters. Once the mapping is learned, saliency of those tough superpixels are given by the first column of matrix .
Despite the simplicity of (IIIB), one should note that the background regions are typically much larger than foreground objects. This leads to the issue of learning in the circumstance of imbalanced data. In order to overcome this limitation, we introduce a weighting strategy to balance the contributions of positive and negative samples in mapping learning, which is formulated as follows
(17) 
where is the weight for the th confident sample. Actually, we can simplify (IIIB) by combining the second term and the third term with generalized weights as follows
(18) 
where is the weight for the th sample, either positive one, negative one or tough one, which is defined as follows
where are numbers of negative and positive samples, respectively, and we have . Optimization problem in (18) can be efficiently solved by
(19) 
where is a diagonal matrix with , and is an identity matrix.
IiiC Complexity analysis
Here we briefly discuss the computational complexity of optimization in Sect. IIIA and Sect. IIIB respectively, and we have , .
We set the th iteration for coarse saliency generation as an example. The time consumption mainly involves three kinds of operations, i.e., SVD, matrix inversion and matrix multiplication. Specifically, update for and is addressed by SVD, with the complexity of and , respectively. While major operations in updating include matrix inversion and matrix multiplication, with complexity of . Considering , the final computational complexity is . Compared with this, the optimization for the treestructured sparsity in [12] requires no extra computational complexity. However, multiscale segmentation in constructing the index tree introduces computational cost thus slows down the speed, as listed in Table III.
For saliency refinement, the solution in (19) involves matrix inversion and matrix multiplication, with the complexity of and , respectively. Considering , the final computational complexity is .
Iv Experiments
To evaluate the performance of our model, we compare it with the other twelve stateoftheart methods covering those categories mentioned in Sect. II. Among them, three methods are lowrank decomposition based, i.e., SMD [12], SLR [11] and ULR [28], which share similar motivation with our baseline model in (IIIA). Moreover, we select five stateoftheart methods depending on contrast or priors including RBD [27], PCA [17], HS [39], HCT [40] and DSR [41]. The four remaining approaches include one graphbased method (MR [25]), two involving frequency domain (SS [42], FT [43]), and one involving supervised training (DRFI [44]). We conduct experiments on three benchmark datasets, i.e., MSRA10K [15], ECSSD [16] and iCoSeg [45]. The dataset MSRA10K contains 10,000 images with single object, iCoSeg contains 643 images with multiple objects and ECSSD contains 1,000 images with complicated background. Sample images from the three datasets can be found in Fig. 9.
Dataset  MSRA10K  iCoSeg  ECSSD  

WF  OR  AUC  MAE  WF  OR  AUC  MAE  WF  OR  AUC  MAE  
ULR [28]  0.425  0.524  0.831  0.224  0.379  0.443  0.814  0.222  0.351  0.369  0.788  0.274 
SLR [11]  0.601  0.691  0.840  0.141  0.473  0.505  0.805  0.179  0.402  0.486  0.805  0.226 
SMD [12]  0.704  0.741  0.847  0.104  0.611  0.598  0.822  0.138  0.544  0.563  0.813  0.174 
Ours (C)  0.688  0.734  0.844  0.108  0.614  0.599  0.823  0.137  0.535  0.557  0.810  0.175 
ULR  0.532  0.597  0.846  0.195  0.439  0.459  0.814  0.219  0.421  0.418  0.801  0.262 
SLR  0.681  0.726  0.847  0.122  0.602  0.587  0.816  0.161  0.519  0.542  0.814  0.199 
SMD  0.706  0.753  0.854  0.103  0.630  0.618  0.838  0.132  0.546  0.571  0.820  0.175 
Ours  0.705  0.751  0.854  0.104  0.634  0.624  0.838  0.131  0.545  0.571  0.820  0.176 
RGB image  ULR [28]  SLR [11]  SMD [12]  Ours (C)  ULR  SLR  SMD  Ours  GT 
Iva Experimental setup
We follow Peng et al.'s methodology [12] to compare the performance of different models. The metrics include precisionrecall (PR) curve, receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), the weighted Fmeasure (WF) score, overlapping ratio (OR) and mean absolute error (MAE). It is a slightly improved variant of the methodology suggested by the benchmark [2], which includes PR curve, ROC curve, MAE and Fmeasure. Among them, precision reflects the percentage of salient pixels correctly assigned, and recall corresponds to the fraction of detected salient pixels belonging to the salient object in the ground truth. PR curve is obtained by setting a series of discrete threshold ranging from to for a grayscale saliency map in range . ROC curve measures in a similar way, which considers hitrate (recall) and falsealarm. Supposing saliency values are normalized to the range of , the generated saliency map can be binarized with a given threshold, i.e., salient or nonsalient. With four basic quantities: truepositive (), truenegative (), false positive () and false negative (), the precision (), recall/hitrate () and falsealarm rate () are calculated as . Generally, precision decreases with the increase of recall, while Fmeasure () is a tradeoff between them , with in many works [27, 44].
Although these metrics are widelyused, in [46], the authors pointed out that there existed interpolation flaw, dependency flaw and equalimportant flaw, and proposed measure (WF) to alleviate the flaws , where the precision and recall are replaced with weighted ones. Overlapping ratio measures the intersection between predicted (binarized) saliency map (S) and the groundtruth saliency map (G), . Mean absolute error provides a numerical difference between the continuous saliency map and the true saliency map.
We adopt SLIC algorithm [47] () for oversegmentation and extract generally used features to compare fairly with the other approaches [11, 12, 28]. Initialization for variables and parameters in the coarse module are set as . Regularization parameters for coarse saliency generation are set as optimal ones, i.e., through out the experiments except for parametric analysis. For the refinement module, we set and corresponding parametric sensitivity is provided in Sect. IVD. As for homogenization, we consider location, contrast and background priors as done in [12]. All the experiments in this paper were conducted with MATLAB2016b on an Intel i56500 3.2GHz Dual Core PC with 16GB RAM.
IvB Experimental Analysis of the Proposed Model
IvB1 Comparison of the coarse module with other LRMRbased methods
To evaluate the performance of our baseline model, i.e., the lowrank decomposition model with Laplacian constraint in (IIIA), a thorough comparison with other lowrank based methods including ULR [28], SLR [11] and SMD [12] is provided in Table I and Fig. 4. From the qualitative comparison in Fig. 4, we can see that methods such as ULR and SLR fail to generate uniform detection results. On the contrary, salient objects detected by SMD [12] and our baseline model are much smoother. This indicates the effectiveness of the Laplacian constraint. From quantitative comparison in Table I, we can see that our baseline model and SMD [12] outperform ULR [28] and SLR [11] by a large margin. It is worth noting that our baseline model is only slightly outperformed by SMD [12] on MSRA10K and ECSSD datasets. While on iCoSeg dataset, our baseline model even achieves better result than SMD [12] in terms of all the four metrics, indicating that the norm sparsity constraint can be equally effective with the structuredsparse regularization. We analyze the reason lies in the essential limitation of addressing spatial relationship with fixedlevel index tree, which is especially evident in scenes of multiple objects. The result again verifies our discussion in Sect. IIIA that sparse regularization on deeper layers of the indextree is potential to destroy object structure.
IvB2 Analysis of the coarsetofine architecture
It can be observed in Fig. 4 that salient objects detected by these lowrank approaches are not whole enough, and even contain irrelevant background regions. This is because the basic lowrank model ignores the spatial relationship of object parts. Though SMD [12] attempts to handle this issue by replacing original norm sparsity constraint with structuredsparse constraint, it has difficulty in achieving the goal as discussed in Sect. IIIA. Instead, we address the issue by cascading mapping learning to produce finer saliency maps. We can see that our method generates more whole saliency detection result compared with our baseline model, e.g., the persons in the second image and the dog in the third image. Besides, it also helps eliminate irrelevant background, e.g., blue water in the first image. With quantitative comparison listed in Table I, we can see an obvious boost of performance of our model on all the three benchmark datasets.
To further verify the general effectiveness of our coarsetofine architecture, we conduct more experiments with different lowrank baseline models, i.e., ULR [28], SLR [11] and SMD [12]. Test results are also summarized in Table I. Comparing with original baseline models, an improvement is obtained after a refinement to these baselines on all the three datasets. The best performance is achieved by our method and also by the SMD [12] model with refinement. Similar visual improvement as discussed above can be observed in Fig. 4. It is especially obvious for the ULR [28] baseline, where clearer and more whole saliency maps are generated after refinement.
IvC Comparison with StateoftheArts
To evaluate the superiority of our coarsetofine model, we systematically compare it with the other twelve stateofthearts. PR curves on three datasets are shown in Fig. 5, ROC curves are shown on Fig. 6, and results of four metrics mentioned above are listed in Table II. Besides, qualitative comparisons are provided in Fig. 9. From the results we can see that, in most cases, our model ranks first or second on the three datasets under different criteria. It is worth noting that we report the result of DRFI [44] as a reference, which belongs to topdown methods with supervised training.
(a)  Metric  Ours  SMD[12]  DRFI[44]  RBD[27]  HCT[40]  DSR[41]  PCA[17]  MR[25]  SLR[11]  SS[42]  ULR[28]  HS[39]  FT[43] 

WF  0.705  0.704  0.666  0.685  0.582  0.656  0.473  0.642  0.601  0.137  0.425  0.604  0.277  
OR  0.751  0.741  0.723  0.716  0.674  0.654  0.576  0.693  0.691  0.148  0.524  0.656  0.379  
AUC  0.854  0.847  0.857  0.834  0.847  0.825  0.839  0.601  0.840  0.801  0.831  0.833  0.690  
MAE  0.104  0.104  0.114  0.108  0.143  0.121  0.185  0.125  0.141  0.255  0.224  0.149  0.231  
(a)  Metric  Ours  SMD[12]  DRFI[44]  RBD[27]  HCT[40]  DSR[41]  PCA[17]  MR[25]  SLR[11]  SS[42]  ULR[28]  HS[39]  FT[43] 
WF  0.634  0.611  0.592  0.599  0.464  0.548  0.407  0.554  0.473  0.126  0.379  0.563  0.289  
OR  0.624  0.598  0.582  0.588  0.519  0.514  0.427  0.573  0.505  0.164  0.443  0.537  0.387  
AUC  0.838  0.822  0.839  0.827  0.833  0.801  0.798  0.795  0.805  0.630  0.814  0.812  0.717  
MAE  0.131  0.138  0.139  0.138  0.179  0.153  0.201  0.162  0.179  0.253  0.222  0.176  0.223  
(a)  Metric  Ours  SMD[12]  DRFI[44]  RBD[27]  HCT[40]  DSR[41]  PCA[17]  MR[25]  SLR[11]  SS[42]  ULR[28]  HS[39]  FT[43] 
WF  0.545  0.544  0.547  0.513  0.446  0.514  0.364  0.496  0.402  0.128  0.351  0.454  0.195  
OR  0.571  0.563  0.568  0.526  0.486  0.514  0.395  0.523  0.486  0.103  0.369  0.458  0.216  
AUC  0.820  0.813  0.817  0.781  0.785  0.785  0.791  0.793  0.805  0.567  0.788  0.801  0.607  
MAE  0.176  0.174  0.160  0.171  0.198  0.171  0.247  0.186  0.226  0.278  0.274  0.227  0.270 
(a)  (b)  (c) 
(a)  (b)  (c) 
IvC1 Results on singleobject images
The MSRA10K dataset contains images with diverse objects of varying size, and with only one object in each image. From Fig. 5 (a), Fig. 6 (a) and Table II (a), we can see that our method achieves the best result with the highest weighted Fmeasure, overlapping ratio and the lowest mean average error, while DRFI [44] obtains the highest AUC score. It is worth noting that, our method even outperforms DRFI [44] with just simple features and no supervision. Frequencybased methods like FT [43] perform badly, as it is difficult to choose a proper scale to suppress background without knowing of object size. While SS [42] considers sparsity directly in standard spatial space and DCT space, it can only give a rough result of detected objects. In PR curves, our method shows an obvious superiority to other approaches. While in ROC curves, DRFI [44] and our method are the best two among those competitive methods.
IvC2 Results on multipleobject images
The iCoSeg dataset contains images with multiple objects, separate or adjacent. From Fig. 5 (b), Fig. 6 (b) and Table II (b), we can see that our method also achieves the highest weighted Fmeasure, overlapping ratio and the lowest mean average error, which shows that our method is effective under cases of multiple objects. However, we can see that the performance of PCA [17], SLR [11], DSR [41] and ULR [28] decrease heavily. As PCA [17] considers the dissimilarity between image patches and SLR [11] introduces a segmentation prior, they are more sensible to the number of objects under a scene. DSR [41] relies on reconstruction error so that its precision drops quickly with the increase of recall. ULR [28] trains a feature transformation on MSRA dataset, hence it obtains poor performance for the detection of multiple objects. In PR curves, our method presents better stability with increased recall. While in ROC curves, our method and DRFI [44] achieve the best performance and almost the same AUC score, outperforming the rest approaches.
Methods  Ours  SMD[12]  DRFI[44]  RBD[27]  HCT[40]  DSR[41]  PCA[17]  MR[25]  SLR[11]  SS[42]  ULR[28]  HS[39]  FT[43] 
Time(s)  0.83  1.59  9.06  0.20  4.12  10.2  4.43  1.84  22.80  0.05  15.62  0.53  0.07 
Code  M+C  M+C  M+C  M+C  M  M+C  M+C  M+C  M+C  M  M+C  EXE  C 
(a)  (b)  (c) 
(a)  (b)  (c) 
IvC3 Results on complex scene images
The ECSSD dataset contains images with complicated background and also objects of varying size. From Fig. 5 (c), Fig. 6 (c) and Table II (c), we can see that our method achieves the highest overlapping ratio and AUC score, and is outperformed by DRFI [44] in terms of weighted Fmeasure and mean absolute error. In PR curves, our method performs similarly to SMD [12], while in ROC curves, DRFI [44] and our method are the best two among the stateofthearts. The result demonstrates that our method is competitive under complex scene. Approaches such as HS [39], HCT [40], MR [25] and RBD [27] that depend on cues like contrast bias and center bias fail to keep good performance.
IvC4 Visual comparison
Finally, to have an intuitive concept of the performance, we provide a visual comparison of detection result with images selected from the three benchmark datasets, which are diverse in object size, complexity of background and number of objects, as listed in Fig. 9. We can see that our method works well under most cases, and is capable of providing a relatively whole detection. As analyzed above, frequencytuned method FT [43] tends either to filter out part of object or to preserve part of background. Basic lowrank matrix recovery methods like SLR [11] and ULR [28] are not robust enough to background and fail to provide a uniform saliency map. Approaches depending on prior cues such as HC [39], HCT [40], MR [25] and RBD [27] are more likely to miss object parts that are adjacent to image boundary. Finally, time consumption for all methods is provided in Table III, which demonstrates the efficiency of our method.
IvD Analysis of Parameters
Image  FT  ULR  SS  HS  SLR  MR  PCA  DSR  HCT  RBD  DRFI  SMD  Ours  GT 
IvD1 Parameters in coarse module
In our coarse module, the algorithm takes three parameters, i.e., the number of superpixels in oversegmentation, regularization parameters . We examine the sensitivity of our model to changes of on iCoSeg dataset as an example. The analysis is conducted by tuning one parameter while fixing another two. The performance changes in terms of WF, OR, AUC, MAE are shown in Fig. 7. For , we observe that similar results are achieved by varying and is a good tradeoff between efficiency and performance, as larger requires more expensive computation. Besides, we observe that when is fixed (), the WF, OR and MAE performance decreases while the AUC performance initially increases, spikes within a range of from to , and then decreases. Thus, we choose the optimal . When is fixed (), the WF and OR performance initially increases, spikes within a range of from to . The AUC performance initially maintains and then decreases, and the MAE performance initially maintains, increases within a range of from to , and then decreases. Thus, we choose the optimal .
IvD2 Parameters in refining module
In our fine module, the main parameter is the regularization parameter . The sensitivity in terms of WF, OR, AUC, MAE is shown in Fig. 8 (a). We observe that the WF, OR performance initially increases, spikes within a range of from to , and then decreases. The AUC performance initially increases, spikes within a range of from to , and then decreases. The MAE performance initially increases, spikes at , and then maintains. Thus, we choose the optimal .
Moreover, we also examine the sensitivity of our model to the changes of different thresholding strategies in our refining module. We fix the lower threshold, i.e., we set as the average value of coarse saliency, and test varying . PR curves and ROC curves of and are shown in Fig. 8 (b) and (c). We observe that our method performs similarly under the three strategies, which demonstrates its robustness.
V Conclusion
In this paper, we proposed a model based on lowrank matrix decomposition with Laplacian constraint as our baseline under a coarsetofine architecture for salient object detection. First we applied the baseline model to produce a lowrank component and a sparse component of the feature matrix extracted from an oversegmented image, which indicated background and foreground respectively. This produced a coarse saliency map, which might fail to detect the whole salient objects. Then we refined the coarse saliency map with spatial relationship considered. We chose negative and positive superpixel examples from the coarse saliency map to learn a projection, which determined the final saliency of those tough superpixels. Due to the coarsetofine architecture, our method achieved competitive results with higher efficiency. We comprehensively compared the baseline model and also the proposed model with other approaches, and illustrated general effectiveness of the coarsetofine architecture. Finally, parametric sensitivity analyses and time consumption were provided to show the robustness and efficiency of our model.
Acknowledgment
This work was supported partially by National Key Technology Research and Development Program of the Ministry of Science and Technology of China (No. 2015BAK36B00), in part by the Key Science and Technology of Shenzhen (No. CXZZ20150814155434903), in part by the Key Program for International S&T Cooperation Projects of China (No. 2016YFE0121200), in part by the National Natural Science Foundation of China (No. 61571205), in part by the National Natural Science Foundation of China (No. 61772220).
References
 [1] A. Borji and L. Itti, “Stateoftheart in visual attention modeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 185–207, 2013.
 [2] A. Borji, M.M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE transactions on image processing, vol. 24, no. 12, pp. 5706–5722, 2015.
 [3] C. Goldberg, T. Chen, F.L. Zhang, A. Shamir, and S.M. Hu, “Datadriven object manipulation in images,” in Computer Graphics Forum, vol. 31, no. 2pt1. Wiley Online Library, 2012.
 [4] S. Goferman, L. ZelnikManor, and A. Tal, “Contextaware saliency detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 10, pp. 1915–1926, 2012.
 [5] K. Gu, S. Wang, H. Yang, W. Lin, G. Zhai, X. Yang, and W. Zhang, “Saliencyguided quality assessment of screen content images,” IEEE transactions on multimedia, vol. 18, no. 6, pp. 1098–1110, 2016.
 [6] L. Tang, Q. Wu, W. Li, and Y. Liu, “Deep saliency quality assessment network with joint metric,” IEEE Access, vol. 6, pp. 913–924, 2018.
 [7] V. Sitzmann, A. Serrano, A. Pavel, M. Agrawala, D. Gutierrez, B. Masia, and G. Wetzstein, “Saliency in VR: How do people explore virtual environments?” IEEE transactions on visualization and computer graphics, vol. 24, no. 4, pp. 1633–1642, 2018.
 [8] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011.
 [9] J. Yan, M. Zhu, H. Liu, and Y. Liu, “Visual saliency detection via sparsity pursuit,” IEEE Signal Processing Letters, vol. 17, no. 8, pp. 739–742, 2010.
 [10] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by multitask sparsity pursuit,” IEEE transactions on image processing, vol. 21, no. 3, pp. 1327–1338, 2012.
 [11] W. Zou, K. Kpalma, Z. Liu, and J. Ronsin, “Segmentation driven lowrank matrix recovery for saliency detection,” in BMVC, 2013.
 [12] H. Peng, B. Li, H. Ling, W. Hu, W. Xiong, and S. J. Maybank, “Salient object detection via structured matrix decomposition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 818–832, 2017.
 [13] L. Itti, C. Koch, and E. Niebur, “A model of saliencybased visual attention for rapid scene analysis,” IEEE transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
 [14] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li, “Automatic salient object segmentation based on context and shape prior.” in BMVC, vol. 6, no. 7, 2011.
 [15] M.M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.M. Hu, “Global contrast based salient region detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 3, pp. 569–582, 2015.
 [16] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 [17] R. Margolin, A. Tal, and L. ZelnikManor, “What makes a patch distinct?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [18] A. Borji and L. Itti, “Exploiting local and global patch rarities for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 [19] S. Lu and J.H. Lim, “Saliency modeling from image histograms,” in European Conference on Computer Vision, 2012.
 [20] S. Lu, C. Tan, and J.H. Lim, “Robust and efficient saliency modeling from image cooccurrence histograms,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 1, pp. 195–201, 2014.
 [21] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 [22] Y. Fang, W. Lin, B.S. Lee, C.T. Lau, Z. Chen, and C.W. Lin, “Bottomup saliency detection model based on human visual sensitivity and amplitude spectrum,” IEEE transactions on multimedia, vol. 14, no. 1, pp. 187–198, 2012.
 [23] J. Li, M. D. Levine, X. An, X. Xu, and H. He, “Visual saliency based on scalespace analysis in the frequency domain,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 4, pp. 996–1010, 2013.
 [24] N. Imamoglu, W. Lin, and Y. Fang, “A saliency detection model using lowlevel features based on wavelet transform,” IEEE transactions on multimedia, vol. 15, no. 1, pp. 96–105, 2013.
 [25] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.H. Yang, “Saliency detection via graphbased manifold ranking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [26] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [27] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 [28] X. Shen and Y. Wu, “A unified approach to salient object detection via low rank matrix recovery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 [29] Q. Zhang, Y. Liu, S. Zhu, and J. Han, “Salient object detection based on superpixel clustering and unified lowrank representation,” Computer Vision and Image Understanding, vol. 161, pp. 51–64, 2017.
 [30] J. Li, L. Luo, F. Zhang, J. Yang, and D. Rajan, “Double low rank matrix recovery for saliency fusion,” IEEE transactions on image processing, vol. 25, no. 9, pp. 4421–4432, 2016.
 [31] J. Li, J. Ding, and J. Yang, “Visual salience learning via low rank matrix recovery,” in Asian Conference on Computer Vision, 2014, pp. 112–127.
 [32] J. Li, J. Yang, C. Gong, and Q. Liu, “Saliency fusion via sparse and double low rank decomposition,” Pattern Recognition Letters, 2017.
 [33] R. Huang, W. Feng, and J. Sun, “Saliency and cosaliency detection by lowrank multiscale fusion,” in IEEE International Conference on Multimedia and Expo (ICME), 2015.
 [34] R. Jenatton, J.Y. Audibert, and F. Bach, “Structured variable selection with sparsityinducing norms,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2777–2824, 2011.
 [35] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse principal component analysis,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
 [36] F. Rapaport, E. Barillot, and J.P. Vert, “Classification of arraycgh data using fused svm,” Bioinformatics, vol. 24, no. 13, pp. i375–i382, 2008.
 [37] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal seeds for diffusionbased salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 [38] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for lowrank representation,” in Advances in neural information processing systems, 2011.
 [39] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [40] J. Kim, D. Han, Y.W. Tai, and J. Kim, “Salient region detection via highdimensional color transform and local spatial support,” IEEE transactions on image processing, vol. 25, no. 1, pp. 9–23, 2016.
 [41] X. Li, H. Lu, L. Zhang, X. Ruan, and M.H. Yang, “Saliency detection via dense and sparse reconstruction,” in Proceedings of the IEEE International Conference on Computer Vision, 2013.
 [42] X. Hou, J. Harel, and C. Koch, “Image signature: Highlighting sparse salient regions,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 1, pp. 194–201, 2012.
 [43] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequencytuned salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
 [44] J. Wang, H. Jiang, Z. Yuan, M.M. Cheng, X. Hu, and N. Zheng, “Salient object detection: A discriminative regional feature integration approach,” International journal of computer vision, vol. 123, no. 2, pp. 251–268, 2017.
 [45] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “Interactively cosegmentating topically related images with intelligent scribble guidance,” International journal of computer vision, vol. 93, no. 3, pp. 273–292, 2011.
 [46] R. Margolin, L. ZelnikManor, and A. Tal, “How to evaluate foreground maps?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 [47] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to stateoftheart superpixel methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.