Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss

Salient Object Detection with Purificatory Mechanism and Structural Similarity Loss

Abstract

Image-based salient object detection has made great progress over the past decades, especially after the revival of deep neural networks. By the aid of attention mechanisms to weight the image features adaptively, recent advanced deep learning-based models encourage the predicted results to approximate the ground-truth masks with as large predictable areas as possible, thus achieving the state-of-the-art performance. However, these methods do not pay enough attention to small areas prone to misprediction. In this way, it is still tough to accurately locate salient objects due to the existence of regions with indistinguishable foreground and background and regions with complex or fine structures. To address these problems, we propose a novel convolutional neural network with purificatory mechanism and structural similarity loss. Specifically, in order to better locate preliminary salient objects, we first introduce the promotion attention, which is based on spatial and channel attention mechanisms to promote attention to salient regions. Subsequently, for the purpose of restoring the indistinguishable regions that can be regarded as error-prone regions of one model, we propose the rectification attention, which is learned from the areas of wrong prediction and guide the network to focus on error-prone regions thus rectifying errors. Through these two attentions, we use the Purificatory Mechanism to impose strict weights with different regions of the whole salient objects and purify results from hard-to-distinguish regions, thus accurately predicting the locations and details of salient objects. In addition to paying different attention to these hard-to-distinguish regions, we also consider the structural constraints on complex regions and propose the Structural Similarity Loss. The proposed loss models the region-level pair-wise relationship between regions to assist these regions to calibrate their own saliency values. In experiments, the proposed purificatory mechanism and structural similarity loss can both effectively improve the performance, and the proposed approach outperforms 19 state-of-the-art methods on six datasets with a notable margin. Also, the proposed method is efficient and runs at over 27FPS on a single NVIDIA 1080Ti GPU.

Salient object detection, purificatory mechanism, error-prone region, structural similarity

I Introduction

Visual saliency plays an essential role in the human vision system, which guides human beings to look at the most important information from visual scenes and can be well referred to as the allocation of cognitive resources on information [18, 25]. To model this mechanism of visual saliency, there are two main research branches in computer vision: fixation prediction [17] and salient object detection [4]. This work focuses on the second one (i.e., salient object detection, abbreviated as SOD), which aims to detect and segment the most visually distinctive objects. Over the past years, SOD has made significant progress, and it is also used as an important preliminary step for various vision tasks, such as object recognition [37], tracking [14] and image parsing [21].

Fig. 1: Difficulties that hinder the development of SOD. In (a)(b), these usually exist regions with similar foreground and background, which confuses the models to cause wrong predictions. In (c)(d), complex structures caused by complex illumination or color and fine hollows make it difficult to maintain the structural integrity and clarity. Images and ground-truth masks (GT) are from ECSSD [54]. Results are generated by MLM [50] and our approach.

To address the SOD task, lots of learning-based methods [22, 31, 15, 33, 60, 44, 61, 7, 58, 57, 48, 47, 39] have been proposed in recent years, achieving impressive performance on existing benchmark datasets [54, 55, 27, 23, 42, 52]. However, there still exist two difficulties that hinder the development of SOD. First, it is hard to distinguish these regions with similar foreground and background. As shown in Fig. 1(a)(b), these regions usually confuse the models to cause wrong predictions, and we named these regions as “error-prone region” of models. Second, it is difficult to restore the complex or fine structures. As displayed in Fig. 1(c)(d), the complex structures (e.g., caused by complex illumination and color) and fine structures (e.g., hollows) make it difficult to maintain the structural integrity and clarity. These two problems are especially difficult to deal with for existing SOD methods and greatly hinder the performance of SOD. Due to these difficulties, SOD remains a challenging vision task.

To deal with the first difficulty, some methods [30, 6, 59, 51, 63, 49] adopt attention mechanisms to weight the features adaptively to focus on salient regions. For examples, Zhang et al. [62] introduced an attention guided network to integrate multi-level contextual information by utilizing global and local attentions, consistently improving saliency detection performance. Chen et al. [6] proposed the reverse attention to guide the side-output residual learning in a top-down manner to restore the salient object parts and details. For these methods, although different forms of features are effectively aggregated, the overall goal is to make the prediction results approach ground-truth masks with as larger an intersection as possible, which improves the accuracy of the area that is easy to predict. However, these methods mainly focuses on improving the correctness of large predictable areas, but don’t pay enough attention to small error-prone areas. To address the second problem, methods [46, 26, 50, 29, 11, 36, 49, 39] consider to solve the problem of inaccurate boundaries. For example, Wang et al. [46] proposed a local boundary refinement network to recover object boundaries by learning the local contextual information for each spatial position. Wu et al. [50] also adopted the foreground contour and edge to guide each other, thereby leading to precise foreground contour prediction and reducing the local noises. In these methods, some special boundary branches and losses are proposed to attend boundaries or local details. In this way, these methods mainly take account of the unary supervision to deal with the complex and fine structures. But for many complex and fine structures that are influenced by the context, it is difficult to accurately restore only considering the unary information.

Inspired by these observations and analysis, we propose a novel convolutional neural network with purificatory mechanism and structural similarity loss for image-based SOD. In the network, we propose the purificatory mechanism to purify salient objects by promoting predictable regions and rectifying indistinguishable regions. In this mechanism, we first introduce a simple but effective promotion attention based on spatial and channel attention mechanisms to provide the promotion ability, which assists to locate preliminary salient objects. Next, we propose a novel rectification attention, which predicts the error-prone areas and guides the network to pay more attention to these areas to rectify errors from the aspect of features and losses. These two attentions are used to impose strict weights with different regions of the whole salient objects and formed the purificatory mechanism. In addition, in order to better restore the complex or fine structures of salient objects, we propose a novel structural similarity loss to model and constrain the structural relation on complex regions for better calibrating the saliency values of regions, which can be regarded as an effective supplement to the pixel-level unary constraint. The purificatory mechanism and structural similarity loss are integrated in a progressive manner to pop-out salient objects. Experimental results on six public benchmark datasets verify the effectiveness of our method which consistently outperforms 19 state-of-the-art SOD models with a notable margin. Moreover, the proposed method is efficient and runs at about 27FPS on a single NVIDIA 1080Ti GPU.

The main contributions of this paper include:

  1. we propose a novel Purificatory Mechanism, which purifies salient objects by promoting predictable regions and rectifying indistinguishable regions;

  2. we present the promotion attention and rectification attention. The former aims to efficiently locate preliminary salient objects, while the latter focuses on distinguishing the details of error-prone areas;

  3. we introduce a novel Structural Similarity Loss to restore the complex or fine structures of salient objects, which constrains region-level pair-wise relationship between regions to be as a supplement to the pixel-level unary constraints, assisting regions to calibrate their own saliency values;

  4. we conduct comprehensive experiments and the results verify the effectiveness of our proposed method which consistently outperforms 19 state-of-the-art algorithms on six datasets with a fast prediction.

The rest of this paper is organized as follows: Section II reviews the recent development of salient object detection, attention-based SOD methods and boundary-aware SOD methods. Section III presents the purificatory network in detail. Section IV presents the proposed structural similarity loss. In Section V, we evaluate the proposed model, and compare it with the state-of-the-art methods to validate the effectiveness of the model. We conclude the paper in Section VI.

Ii Related Work

In this section, we review the related works in three aspects. At the beginning, some representative salient object detection methods are introduced. Next, we present attention mechanisms and attention-based SOD methods. Next, we review the boundary-aware SOD methods.

Ii-a Salient Object Detection

Hundreds of image-based SOD methods have been proposed in the past decades. Early methods mainly adopted hand-crafted local and global visual features as well as heuristic saliency priors such as color difference [1], distance transformation [40] and local/global contrast [19, 8]. More details about the traditional methods can be found in the survey [4].

With the development of deep learning, many deep neural networks (DNNs) based methods [22, 31, 15, 33, 60, 44, 61, 7, 58, 57, 48, 47, 39] have been proposed for SOD. Lots of deep models are devoted to fully utilizing the feature integration to enhance the performance of DNNs. For example, Lee et al. [22] proposed to compare the low-level features with other parts of an image to form a low-level distance map. Then they concatenated the encoded low-level distance map and high-level features extracted by VGG [38], and connect them to a DNN-based classifier to evaluation the saliency of a query region. Liu et al. [31] presented a DHSNet that first made a coarse global prediction by learning various global structured saliency cures and then adopted a recurrent convolutional neural network to refine the details of saliency maps by integrating local contexts step by step, which worked in a global to local and coarse to fine manner.

In addition, Hou et al. [15] introduced short connections to the skip-layer structures, which provided rich multi-scale feature maps at each layer, performing salient object detection. Luo et al. [33] proposed a convolutional neural network by combining global and local information through a multi-resolution grid structure to simplify the model architecture and speed up the computation. Zhang et al. [60] adopted a framework to aggregate multi-level convolutional features into multiple resolutions, which were then combined to predict saliency maps in a recursive manner. Wang et al. [44] proposed a pyramid pooling module and a multi-stage refinement mechanism to gather contextual information and stage-wise results, respectively. Zhang et al. [61] utilized the deep uncertain convolutional features and proposed a reformulated dropout after specific convolutional layers to construct an uncertain ensemble of internal feature units. Chen et al. [7] incorporated human fixation with semantic information to simulate the human annotation process to form two-stream fixation-semantic CNNs, which were fused by an inception-segmentation module. Zhang et al. [58] proposed a novel bi-directional message passing model to integrate multi-level features for SOD.

These methods usually integrate multi-scale and multi-level feature by complex structures to improve the representation ability of DNNs. To simply and effectively integrate these features, we add lateral connections to transfer encoded features to assist the decoder and adopt a top-down architecture to propagate high-level semantics to low-level details as guide of locating salient objects as well as restoring object details.

Ii-B Attention-based Methods

Attention mechanism of DNNs is inspired from human perception process, which weights the features to encourage one model to focus on important information. The mechanism was first applied in machine translation [3] and then widely used in the field of computer vision due to its effectiveness. For example, Mnih et al. [35] applied an attention-based model to image classification tasks. In [5], SCA-CNN that incorporated spatial and channel-wise attention mechanisms in a CNN are proposed to modulate the sentence generation context in multi-layer feature maps, encoding where and what the visual attention is, for the task of image captioning. Chu et al. [9] combined the holistic attention model focusing on the global consistency and the body part attention model focusing on detailed descriptions for human pose estimation. Fu et al. [12] proposed the dual attention network that adopted the position attention module aggregated the feat at each position and the channel attention module emphasizes interdependent channel maps for scene segmentation.

Due to the effectiveness of attention mechanisms for feature enhancement , they are also applied to saliency detection. Liu et al. [30] proposed a pixel-wise contextual attention network to learn to attend to informative context locations for each pixel by two attentions: global attention and local attention, guiding the network learning to attend global and local contexts, respectively. Feng et al. [11] designed the attentive feedback modules to control the message passing between encoder and decoder blocks, which was considered an opportunity for error corrections. Zhang et al. [59] leveraged captioning to boost semantics for salient object detection and introduced a textual attention mechanism to weight the importance of each word in the caption. In [51], a holistic attention module was proposed to enlarge the coverage area of these initial saliency maps since some objects in complex scenes were hard to be completely segmented. Zhao and Wu [63] presented a pyramid feature attention network to enhance the high-level context features and the low-level spatial structural features. Wang et al. [49] proposed a pyramid attention structure to offer the representation ability of the corresponding network layer with an enlarged receptive field.

In the above methods, attention mechanisms (spatial attention and channel attention) are used to enhance the localization and awareness of salient objects. These attentions play good roles in promoting feature attention to salient regions, but lacks attention to small regions prone to mis-prediction. Unlike these methods, we propose the purificatory mechanism, which introduce two novel attention: promotion attention and rectification attention. The first attention is dedicated to promoting the feature representation of salient regions, while the second one is dedicated to rectifying the features of error-prone regions.

Ii-C Boundary-aware Methods

Some methods [46, 26, 50, 29, 11, 36, 49, 39] consider that unclear object boundaries and inconsistent local details are important factors affecting performance of SOD. Li et al. [26] considered contours as useful priors and proposed to facilitate feature learning in SOD by transferring knowledge from an existing contour detection model. In [29], an edge detection branch was used to assist the deep neural network to further sharpen the details of salient objects by joint training. Feng et al. [11] presented a boundary-enhanced loss for learning fine boundaries and worked with the cross-entropy loss for saliency detection. Qin et al. [36] also proposed a loss for boundary-aware SOD and the loss guided the network to learn in three levels: pixel-level, patch-level and map-level. In [49], a salient edge detection module is introduced to emphasize on the importance of salient edge information, encouraging better edge-preserving SOD. And Su et al. [39] proposed a boundary-aware network, which split salient objects into boundaries and interiors, extracted features from different regions to ensure the representation of each region, and then fused to obtain good results.

Fig. 2: The framework of our approach. We first extract the common features by extractor, which provides the features for the other three subnetworks. In detail, the promotion subnetwork produces promotion attention to guide the model to focus on salient regions, and the rectification subnetwork give the rectification attention for rectifying the errors. These two kind of attentions are combined to formed the purificatory mechanism, which is integrated in the purificatory subnetwork to refine the prediction of salient objects progressively.

These methods usually utilize some special boundary branch and loss to attend boundaries or local details. But for many complex and fine structures that are influenced by the context, it is difficult to accurately restore only considering the unary information. Our method differs with these methods by introducing the structural similarity loss, which model and constrain the pair-wise structural relation on complex regions for better calibrating the saliency values of regions and is an effective supplement to the pixel-level unary constraint.

Iii Purificatory Network

To address these problems (i.e., indistinguishable regions and complex structures), we propose a novel purificatory network (denoted as PurNet) for SOD. In this method, different regions are attended by corresponding attentions, i.e., promotion attention and rectification attention. The first one is to promote attention in salient regions and the second one aims to rectify errors for salient regions. In terms of the architecture, the network includes four parts: the feature extractor, the promotion subnetwork, the rectification subnetwork and the purificatory subnetwork. In this section, we first overview the whole purificatory network and then introduce each part separately. Details of the proposed approach are described as follows.

Iii-a Overview

A diagram of the top-down architecture with feature transferring and utilization is as shown in Fig. 3, the proposed PurNet has a top-down basic architecture with lateral connections, which is inspired by the feature pyramid network (FPN) [28] based on the encoder-decoder form. In our method, PurNet consists of four parts, and the first part (i.e., the extractor) provides the common features for the other three ones (regarded as decoders). Each of the rest three parts forms an encoder-decoder relation with the feature extractor, and decodes the received features respectively. In the rest three decoders, the promotion subnetwork is used to provide the promotion features and the rectification subnetwork provides rectification features, while the purificatory subnetwork uses the purificatory mechanism to refine the prediction of SOD progressively.

Iii-B Feature Extractor

To see the 3, the purificatory network tasks ResNet-50 [13] as the feature extractor, which is modified to remove the last global pooling and fully connected layers for the pixel-level prediction task. Feature extractor has five residual modules for encoding, named as with parameters . To obtain larger feature maps, the strides of all convolutional layers belonging to last residual modules are set to 1. To further enlarge the receptive fields of high-level features, we set the dilation rates [56] to 2 and 4 for convolution layers in and , respectively. For a input images, a feature map is output by the feature extractor.

In order to integrate multi-level and multi-scale features, we adopt lateral connections to transfer the features of each encoding module to the decoder by a convolution layer with 128 kernels of , which also compresses the channels of high-level features for later processing and integration. In addition, we use a top-down architecture to propagate high-level semantics to low-level details as guide of locating salient objects as well as restoring object details. In this architecture, features from same-level encoding feature and higher-level decoding features are added, and a convolution layer with 128 kernels of is used to decode these features. We use learnable deconvolution to perform upsampling to align and restore features.

For the following three subnetworks (i.e., the promotion, rectification and purificatory subnetworks), there is a set of learned decoding features , respectively. The three subnetworks mainly process these decoding features and predict the corresponding expected results.

Fig. 3: The backbone of feature extractor. We adopt five residual modules for encoding, lateral connections to transfer the features of each encoding module to the decoder by a convolution layer with 128 kernels of for utilizing multi-level and multi-scale features, and convolution layers with 128 kernels of followed by a upsampling deconvolution layer for decoding and restoring features.

Iii-C Promotion Subnetwork

Promotion Attention

In general, when there are some distractions in the background, the location of salient objects is difficult to be detected as shown in Fig. 1(a)(b). Therefore, some methods [30, 62, 6] consider to make their models focus on the salient regions by spatial attention and channel attention mechanisms. In these two mechanisms, the former can be used to enhance the localization capability, and the latter aims to enhance semantic information [63]. They both promote attention to salient regions and have has proven to be effective. In order to take advantage of the attention mechanisms and reduce complicated manual design in existing methods, we propose a simple but effective promotion attention based on spatial and channel attention mechanisms to provide the promotion ability.

We present the structure of the promotion attention module as depicted in Fig. 4. This moduel is based on existing spatial and channel attention without additional parameters. We denote input convolutional features as . The promotion attention is generated as follows:

(1)

where and denotes the Softmax operation on the spatial and channel dimension respectively, is the operation of global average pooling, and represents element-wise product.

In Eq. (1), the first item is spatial attention, where a Softmax operation on spatial dimension is directly conducted to obtain the spatial weights, and the second item is the channel attention, where global average pooling is adopted to remove the effect of spatial for getting a vector of length followed by a Softmax operation on channel dimension to obtain the channel weights. In this manner, the attentions of spatial and channel dimension decouple and they are integrated by an element-wise product operation. Some visual examples can be found in the third column of Fig. 5.

Subnetwork

As shown in Fig. 2, the promotion attention module exists in the promotion subnetwork. In the promotion subnetwork, features from the five lateral connections of the feature extractor are firstly decoded. And then, each branch processes one of different level decoding features. For each branch, we represent input decoding convolutional features as (the same features in Eq. (1)). Then the promotion attention module is utilized to weight the input features by the following operation:

(2)

The generated features is then classified by a classifier, which includes two convolution layers with 128 kernels of and , and one kernel of followed by a Sigmoid and upsampling operation.

Fig. 4: The structure of the promotion attention module. The Softmax operation on spatial dimension () is used to extract the spatial attention, and global average pooling () followed by the channel Softmax operation () is used to obtain channel attention. The two attentions are multiplied as the promotion attention.

For the sake of simplification, these five branches of the promotion subnetwork are denoted as , where is the set of parameters of . As mentioned earlier, the promotion subnetwork aims to learn the promotion attention. To achieve this, we expect the output of the promotion attention module to approximate the ground-truth masks of SOD (represented as ) by minimizing the loss:

(3)

where means the binary cross-entropy loss function with the following formulation:

(4)

where and represents the th pixel of predicted maps and ground-truth masks of salient objects, respectively.

By taking multi-level lateral features from feature extractor as input, the promotion subnetwork can learning the promotion attention in multi-scale manner, which is fed to the purificatory subnetwork to promote attention to salient regions and demonstrates the strong promotion ability for SOD.

Fig. 5: Visual examples of the purificatory mechanism. GT: ground-truth mask, PA: the promotion attention, RA: the rectification attention, Ours: prediction of our approach.

Iii-D Rectification Subnetwork

Rectification Attention

As shown in Fig. 1(a)(b), it is difficult to accurately define the attributes and locations of some error-prone areas (e.g., salient regions confused with background). Therefore, we propose the rectification attention to guide the model to focus on these error-prone areas for error correction.

The structure of rectification attention module is shown in Fig. 6. This module exists in the rectification branch. We represent the input features as . Then two parallel convolution branches are used to process the input features, where each branch has two convolution layer with 128 kernels of followed by a convolution layer with one kernel of . We denote the outputs of these two branches as and , which mean features of gross regions and object regions (named as gross branches and object branches). The gross features represent potential comprehensive features, while object feature represents predictable features in the object body, and their difference represents mispredicted features. Therefore, we use the subtraction () of and to be as the features of error-prone regions. Next, the rectification attention is generated as follows:

(5)

where and is the Tanh function, which maps the features into range of to obtain the rectification attention. The rectification provides the attention to error-prone regions, which are important but almost undiscovered information for SOD. Some examples of rectification attention are shown in the forth column of Fig. 5.

Subnetwork

Similar to the promotion attention module, the rectification attention module exists in the rectification subnetwork as shown in Fig. 6. In the subnetwork, features from the five lateral connections of the feature extractor are decoded and as the input to each rectification branch in a multi-level manner. For each branch, the rectification attention is used to weight the object features as follows:

(6)

The generated features is fed to a classifier, which is the same as the classifier in the promotion subnetwork. We denote the object outputs of rectification subnetwork as , where is the set of parameters of , consisting of the parameters of decoding convolution layer and object branches. And the outputs of classifiers are expected to approximate the ground-truth masks of SOD. The minimizing optimization objective is as follows:

(7)
Fig. 6: The structure of the rectification branch. is the Tanh function. The output of the Classifier predicts salient objects and the one of the Regressor predicts errors in the saliency prediction.

In addition, an additional regressor is added to the error-prone features . The regressor consists of two convolution layers with 128 kernels of and , and one kernel of followed by a Tanh operation. The outputs are the error-prone prediction of rectification subnetwork, denoted as , where is the set of parameters of . The outputs of regressors aim to approximate the error maps of and the error map is defined as . Obviously, the value of is in the range of . In order to learning the error map, we drive predicted error map to approach its ground truth by minimizing the KL-divergence:

(8)

where means the KL-divergence with the following formulation:

(9)

where and are the th pixel of the predicted error map and the ground-truth error map of , respectively. Also, in the above equation is a normalization operation, which casts the and into the range [0, 1]. In our method, we add 1 to the input and divide by 2 as the operation.

Through these operation, the rectification subnetwork provides the rectification attention and predicted error maps to the purificatory subnetwork, which drives PurNet to focus on the error-prone regions and rectify the wrong prediction.

Fig. 7: The part structure of the purificatory subnetwork. The purificatory subnetwork integrates the promotion and rectification attentions by the purificatory mechanism.

promotion network

Iii-E Purificatory Subnetwork

Usage of Promotion and Rectification Attention

Similar to the promotion subnetwork and rectification subnetwork, the purificatory subnetwork processes the features from feature extractor in a top-down manner, which can refine the SOD prediction progressively.

In our approach, the body of salient objects are first promoted with the help of the promotion attention, and then the error-prone regions of salient objects are rectified by the aid of then the rectification attention. Therefore, these two attentions are combined to purify the salient objects. The purificatory mechanism is integrated in the purificatory subnetwork, the structure of which is shown in 7. For th decoding stage, the input features is firstly weighted by the promotion attention by the operation:

(10)

And then a convolution layer with 128 kernels of are used to convolve the features to be . Next, the rectification attention is fed to weight the produced features as follows:

(11)

The generated features is input to a classifier, which is the same as the classifier in the promotion subnetwork with two convolution layers with 128 kernels of and , and one kernel of followed by a Sigmoid and upsampling operation.

We represent the outputs of the purificatory subnetwork as , where is the set of parameters of , consisting of the parameters of decoding convolution layer and layers of each stage. And the outputs of classifiers are expected to approximate the ground truths of SOD. The loss is formed by the following operation:

(12)

where represents the error maps and means the improved binary cross-entropy loss function with error map from the rectification subnetwork. We give the definition in Section III-E2.

Improved Loss Function

The predicted can be used to penalize the error-prone areas of the predicted saliency map in the purificatory subnetwork. By the extra constraints, the error-prone areas in the final prediction can be better refined. Toward this end, we propose to optimize the saliency maps to approximate the ground-truth masks of SOD by minimizing the improved binary cross-entropy loss (see Eq. (12)). And the improved loss is defined as follows:

(13)

where represents the th pixel of predicted error maps and indicates the absolute value operation. In our improved loss, the cross-entropy loss function at each pixel is weighted by the predicted error map, which penalizes the error-prone areas with bigger loss to rectify possible errors.

Iv Structural Similarity Loss

Through the purificatory mechanism, different regions (i.e., simple regions and error-prone regions) of salient objects are processed and the performance is greatly improved. In addition to paying different attention to these indistinguishable regions, we also consider the structural constraints on complex regions as useful information for salient object detection. Toward this end, we propose a novel structural similarity loss (as shown in Fig. 8) to constrain the region-level pairwise relationship between regions to calibrate the saliency values.

In general, current methods (e.g.,  [30, 6, 44]) mainly adopt the binary cross-entropy loss function as the optimization objective, which is a pixel-level unary constraint for prediction by the formulation of Eq. (4). However, Eq. (4) only considers the relationship between each pixel and its corresponding ground-truth value, but does not take account of the relationship between different pixels or regions. In this manner, sometimes the saliency of whole local areas is completely detected incorrectly, which is caused by this problem lacking region-level relationship constraints.

Fig. 8: Construction of the structural similarity matrix. (a) image, (b) super-pixel, (c)ground truth, (d) saliency map of our approach, (e)(f)structural matrices of the ground truth and saliency map.

To address this problem, we propose to model region-level pair-wise relationship as a supplement to the unary constraint and correct the probable errors. For the purpose of modeling the region-level relationship, we first construct a graph for each image, where and are the sets of nodes and directed edges. In the graph, each node represents a region in images, where is the number of regions in the image. Regions in an image are easily generated by some existing methods [2, 41] and we adopt SLIC algorithm [2] to over-segment an RGB image into super-pixels as regions with . And the edge from the region to the region represents the relation between these two regions. We apply the locations of super-pixels of RGB images to its corresponding ground truths and predicted saliency maps, and then we get the regions of the ground truths and predicted saliency maps.

For the ground truth and predicted saliency map of an RGB image, we define the saliency value of a region as the average of the sum of the saliency values of each pixel in this region, and the saliency value of th super-pixel is denoted as . To model the relationship of regions, we use the subtraction of the saliency values between corresponding two nodes (i.e., and ) to represent the weight of each edge . Then, we construct structural matrix M to model the overall pair-wise relationship of an image as shown in Fig. 8. The value in th row and th column of M represents the weight of the edge . In this manner, we can construct structural matrices for the ground-truth mask as and predicted saliency map as of every image. The ground truth and saliency map are expected to have the similar structure, so we drive to become the structural similarity matrix of by minimizing the KL-divergence:

(14)

where is a normalization operation as used in Eq. (8), and and means the predicted saliency map and ground-truth mask of an image, respectively. This loss function is named as the structural similarity loss (denoted as SSL).

In this work, we conduct the SSL on the outputs of each stage in the purificatory network, and the formulation of the overall structural similarity loss is as follows:

(15)

By taking the losses of Eqs. (3), (7), (8), (12) and (15), the overall learning objective can be formulated as follows:

(16)

where is the set of for convenience of presentation.

Models Year ECSSD DUT-OMRON PASCAL-S HKU-IS DUTS-TE XPIE
MAE MAE MAE MAE MAE MAE
KSR [45] 2016 0.132 0.633 0.810 0.131 0.486 0.625 0.157 0.569 0.773 0.120 0.586 0.773 - - - - - -
HDHF [24] 2016 0.105 0.705 0.834 0.092 0.565 0.681 0.147 0.586 0.761 0.129 0.564 0.812 - - - - - -
ELD [22] 2016 0.078 0.786 0.829 0.091 0.596 0.636 0.124 0.669 0.746 0.063 0.780 0.827 0.092 0.608 0.647 0.085 0.698 0.746
UCF [61] 2017 0.069 0.807 0.865 0.120 0.574 0.649 0.116 0.696 0.776 0.062 0.779 0.838 0.112 0.596 0.670 0.095 0.693 0.773
NLDF [33] 2017 0.063 0.839 0.892 0.080 0.634 0.715 0.101 0.737 0.806 0.048 0.838 0.884 0.065 0.710 0.762 0.068 0.762 0.825
Amulet [60] 2017 0.059 0.840 0.882 0.098 0.626 0.673 0.099 0.736 0.795 0.051 0.817 0.853 0.085 0.658 0.705 0.074 0.743 0.796
FSN [7] 2017 0.053 0.862 0.889 0.066 0.694 0.733 0.095 0.751 0.804 0.044 0.845 0.869 0.069 0.692 0.728 0.066 0.762 0.812
SRM [44] 2017 0.054 0.853 0.902 0.069 0.658 0.727 0.086 0.759 0.820 0.046 0.835 0.882 0.059 0.722 0.771 0.057 0.783 0.841
C2SNet [26] 2018 0.057 0.844 0.878 0.079 0.643 0.693 0.086 0.764 0.805 0.050 0.823 0.854 0.065 0.705 0.740 0.066 0.764 0.807
RA [6] 2018 0.056 0.857 0.901 0.062 0.695 0.736 0.105 0.734 0.811 0.045 0.843 0.881 0.059 0.740 0.772 0.067 0.776 0.836
Picanet [30] 2018 0.047 0.866 0.902 0.065 0.695 0.736 0.077 0.778 0.826 0.043 0.840 0.878 0.051 0.755 0.778 0.052 0.799 0.843
PAGRN [62] 2018 0.061 0.834 0.912 0.071 0.622 0.740 0.094 0.733 0.831 0.048 0.820 0.896 0.055 0.724 0.804 - - -
R3 [10] 2018 0.040 0.902 0.924 0.063 0.728 0.768 0.095 0.760 0.834 0.036 0.877 0.902 0.057 0.765 0.805 0.058 0.805 0.854
DGRL [46] 2018 0.043 0.883 0.910 0.063 0.697 0.730 0.076 0.788 0.826 0.037 0.865 0.888 0.051 0.760 0.781 0.048 0.818 0.859
RFCN [43] 2019 0.067 0.824 0.883 0.077 0.635 0.700 0.106 0.720 0.802 0.055 0.803 0.864 0.074 0.663 0.731 0.073 0.736 0.809
DSS [16] 2019 0.052 0.872 0.918 0.063 0.697 0.775 0.098 0.756 0.833 0.040 0.867 0.904 0.056 0.755 0.810 0.065 0.784 0.849
MLM [50] 2019 0.045 0.871 0.897 0.064 0.681 0.719 0.077 0.778 0.813 0.039 0.859 0.882 0.049 0.761 0.776 - - -
ICTBI [47] 2019 0.041 0.881 0.909 0.061 0.730 0.758 0.071 0.788 0.826 0.038 0.856 0.890 0.048 0.762 0.797 - - -
AFNet [11] 2019 0.042 0.886 0.916 0.057 0.717 0.761 0.073 0.797 0.839 0.036 0.869 0.895 0.046 0.785 0.807 0.047 0.822 0.859
Ours - 0.035 0.907 0.928 0.051 0.747 0.776 0.070 0.805 0.847 0.031 0.889 0.904 0.039 0.817 0.829 0.041 0.843 0.876
TABLE I: Performance on six benchmark datasets. Smaller MAE, larger and correspond to better performance. The best and second results are in red and blue fonts. “-” means the results cannot be obtained and “” means the results are post-processed by dense conditional random field (CRF) [20]. Note that the backbone of R3Net is ResNeXt-101 [53].

V Experiments

V-a Experimental Setup

Datasets

To evaluate the performance of our method, we conduct experiments on six benchmark datasets [54, 55, 23, 42, 52]. Details of these datasets are briefly described as follows: ECSSD [54] consists of 1,000 images with complex and semantically meaningful objects. DUT-OMRON [55] has 5,168 complex images that are downsampled to a maximal side length of 400 pixels. PASCAL-S [27] includes 850 natural images that are pre-segmented into objects or regions with salient object annotation by eye-tracking test of 8 subjects. HKU-IS [23] contains 4,447 images which usually contain multiple disconnected salient objects or salient objects that touch image boundaries. DUTS [42] is a large scale dataset containing 10,533 training images (named as DUTS-TR) and 5019 test images(named as DUTS-TE). The images are challenging with salient objects that occupy various locations and scales as well as complex background. XPIE [52] is also a large dataset that has 10,000 images covering a variety of simple and complex scenes with various salient objects.

Evaluation Metrics

We choose mean absolute error (MAE), weighted F-measure score ([34], F-measure score (), and F-measure curve to evaluate our method. MAE is the average pixel-wise absolute difference between ground-truth masks and estimated saliency maps. In computing F, we normalize the predicted saliency maps into the range [0, 255] and binarize the saliency maps with a threshold sliding from 0 to 255 to compare the binary maps with ground-truth masks. At each threshold, Precision and Recall can be computed. F is computed as:

(17)

where is set to 0.3 to emphasize more on Precision than Recall as suggested in [1]. Then we can plot F-measure curve based on all the binary maps over all saliency maps in a given dataset. We report F using an adaptive threshold for generating binary a saliency map and the threshold is computed as twice the mean of a saliency map. In addition, is used to evaluate the overall performance (more details can be found in  [34]).

Training and Inference.

We train the networks in three stages and the training steps as follows: 1⃝ we first train the feature extractor and purificatory subnetwork with and ; 2⃝ we fix the purificatory subnetwork then train the promotion rectification subnetworks with , and ; 3⃝ Then we train the whole network with the overall loss in Eq. (16).

We use standard stochastic gradient descent algorithm to train our network end-to-end by optimizing the learning object in Eq. (16). In the optimization process, the parameter of feature extractor is initialized by the pre-trained ResNet-50 model [13], whose learning rate is set to with a weight decay of and momentum of 0.9. And the learning rate of rest layers are set to 10 times larger. Besides, we employ the “poly” learning rate policy for all experiments similar to [32]. We train our network by utilizing the training set of DUTS-TR dataset [42] as used in [46, 62, 30, 44]. It comprises of per-pixel ground-truth annotation for 10,553 images. The training images are resized to the resolution of for faster training, and applied horizontal flipping. The training process takes about 20 hours and converges after 500k iterations (20k iterations for stage 1⃝, 50k iterations for stage 2⃝ and 200k iterations for stage 3⃝) with mini-batch of size 8 on a single NVIDA TITAN Xp GPU. During inference, the proposed network removes all the losses, and one image is directly fed into the network to produce the saliency map at the output of first stage in the purificatory network. And the network runs at about 27fps on a single NVIDIA 1080Ti GPU for inference.

V-B Comparisons with the State-of-the-art

Fig. 9: The F-measure curves of 19 state-of-the-arts and our approach are listed across six benchmark datasets.
Fig. 10: Qualitative comparisons of the state-of-the-art algorithms and our approach. GT means ground-truth masks of salient objects.

We compare our approach denoted as with 19 state-of-the-art methods, including KSR [45], HDHF [24], ELD [22], UCF [61], NLDF [33], Amulet [60], FSN [7], SRM [44], C2SNet [26], RA [6], Picanet [30], PAGRN [62], R3Net [10], DGRL [46], RFCN [43], DSS [16], MLM [50], ICTBI [47] and AFNet [11]. We obtain the saliency maps of different methods from the authors or the deployment codes provided by the authors for fair comparison.

Quantitative Evaluation

We evaluate 19 state-of-the-art SOD methods and our method on six benchmark datasets, and the results are listed in Tab. I. We can see that the proposed method consistently outperform other methods across all the six datasets, especially DUTS-TE and XPIE. As for , our method is noticeably improved from 0.785 to 0.817 on DUTS-TE and from 0.822 to 0.843 on XPIE compared to the second best results. Also, it is worth noting that of our method is significantly better compared with the second best results on DUTS-TE (0.829 against 0.810) and XPIE (0.876 against 0.859). PurNet has a similar and obvious improvement on other three datasets. As for MAE, our method has obvious advantages compared with other state-of-the-art algorithms on six datasets.

Models ECSSD DUTS-TE XPIE
MAE MAE MAE
VGG backbone [38]
DSS 0.052 0.872 0.918 0.056 0.755 0.810 0.065 0.784 0.849
MLM 0.045 0.871 0.897 0.049 0.761 0.776 - - -
AFNet 0.042 0.886 0.916 0.046 0.785 0.807 0.047 0.822 0.859
Ours 0.040 0.892 0.920 0.043 0.797 0.816 0.044 0.830 0.868
ResNet backbone [13]
DGRL 0.043 0.883 0.910 0.051 0.760 0.781 0.048 0.818 0.859
R3 0.040 0.902 0.924 0.057 0.765 0.805 0.058 0.805 0.854
ICTBI 0.041 0.881 0.909 0.048 0.762 0.797 - - -
Ours 0.035 0.907 0.928 0.039 0.817 0.829 0.041 0.843 0.876
TABLE II: Comparisons of state-of-the-arts and the proposed method with different backbones. The best results of different backbone are in blod fonts.

For a more comprehensive demonstration, we also provide the comparisons between state-of-the-arts and our method with different backbones. We use VGG-16 [38] as the backbone of our method instead of ResNet-50 [13], and train the new network without changing other settings. As listed in Tab. II, we can observe that the proposed method consistently outperform other methods, which verifies that our proposed purificatory mechanism and similarity structural loss can achieve great performance with different backbones.

For overall comparisons, F-measure curves of different methods are displayed in Fig. 9. We can observe that the F-measure curves of our approach are consistently higher than other state-of-the-art methods. These observations present the efficiency and robustness of our purificatory network across various challenging datasets, which indicates that the perspective of purificatory mechanism for the problem of SOD is useful. Note that the results of DSS, RA on HKU-IS [23] are only conducted on the test set.

RM PA RA SSL ECSSD DUT-OMRON PASCAL-S
MAE MAE MAE
Baseline 0.046 0.864 0.895 0.064 0.700 0.725 0.077 0.776 0.820
Baseline + PA 0.044 0.877 0.919 0.055 0.719 0.760 0.079 0.778 0.838
Baseline + RA 0.043 0.878 0.917 0.053 0.725 0.765 0.076 0.781 0.838
Baseline + PM 0.039 0.892 0.924 0.055 0.734 0.768 0.071 0.798 0.848
Baseline + SSL 0.043 0.880 0.917 0.057 0.716 0.760 0.074 0.786 0.838
PurNet 0.035 0.907 0.928 0.051 0.747 0.776 0.070 0.805 0.847
TABLE III: Performance of differents setting of the proposed method. PurNet is the proposed method. Meanings of other abbreviations are as follows: RM: purificatory Network, PA: Promotion Attention Network, RA: Rectification Attention Network, SSL: Structural Similarity Loss.

Qualitative Evaluation. Some examples of saliency maps generated by our approach and other state-of-the-art algorithms are shown in Fig. 10. We can see that salient objects can pop-out with accurate location and details by the proposed method. From the row of 1 to 3 in Fig. 10, we can find that many methods usually can’t locate the salient objects roughly. In our method, the salient objects are located with the help of effective promotion attention. In addition, lots of methods often mistakenly segment the details of salient objects. We think the reason for this error is that most existing methods usually lack the constraints of error-prone areas. From the row of 4 to 6 in Fig. 10, we can observe that our method achieves better performance, which indicates the ability of processing the fine structures and rectifying errors. More examples of complex scenes are shown in the row of 7 and 8, we can observe that the proposed method also obtains the impressive results. These observations indicate that addressing SOD from the perspective of purificatory mechanism and region-level pair-wise constraints is effective.

V-C Ablation Studies

To validate the effectiveness of different components of the proposed method, we conduct several experiments on the benchmark datasets to compare the performance variations of our methods with different experimental settings.

Effectiveness of the Purificatory Mechanism

To investigate the effectiveness of the proposed purificatory mechanism, we conduct ablation experiments and introduce four different models for comparisons. The first setting is only the feature extractor and purificatory subnetwork, which is regarded as “Baseline”. To explore the respective effectiveness of promotion attention and rectification attention, we conduct the second and third model by adding the promotion subnetwork (denoted as “Baseline + PA”) and rectification subnetwork (denoted as “Baseline + RA”), respectively. In addition, we combine the two attention mechanisms (i.e., purificatory mechanism) with the purificatory network as the fourth models, which is named as “Baseline + PM”. We also list the proposed method with the purificatory mechanism and structural similarity loss as “PurNet”.

The comparison results of above mentioned models are listed in Tab. III. We can observe that the promotion attention and rectification attention greatly improve the performance compared with “Baseline”, which indicates the usefulness of the two attention mechanisms for SOD. In addition, we can find that “Baseline + RA” has better performance improvement than “Baseline + PA”, which implies that the rectification of some error-prone areas is important to SOD. Moreover, a better performance has been achieved through the combination of the two attentions (i.e., purificatory mechanism), which verifies the compatibility of the two attentions and effectiveness of the purificatory mechanism.

Effectiveness of the Structural Similarity Loss

To investigate the effectiveness of the proposed novel structural similarity loss (SSL), we conduct another experiments by only combining the loss with “Baseline” and this model is named as “Baseline + SSL”. As listed in Tab. III, we can observe a remarkable improvement brought by SSL by comparing “Baseline” and “Baseline + SSL”. The result shows that the loss plays an important role in the SOD task. In addition, by comparing “Baseline + PM” and “PurNet”, we can find that SSL is still useful even when the results is advanced.

ECSSD DUT-OMRON PASCAL-S
MAE MAE MAE
S5 0.043 0.878 0.899 0.057 0.712 0.741 0.077 0.778 0.818
S4 0.043 0.880 0.899 0.057 0.715 0.742 0.076 0.781 0.819
S3 0.037 0.900 0.920 0.053 0.738 0.765 0.071 0.799 0.841
S2 0.035 0.906 0.927 0.051 0.746 0.775 0.070 0.804 0.846
S1 0.035 0.907 0.928 0.051 0.747 0.776 0.070 0.805 0.847
Fusion 0.038 0.894 0.918 0.054 0.733 0.758 0.073 0.794 0.837
TABLE IV: Comparisons of each side-outputs and their fusion. S means th side-output and Fusion means the average of S1 to S5.

Performance of Each Side-output

In order to explore how to obtain the best prediction of the proposed network, we conduct an additional experiment to compare the performance of each side-outputs and fusion in the purificatory subnetwork. As listed in Tab. IV, we can see that the performance of last three side-outputs (i.e., third, fourth and fifth side-output) is consistently worse than the one of the first two side-outputs (i.e., first and second side-output). And the performance of fusion is lower than first, second and third side-output. The comparisons indicate the process of generating saliency maps in our network is progressively refined from the higher layer to the lower layer. Thus, we choose the first side-output as the results during inference.

Vi Conclusion

In this paper, we rethink the two difficulties that hinder the development of salient object detection. The difficulties consists of indistinguishable regions and complex structures. To solve these two issues, we propose the purificatory network with structural similarity loss. In this network, we introduce the promotion attention to improve the localization ability and semantic information for salient regions, which guides the network to focus on salient regions. We also propose the rectification subnetwork to provide the rectification attention for rectifying the errors. The two attentions are combined to form the purificatory mechanism to improve the promotable regions and rectifiable regions for purifying salient objects. Moreover, we also propose a novel region-level pair-wise structural similarity loss, which models and constrains the relationships between pair-wise regions. This loss can be used to be as a supplement to the unary constraint. Extensive experiments on six benchmark datasets have validated the effectiveness of the proposed approach.

Jia Li received the B.E. degree from Tsinghua University in 2005 and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, in 2011. He is currently an associate Professor with the School of Computer Science and Engineering, Beihang University, Beijing, China. Before he joined Beihang University in Jun. 2014, he used to conduct research in Nanyang Technological University, Peking University and Shanda Innovations. He is the author or coauthor of over 70 technical articles in refereed journals and conferences. His research interests include computer vision and multimedia big data, especially the learning-based visual content understanding. He is a senior member of IEEE, CIE and CCF. More information can be found at http://cvteam.net.

Jinming Su is currently pursuing his master degree with the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University. He received the B.S. degree from School of Computer Science and Engineering, Northeastern University, in Jul. 2017. His research interests include computer vision, visual saliency analysis and deep learning.

Changqun Xia is currently an assistant Professor at Peng Cheng Laboratory, China. He received the Ph.D. degree from the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, in Jul. 2019. His research interests include computer vision and image understanding.

Yonghong Tian is currently a Boya Distinguished Professor with the School of EECS, Peking University, China, and is also the deputy director of Artificial Intelligence Research Center, Peng Cheng Laboratory, Shenzhen, China. His research interests include computer vision, multimedia big data, and brain-inspired computation. He is the author or coauthor of over 180 technical articles in refereed journals and conferences. He is a senior member of IEEE, CIE and CCF, a member of ACM.

References

  1. R. Achanta, S. Hemami, F. Estrada and S. Süsstrunk (2009) Frequency-tuned salient region detection. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1597–1604. Cited by: §II-A, §V-A2.
  2. R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua and S. Süsstrunk (2012) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34 (11), pp. 2274–2282. Cited by: §IV.
  3. D. Bahdanau, K. Cho and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), Cited by: §II-B.
  4. A. Borji, M. Cheng, H. Jiang and J. Li (2015) Salient object detection: a benchmark. IEEE transactions on image processing (TIP) 24 (12), pp. 5706–5722. Cited by: §I, §II-A.
  5. L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu and T. Chua (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5659–5667. Cited by: §II-B.
  6. S. Chen, X. Tan, B. Wang and X. Hu (2018) Reverse attention for salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 234–250. Cited by: §I, §III-C1, TABLE I, §IV, §V-B.
  7. X. Chen, A. Zheng, J. Li and F. Lu (2017) Look, perceive and segment: finding the salient objects in images via two-stream fixation-semantic cnns. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1050–1058. Cited by: §I, §II-A, §II-A, TABLE I, §V-B.
  8. M. Cheng, N. J. Mitra, X. Huang, P. H. Torr and S. Hu (2015) Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 37 (3), pp. 569–582. Cited by: §II-A.
  9. X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille and X. Wang (2017) Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1831–1840. Cited by: §II-B.
  10. Z. Deng, X. Hu, L. Zhu, X. Xu, J. Qin, G. Han and P. Heng (2018) R3Net: recurrent residual refinement network for saliency detection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 684–690. Cited by: TABLE I, §V-B.
  11. M. Feng, H. Lu and E. Ding (2019) Attentive feedback network for boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1623–1632. Cited by: §I, §II-B, §II-C, TABLE I, §V-B.
  12. J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang and H. Lu (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3146–3154. Cited by: §II-B.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778. Cited by: §III-B, §V-A3, §V-B1, TABLE II.
  14. S. Hong, T. You, S. Kwak and B. Han (2015) Online tracking by learning discriminative saliency map with convolutional neural network. In International Conference on Machine Learning (ICML), pp. 597–606. Cited by: §I.
  15. Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu and P. H. Torr (2017) Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3203–3212. Cited by: §I, §II-A, §II-A.
  16. Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu and P. H. Torr (2019) Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (4), pp. 815–828. Cited by: TABLE I, §V-B.
  17. L. Itti, C. Koch and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (11), pp. 1254–1259. Cited by: §I.
  18. W. James, F. Burkhardt, F. Bowers and I. K. Skrupskelis (1890) The principles of psychology. Vol. 1, Macmillan London. Cited by: §I.
  19. D. A. Klein and S. Frintrop (2011) Center-surround divergence of feature statistics for salient object detection. In International Conference on Computer Vision (ICCV), pp. 2214–2219. Cited by: §II-A.
  20. P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems (NeurIPS), pp. 109–117. Cited by: TABLE I.
  21. B. Lai and X. Gong (2016) Saliency guided dictionary learning for weakly-supervised image parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3630–3639. Cited by: §I.
  22. G. Lee, Y. Tai and J. Kim (2016) Deep saliency with encoded low level distance map and high level features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 660–668. Cited by: §I, §II-A, TABLE I, §V-B.
  23. G. Li and Y. Yu (2015) Visual saliency based on multiscale deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5455–5463. Cited by: §I, §V-A1, §V-B1.
  24. G. Li and Y. Yu (2016) Visual saliency detection based on multiscale deep cnn features. IEEE transactions on image processing (TIP) 25 (11), pp. 5012–5024. Cited by: TABLE I, §V-B.
  25. J. Li and W. Gao (2014) Visual saliency computation: a machine learning perspective. Vol. 8408, Springer. Cited by: §I.
  26. X. Li, F. Yang, H. Cheng, W. Liu and D. Shen (2018) Contour knowledge transfer for salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 355–370. Cited by: §I, §II-C, TABLE I, §V-B.
  27. Y. Li, X. Hou, C. Koch, J. M. Rehg and A. L. Yuille (2014) The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 280–287. Cited by: §I, §V-A1.
  28. T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2117–2125. Cited by: §III-A.
  29. J. Liu, Q. Hou, M. Cheng, J. Feng and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3917–3926. Cited by: §I, §II-C.
  30. N. Liu, J. Han and M. Yang (2018) PiCANet: learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3089–3098. Cited by: §I, §II-B, §III-C1, TABLE I, §IV, §V-A3, §V-B.
  31. N. Liu and J. Han (2016) Dhsnet: deep hierarchical saliency network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 678–686. Cited by: §I, §II-A.
  32. W. Liu, A. Rabinovich and A. C. Berg (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §V-A3.
  33. Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li and P. Jodoin (2017) Non-local deep features for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6609–6617. Cited by: §I, §II-A, §II-A, TABLE I, §V-B.
  34. R. Margolin, L. Zelnik-Manor and A. Tal (2014) How to evaluate foreground maps?. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 248–255. Cited by: §V-A2.
  35. V. Mnih, N. Heess and A. Graves (2014) Recurrent models of visual attention. In Advances in neural information processing systems (NeurIPS), pp. 2204–2212. Cited by: §II-B.
  36. X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan and M. Jagersand (2019) BASNet: boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7479–7489. Cited by: §I, §II-C.
  37. Z. Ren, S. Gao, L. Chia and I. W. Tsang (2014) Region-based saliency detection and its application in object recognition. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 24 (5), pp. 769–779. Cited by: §I.
  38. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §II-A, §V-B1, TABLE II.
  39. J. Su, J. Li, Y. Zhang, C. Xia and Y. Tian (2019) Selectivity or invariance: boundary-aware salient object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3799–3808. Cited by: §I, §I, §II-A, §II-C.
  40. W. Tu, S. He, Q. Yang and S. Chien (2016) Real-time salient object detection with a minimum spanning tree. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 2334–2342. Cited by: §II-A.
  41. M. Van den Bergh, X. Boix, G. Roig, B. de Capitani and L. Van Gool (2012) Seeds: superpixels extracted via energy-driven sampling. In European conference on computer vision (ECCV), pp. 13–26. Cited by: §IV.
  42. L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 136–145. Cited by: §I, §V-A1, §V-A3.
  43. L. Wang, L. Wang, H. Lu, P. Zhang and X. Ruan (2019) Salient object detection with recurrent fully convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (7), pp. 1734–1746. Cited by: TABLE I, §V-B.
  44. T. Wang, A. Borji, L. Zhang, P. Zhang and H. Lu (2017) A stagewise refinement model for detecting salient objects in images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4019–4028. Cited by: §I, §II-A, §II-A, TABLE I, §IV, §V-A3, §V-B.
  45. T. Wang, L. Zhang, H. Lu, C. Sun and J. Qi (2016) Kernelized subspace ranking for saliency detection. In European Conference on Computer Vision (ECCV), pp. 450–466. Cited by: TABLE I, §V-B.
  46. T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan and A. Borji (2018) Detect globally, refine locally: a novel approach to saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3127–3135. Cited by: §I, §II-C, TABLE I, §V-A3, §V-B.
  47. W. Wang, J. Shen, M. Cheng and L. Shao (2019) An iterative and cooperative top-down and bottom-up inference network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5968–5977. Cited by: §I, §II-A, TABLE I, §V-B.
  48. W. Wang, J. Shen, X. Dong and A. Borji (2018) Salient object detection driven by fixation prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1711–1720. Cited by: §I, §II-A.
  49. W. Wang, S. Zhao, J. Shen, S. C. Hoi and A. Borji (2019) Salient object detection with pyramid attention and salient edges. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1448–1457. Cited by: §I, §II-B, §II-C.
  50. R. Wu, M. Feng, W. Guan, D. Wang, H. Lu and E. Ding (2019) A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8150–8159. Cited by: Fig. 1, §I, §II-C, TABLE I, §V-B.
  51. Z. Wu, L. Su and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3907–3916. Cited by: §I, §II-B.
  52. C. Xia, J. Li, X. Chen, A. Zheng and Y. Zhang (2017) What is and what is not a salient object? learning salient object detector by ensembling linear exemplar regressors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4142–4150. Cited by: §I, §V-A1.
  53. S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 1492–1500. Cited by: TABLE I.
  54. Q. Yan, L. Xu, J. Shi and J. Jia (2013) Hierarchical saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1155–1162. Cited by: Fig. 1, §I, §V-A1.
  55. C. Yang, L. Zhang, H. Lu, X. Ruan and M. Yang (2013) Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3166–3173. Cited by: §I, §V-A1.
  56. F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations (ICLR). Cited by: §III-B.
  57. Y. Zeng, H. Lu, L. Zhang, M. Feng and A. Borji (2018) Learning to promote saliency detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1644–1653. Cited by: §I, §II-A.
  58. L. Zhang, J. Dai, H. Lu, Y. He and G. Wang (2018) A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1741–1750. Cited by: §I, §II-A, §II-A.
  59. L. Zhang, J. Zhang, Z. Lin, H. Lu and Y. He (2019) CapSal: leveraging captioning to boost semantics for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6024–6033. Cited by: §I, §II-B.
  60. P. Zhang, D. Wang, H. Lu, H. Wang and X. Ruan (2017) Amulet: aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 202–211. Cited by: §I, §II-A, §II-A, TABLE I, §V-B.
  61. P. Zhang, D. Wang, H. Lu, H. Wang and B. Yin (2017) Learning uncertain convolutional features for accurate saliency detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 212–221. Cited by: §I, §II-A, §II-A, TABLE I, §V-B.
  62. X. Zhang, T. Wang, J. Qi, H. Lu and G. Wang (2018) Progressive attention guided recurrent network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 714–722. Cited by: §I, §III-C1, TABLE I, §V-A3, §V-B.
  63. T. Zhao and X. Wu (2019) Pyramid feature attention network for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3085–3094. Cited by: §I, §II-B, §III-C1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402499
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description