CAGNet: Content-Aware Guidance for Salient Object Detection
Beneficial from Fully Convolutional Neural Networks (FCNs), saliency detection methods have achieved promising results. However, it is still challenging to learn effective features for detecting salient objects in complicated scenarios, in which i) non-salient regions may have ”salient-like” appearance; ii) the salient objects may have different-looking regions. To handle these complex scenarios, we propose a Feature Guide Network which exploits the nature of low-level and high-level features to i) make foreground and background regions more distinct and suppress the non-salient regions which have ”salient-like” appearance; ii) assign foreground label to different-looking salient regions. Furthermore, we utilize a Multi-scale Feature Extraction Module (MFEM) for each level of abstraction to obtain multi-scale contextual information. Finally, we design a loss function which outperforms the widely-used Cross-entropy loss. By adopting four different pre-trained models as the backbone, we prove that our method is very general with respect to the choice of the backbone model. Experiments on five challenging datasets demonstrate that our method achieves the state-of-the-art performance in terms of different evaluation metrics. Additionally, our approach contains fewer parameters than the existing ones, does not need any post-processing, and runs fast at a real-time speed of 28 FPS when processing a image.
keywords:Saliency detection, Fully convolutional neural networks, Attention guidance
Salient object detection aims at localizing the most interesting and prominent parts of an image. Moreover, it is an effective pre-processing step for numerous computer vision tasks such as image classification flores2019saliency , image segmentation li2016robust ; zhi2018saliency ; cai2019saliency , video segmentation wang2015saliency , image editing chen2015improved ; zhang2018novel and object tracking hong2015online .
Traditional approaches are mostly based on low-level cues and hand-crafted features. For example, the method proposed in huo2016object uses color feature to detect salient objects. Some other methods use center prior to improve the performance of salient object detection aksac2017complex ; liang2018material . Because of the lack of semantic information, these methods have limited ability to detect the whole structure of salient objects in complex scenes. In recent years, the methods based on the Fully Convolutional Neural Networks (FCNs), such as luo2017non ; zhang2017amulet ; xi2019salient , have been widely used for saliency detection owing to their high capacity of modeling high-level semantics. Even though these methods have achieved promising results, there are still some challenges due to the complicated scenarios of some images. The learned features by these methods usually lack the ability to i) suppress the non-salient regions which have ”salient-like” appearance as depicted in the first row of Figure 1, ii) detect salient objects that have different-looking regions as depicted in the second row of Figure 1.
To address the above-mentioned challenges, we propose the Guide Module which takes advantage of the nature of the high-level and low-level features. By adopting this module, high-level features, which lack the fine spatial details of low-level features, can exploit the nature of low-level features as a guidance to make foreground and background regions more distinct, and thus it can suppress the non-salient regions that have ”salient-like” appearance. For example, as illustrated in the first row of Figure 1, although the triangular object has ”salient-like” appearance, it should not be labeled as salient object, since it is not the most interesting and prominent part of the image. From Figure 1, we can see that our method (denoted as GAGNet-R) is able to completely suppress the whole triangular object. Furthermore, by adopting the Guide Module, high-level features, which have the ability of category recognition of image regions because of containing high semantic information, can guide the selection of low-level features. By inspiring from the Channel Attention Block (CAB) proposed in yu2018learning , we give our model the ability to guide the selection of low-level features, which equips our network with the power of assigning foreground label to different-looking salient regions. As illustrated in the second row of Figure 1, the appearance of the feet of the doll is different from the rest of the doll, but as it can be seen, our method is able to highlight the whole doll as the salient object. Thus, by benefiting from the content-aware guidance provided by our Guide Modules, our method is able to handle these complicated scenarios.
Some previous salient object detection methods wang2016saliency ; wang2015deep ; zhang2017learning utilize subsequent single-scale convolutional and max pooling layers to produce deep features. Since salient objects have large variations in scale and location, the learned features by these methods might not be able to handle these complicated variations due to the limited field of view. Zhang et al. zhang2018bi use dilated convolutional layers for extracting multi-scale features. The dilated convolution inserts ”holes” in the convolution kernels to enlarge the receptive field, which would cause the loss of local information, especially when the dilation rate increases. This problem is called the ”gridding issue” which was explored in wang2018understanding . To address these problems, we introduce the Multi-scale Feature Extraction Module (MFEM) which is capable of capturing multi-scale contextual information by enabling densely connections within the multi-scale regions in the feature map. For each level of abstraction (i.e., stage) of the pre-trained backbone, we perform convolutions by adopting a trivial convolutional layer and Global Convolutional Networks (GCNs) peng2017large with different kernel sizes. Then, the resulting feature maps are stacked to form multi-scale features. GCNs enable densely connections within a large region in the feature map and thus can alleviate the ”gridding issue”.
In this paper, we propose a Content-Aware Guidance Network, which we refer to as CAGNet, consisting of three networks: (i) Feature Extraction Network (FEN), (ii) Feature Guide Network (FGN), (iii) Feature Fusion Network (FFN).
The FEN produces multi-scale features at multiple levels of abstraction by adopting the MFEM at each level of a pre-trained backbone. The FGN takes the extracted multi-scale features of the FEN as input and guides the features in order to use by the FFN. Then, by using multiple add operations and Residual Refinement Modules (RRMs) in the FFN, the guided features are fused effectively. Our proposed RRM is a residual block with spatial attention, which refines the features with the ability of focusing on salient regions and avoiding distractions in the non-salient regions. In summary, the FEN, FGN, and FFN in our proposed architecture work collaboratively to generate a more accurate prediction. Additionally, while most saliency detection methods in the literature use Cross-entropy loss for learning the salient objects, we design a loss function which outperforms the Cross-entropy by a large margin.
By conducting experiments on various backbones, we prove the robustness of our method. Furthermore, our method contains a lower number of parameters in comparison with the previous state-of-the-art methods. It is worth mentioning that since salient object detection is a pre-processing step for many computer vision tasks, it is important to evaluate the performance in terms of the running speed. Our method is capable of running at a real-time speed of 28 FPS, which guarantees that our network can be practically adopted as a pre-processing step for computer vision tasks.
In short, our main contributions are summarized as follows:
We propose the Feature Guide Network to equip our model with the power of i) making the foreground and background regions more distinct and suppressing the non-salient regions which have ”salient-like” appearance; ii) detecting salient objects that have different-looking regions.
To extract powerful multi-scale features, we propose the Multi-scale Feature Extraction Module which adopts GCNs to enable densely connections within large regions. Additionally, this module helps the model to alleviate the ”gridding issue”.
We design a loss function that outperforms the widely-used Cross-entropy loss by a large margin.
Our method achieves great performance under different backbones, which shows that our proposed framework is very general with respect to the choice of the backbone model. It is interesting to note that while most methods in the saliency detection literature adopt a single backbone in their framework, we evaluate our framework on four different backbones to prove the generalization capability of our method.
The proposed method achieves the state-of-the-art on several challenging saliency detection datasets. Furthermore, our method contains a lower number of parameters compared to the previous state-of-the-art methods and can run at a real-time speed of 28 FPS.
2 Related work
Over the past years, numerous methods have been proposed for saliency detection. Traditional methods predict the saliency score based on hand-crafted features. Most of these methods utilize heuristic priors such as center prior aksac2017complex ; liang2018material , boundary background yang2013saliency , and color contrast cheng2014global . Aytekin et al. aytekin2018probabilistic propose a probabilistic framework to encode the boundary connectivity saliency cue and smoothness constraints into a global optimization problem. Shan et al. shan2018visual propose a graph-based approach and use background weight map to provide seeds for manifold ranking. Furthermore, they design a third-order smoothness framework to enhance the performance of manifold ranking. These methods, which are based on the traditional approaches, fail to capture semantic and high-level information of the objects.
Recently, deep Convolutional Neural Networks (CNNs) have shown their capabilities in extracting powerful features at multiple levels of abstraction. The CNN features can acquire a richer representation compared to the traditional hand-crafted features, and thus would result in performance improvement. In recent years, a vast number of methods have adopted CNNs for saliency detection task. For example, Li et al. li2015visual extract multi-scale features from a CNN and estimate the saliency score for each image super-pixel. Wang et al. wang2015deep employ two CNNs to combine local estimation of super-pixels and global proposal searching to predict saliency maps. Zhao et al. zhao2015saliency propose multi-context CNNs for exploiting local and global context for salient object detection. Although these CNN-based methods have shown better performance than the traditional methods, they are time-consuming because of taking image patches as input. Moreover, these methods fail to consider important spatial information of the whole image.
To overcome the above-mentioned problems, several methods have utilized FCNs to generate a pixel-wise prediction over the whole image directly. For instance, Li et al. li2016deep propose a multi-scale FCN to explore the semantic properties and visual contrast information of salient objects. Hou et al. hou2017deeply introduce short connections to combine features in different layers. Zhang et al. zhang2017amulet propose a resolution-based feature combination module to integrate multi-level feature maps into multiple resolutions, which captures spatial details and semantic information, simultaneously. Then, by fusing the predicted saliency maps in each resolution, the final saliency map is obtained. Zhang et al. zhang2018bi design a bi-directional message passing architecture to pass messages between multi-level features. Wang et al. wang2018detect propose to locate the salient objects globally and then refine them by taking advantage of local context information. Zhang et al. zhang2019hyperfusion employ a hyper-densely hierarchical feature fusion network to fuse the local and global multi-scale feature maps.
Most of the recent methods focus on using both high-level and low-level features for salient object detection. However, naively using these features may result in confusion for the network, and there needs to be an effective approach to use these features constructively. In this paper, we propose the Feature Guide Network which guides multi-level features to produce more effective features.
3 Our method
In this section, we first explain our proposed Content-Aware Guidance Network (CAGNet), consisting of three networks: (i) Feature Extraction Network which extracts multi-scale context information, (ii) Feature Guide Network which guides the extracted features by taking advantage of the spatial details of low-level features and the semantic information of high-level features, (iii) Feature Fusion Network which integrates guided features effectively to generate the saliency map. The architecture of the proposed CAGNet is illustrated in Figure 2. Finally, we describe our designed loss function that has better performance than the widely-used Cross-entropy loss.
3.1 Feature Extraction Network
Feature Extraction Network consists of a pre-trained backbone that takes the input image and produces multi-level feature maps, and Multi-scale Feature Extraction Modules (MFEMs) which we apply them to multi-level feature maps to capture multi-scale contextual features.
3.1.1 Pre-trained backbone
In this study, we examine different pre-trained models in our CAGNet as the backbone model, including VGG-16 simonyan2014very , ResNet50 he2016deep , NASNet-Mobile zoph2018learning , and NASNet-large zoph2018learning , which are denoted as CAGNet-V, CAGNet-R, CAGNet-M, and CAGNet-L, respectively. These backbones are used to produce features at different levels of abstraction. To fit the need of saliency detection task, we remove all the fully connected layers in these backbones. In VGG-16, the features after the last max pooling layer cannot introduce a new level of abstraction. Thus, we use a convolutional layer with kernels of size after the last max pooling layer in VGG-16 to produce a new level.
The output feature maps of all backbones are re-scaled by a factor of with respect to the input image. We take feature maps at four levels from each backbone. Given an input image with size , these feature maps have spatial sizes of with . The details of selected layers for different levels of abstraction in each backbone are shown in Table 1.
|Backbone||Level A||Level B||Level C||Level D|
3.1.2 Multi-scale Feature Extraction Module
Salient objects have large variations in scale and location in different images. Due to the variability of scale, using single scale convolution may not capture the right size. Moreover, due to the variability of location, using pyramid pooling as a multi-scale feature extractor, as proposed in wang2017stagewise , would cause the loss of important local information because of the large scale of pooling. Another approach to implement a multi-scale feature extractor is to use dilated convolutions like zhang2018bi , which enlarges the receptive field by inserting ”holes” in the convolution kernels, and thus would result in the loss of local information because of sparse connections. This problem, which is called the ”gridding issue”, was explored in wang2018understanding .
Based on above observation, we find the Global Convolutional Networks (GCNs) peng2017large effective to address the ”gridding issue” challenge. To avoid sparse connections and enable densely connections within a large region in the feature map, GCN utilizes a combination of and convolutions to implement the convolution effectively with a lower number of parameters compared with the trivial convolution. More details about the GCN can be found in peng2017large . Furthermore, to obtain multi-scale contextual information, by taking advantage of GCNs, we propose the Multi-scale Feature Extraction Module (MFEM). This module consists of GCNs with different kernel sizes and can learn multi-scale context information for multiple abstraction levels.
As illustrated in Figure 3, in MFEM we perform convolutions by utilizing the trivial convolution and GCNs with . Then, the resulting feature maps are concatenated to form multi-scale features.
3.2 Feature Guide Network
By employing the Feature Extraction Network, multi-scale features at multiple levels of abstraction are produced. We use four different levels of the Feature Extraction Network to extract multi-scale features. These different levels have different recognition information. High-level features have semantic and global information because of the large field of view. Thus, these features can help the category recognition of image regions. Low-level features have spatial and local information due to the small field of view. Therefore, the information of low-level features can help to better locate the salient regions.
Based on above observation, we propose the Feature Guide Network to better exploit the diverse recognition abilities of different levels. Feature Guide Network is composed of multiple Guide Modules which help to produce more powerful features for saliency detection. As illustrated in Figure 4, Guide Module consists of Low-level Guide and High-level Guide branches. This module takes low-level and high-level features as inputs and outputs guided low-level and guided high-level features.
In saliency detection, some non-salient regions may have ”salient-like” appearance. As shown in the first row of Figure 1, the triangular object at the bottom of the image, which has ”salient-like” appearance, may cause confusion for saliency prediction. To address this challenge, we take advantage of the nature of the lower levels to guide higher levels. In the lower levels, the Feature Extraction Network captures finer spatial information because of its smaller field of view compared to the higher levels. Thus, by applying a convolution on concatenated high-level and low-level features, spatial weights are produced to weigh the spatial information of high-level features. With this design, high-level features, which lack the low-level cues, can exploit the fine spatial details of low-level features as a guidance to make salient and non-salient regions more distinguishable. Therefore, by guiding the spatial information of high-level features, our network is able to enhance the distinction of salient and non-salient regions and suppress the non-salient regions with ”salient-like” appearance.
In some complicated scenarios, salient regions may have different appearances. As illustrated in the second row of Figure 1, the appearance of the feet of the doll is different from the rest of the doll. Assigning foreground label to these different-looking regions is challenging. To address this challenge, by inspiring from the Channel Attention Block (CAB) proposed in yu2018learning , we use the nature of high-level features to guide low-level features in our Feature Guide Network. High-level features have higher semantic information due to the large receptive field. By applying an architecture like Squeeze and Excitation Networks hu2018squeeze on concatenated high-level and low-level features, channel weights are generated to weight the channels of low-level feature maps. In this way, by utilizing high-level semantic information, the low-level features are guided to produce more attentive features. Thus, Guide Modules provide content-aware guidance for multi-level features, which would result in a more accurate prediction.
3.3 Feature Fusion Network
By adopting Feature Extraction Network and Feature Guide Network, guided multi-scale features at different levels of abstractions are obtained. To integrate these features effectively, we devise Feature Fusion Network. In this network, we use add operations to combine different feature maps. In order to refine the features effectively, we introduce Residual Refinement Module (RRM), which is schematically depicted in Figure 5. RRM is a residual block he2016deep ; he2016identity with spatial attention. This module is used to refine the features and has the ability of focusing on salient regions and avoiding distractions in the non-salient regions.
By adopting multiple RRM modules and add operations in Feature Fusion Network, finally the saliency map is obtained by utilizing a convolutional layer with two kernels with softmax activation.
3.4 Our designed loss function for learning the salient objects
In saliency detection literature, Cross-entropy loss function is widely used for learning the salient objects. However, the networks trained with Cross-entropy loss often differentiate boundary pixels with low confidence, which would result in the performance degradation. In this paper, we design a loss function that leads to better results compared to the Cross-entropy loss, as shown in the ablation analysis section. Let … , , and denote the training images, saliency map for the - training image, and ground truth for the - training image, respectively. Our designed loss is formulated as:
where , and are the balance parameters. We empirically set , , and . and are computed as:
where and are calculated similar to Precision and Recall:
where and , and is a regularization constant. calculates the discrepancy between the predicted saliency map and the ground truth :
where is computed as :
where denotes the total number of pixels. In ablation analysis section, we demonstrate that our designed loss function outperforms the Cross-entropy loss function.
4.1 Datasets and evaluation metrics
The proposed method is evaluated on five public salient object detection datasets. ECSSD yan2013hierarchical contains 1,000 semantically meaningful and complex images with multiple objects of different sizes. DUT-OMRON yang2013saliency consists of 5,168 challenging images with high variety of content, each of which has complex background and one or two salient objects. HKU-IS li2015visual contains 4447 images with low color contrast. Images in this dataset are selected to include multiple foreground objects or objects touching the image boundary. DUTS wang2017learning dataset is currently the largest salient object detection dataset and comprised of 10,553 images in the training set and 5,019 images in the test set. Both training and test sets have very challenging scenarios. The PASCAL-S li2014secrets dataset has 850 natural images chosen from the PASCAL VOC 2010 everingham2010pascal segmentation dataset.
We use five metrics to evaluate the performance of our method as well as previous state-of-the-art saliency detection methods, including Precision-Recall (PR) curves, F-measure curves, Average F-measure (denoted as avgF) score, weighted F-measure (denoted as wF) score, and Mean Absolute Error (MAE) score. More detailed descriptions about these metrics can be found in borji2015salient ; margolin2014evaluate .
Precision is the fraction of correct salient pixels in the predicted saliency maps, and Recall is defined as the fraction of correct salient pixels in the ground truth. To calculate Precision and Recall, the binarized saliency map is compared against the ground truth mask. The threshold is varied from 0 to 1 to generate a sequence of binary masks. These binary masks are used to calculate (Precision, Recall) pairs and (F-measure, threshold) pairs to plot the PR curves and the F-measure curves.
The Average F-measure score is calculated by using the thresholding method suggested in achanta2009frequency . This threshold is used to generate binary maps for computing the F-measure which is defined as:
where is set to 0.3 to weight precision more than recall. The weighted F-measure score margolin2014evaluate is also adopted for evaluating the performance. Finally, the MAE score is calculated as the average pixel-wise absolute difference between the ground truth mask and the predicted saliency map.
4.2 Implementation details
We develop our proposed method in Keras chollet2015keras framework using TensorFlow tensorflow2015-whitepaper backend. The backbone models (i.e., VGG-16 simonyan2014very , ResNet-50 he2016deep , NASNet Mobile zoph2018learning , and NASNet Large zoph2018learning ) are initialized with ImageNet russakovsky2015imagenet weights. In our experiments, the input image is uniformly resized into pixels for training and testing. To reduce overfitting, two types of data augmentations are randomly employed: horizontal filliping and rotation (range of 0-12 degrees). We do not use validation set and train the model until its training loss converges. All the experiments are performed using the stochastic gradient descent with a momentum coefficient and an initial learning rate of - which is divided by if no improvement in training loss is seen for epochs. We perform our experiments on an NVIDIA 1080 Ti GPU. We will release our code, the trained models, and the predicted saliency maps upon the publication of the manuscript.
4.3 Comparison with the state-of-the-art
We compare our method with 16 previous state-of-the-art methods, namely MDF li2015visual , RFCN wang2016saliency , UCF zhang2017learning , Amulet zhang2017amulet , NLDF luo2017non , DSS hou2017deeply , BMPM zhang2018bi , PAGR zhang2018progressive , PiCANet liu2018picanet , SRM wang2017stagewise , DGRL wang2018detect , MLMS wu2019mutual , AFNet feng2019attentive , CapSal zhang2019capsal , BASNet qin2019basnet , and CPD wu2019cascaded . For a fair comparison, we use the saliency maps provided by the authors.
Quantitative Evaluation. P-R curves and F-measure curves on the five datasets are shown in Figure 6 and Figure 7, respectively. We can see that our proposed method performs favorably against other methods in all cases. Especially, it is obvious that our CAGNet-L performs better than all other methods by a relatively large margin. Moreover, we compare our method with other previous state-of-the-art methods in terms of avgF score, wF score, and MAE score on five benchmark datasets in Table 2. As seen from this table, our method ranks first in most cases. It is interesting to note that our method contains fewer parameters than the existing ones, is end-to-end, and does not need any post-processing step such as CRF krahenbuhl2011efficient . Another interesting thing about our method is that although our CAGNet-M has significantly fewer parameters than the other networks (only 5.57 million parameters), it has shown outstanding performance. This functionality is desirable for the applications in which we have limitation in terms of the memory. Furthermore, our CAGNet-V has a real-time speed of FPS when processing a image, and therefore it can be practically adopted as a preprocessing step for computer vision tasks.
|Dataset||DUTS-TE wang2017learning||ECSSD yan2013hierarchical||DUT-O yang2013saliency||PASCAL-S li2014secrets||HKU-IS li2015visual|
Qualitative Evaluation. Some qualitative results are shown in Figure 8. Thanks to the proposed modules, it can be seen that our model is capable of highlighting the inner part of foreground regions in various complicated scenes. Furthermore, our model is able to suppress the background regions which are incorrectly labeled by other saliency detection methods. Thus, by taking advantage of different proposed modules, our method is able to handle various complex scenarios.
4.4 Ablation analysis
Our proposed CAGNet consists of three modules, including the Multi-scale Feature Extraction Module (MFEM), the Guide Module, and the Residual Refinement Module (RRM). We perform the ablation analysis on CAGNet-V by using three challenging large-scale datasets, namely DUTS-TE wang2017learning , DUT-O yang2013saliency , and HKU-IS li2015visual . In order to investigate the effectiveness of each module, we gradually add them to our base network. Our base network is obtained by applying the following modifications to the CAGNet: i) replacing the MFEM modules with convolutions with the same number of filters, ii) removing Guide Modules from the network (which means that the multi-level features are not multiplied by the channel weights and the spatial weights), iii) removing the RRM modules from the model.
We perform ablation analysis by adding each module to our base network in a stepwise manner. The results are shown in Table 3. In this table, the base network is denoted as Base.
The effectiveness of Guide Module. We add the High-level Guide branch, Low-level Guide branch, and the both High-level and Low-level Guide branches (i.e., the Guide Module) to the base network, which are denoted as HG, LG, and GM, respectively in Table 3. As seen from this table, the performance improves, which shows the beneficial effect of using our Guide Module. Using this module results in i) making salient and non-salient regions more distinct and suppressing the non-salient regions that have ”salient-like” appearance, ii) assigning foreground label to different-looking salient regions. To further investigate the effectiveness of our guide branches, we show a visual comparison for each branch in Figure 9. As it can be seen, when we add the High-level Guide branch to the base network (Base+HG), the non-salient regions that have ”salient-like” appearance are suppressed. Furthermore, when we add the Low-level Guide branch to the base network (Base+LG), different-looking salient regions (head of the pencil and the rest of the pencil, head of the bird and the rest of the bird) are labeled as salient.
|Dataset||DUTS-TE wang2017learning||DUT-O yang2013saliency||HKU-IS li2015visual|
|Base + HG||0.7598||0.7204||0.0588||0.6571||0.6068||0.0819||0.8700||0.8398||0.0450|
|Base + LG||0.7650||0.7268||0.0588||0.6649||0.6178||0.0817||0.8714||0.8419||0.0447|
|Base + GM||0.7707||0.7335||0.0558||0.6687||0.6209||0.0780||0.8743||0.8451||0.0432|
|Base + GM + MFEM||0.8003||0.7779||0.0481||0.7256||0.6971||0.0616||0.8960||0.8776||0.0346|
|CE Loss Function||0.7591||0.7517||0.0524||0.7017||0.6793||0.0652||0.8783||0.8558||0.0398|
The effectiveness of MFEM. Based on the aforementioned architecture, we replace the convolutions with the MFEM modules. As seen in Table 3, our proposed MFEM has a beneficial effect on saliency detection and improves the results, which shows extracting multi-scale features can help to detect salient objects with different scales and locations.
The effectiveness of RRM. To reveal the effect of the RRMs, we add them to the aforementioned architecture. From Table 3, it can be observed that using our refinement module is helpful for saliency detection and improves the performance.
The effectiveness of our designed loss. To demonstrate the effectiveness of our designed loss function, we train our CAGNet-V with Cross-entropy, denoted as CE Loss Function, in Table 3. As seen in this table, our designed loss outperforms the cross-entropy loss by a significant margin.
|Dataset||DUTS-TE wang2017learning||DUT-O yang2013saliency||HKU-IS li2015visual|
To further prove the effectiveness of our MFEM, we implement the MFEMs in CAGNet-V by adopting dilated convolutional layers (kernel size=3, dilation rates=1, 3, 5, 7), denoted as Dilated Convolution in Table 3. We can see that the performance degrades, which shows that our MFEM can capture more powerful multi-scale features by enabling densely connections within a large region in the feature map. We also implement the MFEMs in CAGNet-V by adopting trivial convolutional layers with kernel size=3, 7, 11, 15, denoted as Trivial Convolution in Table 3. As seen from this table, the performance gets worse compared to our CAGNet-V with the proposed MFEM. It is interesting to note that CAGNet-V with our proposed MFEM contains fewer parameters than the CAGNet-V with the MFEM implemented by adopting trivial convolutional layers (20.98 million against 27.03 million), which is due to the architectural design of GCNs.
We perform another experiment on CAGNet-V and train it with different setting for the parameter . The results are shown in Table 4. By considering the trade-off between the performance and the number of parameters, we have chosen for our method.
5 Conclusion and future work
In this paper, we propose a novel end-to-end framework that has the power of i) making the foreground and background regions more distinct and suppressing the non-salient regions which have ”salient-like” appearance; ii) detecting salient objects that have different-looking regions. Our proposed model is also capable of capturing multi-scale contextual information effectively. The attentive multi-scale guided features learned by our method and the great results of our deigned loss function proves that a promising approach for saliency detection is introduced in this paper. Experimental evaluations over five datasets demonstrate that our proposed method outperforms the previous state-of-the-art methods under different evaluation metrics.
Based on the great performance and real-time speed of our approach and its superiority over previous approaches, we plan to use our saliency detector in industrial object-related applications, such as object based surveillance and object tracking. However, in real-world scenarios, images are affected by noise, which would lead to performance degradation of the most recently introduced saliency detectors ijcai2018 including ours. Figure 10 shows the predicted saliency maps for images corrupted by Additive White Gaussian Noise (AWGN). As seen, our model fails to output accurate saliency predictions in the presence of noise. This motivates us to plan on enhancing the robustness of our method by handling noise in an end-to-end approach.
- (1) C. F. Flores, A. Gonzalez-Garcia, J. van de Weijer, B. Raducanu, Saliency for fine-grained object recognition in domains with scarce training data, Pattern Recognition 94 (2019) 62–73.
- (2) Z. Li, G. Liu, D. Zhang, Y. Xu, Robust single-object image segmentation based on salient transition region, Pattern recognition 52 (2016) 317–331.
- (3) X.-H. Zhi, H.-B. Shen, Saliency driven region-edge-based top down level set evolution reveals the asynchronous focus in image segmentation, Pattern Recognition 80 (2018) 241–255.
- (4) Q. Cai, H. Liu, Y. Qian, S. Zhou, X. Duan, Y.-H. Yang, Saliency-guided level set model for automatic object segmentation, Pattern Recognition 93 (2019) 147–163.
- (5) W. Wang, J. Shen, F. Porikli, Saliency-aware geodesic video object segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3395–3402.
- (6) Y. Chen, Y. Pan, M. Song, M. Wang, Improved seam carving combining with 3d saliency for image retargeting, Neurocomputing 151 (2015) 645–653.
- (7) G. Zhang, Z. Yuan, Q. Tong, M. Zheng, J. Zhao, A novel framework for background subtraction and foreground detection, Pattern Recognition 84 (2018) 28–38.
- (8) S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discriminative saliency map with convolutional neural network, in: International conference on machine learning, 2015, pp. 597–606.
- (9) L. Huo, L. Jiao, S. Wang, S. Yang, Object-level saliency detection with color attributes, Pattern recognition 49 (2016) 162–173.
- (10) A. Aksac, T. Ozyer, R. Alhajj, Complex networks driven salient region detection based on superpixel segmentation, Pattern Recognition 66 (2017) 268–279.
- (11) J. Liang, J. Zhou, L. Tong, X. Bai, B. Wang, Material based salient object detection from hyperspectral images, Pattern Recognition 76 (2018) 476–490.
- (12) Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, P.-M. Jodoin, Non-local deep features for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6609–6617.
- (13) P. Zhang, D. Wang, H. Lu, H. Wang, X. Ruan, Amulet: Aggregating multi-level convolutional features for salient object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 202–211.
- (14) X. Xi, Y. Luo, P. Wang, H. Qiao, Salient object detection based on an efficient end-to-end saliency regression network, Neurocomputing 323 (2019) 265–276.
- (15) X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, M. Jagersand, Basnet: Boundary-aware salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7479–7489.
- (16) T. Wang, A. Borji, L. Zhang, P. Zhang, H. Lu, A stagewise refinement model for detecting salient objects in images, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4019–4028.
- (17) C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, N. Sang, Learning a discriminative feature network for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1857–1866.
- (18) L. Wang, L. Wang, H. Lu, P. Zhang, X. Ruan, Salient object detection with recurrent fully convolutional networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (7) (2019) 1734–1746.
- (19) L. Wang, H. Lu, X. Ruan, M.-H. Yang, Deep networks for saliency detection via local estimation and global search, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3183–3192.
- (20) P. Zhang, D. Wang, H. Lu, H. Wang, B. Yin, Learning uncertain convolutional features for accurate saliency detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 212–221.
- (21) L. Zhang, J. Dai, H. Lu, Y. He, G. Wang, A bi-directional message passing model for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1741–1750.
- (22) P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, G. Cottrell, Understanding convolution for semantic segmentation, in: 2018 IEEE winter conference on applications of computer vision (WACV), IEEE, 2018, pp. 1451–1460.
- (23) C. Peng, X. Zhang, G. Yu, G. Luo, J. Sun, Large kernel matters–improve semantic segmentation by global convolutional network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4353–4361.
- (24) C. Yang, L. Zhang, H. Lu, X. Ruan, M.-H. Yang, Saliency detection via graph-based manifold ranking, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 3166–3173.
- (25) M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, S.-M. Hu, Global contrast based salient region detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2014) 569–582.
- (26) C. Aytekin, A. Iosifidis, M. Gabbouj, Probabilistic saliency estimation, Pattern Recognition 74 (2018) 359–372.
- (27) D. Shan, X. Zhang, C. Zhang, Visual saliency based on extended manifold ranking and third-order optimization refinement, Pattern Recognition Letters 116 (2018) 1–7.
- (28) G. Li, Y. Yu, Visual saliency based on multiscale deep features, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5455–5463.
- (29) R. Zhao, W. Ouyang, H. Li, X. Wang, Saliency detection by multi-context deep learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.
- (30) G. Li, Y. Yu, Deep contrast learning for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 478–487.
- (31) Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, P. H. Torr, Deeply supervised salient object detection with short connections, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3203–3212.
- (32) T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, A. Borji, Detect globally, refine locally: A novel approach to saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3127–3135.
- (33) P. Zhang, W. Liu, Y. Lei, H. Lu, Hyperfusion-net: Hyper-densely reflective feature fusion for salient object detection, Pattern Recognition 93 (2019) 521–533.
- (34) G. Li, Y. Xie, L. Lin, Y. Yu, Instance-level salient object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2386–2395.
- (35) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.
- (36) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- (37) B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning transferable architectures for scalable image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710.
- (38) J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- (39) K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: European conference on computer vision, Springer, 2016, pp. 630–645.
- (40) Q. Yan, L. Xu, J. Shi, J. Jia, Hierarchical saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1155–1162.
- (41) L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, X. Ruan, Learning to detect salient objects with image-level supervision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 136–145.
- (42) Y. Li, X. Hou, C. Koch, J. M. Rehg, A. L. Yuille, The secrets of salient object segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 280–287.
- (43) M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International journal of computer vision 88 (2) (2010) 303–338.
- (44) A. Borji, M.-M. Cheng, H. Jiang, J. Li, Salient object detection: A benchmark, IEEE transactions on image processing 24 (12) (2015) 5706–5722.
- (45) R. Margolin, L. Zelnik-Manor, A. Tal, How to evaluate foreground maps?, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 248–255.
- (46) R. Achanta, S. Hemami, F. Estrada, S. Süsstrunk, Frequency-tuned salient region detection, in: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2009), no. CONF, 2009, pp. 1597–1604.
- (47) F. Chollet, et al., Keras, https://keras.io (2015).
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,
X. Zheng, TensorFlow: Large-scale
machine learning on heterogeneous systems, software available from
- (49) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International journal of computer vision 115 (3) (2015) 211–252.
- (50) X. Zhang, T. Wang, J. Qi, H. Lu, G. Wang, Progressive attention guided recurrent network for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 714–722.
- (51) N. Liu, J. Han, M.-H. Yang, Picanet: Learning pixel-wise contextual attention for saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3089–3098.
- (52) R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, E. Ding, A mutual learning method for salient object detection with intertwined multi-supervision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8150–8159.
- (53) M. Feng, H. Lu, E. Ding, Attentive feedback network for boundary-aware salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1623–1632.
- (54) L. Zhang, J. Zhang, Z. Lin, H. Lu, Y. He, Capsal: Leveraging captioning to boost semantics for salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6024–6033.
- (55) Z. Wu, L. Su, Q. Huang, Cascaded partial decoder for fast and accurate salient object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3907–3916.
- (56) P. Krähenbühl, V. Koltun, Efficient inference in fully connected crfs with gaussian edge potentials, in: Advances in neural information processing systems, 2011, pp. 109–117.
- (57) S. Cai, J. Huang, D. Zeng, X. Ding, J. Paisley, Menet: A metric expression network for salient object segmentation, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, 2018, pp. 598–605.