Learning Uncertain Convolutional Features for Accurate Saliency Detection
Deep convolutional neural networks (CNNs) have delivered superior performance in many computer vision tasks. In this paper, we propose a novel deep fully convolutional network model for accurate salient object detection. The key contribution of this work is to learn deep uncertain convolutional features (UCF), which encourage the robustness and accuracy of saliency detection. We achieve this via introducing a reformulated dropout (R-dropout) after specific convolutional layers to construct an uncertain ensemble of internal feature units. In addition, we propose an effective hybrid upsampling method to reduce the checkerboard artifacts of deconvolution operators in our decoder network. The proposed methods can also be applied to other deep convolutional networks. Compared with existing saliency detection methods, the proposed UCF model is able to incorporate uncertainties for more accurate object boundary inference. Extensive experiments demonstrate that our proposed saliency model performs favorably against state-of-the-art approaches. The uncertain feature learning mechanism as well as the upsampling method can significantly improve performance on other pixel-wise vision tasks.
|(a) Feature Visualization [?]|
|(b) Generative Adversarial Example [?]|
|(c) Saliency Detection [?]|
|(d) Semantic Segmentation [?]|
Saliency detection targets to identify the most important and conspicuous objects or regions in an image. As a pre-processing procedure in computer vision, saliency detection has greatly benefited many practical applications such as object retargeting [?], scene classification [?], semantic segmentation [?] and visual tracking [?]. Although significant progress has been made [?], saliency detection remains very challenging due to complex factors in real world scenarios. In this work we focus on the task of improving robustness of saliency detection models, which has been ignored in the literature.
Previous saliency detection methods utilize several hand-crafted visual features and heuristic priors. Recently, deep learning based methods become more and more popular, and have set the benchmark on many datasets [?]. Their superior performance is partly attributed to the strong representation power in modeling object appearances and varied scenarios. However, existing methods fail to provide a probabilistic interpretability of the “black-box” learning in deep neural networks, and mainly enjoy the models’ exceptional performance. A reasonable probabilistic interpretation can provide relational confidences alongside predictions and make the prediction system into a more robust one [?]. In addition, since the uncertainty is a natural part of any predictive system, modeling the uncertainty is of crucial importance. For instance, the object boundary strongly affects the prediction accuracy of a saliency model, it is desirable that the model can provide meaningful uncertainties on where the boundary of distinct objects is. As far as we know, there is no work to model and analyze the uncertainty of saliency detection methods based on deep learning.
Another important issue is the checkerboard artifact in pixel-wise vision tasks, which target to generate images or feature maps from low to high resolution. Several typical examples are shown in Fig. ? (ref. [?] for more details). The odd artifacts sometimes are very fatal for deep CNNs based approaches. For example, when the artifacts appear in the output of a fully convolutional network (FCN), the network training may fail and the prediction can be completely wrong [?]. We find that the actual cause of these artifacts is the upsampling mechanism, which generally utilizes the deconvolution operation. Thus, it is of great interest to explore new upsampling methods to better reduce the artifacts for pixel-wise vision tasks. Meanwhile, the artifacts are also closely related to the uncertainty learning of deep CNNs.
All of the issues discussed above motivate us to learn uncertain features (probabilistic learning) through deep networks to achieve accurate saliency detection. Our model has several unique features, as outlined below.
Different from existing saliency detection methods, our model is extremely simplified. It consists of an encoder FCN, a corresponding decoder FCN followed by a pixel-wise classification layer. The encoder FCN hierarchically learns visual features from raw images while the decoder FCN progressively upsamples the encoded feature maps to the input size for the pixel-wise classification.
Our model can learn deep uncertain convolutional features (UCF) for more accurate saliency detection. The key ingredient is inspired by dropout [?]. We propose a reformulated dropout (R-dropout), leading to an adaptive ensemble of the internal feature units in specific convolutional layers. Uncertain features are achieved with no additional parameterization.
We propose a new upsampling method to reduce the checkerboard artifacts of deconvolution operations. The new upsampling method has two obvious advantages. On the one hand it separates out upsampling (to generate higher resolution feature maps) from convolution (to extract convolutional features), on the other hand it is compatible with the regular deconvolution.
The uncertain feature extraction and saliency detection are unified in an encoder-decoder network architecture. The parameters of the proposed model (i.e., weights and biases in all the layers) are jointly trained by end to end gradient learning.
Our methods show good generalization on saliency detection and other pixel-wise vision tasks. Without any post-processing steps, our model yields comparable even better performance on public saliency detection, semantic segmentation and eye fixation datasets.
Recently, deep learning has delivered superior performance in saliency detection. For instance, Wang [?] propose two deep neural networks to integrate local estimation and global search for saliency detection. Li [?] train fully connected layers of mutiple CNNs to predict the saliency degree of each superpixel. To deal with the problem that salient objects may appear in a low-contrast background, Zhao [?] take global and local context into account and model the saliency prediction in a multi-context deep CNN framework. These methods have excellent performances, however, all of them include fully connected layers, which are very computationally expensive. What’s more, fully connected layers drop spatial information of input images. To address these issues, Li [?] propose a FCN trained under the multi-task learning framework for saliency detection. Wang [?] design a recurrent FCN to leverage saliency priors and refine the coarse predictions.
Although motivated by the similar spirit, our method significantly differs from [?] in three aspects. First, the network architecture is very different. The FCN we used is in the encoder-decoder style, which is in the view of main information reconstruction. In [?], the FCN originates from the FCN-8s [?] designed with both long and short skip connections for the segmentation task. Second, instead of simply using FCNs as predictors in [?], our model can learn uncertain convolutional features by using multiple reformulated dropouts, which improve the robustness and accuracy of saliency detection. Third, our model is equipped with a new upsampling method, that naturally handles the checkerboard artifacts of deconvolution operations. The checkerboard artifacts can be reduced through training the entire neural network. In contrast, the artifacts is handled by hand-crafted methods in [?]. Specifically, [?] uses superpixel segmentation to smooth the prediction. In [?], an edge-aware erosion procedure is used.
Our work is also related to the model uncertainty in deep learning. Gal [?] mathematically prove that a multilayer perceptron models (MLPs) with dropout applied before every weight layer, is equivalent to an approximation to the probabilistic deep Gaussian process. Though the provided theory is solid, a full verification on deep CNNs is underexplored. Base on this fact, we make a further step in this direction and show that a reformulated dropout can be used in convolutional layers for learning uncertain feature ensembles. Another representative work on the model uncertainty is the Bayesian SegNet [?]. The Bayesian SegNet is able to predict pixel-wise scene segmentation with a measure of the model uncertainty. They achieve the model uncertainty by Monte Carlo sampling. Dropout is activated at test time to generate a posterior distribution of pixel class labels. Different from [?], our model focuses on learning uncertain convolutional features during training.
3The Proposed Model
3.1Network Architecture Overview
Our architecture is partly inspired by the stacked denoising auto-encoder [?]. We generalize the auto-encoder to a deep fully convolutional encoder-decoder architecture. The resulting network forms a novel hybrid FCN which consists of an encoder FCN for high-level feature extraction, a corresponding decoder FCN for low-level information reconstruction and a pixel-wise classifier for saliency prediction. The overall architecture is illustrated in Figure 1. More specifically, the encoder FCN consists of multiple convolutional layers with batch normalizations (BN) [?] and rectified linear units (ReLU), followed by non-overlapping max pooling. The corresponding decoder FCN additionally introduces upsampling operations to build feature maps up from low to high resolution. We use the softmax classifier for the pixel-wise saliency prediction. In order to achieve the uncertainty of learned convolutional features, we utilize the reformulated dropout (dubbed R-Dropout) after several convolutional layers. The detailed network configuration is included in supplementary materials. We will fully elaborate the R-Dropout, our new upsampling method and the training strategy in the following subsections.
3.2R-Dropout for Deep Uncertain Convolutional Feature Ensemble
Dropout is typically interpreted as bagging a large number of individual models [?]. Although plenty of experiments show that dropout for fully connected layers improves the generalization ability of deep networks, there is a lack of research about using dropout for other type layers, such as convolutional layers. In this subsection, we show that using modified dropout after convolutional layers can be interpreted as a kind of probabilistic feature ensembles. In light of this fact, we provide a strategy on learning uncertain convolutional features. R-Dropout in Convolution: Assume is a 3D tensor, and is a convolution operation in CNNs, projecting to the space by parameters and :
Let be a non-linear activation function. When the original dropout [?] is applied to the outputs of , we can get its disturbed version by
where denotes element-wise product and is a binary mask matrix of size with each element drawn independently from . Eq.(Equation 3) denotes the activation with dropout during training, and Eq.(Equation 2) denotes the activation at test time. In addition, Srivastava [?] suggest to scale the activations with at test time to obtain an approximate average of the unit activation.
Many commonly used activation functions such as Tanh, ReLU and LReLU [?], have the property that . Thus, Eq.(Equation 3) can be re-written as the R-Dropout formula,
where denotes the cross-channel element-wise product. From above equations, we can derive that when is still binary, Eq.(Equation 4) implies that a kind of stochastic properties
If the proposed R-Dropout is followed by a convolutional layer, the forward propagation of input is formulated as
where is the layer number and is the convolution operation. As we can see from Eq.(Equation 6), the disturbed activation is convolved with filter to produce convolved features . In this way, the network will focus on learning the weight and bias parameters, i.e., and , and the uncertainty of using the R-Dropout will be dissipated during training deep networks.
In this case, the forward propagation of input becomes
Here denotes the max-pooling function. is the pooling region at layer and is the activition of each neuron within . is the number of units in . To formulate the uncertainty, without loss of generality, we suppose the activations in each pooling region are ordered in non-decreasing order, i.e. . As a result, will be selected as the pooled activation on conditions that (1) are dropped out, and (2) is retained. This event occurs with probability of according to the probability theory,
Therefore, performing R-dropout before the max-pooling operation is exactly sampling from the following multinomial distribution to select an index , then the pooled activation is simply ,
where is the special event that all the units in a pooling region is dropped out.
The latter strategy exhibits the effectiveness of building the uncertainty by employing the R-Dropout into convolutional layers. We adopt it to build up our network architecture (see Figure 1). We will experimentally demonstrate that the R-Dropout based FCN yields marvelous results on the saliency detection datasets in Section 4.
3.3Hybrid Upsampling for Prediction Smoothing
In this subsection, we first explicate the cause of checkerboard artifacts by the deconvolution arithmetic [?]. Then we derive a new upsampling method to reduce the artifacts as much as possible for the network training and inference.
Without loss of generality, we focus on the square input (), square kernel size (), same stride () and same zero padding () (if used) along both axes. Since we aim to implement upsampling, we set . In general, the convolution operation can be described by
where is the input, is the filter with stride , is the discrete convolution and is the output whose dimension is . The convolution has an associated deconvolution described by , , and , where is the size of the stretched input obtained by adding zeros between each input unit, and the output size of the deconvolution is
Base on the above observations, we propose two strategies to avoid the artifacts produced by the regular deconvolution. The first one is restricting the filter size. We can simply ensure the filter size is a multiple of the stride size, avoiding the overlapping issue, i.e.,
Then the deconvolution will dispose the zero-inserted input with the equivalent convolution, deriving a smooth output. However, because this method only focuses on changing the receptive fields of the output, and can not change the frequency distribution of the zero-inserted input, the artifacts can still leak through in several extreme cases. We propose another alternative strategy which separates out upsampling from equivalent convolution. We first resize the original input into the desired size by interpolations, and then perform some equivalent convolutions. Although this strategy may destroy the learned features in deep CNNs, we find that high resolution maps built by iteratively stacking this kind of upsampling can reduce artifacts amazingly. In order to take the strength of both strategies, we introduce the hybrid upsampling method by summing up the outputs of the two strategies. Figure 2 illustrates the proposed upsampling method. In our proposed model, we use bilinear (or nearest-neighbor) operations for the interpolation. These interpolation methods are linear operations, and can be embedded into the deep CNNs as efficient matrix multiplications.
3.4Training the Entire Network
Since there is a lack of enough saliency detection data for training our model from scratch, we utilize the front-end of the VGG-16 model [?] as our encoder FCN (13 convolutional layers and 5 pooling layers pre-trained on ILSVRC 2014 for the image classification task). Our decoder FCN is a mirrored version of the encoder FCN, and has multiple series of upsampling, convolution and rectification layers. Batch normalization (BN) is added to the output of every convolutional layer. We add the R-dropout with an equal sampling rate after specific convolutional layers, as shown in Figure 1. For saliency detection, we randomly initialize the weights of the decoder FCN and fine-tune the entire network on the MSRA10K dataset [?], which is widely used in salient object detection community (More details will be described in Section 4). We convert the ground-truth saliency map of each image in that dataset to be a 0-1 binary map. This kind of transform perfectly matches the channel output of the FCN when we use the softmax cross-entropy loss function given by the following equation (Equation 13) for separating saliency foreground from general background.
where is the label of a pixel in the image and is the probability that the pixel is the saliency foreground. The value of is obtained from the output of the network. Before putting the training images into our proposed model, each image is subtracted with the ImageNet mean [?] and rescaled into the same size (448 448). For the correspondence, we also rescale the 0-1 binary maps to the same size. The model is trained end to end using the mini-batch stochastic gradient descent (SGD) with a momentum, learning rate decay schedule. The detailed settings of parameters are included in the supplementary material.
Because our model is a fully convolutional network, it can take images with arbitrary size as inputs when testing. After the feed-forward process, the output of the network is composed of a foreground excitation map () and a background excitation map (). We use the difference between and , and clip the negative values to obtain the resulting saliency map, i.e.,
This subtraction strategy not only increases the pixel-level discrimination but also captures context contrast information. Optionally, we can take the ensemble of multi-scale predicted maps to further improve performance.
In this section, we start by describing the experimental setup for saliency detection. Then, we thoroughly evaluate and analyze our proposed model on public saliency detection datasets. Finally, we provide additional experiments to verify the generalization of our methods on other pixel-wise vision tasks, i.e., semantic segmentation and eye fixation.
For training the proposed network, we simply augment the MSRA10K dataset [?] by the mirror reflection and rotation techniques (), producing 80,000 training images totally.
For the detection performance evaluation, we adopt six widely used saliency detection datasets as follows,
[?]. This dataset consists of 5,168 high quality images. Images in this dataset have one or more salient objects and relatively complex background. Thus, this dataset is difficult and challenging in saliency detection.
[?]. This dataset contains 1,000 natural images, including many semantically meaningful and complex structures in the ground truth segmentations.
[?]. This dataset contains 4,447 images with high quality pixel-wise annotations. Images in this dataset are well chosen to include multiple disconnected objects or objects touching the image boundary.
[?]. This dataset is carefully selected from the PASCAL VOC dataset [?] and contains 850 images.
[?]. This dataset contains two different subsets: SED1 and SED2. The SED1 has 100 images each containing only one salient object, while the SED2 has 100 images each containing two salient objects.
[?]. This dataset has 300 images, and it was originally designed for image segmentation. Pixel-wise annotation of salient objects was generated by [?].
We implement our approach based on the MATLAB R2014b platform with the modified Caffe toolbox [?]. We run our approach in a quad-core PC machine with an i7-4790 CPU (with 16G memory) and one NVIDIA Titan X GPU (with 12G memory). The training process of our model takes almost 23 hours and converges after 200k iterations of the min-batch SGD. The proposed saliency detection algorithm runs at about 7 fps with resolution (23 fps with resolution). The source code can be found at http://ice.dlut.edu.cn/lu/.
Saliency Evaluation Metrics:
We adopt three widely used metrics to measure the performance of all algorithms, i.e., the Precision-Recall (PR) curves, F-measure and Mean Absolute Error (MAE) [?]. The precision and recall are computed by thresholding the predicted saliency map, and comparing the binary map with the ground truth. The PR curve of a dataset indicates the mean precision and recall of saliency maps at different thresholds. The F-measure is a balanced mean of average precision and average recall, and can be calculated by
Following existing works [?] [?] [?] [?], we set to be 0.3 to weigh precision more than recall. We report the performance when each saliency map is adaptively binarized with an image-dependent threshold. The threshold is determined to be twice the mean saliency of the image:
where and are width and height of an image, is the saliency value of the pixel at .
We also calculate the mean absolute error (MAE) for fair comparisons as suggested by [?]. The MAE evaluates the saliency detection accuracy by
where is the binary ground truth mask.
|(a) ECSSD||(b) SED1||(c) SED2|
4.2Performance Comparison with State-of-the-art
We compare the proposed UCF algorithm with other 10 state-of-the-art ones including 6 deep learning based algorithms (DCL [?], DS [?], ELD [?], LEGS [?], MDF [?], RFCN [?]) and 4 conventional counterparts (BL [?], BSCA [?], DRFI [?], DSR [?]). The source codes with recommended parameters or the saliency maps of the competing methods are adopted for fair comparison.
As shown in Fig. ? and Table 1, our proposed UCF model can consistently outperform existing methods across almost all the datasets in terms of all evaluation metrics, which convincingly indicates the effectiveness of the proposed methods. Refer to the supplemental material for more results on DUT-OMRON, HKU-IS, PASCAL-S and SOD datasets.
From these results, we have several fundamental observations: (1) Our UCF model outperforms other algorithms on ECSSD and SED datasets with a large margin in terms of F-measure and MAE. More specifically, our model improves the F-measure achieved by the best-performing existing algorithm by 3.9% and 6.15% on ECSSD and SED datasets, respectively. The MAE is consistently improved. (2) Although our proposed UCF is not the best on HKU-IS and PASCAL-S datasets, it is still very competitive (our model ranks the second on these datasets). It is necessary to note that only the augmented MSRA10K dataset is used for training our model. The RFCN, DS and DCL methods are pre-trained on the additional PASCAL VOC segmentation dataset [?], which is overlaped with the PASCAL-S and HKU-IS datasets. This fact may interpret their success on the two datasets. However, their performance on other datasets is obviously inferior. (3) Compared with other methods, our proposed UCF achieves lower MAE on most of datasets. It means that our model is more convinced of the predicted regions by the uncertain feature learning.
The visual comparison of different methods on the typical images is shown in Fig. ?. Our saliency maps can reliably highlight the salient objects in various challenging scenarios, , low contrast between objects and backgrounds (the first two rows), multiple disconnected salient objects (the 3-4 rows) and objects near the image boundary (the 5-6 rows). In addition, our saliency maps provide more accurate boundaries of salient objects (the 1, 3, 6-8 rows).
Ablation Studies: To verify the contributions of each component, we also evaluate several variants of the proposed UCF model with different settings as illustrated in Table 2. The corresponding performance are reported in Table 1. The V-A model is an approximation of the DeconvNet [?]. The comparison between V-A and V-B demonstrates that our uncertain learning mechanism can indeed benefit to learn more robust features for accurate saliency inference. The comparison between V-B and V-C shows the effects with two upsampling strategies. Results imply that the interpolation strategy performs much better in saliency detection. The joint comparison of V-B, V-C and UCF confirms that our hybrid upsampling method is capable of better refining the output saliency maps. An example on the visual effects is illustrated in Fig. ?. In addition, the V-D model and V-E model verify the usefulness of deconvolution and interpolation upsampling, respectively. The V-B and V-C models achieve competitive even better results than other saliency methods. This further confirms the strength of our methods.
To verify the generalization of our methods, we perform additional experiments on other pixel-wise vision tasks. Following existing works [?], we simply change the classifier into 21 classes and perform the semantic segmentation task on the PASCAL VOC 2012 dataset [?]. Our UCF model is trained with the PASCAL VOC 2011 training and validation data, using the Berekely’s extended annotations [?]. We achieve expressive results (mean IOU: 68.25, mean pix.accuracy: 92.19, pix.accuracy: 77.28), which are very comparable with other state-of-the-art segmentation methods. In addition, though the segmentation performance gaps are not as large as in saliency detection, our new upsampling method indeed performs better than regular deconvolution (mean IOU: 67.45 vs 65.173, mean pix.accuracy: 91.21 vs 90.84, pix.accuracy: 76.18 vs 75.73). The task of eye fixation prediction is essentially different from our classification task. We use the Euclidean loss for the gaze prediction. We submit our results to servers of MIT300 [?], iSUN [?] and SALICON [?] benchmarks with standard setups. Our model also achieves comparable results shown in Table 3. All above results on semantic segmentation and eye fixation tasks indicate that our model has a strong generalization in other pixel-wise tasks.
In this paper, we propose a novel fully convolutional network for saliency detection. A reformulated dropout is utilized to facilitate probabilistic training and inference. This uncertain learning mechanism enables our method to learn uncertain convolutional features and yield more accurate saliency prediction. A new upsampling method is also proposed to reduce the artifacts of deconvolution operations, and explicitly enforce the network to learn accurate boundary for saliency detection. Extensive evaluations demonstrate that our methods can significantly improve performance of saliency detection and show good generalization on other pixel-wise vision tasks.
- Stochastic property means that one can use a specific probability distribution to generate the learnable tensor during each training iteration. The update of forms a stochastic process not a certain decision.
- In R-Dropout, the generator can be any probability distribution. The original dropout is a special case of the R-Dropout, when the generator is the Bernoulli distribution.
- The constraint on the size of the input can be relaxed by introducing another parameter that allows to distinguish between the different cases that all lead to the same .