Video Salient Object Detection via
Fully Convolutional Networks
This paper proposes a deep learning model to efficiently detect salient regions in videos. It addresses two important issues: (1) deep video saliency model training with the absence of sufficiently large and pixel-wise annotated video data; and (2) fast video saliency training and detection. The proposed deep video saliency network consists of two modules, for capturing the spatial and temporal saliency information, respectively. The dynamic saliency model, explicitly incorporating saliency estimates from the static saliency model, directly produces spatiotemporal saliency inference without time-consuming optical flow computation. We further propose a novel data augmentation technique that simulates video training data from existing annotated image datasets, which enables our network to learn diverse saliency information and prevents overfitting with the limited number of training videos. Leveraging our synthetic video data (150K video sequences) and real videos, our deep video saliency model successfully learns both spatial and temporal saliency cues, thus producing accurate spatiotemporal saliency estimate. We advance the state-of-the-art on the DAVIS dataset (MAE of .06) and the FBMS dataset (MAE of .07), and do so with much improved speed (2fps with all steps).
Saliency detection has recently attracted a great amount of research interest. The reason behind this growing popularity lies in the effective use of these models in various vision tasks, such as image segmentation, object detection, video summarization and compression, to name a few. Saliency models can be broadly classified into two categories: human eye fixation prediction or salient object detection. According to the type of input, they can be further categorized into static and dynamic saliency models. While static models take still images as input, dynamic models work on video sequences. In this paper, we focus on detecting distinctive regions in dynamic scenes. Convolutional neural networks (CNNs) have been successfully utilized in many fundamental areas of computer vision, including object detection [1, 4], semantic segmentation , and still saliency detection [7, 8]. Inspired by this, we investigate CNNs to another computer vision task, namely video saliency detection.
The first problem of applying CNNs to video saliency is the lack of sufficiently large, densely labelled video training data. As far as we know, the successes of CNNs in computer vision are largely attributed to the availability of large-scale annotated images (e.g., ImageNet ). However, existing video datasets are too small to provide adequate training data for CNNs. In Table 1, we list the statistics of the ImageNet dataset and widely adopted video object segmentation datasets, including FBMS , SegTrackV2 , VSB100  and DAVIS . It can be observed that, the existing video datasets rarely match existing image datasets like ImageNet, in either quality or quantity. Besides, considering the high correlation between the frames from same video clip, existing video datasets are far unable to meet the needs of training CNNs for pixel-level video applications, like video salient object detection. On the other hand, for the moment, creating such a large-scale video dataset is usually infeasible, because annotating videos is complex and time-consuming. To this end, we propose a video data augmentation approach to synthetically generating labeled video training data, which explicitly leverages existing large-scale image segmentation datasets. The simulated video data are easily accessible and rapidly generated, close to realistic video sequences and present various motion patterns, deformations, companied with automatically generated annotations and optical flow. The experimental results via these automatically generated videos clearly demonstrate the practicability of our strategy.
VSB100  and DAVIS  datasets.
Our video data synthesis approach clears the underlying challenge for learning CNNs for many applications in video processing, where dynamic saliency detection is of no exception. Another challenge for detecting saliency in dynamic scenarios derives from the natural demand of this task. As suggested by human visual perception research [14, 15], when computing dynamic saliency maps, video saliency models need to consider both the spatial and the temporal characteristics of the scene. We propose a deep video saliency model for producing spatiotemporal saliency via fully exploring both the static and dynamic saliency information. The proposed model adopts fully convolutional networks (FCNs)  for pixel-wise saliency prediction. Associated with existing rich image saliency data, the static saliency is deeply exploited and explicitly encoded in the deep learning process via transferring and fine-tuning recent success in image classification . For learning dynamic saliency cues, the proposed deep video saliency model learns from a large number of labelled videos, including both human-generated and natural video data, in a supervised learning mode. The static saliency is integrated into dynamic saliency detection process, thus for directly producing final spatiotemporal saliency estimation.
Another important contribution of this work is that our deep video saliency model is much more computationally efficient compared with existing video saliency models. Salient object detection is a key step in many image analysis tasks as it not only identifies relevant parts of a visual scene but may also reduce computational complexity by filtering out irrelevant segments of the scene. In recent years, some notable video saliency models have been proposed and show usefulness in many computer vision applications, such as video segmentation  and video re-timing . However, time efficiency becomes the common major bottleneck for the applicability of existing video saliency algorithms; most computation time has been spent for optical flow computation. Additionally, from the perspective of learning deep networks in dynamic scenes, many schemes [20, 21, 22] take optical flow as input, causing high computational expenses.
In this work, we propose a both effective and efficient video saliency model, which frees itself from the computationally expensive optical flow estimation. One of the key insights of this paper is that, unlike high-level video applications such as action detection, video saliency can derive from short-term analysis of video frames. Thus we directly capture temporal saliency via learning deep networks from frame pairs, instead of using long-term video information, such as optical flows from multiple adjacent video frames.
We comprehensively evaluate our method on the FBMS dataset , where the proposed video saliency model produces more accurate saliency maps than state-of-the-arts. Meanwhile, it achieves a frame rate of 2fps (including all steps) on a GPU. Thus it is a practical video saliency detection model in terms of both speed and accuracy. We also report results on the newly released DAVIS dataset  and observe performance improvements over current competitors.
To summarize, the main contributions of this paper are threefold:
We investigate convolutional neural networks for end-to-end training and pixel-wise saliency prediction in dynamic scenes. As far as we know, this is the first work for applying deep learning to video salient object detection.
We propose a novel training scheme based on synthetically generated video data, which explicitly leverages existing rich image datasets; both static and dynamic saliency information are encoded into a unified deep learning model.
Our methods are computationally efficient, much faster than traditional video saliency models and other deep networks in dynamic scenes.
The rest of this paper is structured as follows: An overview of the related work is given in Section II. Section III defines our proposed deep saliency model. The proposed synthetic video generation approach is articulated in Section IV. Section V shows experiment results on different databases and compare with the state-of-the-art methods. Finally, concluding remarks can be found in Section VI.
Ii Related work
In this section, we give a brief overview of recent works in two lines: saliency detection, and deep learning models in dynamic scenes.
Ii-a Saliency Detection
Saliency detection has been extensively studied in computer vision, and saliency models in general can be categorized into visual attention prediction or salient object detection. The former methods [14, 23, 24, 25] try to predict scene locations where a human observer may fixate. Salient object detection [26, 27, 28] aims at uniformly highlighting the salient regions, which has been shown benefit to a wide range of computer vision applications. More detailed reviews of the saliency models can be found in [29, 30]. Saliency models can be further divided into static and dynamic ones according to their input. In this work, we aim at detecting saliency object regions in videos.
Image saliency detection has been extensively studied for decades and most of the methods are driven by the well-known bottom-up strategy. Early bottom-up models [26, 27] are mainly based on detecting contrast, assuming salient regions in the visual field would first pop out from their surroundings and computing feature-based contrast followed by various mathematical principles. Meanwhile, some other mechanisms [28, 31, 32] have been proposed to adopt some prior knowledge, such as background prior, or global information, to detect salient objects in still images. More recently, deep learning techniques have been introduced to image saliency detection. These methods [7, 33] typically use CNNs to examine a large number of region proposals, from which the salient objects are selected. Currently, more and more methods [34, 36, 37, 38] tend to learn in an end-to-end manner and directly generate pixel-wise saliency maps via fully convolutional networks (FCNs) .
Compared with saliency detection in still images, detecting saliency in videos is a much more challenging problem due to the complication in the detection and utilization of temporal and motion information. So far, only a limited number of algorithms have been proposed for spatiotemporal saliency detection. Early models [50, 51, 52] can be viewed as simple extensions of exiting static saliency models with extra temporal dimension. Some more recent and notable approaches [2, 3, 6, 17, 45, 53] to this task have been proposed, showing inspired performance and good potentials in many computer vision applications [18, 67, 46, 68, 58]. However, the applicability of these approaches is severely limited by their high-computational costs. The main computational bottleneck comes from optical flow estimation, which contributes much to the promising results.
In recent years, the border of saliency detection has been extend to capturing common saliency among related images/videos [40, 41, 42, 44, 47], inferring the salient event with video sequences  or scene understanding [48, 49, 43]. However, there are significant differences between above methods and traditional saliency detection, especially considering their goals and core difficulties.
Ii-B Deep Learning Models in Dynamic Scenes
In this section, we mainly focus on famous, deep learning models for computer vision applications in dynamic scenes, including action recognition [20, 54], object segmentation [55, 22], object tracking [56, 57, 59, 60, 61], attention prediction  and semantic segmentation , and explore their architectures and training schemes. This will help to clarify how our approach differs from previous efforts and will help to highlight the important benefits in terms of effectiveness and efficiency.
Many approaches [56, 57, 62] directly feed single video frames into neural networks trained on image data and adopt various techniques for post-processing the results with temporal or motion information. Unfortunately, these neural networks give up learning the temporal information which is often very important in video processing applications.
A famous architecture for training CNNs for action recognition in videos is proposed in , which incorporates two-stream convolutional networks for learning complementary information on appearance and motion. Other works [21, 55] adopt this architecture for dynamic attention prediction and video object segmentation . However, these methods train their models on multi-frame dense optical flow, which causes heavy computational burden.
In the areas of human pose estimation and video object processing, online learning strategy is introduced for improving performance [22, 54, 59, 60, 61]. Before processing an input video, these approaches generate various training samples for fine-tuning the neural networks learned from image data, thus enabling the models to be optimized towards the object of interest in the test video sequence. Obviously, these models are quite time-consuming and the fine-tuned models are only specialized for specific classes of objects.
In this work, we show the possibilities of learning to detect generic salient objects in dynamic scenes by training on videos and images via an entirely offline manner. We proposed a novel technique for synthesizing video data via leveraging large amounts of image training data. The CNNs model can be efficiently and entirely trained on rich video sequences and images, thus successfully learning both static and dynamic saliency features. Meanwhile, it directly learns inner relationship between frames, getting rid of time-consuming motion computation. Thus, our algorithm is significantly faster than traditional video saliency methods and the deep learning architectures that demand optical flow as input. In summary, our CNNs model learns to detect video saliency in a fast and effective manner.
Iii Deep Networks for Video Saliency Detection
In this work, we describe a procedure for constructing and learning deep video saliency networks using a novel synthetic video data generation approach. Our approach generates a large amount of video data (150K paired frames) from existing image datasets, and associates these annotated video sequences with existing video data to learn deep video saliency networks. We first introduce the proposed CNNs based video saliency model in this section and then we describe our video synthesis approach in Sec. IV.
Iii-a Architecture Overview
We start with an overview of our deep video saliency model before going into details below. At a high level, we feed frames of a video into a neural network, and the network successively outputs saliency maps where brighter pixels indicate higher saliency values. The network is trained with video sequences and images and learns spatiotemporal saliency in general dynamic scenes. Fig. 1 shows the architecture of proposed deep video saliency model. Inspired by classical human visual perception research [14, 15], which suggests both static and dynamic saliency cues contribute to video saliency, we design our model with two modules, simultaneously considering both the spatial and temporal characteristics of the scene.
The first module is for capturing static saliency, taking single frame image as input. It adopts fully convolutional networks (FCNs) for generating pixel-wise saliency estimate and utilizes previous excellent pre-trained models on large-scale image datasets. Boosted from rich image saliency benchmarks, this module is efficiently trained for capturing diverse static saliency information of interesting objects. This module is described in detail in Sec. III-B. The second module takes frame pairs and static saliency from the first module as input, and generates final dynamic saliency results. This network is trained from both synthetic and real labelled video data (see details in Sec. III-C).
Iii-B Deep Networks for Static Saliency
A static saliency network takes a single frame image as input and produce a saliency map with the same size of the input. We model this process with a fully convolutional network (FCN). The bottom of this network is a stack of convolutional layers. Convolutional layer is defined on shared parameters (weight vector and bias) architecture and has translation invariance characteristics. The input and output of each convolutional layer are a set of arrays, called feature maps, with size , where , and are height, width and the feature or channel dimensionality, respectively. For the first convolutional layer, the input is the color image, with pixel size and , and three channels. At the output, each feature map indicates a particular feature representation extracted at all locations on the input, which is obtained via convolving the input feature map with a trainable linear filter (or kernel) and adding a trainable bias parameter. If we denote the input feature map as , whose convolution filters are determined by the kernel weights and bias , then the output feature map is obtained via:
where is the convolution operation with stride . After each convolutional layer, point-wise nonlinearity (e.g., ReLU) is applied for improving feature representation capability. Additionally, convolutional layers are often followed by some form of non-linear down-sampling (e.g., max pooling). This results in robust feature representation which tolerates small variations in the location of input feature map.
Due to the stride of convolutional and feature pooling layers, the output feature maps are coarse and reduced-resolution. However, for saliency detection, we are more interested in pixel-wise saliency prediction. For upsampling the coarse feature map, multi-layer deconvolution (or backwards convolution) networks are put on the top of the convolution networks:
where is the input image; denotes the output feature map generated by the convolutional layers with total stride of ; denotes the deconvolution layers that upsample the input by a factor of to ensure the same spatial size of the output and the input image . The deconvolution operation is achieved via reversing the forward and backward passes of corresponding convolution layer. All the parameters s of convolution and deconvolution layers are learnable.
Finally, on the top of the network, a convolutional layer with a kernel is adopted for mapping the feature maps into a precise saliency prediction map through a sigmoid activation unit. We use the sigmoid layer for pred so that each entry in the output has a real value in the range of 0 and 1. Due to the utilization of FCN, the network is allowed to operate on input images of arbitrary sizes, and preserves spatial information. Fig. 2 illustrates the detailed configuration of our deep network for static saliency.
For training, all the parameters s are learned via minimizing a loss function, which is computed as the errors between the probability map and the ground truth. As demonstrated in , the use of an asymmetric weighted loss helps greatly in the case of unbalanced data. Considering the numbers of salient and non-salient pixels are usually imbalanced, we compute a weighted cross-entropy loss. Given a training sample consisting of an image with size , and groundtruth saliency map , the network produces saliency probability map . For any given training sample, the training loss on network prediction is thus given by
where and ; refers to ratio of salient pixels in ground truth .
We train the proposed architecture in an end-to-end manner. It is commonplace to initialize systems for many of vision tasks with a prefix of a network trained for image classification. This has shown to substantially reduce training time and improve accuracy. During training, our convolutional layers are initialized with the weights in the first five convolutional blocks of VGGNet , which was originally trained over 1.3 million images of the ImageNet dataset . The parameters of remaining layers are randomly initialized. Then we train our network with stochastic gradient descent (SGD) using backpropagation by minimizing the loss in Equ. 3. More details of implementation are described in Sec. V-A.
Iii-C Deep Networks for Dynamic Saliency
Now we describe our spatiotemporal saliency network. As depicted in Fig. 3, the network has a similar structure as our static saliency network, which is based on FCN and includes multi-layer convolution and deconvolution nets. The dynamic network learns dynamic saliency information jointly with the static saliency results, thus directly generating spatiotemporal saliency estimates.
The training set consists of a collection of synthetic and real video data, which efficiently utilizes existing large-scale well-annotated image data (described in Sec. IV). More specifically, we feed successive pair of frames and the groundtruth of frame in the training set into this network for capturing dynamic saliency. Meanwhile, since saliency in dynamic scenes is boosted by both static and dynamic saliency information, the network incorporates the saliency estimate generated by static saliency network as saliency priors indicative of potential salient regions. Thus our dynamic saliency network directly generates final spatiotemporal saliency estimates for frame , which is achieved via exploring dynamic saliency cues and leveraging static saliency prior from the static saliency network.
We concatenate frame pair and static saliency in the channel direction, thus generating a tensor I with size of . Then we feed I into our FCN based dynamic saliency network, which has similar architecture of static saliency network. Only the first convolution layer is modified accordingly:
where s represent corresponding convolution kernels; b is bias parameter. During training, stochastic gradient descent (SGD) is employed to minimize the weighted cross-entropy loss described before. After training, given a frame image pair and static saliency prior, the deep dynamic saliency model is able to output final spatiotemporal saliency estimate. For testing, we first detect the static saliency map for frame via our static saliency network. Then frame image pair and the static saliency map are fed into the dynamic saliency network for generating the final spatiotemporal saliency for frame . After obtaining the video saliency estimate for frame , we keep iterating this process for the next frame until reaching the end of the video sequence. More implementation details can be found in Sec. V-A. Qualitative and quantitative study of the effectiveness of our dynamic saliency model is described in Sec. V-C.
Compared with the popular two-stream network structure used in [20, 55, 21], we merge the output of the static network into the dynamic saliency model, which directly produces spatiotemporal saliency results. This architecture brings two advantages. Firstly, the fusion of dynamic and static saliency is explicitly inserted into the dynamic saliency network, rather than training two-stream networks for spatial and temporal features and specially designing a fusion network for spatial and temporal feature integration. Secondly, the proposed model directly infers the temporal information from two adjacent frames instead of previous methods [20, 55] using optical flow images, thus our model gaining higher computation efficiency.
Iv Synthetic Video Data Generation
So far, we have described our networks for video saliency detection. We discuss our approach for training our networks for dynamic saliency below. As discussed in Sec. I, existing video datasets [10, 11, 12, 13] are insufficiently diverse and have very limited scales. As deep learning models are data-driven and have strong learning ability, directly learning deep networks on such video datasets would easily suffer overfitting. Noticing the gap between the requirement of learning neural networks for video processing and the lack of large-scale, high-quality annotated video data, we propose a technique for synthesizing video data from still frames.¡¡
Directly deriving video sequences from single image is also impossible. However, our video saliency network takes frame pairs as input, instead of the whole video sequence. That means we can simulates diverse but very short video sequences (only 2 frames in length) via fully utilizing well-labelled large-scale image datasets. Concretely, given a training sample from existing image saliency datasets, we wish to generate a pair of frames , which present various motion patterns, diverse deformations and smooth transformation, thus being close to real video signal. We start at simulating the correspondence between and , which is easier than directly inferring adjacent frame . Let denote a point position, the correspondence between and can be represented as an optical flow field via:
The optical flow field v directly represents the pixel-level motion information between two neighboring frames. Next we only introduce how to set the vertical displacement , as the method of generating is similar.
We model the optical flow on superpixel level as the motion of similar adjacent pixels should present consistency. We oversegment into a group of superpixels . According to groundtruth label , we further divide superpixels into foreground superpixels and background ones , where . For simulating the diverse motion patterns of background, we randomly select background regions from and randomly initialize their motion values s (vertical displacement) from , where . The s of the other background regions are initialized as zero. The motion patterns of foreground are usually compactness, as the whole foreground regions move more regularly and purposefully compared with background. Beside, the motion between different foreground parts sometimes also present diverse. For example, the whole body of a person go an exact direction but his arms or legs may have different motions. For this, we first randomly set a value (from ) as the main motion patterns of the foreground regions. Then we randomly set s of foreground regions from for representing the difference between foreground regions. This initialization process is visualized in Fig. 4-a.
A similar process is adopted for generating the initial horizontal motion displacement () and we are able to get an initial optical flow v for . Next, we propose an energy function for smoothing and propagating the initial optical flow globally, yet preserving the difference between foreground and background in motion patterns. Let the initial motion vector of each superpixel be denoted as , the final motion vector is obtained via optimizing the energy function as follows111Here we slightly reuse v for representing the optical flow vector of superpixel without ambiguity.:
The first term is the unary constraint that each superpixel tends to have its initial motion, while the smooth term gives the interactive constraint that neighboring superpixels have consistent motion patterns when their representative colors are similar. The superpixel neighborhood set contains all the spatially adjacent superpixels222For further encouraging the motion consistency of background regions, we consider all the selected background regions are adjacent in neighboring system .. The parameter is a positive coefficient measuring how much we want to fit the initial motion. Typically, imposes the hard constraint that each region definitely has the initial motion. We define :
For the seed regions (selected background regions and all the foreground regions ), we expect that they tend to preserve their initial motions; however, for other regions (), we emphasize more influence on the smooth term thus we can propagate the initial motions from those seed regions.
The weighting function in Equ. 6 defines a similarity measure for adjacent superpixels ():
where indicates the mean color vector of pixels in superpixel . We set the weight as zero, when two adjacent superpixels are from foreground and background , respectively. We consider motion consistency inside the foreground and background, while preserve motion difference between foreground and background. Equ. 6 can be efficiently solved by convex optimization and we can obtain a smooth optical flow field v. As shown in Fig. 4, base on v, we can generate a simulated frame and its corresponding annotation from .
The proposed method is very fast and outputs synthesized video frame pair, optical flow, and pixel-wise annotations simultaneously. The number of samples in existing image segmentation/saliency datasets is ten or hundred order of magnitude larger than in the video segmentation datasets, allowing us to generate enough scenes. For each image sample of an image dataset, we generate ten simulated frames. Some simulated results can be observed in Fig. 5. In our experiments, we use two large image saliency datasets MSRA10K  and DUT-OMRON , generating more than simulated videos associated with pixel-level annotations and optical flow within 3 hours (processing speed of 14 fps on one CPU). Those synthesized video data, combined with real video samples from existing video segmentation datasets, are fed into our model for learning general dynamic saliency information without over-fitting.
V Experimental Results
In this section, we describe our evaluation protocol and implementation details (Sec. V-A), provide exhaustive comparison results over two large datasets (80 videos in total, Sec. V-B), study the quantitative importance of the different components of our system (Sec. V-C), and assess its computational load (Sec. V-D).
V-a Experimental Setup
We report our performance on two public benchmark datasets: Freiburg-Berkeley Motion Segmentation (FBMS) dataset , and Densely Annotated VIdeo Segmentation (DAVIS) dataset . The FBMS dataset contains 59 natural video sequences, covering various challenges such as large foreground and background appearance variation, significant shape deformation, and large camera motion. This dataset is originally used for motion segmentation, where unsalient but moving objects are also labeled as foreground. We offer more precise annotations for this dataset via only labeling the main salient objects. The FBMS dataset comes with a split into a training set and a test set, where the training set includes 29 video sequences and the test set has 30 video sequences. We also report our performance on the newly developed DAVIS dataset, which is one of the most challenging video segmentation benchmarks. It consists of 50 video sequences in total, and fully-annotated pixel-level segmentation ground-truth for each frame is available. We report the performance of our method and other alternatives on the test set of FBMS dataset and the whole DAVIS dataset.
For training, we use two large image saliency datasets: MSRA10K  and DUT-OMRON . The MSRA10K dataset comprising of images, is widely used for saliency detection and covers a large variety of image contents – natural scenes, animals, indoor, outdoor, etc. Most of the images have a single salient object. The DUT-OMRON dataset is one of the most challenging image saliency datasets and contains 5172 images with multiple objects with complex structures and high background clutter. All the above datasets contain manually annotated groundtruth saliency. The video sequences of the whole SegTrackV2 dataset  and the training set of the FBMS dataset are also used for training the dynamic saliency network, which include about 3K frame pairs333Due to the number of annotations provided by FBMS is very limited (only 46 frames are labeled for each video sequence), we provide extra 500 annotations..
The proposed deep video saliency network has been implemented with the popular Caffe library , an open source framework for CNNs training and testing. For our static video saliency network, the weights of the first five convolutional blocks are initialized by the VGGNet model  trained on ImageNet , the other convolutional layers are initialized from zero mean Gaussian with a standard deviation of 0.01 and the biases are set to 0. Based on this, our network was trained on the MSRA10K  and the DUT-OMRON  datasets with iterations for saliency detection in static scenes. Our dynamic video saliency network is also initialized from the VGGNet network. For the first convolutional layer, we use Gaussian initialization due to a different input channel from VGGNet. Benefiting from our video data synthesis approach, we can employ images and annotations from existing saliency segmentation datasets for training our video saliency model. The images and masks from MSRA10K and DUT-OMRON datasets are used to generate more than video slits. Then we combine our simulated video data with real video data (3K frame pairs) from exiting video segmentation datasets [11, 10] for generating an aggregate video saliency training set. Our whole video saliency model is trained for iterations.
For both two networks, we use stochastic gradient descent (SGD) and a polynomial learning policy with initial learning rate of . The momentum and weight decay are set to 0.9 and 0.0005. The whole training process costs about 40 hours on a PC with 3.4 GHz CPU, a TITANX GPU, and 32G RAM.
V-B Performance Comparison
To evaluate the quality of the proposed approach, we provide in this section quantitative comparison for performance of the proposed method against various top-performing alternatives: saliency via deep feature (MD) , saliency via absorbing markov chain (MC) , space-time saliency for time-mapping (TIMP) , gradient-flow filed based saliency (GAFL) , geodesic distance based video saliency (SAGE) , and saliency via random walk with restart (RWRV) , on test set (30 video sequences) of the FBMS dataset and the whole DAVIS dataset (50 video sequences). The former two methods aim at image saliency while the latter four are designed for video saliency.
V-B1 Qualitative Results
Qualitative comparisons are presented in Fig. 6, where the top line shows example video frames and the second line shows the ground truth detection results of salient objects. As seen, the image saliency method  without deep learning, unsurprisingly, faces difficulties in dynamic scenes, due to the lack of inter-frame information and utilization of hand-crafted features. The video saliency methods [3, 17] generate more visually promising results, but suffer higher computation load (which will be detailed in Sec. V-D) and show relatively weak performance with complex background. As for , it’s an image saliency model but exhibits competitive performance with above bottom-up video saliency approaches, which demonstrates the power of deep learning model in saliency detection. However, we can observe the proposed algorithm captures foreground salient objects more faithfully in most test cases. In particular, the proposed algorithm yields good performance on some challenging scenarios, even for blurred backgrounds (lion01), various object motion patterns (parkour) or large shape deformation (soapbox). This can be attributed to our video data synthesis, which offers diverse scene information and rich motion patterns. Based on this, our method is able to learn both static and dynamic saliency information and detects salient moving objects accurately despite similar appearance to the background.
V-B2 Quantitative Results
We report quantitative evaluation results on three widely used performance measures: precision-recall (PR) curves, F-measure and MAE.
We first employ precision-recall (PR) curves for performance evaluation. Precision corresponds to the percentage of salient pixels correctly assigned, while recall corresponds to the fraction of detected salient pixels in relation to the ground truth number of salient pixels. For each saliency map, we vary the cutoff threshold from 0 to 255 to generate 256 precision and recall pairs, which are used to plot a PR curve.
The F-measure is the overall performance measurement computed by the weighted harmonic of precision and recall:
where we set to weigh precision more than recall as suggested in . For each saliency map, we derive a sequence of F-measure values along the PR-curve with the threshold varying from 0 to 255.
|module||Static model in Sec. III-B||8.19||+0.54||7.17||+0.81|
|Dynamic model in Sec. III-C||9.43||+1.78||8.32||+1.96|
|Training||Training set i: only using image data ()||9.27||+1.62||7.53||+1.17|
|Training set ii: only using video data ()||24.5||+16.8||23.9||+17.5|
|Training set iii: reduced training data ()||9.14||+1.48||7.54||+1.18|
|Training set iv: reduced training data ()||10.7||+3.08||9.13||+2.77|
|Training set v: reduced training data ()||12.8||+5.18||10.9||+4.58|
|Training set vi: reduced training data ()||13.5||+5.83||12.7||+6.39|
As neither precision nor recall considers the true negative saliency assignments, the mean absolute error (MAE) is also introduced as a complementary measure. MAE is defined as the average per-pixel difference between an estimated saliency probability map and its corresponding ground truth . Here, and are normalized to the interval [0, 1]. MAE is computed as£º
where and refer to the height and width of the input frame image. MAE is meaningful in evaluating the applicability of a saliency model in a task such as object segmentation.
The precision-recall curves of all methods are reported in Fig. 7-a. As shown, our method significantly outperforms the state-of-the-art both on the FBMS dataset , and the DAVIS dataset . Our saliency method achieves the best precision rates, which demonstrates our saliency maps are more precise and responsive to the actual salient information. The F-scores are depicted in Fig. 7-b, in which our model achieves better scores than other methods. Similar conclusions can be drawn from the MAE. In Fig. 7-c, our method achieves the lowest MAE among all compared methods.
V-C Validation of the Proposed Method
To exhibit more details of our algorithm and objectively evaluate the contribution of different phases in the proposed saliency model, we report the evaluation of each of the components described in Sec. III and different variants of the proposed saliency model. We experiment on the test set of the FBMS dataset , and the DAVIS dataset  and measure the performance using precision recall curve and MAE.
V-C1 Ablation study
We first study the effect of each module of our deep saliency model. In Fig. 8, we present qualitative comparison between static saliency from our static network (in Sec. III-B) and final spatiotemporal saliency results from our whole model (in Sec. III-C). It can be observed, due to the lack of dynamic information, the static saliency model faces difficulties distinguishing salient objects from clutter background in dynamic scenes. Via comprehensively utilizing static and dynamic saliency stimuli, our deep video saliency model is able to estimate more accurate spatiotemporal saliency maps.
For quantitatively examining the performance of our static saliency network, we directly use the static saliency maps generated by the static network as final saliency estimates. From Table II, we can observe decreased performance (7.658.19 on FBMS, 6.367.17 on DAVIS), due to the lack of dynamic saliency information. Similarly, we train a dynamic network without considering static saliency as prior using the same training data. We attribute this to the difficulty of directly capturing dynamic saliency information from two successive frames without any saliency prior or extra motion information. We can draw two important conclusions. First, the fusion of static model and dynamic model improves on both. Second, taking static saliency as prior information makes training the dynamic model easier and yield more accurate prediction.
V-C2 Training strategy
We also explore the effect of different training strategies. We first study the influence of our synthetic video data generation strategy in Sec. IV. We train our deep saliency model only using the synthetics from image data. Although the real video data occupy a small percentage of the training, we can still see a decrease in MAE (7.659.27 on FBMS, 6.367.53 on DAVIS) when we only use synthetic data. The small performance decrease verifies the effectiveness of our data augmentation technique; on the other hand, it suggests the synthetics should not completely replace the real video data. We further explore the performance of our model only using video data ( frame pairs). Unfortunately, our model suffers over-fitting due to the high similarities of scenes within same video. This also demonstrates the importance of our synthetic video data generation.
We next study the influence of the amount of training data. When we reduce the amount of training data, we can observe performance decrease. This indicates that the deep-learning model is data-driven. Or, conversely, the increase of training data will lead to improved performance.
V-D Runtime Analysis
Here we consider the speed of our saliency method. Our computing platform includes Intel Xeon E7 CPU (12 cores) with 64 GB memory and Nvidia Geforce TITAN X GPU. We do not count I/O time, and do not allow processing multiple images in parallel. The time consumption, of our method compared against other video saliency methods [19, 3, 17, 6] are presented in Fig. 9.
From Fig. 9 we can learn that, run time efficiency is the major bottleneck for the usability of previous video saliency algorithms, as a substantial amount of time is spent computing motion or edge information. In contrast, our method computes 480p saliency masks in as little as 0.47 seconds, which is much faster than traditional video saliency methods. Our method does not rely on optical flow, edge maps or other pre-computed information, resulting in roughly an order of magnitude faster processing speed.
In this work, we have presented a deep learning method for fast video saliency detection using convolutional neural networks. The proposed deep video saliency model has two modules, namely static saliency network and dynamic saliency network, which are designed for capturing spatial and temporal statistics of dynamic scenes. The saliency estimates from the static saliency network is incorporated in the dynamic saliency network, which enables our method to automatically learn the way of fusing static saliency into dynamic saliency detection and directly produce final spatiotemporal saliency results with less computation load. Furthermore, we proposed a novel data augmentation technique for synthesizing video data from still images, which enables our deep saliency model to learn generic spatial and temporal saliency and prevents overfitting.
Experimental results on two databases, namely FBMS and DAVIS, have shown that our proposed methods can generate high-quality salience maps. Additionally, our model waives the main computational burdens of previous video saliency models based on optical flow estimation. Our saliency model is very efficient, achieving a processing frame rate of 2fps on a GPU.
-  R. Girshick, “Fast R-CNN,” in IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
-  Y. Fang, Z. Wang, W. Lin, and Z. Fang, “Video saliency incorporating spatiotemporal cues and uncertainty weighting,” IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 3910–3921, 2014.
-  W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local gradient flow optimization and global refinement,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4185–4196, 2015.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
-  H. Kim, Y. Kim, J.-Y. Sim, and C.-S. Kim, “Spatiotemporal saliency detection for video sequences based on random walk with restart,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2552–2564, 2015.
-  R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.
-  G. Li and Y. Yu, “Deep contrast learning for salient object detection,” IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  T. Brox and J. Malik, “Object segmentation by long term analysis of point trajectories,” in European Conference on Computer Vision, 2010, pp. 282–295.
-  F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmentation by tracking many figure-ground segments,” in IEEE International Conference on Computer Vision, 2013, pp. 2192–2199.
-  F. Galasso, N. Shankar Nagaraja, T. Jimenez Cardenas, T. Brox, and B. Schiele, “A unified video segmentation benchmark: Annotation, metrics and analysis,” in IEEE International Conference on Computer Vision, 2013, pp. 3527–3534.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  A. M. Treisman and G. Gelade, “A feature-integration theory of attention,” Cognitive Psychology, vol. 12, no. 1, pp. 97–136, 1980.
-  P. Mital, T. J. Smith, S. Luke, and J. Henderson, “Do low-level visual features have a causal influence on gaze during dynamic scene viewing?” Journal of Vision, vol. 13, no. 9, pp. 144–144, 2013.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3395–3402.
-  W. Wang, J. Shen, R. Yang, and F. Porikli, “Saliency-aware video object segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  F. Zhou, S. Bing Kang, and M. F. Cohen, “Time-mapping using space-time saliency,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3358–3365.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems, 2014, pp. 568–576.
-  Ç. Bak, A. Erdem, and E. Erdem, “Two-stream convolutional networks for dynamic saliency prediction,” arXiv preprint arXiv:1607.04730, 2016.
-  A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung, “Learning video object segmentation from static images,” arXiv preprint arXiv:1612.02646, 2016.
-  L. Itti, C. Koch, E. Niebur et al., “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
-  J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Advances in Neural Information Processing Systems, 2006, pp. 545–552.
-  T. Judd, K. Ehinger, F. Durand, and A. Torralba, “Learning to predict where humans look,” in IEEE International Conference on Computer Vision, 2009, pp. 2106–2113.
-  F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, 2015.
-  W. Wang, J. Shen, L. Shao, and F. Porikli, “Correspondence driven saliency transfer,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5025–5034, 2016.
-  A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185–207, 2013.
-  A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5706–5722, 2015.
-  Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background priors,” in European Conference on Computer Vision, 2012, pp. 29–42.
-  W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2814–2821.
-  G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5455–5463.
-  Y. Tang and X. Wu, “Saliency detection via combining region-level and pixel-level predictions with cnns,” in European Conference on Computer Vision, 2016, pp. 809–825.
-  L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency detection via local estimation and global search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3183–3192.
-  L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recurrent fully convolutional networks,” in European Conference on Computer Vision, 2016, pp. 825–841.
-  N. Liu, and J. Han, “DHSnet: Deep hierarchical saliency network for salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 678–686.
-  X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling and J. Wang, “DeepSaliency: Multi-task deep neural network model for salient object detection,” in IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3919–3930, 2016.
-  D. Zhang, J. Han, L. Jiang, S. Ye, and X. Chang, “Revealing event saliency in unconstrained video collection,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 1746–1758, 2017.
-  W. Wang, J. Shen, X. Li, and F. Porikli, “Robust video object co-segmentation,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3137-3148, 2015.
-  W. Wang, and J. Shen, “Higher-order image co-segmentation,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 1011–1021, 2016.
-  D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a self-paced multiple-instance learning framework,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 5, pp. 865–878, 2017.
-  B. X. Nie, P. Wei, and S.-C. Zhu, “Monocular 3D human pose estimation by predicting depth on joints.” in IEEE International Conference on Computer Vision, 2017.
-  D. Zhang, J. Han, C. Li, J. Wang, and X. Li, “Detection of co-salient objects by looking deep and wide,” International Journal of Computer Vision, vol. 120, no. 2, pp. 215–232, 2016.
-  S.-h. Zhong, Y. Liu, F. Ren, J. Zhang, and T. Ren, “Video saliency detection via dynamic consistent spatio-temporal attention modelling.” in AAAI Conference on Artificial Intelligence, 2013.
-  W. Wang, J. Shen, H. Sun, and L. Shao, “Video co-saliency guided co-segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
-  D. Zhang, J. Han, J. Han, and L. Shao, “Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 6, pp. 1163–1176, 2016.
-  X. Lu, Y. Yuan, and X. Zheng, “Joint dictionary learning for multispectral change detection,” IEEE Transactions on Cybernetics, vol. 47, no. 4, pp. 884–897, 2017.
-  Y. Yuan, L. Mou, and X. Lu, “Scene recognition by manifold regularized deep learning architecture,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2222–2233, 2015.
-  C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
-  H. J. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance,” Journal of Vision, vol. 9, no. 12, pp. 15–15, 2009.
-  V. Mahadevan and N. Vasconcelos, “Spatiotemporal saliency in dynamic scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 171–177, 2010.
-  F. Zhou, S. B. Kang, and M. F. Cohen, “Time-mapping using space-time saliency,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
-  J. Charles, T. Pfister, D. Magee, D. Hogg, and A. Zisserman, “Personalizing human video pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik, “Learning to segment moving objects in videos,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolutional features for visual tracking,” in IEEE International Conference on Computer Vision, 2015, pp. 3074–3082.
-  L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in IEEE International Conference on Computer Vision, 2015.
-  W. Wang, J. Shen, Y. Yu, and K.-L. Ma, “Stereoscopic thumbnail creation via efficient stereo saliency detection,” in IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 8, pp. 2014–2027, 2017.
-  N. Wang and D.-Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Advances in Neural Information Processing Pystems, 2013, pp. 809–817.
-  K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang, “Robust visual tracking via convolutional networks without training,” IEEE Transactions on Image Processing, vol. 25, no. 4, pp. 1779–1792, 2016.
-  H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  Y.-H. Tsai, G. Zhong, and M.-H. Yang, “Semantic co-segmentation in videos,” in European Conference on Computer Vision, 2016, pp. 760–775.
-  M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, “Feedforward semantic segmentation with zoom-out features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3376–3385.
-  T. Liu, J. Sun, N. N. Zheng, X. Tang, and H. Y. Shum, “Learning to detect a salient object,” in IEEE Conference on Computer Vision and Pattern Recognition, 2007.
-  C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013.
-  B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency detection via absorbing markov chain,” in IEEE International Conference on Computer Vision, 2013.
-  W. Wang, J. Shen, J. Xie, and F. Porikli, “Super-trajecotry for video segmentation,” in IEEE International Conference on Computer Vision, 2017.
-  W. Wang, and J. Shen, “Deep cropping via attention box prediction and aesthetics assessment,” in IEEE International Conference on Computer Vision, 2017.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM International Conference on Multimedia, 2014.
-  R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009.