A Unified RGBT Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach
Abstract
Despite significant progress, image saliency detection still remains a challenging task in complex scenes and environments. Integrating multiple different but complementary cues, like RGB and Thermal (RGBT), may be an effective way for boosting saliency detection performance. The current research in this direction, however, is limited by the lack of a comprehensive benchmark. This work contributes such a RGBT image dataset, which includes 821 spatially aligned RGBT image pairs and their ground truth annotations for saliency detection purpose. The image pairs are with high diversity recorded under different scenes and environmental conditions, and we annotate 11 challenges on these image pairs for performing the challengesensitive analysis for different saliency detection algorithms. We also implement 3 kinds of baseline methods with different modality inputs to provide a comprehensive comparison platform.
With this benchmark, we propose a novel approach, multitask manifold ranking with crossmodality consistency, for RGBT saliency detection. In particular, we introduce a weight for each modality to describe the reliability, and integrate them into the graphbased manifold ranking algorithm to achieve adaptive fusion of different source data. Moreover, we incorporate the crossmodality consistent constraints to integrate different modalities collaboratively. For the optimization, we design an efficient algorithm to iteratively solve several subproblems with closedform solutions. Extensive experiments against other baseline methods on the newly created benchmark demonstrate the effectiveness of the proposed approach, and we also provide basic insights and potential future research directions for RGBT saliency detection.
I Introduction
Image saliency detection is a fundamental and active problem in computer vision. It aims at highlighting salient foreground objects automatically from background, and has received increasing attentions due to its wide range of applications in computer vision and graphics, such as object recognition, contentaware retargeting, video compression, and image classification. Despite significant progress, image saliency detection still remains a challenging task in complex scenes and environments.
Recently, integrating RGB data and thermal data (RGBT data) has been proved to be effective in several computer vision problems, such as moving object detection [1, 2] and tracking [3]. Given the potentials of RGBT data, however, the research of RGBT saliency detection is limited by the lack of a comprehensive image benchmark.
In this paper, we contribute a comprehensive image benchmark for RGBT saliency detection, and the following two main aspects are considered in creating this benchmark.

A good dataset should be with reasonable size, high diversity and low bias [4]. Therefore, we use our recording system to collect 821 RGBT image pairs in different scenes and environmental conditions, and each image pair is aligned and annotated with ground truth. In addition, the category, size, number and spatial information of salient objects are also taken into account for enhancing the diversity and challenge, and we present some statistics of the created dataset to analyze the diversity and bias. To analyze the challengesensitive performance of different algorithms, we annotate 11 different challenges according to the abovementioned factors.

To the best of our knowledge, RGBT saliency detection remains not well investigated. Therefore, we implement some baseline methods to provide a comparison platform. On one hand, we regard RGB or thermal images as inputs in some popular methods to achieve singlemodality saliency detection. These baselines can be utilized to identify the importance and complementarity of RGB and thermal information with comparing to RGBT saliency detection methods. On the other hand, we concatenate the features extracted from RGB and thermal modalities together as the RGBT feature representations, and employ some popular methods to achieve RGBT saliency detection.
Salient object detection has been extensively studied in past decades, and numerous models and algorithms have been proposed based on different mathematical principles or priors [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Most of methods measured saliency by measuring local centersurround contrast and rarity of features over the entire image [5, 16, 17, 12]. In contrast, Gopalakrishnan et al. [18] formulated the object detection problem as a binary segmentation or labelling task on a graph. The most salient seeds and several background seeds were identified by the behavior of random walks on a complete graph and a regular graph. Then, a semisupervised learning technique was used to infer the binary labels of the unlabelled nodes. Different from it, Yang et al. [19] employed manifold ranking technique to salient object detection that requires only seeds from one class, which are initialized with either the boundary priors or foreground cues. They then extended their work with several improvements [20], including multiscale graph construction and a cascade scheme on a multilayer representation. Based on the manifold ranking algorithms, Li et al. [21] generated pixelwise saliency maps via regularized random walks ranking, and Wang et al. [22] proposed a new graph model which captured local/global contrast and effectively utilized the boundary prior.
With the created benchmark, we propose a novel approach, multitask manifold ranking with crossmodality consistency, for RGBT saliency detection. For each modality, we employ the idea of graphbased manifold ranking [19] for the good saliency detection performance in terms of accuracy and speed. Then, we assign each modality with a weight to describe the reliability, which is capable of dealing with occasional perturbation or malfunction of individual sources, to achieve adaptive fusion of multiple modalities. To better exploit the relations among modalities, we impose the crossmodality consistent constraints on the ranking functions of different modalities to integrate them collaboratively. Considering the manifold ranking in each modality as an individual task, our method is essentially formulated as a multitask learning problem. For the optimization, we jointly optimize the modality weights and the ranking functions of multiple modalities by iteratively solving two subproblems with closedform solutions.
This paper makes the following three major contributions for RGBT image saliency detection and related applications.

It creates a comprehensive benchmark for facilitating evaluating different RGBT saliency detection algorithms. The benchmark dataset includes 821 aligned RGBT images with the annotated ground truths, and we also present the finegrained annotations with 11 challenges to allow us to analyze the challengesensitive performance of different algorithms. Moreover, we implement 3 kinds of baseline methods with different inputs (RGB, thermal and RGBT) for evaluations. This benchmark will be available online for free academic usage ^{1}^{1}1RGBT saliency detection benchmark’s webpage:
http://chenglongli.cn/people/lcl/journals.html.. 
It proposes a novel approach, multitask manifold ranking with crossmodality consistency, for RGBT saliency detection. In particular, we introduce a weight for each modality to represent the reliability, and incorporate the crossmodality consistent constraints to achieve adaptive and collaborative fusion of different source data. The modality weights and ranking function are jointly optimized by iteratively solving several subproblems with closedform solutions.

It presents extensive experiments against other stateoftheart image saliency methods with 3 kinds of inputs. The evaluation results demonstrate the effectiveness of the proposed approach. Through analyzing quantitative results, we further provide basic insights and identify the potentials of thermal information in RGBT saliency detection.
The rest of this paper is organized as follows. Sect. II introduces details of the RGBT saliency detection benchmark. The proposed model and the associated optimization algorithm are presented in Sect. III, and the RGBT saliency detection approach is introduced in Sect. IV. The experimental results and analysis are shown in Sect. V. Sect. VI concludes the paper.
Ii RGBT Image Saliency Benchmark
In this section, we introduce our newly created RGBT saliency benchmark, which includes dataset with statistic analysis, baseline methods with different inputs and evaluation metrics.
Iia Dataset
We collect 821 RGBT image pairs by our recording system, which consists of an online thermal imager (FLIR A310) and a CCD camera (SONY TD2073). For alignment, we uniformly select a number of point correspondences in each image pairs, and compute the homography matrix by the leastsquare algorithm. It is worth noting that this registration method can accurately align image pairs due to the following two reasons. First, we carefully choose the planar and nonplanar scenes to make the homography assumption effective. Second, since two camera views are almost coincident as we made, the transformation between two views is simple. As each image pair is aligned, we annotate the pixellevel ground truth using more reliable modality. Fig. 1 shows some sample image pairs and their ground truths.
The image pairs in our dataset are recorded in approximately 60 scenes with different environmental conditions, and the category, size, number and spatial information of salient objects are also taken into account for enhancing the diversity and challenge. Specifically, the following main aspects are considered in creating the RGBT image dataset.

Illumination condition. The image pairs are captured under different light conditions, such as sunny, snowy, and nighttime. The low illumination and illumination variation caused by different light conditions usually bring big challenges in RGB images.

Background factor. Two background factors are taken into account for our dataset. First, similar background to the salient objects in appearance or temperature will introduce ambiguous information. Second, it is difficult to separate objects accurately from cluttered background.

Salient object attribute. We take different attributes of salient objects, including category (more than 60 categories), size (see the size distribution in Fig. 2 (b)) and number, into account in constructing our dataset for high diversity.

Object location. Most of methods employ the spatial information (center and boundary of an image) of the salient objects as priors, which is verified to be effective. However, some salient objects are not at center or cross image boundaries, and these situations isolate the spatial priors. We incorporate these factors into our dataset construction to bring its challenge, and Fig. 2 presents the spatial distribution of salient objects on CB and CIB.
Challenge  Description 

BSO  Big Salient Object  the ratio of ground truth salient objects over image is more than 0.26. 
SSO  Small Salient Object  the ratio of ground truth salient objects over image is less than 0.05. 
LI  Low Illumination  the environmental illumination is low. 
BW  Bad Weather  the image pairs are recorded in bad weathers, such as snowy, rainy, hazy and cloudy. 
MSO  Multiple Salient Objects  the number of the salient objects in the image is more than 1. 
CB  Center Bias  the centers of salient objects are far away from the image center. 
CIB  Cross Image Boundary  the salient objects cross the image boundaries. 
SA  Similar Appearance  the salient objets have similar color or shape to the background. 
TC  Thermal Crossover  the salient objects have similar temperature to the background. 
IC  Image Clutter  the image is cluttered. 
OF  Out of Focus  the image is outoffocus. 
Algorithm  Feature  Technique  Book Title  Year 

MST [15]  Lab & Intensity  Minimum spanning tree  IEEE CVPR  2016 
RRWR [21]  Lab  Regularized random walks ranking  IEEE CVPR  2015 
CA [23]  Lab  Celluar Automata  IEEE CVPR  2015 
GMR [19]  Lab  Graphbased manifold ranking  IEEE CVPR  2013 
STM [8]  LUV & Spatial information  Scalebased tree model  IEEE CVPR  2013 
GR [24]  Lab  Graph regularization  IEEE SPL  2013 
NFI [25]  Lab & Orientations & Spatial information  Nonlinear feature integration  Journal of Vision  2013 
MCI [26]  Lab & Spatial information  Multiscale context integration  IEEE TPAMI  2012 
SSKDE [27]  Lab  Sparse sampling and kernel density estimation  SCIA  2011 
BR [28]  Lab & Intensity & Motion  Bayesian reasoning  ECCV  2010 
SR [29]  Lab  Selfresemblance  Journal of Vision  2009 
SRM [16]  Spectrum  Spectral residual model  IEEE CVPR  2007 
Considering the abovementioned factors, we annotate 11 challenges for our dataset to facilitate the challengesensitive performance of different algorithms. They are: big salient object (BSO), small salient object (SSO), multiple salient objects (MSO), low illumination (LI), bad weather (BW), center bias (CB), cross image boundary (CIB), similar appearance (SA), thermal crossover (TC), image clutter (IC), and out of focus (OF). Tab. I shows the details, and Fig. 2 (a) presents the challenge distribution. We will analyze the performance of different algorithms on the specific challenge using the finegrained annotations in the experimental section.
IiB Baseline Methods
To provide a comparison platform, we implement 3 kinds of baseline methods with different modality inputs. On one hand, we regard RGB or thermal images as inputs in 12 popular methods to achieve singlemodality saliency detection, including MST [15], RRWR [21], CA [23], GMR [19], STM [8], GR [24], NFI [25], MCI [26], SSKDE [27], BR [28], SR [29] and SRM [16]. Tab. II presents the details. These baselines can be utilized to identify the importance and complementarity of RGB and thermal information with comparing to RGBT saliency detection methods. On the other hand, we concatenate the features extracted from RGB and thermal modalities together as the RGBT feature representations, and employ the abovementioned methods to achieve RGBT saliency detection.
IiC Evaluation Metrics
There exists several metrics to evaluate the agreement between subjective annotations and experimental predictions. In this work, We use (P)recision(R)ecall curves (PR curves), metric and Mean Absolute Error (MAE) to evaluate all the algorithms. Given the binarized saliency map via the threshold value from 0 to 255, precision means the ratio of the correctly assigned salient pixel number in relation to all the detected salient pixel number, and recall means the ratio of the correct salient pixel number in relation to the ground truth number. Different from (P)recision(R)ecall curves using a fixed threshold for every image, the metric exploits an adaptive threshold of each image to perform the evaluation. The adaptive threshold is defined as:
(1) 
where and denote the width and height of an image, respectively. is the computed saliency map. The Fmeasure () is defined as follows with the precision () and recall () of the above adaptive threshold:
(2) 
where we set the to emphasize the precision as suggested in [17]. PR curves and metric are aimed at quantitative comparison, while MAE are better than them for taking visual comparison into consideration to estimate dissimilarity between a saliency map and the ground truth , which is defined as:
(3) 
Iii Graphbased MultiTask Manifold Ranking
The graphbased ranking problem is described as follows: Given a graph and a node in this graph as query, the remaining nodes are ranked based on their affinities to the given query. The goal is to learn a ranking function that defines the relevance between unlabelled nodes and queries.
This section will introduce the graphbased multitask manifold ranking model and the associated optimization algorithm. The optimized modality weights and ranking scores will be utilized for RGBT saliency detection in next section.
Iiia Graph Construction
Given a pair of RGBT images, we regard the thermal image as one of image channels, and then employ SLIC algorithm [30] to generate nonoverlapping superpixels. We take these superpixels as nodes to construct a graph , where is a node set and is a set of undirected edges. In this work, any two nodes in are connected if one of the following conditions holds: 1) they are neighboring; 2) they share common boundaries with its neighboring node; 3) they are on the four sides of image, i.e., boundary nodes. Fig. 3 shows the details. The first and second conditions are employed to capture local smoothness cues as neighboring superpixels tend to share similar appearance and saliency values. The third condition attempts to reduce the geodesic distance of similar superpixels. It is worth noting that we can explore more cues in RGB and thermal data to construct an adaptive graph that makes best use of intrinsic relationship among superpixels. We will study this issue in future work as this paper is with an emphasis on the multitask manifold ranking algorithm.
If nodes and is connected, we assign it with an edge weight as:
(4) 
where denotes the mean of the th superpixel in the th modality, and is a scaling parameter.
IiiB MultiTask Manifold Ranking with CrossModality Consistency
We first review the algorithm of graphbased manifold ranking that exploits the intrinsic manifold structure of data for graph labeling [31]. Given a superpixel feature set , some superpixels are labeled as queries and the rest need to be ranked according to their affinities to the queries, where denotes the number of superpixels. Let denote a ranking function that assigns a ranking value to each superpixel , and can be viewed as a vector . In this work, we regard the query labels as initial superpixel saliency value, and is thus an initial superpixel saliency vector. Let denote an indication vector, in which if is a query, and otherwise. Given , the optimal ranking of queries are computed by solving the following optimization problem:
(5) 
where is the degree matrix, and . indicates the diagonal operation. is a parameter to balance the smoothness term and the fitting term.
Then, we apply manifold ranking on multiple modalities, and have
(6)  
From Eq. (6), we can see that it inherently indicates that available modalities are independent and contribute equally. This may significantly limit the performance in dealing with occasional perturbation or malfunction of individual sources. Therefore, we propose a novel collaborative model for robustly performing salient object detection that i) adaptively integrates different modalities based on their respective modal reliabilities, ii) collaboratively computes the ranking functions of multiple modalities by incorporating the crossmodality consistent constraints. The formulation of the multitask manifold ranking algorithm is proposed as follows:
(7)  
where is an adaptive parameter vector, which is initialized after the first iteration (see Alg. 1), and is the modality weight vector. denotes the elementwise product, and is a balance parameter. The third term is to avoid overfitting of , and the last term is the crossmodality consistent constraints. The effectiveness of introducing these two terms is presented in Fig. 4. With some simple algebra, Eq. (7) can be rewritten as:
(8)  
where , and denotes the crossmodality consistent constraint matrix, which is defined as:
(9) 
where and are the identity matrices with the size of .
IiiC Optimization Algorithm
We present an alternating algorithm to optimize Eq. (8) efficiently, and denote
(10)  
Given , Eq. (10) can be written as:
(11)  
and we reformulate it as follows:
(12) 
where , and is a blockdiagonal matrix defined as: , where . . Taking the derivative of with respect to , we have
(13) 
where is an identity matrix with the size of .
Given , Eq. (10) can be written as:
(14) 
and we take the derivative of with respect to , and obtain
(15) 
A suboptimal optimization can be achieved by alternating between the updating of and , and the whole algorithm is summarized in Alg. 1. Although the global convergence of the proposed algorithm is not proved, we empirically validate its fast convergence in our experiments. The optimized ranking functions and modality weights will be utilized for RGBT saliency detection in next section.
Iv TwoStage RGBT Saliency Detection
In this section, we present the twostage ranking scheme for unsupervised bottomup RGBT saliency detection using the proposed algorithm with boundary priors and foreground queries.
Iva Saliency Measure
Given an input RGBT images represented as a graph and some salient query nodes, the saliency of each node is defined as its ranking score computed by Eq. (8). In the conventional ranking problems, the queries are manually labelled with the groundtruth. In this work, we first employ the boundary prior widely used in other works [19, 11] to highlight the salient superpixels, and select highly confident superpixels (low ranking scores in all modalities) belonging to the foreground objects as the foreground queries. Then, we perform the proposed algorithm to obtain the final ranking results, and combine them with their modality weights to compute the final saliency map.
IvB Ranking with Boundary Priors
Based on the attention theories for visual saliency [32], we regard the boundary nodes as background seeds (the labelled data) to rank the relevances of all other superpixel nodes in the first stage.
Taking the bottom image boundary as an example, we utilize the nodes on this side as labelled queries and the rest as the unlabelled data, and initilize the indicator in Eq. (8). Given , the ranking values are computed by employing the proposed ranking algorithm, and we normalize as with the range between 0 and 1. Similarly, given the top, left and right image boundaries, we can obtain the respective ranking values , , . We integrate them to compute the initial saliency map for each modality in the first stage:
(16)  
IvC Ranking with Foreground Queries
Given of the th modality, we set an adaptive threshold to generate the foreground queries, where indicates the maximum operation, and is a constant, which is fixed to be 0.25 in this work. Specifically, we select the th superpixel as the foreground query of the th modality if . Therefore, we compute the ranking values and the modality weights in the second stage by employing our ranking algorithm. Similar to the first stage, we normalize the ranking value of the th modality as with the range between 0 and 1. Finally, the final saliency map can be obtained by combining the ranking values with the modality weights:
(17) 
Algorithm  RGB  Thermal  RGBT  Code Type  Runtime  

BR [28]  0.724  0.260  0.411  0.648  0.413  0.488  0.804  0.366  0.520  M&C++  8.23 
SR [29]  0.425  0.523  0.377  0.361  0.587  0.362  0.484  0.584  0.432  M  1.60 
SRM [16]  0.411  0.529  0.384  0.392  0.520  0.380  0.428  0.575  0.411  M  0.76 
CA [23]  0.592  0.667  0.568  0.623  0.607  0.573  0.648  0.697  0.618  M  1.14 
MCI [26]  0.526  0.604  0.485  0.445  0.585  0.435  0.547  0.652  0.515  M&C++  21.89 
NFI [25]  0.557  0.639  0.532  0.581  0.599  0.541  0.564  0.665  0.544  M  12.43 
SSKDE [27]  0.581  0.554  0.532  0.510  0.635  0.497  0.528  0.656  0.515  M&C++  0.94 
GMR [19]  0.644  0.603  0.587  0.700  0.574  0.603  0.694  0.624  0.615  M  1.11 
GR [24]  0.621  0.582  0.534  0.639  0.544  0.545  0.705  0.593  0.600  M&C++  2.43 
STM [8]  0.658  0.569  0.581  0.647  0.603  0.579        C++  1.54 
MST [15]  0.627  0.739  0.610  0.665  0.655  0.598        C++  0.53 
RRWR [21]  0.642  0.610  0.589  0.689  0.580  0.596        C++  2.99 
Ours              0.716  0.713  0.680  M&C++  1.39 
MAE  CA  NFI  SSKDE  GMR  GR  BR  SR  SRM  MCI  STM  MST  RRWR  Ours 

RGB  0.163  0.126  0.122  0.172  0.197  0.269  0.300  0.199  0.211  0.194  0.127  0.171  0.109 
T  0.225  0.124  0.132  0.232  0.199  0.323  0.218  0.155  0.176  0.208  0.129  0.234  0.141 
RGBT  0.195  0.125  0.127  0.202  0.199  0.297  0.260  0.178  0.195        0.107 
V Experiments
In this section, we apply the proposed approach over our RGBT benchmark and compare with other baseline methods. The source codes and result figures will be provided with the benchmark for public usage in the community.
Va Experimental Settings
For fair comparisons, we fix all parameters and other settings of our approach in the experiments, and use the default parameters released in their public codes for other baseline methods.
In graph construction, we empirically generate superpixels and set the affinity parameters and , which control the edge strength between two superpixel nodes. In Alg. 1 and Alg. 2, we empirically set , (the first stage) and (the second stage). Herein, we use bigger balance weight in the second stage than in the first stage as the refined foreground queries are more confident than the background queries according to the boundary prior.
VB Comparison Results
To justify the importance of thermal information, the complementary benefits to image saliency detection and the effectiveness of the proposed approach, we evaluate the compared methods with different modality inputs on the newly created benchmark, which has been introduced in Sect. II.
Overall performance. We first report the precision (), recall () and Fmeasure () of 3 kinds of methods on entire dataset in Tab. III. Herein, as the public source codes of STM, MST and RRWR are encrypted, we only run these methods on the single modality. From the evaluation results, we can observe that the proposed approach substantially outperforms all baseline methods. This comparison clearly demonstrates the effectiveness of our approach for adaptively incorporating thermal information. In addition, the thermal data are effective to boost image saliency detection and complementary to RGB data by observing the superior performance of RGBT baselines over both RGB and thermal methods. We also report MAE of 3 kinds of methods on entire dataset in Tab. IV, the results of which are almost consistent with Tab. III. The evaluation results further validate the effectiveness of the proposed approach, the importance of thermal information and the complementary benefits of RGBT data.
Fig. 6 shows some sample results of our approach against other baseline methods with different inputs. The evaluation results show that the proposed approach can detect the salient objects more accurate than other methods by adaptively and collaboratively leveraging RGBT data. It is worth noting that some results using single modality are better than using RGBT data. It is because that the redundant information introduced by the noisy or malfunction modality sometimes affects the fusion results in bad way.
Challengesensitive performance. For evaluating RGBT methods on subsets with different attributes (big salient object (BSO), small salient object (SSO), multiple salient objects (MSO), low illumination (LI), bad weather (BW), center bias (CB), cross image boundary (CIB), similar appearance (SA), thermal crossover (TC), image clutter (IC), and out of focus (OF), see Sect. II for details) to facilitate analysis of performance on different challenging factors, we present the PR curves in Fig. 5. From the results we can see that our approach outperforms other RGBT methods with a clear margin on most of challenges except for BSO and BW. It validates the effectiveness of our method. In particular, for occasional perturbation or malfunction of one modality (e.g., LI, SA and TC), our method can effectively incorporate another modal information to detect salient objects robustly, justifying the complementary benefits of multiple source data.
For BSO, CA [23] achieves a superior performance over ours, and it may attribute to CA takes the global color distinction and the global spacial distance into account for better capturing the global and local information. We can alleviate this problem by improving the graph construction that explores more relations among superpixels. For BW, most of methods have bad performance, but MCI obtains a big performance gain over ours, the second best one. It suggests that considering multiple cues, like lowlevel considerations, global considerations, visual organizational rules and highlevel factors, can handle extremely challenges in RGBT saliency detection, and we will integrate them into our framework to improve the robustness.
VC Analysis of Our Approach
We discuss the details of our approach by analyzing the main components, efficiency and limitations.
Components. To justify the significance of the main components of the proposed approach, we implement two special versions for comparative analysis, including: 1) OursI, that removes the modality weights in the proposed ranking model, i.e., fixes in Eq. (8), and 2) OursII, that removes the crossmodality consistent constraints in the proposed ranking model, i.e., sets in Eq. (8).
The PR curves, representative precision, recall and Fmeasure, and MAE are presented in Fig. 7, and we can draw the following conclusions. 1) Our method substantially outperforms OursI. This demonstrates the significance of the introduced weighted variables to achieve adaptive fusion of different source data. 2) The complete algorithm achieves superior performance than OursII, validating the effectiveness of the crossmodality consistent constraints.
Efficiency. Runtime of our approach against other methods is presented in Tab. III. It is worth mentioning that our approach is comparable with GMR [19] mainly due to the fast optimization to the proposed ranking model.
The experiments are carried out on a desktop with an Intel i7 3.4GHz CPU and 16GB RAM, and implemented on mixing platform of C++ and MATLAB without any optimization. Fig. 8 shows the convergence curve of the proposed approach. Although involves 5 times of the optimization to the ranking model, our method costs about 1.39 second per image pair due to the efficiency of our optimization algorithm, which converges approximately within 5 iterations.
We also report runtime of other main procedures of the proposed approach with the typical resolution of pixels. 1) The oversegmentation by SLIC algorithm takes about 0.52 second. 2) The feature extraction takes approximately 0.24 second. 3) The first stage, including 4 ranking process, costs about 0.42 second. 4) The second stage, including 1 ranking process, spends approximately 0.14 second. The oversegmentation and the feature extraction are mostly time consuming procedure (about 55%). Hence, through introducing the efficient oversegmentation algorithms and feature extraction implementation, we can achieve much better computation time under our approach.
Limitations. We also present two failure cases generated by our method in Fig. 9. The reliable weights sometimes are wrongly estimated due to the effect of clutter background, as shown in Fig. 9 (a), where the modality weights of RGB and thermal data are 0.97 and 0.95, respectively. In such circumstance, our method will generate bad detection results. This problem could be tackled by incorporating the measure of background clutter in optimizing reliable weights, and will be addressed in our future work. In addition to reliable weight computation, our approach has another major limitation. The first stage ranking relies on the boundary prior, and thus the salient objects are failed to be detected when they crosses image boundaries. The insufficient foreground queries obtained after the first stage usually result in bad saliency detection performance in the second stage, as shown in Fig. 9 (b). We will handle this issue by selecting more accurate boundary queries according to other prior cues in future work.
VD Discussions on RGBT Saliency Detection
We observe from the evaluations that integrating RGB data and thermal data will boost saliency detection performance (see Tab. III). The improvements are even bigger while encountering certain challenges, i.e., low illuminations, similar appearance, image clutter and thermal crossover (see Fig. 5), demonstrating the importance of thermal information in image saliency detection and the complementary benefits from RGBT data.
In addition, directly integrating RGB and thermal information sometimes lead to worse results than using single modality (see Fig. 6), as the redundant information is introduced by the noisy or malfunction modality. We can address this issue by adaptively fusing different modal information (Our method) that can automatically determine the contribution weights of different modalities to alleviate the effects of redundant information.
From the evaluation results, we also observe the following research potentials for RGBT saliency detection. 1) The ensemble of multiple models or algorithms (CA [23]) can achieve robust performance. 2) Some principles are crucial for effective saliency detection, e.g., global considerations (CA [23]), boundary priors (Ours, GMR [19], RRWR [21], MST [15]) and multiscale context (STM [8]). 3) The exploitation of more relations among pixels or superpixels is important for highlighting the salient objects (STM [8], MST [15]).
Vi Conclusion
In this paper, we have presented a comprehensive image benchmark for RGBT saliency detection, which includes a dataset, three kinds of baselines and four evaluation metrics. With the benchmark, we have proposed a graphbased multitask manifold ranking algorithm to achieve adaptive and collaborative fusion of RGB and thermal data in RGBT saliency detection. Through analyzing the quantitative and qualitative results, we have demonstrated the effectiveness of the proposed approach, and also provided some basic insights and potential research directions for RGBT saliency detection. Our future works will focus on the following aspects: 1) We will expand the benchmark to a larger one, including a larger dataset with more challenging factors and more popular baseline methods. 2) We will improve the robustness of our approach by studying other prior models [33] and graph construction.
References
 [1] B. Zhao, Z. Li, M. Liu, W. Cao, and H. Liu, “Infrared and visible imagery fusion based on region saliency detection for 24hoursurveillance systems,” in Proceedings of the IEEE International Conference on Robotics and Biomimetics, 2013.
 [2] C. Li, X. Wang, L. Zhang, and J. Tang, “Weld: Weighted lowrank decomposition for robust grayscalethermal foreground detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2016.
 [3] C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscalethermal tracking,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5743–5756, 2016.
 [4] A. Torralba and A. Efros, “Unbiased look at dataset bias,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011.
 [5] J. Harel, C. Koch, and P. Perona, “Graphbased visual saliency,” in Proceedings of the Advances in Neural Information Processing Systems, 2006.
 [6] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.Y. Shum, “Learning to detect a salient object,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353–367, 2011.
 [7] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [8] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [9] J.G. Yu, J. Zhao, J. Tian, and Y. Tan, “Maximal entropy random walk for regionbased visual saliency,” IEEE Transactions on Cybernetics, vol. 44, no. 9, pp. 1661–1672, 2014.
 [10] G. Zhu, Q. Wang, and Y. Yuan, “Natas: Neural activity trace aware saliency,” IEEE Transactions on Cybernetics, vol. 44, no. 7, pp. 1014–1024, 2014.
 [11] K. Wang, L. Lin, J. Lu, C. Li, and K. Shi, “Pisa: Pixelwise image saliency by aggregating complementary appearance contrast measures with edgepreserving coherence,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3019–3033, 2015.
 [12] M.M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.M. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, 2015.
 [13] M. Liang and X. Hu, “Feature selection in supervised saliency prediction,” IEEE Transactions on Cybernetics, vol. 45, no. 5, pp. 900–912, 2015.
 [14] M. Jian, K.M. Lam, J. Dong, and L. Shen, “Visualpatchattentionaware saliency detection,” IEEE Transactions on Cybernetics, vol. 45, no. 8, pp. 1575–1586, 2015.
 [15] W.C. Tu, S. He, Q. Yang, and S.Y. Chien, “Realtime salient object detection with a minimum spanning tree,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [16] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 [17] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequencytuned salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
 [18] V. Gopalakrishnan, Y. Hu, and D. Rajan, “Random walks on graphs for salient object detection in images,” IEEE Transactions on Image Processing, vol. 19, no. 12, pp. 3232–3242.
 [19] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.H. Yang, “Saliency detection via graphbased manifold ranking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [20] L. Zhang, C. Yang, H. Lu, X. Ruan, and M. Yang, “Ranking saliency,” IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2016.2609426, 2017.
 [21] C. Li, Y. Yuan, W. Cai, Y. Xia, and D. Dagan Feng, “Robust saliency detection via regularized random walks ranking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [22] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 [23] Y. Qin, H. Lu, Y. Xu, and H. Wang, “Saliency detection via cellular automata,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [24] C. Yang, L. Zhang, and H. Lu, “Graphregularized saliency detection with convexhullbased center prior,” IEEE Signal Processing Letters, vol. 20, no. 7, pp. 637–640, 2013.
 [25] E. Erdem and A. Erdem, “Visual saliency estimation by nonlinearly integrating features using region covariances,” Journal of vision, vol. 13, no. 4, 2013.
 [26] S. Goferman, L. ZelnikManor, and A. Tal, “Contextaware saliency detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 10, pp. 1915–1926, 2012.
 [27] H. R. Tavakoli, E. Rahtu, and J. Heikkilä, “Fast and efficient saliency detection using sparse sampling and kernel density estimation,” in Proceedings of the Scandinavian Conference on Image Analysis, 2011.
 [28] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting salient objects from images and videos,” in Proceedings of the European Conference on Computer Vision, 2010.
 [29] H. J. Seo and P. Milanfar, “Static and spacetime visual saliency detection by selfresemblance,” Journal of vision, vol. 9, no. 12, pp. 15–15, 2009.
 [30] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨¹sstrunk, “Slic superpixels compared to stateoftheart superpixel methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
 [31] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf, “Ranking on data manifolds,” in Proceedings of Neural Information Processing Systems, 2004.
 [32] L. Itti, C. Koch, E. Niebur et al., “A model of saliencybased visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
 [33] L. Lin, W. Yang, C. Li, J. Tang, and X. Cao, “Inference with collaborative model for interactive tumor segmentation in medical image sequences,” IEEE Transactions on Cybernetics, vol. 46, no. 12, pp. 2796–2809, 2016.