A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach

A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach

Chenglong Li,  Guizhao Wang,  Yunpeng Ma,  Aihua Zheng,  Bin Luo,  and Jin Tang The authors are with Anhui University, Hefei 230601, China. Email: jtang99029@foxmail.com.
Abstract

Despite significant progress, image saliency detection still remains a challenging task in complex scenes and environments. Integrating multiple different but complementary cues, like RGB and Thermal (RGB-T), may be an effective way for boosting saliency detection performance. The current research in this direction, however, is limited by the lack of a comprehensive benchmark. This work contributes such a RGB-T image dataset, which includes 821 spatially aligned RGB-T image pairs and their ground truth annotations for saliency detection purpose. The image pairs are with high diversity recorded under different scenes and environmental conditions, and we annotate 11 challenges on these image pairs for performing the challenge-sensitive analysis for different saliency detection algorithms. We also implement 3 kinds of baseline methods with different modality inputs to provide a comprehensive comparison platform.

With this benchmark, we propose a novel approach, multi-task manifold ranking with cross-modality consistency, for RGB-T saliency detection. In particular, we introduce a weight for each modality to describe the reliability, and integrate them into the graph-based manifold ranking algorithm to achieve adaptive fusion of different source data. Moreover, we incorporate the cross-modality consistent constraints to integrate different modalities collaboratively. For the optimization, we design an efficient algorithm to iteratively solve several subproblems with closed-form solutions. Extensive experiments against other baseline methods on the newly created benchmark demonstrate the effectiveness of the proposed approach, and we also provide basic insights and potential future research directions for RGB-T saliency detection.

RGB-T benchmark, Saliency detection, Cross-modality consistency, Manifold ranking, Fast optimization.

I Introduction

Image saliency detection is a fundamental and active problem in computer vision. It aims at highlighting salient foreground objects automatically from background, and has received increasing attentions due to its wide range of applications in computer vision and graphics, such as object recognition, content-aware retargeting, video compression, and image classification. Despite significant progress, image saliency detection still remains a challenging task in complex scenes and environments.

Recently, integrating RGB data and thermal data (RGB-T data) has been proved to be effective in several computer vision problems, such as moving object detection [1, 2] and tracking [3]. Given the potentials of RGB-T data, however, the research of RGB-T saliency detection is limited by the lack of a comprehensive image benchmark.

In this paper, we contribute a comprehensive image benchmark for RGB-T saliency detection, and the following two main aspects are considered in creating this benchmark.

  • A good dataset should be with reasonable size, high diversity and low bias [4]. Therefore, we use our recording system to collect 821 RGB-T image pairs in different scenes and environmental conditions, and each image pair is aligned and annotated with ground truth. In addition, the category, size, number and spatial information of salient objects are also taken into account for enhancing the diversity and challenge, and we present some statistics of the created dataset to analyze the diversity and bias. To analyze the challenge-sensitive performance of different algorithms, we annotate 11 different challenges according to the above-mentioned factors.

  • To the best of our knowledge, RGB-T saliency detection remains not well investigated. Therefore, we implement some baseline methods to provide a comparison platform. On one hand, we regard RGB or thermal images as inputs in some popular methods to achieve single-modality saliency detection. These baselines can be utilized to identify the importance and complementarity of RGB and thermal information with comparing to RGB-T saliency detection methods. On the other hand, we concatenate the features extracted from RGB and thermal modalities together as the RGB-T feature representations, and employ some popular methods to achieve RGB-T saliency detection.

Salient object detection has been extensively studied in past decades, and numerous models and algorithms have been proposed based on different mathematical principles or priors [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Most of methods measured saliency by measuring local center-surround contrast and rarity of features over the entire image [5, 16, 17, 12]. In contrast, Gopalakrishnan et al. [18] formulated the object detection problem as a binary segmentation or labelling task on a graph. The most salient seeds and several background seeds were identified by the behavior of random walks on a complete graph and a -regular graph. Then, a semi-supervised learning technique was used to infer the binary labels of the unlabelled nodes. Different from it, Yang et al. [19] employed manifold ranking technique to salient object detection that requires only seeds from one class, which are initialized with either the boundary priors or foreground cues. They then extended their work with several improvements [20], including multi-scale graph construction and a cascade scheme on a multi-layer representation. Based on the manifold ranking algorithms, Li et al. [21] generated pixel-wise saliency maps via regularized random walks ranking, and Wang et al. [22] proposed a new graph model which captured local/global contrast and effectively utilized the boundary prior.

With the created benchmark, we propose a novel approach, multi-task manifold ranking with cross-modality consistency, for RGB-T saliency detection. For each modality, we employ the idea of graph-based manifold ranking [19] for the good saliency detection performance in terms of accuracy and speed. Then, we assign each modality with a weight to describe the reliability, which is capable of dealing with occasional perturbation or malfunction of individual sources, to achieve adaptive fusion of multiple modalities. To better exploit the relations among modalities, we impose the cross-modality consistent constraints on the ranking functions of different modalities to integrate them collaboratively. Considering the manifold ranking in each modality as an individual task, our method is essentially formulated as a multi-task learning problem. For the optimization, we jointly optimize the modality weights and the ranking functions of multiple modalities by iteratively solving two subproblems with closed-form solutions.

This paper makes the following three major contributions for RGB-T image saliency detection and related applications.

  • It creates a comprehensive benchmark for facilitating evaluating different RGB-T saliency detection algorithms. The benchmark dataset includes 821 aligned RGB-T images with the annotated ground truths, and we also present the fine-grained annotations with 11 challenges to allow us to analyze the challenge-sensitive performance of different algorithms. Moreover, we implement 3 kinds of baseline methods with different inputs (RGB, thermal and RGB-T) for evaluations. This benchmark will be available online for free academic usage 111RGB-T saliency detection benchmark’s webpage:
        http://chenglongli.cn/people/lcl/journals.html.
    .

  • It proposes a novel approach, multi-task manifold ranking with cross-modality consistency, for RGB-T saliency detection. In particular, we introduce a weight for each modality to represent the reliability, and incorporate the cross-modality consistent constraints to achieve adaptive and collaborative fusion of different source data. The modality weights and ranking function are jointly optimized by iteratively solving several subproblems with closed-form solutions.

  • It presents extensive experiments against other state-of-the-art image saliency methods with 3 kinds of inputs. The evaluation results demonstrate the effectiveness of the proposed approach. Through analyzing quantitative results, we further provide basic insights and identify the potentials of thermal information in RGB-T saliency detection.

The rest of this paper is organized as follows. Sect. II introduces details of the RGB-T saliency detection benchmark. The proposed model and the associated optimization algorithm are presented in Sect. III, and the RGB-T saliency detection approach is introduced in Sect. IV. The experimental results and analysis are shown in Sect. V. Sect. VI concludes the paper.

Ii RGB-T Image Saliency Benchmark

In this section, we introduce our newly created RGB-T saliency benchmark, which includes dataset with statistic analysis, baseline methods with different inputs and evaluation metrics.

Fig. 1: Sample image pairs with annotated ground truths and challenges from our RGB-T dataset.

Ii-a Dataset

We collect 821 RGB-T image pairs by our recording system, which consists of an online thermal imager (FLIR A310) and a CCD camera (SONY TD-2073). For alignment, we uniformly select a number of point correspondences in each image pairs, and compute the homography matrix by the least-square algorithm. It is worth noting that this registration method can accurately align image pairs due to the following two reasons. First, we carefully choose the planar and non-planar scenes to make the homography assumption effective. Second, since two camera views are almost coincident as we made, the transformation between two views is simple. As each image pair is aligned, we annotate the pixel-level ground truth using more reliable modality. Fig. 1 shows some sample image pairs and their ground truths.

The image pairs in our dataset are recorded in approximately 60 scenes with different environmental conditions, and the category, size, number and spatial information of salient objects are also taken into account for enhancing the diversity and challenge. Specifically, the following main aspects are considered in creating the RGB-T image dataset.

  • Illumination condition. The image pairs are captured under different light conditions, such as sunny, snowy, and nighttime. The low illumination and illumination variation caused by different light conditions usually bring big challenges in RGB images.

  • Background factor. Two background factors are taken into account for our dataset. First, similar background to the salient objects in appearance or temperature will introduce ambiguous information. Second, it is difficult to separate objects accurately from cluttered background.

  • Salient object attribute. We take different attributes of salient objects, including category (more than 60 categories), size (see the size distribution in Fig. 2 (b)) and number, into account in constructing our dataset for high diversity.

  • Object location. Most of methods employ the spatial information (center and boundary of an image) of the salient objects as priors, which is verified to be effective. However, some salient objects are not at center or cross image boundaries, and these situations isolate the spatial priors. We incorporate these factors into our dataset construction to bring its challenge, and Fig. 2 presents the spatial distribution of salient objects on CB and CIB.

Challenge Description
BSO Big Salient Object - the ratio of ground truth salient objects over image is more than 0.26.
SSO Small Salient Object - the ratio of ground truth salient objects over image is less than 0.05.
LI Low Illumination - the environmental illumination is low.
BW Bad Weather - the image pairs are recorded in bad weathers, such as snowy, rainy, hazy and cloudy.
MSO Multiple Salient Objects - the number of the salient objects in the image is more than 1.
CB Center Bias - the centers of salient objects are far away from the image center.
CIB Cross Image Boundary - the salient objects cross the image boundaries.
SA Similar Appearance - the salient objets have similar color or shape to the background.
TC Thermal Crossover - the salient objects have similar temperature to the background.
IC Image Clutter - the image is cluttered.
OF Out of Focus - the image is out-of-focus.
TABLE I: List of the annotated challenges of our RGB-T dataset.
Fig. 2: Dataset statistics.
Algorithm Feature Technique Book Title Year
MST [15] Lab & Intensity Minimum spanning tree IEEE CVPR 2016
RRWR [21] Lab Regularized random walks ranking IEEE CVPR 2015
CA [23] Lab Celluar Automata IEEE CVPR 2015
GMR [19] Lab Graph-based manifold ranking IEEE CVPR 2013
STM [8] LUV & Spatial information Scale-based tree model IEEE CVPR 2013
GR [24] Lab Graph regularization IEEE SPL 2013
NFI [25] Lab & Orientations & Spatial information Nonlinear feature integration Journal of Vision 2013
MCI [26] Lab & Spatial information Multiscale context integration IEEE TPAMI 2012
SS-KDE [27] Lab Sparse sampling and kernel density estimation SCIA 2011
BR [28] Lab & Intensity & Motion Bayesian reasoning ECCV 2010
SR [29] Lab Self-resemblance Journal of Vision 2009
SRM [16] Spectrum Spectral residual model IEEE CVPR 2007
TABLE II: List of the baseline methods with the used features, the main techniques and the published information.

Considering the above-mentioned factors, we annotate 11 challenges for our dataset to facilitate the challenge-sensitive performance of different algorithms. They are: big salient object (BSO), small salient object (SSO), multiple salient objects (MSO), low illumination (LI), bad weather (BW), center bias (CB), cross image boundary (CIB), similar appearance (SA), thermal crossover (TC), image clutter (IC), and out of focus (OF). Tab. I shows the details, and Fig. 2 (a) presents the challenge distribution. We will analyze the performance of different algorithms on the specific challenge using the fine-grained annotations in the experimental section.

Ii-B Baseline Methods

To provide a comparison platform, we implement 3 kinds of baseline methods with different modality inputs. On one hand, we regard RGB or thermal images as inputs in 12 popular methods to achieve single-modality saliency detection, including MST [15], RRWR [21], CA [23], GMR [19], STM [8], GR [24], NFI [25], MCI [26], SS-KDE [27], BR [28], SR [29] and SRM [16]. Tab. II presents the details. These baselines can be utilized to identify the importance and complementarity of RGB and thermal information with comparing to RGB-T saliency detection methods. On the other hand, we concatenate the features extracted from RGB and thermal modalities together as the RGB-T feature representations, and employ the above-mentioned methods to achieve RGB-T saliency detection.

Ii-C Evaluation Metrics

There exists several metrics to evaluate the agreement between subjective annotations and experimental predictions. In this work, We use (P)recision-(R)ecall curves (PR curves), metric and Mean Absolute Error (MAE) to evaluate all the algorithms. Given the binarized saliency map via the threshold value from 0 to 255, precision means the ratio of the correctly assigned salient pixel number in relation to all the detected salient pixel number, and recall means the ratio of the correct salient pixel number in relation to the ground truth number. Different from (P)recision-(R)ecall curves using a fixed threshold for every image, the metric exploits an adaptive threshold of each image to perform the evaluation. The adaptive threshold is defined as:

(1)

where and denote the width and height of an image, respectively. is the computed saliency map. The F-measure () is defined as follows with the precision () and recall () of the above adaptive threshold:

(2)

where we set the to emphasize the precision as suggested in [17]. PR curves and metric are aimed at quantitative comparison, while MAE are better than them for taking visual comparison into consideration to estimate dissimilarity between a saliency map and the ground truth , which is defined as:

(3)

Iii Graph-based Multi-Task Manifold Ranking

The graph-based ranking problem is described as follows: Given a graph and a node in this graph as query, the remaining nodes are ranked based on their affinities to the given query. The goal is to learn a ranking function that defines the relevance between unlabelled nodes and queries.

This section will introduce the graph-based multi-task manifold ranking model and the associated optimization algorithm. The optimized modality weights and ranking scores will be utilized for RGB-T saliency detection in next section.

Fig. 3: Illustration of graph construction.

Iii-a Graph Construction

Given a pair of RGB-T images, we regard the thermal image as one of image channels, and then employ SLIC algorithm [30] to generate non-overlapping superpixels. We take these superpixels as nodes to construct a graph , where is a node set and is a set of undirected edges. In this work, any two nodes in are connected if one of the following conditions holds: 1) they are neighboring; 2) they share common boundaries with its neighboring node; 3) they are on the four sides of image, i.e., boundary nodes. Fig. 3 shows the details. The first and second conditions are employed to capture local smoothness cues as neighboring superpixels tend to share similar appearance and saliency values. The third condition attempts to reduce the geodesic distance of similar superpixels. It is worth noting that we can explore more cues in RGB and thermal data to construct an adaptive graph that makes best use of intrinsic relationship among superpixels. We will study this issue in future work as this paper is with an emphasis on the multi-task manifold ranking algorithm.

If nodes and is connected, we assign it with an edge weight as:

(4)

where denotes the mean of the -th superpixel in the -th modality, and is a scaling parameter.

Iii-B Multi-Task Manifold Ranking with Cross-Modality Consistency

We first review the algorithm of graph-based manifold ranking that exploits the intrinsic manifold structure of data for graph labeling [31]. Given a superpixel feature set , some superpixels are labeled as queries and the rest need to be ranked according to their affinities to the queries, where denotes the number of superpixels. Let denote a ranking function that assigns a ranking value to each superpixel , and can be viewed as a vector . In this work, we regard the query labels as initial superpixel saliency value, and is thus an initial superpixel saliency vector. Let denote an indication vector, in which if is a query, and otherwise. Given , the optimal ranking of queries are computed by solving the following optimization problem:

(5)

where is the degree matrix, and . indicates the diagonal operation. is a parameter to balance the smoothness term and the fitting term.

Then, we apply manifold ranking on multiple modalities, and have

(6)
Fig. 4: Illustration of the effectiveness of introducing the modality weights and the cross-modality consistency. (a) Input RGB and thermal images. (b) Results of our method without modality weights and cross-modality consistency are shown in the first and second rows, respectively. (c) Our results and the corresponding ground truth.

From Eq. (6), we can see that it inherently indicates that available modalities are independent and contribute equally. This may significantly limit the performance in dealing with occasional perturbation or malfunction of individual sources. Therefore, we propose a novel collaborative model for robustly performing salient object detection that i) adaptively integrates different modalities based on their respective modal reliabilities, ii) collaboratively computes the ranking functions of multiple modalities by incorporating the cross-modality consistent constraints. The formulation of the multi-task manifold ranking algorithm is proposed as follows:

(7)

where is an adaptive parameter vector, which is initialized after the first iteration (see Alg. 1), and is the modality weight vector. denotes the element-wise product, and is a balance parameter. The third term is to avoid overfitting of , and the last term is the cross-modality consistent constraints. The effectiveness of introducing these two terms is presented in Fig. 4. With some simple algebra, Eq. (7) can be rewritten as:

(8)

where , and denotes the cross-modality consistent constraint matrix, which is defined as:

(9)

where and are the identity matrices with the size of .

0:  The matrix , the indication vector , and the parameters and ;Set ; , .
0:  , .
1:  for  do
2:     Update by Eq. (13);
3:     if  then
4:        for  do
5:           ;
6:        end for
7:     end if
8:     for  do
9:        Update by Eq. (15);
10:     end for
11:     if  then
12:        Terminate the loop.
13:     end if
14:  end for
Algorithm 1 Optimization Procedure to Eq. (8)

Iii-C Optimization Algorithm

We present an alternating algorithm to optimize Eq. (8) efficiently, and denote

(10)

Given , Eq. (10) can be written as:

(11)

and we reformulate it as follows:

(12)

where , and is a block-diagonal matrix defined as: , where . . Taking the derivative of with respect to , we have

(13)

where is an identity matrix with the size of .

Given , Eq. (10) can be written as:

(14)

and we take the derivative of with respect to , and obtain

(15)

A sub-optimal optimization can be achieved by alternating between the updating of and , and the whole algorithm is summarized in Alg. 1. Although the global convergence of the proposed algorithm is not proved, we empirically validate its fast convergence in our experiments. The optimized ranking functions and modality weights will be utilized for RGB-T saliency detection in next section.

0:  One RGB-T image pair, the parameters , , and .
0:  Saliency map .
1:  // Graph construction
2:  Construct the graph with superpixels as nodes, and calculate the affinity matrix and the degree matrix with the parameter ;
3:  // first stage ranking
4:  Utilize the boundary priors to generate the background queries;
5:  Run Alg. 1 with the parameters and to obtain the initial saliency map (Eq. (16));
6:  // second stage ranking
7:  Compute the foreground queries using the adaptive thresholds;
8:  Run Alg. 1 with the parameters and to compute the final saliency map (Eq. (17)).
Algorithm 2 Proposed Approach

Iv Two-Stage RGB-T Saliency Detection

In this section, we present the two-stage ranking scheme for unsupervised bottom-up RGB-T saliency detection using the proposed algorithm with boundary priors and foreground queries.

Iv-a Saliency Measure

Given an input RGB-T images represented as a graph and some salient query nodes, the saliency of each node is defined as its ranking score computed by Eq. (8). In the conventional ranking problems, the queries are manually labelled with the ground-truth. In this work, we first employ the boundary prior widely used in other works [19, 11] to highlight the salient superpixels, and select highly confident superpixels (low ranking scores in all modalities) belonging to the foreground objects as the foreground queries. Then, we perform the proposed algorithm to obtain the final ranking results, and combine them with their modality weights to compute the final saliency map.

Iv-B Ranking with Boundary Priors

Based on the attention theories for visual saliency [32], we regard the boundary nodes as background seeds (the labelled data) to rank the relevances of all other superpixel nodes in the first stage.

Taking the bottom image boundary as an example, we utilize the nodes on this side as labelled queries and the rest as the unlabelled data, and initilize the indicator in Eq. (8). Given , the ranking values are computed by employing the proposed ranking algorithm, and we normalize as with the range between 0 and 1. Similarly, given the top, left and right image boundaries, we can obtain the respective ranking values , , . We integrate them to compute the initial saliency map for each modality in the first stage:

(16)
Fig. 5: PR curves of the proposed approach with other baseline methods with RGB-T input on the entire dataset and several subsets, where the values are shown in the legend.

Iv-C Ranking with Foreground Queries

Given of the -th modality, we set an adaptive threshold to generate the foreground queries, where indicates the maximum operation, and is a constant, which is fixed to be 0.25 in this work. Specifically, we select the -th superpixel as the foreground query of the -th modality if . Therefore, we compute the ranking values and the modality weights in the second stage by employing our ranking algorithm. Similar to the first stage, we normalize the ranking value of the -th modality as with the range between 0 and 1. Finally, the final saliency map can be obtained by combining the ranking values with the modality weights:

(17)
Algorithm RGB Thermal RGB-T Code Type Runtime
BR [28] 0.724 0.260 0.411 0.648 0.413 0.488 0.804 0.366 0.520 M&C++ 8.23
SR [29] 0.425 0.523 0.377 0.361 0.587 0.362 0.484 0.584 0.432 M 1.60
SRM [16] 0.411 0.529 0.384 0.392 0.520 0.380 0.428 0.575 0.411 M 0.76
CA [23] 0.592 0.667 0.568 0.623 0.607 0.573 0.648 0.697 0.618 M 1.14
MCI [26] 0.526 0.604 0.485 0.445 0.585 0.435 0.547 0.652 0.515 M&C++ 21.89
NFI [25] 0.557 0.639 0.532 0.581 0.599 0.541 0.564 0.665 0.544 M 12.43
SS-KDE [27] 0.581 0.554 0.532 0.510 0.635 0.497 0.528 0.656 0.515 M&C++ 0.94
GMR [19] 0.644 0.603 0.587 0.700 0.574 0.603 0.694 0.624 0.615 M 1.11
GR [24] 0.621 0.582 0.534 0.639 0.544 0.545 0.705 0.593 0.600 M&C++ 2.43
STM [8] 0.658 0.569 0.581 0.647 0.603 0.579 - - - C++ 1.54
MST [15] 0.627 0.739 0.610 0.665 0.655 0.598 - - - C++ 0.53
RRWR [21] 0.642 0.610 0.589 0.689 0.580 0.596 - - - C++ 2.99
Ours - - - - - - 0.716 0.713 0.680 M&C++ 1.39
TABLE III: Average precision, recall, and F-measure of our method against different kinds of baseline methods on the newly created dataset. The code type and runtime (second) are also presented. The bold fonts of results indicate the best performance, and “M” is the abbreviation of MATLAB.
MAE CA NFI SS-KDE GMR GR BR SR SRM MCI STM MST RRWR Ours
RGB 0.163 0.126 0.122 0.172 0.197 0.269 0.300 0.199 0.211 0.194 0.127 0.171 0.109
T 0.225 0.124 0.132 0.232 0.199 0.323 0.218 0.155 0.176 0.208 0.129 0.234 0.141
RGB-T 0.195 0.125 0.127 0.202 0.199 0.297 0.260 0.178 0.195 - - - 0.107
TABLE IV: Average MAE Score of our method against different kinds of baseline methods on the newly created dataset. The bold fonts of results indicate the best performance.
Fig. 6: Sample results of the proposed approach and other baseline methods with different modality inputs. (a) Input RGB and thermal image pair and their ground truth. (b-i) The results of the baseline methods with RGB, thermal and RGB-T inputs. (j-m) The results of the baseline methods with RGB and thermal inputs. (n) The results of our approach.

V Experiments

In this section, we apply the proposed approach over our RGB-T benchmark and compare with other baseline methods. The source codes and result figures will be provided with the benchmark for public usage in the community.

V-a Experimental Settings

For fair comparisons, we fix all parameters and other settings of our approach in the experiments, and use the default parameters released in their public codes for other baseline methods.

In graph construction, we empirically generate superpixels and set the affinity parameters and , which control the edge strength between two superpixel nodes. In Alg. 1 and Alg. 2, we empirically set , (the first stage) and (the second stage). Herein, we use bigger balance weight in the second stage than in the first stage as the refined foreground queries are more confident than the background queries according to the boundary prior.

V-B Comparison Results

To justify the importance of thermal information, the complementary benefits to image saliency detection and the effectiveness of the proposed approach, we evaluate the compared methods with different modality inputs on the newly created benchmark, which has been introduced in Sect. II.

Overall performance. We first report the precision (), recall () and F-measure () of 3 kinds of methods on entire dataset in Tab. III. Herein, as the public source codes of STM, MST and RRWR are encrypted, we only run these methods on the single modality. From the evaluation results, we can observe that the proposed approach substantially outperforms all baseline methods. This comparison clearly demonstrates the effectiveness of our approach for adaptively incorporating thermal information. In addition, the thermal data are effective to boost image saliency detection and complementary to RGB data by observing the superior performance of RGB-T baselines over both RGB and thermal methods. We also report MAE of 3 kinds of methods on entire dataset in Tab. IV, the results of which are almost consistent with Tab. III. The evaluation results further validate the effectiveness of the proposed approach, the importance of thermal information and the complementary benefits of RGB-T data.

Fig. 6 shows some sample results of our approach against other baseline methods with different inputs. The evaluation results show that the proposed approach can detect the salient objects more accurate than other methods by adaptively and collaboratively leveraging RGB-T data. It is worth noting that some results using single modality are better than using RGB-T data. It is because that the redundant information introduced by the noisy or malfunction modality sometimes affects the fusion results in bad way.

Challenge-sensitive performance. For evaluating RGB-T methods on subsets with different attributes (big salient object (BSO), small salient object (SSO), multiple salient objects (MSO), low illumination (LI), bad weather (BW), center bias (CB), cross image boundary (CIB), similar appearance (SA), thermal crossover (TC), image clutter (IC), and out of focus (OF), see Sect. II for details) to facilitate analysis of performance on different challenging factors, we present the PR curves in Fig. 5. From the results we can see that our approach outperforms other RGB-T methods with a clear margin on most of challenges except for BSO and BW. It validates the effectiveness of our method. In particular, for occasional perturbation or malfunction of one modality (e.g., LI, SA and TC), our method can effectively incorporate another modal information to detect salient objects robustly, justifying the complementary benefits of multiple source data.

For BSO, CA [23] achieves a superior performance over ours, and it may attribute to CA takes the global color distinction and the global spacial distance into account for better capturing the global and local information. We can alleviate this problem by improving the graph construction that explores more relations among superpixels. For BW, most of methods have bad performance, but MCI obtains a big performance gain over ours, the second best one. It suggests that considering multiple cues, like low-level considerations, global considerations, visual organizational rules and high-level factors, can handle extremely challenges in RGB-T saliency detection, and we will integrate them into our framework to improve the robustness.

V-C Analysis of Our Approach

We discuss the details of our approach by analyzing the main components, efficiency and limitations.

Fig. 7: PR curves, the representative precision, recall and F-measure and MAE of the proposed approach with its variants on the entire dataset.

Components. To justify the significance of the main components of the proposed approach, we implement two special versions for comparative analysis, including: 1) Ours-I, that removes the modality weights in the proposed ranking model, i.e., fixes in Eq. (8), and 2) Ours-II, that removes the cross-modality consistent constraints in the proposed ranking model, i.e., sets in Eq. (8).

The PR curves, representative precision, recall and F-measure, and MAE are presented in Fig. 7, and we can draw the following conclusions. 1) Our method substantially outperforms Ours-I. This demonstrates the significance of the introduced weighted variables to achieve adaptive fusion of different source data. 2) The complete algorithm achieves superior performance than Ours-II, validating the effectiveness of the cross-modality consistent constraints.

Fig. 8: Average convergence curve of the proposed approach on the entire dataset.

Efficiency. Runtime of our approach against other methods is presented in Tab. III. It is worth mentioning that our approach is comparable with GMR [19] mainly due to the fast optimization to the proposed ranking model.

The experiments are carried out on a desktop with an Intel i7 3.4GHz CPU and 16GB RAM, and implemented on mixing platform of C++ and MATLAB without any optimization. Fig. 8 shows the convergence curve of the proposed approach. Although involves 5 times of the optimization to the ranking model, our method costs about 1.39 second per image pair due to the efficiency of our optimization algorithm, which converges approximately within 5 iterations.

We also report runtime of other main procedures of the proposed approach with the typical resolution of pixels. 1) The over-segmentation by SLIC algorithm takes about 0.52 second. 2) The feature extraction takes approximately 0.24 second. 3) The first stage, including 4 ranking process, costs about 0.42 second. 4) The second stage, including 1 ranking process, spends approximately 0.14 second. The over-segmentation and the feature extraction are mostly time consuming procedure (about 55%). Hence, through introducing the efficient over-segmentation algorithms and feature extraction implementation, we can achieve much better computation time under our approach.

Fig. 9: Two failure cases of our method. The input RGB, thermal images, the ground truth and the results generated by our method are shown in (a) and (b), respectively.

Limitations. We also present two failure cases generated by our method in Fig. 9. The reliable weights sometimes are wrongly estimated due to the effect of clutter background, as shown in Fig. 9 (a), where the modality weights of RGB and thermal data are 0.97 and 0.95, respectively. In such circumstance, our method will generate bad detection results. This problem could be tackled by incorporating the measure of background clutter in optimizing reliable weights, and will be addressed in our future work. In addition to reliable weight computation, our approach has another major limitation. The first stage ranking relies on the boundary prior, and thus the salient objects are failed to be detected when they crosses image boundaries. The insufficient foreground queries obtained after the first stage usually result in bad saliency detection performance in the second stage, as shown in Fig. 9 (b). We will handle this issue by selecting more accurate boundary queries according to other prior cues in future work.

V-D Discussions on RGB-T Saliency Detection

We observe from the evaluations that integrating RGB data and thermal data will boost saliency detection performance (see Tab. III). The improvements are even bigger while encountering certain challenges, i.e., low illuminations, similar appearance, image clutter and thermal crossover (see Fig. 5), demonstrating the importance of thermal information in image saliency detection and the complementary benefits from RGB-T data.

In addition, directly integrating RGB and thermal information sometimes lead to worse results than using single modality (see Fig. 6), as the redundant information is introduced by the noisy or malfunction modality. We can address this issue by adaptively fusing different modal information (Our method) that can automatically determine the contribution weights of different modalities to alleviate the effects of redundant information.

From the evaluation results, we also observe the following research potentials for RGB-T saliency detection. 1) The ensemble of multiple models or algorithms (CA [23]) can achieve robust performance. 2) Some principles are crucial for effective saliency detection, e.g., global considerations (CA [23]), boundary priors (Ours, GMR [19], RRWR [21], MST [15]) and multiscale context (STM [8]). 3) The exploitation of more relations among pixels or superpixels is important for highlighting the salient objects (STM [8], MST [15]).

Vi Conclusion

In this paper, we have presented a comprehensive image benchmark for RGB-T saliency detection, which includes a dataset, three kinds of baselines and four evaluation metrics. With the benchmark, we have proposed a graph-based multi-task manifold ranking algorithm to achieve adaptive and collaborative fusion of RGB and thermal data in RGB-T saliency detection. Through analyzing the quantitative and qualitative results, we have demonstrated the effectiveness of the proposed approach, and also provided some basic insights and potential research directions for RGB-T saliency detection. Our future works will focus on the following aspects: 1) We will expand the benchmark to a larger one, including a larger dataset with more challenging factors and more popular baseline methods. 2) We will improve the robustness of our approach by studying other prior models [33] and graph construction.

References

  • [1] B. Zhao, Z. Li, M. Liu, W. Cao, and H. Liu, “Infrared and visible imagery fusion based on region saliency detection for 24-hour-surveillance systems,” in Proceedings of the IEEE International Conference on Robotics and Biomimetics, 2013.
  • [2] C. Li, X. Wang, L. Zhang, and J. Tang, “Weld: Weighted low-rank decomposition for robust grayscale-thermal foreground detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2016.
  • [3] C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5743–5756, 2016.
  • [4] A. Torralba and A. Efros, “Unbiased look at dataset bias,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011.
  • [5] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Proceedings of the Advances in Neural Information Processing Systems, 2006.
  • [6] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353–367, 2011.
  • [7] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [8] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [9] J.-G. Yu, J. Zhao, J. Tian, and Y. Tan, “Maximal entropy random walk for region-based visual saliency,” IEEE Transactions on Cybernetics, vol. 44, no. 9, pp. 1661–1672, 2014.
  • [10] G. Zhu, Q. Wang, and Y. Yuan, “Natas: Neural activity trace aware saliency,” IEEE Transactions on Cybernetics, vol. 44, no. 7, pp. 1014–1024, 2014.
  • [11] K. Wang, L. Lin, J. Lu, C. Li, and K. Shi, “Pisa: Pixelwise image saliency by aggregating complementary appearance contrast measures with edge-preserving coherence,” IEEE Transactions on Image Processing, vol. 24, no. 10, pp. 3019–3033, 2015.
  • [12] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, 2015.
  • [13] M. Liang and X. Hu, “Feature selection in supervised saliency prediction,” IEEE Transactions on Cybernetics, vol. 45, no. 5, pp. 900–912, 2015.
  • [14] M. Jian, K.-M. Lam, J. Dong, and L. Shen, “Visual-patch-attention-aware saliency detection,” IEEE Transactions on Cybernetics, vol. 45, no. 8, pp. 1575–1586, 2015.
  • [15] W.-C. Tu, S. He, Q. Yang, and S.-Y. Chien, “Real-time salient object detection with a minimum spanning tree,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [16] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
  • [17] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [18] V. Gopalakrishnan, Y. Hu, and D. Rajan, “Random walks on graphs for salient object detection in images,” IEEE Transactions on Image Processing, vol. 19, no. 12, pp. 3232–3242.
  • [19] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [20] L. Zhang, C. Yang, H. Lu, X. Ruan, and M. Yang, “Ranking saliency,” IEEE Transactions on Pattern Analysis and Machine Intelligence, DOI: 10.1109/TPAMI.2016.2609426, 2017.
  • [21] C. Li, Y. Yuan, W. Cai, Y. Xia, and D. Dagan Feng, “Robust saliency detection via regularized random walks ranking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [22] Q. Wang, W. Zheng, and R. Piramuthu, “Grab: Visual saliency via novel graph model and background priors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [23] Y. Qin, H. Lu, Y. Xu, and H. Wang, “Saliency detection via cellular automata,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [24] C. Yang, L. Zhang, and H. Lu, “Graph-regularized saliency detection with convex-hull-based center prior,” IEEE Signal Processing Letters, vol. 20, no. 7, pp. 637–640, 2013.
  • [25] E. Erdem and A. Erdem, “Visual saliency estimation by nonlinearly integrating features using region covariances,” Journal of vision, vol. 13, no. 4, 2013.
  • [26] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 10, pp. 1915–1926, 2012.
  • [27] H. R. Tavakoli, E. Rahtu, and J. Heikkilä, “Fast and efficient saliency detection using sparse sampling and kernel density estimation,” in Proceedings of the Scandinavian Conference on Image Analysis, 2011.
  • [28] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting salient objects from images and videos,” in Proceedings of the European Conference on Computer Vision, 2010.
  • [29] H. J. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance,” Journal of vision, vol. 9, no. 12, pp. 15–15, 2009.
  • [30] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨¹sstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
  • [31] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf, “Ranking on data manifolds,” in Proceedings of Neural Information Processing Systems, 2004.
  • [32] L. Itti, C. Koch, E. Niebur et al., “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
  • [33] L. Lin, W. Yang, C. Li, J. Tang, and X. Cao, “Inference with collaborative model for interactive tumor segmentation in medical image sequences,” IEEE Transactions on Cybernetics, vol. 46, no. 12, pp. 2796–2809, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
46059
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description