# Ro-SOS: Metric Expression Network for Robust Salient Object Segmentation

###### Abstract

Although deep CNNs have brought significant improvement to image saliency detection, most CNN based models are sensitive to distortion such as compression and noise. In this paper, we propose an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) to deal with saliency detection with the tolerance of distortion. Within MEnet, a new topological metric space is constructed, whose implicit metric is determined by the deep network. As a result, we manage to group all the pixels in the observed image semantically within this latent space into two regions: a salient region and a non-salient region. With this architecture, all feature extractions are carried out at the pixel level, enabling fine granularity of output boundaries of the salient objects. What’s more, we try to give a general analysis for the noise robustness of the network in the sense of Lipschitz and Jacobian literature. Experiments demonstrate that robust salient maps facilitating object segmentation can be generated by the proposed metric. Tests on several public benchmarks show that MEnet has achieved desirable performance. Furthermore, by direct computation and measuring the robustness, the proposed method outperforms previous CNN-based methods on distorted inputs.

## I Introduction

Image saliency detection and segmentation is of significant interest in the fields of computer vision and pattern recognition. It aims to simulate the perceptual mechanism of the human visual system by distinguishing between salient regions and background. It is a fundamental problem in CV and has a wide range of applications, including image segmentation [1], objection detection and recognition [2], image quality assessment [3], visual tracking [4], image editing and manipulating [5].

Recent saliency detection studies can be divided into two categories: handcrafted features based and learning-based approaches. In previous literature, the majority of saliency detection methods use handcrafted features. Traditional low-level features for such saliency detection models mainly consist of color, intensity, texture edge and structure [6, 7, 8]. Though handcrafted features with heuristic priors perform well in simple scenes, they are not robust to more challenging cases, such as when salient regions share similar colors to background.

Deep convolutional neural networks (CNNs) [9] have achieved tremendous success in many computer vision problems including image classification [10, 11], object detection [12] and semantic segmentation [13], due to their great ability of learning powerful feature representations from data automatically. Likewise, the accuracy of salient detection has been significantly improved and pushed into a new stage by various deep CNN based approaches [14, 15, 16, 17]. For instance, Wang et al.[18] developed a multistage refinement mechanism to effectively combine high-level semantics with low-level image features to produce high-resolution saliency maps. [19, 20, 21] exploited multi-scale convolutional features for object segmentation. Nevertheless, even with great success, there is still plenty room for further improvement. For example, very little work pays attention to the robustness against distorted scenes, while the performance of neural networks are susceptible to typical distortions such as noise [22].

Recently, metric learning has received much attention in computer vision, such as image segmentation [23], face recognition [24] and human identification [25], for measuring similarity between objects. Inspired by the metric learning framework, we propose a saliency model that works in a learned metric space. We propose a deep metric learning architecture for saliency segmentation with potentially distorted images. We use semantic features extracted from a deep CNN to learn a homogeneous metric space. The features are at the pixel level and allow for distinguishing between salient regions and background through a distance measure. Simultaneously, we introduce a novel metric loss function which is based on metric learning and cross entropy. We also use multi-level information for feature extraction, similar to approaches such as Hypercolumns [26] and U-net [27]. We experiment with several benchmark data sets and achieve state-of-the-art level results. Moreover, the proposed model is robust to distorted images.

The rest of this paper is as follows. Section II introduces related works of saliency segmentation and metric learning. Section III describes details of our model. Finally, Section IV discusses the performance of MEnet model compared to other state-of-the-art saliency segmentation approaches and the robustness to distorted scenes.

## Ii Related works

### Ii-a Instance Aware Semantic Segmentation

Brought about by Fully Convolution Network (FCN)[28], CNNs can address semantic segmentation problems and reach state-of-the-art performance, trained in an end-to-end manner, which facilitates following researches in the field. U-Net[27] advanced FCN by balancing the number of downsampling and upsampling layers which made the structure symmetric. Evidence shows that semantic segmentation performance benefits from multi-scale information[29, 30, 31, 32, 33, 34]. Deeplab[31] proposed atrous spatial pyramid pooling (ASPP) to leverage the full power of multiple scales.

### Ii-B Metric Learning

In recent years, it has been shown that training a set of data to learn a distance metric can improve performance. Therefore, metric learning is popular in face recognition [35, 24], image classification [36, 37], human Re-identification [38], information retrieval [39] and visual tracking [40]. Deep learning has achieved much attention, and several metric learning models that are based on deep convolutional neural networks have been proposed. For example, in visual tracking [40], metric learning was used to measure the difference between adjacent frames of the video. In image retrieval [39], using deep metric learning to learn a nonlinear feature space can help to easily measure the relationship of images.

In the past two years, metric learning also has been applied to saliency detection. Lu et al.[41] proposed an adaptive metric learning model based on the global and local information which used the Fisher vector to represent the super-pixel block, then measured the distance of the saliency and background. [42] employed metric learning to learn the point-to-set metric that can explicitly compute the distances of single points to sets of correlated points, which has the ability to distinguish the salient regions and background.

### Ii-C CNNs-Based Salient Segmentation Approaches

Over the past decades, salient segmentation models have been developed and widely used in computer vision tasks. In particular, CNN-based methods have obtained much better performance than all traditional methods which use handcrafted features. The majority of salient segmentation models are based on handcrafted local features, global features or both of them. Most of them prove that using both features performs better than others. For instance, in [16], Li et al. proposed deep contrast network that consists of a pixel-level fully convolutional stream and a segment-wise spatial pooling stream, and operates at the pixel level to predict saliency maps. A fully connected CRF is used as refinement. Zhao et al. [43] combine the global context and local context to train a multi-context deep learning framework for saliency detection. Wang et al. [44] used a deep neural network (DNN-L) to learn local patch features and used deep neural network (DNN-G) to obtain the score of salient regions based on the initial local saliency map and global features. Lee et al. [45] proposed a unified deep learning framework which utilizes both high level and low level features for saliency detection. In [19], Zhang et al. presented a simplified convolutional neural network that combines local and global information and implemented a loss function to penalize errors on the boundary.

There are also several works for salient segmentation are based on multi-level features. For example, In [20], Liu et al. proposed a deep hierarchical networks for salient segmentation that is first use the GV-CNN to detect salient objects in a global perspective, then HRCNN is considered as refinement method to achieve the details of the saliency map step by step. Zhang et al. [21] presented a multi-level features aggregation network, the deep network first integrates the multi-level feature maps, then it adaptively learns to combine these feature maps and predicted saliency maps.

In this paper, we proposed a novel a symmetric encoder-decoder CNN architecture for salient segmentation. Our approach differs from the above mentioned methods. Our model ensures that there are enough large receptive fields to obtain much feature information in convolutional operation. Unlike the these methods [21, 20, 19], we needn’t any pre-trained model, and use the different up-sampling method. simultaneously, we construct a effective loss function to predict saliency maps. And in the following section, we will give the details of the proposed model.

## Iii Metric Expression Network (MEnet)

We illustrate our model architecture in Figure 2. An encoder-decoder CNN first generates feature maps at different scales (blocks), which through convolution and up-sampling gives a feature vector for each pixel of an image according to how it maps through the layers. These extracted features are then used in a metric loss and cross entropy function by convolutions for saliency detection as described below.

### Iii-a Encoder-decoder CNN for feature extraction

In SegNet [46] and U-net, the encoder-decoder is used to extraction multi-scale features. We use a similar structure in the proposed model. Since global information plays an important role in saliency segmentation [44], we use convolutions and pooling layers to increase the receptive field of the model, and compress all feature information into feature maps whose size are , as shown as the white box in Figure 2. Through the decoder module, we up-sample these feature maps and the feature map at each scale represents information at one semantic level. We therefore propose a symmetric encoder-decoder CNN architecture.

The encoder-decoder network of Figure 2 uses a deep symmetric CNN architecture with short connections as indicated by blue arrows. It consists of an encoder half (left) and a decoder half (right), each block of which has an application of either of the two basic blocks as shown in Figure 3. For encoding, at each down-sampling step we double the number of feature channels using a convolution with stride 2. For decoding, each step in the decoder path consists of an up-sampling of the feature map by a deconvolution after being concatenated the input with the short connection, also with stride 2. In the decoder path, we concatenate the corresponding feature maps from the encoder path. This part is similar to U-Net. But the difference is that U-net is designed for edge-detection, it works well even though it crops feature maps (in Fig.1 of U-net) from the encoder path as it doesn’t impact edge-detection. While for saliency segmentation, we maintain the size of the feature map to make full use of all the information as its receptive field need to be much larger. We believe that the size of the feature maps contain rich global information of the original image that can be used by later layers for better prediction. Our goal in using a symmetric CNN is to generate different scales of feature maps, which are concatenated to give feature vectors for each corresponding pixel in the input image that contains multi-scale information across the dimensions. Furthermore, we process some more convolution to balance the dimensional unevenness as described in the following paragraph without doing direct classification.

We ultimately want to distinguish salient objects from background and so want to map image pixels into a feature space where that distance across salient and background regions is large, but within regions is small. However, previous work in this direction showed that deep CNNs can learn such a feature representation that captures local and global context information for saliency segmentation[43].

Therefore, as it is shown in Figure 2, we can convert the blocks from the 13 different scales of the encoder-decoder network into a bundle of feature maps as indicated by the green dashed lines. That is, in the feature extraction part, each scale generates one output feature map of the same size via a single convolution and up-sampling; while the first “feature map” is simply obtained from convolving the original image across its RGB channels. Though the proposed algorithm may be partially similar to Hypercolumns model, during testing Hypercolumns model takes the outputs of all these layers, upsamples them using bilinear interpolation and sums them for the final prediction. But the difference is that, within training Hypercolumns model predicts heatmaps from feature maps of different scales by stacking on additional convolutional layers. Hypercolumns is more like DHSNet [20] which utilizes multi-scale saliency labels for segmentation. In contrast, MEnet upsamples each scale of feature map with the same size during training. As the components from these features with 13 scales are uneven to each other, they cannot directly be applied to classification, e.g., assigning any loss function on them. These components from different scales should be balanced and one possible way is to filter these features of 13 dimension by convolution. Consequently in the proposed way, after concatenating the feature maps at each level we further use convolution operation with 16 kernels to generate the final feature map to balance the dimensional characteristics with the constraint of minimizing the cross entropy (see the following section). In this case, the final feature vector is in .

### Iii-B Loss function

Most previous work on saliency detection based on deep learning used cross entropy (CE) to optimize the network [16, 19]. The loss functions is written as follows:

(1) | ||||

where is the set of learnable parameters of network to influence , is the pixel domain of the image, denotes the loss for the -th image in the training set, is the indicator function; and , where denotes the salient pixel and denotes the non-salient pixel. is the label probability of the -th pixel predicted by network. In MEnet, we generate via a convolution with 2 kernels from feature extraction part as shown in Figure 2.

Inspired by the metric learning, we also introduce our metric loss function (ML) defined as Equation 2. In our network, the input is an RGB image whose size is , and all the images are resized to , hence here. The output is a feature metric space which generated by 16-kernels convolution in Figure 2, and the size is (in our method we set ). Each pixel in the image corresponds to a C-dimension vector in the salient feature maps. The metric loss function is defined as following:

(2) | ||||

where is the set of learnable parameters of network to influence , denotes the feature vectors corresponding to the pixel in the -th image of the training set. We denote (or ), s.t., , by meaning that is the positive (negative) feature vector of , respectively. That is, and are from the same region (salient or non-salient), otherwise, is from a different region with respect to . We use Euclidean distance to calculate the distance between two feature vectors.

This loss function (2) seeks to find an encoder-decoder network that enlarges the distance between any pair of feature vectors from different regions, and reduces the distance from the same region. In this way, the two region is expected to be homogenous by themselves. Then by trivial deduction, it is equivalent to

(3) | |||

where we average all in Equation 3 to get and . That is is the mean of all positive pixels from a single image, while corresponds to all negative pixels. Intuitively, Equation 3 enforces that the feature vectors extracted from the same region be close to the center of that region while keeping away from the center of other region in salient feature space. In this case, we can obtain a more robust distance evaluation between the salient object and background. We also add a second cross entropy loss function as a constraint which shares the same network architecture to the objective function and empirically have noticed that the combined results were significantly better than only using either the metric or cross entropy losses. Therefore, our final loss function is defined as below:

(4) |

where and is set to 1 in our experiment.

### Iii-C Semantic distance expression

If we train the MEnet to minimize the loss function , we will obtain a converged network , where is the converged state of . Given an observed input image for testing, where the pixel domain is , we usually describe pixel by its intensities s across the channels. But it is difficult to define the semantical distance by , e.g., by Euclidean distance . However, through transformation of , we will obtain the corresponding feature vectors to represent the input. Then the distance can be expressed by , and finally the saliency map for saliency segmentation is obtained by:

(5) | ||||

where is the probability distribution function of the feature vector , and , where and denote the background region and salient region only computed from the component of in the loss function (4) within the whole converged network , respectively. Note that, and are not the accurate segmentation and they are to be further investigated in the experiment part. To conclude, by network transformation we succeed to express with . As illustrated in Figure 4, we anticipate that through a space transformation, the intra-class distance will be smaller than the inter-class distance.

### Iii-D Noise Robustness Analysis

In the sense of vectors, denote the input image for the network as and output as Then by differentiation, we obtain

(6) |

where denotes the output of the layer, is the partial differentiation within back propagation. That is, we differentiate each layer’s output w.r.t its corresponding input.

Firstly in the situation of 1-d output (), by multi-dimensional Mean Value Theorem, we have

(7) |

for some vector of belongs to the region bounded by corresponding components of and , and the operation above is inner product. Then by Cauchy’s Inequality,

(8) |

Similarly, for m-d output, we have

(9) |

Let and denote the real data and its distortion, respectively. An assume that the distortion be small, would be close enough to then the inequality still holds for . Denote to be the error of the input and be the error of output, then

(10) |

Thus, we can measure the robustness of , denoted as by evaluating the norm of the Jacobian matrix at each input data point, i.e. .

#### Iii-D1 Estimation of robustness

Suppose an arbitrary dataset , then we can estimate the robustness over it, by

(11) |

where could be chosen as training, validation or test set.

#### Iii-D2 Approximation of Jacobian

In the case of unknown gradients, we can perform numerical differentiation to approximate the Jacobian. Since the input is multi-dimensional and without loss of generality, we assume the output be one-dimensional, which is inefficient to calculate the numerical differential directly, we can estimate it in the Monte Carlo manner w.r.t the error direction, i.e.,

(12) |

where stands for the set of directions in which the error takes, which is assumed to be evenly distributed. Furthermore, it can adopt any distortion type of interest, such as JPEG compression, no matter it is a random error or not.

#### Iii-D3 Theoretical upper bound of

For estimate-free analysis, we can still measure the robustness by calculating a theoretical upper bound of the Jacobian matrix element-wise, which is denoted as . We then obtain the Lipschitz constant, by such that sufficiently small,

(13) |

In practice, since the convolution and fully connected layer perform linear operation, the derivative should be constant with respect to its input, once the model parameters are fixed. For nonlinear operation appeared within the proposed MEnet, such as ReLU, pooling and softmax, the upper bound of derivative for arbitrary input can also be found trivially as 1, then backpropagating those upper bounds gives the total upper bound of Jacobian matrix. Table III shows the direct comparison of Jacobian with other methods on selected datasets.

## Iv Experiments

We test on several public saliency datasets and distorted images compare with state-of-the-art saliency detection methods. We use the Caffe software package to train our model[47]. We then use Pytorch[48] and Tensorflow[49] to compare MEnet with other models.

### Iv-a Datasets

The datasets we consider are: MSRA10K [7], DUT-OMRON (DUT-O) [8], HKU-IS [50], ECSSD [51], MSRA1000 (MSRA1K) [52] and SOD [53]. MSRA10K contains 10000 images, it is the largest dataset and covers a large variety of contents. HKU-IS contains 4447 images, most images containing two salient objects or multiple objects. ECSSD dataset contains 1000 images. DUT-OMRON contains 5168 images, which was originally designed for image segmentation. This datasets is very challenging since most of the images contain complex scenes; existing saliency detection models have yet to achieve high accuracy on this dataset. MSRA1K including 1000 images, all belongs to the MSRA10K. SOD contains 300 images.

### Iv-B Training

We use stochastic gradient descent (SGD) for optimization, and the MSRA10K and HKU-IS are selected for training. For MSRA10K, 8500 images for training, 500 images for validation and the MSRA1K for testing; HKU-IS was divided into approximately 80/5/15 training-validation-testing splits. To prevent overfitting, all of our models use cropping and flipping images randomly as data augmentation. We utilize batch normalization [54] to steep up the convergence of MEnet.

Most experiments are performed on a PC with Intel(R) Xeon(R) CPU I7-6900k, 96GB RAM and GTX TITAN X Pascal. Some later experiments are performed on Google Colab.
We use a 4 convolutional layer block in the upsample and downsample operations. Therefore the depth of our MEnet is 52 layers. The parameter sizes are shown in Figure 2 and Figure 3. We set the learning rate to 0.1 with weight decay of , a momentum of 0.9 and a mini-batch size of 5. We train for 110,000 iterations.
Since salient pixels and non-salient pixels are very imbalanced, network convergence to a good local optimum is challenging. Inspired by object detection methods such as SSD [55], we adopt hard negative mining to address this problem. This sampling scheme ensures salient and non-salient sample ratio equal to 1, eliminating label bias.^{1}^{1}1Codes and more related details are given in: https://github.com/SherylHYX/Ro-SOS-Metric-Expression-Network-MEnet-for-Robust-Salient-Object-Segmentation.

Data | Index | MC | ELD | DHSNet | DS | DCL | UCF | Amulet | NLDF | SRM | MEnet |

DUT-O | 0.622 | 0.618 | – | 0.646 | 0.660 | 0.645 | 0.654 | 0.691 | 0.718 | 0.732 | |

0.094 | 0.092 | – | 0.084 | 0.095 | 0.132 | 0.098 | 0.080 | 0.071 | 0.074 | ||

HKU-IS | 0.733 | 0.779 | 0.859 | 0.790 | 0.844 | 0.820 | 0.841 | 0.873 | 0.877 | 0.879 | |

0.099 | 0.072 | 0.053 | 0.079 | 0.063 | 0.072 | 0.052 | 0.048 | 0.046 | 0.044 | ||

ECSSD | 0.779 | 0.810 | 0.877 | 0.834 | 0.857 | 0.854 | 0.873 | 0.880 | 0.892 | 0.880 | |

0.106 | 0.080 | 0.060 | 0.079 | 0.078 | 0.078 | 0.060 | 0.063 | 0.056 | 0.060 | ||

MSRA1K | 0.885 | 0.882 | – | 0.858 | 0.922 | – | – | – | 0.894 | 0.928 | |

0.044 | 0.037 | – | 0.059 | 0.035 | – | – | – | 0.045 | 0.028 | ||

SOD | 0.497 | 0.540 | 0.595 | 0.552 | 0.573 | 0.557 | 0.550 | 0.591 | 0.617 | 0.594 | |

0.160 | 0.150 | 0.124 | 0.141 | 0.147 | 0.186 | 0.160 | 0.130 | 0.120 | 0.139 |

Data | Distortion | MC | ELD | DHSNet | DS | DCL | UCF | Amulet | NLDF | SRM | MEnet |
---|---|---|---|---|---|---|---|---|---|---|---|

DUT-O | AWGN | 0.487 | 0.534 | – | 0.452 | 0.472 | 0.582 | 0.574 | 0.517 | 0.533 | 0.673 |

Compression | 0.512 | 0.558 | – | 0.508 | 0.543 | 0.564 | 0.538 | 0.574 | 0.565 | 0.699 | |

HKU-IS | AWGN | 0.554 | 0.622 | 0.723 | 0.567 | 0.589 | 0.720 | 0.726 | 0.637 | 0.572 | 0.751 |

Compression | 0.602 | 0.687 | 0.736 | 0.614 | 0.679 | 0.705 | 0.688 | 0.700 | 0.663 | 0.811 | |

ECSSD | AWGN | 0.603 | 0.671 | 0.726 | 0.611 | 0.618 | 0.730 | 0.738 | 0.669 | 0.583 | 0.740 |

Compression | 0.650 | 0.730 | 0.753 | 0.650 | 0.658 | 0.724 | 0.711 | 0.694 | 0.665 | 0.810 | |

MSRA1K | AWGN | 0.750 | 0.806 | – | 0.711 | 0.760 | – | – | – | 0.785 | 0.899 |

Compression | 0.788 | 0.841 | – | 0.772 | 0.833 | – | – | – | 0.820 | 0.914 | |

SOD | AWGN | 0.403 | 0.443 | 0.492 | 0.373 | 0.388 | 0.439 | 0.453 | 0.430 | 0.443 | 0.504 |

Compression | 0.410 | 0.458 | 0.475 | 0.421 | 0.406 | 0.451 | 0.431 | 0.455 | 0.433 | 0.536 |

Data | max(abs(g)) | min(abs(g)) | median(abs(g)) | mean(abs(g)) | var(abs(g)) | |||||

MEnet | NLDF | MEnet | NLDF | MEnet | NLDF | MEnet | NLDF | MEnet | NLDF | |

DUT | 1.36E-08 | 1.91E-07 | 4.11E-16 | 8.66E-15 | 1.21E-10 | 2.37E-09 | 2.62E-10 | 4.88E-09 | 5.01E-19 | 6.43E-17 |

ECSSD | 1.04E-08 | 1.84E-07 | 3.34E-16 | 7.24E-15 | 9.70E-11 | 1.96E-09 | 2.05E-10 | 4.36E-09 | 2.43E-19 | 5.68E-17 |

HKU-IS | 7.80E-09 | 1.89E-07 | 2.80E-16 | 7.18E-15 | 7.39E-11 | 2.05E-09 | 1.57E-10 | 4.52E-09 | 1.59E-19 | 5.85E-17 |

MSRA1000 | 1.12E-08 | 2.17E-07 | 2.77E-16 | 9.02E-15 | 7.76E-11 | 2.35E-09 | 1.90E-10 | 5.21E-09 | 3.24E-19 | 8.27E-17 |

SOD | 1.22E-08 | 1.75E-07 | 4.79E-16 | 6.87E-15 | 1.24E-10 | 1.97E-09 | 2.56E-10 | 4.21E-09 | 3.32E-19 | 5.11E-17 |

Data | Indexes | CE-plain | CE-only | MEnet |
---|---|---|---|---|

DUT-O | 0.631 | 0.678 | ||

0.098 | 0.084 | |||

HKU-IS | 0.803 | 0.872 | ||

0.064 | 0.056 | |||

ECSSD | 0.794 | 0.855 | ||

0.093 | 0.072 | |||

MSRA1K | 0.884 | 0.915 | ||

0.037 | 0.034 | |||

SOD | 0.525 | 0.555 | ||

0.156 | 0.159 |

### Iv-C Upsampling Operation

In our approach, we use two upsampling operation methods. As shown in Figure 8, we use deconvolution [28] as the upsampling method in Decoder operation. Since feature maps are generated at different scales, we use a simple way to upscale output feature maps with different sizes for concatenation. The response of each upscale operation is defined as:

(14) |

where and denote the input and output respectively; is the convolutional level, and is the size of input.

### Iv-D Quantitative evaluation

For the valuation of the result, we use three standard criteria evaluation method: F-measure rate for adaptive threshold, mean absolute error (MAE) [56, 57], and precision-recall curve (PR curve) [58]. PR curve is widely utilized, and is obtained by comparing result with the ground truth by the binary masks generated with the threshold sliding from 0 to 255. For the F-measure, the adaptive threshold [59] is defined as twice the mean saliency of the image as shown in Equation 15.

(15) |

where and denote the width and the height of the final saliency map S respectively. And the F-measure is defined as

(16) |

where . Different from PR curves, MAE evaluates the averaged degree of the dissimilarity and comparability between the saliency map and the ground truth at every pixel. MAE is defined as the average pixelwise absolute difference between the binary ground truth and the saliency map.

(17) |

### Iv-E Performance Comparison

We compare MEnet with 9 state-of-the-art models for saliency detection: MC [43], ELD [45], DCL [16], DHSNet [20], DS [60], UCF [61], Amulet [21], SRM [18], NLDF [19] and 2 traditional metric learning methods: AML [41] and Lu’s method [42].

#### Iv-E1 Visual Comparison

A visual comparison is shown in Figure 6 along with other state-of-the-art methods. To illustrate the efficiency of MEnet, we select a variety of difficult circumstances from different datasets. MEnet performs better in these challenging circumstances e.g., when the salient region is similar to background and the images with complex scenes.

#### Iv-E2 F-measure and MAE

We also compare our approach with the state-of-the-art salient segmentation methods in terms of F-measure scores and MAE are shown in Table I. It is noted that the better models (e.g., DHSNet, NLDF, Amulet, SRM, UCF and etc.) need pre-trained model, and conditional random field (CRF) method [62] is used as post-processing in DCL. Although MEnet is trained from scratch, it is still comparable with state-of-the-art models (average F-measure with adaptive threshold and MAE), particularly on some challenging datasets DUT-O and HKU-IS. Our model costs to generate each saliency map with GPU.

#### Iv-E3 PR curve

We compare our approach with existing methods by the PR curve approach which is widely used to evaluate the performance of salient segmentation models. As shown in Figure 5, since MSRA5K which contains MSRA1K is treated as the training dataset, we only depict the PR curves produced by our approach and other previous methods on four datasets. From Figure 5, it’s clear that MEnet is comparable with state-of-the-art models without any pre/post processing.

### Iv-F Robustness evaluation

Note that MEnet is not trained on distorted images, which is the same as previous works. To show the robustness of MEnet in the distorted setting, we work with public datasets corrupted by Additive White Gaussian Noise (AWGN) and JPEG compression (with random strengths). We compare F-measure scores in Table II. We can see that MEnet clearly outperforms other methods. Additionally, we show PR curves of our approach in Figure 7. Since the saliency maps generated by metric loss prediction tend to be binary, it is harmful to draw PR curves which need continuous salient values. Therefore, we select saliency maps generated by CE prediction to draw PR curves. Through Figure 7, we observe that the performance of the proposed method is a little better than others on distorted datasets. As shown in Figure 9, with growing noise variance, the performance of other methods degrades rapidly, while MEnet still achieves robust performance. The reason for the robustness of MEnet owes to the fact that multi-scale features and metric loss are integrated into this structure, where abundant features from either low or high levels are fully utilized and metric loss idea correlates every pixel to the remaining pixels for optimization. For example, a similar metric loss idea is shown to be robust in human re-identification [25], because it is insensitive to light, deformation and angle that can be regarded as “noise”. In practice, images are easy to be impacted by noise and compression. Therefore, our proposed work is beneficial for constructing a robust model.

#### Iv-F1 Evaluation on distorted images

To better explain the efficiency of MEnet on distorted images, we also compare our approach with the state-of-the-art methods in qualitative evaluation. We select several images which are disturbed by AWGN or JPEG compression with different parameters, as Figure 10 and Figure 11 demonstrate. From Figure 10, we can observe that MEnet also can obtain the accurate saliency maps with small variance as shown in the top three rows. From the bottom two rows, we can see that MEnet also outperforms other methods. Figure 11 illustrates that for the images disturbed by JPEG compression, MEnet can detect large parts of the accurate salient objects, which is better than other previous methods..

#### Iv-F2 Jacobian of test datasets

To further illustrate the robustness of MEnet, we compare the Jacobians on several datasets that are not trained on. Since the Jacobian is given with respect to each input and output pixel, criteria of dimensional reduction are applied. As can be observed in Table III, MEnet outperforms other methods by a magnitude in all criteria. For this comparison, MEnet is performed on PyTorch and NLDF is on TensorFlow.

Data | Indexs | AML | Lu’s | MEnet |
---|---|---|---|---|

ECSSD | 0.667 | 0.715 | 0.880 | |

0.165 | 0.136 | 0.060 | ||

MSRA1K | 0.794 | 0.806 | 0.928 | |

0.089 | 0.080 | 0.028 |

### Iv-G Advantages of MEnet

To intuitively illustrate the advantages of MEnet, we select several feature maps visualization for analysis. As the layers of each scale go deeper, the receptive field of each neuron becomes larger. As shown in Figure 8, we observe that each convolutional layer contains different semantic information, and going deeper allows the model to capture richer structures. Within the decoding parts, scale2-de, 3-de, 4-de are sensitive to the salient region, while scale1-de has higher response against the background region. Other layers like scale0-de can distinguish the boundary of salient objects.

To show the effectiveness of our proposed multi-scale feature extraction and loss function, we use different strategies for semantic saliency detection/segmentation as shown in Table IV. The difference between CE-only and CE-plain is that CE-plain does not utilize multi-scale information which will cause performance degradation. We also note that the performance of MEnet is improved after introducing metric loss.

### Iv-H Failure Cases

However, as shown in Figure 12, MEnet may also sometimes mistake information as a salient object based on the provided ground truth. Simultaneously, we observe that our approach can sometimes fail to detect some parts of transparent object, so there is still room to improve. However, in these cases, we see that the mistakes are often semantically reasonable, rather than caused by a clear flaw in the model.

## V Conclusion

An end-to-end deep metric learning architecture (called MEnet) for salient object segmentation is presented in this paper. Within this architecture, multi-scale feature extraction is utilized to obtain adequate semantic information which is then combined with deep metric learning. Our network maps pixels into a “saliency space” for Euclidean metrics to be measured. The mapping result in the “saliency space” effectively distinguishes salient image elements (pixels) from background. Besides, MEnet is trained from scratch and does not require pre/post-processing. Experimental results on benchmark datasets demonstrate the outstanding performance of our model. Comparison with existing methods on distorted images and numerical evaluation of robustness show the robustness of our model to distortion.

## References

- [1] Q. Li, Y. Zhou, and J. Yang, “Saliency based image segmentation,” in Multimedia Technology (ICMT), 2011 International Conference on, pp. 5068–5071, IEEE, 2011.
- [2] Z. Ren, S. Gao, L.-T. Chia, and I. W.-H. Tsang, “Region-based saliency detection and its application in object recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 5, pp. 769–779, 2014.
- [3] A. Li, X. She, and Q. Sun, “Color image quality assessment combining saliency and fsim,” in Fifth International Conference on Digital Image Processing, pp. 88780I–88780I, International Society for Optics and Photonics, 2013.
- [4] G. Zhang, Z. Yuan, N. Zheng, X. Sheng, and T. Liu, “Visual saliency based object tracking,” Computer Vision–ACCV 2009, pp. 193–203, 2010.
- [5] R. Margolin, L. Zelnik-Manor, and A. Tal, “Saliency for image manipulation,” The Visual Computer, vol. 29, no. 5, pp. 381–392, 2013.
- [6] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3166–3173, 2013.
- [7] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, 2015.
- [8] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3166–3173, 2013.
- [9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- [10] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classification using deep pixel-pair features,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 844–853, 2016.
- [11] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional neural networks for hyperspectral image classification,” Journal of Sensors, vol. 2015, 2015.
- [12] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikainen, “Deep learning for generic object detection: A survey,” arXiv Preprint, arXiv:1809.02165, 2018.
- [13] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, and J. Garcia-Rodriguez, “A survey on deep learning techniques for image and video semantic segmentation,” Applied Soft Computing, vol. 70, pp. 41–65, 2018.
- [14] G. Li and Y. Yu, “Visual saliency detection based on multiscale deep cnn features,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5012–5024, 2016.
- [15] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recurrent fully convolutional networks,” in European conference on computer vision, pp. 825–841, Springer, 2016.
- [16] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 478–487, 2016.
- [17] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3668–3677, 2016.
- [18] T. Wang, A. Borji, L. Zhang, P. Zhang, and H. Lu, “A stagewise refinement model for detecting salient objects in images,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- [19] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin, “Non-local deep features for salient object detection,” in IEEE CVPR, 2017.
- [20] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–686, 2016.
- [21] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet: Aggregating multi-level convolutional features for salient object detection,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- [22] Z. Chen, W. Lin, S. Wang, L. Xu, and L. Li, “Image quality assessment guided deep neural networks training,” arXiv preprint arXiv:1708.03880, 2017.
- [23] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy, “Semantic instance segmentation via deep metric learning,” arXiv preprint arXiv:1703.10277, 2017.
- [24] J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1875–1882, 2014.
- [25] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person re-identification,” in Pattern Recognition (ICPR), 2014 22nd International Conference on, pp. 34–39, IEEE, 2014.
- [26] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447–456, 2015.
- [27] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Springer, 2015.
- [28] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
- [29] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
- [30] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3640–3649, 2016.
- [31] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
- [32] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in Proceedings of the IEEE international conference on computer vision, pp. 2650–2658, 2015.
- [33] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1915–1929, 2012.
- [34] P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural networks for scene labeling,” in 31st International Conference on Machine Learning (ICML), no. CONF, 2014.
- [35] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? metric learning approaches for face identification,” in Computer Vision, 2009 IEEE 12th international conference on, pp. 498–505, IEEE, 2009.
- [36] T. Hastie and R. Tibshirani, “Discriminant adaptive nearest neighbor classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 18, no. 6, pp. 607–616, 1996.
- [37] Z. Zhang, J. T. Kwok, and D.-Y. Yeung, “Parametric distance metric learning with label information,” in IJCAI, p. 1450, Citeseer, 2003.
- [38] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2288–2295, IEEE, 2012.
- [39] Z. Li and J. Tang, “Weakly supervised deep metric learning for community-contributed image retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1989–1999, 2015.
- [40] J. Hu, J. Lu, and Y.-P. Tan, “Deep metric learning for visual tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 11, pp. 2056–2068, 2016.
- [41] S. Li, H. Lu, Z. Lin, X. Shen, and B. Price, “Adaptive metric learning for saliency detection,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3321–3331, 2015.
- [42] J. You, L. Zhang, J. Qi, and H. Lu, “Salient object detection via point-to-set metric learning,” Pattern Recognition Letters, vol. 84, pp. 85–90, 2016.
- [43] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1265–1274, 2015.
- [44] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency detection via local estimation and global search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3183–3192, 2015.
- [45] G. Lee, Y.-W. Tai, and J. Kim, “Deep saliency with encoded low level distance map and high level features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 660–668, 2016.
- [46] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
- [47] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678, ACM, 2014.
- [48] A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration,” PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration, vol. 6, 2017.
- [49] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283, 2016.
- [50] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5455–5463, 2015.
- [51] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1155–1162, 2013.
- [52] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” IEEE Transactions on Pattern analysis and machine intelligence, vol. 33, no. 2, pp. 353–367, 2011.
- [53] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2, pp. 416–423, IEEE, 2001.
- [54] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, pp. 448–456, 2015.
- [55] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision, pp. 21–37, Springer, 2016.
- [56] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5706–5722, 2015.
- [57] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 733–740, IEEE, 2012.
- [58] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection via dense and sparse reconstruction,” in Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 2976–2983, IEEE, 2013.
- [59] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in Computer vision and pattern recognition, 2009. cvpr 2009. ieee conference on, pp. 1597–1604, IEEE, 2009.
- [60] X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang, “Deepsaliency: Multi-task deep neural network model for salient object detection,” IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3919–3930, 2016.
- [61] P. Zhang, D. Wang, H. Lu, H. Wang, and B. Yin, “Learning uncertain convolutional features for accurate saliency detection,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- [62] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in neural information processing systems, pp. 109–117, 2011.