Enhancedalignment Measure for Binary Foreground Map Evaluation
Abstract
The existing binary foreground map (FM) measures address various types of errors in either pixelwise or structural ways. These measures consider pixellevel match or imagelevel information independently, while cognitive vision studies have shown that human vision is highly sensitive to both global information and local details in scenes. In this paper, we take a detailed look at current binary FM evaluation measures and propose a novel and effective Emeasure (Enhancedalignment measure). Our measure combines local pixel values with the imagelevel mean value in one term, jointly capturing imagelevel statistics and local pixel matching information. We demonstrate the superiority of our measure over the available measures on 4 popular datasets via 5 metameasures, including ranking models for applications, demoting generic, random Gaussian noise maps, groundtruth switch, as well as human judgments. We find large improvements in almost all the metameasures. For instance, in terms of application ranking, we observe improvement ranging from to compared with other popular measures.
1 Introduction
Please take a look at Fig. 1. You see the output of a binary foreground segmentation model and a random Gaussian noise map. While it is clear that the foreground map (FM) is much closer to the groundtruth (GT) map, to date the most common measures (e.g. , IOU [9], F1, and JI [12]) as well as recently proposed ones including Fbw [24] and VQ [31] favor the noise map over the estimated map. This is one of the problems that we will address in this paper (see experiments Sec. 4.3). In order to solve it, we propose a novel measure that does much better than existing ones.
The comparison between a binary foreground estimated map and a human labeled groundtruth binary map is common in various computer vision tasks, such as image retrieval [20], image segmentation [29], object detection, recognition [13, 30], foreground extraction [4], and salient object detection [11, 5], and is crucial for making statements regarding which models perform better.
Three widely used measures for comparing the foreground map (FM) and the GT include measure [2], Jaccard Index (JI) measure [12], and intersection over union (IOU) [9]. Various evaluations based on measures [8, 24, 31] and other measures (e.g. [27, 32, 26]) have been reported in the past. All of these evaluations, however, have used measures that address pixelwise similarities and often discard structural similarities. Recently, Fan et al. [10] proposed the structure measure (Smeasure), which achieves great performance. However, this measure is designed for evaluating nonbinary maps and some components (e.g. uniform distribution term) are not well defined for binary map case.
No.  Measure  Year  Pub  Pros  Cons 

1  IOU/F1/JI [12]  1901  BSVSN  easy to calculate  losing image level statistics 
2  CM [27]  2010  CVPRW  considering both region and contour  noise sensitive 
3  Fbw [24]  2014  CVPR  assigning different weights for errors  error location sensitive, complicated 
4  VQ [31]  2015  TIP  weighting errors by psychological function  subjective measure 
5  Smeasure [10]  2017  ICCV  considering structure similarity  focusing on nonbinary map properties 
On the contrary, here, we propose a novel measure known as Emeasure (Enhancedalignment measure) which consists of a single term to account for both pixel and image level properties. We show that such an approach leads to an effective and efficient way for evaluating binary foreground maps as demonstrated in Fig. 2. Three foreground maps with colored borders (blue, red, yellow) are evaluated compared to the groundtruth map. Compared to 3 popular measures including Fbw, VQ, and CM, only our measure correctly ranks the three, considering both structural information and global shape coverage. We achieve this by taking into account imagelevel statistics (mean value of FM) and pixellevel matching jointly. Our measure (will be described in detail in Sec. 3) can correctly rank the estimated segmentation maps. Our main contributions are as follows:

We propose a simple measure that consists of a compact term that simultaneously captures image level statistics and local pixel matching information. Using 5 metameasures on 4 public datasets, we experimentally show that our measure is significantly better than traditional IOU, F1/JI, CM measures and the recently proposed ones including VQ, Fbw and Smeasure.

To assess the measures, we also propose a new metameasure (SOTA vs. Noise) and build a new dataset. Our dataset contains 555 binary foreground maps which are ranked by humans. We use this dataset to examine the ranking consistency between current measures and human judgments.
2 Related Work
A summary of popular evaluation metrics for binary foreground map evaluation can be found in Tab. 1. Here, we explain these measures and discuss their pros and cons.
The measure [2, 7, 22] is a common measure, which simultaneously considers & :
(1) 
where is a parameter to tradeoff and , True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) are 4 basic quantities. Setting leads to the classic measure. Another widely used F1based measure is the Jaccard Index (JI) [12], also known as the IOU measure:
(2) 
The and IOU measures are related as: . Shi et al. [31] proposed another measure for subjective object segmentation assessment. Essentially, their measure is also based on the F1 measure. Margolin et al. [24] proposed a complicated measure called weighted measure (Fbw):
(3) 
It assigns different weights to errors in different locations.
All the measures mentioned above are closely related with . They can be estimated by considering each pixel position independently, and ignore important image level information, which leads to suboptimal performance in identifying noise (Fig. 1), structure errors (Fig. 2), and different shapes (Fig. 3).
Movahedi et al. [27] proposed the contour mapping (CM) measure. This measure, however, is sensitive to noise (see Fig. 1), which results in poor performance, especially using metameasure 3 as explained later (Sec. 4.3 & Tab. 2). A recently proposed measure known as Smeasure [10], focuses on nonbinary foreground map (FM) evaluation. It considers the regionlevel structure similarity on a 22 grid over the segmentation map and the objectlevel properties (e.g. , uniform and contrast). However, these properties are not well defined for binary maps.
3 The Proposed Measure
In this section, we explain details of our new measure to evaluate binary foreground maps. An important advantage of our measure is its simplicity as it consists of a compact term that simultaneously captures global statistics and local pixel matching information. As a result, our measure performs better than the current popular measures. Our Emeasure framework is demonstrated in Fig. 4.
3.1 Motivation
Despite the previous success of binary map measures, recent measures such as the Smeasure are still not performing well enough on binary maps. It is often the case that these measures assign a higher score to a binary generic map than a stateoftheart (SOTA) estimate segmentation map (see Sec. 4.2). The reason behind this is that in binary maps, Smeasure puts emphasis on the luminance comparison, contrast comparison and the dispersion probability. However, while it makes sense to compute these terms for nonbinary maps whose values are real numbers ranged in [0, 1] and treat the values as representing the probability that a pixel is owned by the foreground, in binary maps such properties are not well defined and less valid. As a result, using continuous assumptions can lead to erroneous evaluation on binary maps.
Cognitive vision studies have shown that human vision system is highly sensitive to structures (e.g. global information, local details) in scenes. Accordingly, it is important to consider local information and global information simultaneously when evaluating the similarity between an FM and a GT.
Based on the above observation, we design a novel measure that is tailored for binary map evaluation. Our measure works on welldefined properties of binary maps, and combines local pixel values and imagelevel mean value in one term, which helps to capture imagelevel statistics and local pixel matching information jointly. Experiments (Sec. 4) show that our measure performs better than other previous measures on binary maps.
3.2 Alignment Term
To design a compact term that simultaneously captures local pixel matching information and global statistics, we define a bias matrix as the distance between each pixelwise value of the input binary foreground map and its global mean :
(4) 
where, is a matrix in which all the element values are and the size of is identical to . We compute two bias matrices and for the binary groundtruth map and binary foreground map , respectively. . The bias matrix can be treated as the signal centering by removing the mean intensity from the signal. It can eliminate errors due to intrinsic variations, or large numerical differences.
Our bias matrix has strong relationship to the luminance contrast [35]. Consequently, we consider the correlation (Hadamard product) between and as a simple and effective measure to quantify the bias matrix similarity. Therefore, we define an alignment matrix as following:
(5) 
where, donates the Hadamard product. The alignment matrix has the properties that if and only if and have the same sign, i.e. the two inputs are aligned at this position . The value of the alignment matrix element will depend on the global means, taking global statistic into account. These properties make Equ. (5) suitable for our purpose.
3.3 Enhanced Alignment Term
The absolute value of depends on the similarity of and . When two maps are highly similar, further similarity between and may increase positive values at aligned positions and decrease negative values at unaligned positions. When summed together the total value of does not necessarily go up as we expected. Therefore, we need a mapping function that suppresses decrease (which means having smaller derivative value) at negative value () regions and strengthens increase at positive value () regions.
To achieve this goal, a “convex function” is needed. We have tested other forms of mapping functions such as higherorder polynomials or trigonometric functions, but have found the quadratic form ( shown in Fig. 4 (g)) is a simple and effective function and works best in our experiments. Here, we use it to define the enhanced alignment matrix as:
(6) 
3.4 Enhanced Alignment Measure
Using the enhanced alignment matrix to capture the two properties (pixellevel matching and imagelevel statistics) of a binary map, we define our final Emeasure as:
(7) 
where and are the height and the width of the map, respectively. Using this measure to evaluate the foreground map (FM) and noise in Fig. 1, we can correctly rank the maps consistent with the application rank (See below).
4 Experiments
In this section, we compare our Emeasure with 5 popular measures for binary foreground map evaluation on 4 public salient object detection datasets, as in [10].
MetaMeasure. To test the quality of an evaluation measure, we use the metameasure methodology. The basic idea is defining some desired criteria about the quality of the results and assessing how well a measure satisfies those criteria [28]. We use 4 metameasures proposed in [24, 28, 10], as well as a new one (Sec. 4.3) introduced here by us. Results are listed in Tab. 2.
Datasets & Models. The employed datasets include PASCALS [19], ECSSD [36], HKUIS [16], and SOD [25]. We use 10 stateoftheart (SOTA) salient object detection models including 3 traditional ones (ST [23], DRFI [33], and DSR [18]) and 7 deep learning based ones (DCL [17], RFCN [34], MC [37], MDF [16], DISC [6], DHS [21], and ELD [14]) to generate nonbinary maps. In order to further obtain foreground binary maps, we use the imagedependent adaptive thresholding method [1] to threshold nonbinary maps.
4.1 MetaMeasure 1: Application Ranking
The first metameasure specifies that the results of an evaluation measure to rank the foreground maps should be consistent with the results of application ranking. Fig. 5 illustrates the application ranking. Assume that the application’s output when using GT maps is the optimal output. Then we feed a series of estimated maps to the application and obtain a sequence of outputs which have been ordered from the most similar to the most dissimilar. We compare the output sequence with the optimal output sequence. The more similar a map is to the GT map, the closer its application’s output sequence should be to the GT output sequence.
As Margolin et al. [24] claimed, the applications including image retrieval, object detection and segmentation have similar results. For a fair comparison, we use the contextbased image retrieval application as Margolin et al. to perform this metameasure. Implementation of this application is mentioned in the Sec. Application Realization and other applications are implemented similarly.
Here, we use the [3] measure to examine the ranking correlation between measure ranking and application ranking. The value of falls in the range [0, 2]. Value of 0 means that the orders of measure ranking and application ranking are the same. Completely reverse orders give a value of 2.
In Tab. 2 we observe a significant improvement over the popular evaluation measures. Compared to the best prior measure, our measure improves the performance by 19.65%, 9.08%, 18.42% and 9.64% over PASCAL, ECSSD, SOD and HKU datasets, respectively. Fig. 2 illustrates an example of how well our measure predicts the preference of these applications.
Application Realization. Contextbased image retrieval finds the most similar images to a query image in a dataset [15]. The similarity is determined by various features such as colorhistograms, color and edge directivity descriptor (CEDD). We use LIRE [15] with CEDD to weigh the binary foreground maps.
Firstly, in order to ignore the background and get the foreground feature, we generate a combined GT image or FM image by combining the image with its GT map or FM map (see Fig. 6 (a)(d)). The combined results are denoted by and . Secondly, for each combined image we use LIRE to retrieve a list of 100 most similar combined images which are ordered from the most similar to the most dissimilar. The GT output, , is the ordered list returned when using the combined GT (e.g. ). The ordered score list is . The score means the degree of similarity. Likewise, for the FM we obtain and . Thirdly, let . We find which equals to in the . If exists, indicating , we get the index as well as the corresponding score . Each score of FM assigned:
(8) 
4.2 MetaMeasure 2: SOTA vs. Generic Maps
The second metameasure is that an evaluation measure should assign higher scores to maps obtained by the SOTA models than trivial maps without any meaningful contents. Here, we use a centered circle as the generic map. One example can be seen in Fig. 7. We expect that evaluating FM in (c) would generate a higher score than (d).
PASCAL  ECSSD  SOD  HKU  

Measure  MM1  MM2  MM3  MM1  MM2  MM3  MM1  MM2  MM3  MM1  MM2  MM3  MM4 
CM  0.610  49.78%  100.0%  0.504  34.62%  100.0%  0.723  29.89%  56.22%  0.613  25.26%  100.0%  1.492 
VQ  0.339  17.97%  15.32%  0.294  7.445%  6.162%  0.335  9.143%  14.05%  0.331  3.067%  1.800%  0.161 
IOU/F1/JI  0.307  9.426%  5.597%  0.272  4.097%  1.921%  0.342  4.571%  6.857%  0.303  0.900%  0.197%  0.124 
Fbw  0.308  5.147%  4.265%  0.280  2.945%  1.152%  0.361  6.286%  5.714%  0.312  0.535%  0.083%  0.149 
Smeasure  0.315  2.353%  0.000%  0.279  1.152%  0.000%  0.374  1.714%  0.000%  0.312  0.141%  0.000%  0.140 
Ours  0.247  3.093%  0%  0.247  0.641%  0%  0.273  0.571%  0%  0.274  0.084%  0%  0.121 
We counted the number of times a generic map scored higher than the mean score obtained by 10 SOTA models mentioned in Sec. 4 as misranking rate. As suggested in [24], the mean score is robust to situations in which a certain model generates a poor result. The evaluation score of 10 maps should be higher than a threshold to choose as the “good map”. Thus, about 80% good maps in the dataset have been selected to examine this metameasure. The lower the misranking rate is, the better the measure performs. Our measure outperforms the current measures over ECSSD, SOD and HKUIS except on the PASCALS dataset.
4.3 MetaMeasure 3: SOTA vs. Random Noise
The property on which we based our third metameasure is that an evaluation measure should prefer the map generated by a SOTA model over the random noise map on average.
We perform this experiment similar to the metameasure 2 but this time we use the Gaussian random noise map instead of the generic map in Sec. 4.2. Our measure achieves the best performance since it considers both local pixelmatching and the global statistics jointly. It is to be noted that as stated in Sec. 4.2, the mean score is robust to single failure case of FM from a certain SOTA model. As a result the mean score on foreground maps from the group of SOTA models should always be higher than a score measured from noise map, however, only our measure and the Smeasure achieve the lowest misranking rate.
4.4 MetaMeasure 4: Human Ranking
The fourth metameasure regards examining the ranking correlation between an evaluation measure and the human ranking. To the best of our knowledge, there is no such binary foreground map dataset ranked by human beings before. To create such a dataset, we randomly select the ranked maps by an application in metameasure 1, from four above mentioned datasets including PASCAL, SOD, ECSSD and HKU. Then, we asked 10 subjects to rank these maps. We keep the maps for which all of the viewers agree in their rankings. We name our dataset FMDatabase^{1}^{1}1FMDatabase: http://dpfan.net/emeasure/ which contains 185 images. Each of the images comes with 3 ranked estimated maps (555 maps in total).
To give a quantitative assessment of the correlation between human ranking and measure ranking, we also use the measure (mentioned in metameasure 1) to examine this metameasure. The lower the score is, the more consistent an evaluation measure is for human ranking. As it can be seen, our measure outperforms other measures. Fig. 2 illustrates an example of how well our measure predicts the preference using human ranking.
4.5 MetaMeasure 5: Ground Truth Switch
The property on which we base the fifth metameasure is that the score of a “good map” should decrease when we use a wrong GT map. We analyzed 4 popular datasets (PASCAL, SOD, ECSSD, HKU) and found that a map is considered as “good” when it scores (using F1 measure) at least 0.8 out of 1. We follow Margolin et al. [24] to calculate this metameasure. We count the percentage of times that an evaluation measure assigns a higher score when using the wrong GT map. We found that all the evaluation measures perform well (The average result on 4 datasets are: VQ with 0.000925%, CM with 0.001675%, IOU/JI/F1 with 0.00515%, Smeasure with 0.0014% and ours with 0.0523%). Our measure has a 0.05% gap relative to other measures.
5 Conclusion and Future Work
In this paper, we analyzed various binary foreground evaluation measures that consider errors in different levels including pixel, region, boundary and object level. They can be classified as considering either pixellevel errors or imagelevel errors independently. To solve this shortcoming, here we proposed the simple Emeasure which simultaneously considers both types of errors. Our measure is highly effective and efficient. Extensive experiments using 5 metameasures demonstrate the effectiveness of our measure compared to existing measures on 4 popular datasets. Finally, we created a new dataset (740 maps) which consists of 185 groundtruth maps and 555 human ranked maps to examine the correlation between evaluation measures and human judgments.
Limitation. Compared to our metric, Smeasure is mainly designed for tackling structural similarity. Images in PASCAL dataset have more structural objects than the other 3 datasets (ECSSD, SOD, HKUIS). Therefore, Smeasure is slightly better than our metric in PASCAL dataset. One failure case can be found in Fig. 10.
Future Work. We will investigate the potential to propose a new segmentation model based on the Emeasure in our future work. Besides, our metric consists of simple derivable functions, so a new loss function based on the Emeasure can be developed. To help future explorations in this area, our code and dataset will be made publicly available on the web.
Acknowledgments
This research was supported by NSFC (NO. 61620106008, 61572264), Huawei Innovation Research Program, and Fundamental Research Funds for the Central Universities.
References
 [1] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk. Frequencytuned salient region detection. In CVPR. IEEE, 2009.
 [2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE TPAMI, 33(5):898–916, 2011.
 [3] D. Best and D. Roberts. Algorithm as 89: the upper tail probabilities of spearman’s rho. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(3):377–379, 1975.
 [4] A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactive image segmentation using an adaptive GMMRF model. In ECCV, pages 428–441. Springer, 2004.
 [5] A. Borji, M.M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE TIP, 24(12):5706–5722, 2015.
 [6] T. Chen, L. Lin, L. Liu, X. Luo, and X. Li. Disc: Deep image saliency computing via progressive representation learning. IEEE T Neur. Net. Lear., 27(6):1135–1149, 2016.
 [7] M.M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.M. Hu. Global contrast based salient region detection. IEEE TPAMI, 37(3):569–582, 2015.
 [8] G. Csurka, D. Larlus, F. Perronnin, and F. Meylan. What is a good evaluation measure for semantic segmentation? In BMVC, volume 27, page 2013. Citeseer, 2013.
 [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
 [10] D.P. Fan, M.M. Cheng, Y. Liu, T. Li, and A. Borji. Structuremeasure: A new way to evaluate foreground maps. In ICCV, pages 4550–4557. IEEE, 2017.
 [11] Q. Hou, M.M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr. Deeply supervised salient object detection with short connections. IEEE TPAMI, 2018.
 [12] P. Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat, 37:547–579, 1901.
 [13] C. Kanan and G. Cottrell. Robust classification of objects, faces, and flowers using natural image statistics. In CVPR, pages 2472–2479. IEEE, 2010.
 [14] G. Lee, Y.W. Tai, and J. Kim. Deep saliency with encoded low level distance map and high level features. In CVPR, pages 660–668. IEEE, 2016.
 [15] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain. Contentbased multimedia information retrieval: State of the art and challenges. ACM T Multim. Comput., 2(1):1–19, 2000.
 [16] G. Li and Y. Yu. Visual saliency based on multiscale deep features. In CVPR, pages 5455–5463. IEEE, 2015.
 [17] G. Li and Y. Yu. Deep contrast learning for salient object detection. In CVPR, pages 478–487. IEEE, 2016.
 [18] X. Li, H. Lu, L. Zhang, X. Ruan, and M.H. Yang. Saliency detection via dense and sparse reconstruction. In ICCV, pages 2976–2983. IEEE, 2013.
 [19] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille. The secrets of salient object segmentation. In CVPR, pages 280–287. IEEE, 2014.
 [20] G. Liu and D. Fan. A model of visual attention for natural image retrieval. In Information Science and Cloud Computing Companion (ISCCC), pages 728–733. IEEE, 2013.
 [21] N. Liu and J. Han. DHSNet: Deep hierarchical saliency network for salient object detection. In CVPR, pages 678–686. IEEE, 2016.
 [22] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.Y. Shum. Learning to detect a salient object. IEEE TPAMI, 33(2):353–367, 2011.
 [23] Z. Liu, W. Zou, and O. Le Meur. Saliency tree: A novel saliency detection framework. IEEE TIP, 23(5):1937–1952, 2014.
 [24] R. Margolin, L. ZelnikManor, and A. Tal. How to evaluate foreground maps? In CVPR, pages 248–255. IEEE, 2014.
 [25] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, pages 416–423. IEEE, 2001.
 [26] K. McGuinness and N. E. O’connor. A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2):434–444, 2010.
 [27] V. Movahedi and J. H. Elder. Design and perceptual validation of performance measures for salient object segmentation. In IEEE CVPRW, pages 49–56, 2010.
 [28] J. PontTuset and F. Marques. Supervised evaluation of image segmentation and object proposal techniques. IEEE TPAMI, 38(7):1465–1478, 2016.
 [29] C. Qin, G. Zhang, Y. Zhou, W. Tao, and Z. Cao. Integration of the saliencybased seed extraction and random walks for image segmentation. Neurocomputing, 129:378–391, 2014.
 [30] U. Rutishauser, D. Walther, C. Koch, and P. Perona. Is bottomup attention useful for object recognition? In CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–37. IEEE, 2004.
 [31] R. Shi, K. N. Ngan, S. Li, R. Paramesran, and H. Li. Visual quality evaluation of image object segmentation: Subjective assessment and objective measure. IEEE TIP, 24(12):5033–5045, 2015.
 [32] P. Villegas and X. Marichal. Perceptuallyweighted evaluation criteria for segmentation masks in video sequences. IEEE TIP, 13(8):1092–1103, 2004.
 [33] J. Wang, H. Jiang, Z. Yuan, M.M. Cheng, X. Hu, and N. Zheng. Salient object detection: A discriminative regional feature integration approach. IJCV, 123(2):251–268, 2017.
 [34] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan. Saliency detection with recurrent fully convolutional networks. In ECCV, pages 825–841. Springer, 2016.
 [35] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004.
 [36] Y. Xie, H. Lu, and M.H. Yang. Bayesian saliency via low and mid level cues. IEEE TIP, 22(5):1689–1698, 2013.
 [37] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multicontext deep learning. In CVPR, pages 1265–1274. IEEE, 2015.