Auto SegLoss: Searching Metric Surrogates for Semantic Segmentation
Abstract
We propose a general framework for searching surrogate losses for mainstream semantic segmentation metrics. This is in contrast to existing loss functions manually designed for individual metrics. The searched surrogate losses can generalize well to other datasets and networks. Extensive experiments on PASCAL VOC and Cityscapes demonstrate the effectiveness of our approach. Code shall be released.
1 Introduction
Loss functions are of indispensable components in training deep networks, as they drive the feature learning process for various applications with specific evaluation metrics. However, most metrics, like the commonly used 01 classification error, are nondifferentiable in their original forms and cannot be directly optimized via gradientbased methods. Empirically, the crossentropy loss serves well as an effective surrogate objective function for a variety of tasks concerning categorization. This phenomenon is especially prevailing in image semantic segmentation, where various evaluation metrics have been designed to address the diverse task focusing on different scenarios. Some metrics measure the accuracy on the whole image, while others focus more on the segmentation boundaries. Although crossentropy and its variants work well for many metrics, the misalignment between network training and evaluation still exist and inevitably leads to performance degradation.
Typically, there are two ways for designing metricspecific loss functions in semantic segmentation. The first is to modify the standard crossentropy loss to meet the target metric (Ronneberger et al., 2015; Wu et al., 2016). The other is to design other clever surrogate losses for specific evaluation metrics (Rahman and Wang, 2016; Milletari et al., 2016). Despite the improvements, these handcrafted losses need expertise and are nontrivial to extend to other evaluation metrics.
We propose a general framework for searching surrogate losses for mainstream nondifferentiable segmentation metrics. The metrics are first relaxed to the continuous domain by substituting the onehot prediction and logical operations, which are the nondifferentiable parts in most metrics, with their differentiable approximations. Parameterized functions are introduced to approximate the logical operations, ensuring that the loss surfaces are smooth while accurate for training. The loss parameterization functions can be of arbitrary families defined on . Parameter search is further conducted on the chosen family so as to optimize the network performance on the validation set with the given evaluation metric. Two essential constraints are introduced to regularize the parameter search space. We find that the searched surrogate losses can effectively generalize to other networks and datasets. Extensive experiments on Pascal VOC (Everingham et al., 2015) and Cityscapes (Cordts et al., 2016) show our approach delivers accuracy superior than the existing losses specifically designed for individual segmentation metrics with a mild computational overhead.
Our contributions can be summarized as follows: 1) Our approach is the first general framework of surrogate loss search for mainstream segmentation metrics. 2) We propose an effective parameter regularization and parameter search algorithm, which can find loss surrogates optimizing the target metric performance with mild computational overhead. 3) The surrogate losses obtained via the proposed searching framework promote our understandings on loss function design and by themselves are novel contributions, because they are different from existing loss functions specifically designed for individual metrics, and are transferable across different datasets and networks.
2 Related Work
Loss function design is an active topic in deep network training (Ma, 2020). In the area of image semantic segmentation, crossentropy loss is widely used (Ronneberger et al., 2015; Chen et al., 2018). But the crossentropy loss is designed for optimizing the global accuracy measure (Rahman and Wang, 2016; Patel et al., 2020), which is not aligned with many other metrics. Numerous studies are conducted to design proper loss functions for the prevalent evaluation metrics. For the mIoU metric, many works (Ronneberger et al., 2015; Wu et al., 2016) incorporate class frequency to mitigate the class imbalance problem. For the boundary F1 score, the losses at boundary regions are upweighted (Caliva et al., 2019; Qin et al., 2019), so as to deliver more accurate boundaries. These works carefully analyze the property of specific evaluation metrics, and design the loss functions in a fully handcrafted way, which needs expertise. By contrast, we propose a unified framework for deriving parameterized surrogate losses for various evaluation metrics. Wherein, the parameters are searched by reinforcement learning in an automatic way. The networks trained with the searched surrogate losses deliver accuracy on par or even superior than those with the best handcrafted losses.
Surrogate loss is introduced to derive loss gradients for the nondifferentiable evaluation metrics. There are usually two ways for designing surrogate losses. The first is to handcraft an approximated differentiable metric function. For the IoU measure, Rahman and Wang (2016) propose to approximate the intersection and union seperately using the softmax probabilities in a differentiable form, and show its effectiveness on binary segmentation tasks. Berman et al. (2018) further deal with multiclass segmentation problems by extending mIoU from binary inputs to the continuous domain with the convex Lovàsz extension, and their method outperforms standard cross entropy loss in multiclass segmentation tasks. For the F1 measure, dice loss is proposed by Milletari et al. (2016) as a direct objective by substituting the binary prediction with the softmax probability. In spite of the success, they do not apply for other metrics.
The second solution is to train a network to approximate the target metric. Nagendar et al. (2018) train a network to approximate mIoU. Patel et al. (2020) design a neural network to learn embeddings for predictions and ground truths for tasks other than segmentation. This line of research focuses on minimizing the approximation error w.r.t. the target metrics. But there is no guarantee that their approximations provide good loss signals for training. These approximated losses are just employed in a posttuning setup, still relying on crossentropy pretrained models. Our method significantly differs in that we search surrogate losses to directly optimize the evaluation metrics in applications.
AutoML is a longpursued target of machine learning (He et al., 2019). Recently a subfield of AutoML, neural architecture search (NAS), has attracted much attention due to its success in automating the process of neural network architecture design (Zoph and Le, 2017; Pham et al., 2018; Liu et al., 2018). As an essential element, loss function has also raised the interest of researchers to automate its design process. Li et al. (2019) and Wang et al. (2020) design search spaces based on existing humandesigned loss functions and search for the best combination parameters. There are two issues: a) the search process outputs whole network models rather than loss functions. For every new network or dataset, the expensive search procedure is conducted again, and b) the search space are filled with variants of crossentropy, which cannot solve the misalignment between crossentropy loss and many target metrics. By contrast, our method outputs the searched surrogate loss functions of close form with the target metrics, which are transferable between networks and datasets.
3 Revisiting Evaluation Metrics for Semantic Segmentation
Various evaluation metrics are defined for semantic segmentation, to address the diverse task focusing on different scenarios. Most of them are of three typical classes: Accbased, IoUbased, and F1scorebased. This section revisits the evaluation metrics, under a unified notation set.
Table 1 summarizes the mainstream evaluation metrics. The notations are as follows: suppose the validation set is composed of images, labeled with categories from classes (background included). Let be the th image, and be the corresponding groundtruth segmentation mask. Here is a onehot vector, where indicates whether the pixel at spatial location belongs to the th category (). In evaluation, the groundtruth segmentation mask is compared to the network prediction , where . is quantized from the continuous scores produced by the network (by argmax operation).
Type  Name  Formula  

Accbased 
Global Accuracy 



Mean Accuracy 


IoUbased 
Mean IoU 



Frequency Weighted IoU 



Boundary IoU 


F1scorebased 
Boundary F1 Score 

Accbased metrics. The global accuracy measure (gAcc) counts the number of pixels correctly classified. It can be written with logical operator AND as Eq. (1). The gAcc metric counts each pixel equally, so the results of the longtailed categories have little impact on the metric number. The mean accuracy (mAcc) metric mitigates this by normalizing within each category as in Eq. (2).
IoUbased metrics. The evaluation is on set similarity rather than pixel accuracy. The intersectionoverunion (IoU) score is evaluated between the prediction and the groundtruth mask of each category. The mean IoU (mIoU) metric averages the IoU scores of all categories, as in Eq. (3).
In the variants, the frequency weighted IoU (FWIoU) metric weighs each category IoU score by the category pixel number, as in Eq. (4). The boudary IoU (BIoU) (Kohli and Torr, 2009) metric only cares the segmentation quality around the boundary, so it picks the boundary pixels out in evaluation and ignores the rest pixels. It can be calculated with Eq. (5), in which denotes the boundary region in map . is derived by applying XOR operation on the minpooled groundtruth mask. The kernel size of min pooling can be chosen manually, and larger kernel size corresponds to wider boundary, which gives rise to more error tolerance in the metric.
F1scorebased metrics. F1score is a criterion that takes both precision and recall into consideration. A wellknown metric of this type is boundary F1score (BF1score) (Csurka et al., 2013), which is widely used for evaluating boundary segmentation accuracy. The computation of precision and recall in BF1score is as in Eq. (6), where and are derived from Eq. (5). Max pooling, , is applied on the boundary regions to allow error tolerance.
4 Auto SegLoss Framework
In the Auto SegLoss framework, the evaluation metrics are transferred into continuous surrogate losses with learnable parameters, which are further optimized. Fig. 1 illustrates our approach.
4.1 Extending metrics to surrogates
As shown in Section 3, most segmentation metrics are nondifferentiable because they take onehot prediction maps as input, and contain binary logical operations. We extend these metrics to be continuous loss surrogates by smoothing the nondifferentiable operations within.
Extending Onehot Operation. The onehot prediction map, , is derived by picking the highest scoring category at each pixel, which is further turned into onehot form. Here, we approximate the onehot predictions with softmax probabilities, as,
(7) 
where is the category score output by the network (without normalization). The approximated onehot prediction is denoted by .
Extending Logical Operations. As shown in Table 1, the nondifferentiable logical operations, , , and , are of indispensable components in these metrics. Because the operation can be constructed by and , , we focus on extending and to the continuous domain.
Following the common practice, the logical operators are substituted with arithmetic operators
(8) 
where . Eq. (8) can be directly extended to take continuous as inputs. By such an extension, together with the approximated onehot operation, a naïve version of differentiable surrogate losses can be obtained. The strength of such surrogates is that they are directly derived from the metrics, which significantly reduces the gap between training and evaluation. However, there is no guarantee that the loss surfaces formed by naïvely extending Eq. (8) provide accurate loss signals. To adjust the loss surfaces, we parameterize the and functions as
(9) 
where is a scalar function parameterized by .
The parameterized function can be from arbitrary function families defined on , e.g., piecewise linear functions and piecewise Bézier curves. With a chosen function family, the parameters control the shape of loss surfaces. We seek to search for the optimal parameters so as to maximize the given evaluation metric.
Meanwhile, optimal parameter search is nontrivial. With the introduced parameters, the plasticity of loss surfaces is strong. The parameterized loss surfaces may well be chaotic, or be far away from the target evaluation metric even at the binary inputs. For more effective parameter search, we regularize the loss surfaces by introducing two constraints on .
Truthtable constraint is introduced to enforce the surrogate loss surfaces taking the same values as the evaluation metric score at binary inputs. This is applied by enforcing
(10) 
Thus, the parameterized functions preserve the behavior of the corresponding logical operations on binary inputs .
Monotonicity constraint is introduced based on the observation of monotonicity tendency in the truth tables of and . It pushes the loss surfaces towards a benign landscape, avoiding dramatic nonsmoothness. The monotonicity constraint is enforced on and , as
Applying the chain rule and the truth table constraint, the monotonicity constraint implies
(11) 
Empirically we find it is important to enforce these two constraints in parameterization.
Extending Evaluation Metrics. Now we can extend the metrics to surrogate losses by a) replacing the onehot predictions with softmax probabilities, and b) substituting the logical operations with parameterized functions. Note that if the metric contains several logical operations, their parameters will not be shared. The collection of parameters in one metric are denoted as . For a segmentation network and evaluation dataset , the score of the evaluation metric is denoted as . And the parameterized surrogate loss is denoted as .
4.2 Surrogate parameterization
Here we choose the piecewise Bézier curve for parameterizing , which is easy to enforce the constraints via its control points. We also verify the effectiveness of parameterizing by piecewise linear functions. See Fig. 2 for visualization and Appendix B for more details.
A piecewise Bézier curve consists of a series of quadratic Bézier curves, where the last control point of one curve segment coincides with the first control point of the next curve segment. If there are segments in a piecewise Bézier curve, the th segment is defined as
(12) 
where transverses the th segment, denotes the th control point on the th segment, in which index the 2d plane axes. A piecewise Bézier curve with segments has control points in total. To parameterize , we assign
(13a)  
(13b)  
s.t.  (13c) 
where is the control point set, , . Given an input , the segment index and the transversal parameter are derived from Eq. (13c) and Eq. (13a), respectively. Then is assigned as Eq. (13b). Because is defined on , we arrange the control points in the axis as, , where the coordinate of the first and the last control points are at and , respectively.
The strength of the piecewise Bézier curve is that the curve shape is defined explicitly via the control points. Here we enforce the truthtable and the monotonicity constraints on the control points via,
(truthtable constraint)  
(monotonicity constraint) 
To fulfill the above restrictions in optimization, the specific form of the parameters is given by
with and fixed. So every is in range and it is straightforward to compute the actual coordinates of control points from this parameterized form. Such parameterization makes each independent with each other, and thus simplifies the optimization. By default, we use piecewise Bézier curve with two segments to parameterize .
4.3 Surrogate parameter optimization
Algorithm 1 describes our parameter search algorithm. The training set is split into two subsets, for training and for evaluation in the search algorithm, respectively. Specifically, suppose we have a segmentation network with weights , our search target is the parameters that maximize the evaluation metric on the holdout training set
(14) 
To optimize Eq. (14), the segmentation network is trained with SGD as the innerlevel problem. At the outerlevel, the surrogate parameters are searched via the PPO2 algorithm (Schulman et al., 2017). The process consists of sampling steps. In the th step, we aim to explore the search space around that from . Here parameters are sampled independently from a truncated normal distribution (Burkardt, 2014), as , with each variable in range . In it, and denote the mean and covariance of the parent normal distribution ( is fixed as 0.2 in this paper). summarizes the information from the th step. surrogate losses are constructed with the sampled parameters, which drive the training of segmentation networks separately. To optimize the outerlevel problem, we evaluate these models with the target metric and take the evaluation scores as rewards for PPO2. Following the PPO2 algorithm, is computed as ,where the reward is as
where picks the smaller item from its inputs, clips to be within and , and is the PDF of the truncated normal distribution. Note that the mean reward of the samples is subtracted when computing for better convergence. After steps, the mean with the highest average evaluation score is output as the final parameters .
Empirically we find the searched losses have good transferability, i.e., they can be applied for different datasets and networks. Benefiting from this, we use a light proxy task for parameter search. In it, we utilize a smaller image size, a shorter learning schedule and a lightweight network. Thus, the whole search process is quite efficient (8 hours on PASCAL VOC with 8 NVIDIA Tesla V100 GPUs). More details are in Appendix A. In addition, the search process can be conducted only once for a specific metric and the resulting surrogate loss can be directly used for training henceforth.
5 Experiments
We evaluate on the PASCAL VOC 2012 (Everingham et al., 2015) and the Cityscapes (Cordts et al., 2016) datasets. We use Deeplabv3+ (Chen et al., 2018) with ResNet50/101 (He et al., 2016) as the network model. During the surrogate parameter search, we randomly sample 1500 training images in PASCAL VOC and 500 training images in Cityscapes to form the holdout set , respectively. The remaining training images form the training set in search. is set to make . The backbone network is ResNet50. The images are downsampled to be of resolution. SGD lasts only 1000 iterations with a minibatch size of 32. After the search procedure, we retrain the segmentation networks with ResNet101 using the searched losses on the full training set and evaluate them on the actual validation set. The retrain settings are the same as Deeplabv3+ (Chen et al., 2018), except that the loss function is substituted by the obtained surrogate loss. The search time is counted on 8 NVIDIA Tesla V100 GPUs. More details are in Appendix A.
5.1 Searching for Different Metrics
In Table 2, we compare our searched surrogate losses against the widelyused crossentropy loss and its variants, and some other metricspecific surrogate losses. We also seek to compare with the AutoMLbased method in Li et al. (2019), which was originally designed for other tasks. But we cannot get reasonable results due to convergence issues. The results show that our searched losses are on par or better the previous losses on their target metrics. It is interesting to note that the obtained surrogates for boundary metrics (such as BIoU and BF1) only focus on the boundary areas, see Appendix C for further discussion. We also tried training segmentation networks driven by both searched mIoU and BIoU/BF1 surrogate losses. Such combined losses refine the boundaries while keeping reasonable global performance.
Dataset  PASCAL VOC  Cityscapes  

Loss Function  mIoU  FWIoU  BIoU  BF1  mAcc  gAcc  mIoU  FWIoU  BIoU  BF1  mAcc  gAcc 
Cross Entropy  78.69  91.31  70.61  65.30  87.31  95.17  79.97  93.33  62.07  62.24  87.01  96.44 
WCE (Ronneberger et al., 2015)  69.60  85.64  61.80  37.59  92.61  91.11  73.01  90.51  53.07  51.19  89.22  94.56 
DPCE (Caliva et al., 2019)  79.82  91.76  71.87  66.54  87.76  95.45  80.27  93.38  62.57  65.99  86.99  96.46 
SSIM (Qin et al., 2019)  79.26  91.68  71.54  66.35  87.87  95.38  80.65  93.22  63.04  72.20  86.88  96.39 
DiceLoss (Milletari et al., 2016)  77.78  91.34  69.85  64.38  87.47  95.11  79.30  93.25  60.93  59.94  86.38  96.39 
Lovàsz (Berman et al., 2018)  79.72  91.78  72.47  66.65  88.64  95.42  77.67  92.51  56.71  53.48  82.05  96.03 
Searched mIoU  80.97  92.09  73.44  68.86  88.23  95.68  80.67  93.30  63.05  67.97  87.20  96.44 
Searched FWIoU  80.00  91.93  75.14  65.67  89.23  95.44  79.42  93.33  61.71  59.68  87.96  96.37 
Searched BIoU  48.97  69.89  79.27  38.99  81.28  62.64  45.89  39.80  63.89  38.29  62.80  58.15 
Searched BF1  1.93  0.96  7.39  74.83  6.51  2.66  6.78  3.19  18.37  77.40  12.09  8.19 
Searched mAcc  69.80  85.86  72.85  35.62  92.66  91.28  74.10  90.79  54.62  53.45  89.22  94.75 
Searched gAcc  79.73  91.76  74.09  64.41  88.95  95.47  79.41  93.30  61.65  62.04  87.08  96.51 
Searched mIoU + BIoU  81.19  92.19  76.89  69.56  88.36  95.75  80.43  93.34  63.88  65.87  87.03  96.45 
Searched mIoU + BF1  78.72  90.80  71.81  73.57  86.70  94.88  78.30  93.00  61.62  71.73  87.13  96.23 
5.2 Generalization of the Loss
Generalization among datasets. Table 3 evaluates the generalization ability of our searched loss surrogates among different datasets. Due to limited computational resource, we train networks only with the searched mIoU, BF1 and mAcc surrogate losses. The results show that our searched surrogate losses generalize well between these two datasets with quite different scenes and categories.
Datasets  Cityscapes VOC  VOC Cityscapes  

Loss Function  mIoU  FWIoU  BIoU  BF1  mAcc  gAcc  mIoU  FWIoU  BIoU  BF1  mAcc  gAcc 
Cross Entropy  78.69  91.31  70.61  65.30  87.31  95.17  79.97  93.33  62.07  62.24  87.01  96.44 
Searched mIoU  80.05  91.72  73.97  67.61  88.01  95.45  80.67  93.31  62.96  66.48  87.36  96.44 
Searched BF1  1.84  0.93  7.42  75.85  6.48  1.47  6.67  3.20  19.00  77.99  12.12  4.09 
Searched mAcc  70.90  86.29  73.43  37.18  93.19  91.43  73.50  90.68  54.34  54.04  88.66  94.68 
Generalization among segmentation networks. The surrogate losses are searched with ResNet50 + DeepLabv3+ on PASCAL VOC. The searched losses drive the training of ResNet101 + DeepLabv3+, PSPNet (Zhao et al., 2017) and HRNet (Sun et al., 2019) on PASCAL VOC. Table 4 shows the results. The results demonstrate the our searched loss functions can be applied to various semantic segmentation networks.
Network  R50DeepLabv3+  R101DeepLabv3+  R101PSPNet  HRNetV2pW48  

Loss Function  mIoU  BF1  mAcc  mIoU  BF1  mAcc  mIoU  BF1  mAcc  mIoU  BF1  mAcc 
Cross Entropy  76.22  61.75  85.43  78.69  65.30  87.31  77.91  64.70  85.71  76.35  61.19  85.12 
Searched mIoU  78.35  66.93  85.53  80.97  68.86  88.23  78.93  65.65  87.42  77.26  63.52  86.80 
Searched BF1  1.35  70.81  6.05  1.43  73.54  6.12  1.62  71.84  6.33  1.34  68.41  5.99 
Searched mAcc  69.82  36.92  91.61  69.80  35.62  92.66  71.66  39.44  92.06  68.22  35.90  91.46 
5.3 Ablation
Parameterization and constraints. Table 6 ablates the parameterization and the search space constraints. A naïve surrogate without parameters deliver much lower accuracy, indicating that the two constraints are essential. The performance drops or even the algorithm fails without the constraints.
Proxy tasks for parameter search. Table 6 ablates this. The bottom row is our default setting with a lightweight backbone, downsampled image size and shorter learning schedule. The default setting delivers on par accuracy with heavier settings. This is consistent with the generalization ability of our surrogate losses. Thus we can improve the search efficiency via light proxy tasks.
Parameter search algorithm. Fig. 3 compares the employed PPO2 (Schulman et al., 2017) algorithm with random search. The much better performance of PPO2 suggests that surrogate loss search is nontrivial and reinforcement learning helps to improve the search efficiency.
Parameter  Truthtable  Monotonicity  VOC mIoU 
✗  ✗  ✗  46.99 
✓  ✗  ✗  Fail 
✓  ✓  ✗  77.76 
✓  ✓  ✓  80.64 
Backbone  Image Size  Iterations  Time(hours)  VOC mIoU 

R50  256 256  1000  33.0  81.15 
R50  128 128  2000  17.1  80.56 
R101  128 128  1000  13.3  80.75 
R50  128 128  1000  8.5  80.97 
6 Conclusion
The introduced Auto SegLoss is a powerful framework to search for the parameterized surrogate losses for mainstream segmentation evalutation metrics. The nondifferentiable operators are substituted by their parameterized continuous counterparts. The parameters are optimized to improve the final evaluation metrics with essential constraints. It would be interesting to extend the framework to more tasks, like object detection, pose estimation and machine translation problems.
Acknowledgments
The work is supported by the National Key R&D Program of China (2020AAA0105200), Beijing Academy of Artificial Intelligence and the Institute for Guo Qiang of Tsinghua University.
Appendix A Implementation details
Datasets. We evaluate our approach on the PASCAL VOC 2012 (Everingham et al., 2015) and the Cityscapes (Cordts et al., 2016) datasets. For PASCAL VOC, we follow Chen et al. (2017) to augment with the extra annotations provided by Hariharan et al. (2011). For Cityscapes, we follow the standard evaluation protocol in Cordts et al. (2016).
During the surrogate parameter search, we randomly sample 1500 training images in PASCAL VOC and 500 training images in Cityscapes to form the holdout set , respectively. The remaining training images form the training set in search. After the search procedure, we retrain the segmentation networks with the searched losses on the full training set and evaluate them on the actual validation set.
Implementation Details. We use Deeplabv3+ (Chen et al., 2018) with ImageNetpretrained ResNet50/101 (He et al., 2016) backbone as the network model. The segmentation head is randomly initialized. In surrogate parameter search, the backbone is of ResNet50. is set to make . The training and validation images are downsampled to be of resolution. In SGD training, the minibatch size is of 32 images, and the training is of 1000 iterations. The initial learning rate is 0.02, which is decayed by polynomial with power 0.9 and minimum learning rate 1e4. The momentum and weight decay are set to 0.9 and 5e4, respectively. For faster convergence, learning rate of the segmentation head is multiplied by 10. The search procedure is conducted for steps, and loss parameters are sampled in each step. In PPO2 (Schulman et al., 2017), the clipping threshold , and is updated by 100 steps. After surrogate parameter search, the retraining settings are the same as Deeplabv3+ (Chen et al., 2018), except that the loss function is substituted by the searched surrogate loss function. The backbone is of ResNet101 by default.
Appendix B Parameterization with piecewise linear functions
Here we choose the continuous piecewise linear function for parameterizing , where the form of constraints and parameters are very similar to that of the piecewise Bézier curve described in Section 4.2. Experimental results on PASCAL VOC 2012 (Everingham et al., 2015) are presented at the end of this section.
A continuous piecewise linear function consists of multiple line segments, where the right endpoint of one line segment coincides with the left endpoint of the next. Suppose there are line segments in a piecewise linear function, then the th line segment is defined as
(15) 
where transverses the th line segment, and are the left endpoint and right endpoint of the th line segment, respectively, in which index the 2d plane axes.
To parameterize via continuous piecewise linear functions, we assign
(16a)  
(16b)  
s.t.  (16c) 
where is the collection of all control points. Given an input , the segment index and the transversal parameter are derived from Eq. (16c) and Eq. (16a), respectively. Then is assigned as Eq. (16b).
Because is a function defined on , we arrange the endpoints in the axis as,
(17) 
where the coordinate of the first and the last endpoints are at and , respectively.
We enforce the two constraints introduced in Section 4.1 on the searching space through parameters . These two constraints can be formulated as
(truthtable constraint)  
(monotonicity constraint) 
In practice, we divide the domain into subintervals uniformly, and fix the coordinate of endpoints at the intersections of these intervals, i.e., where . Then the specific form of the parameters is given by
According to the constraints, the parameters need to satisfy
In order to meet the above constraints, during the surrogate parameter search, we first sample parameters from a normal distribution without truncation, and then apply Softmax operation on the sampled parameters. The normalized parameters are used as the actual parameters for piecewise linear functions.
In our implementation, we use piecewise linear functions with five line segments. The effectiveness are presented in Table 7. The searched losses parameterized with piecewise linear functions are on par or better the previous losses on their target metrics, and achieve very similar performance with that of using piecewise Bézier curve for the parameterization.
Loss Function  mIoU  FWIoU  BIoU  BF1  mAcc  gAcc 

Cross Entropy  78.69  91.31  70.61  65.30  87.31  95.17 
WCE (Ronneberger et al., 2015)  69.60  85.64  61.80  37.59  92.61  91.11 
DPCE (Caliva et al., 2019)  79.82  91.76  71.87  66.54  87.76  95.45 
SSIM (Qin et al., 2019)  79.26  91.68  71.54  66.35  87.87  95.38 
DiceLoss (Milletari et al., 2016)  77.78  91.34  69.85  64.38  87.47  95.11 
Lovàsz (Berman et al., 2018)  79.72  91.78  72.47  66.65  88.64  95.42 
Searched mIoU  80.94  92.01  73.22  67.32  90.12  95.46 
Searched FWIoU  79.05  91.78  71.47  64.24  89.77  95.31 
Searched BIoU  43.62  70.50  75.37  46.23  53.21  82.60 
Searched BF1  1.87  1.03  6.85  76.02  6.54  2.17 
Searched mAcc  74.33  88.77  65.96  46.81  92.34  93.26 
Searched gAcc  78.95  91.51  69.90  62.65  88.76  95.19 
Searched mIoU + BIoU  81.24  92.48  75.74  68.19  90.03  95.42 
Searched mIoU + BF1  79.11  91.38  71.71  73.55  89.28  95.17 
Appendix C Visualization and discussion on boundary segmentation
During the retraining stage, we find the segmentation result trained with surrogates for BIoU and BF1 metrics particularly interesting. To further investigate their properties, we visualize the segmentation results trained with surrogates for boundary metrics.
Boundary segmentation. As shown in Table 2 and Table 7, despite the great improvement achieved on BIoU and BF1 scores by training with surrogate losses for BIoU and BF1, respectively, other metrics show a significant drop. Fig. 4 and Fig. 5 visualizes the segmentation results of surrogate losses for mIoU, BIoU/BF1, and mIoU + BIoU/BF1. It can be seen that the surrogate losses for BIoU/BF1 guide the network to focus on object boundaries but ignore other regions, thus fail to meet the needs of other metrics. Training with surrogate losses for both mIoU and BIoU/BF1 can refine the boundary meanwhile maintain good performance for mIoU.
Boundary tolerance of the BF1 metric. Boundary metrics (e.g., BIoU and BF1) introduce the tolerance for boundary regions to allow slight misalignment in boundary prediction. Interestingly, we find that using the surrogate loss for BF1 with nonzero tolerance will lead to sawtooth around the predicted boundaries, as shown in Fig. 6. Such sawtooth waves are within the tolerance range, which would not hurt the BF1 scores. When the boundary tolerance range in BF1 score reduces, the sawtooth phenomenon gets punished. The corresponding surrogate losses are learned to remove such sawtooth waves.
References
 The lovászsoftmax loss: a tractable surrogate for the optimization of the intersectionoverunion measure in neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4413–4421. Cited by: Table 7, §2, Table 2.
 The truncated normal distribution. Department of Scientific Computing Website, Florida State University, pp. 1–35. Cited by: §4.3.
 Distance map loss penalty term for semantic segmentation. In International Conference on Medical Imaging with Deep Learning–Extended Abstract Track, Cited by: Table 7, §2, Table 2.
 Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: Appendix A.
 Encoderdecoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818. Cited by: Appendix A, §2, §5.
 The cityscapes dataset for semantic urban scene understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. Cited by: Appendix A, §1, §5.
 What is a good evaluation measure for semantic segmentation?. In Proceedings of the British Machine Vision Conference (BMVC), Vol. 27, pp. 2013. Cited by: §3.
 The pascal visual object classes challenge: a retrospective. International Journal of Computer Vision 111 (1), pp. 98–136. Cited by: Appendix A, Appendix B, §1, §5.
 Semantic contours from inverse detectors. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 991–998. Cited by: Appendix A.
 Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: Appendix A, §5.
 AutoML: a survey of the stateoftheart. arXiv preprint arXiv:1908.00709. Cited by: §2.
 Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision 82 (3), pp. 302–324. Cited by: §3.
 Amlfs: automl for loss function search. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 8410–8419. Cited by: §2, §5.1.
 DARTS: differentiable architecture search. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Cited by: §2.
 Segmentation loss odyssey. arXiv preprint arXiv:2005.13449. Cited by: §2.
 Vnet: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: Table 7, §1, §2, Table 2.
 Neuroiou: learning a surrogate loss for semantic segmentation. In Proceedings of the British Machine Vision Conference (BMVC), pp. 278. Cited by: §2.
 Learning surrogates via deep embedding. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2, §2.
 Efficient neural architecture search via parameters sharing. In Proceedings of the 35th International Conference on Machine Learning, pp. 4095–4104. Cited by: §2.
 Basnet: boundaryaware salient object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7479–7489. Cited by: Table 7, §2, Table 2.
 Optimizing intersectionoverunion in deep neural networks for image segmentation. In International Symposium on Visual Computing, pp. 234–244. Cited by: §1, §2, §2.
 Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 234–241. Cited by: Table 7, §1, §2, Table 2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix A, §4.3, §5.3.
 Deep highresolution representation learning for human pose estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703. Cited by: §5.2.
 Loss function search for face recognition. In Proceedings of the 37th International Conference on Machine Learning, Cited by: §2.
 Bridging categorylevel and instancelevel semantic image segmentation. arXiv preprint arXiv:1605.06885. Cited by: §1, §2.
 Pyramid scene parsing network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890. Cited by: §5.2.
 Neural architecture search with reinforcement learning. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Cited by: §2.