[0.7cm] Threshold Choice Methods: the Missing Link
[0.4cm]
[1.2cm]
José HernándezOrallo (jorallo@dsic.upv.es)
Departament de Sistemes Informàtics i Computació
Universitat Politècnica de València, Spain
[12pt]
Peter Flach (Peter.Flach@bristol.ac.uk)
Intelligent Systems Laboratory
University of Bristol, United Kingdom
[12pt]
Cèsar Ferri (cferri@dsic.upv.es)
Departament de Sistemes Informàtics i Computació
Universitat Politècnica de València, Spain
[12pt]
July 16, 2019
Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, macroaccuracy, area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into refinement and calibration). One way of understanding the relation among these metrics is by means of variable operating conditions (in the form of misclassification costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis has been the distribution for this range of operating conditions, leading to some important connections in the area of proper scoring rules. We demonstrate in this paper that there is an equally important dimension which has so far not received attention in the analysis of performance metrics. This new dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, scoreuniform, scoredriven, ratedriven and optimal, among others. By calculating the expected loss obtained with these threshold choice methods for a uniform range of operating conditions we give clear interpretations of the 01 loss, the absolute error, the Brier score, the and the refinement loss respectively. Our analysis provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation which can be summarised as follows: given a model, apply the threshold choice methods that correspond with the available information about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibration in choosing the threshold choice method.
Keywords: Classification performance metrics, Costsensitive Evaluation, Operating Condition, Brier Score, Area Under the ROC Curve (), Calibration Loss, Refinement Loss.
Contents
1 Introduction
The choice of a proper performance metric for evaluating classification [hand1997construction] is an old but still lively debate which has incorporated many different performance metrics along the way. Besides accuracy (, or, equivalently, the error rate or 01 loss), many other performance metrics have been studied. The most prominent and wellknown metrics are the Brier Score (, also known as Mean Squared Error) [brier1950verification] and its decomposition in terms of refinement and calibration [murphy1973new], the absolute error (), the log(arithmic) loss (or crossentropy) [Good52] and the area under the ROC curve (, also known as the WilcoxonMannWhitney statistic, proportional to the Gini coefficient and to the Kendall’s tau distance to a perfect model) [SDM00, Fawcett06]. There are also many graphical representations and tools for model evaluation, such as ROC curves [SDM00, Fawcett06], ROC isometrics [Fla03], cost curves [DH00, drummondandHolte2006], DET curves [martin1997det], lift charts [piatetsky1999estimating], calibration maps [cohen2004properties], etc. A survey of graphical methods for classification predictive performance evaluation can be found in [Prati2011].
When we have a clear operating condition which establishes the misclassification costs and the class distributions, there are effective tools such as ROC analysis [SDM00, Fawcett06] to establish which model is best and what its expected loss will be. However, the question is more difficult in the general case when we do not have information about the operating condition where the model will be applied. In this case, we want our models to perform well in a wide range of operating conditions. In this context, the notion of ‘proper scoring rule’, see e.g. [murphy1970scoring], sheds some light on some performance metrics. Some proper scoring rules, such as the Brier Score (MSE loss), the logloss, boosting loss and error rate (01 loss) have been shown in [BSS05] to be special cases of an integral over a Beta density of costs, see e.g. [gneiting2007strictly, reid2010composite, reid2011information, brummer2010thesis]. Each performance metric is derived as a special case of the Beta distribution. However, this analysis focusses on scoring rules which are ‘proper’, i.e., metrics that are minimised for wellcalibrated probability assessments or, in other words, get the best (lowest) score by forecasting the true beliefs. Much less is known (in terms of expected loss for varying distributions) about other performance metrics which are nonproper scoring rules, such as . Moreover, even its role as a classification performance metric has been put into question [hand2009measuring].
All these approaches make some (generally implicit and poorly understood) assumptions on how the model will work for each operating condition. In particular, it is generally assumed that the threshold which is used to discriminate between the classes will be set according to the operating condition. In addition, it is assumed that the threshold will be set in such a way that the estimated probability where the threshold is set is made equal to the operating condition. This is natural if we focus on proper scoring rules. Once all this is settled and fixed, different performance metrics represent different expected losses by using the distribution over the operating condition as a parameter. However, this threshold choice is only one of the many possibilities.
In our work we make these assumptions explicit through the concept of a threshold choice method, which we argue forms the ‘missing link’ between a performance metric and expected loss. A threshold choice method sets a single threshold on the scores of a model in order to arrive at classifications, possibly taking circumstances in the deployment context into account, such as the operating condition (the class or cost distribution) or the intended proportion of positive predictions (the predicted positive rate). Building on this new notion of threshold choice method, we are able to systematically explore how known performance metrics are linked to expected loss, resulting in a range of results that are not only theoretically wellfounded but also practically relevant.
The basic insight is the realisation that there are many ways of converting a model (understood throughout this paper as a function assigning scores to instances) into a classifier that maps instances to classes (we assume binary classification throughout). Put differently, there are many ways of setting the threshold given a model and an operating point. We illustrate this with an example concerning a very common scenario in machine learning research. Consider two models and , a naive Bayes model and a decision tree respectively (induced from a training dataset), which are evaluated against a test dataset, producing a score distribution for the positive and negative classes as shown in Figure 1. ROC curves of both models are shown in Figure 2. We will assume that at this evaluation time we do not have information about the operating condition, but we expect that this information will be available at deployment time.
If we ask the question of which model is best we may rush to calculate its and (and perhaps other metrics), as given by Table 1. However, we cannot give an answer because the question is underspecified. First, we need to know the range of operating conditions the model will work with. Second, we need to know how we will make the classifications, or in other words, we need a decision rule, which can be implemented as a threshold choice method when the model outputs scores. For the first dimension (already considered by the work on proper scoring rules), if we have no knowledge about the operating conditions, we can assume a distribution, e.g., a uniform distribution, which considers all operating conditions equally likely. For the second (new) dimension, we have many options.
performance metric  model  model 

0.79  0.67  
Brier score  0.33  0.24 
For instance, we can just set a fixed threshold at 0.5. This is what naive Bayes and decision trees do by default. This decision rule works as follows: if the score is greater than 0.5 then predict positive, otherwise predict negative. With this precise decision rule, we can now ask the question about the expected loss. Assuming a uniform distribution for operating conditions, we can effectively calculate the answer on the dataset: .
But we can use better decision rules. We can use decision rules which adapt to the operating condition. One of these decision rules is the scoredriven threshold choice method, which sets the threshold equal to the operating condition or, more precisely, to a cost proportion . Another decision rule is the ratedriven threshold choice method, which sets the threshold in such a way that the proportion of predicted positives (or predicted positive rate), simply known as ‘rate’ and denoted by , equals the operating condition. Using these three different threshold choice methods for the models and we get the expected losses shown in Table 2.
threshold choice method  expected loss model  expected loss model 

Fixed ()  0.510  0.375 
Scoredriven ()  0.328  0.231 
Ratedriven ( s.t. )  0.188  0.248 
In other words, only when we specify or assume a threshold choice method can we convert a model into a classifier for which it makes sense to consider its expected loss. In fact, as we can see in Table 2, very different expected losses are obtained for the same model with different threshold choice methods. And this is the case even assuming the same uniform cost distribution for all of them.
Once we have made this (new) dimension explicit, we are ready to ask new questions. How many threshold choice methods are there? Table 3 shows six of the threshold choice methods we will analyse in this work, along with their notation. Only the scorefixed and the scoredriven methods have been analysed in previous works in the area of proper scoring rules. In addition, a seventh threshold choice method, known as optimal threshold choice method, denoted by , has been (implicitly) used in a few works [DH00, drummondandHolte2006, hand2009measuring].
Threshold choice method  Fixed  Chosen uniformly  Driven by o.c. 

Using scores 
scorefixed ()  scoreuniform ()  scoredriven () 
Using rates  ratefixed ()  rateuniform ()  ratedriven () 
We will see that each threshold choice method is linked to a specific performance metric. This means that if we decide (or are forced) to use a threshold choice method then there is a recommended performance metric for it. The results in this paper show that accuracy is the appropriate performance metric for the scorefixed method, fits the scoreuniform method, is the appropriate performance metric for the scoredriven method, and fits both the rateuniform and the ratedriven methods. All these results assume a uniform cost distribution.
The good news is that intercomparisons are still possible: given a threshold choice method we can calculate expected loss from the relevant performance metric. The results in Table 2 allow us to conclude that model achieves the lowest expected loss for uniformly sampled cost proportions, if we are wise enough to choose the appropriate threshold choice method (in this case the ratedriven method) to turn model into a successful classifier. Notice that this cannot be said by just looking at Table 1 because the metrics in this table are not comparable to each other. In fact, there is no single performance metric that ranks the models in the correct order, because, as already said, expected loss cannot be calculated for models, only for classifiers.
1.1 Contributions and structure of the paper
The contributions of this paper to the subject of model evaluation for classification can be summarised as follows.

The expected loss of a model can only be determined if we select a distribution of operating conditions and a threshold choice method. We need to set a point in this twodimensional space. Along the second (usually neglected) dimension, several new threshold choice methods are introduced in this paper.

We answer the question: “if one is choosing thresholds in a particular way, which performance metric is appropriate?” by giving an explicit expression for the expected loss for each threshold choice method. We derive linear relationships between expected loss and many common performance metrics. The most remarkable one is the vindication of as a measure of expected classification loss for both the rateuniform and ratedriven methods, contrary to recent claims in the literature [hand2009measuring].

One fundamental and novel result shows that the refinement loss of the convex hull of a ROC curve is equal to expected optimal loss as measured by the area under the optimal cost curve. This sets an optimistic (but also unrealistic) bound for the expected loss.

Conversely, from the usual calculation of several wellknown performance metrics we can derive expected loss. Thus, classifiers and performance metrics become easily comparable. With this we do not choose the best model (a concept that does not make sense) but we choose the best classifier (a model with a particular threshold choice method).

By cleverly manipulating scores we can connect several of these performance metrics, either by the notion of evenlyspaced scores or perfectly calibrated scores. This provides an additional way of analysing the relation between performance metrics and, of course, threshold choice methods.

We use all these connections to better understand which threshold choice method should be used, and in which cases some are better than others. The analysis of calibration plays a central role in this understanding, and also shows that nonproper scoring rules do have their role and can lead to lower expected loss than proper scoring rules, which are, as expected, more appropriate when the model is wellcalibrated.
This set of contributions provides an integrated perspective on performance metrics for classification around the ‘missing link’ which we develop in this paper: the notion of threshold choice method.
The remainder of the paper is structured as follows. Section 2 introduces some notation, the basic definitions for operating condition, threshold, expected loss, and particularly the notion of threshold choice method, which we will use throughout the paper. Section 3 investigates expected loss for fixed threshold choice methods (scorefixed and ratefixed), which are the base for the rest. We show that, not surprisingly, the expected loss for these threshold choice method are the 01 loss (accuracy or macroaccuracy depending on whether we use cost proportions or skews). Section 4 presents the results that the scoreuniform threshold choice method has as associate performance metric and the scoredriven threshold choice method leads to the Brier score. We also show that one dominates over the other. Section 5 analyses the nonfixed methods based on rates. Somewhat surprisingly, both the rateuniform threshold choice method and the ratedriven threshold choice method lead to linear functions of , with the latter always been better than the former. All this vindicates the ratedriven threshold choice method but also as a performance metric for classification. Section 6 uses the optimal threshold choice method, connects the expected loss in this case with the area under the optimal cost curve, and derives its corresponding metric, which is refinement loss, one of the components of the Brier score decomposition. Section 7 analyses the connections between the previous threshold choice methods and metrics by considering several properties of the scores: evenlyspaced scores and perfectly calibrated scores. This also helps to understand which threshold choice method should be used depending on how good scores are. Finally, Section 8 closes the paper with a thorough discussion of results, related work, and an overall conclusion with future work and open questions. There is an appendix which includes some technical results for the optimal threshold choice method and some examples.
2 Background
In this section we introduce some basic notation and definitions we will need throughout the paper. Some other definitions will be delayed and introduced when needed. The most important definitions we will need are introduced below: the notion of threshold choice method and the expression of expected loss.
2.1 Notation and basic definitions
A classifier is a function that maps instances from an instance space to classes from an output space . For this paper we will assume binary classifiers, i.e., . A model is a function that maps examples to real numbers (scores) on an unspecified scale. We use the convention that higher scores express a stronger belief that the instance is of class 1. A probabilistic model is a function that maps examples to estimates of the probability of example to be of class . In order to make predictions in the domain, a model can be converted to a classifier by fixing a decision threshold on the scores. Given a predicted score , the instance is classified in class if , and in class otherwise.
For a given, unspecified model and population from which data are drawn, we denote the score density for class by and the cumulative distribution function by . Thus, is the proportion of class 0 points correctly classified if the decision threshold is , which is the sensitivity or true positive rate at . Similarly, is the proportion of class 1 points incorrectly classified as 0 or the false positive rate at threshold ; is the true negative rate or specificity. Note that we use 0 for the positive class and 1 for the negative class, but scores increase with . That is, and are monotonically nondecreasing with . This has some notational advantages and is the same convention as used by, e.g., Hand [hand2009measuring].
Given a dataset of size , we denote by the subset of examples in class , and set and . Clearly . We will use the term class proportion for (other terms such as ‘class ratio’ or ‘class prior’ have been used in the literature). Given a model and a threshold , we denote by the predicted positive rate, i.e., the proportion of examples that will be predicted positive (class 0) is threshold is set at . This can also be defined as . The average score of actual class is . Given any strict order for a dataset of examples we will use the index on that order to refer to the th example. Thus, denotes the score of the th example and its true class.
We define partial class accuracies as and . From here, (microaverage) accuracy is defined as and macroaverage accuracy .
We denote by the continuous uniform distribution of variable over an interval . If this interval is then can be omitted. The family of continuous distributions Beta is denoted by . The Beta distributions are always defined in the interval . Note that the continuous distribution is a special case of the Beta family, i.e., .
2.2 Operating conditions and expected loss
When a model is deployed for classification, the conditions might be different to those during training. In fact, a model can be used in several deployment contexts, with different results. A context can entail different class distributions, different classificationrelated costs (either for the attributes, for the class or any other kind of cost), or some other details about the effects that the application of a model might entail and the severity of its errors. In practice, a deployment context or operating condition is usually defined by a misclassification cost function and a class distribution. Clearly, there is a difference between operating when the cost of misclassifying into is equal to the cost of misclassifying into and doing so when the former is ten times the latter. Similarly, operating when classes are balanced is different from when there is an overwhelming majority of instances of one class.
One general approach to costsensitive learning assumes that the cost does not depend on the example but only on its class. In this way, misclassification costs are usually simplified by means of cost matrices, where we can express that some misclassification costs are higher than others [Elk01]. Typically, the costs of correct classifications are assumed to be 0. This means that for binary models we can describe the cost matrix by two values , representing the misclassification cost of an example of class . Additionally, we can normalise the costs by setting and ; we will refer to as the cost proportion. Since this can also be expressed as , it is often called ‘cost ratio’ even though, technically, it is a proportion ranging between and .
The loss which is produced at a decision threshold and a cost proportion is then given by the formula:
(1)  
This notation assumes the class distribution to be fixed. In order to take both class proportion and cost proportion into account we introduce the notion of skew, which is a normalisation of their product:
(2) 
From equation (1) we obtain
(3) 
This gives an expression for loss at a threshold and a skew . We will assume that the operating condition is either defined by the cost proportion (using a fixed class distribution) or by the skew. We then have the following simple but useful result.
Lemma 1.
If then and .
Proof.
This justifies taking , which means that and are expressed on the same 01 scale, and are also commensurate with error rate which assumes . The upshot of Lemma 1 is that we can transfer any expression for loss in terms of cost proportion to an equivalent expression in terms of skew by just setting and . Notice that if then , so in that case skew denotes the class distribution as operating condition.
It is important to distinguish the information we may have available at each stage of the process. At evaluation time we may not have some information that is available later, at deployment time. In many realworld problems, when we have to evaluate or compare models, we do not know the cost proportion or skew that will apply during deployment. One general approach is to evaluate the model on a range of possible operating points. In order to do this, we have to set a weight or distribution on cost proportions or skews. In this paper, we will mostly consider the continuous uniform distribution (but other distribution families, such as the Beta distribution could be used).
A key issue when applying a model under different operating conditions is how the threshold is chosen in each of them. If we work with a classifier, this question vanishes, since the threshold is already settled. However, in the general case when we work with a model, we have to decide how to establish the threshold. The key idea proposed in this paper is the notion of a threshold choice method, a function which converts an operating condition into an appropriate threshold for the classifier.
Definition 1.
Threshold choice method. A threshold choice method is a (possibly nondeterministic) function such that given an operating condition it returns a decision threshold. The operating condition can be either a skew or a cost proportion ; to differentiate these we use the subscript or on . Superscripts are used to identify particular threshold choice methods. Some threshold choice methods we consider in this paper take additional information into account, such as a default threshold or a target predicted positive rate; such information is indicated by square brackets. So, for example, the scorefixed threshold choice method for cost proportions considered in the next section is indicated thus: .
When we say that may be nondeterministic, it means that the result may depend on a random variable and hence may itself be a random variable according to some distribution.
We introduce the threshold choice method as an abstract concept since there are several reasonable options for the function , essentially because there may be different degrees of information about the model and the operating conditions at evaluation time. We can set a fixed threshold ignoring the operating condition; we can set the threshold by looking at the ROC curve (or its convex hull) and using the cost proportion or the skew to intersect the ROC curve (as ROC analysis does); we can set a threshold looking at the estimated scores, especially when they represent probabilities; or we can set a threshold independently from the rank or the scores. The way in which we set the threshold may dramatically affect performance. But, not less importantly, the performance metric used for evaluation must be in accordance with the threshold choice method.
In the rest of this paper, we explore a range of different methods to choose the threshold (some deterministic and some nondeterministic). We will give proper definitions of all these threshold choice methods in its due section.
Given a threshold choice function , the loss for a particular cost proportion is given by . Following Adams and Hand [AdamsHand1999] we define expected loss as a weighted average over operating conditions.
Definition 2.
Given a threshold choice method for cost proportions and a probability density function over cost proportions , expected loss is defined as
(4) 
Incorporating the class distribution into the operating condition we obtain expected loss over a distribution of skews:
(5) 
It is worth noting that if we plot or against and , respectively, we obtain cost curves as defined by [DH00, drummondandHolte2006]. Cost curves are also known as risk curves (see, e.g. [reid2011information], where the plot can also be shown in terms of priors, i.e., class proportions).
Equations (4) and (5) illustrate the space we explore in this paper. Two parameters determine the expected loss: and (respectively and ). While much work has been done on a first dimension, by changing or , particularly in the area of proper scoring rules, no work has systematically analysed what happens when changing the second dimension, or .
3 Expected loss for fixedthreshold classifiers
The easiest way to choose the threshold is to set it to a predefined value , independently from the model and also from the operating condition. This is, in fact, what many classifiers do (e.g. Naive Bayes chooses independently from the model and independently from the operating condition). We will see the straightforward result that this threshold choice method corresponds to 01 loss (either microaverage accuracy, , or macroaverage accuracy, ). Part of these results will be useful to better understand some other threshold choice methods.
Definition 3.
The scorefixed threshold choice method is defined as follows:
(6) 
This choice has been criticised in two ways, but is still frequently used. Firstly, choosing 0.5 as a threshold is not generally the best choice even for balanced datasets or for applications where the test distribution is equal to the training distribution (see, e.g. [lachiche2003improving] on how to get much more from a Bayes classifier by simply changing the threshold). Secondly, even if we are able to find a better value than 0.5, this does not mean that this value is best for every skew or cost proportion —this is precisely one of the reasons why ROC analysis is used [provost2001robust]. Only when we know the deployment operating condition at evaluation time is it reasonable to fix the threshold according to this information. So either by common choice or because we have this latter case, consider then that we are going to use the same threshold independently of skews or cost proportions. Given this threshold choice method, then the question is: if we must evaluate a model before application for a wide range of skews and cost proportions, which performance metric should be used? This is what we answer below.
If we plug (Equation (6)) into the general formula of the expected loss for a range of cost proportions (Equation (4)) we have:
(7) 
We obtain the following straightforward result.
Theorem 2.
If a classifier sets the decision threshold at a fixed value irrespective of the operating condition or the model, then expected loss under a uniform distribution of cost proportions is equal to the error rate at that decision threshold.
Proof.
In words, the expected loss is equal to the classweighted average of false positive rate and false negative rate, which is the (microaverage) error rate. ∎
So, the expected loss under a uniform distribution of cost proportions for the scorefixed threshold choice method is the error rate of the classifier at that threshold. That means that accuracy can be seen as a measure of classification performance in a range of costs proportions when we choose a fixed threshold. This interpretation is reasonable, since accuracy is a performance metric which is typically applied to classifiers (where the threshold is fixed) and not to models outputting scores. This is exactly what we did in Table 2. We calculated the expected loss for the fixed threshold at 0.5 for a uniform distribution of cost proportions, and we got = and for models and respectively.
Similarly, if we plug (Equation (6)) into the general formula of the expected loss for a range of skews (Equation (5)) we have:
(8) 
Using Lemma 1 we obtain the equivalent result for skews:
Corollary 3.
If a classifier sets the decision threshold at a fixed value irrespective of the operating condition or the model, then expected loss under a uniform distribution of skews is equal to the macroaverage error rate at that decision threshold: .
The previous results show that 01 losses are appropriate to evaluate models in a range of operating conditions if the threshold is fixed for all of them. In other words, accuracy and macroaccuracy can be the right performance metrics for classifiers even in a costsensitive learning scenario. The situation occurs when one assumes a particular operating condition at evaluation time while the classifier has to deal with a range of operating conditions in deployment time.
In order to prepare for later results we also define a particular way of setting a fixed classification threshold, namely to achieve a particular predicted positive rate. One could say that such a method quantifies the proportion of positive predictions made by the classifier. For example, we could say that our threshold is fixed to achieve a rate of 30% positive predictions and the rest negatives. This of course involves ranking the examples by their scores and setting a cutting point at the appropriate position, something which is frequently known as ‘screening’.
Definition 4.
Define the predicted positive rate at threshold as , and assume the cumulative distribution functions and are invertible, then we define the ratefixed threshold choice method for rate as:
(9) 
If and are not invertible, they have plateaus and so does . This can be handled by deriving from the centroid of a plateau.
The ratefixed threshold choice method for skews is defined as:
(10) 
where .
The corresponding expected loss for cost proportions is
(11) 
The notion of setting a threshold based on a rate is closely related to the problem of quantification [DBLP:journals/datamine/Forman08, bella2010quantification] where the goal is to correctly estimate the proportion for each of the classes (in the binary case, the positive rate is sufficient). This threshold choice method allows the user to set the quantity of positives, which can be known (from a sample of the test) or can be estimated using a quantification method. In fact, some quantification methods can be seen as methods to determine an absolute fixed threshold that ensures a correct proportion for the test set.
Note that Equation (11) is closely related to Theorem 2. If we determine the threshold which produces a rate, i.e., if we determine , we get the expected loss as an accuracy. Formally, we have:
(12) 
Fortunately, it is immediate to get the threshold which produces a rate; it can just be derived by sorting the examples by their scores and placing the cutpoint where the rate equals the rank divided by the number of examples (e.g. if we have examples, the cutpoint makes ).
4 Threshold choice methods using scores
In the previous section we looked at accuracy and error rate as performance metrics for classifiers and gave their interpretation as expected losses. In this and the following sections we consider performance metrics for models that do not require fixing a threshold choice method in advance. Such metrics include which evaluates ranking performance and the Brier score or mean squared error which evaluates the quality of probability estimates. We will deal with the latter in this section. We will therefore assume that scores range between 0 and 1 and represent posterior probabilities for class 1. This means that we can sample thresholds uniformly or derive them from the operating condition. We first introduce two performance metrics that are applicable to probabilistic scores.
The Brier score is a wellknown performance metric for probabilistic models. It is an alternative name for the Mean Squared Error or MSE loss [brier1950verification], especially for binary classification.
Definition 5.
denotes the Brier score of model on data ; we will usually omit and when clear from the context. is defined as follows:
From here, we can define a priorindependent version of the Brier score (or a macroaverage Brier score) as follows:
(13) 
The Mean Absolute Error () is another simple performance metric which has been rediscovered many times under different names.
Definition 6.
denotes the Mean Absolute Error of model on data ; we will again usually omit and when clear from the context. is defined as follows:
We can define a macroaverage as follows:
(14) 
It can be shown that is equivalent to the Mean Probability Rate (MPR) [LL02] for discrete classification [PRL09].
4.1 The scoreuniform threshold choice method leads to
We now demonstrate how varying a model’s threshold leads to an expected loss that is different from accuracy. First, we explore a threshold choice method which considers that we have no information at all about the operating condition, neither at evaluation time nor at deployment time. We just employ the interval between the maximum and minimum value of the scores, and we randomly select the threshold using a uniform distribution over this interval.
Definition 7.
Assuming a model’s scores are expressed on a bounded scale , the scoreuniform threshold choice method is defined as follows:
(15) 
Given this threshold choice method, then the question is: if we must evaluate a model before application for a wide range of skews and cost proportions, which performance metric should be used?
Theorem 4.
Assuming probabilistic scores and the scoreuniform threshold choice method, expected loss under a uniform distribution of cost proportions is equal to the model’s mean absolute error.
Proof.
First of all we note that the threshold choice method does not take the operating condition into account, and hence we can work with . Then,
The last step makes use of the following useful property.
Setting and for probabilistic scores, we obtain the final result:
∎
This gives a baseline loss if we choose thresholds randomly and independently of the model. Using Lemma 1 we obtain the equivalent result for skews:
Corollary 5.
Assuming probabilistic scores and the scoreuniform threshold choice method, expected loss under a uniform distribution of skews is equal to the model’s macroaverage mean absolute error:
4.2 The scoredriven threshold choice method leads to the Brier score
We will now consider the first threshold choice method to take the operating condition into account. Since we are dealing with probabilistic scores, this method simply sets the threshold equal to the operating condition (cost proportion or skew). This is a natural criterion as it has been used especially when the model is a probability estimator and we expect to have perfect information about the operating condition at deployment time. In fact, this is a direct choice when working with proper scoring rules, since when rules are proper, scores are assumed to be a probabilistic assessment. The use of this threshold choice method can be traced back to Murphy [murphy1966note] and, perhaps, implicitly, much earlier. More recently, and in a different context from proper scoring rules, Drummond and Holte [drummondandHolte2006] say it is a common example of a “performance independence criterion”. Referring to figure 22 in their paper which uses the scoredriven threshold choice they say: “the performance independent criterion, in this case, is to set the threshold to correspond to the operating conditions. For example, if the Naive Bayes threshold is set to 0.2”. The term is equivalent to our ‘skew’.
Definition 8.
Assuming model’s scores are expressed on a probability scale , the scoredriven threshold choice method is defined for cost proportions as follows:
(16) 
and for skews as
(17) 
Given this threshold choice method, then the question is: if we must evaluate a model before application for a wide range of skews and cost proportions, which performance metric should be used? This is what we answer below.
Theorem 6 ([ICML11Brier]).
Assuming probabilistic scores and the scoredriven threshold choice method, expected loss under a uniform distribution of cost proportions is equal to the model’s Brier score.
Proof.
If we plug (Equation (16)) into the general formula of the expected loss (Equation (4)) we have the expected scoredriven loss:
(18) 
And if we use the uniform distribution and the definition of (Equation (1)):
(19) 
In order to show this is equal to the Brier score, we expand the definition of and using integration by parts:
Taking their weighted average, we obtain
(20) 
which, after reordering of terms and change of variable, is the same expression as Equation (19).
∎
It is now clear why we just put the Brier score from Table 1 as the expected loss in Table 2. We calculated the expected loss for the scoredriven threshold choice method for a uniform distribution of cost proportions as its Brier score.
Theorem 6 was obtained in [ICML11Brier] (the threshold choice method there was called ‘probabilistic’) but it is not completely new in itself. In [murphy1966note] we find a similar relation to expected utility (in our notation, , where the socalled probability score ). Apart from the sign (which is explained because Murphy works with utilities and we work with costs), the difference in the second constant term is explained because Murphy’s utility (cost) model is based on a cost matrix where we have a cost for one of the classes (in meteorology the class ‘protect’) independently of whether we have a right or wrong prediction (‘adverse’ or ‘good’ weather). The only case in the matrix with a 0 cost is when we have ‘good’ weather and ‘no protect’. It is interesting to see that the result only differs by a constant term, which supports the idea that whenever we can express the operating condition with a cost proportion or skew, the results will be portable to each situation with the inclusion of some constant terms (which are the same for all classifiers). In addition to this result, it is also worth mentioning another work by Murphy [murphy1969measures] where he makes a general derivation for the Beta distribution.
After Murphy, in the last four decades, there has been extensive work on the socalled proper scoring rules, where several utility (cost) models have been used and several distributions for the cost have been used. This has led to relating Brier score (square loss), logarithmic loss, 01 loss and other losses which take the scores into account. For instance, in [BSS05] we have a comprehensive account of how all these losses can be obtained as special cases of the Beta distribution. The result given in Theorem 6 would be a particular case for the uniform distribution (which is a special case of the Beta distribution) and a variant of Murphy’s results. Nonetheless, it is important to remark that the results we have just obtained in Section 4.1 (and those we will get in Section 5) are new because they are not obtained by changing the cost distribution but rather by changing the threshold choice method. The threshold choice method used (the scoredriven one) is not put into question in the area of proper scoring rules. But Theorem 6 can now be seen as a result which connects these two different dimensions: cost distribution and threshold choice method, so placing the Brier score at an even more predominant role.
We can derive an equivalent result using empirical distributions [ICML11Brier]. In that paper we show how the loss can be plotted in cost space, leading to the Brier curve whose area below is the Brier score.
Finally, using skews we arrive at the priorindependent version of the Brier score.
Corollary 7.
.
It is interesting to analyse the relation between and (similarly between and ). Since the former gives the and the second gives the Brier score (which is the MSE), from the definitions of and Brier score, we get that, assuming scores are between and we have:
Since and have the same terms but the second squares them, and all the values which are squared are between 0 and 1, then the must be lower or equal. This is natural, since the expected loss is lower if we get reliable information about the operating condition at deployment time. So, the difference between the Brier score and is precisely the gain we can get by having (and using) the information about the operating condition at deployment time. Notice that all this holds regardless of the quality of the probability estimates.
5 Threshold choice methods using rates
We show in this section that can be translated into expected loss for varying operating conditions in more than one way, depending on the threshold choice method used. We consider two threshold choice methods, where each of them sets the threshold to achieve a particular predicted positive rate: the rateuniform method, which sets the rate in a uniform way; and the ratedriven method, which sets the rate equal to the operating condition.
We recall the definition of a ROC curve and its area first.
Definition 9.
The ROC curve [SDM00, Fawcett06] is defined as a plot of (i.e., false positive rate at decision threshold ) on the axis against (true positive rate at ) on the axis, with both quantities monotonically nondecreasing with increasing (remember that scores increase with and 1 stands for the negative class). The Area Under the ROC curve () is defined as:
5.1 The rateuniform threshold choice method leads to
The ratefixed threshold choice method places the threshold in such a way that a given predictive positive rate is achieved. However, if this proportion may change easily or we are not going to have (reliable) information about the operating condition at deployment time, an alternative idea is to consider a nondeterministic choice or a distribution for this quantity. One reasonable choice can be a uniform distribution.
Definition 10.
The rateuniform threshold choice method nondeterministically sets the threshold to achieve a uniformly randomly selected rate:
(22)  
(23) 
In other words, it sets a relative quantity (from 0% positives to 100% positives) in a uniform way, and obtains the threshold from this uniform distribution over rates. Note that for a large number of examples, this is the same as defining a uniform distribution over examples or, alternatively, over cutpoints (between examples), as explored in [ICML11CoherentAUC].
There are reasons for considering this threshold a reasonable method. It is a generalisation of the ratefixed threshold choice method which considers all the imbalances (class proportions) equally likely whenever we make a classification. It assumes that we will not have any information about the operating condition at deployment time.
As done before for other threshold choice methods, we analyse the question: given this threshold choice method, if we must evaluate a model before application for a wide range of skews and cost proportions, which performance metric should be used?
The corresponding expected loss for cost proportions is
(24) 
We then have the following result.
Theorem 8 ([ICML11CoherentAUC]).
Assuming the ratefixed threshold choice method, expected loss for uniform cost proportion and uniform rate decreases linearly with as follows:
Proof.
First of all we note that the threshold choice method does not take the operating condition into account, and hence we can work with . Furthermore, and hence . Then,
The first term can be related to :
The remaining two terms are easily solved:
Putting everything together we obtain . Since , this can be rewritten to .^{1}^{1}1If we do not assume a uniform distribution for cost proportions we would obtain a different integral, but expected loss would still be linear in AUC (David Hand, personal communication). ∎
Corollary 9.
Assuming the ratefixed threshold choice method, expected loss for uniform skew and uniform rate decreases linearly with as follows:
We see that expected loss for uniform skew ranges from 1/4 for a perfect ranker that is harmed by suboptimal threshold choices, to 3/4 for the worst possible ranker that puts positives and negatives the wrong way round, yet gains some performance by putting the threshold at or close to one of the extremes.
Intuitively, these formulae can be understood as follows. Setting a randomly sampled rate is equivalent to setting the decision threshold to the score of a randomly sampled example. With probability we select a positive and with probability we select a negative. If we select a positive, then the expected true positive rate is (as on average we select the middle one); and the expected false positive rate is (as one interpretation of is the expected proportion of negatives ranked correctly wrt. a random positive). Similarly, if we select a negative then the expected true positive rate is and the expected false positive rate is . Put together, the expected true positive rate is and the expected false positive rate is . The proportion of true positives among all examples is thus
and the proportion of false positives is
We can summarise these expectations in the following contingency table (all numbers are proportions relative to the total number of examples):
Predicted  Predicted  

Actual  
Actual  
1 
The column totals are, of course, as expected: if we randomly select an example to split on, then the expected split is in the middle.
While in this paper we concentrate on the case where we have access to population densities and distribution functions , in practice we have to work with empirical estimates. In [ICML11CoherentAUC] we provide an alternative formulation of the main results in this section, relating empirical loss to the of the empirical ROC curve. For instance, the expected loss for uniform skew and uniform instance selection is calculated in [ICML11CoherentAUC] to be , showing that for smaller samples the reduction in loss due to is somewhat smaller.
5.2 The ratedriven threshold choice method leads to
Naturally, if we can have precise information of the operating condition at deployment time, we can use the information about the skew or cost to adjust the rate of positives and negatives to that proportion. This leads to a new threshold selection method: if we are given skew (or cost proportion) (or ), we choose the threshold in such a way that we get a proportion of (or ) positives. This is an elaboration of the ratefixed threshold choice method which does take the operating condition into account.
Definition 11.
The ratedriven threshold choice method for cost proportions is defined as
(25) 
The ratedriven threshold choice method for skews is defined as
(26) 
Given this threshold choice method, the question is again: if we must evaluate a model before application for a wide range of skews and cost proportions, which performance metric should be used? This is what we answer below.
If we plug (Equation (25)) into the general formula of the expected loss for a range of cost proportions (Equation (4)) we have:
(27) 
And now, from this definition, if we use the uniform distribution for , we obtain this new result.
Theorem 10.
Expected loss for uniform cost proportions using the ratedriven threshold choice method is linearly related to as follows:
Proof.
By a change of variable we have and hence , and thus
All terms in the first integral can be reduced to :
The second integral provides the link to :