On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization

Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology
Nanjing University, Nanjing 210023, China
Abstract

AUC (area under ROC curve) has always been an important evaluation criterion popularly used in diverse learning tasks such as class-imbalance learning, cost-sensitive learning, learning to rank and information retrieval. Many learning approaches are developed to optimize AUC, whereas owing to its non-convexity and discontinuousness, most approaches work with pairwise surrogate losses such as exponential loss, hinge loss, etc; therefore, an important theoretic problem is to study on the AUC consistency based on minimizing pairwise surrogate losses.

In this paper, we introduce the generalized calibration for AUC optimization, and prove that the generalized calibration is necessary yet insufficient for AUC consistency. We then provide a new sufficient condition for the AUC consistency of learning approaches based on minimizing pairwise surrogate losses, and from this finding, we prove that exponential loss, logistic loss and distance-weighted loss are consistent with AUC. In addition, we derive the -norm hinge loss and general hinge loss that are consistent with AUC. We also derive the regret bounds for exponential loss and logistic loss, and present the regret bounds for more general surrogate losses in the realizable setting. Finally, we prove regret bounds that disclose the equivalence between the pairwise exponential surrogate loss of AUC and the exponential surrogate loss of accuracy, and one direct consequence of such finding is the equivalence between AdaBoost and RankBoost in the limit of infinite sample.

keywords:
AUC, consistency, surrogate loss, cost-sensitive learning, learning to rank, RankBoost, AdaBoost
journal: Artificial Intelligence Journal

1 Introduction

AUC (Area Under ROC Curve) is an important evaluation criterion, which has been adopted in diverse learning tasks such as cost-sensitive learning, class-imbalance learning, learning to rank, information retrieval, etc. (Elkan, 2001; Freund et al., 2003; Cortes and Mohri, 2004; Balcan et al., 2007; Ailon and Mohri, 2008; Clémençon and Vayatis, 2009; Clémençon et al., 2009; Kotlowski et al., 2011; Flach et al., 2011), where traditional criteria such as accuracy, precision, recall, etc. are inadequate (Provost et al., 1998; Provost and Fawcett, 2001) since AUC is irrelevant to class distribution.

Owing to the non-convexity and discontinuousness, it is not easy, or even infeasible, to optimize AUC directly since such optimization often yields NP-hard problems. To make a compromise for avoiding computational difficulties, pairwise surrogate losses that can be optimized more efficiently are usually adopted in practical algorithms, e.g., exponential loss (Freund et al., 2003; Rudin and Schapire, 2009), hinge loss (Brefeld and Scheffer, 2005; Joachims, 2005; Zhao et al., 2011), least square loss (Gao et al., 2013), etc.

An important theoretic problem is how well does minimizing such convex surrogate losses lead to improving the actually AUC; in other words, does the expected risk of learning with surrogate losses converge to the Bayes risk of AUC? Consistency (also called Bayes consistency) guarantees that optimizing a surrogate loss will yield an optimal function with Bayes risk in the limit of infinite sample. Thus, the above problem, in a formal expression, is whether the optimization of surrogate losses is consistent with AUC.

1.1 Our Contribution

We first introduce the generalized calibration for AUC optimization based on minimizing the pairwise surrogate losses, and find that the generalized calibration is necessary yet insufficient for AUC consistency. For example, hinge loss and absolute loss are calibrated but inconsistent with AUC. The deep reason is that, for pairwise surrogate losses, minimizing the expected risk over the whole distribution is not equivalent to minimizing the conditional risk on each pair of instances.

We then provide a new sufficient condition for the AUC consistency of learning approaches based on minimizing pairwise surrogate losses. From this finding, we prove that exponential loss, logistic loss and distance-weighted loss are consistent with AUC. In addition, we derive the -norm hinge loss and general hinge loss that are consistent with AUC. We also derive the regret bounds for exponential loss and logistic loss, and present the regret bounds for more general surrogate losses in the realizable setting.

Finally, we provide regret bounds that disclose the equivalence between the pairwise exponential surrogate loss of AUC and the exponential surrogate loss of accuracy; in other words, the exponential surrogate loss of accuracy is consistent AUC, while the pairwise surrogate loss of AUC is consistent with accuracy by selecting a proper threshold. One direct consequence of such finding is the equivalence between AdaBoost and RankBoost in the limit of infinite sample.

1.2 Related Work

The studies on AUC can be traced back to 1970’s in signal detection theory (Egan, 1975), and it has been widely used as a criterion in medical area and machine learning (Provost et al., 1998; Provost and Fawcett, 2001; Elkan, 2001). In model selection, AUC also exhibits better measure than accuracy theoretically and empirically (Huang and Ling, 2005). AUC can be estimated under parametric (Zhou et al., 2002), semi-parametric (Hsieh and Turnbull, 1996) and non-parametric (Hanley and McNeil, 1982) assumptions, and the non-parameteric estimation of AUC is popularly applied in machine learning and data mining, equivalent to the Wilcoxon-Mann-Whitney (WMW) statistic test of ranks (Hanley and McNeil, 1982). In addition, Hand (2009) and Flach et al. (2011) present the incoherent and coherent explanations of AUC as a measure of aggregated classifier performance, respectively.

AUC has always been regarded as an performance measure for information retrieval and learning to rank, especially for bipartite ranking (Cohen et al., 1999; Freund et al., 2003; Cortes and Mohri, 2004; Rudin and Schapire, 2009; Rudin, 2009). Various Generalization bounds are presented to understand the prediction beyond the training sample (Agarwal et al., 2005; Usunier et al., 2005; Cortes et al., 2007; Clemenćon et al., 2008; Agarwal and Niyogi, 2009; Rudin and Schapire, 2009; Wang et al., 2012; Kar et al., 2013). In addition, the learnability of AUC has been studied in (Agarwal and Roth, 2005; Gao and Zhou, 2013b).

Consistency is an important theoretic issue in machine learning. For example, Breiman (2004) showed that exponential loss converges to the Bayes classifier for arcing-style greedy boosting algorithms, and Bühlmann and Yu (2003) proved the consistency of boosting algorithms with respect to least square loss. Lin (2002) and Steinwart (2005) studied the consistency of support vector machines. For binary classification, Zhang (2004b) and Bartlett et al. (2006) provided the most fundamental and comprehensive analysis, and many famous algorithms such as boosting, logistic regression and SVMs are proven to be consistent. Further, the consistency studies on multi-class learning and multi-label learning have been addressed in (Zhang, 2004a; Tewari and Bartlett, 2007) and in (Gao and Zhou, 2011, 2013a), respectively. Also, it is well-studied on the consistency of learning to rank (Clemenćon et al., 2008; Cossock and Zhang, 2008; Xia et al., 2008, 2009; Duchi et al., 2010).

In contrast to previous studies on consistency (Zhang, 2004a, b; Bartlett et al., 2006; Tewari and Bartlett, 2007; Gao and Zhou, 2011, 2013a) that focused on single instances, our work concerns about the pairwise surrogate losses over a pair of instances from different classes. Such difference yields that previous consistent analysis is sufficient to study on conditional risk whereas our analysis has to consider the whole distribution, because as to be shown in Lemma 1, minimizing the expected risk over the whole distribution is not equivalent to minimizing the conditional risk. This is a challenge for the study on AUC consistency based on minimizing pairwise surrogate losses.

Clemenćon et al. (2008) formulated the ranking problems in statistical framework and achieved faster rates of convergence under noise assumptions based on new inequalities. They also studied the consistency of ranking rules, whereas our work studies the consistency of score function based on pairwise surrogate loss. This yields the fact that calibration has been shown as a necessary and sufficient condition in (Clemenćon et al., 2008), whereas we will show that calibration is necessary yet insufficient condition, e.g., hinge loss and absolute loss are calibrated but inconsistent with AUC (as to be shown in Section 3).

Duchi et al. (2010) studied the consistency of supervised ranking, but it is quite different from our work. Firstly, the problem settings are different: they considered “instances” consisting of a query, a set of inputs and a weighted graph, and the goal is to order the inputs according to the weighted graph; yet we consider instances with positive or negative labels, and the goal is to rank positive instances higher than negative ones. Further, they established inconsistency for the logistic loss, exponential loss and hinge loss even in low-noisy setting, yet our work shows that the logistic loss and exponential loss are consistent but hinge loss is inconsistent.

Kotlowski et al. (2011) studied the AUC consistency based on minimizing univariate surrogate losses (e.g., exponential loss and logistic loss), and it has been generalized to a broad class of proper (composite) losses by Agarwal (2013) with simpler techniques. These two studies focused on univariate surrogate losses, whereas our work considers pairwise surrogate losses that have been popularly used in many literatures (Freund et al., 2003; Brefeld and Scheffer, 2005; Joachims, 2005; Rudin and Schapire, 2009; Zhao et al., 2011; Gao et al., 2013).

1.3 Organization

Section 2 makes some preliminaries. Section 3 shows that generalized calibration is necessary yet insufficient for AUC consistency, and presents a new sufficient condition with consistent surrogate losses. Section 4 presents regret bounds for exponential loss and logistic loss, as well as regret bounds for general surrogate losses under the realizable setting. Section 5 discloses the equivalence between the exponential surrogate losses of AUC and accuracy. Section 6 presents detailed proofs and Section 7 concludes this work.

2 Preliminaries

Let be an instance space and is the label set. We denote by an unknown (underlying) distribution over , and represents the instance-marginal distribution over . Further, we denote and conditional probability . It is trivial to study the case (all positive instances) and (all negative instances), and we assume throughout this work.

For a score function , the AUC w.r.t. the distribution is given by

where and are drawn identically and independently according to distribution , and is the indicator function which returns if the argument is true and otherwise. Maximizing the AUC is equivalent to minimizing the expected risk

(1)

where expectation takes on and drawn i.i.d. from distribution , and is also called ranking loss. It is easy to obtain . Denote by the Bayes risk where the infimum takes over all measurable functions. By simple calculation, we can get the set of optimal functions as

(2)

It is easy to find that the ranking loss is non-convex and discontinuous, and thus a direct optimization often leads to NP-hard problems. In practice, surrogate losses that can be optimized with efficient algorithms are usually adopted. For AUC, a commonly-used formulation is given based on pairwise surrogate losses as follows:

where is a convex function, e.g., exponential loss (Freund et al., 2003; Rudin and Schapire, 2009), hinge loss (Brefeld and Scheffer, 2005; Joachims, 2005; Zhao et al., 2011), least quare loss (Gao et al., 2013), etc.

For pairwise surrogate loss, we define the expected -risk as

(3)

and denote by the optimal expected -risk where the infimum takes over all measurable functions. Given two instances , we denote by the conditional -risk as

(4)

where , and it holds that . For convenience, we denote by and . Then, we define the optimal conditional -risk

(5)

and further define

(6)

3 AUC Consistency

We first define the AUC consistency as follows:

Definition 1

The surrogate loss is said to be consistent with AUC if for every sequence , the following holds over all distributions on :

In binary classification, Bartlett et al. (2006) showed that the classification calibration is sufficient and necessary to consistency of error. Motivated from this work, we generalize the calibration to AUC as follows:

Definition 2

The surrogate loss is said to be calibrated if

where and are defined by Eqns. (5) and (6), respectively.

We will try to understand the relationship between calibration and AUC consistency. Recall that

and we first observe that

(7)

Notice that the equality in Eqn. (7) does not hold for many commonly-used surrogate losses such as hinge loss, least square hinge loss, least square loss, absolute loss, etc., which can be shown by the following lemma:

Lemma 1

For hinge loss , least square hinge loss , least square loss and absolute loss , we have

Lemma 1 shows that minimizing the expected -risk over the whole distribution is not equivalent to minimizing the conditional -risk on each pair of instances from different class. Therefore, for pairwise surrogate loss, the study on AUC consistency should focus on the expected -risk over the whole distribution rather than conditional -risk on each pair of instances. This is quite different from binary classification where minimizing the expected risk over the whole distribution is equivalent to minimizing the conditional risk on each instance, and thus the study on consistency of binary classification focuses on the conditional risk as illustrated in (Zhang, 2004b; Bartlett et al., 2006).

Proof We will present detailed proof for hinge loss by contradiction, and similar considerations could be made to other losses. Suppose that there exists a function such that

For simplicity, we consider three different instances such that

The conditional risk of hinge loss is given by

and minimizing gives if . From the assumption that

we have , and ; while they are contrary to each other. ∎

3.1 Calibration is Necessary yet Insufficient for AUC Consistency

We first prove that calibration is a necessary condition for AUC consistency by the following lemma:

Lemma 2

If the surrogate loss is consistent with AUC, then is calibrated, and for convex , it is differentiable at with .

The proof is partly motivated from (Bartlett et al., 2006), and we defer it to Section 6.1. For the converse direction, we first observe that hinge loss is inconsistent with respect to AUC as follows:

Lemma 3

For hinge loss , the surrogate loss is inconsistent with AUC.

The detailed proof is deferred to Section 6.2. In addition to hinge loss, the absolute loss is also proven to be inconsistent with AUC:

Lemma 4

For absolute loss , the surrogate loss is inconsistent with AUC.

The detailed proof is presented in Section 6.3. It is noteworthy that hinge loss and absolute loss are convex with , and thus they are calibrated, whereas Lemmas 3 and 4 show their inconsistency with AUC, respectively. Therefore, classification calibration is no longer a sufficient condition for AUC consistency.

Combining Lemmas 2-4, we have

Theorem 1

Calibration is necessary yet insufficient for AUC consistency.

This theorem shows that the study on AUC consistency is not parallel to that of binary classification where the classification calibration is necessary and sufficient for the consistency of error in (Bartlett et al., 2006). The main difference is that, for AUC consistency, minimizing the expected risk over the whole distribution is not equivalent to minimizing the conditional risk on each pair of instances as shown in Lemma 1.

3.2 Sufficient Condition for AUC Consistency

Based on the previous analysis, we present a new sufficient condition for AUC consistency, and the detailed proof is deferred to Section 6.4.

Theorem 2

The surrogate loss is consistent with AUC if is a convex, differentiable and non-increasing function with .

Uematsu and Lee (2011) proved the inconsistency of hinge loss and presented a sufficient condition, whereas our proof technique is considerably simpler than that of Uematsu and Lee (2011), especially for the proof of inconsistency of hinge loss. We will also provide a necessary condition in previous section and regret bounds later.

Based on Theorem 2, many surrogate losses are proven to be consistent with AUC as follows:

Corollary 1

For exponential loss , the surrogate loss is consistent with AUC.

Corollary 2

For logistic loss , the surrogate loss is consistent with AUC.

Marron et al. (2007) introduced the distance-weighted discrimination method to deal with the problems with high dimension yet small-size sample, and this method has been reformulated by Bartlett et al. (2006), for any , as follows:

(8)

Based on Theorem 2, we can also derive its consistency as follows:

Corollary 3

For distance-weighted loss given by Eqn. (8) with , the surrogate loss is consistent with AUC.

It is noteworthy that the hinge loss is not differentiable at , and we cannot apply Theorem 2 directly to study the consistency of hinge loss. Lemma 3 proves its inconsistency and also shows the difficulty for consistency without differentiability, even if the surrogate loss function is convex and non-increasing with . We now derive some variants of hinge loss that are consistent. For example, the -norm hinge loss:

From Theorem 2, we can get the AUC consistency of the -norm hinge loss:

Corollary 4

For -norm hinge loss with , the surrogate loss is consistent with AUC.

From this corollary, it is immediate to get the consistency for the least-square hinge loss . We further define the general hinge loss, for any , as:

(9)

It is easy to obtain the AUC consistency of general hinge loss from Theorem 2:

Corollary 5

For general hinge loss given by Eqn. (9) with , the surrogate loss is consistent with AUC.

Hinge loss is inconsistent with AUC, but we can use consistent surrogate loss, e.g., the general hinge loss, to approach hinge loss when . In addition, it is also interesting to derive other surrogate loss functions that are consistent with AUC under the guidance of Theorem 2.

4 Regret Bounds

In this section, we first present the regret bounds for exponential loss and logistic loss, and then study the regret bounds for general losses under the realizable setting.

4.1 Regret Bounds for Exponential Loss and Logistic Loss

Corollaries 1 and 2 show that the exponential loss and logistic loss are consistent with AUC, respectively. We further study their regret bounds based on the following special property:

Lemma 5

For exponential loss and logistic loss, it holds that

Proof We provide the detailed proof for exponential loss, and similar consideration could be made to logistic loss. Fixing an instance and , we set

It remains to prove . Based on the above equation, we have, for instances :

which exactly minimizes when .∎

It is noteworthy that Lemma 5 is specific to the exponential loss and logistic loss, and it does not hold for other surrogate loss functions such as hinge loss, general hinge loss, -norm hinge loss, etc. Based on Lemma 5, we study the regret bounds for exponential loss and logistic loss by focusing on conditional risk. We first present a general theorem as follows:

Theorem 3

For some and , we have

if the surrogate loss satisfies , and if is such that

This proof is partly motivated from Zhang (2004b) and we defer it to Section 6.5. Based on this theorem, we can get the following regret bounds for the exponential loss and logistic loss:

Corollary 6

For exponential loss, it holds that  .

Corollary 7

For logistic loss, it holds that  .

The detailed proofs of Corollaries 6 and 7 are given in Section 6.6 and 6.7, respectively.

4.2 Regret Bounds for Realizable Setting

Now we define the realizable setting as:

Definition 3

A distribution is said to be realizable if for each .

Such setting have been studied for bipartite ranking (Rudin and Schapire, 2009) and multi-class classification (Long and Servedio, 2013). Under this setting, we have the regret bounds as follows:

Theorem 4

For some , we have

if , and if for and for .

Proof For convenience, denote by and the positive and negative instance distributions, respectively. From Eqn. (1), we have

and thus when . From Eqn. (3), we get the -risk . Then

which completes the proof.∎

Based on this theorem, we have the following regret bounds:

Corollary 8

For exponential loss, hinge loss, general hinge loss, -norm hinge loss, and least square loss, we have

and for logistic loss, we have

It is noteworthy that the hinge loss is consistent with AUC under the realizable setting yet inconsistent for the general case as shown in Lemma 3. Corollaries 6 and 7 show regret bounds for exponential loss and logistic loss in the general case, respectively, whereas the above corollary provides tighter regret bounds under the realizable setting.

5 Equivalence Between AUC and Accuracy Optimization with Exponential Loss

In this section, we analyze the relationship of exponential loss for AUC and accuracy, and present regret bounds to show their equivalence.

In binary classification, we always learn a score function , and make predictions based on . The goal is to improve the accuracy by minimizing

We denote by where the infimum takes over all measurable functions, and it is easy to obtain the set of optimal solutions for accuracy as follows:

In binary classification, the most popular formulation for surrogate losses is given by:

where is a convex function, e.g., hinge loss (Vapnik, 1998), exponential loss (Freund and Schapire, 1997), logistic loss (Friedman et al., 2000), etc. We define -risk as

where . Further, we denote by , where the infimum takes over all measurable functions.

We begin with a regret bound as follows:

Theorem 5

For a classifier and exponential loss , we have

The detailed proof is presented in Section 6.8. This theorem shows that a good classifier, which is learned by optimizing the exponential loss of accuracy, optimizes the pairwise exponential loss of AUC.

For a ranking function , we will first find some proper threshold to construct classifier. Here, we present a simple way to select a threshold by

and it is easy to get, for convex and smooth exponential loss, that

Based on such threshold, we have

Theorem 6

For a score ranking function and exponential loss , we have

by selecting the threshold .

The proof is presented in Section 6.9. From this theorem, we can see that a score ranking function , which is learned by optimizing the pairwise exponential loss of AUC, optimizes the exponential loss of accuracy by selecting a proper threshold.

Together with Corollary 6, Theorems 5 and 6, and (Zhang, 2004b, Theorem 2.1), we have

Theorem 7

For a classifier and exponential loss , we have

For a ranking function and exponential loss , we have

by selecting the threshold .

This theorem shows the asymptotic equivalence between the exponential surrogate loss of accuracy and the pairwise exponential surrogate loss of AUC. Thus, the surrogate loss of accuracy is consistent with AUC, while the pairwise surrogate loss of AUC is consistent with accuracy by choosing a proper threshold. One direct consequence of this theorem is: AdaBoost and RankBoost are equivalent asymptotically, i.e., both of them optimize AUC and accuracy simultaneously in infinite sample, because AdaBoost and RankBoost essentially optimize the surrogate loss and , respectively.

Rudin and Schapire (2009) has established the equivalence between AdaBoost and RankBoost for finite training sample. For that purpose, they assumed that the negative and positive classes contributed equally, although this is often not the fact in practice. Our work does not make such assumption, and we consider the limit of infinite sample. Moreover, our regret bounds, which shows the equivalence between AUC and accuracy optimization with exponential surrogate loss, provides a new explanation to the equivalence between AdaBoost and RankBoost.

6 Proofs

In this section, we provide some detailed proofs for our results.

6.1 Proof of Lemma 2

If is not calibrated, then there exist and s.t. and , that is,

This implies the existence of some such that

We consider an instance space with marginal probability and conditional probability and . We then construct a sequence by picking up , and it is easy to get that

This shows the inconsistency of ; therefore, calibration is a necessary condition for AUC consistency.

For convex , we will show that the condition that is differentiable at and is necessary for AUC consistency. For convenience, we consider the instance space with marginal probability and conditional probability and .

We first prove that if the consistent surrogate loss is differentiable at , then . Assume , and for convex , we have

for . This follows that

(10)

which implies that is not calibrated, and it is contrary to consistency of .

We now prove that convex loss is differentiable at . Assume that is not differentiable at . We can find subgradients such that

and it is sufficient to consider the following cases:

  1. For , we select and . It is obvious that , and for any , we have

  2. For or , we select and , and for any , it holds that

  3. For , we select and . We have , and for any , it holds that

Therefore, for any and , there exist and such that