Semi-Supervised AUC Optimization based on Positive-Unlabeled Learning

# Semi-Supervised AUC Optimization based on Positive-Unlabeled Learning

Tomoya Sakai Gang Niu Masashi Sugiyama
###### Abstract

Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced classification. So far, various supervised AUC optimization methods have been developed and they are also extended to semi-supervised scenarios to cope with small sample problems. However, existing semi-supervised AUC optimization methods rely on strong distributional assumptions, which are rarely satisfied in real-world problems. In this paper, we propose a novel semi-supervised AUC optimization method that does not require such restrictive assumptions. We first develop an AUC optimization method based only on positive and unlabeled data (PU-AUC) and then extend it to semi-supervised learning by combining it with a supervised AUC optimization method. We theoretically prove that, without the restrictive distributional assumptions, unlabeled data contribute to improving the generalization performance in PU and semi-supervised AUC optimization methods. Finally, we demonstrate the practical usefulness of the proposed methods through experiments.

## 1 Introduction

Maximizing the area under the receiver operating characteristic curve (AUC) (Hanley and McNeil, 1982) is a standard approach to imbalanced classification (Cortes and Mohri, 2004). While the misclassification rate relies on the sign of the score of a single sample, AUC is governed by the ranking of the scores of two samples. Based on this principle, various supervised methods for directly optimizing AUC have been developed so far and demonstrated to be useful (Herschtal and Raskutti, 2004; Zhao et al., 2011; Rakhlin et al., 2012; Kotlowski et al., 2011; Ying et al., 2016).

However, collecting labeled samples is often expensive and laborious in practice. To mitigate this problem, semi-supervised AUC optimization methods have been developed that can utilize unlabeled samples (Amini et al., 2008; Fujino and Ueda, 2016). These semi-supervised methods solely rely on the assumption that an unlabeled sample that is “similar” to a labeled sample shares the same label. However, such a restrictive distributional assumption (which is often referred to as the cluster or the entropy minimization principle) is rarely satisfied in practice and thus the practical usefulness of these semi-supervised methods is limited (Cozman et al., 2003; Sokolovska et al., 2008; Li and Zhou, 2015; Krijthe and Loog, 2017).

On the other hand, it has been recently shown that unlabeled data can be effectively utilized without such restrictive distributional assumptions in the context of classification from positive and unlabeled data (PU classification) (du Plessis et al., 2014). Furthermore, based on recent advances in PU classification (du Plessis et al., 2014, 2015; Niu et al., 2016), a novel semi-supervised classification approach has been developed that combines supervised classification with PU classification (Sakai et al., 2017). This approach inherits the advances of PU classification that the restrictive distributional assumptions are not necessary and is demonstrated to perform excellently in experiments.

Following this line of research, we first develop an AUC optimization method from positive and unlabeled data (PU-AUC) in this paper. Previously, a pairwise ranking method for PU data has been developed (Sundararajan et al., 2011), which can be regarded as an AUC optimization method for PU data. However, it merely regards unlabeled data as negative data and thus the obtained classifier is biased. On the other hand, our PU-AUC method is unbiased and we theoretically prove that unlabeled data contribute to reducing an upper bound on the generalization error with the optimal parametric convergence rate without the restrictive distributional assumptions.

Then we extend our PU-AUC method to the semi-supervised setup by combining it with a supervised AUC optimization method. Theoretically, we again prove that unlabeled data contribute to reducing an upper bound on the generalization error with the optimal parametric convergence rate without the restrictive distributional assumptions, and further we prove that the variance of the empirical risk of our semi-supervised ACU optimization method can be smaller than that of the plain supervised counterpart. The latter claim suggests that the proposed semi-supervised empirical risk is also useful in the cross-validation phase. Finally, we experimentally demonstrate the usefulness of the proposed PU and semi-supervised AUC optimization methods.

## 2 Preliminary

We first describe our problem setting and review an existing supervised AUC optimization method.

Let covariate and its corresponding label be equipped with probability density , where is a positive integer. Suppose we have sets of positive and negative samples:

 XP :={xPi}nPi=1\lx@stackreli.i.d.∼pP(x):=p(x∣y=+1),and XN :={xNj}nNj=1\lx@stackreli.i.d.∼pN(x):=p(x∣y=−1).

Furthermore, let be a decision function and classification is carried out based on its sign: .

The goal is to train a classifier by maximizing the AUC (Hanley and McNeil, 1982; Cortes and Mohri, 2004) defined and expressed as

 AUC(g) :=EP[EN[I(g(xP)≥g(xN))]] :=1−EP[EN[I(g(xP)

where and be the expectations over and , respectively. is the indicator function, which is replaced with the zero-one loss, , to obtain the last equation. Let

 f(x,x′):=g(x)−g(x′)

be a composite classifier. Maximizing the AUC corresponds to minimizing the second term in Eq.(1). Practically, to avoid the discrete nature of the zero-one loss, we replace the zero-one loss with a surrogate loss and consider the following PN-AUC risk (Herschtal and Raskutti, 2004; Kotlowski et al., 2011; Rakhlin et al., 2012):

 RPN(f):=EP[EN[ℓ(f(xP,xN))]]. (2)

In practice, we train a classifier by minimizing the empirical PN-AUC risk defined as

 ˆRPN(f):=1nPnNnP∑i=1nN∑j=1ℓ(f(xPi,xNj)).

Similarly to the classification-calibrated loss (Bartlett et al., 2006) in misclassification rate minimization, the consistency of AUC optimization in terms of loss functions has been studied recently (Gao and Zhou, 2015; Gao et al., 2016). They showed that minimization of the AUC risk with a consistent loss function is asymptotically equivalent to that with the zero-one loss function. The squared loss , the exponential loss , and the logistic loss are shown to be consistent, while the hinge loss and the absolute loss are not consistent.

## 3 Proposed Method

In this section, we first propose an AUC optimization method from positive and unlabeled data and then extend it to a semi-supervised AUC optimization method.

### 3.1 PU-AUC Optimization

In PU learning, we do not have negative data while we can use unlabeled data drawn from marginal density in addition to positive data:

 XU :={xUk}nUk=1\lx@stackreli.i.d.∼p(x)=θPpP(x)+θNpN(x), (3)

where

 θP:=p(y=+1)andθN:=p(y=−1).

We derive an equivalent expression to the PN-AUC risk that depends only on positive and unlabeled data distributions without the negative data distribution. In our derivation and theoretical analysis, we assume that and are known. In practice, they are replaced by their estimate obtained, e.g., by du Plessis et al. (2017), Kawakubo et al. (2016), and references therein.

From the definition of the marginal density in Eq. (3), we have

 EP[EU[ℓ(f(xP,xU))]] =θPEP[E\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xP,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111P))]]+θNRPN(f),

where denotes the expectation over . Dividing the above equation by and rearranging it, we can express the PN-AUC risk in Eq. (2) based on PU data (the PU-AUC risk) as

 (4)

We refer to the method minimizing the PU-AUC risk as PU-AUC optimization. We will theoretically investigate the superiority of in Section 4.1.

To develop a semi-supervised AUC optimization method later, we also consider AUC optimization from negative and unlabeled data, which can be regarded as a mirror of PU-AUC optimization. From the definition of the marginal density in Eq. (3), we have

 EU[EN[ℓ(f(xU,xN))]] =θPEP[EN[ℓ(f(xP,xN))]]+θNEN[E\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xN,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N))]] =θPRPN(f)+θNEN[E\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xN,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N))]],

where denotes the expectation over . Rearranging the above equation, we can obtain the PN-AUC risk in Eq. (2) based on negative and unlabeled data (the NU-AUC risk):

 (5)

We refer to the method minimizing the NU-AUC risk as NU-AUC optimization.

### 3.2 Semi-Supervised AUC Optimization

Next, we propose a novel semi-supervised AUC optimization method based on positive-unlabeled learning. The idea is to combine the PN-AUC risk with the PU-AUC/NU-AUC risks, similarly to Sakai et al. (2017).111 In Sakai et al. (2017), the combination of the PU and NU risks has also considered and found to be less favorable than the combination of the PN and PU/NU risks. For this reason, we focus on the latter in this paper.

First of all, let us define the PNPU-AUC and PNNU-AUC risks as

 RγPNPU(f) :=(1−γ)RPN(f)+γRPU(f), RγPNNU(f) :=(1−γ)RPN(f)+γRNU(f),

where is the combination parameter. We then define the PNU-AUC risk as

 RηPNU(f):={RηPNPU(f)(η≥0),R−ηPNNU(f)(η<0), (6)

where is the combination parameter. We refer to the method minimizing the PNU-AUC risk as PNU-AUC optimization. We will theoretically discuss the superiority of and in Section 4.1.

### 3.3 Discussion about Related Work

Sundararajan et al. (2011) proposed a pairwise ranking method for PU data, which can be regarded as an AUC optimization method for PU data. Their approach simply regards unlabeled data as negative data and the ranking SVM (Joachims, 2002) is applied to PU data so that the score of positive data tends to be higher than that of unlabeled data. Although this approach is simple and shown computationally efficient in experiments, the obtained classifier is biased. From the mathematical viewpoint, the existing method ignores the second term in Eq. (4) and maximizes only the first term with the hinge loss function. However, the effect of ignoring the second term is not negligible when the class prior, , is not sufficiently small. In contrast, our proposed PU-AUC risk includes the second term so that the PU-AUC risk is equivalent to the PN-AUC risk.

Our semi-supervised AUC optimization method can be regarded as an extension of the work by Sakai et al. (2017). They considered the misclassification rate as a measure to train a classifier and proposed a semi-supervised classification method based on the recently proposed PU classification method (du Plessis et al., 2014, 2015). On the other hand, we train a classifier by maximizing the AUC, which is a standard approach for imbalanced classification. To this end, we first developed an AUC optimization method for PU data, and then extended it to a semi-supervised AUC optimization method. Thanks to the AUC maximization formulation, our proposed method is expected to perform better than the method proposed by Sakai et al. (2017) for imbalanced data sets.

## 4 Theoretical Analyses

In this section, we theoretically analyze the proposed risk functions. We first derive generalization error bounds of our methods and then discuss variance reduction.

### 4.1 Generalization Error Bounds

Recall the composite classifier . As the classifier , we assume the linear-in-parameter model given by

 g(x)=b∑ℓ=1wℓϕ(x)=w⊤ϕ(x),

where denotes the transpose of vectors and matrices, is the number of basis functions, is a parameter vector, and is a basis function vector. Let be a function class of bounded hyperplanes:

 F:={f(x,x′)=w⊤(ϕ(x)−ϕ(x′))∣∥w∥≤Cw;∀x:∥ϕ(x)∥≤Cϕ},

where and are certain positive constants. This assumption is reasonable because the -regularizer included in training and the use of bounded basis functions, e.g., the Gaussian kernel basis, ensure that the minimizer of the empirical AUC risk belongs to such the function class . We assume that a surrogate loss is bounded from above by and denote the Lipschitz constant by . For simplicity,222 Our theoretical analysis can be easily extended to the loss satisfying with a certain . we focus on a surrogate loss satisfying . For example, the squared loss and the exponential loss satisfy the condition.333 These losses are bounded in our setting, since the input to , i.e., is bounded.

Let

 I(f)=EP[EN[ℓ0-1(f(xP,xN))]]

be the generalization error of in AUC optimization. For convenience, we define

 h(δ) :=2√2LCℓCwCϕ+32√2log(2/δ).

In the following, we prove the generalization error bounds of both PU and semi-supervised AUC optimization methods.

For the PU-AUC/NU-AUC risks, we prove the following generalization error bounds (its proof is available in Appendix B):

###### Theorem 1.

For any , the following inequalities hold separately with probability at least for all :

 I(f)≤ˆRPU(f)+h(δ/2)(1θN√min(nP,nU)+θPθN√nP), I(f)≤ˆRNU(f)+h(δ/2)(1θP√min(nN,nU)+θNθP√nN),

where and are unbiased empirical risk estimators corresponding to and , respectively.

Theorem 1 guarantees that can be bounded from above by the empirical risk, , plus the confidence terms of order

Since () and can increase independently in our setting, this is the optimal convergence rate without any additional assumptions (Vapnik, 1998; Mendelson, 2008).

For the PNPU-AUC and PNNU-AUC risks, we prove the following generalization error bounds (its proof is also available in Appendix B):

###### Theorem 2.

For any , the following inequalities hold separately with probability at least for all :

 I(f) ≤ˆRγPNPU(f)+h(δ/3)(1−γ√min(nP,nN)+γθN√min(nP,nU)+γθPθN√nP), I(f) ≤ˆRγPNNU(f)+h(δ/3)(1−γ√min(nP,nN)+γθP√min(nN,nU)+γθNθP√nN).

where and are unbiased empirical risk estimators corresponding to and , respectively.

Theorem 2 guarantees that can be bounded from above by the empirical risk, , plus the confidence terms of order

 Op(1√nP+1√nN+1√nU).

Again, since , , and can increase independently in our setting, this is the optimal convergence rate without any additional assumptions.

### 4.2 Variance Reduction

In the existing semi-supervised classification method based on PU learning, the variance of the empirical risk was proved to be smaller than the supervised counterpart under certain conditions (Sakai et al., 2017). Similarly, we here investigate if the proposed semi-supervised risk estimators have smaller variance than its supervised counterpart.

Let us introduce the following variances and covariances:444 , , and are the variances over , , and , respectively. , , , and are the covariances over , , , and , respectively.

 σ2PN(f) =VarPN[ℓ(f(xP,xN))], σ2PP(f) =VarP\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xP,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111P))], σ2NN(f) =VarN\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xN,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N))], τPN,PP(f) =CovPN,P\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xP,xN)),ℓ(f(xP,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111P))], τPN,NN(f) =CovPN,N\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xP,xN)),ℓ(f(xN,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N))], τPU,PP(f) =CovPU,P\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xP,xU)),ℓ(f(xP,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111P))], τNU,NN(f) =CovNU,N\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111[ℓ(f(xP,xU)),ℓ(f(xN,\macc@depth\char1\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111N))].

Then, we have the following theorem (its proof is available in Appendix C):

###### Theorem 3.

Assume . For any fixed , the minimizers of the variance of the empirical PNPU-AUC and PNNU-AUC risks are respectively obtained by

 γPNPU =argminγVar[ˆRγPNPU(f)]=ψPN−ψPP/2ψPN+ψPU−ψPP, (7) γPNNU =argminγVar[ˆRγPNNU(f)]=ψPN−ψNN/2ψPN+ψNU−ψNN, (8)

where

 ψPN =1nPnNσ2PN(f), ψPU =θ2Pθ2NnP2σ2PP(f)−θPθ2NnPτPU,PP(f), ψPP =1θNnPτPN,PU(f)−θPθNnPτPN,PP(f), ψNU =θ2Nθ2PnN2σ2NN(f)−θNθ2PnNτNU,NN(f), ψNN =1θPnNτPN,NU(f)−θNθPnNτPN,NN(f).

Additionally, we have for any if and . Similarly, we have for any if and .

This theorem means that, if is chosen appropriately, our proposed risk estimators, and , have smaller variance than the standard supervised risk estimator . A practical consequence of Theorem 3 is that when we conduct cross-validation for hyperparameter selection, we may use our proposed risk estimators and instead of the standard supervised risk estimator since they are more stable (see Section 5.3 for details).

## 5 Practical Implementation

In this section, we explain the implementation details of our proposed methods.

### 5.1 General Case

In practice, the AUC risks introduced above are replaced with their empirical version , where the expectations in are replaced with the corresponding sample averages.

Here, we focus on the linear-in-parameter model given by

 g(x)=b∑ℓ=1wℓϕ(x)=w⊤ϕ(x),

where denotes the transpose of vectors and matrices, is the number of basis functions, is a parameter vector, and is a basis function vector. The linear-in-parameter model allows us to express the composite classifier as

 f(x,x′)=w⊤¯ϕ(x,x′),

where

 ¯ϕ(x,x′):=ϕ(x)−ϕ(x′)

is a composite basis function vector. We train the classifier by minimizing the -regularized empirical AUC risk:

 minwˆR(f)+λ∥w∥2,

where is the regularization parameter.

### 5.2 Analytical Solution for Squared Loss

For the squared loss , the empirical PU-AUC risk555We discuss the way of estimating the PU-AUC risk in Appendix A. can be expressed as

 ˆRPU(f) =1θNnPnUnP∑i=1nU∑k=1ℓS(f(xPi,xUk)) =−θPθNnP(nP−1)nP∑i=1nP∑i′=1ℓS(f(xPi,xPi′))+θPθN(nP−1) =1−2w⊤ˆhPU+w⊤ˆHPUw−w⊤ˆHPPw,

where

 ˆhPU :=1θNnPΦ⊤P1nP−1θNnUΦ⊤U1nU, ˆHPU :=1θNnPΦ⊤PΦP−1θNnPnUΦ⊤U1nU1⊤nPΦP :=−1θNnPnUΦ⊤P1nP1⊤nUΦU+1θNnUΦ⊤UΦU, ˆHPP :=2θPθN(nP−1)Φ⊤PΦP−2θPθNnP(nP−1)Φ⊤P1nP1⊤nPΦP, ΦP :=(ϕ(xP1),…,ϕ(xPnP))⊤, ΦU :=(ϕ(xU1),…,ϕ(xUnU))⊤,

and is the -dimensional vector whose elements are all one. With the -regularizer, we can analytically obtain the solution by

 ˆwPU:=(ˆHPU−ˆHPP+λIb)−1ˆhPU,

where is the -dimensional identity matrix.

The computational complexity of computing , , and are , , and , respectively. Then, solving a system of linear equations to obtain the solution requires the computational complexity of . In total, the computational complexity of this PU-AUC optimization method is .

As given by Eq. (6), our PNU-AUC optimization method consists of the PNPU-AUC risk and the PNNU-AUC risk. For the squared loss , the empirical PNPU-AUC risk can be expressed as

 ˆRγPNPU(f) =1−γnPnNnP∑i=1nN∑j=1ℓS(f(xPi,xNj))+γθNnPnUnP∑i=1nU∑k=1ℓS(f(xPi,xUk)) =(1−γ)−2(1−γ)w⊤ˆhPN+(1−γ)w⊤ˆHPNw =+γ−2γw⊤ˆhPU+γw⊤ˆ