A Robust AUC Maximization Framework with Simultaneous Outlier Detection and Feature Selection for PositiveUnlabeled Classification
Abstract
The positiveunlabeled (PU) classification is a common scenario in realworld applications such as healthcare, text classification, and bioinformatics, in which we only observe a few samples labeled as “positive” together with a large volume of “unlabeled” samples that may contain both positive and negative samples. Building robust classifier for the PU problem is very challenging, especially for complex data where the negative samples overwhelm and mislabeled samples or corrupted features exist. To address these three issues, we propose a robust learning framework that unifies AUC maximization (a robust metric for biased labels), outlier detection (for excluding wrong labels), and feature selection (for excluding corrupted features). The generalization error bounds are provided for the proposed model that give valuable insight into the theoretical performance of the method and lead to useful practical guidance, e.g., to train a model, we find that the included unlabeled samples are sufficient as long as the sample size is comparable to the number of positive samples in the training process. Empirical comparisons and two realworld applications on surgical site infection (SSI) and EEG seizure detection are also conducted to show the effectiveness of the proposed model.
1 Introduction
The positiveunlabeled (PU) classification is quite common in many realworld applications such as healthcare (Kaur and Wasan, 2006; Peng et al., 2009), text classification (Liu et al., 2002; Yu et al., 2002), time series classification (Nguyen et al., 2011), and bioinformatics (Yang et al., 2012). The PU classification is defined as: given a few samples labeled as “positive” and a high volume of “unlabeled” samples that contain both negative and positive samples, a binary classifier is learned from it. Existing works for PU classification (Calvo et al., 2007; Li et al., 2011; Lee and Liu, 2003; Li et al., 2009; Yang et al., 2012; Elkan and Noto, 2008) work well for wellconditioned data. However, some important issues remain unsolved for informationrich but complex data such that the performance is often unsatisfactory. In particular, we consider the following three issues for complex data:

The data is highly biased – negative samples dominate. For example, in the abnormal event detection, e.g., earthquake detection, seizure detection (considered in our experiment), there are no more than 0.1% of samples are positive. In this case, a dummy classifier classifying all data to be negative achieves prediction accuracy . As a result, the commonly used prediction error is no longer a robust and stable objective or metric.

The labeled positive samples includes contain incorrect labels. It happens often in practice when the labeled is provided by humans. Although the percentage of samples with wrong labels (we called outliers in this paper) could be small, such outliers could seriously influence the performance.

Samples are with redundant or irrelevant features. Redundant features could seriously hurt the performance due to causing overfitting issue, especially when the number of samples are very limited.
Although each single issue has been considered in existing literature du Plessis et al. (2014); Hodge and Austin (2004); Zhang (2009), a naive stepbystep solution compiling existing methods does not work well, since any one issue can affect another two. For example, if we first apply a feature selection algorithm, followed by an outlier detection algorithm, the selected features in the first step could be totally wrong because it does not know there may exist outliers; vise versa. Therefore, only when all issues are considered simultaneously, it is possible to find out correct features and outliers. Unfortunately, little is known about how these three issues can be handled simultaneously, which makes the PU problem on complex data more challenging. It motivates us to find out an integral solution to solve all three issues together.
To address these three issues jointly on PU classification, we propose a robust learning framework that unifies AUC (area under the curve) maximization – a robust metric for biased labels, outlier detection (for excluding wrong labels and bad samples), and feature selection (for excluding corrupted features). Firstly, existing works for PU classification (Calvo et al., 2007; Li et al., 2011; Lee and Liu, 2003; Li et al., 2009; Yang et al., 2012; Elkan and Noto, 2008) mainly use the misclassification error or the recall value as the performance metric to guide the training of the classifier. However, under the settings of the PU learning, highly biased training set, AUC serves as a more robust metric (Cortes and Mohri, 2004) since it is invariant to the percentage of positive samples. Although AUC optimization has been studied before, i.e. (Rakotomamonjy, 2004; Brefeld and Scheffer, 2005), it cannot be directly adopted the AUC metric to the PU problem. This is because negative samples are unavailable in the PU’s scenario. To overcome this difficulty, we propose to use a blind AUC (BAUC) criterion to approximate the target AUC, and we will show that in theory maximizing BAUC is equivalent to maximizing AUC. Secondly, as the “unlabeled” samples are actually a heterogeneous collection of samples that contain outliers, the learning formulation needs to automatically identify the outliers and leave them out of the training process of the classifier. As a matter of fact, as the name “unlabeled” suggested, positive samples with wrong labels or positive samples with corrupted feature values, are likely to happen and contribute to the “outlier” category of the unlabeled samples. Thirdly, as feature selection has been a critical aspect for mitigating the overfitting issue, we also need to ensure that our learning formulation is able to incorporate this functionality.
The main contributions in this paper are summarized in the following:

The proposed OF model is more robust than existing classification error minimization frameworks (for example, Plessis et al. (2015)), particularly in these two aspects: 1) there is no need to set a prior value for the percentage of positive samples in the training process, which has been a difficulty in many applications; and 2) both outlier detection and feature selection are integrated with the AUC maximization formulation.

The generalization error bounds are also provided for the proposed model. It gives valuable insights into the theoretical performance of the method and reveals relationships between some important parameters (such as the dimensionality of the features, the samples sizes for both positive samples and “unlabeled” samples) of the PU problem with the performance of the learned classifier. Those insights also lead to useful guidance for practices such as that the unlabeled samples are sufficient as long as the sample size is comparable to the number of positive samples in the training process.

Empirical experimental studies have been conducted on a thorough collection of datasets that demonstrate the proposed method outperforms the stateoftheart approaches.

Last but not least, it is worthy of mentioning that the proposed outlier detection and feature selection technique can be easily extended to other formulations. That means, although our proposed OF model is motivated for general PU problems, the proposed outlier detection and feature selection can also be integrated with other PU learning formulations that have been developed for specific PU problems. To the best of our knowledge, this is the first work to simultaneously select features and identify outliers.
2 Related Works
This section reviews related works about PU problem, outlier detection, and feature selection.
Learning From PU Data PU learning is mainly about learning a binary classifier from a dataset containing both positive and unlabeled data (Elkan and Noto, 2008). The labeled positive data is assumed to be selected randomly from the population. The traditional approach to solving the PU learning is to simply treat all unlabeled data as negative samples, which may result in biased solutions. To mitigate this bias, several methods are proposed. Oneclass classification (Moya and Hush, 1996) only uses the positive samples in the training set. These works include De Bie et al. (2007); Manevitz and Yousef (2001). Further some nonconvex loss functions are introduced to mitigate this bias, for example (Smola et al., 2009; du Plessis et al., 2014). Finally (Plessis et al., 2015) propose the convex formulation that can still cancel this bias for PU problem. In some applications where the class prior is known, the PU learning can be reduced to solving a costsensitive classification problem (Elkan, 2001). Works focused on adjusting the weights inside the loss functions according to the class prior are also studied in Scott and Blanchard (2009); Blanchard et al. (2010); Li and Liu (2003); Lee and Liu (2003). However, an inaccurate estimation of the class prior will increase the classification error. Thus some research efforts in the PU problem also investigated different ways to estimate the class prior such as (Elkan and Noto, 2008; Blanchard et al., 2010; Du Plessis and Sugiyama, 2014). Other approaches proposed also include graphbased approach Pelckmans and Suykens (2009), bootstrapbased approach Mordelet and Vert (2014). We omit the details here. In this paper, we propose an AUCbased PU learning framework where the AUC metric is used to guide the learning process. We show this robust metric is especially suitable for PU learning and can be integrated with outlier detection and feature selection to achieve better performance than stateofart PU learning approaches.
Outlier Detection To the best of our knowledge, the outlier issue has not been considered in PU classification. Thus, here we provide a brief overview of outlier detection methods in general settings. Existing outlier detection researches can be roughly categorized as variancebased approaches and modelbased approaches. The variancebased approaches detect the outliers through a set of criterion used to measure the difference between the data and the rest of the dataset. These criterion can come from statistical analysis (Yamanishi and Takeuchi, 2001; Yamanishi et al., 2004), distance metric (Knorr and Ng, 1999) and density ratio (Aggarwal and Yu, 2001; Jiang et al., 2001; Breunig et al., 2000). They usually work well in low dimensional space when the data amount is not too huge. But they are difficult to be integrated into other models like PU learning. The modelbased approaches detect outliers through certain models. Typical methods include the regularized principal component regression (Walczak, 1995), regularized partial least square (Hubert and Branden, 2003), SVM (Jordaan and Smits, 2004) based algorithms and others. More detailed reviews can be found in Hodge and Austin (2004). None of these works were integrated with PU learning models.
Feature Selection / Overfitting Feature selection is very important in machine learning especially for applications where the number of features is much larger than the number of data points. While feature selection has a wide spectrum of approaches, for example (He et al., 2006; Yao et al., 2017), here, the sparse learning area is more related to our study. Generally, there are two types of sparse learning approaches. One type of approaches include the convex relaxation formulations, represented by the norm based approaches such as the LASSO formulation developed in (Tibshirani, 1996) and further extended in (Ng, 2004; Zou, 2006; Friedman et al., 2008). The other approach is the nonconvex formulation, represented by the norm based (or greedy) approaches such as OMP (Tropp, 2004), FoBa (Zhang, 2009; Liu et al., 2013), and projected gradient based methods (Yuan et al., 2014; Nguyen et al., 2014). Usually, the based methods are convex relaxations of the norm based methods. Theoretical studies have also been conducted to compare the performance of the norm based approaches with based methods such as (Zhang, 2009; Liu et al., 2013). The feature selection framework is also extended to the multiclass settings (Obozinski et al., 2006; Chapelle and Keerthi, 2008; Xu et al., 2017).
3 Blind AUC Formulation with Outlier Detection and Feature Selection
We first propose the blind () metric for the PU problem and show its connection with the AUC for binary classification problem in Section 3.1. Then, Section 3.2 introduces the proposed formulation that unifies the BAUC maximization with simultaneous outlier detection and feature selection. The outlier detection and the feature selection are further integrated with other PU formulations. In Section 3.3, we develop the optimization algorithm for solving the proposed model.
3.1 Blind AUC (BAUC)
The (Hanley and McNeil, 1982; Mason and Graham, 2002) metric is defined as
where is a scoring function, for example, . and denote the distributions for positive samples and negative samples respectively. Indicator function returns value if the condition is satisfied; otherwise. Intuitively, measures the probability that the scoring of is greater than if and are randomly sampled from the positive class and the negative class. It has been known that is a more stable and robust metric than accuracy for biased binary classification problem. Thus, it has been widely used to guide the training of the classification model in binary classification problem.
Although the maximization of AUC is our ultimate goal in the PU problem, it is hard to directly apply it because the negative labels are not available in the PU problem. Therefore, we consider a blind AUC (BAUC) for the PU problem. In particular, BAUC simply blindly treats all unlabeled samples as negative samples and defines in the following
where is the distribution for unlabeled samples. Thus, using the , we can derive the following empirical formulation to learn the classifier from the positive training set and unlabeled training set :
(1) 
It is easy to verify that
from the fact that the expected value of the sum of random variables is equal to the sum of their individual expected values. In the proof of Theorem 2, we will show the details.
Note that, one can approximately maximize (1) by replacing the indicator function by a surrogate function, e.g, hinge loss or logistic loss.
Although we can only empirically maximize , the following Theorem 1 actually suggests that maximizing essentially maximizes (recall that it is our ultimate goal). Particularly, Theorem 1 reveals that depends on linearly, which indicates that when achieves the maximum, achieves its maximum too.
Theorem 1.
For binary classification problem, given an arbitrary scoring function , there exists a linear dependence between its value and its value:
where is the percentage of positive samples.
Proof.
From the definition of , we have
The term is a constant, because the probability that a randomly chosen positive sample is ranked higher than another randomly chosen positive sample from one same data set should always be . So we have :
which proves the theorem. ∎
3.2 Integration of Maximization with Outlier Detection and Feature Selection
The maximization of is equivalent to minimizing :
While the concept of outlier has been diverse, in this paper, we mainly consider the outliers that include: 1) the samples that are wrongly labeled as positive; 2) the samples in the positive samples whose feature values are corrupted for whatever reasons, whose existence will distort the distribution of the data points. To identify those outliers, we construct a vector while each positive sample corresponds to a coordinate of , denoted by . With this notation, the following optimization formulation can be derived:
(2)  
s.t.  (3) 
The key motivation behind this formulation is to use to adjust the score of the outlier instead of modifying its label feature values though it actually has the equivalent effect. The constraint in (3) is to restrict the maximal number of outliers by a userdefined parameter , and the nonzero elements of the optimal indicate outliers. To the best of our knowledge, this is the first time to apply the norm for outlier detection while it has been used for the feature selection purpose before.
Next we integrate the model with feature selection capacity. The basic task is to restrict the hypothesis space for . Here, we restrict our interest on linear scoring forms for , that is, where parameterizes the scoring function . It is worthy of mentioning that our proposed framework could be extended to nonlinear models as well. Here, we mainly consider three types of sparsity hypothesis () for by defining as
(4a)  
(4b)  
(4c) 
where is the set of disjoint group index sets, is a scalar that specifies the upper bound of the feature size, and is a vector. is the commonly used sparsity hypothesis space in sparse learning (Tibshirani, 1996); denotes the group sparsity set (Huang and Zhang, 2010; Zhang et al., 2010); and is the exclusive sparsity set enforcing the selection diversity (Campbell and Allen, 2015).
To put everything together, the final model can be summarized in the following:
s.t. 
Since the indicator function (or equivalently the loss function) is not continuous, the common treatment is to use convex and continuous surrogate function to approximate it, such as the hinge loss and logistic loss function. Without loss of generality, here, we focus on the logistic loss (similar algorithms and theories can be applied to other smooth loss functions). This leads to the following formulation with Outlier detection and Feature Selection (named as OF):
(5)  
s.t.  (6) 
where is defined as Note that the two additional terms and serve as the regularization term to cope with the possibility that or diverges in the optimization process. and are usually set to be small values.
The proposed outlier detection and feature selection scheme is quite flexible and can easily incorporate with other existing PU frameworks. Most existing models for PU problem consider minimizing the misclassification error. It is of interest to compare our proposed based models with these based methods. Particularly, as we also integrate the model with outlier detection and feature selection, in this section, we further illustrate how the counterpart of OF can be developed in the framework of based framework using a recent development (Plessis et al., 2015). Particularly, following (Plessis et al., 2015), the error minimization is given as:
(7) 
Applying the logistic loss function to approximate the indication function , (7) can be written as (Plessis et al., 2015):
We then introduce the outlier detection and feature selection to this model to obtain the error minimization formulation with outlier detection and feature selection (named as OF) in below:
s.t.  (8) 
3.3 Optimization
This section introduces the optimization algorithm to solve the proposed OF formulation in (6). Eq. (6) is a constrained smooth nonconvex optimization. The nonconvexity is due to the constraints for and . A natural idea is to apply the commonly used projected gradient descent algorithm to solve it. However, the AUC formulation involves a huge number of interactive terms between positive samples and the unlabeled samples. To reduce the complexity of each iteration, we use the stochastic gradient to approximate the exact gradient. In particular, we iteratively sample from to calculate the unbiased stochastic gradient:
and apply the projected gradient step to update the next iteration
where is the learning rate, and the projection steps for and have closed form solutions.
While the convergence of PSG for convex optimization has been well studied  the generic convergence rate is , its convergence for nonconvex optimization has rarely been studied until very recently. Thanks for the method developed in (Nguyen et al., 2014), one can follow their method to establish the convergence rate of PSG for (6). Omitting tedious statements and proofs, we simply state the results in below: under some mild conditions, PSG converges to a ball of the optimal solution to (6):
where and are the optimal solution to (6). is a number smaller than . It depends on the restricted condition number of the objective of (6). The radius of the ball depends on two terms: and . is the variance due to the use of the “stochastic” gradient, while is the observation noise while collecting the data.
Computational time analysis We discuss the computational time of the proposed algorithm. The AUC optimization is essentially a ranking based algorithm. For such kind of method, the computation complexity increases through pairing the data. So the computational time is highly dependent on the data size. Fortunately, from the proposed Theorem 2, we can see that when the number of positive labeled data (i.e. ) is fixed the marginal gain by including more unlabeled data (i.e. increase ) is very minor. Since the dataset in PU learning is highly biased, i.e., is small, it indicates that we can still achieve a good result by only using a moderate amount of data. So we suggest a useful way for deciding the sample size of the unlabeled set in practice, e.g., when the number of unlabeled samples is substantially more than the number of positive samples, it is not necessary to include all the unlabeled samples in early training stages to avoid heavy computational burden. In practice, one can gradually increase the size of unlabeled samples until does not change significantly.
4 Theoretical Guarantee
This section will study the theoretical performances of the proposed model and algorithm.
Theorem 2.
Given two datasets and , let and , and assume that all data points in are i.i.d samples from the distribution , and all data points in are i.i.d samples from the distribution , with probability at least we have:
where
and is defined as
where is the total number of features.
Proof.
To prove the error bound between and , we start to measure the difference between and . First from the definition of
we can take its expectation
Then we apply the connection between and in Theorem 1 to obtain the following connection:
(9) 
For simplicity, we use to denote in the following. Next we estimate the probabilistic error bound between and .
(10)  
(11) 
We first consider the case . We only provide the upper bound for (10) (The upper bound for (11) can be obtained similarly.)
(12) 
Fixing and in (10), we have
(13) 
which is bounded from Hoeffding’s inequality. is defined as the # of possible configuration of on points in for . We have
(14) 
where the second inequality uses the VC dimension for linear classifier (Vapnik, 2006). So we get the upper bound for (10) by (13) and (14)
(15) 
Similarly we can obtain the upper bound for (11)
(16) 
So we can get (17) by (15) and (16):
(17) 
Let the right hand side in the above inequality be bounded by , we have
holds with probability less than , and . Using the dependence in (9), we obtain
with probability at least .
To show the bound for , we only need to estimate (12) by taking as and estimate the upper bound for
Then we can follow the proof for to obtain the bound for .
We conclude the proof by considering the last case . Following the same idea before, we only need to estimate (12) by taking as and estimate the upper bound for
It completes the proof. ∎
This theorem provides the upper bound of the difference between the empirically obtained and the true . This leads to the following interesting observations:

When the number of unlabeled samples is more than the positive samples, the improvement on this bound is quite limited by increasing the number of unlabeled samples.

The complexity of affects this bound significantly. Let us consider the case . Note that the error bound linearly depends on the sparsity parameter . When , the error bound converges to zero, which coincides with the consistency analysis for sparse signal recovery (for example, (Zhang, 2011)). Actually, the error bounds for and suggest similar observations.

It is worthy of pointing out that if the super group set contains singleton groups, then it is known that and the provided error bounds for and are the same. And if the super set only contains a single group and , then it is known that and the suggested error bounds are the same as well.

In addition, we can see that both and have better error bound (i.e. smaller ) than when the number of nonzero elements are the same. Suppose we have groups, and all the groups have the same size (for convenience, suppose is dividable by ). Let all the models have no more than nonzero elements. In this case, we compare the models under the sparsity hypotheses , and where . According to Theorem 2, we know that
and
While . Therefore, we have that and .
5 Experiments
In this section, we will thoroughly evaluate the proposed OF model. First, we will test how the number of unlabeled training samples affects the value to validate our theoretical analysis in Theorem 2 using synthetic data. Then, we compare the proposed model with the error minimization model using both synthetic data and real datasets. Finally, we further apply the proposed method on two realworld applications, the prediction of surgical site infection (SSI) and detection of seizure.
5.1 Empirical Validation of Theorem 2
This section conducts empirical experiments to study how the number of unlabeled samples affects the value, and evaluate the difference between the and the empirical . Here, is calculated using the training data, and is calculated using the testing data with a sufficient number of samples. The positive samples follow the Gaussian distribution while the negative samples follow the Gaussian distribution . Note that this case is not linearly separable. The number of positive samples is fixed as . We also generate unlabeled samples where 10 % of them are generated from the distribution model of positive samples. We gradually increase the size of unlabeled data from to . All experiments are repeated 10 times to obtain the mean and variance of the performance metrics.
We apply the PSG algorithm to solve the model (without feature selection and outlier detection) on synthetic datasets with various sizes of unlabeled sets. Results are reported in Figure 1, i.e., the two curves correspond to the and , respectively. It indicates that when the number of unlabeled samples is more than 5 times of the number of positive samples, the improvement on becomes quite minor and the estimation error between and does not change too much. This observation is consistent with our analysis in Theorem 2. It essentially suggests that it is not necessary to include all the unlabeled data in the training process when the unlabeled data points are substantially more than the positive samples.
5.2 Comparison of the Proposed Model (6) with the StateoftheArt Methods
This section compares the proposed model with the error minimization model and its variants. Specifically, the comparison involves 9 algorithms: SVM under the ideal case (true labels of negative samples are known), oneclass SVM (popular oneclass classification algorithm) Manevitz and Yousef (2001), Biased SVM (Hoi et al., 2004) (the stateofthe art algorithm), (the leading algorithm recently developed in Plessis et al. (2015), that is, (8) without outlier detection and feature selection), O (Plessis et al. (2015) + the proposed outlier detection), OF (Plessis et al. (2015) + the proposed outlier detection and feature selection), ((6) without outlier detection and feature selection), O ((6) only with outlier detection), and OF in (6) (the complete version of the proposed model). Among them, the SVM under the ideal case serves as the baseline or gold standard where the true labels of the negative samples are known. It serves as the standard for us to evaluate the performance of all the other algorithms.
Synthetic datasets: We consider the binary classification task. Each feature vector contains relevant features while the remaining is irrelevant. The relevant features are generated from and for positive and negative samples respectively. Irrelevant features are generated from . The wrong positive samples (outliers) are generated from . The training set contains 100 positive samples (containing wrong samples or outliers) and 300 unlabeled samples (20 positive samples + 280 negative samples). This leads to 30000 pairs in the formulation. Test samples are generated in the same way with 1200 positive samples and 2800 negative samples. We varied the number of outliers and the number of features to compare all algorithms.
#F  SVM(ideal)  Oneclass SVM  BSVM  ERR  BAUC  ERRO  ERROF  BAUCO  BAUCOF 

40  88.010.60  66.587.04  85.582.21  84.141.51  85.101.39  88.070.71  87.970.72  88.020.69  88.210.63 
80  86.600.90  63.566.58  78.172.31  78.451.72  80.601.56  86.000.84  86.160.77  85.980.90  86.221.64 
120  84.391.04  66.845.12  75.111.76  76.031.71  78.501.67  84.450.76  84.651.02  84.640.69  86.281.07 
160  83.011.01  62.464.55  70.521.86  72.322.36  75.022.42  82.401.28  82.811.71  82.391.37  83.032.08 
200  82.301.30  60.494.01  70.121.52  71.110.87  74.110.78  81.430.52  82.941.63  81.290.66  83.772.22 
240  81.371.59  60.055.66  68.851.64  68.642.87  71.692.46  79.551.50  81.472.36  79.361.60  81.542.36 
280  80.441.44  60.695.29  66.931.65  68.552.04  71.161.84  79.201.33  81.501.51  78.861.12  81.022.22 
320  79.091.58  59.123.02  66.851.25  68.191.92  70.812.00  78.601.26  81.081.09  78.491.13  81.471.22 
360  79.021.24  57.245.03  63.770.83  65.531.96  68.192.02  76.471.54  79.612.29  76.641.75  78.442.20 
400  76.941.54  57.426.32  63.091.02  65.391.94  67.991.12  76.131.54  78.272.34  75.921.47  78.773.10 
#O  SVM(ideal)  Oneclass SVM  BSVM  ERR  BAUC  ERRO  ERROF  BAUCO  BAUCOF 

1  84.100.85  80.302.40  77.131.74  79.841.89  80.211.70  80.941.72  83.142.53  80.881.64  83.932.30 
2  83.901.13  78.062.80  75.942.03  78.970.74  79.580.91  81.330.73  83.191.21  81.280.85  84.021.82 
3  82.790.87  77.602.13  73.221.25  77.751.43  78.561.23  81.381.15  83.801.35  81.291.06  84.171.20 
4  83.520.83  75.373.36  72.672.51  76.291.61  77.781.39  81.411.31  83.072.05  81.241.37  83.491.67 
5  83.421.22  71.644.48  71.591.69  75.071.95  77.131.76  81.571.32  83.752.01  81.521.39  83.911.95 
6  82.411.53  67.306.58  69.122.13  73.702.35  76.172.33  81.711.60  83.742.26  81.641.40  83.772.44 
7  82.101.63  61.475.1  67.091.49  70.412.87  73.292.45  80.891.83  82.862.41  80.971.77  83.302.90 
8  81.630.87  57.704.59  65.671.69  69.032.16  71.861.94  81.221.19  83.512.01  81.181.48  83.771.73 
Datasets  SVM(ideal)  Oneclass SVM  BSVM  ERR  ERRO  ERROF  BAUC  BAUCO  BAUCOF 

disease 1 vs disease 2  97.08  57.42  89.86  95.69  95.69  97.08  95.55  95.55  97.22 
disease 1 vs health  91.74  61.27  88.10  89.27  90.00  94.99  88.52  88.52  94.94 
disease 2 vs health  87.70  60.28  87.00  92.58  92.58  93.71  92.47  92.47  93.58 
health vs disease 1,2  84.09  51.22  78.14  77.33  77.33  83.17  77.19  77.30  83.21 
health vs all  78.05  53.25  76.54  77.27  77.28  77.91  77.39  77.39  77.74 
SPECTF  81.43  52.30  79.31  80.00  80.00  80.66  80.24  80.36  80.73 
Readmission  73.22  71.48  72.89  72.60  72.60  72.74  72.80  72.80  72.92 
Readmission(outlier)  67.40  70.73  64.00  67.50  71.11  71.14  67.61  72.38  72.42 
HillValley  96.39  52.34  84.15  88.17  88.19  88.19  94.66  95.82  95.82 
HillValley(noise)  87.30  53.43  80.73  81.28  81.51  81.51  84.28  84.29  84.29 
Real datasets: Five real datasets are used to validate the proposed model, including Arrhythmia, SPECTF Heart, Readmission, noiseless HillValley, and noise HillValley (Lichman, 2013). The first real data set is the Arrhythmia data from the UCI data set. By choosing different groups of labels as positive class and negative class, we get five learning scenarios as shown in Table 3. In this dataset, label 1, 2 and 10 are chosen as health, disease type 2, and disease type 1 respectively. The reason to choose these three labels is that the number of people in these classes is large enough. The sizes of the training sets for five learning scenarios are respectively, with the number of positive data being in all these sets. The second dataset is the SPECTF Heart Data Set. We choose label 0 as positive class. The size of training set is 80 with 50% positive class. The third and fourth data sets (i.e. Readmission and Readmission (outlier) in Table 3) are generated from medical readmission dataset. In our experiments, we randomly choose 20 positive samples (no readmission) and 30 negative samples to form the training set. For the experiment with outliers (i.e. fourth data set), we randomly add 3 negative samples into the training set. The last two datasets are the noisy version and noiseless version of the HillValley dataset. We randomly choose 50 positive samples (Hill) and 150 negative samples (Valley) to form our training set, and the rest samples are used for testing. Note that, for all the training sets, we use the true class prior in the label errorbased algorithm. At last, during the experiments, of the randomly chosen positive data inside the training sets is known to the algorithms.
Parameter tuning: Without specification, we use the following way to tune the hyperparameters in our model (6). In the experiment, we choose . There are four hyperparameters: , , (the outlier upper bound), and (the feature sparsity). Since and are just used to restrict the magnitude of and , the performance is less sensitive to these two hyperparameters. So in practice, they are chosen to be small values, e.g., . and are important to the performance. They serve the same purpose as the weight of norm sparse regularization, but they are discrete and much easier to tune. We initialize both and by small integers (e.g., and of the total number of features) and increase the value of each hyperparameter in a greedy manner, until that the performance on training set stops improving.
The results for synthetic data are shown in Table 1 and Table 2. Results for real data are shown in Table 3. Overall, the performance of the proposed OF model outperforms other models. The performance of all the PU learning algorithms without feature selection and outlier detection declines greatly when redundant features and outliers exist. Comparing to the traditional classification problems (ideal SVM), the performance of PU learning algorithms decreases more rapidly. Thus we can conclude PU learning is much more sensitive towards irrelevant features and noise inside the dataset. Intuitively, when all kinds of the uncertainties (unknown labels, irrelevant features, and outliers) combined and correlated together, the problem becomes much more complicated than the summation of those separated problems. While the proposed feature selection and outlier detection are included in the learning process, the performance is improved significantly. Oneclass classification is very sensitive towards the outliers as seen in the tables since it totally relies on the observed positive labeled data to make decisions. For the real datasets, the performance of the SVM under the ideal case usually is the best, but it can be deteriorated by the outliers. The benefit of feature selection is significant in Readmission (outlier) data because there are outliers (false positive samples) in the training sets. For the datasets, e.g., Arrhythmia, SPECTF Heart where the number of features is not large and may contain no outliers, all algorithms except for the oneclass SVM tend to achieve the same performance. As before, the performance of oneclass SVM can be very sensitive to problem types. In all datasets OF achieves very similar performance as the ideal case indicating that the proposed method acts as a powerful tool in dealing with realworld problems.
5.3 Realworld Application I: Prediction of Surgical Site Infection
The PU problems are common in healthcare areas. For instance, here, we study the performance of our proposed method on a prediction problem for surgical site infection (SSI). It has been a very important question to predict the SSI onset based on some risk factors and wound characteristics, however, it is usually difficult to identify all the SSI patients since that may need us to keep track of the patients who have had surgery for quite a while. It is not uncommon that many patients’ final status (whether or not they develop SSI) is unknown.
This results in a typical PU problem. In our study, we have 464 subjects in total, while for each subject, 37 clinical variables (such as some wound characteristics including the induration amount, wound edge distance, and wound edge color; physiological factors such as heart rate, diastolic RR and systolic RR) are measured in multiple time points. To test our algorithm, we split the data into training and testing data, while the training data consist of 80 subjects, i.e., 40 positive samples (infected) and 40 negative samples (not infected). Further, for the training data, only 35 positive subjects are assumed to be known to us, and the rest of the subjects form the unlabeled sample. For every subject, we use the measurement of the first 9 days after surgery, resulting in a total number of features for each subject as . Correspondingly, group sparsity is used in our algorithm for feature selection. By employing crossvalidation in a wide range of choices on the tuning parameters, we finally choose 7 groups of features and 1 outlier in the experiments. we gather performances of the competing algorithm as shown in Figure 2, which suggests that the proposed OF outperforms other algorithms.
5.4 Realworld Application II: Seizure Detection from EEG Signals
Automatic seizure detection from EEG signals has been very important in seizure prevention and control. While the EEG signals provide rich information which can be leveraged to build prediction models, it is a timeconsuming task to employ domain experts to segment the massive EEG signals and assign labels to the segments. It is not uncommon that manual labeling can only be applied to a few segments, resulting in a typical PU problem. The EEG dataset was acquired from 8 epileptic mice (4 males and 4 females) at 1014 weeks of age at Baylor College of Medicine. EEG recording electrodes (Tefloncoated silver wire, diameter) were placed in frontal cortex, somatosensory cortex, hippocampal CA1, and dentate gyrus. Spontaneous EEG activity (filtered between 0.1 Hz and 1 kHz, sampled at 2 kHz) were recorded in freely moving mice for 2 hours per day over 3 days. An example was shown in Figure 3.
The EEG sequence for each mouse is with 261673 continuous time points (or signals), among which 21673 signals are labeled as seizure. At each time point, we extract 264 features for each signal including nonlinear energy, FFT, RMS value, zerocrossing, Hjorth parameters, and entropy (Greene et al., 2008) using the window length 2056.
We construct the training dataset with 140 labeled positive signals and 160 unlabeled signals. The remaining forms the testing set. By employing crossvalidation in a wide range of choices on the tuning parameters, we gather performances of the competing algorithm as shown in Figure 4. Finally, outliers and features are selected. And most features coming from FFT are excluded. This shows the feature selection actually works because only signals of some specific frequencies matter in this case. It is apparent that the proposed OF outperforms other algorithms.
6 Conclusion
Learning robust classifiers from positive and unlabeled data (PU problem) is a very challenging problem. In this paper, we propose a robust formulation to systematically address the challenging issues for PU problem. We unify AUC maximization, outlier detection, and feature selection in an integrated formulation, and study its theoretical performance that reveals insights about the relationships between the generalized error bounds with some important guidelines for practice. Extensive numerical studies using both synthetic data and realworld data demonstrate the superiority and efficacy of the proposed method compared with other stateoftheart methods.
References
 C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. SIGMOD, 2001.
 G. Blanchard, G. Lee, and C. Scott. Semisupervised novelty detection. The Journal of Machine Learning Research, 2010.
 U. Brefeld and T. Scheffer. Auc maximizing support vector learning. In ICML workshop, 2005.
 M. M. Breunig, H.P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying densitybased local outliers. In ACM sigmod record, 2000.
 B. Calvo, P. LarraÃ±aga, and J. A. Lozano. Learning bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters, 2007.
 F. Campbell and G. I. Allen. Within group variable selection through the exclusive lasso. In arXiv:1505.07517, 2015.
 O. Chapelle and S. S. Keerthi. Multiclass feature selection with support vector machines. In Proceedings of the American statistical association, 2008.
 C. Cortes and M. Mohri. Auc optimization vs. error rate minimization. NIPS, 2004.
 T. De Bie, L.C. Tranchevent, L. M. Van Oeffelen, and Y. Moreau. Kernelbased data fusion for gene prioritization. Bioinformatics, 2007.
 M. C. Du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data. IEICE TRANSACTIONS on Information and Systems, 2014.
 M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In NIPS, 2014.
 C. Elkan. The foundations of costsensitive learning. IJCAI, 2001.
 C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In SIGKDD, 2008.
 J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. In Biostatistics, 2008.
 B. Greene, S. Faul, W. Marnane, G. Lightbody, I. Korotchikova, and G. Boylan. A comparison of quantitative eeg features for neonatal seizure detection. Clinical Neurophysiology, 2008.
 J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 1982.
 X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. In NIPS, 2006.
 V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial intelligence review, 2004.
 C.H. Hoi, C.H. Chan, K. Huang, M. R. Lyu, and I. King. Biased support vector machine for relevance feedback in image retrieval. In IJCNN, 2004.
 J. Huang and T. Zhang. The benefit of group sparsity. In The Annals of Statistics, 2010.
 M. Hubert and K. V. Branden. Robust methods for partial least squares regression. Journal of Chemometrics, 2003.
 M.F. Jiang, S.S. Tseng, and C.M. Su. Twophase clustering process for outliers detection. Pattern recognition letters, 2001.
 E. M. Jordaan and G. F. Smits. Robust outlier detection using svm regression. IJCNN, 2004.
 H. Kaur and S. K. Wasan. Empirical study on applications of data mining techniques in healthcare. In Journal of Computer Science, 2006.
 E. M. Knorr and R. T. Ng. Finding intensional knowledge of distancebased outliers. In VLDB, 1999.
 W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression. In ICML, 2003.
 W. Li, Q. Guo, and C. Elkan. A positive and unlabeled learning algorithm for oneclass classification of remotesensing data. IEEE Transactions on Geoscience and Remote Sensing, 2011.
 X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, 2003.
 X. Li, S. Y. Philip, and B. Liu. Positive unlabeled learning for data stream classification. In SDM, 2009.
 M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
 B. Liu, W. S. Lee, P. S. Yu, and X. L. Li. Partially supervised classification of text documents. In ICML, 2002.
 J. Liu, R. Fujimaki, and J. Ye. Forwardbackward greedy algorithms for general convex smooth functions over a cardinality constraint. ICML, 2013.
 L. M. Manevitz and M. Yousef. Oneclass svms for document classification. Journal of machine Learning research, 2001.
 S. J. Mason and N. E. Graham. Areas beneath the relative operating characteristics (roc) and relative operating levels (rol) curves: Statistical significance and interpretation. Quarterly Journal of the Royal Meteorological Society, 2002.
 F. Mordelet and J.P. Vert. A bagging svm to learn from positive and unlabeled examples. Pattern Recognition Letters, 2014.
 M. M. Moya and D. R. Hush. Network constraints and multiobjective optimization for oneclass classification. Neural Networks, 1996.
 A. Y. Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In ICML, 2004.
 M. N. Nguyen, X. L. Li, and S. K. Ng. Positive unlabeled leaning for time series classification. In IJCAI, 2011.
 N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. arXiv:1407.0088, 2014.
 G. Obozinski, B. Taskar, and M. Jordan. Multitask feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2006.
 K. Pelckmans and J. A. Suykens. Transductively learning from positive examples only. In ESANN, 2009.
 Y. T. Peng, C. Y. Lin, M. T. Sun, and K. C. Tsai. Healthcare audio event classification using hidden markov models and hierarchical hidden markov models. In ICME, 2009.
 M. D. Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, 2015.
 A. Rakotomamonjy. Optimizing area under roc curve with svms. In ROCAI, 2004.
 C. Scott and G. Blanchard. Novelty detection: Unlabeled data definitely help. In AISTATS, 2009.
 A. Smola, L. Song, and C. H. Teo. Relative novelty detection. In AISTATS, 2009.
 R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996.
 J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. Information Theory, IEEE Transactions on, 2004.
 V. Vapnik. Estimation of dependences based on empirical data. Springer Science & Business Media, 2006.
 B. Walczak. Outlier detection in multivariate calibration. Chemometrics and intelligent laboratory systems, 1995.
 J. Xu, F. Nie, and J. Han. Feature selection via scaling factor integrated multiclass support vector machines. In IJCAI, 2017.
 K. Yamanishi and J.i. Takeuchi. Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner. In SIGKDD, 2001.
 K. Yamanishi, J.I. Takeuchi, G. Williams, and P. Milne. Online unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 2004.
 P. Yang, X. L. Li, J. P. Mei, C. K. Kwoh, and S. k. Ng. Positiveunlabeled learning for disease gene identification. Bioinformatics, 2012.
 C. Yao, Y.F. Liu, B. Jiang, J. Han, and J. Han. Lle score: a new filterbased unsupervised feature selection method based on nonlinear manifold embedding and its application to image recognition. IEEE Transactions on Image Processing, 2017.
 H. Yu, J. Han, and K. C. Chang. Pebl: positive example based learning for web page classification using svm. In SIGKDD, 2002.
 X.T. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit for sparsityconstrained optimization. ICML, 2014.
 S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and D. N. Metaxas. Automatic image annotation using group sparsity. CVPR, 2010.
 T. Zhang. Adaptive forwardbackward greedy algorithm for sparse learning with linear models. In NIPS, 2009.
 T. Zhang. Sparse recovery with orthogonal matching pursuit under rip. Information Theory, IEEE Transactions on, 2011.
 H. Zou. The adaptive lasso and its oracle properties. In Journal of the American statistical association, 2006.