Classification from Positive, Unlabeled and Biased Negative Data

# Classification from Positive, Unlabeled and Biased Negative Data

Yu-Guan Hsieh
yu-guan.hsieh@ens.fr
Work done as an intern at RIKEN Center for Advanced Intelligence Project.
Gang Niu
RIKEN Center for Advanced Intelligence Project
gang.niu@riken.jp
Masashi Sugiyama
RIKEN Center for Advanced Intelligence Project
& The University of Tokyo
sugi@k.u-tokyo.ac.jp
###### Abstract

Positive-unlabeled (PU) learning addresses the problem of learning a binary classifier from positive (P) and unlabeled (U) data. It is often applied to situations where negative (N) data are difficult to be fully labeled. However, collecting a non-representative N set that contains only a small portion of all possible N data can be much easier in many practical situations. This paper studies a novel classification framework which incorporates such biased N (bN) data in PU learning. The fact that the training N data are biased also makes our work very different from those of standard semi-supervised learning. We provide an empirical risk minimization-based method to address this PUbN classification problem. Our approach can be regarded as a variant of traditional example-reweighting algorithms, with the weight of each example computed through a preliminary step that draws inspiration from PU learning. We also derive an estimation error bound for the proposed method. Experimental results demonstrate the effectiveness of our algorithm in not only PUbN learning scenarios but also ordinary PU leaning scenarios on several benchmark datasets.

## 1 Introduction

In conventional binary classification, examples are labeled as either positive (P) or negative (N), and we train a classifier on these labeled examples. On the contrary, positive-unlabeled (PU) learning addresses the problem of learning a classifier from P and unlabeled (U) data, without need of explicitly identifying N data (Elkan2008LearningCF; Ward2009PresenceonlyDA).

PU learning finds its usefulness in many real-world problems. For example, in one-class remote sensing classification (li2011positive), we seek to extract a specific land-cover class from an image. While it is easy to label examples of this specific land-cover class of interest, examples not belonging to this class are too diverse to be exhaustively annotated. The same problem arises in text classification, as it is difficult or even impossible to compile a set of N samples that provides a comprehensive characterization of everything that is not in the P class (liu2003building; fung2006text). Besides, PU learning has also been applied to other domains such as outlier detection (hido2008inlier; scott2009novelty), medical diagnosis (zu2011learn), or time series classification (nguyen2011positive).

By carefully examining the above examples, we find out that the most difficult step is often to collect a fully representative N set, whereas only labeling a small portion of all possible N data is relatively easy. Therefore, in this paper, we propose to study the problem of learning from P, U and biased N (bN) data, which we name PUbN learning hereinafter. We suppose that in addition to P and U data, we also gather a set of bN samples, governed by a distribution distinct from the true N distribution. As described previously, this can be viewed as an extension of PU learning, but such bias may also occur naturally in some real-world scenarios. For instance, let us presume that we would like to judge whether a subject is affected by a particular disease based on the result of a physical examination. While the data collected from the patients represent rather well the P distribution, healthy subjects that request the examination are in general highly biased with respect to the whole healthy subject population.

We are not the first to be interested in learning with bN data. In fact, both li2010neg and fei2015social attempted to solve similar problems in the context of text classification. li2010neg simply discarded negative samples and performed ordinary PU classification. It was also mentioned in the paper that bN data could be harmful. fei2015social adapted another strategy. The authors considered even gathering unbiased U data is difficult and learned the classifier from only P and bN data. However, their method is specific to text classification because it relies on the use of effective similarity measures to evaluate similarity between documents. Therefore, our work differs from these two in that the classifier is trained simultaneously on P, U and bN data, without resorting to domain-specific knowledge. The presence of U data allows us to address the problem from a statistical viewpoint, and thus the proposed method can be applied to any PUbN learning problem in principle.

In this paper, we develop an empirical risk minimization-based algorithm that combines both PU learning and importance weighting to solve the PUbN classification problem, We first estimate the probability that an example is sampled into the P or the bN set. Based on this estimate, we regard bN and U data as N examples with instance-dependent weights. In particular, we assign larger weights to U examples that we believe to appear less often in the P and bN sets. P data are treated as P examples with unity weight but also as N examples with usually small or zero weight whose actual value depends on the same estimate.

The contributions of the paper are three-fold:

1. We formulate the PUbN learning problem as an extension of PU learning and propose an empirical risk minimization-based method to address the problem. We also theoretically establish an estimation error bound for the proposed method.

2. We experimentally demonstrate that the classification performance can be effectively improved thanks to the use of bN data during training. In other words, PUbN learning yields better performance than PU learning.

3. Our method can be easily adapted to ordinary PU learning. Experimentally we show that the resulting algorithm allows us to obtain new state-of-the-art results on several PU learning tasks.

Relation with Semi-supervised Learning With P, N and U data available for training, our problem setup may seem similar to that of semi-supervised learning (Chapelle:2010:SL:1841234; oliver2018realistic). Nonetheless, in our case, N data are biased and often represent only a small portion of the whole N distribution. Therefore, most of the existing methods designed for the latter cannot be directly applied to the PUbN classification problem. Furthermore, our focus is on deducing a risk estimator using the three sets of data, whereas in semi-supervised learning the main concern is often how U data can be utilized for regularization (grandvalet2005semi; belkin2006manifold; laine2017temporal; miyato2016distributional). The two should be compatible and we believe adding such regularization to our algorithm can be beneficial in many cases.

Relation with Dataset Shift PUbN learning can also be viewed as a special case of dataset shift111 Dataset shift refers to any case where training and test distributions differ. The term sample selection bias (heckman1979sample; zadrozny2004learning) is sometimes used to describe the same thing. However, strictly speaking, sample selection bias actually refers to the case where training instances are first drawn from the test distributions and then a subset of these data is systematically discarded due to a particular mechanism. (quionero2009dataset) if we consider that P and bN data are drawn from the training distribution while U data are drawn from the test distribution. Covariate shift (shimodaira2000improving; book:Sugiyama+Kawanabe:2012) is another special case of dataset shift that has been studied intensively. In the covariate shift problem setting, training and test distributions have the same class conditional distribution and only differ in the marginal distribution of the independent variable. One popular approach to tackle this problem is to reweight each training example according to the ratio of the test density to the training density (huang2007correcting; sugiyama2008direct). Nevertheless, simply training a classifier on a reweighted version of the labeled set is not sufficient in our case since there may be examples with zero probability to be labeled. It is also important to notice that the problem of PUbN learning is intrinsically different from that of covariate shift and neither of the two is a special case of the other.

## 2 Problem Setting

In this section, we briefly review the formulations of PN, PU and PNU classification and introduce the problem of learning from P, U and bN data.

### 2.1 Standard Binary Classification

Let and be random variables following an unknown probability distribution with density . Let be an arbitrary decision function for binary classification and be a loss function of margin that usually takes a small value for a large margin. The goal of binary classification is to find that minimizes the classification risk:

 R(g)=E(x,y)∼p(x,y)[ℓ(yg(x))], (1)

where denotes the expectation over the joint distribution . When we care about classification accuracy, is the zero-one loss . However, for ease of optimization, is often substituted with a surrogate loss such as the sigmoid loss or the logistic loss during learning.

In standard supervised learning scenarios (PN classification), we are given P and N data that are sampled independently from and as and . Let us denote by , partial risks and the P prior. We have the equality . The classification risk (1) can then be empirically approximated from data by

 ^R{\scriptsize PN}(g)=π^R+{\scriptsize P}(g)+(1−π)^R−{\scriptsize N}(g),

where and . By minimizing we obtain the ordinary empirical risk minimizer .

### 2.2 PU Classification

In PU classification, instead of N data we have only access to a set of U samples drawn from the marginal density . Several effective algorithms have been designed to address this problem. liu2002partially proposed the S-EM approach that first identifies reliable N data in the U set and then runs the Expectation-Maximization (EM) algorithm to build the final classifier. The biased support vector machine (Biased SVM) introduced in liu2003building regards U samples as N samples with smaller weights. mordelet2014bagging solved the PU problem by aggregating classifiers trained to discriminate P data from a small random subsample of U data.

More recently, attention has been paid on the unbiased risk estimator proposed in du2014analysis and du2015convex. The key idea is to use the following equality:

 (1−π)R−{\scriptsize N}(g)=R−{\scriptsize U}(g)−πR−{\scriptsize P}(g),

where and This equality is acquired by exploiting the fact . As a result, we can approximate the classification risk (1) by

 ^R{\scriptsize PU}(g)=π^R+{\scriptsize P}(g)−π^R−{\scriptsize P}(g)+^R−{% \scriptsize U}(g), (2)

where and . We then minimize to obtain another empirical risk minimizer . Note that as the loss is always positive, the classification risk (1) that approximates is also positive. However, kiryo2017positive pointed out that when the model of is too flexible, that is, when the function class is too large, indeed goes negative and the model seriously overfits the training data. To alleviate overfitting, the authors observed that and proposed the non-negative risk estimator for PU learning:

 ~R{\scriptsize PU}(g)=π^R+{\scriptsize P}(g)+max{0,^R−{\scriptsize U}(g)−π^R−% {\scriptsize P}(g)}. (3)

In terms of implementation, stochastic optimization was used and when becomes negative for a mini-batch, they performed a step of gradient ascent along to make the mini-batch less overfitted.

### 2.3 PNU Classification

In semi-supervised learning (PNU classification), P, N and U data are all available. An abundance of works have been dedicated to solving this problem. Here we in particular introduce the PNU risk estimator proposed in sakai2016semi. By directly leveraging U data for risk estimation, it is the most comparable to our method. The PNU risk is simply defined as a linear combination of PN and PU/NU risks. Let us just consider the case where PN and PU risks are combined, then for some , the PNU risk estimator is expressed as

 ^Rγ{\scriptsize PNU}(g) =γ^R{\scriptsize PN}(g)+(1−γ)^R{\scriptsize PU}(g) =π^R+{\scriptsize P}(g)+γ(1−π)^R−{\scriptsize N}(g)+(1−γ)(^R−{\scriptsize U% }(g)−π^R−{\scriptsize P}(g)). (4)

We can again consider the non-negative correction by forcing the term to be non-negative. In the rest of the paper, we refer to the resulting algorithm as non-negative PNU (nnPNU) learning (see Appendix D.3 for an alternative definition of nnPNU and the corresponding results).

### 2.4 PUbN Classification

In this paper, we study the problem of PUbN learning. It differs from usual semi-supervised learning in the fact that labeled N data are not fully representative of the underlying N distribution . To take this point into account, we introduce a latent random variable and consider the joint distribution with constraint . Equivalently, . Let . Both and are assumed known throughout the paper. In practice they often need to be estimated from data (jain2016estimating; ramaswamy2016mixture; Plessis:2017:CEL:3085961.3085999). In place of ordinary N data we collect a set of bN samples

 X{\scriptsize bN}={x{\scriptsize bN}% i}n{\scriptsize bN}i=1∼p(x|y=−1,s=+1).

The goal remains the same: we would like to minimize the classification risk .

## 3 Method

In this section, we propose a risk estimator for PUbN classification and establish an estimation error bound for the proposed method. Finally we show how our method can be applied to PU learning as a special case when no bN data are available.

### 3.1 Risk Estimator

Let and . Since , we have

 R(g)=πR+{\scriptsize P}(g)+ρR−{\scriptsize bN}(g)+(1−π−ρ)R−s=−1(g). (5)

The first two terms on the right-hand side of the equation can be approximated directly from data by writing and . We therefore focus on the third term . Our approach is mainly based on the following theorem. We relegate all proofs to the appendix.

###### Theorem 1.

Let . For all and satisfying the condition , the risk can be expressed as

 ¯R−s=−1(g) =Ex∼p(x)[\mathbbm1h(x)≤ηℓ(−g(x))(1−σ(x))] +πEx∼p(x∣y=+1)[\mathbbm1h(x)>ηℓ(−g(x))1−σ(x)σ(x)] +ρEx∼p(x∣s=+1,y=−1)[\mathbbm1h(x)>ηℓ(−g(x))1−σ(x)σ(x)]. (6)

In the theorem, is decomposed into three terms, and when the expectation is substituted with the average over training samples, these three terms are approximated respectively using data from , and . The choice of and is thus very crucial because it determines what each of the three terms tries to capture in practice. Ideally, we would like to be an approximation of . Then, for such that is close to 1, is close to 1, so the last two terms on the right-hand side of the equation can be reasonably evaluated using and (i.e., samples drawn from ). On the contrary, if is small, is small and such samples can be hardly found in or . Consequently the first term appeared in the decomposition is approximated with the help of . Finally, in the empirical risk minimization paradigm, becomes a hyperparameter that controls how important U data is against P and bN data when we evaluate . The larger is, the more attention we would pay to U data.

One may be curious about why we do not simply approximate the whole risk using only U samples, that is, set to 1. There are two main reasons. On one hand, if we have a very small U set, which means and , approximating a part of the risk with labeled samples should help us reduce the estimation error. This may seem unrealistic but sometimes unbiased U samples can also be difficult to collect (takashi2018binary). On the other hand, more importantly, we have empirically observed that when the model of is highly flexible, even a sample regarded as N with small weight gets classified as N in the latter stage of training and performance of the resulting classifier can thus be severely degraded. Introducing alleviates this problem by avoiding treating all U data as N samples.

As is not available in reality, we propose to replace by its estimate in (1). We further substitute with the same estimate and obtain the following expression:

 ¯R−s=−1,η,^σ(g) =Ex∼p(x)[\mathbbm1^σ(x)≤ηℓ(−g(x))(1−^σ(x))] +πEx∼p(x∣y=+1)[\mathbbm1^σ(x)>ηℓ(−g(x))1−^σ(x)^σ(x)] +ρEx∼p(x∣s=+1,y=−1)[\mathbbm1^σ(x)>ηℓ(−g(x))1−^σ(x)^σ(x)].

We notice that depends both on and . It can be directly approximated from data by

 ^¯Rs=−1,η,^σ(g) =1n{\scriptsize U}n{% \scriptsize U}∑i=1[\mathbbm1^σ(x{% \scriptsize U}i)≤ηℓ(−g(x{\scriptsize U% }i))(1−^σ(x{\scriptsize U}i))] +πn{\scriptsize P}n{\scriptsize P}∑i=1⎡⎣\mathbbm1^σ(x% {\scriptsize P}i)>ηℓ(−g(x{\scriptsize P}i))1−^σ(x{\scriptsize P}i)^σ(x{\scriptsize P}i)⎤⎦ +ρn{\scriptsize bN}n{\scriptsize bN}∑i=1⎡⎣\mathbbm1^σ(x{\scriptsize bN}i)>ηℓ(−g(x{% \scriptsize bN}i))1−^σ(x{\scriptsize bN}i)^σ(x{\scriptsize bN}i))⎤⎦.

We are now able to derive the empirical version of Equation (5) as

 ^RPUbN,η,^σ(g)=π^R+{\scriptsize P% }(g)+ρ^R−{\scriptsize bN}(g)+^¯R−s=−1,η,^σ(g). (7)

Estimating  If we regard as a class label, the problem of estimating is then equivalent to training a probabilistic classifier separating the classes with and . Observing that for , it is straightforward to apply nnPU learning with availability of , and to minimize . In other words, here we regard and as P and as U, and attempt to solve a PU learning problem by applying nnPU. Since we are interested in the class-posterior probabilities, we minimize the risk with respect to the logistic loss and apply the sigmoid function to the output of the model to get . However, the above risk estimator accepts any reasonable and we are not limited to using nnPU for computing . For example, the least-squares fitting approach proposed in Kanamori2009ALA for direct density ratio estimation can also be adapted to solving the problem.

### 3.2 Estimation Error Bound

Here we establish an estimation error bound for the proposed method. Let be the function class from which we find a function. The Rademacher complexity of for the samples of size drawn from is defined as

 Rn,q(G)=EX∼qnEθ⎡⎣supg∈G1n∑xi∈Xθig(xi)⎤⎦,

where and with each drawn from and as a Rademacher variable (mohri2012foundations). In the following we will assume that vanishes asymptotically as . This holds for most of the common choices of if proper regularization is considered (bartlett2002rademacher; golowich2018size). Assume additionally the existence of such that as well as such that . We also assume that is Lipschitz continuous on the interval with a Lipschitz constant .

###### Theorem 2.

Let be the true risk minimizer and be the PUbN empirical risk minimizer. We suppose that is a fixed function independent of data used to compute and . Denote by and the P and bN marginals. Let and . Then for any , with probability at least ,

 R(^g\emphPUbN,η,^σ)−R(g∗) ≤4LlRn{\scriptsize\emph{U}}% ,p(G)+4πLlηRn{% \scriptsize\emph{P}},p{\scriptsize\emph{P}}(G)+4ρLlηRn{\scriptsize\emph{bN}},p%\emphbN(G) +2Cl√ln(6/δ)2n{% \scriptsize\emph{U}}+2πClη√ln(6/δ)2n{\scriptsize\emph{P}}+2ρClη√ln(6/δ)2n{\scriptsize\emph{bN}}+2Cl√ζϵ+2Clη√(1−ζ)ϵ.

Theorem 2 shows that as , and , we have . Furthermore, if there is such that 222 For instance, this holds for linear-in-parameter model class , where and are positive constants (mohri2012foundations). , the convergence rate is , where denotes the order in probability. As for , knowing that is also estimated from data in practice 333 These data, according to theorem 2, must be different from those used to evaluate . This condition is however violated in most of our experiments. See Appendix D.2 for more discussion. , apparently its value depends on both the estimation algorithm and the number of samples that are involved in the estimation process. For example, in our approach we applied nnPU with the logistic loss to obtain , so the excess risk can be written as , where by abuse of notation denotes the KL divergence between two Bernouilli distributions with parameters respectively and . It is known that (zhang2004statistical). The excess risk itself can be decomposed into the sum of the estimation error and the approximation error. kiryo2017positive showed that under mild assumptions the estimation error part converges to zero when the sample size increases to infinity in nnPU learning. It is however impossible to get rid of the approximation error part which is fixed once we fix the function class . To circumvent this problem, we can either resort to kernel-based methods with universal kernels (zhang2004statistical) or simply enlarge the function class when we get more samples.

### 3.3 PU Learning Revisited

In PU learning scenarios, we only have P and U data and bN data are not available. Nevertheless, if we let play the role of and ignore all the terms related to bN data, our algorithm is naturally applicable to PU learning. Let us name the resulting algorithm PUbN\N, then

 ^RPUbN∖N,η,^σ(g)=π^R+{% \scriptsize P}(g)+^¯R−y=−1,η,^σ(g),

where is an estimate of and

PUbN\N can be viewed as a variant of the traditional two-step approach in PU learning which first identifies possible N data in U data and then perform ordinary PN classification to distinguish P data from the identified N data. However, being based on state-of-the-art nnPU learning, our method is more promising than other similar algorithms. Moreover, by explicitly considering the posterior , we attempt to correct the bias induced by the fact of only taking into account confident negative samples. The benefit of using an unbiased risk estimator is that the resulting algorithm is always statistically consistent, i.e., the estimation error converges in probability to zero as the number of samples grows to infinity.

## 4 Experiments

In this section, we experimentally investigate the proposed method and compare its performance against several baseline methods.

### 4.1 Basic Setup

We focus on training neural networks with stochastic optimization. For simplicity, in an experiment, and always use the same model and are trained for the same number of epochs. All models are learned using AMSGrad (j.2018on) as the optimizer and the logistic loss as the surrogate loss unless otherwise specified. To determine the value of , we introduce another hyperparameter and choose such that . In all the experiments, an additional validation set, equally composed of P, U and bN data, is sampled for both hyperparameter tuning and choosing the model parameters with the lowest validation loss among those obtained after every epoch. Regarding the computation of the validation loss, we use the PU risk estimator (2) with the sigmoid loss for and an empirical approximation of for (see Appendix B).

### 4.2 Effectiveness of the Algorithm

We assess the performance of the proposed method on three benchmark datasets: MNIST, CIFAR-10 and 20 Newsgroups. Experimental details are given in Appendix C. In particular, since all the three datasets are originally designed for multiclass classification, we group different categories together to form a binary classification problem.

Baselines. When is given, two baseline methods are considered. The first one is nnPNU adapted from (2.3). In the second method, named as PUPN, we train two binary classifiers: one is learned with nnPU while we regard as the class label, and the other is learned from and to separate P samples from bN samples. A sample is classified in the P class only if it is so classified by the two classifiers. When is not available, nnPU is compared with the proposed PUbN\N.

Sampling bN Data To sample , we suppose that the bias of N data is caused by a latent prior probability change (sugiyama2007mixture; hu2018does) in the N class. Let be some latent variable which we call a latent category, where is a constant. It is assumed

 p(x∣z,y=−1) =p(x∣z,y=−1,s=+1), p(z∣y=−1) ≠p(z∣y=−1,s=+1).

In the experiments, the latent categories are the original class labels of the datasets. Concrete definitions of with experimental results are summarized in Table 1.

Results. Overall, our proposed method consistently achieves the best or comparable performance in all the scenarios, including those of standard PU learning. Additionally, using bN data can effectively help improving classification performance. However, the choice of algorithm is essential. Both nnPNU and the naive PUPN are able to leverage bN data to enhance classification accuracy in only relatively few tasks. In the contrast, the proposed PUbN successfully reduce the misclassification error most of the time.

Clearly, the performance gain that we can benefit from the availability of bN data is case-dependent. On CIFAR-10, the greatest improvement is achieved when we regard mammals (i.e. cat, deer, dog and horse) as P class and drawn samples from latent categories bird and frog as labeled negative data. This is not surprising because birds and frogs are more similar to mammals than vehicles, which makes the classification harder specifically for samples from these two latent categories. By explicitly labeling these samples as N data, we allow the classifier to make better predictions for these difficult samples.

### 4.3 Why Does PUbN\N Outperform nnPU ?

Our method, specifically designed for PUbN learning, naturally outperforms other baseline methods in this problem. Nonetheless, Table 1 equally shows that the proposed method when applied to PU learning, achieves significantly better performance than the state-of-the-art nnPU algorithm. Here we numerically investigate the reason behind this phenomenon.

Besides nnPU and PUbN\N, we compare with unbiased PU (uPU) learning (2). Both uPU and nnPU are learned with the sigmoid loss, learning rate for MNIST, initial learning rate for CIFAR-10, and learning rate for 20 Newsgroups. This is because uPU learning is unstable with the logistic loss. The other parts of the experiments remain unchanged. On the test sets we compute the false positive rates, false negative rates and misclassification errors for the three methods and plot them in Figure 1. We first notice that PUbN\N still outperforms nnPU trained with the sigmoid loss. In fact, the final performance of the nnPU classifier does not change much when we replace the logistic loss with the sigmoid loss.

In kiryo2017positive, the authors observed that uPU overfits training data with the risk going to negative. In other words, a large portion of U samples are classified to the N class. This is confirmed in our experiments by an increase of false negative rate and decrease of false positive rate. nnPU remedies the problem by introducing the non-negative risk estimator (3). While the non-negative correction successfully prevents false negative rate from going up, it also causes more N samples to be classified as P compared to uPU. However, since the gain in terms of false negative rate is enormous, at the end nnPU achieves a lower misclassification error. By further identifying possible N samples after nnPU learning, we expect that our algorithm can yield lower false positive rate than nnPU without misclassifying too many P samples as N as in the case of uPU. Figure 1 suggests that this is effectively the case. In particular, we observe that on MNIST, our method achieves the same false positive rate than uPU whereas its false negative rate is comparable to nnPU.

## 5 Conclusion

This paper studied the PUbN classification problem, where a binary classifier is trained on P, U and bN data. The proposed method is a two-step approach inspired from both PU learning and importance weighting. The key idea is to attribute appropriate weights to each example to evaluate the classification risk using the three sets of data. We theoretically established an estimation error bound for the proposed risk estimator and experimentally showed that our approach successfully leveraged bN data to improve the classification performance on several real-world datasets. A variant of our algorithm was able to achieve state-of-the-art results in PU learning.

## Appendix A Proofs

### a.1 Proof of Theorem 1

We notice that and that when , we have , which allows us to write . We can thus decompose as following:

 ¯R−s=−1(g) =∫ℓ(−g(x))p(x,s=−1)dx =∫\mathbbm1h(x)≤ηℓ(−g(x))p(x,s=−1)dx +∫\mathbbm1h(x)>ηℓ(−g(x))p(x,s=−1)dx =∫\mathbbm1h(x)≤ηℓ(−g(x))p(x,s=−1)p(x)p(x)dx +∫\mathbbm1h(x)>ηℓ(−g(x))p(x,s=−1)p(x,s=+1)p(x,s=+1)dx.

By writing and , we have

 ¯R−s=−1(g) =∫\mathbbm1h(x)≤ηℓ(−g(x))(1−σ(x))p(x)dx +∫\mathbbm1h(x)>ηℓ(−g(x))1−σ(x)σ(x)p(x,s=+1)dx.

We obtain Equation (1) after replacing by .

### a.2 Proof of Theorem 2

For and given, let us define

 RPUbN,η,^σ(g)=πR+{\scriptsize P}(g)+ρR−{\scriptsize bN}(g)+¯R−s=−1,η,^σ(g).

The following lemma establishes the uniform deviation bound from to .

###### Lemma 1.

Let be a fixed function independent of data used to compute and . For any , with probability at least ,

 supg∈G|^R−\emphPUbN,η,^σ(g)−R\emphPUbN,η,^σ(g)| ≤2LlRn{\scriptsize\emph{U}}% ,p(G)+2πLlηRn{% \scriptsize\emph{P}},p{\scriptsize\emph{P}}(G)+2ρLlηRn{\scriptsize\emph{bN}},p%\emphbN(G) +Cl√ln(6/δ)2n{% \scriptsize\emph{U}}+πClη√ln(6/δ)2n{\scriptsize\emph{P}}+ρClη√ln(6/δ)2n{\scriptsize\emph{bN}}.
###### Proof.

For ease of notation, let

 R{\scriptsize P}(g) =Ex∼p{\scriptsize P}(x)[ℓ(g(x))+\mathbbm1^σ(x)>ηℓ(−g(x))1−^σ(x)^σ(x)], R{\scriptsize bN}(g) =Ex∼p{\scriptsize bN}(x)[ℓ(−g(x))(1+\mathbbm1^σ(x)>η1−^σ(x)^σ(x))], R{\scriptsize U}(g) =Ex∼p(x)[\mathbbm1^σ(x)≤ηℓ(−g(x))(1−^σ(x))], ^R{\scriptsize P}(g) =1n{\scriptsize P}n{% \scriptsize P}∑i=1⎡⎣ℓ(g(x{\scriptsize P}i))+\mathbbm1^σ(x{\scriptsize P}i)>ηℓ(−g(x{\scriptsize P}i))1−^σ(x{\scriptsize P}i)^σ(x{\scriptsize P}i)⎤⎦, ^R{\scriptsize bN}(g) =1n{\scriptsize bN}n{% \scriptsize bN}∑i=1⎡⎣ℓ(−g(x{\scriptsize bN}i))(1+\mathbbm1^σ(x{\scriptsize bN}i)>η1−^σ(x{\scriptsize bN}i)^σ(x{\scriptsize bN}i))⎤⎦, ^R{\scriptsize U}(g) =1n{\scriptsize U}n{% \scriptsize U}∑i=1[\mathbbm1^σ(x{% \scriptsize U}i)≤ηℓ(−g(x{\scriptsize U% }i))(1−^σ(x{\scriptsize U}i))].

From the sub-additivity of the supremum operator, we have

 supg∈G|^R−PUbN,η,^σ(g)−RPUbN,η,^σ(g)| ≤πsupg∈G|^R{% \scriptsize P}(g)−R{\scriptsize P}(g)|+ρsupg∈G|^R{\scriptsize bN}(g)−R{\scriptsize bN}(g)|+supg∈G|^R{\scriptsize U}(g)−R{% \scriptsize U}(g)|.

As a consequence, to conclude the proof, it suffices to prove that with probability at least , the following bounds hold separately:

 supg∈G|^R{\scriptsize P}(g)−R{\scriptsize P}(g)| ≤2LlηRn{\scriptsize P},p{\scriptsize P}(G)+Clη√ln(6/δ)2n{\scriptsize P}, (8) supg∈G|^R{\scriptsize bN}(g)−R{\scriptsize bN}(g)| ≤2LlηRn{\scriptsize bN},p{\scriptsize bN}(G)+Clη√ln(6/δ)2n{\scriptsize bN}, (9) supg∈G|^R{\scriptsize U}(g)−R{\scriptsize U}(g)| (10)

Below we prove (8). (9) and (10) are proven similarly.

Let be the function defined by . For , since