Classification with Noisy Labels
by Importance Reweighting
In this paper, we study a classification problem in which sample labels are randomly corrupted. In this scenario, there is an unobservable sample with noise-free labels. However, before being observed, the true labels are independently flipped with a probability , and the random label noise can be class-conditional. Here, we address two fundamental problems raised by this scenario. The first is how to best use the abundant surrogate loss functions designed for the traditional classification problem when there is label noise. We prove that any surrogate loss function can be used for classification with noisy labels by using importance reweighting, with consistency assurance that the label noise does not ultimately hinder the search for the optimal classifier of the noise-free sample. The other is the open problem of how to obtain the noise rate . We show that the rate is upper bounded by the conditional probability of the noisy sample. Consequently, the rate can be estimated, because the upper bound can be easily reached in classification problems. Experimental results on synthetic and real datasets confirm the efficiency of our methods.
Classification crucially relies on the accuracy of the dataset labels. In some situations, observation labels are easily corrupted and, therefore, inaccurate. Designing learning algorithms that account for noisy labeled data is therefore of great practical importance and has attracted a significant amount of interest in the machine learning community.
The random classification noise (RCN), in which each label is flipped independently with a probability , has been proposed; it was proven to be PAC-learnable by Angluin and Laird  soon after the noise-free PAC learning model was introduced by Valiant . Many related works then followed: Kearns  proposed the statistical query model to learn with RCN. The restriction he enforced is that learning is based not on the particular properties of individual random examples, but instead on the global statistical properties of large samples. Such an approach to learning seems intuitively more robust. Lawrence and Scholköpf  proposed a Bayesian model for this noise and applied it to sky moving in images. Biggio et al.  enabled support vector machine learning with RCN via a kernel matrix correction. And Yang et al.  developed multiple kernel learning for classification with noisy labels using stochastic programming. The interested reader is referred to further examples in the survey . However, most of these algorithms are designed for specific surrogate loss functions, and the use and benefit of the large number of surrogate loss functions designed for the traditional (noise-free) classification problem is important to investigate in order to solve classification problems in the presence of label noise.
Aslam and Decatur  proved that the RCN exploited using a 0-1 loss function is PAC-learnable if the function class is of finite VC-dimension. Manwani and Sastry  analyzed the tolerance properties of RCN for risk minimization under several frequently used surrogate loss functions and showed that many of them do not tolerate RCN. Natarajan et al.  reported two methods for learning asymmetric RCN models, in which the random label noise is class-conditional. Their methods exploit many different surrogate loss functions : the first model uses unbiased estimators of surrogate loss functions for empirical risk minimization, but the unbiased estimator may be non-convex, even if the original surrogate loss function is convex; their second method uses label-dependent costs. The latter approach is based on the idea that there exists an such that the minimizer of the expected risk as assessed using the -weighted 0-1 loss function over the noisy sample distribution, where is the predicted value and is the label of the example, has the same sign as that of the Bayes classifier which minimizes the expected risk as assessed using the 0-1 loss function over the clean sample distribution; see, for example, Theorem 9 in . The method is notable because it can be applied to all the convex and classification-calibrated surrogate loss functions (If a surrogate loss function is classification-calibrated and the sample size is sufficiently large, the surrogate loss function will help learn the same optimal classifier as the 0-1 loss function does, see Theorem 1 in Bartlett et al. ). This modification is based on the asymmetric classification-calibrated results  and cannot be used to improve the performance of symmetric RCN problems or the algorithms that employ the non-classification-calibrated surrogate loss functions.
To best use and benefit from the abundant surrogate loss functions designed for the traditional classification problems, here we propose an importance reweighting method in which any surrogate loss function designed for a traditional classification problem can be used for classification with noisy labels. In our method, the weights are non-negative, so the convexity of objective functions does not change. In addition, our method inherits most batch learning optimization procedures designed for traditional classification problems with different regularizations; see, for examples, [13, 14, 15, 16, 17].
Although many works have focused on the RCN model, how to best estimate the noise rate remains an open problem  and severely limits the practical application of the existing algorithms. Most previous works make the assumption that the noise rate is known or learn it using cross-validation, which is time-consuming and lacks a guarantee of generalization accuracy. In this paper, we set the noise rate to be asymmetric and unknown and denote the flip probability of positive labels and the flip probability of negative labels by and , respectively. We show that the noise rate (or ) is upper bounded by the conditional probability (or ) of the noisy data. Moreover, the upper bound can be reached if there exists an such that the probability (or ) of the “clean” sample is zero, which is very likely to hold for classification problems. The noise rates are therefore estimated by finding the minimal of the noisy training sample.
1.1 Related Works
Kearns and Li  introduced the malicious noise (MN) model, in which an adversary can access the sample and randomly replace a fraction of them with adversarial ones. It has been proven that any nontrivial target function class cannot be PAC learned with accuracy and malicious noise rate ; see, for examples, [18, 19, 20]. Long and Servedio  proved that an algorithm for learning -margin half-spaces that minimizes a convex surrogate loss function for misclassification risk cannot tolerate malicious noise at a rate greater than . They therefore proposed an algorithm, that does not optimize a convex loss function and that can tolerate a higher rate of malicious noise than order . Further details about the MN model can be found in .
Cesa-Bianchi et al.  considered a more complicated model in which the features and labels are both added with zero-mean and variance-bounded noise. They used unbiased estimates of the gradient of the surrogate loss function to learn from the noisy sample in an online learning setting. Perceptron algorithms that tolerate RCN have also been widely studied; see, for examples, [24, 25, 26, 27]. See Khardon and Wachman  for a survey of noise-tolerant variants of perceptron algorithms.
As well as these model-motivated algorithms, many algorithms that exploit robust surrogate loss functions have been designed for learning with any kind of feature and label noise. Robust surrogate loss functions, such as the Cauchy loss function  and correntropy (also known as the Welsch loss function), [30, 31], have been empirically proven to be robust to noise. Some other algorithms, such as confidence weighted learning , have also been proposed for noise-tolerant learning.
To the best of our knowledge, the only work that related to learn the unknown noise rate was proposed by Scott et al. . Inspired by the theory of mixture proposition estimation , they provided estimators for the inversed noise rates and . However, there were no efficient algorithms that can be used to calculate the estimators until Scott  proposed an efficient algorithm for optimizing them during the preparation of this manuscript. By using Bayes’ rule, we have . However, our method for estimation the noise rates is essentially different from that of Scott et al.  because is unknown. The inversed noise rates can be used to design algorithms for classification with label noise; see, for example, . In this paper, we also design importance reweighting algorithms for classification with label noise by employing the inversed noise rates.
The rest of this paper is organized as follows. The problem is set up in Section 2. Section 3 presents some useful results applied to the traditional classification problem. In Section 4, we discuss how to perform classification in the presence of RCN and benefit from the abundant surrogate loss functions and algorithms designed for the traditional classification problem. In Section 5, we discuss how to reduce the uncertainty introduced by RCN by estimating the conditional probability of the noisy sample; theoretical guarantees for the consistency of the learned classifiers are provided; certain convergence rates are also characterized in this section. In Section 6, an approach for estimating the noise rates is proposed. We also provide a detailed comparison between the theory of noise rate estimation and that of the inversed noise rate estimation in this section. We present the proofs of our assertions in Section 7. In Section 8, we present experimental results on synthetic and benchmark datasets, before concluding in Section 9.
2 Problem Setup
Let be the distribution of a pair of random variables , where . Our goal is to predict a label for any given observation using a sample drawn i.i.d. from the distribution . However, in many real-world classification problems, sample labels are randomly corrupted. We therefore consider the asymmetric RCN model (see ). Let be an i.i.d. sample drawn from the distribution and the corresponding corrupted ones. The asymmetric RCN model is given by:
where and .
We denote by the distribution of the corrupted variables . In our setting, the “clean” sample and the noise rates and are not available for learning algorithms. The classifier and noise rates are learned only by using the knowledge from the corrupted sample .
3 The Traditional Classification Pr-oblem
Classification is a fundamental machine learning problem. One intuitive way to learn the classifier is to find a decision function , such that the expected risk is minimized, where is the function class for searching. However, two problems remain when minimizing the expected risk: first, that the 0-1 loss function is neither convex nor smooth, and second that the distribution is unknown. The solutions to these two problems are summarized below.
For the problem that the 0-1 loss function is neither convex nor smooth, abundant convex surrogate loss functions (most are smooth) with the classification-calibrated property [11, 12] have been proposed. These surrogate loss functions, such as square loss, logistic loss, and hinge loss, have been proven useful in many real-world applications. Apart from the convex classification-calibrated surrogate loss functions, many other non-convex surrogate loss functions empirically proven to be robust to noise, such as Cauchy loss and Welsch loss, are also frequently used. In this paper, we show that all these surrogate loss functions, as well as the non-classification-calibrated surrogate loss functions, such as the asymmetric exponential loss function (see Example 8 in)
can be used directly for classification in the presence of RCN by employing the importance reweighting method.
For the problem that distribution is unknown, empirical risk is proposed to approximate the expected risk. The empirical risk is defined as
where the corresponding expected risk is
and denotes any surrogate loss function. The classifier is then learned by empirical risk minimization (ERM) :
The consistency of to is therefore essential for designing surrogate loss functions and learning algorithms. Let
It  is easily proven that
The right hand side term is known as the generalization error, and the consistency is guaranteed by convergence of the generalization error. We note that learning algorithms which are based on ERM, such as those using Tikhonov or manifold regularization, will not have a slower convergence rate of consistency than that of ERM. In this paper, we therefore provide consistency guarantees for learning algorithms dealing with RCN by deriving the generalization error bounds of the corresponding ERM algorithms.
4 Learning with Importance Reweigh-ting
Importance reweighting is widely used for domain adaptation , but here we introduce it to classification in the presence of label noise. One observation  from the field of importance reweighting is as follows:
For the problem of classification in the presence of label noise, note that . We therefore have
Thus, even though the labels are corrupted, classification can still be implemented if only the weight could be accessed to the loss .
The asymmetric RCN problem can be addressed by reweighting the surrogate loss functions of the traditional classification problem via importance reweighting. The weight given to a noisy example is
The weight is non-negative if . If , we intuitively let .
A classifier can therefore be learned for the “clean” data in the presence of asymmetric RCN by minimizing the following reweighted empirical risk:
By the following proposition, based on Talagrand’s Lemma (see, e.g., Lemma 4.2 in ), we show that, given , the above weighted empirical risk will converge to the unweighted expected risk of the “clean” data for any . So, can be approximated by .
Given the conditional probability and the noise rates and . Let be upper bounded by . Then, for any , with probability at least , we have
where , and the Rademacher complexity  is defined by
and are i.i.d. Rademacher variables.
The Rademacher complexity has a convergence rate of order . If the function class has proper conditions on its variance, the Rademacher complexity will quickly converge and is of order ; see, for example, . The generalization bound in Proposition 1 is derived using the Rademacher complexity method. Many other hypothesis complexities and methods can also be employed to derive the generalization bound.
the consistency rate will therefore be inherited for learning with label noise, provided that the conditional probability and noise rates are accurately estimated.
Based on Proposition 1, we can now state our first main result for classification in the presence of label noise using our framework of importance reweighting.
Any surrogate loss functions designed for the traditional classification problem can be used for classification in the presence of asymmetric RCN by employing the importance reweighting method. The consistency rate for classification with asymmetric RCN will be the same as that of the corresponding traditional classification algorithm, provided that the conditional probability and noise rates are accurately estimated.
The trade-off for using and benefitting from the abundant surrogate loss functions designed for traditional classification problems is the need to estimate the distribution and noise rates . Next, we address how to estimate the distribution and the noise rates separately.
We have shown that the uncertainty introduced by classification label noise can be reduced by the knowledge of weight
In the asymmetric RCN problem,
and therefore the weight can be learned by using the noisy sample and the noise rates. In this section, we present three methods to estimate the conditional probability with consistency analyses; how to estimate the noise rates is discussed in the next section.
5.1 The Probabilistic Classification Method
The conditional probability can be estimated by a simple probabilistic classification method, where the corresponding link function maps the outputs of the learned predictor to the interval and thus can be interpreted as probabilities. However, such a method is parametric, which has a strong assumption that the target conditional distribution is of the form of the link function used. For example, if the logistic loss function is employed, the learned distribution will be the form of
When the logistic regression is correctly specified, i.e., there exists such that is equal to the target conditional distribution , the logistic regression is optimal in the sense that the approximation error is minimized (being zero). However, when the model is misspecified, which would be the case in practice, a large approximation error may be introduced even if the hypothesis class is chosen to be relatively large, which will hinder the statistical consistency for learning the target weight function .
We found that employing the probabilistic classification method to estimate the conditional probability did not perform well. Its empirical validation is therefore omitted in this paper.
5.2 The Kernel Density Estimation Method
In this subsection, we introduce the kernel density estimation method to estimate the conditional probability , which has the consistency property for learning the target weight function .
Using Bayes’ rule, we have
When the dimensionality of is low and the sample size is sufficiently large, the probabilities and can be easily and efficiently estimated using the noisy sample.
If we use
and the kernel density estimation method
to estimate and , respectively (where is a universal kernel, see ), the consistency of classification with label noise (or learning the target weight function ) is guaranteed by the following theorem.
Let be an estimator for using equations , and , and
For any , we have
When and are estimated separately, although the consistency property is guaranteed by mapping features into a universal kernel induced reproducing kernel Hilbert space (RKHS), the convergence rate may be slow. Note that the kernel density estimation method is non-parametric and thus it often requires a large sample size. Since density estimation is known to be a hard problem for high-dimensional variables, in practice, it is preferable to directly estimate the density ratio  and avoid estimating the densities separately.
5.3 The Density Ratio Estimation Method
Density ratio estimation  provides a way to significantly reduce the curse of dimensionality for kernel density estimation and can be estimated accurately for high-dimensional variables. Therefore, in this subsection, we introduce density ratio estimation to estimate the conditional probability distribution for classification in the presence of RCN.
Three methods are frequently used for density ratio estimation, including the moment matching approach, the probabilistic classification approach and the ratio matching approach; see . Since the probabilistic classification approach may introduce a large approximation error, in practice, the moment matching and ratio matching methods are more preferable , where the density ratio can be modelled by employing linear or non-linear functions. If proper reproducing kernel Hilbert spaces are chosen to be the hypothesis classes, the approximation errors of the moment matching and ratio matching methods could be small. Although these methods introduce approximation errors for learning the weight , their efficiency has been widely and empirically proven [48, 49, 50].
In this paper, we exploit the ratio matching approach that employs the Bregman divergence  (KLIEP ) to estimate the conditional probability distribution . It is proven that the ratio matching approach exploiting the Bregman divergence  is consistent with the optimal approximation in the hypothesis class111Parametric modeling is used for estimating density ratio. We provide the proof of consistency in Section 7.6..
The following theorem provides an assurance that our importance reweighting method that exploits density ratio estimation is consistent.
When employing the density ratio estimation method to estimate the conditional probability distribution and , if the hypothesis class for estimating the density ratio is chosen properly so that the approximation error is zero, for any , we have
where is the same as that defined in Theorem 2, and .
The convergence rate is characterized in the following proposition.
The convergence rate in Proposition 2 could be a certain rate of order because , where and denote the number of positive labels and negative labels of the noisy sample, respectively.
6 Estimating the Noise Rates
Most existing algorithms designed for RCN problems need the knowledge of the noise rates. Scott et al. [33, 34] developed lower bounds for the inversed noise rates and , under the irreducibility assumption, which are consistent with the target inversed noise rates and can therefore be used as estimators for the inversed noise rates. However, the convergence rate could be slow. Then, during the preparation of this manuscript, Scott  released an efficient implementation to estimate the inversed noise rates and introduced the distributional assumption and to the label noise classification problem. The distributional assumption is sufficient for the irreducibility assumption and thus is slightly stronger. Scott then proved that the distributional assumption ensures an asymptotic convergence rate of order for estimating the inversed noise rates.
To the best of our knowledge, no efficient method has been proposed to estimate the noise rates and how to estimate them remains an open problem . We first provide upper bounds for the noise rates and show that with a mild assumption on the “clean” data, they can be used to efficiently estimate the noise rates.
We have that
Moreover, if the assumption holds that there exists , such that , we have
Theorem 4 shows that under the assumption that there exists , such that , is a consistent estimator for the noise rates. The convergence rate for estimating the noise rates is the same as that of estimating the conditional distribution . We therefore could obtain fast convergence rates for estimating the noise rates via finite sample analysis. For example, if the hypothesis class has proper conditions on its variance, the Rademacher complexity will quickly converge and is of order .
We have proven the consistency property of the joint estimation of the weight and classifier in Theorems 2 and 3, and characterized the convergence rates of the joint estimation in Proposition 2. According to Theorem 4, the results can be easily extended to the joint estimation of the weight, noise rate and classifier of our importance reweighting method. We provide detailed proofs in the supplementary material.
For classification problems, the assumption in Theorem 4 can be easily held. If an observation is far from the target classifier, it is likely that the conditional probability (or ) is equal to zero or very small. With the assumption that there exist such that and are very small, we can efficiently estimate by
In our experiments, we estimate by
Note that . Thus, can be consistently estimated by if there exists an such that , where is the slope to the point in the receiver operating characteristic (ROC) space defined in [33, 35]. In the proof of Theorem 4, we also derived that
Since is non-negative, our estimator is consistent based on the assumption that there exists an such that . Having the above knowledge in mind, we can improve the theoretical analysis for estimation the inversed noise rates in [33, 35] (and the mixture proportion estimation) by employing finite sample analysis.
We can design importance reweighting algorithms for classification with label noise by employing the inversed noise rates.
When using the importance reweighting method to address the asymmetric RCN problem, the weight given to a noisy example can be derived by exploiting the inversed noise rates:
The weight is non-negative222The inversed noise rates are defined so that , see, . if . If , we intuitively let .
We employed Scott’s method  to estimate the inversed noise rates and found that the importance reweighting method exploiting the estimated inversed noise rates did not perform well, so the results are omitted. There might be two reasons which could possibly explain the poor performance: (1) Scott’s estimator has the form of density ratio estimation, and is more complex than our estimator , which has the form of the conditional distribution. (2) How to choose the kernel width to obtain the ROC in Scott’s method has remained elusive.
In this section, we provide detailed proofs of the assertions made in previous sections.
7.1 Proof of Lemma 1
For label noise problem, we have shown that
When the label noise is of asymmetric RCN, we have
Similarly, it gives
We therefore have
We intuitively let , if . Since , we can conclude that .
We start by introducing the Rademacher complexity method  for deriving generalization bounds.
Let be independent Rademacher variables, be i.i.d. variables and be a real-valued function class. The Rademacher complexity of the function class over the variable is defined as
Theorem 5 ()
Let be a real-valued function class on , and
The following theorem, proven utilizing Theorem 5 and Hoeffding’s inequality, plays an important role in deriving the generalization bounds.
Theorem 6 ()
Let be an -valued function class on , and . Then, for any and any , with probability at least , we have
According to Theorem 6, we can easily prove that for any -valued function class and , with probability at least , the following holds
Since is upper bounded by
using the Lipschitz composition property of Rademacher complexity, which is also known as the Talagrand’s Lemma (see, e.g., Lemma 4.2 in ), we have
Propostion 1 can be proven together with the fact that .
7.3 Proof of Theorem 2
We begin with the following lemma.
Let be a universal kernel, where is a feature map into a feature space. Let
Then, and will converge to their target distributions and in the induced RKHS , respectively.
The proof relies on the following theorem proven by Gretton et al. .
Let be the space of all probability distributions on an RKHS induced by a universal kernel . Define as the expectation operator that . The operator is a bijection between and .
Proof of Lemma 3. Since
using the weak law of large numbers, for any , we have
So, will converge to its target distribution .
We then prove that