Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates
Learning with noisy labels is a common problem in supervised learning. Existing approaches require practitioners to specify noise rates, i.e., a set of parameters controlling the severity of label noises in the problem. In this work, we introduce a technique to learn from noisy labels that does not require a priori specification of the noise rates. In particular, we introduce a new family of loss functions that we name as peer loss functions. Our approach then uses a standard empirical risk minimization (ERM) framework with peer loss functions. Peer loss functions associate each training sample with a certain form of “peer” samples, which evaluate a classifier’ predictions jointly. We show that, under mild conditions, performing ERM with peer loss functions on the noisy dataset leads to the optimal or a near optimal classifier as if performing ERM over the clean training data, which we do not have access to. To our best knowledge, this is the first result on “learning with noisy labels without knowing noise rates” with theoretical guarantees. We pair our results with an extensive set of experiments, where we compare with state-of-the-art techniques of learning with noisy labels. Our results show that peer loss functions based method consistently outperforms the baseline benchmarks. Peer loss provides a way to simplify model development when facing potentially noisy training labels, and can be promoted as a robust candidate loss function in such situations.
The quality of supervised learning models depends on the training data . In practice, label noise can arise due to a host of reasons. For instance, the observed labels s may represent human observations of a ground truth label. In this case, human annotators may observe the label imperfectly due to differing degrees of expertise or measurement error, see e.g., medical examples such as labeling MRI images from patients. Many prior approaches to this problem in the machine learning literature aim to develop algorithms to learn models that are robust to label noise (Bylander, 1994; Cesa-Bianchi et al., 1999, 2011; Ben-David et al., ; Scott et al., 2013; Natarajan et al., 2013; Scott, 2015). Typical approaches require a priori knowledge of noise rates, i.e., a set of parameters that control the severity of label noise. Working with unknown noise rates is difficult in practice: Often, one must estimate the noise rates from data, which may require additional data collection (Natarajan et al., 2013; Scott, 2015; van Rooyen and Williamson, 2015) (e.g., be a redundant set of noisy labels for each sample point, or a set of ground truth labels for tuning these parameters) and may introduce estimation error that can affect the final model in less predictable ways.
In this paper, we introduce a new family of loss functions, peer loss functions, to empirical risk minimization (ERM), for a broad class of learning with noisy labels problems. Peer loss functions operate under different noise rates without requiring either a priori knowledge of the embedded noise rates, or an estimation procedure. This family of loss functions builds on approaches developed in the peer prediction literature (Miller,Nolan et al., 2005; Dasgupta and Ghosh, 2013; Shnayder et al., 2016), which studies how to elicit information from self-interested agents without verification. Typical approaches in the peer prediction literature design scoring functions to score each reported data using another noisy reference answer, without accessing ground truth information. We borrow this idea and the associated scoring functions via making a connection through treating each classifier’s prediction as an agent’s private information to be elicited and evaluated, and the noisy label as an imperfect reference from a “noisy label agent”. The peer loss takes a form of evaluating classifiers’ prediction using noisy labels on both the targeted samples and a particular form of “peer” samples, which turns to capture the true risk of the classifier, up to an affine transformation.
The main contributions of this work are:
To the best of our knowledge, this is the first work proposing a loss function that i) is robust to label noises with formal theoretical guarantees and ii) requires no prior knowledge of the noise rates. We believe having the second feature above is a non-trivial progress, and features a promising solution to deploy in a noisy training environment.
We present extensive experimental results to validate the usefulness of peer loss (Section 5 and Appendix). This result is encouraging as it is able to remove the long-standing requirement of learning error rates of noises before any of the existing methods can be applied.
We will contribute to the community by publishing our codes and implementations.
1.1 Related Work
Learning from Noisy Labels
Our work fits within a stream of research on learning with noisy labels. A large stream of research on this topic works with the random classification noise (RCN) model, where observed labels are flipped independently with probability (Bylander, 1994; Cesa-Bianchi et al., 1999, 2011; Ben-David et al., ; Scott et al., 2013; Natarajan et al., 2013; Scott, 2015). Recently, learning with asymmetric noisy data (or also referred as class-conditional random classification noise (CCN)) for binary classification problems has been rigorously studied in (Stempfel and Ralaivola, 2009; Scott et al., 2013; Natarajan et al., 2013; Scott, 2015; van Rooyen and Williamson, 2015; Menon et al., 2015). For a more thorough survey of classical results on learning with noisy data, please refer to (Frénay and Verleysen, 2014). More recent developments include an importance re-weighting algorithm (Liu and Tao, 2016), a noisy deep neural network learning setting (Sukhbaatar and Fergus, 2014), and learning from massive noisy data for image classification (Xiao et al., 2015), among many others.
Our work also builds on the literature for peer prediction (Prelec, 2004; Miller,Nolan et al., 2005; Witkowski and Parkes, 2012; Radanovic and Faltings, 2013; Witkowski et al., 2013; Dasgupta and Ghosh, 2013; Shnayder et al., 2016; Liu and Chen, 2017). The seminal work (Miller,Nolan et al., 2005) established that strictly proper scoring rule (Gneiting and Raftery, 2007) could be adopted in the peer prediction setting for eliciting truthful reports from self-interested agents. There have been a sequence of follow-up works that have been done to relax the assumptions imposed therein (Witkowski and Parkes, 2012; Radanovic and Faltings, 2013; Witkowski et al., 2013; Radanovic et al., 2016; Liu and Chen, 2017). Most relevant to us is (Dasgupta and Ghosh, 2013; Shnayder et al., 2016) where a correlated agreement (CA) type of mechanism was proposed. CA evaluates a report’s correlations with another reference agent - its specific form inspired our peer loss.
Notations and preliminaries: For positive integer , denote by . Suppose are drawn from a joint distribution , with their marginal distributions denoted as respectively. We assume , and , that is we consider a binary classification problem. Denote by . There are training samples drawn i.i.d. from .
Instead of observing s, the learner can only collect a noisy set of training labels s, generated according to s and a certain error rate model, that is we observe a dataset We assume a uniform error model for all the training samples we collect, in that errors in s follow the same error rate model: denoting the random variable for noisy labels as and we denote
such that . is not unlike the condition imposed in the existing learning literature (Natarajan et al., 2013), and it simply implies that the noisy labels are positively correlating with the true labels (informative about the true labels). Label noises are conditional independent from the features, that is the error rate is uniform across s: Denote the distribution of the noisy data as .
is a real-valued decision function, and its risk w.r.t. the 0-1 loss is defined as
The Bayes optimal classifier is the one that minimizes the 0-1 risk:
Denote this optimal risk as . Instead of minimizing the above 0-1 risk, the learner often uses a surrogate loss function , and find a that minimizes the following error: We denote the following risk measures:
When there is no confusion, we will also short-hand as . Using to denote a dataset collected from distribution (correspondingly for ), the empirical risk measure for is defined as
2.1 Learning with noisy labels
Typical methods for learning with noisy labels include developing bias removal surrogates loss function methods to learn with noisy data (Natarajan et al., 2013). For instance, Natarajan et al. (2013) tackle this problem by defining an “un-biased” surrogate loss functions over to help “remove” noise, when : is identified such that when a prediction is evaluated against a noisy label using this surrogate loss function, the prediction is as if evaluated against the ground-truth label using in expectation. Hence the loss of the prediction is “unbiased”, that is prediction , [Lemma 1, (Natarajan et al., 2013)].
One important note to make is most, if not all, existing solutions require the knowledge of error rates . Previous works either assumed the knowledge of it, or needed additional clean labels or redundant noisy labels to estimate them. This becomes the bottleneck of applying these great techniques in practice. Our work is also motivated by the desire to remove this limitation.
2.2 Peer Prediction: Information Elicitation without Verification
Peer prediction is a technique developed to truthfully elicit information when there is no ground truth verification. Suppose we are interested in eliciting private observations about a binary event generated according to a random variable . There are agents indexed by . Each of them holds a noisy observation of , denoted as . We would like to elicit the s, but they are completely private and we won’t observe to evaluate agents’ reports. Denote by the reported data from each agent . It is completely possible that if agents are not compensated properly for their information. Results in peer prediction have proposed scoring or reward functions that evaluate an agent’s report using the reports of other peer agents. For example, a peer prediction mechanism may reward agent for her report using where is the report of a randomly selected reference agent . The scoring function is designed so that truth-telling is a strict Bayesian Nash Equilibrium (implying other agents truthfully report their ), that is,
There is a rich literature on proposing and studying peer prediction scoring functions, but we will focus on the following knowledge-free peer prediction mechanism, which only require a minimal amount of prior knowledge of the data sources to implement.
Correlated Agreement (Shnayder et al., 2016; Dasgupta and Ghosh, 2013) (CA) is a recently established peer prediction mechanism for a multi-task setting 333We provide other examples of peer prediction functions in the Appendix.. CA is also the core and the focus of our subsequent sections on developing peer prediction based loss functions. This mechanism builds on a matrix that captures the stochastic correlation between the two sources of predictions and . Denote the following mapping function: , is then defined as a squared matrix with its entries defined as follows:
The intuition of above matrix is that each entry of captures the marginal correlation between the two predictions. is defined as the sign matrix of : Define the following score matrix
where is the inverse function of . CA requires each agent to perform multiple tasks: denote agent ’s observations for the tasks as . Ultimately the scoring function for each task that is shared between is defined as follows: randomly draw two other tasks ,
Note a key difference between the first and second terms is that the second term is defined for two independent peer tasks (as the reference answers). It was established in (Shnayder et al., 2016) that if is categorical w.r.t. : then is strictly truthful (Theorem 4.4, Shnayder et al. (2016)).
3 Learning with noisy data: the peer prediction approach
In this section, we show that peer prediction scoring functions, when specified properly, will adopt Bayes optimal classifier as their maximizers (or minimizers for the corresponding loss form).
We first state our problem of learning with noisy labels as a peer prediction problem. The connection is made by firstly rephrasing the two data sources, the classifiers and the noisy labels, from agents’ perspective. For a task , say for example, denote the noisy labels as . In general, can be interpreted as the agent that observes for a set of randomly drawn feature vectors : Suppose the agent’s observations are defined as follows (similar to the definition of ): Denote another agent whose observations “mimic” the Bayes optimal classifier . Again denote this optimal classifier agent as :
Suppose we would like to elicit predictions from the optimal classifier agent , while the reports from the noisy label agent will serve as the reference reports. Both and are randomly assigned a task , and each of them observes a signal and respectively. Denote the report from agent as . A scoring function is called to induce strictly truthfulness if the following fact holds: , Taking the negative of (changing a reward score one aims to maximize to a loss to minimize) we also have ,, implying when taking as the loss function, minimizing w.r.t. will return us the Bayes optimal classifier . Our idea can be summarized easily using Fig. 1.
When there is no ambiguity, we will shorthand as , with keeping in mind that encode the randomness in . Suppose is able to elicit the Bayes optimal classifier () using , we have the following theorem formally:
This proof can be done via showing that any non-optimal Bayes classifier corresponds to a mis-reporting strategy, thus establishing its non-optimality. We emphasize that it is not super restrictive to have a strictly truthful peer prediction scoring function . We provide discussions in Appendix.
4 Peer Loss Function
We now present peer loss, a family of loss functions inspired by a particular peer prediction mechanism, the correlated agreement (CA), as presented in Section 2.2. We are going to show that peer loss is able to induce the minimizer of a concept class , under a broad set of non-restrictive conditions.
To give a gentle start, we repeat the setting of CA for our classification problem in the setting we introduced above in Section 3.
and scoring matrix
First recall that is a squared matrix with entries defined between (also the ) and (i.e., the noisy labels ):
Recall is simply a mapping function: . Then the following scoring matrix , sign matrix of , is computed.
Consider a binary class label case: , the noises in the labels are and . Then we have
For each sample , randomly draw another two samples such that We will name as ’s peer samples. The scoring function for each sample point is defined as follows: note ,
Recall is defined in Eqn. (2). Define loss function as the negative of :
According to Theorem 1, we already know that minimizing is going to find the optimal Bayes classifier, if and are categorical:
When and , and ( and ) are categorical.
We need to know in order to specify , which requires knowing certain information about the optimal classifier and . We show that for the cases that the literature is broadly interested in, is simply the identify matrix:
If , then , i.e., the identity matrix.
means that the optimal classifier is at least informative ((Liu and Chen, 2017)) - if otherwise, we can flip the classifier’s output to obtain one.
When , if , and 0 otherwise. defined in Eqn. (3) reduces to the following form:
To see this, for instance . Replacing with any generic loss we define:
We name above loss as peer loss. This strikingly simple form of implies that knowing hold is all we need to specify . The rest of presentation focuses on defined in Eqn. (5), but we keep in mind that replacing with in recovers .
ERM with peer loss
(1) Peer loss is a “multi-sample” loss. For each sample point , we need to pair it with uniformly randomly sampled “peer samples” and - we further illustrate this in Fig. 3 in Appendix. (2) The definition of does not require the knowledge of either or .
4.1 Property of Peer Loss
We will denote , and denote by the expected peer loss of when , as well as its peer samples, are drawn i.i.d. from distribution . We now present a key property of peer loss, which shows that its risk over the noisy labels is simply an affine transformation of its risk over the clean ones.
Denote . With Lemma 3, we can easily prove the following:
When , .
The above theorem states that for a class-balanced dataset with , peer loss induces the same minimizer as the one that minimizes the 0-1 loss on the clean data. When removing the constraint of , i.e., , we have . In practice we can balance the dataset so that . But when , denote by , we have the following theorem:
When , suppose the following conditions hold: (1) ; (2) ; (3) , where . Then , if is bounded with denoting its max and min.
Condition (1) is a well-adopted assumption in the literature of learning with noisy labels. When , we have conditions (2) and (3) hold: When is small, i.e., is closer to , this condition becomes weaker, as we will afford to have a small but also a small .
4.2 -weighted peer loss
We take a further look at the case with . Denote by . It is easy to prove:
Minimizing is equivalent to minimizing .
However, minimizing the true risk is equivalent to minimizing , a weighted sum of and with and . The above observation, as well as the failure to reproduce the strong theoretical guarantee when , motivated us to study a -weighted version of peer loss, to make it robust to the case . We propose the following -weighted peer loss via adding a weight to the second term, the peer term:
Denote as when replacing with , as the optimal classifier under , and . Then we have:
Let . Then .
Denote . Several remarks follow:
When , we have , we recover the earlier definition of .
When , , we recover for the clean learning setting.
When the signs of and are the same, . Otherwise, . In other words, when the noise changes the relative quantitative relationship of and , and vice versa.
Knowing requires certain knowledge of when . Though we do not claim this knowledge, this result implies tuning (using validation data) may improve the performance.
With probability at least ,
4.3 Calibration and Generalization
So far our results focused on minimizing 0-1 losses, which is hard in practice. We provide evidences of ’s, and ’s in general, calibration and convexity with a generic and differentiable calibrated loss. We consider a that is classification calibrated, convex and -Liptchitz.
Classification calibration describes the property that the convergence to optimality using a loss function would also guarantee the convergence to optimality with 0-1 loss:
is classification calibrated if there a convex, invertible, nondecreasing transformation with s.t.
Denote . Below we provide sufficient conditions for to be calibrated.
is classification calibrated when either of the following two conditions holds: (1) (i.e., ), , and satisfies the following: (2) , and .
(1) states that not only achieves the smallest risk over but also performs the worst on the “opposite” distribution with flipped labels . (2) is satisfied by some common loss function, such as square losses and logistic losses, as noted in (Natarajan et al., 2013),
Under the calibration condition, and denote the corresponding calibration function for as . Denote by We have the following generalization bound:
The following generalization bound holds for with probability at least :
where is Rademacher complexity of .
In our experiments, we resolve to neural networks, which are more robust to non-convex loss functions. Nonetheless, despite the fact that is not convex in general, [Lemma 5, (Natarajan et al., 2013)] informs us that as long as is close to some convex function, mirror gradient type of algorithms will converge to a small neighborhood of the optimal point when performing ERM with . A natural candidate for this convex function is the expectation of as when . We provide sufficient conditions for to be convex in Appendix (Lemma 8).
|Task||Equalized Prior||No Prior Equalization|
|Breast Cancer||0.2, 0.4||0.63||0.534||0.538||0.73||0.674||0.672||0.698||0.672|
We implemented a two-layer ReLU Multi-Layer Perceptron (MLP) for classification tasks on 10 UCI Benchmarks and applied our peer loss to update their parameters. We show the robustness of peer loss with increasing rates of label noises on 10 real-world datasets. We compare the performance of our peer loss based method with surrogate loss method (Natarajan et al., 2013) (with known error rates), C-SVM (Liu et al., 2003) and PAM (Khardon and Wachman, 2007), which are state-of-the-art methods for dealing with random binary-classification noises, as well as a neural network solution with binary cross entropy loss (NN). We use a cross-validation set to tune the parameters specific to the algorithms. For surrogate loss, we use the true error rates and instead of learning them on the validation set. Thus, surrogate loss could be considered a favored and advantaged baseline method. Accuracy of a classification algorithm is defined as the fraction of examples in the test set classified correctly with respect to the clean and true label. For given noise rates and , labels of the training data are flipped accordingly.
A subset of the experiment results are shown in Table 1. A full table with all details can be found in Appendix. Equalized Prior means that we pre-sample the dataset to guarantee . For this case we used without (or rather as in ). For , we use validation dataset (using noisy labels) to tune . Our method is competitive across all datasets and is even able to outperform the surrogate loss method with access to the true error rates in a number of datasets. Fig. 2 shows that our method can prevent over-fitting when facing noisy labels. More results are available in the Appendix.
This paper introduces peer loss, a family of loss functions that enables training a classifier over noisy labels, but without using explicit knowledge of the noise rates of labels. We provide both theoretical justifications and extensive experimental evidences.
- Convexity, classification, and risk bounds. Journal of the American Statistical Association 101 (473), pp. 138–156. Cited by: Proof for Theorem 7, Theorem 10.
-  Agnostic online learning.. In COLT 2009, Cited by: §1.1, §1.
- Learning linear threshold functions in the presence of classification noise. In Proceedings of the seventh annual conference on Computational learning theory, pp. 340–347. Cited by: §1.1, §1.
- Sample-efficient strategies for learning in the presence of noise. Journal of the ACM (JACM) 46 (5), pp. 684–719. Cited by: §1.1, §1.
- Online learning of noisy data. IEEE Transactions on Information Theory 57 (12), pp. 7907–7931. Cited by: §1.1, §1.
- Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web, pp. 319–330. Cited by: §1.1, §1, §2.2.
- Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25 (5), pp. 845–869. Cited by: §1.1.
- Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. Cited by: §1.1.
- Noise tolerant variants of the perceptron algorithm. J. Mach. Learn. Res. 8, pp. 227–248. External Links: Cited by: §5.
- Deep learning. Nature 521, pp. 436–44. External Links: Cited by: Implementation Details.
- Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining, ICDM ’03, Washington, DC, USA, pp. 179–. External Links: Cited by: §5.
- Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38 (3), pp. 447–461. Cited by: §1.1.
- Machine Learning aided Peer Prediction. ACM EC. Cited by: §1.1, §4.
- Learning from corrupted binary labels via class-probability estimation. In International Conference on Machine Learning, pp. 125–134. Cited by: §1.1.
- Eliciting informative feedback: The peer-prediction method. Management Science 51 (9), pp. 1359 –1373. Cited by: §1.1, §1, Lemma 5.
- Learning with noisy labels. In Advances in neural information processing systems, pp. 1196–1204. Cited by: §1.1, §1, §2.1, §2, §4.3, §4.3, §5, Other peer prediction functions, Proof for Lemma 2.
- A bayesian truth serum for subjective data. science 306 (5695), pp. 462–466. Cited by: §1.1.
- A robust bayesian truth serum for non-binary signals. In Proceedings of the 27th AAAI Conference on Artificial Intelligence, AAAI ’13. Cited by: §1.1.
- Incentives for effort in crowdsourcing using the peer truth serum. ACM Transactions on Intelligent Systems and Technology (TIST) 7 (4), pp. 48. Cited by: §1.1.
- Classification with asymmetric label noise: consistency and maximal denoising.. In COLT, pp. 489–511. Cited by: §1.1, §1, Other peer prediction functions.
- A rate of convergence for mixture proportion estimation, with application to learning from noisy labels.. In AISTATS, Cited by: §1.1, §1, Other peer prediction functions.
- Informed Truthfulness in Multi-Task Peer Prediction. ACM EC. External Links: Cited by: §2.2.
- Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation, pp. 179–196. Cited by: §1.1, §1.
- Learning svms from sloppily labeled data. In International Conference on Artificial Neural Networks, pp. 884–893. Cited by: §1.1.
- Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080 2 (3), pp. 4. Cited by: §1.1.
- Learning in the presence of corruption. arXiv preprint arXiv:1504.00091. Cited by: §1.1, §1.
- A robust bayesian truth serum for small populations. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, AAAI ’12. Cited by: §1.1.
- Dwelling on the Negative: Incentivizing Effort in Peer Prediction. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing (HCOMP’13), Cited by: §1.1.
- Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. Cited by: §1.1.
Illustration of our implementation of peer loss
Other peer prediction functions
Other notable examples include quadratic and logarithmic scoring function, defined as follows:
Quadratic scoring function:
Logarithmic scoring function:
We know the following is true:
Lemma 5 (Miller,Nolan et al. (2005)).
defined in Example 1 & 2 induce strict truthfulness when and are stochastically relevant.
with defining stochastic relevance as follows:
and are stochastically relevant if s.t.
Similarly we conclude that when and are stochastic relevant, the correlated agreement scoring rule, quadratic scoring rule and logarithmic scoring rule are strictly truthful. This stochastic relevance condition essentially states that the optimal classifier is statistically different from the noisy data source on some signals. Stochastic relevance is further satisfied in the binary classification setting when , under the assumption that , as similarly imposed in learning with noisy labels literature (Scott et al., 2013; Natarajan et al., 2013; Scott, 2015).
and are stochastically relevant if and only if .
Since can be written as a function of and , due to conditional independence between and (conditional on ), by chain rule