Towards Robust Learning with Different Label Noise Distributions
Noisy labels are an unavoidable consequence of automatic image labeling processes to reduce human supervision. Training in these conditions leads Convolutional Neural Networks to memorize label noise and degrade performance. Noisy labels are therefore dispensable, while image content can be exploited in a semi-supervised learning (SSL) setup. Handling label noise then becomes a label noise detection task. Noisy/clean samples are usually identified using the small loss trick, which is based on the observation that clean samples represent easier patterns and, therefore, exhibit a lower loss. However, we show that different noise distributions make the application of this trick less straightforward. We propose to continuously relabel all images to reveal a loss that facilitates the use of the small loss trick with different noise distributions. SSL is then applied twice, once to improve the clean-noisy detection and again for training the final model. We design an experimental setup for better understanding the consequences of differing label noise distributions and find that non-uniform out-of-distribution noise better resembles real-world noise. We show that SSL outperforms other alternatives when using oracles and demonstrate substantial improvements across five datasets of our label noise Distribution Robust Pseudo-Labeling (DRPL). We further study the effects of label noise memorization via linear probes and find that in most cases intermediate features are not affected by label noise corruption. Code and details to reproduce our framework will be made available.
Modern representation learning, i.e. the extraction of useful information to build classifiers or other predictors , in computer vision is led by Convolutional Neural Networks (CNNs) [9, 28, 4, 31, 49, 25, 11]. Their widespread use is attributable to their capability to model complex patterns when vast amounts of labeled data are available . This supervision requirement limits exploiting the vast amounts of web images as it is infeasible to label every image for each particular task. However, leveraging these data might drive visual representation learning a step forward.
What can we do to relax supervision? One could adopt transfer learning or domain adaptation , where representations from a source domain are transferred to a target domain where fewer labels are available. This approach, however, makes the target domain dependent on the source one. Learning representations from scratch in the target domain, on the other hand, may lead to better representations. Several alternatives exist: semi-supervised learning (SSL), which jointly learns from few labeled images and extensive unlabeled ones [6, 1]; self-supervised learning, where data provides the supervision [29, 24]; or learning with label noise, where automatic labeling processes introduce noise in the observed labels [36, 44]. This paper focuses on this last alternative, which has attracted much recent interest [43, 44, 2, 42].
Learning with label noise is challenging; recent studies on the generalization capabilities of deep networks  demonstrate that noisy labels are easily fit by CNNs, harming generalization. There is, however, a key observation on how CNNs memorize corruptions: they tend to learn easy patterns first and these patterns are closer to clean data patterns, i.e. correctly labeled images [45, 3], thus exhibiting lower loss than images with noisy or incorrect labels. This phenomenon is commonly named small loss [36, 44, 43] and recent works exploit this small loss trick to identify clean and noisy samples [14, 34, 2].
Approaches dealing with label noise can be categorized into: loss correction [32, 30, 22, 2], when the loss is weighted to correct the label noise effect; relabeling [36, 44], when all observed labels are corrected by an estimation of the true labels; and approaches that discard noisy labels to transform the problem into SSL [10, 23]. Despite the variety of approaches and comparative evaluations, it is not clear which of them behave better. Most approaches are exhaustively compared on CIFAR data  and then tested in a real world datasets such as WebVision . Although substantial improvements have been achieved in recent years when training from scratch on CIFAR , transfer learning is usually used when approaches are evaluated on WebVision [22, 43, 7]. This misalignment might harm the ability to understand representation learning from scratch as it has been shown that fine-tuning pre-trained weights is robust to label noise .
In light of these limitations, we undertake an exhaustive study on different label noise distributions and dataset complexities and propose a general solution to tackle all of them. In particular, we adopt the and versions of the ImageNet dataset and artificially introduce 4 different label noise types (2 in-distribution and 2 out-of-distribution). We find that discarding noisy labels and training with a semi-supervised technique outperforms all other approaches considered. The novelty here lies in the clean-noisy detection strategy, which is based on both relabeling  and the small loss trick . Relabeling modifies the dataset noise separating the loss of clean and noisy samples to facilitate the use of the small loss trick . SSL via pseudo-labeling  is then applied twice to refine the clean-noisy detection before a final training stage. Our contributions include:
A framework to study label noise with different distributions providing a more realistic understanding.
An experimental demonstration, based on using oracles, that SSL substantially outperforms loss correction based on label noise transition matrices .
A study of label noise memorization and its importance for representation learning, showing that despite the performance degradation for source and target tasks due to the label noise fitting, the intermediate representations do not often suffer such degradation.
An extensive evaluation in five datasets using multiple label noise distributions and training from scratch, demonstrating both superior performance of our label noise Distribution Robust Pseudo-Labeling (DRPL) approach and contributing to a better understanding of existing methods.
2 Related work
Label noise is a well-known problem in machine learning . Recent efforts in image classification focus on dealing with in-distribution noise , where the set of possible labels is known and noisy labels belong to this set. However, label noise might also come from outside the distribution , which occurs in many real-world scenarios .
Several label noise distributions can affect dataset annotations, namely uniform or non-uniform random label noise. Uniform label noise means the true labels are flipped to a different class with uniform random probability. Non-uniform noise has different flipping probabilities for each class.
Loss correction approaches [32, 30, 2, 43] either modify the loss directly or the network probabilities to compensate for the incorrect guidance provided by the noisy samples.  extend the loss with a perceptual term that introduces a reliance on the model prediction. Han et al.  adopt the same approach, but differ in the estimation of the perceptual term as network predictions are replaced by class estimations based on class prototypes. These approaches are, however, limited in that the noisy label always affects the objective. Arazo et al.  address this by dynamically weighting the original and perceptual terms based on clean-noisy per-sample probabilities given by a label noise model. Patrini et al.  estimate the label noise transition matrix , which specifies the probability of one label being flipped to another, and correct the softmax probability by multiplying by . In the same spirit, Yao et al.  propose to estimate in a Bayesian non-parametric form and deduce a dynamic label regression method to iteratively infer the latent true labels and jointly train the classifier and model the noise.
Other loss correction approaches reduce the contribution of noisy samples to the loss.  propose to use a mentor network that learns a curriculum (i.e. a weight for each sample) to guide a student network that learns under noise conditions. Similarly,  learn a curriculum based on an unsupervised estimation of data complexity. Wang et al.  introduce a weighting scheme in the loss to reduce the contribution of noisy samples, and a metric learning framework to pull noisy samples representations away from those of clean ones. Robust loss functions are studied in , which define the generalized cross-entropy loss by jointly exploiting the benefits of mean absolute error and cross-entropy losses.  propose the Determinant-based Mutual Information, a generalized version of mutual information that is provably insensitive to noise patterns and amounts.
Other approaches relabel the noisy samples by modeling their noise through conditional random fields , or CNNs  using a small set of clean samples, which limits their applicability. Tanaka et al.  have, however, demonstrated that it is possible to do sample relabeling using the network predictions as soft labels.  further improve  by using estimated label distributions as soft pseudo-labels.
A simple approach to dealing with label noise is to remove the corrupted data. This is not only challenging because difficult clean samples may be confused with noisy ones , but also removes the possibility of exploiting the noisy samples for representation learning. It has, however, recently been demonstrated [10, 23] that it is useful to discard samples that are likely to be noisy while still using them in a semi-supervised setup.  define clean samples as those whose prediction agrees with the label with a high certainty, whereas  use high softmax probabilities to distinguish clean samples after performing negative learning, i.e. minimizing the probability of predicting a class when the label comes from a random label flip that is likely incorrect.
3 From label noise to semi-supervised learning
Image classification can be formulated as learning a model from a set of training examples with being the one-hot encoding true label corresponding to . In our case, is a CNN and represents the model parameters (weights and biases). As we are considering classification under label noise, the label can be noisy (i.e. is a noisy sample). This training under label noise can be redefined through SSL, where the samples are split into a set of unlabeled samples and a set of labeled samples . Ideally, would contain the clean samples and the noisy ones. In practice, clean and noisy samples sets are unknown and must be estimated. Detected clean (noisy) samples are used as the labeled (unlabeled) set and, as such, detection accuracy will influence the SSL success. The remainder of this section introduces the proposed two-stage label noise detection (Subsections 3.1 and 3.2) and briefly explains the SSL approach  (Subsection 3.3).
3.1 Label noise detection: first stage
Recent literature assumes that using the small loss trick when training with cross-entropy leads to accurate clean-noisy discrimination [14, 34, 2]. However, as Figure 1 shows, different label noise distributions result in different behavior. Noisy samples tend to have higher loss than clean samples for both in-distribution and out-of-distribution noise but the different noise types exhibit different complexities. The uniform noise types (Figure 1 (left)) exhibit higher separation between the losses of clean (blue) and noisy (red) samples than the non-uniform ones (Figure 1 (right)). This shows that a straightforward application of the small loss trick will likely encounter some difficulties.
Based on the evidence that clean data is easier to fit than mislabelled data , we propose to identify the clean data by fully relabeling all samples using the network predictions and analyzing which samples still fit the original labels. The relabeling approach optimizes the following loss function:
where indexes the epochs, is a constant that defines the number of warm-up epochs with the original (potentially noisy) labels , and and are two regularization terms (see details in ) weighted by and included to ensure convergence. The first phase () trains without relabeling and uses a high learning rate to prevent fitting the label noise, while the second () relabels all samples by computing soft pseudo-labels that are re-estimated every epoch using the softmax predictions . For simplicity we omit the epoch index inside Eq. (1).
The new labels or pseudo-labels no longer represent the original label noise, but the noise from inaccurate network predictions. Figure 2 illustrates the benefits of this approach on the detection of clean and noisy samples. The relabeling approach progressively fits the new labels (Figure 2 (a)) as it is highly affected by confirmation bias , i.e. overfitting to incorrect pseudo-labels predicted by the network. However, this reveals an interesting property in the cross-entropy loss with respect to the original labels (Figure 2 (b)). Clean samples tend to agree with the original labels (blue loss is low) substantially better than noisy samples (red loss is high). This facilitates distinguishing between clean and noisy samples using the loss, which does not occur in a standard training (Figure 1). Given the loss ((*) denotes losses with respect the original labels ), we adopt a probabilistic version of the small loss trick . In particular, the authors in  fit a 2-component Beta Mixture Model (BMM) to the loss to model the loss of clean (noisy) samples using the first (second) component, as lower losses correspond to clean samples. The probability of each sample being clean or noisy is then estimated using the posterior probability from the mixture model. We use this BMM approach on the loss to detect clean and noisy samples, thus defining the initial labeled and unlabeled sets as:
where represents noisy samples, is the loss after the relabeling approach for sample , and is a threshold to detect clean and noisy samples. This threshold should be small to select only highly probable clean samples (we use unless otherwise stated).
3.2 Label noise detection: second stage
The second stage of our clean-noisy samples detector refines the first stage estimation by training a new semi-supervised model using the labeled and unlabeled sets from Eqs. (3) and (4). This new model leads to a loss with respect the original labels that further facilitates the clean-noisy detection as training is performed with far less corrupted labels. Again, we detect clean and noisy samples by fitting a BMM to the loss and compute the final labeled and unlabeled sets and by selecting samples similarly to Eqs. (3) and (4). As Figure 2 (c) shows, clean and noisy samples become easily separable in the loss leading to wider separation in posterior probabilities (Figure 2 (d)). This allows the use of a higher threshold compared to and lower risk of introducing noisy samples. We therefore apply a maximum a posteriori threshold over of . Subsection 5.2 demonstrates the capabilities of our label noise detection method for different label noise distributions.
3.3 The semi-supervised learning approach
We adopt a recent SSL approach described in  for simplicity and performance. This approach performs relabeling or pseudo-labeling as presented in the first stage, but the pseudo-labels are only estimated for the unlabeled samples. Mixup data augmentation  and label noise regularization are also applied to alleviate confirmation bias and make pseudo-labeling effective. The SSL approach is applied twice. The first time SSL is used in the second stage of the label noise detection, i.e. with the labeled and unlabeled sets (see Subsection 3.1) to train , whereas the second time it is applied using the final labeled and unlabeled sets and (see Subsection 3.2) to train the final label-noise robust model . The supplementary material provides a summary of the proposed method.
4 Experimental setup
CIFAR data  is commonly corrupted for fast experimentation with label noise, whereas real-world performance against label noise is evaluated in datasets like WebVision . To the best of our knowledge, however, the related work obviates the differences that might exist between artificially introduced noise and real-world noise. This section introduces the proposed label noise framework aimed at a better understanding of label noise (Subsection 4.1) and further describes other datasets commonly used for evaluation (Subsections 4.2 and 4.3). We use a PreAct ResNet-18  in ImageNet32/64 and CIFAR and a ResNet-18  for WebVision. We always train from scratch (except for the transfer learning experiments in Subsection 5.4), using SGD with a momentum of 0.9, a weight decay of , batch size of 128 and initial learning rate of 0.1 for every stage of our method. Note that we do not use validation sets in any experiment. We take this decision due to the difficulty of defining a clean validation set in a real-world scenario and, more importantly, due to the fact that having clean data allows direct application of SSL, which leads to superior performance . The supplementary material provides further experimental details.
We propose to use ImageNet32/64 for fast experimentation and higher flexibility in better understanding label noise. ImageNet32/64 are 3232 and 6464 downsampled versions of the well-known ImageNet classification dataset . This dataset contains 1.2M images uniformly distributed over 1000 classes. To introduce label noise we split the dataset into in-distribution (ID) classes and out-of-distribution (OOD) classes. The split is performed to study both ID noise, as is typically done in the literature , and also the less frequently considered OOD noise . We set (randomly selected classes) in all our experiments, thus leading to 127K images. We study both uniform and non-uniform noise in both ID and OOD scenarios. To introduce uniform noise for ID we randomly flip the true label to another of the labels using uniform probabilities and excluding the true label, whereas for OOD we randomly select a class among the OOD classes and use an image to replace the ID image. To introduce non-uniform noise we use a label noise transition matrix  designed to be as realistic as possible. To this end, we average and apply row-wise unit-based normalization to the confusion matrices of the pre-trained ImageNet networks VGG-16 , ResNet-50 , Inception-v3 , and DenseNet-161 . We truncate this matrix and re-normalize it to the classes for ID noise and the classes for OOD noise. We follow the same process as the uniform case to introduce noise, but using the row distributions corresponding to the true label of each image: we randomly flip the label for ID noise, while changing the image content for OOD noise. For a specific noise level , we always keep clean samples in each class and modify the remainder.
We use standard data augmentation by random horizontal flips and random 4 (8) pixel translations for ImageNet32 (64) in training. During the first stage of the label noise detection, we train for 40 epochs before starting 60 epochs of relabeling and reduce the learning rate by a factor of 10 in epochs 45 and 80. In the second stage of the label noise detection, we train 175 epochs and reduce the learning rate in epochs 100 and 150. The final SSL stage has 300 epochs with learning rate reductions in epochs 150 and 225.
The CIFAR-10/100 datasets  have 10/100 classes of 3232 images split into 50K images for training and 10K for testing. We follow the criteria in  for label noise addition. For uniform noise, labels are randomly assigned excluding the original label. For non-uniform noise, labels are flipped with probability to similar classes in CIFAR-10 (i.e. truck automobile, bird airplane, deer horse, cat dog), whereas for CIFAR-100 label flips are done to the next class circularly within the super-classes. Standard data augmentation by random horizontal flips and random 4 pixel translations is used in training. We train as in ImageNet32/64 in CIFAR-100, whereas in CIFAR-10 we slightly modify the first stage by training 130 epochs (70 before relabeling) and reduce the learning rate in epochs 75 and 110. This increase in training epochs is to ensure a better model before relabeling, as classes in CIFAR-10 are more different than in ImageNet32/64 and CIFAR-100, thus incorrect predictions during relabeling are less informative. We use to assure sufficient labeled samples for SSL.
We use WebVision 1.0  to evaluate performance on real world label noise. We evaluate our approach using only the first 50 classes of WebVision as done in  resulting in a dataset of 137K images with resolution 224224 after resizing and cropping. However, unlike most approaches in the literature [22, 7], we train all compared methods from scratch to better understand the effect of label noise. We use random horizontal flips during training and resize images to before taking random crops. For the first stage of the label noise detection, we train 40 epochs before starting 60 epochs of relabeling and reduce the learning rate dividing by 10 twice (epochs 45 and 80). For the second stage, we train 150 epochs and reduce the learning rate in epochs 100 and 125. The final SSL stage has 200 epochs with learning rate reductions in epochs 150 and 175.
5.1 SSL vs label noise transition matrix
Correcting the loss by using the label noise transition matrix as proposed in  has recently attracted a lot of interest [19, 43, 41]. Estimating the label noise transition matrix , however, is a challenging task as label flips from one class to another have to be estimated. It seems simpler, on the other hand, to detect clean and noisy samples and discard the noisy labels.
|Oracle SSL||Oracle forward|
|Uniform ID (80%)||69.06 / 78.02||34.66 / 49.98|
|Non-uniform ID (50%)||73.24 / 81.80||65.70 / 73.44|
Table 1 presents a study using oracles for both tasks (i.e. perfect knowledge clean-noisy samples and known ) to shed light the potential of each approach under ideal conditions. The results show that SSL surpasses label noise transition matrix correction  for both uniform and non-uniform noise. We believe this is an important finding that suggests further research on making the label noise transition matrix methods more effective.
5.2 Label noise detection comparison
Transforming the supervised training with label noise into SSL requires detecting the noise to, ideally, discard the labels and turn noisy samples into unlabeled ones. As commented in Section 1, many approaches use the small loss trick, i.e. considering low loss samples as clean ones, to accomplish such detection. However, Figure 1 shows that different noise distributions present different challenges, limiting a straightforward application of this trick. We confirm this limitation in Figure 3, where we compare, for different label noise distributions, the Receiver Operating Characteristic (ROC) curves for our label noise detection method and the small loss trick (using cross-entropy with and without mixup ). The proposed approach clearly outperforms straightforward application of the small loss trick as is usually done by many recent works [14, 2, 34]. It also outperforms two recently proposed label noise detection methods [10, 23], showing consistent improvements across different label noise distributions (see Table 2). Note that we encounter some limitations addressing high levels of non-uniform out-of-distribution noise, which occurs due to the nature of the classes used as noise. We are using the ImageNet confusion matrix in the validation set to introduce noise and we have 100 (900) in- (out-of-) distribution classes, thus using the most challenging classes as out-of-distribution noise.
5.3 Comparison with related work
We select representative top-performing loss correction [30, 2], relabeling , and label noise robust regularization approaches  to compare against our label noise Distribution Robust Pseudo-Labeling (DRPL) approach in Table 3. The proposed method gives remarkable improvements across different levels and distributions of label noise. Note that, unlike most methods compared, our method shows little degradation between the best (reported in Table 3) and last epoch accuracy (see extended results in the supplementary material). In general, R  behaves consistently across label noise levels and distributions, while DB  has problems with non-uniform noise. FW  and M  tend to exhibit worse performance. An important observation is that non-uniform out-of-distribution noise exhibits much less degradation than other noise types for all methods. This is reasonable as out-of-distribution samples whose content is close to an in-distribution class will contribute to improved representation learning due to semantic similarities, and the network predictions for these samples will not harm the model for in-distribution classes. This behavior resembles real-world noisy data as observed in the WebVision dataset results in Subsection 5.5.
5.4 Effects of label noise memorization
Figure 4 provides some intuition. It shows the Class Activation Maps  for the true class and the predicted class for undetected noisy samples. The network skips relevant areas for the true class, attending to areas that might help explain (i.e. memorizing) the noisy label. For example, the network focuses on the left part of the bridge instead of right to better fit the noisy label dock (first row), and the whole lamp instead of the candles for the the noisy label lobster (second row). Further examples are provided in the supplementary material.
Does memorization negatively impact visual representation learning? We follow the standard approach of using linear probes  to verify the utility of features under different noise levels and distributions in a target task. Specifically, we train a linear classifier on the global average pooled activations obtained after each of the 4 PreAct ResNet-18 blocks. Figure 5 presents the results in ImageNet64. Our model (O) clearly outperforms mixup (M) in the target task performance using features from the last block. Better source models (i.e. trained with less noise) also tend to produce better target performance. However, an interesting finding is that for both uniform and non-uniform noise, the final accuracy exhibits degradation in L4, while for earlier features no degradation is observed even for M (a model that has memorized the noise). An exception is 80% of uniform noise, where degradation is found across all blocks, but with more discriminative representations learned by our approach. Similar results are observed in ImageNet32 (see supplementary material).
5.5 Other datasets and real-world frameworks
Table 4 compares our DRPL approach with related work in CIFAR-10/100, showing improved performance over all other approaches (extended results in the supplementary material). We evaluate the same approaches as in ImageNet32/64 [30, 36, 46, 2] and also add further recent approaches [43, 44, 42]. The figures given are from our own evaluation runs (details in the supplementary material). Improved results for uniform and non-uniform noise in CIFAR confirm the improvements observed in ImageNet32/64. The additional recent approaches, GCE , DMI , and PEN  are outperformed by our approach DRPL in most cases (with the exception being PEN for low noise levels in CIFAR-100) with GCE and DMI far from top-performance.
We also compare DRPL against related work in the first 50 classes of WebVision 1.0 dataset  to verify the practical use of our method in real-world noise scenarios. Table 4 shows the proposed approach is more accurate than all compared approaches. Note, however, that many approaches give very similar performance. Surprisingly, a straightforward approach like M , not specifically designed to deal with label noise, gives the second highest accuracy. The similar performance among approaches is something that can be also observed for non-uniform out-of-distribution noise in ImageNet32/64. Another similarity here with non-uniform out-of-distribution noise is the small amount of degradation in performance at the last training epoch (reported in the supplementary material for all comparisons). Furthermore, the reduced noise level in WebVision, estimated at around 20% in , and with 17% of detected noisy samples in our 50 class subset, helps to explain performance similarities across approaches: this is also seen in CIFAR and ImageNet32/64 with low noise levels. We present detected noisy images in Figure 9 (more examples are provided in the supplementary material), which are mainly from outside the distribution.
The proposed label noise detection method outperforms other state-of-the-art across label noise distributions (see Subsection 5.2) and its combination with a modern semi-supervised detection technique  surpasses many recent approaches (see Tables 3 and 4). When the proposed pipeline is applied in a scenario with little or no noise, however, performance suffers an important drop. For example, Table 4 shows top performances of 78.33 for  and 76.31 for  in CIFAR-100 with 0% and 10% noise, while our approach achieves 72.27 and 72.40. These approaches do not discard samples, suggesting that the information discarded by our approach is important for achieving top performance. This highlights that high loss is indicative not only of incorrect labels, but also of difficult samples . Important information can be discarded when high loss is used to discard labels of presumed noisy samples, the semi-supervised approach determines whether this information is recovered.
In addition to discarded labels, undetected noisy samples also require consideration. Including a noisy sample in the labeled set corrupts it and harms learning from unlabeled data. Subsection 5.4 investigated network behavior with respect to these samples, showing how the network memorizes the noisy labels (Figure 4) and that this memorization is detrimental to the formation of discriminative features in intermediate layers (Figure 5). These results open the door to tackling noise in different ways at different depths.
Methods behave differently against artificially introduced noise in CIFAR than in real-world scenarios like WebVision . This is clear from the performance in the last epoch for all methods (reported the extension of Table 4 in the supplementary material), where many of them are overfit to label noise. Substantial accuracy degradation is seen in many methods in CIFAR, while in WebVision degradations are minor and mixup (M) , which does not deal specifically with label noise, outperforms most approaches. We believe that evaluation with out-of-distribution noise in ImageNet32/64 provides a better understanding of real-world noise. Moreover, simplifying label noise to a particular distribution might be unrealistic and a combination of uniform and non-uniform in and out-of-distribution noise might be better (e.g. the first row in Figure 9 presents a shark (in-distribution class) and other images (out-of-distribution) as noisy samples). It would be worth studying conditioning noise on specific class subsets, rather than overall class-conditional flips, as classes have different prototype samples  that may be more or less difficult to fit for other classes.
We propose a framework to study multiple label noise distributions and a straightforward approach based on label noise detection and semi-supervised learning to tackle them all. We show that, in ideal conditions, semi-supervised learning outperforms loss correction based on an oracle label noise transition matrix. We also provide intuitions about the network behavior when memorizing noisy labels and show that such memorization does not often harm learning discriminative intermediate representations. Results in five datasets support the generality and robustness of our approach and help in better understanding real-world label noise.
This work was supported by Science Foundation Ireland (grant numbers SFI/15/SIRG/3283 and SFI/12/RC/2289).
Appendix A Supplementary material for: Towards Robust Learning with Different Label Noise Distributions
a.1 Proposed method: algorithm
Algorithm 1 summarizes the proposed label noise Distribution Robust Pseudo-Labeling (DRPL) approach to deal with label noise.
a.2 Extended implementation details
The and hyperparameters are set as in ; we have not sought careful tuning. We use the default parameters for the semi-supervised method . Two important hyperparameters in our method are the thresholds and to detect label noise in each label noise detection stage. In the first stage, the idea is to get sufficient data for semi-supervised learning with labels that are as clean as possible. We select in ImageNet32/64 and WebVision, and in CIFAR. We slightly increase the threshold in CIFAR to assure that enough data is selected to perform a successful semi-supervised learning, as we had some problems in CIFAR-100 to select enough data (less than 4K samples were selected occasionally in CIFAR-100, which according to results in  is not enough to prevent performance degradation). We keep the same configuration in CIFAR-10 to demonstrate its generality.
Training details for compared methods:
F : In ImageNet32/64 and CIFAR we train 200 epochs with initial learning rate of 0.1 that we divide by 10 in epochs 100 and 150. For WebVision we train 125 epochs and reduce the learning rate in epochs 75 and 120. Forward correction always starts in epoch 50.
M : In ImageNet32/64 and CIFAR we train 300 epochs with initial learning rate of 0.1 that we divide by 10 in epochs 100 and 250. For WebVision we train 200 epochs and reduce the learning rate in epochs 75 and 120. Mixup parameter is set to 1.
R : In ImageNet32/64 and CIFAR we train in the first stage of the method as in the first stage of our label noise detection method, i.e. we train for 100 epochs with initial learning rate of 0.1 that we divide by 10 in epochs 45 and 80. Relabeling starts in epoch 40. For the second stage we train 120 epochs with initial learning rate of 0.1 that we divide by 10 in epochs 40 and 80. Other hyperparameters are kept as in .
DB : We use the code
1associated to the official implementation of . We keep the default configuration used for CIFAR-10/100 and use it also in ImageNet32/64. For WebVision we train for 200 epochs with initial learning rate of 0.1 that we divide by 10 in epochs 100 and 175 (bootstrapping starts in epoch 102).
GCE and DMI : We use the code
2associated to the official implementation of . We keep the default configuration used for CIFAR-10 for both CIFAR-10 and CIFAR-100. We respect the use of the best model in the pre-training phase using cross-entropy loss, although such selection would not be straightforward without using a clean set in a real scenario.
P : We use the official implementation
3of . We found difficulties on configuring this method for the different datasets, as similar configurations did not work in CIFAR-10/100, and WebVision. We always use , , and learning rate of 0.1 in the first and second stages, whereas we use 0.2 as starting learning rate in the third stage. For CIFAR-10 different hyperparameters are used in  to deal with different noise distributions and noise levels, which we did not find reasonable. Therefore, we set a common configuration with and training the suggested number of epochs . In CIFAR-100 we tried the configuration suggested in the paper and it did not converge to reasonable performance, thus we reduced the learning rate from 0.35 to 0.1 to make it stable and use as suggested. For WebVision we used the same configuration used in CIFAR-10 and trained 40 epochs in the first stage, 60 in the second and 100 in the third. Note that we tried the suggested CIFAR-100 configuration in WebVision, but it led to poor performance.
a.3 Extended results: Label noise memorization
Figure 7 presents Class Activation Maps  for the true class (middle) and the predicted class (right) for undetected noisy samples. As shown in Figure 4 of the paper, the network skips relevant areas for the true class, and shifts attention to areas that might help in explaining (i.e. memorizing) the noisy label. The network extends the class activation maps to cover image regions that share visual similarities with the incorrect class, while other times it omits characteristic areas of the true class. The former situation can be observed in rows one to four where important areas are expanded, whereas the latter is seen in the last two rows. The blades of the snowplow are possibly skipped to help with explaining the noisy label pickup and the theater curtain is skipped and attention is focused on the theater’s seats which resemble the noisy label keyboard.
Figure 8 reports the results of the linear probes experiment in ImageNet32, showing similar characteristics as those in ImageNet64 reported in the paper. Similar conclusions can be extracted, highlighting little or no degradation of intermediate features when learning with label noise.
a.4 Extended results: ImageNet32/64
a.5 Extended results: CIFAR-10/100
Table A.5 gives additional results for CIFAR-10/100 adding the performance in the last epoch, which as found in ImageNet32/64, reveals methods degradation due to label noise memorization (e.g. CE, M).
a.6 Extended results: WebVision
Table 8 extends the results obtained in WebVision (first 50 classes) by including the performance in the last training epoch when training with a ResNet-18 from scratch. Figure 9 shows examples of detected noisy images.
- E. Arazo, D. Ortego, P. Albert, N.E. O’Connor, and K. McGuinness. Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning. arXiv: 1908.02983, 2019.
- E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. McGuinness. Unsupervised Label Noise Modeling and Loss Correction. In International Conference on Machine Learning (ICML), 2019.
- D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M.S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, and S. Lacoste-Julien. A Closer Look at Memorization in Deep Networks. In International Conference on Machine Learning (ICML), 2017.
- W.H. Beluch, T. Genewein, A. Nürnberger, and J.M. Köhler. The Power of Ensembles for Active Learning in Image Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Y. Bengio, A. Courville, and P. Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
- D. Berthelot, N. Carlini, I.J. Goodfellow, N. Papernot, A. Oliver, and C. Raffel. MixMatch: A Holistic Approach to Semi-Supervised Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- P. Chen, B. Liao, G. Chen, and S. Zhang. Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels. In International Conference on Machine Learning (ICML), 2019.
- J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- D. DeTone, T. Malisiewicz, and A. Rabinovich. Deep image homography estimation. arXiv:1606.03798, 2016.
- Y. Ding, L. Wang, D. Fan, and B. Gong. A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
- C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional Two-stream Network Fusion for Video Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- B. Frenay and M. Verleysen. Classification in the Presence of Label Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869, 2014.
- S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M.R. Scott, and D. Huang. CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images. In European Conference on Computer Vision (ECCV), 2018.
- B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- J. Han, P. Luo, and X. Wang. Deep Self-Learning From Noisy Labels. In IEEE International Conference on Computer Vision (ICCV), 2019.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. In European Conference on Computer Vision (ECCV), 2016.
- D. Hendrycks, K. Lee, and M. Mazeika. Using Pre-Training Can Improve Model Robustness and Uncertainty. In International Conference on Machine Learning (ICML), 2019.
- Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- G. Huang, Z. Liu, L. Van der Maaten, and K.Q. Weinberger. Densely Connected Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- A.H. Jiang, D.L.-K. Wong, G. Zhou, D.G. Andersen, J. Dean, G.R. Ganger, G. Joshi, M. Kaminksy, M. Kozuch, Z.C. Lipton, and P. Pillai. Accelerating Deep Learning by Focusing on the Biggest Losers. arXiv: 1910.00762, 2019.
- L. Jiang, Z. Zhou, T. Leung, L.J. Li, and L. Fei-Fei. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In International Conference on Machine Learning (ICML), 2018.
- Y. Kim, J. Yim, J. Yun, and J. Kim. NLNL: Negative Learning for Noisy Labels. In IEEE International Conference on Computer Vision (ICCV), 2019.
- A. Kolesnikov, X. Zhai, and L. Beyer. Revisiting Self-Supervised Visual Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. Dense-Captioning Events in Videos. In IEEE International Conference on Computer Vision (ICCV), 2017.
- A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
- W. Li, L. Wang, W. Li, E. Agustsson, and L. Van Gool. WebVision Database: Visual Learning and Understanding from Web Data. arXiv: 1708.02862, 2017.
- Y. Ono, E. Trulls, P. Fua, and K. Moo Yi. LF-Net: Learning Local Features from Images. arXiv: 1805.09662, 2018.
- D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan. Learning Features by Watching Objects Move. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu. Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In International Conference on Learning Representations (ICLR), 2015.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014.
- H. Song, M. Kim, and J.-G. Lee. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In International Conference on Machine Learning (ICML), 2019.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint Optimization Framework for Learning with Noisy Labels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Arash Vahdat. Toward Robustness against Label Noise in Training Deep Discriminative Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie. Learning From Noisy Large-Scale Datasets With Minimal Supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- M. Wang and W. Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
- Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S.-T. Xia. Iterative Learning With Open-Set Noisy Labels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, and M. Sugiyama. Are Anchor Points Really Indispensable in Label-Noise Learning? In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Y. Xu, P. Cao, Y. Kong, and Y. Wang. L_DMI: An Information-theoretic Noise-robust Loss Function. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- J. Yao, H. Wu, Y. Zhang, I.W Tsang, and J. Sun. Safeguarded Dynamic Label Regression for Noisy Supervision. In Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI), 2019.
- K. Yi and J. Wu. Probabilistic End-To-End Noise Correction for Learning With Noisy Labels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires re-thinking generalization. In International Conference on Learning Representations (ICLR), 2017.
- H. Zhang, M. Cisse, Y.N. Dauphin, and D. Lopez-Paz. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations (ICLR), 2018.
- R. Zhang, P. Isola, and A. Efros. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Z. Zhang and M. Sabuncu. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene Parsing Network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning Deep Features for Discriminative Localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.