SelfAdaptive Training: Bridging the Supervised and SelfSupervised Learning
Abstract
We propose selfadaptive training—a unified training algorithm that dynamically calibrates and enhances training process by model predictions without incurring extra computational cost—to advance both supervised and selfsupervised learning of deep neural networks. We analyze the training dynamics of deep networks on training data that are corrupted by, e.g., random noise and adversarial examples. Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of any label information, highlighting that model predictions could substantially benefit the training process: selfadaptive training improves the generalization of deep networks under noise and enhances the selfsupervised representation learning. The analysis also sheds light on understanding deep learning, e.g., a potential explanation of the recentlydiscovered doubledescent phenomenon in empirical risk minimization and the collapsing issue of the stateoftheart selfsupervised learning algorithms. Experiments on the CIFAR, STL and ImageNet datasets verify the effectiveness of our approach in three applications: classification with label noise, selective classification and linear evaluation. To facilitate future research, the code has been made public available at https://github.com/LayneH/selfadaptivetraining.
1 Introduction
Deep neural networks have received significant attention in machine learning and computer vision, in part due to their impressive performance achieved by supervised learning approaches in the ImageNet challenge. With the help of massive labeled data, deep neural networks advance the stateoftheart to an unprecedented level on many fundamental tasks, such as image classification [2, 3], object detection [4] and semantic segmentation [5]. However, data acquisition is notoriously costly, errorprone and even infeasible in certain cases. Furthermore, deep neural networks suffer significantly from overfitting in these scenarios. On the other hand, the great success of selfsupervised pretraining in natural language process (e.g., GPT [6, 7, 8] and BERT [9]) highlights that learning universal representations from unlabeled data can be even more beneficial for a broad range of downstream tasks.
Regarding this, much effort has been devoted to learning representations without human supervision for computer vision. Several recent studies show promising results and largely close the performance gap between supervised and selfsupervised learning. To name a few, the contrastive learning approaches [10, 11, 12] solve the instancewise discrimination task [13] as proxy objective of representation learning. Extensive studies demonstrate that selfsupervisedly learned representations are generic and even outperform the supervised pretrained counterparts when they are finetuned on certain downstream tasks.
Our work advances both the supervised learning and selfsupervised learning settings. Instead of designing two distinct algorithms for each learning paradigm separately, in this paper, we explore the possibility of a unified algorithm that bridges the supervised and selfsupervised learning. Our exploration is based on two observations on the learning of deep neural networks.
Observation I: To begin with, we observe that deep neural networks are able to magnify useful underlying information by their predictions under various supervised settings. To take a closer look at this phenomenon, we inspect the standard Empirical Risk Minimization (ERM) training dynamics of deep models on the CIFAR10 dataset [14] with 40% of data being corrupted at random (see Section 2.1 for details) and report the accuracy curves in Fig. 0(a). We can see that, under all four corruptions, the peak of accuracy curve on the clean training set (80%) is much higher than the percentage of clean data in the noisy training set (60%).
Observation II: Furthermore, we can observe the similar phenomenon even in the extreme scenarios when the supervised signals are completely random. As a concrete example, we generate two kinds of random noise as the training (realvalued) targets for each image: 1) the output probability of another model on the same training images, where the model is randomly initialized and then frozen (the red curve in Fig. 2); 2) random noise that is drawn i.i.d. from standard Gaussian distribution (the green curve in Fig. 2), which is similar to the approach of [15]. We train two deep models by encouraging their output on each image to align to the training target of 1) and 2), respectively (see Sec. 2.1 for details). We then evaluate the two deep models by learning linear classifiers on their representations. Fig. 2 displays the accuracy curves. We can clearly see that, fitting either of the two kinds of noise (the red and green curves), the representations of deep model are substantially improved when comparing with those of a randomly initialized network (the horizontal line).
These two observations indicate that model predictions can magnify useful information in data, which further sheds light that incorporating predictions into training process could significantly benefit the model learning. With this in mind, we propose selfadaptive training—a carefully designed approach which dynamically uses model predictions as a guiding principle in the design of training algorithms—that bridges the supervised learning and selfsupervised learning in a unified framework. Our approach is conceptually simple yet calibrates and significantly enhances the learning of deep neural networks in multiple ways.
1.1 Summary of our contributions
Selfadaptive training sheds light on understanding and improving the learning of deep neural networks.

We analyze the standard ERM training process of deep networks on four kinds of corruptions (see Fig. 0(a)). We describe the failure scenarios of ERM and observe that useful information from data has been distilled to model predictions in the first few epochs. We show that this phenomenon occurs broadly even in the absence of any label information (see Fig. 2). These insights motivate us to propose selfadaptive training—a unified training algorithm for both supervised and selfsupervised learning—to improve learning of deep networks by dynamically incorporating model predictions into training, without requiring modification to existing network architecture and incurring extra computational cost.

We show that selfadaptive training improves generalization of deep networks under both labelwise and instancewise random noise (see Fig. 1 and 3). Besides, selfadaptive training exhibits a singledescent errorcapacity curve (see Fig. 4). This is in sharp contrast to the recentlydiscovered doubledescent phenomenon in ERM which might be a result of overfitting of noise. Moreover, while adversarial training may easily overfit adversarial noise, our approach mitigates the overfitting issue and improves adversarial accuracy by 3% over the stateoftheart (see Fig. 5).

We illustrate that selfadaptive training enhances the selfsupervised representation learning of deep networks (see Fig. 2). Our study challenges the dominant training mechanism of recent selfsupervised algorithms that typically involves multiple augmented views of the same images at each training steps: selfadaptive training achieves remarkable performance despite requiring only a single view of each image for training, which significantly reduces the heavy cost of data preprocessing and model training on extra views.
Selfadaptive training has three applications and advances the stateoftheart by a significant gap.

Learning with noisy label, where the goal is to improve the performance of deep networks on clean test data in the presence of training label noise. On the CIFAR datasets, our approach obtains up to 9% absolute classification accuracy improvement over the stateoftheart. On the ImageNet dataset, our approach improves over ERM by 3% under 40% noise rate.

Selective classification, which aims to trade prediction coverage off against classification accuracy. Selfadaptive training achieves up to 50% relative improvement over the stateoftheart on three datasets under various coverage rate.

Linear evaluation, which use a linear classifier to evaluate the representations of selfsupervised pretrained model. Selfadaptive training performs on par with or even better than the stateoftheart while requiring only single view training.
2 SelfAdaptive Training
2.1 Blessing of model predictions
On corrupted data
Recent works [16, 17] cast doubt on the ERM training: techniques such as uniform convergence might be unable to explain the generalization of deep neural networks, because ERM easily overfits training data even though the training data are partially or completely corrupted by random noise. To take a closer look at this phenomenon, we conduct the experiments on the CIFAR10 dataset [14], of which we split the original training data into a training set (consists of first 45,000 data pairs) and a validation set (consists of last 5,000 data pairs). We consider four random noise schemes according to prior work [16], where the data are partially corrupted with probability : 1) Corrupted labels. Labels are assigned uniformly at random; 2) Gaussian. Images are replaced by random Gaussian samples with the same mean and standard deviation as the original image distribution; 3) Random pixels. Pixels of each image are shuffled using independent random permutations; 4) Shuffled pixels. Pixels of each image are shuffled using a fixed permutation pattern. We consider the performance on both the noisy and the clean sets (i.e., the original uncorrupted data), while the models can only have access to the noisy training sets.
Figure 0(a) displays the accuracy curves of ERMs that are trained on the noisy training sets under four kinds of random corruptions: ERM easily overfits noisy training data and achieves nearly perfect training accuracy. However, the four subfigures exhibit very different generalization behaviors which are indistinguishable if we only look at the accuracy curve on the noisy training set (the red curve). In Figure 0(a), the accuracy increases in the early stage and the generalization errors grow quickly only after certain epochs. Intuitively, stopping at early epoch improves generalization in the presence of label noise (see the first column in Figure 0(a)); however, it remains unclear how to properly identify such an epoch. Moreover, the earlystop mechanism may significantly hurt the performance on the clean validation sets, as we can see in the last three columns of Figure 0(a). Our approach is motivated by the failures scenarios of ERM and goes beyond ERM. We begin by making the following observations in the leftmost subfigure of Figure 0(a): the peak of accuracy curve on the clean training set (80%) is much higher than the percentage of clean data in the noisy training set (60%). This finding was also previously reported by [18, 19, 20] under label corruption and suggested that model predictions might be able to magnify useful underlying information in data. We confirm this finding and show that the pattern occurs under various kinds of corruptions more broadly (see the last three subfigures of Figure 0(a)).
On unlabelled data We notice that supervised learning with 100% noisy labels is equivalent to unsupervised learning if we simply discard the meaningless labels. Therefore, it would be interesting to analyze how deep model behaves in such extreme case. Here, we conduct experiments on the CIFAR10 [14] dataset and consider two kinds of random noise as the training (realvalued) targets for deep learning models: 1) the output features of another model on the same training images, where the model is randomly initialized and then frozen; 2) random noise that is drawn i.i.d. from standard Gaussian distribution and then fixed. The training of the deep model is then formulated as minimizing the mean square error between normalized model predictions and these two kinds of random noise. To monitor the training, we learn a linear classifier on the top of each model to evaluate its representation, which is fixed.
Figure 2 shows the linear evaluation accuracy curves of models trained on these two kinds of noise. We observe that, perhaps surprisingly, the model trained by predicting fixed random Gaussian noise (the green curve) achieves 57% linear evaluation accuracy, which is substantially higher than the 38% accuracy of a randomly initialized network (the dashed horizontal line). This intriguing observation shows that deep neural networks are able to distill underlying information from data to its predictions, even in the case that supervision signals are completely replaced by random noise. Furthermore, although predicting the output of another network can also improve the representations learning (the red curve), its performance is worse than that of the second scheme. We hypothesis that this might be a result of inconsistent training targets: the output of network depends on the input training image, which is randomly transformed by data augmentation at each epoch. This suggests that the consistency of training targets should also be taken into account in the design of training algorithm.
Inspiration for our methodology Based on our analysis, we propose a unified algorithm, SelfAdaptive Training, for both supervised and selfsupervised learning. Selfadaptive training incorporates model predictions to augment the training process of deep models in a dynamic and consistent manner. Our methodology is general and very effective: selfadaptive training significantly improves the generalization of deep neural networks on the corrupted data and the representation learning of deep models without human supervision.
2.2 Meta algorithm: SelfAdaptive Training
Given a set of training images and a deep network parametrized by , our approach records training targets for each data points accordingly. We first obtain the predictions of deep network as
(1) 
where is a normalization function. Then, the training targets track all historical model predictions during training and are updated by ExponentialMovingAverage (EMA) scheme as
(2) 
The EMA scheme in Equation (2) alleviates the instability issue of model predictions, smooths out during the training process and enables our algorithm to completely change the training labels if necessary. The momentum term controls the weight on the model predictions. Finally, we can update the weights of deep network by Stochastic Gradient Descent (SGD) on the loss function at each training iterations.
We summarize the meta algorithm of selfadaptive training in Algorithm 1. The algorithm is conceptually simple, flexible and has three components adapting to different learning settings: 1) the training targets initialization; 2) normalization function ; 3) loss function . In the following sections, we will elaborate on the instantiation of these components for specific learning settings.
Convergence analysis To simplify the analysis, we consider a linear regression problem with data , training targets and a linear model , where , and . Let , . Then the optimization for this regression problem (corresponding to in Algorithm 1) can be written as
(3) 
Let and be the model parameters and training targets at the th training step, respectively. Let denote the learning rate for gradient descent update. The Algorithm 1 alternatively minimizes the problem (3) over and as
(4) 
(5) 
Proposition 1.
Let be the maximal eigenvalue of the matrix , if the learning rate , then
(6) 
3 Improved Generalization of Deep Models
3.1 Supervised SelfAdaptive Training
Instantiation We consider class classification problem and denote the images by , labels by . Given a data pair , our approach instantiates the three components of meta Algoorithm 1 for supervised learning as follows:

Targets initialization. Since the labels are provided, the training target is directly initialized as .

Normalization function. We use the softmax function to normalize the model predictions into probability vectors , such that .

Loss function. Following the common practice in supervised learning, the loss function is implemented as the cross entropy loss between model predictions and training targets , i.e., , where and represent the th entry of and , respectively.
During the training process, we fix in the first training epochs, and update the training targets according to Equation (2) in each following training epoch. The number of initial epochs allows the model to capture informative signals in the data set and excludes ambiguous information that is provided by model predictions in the early stage of training.
Sample reweighting Based on the scheme presented above, we introduce a simple yet effective sample reweighting scheme on each sample. Concretely, given training target , we set
(7) 
The sample weight reveals the labeling confidence of this sample. Intuitively, all samples are treated equally in the first epochs. As target being updated, our algorithm pays less attention to potentially erroneous data and learns more from potentially clean data. This scheme also allows the corrupted samples to reattain attention if they are confidently corrected.
Putting everything together We use stochastic gradient descent to minimize:
(8) 
during the training process. Here, the denominator normalizes per sample weights and stabilizes the loss scale. We summarize the Supervised SelfAdaptive Training and display the pseudocode in Algorithm 2. Intuitively, the optimal choice of hyperparameter should be related to the epoch where overfitting occurs, which is around th epoch according to Fig. 0(a). For convenience, we directly fix the hyperparameters , by default if not specified. Experiments on the sensitivity of our algorithm to hyperparameters are deferred to Sec. 5.2. Our approach requires no modification to existing network architecture and incurs almost no extra computational cost.
Methodology differences with prior work Supervised selfadaptive training consists of two components: a) label correction; b) sample reweighting. With the two components, our algorithm is robust to both instancewise and labelwise noise, and is ready to combine with various training schemes such as natural and adversarial training, without incurring multiple rounds of training. In contrast, a vast majority of works on learning from corrupted data follow a preprocessingtraining fashion with an emphasis on the labelwise noise only: this line of research either discards samples based on disagreement between noisy labels and model predictions [21, 22, 23, 24], or corrects noisy labels [25, 26]; [27] investigated a more generic approach that corrects both labelwise and instancewise noises. However, their approach inherently suffers from extra computational overhead. Besides, unlike the general scheme in robust statistics [28] and other reweighting methods [29, 30] that use an additional optimization step to update the sample weights, our approach directly obtains the weights based on accumulated model predictions and thus is much more efficient.
3.2 Improved generalization under random noise
We consider noise scheme (including noise type and noise level) and model capacity as two factors that affect the generalization of deep networks under random noise. We analyze selfadaptive training by varying one of the two factors while fixing the other.
Varying noise schemes We use ResNet34 [3] and rerun the same experiments in Figure 0(a) by replacing ERM with our approach. In Figure 0(b), we plot the accuracy curves of models trained with our approach on four corrupted training sets and compare with Figure 0(a). We highlight the following observations.

Our approach mitigates the overfitting issue in deep networks. The accuracy curves on noisy training sets (i.e., the red dashed curves in Figure 0(b)) nearly converge to the percentage of clean data in the training sets, and do not reach perfect accuracy.

The generalization errors of selfadaptive training (the gap between the red and blue dashed curves in Figure 0(b)) are much smaller than Figure 0(a). We further confirm this observation by displaying the generalization errors of the models trained on the four noisy training sets under various noise rates in the leftmost subfigure of Figure 3. Generalization errors of ERM consistently grow as we increase the injected noise level. In contrast, our approach significantly reduces the generalization errors across all noise levels from 0% (no noise) to 90% (overwhelming noise).

The accuracy on the clean sets (cyan and yellow solid curves in Figure 0(b)) is monotonously increasing and converges to higher values than their correspondence in Figure 0(a). We also show the clean validation errors in the right two subfigures in Figure 3. The figures show that the error of selfadaptive training is consistently much smaller than that of ERM.
Varying model capacity We notice that such analysis is related to a recentdiscovered intriguing phenomenon [33, 34, 35, 36, 31, 37, 32] in modern machine learning models: as the capacity of model increases, the test error initially decreases, then increases, and finally shows a second descent. This phenomenon is termed double descent [31] and has been widely observed in deep networks [32]. To evaluate the doubledescent phenomenon on selfadaptive training, we follow exactly the same experimental settings as [32]: we vary the width parameter of ResNet18 [3] and train the networks on the CIFAR10 dataset with 15% training label being corrupted at random (details are given in Appendix B.1).
Figure 4 shows the curves of test error. It shows that selfadaptive training overall achieves much lower test error than that of ERM except when using extremely small models that underfit the training data. This suggests that our approach can improves the generalization of deep networks especially when the model capacity is reasonably large. Besides, we observe that the curve of ERM clearly exhibits the doubledescent phenomenon, while the curve of our approach is monotonously decreasing as the model capacity increases. Since the doubledescent phenomenon may vanish when label noise is absent [32], our experiment indicates that this phenomenon may be a result of overfitting of noise and we can bypass it by a proper design of training process such as the selfadaptive training.
3.3 Improved generalization under adversarial noise
Adversarial noise [38] is different from the random noise in that the noise is modeldependent and imperceptible to humans. We use the stateoftheart adversarial training algorithm TRADES [39] as our baseline to evaluate the performance of selfadaptive training under adversarial noise. Algorithmally, TRADES minimizes
(9) 
where is the model prediction, is the maximal allowed perturbation, CE stands for the cross entropy, KL stands for the Kullback–Leibler divergence, and the hyperparameter controls the tradeoff between robustness and accuracy. We replace the CE term in TRADES loss with our method. The models are evaluated using robust accuracy , where adversarial example are generated by white box AutoAttack [40] with = 0.031 (the evaluation of projected gradient descent attack [41] are given in Fig. 14 of Appendix C). We set the initial learning rate as 0.1 and decay it by a factor of 0.1 in epochs 75 and 90, respectively. We choose as suggested by [39] and use = 70, = 0.9 for our approach. Experimental details are given in Appendix B.2.
We display the robust accuracy on CIFAR10 test set after = 70 epochs in Figure 5. It shows that the robust accuracy of TRADES reaches its highest value around the epoch of first learning rate decay (epoch 75) and decreases later, which suggests that overfitting might happen if we train the model without early stopping. On the other hand, our method considerably mitigates the overfitting issue in the adversarial training and consistently improves the robust accuracy of TRADES by 1%3%, which indicates that selfadaptive training can improve the generalization in the presence of adversarial noise.
4 Improved Representations Learning
4.1 SelfSupervised SelfAdaptive Training
Instantiation We consider training images without label and use a deep network followed by a nonlinear projection as encoder . Then, we instantiate the meta Algorithm 1 for selfsupervised learning as follows:

Target initialization. Since the labels are absent, each training target is randomly and independently drawn from standard normal distribution.

Normalization function. We directly normalize each representation by dividing its norm.

Loss function. The loss function is implemented as the Mean Square Error (MSE) between normalized model predictions and training targets .
The above instatiation of our meta algorithm suffices to learn decent representation, as exhibited by the blue curve in Fig. 2. However, as discussed in Sec. 2.1 that the consistency of training target also plays an essential role especially in selfsupervised representation learning, we introduce two components that further improve the representation learning of selfadaptive training.
Momentum encoder and predictor We follow prior work [10, 42] to employ momentum encoder , whose parameters is also updated by the EMA scheme as
(10) 
With the slowlyevolving , we can obtain representation
(11) 
and construct the target following the EMA scheme in Equation (2).
Furthermore, to prevent the model from outputting the same representation for every images in each iteration (i.e., collapsing), we further use a predictor to transform the output of encoder to prediction
(12) 
where has the same number of output units as .
Putting everything together We normalize to and to . Finally, the MSE loss between the normalized predictions and accumulated representations
(13) 
is minimized to update the encoder and predictor . We term this variant SelfSupervised SelfAdaptive Training, and summarize the pseudocode in Algorithm 3 and the overall pipeline in Fig. 6. Our approach is straightforward to be implemented in practice and requires only singleview training, which significantly alleviates the heavy computation burden of data augmentation operations.
Methodology differences with prior works BYOL [42] formulated the selfsupervised training of deep models as predicting the representation of one augmented view of an image from the other augmented view of the same image. The selfsupervised selfadaptive training shares some similarities with BYOL since both methods do not need to contrast to negative examples to prevent collapsing issue. Instead of directly using the output of momentum encoder as training targets, our approach uses the accumulated predictions as training targets, which contain all historical view information for each image. As a result, our approach requires only a single view during training, which is much more efficient as shown in Sec. 4.3 and Fig. 7. Besides, NAT [15] used online clustering algorithm to assign a noise for each image as training target of representation learning. Unlike NAT that fixes the noise while updating the noise assignment during training, our approach uses the noise as initial training target and update the noise by model predictions in the subsequent training process. InstDisc [13] uses a memory bank to store the representation of each image in order to construct positive and negative samples for contrastive objective. By contrast, our method gets rid of the negative samples and only matches the prediction with the training targets of the same image (i.e., positive samples).
4.2 Bypassing collapsing issues
We note that there exist trivial local minimums when we directly optimize the MSE loss between prediction and training target due to the absence of negative pairs: the encoder can simply output a constant feature vector for every data points to minimize the training loss, a.k.a. collapsing issue [42]. Despite the existence of collapsing solution, the selfadaptive training intrinsically prevents the collapsing. The initial targets are different for different classes (under the supervised setting) or even for different images (under the selfsupervised setting), enforcing the models to learn different representations for different classes/images. Our empirical study in Fig. 1, 2 of the main body and Fig. 11 of Appendix strongly supports that deep neural networks are able to learn meaningful information from corrupted data or even random noise, and bypass the model collapse. Based on this, our approach significantly improves the supervised learning (see Fig. 0(a)) and selfsupervised learning (see the blue curve in Fig. 2) of deep neural networks.
4.3 Is multiview training indispensable?
The success of stateoftheart selfsupervised learning approaches [10, 11, 42] largely hinges to the multiview training scheme: they essentially use strong data augmentation operations to create multiple views (crops) of the same image and then match the representation of one view with the other views of the same image. Despite the promising performance, these methods suffer heavily from computational burden of data preprocessing and the training on extra views. Concretely, as shown in Fig. 7, prior methods BYOL [42] and MoCo [10] incur doubled training time compared with standard supervised cross entropy training. In contrast, since our method requires only singleview training, its training is only slightly slower than supervised method and is much faster than the MoCo and BYOL.
We further conduct experiments to evaluate the performance of multiview training scheme on CIFAR10 and STL10 datasets (see Sec. 7.1 for details), which cast doubt on its necessity for learning a good representation. As shown in Table I, although the performances of MoCo and BYOL are nearly halved on both datasets when using the singleview training scheme, the selfsupervised adaptive training achieves comparable results under both settings. Moreover, our approach with singleview training even slightly outperforms the MoCo and BYOL with multiview training. We attribute this superiority to the guiding principle of our algorithm: by dynamically incorporating the model predictions, each training target contains the relevant information about all the historical views of each image and, therefore, implicitly enforces the model learning representations that are invariant to the historical views.
4.4 On the power of momentum encoder and predictor
Momentum encoder and predictor are two important components of selfsupervised selfadaptive training and BYOL. The work of BYOL [42] showed that the algorithm may converge to trivial solution if one of them is removed, not to mention removing both of them. As shown in Table II, however, our results challenge their conclusion: 1) with the predictor, the linear evaluation accuracy of either BYOL (> 85%) or our method (> 90%) is nontrivial, regardless the absence and the configuration of momentum encoder; 2) without the predictor, the momentum encoder with sufficiently large momentum can also improve the performance and bypass collapsing issue. The results suggest that although both predictor and momentum encoder are indeed crucial for the performance of representation learning, either one of them with a proper configuration (i.e., the momentum term) suffices to avoid the collapsing. We note that the latest version of [42] also found that the momentum encoder can be removed without collapse when carefully tuning the learning rate of predictor. Our results, however, are obtained using the same training setting.
Moreover, we find that selfsupervised selfadaptive training exhibits impressive resistance to collapsing, despite using only singleview training. Our approach can learn decent representation even when predictor and momentum encoder are both removed (see the seventh row of Table II). We hypothesis that the resistance comes from the consistency of training target due to our EMA scheme. This hypothesis is also partly supported by the observation that the learning of BYOL heavily depends on the slowlyevolving momentum encoder.
Method  Predictor  Momentum  Accuracy 

BYOL [42] 
0.0  25.64  
0.99  26.87  
0.999  72.44  
0.0  85.22  
0.99  91.78  
0.999  90.68  
Ours  0.0  78.36  
0.99  79.68  
0.999  83.58  
0.0  90.18  
0.99  92.27  
0.999  90.92 
Backbone  CIFAR10  CIFAR100  

Label Noise Rate  0.2  0.4  0.6  0.8  0.2  0.4  0.6  0.8  
ResNet34  ERM + Early Stopping  85.57  81.82  76.43  60.99  63.70  48.60  37.86  17.28 
Label Smoothing [43]  85.64  71.59  50.51  28.19  67.44  53.84  33.01  9.74  
Forward [44]  87.99  83.25  74.96  54.64  39.19  31.05  19.12  8.99  
Mixup [45]  93.58  89.46  78.32  66.32  69.31  58.12  41.10  18.77  
Trunc [46]  89.70  87.62  82.70  67.92  67.61  62.64  54.04  29.60  
Joint Opt [26]  92.25  90.79  86.87  69.16  58.15  54.81  47.94  17.18  
SCE [47]  90.15  86.74  80.80  46.28  71.26  66.41  57.43  26.41  
DAC [48]  92.91  90.71  86.30  74.84  73.55  66.92  57.17  32.16  
SELF [24]    91.13    63.59    66.71    35.56  
ELR [49]  92.12  91.43  88.87  80.69  74.68  68.43  60.05  30.27  
Ours  94.14  92.64  89.23  78.58  75.77  71.38  62.69  38.72  
WRN2810  ERM + Early Stopping  87.86  83.40  76.92  63.54  68.46  55.43  40.78  20.25 
MentorNet [29]  92.0  89.0    49.0  73.0  68.0    35.0  
DAC [48]  93.25  90.93  87.58  70.80  75.75  68.20  59.44  34.06  
SELF [24]    93.34    67.41    72.48    42.06  
Ours  94.84  93.23  89.42  80.13  77.71  72.60  64.87  44.17 
5 Application I: Learning with Noisy Label
Given improved generalization of selfadaptive training over ERM under noise, we show the applications of our approach which outperforms the stateoftheart with a significant gap.
5.1 Problem formulation
Given a set of noisy training data , where is the distribution of noisy data and is the noisy label for each uncorrupted sample , the goal is to be robust to the label noise in the training data and improve the classification performance on clean test data that are sampled from clean distribution .
5.2 Experiments on CIFAR datasets
Setup We consider the case that the labels are assigned uniformly at random with different noise rates. Following previous work [46, 48], we conduct the experiments on the CIFAR10 and CIFAR100 datasets [14] and use ResNet34 [3] / Wide ResNet2810 [50] as our base classifier. The networks are implemented on PyTorch [51] and optimized using SGD with initial learning rate of 0.1, momentum of 0.9, weight decay of 0.0005, batch size of 256, total training epochs of 200. The learning rate is decayed to zero using cosine annealing schedule [52]. We use standard data augmentation of random horizontal flipping and cropping. We report the average performance over 3 trials.
Main results We summarize the experiments in Table III. Most of the results are directly cited from original papers with the same experiment settings; the results of Label Smoothing [43], Mixup [45], Joint Opt [26] and SCE [47] are reproduced by rerunning the official opensourced implementations. From the table, we can see that our approach outperforms the stateoftheart methods in most entries by 1% 5% on both CIFAR10 and CIFAR100 datasets, using different backbones. Notably, unlike Joint Opt, DAC and SELF methods that require multiple iterations of training, our method enjoys the same computational budget as ERM.
CIFAR10  CIFAR100  

Label Noise Rate  0.4  0.8  0.4  0.8 
Ours  92.64  78.58  71.38  38.72 
 Reweighting  92.49  78.10  69.52  36.78 
 Exponential Moving Average  72.00  28.17  50.93  11.57 
Ablation study and hyperparameter sensitivity First, we report the performance of ERM equipped with simple early stopping scheme in the first row of Table III. We observe that our approach achieves substantial improvements over this baseline. This demonstrates that simply early stopping the training process is a suboptimal solution. Then, we further report the influences of two individual components of our approach: Exponential Moving Average (EMA) and sample reweighting scheme. As displayed in Table IV, removing any component considerably hurts the performance under all noise rates and removing EMA scheme leads to a significant performance drop. This suggests that properly incorporating model predictions is important in our approach. Finally, we analyze the sensitivity of our approach to the parameters and in Table V, where we fix one parameter while varying the other. The performance is stable for various choices of and , indicating that our approach is insensitive to the hyperparameter tuning.
0.6  0.8  0.9  0.95  0.99  

Fix  90.17  91.91  92.64  92.54  84.38 
20  40  60  80  100  
Fix  89.58  91.89  92.64  92.26  88.83 
ResNet50  ResNet101  

Label Noise Rate  0.0  0.4  0.0  0.4 
ERM  76.8  69.5  78.2  70.2 
Ours  77.2  71.5  78.7  73.5 
5.3 Experiments on ImageNet dataset
The work of [53] suggested that ImageNet dataset [54] contains annotation errors on its own even after several rounds of cleaning. Therefore, in this subsection, we use ResNet50/101 [3] to evaluate selfadaptive training on the largescale ImageNet under both standard setup (i.e., using original labels) and the case that 40% training labels are corrupted. We provide the experimental details in Appendix B.3 and report model performance on the ImageNet validation set in terms of top1 accuracy in Table VI. We can see that selfadaptive training consistently improves the ERM baseline by a considerable margin under all settings using different models. Specifically, the improvement can be as large as 3% in absolute for the larger ResNet101 when 40% training labels are corrupted. The results validate the effectiveness of our approach on largescale dataset and larger model.
5.4 Label recovery of selfadaptive training
We demonstrate that our approach is able to recover the true labels from noisy training labels: we obtain the recovered labels by the moving average targets and compute the recovered accuracy as , where is the clean label of each training sample. When 40% label are corrupted in the CIFAR10 and ImageNet training set, our approach successfully corrects a huge amount of labels and obtains recovered accuracy of 94.6% and 81.1%, respectively. We also display the confusion matrix of recovered labels w.r.t the clean labels on CIFAR10 in Figure 8, from which see that our approach performs well for all classes.
5.5 Investigation of sample weights
We further inspect on the reweighting scheme of selfadaptive training. Following the procedure in Section 5.4, we display the average sample weights in Figure 9. In the figure, the th block contains the average weight of samples with clean label and recovered label , the white areas represent the case that no sample lies in the cell. We see that the weights on the diagonal blocks are clearly higher than those on nondiagonal blocks. The figure indicates that, aside from impressive ability to recover the correct labels, selfadaptive training could properly downweight the noisy examples.
Dataset  Coverage  Ours  Deep Gamblers [55]  SelectiveNet [56]  SR [57]  MCdropout [57] 

CIFAR10  100  6.050.20  6.120.09  6.790.03  6.790.03  6.790.03 
95  3.370.05  3.490.15  4.160.09  4.550.07  4.580.05  
90  1.930.09  2.190.12  2.430.08  2.890.03  2.920.01  
85  1.150.18  1.090.15  1.430.08  1.780.09  1.820.09  
80  0.670.10  0.660.11  0.860.06  1.050.07  1.080.05  
75  0.440.03  0.520.03  0.480.02  0.630.04  0.660.05  
70  0.340.06  0.430.07  0.320.01  0.420.06  0.430.05  
SVHN  100  2.750.09  3.240.09  3.210.08  3.210.08  3.210.08 
95  0.960.09  1.360.02  1.400.01  1.390.05  1.400.05  
90  0.600.05  0.760.05  0.820.01  0.890.04  0.900.04  
85  0.450.02  0.570.07  0.600.01  0.700.03  0.710.03  
80  0.430.01  0.510.05  0.530.01  0.610.02  0.610.01  
Dogs vs. Cats  100  3.010.17  2.930.17  3.580.04  3.580.04  3.580.04 
95  1.250.05  1.230.12  1.620.05  1.910.08  1.920.06  
90  0.590.04  0.590.13  0.930.01  1.100.08  1.100.05  
85  0.250.11  0.470.10  0.560.02  0.820.06  0.780.06  
80  0.150.06  0.460.08  0.350.09  0.680.05  0.550.02 
6 Application II: Selective Classification
6.1 Problem formulation
Selective classification, a.k.a. classification with rejection, trades classifier coverage off against accuracy [58], where the coverage is defined as the fraction of classified samples in the dataset; the classifier is allowed to output “don’t know” for certain samples. The task focuses on noisefree setting and allows classifier to abstain on potential outofdistribution samples or samples lies in the tail of data distribution, that is, making prediction only on samples with confidence.
Formally, a selective classifier is a composition of two functions , where is the conventional class classifier and is the selection function that reveals the underlying uncertainty of inputs. Given an input , selective classifier outputs
(14) 
for a given threshold that controls the tradeoff.
6.2 Approach
Inspired by [48, 55], we adapt our presented approach in Algorithm 1 to the selective classification task. We introduce an extra ()th class (represents abstention) during training and replace selection function in Equation (14) by . In this way, we can train a selective classifier in an endtoend fashion. Besides, unlike previous works that provide no explicit signal for learning abstention class, we use model predictions as a guideline in the design of learning process.
Given a minibatch of data pairs , model predictions and its exponential moving average for each sample, we optimize the classifier by minimizing:
(15) 
where is the index of nonzero element in the one hot label vector . The first term measures the crossentropy loss between prediction and original label , in order to learn a good multiclass classifier. The second term acts as the selection function, identifies uncertain samples in datasets. dynamically tradesoff these two terms: if is very small, the sample is deemed as uncertain and the second term enforces the selective classifier to learn to abstain this sample; if is close to 1, the loss recovers the standard cross entropy minimization and enforces the selective classifier to make perfect prediction.
6.3 Experiments
Setup We conduct the experiments on three datasets: CIFAR10 [14], SVHN [59] and Dogs vs. Cats [60].
We compare our method with previous stateoftheart methods on selective classification, including Deep Gamblers [55], SelectiveNet [56], Softmax Response (SR) and MCdropout [57].
The experiments are based on official opensourced implementation
Main results The results of prior methods are cited from original papers and are summarized in Table VII. We see that our method achieves up to 50% relative improvements compared with all other methods under various coverage rates, on all datasets. Notably, Deep Gamblers also introduces an extra abstention class in their method but without applying model predictions. The improved performance of our method comes from the use of model predictions in the training process.
7 Application III: Linear Evaluation Protocol in SelfSupervised Learning
7.1 Experimental setup
Datasets and data augmentations We conduct experiments on three benchmarks: CIFAR10/CIFAR100 [14] with 50K images; STL10 [63] with 105K images. The choice of data augmentations follows prior works [10, 11]: we take a random crop from each image and resize it to the original size (i.e., for CIFAR10/CIFAR100 and for STL10); the crop is then transformed by random color jittering, random horizontal flip, and random grayscale conversion.
Network architecture The encoders and consist of a backbone of ResNet18 [3]/ResNet50 [3]/AlexNet [64] and a projector that is instantiated by a multilayer perceptron (MLP). We use the output of last global average pooling layer of the backbone as the extracted feature vectors. Following prior works [11], the output vectors of the backbone are transformed by the projector MLP to dimension 256. Besides, the predictor is also instantiated by an MLP with the same architecture as the projector in . In our implementation, all the MLPs have one hidden layer of size 4,096, followed by a batch normalization [61] layer and ReLU activation [65].
Selfsupervised pretraining settings We optimize the networks using SGD optimizer with momentum of 0.9 and weight decay of 0.00005. By default, we use a batch size 512 for all methods in all experiments and train the networks for 800 epoch using 4 GTX 1080Ti GPUs. The base learning rate is set to 2.0 and is scaled linearly with respect to batch size following [66]. During training, the learning rate is warmed up for the first 30 epochs and then adjusted according to cosine annealing schedule [52].
Linear evaluation protocol Following the common practice [10, 11, 42], we evaluate the representation learned by selfsupervised pretraining using linear classification protocol. That is, we remove the projector in and the predictor , fix the parameters of backbone of the encoder and train a supervised linear classifier on top of the features extracted from the encoder. The linear classifier is trained for 100 epoch with weight decay 0.0 and batch size 512. The initial learning rate is set to 0.4 and decayed by a factor of 0.1 at 60th and 80th training epoch. The performance is measured in terms of top1 accuracy of the linear classifier on test data.
Backbone  Method  CIFAR10  CIFAR100  STL10 

AlexNet  SplitBrain [67]  67.1  39.0   
DeepCluster [68]  77.9  41.9    
InstDisc [13]  70.1  39.4    
AND [69]  77.6  47.9    
SeLa [70]  83.4  57.4    
CMC [71]      83.28  
Ours  83.55  59.80  83.75  
ResNet50  MoCo [10]  93.20  69.48  91.95 
SimCLR[11]  93.08  67.92  90.90  
BYOL [42]  93.48  68.48  92.40  
Ours  94.04  70.16  92.60 
7.2 Comparison with the stateoftheart
We firstly compare our selfsupervised selfadaptive training with three stateoftheart methods, including the contrastive learning methods MoCo [10], SimCLR [11] and bootstrap method BYOL [42]. For fair comparison, we use the same code base and the same experimental settings for all the methods, following their official opensourced implementations. We carefully adjust the hyperparameters on CIFAR10 dataset for each method and use the same parameters on the rest datasets. Besides, we also conduct experiments using AlexNet as backbone and compare the performance of our method with the reported results of prior methods. The results are summarized in Table VIII. We can see that, despite using only singleview training, our selfsupervised selfadaptive training consistently obtains better performance than all other methods on all datasets with different backbones.
7.3 Sensitivity of selfsupervised selfadaptive training
Sensitivity to hyperparameters We study how the two momentum parameters and affect the performance of our approach and report the results in Table IX. By varying one of the parameters while fixing the other, we observe that selfsupervised selfadaptive training performs consistently well, which suggests that our approach is not sensitive to the choice of hyperparameters.
Sensitivity to data augmentation Data augmentation is one of the most essential ingredients of recent selfsupervised learning methods: a large body of these methods, including ours, formulate the training objective as learning representation that encodes the shared information across different views that generated by data augmentation. As a result, prior methods, like MoCo and BYOL, fail in the singleview training setting. Selfsupervised selfadaptive training, on the other hand, maintains a training target for each image, which contains all historical view information of this image. Therefore, we conjecture that our method should be more robust to the data augmentation. To validate our conjecture, we evaluate our method and BYOL under the settings that some of the augmentation operators are removed. The results are shown in Fig. 9(a). Removing any augmentation operators hurts the performance of both methods while our method is less affected. Specifically, when all augmentation operators except random crop and flip are removed, the performance of BYOL drops to 86.6% while our method still obtains 88.6% accuracy.
0.5  0.6  0.7  0.8  0.9  

Fix  91.68  91.76  92.27  92.08  91.68 
0.9  0.95  0.99  0.995  0.999  
Fix  90.68  91.60  92.27  92.16  90.92 
Sensitivity to training batch size Recent contrastive learning methods require large batch size training (e.g., 1024 or even larger) for optimal performance, due to the need of comparing with massive negative samples. BYOL does not use negative samples and suggests that this issue can be mitigated. Here, since our method also gets rid of the negative samples, we make direct comparisons with BYOL at different batch sizes to evaluate the sensitivity of our method to batch size. For each method, the base learning rate is linearly scaled according to the batch size while the rest of settings are kept unchanged. The results are shown in Fig. 9(b). We can see that our method exhibits a smaller performance drop than BYOL at various batch sizes. Concretely, the accuracy of BYOL drops by 2% at batch size 128 while that of ours drops by only 1%.
8 Related Works
Generalization of deep networks Previous work [16] systematically analyzed the capability of deep networks to overfit random noise. Their results show that traditional wisdom fails to explain the generalization of deep networks. Another line of works [33, 34, 35, 36, 31, 37, 32] observed an intriguing doubledescent risk curve from the biasvariance tradeoff. [31, 32] claimed that this observation challenges the conventional Ushaped risk curve in the textbook. Our work shows that this observation may stem from overfitting of noises; the phenomenon vanishes by a proper design of training process such as selfadaptive training. To improve the generalization of deep networks, [43, 72] proposed label smoothing regularization that uniformly distributes of labeling weight to all classes and uses this soft label for training; [45] introduced mixup augmentation that extends the training distribution by dynamic interpolations between random paired input images and the associated targets during training. This line of research is similar with ours as both methods use soft labels in the training. However, selfadaptive training is able to recover true labels from noisy labels and is more robust to noises.
Robust learning from corrupted data Aside from the preprocessingtraining approaches that have been discussed in the last paragraph of Section 3.1, there have also been many other works on learning from noisy data. To name a few, [73, 20] showed that deep neural networks tend to fit clean samples first and overfitting of noise occurs in the later stage of training. [20] further proved that early stopping can mitigate the issues that are caused by label noise. [74, 75] incorporated model predictions into training by simple interpolation of labels and model predictions. We demonstrate that our exponential moving average and sample reweighting schemes enjoy superior performance. Other works [46, 47] proposed alternative loss functions to cross entropy that are robust to label noise. They are orthogonal to ours and are ready to cooperate with our approach as shown in Appendix C.4. Beyond the corrupted data setting, recent works [76, 77] propose selftraining scheme that also uses model predictions as training target. However, they suffers from the heavy cost of multiple iterations of training, which is avoided by our approach. Temporal Ensembling [78] incorporated the ’ensemble’ predictions as pseudolabel for training. Different from ours, Temporal Ensembling focuses on the semisupervised learning setting and only accumulates predictions for unlabeled data.
Selfsupervised learning Aiming to learn powerful representation, most selfsupervised learning approaches typically first solve a proxy task without human supervision. For example, prior works proposed recovering input using autoencoder [79, 80], generating pixels in the input space [81, 82], predicting rotation [83] and solving jigsaw [84]. Recently, contrastive learning methods [13, 85, 71, 10, 11, 86] significantly advanced selfsupervised representation learning. These approaches essentially used strong data augmentation techniques to create multiple views (crops) of the same image and discriminated the representation of different views of the same images (i.e., positive samples) from the views of other images (i.e., negative samples). Bootstrap methods eliminated the discrimination of positive and negative data pairs: the works of [68, 70] alternatively performed clustering on the representations and then used the cluster assignments as classification targets to update the model; [12] swapped the cluster assignments between the two views of the same image as training targets; [42] simply predicted the representation of one view from the other view of the same image; [15] formulated the selfsupervised training objective as predicting a set of predefined noises. Our work follows the path of bootstrap methods. Going further than them, selfadaptive training is a general training algorithm that bridges supervised and selfsupervised learning paradigms. Our approach casts doubt on the necessity of the costly multiviews training and works well with singleview training scheme.
9 Conclusion
In this paper, we explore the possibility of a unified framework to bridge the supervised and selfsupervised learning of deep neural networks. We first analyze the training dynamic of deep networks under these two learning settings and observe that useful information from data is distilled to model predictions. The observation occurs broadly even in the presence of data corruptions and the absence of label, which motivate us to propose SelfAdaptive Training—a general training algorithm that dynamically incorporates model predictions into training process. We demonstrate that our approach improves the generalization of deep neural networks under various kinds of training data corruptions and enhances the representation learning using accumulated model predictions. Finally, we present three applications of selfadaptive training on learning with noisy label, selective classification and linear evaluation protocol in selfsupervised learning, where our approach significantly advances the stateoftheart.
Appendix A Proof
Proposition 1.
Let be the maximal eigenvalue of the matrix , if the learning rate , then
(16) 
Proof.
Note that is positive semidefinite and can be diagonalized as , where the diagonal matrix contains the eigenvalue of and the matrix contains the corresponding eigenvectors, . And let . Multiplying the both sides of Equation (17) by , we have
(18) 
From Equation (4), we have
(19) 
Subtracting the the both sides of Equation (19) by , we obtain
(20) 
where . Therefore,
(21) 
Because is positive semidefinite and , all elements in is smaller than 1. When , each elements of the diagonal matrix is greater than 1, and we have
(22) 
∎
Appendix B Experimental Setups
b.1 Double descent phenomenon
Following previous work [32], we optimize all models using Adam [87] optimizer with fixed learning rate of 0.0001, batch size of 128, common data augmentation, weight decay of 0 for 4,000 epochs. For our approach, we use the hyperparameters for standard ResNet18 (width of 64) and dynamically adjust them for other models according to the relation of model capacity as:
(23) 
b.2 Adversarial training
[38] reported that imperceptible small perturbations around input data (i.e., adversarial examples) can cause ERM trained deep neural networks to make arbitrary predictions. Since then, a large literature devoted to improving the adversarial robustness of deep neural networks. Among them, adversarial training algorithm TRADES [39] achieves stateoftheart performance. TRADES decomposed robust error (w.r.t adversarial examples) to sum of natural error and boundary error, and proposed to minimize:
(24) 
where is the model prediction, is the maximal allowed perturbation, CE stands for cross entropy, KL stands for Kullback–Leibler divergence. The first term corresponds to ERM that maximizes the natural accuracy; the second term pushes the decision boundary away from data points to improve adversarial robustness; the hyperparameter controls the tradeoff between natural accuracy and adversarial robustness. We evaluate selfadaptive training on this task by replacing the first term of Equation (24) with our approach.
Our experiments are based on the official opensourced implementation
b.3 ImageNet
We use ResNet50/ResNet101 [3] as base classifier.
Following original paper [3] and [52, 66], we use SGD to optimize the networks with batch size of 768, base learning rate of 0.3, momentum of 0.9, weight decay of 0.0005 and total training epoch of 95.
The learning rate is linearly increased from 0.0003 to 0.3 in first 5 epochs (i.e., warmup), and then decayed using cosine annealing schedule [52] to 0.
Following common practice, we use random resizing, cropping and
flipping augmentation during training.
The hyperparameters of our approach are set to and under standard setup, and are set to and under 40% label noise setting. The experiments are conducted on PyTorch [51] with distributed training and mixed precision training
Appendix C Additional Experimental Results
c.1 ERM may suffer from overfitting of noise
In [16], the authors showed that the model trained by standard ERM can easily fit randomized data. However, they only analyzed the generalization errors in the presence of corrupted labels. In this paper, we report the whole training process and also consider the performance on clean sets (i.e., the original uncorrupted data). Figure 0(a) shows the four accuracy curves (on clean and noisy training, validation set, respectively) for each model that is trained on one of four corrupted training data. Note that the models can only have access to the noisy training sets (i.e., the red curve) and the other three curves are shown only for the illustration purpose. We conclude with two principal observations from the figures: (1) The accuracy on noisy training and validation sets is close at beginning and the gap is monotonously increasing w.r.t. epoch. The generalization errors (i.e., the gap between the accuracy on noisy training and validation sets) are large at the end of training. (2) The accuracy on clean training and validation set is consistently higher than the percentage of clean data in the noisy training set. This occurs around the epochs between underfitting and overfitting.
Our first observation poses concerns on the overfitting issue of ERM training dynamic which has also been reported by [20]. However, the work of [20] only considered the case of corrupted labels and proposed using earlystop mechanism to improve the performance on clean data. On the other hand, our analysis of the broader corruption schemes shows that the early stopping might be suboptimal and may hurt the performance under other types of corruptions (see the last three columns in Figure 0(a)).
The second observation implies that model predictions by ERM can capture and amplify useful signals in the noisy training set, although the training dataset is heavily corrupted. While this was also reported in [16, 18, 19, 20] for the case of corrupted labels, we show that similar phenomenon occurs under other kinds of corruptions more generally. This observation sheds light on our approach, which incorporates model predictions into training procedure.
c.2 Improved generalization of selfadaptive training on random noise
Training accuracy w.r.t. correct labels on different portions of data For more intuitive demonstration, we split the CIFAR10 training set (with 40% label noise) into two portions: 1) Untouched portion, i.e., the elements in the training set which were left untouched; 2) Corrupted portion, i.e., the elements in the training set which were indeed randomized. The accuracy curves on these two portions w.r.t correct training labels is shown in Figure 13. We can observe that the accuracy of ERM on the corrupted portion first increases in the first few epochs and then eventually decreases to 0. In contrast, selfadaptive training calibrates the training process and consistently fits the correct labels.
Study on extreme noise We further rerun the same experiments as in Figure 1 of main text by injecting extreme noise (i.e., noise rate of 80%) into CIFAR10 dataset. We report the corresponding accuracy curves in Figure 11, which shows that our approach significantly improves the generalization over ERM even when random noise dominates training data. This again justify our observations in Section 3.
Effect of data augmentation All our previous studies are performed with common data augmentation (i.e., random cropping and flipping). Here, we further report the effect of data augmentation. We adjust introduced hyperparameters as , due to severer overfitting when data augmentation is absent. The Figure 12 shows the corresponding generalization errors and clean validation errors. We observe that, for both ERM and our approach, the errors clearly increase when data augmentation is absent (compared with those in Figure 3). However, the gain is limited and the generalization errors can still be very large, with or without data augmentation for standard ERM. Directly replacing the standard training procedure with our approach can bring bigger gains in terms of generalization regardless of data augmentation. This suggests that data augmentation can help but is not of essence to improve generalization of deep neural networks, which is consistent with the observation in [16].
c.3 Epochwise double descent phenomenon
[32] reported that, for sufficient large model, test errortraining epoch curve also exhibits doubledescent phenomenon, which they termed epochwise double descent. In Figure 15, we reproduce the epochwise double descent phenomenon on ERM and inspect selfadaptive training. We observe that our approach (the red curve) exhibits slight doubledescent due to overfitting starts before initial epochs. As the training targets being updated (i.e., after = 40 training epochs), the red curve undergoes monotonous decrease. This observation again indicates that doubledescent phenomenon may stem from overfitting of noise and can be avoided by our algorithm.
CIFAR10  CIFAR100  

Label Noise Rate  0.2  0.4  0.6  0.8  0.2  0.4  0.6  0.8 
SCE [47]  90.15  86.74  80.80  46.28  71.26  66.41  57.43  26.41 
Ours  94.14  92.64  89.23  78.58  75.77  71.38  62.69  38.72 
Ours + SCE  94.39  93.29  89.83  79.13  76.57  72.16  64.12  39.61 
c.4 Cooperation with Symmetric Cross Entropy
[47] showed that Symmetric Cross Entropy (SCE) loss is robust to underlying label noise in training data. Formally, given training target and model prediction , SCE loss is defined as:
(25) 
where the first term is the standard cross entropy loss and the second term is the reversed version. In this section, we show that selfadaptive training can cooperate with this noiserobust loss and enjoy further performance boost without extra cost.
Setup The Most experiments settings are kept the same as Section 5.2. For the introduced hyperparameters of SCE loss, we directly set them to 1, 0.1, respectively, in all our experiments.
Results We summarize the results in Table X. We cam see that, although selfadaptive training already achieves very strong performance, considerable gains can be obtained when equipped with SCE loss. Concretely, the improvement is as large as 1.5% when label noise of 60% injected to CIFAR100 training set. It also indicates that our approach is flexible and can be further extended.
Method  CIFAR10  Corruption Level@CIFAR10C  

1  2  3  4  5  
ERM  95.32  88.44  83.22  77.26  70.40  58.91 
Ours  95.80  89.41  84.53  78.83  71.90  60.77 
c.5 Outofdistribution generalization
In this section, we consider outofdistribution (OOD) generalization, where the models are evaluated on unseen test distributions outside the training distribution.
Setup To evaluate the OOD generalization performance, we use CIFAR10C benchmark [88] that constructed by applying 15 types of corruption to the original CIFAR10 test set at 5 levels of severity. The performance is measure by average accuracy over 15 types of corruption. We mainly follow the training details in Section 5.2 and adjust .
Results We summarize the results in Table XI. Regardless the presence of corruption and corruption levels, our method consistently outperforms ERM by a considerable margin, which becomes large when the corruption is more severe. The experiment indicates that selfadaptive training may provides implicit regularization for OOD generalization.
c.6 Cost of maintaining probability vectors
Take the largescale ImageNet dataset [54] as an example. The ImageNet consists of about 1.2 million images categorized to 1000 classes. The storage of such vectors in single precision format for the entire dataset requires bit GB, which is reduced to GB under selfsupervised learning setting that records a d feature for each image. The cost is acceptable since modern GPUs usually have 20GB or more dedicated memory, e.g., NVIDIA Tesla A100 has 40GB memory. Moreover, the vectors can be stored on CPU memory or even disk and loaded along with the images to further reduce the cost.
Footnotes
References
 L. Huang, C. Zhang, and H. Zhang, “Selfadaptive training: beyond empirical risk minimization,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations, 2015.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
 J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
 A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pretraining,” 2018.
 A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
 T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are fewshot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
 J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.
 K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
 T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning, 2020.
 M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
 Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via nonparametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.
 A. Krizhevsky and G. E. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
 P. Bojanowski and A. Joulin, “Unsupervised learning by predicting noise,” in International Conference on Machine Learning, 2017, pp. 517–526.
 C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, 2017.
 V. Nagarajan and J. Z. Kolter, “Uniform convergence may be unable to explain generalization in deep learning,” in Advances in Neural Information Processing Systems, 2019, pp. 11 611–11 622.
 D. Rolnick, A. Veit, S. Belongie, and N. Shavit, “Deep learning is robust to massive label noise,” arXiv preprint arXiv:1705.10694, 2017.
 M. Y. Guan, V. Gulshan, A. M. Dai, and G. E. Hinton, “Who said what: Modeling individual labelers improves classification,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 4313–4324.
 C. E. Brodley, M. A. Friedl et al., “Identifying and eliminating mislabeled training instances,” in Proceedings of the National Conference on Artificial Intelligence, 1996, pp. 799–805.
 C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,” Journal of artificial intelligence research, vol. 11, pp. 131–167, 1999.
 X. Zhu, X. Wu, and Q. Chen, “Eliminating class noise in large datasets,” in Proceedings of the 20th International Conference on Machine Learning (ICML03), 2003, pp. 920–927.
 D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox, “SELF: Learning to filter noisy labels with selfensembling,” in International Conference on Learning Representations, 2020.
 H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi, “Label refinery: Improving imagenet classification through label progression,” arXiv preprint arXiv:1805.02641, 2018.
 D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa, “Joint optimization framework for learning with noisy labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5552–5560.
 C.M. Teng, “Correcting noisy data.” in International Conference on Machine Learning. Citeseer, 1999, pp. 239–248.
 P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection. John wiley & sons, 2005, vol. 589.
 L. Jiang, Z. Zhou, T. Leung, L.J. Li, and L. FeiFei, “Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels,” in International Conference on Machine Learning, 2018, pp. 2304–2313.
 M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in International Conference on Machine Learning, 2018, pp. 4334–4343.
 M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machinelearning practice and the classical bias–variance tradeoff,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019.
 P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data hurt,” in International Conference on Learning Representations, 2020.
 M. Opper, “Statistical mechanics of learning: Generalization,” The Handbook of Brain Theory and Neural Networks,, pp. 922–925, 1995.
 ——, “Learning to generalize,” Frontiers of Life, vol. 3, no. part 2, pp. 763–775, 2001.
 M. S. Advani, A. M. Saxe, and H. Sompolinsky, “Highdimensional dynamics of generalization error in neural networks,” Neural Networks, vol. 132, pp. 428–446, 2020.
 S. Spigler, M. Geiger, S. d’Ascoli, L. Sagun, G. Biroli, and M. Wyart, “A jamming transition from underto overparametrization affects loss landscape and generalization,” arXiv preprint arXiv:1810.09665, 2018.
 M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. BaityJesi, G. Biroli, and M. Wyart, “Jamming transition as a paradigm to understand the loss landscape of deep neural networks,” Physical Review E, vol. 100, no. 1, p. 012115, 2019.
 C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014.
 H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled tradeoff between robustness and accuracy,” in International Conference on Machine Learning, 2019, pp. 7472–7482.
 F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameterfree attacks,” in International Conference on Machine Learning, 2020.
 A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018.
 J.B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latenta new approach to selfsupervised learning,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
 G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952.
 H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations, 2018.
 Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018, pp. 8778–8788.
 Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric cross entropy for robust learning with noisy labels,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 322–330.
 S. Thulasidasan, T. Bhattacharya, J. Bilmes, G. Chennupati, and J. MohdYusof, “Combating label noise in deep learning using abstention,” in International Conference on Machine Learning, 2019, pp. 6234–6243.
 S. Liu, J. NilesWeed, N. Razavian, and C. FernandezGranda, “Earlylearning regularization prevents memorization of noisy labels,” in Advances in Neural Information Processing Systems, 2020.
 S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
 A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, highperformance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
 I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations, 2017.
 O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 2009, pp. 248–255.
 Z. Liu, Z. Wang, P. P. Liang, R. Salakhutdinov, L.P. Morency, and M. Ueda, “Deep gamblers: Learning to abstain with portfolio theory,” in Advances in Neural Information Processing Systems, 2019.
 Y. Geifman and R. ElYaniv, “Selectivenet: A deep neural network with an integrated reject option,” in International Conference on Machine Learning, 2019, pp. 2151–2159.
 ——, “Selective classification for deep neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 4878–4887.
 R. ElYaniv and Y. Wiener, “On the foundations of noisefree selective classification,” Journal of Machine Learning Research, vol. 11, no. May, pp. 1605–1641, 2010.
 Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
 “Dogs vs. cats dataset,” https://www.kaggle.com/c/dogsvscats.
 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 A. Coates, A. Ng, and H. Lee, “An analysis of singlelayer networks in unsupervised feature learning,” in Proceedings of International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, 2012, pp. 1097–1105.
 V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning, 2010.
 P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
 R. Zhang, P. Isola, and A. A. Efros, “Splitbrain autoencoders: Unsupervised learning by crosschannel prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1058–1067.
 M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 132–149.
 J. Huang, Q. Dong, S. Gong, and X. Zhu, “Unsupervised deep learning by neighbourhood discovery,” in International Conference on Machine Learning, 2019, pp. 2849–2858.
 Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Selflabelling via simultaneous clustering and representation learning,” in International Conference on Learning Representations, 2020.
 Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in European Conference on Computer Vision, 2020.
 G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, “Regularizing neural networks by penalizing confident output distributions,” arXiv preprint arXiv:1701.06548, 2017.
 D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio et al., “A closer look at memorization in deep networks,” in International Conference on Machine Learning. JMLR. org, 2017, pp. 233–242.
 S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” in Workshop at the International Conference on Learning Representation, 2015.
 B. Dong, J. Hou, Y. Lu, and Z. Zhang, “Distillation early stopping? harvesting dark knowledge utilizing anisotropic information retrieval for overparameterized neural network,” arXiv preprint arXiv:1910.01255, 2019.
 T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born again neural networks,” in International Conference on Machine Learning, 2018, pp. 1607–1616.
 Q. Xie, M.T. Luong, E. Hovy, and Q. V. Le, “Selftraining with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687–10 698.
 S. Laine and T. Aila, “Temporal ensembling for semisupervised learning,” in International Conference on Learning Representations, 2017.
 P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in International Conference on Machine Learning, 2008, pp. 1096–1103.
 D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
 D. P. Kingma and M. Welling, “Autoencoding variational bayes,” in International Conference on Learning Representation, 2014.
 I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
 S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representation, 2018.
 M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision. Springer, 2016, pp. 69–84.
 A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
 J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. Hoi, “Prototypical contrastive learning of unsupervised representations,” arXiv preprint arXiv:2005.04966, 2020.
 D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
 D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in International Conference on Learning Representations, 2018.