Self-Adaptive Training: Bridging the Supervised and Self-Supervised Learning
We propose self-adaptive training—a unified training algorithm that dynamically calibrates and enhances training process by model predictions without incurring extra computational cost—to advance both supervised and self-supervised learning of deep neural networks. We analyze the training dynamics of deep networks on training data that are corrupted by, e.g., random noise and adversarial examples. Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of any label information, highlighting that model predictions could substantially benefit the training process: self-adaptive training improves the generalization of deep networks under noise and enhances the self-supervised representation learning. The analysis also sheds light on understanding deep learning, e.g., a potential explanation of the recently-discovered double-descent phenomenon in empirical risk minimization and the collapsing issue of the state-of-the-art self-supervised learning algorithms. Experiments on the CIFAR, STL and ImageNet datasets verify the effectiveness of our approach in three applications: classification with label noise, selective classification and linear evaluation. To facilitate future research, the code has been made public available at https://github.com/LayneH/self-adaptive-training.
Deep neural networks have received significant attention in machine learning and computer vision, in part due to their impressive performance achieved by supervised learning approaches in the ImageNet challenge. With the help of massive labeled data, deep neural networks advance the state-of-the-art to an unprecedented level on many fundamental tasks, such as image classification [2, 3], object detection  and semantic segmentation . However, data acquisition is notoriously costly, error-prone and even infeasible in certain cases. Furthermore, deep neural networks suffer significantly from overfitting in these scenarios. On the other hand, the great success of self-supervised pre-training in natural language process (e.g., GPT [6, 7, 8] and BERT ) highlights that learning universal representations from unlabeled data can be even more beneficial for a broad range of downstream tasks.
Regarding this, much effort has been devoted to learning representations without human supervision for computer vision. Several recent studies show promising results and largely close the performance gap between supervised and self-supervised learning. To name a few, the contrastive learning approaches [10, 11, 12] solve the instance-wise discrimination task  as proxy objective of representation learning. Extensive studies demonstrate that self-supervisedly learned representations are generic and even outperform the supervised pre-trained counterparts when they are fine-tuned on certain downstream tasks.
Our work advances both the supervised learning and self-supervised learning settings. Instead of designing two distinct algorithms for each learning paradigm separately, in this paper, we explore the possibility of a unified algorithm that bridges the supervised and self-supervised learning. Our exploration is based on two observations on the learning of deep neural networks.
Observation I: To begin with, we observe that deep neural networks are able to magnify useful underlying information by their predictions under various supervised settings. To take a closer look at this phenomenon, we inspect the standard Empirical Risk Minimization (ERM) training dynamics of deep models on the CIFAR10 dataset  with 40% of data being corrupted at random (see Section 2.1 for details) and report the accuracy curves in Fig. 0(a). We can see that, under all four corruptions, the peak of accuracy curve on the clean training set (80%) is much higher than the percentage of clean data in the noisy training set (60%).
Observation II: Furthermore, we can observe the similar phenomenon even in the extreme scenarios when the supervised signals are completely random. As a concrete example, we generate two kinds of random noise as the training (real-valued) targets for each image: 1) the output probability of another model on the same training images, where the model is randomly initialized and then frozen (the red curve in Fig. 2); 2) random noise that is drawn i.i.d. from standard Gaussian distribution (the green curve in Fig. 2), which is similar to the approach of . We train two deep models by encouraging their output on each image to align to the training target of 1) and 2), respectively (see Sec. 2.1 for details). We then evaluate the two deep models by learning linear classifiers on their representations. Fig. 2 displays the accuracy curves. We can clearly see that, fitting either of the two kinds of noise (the red and green curves), the representations of deep model are substantially improved when comparing with those of a randomly initialized network (the horizontal line).
These two observations indicate that model predictions can magnify useful information in data, which further sheds light that incorporating predictions into training process could significantly benefit the model learning. With this in mind, we propose self-adaptive training—a carefully designed approach which dynamically uses model predictions as a guiding principle in the design of training algorithms—that bridges the supervised learning and self-supervised learning in a unified framework. Our approach is conceptually simple yet calibrates and significantly enhances the learning of deep neural networks in multiple ways.
1.1 Summary of our contributions
Self-adaptive training sheds light on understanding and improving the learning of deep neural networks.
We analyze the standard ERM training process of deep networks on four kinds of corruptions (see Fig. 0(a)). We describe the failure scenarios of ERM and observe that useful information from data has been distilled to model predictions in the first few epochs. We show that this phenomenon occurs broadly even in the absence of any label information (see Fig. 2). These insights motivate us to propose self-adaptive training—a unified training algorithm for both supervised and self-supervised learning—to improve learning of deep networks by dynamically incorporating model predictions into training, without requiring modification to existing network architecture and incurring extra computational cost.
We show that self-adaptive training improves generalization of deep networks under both label-wise and instance-wise random noise (see Fig. 1 and 3). Besides, self-adaptive training exhibits a single-descent error-capacity curve (see Fig. 4). This is in sharp contrast to the recently-discovered double-descent phenomenon in ERM which might be a result of overfitting of noise. Moreover, while adversarial training may easily overfit adversarial noise, our approach mitigates the overfitting issue and improves adversarial accuracy by 3% over the state-of-the-art (see Fig. 5).
We illustrate that self-adaptive training enhances the self-supervised representation learning of deep networks (see Fig. 2). Our study challenges the dominant training mechanism of recent self-supervised algorithms that typically involves multiple augmented views of the same images at each training steps: self-adaptive training achieves remarkable performance despite requiring only a single view of each image for training, which significantly reduces the heavy cost of data pre-processing and model training on extra views.
Self-adaptive training has three applications and advances the state-of-the-art by a significant gap.
Learning with noisy label, where the goal is to improve the performance of deep networks on clean test data in the presence of training label noise. On the CIFAR datasets, our approach obtains up to 9% absolute classification accuracy improvement over the state-of-the-art. On the ImageNet dataset, our approach improves over ERM by 3% under 40% noise rate.
Selective classification, which aims to trade prediction coverage off against classification accuracy. Self-adaptive training achieves up to 50% relative improvement over the state-of-the-art on three datasets under various coverage rate.
Linear evaluation, which use a linear classifier to evaluate the representations of self-supervised pre-trained model. Self-adaptive training performs on par with or even better than the state-of-the-art while requiring only single view training.
2 Self-Adaptive Training
2.1 Blessing of model predictions
On corrupted data
Recent works [16, 17] cast doubt on the ERM training: techniques such as uniform convergence might be unable to explain the generalization of deep neural networks, because ERM easily overfits training data even though the training data are partially or completely corrupted by random noise. To take a closer look at this phenomenon, we conduct the experiments on the CIFAR10 dataset , of which we split the original training data into a training set (consists of first 45,000 data pairs) and a validation set (consists of last 5,000 data pairs). We consider four random noise schemes according to prior work , where the data are partially corrupted with probability : 1) Corrupted labels. Labels are assigned uniformly at random; 2) Gaussian. Images are replaced by random Gaussian samples with the same mean and standard deviation as the original image distribution; 3) Random pixels. Pixels of each image are shuffled using independent random permutations; 4) Shuffled pixels. Pixels of each image are shuffled using a fixed permutation pattern. We consider the performance on both the noisy and the clean sets (i.e., the original uncorrupted data), while the models can only have access to the noisy training sets.
Figure 0(a) displays the accuracy curves of ERMs that are trained on the noisy training sets under four kinds of random corruptions: ERM easily overfits noisy training data and achieves nearly perfect training accuracy. However, the four subfigures exhibit very different generalization behaviors which are indistinguishable if we only look at the accuracy curve on the noisy training set (the red curve). In Figure 0(a), the accuracy increases in the early stage and the generalization errors grow quickly only after certain epochs. Intuitively, stopping at early epoch improves generalization in the presence of label noise (see the first column in Figure 0(a)); however, it remains unclear how to properly identify such an epoch. Moreover, the early-stop mechanism may significantly hurt the performance on the clean validation sets, as we can see in the last three columns of Figure 0(a). Our approach is motivated by the failures scenarios of ERM and goes beyond ERM. We begin by making the following observations in the leftmost subfigure of Figure 0(a): the peak of accuracy curve on the clean training set (80%) is much higher than the percentage of clean data in the noisy training set (60%). This finding was also previously reported by [18, 19, 20] under label corruption and suggested that model predictions might be able to magnify useful underlying information in data. We confirm this finding and show that the pattern occurs under various kinds of corruptions more broadly (see the last three subfigures of Figure 0(a)).
On unlabelled data We notice that supervised learning with 100% noisy labels is equivalent to unsupervised learning if we simply discard the meaningless labels. Therefore, it would be interesting to analyze how deep model behaves in such extreme case. Here, we conduct experiments on the CIFAR10  dataset and consider two kinds of random noise as the training (real-valued) targets for deep learning models: 1) the output features of another model on the same training images, where the model is randomly initialized and then frozen; 2) random noise that is drawn i.i.d. from standard Gaussian distribution and then fixed. The training of the deep model is then formulated as minimizing the mean square error between -normalized model predictions and these two kinds of random noise. To monitor the training, we learn a linear classifier on the top of each model to evaluate its representation, which is fixed.
Figure 2 shows the linear evaluation accuracy curves of models trained on these two kinds of noise. We observe that, perhaps surprisingly, the model trained by predicting fixed random Gaussian noise (the green curve) achieves 57% linear evaluation accuracy, which is substantially higher than the 38% accuracy of a randomly initialized network (the dashed horizontal line). This intriguing observation shows that deep neural networks are able to distill underlying information from data to its predictions, even in the case that supervision signals are completely replaced by random noise. Furthermore, although predicting the output of another network can also improve the representations learning (the red curve), its performance is worse than that of the second scheme. We hypothesis that this might be a result of inconsistent training targets: the output of network depends on the input training image, which is randomly transformed by data augmentation at each epoch. This suggests that the consistency of training targets should also be taken into account in the design of training algorithm.
Inspiration for our methodology Based on our analysis, we propose a unified algorithm, Self-Adaptive Training, for both supervised and self-supervised learning. Self-adaptive training incorporates model predictions to augment the training process of deep models in a dynamic and consistent manner. Our methodology is general and very effective: self-adaptive training significantly improves the generalization of deep neural networks on the corrupted data and the representation learning of deep models without human supervision.
2.2 Meta algorithm: Self-Adaptive Training
Given a set of training images and a deep network parametrized by , our approach records training targets for each data points accordingly. We first obtain the predictions of deep network as
where is a normalization function. Then, the training targets track all historical model predictions during training and are updated by Exponential-Moving-Average (EMA) scheme as
The EMA scheme in Equation (2) alleviates the instability issue of model predictions, smooths out during the training process and enables our algorithm to completely change the training labels if necessary. The momentum term controls the weight on the model predictions. Finally, we can update the weights of deep network by Stochastic Gradient Descent (SGD) on the loss function at each training iterations.
We summarize the meta algorithm of self-adaptive training in Algorithm 1. The algorithm is conceptually simple, flexible and has three components adapting to different learning settings: 1) the training targets initialization; 2) normalization function ; 3) loss function . In the following sections, we will elaborate on the instantiation of these components for specific learning settings.
Convergence analysis To simplify the analysis, we consider a linear regression problem with data , training targets and a linear model , where , and . Let , . Then the optimization for this regression problem (corresponding to in Algorithm 1) can be written as
Let and be the model parameters and training targets at the -th training step, respectively. Let denote the learning rate for gradient descent update. The Algorithm 1 alternatively minimizes the problem (3) over and as
Let be the maximal eigenvalue of the matrix , if the learning rate , then
3 Improved Generalization of Deep Models
3.1 Supervised Self-Adaptive Training
Instantiation We consider -class classification problem and denote the images by , labels by . Given a data pair , our approach instantiates the three components of meta Algoorithm 1 for supervised learning as follows:
Targets initialization. Since the labels are provided, the training target is directly initialized as .
Normalization function. We use the softmax function to normalize the model predictions into probability vectors , such that .
Loss function. Following the common practice in supervised learning, the loss function is implemented as the cross entropy loss between model predictions and training targets , i.e., , where and represent the -th entry of and , respectively.
During the training process, we fix in the first training epochs, and update the training targets according to Equation (2) in each following training epoch. The number of initial epochs allows the model to capture informative signals in the data set and excludes ambiguous information that is provided by model predictions in the early stage of training.
Sample re-weighting Based on the scheme presented above, we introduce a simple yet effective sample re-weighting scheme on each sample. Concretely, given training target , we set
The sample weight reveals the labeling confidence of this sample. Intuitively, all samples are treated equally in the first epochs. As target being updated, our algorithm pays less attention to potentially erroneous data and learns more from potentially clean data. This scheme also allows the corrupted samples to re-attain attention if they are confidently corrected.
Putting everything together We use stochastic gradient descent to minimize:
during the training process. Here, the denominator normalizes per sample weights and stabilizes the loss scale. We summarize the Supervised Self-Adaptive Training and display the pseudocode in Algorithm 2. Intuitively, the optimal choice of hyper-parameter should be related to the epoch where overfitting occurs, which is around -th epoch according to Fig. 0(a). For convenience, we directly fix the hyper-parameters , by default if not specified. Experiments on the sensitivity of our algorithm to hyper-parameters are deferred to Sec. 5.2. Our approach requires no modification to existing network architecture and incurs almost no extra computational cost.
Methodology differences with prior work Supervised self-adaptive training consists of two components: a) label correction; b) sample re-weighting. With the two components, our algorithm is robust to both instance-wise and label-wise noise, and is ready to combine with various training schemes such as natural and adversarial training, without incurring multiple rounds of training. In contrast, a vast majority of works on learning from corrupted data follow a preprocessing-training fashion with an emphasis on the label-wise noise only: this line of research either discards samples based on disagreement between noisy labels and model predictions [21, 22, 23, 24], or corrects noisy labels [25, 26];  investigated a more generic approach that corrects both label-wise and instance-wise noises. However, their approach inherently suffers from extra computational overhead. Besides, unlike the general scheme in robust statistics  and other re-weighting methods [29, 30] that use an additional optimization step to update the sample weights, our approach directly obtains the weights based on accumulated model predictions and thus is much more efficient.
3.2 Improved generalization under random noise
We consider noise scheme (including noise type and noise level) and model capacity as two factors that affect the generalization of deep networks under random noise. We analyze self-adaptive training by varying one of the two factors while fixing the other.
Varying noise schemes We use ResNet-34  and rerun the same experiments in Figure 0(a) by replacing ERM with our approach. In Figure 0(b), we plot the accuracy curves of models trained with our approach on four corrupted training sets and compare with Figure 0(a). We highlight the following observations.
Our approach mitigates the overfitting issue in deep networks. The accuracy curves on noisy training sets (i.e., the red dashed curves in Figure 0(b)) nearly converge to the percentage of clean data in the training sets, and do not reach perfect accuracy.
The generalization errors of self-adaptive training (the gap between the red and blue dashed curves in Figure 0(b)) are much smaller than Figure 0(a). We further confirm this observation by displaying the generalization errors of the models trained on the four noisy training sets under various noise rates in the leftmost subfigure of Figure 3. Generalization errors of ERM consistently grow as we increase the injected noise level. In contrast, our approach significantly reduces the generalization errors across all noise levels from 0% (no noise) to 90% (overwhelming noise).
The accuracy on the clean sets (cyan and yellow solid curves in Figure 0(b)) is monotonously increasing and converges to higher values than their correspondence in Figure 0(a). We also show the clean validation errors in the right two subfigures in Figure 3. The figures show that the error of self-adaptive training is consistently much smaller than that of ERM.
Varying model capacity We notice that such analysis is related to a recent-discovered intriguing phenomenon [33, 34, 35, 36, 31, 37, 32] in modern machine learning models: as the capacity of model increases, the test error initially decreases, then increases, and finally shows a second descent. This phenomenon is termed double descent  and has been widely observed in deep networks . To evaluate the double-descent phenomenon on self-adaptive training, we follow exactly the same experimental settings as : we vary the width parameter of ResNet-18  and train the networks on the CIFAR10 dataset with 15% training label being corrupted at random (details are given in Appendix B.1).
Figure 4 shows the curves of test error. It shows that self-adaptive training overall achieves much lower test error than that of ERM except when using extremely small models that underfit the training data. This suggests that our approach can improves the generalization of deep networks especially when the model capacity is reasonably large. Besides, we observe that the curve of ERM clearly exhibits the double-descent phenomenon, while the curve of our approach is monotonously decreasing as the model capacity increases. Since the double-descent phenomenon may vanish when label noise is absent , our experiment indicates that this phenomenon may be a result of overfitting of noise and we can bypass it by a proper design of training process such as the self-adaptive training.
3.3 Improved generalization under adversarial noise
Adversarial noise  is different from the random noise in that the noise is model-dependent and imperceptible to humans. We use the state-of-the-art adversarial training algorithm TRADES  as our baseline to evaluate the performance of self-adaptive training under adversarial noise. Algorithmally, TRADES minimizes
where is the model prediction, is the maximal allowed perturbation, CE stands for the cross entropy, KL stands for the Kullback–Leibler divergence, and the hyper-parameter controls the trade-off between robustness and accuracy. We replace the CE term in TRADES loss with our method. The models are evaluated using robust accuracy , where adversarial example are generated by white box AutoAttack  with = 0.031 (the evaluation of projected gradient descent attack  are given in Fig. 14 of Appendix C). We set the initial learning rate as 0.1 and decay it by a factor of 0.1 in epochs 75 and 90, respectively. We choose as suggested by  and use = 70, = 0.9 for our approach. Experimental details are given in Appendix B.2.
We display the robust accuracy on CIFAR10 test set after = 70 epochs in Figure 5. It shows that the robust accuracy of TRADES reaches its highest value around the epoch of first learning rate decay (epoch 75) and decreases later, which suggests that overfitting might happen if we train the model without early stopping. On the other hand, our method considerably mitigates the overfitting issue in the adversarial training and consistently improves the robust accuracy of TRADES by 1%3%, which indicates that self-adaptive training can improve the generalization in the presence of adversarial noise.
4 Improved Representations Learning
4.1 Self-Supervised Self-Adaptive Training
Instantiation We consider training images without label and use a deep network followed by a non-linear projection as encoder . Then, we instantiate the meta Algorithm 1 for self-supervised learning as follows:
Target initialization. Since the labels are absent, each training target is randomly and independently drawn from standard normal distribution.
Normalization function. We directly normalize each representation by dividing its norm.
Loss function. The loss function is implemented as the Mean Square Error (MSE) between normalized model predictions and training targets .
The above instatiation of our meta algorithm suffices to learn decent representation, as exhibited by the blue curve in Fig. 2. However, as discussed in Sec. 2.1 that the consistency of training target also plays an essential role especially in self-supervised representation learning, we introduce two components that further improve the representation learning of self-adaptive training.
With the slowly-evolving , we can obtain representation
and construct the target following the EMA scheme in Equation (2).
Furthermore, to prevent the model from outputting the same representation for every images in each iteration (i.e., collapsing), we further use a predictor to transform the output of encoder to prediction
where has the same number of output units as .
Putting everything together We -normalize to and to . Finally, the MSE loss between the normalized predictions and accumulated representations
is minimized to update the encoder and predictor . We term this variant Self-Supervised Self-Adaptive Training, and summarize the pseudo-code in Algorithm 3 and the overall pipeline in Fig. 6. Our approach is straight-forward to be implemented in practice and requires only single-view training, which significantly alleviates the heavy computation burden of data augmentation operations.
Methodology differences with prior works BYOL  formulated the self-supervised training of deep models as predicting the representation of one augmented view of an image from the other augmented view of the same image. The self-supervised self-adaptive training shares some similarities with BYOL since both methods do not need to contrast to negative examples to prevent collapsing issue. Instead of directly using the output of momentum encoder as training targets, our approach uses the accumulated predictions as training targets, which contain all historical view information for each image. As a result, our approach requires only a single view during training, which is much more efficient as shown in Sec. 4.3 and Fig. 7. Besides, NAT  used online clustering algorithm to assign a noise for each image as training target of representation learning. Unlike NAT that fixes the noise while updating the noise assignment during training, our approach uses the noise as initial training target and update the noise by model predictions in the subsequent training process. InstDisc  uses a memory bank to store the representation of each image in order to construct positive and negative samples for contrastive objective. By contrast, our method gets rid of the negative samples and only matches the prediction with the training targets of the same image (i.e., positive samples).
4.2 Bypassing collapsing issues
We note that there exist trivial local minimums when we directly optimize the MSE loss between prediction and training target due to the absence of negative pairs: the encoder can simply output a constant feature vector for every data points to minimize the training loss, a.k.a. collapsing issue . Despite the existence of collapsing solution, the self-adaptive training intrinsically prevents the collapsing. The initial targets are different for different classes (under the supervised setting) or even for different images (under the self-supervised setting), enforcing the models to learn different representations for different classes/images. Our empirical study in Fig. 1, 2 of the main body and Fig. 11 of Appendix strongly supports that deep neural networks are able to learn meaningful information from corrupted data or even random noise, and bypass the model collapse. Based on this, our approach significantly improves the supervised learning (see Fig. 0(a)) and self-supervised learning (see the blue curve in Fig. 2) of deep neural networks.
4.3 Is multi-view training indispensable?
The success of state-of-the-art self-supervised learning approaches [10, 11, 42] largely hinges to the multi-view training scheme: they essentially use strong data augmentation operations to create multiple views (crops) of the same image and then match the representation of one view with the other views of the same image. Despite the promising performance, these methods suffer heavily from computational burden of data pre-processing and the training on extra views. Concretely, as shown in Fig. 7, prior methods BYOL  and MoCo  incur doubled training time compared with standard supervised cross entropy training. In contrast, since our method requires only single-view training, its training is only slightly slower than supervised method and is much faster than the MoCo and BYOL.
We further conduct experiments to evaluate the performance of multi-view training scheme on CIFAR10 and STL10 datasets (see Sec. 7.1 for details), which cast doubt on its necessity for learning a good representation. As shown in Table I, although the performances of MoCo and BYOL are nearly halved on both datasets when using the single-view training scheme, the self-supervised adaptive training achieves comparable results under both settings. Moreover, our approach with single-view training even slightly outperforms the MoCo and BYOL with multi-view training. We attribute this superiority to the guiding principle of our algorithm: by dynamically incorporating the model predictions, each training target contains the relevant information about all the historical views of each image and, therefore, implicitly enforces the model learning representations that are invariant to the historical views.
4.4 On the power of momentum encoder and predictor
Momentum encoder and predictor are two important components of self-supervised self-adaptive training and BYOL. The work of BYOL  showed that the algorithm may converge to trivial solution if one of them is removed, not to mention removing both of them. As shown in Table II, however, our results challenge their conclusion: 1) with the predictor, the linear evaluation accuracy of either BYOL (> 85%) or our method (> 90%) is non-trivial, regardless the absence and the configuration of momentum encoder; 2) without the predictor, the momentum encoder with sufficiently large momentum can also improve the performance and bypass collapsing issue. The results suggest that although both predictor and momentum encoder are indeed crucial for the performance of representation learning, either one of them with a proper configuration (i.e., the momentum term) suffices to avoid the collapsing. We note that the latest version of  also found that the momentum encoder can be removed without collapse when carefully tuning the learning rate of predictor. Our results, however, are obtained using the same training setting.
Moreover, we find that self-supervised self-adaptive training exhibits impressive resistance to collapsing, despite using only single-view training. Our approach can learn decent representation even when predictor and momentum encoder are both removed (see the seventh row of Table II). We hypothesis that the resistance comes from the consistency of training target due to our EMA scheme. This hypothesis is also partly supported by the observation that the learning of BYOL heavily depends on the slowly-evolving momentum encoder.
|Label Noise Rate||0.2||0.4||0.6||0.8||0.2||0.4||0.6||0.8|
|ResNet-34||ERM + Early Stopping||85.57||81.82||76.43||60.99||63.70||48.60||37.86||17.28|
|Label Smoothing ||85.64||71.59||50.51||28.19||67.44||53.84||33.01||9.74|
|Joint Opt ||92.25||90.79||86.87||69.16||58.15||54.81||47.94||17.18|
|WRN28-10||ERM + Early Stopping||87.86||83.40||76.92||63.54||68.46||55.43||40.78||20.25|
5 Application I: Learning with Noisy Label
Given improved generalization of self-adaptive training over ERM under noise, we show the applications of our approach which outperforms the state-of-the-art with a significant gap.
5.1 Problem formulation
Given a set of noisy training data , where is the distribution of noisy data and is the noisy label for each uncorrupted sample , the goal is to be robust to the label noise in the training data and improve the classification performance on clean test data that are sampled from clean distribution .
5.2 Experiments on CIFAR datasets
Setup We consider the case that the labels are assigned uniformly at random with different noise rates. Following previous work [46, 48], we conduct the experiments on the CIFAR10 and CIFAR100 datasets  and use ResNet-34  / Wide ResNet-28-10  as our base classifier. The networks are implemented on PyTorch  and optimized using SGD with initial learning rate of 0.1, momentum of 0.9, weight decay of 0.0005, batch size of 256, total training epochs of 200. The learning rate is decayed to zero using cosine annealing schedule . We use standard data augmentation of random horizontal flipping and cropping. We report the average performance over 3 trials.
Main results We summarize the experiments in Table III. Most of the results are directly cited from original papers with the same experiment settings; the results of Label Smoothing , Mixup , Joint Opt  and SCE  are reproduced by rerunning the official open-sourced implementations. From the table, we can see that our approach outperforms the state-of-the-art methods in most entries by 1% 5% on both CIFAR10 and CIFAR100 datasets, using different backbones. Notably, unlike Joint Opt, DAC and SELF methods that require multiple iterations of training, our method enjoys the same computational budget as ERM.
|Label Noise Rate||0.4||0.8||0.4||0.8|
|- Exponential Moving Average||72.00||28.17||50.93||11.57|
Ablation study and hyper-parameter sensitivity First, we report the performance of ERM equipped with simple early stopping scheme in the first row of Table III. We observe that our approach achieves substantial improvements over this baseline. This demonstrates that simply early stopping the training process is a sub-optimal solution. Then, we further report the influences of two individual components of our approach: Exponential Moving Average (EMA) and sample re-weighting scheme. As displayed in Table IV, removing any component considerably hurts the performance under all noise rates and removing EMA scheme leads to a significant performance drop. This suggests that properly incorporating model predictions is important in our approach. Finally, we analyze the sensitivity of our approach to the parameters and in Table V, where we fix one parameter while varying the other. The performance is stable for various choices of and , indicating that our approach is insensitive to the hyper-parameter tuning.
|Label Noise Rate||0.0||0.4||0.0||0.4|
5.3 Experiments on ImageNet dataset
The work of  suggested that ImageNet dataset  contains annotation errors on its own even after several rounds of cleaning. Therefore, in this subsection, we use ResNet-50/101  to evaluate self-adaptive training on the large-scale ImageNet under both standard setup (i.e., using original labels) and the case that 40% training labels are corrupted. We provide the experimental details in Appendix B.3 and report model performance on the ImageNet validation set in terms of top1 accuracy in Table VI. We can see that self-adaptive training consistently improves the ERM baseline by a considerable margin under all settings using different models. Specifically, the improvement can be as large as 3% in absolute for the larger ResNet-101 when 40% training labels are corrupted. The results validate the effectiveness of our approach on large-scale dataset and larger model.
5.4 Label recovery of self-adaptive training
We demonstrate that our approach is able to recover the true labels from noisy training labels: we obtain the recovered labels by the moving average targets and compute the recovered accuracy as , where is the clean label of each training sample. When 40% label are corrupted in the CIFAR10 and ImageNet training set, our approach successfully corrects a huge amount of labels and obtains recovered accuracy of 94.6% and 81.1%, respectively. We also display the confusion matrix of recovered labels w.r.t the clean labels on CIFAR10 in Figure 8, from which see that our approach performs well for all classes.
5.5 Investigation of sample weights
We further inspect on the re-weighting scheme of self-adaptive training. Following the procedure in Section 5.4, we display the average sample weights in Figure 9. In the figure, the -th block contains the average weight of samples with clean label and recovered label , the white areas represent the case that no sample lies in the cell. We see that the weights on the diagonal blocks are clearly higher than those on non-diagonal blocks. The figure indicates that, aside from impressive ability to recover the correct labels, self-adaptive training could properly down-weight the noisy examples.
|Dataset||Coverage||Ours||Deep Gamblers ||SelectiveNet ||SR ||MC-dropout |
|Dogs vs. Cats||100||3.010.17||2.930.17||3.580.04||3.580.04||3.580.04|
6 Application II: Selective Classification
6.1 Problem formulation
Selective classification, a.k.a. classification with rejection, trades classifier coverage off against accuracy , where the coverage is defined as the fraction of classified samples in the dataset; the classifier is allowed to output “don’t know” for certain samples. The task focuses on noise-free setting and allows classifier to abstain on potential out-of-distribution samples or samples lies in the tail of data distribution, that is, making prediction only on samples with confidence.
Formally, a selective classifier is a composition of two functions , where is the conventional -class classifier and is the selection function that reveals the underlying uncertainty of inputs. Given an input , selective classifier outputs
for a given threshold that controls the trade-off.
Inspired by [48, 55], we adapt our presented approach in Algorithm 1 to the selective classification task. We introduce an extra ()-th class (represents abstention) during training and replace selection function in Equation (14) by . In this way, we can train a selective classifier in an end-to-end fashion. Besides, unlike previous works that provide no explicit signal for learning abstention class, we use model predictions as a guideline in the design of learning process.
Given a mini-batch of data pairs , model predictions and its exponential moving average for each sample, we optimize the classifier by minimizing:
where is the index of non-zero element in the one hot label vector . The first term measures the cross-entropy loss between prediction and original label , in order to learn a good multi-class classifier. The second term acts as the selection function, identifies uncertain samples in datasets. dynamically trades-off these two terms: if is very small, the sample is deemed as uncertain and the second term enforces the selective classifier to learn to abstain this sample; if is close to 1, the loss recovers the standard cross entropy minimization and enforces the selective classifier to make perfect prediction.
Setup We conduct the experiments on three datasets: CIFAR10 , SVHN  and Dogs vs. Cats .
We compare our method with previous state-of-the-art methods on selective classification, including Deep Gamblers , SelectiveNet , Softmax Response (SR) and MC-dropout .
The experiments are based on official open-sourced implementation
Main results The results of prior methods are cited from original papers and are summarized in Table VII. We see that our method achieves up to 50% relative improvements compared with all other methods under various coverage rates, on all datasets. Notably, Deep Gamblers also introduces an extra abstention class in their method but without applying model predictions. The improved performance of our method comes from the use of model predictions in the training process.
7 Application III: Linear Evaluation Protocol in Self-Supervised Learning
7.1 Experimental setup
Datasets and data augmentations We conduct experiments on three benchmarks: CIFAR10/CIFAR100  with 50K images; STL10  with 105K images. The choice of data augmentations follows prior works [10, 11]: we take a random crop from each image and resize it to the original size (i.e., for CIFAR10/CIFAR100 and for STL10); the crop is then transformed by random color jittering, random horizontal flip, and random grayscale conversion.
Network architecture The encoders and consist of a backbone of ResNet-18 /ResNet-50 /AlexNet  and a projector that is instantiated by a multi-layer perceptron (MLP). We use the output of last global average pooling layer of the backbone as the extracted feature vectors. Following prior works , the output vectors of the backbone are transformed by the projector MLP to dimension 256. Besides, the predictor is also instantiated by an MLP with the same architecture as the projector in . In our implementation, all the MLPs have one hidden layer of size 4,096, followed by a batch normalization  layer and ReLU activation .
Self-supervised pre-training settings We optimize the networks using SGD optimizer with momentum of 0.9 and weight decay of 0.00005. By default, we use a batch size 512 for all methods in all experiments and train the networks for 800 epoch using 4 GTX 1080Ti GPUs. The base learning rate is set to 2.0 and is scaled linearly with respect to batch size following . During training, the learning rate is warmed up for the first 30 epochs and then adjusted according to cosine annealing schedule .
Linear evaluation protocol Following the common practice [10, 11, 42], we evaluate the representation learned by self-supervised pre-training using linear classification protocol. That is, we remove the projector in and the predictor , fix the parameters of backbone of the encoder and train a supervised linear classifier on top of the features extracted from the encoder. The linear classifier is trained for 100 epoch with weight decay 0.0 and batch size 512. The initial learning rate is set to 0.4 and decayed by a factor of 0.1 at 60-th and 80-th training epoch. The performance is measured in terms of top1 accuracy of the linear classifier on test data.
7.2 Comparison with the state-of-the-art
We firstly compare our self-supervised self-adaptive training with three state-of-the-art methods, including the contrastive learning methods MoCo , SimCLR  and bootstrap method BYOL . For fair comparison, we use the same code base and the same experimental settings for all the methods, following their official open-sourced implementations. We carefully adjust the hyper-parameters on CIFAR10 dataset for each method and use the same parameters on the rest datasets. Besides, we also conduct experiments using AlexNet as backbone and compare the performance of our method with the reported results of prior methods. The results are summarized in Table VIII. We can see that, despite using only single-view training, our self-supervised self-adaptive training consistently obtains better performance than all other methods on all datasets with different backbones.
7.3 Sensitivity of self-supervised self-adaptive training
Sensitivity to hyper-parameters We study how the two momentum parameters and affect the performance of our approach and report the results in Table IX. By varying one of the parameters while fixing the other, we observe that self-supervised self-adaptive training performs consistently well, which suggests that our approach is not sensitive to the choice of hyper-parameters.
Sensitivity to data augmentation Data augmentation is one of the most essential ingredients of recent self-supervised learning methods: a large body of these methods, including ours, formulate the training objective as learning representation that encodes the shared information across different views that generated by data augmentation. As a result, prior methods, like MoCo and BYOL, fail in the single-view training setting. Self-supervised self-adaptive training, on the other hand, maintains a training target for each image, which contains all historical view information of this image. Therefore, we conjecture that our method should be more robust to the data augmentation. To validate our conjecture, we evaluate our method and BYOL under the settings that some of the augmentation operators are removed. The results are shown in Fig. 9(a). Removing any augmentation operators hurts the performance of both methods while our method is less affected. Specifically, when all augmentation operators except random crop and flip are removed, the performance of BYOL drops to 86.6% while our method still obtains 88.6% accuracy.
Sensitivity to training batch size Recent contrastive learning methods require large batch size training (e.g., 1024 or even larger) for optimal performance, due to the need of comparing with massive negative samples. BYOL does not use negative samples and suggests that this issue can be mitigated. Here, since our method also gets rid of the negative samples, we make direct comparisons with BYOL at different batch sizes to evaluate the sensitivity of our method to batch size. For each method, the base learning rate is linearly scaled according to the batch size while the rest of settings are kept unchanged. The results are shown in Fig. 9(b). We can see that our method exhibits a smaller performance drop than BYOL at various batch sizes. Concretely, the accuracy of BYOL drops by 2% at batch size 128 while that of ours drops by only 1%.
8 Related Works
Generalization of deep networks Previous work  systematically analyzed the capability of deep networks to overfit random noise. Their results show that traditional wisdom fails to explain the generalization of deep networks. Another line of works [33, 34, 35, 36, 31, 37, 32] observed an intriguing double-descent risk curve from the bias-variance trade-off. [31, 32] claimed that this observation challenges the conventional U-shaped risk curve in the textbook. Our work shows that this observation may stem from overfitting of noises; the phenomenon vanishes by a proper design of training process such as self-adaptive training. To improve the generalization of deep networks, [43, 72] proposed label smoothing regularization that uniformly distributes of labeling weight to all classes and uses this soft label for training;  introduced mixup augmentation that extends the training distribution by dynamic interpolations between random paired input images and the associated targets during training. This line of research is similar with ours as both methods use soft labels in the training. However, self-adaptive training is able to recover true labels from noisy labels and is more robust to noises.
Robust learning from corrupted data Aside from the preprocessing-training approaches that have been discussed in the last paragraph of Section 3.1, there have also been many other works on learning from noisy data. To name a few, [73, 20] showed that deep neural networks tend to fit clean samples first and overfitting of noise occurs in the later stage of training.  further proved that early stopping can mitigate the issues that are caused by label noise. [74, 75] incorporated model predictions into training by simple interpolation of labels and model predictions. We demonstrate that our exponential moving average and sample re-weighting schemes enjoy superior performance. Other works [46, 47] proposed alternative loss functions to cross entropy that are robust to label noise. They are orthogonal to ours and are ready to cooperate with our approach as shown in Appendix C.4. Beyond the corrupted data setting, recent works [76, 77] propose self-training scheme that also uses model predictions as training target. However, they suffers from the heavy cost of multiple iterations of training, which is avoided by our approach. Temporal Ensembling  incorporated the ’ensemble’ predictions as pseudo-label for training. Different from ours, Temporal Ensembling focuses on the semi-supervised learning setting and only accumulates predictions for unlabeled data.
Self-supervised learning Aiming to learn powerful representation, most self-supervised learning approaches typically first solve a proxy task without human supervision. For example, prior works proposed recovering input using auto-encoder [79, 80], generating pixels in the input space [81, 82], predicting rotation  and solving jigsaw . Recently, contrastive learning methods [13, 85, 71, 10, 11, 86] significantly advanced self-supervised representation learning. These approaches essentially used strong data augmentation techniques to create multiple views (crops) of the same image and discriminated the representation of different views of the same images (i.e., positive samples) from the views of other images (i.e., negative samples). Bootstrap methods eliminated the discrimination of positive and negative data pairs: the works of [68, 70] alternatively performed clustering on the representations and then used the cluster assignments as classification targets to update the model;  swapped the cluster assignments between the two views of the same image as training targets;  simply predicted the representation of one view from the other view of the same image;  formulated the self-supervised training objective as predicting a set of predefined noises. Our work follows the path of bootstrap methods. Going further than them, self-adaptive training is a general training algorithm that bridges supervised and self-supervised learning paradigms. Our approach casts doubt on the necessity of the costly multi-views training and works well with single-view training scheme.
In this paper, we explore the possibility of a unified framework to bridge the supervised and self-supervised learning of deep neural networks. We first analyze the training dynamic of deep networks under these two learning settings and observe that useful information from data is distilled to model predictions. The observation occurs broadly even in the presence of data corruptions and the absence of label, which motivate us to propose Self-Adaptive Training—a general training algorithm that dynamically incorporates model predictions into training process. We demonstrate that our approach improves the generalization of deep neural networks under various kinds of training data corruptions and enhances the representation learning using accumulated model predictions. Finally, we present three applications of self-adaptive training on learning with noisy label, selective classification and linear evaluation protocol in self-supervised learning, where our approach significantly advances the state-of-the-art.
Appendix A Proof
Let be the maximal eigenvalue of the matrix , if the learning rate , then
Note that is positive semi-definite and can be diagonalized as , where the diagonal matrix contains the eigenvalue of and the matrix contains the corresponding eigenvectors, . And let . Multiplying the both sides of Equation (17) by , we have
From Equation (4), we have
Subtracting the the both sides of Equation (19) by , we obtain
where . Therefore,
Because is positive semi-definite and , all elements in is smaller than 1. When , each elements of the diagonal matrix is greater than -1, and we have
Appendix B Experimental Setups
b.1 Double descent phenomenon
Following previous work , we optimize all models using Adam  optimizer with fixed learning rate of 0.0001, batch size of 128, common data augmentation, weight decay of 0 for 4,000 epochs. For our approach, we use the hyper-parameters for standard ResNet-18 (width of 64) and dynamically adjust them for other models according to the relation of model capacity as:
b.2 Adversarial training
 reported that imperceptible small perturbations around input data (i.e., adversarial examples) can cause ERM trained deep neural networks to make arbitrary predictions. Since then, a large literature devoted to improving the adversarial robustness of deep neural networks. Among them, adversarial training algorithm TRADES  achieves state-of-the-art performance. TRADES decomposed robust error (w.r.t adversarial examples) to sum of natural error and boundary error, and proposed to minimize:
where is the model prediction, is the maximal allowed perturbation, CE stands for cross entropy, KL stands for Kullback–Leibler divergence. The first term corresponds to ERM that maximizes the natural accuracy; the second term pushes the decision boundary away from data points to improve adversarial robustness; the hyper-parameter controls the trade-off between natural accuracy and adversarial robustness. We evaluate self-adaptive training on this task by replacing the first term of Equation (24) with our approach.
Our experiments are based on the official open-sourced implementation
We use ResNet-50/ResNet-101  as base classifier.
Following original paper  and [52, 66], we use SGD to optimize the networks with batch size of 768, base learning rate of 0.3, momentum of 0.9, weight decay of 0.0005 and total training epoch of 95.
The learning rate is linearly increased from 0.0003 to 0.3 in first 5 epochs (i.e., warmup), and then decayed using cosine annealing schedule  to 0.
Following common practice, we use random resizing, cropping and
flipping augmentation during training.
The hyper-parameters of our approach are set to and under standard setup, and are set to and under 40% label noise setting. The experiments are conducted on PyTorch  with distributed training and mixed precision training
Appendix C Additional Experimental Results
c.1 ERM may suffer from overfitting of noise
In , the authors showed that the model trained by standard ERM can easily fit randomized data. However, they only analyzed the generalization errors in the presence of corrupted labels. In this paper, we report the whole training process and also consider the performance on clean sets (i.e., the original uncorrupted data). Figure 0(a) shows the four accuracy curves (on clean and noisy training, validation set, respectively) for each model that is trained on one of four corrupted training data. Note that the models can only have access to the noisy training sets (i.e., the red curve) and the other three curves are shown only for the illustration purpose. We conclude with two principal observations from the figures: (1) The accuracy on noisy training and validation sets is close at beginning and the gap is monotonously increasing w.r.t. epoch. The generalization errors (i.e., the gap between the accuracy on noisy training and validation sets) are large at the end of training. (2) The accuracy on clean training and validation set is consistently higher than the percentage of clean data in the noisy training set. This occurs around the epochs between underfitting and overfitting.
Our first observation poses concerns on the overfitting issue of ERM training dynamic which has also been reported by . However, the work of  only considered the case of corrupted labels and proposed using early-stop mechanism to improve the performance on clean data. On the other hand, our analysis of the broader corruption schemes shows that the early stopping might be sub-optimal and may hurt the performance under other types of corruptions (see the last three columns in Figure 0(a)).
The second observation implies that model predictions by ERM can capture and amplify useful signals in the noisy training set, although the training dataset is heavily corrupted. While this was also reported in [16, 18, 19, 20] for the case of corrupted labels, we show that similar phenomenon occurs under other kinds of corruptions more generally. This observation sheds light on our approach, which incorporates model predictions into training procedure.
c.2 Improved generalization of self-adaptive training on random noise
Training accuracy w.r.t. correct labels on different portions of data For more intuitive demonstration, we split the CIFAR10 training set (with 40% label noise) into two portions: 1) Untouched portion, i.e., the elements in the training set which were left untouched; 2) Corrupted portion, i.e., the elements in the training set which were indeed randomized. The accuracy curves on these two portions w.r.t correct training labels is shown in Figure 13. We can observe that the accuracy of ERM on the corrupted portion first increases in the first few epochs and then eventually decreases to 0. In contrast, self-adaptive training calibrates the training process and consistently fits the correct labels.
Study on extreme noise We further rerun the same experiments as in Figure 1 of main text by injecting extreme noise (i.e., noise rate of 80%) into CIFAR10 dataset. We report the corresponding accuracy curves in Figure 11, which shows that our approach significantly improves the generalization over ERM even when random noise dominates training data. This again justify our observations in Section 3.
Effect of data augmentation All our previous studies are performed with common data augmentation (i.e., random cropping and flipping). Here, we further report the effect of data augmentation. We adjust introduced hyper-parameters as , due to severer overfitting when data augmentation is absent. The Figure 12 shows the corresponding generalization errors and clean validation errors. We observe that, for both ERM and our approach, the errors clearly increase when data augmentation is absent (compared with those in Figure 3). However, the gain is limited and the generalization errors can still be very large, with or without data augmentation for standard ERM. Directly replacing the standard training procedure with our approach can bring bigger gains in terms of generalization regardless of data augmentation. This suggests that data augmentation can help but is not of essence to improve generalization of deep neural networks, which is consistent with the observation in .
c.3 Epoch-wise double descent phenomenon
 reported that, for sufficient large model, test error-training epoch curve also exhibits double-descent phenomenon, which they termed epoch-wise double descent. In Figure 15, we reproduce the epoch-wise double descent phenomenon on ERM and inspect self-adaptive training. We observe that our approach (the red curve) exhibits slight double-descent due to overfitting starts before initial epochs. As the training targets being updated (i.e., after = 40 training epochs), the red curve undergoes monotonous decrease. This observation again indicates that double-descent phenomenon may stem from overfitting of noise and can be avoided by our algorithm.
|Label Noise Rate||0.2||0.4||0.6||0.8||0.2||0.4||0.6||0.8|
|Ours + SCE||94.39||93.29||89.83||79.13||76.57||72.16||64.12||39.61|
c.4 Cooperation with Symmetric Cross Entropy
 showed that Symmetric Cross Entropy (SCE) loss is robust to underlying label noise in training data. Formally, given training target and model prediction , SCE loss is defined as:
where the first term is the standard cross entropy loss and the second term is the reversed version. In this section, we show that self-adaptive training can cooperate with this noise-robust loss and enjoy further performance boost without extra cost.
Setup The Most experiments settings are kept the same as Section 5.2. For the introduced hyper-parameters of SCE loss, we directly set them to 1, 0.1, respectively, in all our experiments.
Results We summarize the results in Table X. We cam see that, although self-adaptive training already achieves very strong performance, considerable gains can be obtained when equipped with SCE loss. Concretely, the improvement is as large as 1.5% when label noise of 60% injected to CIFAR100 training set. It also indicates that our approach is flexible and can be further extended.
c.5 Out-of-distribution generalization
In this section, we consider out-of-distribution (OOD) generalization, where the models are evaluated on unseen test distributions outside the training distribution.
Setup To evaluate the OOD generalization performance, we use CIFAR10-C benchmark  that constructed by applying 15 types of corruption to the original CIFAR10 test set at 5 levels of severity. The performance is measure by average accuracy over 15 types of corruption. We mainly follow the training details in Section 5.2 and adjust .
Results We summarize the results in Table XI. Regardless the presence of corruption and corruption levels, our method consistently outperforms ERM by a considerable margin, which becomes large when the corruption is more severe. The experiment indicates that self-adaptive training may provides implicit regularization for OOD generalization.
c.6 Cost of maintaining probability vectors
Take the large-scale ImageNet dataset  as an example. The ImageNet consists of about 1.2 million images categorized to 1000 classes. The storage of such vectors in single precision format for the entire dataset requires bit GB, which is reduced to GB under self-supervised learning setting that records a -d feature for each image. The cost is acceptable since modern GPUs usually have 20GB or more dedicated memory, e.g., NVIDIA Tesla A100 has 40GB memory. Moreover, the vectors can be stored on CPU memory or even disk and loaded along with the images to further reduce the cost.
- L. Huang, C. Zhang, and H. Zhang, “Self-adaptive training: beyond empirical risk minimization,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning, 2020.
- M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
- Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3733–3742.
- A. Krizhevsky and G. E. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
- P. Bojanowski and A. Joulin, “Unsupervised learning by predicting noise,” in International Conference on Machine Learning, 2017, pp. 517–526.
- C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, 2017.
- V. Nagarajan and J. Z. Kolter, “Uniform convergence may be unable to explain generalization in deep learning,” in Advances in Neural Information Processing Systems, 2019, pp. 11 611–11 622.
- D. Rolnick, A. Veit, S. Belongie, and N. Shavit, “Deep learning is robust to massive label noise,” arXiv preprint arXiv:1705.10694, 2017.
- M. Y. Guan, V. Gulshan, A. M. Dai, and G. E. Hinton, “Who said what: Modeling individual labelers improves classification,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- M. Li, M. Soltanolkotabi, and S. Oymak, “Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 4313–4324.
- C. E. Brodley, M. A. Friedl et al., “Identifying and eliminating mislabeled training instances,” in Proceedings of the National Conference on Artificial Intelligence, 1996, pp. 799–805.
- C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,” Journal of artificial intelligence research, vol. 11, pp. 131–167, 1999.
- X. Zhu, X. Wu, and Q. Chen, “Eliminating class noise in large datasets,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 920–927.
- D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox, “SELF: Learning to filter noisy labels with self-ensembling,” in International Conference on Learning Representations, 2020.
- H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi, “Label refinery: Improving imagenet classification through label progression,” arXiv preprint arXiv:1805.02641, 2018.
- D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa, “Joint optimization framework for learning with noisy labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5552–5560.
- C.-M. Teng, “Correcting noisy data.” in International Conference on Machine Learning. Citeseer, 1999, pp. 239–248.
- P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection. John wiley & sons, 2005, vol. 589.
- L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels,” in International Conference on Machine Learning, 2018, pp. 2304–2313.
- M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in International Conference on Machine Learning, 2018, pp. 4334–4343.
- M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019.
- P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data hurt,” in International Conference on Learning Representations, 2020.
- M. Opper, “Statistical mechanics of learning: Generalization,” The Handbook of Brain Theory and Neural Networks,, pp. 922–925, 1995.
- ——, “Learning to generalize,” Frontiers of Life, vol. 3, no. part 2, pp. 763–775, 2001.
- M. S. Advani, A. M. Saxe, and H. Sompolinsky, “High-dimensional dynamics of generalization error in neural networks,” Neural Networks, vol. 132, pp. 428–446, 2020.
- S. Spigler, M. Geiger, S. d’Ascoli, L. Sagun, G. Biroli, and M. Wyart, “A jamming transition from under-to over-parametrization affects loss landscape and generalization,” arXiv preprint arXiv:1810.09665, 2018.
- M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. Baity-Jesi, G. Biroli, and M. Wyart, “Jamming transition as a paradigm to understand the loss landscape of deep neural networks,” Physical Review E, vol. 100, no. 1, p. 012115, 2019.
- C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014.
- H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan, “Theoretically principled trade-off between robustness and accuracy,” in International Conference on Machine Learning, 2019, pp. 7472–7482.
- F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” in International Conference on Machine Learning, 2020.
- A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018.
- J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
- G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations, 2018.
- Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018, pp. 8778–8788.
- Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric cross entropy for robust learning with noisy labels,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 322–330.
- S. Thulasidasan, T. Bhattacharya, J. Bilmes, G. Chennupati, and J. Mohd-Yusof, “Combating label noise in deep learning using abstention,” in International Conference on Machine Learning, 2019, pp. 6234–6243.
- S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda, “Early-learning regularization prevents memorization of noisy labels,” in Advances in Neural Information Processing Systems, 2020.
- S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
- I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations, 2017.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 2009, pp. 248–255.
- Z. Liu, Z. Wang, P. P. Liang, R. Salakhutdinov, L.-P. Morency, and M. Ueda, “Deep gamblers: Learning to abstain with portfolio theory,” in Advances in Neural Information Processing Systems, 2019.
- Y. Geifman and R. El-Yaniv, “Selectivenet: A deep neural network with an integrated reject option,” in International Conference on Machine Learning, 2019, pp. 2151–2159.
- ——, “Selective classification for deep neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 4878–4887.
- R. El-Yaniv and Y. Wiener, “On the foundations of noise-free selective classification,” Journal of Machine Learning Research, vol. 11, no. May, pp. 1605–1641, 2010.
- Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
- “Dogs vs. cats dataset,” https://www.kaggle.com/c/dogs-vs-cats.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, 2012, pp. 1097–1105.
- V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning, 2010.
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- R. Zhang, P. Isola, and A. A. Efros, “Split-brain autoencoders: Unsupervised learning by cross-channel prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1058–1067.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 132–149.
- J. Huang, Q. Dong, S. Gong, and X. Zhu, “Unsupervised deep learning by neighbourhood discovery,” in International Conference on Machine Learning, 2019, pp. 2849–2858.
- Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” in International Conference on Learning Representations, 2020.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in European Conference on Computer Vision, 2020.
- G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, “Regularizing neural networks by penalizing confident output distributions,” arXiv preprint arXiv:1701.06548, 2017.
- D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio et al., “A closer look at memorization in deep networks,” in International Conference on Machine Learning. JMLR. org, 2017, pp. 233–242.
- S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” in Workshop at the International Conference on Learning Representation, 2015.
- B. Dong, J. Hou, Y. Lu, and Z. Zhang, “Distillation early stopping? harvesting dark knowledge utilizing anisotropic information retrieval for overparameterized neural network,” arXiv preprint arXiv:1910.01255, 2019.
- T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born again neural networks,” in International Conference on Machine Learning, 2018, pp. 1607–1616.
- Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687–10 698.
- S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” in International Conference on Learning Representations, 2017.
- P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in International Conference on Machine Learning, 2008, pp. 1096–1103.
- D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in International Conference on Learning Representation, 2014.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
- S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in International Conference on Learning Representation, 2018.
- M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision. Springer, 2016, pp. 69–84.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. Hoi, “Prototypical contrastive learning of unsupervised representations,” arXiv preprint arXiv:2005.04966, 2020.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
- D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in International Conference on Learning Representations, 2018.