Improving Generalization by Controlling Label-Noise Information in Neural Network Weights

Improving Generalization by Controlling Label-Noise Information in Neural Network Weights

Abstract

In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. Standard regularization techniques such as dropout, weight decay or data augmentation sometimes help, but do not prevent this behavior. If one considers neural network weights as random variables that depend on the data and stochasticity of training, the amount of memorized information can be quantified with the Shannon mutual information between weights and the vector of all training labels given inputs, . We show that for any training algorithm, low values of this term correspond to reduction in memorization of label-noise and better generalization bounds. To obtain these low values, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. We illustrate the effectiveness of our approach on versions of MNIST, CIFAR-10, and CIFAR-100 corrupted with various noise models, and on a large-scale dataset Clothing1M that has noisy labels.

\printAffiliationsAndNotice

1 Introduction

Supervised learning with deep neural networks has shown great success in the last decade. Despite having millions of parameters, modern neural networks generalize surprisingly well. However, their training is particularly susceptible to noisy labels, as shown by Zhang et al. (2016) in their analysis of generalization error. In the presence of noisy or incorrect labels, networks start to memorize the training labels, which degrades the generalization performance Chen et al. (2019). At the extreme, standard architectures have the capacity to achieve 100% classification accuracy on training data, even when labels are assigned at random Zhang et al. (2016). Furthermore, standard explicit or implicit regularization techniques such as dropout, weight decay or data augmentation do not directly address nor completely prevent label memorization Zhang et al. (2016); Arpit et al. (2017).

Poor generalization due to label memorization is a significant problem because many large, real-world datasets are imperfectly labeled. Label noise may be introduced when building datasets from unreliable sources of information or using crowd-sourcing resources like Amazon Mechanical Turk. A practical solution to the memorization problem is likely to be algorithmic as sanitizing labels in large datasets is costly and time consuming. Existing approaches for addressing the problem of label-noise and generalization performance include deriving robust loss functions Natarajan et al. (2013); Ghosh et al. (2017); Zhang and Sabuncu (2018); Xu et al. (2019), loss correction techniques Sukhbaatar et al. (2014); Tong Xiao et al. (2015); Goldberger and Ben-Reuven (2017); Patrini et al. (2017), re-weighting samples Jiang et al. (2017); Ren et al. (2018), detecting incorrect samples and relabeling them Reed et al. (2014); Tanaka et al. (2018); Ma et al. (2018), and employing two networks that select training examples for each other Han et al. (2018); Yu et al. (2019). We propose an information-theoretic approach that directly addresses the root of the problem. If a classifier is able to correctly predict a training label that is actually random, it must have somehow stored information about this label in the parameters of the model. To quantify this information, Achille and Soatto (2018) consider weights as a random variable, , that depends on stochasticity in training data and parameter initialization. The entire training dataset is considered a random variable consisting of a vector of inputs, , and a vector of labels for each input, . The amount of label memorization is then given by the Shannon mutual information between weights and labels conditioned on inputs, . Achille and Soatto (2018) show that this term appears in a decomposition of the commonly used expected cross-entropy loss, along with three other individually meaningful terms. Surprisingly, cross-entropy rewards large values of , which may promote memorization if labels contain information beyond what can be inferred from . Such a result highlights that in addition to the network’s representational capabilities, the loss function – or more generally, the learning algorithm – plays an important role in memorization. To this end, we wish to study the utility of limiting , and how it can be used to modify training algorithms to reduce memorization.

Our main contributions towards this goal are as follows: 1) We show that low values of correspond to reduction in memorization of label-noise, and lead to better generalization gap bounds. 2) We propose training methods that control memorization by regularizing label-noise information in weights. When the training algorithm is a variant of stochastic gradient descent, one can achieve this by controlling label-noise information in gradients. A promising way of doing this is through an additional network that tries to predict the classifier gradients without using label information. We experiment with two training procedures that incorporate gradient prediction in different ways: one which uses the auxiliary network to penalize the classifier, and another which uses predicted gradients to train it. In both approaches, we employ a regularization that penalizes the L2 norm of predicted gradients to control their capacity. The latter approach can be viewed as a search over training algorithms, as it implicitly looks for a loss function that balances training performance with label memorization. 3) Finally, we show that the auxiliary network can be used to detect incorrect or misleading labels. To illustrate the effectiveness of the proposed approaches, we apply them on corrupted versions of MNIST, CIFAR-10, CIFAR-100 with various label noise models, and on the Clothing1M dataset, which already contains noisy labels. We show that methods based on gradient prediction yield drastic improvements over standard training algorithms (like cross-entropy loss), and outperform competitive approaches designed for learning with noisy labels.

2 Label-Noise Information in Weights

We begin by formally introducing a measure of label-noise information in weights, and discuss its connections to memorization and generalization. Throughout the paper we use several information-theoretic quantities such as entropy: , mutual information: , KullbackâLeibler divergence: and their conditional variants Cover and Thomas (2006).

Consider a setup in which a labeled dataset, , for data and categorical labels , is generated from a distribution . A training algorithm for learning weights of a fixed probabilistic classifier can be denoted as a conditional distribution . Given any training algorithm , its training performance can be measured using the expected cross-entropy:

 Hp,f(y∣x,w)=ESEw|S[n∑i=1−logf(y(i)∣x(i),w)].

Achille and Soatto (2018) present a decomposition of this expected cross-entropy, which reduces to the following when the data generating process is fixed (i.e is constant):

 Hp,f(y∣x,w)= (1)

The problem of minimizing this expected cross-entropy is equivalent to selecting an appropriate training algorithm. If the labels contain information beyond what can be inferred from inputs (meaning non-zero , such an algorithm may do well by memorizing the labels through the second term of (1). Indeed, minimizing the empirical cross-entropy loss , where , does exactly that Zhang et al. (2016).

2.1 Decreasing I(w:y∣x) Reduces Memorization

To demonstrate that is directly linked to memorization, we prove that any algorithm with small overfits less to label-noise in the training set.

Theorem 1.

Consider a dataset of i.i.d. samples, and , where the domain of labels is a finite set, . Let be any training algorithm, producing weights for a possibly stochastic classifier . Let denote the prediction of the classifier on the -th example and let be a random variable corresponding to predicting incorrectly. Then, the following inequality holds:

 E[n∑i=1e(i)]≥H(y∣x)−I(w:y∣x)−nH(e(1))log(|Y|−1).

This result establishes a lower bound on the expected number of prediction errors on the training set, which increases as decreases. For example, consider a corrupted version of the MNIST dataset where each label is changed with probability to a uniformly random incorrect label. By the above bound, every algorithm for which will make at least prediction errors on the training set in expectation. In contrast, if the weights retain bit of label-noise information per example, the classifier will make at least 40.5% errors in expectation. The proof of Thm. 1 and additional discussion on the dependence of error probability and are presented in the supplementary material (Sec. A.1).

Thm. 1 provides theoretical guarantees that memorization of noisy labels is prevented when is small, in contrast to standard regularization techniques – such as dropout, weight decay, and data augmentation – which only slow it down Zhang et al. (2016); Arpit et al. (2017). To demonstrate this empirically, we compare an algorithm that controls (presented in Sec. 3) against these regularization techniques on the aforementioned corrupted MNIST setup. We see in Fig. 1 that explicitly preventing memorization of label-noise information leads to optimal training performance (20% training accuracy) and good generalization on a non-corrupted validation set. Other approaches quickly exceed 20% training accuracy by incorporating label-noise information, and generalize poorly as a consequence. The classifier here is a fully connected neural network with 4 hidden layers each having 512 ReLU units. The rates of dropout and weight decay were selected according to the performance on a validation set.

2.2 Decreasing I(w:y∣x) Improves Generalization

The information that weights contain about a training dataset has previously been linked to generalization (Xu and Raginsky, 2017). The following bound relates the expected difference between train and test performance to the mutual information .

Theorem 2.

Xu and Raginsky (2017) Suppose is a loss function, such that is -sub-Gaussian random variable for each . Let be the training set, be the training algorithm, and be a test sample independent from and . Then the following holds:

 ∣∣ ∣∣E[ℓ(f(¯x),¯y)−1nn∑i=1ℓ(f(x(i)),y(i))]∣∣ ∣∣≤√2σ2nI(w:S) (2)

For good test performance, learning algorithms need to have both a small generalization gap, and good training performance. The latter may require retaining more information about the training set, meaning there is a natural conflict between increasing training performance and decreasing the generalization gap bound of (2). Furthermore, information in weights can be decomposed as follows: . We claim that one needs to prioritize reducing over for the following reason. When noise is present in the training labels, fitting this noise implies a non-zero value of , which grows linearly with the number of samples . In such cases, the generalization gap bound of (2) becomes a constant and does not improve as increases. To get meaningful generalization bounds via (2) one needs to limit . We hypothesize that for efficient learning algorithms, this condition might be also sufficient.

3 Methods Limiting Label Information

We now consider how to design training algorithms that control . We assume , with as the output of a neural network , and as the softmax function. We consider the case when is trained with a variant of stochastic gradient descent for iterations. The inputs and labels of a mini-batch at iteration are denoted by and respectively, and are selected using a deterministic procedure (such as cycling through the dataset, or using pseudo-randomness). Let denote the weights after initialization, and the weights after iteration . Let be some classification loss function (e.g, cross-entropy loss) and be the gradient at iteration . Let denote the gradients used to update the weights, possibly different from . Let the update rule be , and be the final weights (denoted with for convenience).

To limit , the following sections will discuss two approximations which relax the computational difficulty, while still provide meaningful bounds: 1) first, we show that the information in weights can be replaced by information in the gradients; 2) we introduce a variational bound on the information in gradients. The bound employs an auxiliary network that predicts gradients of the original loss without label information. We then explore two ways of incorporating predicted gradients: (a) using them in a regularization term for gradients of the original loss, and (b) using them to train the classifier.

3.1 Penalizing Information in Gradients

Looking at (1) it is tempting to add as a regularization to the objective and minimize over all training algorithms:

 minA(w∣D)Hp,f(y∣x,w)+I(w:y∣x). (3)

This will become equivalent to minimizing . Unfortunately, the optimization problem of (3) is hard to solve for two major reasons. First, the optimization is over training algorithms (rather than over the weights of a classifier, as in the standard machine learning setup). Second, the penalty is hard to compute/approximate.

To simplify the problem of (3), we relate information in weights to information in gradients as follows:

 I(w:y∣x) ≤I(g1:T:y∣x)=T∑t=1I(gt:y∣x,g

Hereafter, we focus on constraining at each iteration. Our task becomes choosing a loss function such that is small and is a good classifier. One key observation is that if our task is to minimize label-noise information in gradients it may be helpful to consider gradients with respect to the last layer only and compute the remaining gradients using back-propagation. As these steps of back-propagation do not use labels, by data processing inequality, subsequent gradients would have at most as much label information as the last layer gradient.

To simplify information-theoretic quantities, we add a small independent Gaussian noise to the gradients of the original loss: , where and is small. With this convention, we formulate the following regularized objective function:

 Lreg(w;xt,yt)=L(w;xt,yt)+λI(~gt:y∣x,g

where is a regularization coefficient. The term is a function of and , or more explicitly, a function of and . Computing this function would allow the optimization of (5) through gradient descent: . Importantly, label-noise information is equal in both and , as the gradient from the regularization is constant given and :

 I(gt:y∣x,g

Therefore, by minimizing in (5) we minimize , which is used to upper bound in (4). We rewrite this regularization in terms of entropy and discard the constant term, :

 I(~gt:y∣x,g

3.2 Variational Bounds on Gradient Information

The first term in (6) is still challenging to compute, as we typically only have one sample from the unknown distribution . Nevertheless, we can upper bound it with the cross-entropy , where is a variational approximation for :

 H(~gt∣x,g

This bound is correct when is a constant or a random variable that depends only on . With this upper bound, (5) reduces to:

 (7)

This formalization introduces a soft constraint on the classifier by attempting to make its gradients predictable without labels , effectively reducing .

Assuming denotes the predicted class probabilities of the classifier and is the cross-entropy loss, the gradient with respect to logits is (assuming has a one-hot encoding). Thus, = . Since this expression has no dependence on , it would not serve as a meaningful regularizer. Instead, we descend an additional level to look at gradients of the final layer parameters. When the final layer of is fully connected with inputs and weights (i.e., ), the gradients with respect to its parameters is equal to . With this gradient, . There is now dependence on through , as this quantity can be reduced by setting to a small value. We choose to parametrize as a Gaussian distribution with mean and fixed covariance , where is another neural network. Under this assumption, becomes proportional to:

 E[∥∥ytzTt+ξt−μϕ(xt,wt−1)∥∥22] =E[ξ2t]

Ignoring constants and approximating the expectation above with one Monte Carlo sample computed using the label , the objective of (7) becomes:

 Lreg(w;xt,yt)=L(w;xt,yt)+λ[∥zt∥22∥∥yt−s(rϕ(xt))∥∥22]. (8)

While this may work in principle, in practice the dependence on is only felt through the norm of , making it too weak to have much effect on the overall objective. We confirm this experimentally in Sec. 4. To introduce more complex dependencies on , one would need to model the gradients of deeper layers.

3.3 Predicting Gradients without Label Information

An alternative approach is to use gradients predicted by to update classifier weights, i.e., sample . This is a much stricter condition, as it implies (again assuming is a constant or a random variable that depends only on ). Note that minimizing makes the predicted gradient a good substitute for the cross-entropy gradients . Therefore, we write down the following objective function:

 minϕL′(wt−1,ϕ;xt)−λE~gt[logqϕ(~gt∣x,g

where is some probabilistic function accessed implicitly such that . We found that this approach performs significantly better than the penalizing approach of (7).

We choose to predict the gradients with respect to logits only and compute the remaining gradients using backpropagation. We consider two distinct parameterizations for Gaussian: , and Laplace: , with and being an auxiliary neural network as before. Under these Gaussian and Laplace parameterizations, becomes proportional to and respectively. In the Gaussian case is updated with a mean square error loss (MSE) function, while in the Laplace case it is updated with a mean absolute error loss (MAE). The former is expected to be faster at learning, but less robust to noise Ghosh et al. (2017).

In both approaches of (7) and (9), the classifier can still overfit if overfits. There are multiple ways to prevent this. One can choose to be a small network, or pre-train and freeze some of its layers in an unsupervised fashion. In this work, we choose to control the L2 norm of the mean of predicted gradients, , while keeping the variance fixed. This can be viewed as limiting the capacity of gradients .

Proposition 1.

If , where is independent noise, and , then the following inequality holds:

 I(gt:y∣x,g

The proof is provided in the supplementary section A.2. The same bound holds when is sampled from a product of univariate zero-mean Laplace distributions with variance , since the proof relies only on being zero-mean and having variance . The final objective of our main method becomes:

 minϕL′(w,ϕ;xt)−λE~gt[logqϕ(~gt∣x,g

We name this approach LIMITlimiting label information memorization in training. We denote the variants with Gaussian and Laplace distributions as LIMIT and LIMIT respectively. The pseudocode of LIMIT is presented in the supplementary material (Alg. 1). Note that in contrast to the previous approach of (5), this follows the spirit of (3), in the sense that the optimization over can be seen as optimizing over training algorithms; namely, learning a loss function implicitly through gradients. With this interpretation, the gradient norm penalty can be viewed as a way to smooth the learned loss, which is a good inductive bias and facilitates learning.

4 Experiments

We set up experiments with noisy datasets to see how well the proposed methods perform for different types and amounts of label noise. The simplest baselines in our comparison are standard cross-entropy (CE) and mean absolute error (MAE) loss functions. The next baseline is the forward correction approach (FW) proposed by Patrini et al. (2017), where the label-noise transition matrix is estimated and used to correct the loss function. Finally, we include the recently proposed determinant mutual information (DMI) loss, which is the log-determinant of the confusion matrix between predicted and given labels Xu et al. (2019). Both FW and DMI baselines require initialization with the best result of the CE baseline. To avoid small experimental differences, we implement all baselines, closely following the original implementations of FW and DMI. We train all baselines except DMI using the ADAM optimizer Kingma and Ba (2014) with learning rate and . As DMI is very sensitive to the learning rate, we tune it by choosing the best from the following grid of values . For all baselines, model selection is done by choosing the model with highest accuracy on a validation set that follows the noise model of the corresponding train set. All scores are reported on a clean test set. Additional experimental details, including the hyperparameter grids, are presented in supplementary section B. The implementation of the proposed method and the code for replicating the experiments is available at https://github.com/hrayrhar/limit-label-memorization.

4.1 MNIST with Uniform Label Corruption

To compare the variants of our approach discussed earlier and see which ones work well, we do experiments on the MNIST dataset with corrupted labels. In this experiment, we use a simple uniform label-noise model, where each label is set to an incorrect value uniformly at random with probability . In our experiments we try 4 values of – 0%, 50%, 80%, 89%. We split the 60K images of MNIST into training and validation sets, containing 48K and 12K samples respectively. For each noise amount we try 3 different training set sizes – , , and . All classifiers and auxiliary networks are 4-layer CNNs, with a shared architecture presented in the supplementary (Sec. B). For this experiment we include two additional baselines where additive noise (Gaussian or Laplace) is added to the gradients with respect to logits. We denote these baselines with names “CE + GN” and “CE + LN”. The comparison with these two baselines demonstrates that the proposed method does more than simply reduce information in gradients via noise. We also consider a variant of LIMIT where instead of sampling from we use the predicted mean .

Table 1 shows the test performances of different approaches averaged over 5 training/validation splits. Standard deviations and additional combinations of and are presented in the supplementary (See tables 4 and 5). Additionally, Fig. 2 shows the training and testing performances of the best methods during the training when and all training samples are used. Overall, variants of LIMIT produce the best results and improve significantly over standard approaches. The variants with a Laplace distribution perform better than those with a Gaussian distribution. This is likely due to the robustness of MAE. Interestingly, LIMIT works well and trains faster when the sampling of in is disabled (rows with “-S”). Thus, hereafter we consider this as our primary approach. As expected, the soft regularization approach of (7) and cross-entropy variants with noisy gradients perform significantly worse than LIMIT. We exclude these baselines in our future experiments. Additionally, we tested the importance of penalizing norm of predicted gradients by comparing training and testing performances of LIMIT with varying regularization strength in the supplementary (Fig. A1). We found that this penalty is essential for preventing memorization.

In our approach, the auxiliary network should not be able to distinguish correct and incorrect samples, unless it overfits. We found that it learns to predict “correct” gradients on examples with incorrect labels (Sec. C). Motivated by this, we use the distance between predicted and cross-entropy gradients to detect samples with incorrect or confusing labels (Fig. 3). Additionally, when considering the distance as a score for for classifying correctness of a label, distance, we get 99.87% ROC AUC score (Sec. C).

4.2 CIFAR with Uniform and Pair Noise

Next we consider a harder dataset, CIFAR-10 Krizhevsky and Hinton (2009), with two label noise models: uniform noise and pair noise. For pair noise, certain classes are confused with some other similar class. Following the setup of Xu et al. (2019) we use the following four pairs: truck automobile, bird airplane, deer horse, cat dog. Note in this type of noise is much smaller than in the case of uniform noise. We split the 50K images of CIFAR-10 into training and validation sets, containing 40K and 10K samples respectively. For the CIFAR experiments we use ResNet-34 networks He et al. (2016) with standard data augmentation, consisting of random horizontal flips and random 28x28 crops padded back to 32x32. For our proposed methods, the auxiliary network is ResNet-34 as well. We noticed that for more difficult datasets, it may happen that while still learns to produce good gradients, the updates with these less informative gradients may corrupt the initialization of the classifier. For this reason, we add an additional variant of LIMIT, which initializes the network with the best CE baseline, similar to the DMI and FW baselines.

Table 2 presents the results on CIFAR-10. Again, variants of LIMIT improve significantly over standard baselines, especially in the case of uniform label noise. As expected, when is initialized with the best CE model (similar to FW and DMI), the results are better. As in the case of MNIST, our approach helps even when the dataset is noiseless.

CIFAR-100.  To test proposed methods on a classification task with many classes, we apply them on CIFAR-100 with 40% uniform noise. We use the same networks as in the case of CIFAR-10. Results presented in Table 2 indicate several interesting phenomena. First, training with the MAE loss fails, which was observed by other works as well Zhang and Sabuncu (2018). The gradient of MAE with respect to logits is . When is small, there is small signal to fix the mistake. In fact, in the case of CIFAR-100, is approximately 0.01 in the beginning, slowing down the training. The performance of FW degrades as the approximation errors of noise transition matrix become large. The DMI does not give significant improvement over CE due to numerical issues with computing a determinant of a 100x100 confusion matrix. LIMIT performs worse than other variants, as training with MAE becomes challenging. However, performance improves when is initialized with the CE model. LIMIT does not suffer from the mentioned problem and works with or without initialization.

4.3 Clothing1M

Finally, as in our last experiment, we consider the Clothing1M dataset Xiao et al. (2015), which has 1M images labeled with one of 14 possible clothing labels. The dataset has very noisy training labels, with roughly 40% of examples incorrectly labeled. More importantly, the label noise in this dataset is realistic and instance dependent. For this dataset we use ResNet-50 networks and employ standard data augmentation, consisting of random horizontal flips and random crops of size 224x224 after resizing images to size 256x256. The results shown in the last column of Table  2 demonstrate that DMI and LIMIT with initialization perform the best, producing similar results.

5 Related Work

Our approach is related to many works that study memorization and learning with noisy labels. Our work also builds on theoretical results studying how generalization relates to information in neural network weights. In this section we present the related work and discuss the connections.

5.1 Learning with Noisy Labels

Learning with noisy labels is a longstanding problem and has been studied extensively Frenay and Verleysen (2014). Many works studied and proposed loss functions that are robust to label noise. Natarajan et al. (2013) propose robust loss functions for binary classification with label-dependent noise. Ghosh et al. (2017) generalize this result for multiclass classification problem and show that the mean absolute error (MAE) loss function is tolerant to label-dependent noise. Zhang and Sabuncu (2018) propose a new loss function, called generalized cross-entropy (GCE), that interpolates between MAE and CE with a single parameter . Xu et al. (2019) propose a new loss function (DMI), which is equal to the log-determinant of the confusion matrix between predicted and given labels, and show that it is robust to label-dependent noise. These loss functions are robust in the sense that the best performing hypothesis on clean data and noisy data are the same in the regime of infinite data. When training on finite datasets, training with these loss functions may result in memorization of training labels.

Another line of research seeks to estimate label-noise and correct the loss function accordingly Sukhbaatar et al. (2014); Tong Xiao et al. (2015); Goldberger and Ben-Reuven (2017); Patrini et al. (2017); Hendrycks et al. (2018); Yao et al. (2019). Some works use meta-learning to treat the problem of noisy/incomplete labels as a decision problem in which one determines the reliability of a sample Jiang et al. (2017); Ren et al. (2018); Shu et al. (2019). Others seek to detect incorrect examples and relabel them Reed et al. (2014); Tanaka et al. (2018); Ma et al. (2018); Han et al. (2019); Arazo et al. (2019). Han et al. (2018); Yu et al. (2019) employ an approach where two networks select training examples for each other using the small-loss trick. While our approach also has a teaching component, the network uses all samples instead of filtering. Li et al. (2019) propose a meta-learning approach that optimizes a classification loss along with a consistency loss between predictions of a mean teacher and predictions of the model after a single gradient descent step on a synthetically labeled mini-batch.

Some approaches assume particular label-noise models, while our approach assumes that , which may happen because of any type of label noise or attribute noise (e.g., corrupted images or partially observed inputs). Additionally, the techniques used to derive our approach can be adopted for regression or multilabel classification tasks. Furthermore, some methods require access to small clean validation data, which is not required in our approach.

5.2 Information in Weights and Generalization

Defining and quantifying information in neural network weights is an open challenge and has been studied by multiple authors. One approach is to relate information in weights to their description length. A simple way of measuring description length was proposed by Hinton and van Camp (1993) and reduces to the L2 norm of weights. Another way to measure it is through the intrinsic dimension of an objective landscape Li et al. (2018); Blier and Ollivier (2018). Li et al. (2018) observed that the description length of neural network weights grows when they are trained with noisy labels Li et al. (2018), indicating memorization of labels.

Achille and Soatto (2018) define information in weights as the KL divergence from the posterior of weights to the prior. In a subsequent study they provide generalization bounds involving the KL divergence term Achille and Soatto (2019). Similar bounds were derived in the PAC-Bayesian setup and have been shown to be non-vacuous Dziugaite and Roy (2017). With an appropriate selection of prior on weights, the above KL divergence becomes the Shannon mutual information between the weights and training dataset, . Xu and Raginsky (2017) derive generalization bounds that involve this latter quantity. Pensia et al. (2018) upper bound when the training algorithm consists of iterative noisy updates. They use the chain-rule of mutual information as we did in (4) and bound information in updates by adding independent noise. It has been observed that adding noise to gradients can help to improve generalization in certain cases Neelakantan et al. (2015). Another approach restricts information in gradients by clipping them Menon et al. (2020) .

Achille and Soatto (2018) also introduce the term and show the decomposition of the cross-entropy described in (1). In a recent work, Yin et al. (2020) consider a similar term in the context of meta-learning and use it as a regularization to prevent memorization of meta-testing labels. Given a meta-learning dataset , they consider the information in the meta-weights about the labels of meta-testing tasks given the inputs of meta-testing tasks, . They bound this information with a variational upper bound and use multivariate Gaussian distributions for both. For isotropic Gaussians with equal covariances, the KL divergence reduces to , which was studied by Hu et al. (2020) as a regularization to achieve robustness to label-noise. Note that this bounds not only but also . In contrast, we bound only and work with information in gradients.

6 Conclusion and Future Work

Several recent theoretical works have highlighted the importance of the information about the training data that is memorized in the weights. We distinguished two components of it and demonstrated that the conditional mutual information of weights and labels given inputs is closely related to memorization of labels and generalization performance. By bounding this quantity in terms of information in gradients, we were able to derive the first practical schemes for controlling label information in the weights and demonstrated that this outperforms approaches for learning with noisy labels. In the future, we plan to explore ways of improving the bound of (4) and to design a better bottleneck in the gradients. Additionally, we aim to extend the presented ideas to reducing instance memorization.

Supplementary material: Improving Generalization by Controlling Label-Noise Information in Neural Network Weights

Appendix A Proofs

This section presents the proofs and some remarks that were not included in the main text due to space constraints.

a.1 Proof of Thm. 1

Theorem 1.

(Thm. 1 restated) Consider a dataset of i.i.d. samples, and , where the domain of labels is a finite set, , with . Let be any training algorithm, producing weights for possibly stochastic classifier . Let denote the prediction of the classifier on -th example and be a random variable corresponding to predicting incorrectly. Then, the following holds

 E[n∑i=1e(i)]≥H(y∣x)−I(w:y∣x)−nH(e(1))log(|Y|−1).
Proof.

For each example we consider the following Markov chain:

 y(i)→[xy]→[x(i)w]→ˆy(i).

In this setup Fano’s inequality gives a lower bound for the error probability:

 H(e(i))+P(e(i)=1)log(|Y|−1)≥H(y(i)∣x(i),w), (10)

which can be written as:

 P(e(i)=1)≥H(y(i)∣x(i),w)−H(e(i))log(|Y|−1).

Summing this inequality for we get

 n∑i=1P(e(i)=1) ≥∑ni=1(H(y(i)∣x(i),w)−H(e(i)))log(|Y|−1) ≥∑ni=1(H(y(i)∣x,w)−H(e(i)))log(|Y|−1) ≥H(y∣x,w)−nH(e(1))log(|Y|−1).

The correctness of the last step follows from the fact that total correlation is always non-negative Cover and Thomas (2006):

 n∑i=1H(y(i)∣x,w)−H(y∣x,w)=% TC(y∣x,w)≥0.

Finally, using the fact that , we get that the desired result:

 E[n∑i=1e(i)]≥H(y∣x)−I(w:y∣x)−nH(e(1))log(|Y|−1). (11)

Remark 1.  If we let and denote the expected training error rate, then we can rewrite (11) as follows:

 r≥H(y(1)∣x(1))−I(w:y∣x)/n−H(r)log(k−1). (12)

Solving this inequality for is challenging. One can simplify the right hand side by bounding , assuming that entropies are measured in bits. However, this will loosen the bound. Alternatively, we can find the smallest for which (12) holds and claim that .

Remark 2.  If , then , putting which in (10) leads to:

 H(r)≥H(y(1)∣x(1))−I(w:y∣x)/n.

Remark 3.  When we have uniform label noise where a label is incorrect with probability () and , the bound of (12) is tight, i.e., implies that . To see this, we note that , putting which in (12) gives us:

 r≥H(p)+plog(k−1)−H(r)log(k−1)=p+H(p)−H(r)log(k−1). (13)

Therefore, when the inequality holds, implying that . To show that , we need to show that for any , the inequality above does not hold. Let by an arbitrary number from , then we can continue (13) as follows:

 r ≥p+H(p)−H(r)log(k−1) ≥p+H(p)−(H(p)+(r−p)H′(p))log(k−1) (as H(x) is concave) ≥p+−(r−p)log(k−1)log(k−1) (as 0≤p−log(k−1) for such p) ≥2p−r.

This implies that , which forms a contradiction with .

When , we can find the smallest by a numerical method. Fig. A0 plots vs when the label noise is uniform. When the label-noise is not uniform, the bound of (12) becomes loose as Fano’s inequality becomes loose. We leave the problem of deriving better lower bounds in such cases for a future work.

a.2 Proof of Prop. 1

Proposition 1.

(Prop. 1 restated) If , where is an independent noise and , then the following inequality holds:

 I(gt:y∣x,g
Proof.

Given that and are independent, let us bound the expected L2 norm of :

 E[gTtgt] =E[(ϵt+μt)T(ϵt+μt)] =E[ϵTtϵt]+E[μTtμt] ≤dσ2q+L2.

Among all random variables with the Gaussian distribution has the largest entropy, given by . Therefore,

 H(gt)≤d2log(2πe(dσ2q+L2)d).

With this we can upper bound the as follows:

 I(gt:y∣x,g

Note that the proof will work for arbitrary that has zero mean and independent components, where the L2 norm of each component is bounded by . This holds because in such cases (as Gaussians have highest entropy for fixed L2 norm) and the transition of (14) remains correct. Therefore, the same result holds when is sampled from a product of univariate zero-mean Laplace distributions with scale parameter (which makes the second moment equal to ).

A similar result has been derived by Pensia et al. (2018) (lemma 5) to bound .

Appendix B Experimental Details

In this section we describe the details of experiments and implementations.

Classifier architectures.

The architecture of classifiers used in MNIST experiments is presented in Table 3. The ResNet-34 used in CIFAR-10 and CIFAR-100 experiments differs from the standard ResNet-34 architecture (which is used for images) in two ways: (a) the first convolutional layer has 3x3 kernels and stride 1 and (b) the max pooling layer after it is skipped. The architecture of ResNet-50 used in the Clothing1M experiment follows the original He et al. (2016).

Hyperparameter search.

The CE, MAE, and FW baselines have no hyperparameters. For the DMI, we tuned the learning rate by setting the best value from the following list: . The soft regularization approach of (8) has two hyperparameters: and . We select from and from . The objective of LIMIT instances has two terms: and . Consequently, we need only one hyperparameter instead of two. We choose to set and select from . When sampling is enabled, we select from . In MNIST and CIFAR experiments, we trained all models for 400 epochs and terminated the training early when the best validation accuracy was not improved in the last 100 epochs. All models for Clothing1M were trained for 30 epochs.

Appendix C Additional Results

Effectiveness of gradient norm penalty.

In the main text we discussed that the proposed approach may overfit if the gradient predictor overfits and proposed to penalize the L2 norm of predicted gradients as a simply remedy for this issue. To demonstrate the effectiveness of this regularization, we present the training and testing accuracy curves of LIMIT with varying values of in Fig. A1. We see that increasing decreases overfitting on the training set and usually results in better generalization.

Detecting incorrect samples.

In the proposed approach, the auxiliary network should not be able to distinguish correct and incorrect samples, unless it overfits. In fact, Fig. A2 shows that if we look at the norm of predicted gradients, examples with correct and incorrect labels are indistinguishable in easy cases (MNIST with 80% uniform noise and CIFAR-10 with 40% uniform noise) and have large overlap in harder cases (CIFAR-10 with 40% pair noise and CIFAR-100 with 40% uniform noise). Therefore, we hypothesize that the auxiliary network learns to utilize incorrect samples effectively by predicting “correct” gradients. This also hints that the distance between the predicted and cross-entropy gradients might be useful for detecting samples with incorrect or confusing labels. Fig. A3 confirms this intuition, demonstrating that this distance separates correct and incorrect samples perfectly in easy cases (MNIST with 80% uniform noise and CIFAR-10 with 40% uniform noise) and separates them well in harder cases (CIFAR-10 with 40% pair noise and CIFAR-100 with 40% uniform noise). If we interpret this distance as a score for classifying correctness of a label, we get 91.1% ROC AUC score in the hardest case: CIFAR-10 with 40% pair noise, and more than 99% score in the easier cases.Motivated by this results, we use this analysis to detect samples with incorrect or confusing labels in the original MNIST, CIFAR-10, and Clothing1M datasets. We present a few incorrect/confusing labels for each class in Figures A4 and A5.

Quantitative results.

Tables 4, 5, 6, and 7 present test accuracy comparisons on multiple corrupted versions of MNIST and CIFAR-10. The presented error bars are standard deviations. In case of MNIST, we compute them over 5 training/validation splits. In the case of CIFAR-10, due to high computational cost, we have only one run for each model and dataset pair. The standard deviations are computed by resampling the corresponding test sets 1000 times with replacement.