Auxiliary Image Regularization for Deep CNNs with Noisy Labels

# Auxiliary Image Regularization for Deep CNNs with Noisy Labels

## Abstract

Precisely-labeled data sets with sufficient amount of samples are very important for training deep convolutional neural networks (CNNs). However, many of the available real-world data sets contain erroneously labeled samples and those errors substantially hinder the learning of very accurate CNN models. In this work, we consider the problem of training a deep CNN model for image classification with mislabeled training samples an issue that is common in real image data sets with tags supplied by amateur users. To solve this problem, we propose an auxiliary image regularization technique, optimized by the stochastic Alternating Direction Method of Multipliers (ADMM) algorithm, that automatically exploits the mutual context information among training images and encourages the model to select reliable images to robustify the learning process. Comprehensive experiments on benchmark data sets clearly demonstrate our proposed regularized CNN model is resistant to label noise in training data.

## 1Introduction

Deep Convolutional Neural Network (CNN) models have seen great success in solving general object recognition problems [14]. However, due to their extremely huge parameter space, the performance of deep models relies heavily on the availability of a sufficiently large number of training examples. In practice, collecting images as well as their accurate annotations at a large scale is usually tedious and expensive. On the other hand, there are millions of freely available images with user-supplied tags that can be easily collected from the web. Being able to exploit this rich resource seems promising for learning a deep classification model. The labels of these web images, however, tend to be much more noisy, and hence challenging to learn from.

In this work, we consider the problem of image classification in this challenging scenario where annotations of the training examples are noisy. In particular, we are interested in how to learn a deep CNN model that is able to produce robust image representations and classification results in presence of noisy supervision. Deep CNN models have been recognized to be sensitive to sample and label noise in recent works [20]. Thus, several methods [20] have been developed to alleviate the negative effect of noisy labels in learning the deep models. However, most of the existing methods only consider modeling a fixed category-level label confusion distribution and cannot alleviate the effect of noise in representation learning for each sample.

We propose a novel auxiliary image regularizer (AIR) to address this issue of deception of training annotations. Intuitively, the proposed regularizer exploits the structure of the data and automatically retrieves useful auxiliary examples to collaboratively facilitate training of the classification model. Here, structure of the data means the nonlinear manifold structure underlying images from multiple categories learned from a well-trained deep model on another data set . To some extent, the AIR regularizer can be deemed as seeking some “nearest neighbors” within the training examples to regularize the fitting of a deep CNN model to noisy samples and improve its classification performance in presence of noise.

Robustifying classification models via regularization is a common practice to enhance robustness of the models. Popular regularizers include Tikhonov regularization (-norm) and the -norm on the model parameters [22]. However, an effective regularizer for training a deep CNN model is still absent, especially for handling the above learning problem with faulty labels. To the best of our knowledge, this work is among the first to introduce an effective regularizer for deep CNN models to handle label noise.

Inspired by related work in natural language processing [26], we use a group sparse norm to automatically select auxiliary images. Figure 1 shows an exemplar overview of our model. In contrast to previous works imposing the regularization on the model parameter, we propose to construct groups of input image features and apply the group sparse regularizer on the response maps. Imposing such group sparsity regularization on the classifier response enables it to actively select the relevant and useful features, which gives higher learning weights to the informative groups in the classification task and forces the weights of irrelevant or noisy groups toward zero. The activated auxiliary images implicitly provide guiding information for training the deep models. We solve the associated optimization problem via ADMM [10], recently popularized by [4]. In particular, we use the stochastic ADMM method of [19] and [1] on this large-scale problem.

We demonstrate the effect of AIR on image classification via deep CNNs, where we synthetically corrupt the training annotations. We investigate how the proposed method identifies informative images and filters out noisy ones among the candidate auxiliary images. Going one step further, we then explore how the proposed method improves learning of image classification from user-supplied tags and handles the inherent noise in these tags. Comprehensive experiments on benchmark data sets, shown in Section 4, clearly demonstrate the effectiveness of our proposed method for the large-scale image classification task.

### 1.1Related Work

A large body of existing work proposes to employ sparsity-inducing [22] or group-sparsity inducing [8] norms for effective model regularization and better model selection, with applications to dictionary learning [17] and image representation learning [25], to name a few examples from the field of computer vision. However, most focus on imposing a structured prior on the model parameters. The idea of exploiting the group sparsity structure within raw data was recently proposed in [26], to solve text recognition problems, by exploiting the intrinsic structure among sentences in a document. However, as far as we know, none of existing works have investigated how to exploit the structural information among data for a deep model, as we address here, and we are among the first to propose such a regularized deep model based on auxiliary data.

In this work, we are particularly interested in learning from noisy labeled image data, where a limited number of training examples are supplied with clean labels. Among the most recent contributions, [21] solve this problem by learning the noise distribution through an extra noise layer added to the deep model,while [24] explore this problem from a probabilistic graphical model point of view and train a classifier in an end-to-end learning procedure. [11] introduce a robust logistic regression method for classification with user-supplied tags. [7] deal with arbitrary outliers in the data through a robust logistic regression method by estimating the parameters in a linear programming scheme. Different from those existing works, we automatically exploit the contextual information from useful auxiliary images via a new regularization on deep CNN models.

## 2Auxiliary Image Regularizer

### 2.1Problem Setup

Suppose we are given a set of training images . Some training samples in have noisy annotations. We do not assume any specific distribution on the noise in . Indeed, we only assume the number of noisy labels does not exceed the number of correct labels. This significantly relaxes the assumptions imposed in previous works [20]. For instance, [20] require the noise to follow a fixed confusion distribution. Our goal is to learn a deep CNN model for the data set while we have access to a deep network pre-trained on an independent set containing sufficiently many accurately annotated examples in categories . We use this pre-trained network to produce representations of images in the main set . We employ the popular AlexNet architecture [14] to build our CNN model. The top layer which accounts for classification is parameterized by , and we apply the usual empirical risk minimization method to learn such parameter:

where is the classification loss on an individual training example in set . Adding a regularizer to the loss function is a common approach in solving large-scale classification problems to prevent overfitting and train a more accurate and generalizable classifier:

Popular regularizers include , and .

### 2.2Auxiliary Image Regularizer

In this work, beyond imposing a prior structure of the model , we propose a novel regularizer to exploit the data structure within to handle the sample noise, which is defined as:

where denotes the group norm defined as . In the group norm, is an index set of all groups/partitions within the vector , are some positive weights, and denotes the sub-vector indexed by group . The norm induces a group sparsity that encourages the coefficients outside a small number of groups to be zero.

We do not impose the group norm on the parameter directly. Instead, we encourage the unlabeled response of the deep model on the learned representation of image data set to be group-wisely sparse, such that only a small number of images will contribute to learning of the model, while other non-relevant images are filtered out. These image representations are extracted from the model pre-trained on the well-labeled data set . Therefore, the regularization draws information from features trained on available data .

Each image is a “group” of active features. The regularizer enforces the features of the “good” and “stable” images to be used for model learning, while noisy additional activations will be disregarded. The active images are the ones that are categorized well in the feature space obtained from the deep model. For a new image, the subset of active features close to those stable images in the learned manifold will be weighted most highly (hence “neighbor regularization”).

Towards this target, the matrix is constructed as and . Here, is a diagonal matrix consisting of the features (such as the outputs of the fc-7 layer) of image representing the -th group in the group norm regularization setting. The group norm enforces the resulting response vector to be sparse, namely only a small number of images are active. These active images are auxiliary examples for learning that are automatically identified by the model. They contribute additional information to model learning. In multi-class classification, is a matrix and the group norm regularizer is defined as the sum of all group sparsity norms of each column.

Our proposed auxiliary image regularized (AIR) model is then defined as:

Our goal is to classify an independent target data set that has corrupted labels.

A deep Convolutional Neural Network (CNN) is parameterized layer-wisely by , and in this work we only impose the auxiliary regularizer on the top (last) layer of the CNN. The objective function we are going to work with for training a deep CNN model is

A popular loss function used for image classification with CNNs is the following cross-entropy loss with softmax:

with the output feature being a function of and the raw input image. Here, is an indicator function.

The standard backprogapation technique via stochastic gradient descent cannot handle the non-smooth regularization term well and presents slow convergence rate. We here demonstrate how the loss function in Section 2.2 can be optimized via ADMM. In this section, we use to denote the parameter of the top layer .

After introducing an auxiliary variable , our optimization problem becomes:

The matrix is a sparse matrix with only one non-zero entry in each row, which corresponds to an entry in the feature vector, and the first rows of correspond to the features of the first sample, indicated as in Section 2.2. In other words, the matrix gives a weight to the parameter value () within each group that is equal to the -th entry in the feature vector. Feature values indicate the activity of corresponding visual feature of an image in the network model. Each term in the group regularizer is normalized by the size of the group through weights such that all of the groups have the same overall effect in the regularizer.

The approximated Augmented Lagrangian [19] for the objective function in Eq. with being the Lagrange variable is:

where is replaced with its first order approximation at , with indicating the number of iteration. Here, is the gradient of at the -th iteration over a mini-batch of training samples. We set equal to according to [1]. Then applying Stochastic Alternating Direction Method of Multipliers (SADMM) [19] followed by a non-uniform averaging step to give higher weights to the recent updates, inspired by [1], gives the following alternative updates of the variables :

We use an adaptive by increasing its value by a factor of in each iteration up to a fixed maximum value . can be updated in a closed form solution as:

The update of has a closed-form solution as the proximal operator for the -norm (group soft-thresholding) [2]:

where soft-thresholding operator for group norm is defined as

Since the update of is independent of the update of any other when , all can be updated in parallel to further reduce computational cost.

Thereafter, we apply a non-uniform averaging step on as:

where ; similar updates apply for and .

## 4Experiments on Deep CNN Model

In this section, we explore the robustness added to the classifier by the proposed auxiliary regularizer and evaluate the performance of the regularized CNN model proposed in Section 2.2 on different benchmark data sets. First, we examine the effect of the auxiliary regularizer when noisy labels are added to clean data sets manually. Second, we investigate its influence on the robustness of the model trained on a freely-available user-tagged data set.

In all of our experiments, we use the AlexNet CNN model pre-trained on the ISLVRC2012 data set [12] and fine-tune its last layer on the target data set . Furthermore, we set in Eq. to a very small number, equal to over the length of each feature vector, and fine-tune the other set of hyper-parameters (batch size, , and ) in the SADMM updates of Eq. on the cross-validation set for each experiment. The initial value for is in all experiments. We cross-validate the regularization parameter for our SVM baseline from the set of . We define our loss function to be a softmax and apply the AIR regularizer on the top layer in the CNN model.

### 4.1Experiments with Synthetic Noisy Labels

First, we conduct image classification on a subset of the ImageNet7k data set [6]. We use a pre-trained AlexNet CNN model and fine-tune its last layer on randomly selected classes from ImageNet7k data set as the leaf categories of animal each of which contains samples. We randomly flipped half of the labels among all 50 categories. This is exactly the problem described in Section 2.1.

We perform similar experiment on the MNIST data with 10 categories of handwritten digits containing samples in total. We use a confusion matrix to define the distribution of noisy labels among the 10 categories. We followed the same settings as in [21] to determine the probability of changing label to label for all as . Different levels of noise are applied by setting the diagonal values of the matrix equal to the noise level and normalizing the distribution accordingly.

We empirically investigate the robustness gained from applying the proposed AIR regularize compared to a linear SVM classifier as well as Robust Logistic Regression (RoLR) [7] which assumes a constant fraction of outliers The SVM and RoLR classifiers are used on the last layer of a similar network. The results, shown in Table 1, demonstrate that AIR offers significant improvements over the performance of deep CNN plus SVM or RoLR. On the ImageNet7k data, the performance gain is as large as compared to the SVM.

We also trained the deep CNN with AIR on the CIFAR-10 data which contains 10 different categories each with training samples. We used the same confusion matrix as explained above and exactly the same as the matrix defined in [21] to randomly corrupt the labels. Since the size of images in CIFAR-10 and MNIST data sets is small ( and respectively), we re-sized images to , and then cropped them to and thereafter extracted their fc7 features from the pre-trained model explained above. We tested the robustness of the model on different batches of CIFAR-10 with various noise levels. In Figure 2, we compare the accuracy of the AIR model to that of an SVM with regularization, [21]’s deep model that adds a noise layer to the CudaConv network, and the CudaConv network. CudaConv refers to the network with three convolutional layers with similar model architecture and hyper-parameter settings used in [21] given by [13].

Figure 2 illustrates that the SVM model suffers as the number of incorrect labels grows, whereas AIR remains robust even with large amounts of corruption. Moreover, the performance of both CudaConv+learned-Q and CudaConv depends heavily on the number of training samples, while we can reach the same classification accuracy with significantly fewer number of data points and the same noise level.

To show the benefit of stochastic ADMM in solving our proposed regularized optimization problem in comparison with stochastic gradient descent (SGD), we re-run the same experiment on CIFAR-10 data set but train the model with SGD in the presence of label noise. This experiment results in an accuracy of compared to accuracy achieved by stochastic ADMM following the same settings discussed in Section 3.

We also trained AlexNet with regularizer in an end-to-end scheme on CIFAR-10 with incorrect labels where the weights of the last layer are initialized with the learned weights from AlexNet+ft-last-SVM and AlexNet+ft-last-AIR. Classification accuracy obtained from these different initializations is and , respectively. It shows that even implementing the proposed AIR regularizer in the last layer will improve the accuracy of the whole deep model that is trained end-to-end.

The AIR regularizer automatically incorporates inherent structure in the image data by forcing the weights of noisy groups to decrease toward zero and giving higher learning weights to the stable groups. We illustrate this characteristic of AIR in Figure 3 by comparing the distribution of activations of noisy-labeled images and the distribution of the activations of clean images in the Imagenet7k experiment along different learning iterations. Here “activation” refers to the -norm of the weights associated with each of the groups, i.e., in Eq. . The distributions of these two sets of activations highly overlap in the first iteration and they gradually get more and more distinct from each other revealing the ability of the auxiliary regularizer in finding the images with noisy annotations. We manually compute the same activation scores per image for SVM by setting and compare the corresponding distribution in the right plot of Figure 3. Activation scores for images with both clean and noisy labels, learned from SVM, overlap notably after equal number of epochs of training with AIR.

### 4.2Experiments with Real Noisy Labels

In this section, we examine the performance of deep image classifier on images with user-supplied tags from publicly available multi-label NUS-WIDE-LITE data set [5] as a subset of large Flickr data set. This data set contains 81 different tags with the total number of 55615 samples divided into two equal-sized train and test sets. After ignoring the subset of training samples which are not annotated by any tags, we have samples in the training set. We train the classifier on the user-tagged images of training data and evaluate the performance on the test set with ground-truth labels.

We followed the same experimental settings as explained in Section 4. We compare the performance of deep model with different classifiers applied to the last layer of AlexNet. In Figure 6, we plot averaged-per-image precision and recall values when we assign highest-ranked predictions to each image. As a secondary metric, we compare AIR’s performance with the baselines in terms of Mean Average Precision () [15] which does not depend on the top rankings, but on the full rankings for each image. We use to measure the quality of image ranking per label and to measure the quality of tag ranking per image. Robustness to noisy user-tags obtained from the auxiliary regularizer is significant as shown in Figure 6. A few sample images and their top-5 predictions by both AIR and SVM regularizers are represented in Figure 7.

### 4.3Visualization of Selected Auxiliary Images

Finally, we try to understand whether the model indeed has the expected ability of retrieving informative images during the training process. To this end, we visualize the automatically selected auxiliary images in Figure ?, left plot. The figure refers to the experiments on the Imagenet7k data set with noisy labels with the same settings explained in Section 4.1, and displays the images whose corresponding groups were active (selected during the optimization), and also the images that were filtered out (suppressed). The images in this figure are ranked based on their activation scores, expecting clean images to have higher ranks and appearing on the top rows of the plot. Similar to Section 4.1, we rank images based on their activation scores obtained from SVM learned weights on the right hand side of Figure ?. Indeed, the figure shows that AIR forces the weights of noisy or non-informative images to zero and encourages the model to select clear and informative images in the training procedure much more accurately than SVM. This also explains why our proposed model is robust to the noisy-labeled training examples, as shown in the previous experiments.

### 4.4Scalability of the Auxiliary Image Regularizer

To reduce memory requirement for matrix on large data sets, we can randomly select a small number of groups to be considered in AIR regularizer. When large number of data points are available, randomly ignoring the groups in regularization of the response will not substantially affect the learning process that is influenced by the distribution of informative images in the feature space. To verify this point, we repeat the experiment on CIFAR-10 data set with synthetic noise level but with only of the groups used in the regularization term. This significant memory reduction will only drop the accuracy by from to . This experiment shows sampling from the groups does not reduce the final classification accuracy considerably in the case of large data sets but saves the memory cost significantly.

## 5Summary and Future Work

We introduced a new regularizer that uses overlapping group norms for deep CNN models to improve image classification accuracy when training labels are noisy. This regularizer is adaptive, in that it automatically incorporates inherent structure in the image data. Our experiments demonstrated that the regularized model performs well for both synthetic and real noisy labels: it leads to a substantial enhancement in performance on the benchmark data sets when compared with standard models. In the future, we will explore the effect of AIR on robustifying the classifier in an end-to-end scheme where the error information from the auxiliary regularizer will back-propagate through the inner layers of deep model to produce robust image representations.

### References

1. Towards an optimal stochastic alternating direction method of multipliers.
Azadi, Samaneh and Sra, Suvrit. In Proceedings of the 31st International Conference on Machine Learning, pp. 620–628, 2014.
2. Convex optimization with sparsity-inducing norms.
Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, Obozinski, Guillaume, et al. Optimization for Machine Learning
3. Optimization with sparsity-inducing penalties.
Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, and Obozinski, Guillaume. Foundations and Trends in Machine Learning
4. Distributed optimization and statistical learning via the alternating direction method of multipliers.
Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and Eckstein, Jonathan. Foundations and Trends in Machine Learning
5. Nus-wide: A real-world web image database from national university of singapore.
Chua, Tat-Seng, Tang, Jinhui, Hong, Richang, Li, Haojie, Luo, Zhiping, and Zheng, Yan-Tao. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), Santorini, Greece., July 8-10, 2009.
6. What does classifying more than 10,000 image categories tell us?
Deng, Jia, Berg, Alexander C, Li, Kai, and Fei-Fei, Li. In Computer Vision–ECCV 2010, pp. 71–84. Springer, 2010.
7. Robust logistic regression and classification.
Feng, Jiashi, Xu, Huan, Mannor, Shie, and Yan, Shuicheng. In Advances in Neural Information Processing Systems, pp. 253–261, 2014.
8. A note on the group lasso and a sparse group lasso.
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. arXiv preprint arXiv:1001.0736
9. A dual algorithm for the solution of nonlinear variational problems via finite element approximation.
Gabay, Daniel and Mercier, Bertrand. Computers & Mathematics with Applications
10. Sur l’approximation, par elements finis d’ordre un, et la resolution, par penalisation-dualite d’une classe de problemes de dirichlet non lineaires.
Glowinski, Roland and Marroco, A. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique
11. Image classification and retrieval from user-supplied tags.
Izadinia, Hamid, Farhadi, Ali, Hertzmann, Aaron, and Hoffman, Matthew D. arXiv preprint arXiv:1411.6909
12. Caffe: Convolutional architecture for fast feature embedding.
Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. arXiv preprint arXiv:1408.5093
13. cuda-convnet.
Krizhevsky, Alex. URL https://code.google.com/p/cuda-convnet/.
14. Imagenet classification with deep convolutional neural networks.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. In Advances in neural information processing systems, pp. 1097–1105, 2012.
15. Socializing the semantic gap: A comparative survey on image tag assignment, refinement and retrieval.
Li, Xirong, Uricchio, Tiberio, Ballan, Lamberto, Bertini, Marco, Cees, Snoek, and Bimbo, Alberto Del. http://arxiv.org/pdf/1503.08248v2.pdf
16. Splitting algorithms for the sum of two nonlinear operators.
Lions, Pierre-Louis and Mercier, Bertrand. SIAM Journal on Numerical Analysis
17. Supervised dictionary learning.
Mairal, Julien, Ponce, Jean, Sapiro, Guillermo, Zisserman, Andrew, and Bach, Francis R. In Advances in neural information processing systems, pp. 1033–1040, 2009.
18. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images.
Nguyen, Anh, Yosinski, Jason, and Clune, Jeff. arXiv preprint arXiv:1412.1897
19. Stochastic alternating direction method of multipliers.
Ouyang, Hua, He, Niao, Tran, Long, and Gray, Alexander. In Proceedings of the 30th International Conference on Machine Learning, pp. 80–88, 2013.
20. Learning from noisy labels with deep neural networks.
Sukhbaatar, Sainbayar and Fergus, Rob. arXiv preprint arXiv:1406.2080
21. Training convolutional networks with noisy labels.
Sukhbaatar, Sainbayar, Bruna, Joan, Paluri, Manohar, Bourdev, Lubomir, and Fergus, Rob. arXiv preprint arXiv:1406.2080
22. Regression shrinkage and selection via the lasso.
Tibshirani, Robert. Journal of the Royal Statistical Society. Series B (Methodological)
23. Visualizing data using t-sne.
Van der Maaten, Laurens and Hinton, Geoffrey. Journal of Machine Learning Research
24. Learning from massive noisy labeled data for image classification.
Xiao, Tong, Xia, Tian, Yang, Yi, Huang, Chang, and Wang, Xiaogang. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699, 2015.
25. Linear spatial pyramid matching using sparse coding for image classification.
Yang, Jianchao, Yu, Kai, Gong, Yihong, and Huang, Thomas. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 1794–1801. IEEE, 2009.
26. Making the most of bag of words: Sentence regularization with alternating direction method of multipliers.
Yogatama, Dani and Smith, Noah. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 656–664, 2014.
27. Regularization and variable selection via the elastic net.
Zou, Hui and Hastie, Trevor. Journal of the Royal Statistical Society: Series B (Statistical Methodology)
10496