Deep Learning from Noisy Image Labels with Quality Embedding
Abstract
There is an emerging trend to leverage noisy image datasets in many visual recognition tasks. However, the label noise among the datasets severely degenerates the performance of deep learning approaches. Recently, one mainstream is to introduce the latent label to handle label noise, which has shown promising improvement in the network designs. Nevertheless, the mismatch between latent labels and noisy labels still affects the predictions in such methods. To address this issue, we propose a quality embedding model, which explicitly introduces a quality variable to represent the trustworthiness of noisy labels. Our key idea is to identify the mismatch between the latent and noisy labels by embedding the quality variables into different subspaces, which effectively minimizes the noise effect. At the same time, the highquality labels is still able to be applied for training. To instantiate the model, we further propose a ContrastiveAdditive Noise network (CAN), which consists of two important layers: (1) the contrastive layer estimates the quality variable in the embedding space to reduce noise effect; and (2) the additive layer aggregates the prior predictions and noisy labels as the posterior to train the classifier. Moreover, to tackle the optimization difficulty, we deduce an SGD algorithm with the reparameterization tricks, which makes our method scalable to big data. We conduct the experimental evaluation of the proposed method over a range of noisy image datasets. Comprehensive results have demonstrated CAN outperforms the stateoftheart deep learning approaches.
I Introduction
While editorially labeled image data is crucial to visual classification [1, 2, 3, 4], weakly supervised detection and segmentation [5, 6, 7, 8, 9, 10], collecting such datasets in large volume can be prohibitive. Noneditorial means such as social tagging and crowdsourcing, have been explored as efficient alternatives [11, 12, 13]. For example, there are a plethora of images with tags available on the Flickr website, which provides us valuable labeled resources to build image classifiers. However, the challenges lie in the fact that social tags as labels are highly noisy. As a result, deep learning from noisy image labels has attracted the increasing attention [14].
Previous studies have investigated the label noise [15, 16, 17, 18, 19] for nondeep approaches in the machine learning community. For example, Vikas et al. [15] introduce parameters for annotators to transit latent predictions to noisy labels. For parameter estimation, they resort to an EM optimization algorithm that is also adopted in the contemporaneous works. However, it is not straightforward to apply these studies to deep learning methods due to the computational consuming in the EM optimization.
With the success of deep learning in computer vision [1, 2, 3, 4], training neural network with noisy image labels has also been explored [20, 14, 21, 22, 23, 24, 25, 26, 27, 28, 29]. These methods can be summarized into two categories, building the robust loss function and modeling the latent label. The former paradigm is heuristic and usually depends on nontrivial hyperparameter selection. For instance, Reed et al. [24] construct a weighted combination of noisy image labels and predictions to supervise the network training. However, it is unclear that how the weight interacts with the realworld label noise for settings. One popular example of the latter paradigm, Sukhbaatar et al. [14] model the latent label to handle the label noise. Specifically, the classifier is trained based on latent labels, and thus the label noise will not directly affect the classifier. However, they adapt latent labels to noisy labels with a linear transition layer, which cannot sufficiently model the label corruption. Label noise can still go through this layer to degenerate the performance. The deficiency of above deep learning methods is that they do not explicitly model the trustworthiness of noisy labels. Implicitly considering noise in the loss function or by modeling the latent label may harm the nature of noise, e.g., flip and outlier.
In this paper, we follow the latter paradigm and propose a quality embedding model. Fig. 1 illustrates our idea as well as its advantage to reduce the noise effect. For example, in Fig. 1(a), the latent labels and predictions of the first three cat images must approximately consistent due to their content similarity. However, mismatch will occur between the second prediction and the corresponding annotation by virtue of the label noise. For the fourth image, the prediction induced by the estimation error of the latent label, also has conflict with the fourth annotation. As a result, these two mismatches will mix together for backpropagation. On the other hand, if we explicitly introduce a quality variable to model the trustworthiness of noisy labels like Fig. 1(b), label noise can be reduced more effectively. For example, if the quality variable of the second sample is embedded in the nontrustworthy subspace, the latent label can be disturbed accordingly to prevent mismatch error caused by the label noise from backpropagation. While for the fourth sample whose quality variable is estimated in the trustworthy subspace, the latent label still transits to the final prediction causing the mismatch. Then supervision from the correct annotations is normally fed back.
Mathematically, we illustrate the corresponding graphical model in Fig. 2. Different from previous latentlabelbased deep learning approaches, a quality variable is specially introduced to model the trustworthiness of noisy labels. By embedding the quality variable into different subspaces, the shortcoming illustrated like Fig.1(a) can be solved as Fig.1(b). To instantiate our probabilistic model with deep neural network, we further design a ContrastiveAdditive Noise network (CAN) shown in Fig. 3. For parameter learning, we optimize an evidence lower bound [30, 31, 32] plus a variational mutual information regularizer, and deduce an SGD algorithm. The major contribution in this paper can be summarized into four parts in the following.

To address the shortcoming of existing latentlabelbased deep learning approaches, we propose a quality embedding model that introduce a quality variable to represent the trustworthiness of noisy labels. By embedding the quality variable into different subspace, the negative effect of label noise can be effectively reduced. Simultaneously, the supervision from high quality labels still can be backpropagated normally for training.

To instantiate the quality embedding model, we design a ContrastiveAdditive Noise network. Specially, it consists of two important layers: (1) the contrastive layer estimates the quality variable in the embedding space to reduce noise effect; (2) the additive layer aggregates prior predictions and noisy labels as posterior to train the classifier.

To tackle the optimization difficulty, we apply the reparameterization tricks and deduce an efficient SGD algorithm, which makes our model scalable to big data.

We conduct a range of experiments to demonstrate that CAN outperforms existing stateoftheart deep learning methods on noisy datasets. We further present qualitative analysis about quality embedding, latent label estimation and noise pattern to give a deep insight on our model.
The rest of this paper is organized as follows. Section 2 briefly reviews the related work of learning with noisy labels in deep learning. Then we introduce our quality embedding model, the corresponding instantiation ContrastiveAdditiveNoise network as well as its optimization algorithm in Section 3. We validate the efficiency of our method over a range of experiments in Section 4. Section 5 concludes the paper.
Ii Related Work
Social websites and crowdsourcing platforms provide us an effective way to gather a large amount of lowcost annotations for images. However, in the visual recognition tasks such as image classification, the noise among labels shall severely degenerate the performance of classification models [33]. To exploit the great value of noisy labels, several noiseaware deep learning methods have been proposed for the image classification task. Here, we briefly review these related works.
Robust loss function This line of research aims at designing a robust loss function to alleviate noise effect. For instance, Joulin et al. [34] weight the crossentropy loss with the sample number to balance the emphasis of noise in positive and negative instances. Izadinia et al. [23] estimate a global ratio of positive samples to weaken the supervision in the loss function. Reed et al. [24] consider the consistency of predictions in similar images and apply bootstrap to the loss function. They substitute the noisy label with a weight combination of the noisy label and the prediction to encourage the consistent output. Recently, Li et al. [28] reweight the noisy label with a soft label learned from side information. They train a teacher network with the clean dataset to compute the soft label by leveraging the knowledge graph. The soft label is then combined with the noisy label in the loss function to pilot student model’s learning. Andreas et al. [29] rectify labels in the crossentropy loss with a labelcorrection network trained on the extra clean dataset. While these methods are concerned with modifying the labels in the loss function by reweighting or rectification, our approach also models the auxiliary trustworthiness of noisy image labels to reduce the noise effect on training.
Modeling the latent labels This paradigm targets at modeling the latent labels to train the classifier, and building a transition for adaption from the latent labels to the noisy labels. With the success of deep learning in image recognition, this kind of idea receives considerable attention. Mnih et al. [20] first propose a latent variable model on aerial images, which assumes that the noise is symmetric and at random. Based on it, [14, 27] use an linear adaptation layer to model the asymmetric label noise, and add the layer on top of a deep neural network. This transition layer can be deemed as the confusion matrix representing label flip probability. However, the matrix only depends on the distribution of labels but ignore the information of image contents. Chen et al. [12] apply a twostage approach to model the latent label and learn the translation to the noisy label, in which a clean dataset is used. Different from methods that model label transition in the dataset level, Xiao et al. [26] propose a probabilistic graphic model that disturbs the label in the image level. However, the model also needs a small part of clean data to learn conditional probability, which may constrains the generalization of the model. To demonstrate the humancentric noisy label exhibits specific structure that can be modeled, Misra et al. [22] build two parallel classifiers. One classifier deals with image recognition and the other classifier model human’s reporting bias. However, it still suffers from the problem mentioned in Fig. 1(a) since similar images have similar latent variables. Although these methods take advantages of deep neural network to model the latent label, the simple transition cannot sufficiently model the label corruption. We go on by unearthing the annotation quality from training data and further utilize it to guide the learning of our model.
Iii Quality Embedding models
Iiia Preliminaries
Consider that we have a noisy image dataset of items,
where each tuple in the dataset consists of one image and its noisy labels . Note that can be the original image or the feature vector extracted from the image. is a dimensional binary vector indicating which labels are annotated, and is the number of categories. However, may be corrupted with annotation noise and thus incorrect. We assume the underlying clean label is . We introduce , a quality variable embedded in dimensional Gaussian space, to represent the annotation quality of . For ease of reference, we list the notations of this paper in Table I.
Formally, it is a multilabel, multiclass classification problem with noise in labels. We target to train a deep classifier from these noisy training samples. There are many other tasks that are consistent with this setting, like weakly supervised object detection and segmentation [5, 6, 7, 8, 9, 10] with web data.
IiiB Quality Embedding Model
IiiB1 quality embedding
In this section, we introduce a quality variable in parallel to the latent label, which jointly transit to the noisy image label. Our probabilistic graphical model is illustrated in Fig. 2. In the generative process, the latent label vector purely depends on the instance . We model this dependency with . However, the noisy label vector is generated based on both the annotation quality and the latent label , which we model with . In the inference process, both the distributions of and are all modeled based on and . We respectively represent these two distributions with and , which plays roles of posterior approximation.
According to the graphical model in Fig. 2, once given the training set, we have the following loglikelihood.
(1) 
However, the loglikelihood function is difficult to explicitly compute. We instead choose to optimize an adjustable evidence lower bound (ELBO) [30, 31, 32]. The ELBO is acquired by introducing two variational distributions and to approximate the true distributions of and . We illustrate the form of our ELBO in Eq.(2).
(2) 
Above bound is a good approximation of the marginal likelihood, which provides a basis for selecting a model [32]. When the gap between marginal likelihood and ELBO becomes zero, the variational distributions approach the true distributions.
Notation  Description 

number of training items  
number of categories  
dimension of the quality variable  
image variable  
noisy label vector variable  
latent label vector variable  
quality vector variable  
parameter of classifier network  
parameter of noise network  
parameter of annotation quality network  
parameter of latent label network  
index of an item  
th observed image  
th observed noisy label vector  
th latent label vector  
th quality vector  
mean of Guassian distribution  
covariance diagonal of Gaussian distribution  
regularizaion cofficient  
th sample from Gumbel distribution  
th sample from Gaussian distribution  
temperature in GumbelSoftMax  
timevarying coefficient 
IiiB2 variational mutual information regularizer
Although Fig. 2 presents the structure prior of our probabilistic model, optimization on ELBO may not converge to the desirable optimal since modeling the distribution with neural network introduces much flexibility. It is a common problem in Bayesian models and a general solution is posterior regularizations [35]. Posterior regularizations ensure the desirable expectation and simultaneously retain the computational efficiency. Such methods have been applied in clustering [36], classification [37] and image generation[38]. In this paper, we introduce the regularization for variational distributions of and in the perspective of mutual information maximization. We deduce the regularizers as follows,
(3)  
where means the mutual information of two distributions and is the entropy of the variable. As can be seen in Eq. (IIIB2), maximizing the mutual information is equal to minimizing the entropy of and . For the latent label , such posterior regularization can force the probability close to the extreme points. And for the quality variable , it will encourage the distribution to have a low variance.
IiiB3 objective
Combining Eq. (2) with (IIIB2), our objective then becomes the maximization of ELBO along with the mutual information regularizer. Note that, we substitute in Eq. (IIIB2) with a coefficient to weight the regularization effect in the optimization. Instead of maximization, we rewrite our goal as the following minimization problem for the simplicity sake.
(4) 
From Eq. (4), our model mainly differs from previous methods in three aspects. First, indicates that the transition from the latent label to the noisy label is based on both and while previous methods [20, 14, 27] only depend on . Second, previous works [20, 14, 27, 26, 22] use the linear transition while our model applies nonlinear implementation . Third, and are approximated with and in the posterior perspective while previous works [26, 28, 29] might have to facilitate the extra clean dataset or other label knowledge.
IiiC ContrastiveAdditive Noise Network
In this section, we instantiate our model with a ContrastiveAdditive Noise network (CAN) in Fig. 3. Simply, CAN consists of four modules, encoder, sampler, decoder and classifier, which are corresponding to the different parts of our model respectively. In the following, we decribe the design in detail.
IiiC1 architectures
For encoder module, it is used to model the variational distributions, and . Concretely, we first forward to a neural network to generate a prior label judgement . Then, according to and , we model the distribution parameters with two elaboratelydesigned layers. The neural network for can be decided by the type of . If is the original image, then a convolutional neural network can be applied. While if is a feature vector, a fullyconnected network can be chosen. In Fig. 3, we take the convolutional neural network as an example. The sampler module is the implementation of Monte Carlo sampling for and . It receives the output of the encoder module and samples from the Gumbel and Gaussian distributions to generate a sample set of and . In the next section, we will talk out this part in detail with reparameterization tricks. For the decoder module, it is a neural network for , which consists of two group of (linear, ReLU) layers, following with a Sigmoid layer. It takes the sampler output to recover noisy labels. Previous works [20, 14, 27, 26, 22] usually use a linear transition from to . We consider the nonlinear transition since we have the heterogeneous quality variable . The classifier module as our most important target , employs a same network for in the encoder module. It is trained based on KLdivergence between and .
IiiC2 contrastive layer and additive layer
We specially describe these two important layers in the encoder module. Regarding the distribution , it is a dimensional Gaussian distribution and both mean and variance need to be modeled. We exploits the contrastive layer to implement the estimation. It internally forwards and into a shared fullyconnected layer with ReLU () and transforms their difference to and with another fullyconnected layer (function ). It is simply represented as follows,
This contrastive layer is built up based on the assumption that the quality variable is related to the difference between and . We evaluate their difference in a latent space with and decide which subspace it is embedded with . This embedding mechanism makes us identify the label quality explicitly and subsequently helps to reduce the noise effect in . This idea has never been proposed in previous noiseaware deep learning approaches [20, 14, 21, 22, 23, 24, 25, 26, 27, 28, 29].
Regarding the distribution , it consists of Bernoulli distributions and thus probabilities need to be modeled. We design an additive layer to learn these parameters. It internally uses two nonshared fullyconnected layers ( and ) to transform and into a latent space, and then feeds their addition into another fullyconnected layer plus a sigmoid function (function ), illustrated as follows,
This design learns a posterior label from and by a nonlinear combination with neural network. Previous methods in [34, 23, 24, 28, 29] use a weight in their lost function to linearly combine the noisy label with the “soft” label from the prediction, the clean dataset or other side information. They usually need nontrivial tuning manually, while we resort to a learning procedure by neural network automatically.
The whole network can be trained endtoendly, which will be explained in the next section. In the training, the noise effect is reduced by the branch of the quality variable, and simultaneously the posterior label is estimated by the additive layer to guarantee a more reliable training. We will demonstrate the effectiveness of our network in the experiments.
IiiD Optimization
In this section, we will analyze the difficulty in optimization and deduce an SGD algorithm with reparameterization tricks.
IiiD1 The reparameterization tricks
The first term in the RHS of Eq. (4) has no closed form when either or is not conjugated with . Let alone we model these distributions with deep neural network in the paper. The general way is by the Monte Carlo sampling. However, Paisley et al. [31] have shown when the derivative is about or , the sampling estimation will present high variance. In this case, a large number of samples will be required to have an accurate estimation, which may lead to the high GPU load and the computational burden. Fortunately, reparameterization tricks [39, 40] are explored to overcome this difficulty in the recent years. They have shown promising efficiency in discrete and continuous representation learning. Simply, the idea behind reparameterization tricks is to decouple the integral variate as one parameterrelated part and another parameterfree variate. After integral by substitution, the Monte Carlo sampling on this parameterfree variate will have a small variance. According to this, we apply the reparameterization trick [40] for discrete and the reparameterization trick [39] for continuous as follows,
where is a temperature to control the discreteness of samples, ^{1}^{1}1, are both sampled by , where Uniform(0,1) and are the parameterfree variates, , and are parameterrelated parts. With above reparameterization tricks, we have the following lowvariance sampling estimation,
(5) 
where is the sample number of and for the th image. Based on Eq. (5), the first term in the RHS of Eq. (4) can be efficiently estimated, even though we set the sample number equal to 1 in the training.
IiiD2 Stochastic variational gradient
The remaining terms in the RHS of Eq. (4) can be explicity computed. We just present their deduction in the appendix. Putting Eq. (5) and (A) (in the appendix) back to Eq. (4), the objective is derivable regarding parameters of all distributions. We can learn the parameter of each distribution with a SGD algorithm, even if they are all modeled with deep neural network. It is important for deep learning especially on the large datasets. Assuming , , and respectively represent the parameters of , , and , their gradients can be computed with the following equations with chain rules.
(6) 
where is the abbreviation of for the space sake. Note that, although we have above gradients for CAN, there are two undesirable problems existing in the optimization: (1) It is not easy to precisely decouple the information from backpropagation respectively for and , i.e., squeeze out the clean label information for and leave the qualityrelated information to ; (2) The corresponding label order between and may be inconsistent in the optimization. For example, the category in first dimension of can be corresponding to the category in the second dimension of . To avoid these two problems, we can asymmetrically inject auxiliary information to the optimization procedure in an annealing way, that is, substitute with the following Eq.(7).
(7) 
where is gradient regarding the crossentropy loss between and , and is a timevarying term. In this equation, is initially decided by and then progressively anneals to with increasing. It guarantees the decoupling procedure from the backpropagation with asymmetrical constraint to and make the label order of and consistent in the optimization.
Model  aer  bik  brd  boa  btl  bus  car  cat  cha  cow  tbl  dog  hrs  mbk  prs  plt  shp  sfa  trn  tv  mAP 

ResnetN  93.5  85.3  90.1  85.1  51.2  82.3  84.8  91.2  59.3  87.1  72.1  88.7  91.3  88.9  76.1  54.4  87.6  70.0  90.4  61.4  79.5 
LearnQ  92.8  86.1  91.0  87.8  50.2  84.9  85.1  90.9  59.2  88.3  71.1  90.1  91.2  88.1  78.3  56.6  89.1  73.1  90.7  64.3  80.4 
ICNM  92.5  86.2  90.5  87.9  47.7  84.0  84.8  90.6  59.8  88.3  72.7  89.8  91.5  87.2  77.0  57.0  88.9  71.5  91.2  65.7  80.3 
Bootstrap  94.0  88.4  90.3  88.2  51.7  83.8  86.5  91.0  65.4  88.0  77.4  90.4  91.8  90.8  79.8  55.2  92.8  75.2  90.8  66.4  81.9 
CAN  95.5  87.0  91.4  89.9  60.1  85.5  87.6  92.0  67.2  90.1  77.7  91.8  93.3  90.6  82.1  56.0  93.6  80.7  94.5  70.6  83.8 
Model  aer  bik  brd  boa  btl  bus  car  cat  cha  cow  tbl  dog  hrs  mbk  prs  plt  shp  sfa  trn  tv  mAP 

ResnetN  98.4  81.1  92.9  88.7  57.0  87.4  73.2  96.6  63.3  90.0  63.9  94.3  95.0  92.9  76.8  43.8  92.9  67.2  93.1  65.1  80.7 
LearnQ  98.4  83.8  93.8  88.5  53.5  87.8  73.7  96.5  64.3  90.6  62.6  94.6  96.1  91.6  78.4  46.8  92.8  69.0  94.0  65.4  81.1 
ICNM  98.1  82.9  93.6  88.9  53.4  87.7  72.3  96.2  64.7  91.2  66.3  94.2  96.2  91.4  78.0  44.0  93.5  69.3  94.4  66.9  81.2 
Bootstrap  98.6  84.1  93.6  90.9  56.3  89.8  75.5  96.3  69.8  91.6  69.9  94.4  95.8  93.2  82.2  43.2  92.8  70.9  95.4  67.4  82.6 
CAN  98.8  84.1  95.3  93.2  62.1  90.8  77.0  97.9  72.6  94.4  73.5  96.1  97.7  94.3  82.4  45.5  95.8  71.4  95.8  68.6  84.4 
The optimization procedure can be interpreted as a probabilistic autoencoder [39]. However, our model is different from the traditional autoencoder, which is illustrated in Fig.4. A conventional autoencoder is symmetric, that is, observed knowledge is encoded into latent variables and decoded to itself, for instance in Fig.4 (a), is encoded to and , and then and are used to decode to . It is usually used in generative models and their corresponding applications like image generation [41, 42]. In Fig. 4 (b), our model uses an auxiliary variables in the encodingdecoding procedure, that is, and are used to encode and , and then and are only used to decode . Simultaneously, a discriminative model will be involved and jointly optimize with our autoencoder.
Iv Experiments
In this section, we conduct the quantitative and qualitative experiments to show the superiority of CAN in classification. Specifically, we compare CAN with stateoftheart methods, investigate its performance with varying training sizes, hyperparameter sensitivity and artificial noise. To present a deep insight on how CAN works, we analyze the quality embedding, latent label estimation and noise transition in the network.
Iva Datasets
We totally have five image datasets used in the experiments.
WEB^{2}^{2}2https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67 This dataset is a subset of YFCC100M [43] collected from the social imagesharing website. It is formed by randomly selecting images from YFCC100M, which belong to the 20 categories of the PASCAL VOC [44]. The statistics of this dataset are shown in the left panel of Fig. 5. There are 97,836 samples in total and the sample number in each category ranges from 4k to 8k. Most of images in this dataset belong to one class and about 10k images have two or more. Labels in this dataset may contain annotation error.
AMT^{3}^{3}3https://www.microsoft.com/enus/research/publication/learningfromthewisdomofcrowdsbyminimaxentropy/ This dataset is collected by Zhou et al. [18] from the Amazon Mechanical Turk platform. They submit 4 breeds of dog images from the Stanford Dog dataset [45] to Turkers and acquire their annotations. To ease the classification, Zhou et al. also provide a 5376dimensional feature for each image. The statistics of this dataset is illustrated in the right panel of Fig. 5. There are 7,354 samples in total and the sample number in each category is between 1k and 2k. All images in this dataset belong to one class. Labels in this dataset may contain annotation error.
V07^{4}^{4}4http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ This dataset is provided for the 20cateogry classification task in PASCAL VOC Chanllenge 2007 [44]. It consists of two subsets: trainging (V07TR) and test (V07TE). There are 5,011 samples in V07TR and 4,592 samples in V07TE. All labels in this dataset are clean.
V12^{5}^{5}5http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ This dataset is provided for the 20cateogry classification task in PASCAL VOC Chanllenge 2012 [46]. It consists of two subsets: trainging (V12TR) and test (V12TE). There are 11,540 samples in V12TR and 10,991 samples in V12TE. All labels in this dataset are clean.
SD4^{6}^{6}6http://vision.stanford.edu/aditya86/ImageNetDogs/ This last dataset consists of 4 categories of dogs (same to [18]) in the Stanford Dog dataset [45]. It is a finegrained categorization dataset and there are 837 samples in total. We randomly partition samples into training (SD4TR) and test (SD4TE) by to use. All labels in this dataset are clean.
IvB Experimental Setup
For WEB, V07 and V12 datasets, a 34layer residual network [4] is adopted as the convolutional networks in CAN, and this configuration is also applied to all baselines to be fair. In the training phase, we first resize the short side of each image to 224 and then follow the transformations in the residual network^{7}^{7}7https://github.com/facebook/fb.resnet.torch to preprocess images. In the test phase, we average the results of sixcrop images as the final prediction. For AMT and SD4 datasets, we directly use the features provided by [18]. Hence, one 3layer perception network (53761024, ReLU, 102430, ReLU, 304) is adopted as the substitution of the convolutional networks in CAN. Both the temperature in the Gumbelsoftmax function and the annealing coefficient in Eq. (7) vary with the formula . in the sampler is set to 1 following [39]. The regularizer coefficient is empirically set to 0.3. The batch size is set to 50 and the learning rate starting from 0.01 is divided by 10 every 30 epochs. All experiments run 90 epochs. For the evaluation metric, we adopt Average Precision (AP) and mean Average Precision (mAP) like [44, 46].
In the following sections of “model comparision”, “impact of training size” and “hyperparameter sensitivity”, we train all models on WEB and AMT datasets and test them on V07TE, V12TE and SD4TE datasets. Note that, models trained on WEB dataset are evaluated on both V07TE and V12TE datasets since they have same categories. And models trained on AMT dataset are ony evaluated on SD4TE dataset. For the “artificial noise” section, we first quantitatively add noise to V07TR, V12TR and SD4TR datasets, and then train all models. Finally, we test them on V07TE, V12TE and SD4TE datasets.
IvC Classification Results
IvC1 Training with realworld noisy datasets
To demonstrate the effectiveness of the proposed method in classification, we compare CAN with three stateoftheart approaches, LearnQ [14], ICNM [22] and Bootstrap [24]. Besides, two baselines ResnetN and MLPN are added, which directly train the 34layer residual network and the 3layer perception network on WEB dataset and AMT dataset. The classification performance for each category on the V07TE, V12TE and SD4TE datasets is reported in Table. II, III, and IV.
From the results in Table II and III, we find CAN outperforms all baselines in terms of mAP and show improvement almost in all categories. For example, on V07TE dataset, CAN achieves 83.8 mAP, which outperforms ResnetN by 4.3 mAP and the best baseline Bootstrap by 1.9 mAP. In the challenging categories such as “bottle”, “chair” and “sofa”, it also achieves significant improvement. However, although the results of LearnQ, ICNM and Bootstrap are better than those of ResnetN, the improvement is still limited. Similarly in Table. IV, CAN outperforms the baselines by at least 2.8 mAP while LearnQ, ICNM and Bootstrap only improve about 1.6 mAP compared with MLPN.
Model  nft  nwt  iwh  swh  mAP 

MLPN  78.1  73.2  80.9  76.5  77.2 
LearnQ  80.5  73.7  83.0  77.7  78.7 
ICNM  80.5  72.8  83.9  78.3  78.9 
Bootstrap  80.7  72.5  83.7  78.1  78.8 
CAN  82.0  79.0  81.8  83.8  81.7 
Based on above experiments, we have the following interpretations. (1) LearnQ and ICNM, which only introduce the latent label to handle the label noise, cannot prevent noise from degenerating the classifier sufficiently. (2) Bootstrap shares the similar idea with CAN in the aspect of estimating the posterior label for training. But its loss function uses the linear combination of predictions and noisy labels, which still cannot prevent the error backpropagation from label noise. (3) Our approach, which one one hand models the trustworthiness of noisy labels to reduce the noise effect, and on the other hand estimates the latent label in the posterior perspective to train the classifier, shows better classification performance.
0  0.2  0.5  1  2  5  10  

V07TE  82.9  83.5  84.8  83.6  80.7  78.8  77.0 
V12TE  84.3  85.2  84.1  83.0  80.8  78.3  76.6 
SD4TE  78.6  80.7  80.4  79.9  76.4  73.9  71.3 
V07TE  V12TE  SD4TE  
ResnetN  LearnQ  ICNM  Bootstrap  CAN  ResnetN  LearnQ  ICNM  Bootstrap  CAN  MLPN  LearnQ  ICNM  Bootstrap  CAN  
1.0  6.4  9.1  9.2  8.9  8.6  5.2  8.4  8.4  8.2  10.5  29.6  26.9  27.0  27.8  30.1 
0.8  33.4  28.0  28.5  30.1  36.1  26.6  23.7  23.8  25.1  28.0  41.6  39.6  39.7  38.6  49.7 
0.6  53.0  56.4  57  59.3  63.2  49.2  49.7  49.6  51.8  55.3  51.5  60.4  60.8  58.7  63.9 
0.4  70.2  72.0  71.6  73.3  79.4  69.0  70.3  70.5  72.6  78.4  73.4  72.7  73.1  73.5  77.1 
0.2  78.2  80.1  79.6  81.0  83.6  80.0  81.3  81.4  82.2  84.5  86.1  89.0  89.2  89.3  91.1 
0.0  86.8  85.4  85.4  85.5  85.3  89.7  88.3  88.3  88.5  87.3  96.4  95.9  95.8  96.2  94.3 
IvC2 Impact of training size
To explore the reliability of the proposed method when the training size changes, we compare CAN with other methods on different scales of datasets. We randomly sample different ratios of subsets in WEB and AMT datasets for training, and illustrate results of all the methods on V07TE, V12TE and SD4TE in Fig. 6.
From Fig. 6, the results of all methods on these datasets decline with the decrease of the training size. However, CAN performs better than other models persistently. For instance, in the left panel of Fig. 6, when the training size accounts at 20%, CAN achieves 81.0 mAP on the V07TE dataset, while ICNM and LearnQ are even worse than the most simple ResnetN (79.4% mAP). Similar clues can be found in the middle and right panels. These results demonstrate the reliability of CAN on different scales of datasets.
In Fig. 6, we also find the decline trend on SD4TE dataset is more significant than that on V07TE and V12TE datasets. This is because that even if the 20% subset, there are still about 20k samples for training in WEB dataset. But there are only about 1.6k samples remaining in AMT dataset, which may lack enough knowledge to learn the classifier in the training.
IvC3 hyperparameter sensitivity
To investigate the reliability of CAN with different the regularizer coefficients, we set to 0, 0.2, 0.5, 1, 5, 10 to respectively validate its effect. The results are illustrated in Table. V. From this table, we find the performance on all datasets first grows to a peak and then gradually decreases with increasing. For example, CAN achieves 85.2 mAP on V12TE dataset when =0.2, but significantly decreases to below 76.6 mAP when =10. This indicates: (1) the regularizer in the proper degree encourages our model to find a good solution; (2) too strong regularization may induce the solution to depart from the optimal. Empirically, setting between 0 to 1 makes the variational mutual information regularizer collaborate well with KLdivergences.
IvC4 controlled experiments with artificial noise
In previous sections, all models are trained on WEB and AMT with given noise, which does not exhibit the characteristics in different noise levels. To show the superiority of CAN, we quantitatively add noise on V07TR, V12TR and SD4TR datasets for training, and then compare the classification performance of all models on the V07TE, V12TE and SD4TE datasets. The way to add noise to datasets is by setting a corruption probability P to randomly decide whether to shuffle elements of each clean label vector or not. We list the model performance in different P settings in Table. VI.
As shown in Table. VI, when the corruption probability P=1.0, the classification results of all models are close to randomness. With P varying from 1.0 to 0, all models show improvement, since there are some clean samples available for training. Specially when P is set to 0.8, 0.6, 0.4, 0.2, CAN robustly outperforms other baselines. However, when the training data becomes purely clean, i.e., P=0, all noiseaware models are worse than ResnetN and MLPN. Table. VI indicates: (1) The performance of all existing models is stronglyrelated to the noise level in the datasets. All noiseaware models perform bad in the heavy noise. (2) When the training data is clean, noiseaware models may be worse than models without considering noise. (3) CAN shows advantages in different noise levels compared with existing methods.
IvD Model Visualization
To give a deep insight on how CAN works, in this section, we will present the qualitative analysis about quality embedding, latent label estimation and noise transition in CAN.
IvD1 quality embedding
The quality variable is estimated in the embedding space by the contrastive layer. To visualize this mechanism, we respectively forward all the training samples into CAN to compute their quality embedding. By comparing the consistency between the prior prediction (thresholded by 0.5) and the noisy label, we then binarize each embedding as trustworthy embedding or nontrustworthy embedding. If we only consider the Gaussian mean of each quality variable plus the embedding type, a low dimensional visualization of quality embedding can be illustrated with tSNE package [47].
In Fig. 7, two exemplar categories “aeroplane” and “bike” in WEB dataset, and two exemplar categories “Norfolk Terrier” and “Norwich Terrier” in AMT dataset, are presented. As shown in Fig. 7, the embedding in each category exhibits two distinguishable clusters. It indicates CAN can identify mismatches between latent labels and noisy labels, and selectively embed the quality variable to different subspace based on the training samples. Thus the label noise can be effectively reduced with the auxiliary of the quality variable.
Besides, we find the embedding for the first two categories are better than that for the last two categories in Fig.7. It is because the categories in WEB and AMT datasets are notably different in number and diversity of training samples. For example, there are about 4,200 different images and annotations in the “aeroplane”, while there are only about 200 different images and 1,300 annotations in the “Norfolk Terrier”. Thus embedding in the first two categories is uniformly distributed but in the last two categories is discretely cluttered.
IvD2 latent label estimation
The latent label is estimated in the posterior perspective by the additive layer. To visualize this estimation, we forward all the training samples into CAN to compute output of the additive layer. In Fig. 8, we present 20 examples of WEB dataset and 8 examples of AMT dataset.
From Fig. 8, we observe: (1) the annotations in WEB dataset may be totally unrelated to the image content, e.g., “bottle” for the first aeroplane image; (2) In AMT, the Turkers also assign the wrong labels to the finegrained images. The former error is usually from the batch annotation function provided by the Flickr website. The latter error is usually from the limit domain knowledge of Turkers. Nevertheless, from the estimation, we find our additive layer still successfully rectifies the wrong labels. Thus based on these latent labels for training, CAN achieves the better performance than other baselines.
IvD3 noise transition
To explore how the quality embedding intermediates the mismatch between latent labels and noisy labels, we investigate the transition patterns between latent labels and noisy labels. Firstly, we forward all the training samples to CAN to compute quality embeddings and latent labels. Secondly, we utilize Kmeans to binarize quality embeddings (only consider Gaussian mean) into trustworthy embedding and nontrustworthy embedding. Thirdly, we count transitions from latent labels to noisy labels conditioned on two types of embeddings. In Fig. 9, we respectively plot two transition patterns with heatmaps for WEB dataset and AMT dataset.
As shown for WEB dataset in Fig. 9, the diagonal of the transition pattern conditioned on trustworthy embedding is dominant. In this case, noisy labels are considered to be reliable and thus transition should mainly happen among same labels. However, the transition patterns conditioned on nontrustworthy embedding is diffusing. Because in this case, noisy labels are considered not correct and transition usually happen between different labels. Similarly, transition patterns on AMT dataset in Fig. 9 also have these characteristics. Fig. 9 indicates CAN is based on quality embedding to automatically disturb the latent label to match the noisy label.
The transition pattern conditioned on nontrustworthy embedding usually reflects the realworld noise. Some interesting patterns can be found. For instance, according to the second panel of Fig. 9, “plt” class has less transition to other classes while the transition between “prs” and “tv” has high value. It means: (1) people who upload the “pottedplant” images to social websites almost do not annotate it wrong; (2) for “tv” images, some people focus on persons in the TV program, and others may pay attention to TV itself. Similarly in the fourth panel of Fig. 9, the transition on AMT usually exists in the appearsimilar dogs, i.e., “Norfolk Terrier” and “Norwich Terrier”, “Irish Wolfhound” and “Scottish Wolfhound”. It reflects that it is more difficult to distinguish these two breeds of dogs than other pairs in some sense.
V Conclusion
In this paper, we present a quality embedding model to learn the classifier from noisy image labels, which effectively avoid the error backpropagated from label noise. To instantiate the model, a ContrastiveAdditive Noise network is welldesigned. Regarding parameter estimation, we deduce an efficient SGD optimization algorithm by applying recent discrete and continuous reparameterization tricks. We demonstrate our model outperform other noiseaware deep learning methods on some noisy training datasets. Simultaneously, detailed visualization on three key parts is presented to give a deep insight on our model. However, we only validate our model in image data in this paper and other types of contents can be further explored.
Appendix A Computation for KLdivergences and regularizers
The remaining four terms in the RHS of Eq. (4) can be calculated without sampling. For example, for the latent label , both and are two dimensional multinomial probabilities. Their KLdivergence term and regularizer can be simplified by enumerating each dimension. For the quality variable , it is from the dimensional Gaussian space whose parameters are implicitly modeled with network of input and . If we assume its prior is like [39], it is easy to compute their KLdivergence and the regularizer due to the conjugation. In Eq. (A), we give their simplifications bigeminally.
(8) 
Acknowledgment
References
 [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
 [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” 2014.
 [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 [5] C. Wang, W. Ren, K. Huang, and T. Tan, “Weakly supervised object localization with latent category learning,” in European Conference on Computer Vision. Springer, 2014, pp. 431–445.
 [6] H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2846–2854.
 [7] L. Wang, G. Hua, J. Xue, Z. Gao, and N. Zheng, “Joint segmentation and recognition of categorized objects from noisy web image collection,” IEEE Transactions on Image Processing, vol. 23, no. 9, pp. 4070–4086, 2014.
 [8] W. Zhang, S. Zeng, D. Wang, and X. Xue, “Weakly supervised semantic segmentation for social images,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [9] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele, “Simple does it: Weakly supervised instance and semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [10] Z. Lu, Z. Fu, T. Xiang, P. Han, L. Wang, and X. Gao, “Learning from weak and noisy labels for semantic segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 3, pp. 486–500, 2017.
 [11] S. K. Divvala, A. Farhadi, and C. Guestrin, “Learning everything about anything: Weblysupervised visual concept learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
 [12] X. Chen and A. Gupta, “Webly supervised learning of convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1431–1439.
 [13] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73, 2017.
 [14] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus, “Training convolutional networks with noisy labels,” Computer Science, 2015.
 [15] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds,” Journal of Machine Learning Research, vol. 11, no. Apr, pp. 1297–1322, 2010.
 [16] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari, “Learning with noisy labels,” Advances in Neural Information Processing Systems, vol. 26, pp. 1196–1204, 2013.
 [17] T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 38, no. 3, p. 447, 2014.
 [18] D. Zhou, S. Basu, Y. Mao, and J. C. Platt, “Learning from the wisdom of crowds by minimax entropy,” in Advances in Neural Information Processing Systems, 2012, pp. 2195–2203.
 [19] B. Frenay and M. Verleysen, “Classification in the presence of label noise: A survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, May 2014.
 [20] V. Mnih and G. Hinton, “Learning to label aerial images from noisy data,” in International Conference on Machine Learning, 2012.
 [21] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary image regularization for deep cnns with noisy labels,” 2016.
 [22] I. Misra, C. Lawrence Zitnick, M. Mitchell, and R. Girshick, “Seeing through the human reporting bias: Visual classifiers from noisy humancentric labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2930–2939.
 [23] H. Izadinia, B. C. Russell, A. Farhadi, M. D. Hoffman, and A. Hertzmann, “Deep classifiers from image tags in the wild,” in The Workshop on CommunityOrganized Multimodal Mining: Opportunities for Novel Solutions, 2015, pp. 13–18.
 [24] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” Computer Science, 2014.
 [25] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu, “Making neural networks robust to label noise: a loss correction approach,” arXiv preprint arXiv:1609.03683, 2016.
 [26] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2691–2699.
 [27] I. Jindal, M. Nokleby, and X. Chen, “Learning deep networks from noisy labels with dropout regularization,” in Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 967–972.
 [28] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and J. Li, “Learning from noisy labels with distillation,” arXiv preprint arXiv:1703.02391, 2017.
 [29] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie, “Learning from noisy largescale datasets with minimal supervision,” Computer Vision and Pattern Recognition (CVPR), 2017.
 [30] M. J. Wainwright, M. I. Jordan et al., “Graphical models, exponential families, and variational inference,” Foundations and Trends® in Machine Learning, vol. 1, no. 1–2, pp. 1–305, 2008.
 [31] D. M. Blei, M. I. Jordan, and J. W. Paisley, “Variational bayesian inference with stochastic search,” in Proceedings of the 29th International Conference on Machine Learning (ICML12), J. Langford and J. Pineau, Eds. New York, NY, USA: ACM, 2012, pp. 1367–1374.
 [32] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American Statistical Association, no. justaccepted, 2017.
 [33] D. F. Nettleton, A. OrriolsPuig, and A. Fornells, “A study of the effect of different types of noise on the precision of supervised learning techniques,” Artificial Intelligence Review, vol. 33, no. 4, pp. 275–306, 2010.
 [34] A. Joulin, L. V. D. Maaten, A. Jabri, and N. Vasilache, Learning Visual Features from Large Weakly Supervised Data. Springer International Publishing, 2015.
 [35] K. Ganchev, J. Gillenwater, B. Taskar et al., “Posterior regularization for structured latent variable models,” Journal of Machine Learning Research, vol. 11, no. Jul, pp. 2001–2049, 2010.
 [36] A. Krause, P. Perona, and R. G. Gomes, “Discriminative clustering by regularized information maximization,” in Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, Eds. Curran Associates, Inc., 2010, pp. 775–783.
 [37] J. Zhu, N. Chen, and E. P. Xing, “Bayesian inference with posterior regularization and applications to infinite latent svms,” Journal of Machine Learning Research, vol. 15, p. 1799, 2014.
 [38] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 2172–2180.
 [39] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” stat, vol. 1050, p. 10, 2014.
 [40] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbelsoftmax,” arXiv preprint arXiv:1611.01144, 2016.
 [41] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “betavae: Learning basic visual concepts with a constrained variational framework,” 2016.
 [42] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems, 2016, pp. 4790–4798.
 [43] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. J. Li, “The new data and new challenges in multimedia research,” Communications of the Acm, vol. 59, no. 2, pp. 64–73, 2015.
 [44] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2007 results,” 2007.
 [45] A. Khosla, N. Jayadevaprakash, B. Yao, and L. FeiFei, “Novel dataset for finegrained image categorization,” in First Workshop on FineGrained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
 [46] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge 2012 results,” 2012.
 [47] L. Van Der Maaten, “Accelerating tsne using treebased algorithms.” Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245, 2014.