Progressive Identification of True Labels for PartialLabel Learning
Abstract
Partiallabel learning is one of the important weakly supervised learning problems, where each training example is equipped with a set of candidate labels that contains the true label. Most existing methods elaborately designed learning objectives as constrained optimizations that must be solved in specific manners, making their computational complexity a bottleneck for scaling up to big data. The goal of this paper is to propose a novel framework of partiallabel learning without implicit assumptions on the model or optimization algorithm. More specifically, we propose a general estimator of the classification risk, theoretically analyze the classifierconsistency, and establish an estimation error bound. We then explore a progressive identification method for approximately minimizing the proposed risk estimator, where the update of the model and identification of true labels are conducted in a seamless manner. The resulting algorithm is modelindependent and lossindependent, and compatible with stochastic optimization. Thorough experiments demonstrate it sets the new state of the art.
1 Introduction
In practice, the increasing demand for massive data makes it inevitable that not all examples are equipped with highquality labels. Low efficiency to weakly supervision [1] becomes a critical bottleneck for supervised learning methods. We focus on an important type in weakly supervised learning problems called partiallabel learning (PLL) [2, 3, 4], where each training example is associated with multiple candidate labels among which exactly one is true. This problem arises in many realworld tasks such as automatic image annotation [5, 6], web mining [7], ecoinformatics [8], etc.
Related research on PLL was pioneered by a multiplelabel learning (MLL) approach [9]. Despite the same form of supervision information that each training example is assigned with a set of candidate labels, a vital difference between MLL and PLL is that the goal of PLL is identifying the only one true label among candidate labels whereas for MLL identifying an arbitrary label in the candidate label set is acceptable. In practical implementation, [9] formulated the learning objective by minimizing the KL divergence between the prior probability and the modelbased conditional distribution, and solved the optimization by Expectation Maximization (EM) algorithm, resulting in a procedure iterates between estimating the prior probability and training the model.
The true labels of training examples in PLL are obscured by candidate labels that hinder the learning, the key to success is, therefore, identifying the true labels. We review that [9] attempts to recover the prior information as the fitting target of the model, and then the label with maximum prior probability is naturally identified as the true label. Thus, the foundational work inspires the successors to propose EMbased methods for recovering nonuniform prior information on the labels reasonably. Various effort to design learning objectives from the perspective of modifying learning models and loss functions, or adding proper regularizations and constraints. [10, 11] proposed a method to train linear dictionaries with sparse constraints, and then [12] extended the linear dictionary by the kernel trick. [13] adapted the boosting techniques for maintaining the prior information in each iteration. In [14, 15], the optimization of prior probability is formulated as a quadratic programming (QP) problem. In addition, maximum margin methods [2, 16] identified the true labels by defining a multiclass maximum margin problem [17] to be solved by some offtheshelf implementation on SVM [18]. Some nonparametric methods have also been proposed. [19] proposed a NN method, and [4] constructed a QP problem for better leveraging the underlying distribution of the data.
Most existing methods elaborately designed tricky constrained learning objectives that are coupled to some specific optimization algorithms. A nonlinear time complexity in the total volume of data becomes a major limiting factor when these methods envision large amounts of training data. Furthermore, [20] proved stochastic optimization algorithm, which few of the existing methods are compatible with, is the best choice for largescale learning problems considering the estimation–optimization tradeoff, so the PLL problem is still practically challenging.
In this paper, we would like to propose a novel framework of PLL without implicit assumptions on the model and optimization algorithm for scaling to big data. For this purpose, we rethink critically whether the unconstrained method in [9] is worth exploring?
Our answer is in two folds. First, the decadeold method used a simple linear model fed by the handcrafted features, which is incompetent to represent and discriminate, especially when encountering largescale learning problems, and we also observe that almost all existing methods exploited nondeep models. Fortunately, the learning objective in [9] is easy to be instantiated by deep neural networks with an enormous benefit in practice [21]. Second, the question we explore is essentially whether we can advance the pioneering work not depending solely on the modification of the learning model. We first derive a general risk estimator leading to classifierconsistency, and then propose a new method bringing a step further towards this goal. We also prove that the proposed method is a generalization of [9]. We further experimentally verify its superiority on benchmark datasets corrupted by manual candidate labels, controlled UCI datasets, and realworld partiallabel datasets. Our contributions can be summarized as follows:

Theoretically, we propose a new risk estimator for PLL, and analyze the classifierconsistency, i.e., the classifier learned from partially labeled data converges to the optimal one learned from ordinary supervised data under mild conditions. Then, we establish an estimation error bound for it.

Practically, we propose a progressive identification method for approximately optimizing the above risk estimator. The proposed method operates in a minibatched training manner where the update of the model and the identification of true labels are accomplished seamlessly. The resulting method is modelindependent and lossindependent, and is compatible with stochastic optimization (e.g., [22, 23, 24]).
2 Basic Risk Estimator for PLL
In this section, we formulate PLL, propose a general risk estimator, and establish the estimation error bound.
2.1 Preliminaries
In ordinary multiclass classification, let be the instance space and be the label space, where is the feature space dimension, and is the number of classes. Let be the underlying joint distribution of random variables . The goal is to learn a decision function that minimizes the estimator of the classification risk:
(1) 
where is short for , is the loss function, and is an estimate of . Typically, the classifier is assumed to take the following form:
The hypothesis in with minimal error is the optimal decision function. And in this case,
2.2 Risk Estimator of PLL
Next, we formulate the problem of PLL. The candidate label set is the powerset of and its cardinality is (the empty set and complete set are not included). Let be the variables defined on with the joint distribution , which may be decomposed into ordinary distribution and label set conditional distribution . Note that in PLL, the true label is invisible to the learner, so we can only have . In this way, we minimize the following risk for PLL:
where and is the powerset of . is a loss function defined specially for PLL and we will specify its form later. In the same way as ordinary multiclass case, we can obtain the optimal solution
and we can further learn a classifier by
Our task now becomes to derive a new loss function such that the learned is equivalent to learned in ordinary multiclass learning if the training sample size is sufficiently large.
Note that in there is only one true label. Image the case when the learned can be good enough to approximate . Then the associated with the true label will incur the minimal loss among all candidate labels. Motivated by this, we write our loss function as
(2) 
which immediately leads to a new risk estimator, namely,
(3) 
So what is the effect of optimizing the new defined risk for PLL? To show the effect, we will prove that the proposed risk estimator ensures converges to under reasonable assumptions.
Our first assumption is that our learning is conducted under the deterministic case.
Assumption 1.
Consider the general deterministic learning scenario that label of is uniquely determined by some measurable function. When is uniquely determined,
(4) 
This is the basic assumption in PLL made by previous PLL works (e.g., [2, 3, 26, 4, 14]) that the true label must be included in the candidate label set.
By definition, the Bayes error is defined over all measurable functions and the hypothesis with a Bayes error is a Bayes optimal decision function and denoted by . Under Assumption 1, the Bayes error can reach the value . In this way, if we use flexible models such as deep neural networks, our hypothesis space will be large enough to have one classifier reach the Bayes error, which can be as low as zero in this case.
Clearly, the true label is determined by and defined in terms of the conditional probabilities as . Then, according to Eq. (4), we can also have
(5) 
We make another assumption on the classifier learned in ordinary multiclass learning.
Assumption 2.
By minimizing the expected risk , we can obtain .
Note that such an assumption can be satisified when using the crossentropy loss (Chapter 5 in [21]) or the mean squared error loss [27].
With these above assumptions, we can have our main theorem, which specifies that the classifier learned in ordinary multiclass learning, and the classifier learned in PLL by minimizing are equivalent.
Theorem 1.
Theorem 1 is proved by substituting into , and shows that . Given that the hypothesis space is flexible enough, the Bayes error can be achieved which means is minimized by , i.e., . This further ensures . A complete proof can be found in Appendix A.
2.3 Estimation Error Bound
In this section we establish the estimation error bound for the proposed estimator.
Assume that is the empirical counterpart of , and we denote by the optimal solution obtained by minimizing . We will upper bound the difference between and by upper bounding . We have the following estimation error bound.
Theorem 2.
(Estimation Error Bound) Assume and is Lipschitz continuous with respect to with a Lipschitz constant . Let be the Rademacher Complexity on when given sample sized , and the loss function is upper bounded by , then with probability at least we have
To prove the estimation error bound, we first need the following conclusion between the estimation error bound and the generalization error bound.
Lemma 2.
The estimation error can be bounded by
That is, the generalization error can be used to bound the estimation error of the ERM algorithm. Then we use the following generalization error bound.
Theorem 3.
([29]) Let and be the Rademacher complexity of . If the loss function be upper bounded by M, then for any , with the probability , we have
We further bound the relationship between and .
Lemma 3.
Let . is defined in Eq. (2), is Lipschitz continuous with respect to with a Lipschitz constant . Then we have
Combining Lemma 2, Theorem 3 and Lemma 3, we can prove Theorem 2. Please find in Appendix A the detailed proof.
Note that if for all , we can have the following corollary.
Corollary 4.
Assume for all and all other conditions same as Theorem 2, we can have
Corollary 4 implies that the smaller the set of , the better the learned classifier, given that the true label lies in the set . This agrees with our intuition on PLL. This section discusses an expected risk estimator for PLL. In the next section, we will discuss how to approximately optimizing the proposed risk.
3 Progressive Identification of True Labels
Obviously, it is not easy to directly do stochastic gradient descend on Eq. (3), due to the nondifferentiability of the operator. However, we still would like to train deep networks to obtain by stochastic optimization with great benefit. This motivates our efforts to propose a novel progressive identification method for approximately minimizing Eq. (3). To this end, we first assume that the loss function can be decomposed onto each label, i.e.,
where is the labelwise loss. In this way, with appropriate associated with the training example , we can have the relaxed empirical loss
(6) 
where
refers to a standard simplex in , i.e., . can be interpreted as the corresponding confidence, i.e., the confidence of the th label being the true label of the th example. Ideally, the confidence of a label would trend to progressively, which means that we finally have full confidence on which label is a true label. We will explain how to eventually achieve such an ideal situation.
The method begins with training an initial model based on the uniform confidences:
(7) 
According to the memorization effects [30, 31] of deep networks, the deep networks will remember “frequent patterns” in the first few iterations. If the given partially labeled data has a reasonable ambiguity degree [9, 26], the deep networks tend to remember the true labels in the initial epochs, and guides us towards a classifier giving relatively low prediction losses for more possible true labels. In this way, the initial informative predictions are used to update the confidences for further training:
(8) 
In summary, we begin with training a neural network to optimize the risk using the uniform confidences given in Eq. (7). Then we update the confidences by Eq. (8) after each iteration, and continue to train the neural network using the new updated confidences. The procedure of our algorithm is summarized in Algorithm 1 where we call our proposal PRODEN (PROgressive iDENtification).
At first glance, the proposed method shares some similarities with the iterative EM method proposed in [9]. In fact, the method in [9] has a tendency to overfit in the Mstep, and has the limitation of using only one specific loss function. Conversely, PRODEN makes use of a more effective learning framework to get rid of overfitting, and besides, it is also flexible enough to use other loss functions. We provide more detailed arguments on the superiority and generalization of the proposed method in the following.
First, the iterative EM method in previous works trains the model until convergence in the Mstep, but overemphasizing the convergence may result in redundant computation and overfitting issue, as the model will eventually fit the initial inexact prior knowledge and make a less informative estimate in the Estep on which the subsequent learning is based. To mitigate the overfitting issue, our method advances the procedure by merging the Estep and Mstep. While the model is trained in a seemless manner without clear separation of Estep and Mstep, the confidences can be updated at any iteration such that the convergence is not necessary in our training procedure.
Second, in the deep learning era, loss functions are one of the key elements and many useful loss functions are proposed, such as the mean squared error loss [27] and the mean absolute error loss [32]. In this way, models are welcomed to be lossindependent that allow usage of any loss function [33]. However, existing EM methods are restricted to some specific loss function, e.g., [9] limits the loss function to the KL divergence which is equivalent to the crossentropy loss. Such restriction on loss functions is not suitable for practical use. Thus, our proposal is flexible enough to be compatible with a large group of loss functions. Moreover, we will show in the appendix that the proposal of [9] is a special case of ours.
4 Experiments
In this section, we verify the effectiveness of the proposed method PRODEN. First we analyze different strategies of PRODEN in training deep neural networks on benchmark datasets corrupted by manual candidate labels, and then compare it with nondeep stateoftheart PLL methods on controlled UCI datasets and realworld datasets.
4.1 Experiments with deep networks
Dataset  # Train  # Test  # Feature  # Class  Model 
MNIST  60,000  10,000  784  10  Linear Model, MLP (depth 5) 
FashionMNIST  60,000  10,000  784  10  Linear Model, MLP (depth 5) 
KuzushijiMNIST  60,000  10,000  784  10  Linear Model, MLP (depth 5) 
CIFAR10  50,000  10,000  3,072  10  ConvNet (depth 10), ResNet (depth 32) 
Datasets Experiments are conducted on four widely adopted benchmarks MNIST, FashionMNIST, KuzushijiMNIST and CIFAR10 which are summarized in Table 1. We manually corrupted these datasets into partially labeled versions. Firstly, we probabilistically add any negative labels into the candidate label sets with a flipping probability where . Secondly, for the training examples only have one candidate label (the true label), we randomly flip a negative label to positive label in order to ensure all the training examples are partially labeled.
Baselines In order to analyze the proposed method, we compare it with six baselines:

PNoracle means supervised learning from ordinary supervised data. It is merely for a proof of concept.

PNtransform means decomposing multiple candidate labels into many single labels, so that we could use any ordinary multiclass classification methods.

PRODENiterative means updating the label confidences in the iterative EM manner.

PRODENdeterministic means updating the label confidences in a hard manner, i.e., the confidence of label with maximum modeling output over equals and otherwise.

PRODENnaive means never updating the uniform confidences.

GA [33] means complementary label learning with gradient ascent.
PLL may be tackled by the method learning from complementary labels, which specifies a class that an example does not belong to. A set of candidate labels can be regarded as an inverse case of multicomplementary labels. We compare the proposed method with a stateoftheart complementarylabel learning method.
Experimental setup The robustness to partially labeled data is tested by conducting the proposed method and comparing methods under lowlevel partial circumstances and under extremely partial circumstances . Table 1 describes the models on each dataset, where MLP refers to multilayer perceptron, ConvNet follows the architecture in [31] and ResNet refers to residual networks [34]. The optimizer is stochastic gradient descent (SGD) [22] with momentum 0.9. We train these models 500 epochs with crossentropy loss function for all the experiments. Because the Mstep of iterative EM method is timeconsuming, PRODENiterative updates label confidences every 100 epochs, which suffice to demonstrate the overfitting issue. We implement all methods by PyTorch, and conduct all the experiments on an NVIDIA Tesla V100 GPU. Please find the details in Appendix B.
Results We record inductive results which indicate the classification accuracy in the test set. The experimental results are shown in Figure 1. And transductive results which reflect the ability in identifying the true labels in the training set can be found in Appendix C.
Firstly, we observe the performance under (cf. the left two columns). The proposed method PRODEN is always the best method and comparable to PNoracle with all the models. PRODENiterative is comparable to PRODEN with linear model, but its performance deteriorates drastically with complex models because the overfitting issue is severe. These results are consistent with the discussions in Section 3. And the residual blocks can remedy this problem. The superiority always stands out for PRODEN compared with GA. Besides, PRODEN is much more stable than GA without learning rate decay strategy because the the progressively identified true labels make the training examples much cleaner. The awful performance of PRODENdeterministic proves the hard labels discard the helpful learning history. PNtransform and PRODENnaive suffer from overfitting with deep models, where the performance even drops behind their linear counterparts. The reason for this phenomenon is more complex architectures have stronger capacities to fit all labels.
Secondly, we investigate these methods under (cf. the right two columns). In the easier learning scenarios (MNIST, FashionMNIST), PRODEN is comparable to PNoracle even when the flipping probability is considerably large. In the hardest learning scenario (CIFAR10), the progressive identification method can also alleviate overfitting. The generalization degeneration of PRODENiterative is more serious, however, GA which is designed for extremely partial circumstances (complementary labels) performs better (surpasses PRODENiterative) instead. It is impressive that PRODEN still achieves the superior performance against all the comparing methods.
4.2 Comparison with nondeep PLL approaches
We further verify the performance of the proposed method by comparing it with six stateoftheart PLL approaches which cannot be generalized to deep networks:

SURE [14]: an iterative EM approach [suggested configuration: ].

CLPL [3]: a parametric approach transforming the PLL problem to the binary learning problem [suggested configuration: SVM with squared hinge loss].

ECOC [35]: a disambiguationfree approach adapting the binary decomposition strategy to PLL [suggested configuration: ].

PLSVM [2]: a maximum margin approach [suggested configuration: ].

PLkNN [19]: a nonparametric approach [suggested configuration: ].

IPAL [4]: a nonparametric approach [suggested configuration: ].
The original comparing approaches are implemented in Matlab. For a fair comparison, all the parametric approaches exploit a linear model. We train PRODEN epochs for all the datasets and average the classification accuracy over the last epochs as the final results. For each dataset, tenfold crossvalidation is performed.
PRODENLinear against  
SURE  CLPL  ECOC  PLSVM  PLNN  IPAL  
[Configuration 1]  22/4/2  18/7/3  15/2/11  27/1/0  17/3/8  20/1/7 
[Configuration 2]  22/6/0  20/7/1  17/2/9  26/2/0  20/1/7  18/2/8 
[Configuration 3]  21/6/1  19/8/1  20/0/8  28/0/0  19/1/8  21/0/7 
[Configuration 4]  24/4/0  22/5/1  16/4/8  28/0/0  19/2/7  12/2/14 
Total  89/20/3  79/27/6  68/8/36  109/3/0  75/7/30  71/5/36 
PRODENLinear  SURE  CLPL  ECOC  PLSVM  PLNN  IPAL  
Lost  81.593.46  71.333.57  74.874.30  49.038.36  75.313.81  36.732.99  72.124.48 
MSRCv2  43.443.28  46.884.67  36.534.59  41.533.25  35.854.41  41.362.89  50.804.46 
BirdSong  68.900.72  58.921.28  63.561.40  71.581.81  49.902.07  64.941.42  72.061.55 
Soccer Player  55.320.56  49.410.86  36.821.04  53.702.02  46.290.96  49.620.67  55.030.77 
Yahoo!News  67.480.65  45.491.15  46.210.90  66.221.01  56.850.91  41.071.02  66.791.22 
Controlled UCI datasets The characteristics of UCI datasets are reported in Appendix B. Following the widelyused controlling protocol, we can generate several artificial partiallabel datasets from each UCI dataset by adjusting the controlling parameters , and . Here, controls the proportion of partially labeled examples, controls the number of false positive labels in each candidate label sets, and controls the occurring probability between the true label and a specific false positive label. There are 4 configurations of controlling parameters corresponding to results for each UCI dataset.
Figure 2 shows the classification accuracy of the comparing approaches as ranges from 0.1 to 0.7 when (Configuration 2). Figure 3 illustrates the classification accuracy as varies from 0.1 to 0.7 when (Configuration 4). To further statistically compare the proposed method with other algorithms, Table 2 summarizes the win/tie/loss counts between PRODENLinear and the comparing approaches. We can find that the accuracy of PRODENLinear is highly competitive to all the parametric approaches. And compared with the best nonparametric method IPAL, PRODENLinear still significantly outperforms it in 63.4% cases, but it is easy to use other networks for further advancing the classification ability of PRODENLinear.
Realworld datasets The characteristics of realworld partiallabel datasets are summarized in Appendix B. Table 3 reports the mean classification accuracy as well as the standard deviation of each method. Out of the 30 statistical tests, we can see that PRODENLinear achieves superior or at least comparable performance against all the comparing methods in 73.3% cases, and outperformed by them in only 10% cases. This confirms that the advantage of PRODEN afforded not only by the deep neural networks, more important, the progressive identification process.
5 Conclusion
In this paper, we proposed a progressive identification method for partiallabel learning, named PRODEN. We first proposed a novel risk estimator, which will optimize the minimal loss incurred by candidate labels. We then proved theoretically that the classifier learned with the proposed risk estimator is equal to the classifier learned in ordinary multiclass learning under mild conditions. We also derived the estimation error bound, which theoretically guarantees the learned classifier’s performance. Furthermore, we proposed a progressive identification method for approximately minimizing the proposed risk estimator. Our key idea is to conduct identifying and classifying seamlessly for mitigating the overfitting problem. At last, we experimentally demonstrated the proposed method could successfully train linear, fully connected, convolutional networks and residual networks, and confirmed the superiority of the progressive identification process.
Appendix A Proofs
a.1 Proof of Lemma 1
According to [36], since the is nonnegative, minimizing the conditional risk is an alternative of minimizing . The conditional risk can be written as
Given that is an estimate of , . is dimensional probability simplex, i.e., . Add the constraints into the objective function according to the Lagrange Multiplier method [37], we have
To minimize , we take the partial derivative of with respect to and let it be :
can ensure , which concludes the proof.∎
a.2 Proof of Theorem 1
We substitute the optimal decision function into the PLL risk estimator , we have
Note that the sixth equality is due to that (Assumption 2) and that (Assumption 1), we have
According to the assumption that the hypothesis space is flexible enough, the Bayes error can be achieved, i.e., , which means has been minimized, i.e., . Therefore . ∎
a.3 Proof of Lemma 2
The proof is completed. ∎
a.4 Proof of Lemma 3
By definition of , . Given sample sized , we first prove the result in the case . The operator can be written as
In this way, we can write
Since is a 1Lipschitz function, by Talagrand’s lemma [38], the last term can be bounded:
Combining Eq. (A.4) and Eq. (A.4), we have
The general case can be derived from the case using and an immediate recurrence.
Again we apply the Talagrand’s contraction lemma,
The proof is completed. ∎
a.5 Supplementary theorem on Section 3
Proof.
First, recall the learning objective in [9] is formulated as:
(11) 
Here, represents the unknown prior probability and is the modelbased conditional distribution.
Then in Eq. (6), the loss function can be specified as the crossentropy loss: . We can easily find the crossentropy loss is linear in the second term, i.e.,
Thus, the confidences can be moved into the loss function:
(12) 
where . is the estimate of .
Appendix B Dataset information and experimental setting
b.1 Benchmark dataset
MNIST This is a grayscale image dataset of handwritten digits from 0 to 9 where the size of the images is .
The linear model was a linearininput model: 10 and used softmax activation in the output layer. An regularization was added where the regularization parameter was fixed to 1e5. The model was trained by SGD with a fixed learning rate 1e3 and a batch size .
The MLP used for training MNIST was FC with ReLU as the activation function: 30030030030010. The softmax activation function was also used in the output layer. Batch normalization [39] was applied before hidden layers. The regularization parameter was 1e3 and the learning rate was 1e2.
FashionMNIST This is also a grayscale image dataset similarly to MNIST, but here each data is associated with a label from 10 fashion item classes.
The models and optimizer were the same as MNIST, except that the regularization parameter was 1e5 for MLP.
KuzushijiMNIST This is another variant of MNIST dataset, and each example is associated with a label from 10 cursive Japanese (Kuzushiji) characters.
The models and optimizer were the same as MNIST, except that the regularization parameter was 1e6 for linear model and 1e4 for MLP.
CIFAR10 This dataset consists of 60,000 color images in 10 classes.
The detailed architecture of ConvNet [31] was as follows.
0th (input) layer: (32*32*3)
1st to 3rd layers: Max Pooling[C(3*3, 128)]*3
4th to 6th layers: Max Pooling[C(3*3, 256)]*3
7th to 9th layers: Max PoolingC(3*3, 512)C(3*3, 256)C(3*3, 128)
10th layer: Average Pooling10
where C(3*3, 128) means 128 channels of 3*3 convolutions followed by LeakyReLU (LReLU) active function [40], [ · ]*3 means 3 such layers, etc. Again, the softmax activation function was used in the output layer. Besides, dropout (the dropout rate was set to 50%) and batch normalization were also used. The model was trained by SGD with the default momentum parameters. The regularization parameter was 1e4 and the batch size was 500. In addition, the initial learning rate was 1e2 and decreased by
where decay was 0.9.
The detailed architecture of ResNet32 [34] was as follows.
0th (input) layer: (32*32*3)
1st to 11th layers: C(3*3, 16)[C(3*3, 16), C(3*3, 16)]*5
12th to 21st layers: [C(3*3, 32), C(3*3, 32)]*5
22nd to 31st layers: [C(3*3, 64), C(3*3, 64)]*5
32nd layer: Average Pooling10
where [ ·, · ] means a building block [34]. The optimization setup was the same as for MNIST, except that the regularization parameter was 1e3 and the fixed learning rate was 5e2.
b.2 UCI datasets and realworld datasets
The characteristics of UCI datasets and realworld datasets are reported in Table 4 and Table 5 respectively.
First we normalized these dataset by the Zscores by convention. In all these datasets, PRODEN used linear model:  trained by SGD with momentum 0.9. The regularization parameter was 1e3 and the learning rate was 1e3 in usps, the regularization parameter was 1e2 and the learning rate was 1e1 in ecoli, deter, glass, Lost, MSRCv2, and BirdSong, the regularization parameter was 1e3 and the learning rate was 1e2 in Soccer Player, the regularization parameter was 1e4 and the learning rate was 1e2 in Yahoo!News. In regularscale dataset (#examples 5000), we used a full batch update, whereas in largescale dataset (#examples 5000), the batch size was fixed to 1000.
Dataset  # Examples  # Feature  # Class  Configurations 
ecoli  336  7  8  [Configuration 1] , 
deter  358  23  6  [Configuration 2] , 
glass  214  10  29  [Configuration 3] , 
usps  9,298  256  10  [Configuration 4] , , 
Dataset  # Examples  # Feature  # Class  # Avg. CLs  Task Domain 
Lost  1122  108  16  2.23  automatic face naming [41] 
MSRCv2  1758  48  23  3.16  object classification [8] 
BirdSong  4998  38  13  2.18  bird song classification [42] 
Soccer Player  17472  279  171  2.09  automatic face naming [43] 
Yahoo!News  22991  163  219  1.91  automatic face naming [44] 
Appendix C Additional experimental results
c.1 Transductive results on benchmark datasets
The transductive results which reflect the ability in identifying the true labels in the training sets are shown in Figure 4.
We can find these results are consistent with findings in Figure 1 that the proposed method can successfully train linear, fully connected, convolutional networks and residual networks with stochastic optimization. The progressive identification process can successfully mitigate the overfitting issue of the iterative EM method, and outperforms other comparing methods even when the flipping probability is considerably large.
c.2 Classification accuracy on controlled UCI datasets
Figure 5 and Figure 6 shows the classification accuracy of the comparing approaches as ranges from 0.1 to 0.7 when (C. 1) and (C. 3) respectively.
Similar to Figure 2 and Figure 3, the proposed method PRODENLinear is highly comparable to all the parametric methods, and compared with the best nonparametric approach IPAL, PRODENLinear still significantly outperforms it in most cases.
References
 Z. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44–53, 2017.
 N. Nguyen and R. Caruana, “Classification with partial labels,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), (Las Vegas, NV), pp. 381––389, 2008.
 T. Cour, B. Sapp, and B. Taskar, “Learning from partial labels,” Journal of Machine Learning Research, vol. 12, no. 5, pp. 1501–1536, 2011.
 M. Zhang and F. Yu, “Solving the partial label learning problem: An instancebased approach,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI’15), (Buenos Aires, Argentina), p. 4048–4054, 2015.
 Z. Zeng, S. Xiao, K. Jia, T. Chan, S. Gao, D. Xu, and Y. Ma, “Learning by associating ambiguously labeled images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), (Portland, OR), pp. 708–715, 2013.
 C. Chen, V. M. Patel, and R. Chellappa, “Learning from ambiguously labeled face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 7, pp. 1653–1667, 2018.
 J. Luo and F. Orabona, “Learning from candidate labeling sets,” in Advances in Neural Information Processing Systems 23 (NIPS’10), (Vancouver, Canada), pp. 1504–1512, 2010.
 L. Liu and T. G. Dietterich, “A conditional multinomial mixture model for superset label learning,” in Advances in Neural Information Processing Systems 25 (NIPS’12), (Lake Tahoe, NV), pp. 548–556, 2012.
 R. Jin and Z. Ghahramani, “Learning with multiple labels,” in Advances in Neural Information Processing Systems 16 (NIPS’03), (Vancouver, Canada), pp. 921–928, 2003.
 Y. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Dictionary learning from ambiguously labeled data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), (Portland, OR), pp. 353–360, 2013.
 Y. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Ambiguously labeled learning using dictionaries,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2076–2088, 2014.
 A. Shrivastava, V. M. Patel, and R. Chellappa, “Nonlinear dictionary learning with partially labeled data,” Pattern Recognition, vol. 48, no. 11, pp. 3283–3292, 2015.
 C. Tang and M. Zhang, “Confidencerated discriminative partial label learning,” in 31st AAAI Conference on Artificial Intelligence (AAAI’17), (San Francisco, CA), 2017.
 L. Feng and B. An, “Partial label learning with selfguided retraining,” in 33rd AAAI Conference on Artificial Intelligence (AAAI’19), (Honolulu, HI), pp. 3542––3549, 2019.
 L. Feng and B. An, “Partial label learning by semantic difference maximization,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19), (Macao, China), pp. 2294–2300, 2019.
 F. Yu and M. Zhang, “Maximum margin partial label learning,” Machine Learning, vol. 106, no. 4, pp. 573–593, 2017.
 C. Hsu and C. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415–425, 2002.
 R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, no. Aug, pp. 1871–1874, 2008.
 E. Hullermeier and J. Beringer, “Learning from ambiguously labeled examples,” Intelligent Data Analysis, vol. 10, no. 5, pp. 419––439, 2006.
 L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in Advances in Neural Information Processing Systems 20 (NIPS’07), vol. 20, (Vancouver, Canada), pp. 1–8, 2007.
 I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.
 J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
 D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in In Proceedings of 3rd International Conference on Learning Representations (ICLR’15), 2015.
 V. N. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, 1999.
 L. Liu and T. G. Dietterich, “Learnability of the superset label learning problem,” in Proceedings of 31st International Conference on Machine Learning (ICML’14), (Beijing, China), pp. 1629–1637, 2014.
 H. L. Seal, The Historical Development of the Gauss Linear Model. Yale University, 1968.
 X. Yu, T. Liu, M. Gong, and D. Tao, “Learning with biased complementary labels,” in Proceedings of the 15th European Conference on Computer Vision (ECCV’18), (Munich, Germany), pp. 68–83, 2018.
 P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002.
 D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, and A. Fischer, “A closer look at memorization in deep networks,” in Proceedings of 34th International Conference on Machine Learning (ICML’17), (Sydney, Australia), pp. 233–242, 2017.
 B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Coteaching: Robust training of deep neural networks with extremely noisy labels,” in Advances in Neural Information Processing Systems 31 (NeurIPS’18), (Montreal, Canada), pp. 8527–8537, 2018.
 A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in 31st AAAI Conference on Artificial Intelligence (AAAI’17), (San Francisco, CA), 2017.
 T. Ishida, G. Niu, A. K. Menon, and M. Sugiyama, “Complementarylabel learning for arbitrary losses and models,” in Proceedings of 36th International Conference on Machine Learning (ICML’19), (Long Beach, CA), pp. 2971–2980, 2019.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 29th IEEE conference on Computer Vision and Pattern Recognition (CVPR’16), (Las Vegas, NV), pp. 770–778, 2016.
 M. Zhang, F. Yu, and C. Tang, “Disambiguationfree partial label learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2155–2167, 2017.
 H. MasnadiShirazi and N. Vasconcelos, “On the design of loss functions for classification: theory, robustness to outliers, and savageboost,” in Advances in Neural Information Processing Systems 22 (NIPS’09), (Vancouver, Canada), pp. 1049–1056, 2009.
 D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997.
 M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes. Springer Science & Business Media, 2013.
 S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning (ICML’15), (Lille, France), pp. 448–456, 2015.
 A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of 30th International Conference on Machine Learning (ICML’13), vol. 30, (Atlanta, GA), p. 3, 2013.
 G. Panis and A. Lanitis, “An overview of research activities in facial age estimation using the fgnet aging database,” in Proceedings of the 13th European Conference on Computer Vision (ECCV’14), (Zurich, Switzerland), pp. 737–750, 2014.
 F. Briggs, X. Z. Fern, and R. Raich, “Rankloss support instance machines for miml instance annotation,” in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12), (Beijing, China), pp. 534–542, 2012.
 Z. Zeng, S. Xiao, K. Jia, T. Chan, S. Gao, D. Xu, and Y. Ma, “Learning by associating ambiguously labeled images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), (Portland, OR), pp. 708–715, 2013.
 M. Guillaumin, J. Verbeek, and C. Schmid, “Multiple instance metric learning from automatically labeled bags of faces,” Lecture Notes in Computer Science, vol. 63, no. 11, pp. 634–647, 2010.