Progressive Identification of True Labels forPartial-Label Learning

Progressive Identification of True Labels for Partial-Label Learning

Abstract

Partial-label learning is one of the important weakly supervised learning problems, where each training example is equipped with a set of candidate labels that contains the true label. Most existing methods elaborately designed learning objectives as constrained optimizations that must be solved in specific manners, making their computational complexity a bottleneck for scaling up to big data. The goal of this paper is to propose a novel framework of partial-label learning without implicit assumptions on the model or optimization algorithm. More specifically, we propose a general estimator of the classification risk, theoretically analyze the classifier-consistency, and establish an estimation error bound. We then explore a progressive identification method for approximately minimizing the proposed risk estimator, where the update of the model and identification of true labels are conducted in a seamless manner. The resulting algorithm is model-independent and loss-independent, and compatible with stochastic optimization. Thorough experiments demonstrate it sets the new state of the art.

1 Introduction

In practice, the increasing demand for massive data makes it inevitable that not all examples are equipped with high-quality labels. Low efficiency to weakly supervision [1] becomes a critical bottleneck for supervised learning methods. We focus on an important type in weakly supervised learning problems called partial-label learning (PLL) [2, 3, 4], where each training example is associated with multiple candidate labels among which exactly one is true. This problem arises in many real-world tasks such as automatic image annotation [5, 6], web mining [7], ecoinformatics [8], etc.

Related research on PLL was pioneered by a multiple-label learning (MLL) approach [9]. Despite the same form of supervision information that each training example is assigned with a set of candidate labels, a vital difference between MLL and PLL is that the goal of PLL is identifying the only one true label among candidate labels whereas for MLL identifying an arbitrary label in the candidate label set is acceptable. In practical implementation, [9] formulated the learning objective by minimizing the KL divergence between the prior probability and the model-based conditional distribution, and solved the optimization by Expectation Maximization (EM) algorithm, resulting in a procedure iterates between estimating the prior probability and training the model.

The true labels of training examples in PLL are obscured by candidate labels that hinder the learning, the key to success is, therefore, identifying the true labels. We review that [9] attempts to recover the prior information as the fitting target of the model, and then the label with maximum prior probability is naturally identified as the true label. Thus, the foundational work inspires the successors to propose EM-based methods for recovering non-uniform prior information on the labels reasonably. Various effort to design learning objectives from the perspective of modifying learning models and loss functions, or adding proper regularizations and constraints. [10, 11] proposed a method to train linear dictionaries with sparse constraints, and then [12] extended the linear dictionary by the kernel trick. [13] adapted the boosting techniques for maintaining the prior information in each iteration. In [14, 15], the optimization of prior probability is formulated as a quadratic programming (QP) problem. In addition, maximum margin methods [2, 16] identified the true labels by defining a multi-class maximum margin problem [17] to be solved by some off-the-shelf implementation on SVM [18]. Some non-parametric methods have also been proposed. [19] proposed a NN method, and [4] constructed a QP problem for better leveraging the underlying distribution of the data.

Most existing methods elaborately designed tricky constrained learning objectives that are coupled to some specific optimization algorithms. A non-linear time complexity in the total volume of data becomes a major limiting factor when these methods envision large amounts of training data. Furthermore, [20] proved stochastic optimization algorithm, which few of the existing methods are compatible with, is the best choice for large-scale learning problems considering the estimation–optimization tradeoff, so the PLL problem is still practically challenging.

In this paper, we would like to propose a novel framework of PLL without implicit assumptions on the model and optimization algorithm for scaling to big data. For this purpose, we rethink critically whether the unconstrained method in [9] is worth exploring?

Our answer is in two folds. First, the decade-old method used a simple linear model fed by the handcrafted features, which is incompetent to represent and discriminate, especially when encountering large-scale learning problems, and we also observe that almost all existing methods exploited non-deep models. Fortunately, the learning objective in [9] is easy to be instantiated by deep neural networks with an enormous benefit in practice [21]. Second, the question we explore is essentially whether we can advance the pioneering work not depending solely on the modification of the learning model. We first derive a general risk estimator leading to classifier-consistency, and then propose a new method bringing a step further towards this goal. We also prove that the proposed method is a generalization of [9]. We further experimentally verify its superiority on benchmark datasets corrupted by manual candidate labels, controlled UCI datasets, and real-world partial-label datasets. Our contributions can be summarized as follows:

• Theoretically, we propose a new risk estimator for PLL, and analyze the classifier-consistency, i.e., the classifier learned from partially labeled data converges to the optimal one learned from ordinary supervised data under mild conditions. Then, we establish an estimation error bound for it.

• Practically, we propose a progressive identification method for approximately optimizing the above risk estimator. The proposed method operates in a mini-batched training manner where the update of the model and the identification of true labels are accomplished seamlessly. The resulting method is model-independent and loss-independent, and is compatible with stochastic optimization (e.g., [22, 23, 24]).

2 Basic Risk Estimator for PLL

In this section, we formulate PLL, propose a general risk estimator, and establish the estimation error bound.

2.1 Preliminaries

In ordinary multi-class classification, let be the instance space and be the label space, where is the feature space dimension, and is the number of classes. Let be the underlying joint distribution of random variables . The goal is to learn a decision function that minimizes the estimator of the classification risk:

 R(g)=EX,Y[ℓ({g1(X),…,gc(X)},Y)], (1)

where is short for , is the loss function, and is an estimate of . Typically, the classifier is assumed to take the following form:

 f(X)=argmaxi∈[c]gi(X).

The hypothesis in with minimal error is the optimal decision function. And in this case,

Since the joint distribution is usually unknown, the expectation in Eq. (1) is typically approximated by the average over the training examples , and is returned by the empirical risk minimization (ERM) principle [25].

2.2 Risk Estimator of PLL

Next, we formulate the problem of PLL. The candidate label set is the powerset of and its cardinality is (the empty set and complete set are not included). Let be the variables defined on with the joint distribution , which may be decomposed into ordinary distribution and label set conditional distribution . Note that in PLL, the true label is invisible to the learner, so we can only have . In this way, we minimize the following risk for PLL:

 RPLL(g)=EX,S[ℓPLL({g1(X),…,gc(X)},S)],

where and is the powerset of . is a loss function defined specially for PLL and we will specify its form later. In the same way as ordinary multi-class case, we can obtain the optimal solution

 g∗PLL=argming∈GRPLL(g),

and we can further learn a classifier by

 f∗PLL(X)=argmaxi∈[c]g∗PLL,i(X).

Our task now becomes to derive a new loss function such that the learned is equivalent to learned in ordinary multi-class learning if the training sample size is sufficiently large.

Note that in there is only one true label. Image the case when the learned can be good enough to approximate . Then the associated with the true label will incur the minimal loss among all candidate labels. Motivated by this, we write our loss function as

 ℓPLL=minY′∈Sℓ({g1(X),…,gc(X)},Y′), (2)

which immediately leads to a new risk estimator, namely,

 RPLL(g)=EX,SminY′∈Sℓ({g1(X),…,gc(X)},Y′). (3)

So what is the effect of optimizing the new defined risk for PLL? To show the effect, we will prove that the proposed risk estimator ensures converges to under reasonable assumptions.

Our first assumption is that our learning is conducted under the deterministic case.

Assumption 1.

Consider the general deterministic learning scenario that label of is uniquely determined by some measurable function. When is uniquely determined,

 P[Y∈S]=1. (4)

This is the basic assumption in PLL made by previous PLL works (e.g., [2, 3, 26, 4, 14]) that the true label must be included in the candidate label set.

By definition, the Bayes error is defined over all measurable functions and the hypothesis with a Bayes error is a Bayes optimal decision function and denoted by . Under Assumption 1, the Bayes error can reach the value . In this way, if we use flexible models such as deep neural networks, our hypothesis space will be large enough to have one classifier reach the Bayes error, which can be as low as zero in this case.

Clearly, the true label is determined by and defined in terms of the conditional probabilities as . Then, according to Eq. (4), we can also have

 Y=argmaxY′∈[c]P[Y′|X]=argmaxY′∈SP[Y′|X]. (5)

We make another assumption on the classifier learned in ordinary multi-class learning.

Assumption 2.

By minimizing the expected risk , we can obtain .

Note that such an assumption can be satisified when using the cross-entropy loss (Chapter 5 in  [21]) or the mean squared error loss [27].

Lemma 1.

([28]) When is the cross-entropy loss or mean squared error loss, Assumption 2 satisfies.

With these above assumptions, we can have our main theorem, which specifies that the classifier learned in ordinary multi-class learning, and the classifier learned in PLL by minimizing are equivalent.

Theorem 1.

(Classifier-consistency) Suppose Assumption 1 and 2 are satisfied, then when the hypothesis space is flexible enough, the optimal one learned in PLL is equivalent to the optimal classifier learned in ordinary multi-class learning, i.e., .

Theorem 1 is proved by substituting into , and shows that . Given that the hypothesis space is flexible enough, the Bayes error can be achieved which means is minimized by , i.e., . This further ensures . A complete proof can be found in Appendix A.

2.3 Estimation Error Bound

In this section we establish the estimation error bound for the proposed estimator.

Assume that is the empirical counterpart of , and we denote by the optimal solution obtained by minimizing . We will upper bound the difference between and by upper bounding . We have the following estimation error bound.

Theorem 2.

(Estimation Error Bound) Assume and is Lipschitz continuous with respect to with a Lipschitz constant . Let be the Rademacher Complexity on when given sample sized , and the loss function is upper bounded by , then with probability at least we have

 RPLL(ˆgPLL)−RPLL (g∗PLL)≤4cLℓR(G)+M√log(1/δ)2n

To prove the estimation error bound, we first need the following conclusion between the estimation error bound and the generalization error bound.

Lemma 2.

The estimation error can be bounded by

 RPLL(ˆgPLL)−RPLL (g∗PLL)≤2supg∈G|RPLL(g)−ˆRPLL(g)|.

That is, the generalization error can be used to bound the estimation error of the ERM algorithm. Then we use the following generalization error bound.

Theorem 3.

([29]) Let and be the Rademacher complexity of . If the loss function be upper bounded by M, then for any , with the probability , we have

 supg∈G|RPLL(g)− ˆRPLL(g)|≤2Rn(ℓPLL∘G)+M√log(1/δ)2n.

We further bound the relationship between and .

Lemma 3.

Let . is defined in Eq. (2), is Lipschitz continuous with respect to with a Lipschitz constant . Then we have

 Rn(ℓPLL∘G)≤cRn(ℓ∘G)≤cLℓRn(G).

Combining Lemma 2, Theorem 3 and Lemma 3, we can prove Theorem 2. Please find in Appendix A the detailed proof.

Note that if for all , we can have the following corollary.

Corollary 4.

Assume for all and all other conditions same as Theorem 2, we can have

 RPLL(ˆgPLL)−RPLL (g∗PLL)≤4sLℓR(G)+M√log(1/δ)2n.

Corollary 4 implies that the smaller the set of , the better the learned classifier, given that the true label lies in the set . This agrees with our intuition on PLL. This section discusses an expected risk estimator for PLL. In the next section, we will discuss how to approximately optimizing the proposed risk.

3 Progressive Identification of True Labels

Obviously, it is not easy to directly do stochastic gradient descend on Eq. (3), due to the non-differentiability of the operator. However, we still would like to train deep networks to obtain by stochastic optimization with great benefit. This motivates our efforts to propose a novel progressive identification method for approximately minimizing Eq. (3). To this end, we first assume that the loss function can be decomposed onto each label, i.e.,

 ℓPLL({g1(X),…,gc(X)},S)=c∑i=1~ℓ(gi(X),I(i∈S)),

where is the label-wise loss. In this way, with appropriate associated with the training example , we can have the relaxed empirical loss

 ˆRPLL=1n∑ni=1∑cj=1wij~ℓ(gj(xi),I(j∈si)), (6)

where

 wi∈Δc−1∀i∈[n], wij=0 ∀i∈[n],∀j∉si.

refers to a standard simplex in , i.e., . can be interpreted as the corresponding confidence, i.e., the confidence of the -th label being the true label of the -th example. Ideally, the confidence of a label would trend to progressively, which means that we finally have full confidence on which label is a true label. We will explain how to eventually achieve such an ideal situation.

The method begins with training an initial model based on the uniform confidences:

 w0ij={1/|si| if j∈si0 otherwise . (7)

According to the memorization effects [30, 31] of deep networks, the deep networks will remember “frequent patterns” in the first few iterations. If the given partially labeled data has a reasonable ambiguity degree [9, 26], the deep networks tend to remember the true labels in the initial epochs, and guides us towards a classifier giving relatively low prediction losses for more possible true labels. In this way, the initial informative predictions are used to update the confidences for further training:

 (8)

In summary, we begin with training a neural network to optimize the risk using the uniform confidences given in Eq. (7). Then we update the confidences by Eq. (8) after each iteration, and continue to train the neural network using the new updated confidences. The procedure of our algorithm is summarized in Algorithm 1 where we call our proposal PRODEN (PROgressive iDENtification).

At first glance, the proposed method shares some similarities with the iterative EM method proposed in [9]. In fact, the method in [9] has a tendency to overfit in the M-step, and has the limitation of using only one specific loss function. Conversely, PRODEN makes use of a more effective learning framework to get rid of overfitting, and besides, it is also flexible enough to use other loss functions. We provide more detailed arguments on the superiority and generalization of the proposed method in the following.

First, the iterative EM method in previous works trains the model until convergence in the M-step, but overemphasizing the convergence may result in redundant computation and overfitting issue, as the model will eventually fit the initial inexact prior knowledge and make a less informative estimate in the E-step on which the subsequent learning is based. To mitigate the overfitting issue, our method advances the procedure by merging the E-step and M-step. While the model is trained in a seemless manner without clear separation of E-step and M-step, the confidences can be updated at any iteration such that the convergence is not necessary in our training procedure.

Second, in the deep learning era, loss functions are one of the key elements and many useful loss functions are proposed, such as the mean squared error loss [27] and the mean absolute error loss [32]. In this way, models are welcomed to be loss-independent that allow usage of any loss function [33]. However, existing EM methods are restricted to some specific loss function, e.g., [9] limits the loss function to the KL divergence which is equivalent to the cross-entropy loss. Such restriction on loss functions is not suitable for practical use. Thus, our proposal is flexible enough to be compatible with a large group of loss functions. Moreover, we will show in the appendix that the proposal of [9] is a special case of ours.

4 Experiments

In this section, we verify the effectiveness of the proposed method PRODEN. First we analyze different strategies of PRODEN in training deep neural networks on benchmark datasets corrupted by manual candidate labels, and then compare it with non-deep state-of-the-art PLL methods on controlled UCI datasets and real-world datasets.

4.1 Experiments with deep networks

Datasets  Experiments are conducted on four widely adopted benchmarks MNIST, Fashion-MNIST, Kuzushiji-MNIST and CIFAR-10 which are summarized in Table 1. We manually corrupted these datasets into partially labeled versions. Firstly, we probabilistically add any negative labels into the candidate label sets with a flipping probability where . Secondly, for the training examples only have one candidate label (the true label), we randomly flip a negative label to positive label in order to ensure all the training examples are partially labeled.

Baselines  In order to analyze the proposed method, we compare it with six baselines:

• PN-oracle means supervised learning from ordinary supervised data. It is merely for a proof of concept.

• PN-transform means decomposing multiple candidate labels into many single labels, so that we could use any ordinary multi-class classification methods.

• PRODEN-iterative means updating the label confidences in the iterative EM manner.

• PRODEN-deterministic means updating the label confidences in a hard manner, i.e., the confidence of label with maximum modeling output over equals and otherwise.

• PRODEN-naive means never updating the uniform confidences.

• GA [33] means complementary label learning with gradient ascent.

PLL may be tackled by the method learning from complementary labels, which specifies a class that an example does not belong to. A set of candidate labels can be regarded as an inverse case of multi-complementary labels. We compare the proposed method with a state-of-the-art complementary-label learning method.

Experimental setup  The robustness to partially labeled data is tested by conducting the proposed method and comparing methods under low-level partial circumstances and under extremely partial circumstances . Table 1 describes the models on each dataset, where MLP refers to multi-layer perceptron, ConvNet follows the architecture in [31] and ResNet refers to residual networks [34]. The optimizer is stochastic gradient descent (SGD) [22] with momentum 0.9. We train these models 500 epochs with cross-entropy loss function for all the experiments. Because the M-step of iterative EM method is time-consuming, PRODEN-iterative updates label confidences every 100 epochs, which suffice to demonstrate the overfitting issue. We implement all methods by PyTorch, and conduct all the experiments on an NVIDIA Tesla V100 GPU. Please find the details in Appendix B.

Results  We record inductive results which indicate the classification accuracy in the test set. The experimental results are shown in Figure 1. And transductive results which reflect the ability in identifying the true labels in the training set can be found in Appendix C.

Firstly, we observe the performance under (cf. the left two columns). The proposed method PRODEN is always the best method and comparable to PN-oracle with all the models. PRODEN-iterative is comparable to PRODEN with linear model, but its performance deteriorates drastically with complex models because the overfitting issue is severe. These results are consistent with the discussions in Section 3. And the residual blocks can remedy this problem. The superiority always stands out for PRODEN compared with GA. Besides, PRODEN is much more stable than GA without learning rate decay strategy because the the progressively identified true labels make the training examples much cleaner. The awful performance of PRODEN-deterministic proves the hard labels discard the helpful learning history. PN-transform and PRODEN-naive suffer from overfitting with deep models, where the performance even drops behind their linear counterparts. The reason for this phenomenon is more complex architectures have stronger capacities to fit all labels.

Secondly, we investigate these methods under (cf. the right two columns). In the easier learning scenarios (MNIST, Fashion-MNIST), PRODEN is comparable to PN-oracle even when the flipping probability is considerably large. In the hardest learning scenario (CIFAR-10), the progressive identification method can also alleviate overfitting. The generalization degeneration of PRODEN-iterative is more serious, however, GA which is designed for extremely partial circumstances (complementary labels) performs better (surpasses PRODEN-iterative) instead. It is impressive that PRODEN still achieves the superior performance against all the comparing methods.

4.2 Comparison with non-deep PLL approaches

We further verify the performance of the proposed method by comparing it with six state-of-the-art PLL approaches which cannot be generalized to deep networks:

• SURE [14]: an iterative EM approach [suggested configuration: ].

• CLPL [3]: a parametric approach transforming the PLL problem to the binary learning problem [suggested configuration: SVM with squared hinge loss].

• ECOC [35]: a disambiguation-free approach adapting the binary decomposition strategy to PLL [suggested configuration: ].

• PLSVM [2]: a maximum margin approach [suggested configuration: ].

• PLkNN [19]: a non-parametric approach [suggested configuration: ].

• IPAL [4]: a non-parametric approach [suggested configuration: ].

The original comparing approaches are implemented in Matlab. For a fair comparison, all the parametric approaches exploit a linear model. We train PRODEN epochs for all the datasets and average the classification accuracy over the last epochs as the final results. For each dataset, ten-fold cross-validation is performed.

Controlled UCI datasets  The characteristics of UCI datasets are reported in Appendix B. Following the widely-used controlling protocol, we can generate several artificial partial-label datasets from each UCI dataset by adjusting the controlling parameters , and . Here, controls the proportion of partially labeled examples, controls the number of false positive labels in each candidate label sets, and controls the occurring probability between the true label and a specific false positive label. There are 4 configurations of controlling parameters corresponding to results for each UCI dataset.

Figure 2 shows the classification accuracy of the comparing approaches as ranges from 0.1 to 0.7 when (Configuration 2). Figure 3 illustrates the classification accuracy as varies from 0.1 to 0.7 when (Configuration 4). To further statistically compare the proposed method with other algorithms, Table 2 summarizes the win/tie/loss counts between PRODEN-Linear and the comparing approaches. We can find that the accuracy of PRODEN-Linear is highly competitive to all the parametric approaches. And compared with the best non-parametric method IPAL, PRODEN-Linear still significantly outperforms it in 63.4% cases, but it is easy to use other networks for further advancing the classification ability of PRODEN-Linear.

Real-world datasets  The characteristics of real-world partial-label datasets are summarized in Appendix B. Table 3 reports the mean classification accuracy as well as the standard deviation of each method. Out of the 30 statistical tests, we can see that PRODEN-Linear achieves superior or at least comparable performance against all the comparing methods in 73.3% cases, and outperformed by them in only 10% cases. This confirms that the advantage of PRODEN afforded not only by the deep neural networks, more important, the progressive identification process.

5 Conclusion

In this paper, we proposed a progressive identification method for partial-label learning, named PRODEN. We first proposed a novel risk estimator, which will optimize the minimal loss incurred by candidate labels. We then proved theoretically that the classifier learned with the proposed risk estimator is equal to the classifier learned in ordinary multi-class learning under mild conditions. We also derived the estimation error bound, which theoretically guarantees the learned classifier’s performance. Furthermore, we proposed a progressive identification method for approximately minimizing the proposed risk estimator. Our key idea is to conduct identifying and classifying seamlessly for mitigating the overfitting problem. At last, we experimentally demonstrated the proposed method could successfully train linear, fully connected, convolutional networks and residual networks, and confirmed the superiority of the progressive identification process.

Appendix A Proofs

a.1 Proof of Lemma 1

According to [36], since the is non-negative, minimizing the conditional risk is an alternative of minimizing . The conditional risk can be written as

 C(g)=−c∑i=1P[Y=i∣X=x]log(gi(x)).

Given that is an estimate of , . is -dimensional probability simplex, i.e., . Add the constraints into the objective function according to the Lagrange Multiplier method [37], we have

 L=C(g)−λ(c∑i=1gi(x)−1).

To minimize , we take the partial derivative of with respect to and let it be :

 ∂L∂gi=−P[Y=i∣X]gi(x)−λ=0, g∗i(x)=−1λP[Y=i∣x].

can ensure , which concludes the proof.∎

a.2 Proof of Theorem 1

We substitute the optimal decision function into the PLL risk estimator , we have

 RPLL(g∗) = EX,S[minY′∈Sℓ({g∗1(X),…,g∗c(X)},Y′)] = ∫∑SminY′∈Sℓ({g∗1(X),…,g∗c(X)},Y′)P[S|X]P[X]dX = ∫∑SminY′∈Sℓ({g∗1(X),…,g∗c(X)},Y′)∑~YP[S,~Y|X]P[X]dX = ∫∑~Y∑SminY′∈Sℓ({g∗1(X),…,g∗c(X)},Y′)P[S,~Y|X]P[X]dX = ∫∑~Y∑SminY′∈Sℓ({g∗1(X),…,g∗c(X)},Y′)P[S|X,~Y]P[~Y|X]P[X]dX = ∫∑~Y∑Sℓ({g∗1(X),…,g∗c(X)},Y)P[S|X,~Y]P[~Y|X]P[X]dX = ∫∑~Yℓ({g∗1(X),…,g∗c(X)},Y)∑SP[S|X,~Y]P[~Y|X]P[X]dX = ∫∑~Yℓ({g∗1(X),…,g∗c(X)},Y)P[X,~Y]dX = R(g∗).

Note that the sixth equality is due to that (Assumption 2) and that (Assumption 1), we have

 argminY′∈Sℓ({g∗1(X),…,g∗c(X)},Y′)=Y.

According to the assumption that the hypothesis space is flexible enough, the Bayes error can be achieved, i.e., , which means has been minimized, i.e., . Therefore . ∎

a.3 Proof of Lemma 2

 RPLL(ˆgPLL)−RPLL(g∗PLL) = (RPLL(ˆgPLL)−ˆRPLL(ˆgPLL))+(ˆRPLL(ˆgPLL)−ˆRPLL(g∗PLL)) + (ˆRPLL(g∗PLL)−RPLL(g∗)) ≤ (RPLL(ˆgPLL)−ˆRPLL(ˆgPLL))+(ˆRPLL(g∗PLL)−RPLL(g∗PLL)) ≤ 2supg∈G∣RPLL(g)−ˆRPLL(g)∣.

The proof is completed. ∎

a.4 Proof of Lemma 3

By definition of , . Given sample sized , we first prove the result in the case . The operator can be written as

 min{g1,g2}=12[g1+g2−|g1−g2|].

In this way, we can write

 Rn(ℓPLL∘G) = Eσ[supg∈G1nn∑i=1σiℓ(g(xi,yi))] = Eσ[supg∈G1nn∑i=1σimin{ℓ(g(xi,y1)),ℓ(g(xi,y2))}] = Eσ[supg∈G12nn∑i=1σi[ℓ(g(xi,y1))+ℓ(g(xi,y2))−|ℓ(g(xi,y1))−ℓ(g(xi,y2))|]] ≤ Eσ[supg∈G12nn∑i=1σiℓ(g(xi,y1))]+Eσ[supg∈G12nn∑i=1σiℓ(g(xi,y2))] +Eσ[supg∈G12nn∑i=1σi|ℓ(g(xi,y1))−ℓ(g(xi,y2))|] = 12(Rn(ℓ∘G)+Rn(ℓ∘G))+Eσ[supg∈G12nn∑i=1σi|ℓ(g(xi,y1))−ℓ(g(xi,y2))|].

Since is a 1-Lipschitz function, by Talagrand’s lemma [38], the last term can be bounded:

 Eσ[supg∈G12nn∑i=1σi|ℓ(g(xi,y1))−ℓ(g(xi,y2))|] ≤ Eσ[supg∈G12nn∑i=1σi(ℓ(g(xi,y1)−ℓ(g(xi,y2)))] = 12(Rn(ℓ∘G)+Rn(ℓ∘G)).

Combining Eq. (A.4) and Eq. (A.4), we have

 Rn(ℓPLL∘G)≤Rn(ℓ∘G)+Rn(ℓ∘G).

The general case can be derived from the case using and an immediate recurrence.

Again we apply the Talagrand’s contraction lemma,

 cRn(ℓ∘G)≤cLℓRn(G).

The proof is completed. ∎

a.5 Supplementary theorem on Section 3

Theorem 5.

The learning objective in [9] is a special case of Eq. (6).

Proof.

First, recall the learning objective in [9] is formulated as:

 ˆRPLL=1n∑ni=1∑cj=1KL[ˆP[j∣xi]∣∣P[j∣xi;g]]. (11)

Here, represents the unknown prior probability and is the model-based conditional distribution.

Then in Eq. (6), the loss function can be specified as the cross-entropy loss: . We can easily find the cross-entropy loss is linear in the second term, i.e.,

 wij~ℓCE(gj(xi),I(j∈si))=~ℓCE(gj(xi),wijI(j∈si)).

Thus, the confidences can be moved into the loss function:

 ˆRPLL=1n∑ni=1∑cj=1~ℓCE(gj(xi),zij), (12)

where . is the estimate of .

For optimizing in Eq. (12), we can add a term which is independent from to the learning objective:

 ˆRPLL =1n∑ni=1∑cj=1[~ℓCE(gj(xi),zij)+~ℓCE(zij,zij)] =1n∑ni=1∑cj=1[−zijloggj(xi)+zijlogzij] =1n∑ni=1∑cj=1zijzijgj(xi) =1n∑ni=1∑cj=1KL[zij∣∣gj(xi)],

which is essentially equivalent to Eq. (11). Therefore, our learning objective is a strict extension of [9]. Instead, the learning objective in [9] is a special case of the proposed methood. ∎

Appendix B Dataset information and experimental setting

b.1 Benchmark dataset

MNIST  This is a grayscale image dataset of handwritten digits from 0 to 9 where the size of the images is .

The linear model was a linear-in-input model: -10 and used softmax activation in the output layer. An -regularization was added where the regularization parameter was fixed to 1e-5. The model was trained by SGD with a fixed learning rate 1e-3 and a batch size .

The MLP used for training MNIST was FC with ReLU as the activation function: -300-300-300-300-10. The softmax activation function was also used in the output layer. Batch normalization [39] was applied before hidden layers. The regularization parameter was 1e-3 and the learning rate was 1e-2.

Fashion-MNIST  This is also a grayscale image dataset similarly to MNIST, but here each data is associated with a label from 10 fashion item classes.

The models and optimizer were the same as MNIST, except that the regularization parameter was 1e-5 for MLP.

Kuzushiji-MNIST  This is another variant of MNIST dataset, and each example is associated with a label from 10 cursive Japanese (Kuzushiji) characters.

The models and optimizer were the same as MNIST, except that the regularization parameter was 1e-6 for linear model and 1e-4 for MLP.

CIFAR-10  This dataset consists of 60,000 color images in 10 classes.

The detailed architecture of ConvNet [31] was as follows.

0th (input) layer:   (32*32*3)-

1st to 3rd layers:   Max Pooling-[C(3*3, 128)]*3-

4th to 6th layers:   Max Pooling-[C(3*3, 256)]*3-

7th to 9th layers:   Max Pooling-C(3*3, 512)-C(3*3, 256)-C(3*3, 128)-

10th layer:   Average Pooling-10

where C(3*3, 128) means 128 channels of 3*3 convolutions followed by Leaky-ReLU (LReLU) active function [40], [ · ]*3 means 3 such layers, etc. Again, the softmax activation function was used in the output layer. Besides, dropout (the dropout rate was set to 50%) and batch normalization were also used. The model was trained by SGD with the default momentum parameters. The regularization parameter was 1e-4 and the batch size was 500. In addition, the initial learning rate was 1e-2 and decreased by

 decay⌊epoch/50⌋,

where decay was 0.9.

The detailed architecture of ResNet-32 [34] was as follows.

0th (input) layer:   (32*32*3)-

1st to 11th layers:   C(3*3, 16)-[C(3*3, 16), C(3*3, 16)]*5-

12th to 21st layers:   [C(3*3, 32), C(3*3, 32)]*5-

22nd to 31st layers:   [C(3*3, 64), C(3*3, 64)]*5-

32nd layer:   Average Pooling-10

where [ ·, · ] means a building block [34]. The optimization setup was the same as for MNIST, except that the regularization parameter was 1e-3 and the fixed learning rate was 5e-2.

b.2 UCI datasets and real-world datasets

The characteristics of UCI datasets and real-world datasets are reported in Table 4 and Table 5 respectively.

First we normalized these dataset by the Z-scores by convention. In all these datasets, PRODEN used linear model: - trained by SGD with momentum 0.9. The regularization parameter was 1e-3 and the learning rate was 1e-3 in usps, the regularization parameter was 1e-2 and the learning rate was 1e-1 in ecoli, deter, glass, Lost, MSRCv2, and BirdSong, the regularization parameter was 1e-3 and the learning rate was 1e-2 in Soccer Player, the regularization parameter was 1e-4 and the learning rate was 1e-2 in Yahoo!News. In regular-scale dataset (#examples 5000), we used a full batch update, whereas in large-scale dataset (#examples 5000), the batch size was fixed to 1000.

c.1 Transductive results on benchmark datasets

The transductive results which reflect the ability in identifying the true labels in the training sets are shown in Figure 4.

We can find these results are consistent with findings in Figure 1 that the proposed method can successfully train linear, fully connected, convolutional networks and residual networks with stochastic optimization. The progressive identification process can successfully mitigate the overfitting issue of the iterative EM method, and outperforms other comparing methods even when the flipping probability is considerably large.

c.2 Classification accuracy on controlled UCI datasets

Figure 5 and Figure 6 shows the classification accuracy of the comparing approaches as ranges from 0.1 to 0.7 when (C. 1) and (C. 3) respectively.

Similar to Figure 2 and Figure 3, the proposed method PRODEN-Linear is highly comparable to all the parametric methods, and compared with the best non-parametric approach IPAL, PRODEN-Linear still significantly outperforms it in most cases.

References

1. Z. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44–53, 2017.
2. N. Nguyen and R. Caruana, “Classification with partial labels,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), (Las Vegas, NV), pp. 381––389, 2008.
3. T. Cour, B. Sapp, and B. Taskar, “Learning from partial labels,” Journal of Machine Learning Research, vol. 12, no. 5, pp. 1501–1536, 2011.
4. M. Zhang and F. Yu, “Solving the partial label learning problem: An instance-based approach,” in Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI’15), (Buenos Aires, Argentina), p. 4048–4054, 2015.
5. Z. Zeng, S. Xiao, K. Jia, T. Chan, S. Gao, D. Xu, and Y. Ma, “Learning by associating ambiguously labeled images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), (Portland, OR), pp. 708–715, 2013.
6. C. Chen, V. M. Patel, and R. Chellappa, “Learning from ambiguously labeled face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 7, pp. 1653–1667, 2018.
7. J. Luo and F. Orabona, “Learning from candidate labeling sets,” in Advances in Neural Information Processing Systems 23 (NIPS’10), (Vancouver, Canada), pp. 1504–1512, 2010.
8. L. Liu and T. G. Dietterich, “A conditional multinomial mixture model for superset label learning,” in Advances in Neural Information Processing Systems 25 (NIPS’12), (Lake Tahoe, NV), pp. 548–556, 2012.
9. R. Jin and Z. Ghahramani, “Learning with multiple labels,” in Advances in Neural Information Processing Systems 16 (NIPS’03), (Vancouver, Canada), pp. 921–928, 2003.
10. Y. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Dictionary learning from ambiguously labeled data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), (Portland, OR), pp. 353–360, 2013.
11. Y. Chen, V. M. Patel, J. K. Pillai, R. Chellappa, and P. J. Phillips, “Ambiguously labeled learning using dictionaries,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2076–2088, 2014.
12. A. Shrivastava, V. M. Patel, and R. Chellappa, “Non-linear dictionary learning with partially labeled data,” Pattern Recognition, vol. 48, no. 11, pp. 3283–3292, 2015.
13. C. Tang and M. Zhang, “Confidence-rated discriminative partial label learning,” in 31st AAAI Conference on Artificial Intelligence (AAAI’17), (San Francisco, CA), 2017.
14. L. Feng and B. An, “Partial label learning with self-guided retraining,” in 33rd AAAI Conference on Artificial Intelligence (AAAI’19), (Honolulu, HI), pp. 3542––3549, 2019.
15. L. Feng and B. An, “Partial label learning by semantic difference maximization,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19), (Macao, China), pp. 2294–2300, 2019.
16. F. Yu and M. Zhang, “Maximum margin partial label learning,” Machine Learning, vol. 106, no. 4, pp. 573–593, 2017.
17. C. Hsu and C. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415–425, 2002.
18. R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “Liblinear: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, no. Aug, pp. 1871–1874, 2008.
19. E. Hullermeier and J. Beringer, “Learning from ambiguously labeled examples,” Intelligent Data Analysis, vol. 10, no. 5, pp. 419––439, 2006.
20. L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” in Advances in Neural Information Processing Systems 20 (NIPS’07), vol. 20, (Vancouver, Canada), pp. 1–8, 2007.
21. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
22. H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400–407, 1951.
23. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
24. D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in In Proceedings of 3rd International Conference on Learning Representations (ICLR’15), 2015.
25. V. N. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988–999, 1999.
26. L. Liu and T. G. Dietterich, “Learnability of the superset label learning problem,” in Proceedings of 31st International Conference on Machine Learning (ICML’14), (Beijing, China), pp. 1629–1637, 2014.
27. H. L. Seal, The Historical Development of the Gauss Linear Model. Yale University, 1968.
28. X. Yu, T. Liu, M. Gong, and D. Tao, “Learning with biased complementary labels,” in Proceedings of the 15th European Conference on Computer Vision (ECCV’18), (Munich, Germany), pp. 68–83, 2018.
29. P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002.
30. D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, and A. Fischer, “A closer look at memorization in deep networks,” in Proceedings of 34th International Conference on Machine Learning (ICML’17), (Sydney, Australia), pp. 233–242, 2017.
31. B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” in Advances in Neural Information Processing Systems 31 (NeurIPS’18), (Montreal, Canada), pp. 8527–8537, 2018.
32. A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in 31st AAAI Conference on Artificial Intelligence (AAAI’17), (San Francisco, CA), 2017.
33. T. Ishida, G. Niu, A. K. Menon, and M. Sugiyama, “Complementary-label learning for arbitrary losses and models,” in Proceedings of 36th International Conference on Machine Learning (ICML’19), (Long Beach, CA), pp. 2971–2980, 2019.
34. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 29th IEEE conference on Computer Vision and Pattern Recognition (CVPR’16), (Las Vegas, NV), pp. 770–778, 2016.
35. M. Zhang, F. Yu, and C. Tang, “Disambiguation-free partial label learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2155–2167, 2017.
36. H. Masnadi-Shirazi and N. Vasconcelos, “On the design of loss functions for classification: theory, robustness to outliers, and savageboost,” in Advances in Neural Information Processing Systems 22 (NIPS’09), (Vancouver, Canada), pp. 1049–1056, 2009.
37. D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997.
38. M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes. Springer Science & Business Media, 2013.
39. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning (ICML’15), (Lille, France), pp. 448–456, 2015.
40. A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of 30th International Conference on Machine Learning (ICML’13), vol. 30, (Atlanta, GA), p. 3, 2013.
41. G. Panis and A. Lanitis, “An overview of research activities in facial age estimation using the fg-net aging database,” in Proceedings of the 13th European Conference on Computer Vision (ECCV’14), (Zurich, Switzerland), pp. 737–750, 2014.
42. F. Briggs, X. Z. Fern, and R. Raich, “Rank-loss support instance machines for miml instance annotation,” in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’12), (Beijing, China), pp. 534–542, 2012.
43. Z. Zeng, S. Xiao, K. Jia, T. Chan, S. Gao, D. Xu, and Y. Ma, “Learning by associating ambiguously labeled images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), (Portland, OR), pp. 708–715, 2013.
44. M. Guillaumin, J. Verbeek, and C. Schmid, “Multiple instance metric learning from automatically labeled bags of faces,” Lecture Notes in Computer Science, vol. 63, no. 11, pp. 634–647, 2010.