Online Algorithms for Multiclass Classification using Partial Labels

# Online Algorithms for Multiclass Classification using Partial Labels

## Abstract

In this paper, we propose online algorithms for multiclass classification using partial labels. We propose two variants of Perceptron called Avg Perceptron and Max Perceptron to deal with the partial labeled data. We also propose Avg Pegasos and Max Pegasos, which are extensions of Pegasos algorithm. We also provide mistake bounds for Avg Perceptron and regret bound for Avg Pegasos. We show the effectiveness of the proposed approaches by experimenting on various datasets and comparing them with the standard Perceptron and Pegasos.

###### Keywords:
Online Learning Pegasos Perceptron.

## 1 Introduction

Multiclass classification is a well-studied problem in machine learning. However, we assume that we know the true label for every example in the training data. In many applications, we don’t have access to the true class label as labeling data is an expensive and time-consuming process. Instead, we get a set of candidate labels for every example. This setting is called multiclass learning with partial labels. The true or ground-truth label is assumed to be one of the instances in the partial label set. Partially labeled data is relatively easier to obtain and thus provides a cheap alternative to learning with exact labels.

Learning with partial labels is referred to as superset label learning [11], ambiguous label learning [1], and by other names in different papers. Many proposed models try to disambiguate the correct labels from the incorrect ones. One popular approach is to treat the unknown correct label in the candidate set as a latent variable and then use an Expectation-Maximization type algorithm to estimate the correct label as well the model parameters iteratively ([9], [16], [11], [7], [1]). Other approaches to label disambiguation include using a maximum margin formulation [18], which alternates between ground truth identification and maximizing the margin from the ground-truth label to all other labels. Another model assumes that the ground truth label is the one to which the maximum score is assigned in the candidate label set by the model [12]. Then the margin between this ground-truth label and all other labels not in the candidate set is maximized.

Some approaches try to predict the label of an unseen instance by averaging the candidate labeling information of its nearest neighbors in the training set ([19], [8]). Some formulations combine the partial label learning framework with other frameworks like multi-label learning [17]. There are also specific approaches that do not try to disambiguate the label set directly. For example, Zhang et al. [20] introduced an algorithm that works to utilize the entire candidate label set using a method involving error-correcting codes.

A general risk minimization framework for learning with partial labels is discussed in Cour et al. ([2], [3]). In this framework, any standard convex loss function can be modified to be used in the partial label setting. For a single instance, since the ground-truth label is not available, an average over the scores in the candidate label set is taken as a proxy to calculate the loss. Nguyen and Caruana [12] propose a risk minimization approach based on a non-convex max-margin loss for a partial label setting.

In this paper, we propose online algorithms for multiclass classification using partially labeled data. Perceptron [13] algorithm is one of the earliest online learning algorithms. Perceptron for multiclass classification is proposed in [6]. A unified framework for designing online update rules for multiclass classification was provided in [4]. An online variant of the support vector machine [15] called Pegasos is proposed in [14]. This algorithm is shown to achieve regret (where is the number of rounds). Once again, all these online approaches assume that we know the true label for each example.

Online multiclass learning with partial labels remained an unaddressed problem. In this paper, we propose several online multiclass algorithms using partial labels. Our key contributions in this paper are as follows.

1. We propose Avg Perceptron and Max Perceptron, which extensions of Perceptron to handle the partial labels. Similarly, we propose Avg Pagasos and Max Pegasos, which are extensions of Pegasos algorithm.

2. We derive mistake bounds for Avg Perceptron in both separable and general cases. Similarly, we provide regret bound for Avg Pegasos.

3. We also provide thorough experimental validation of our algorithms using datasets of different dimensions and compare the performance of the proposed algorithms with standard multiclass Perceptron and Pegasos.

## 2 Multiclass Classification Using Partially Labeled Data

We now formally discuss the problem of multiclass classification given partially labeled training set. Let be the feature space from which the instances are drawn and let be the output label space. Every instance is associated with a candidate label set . The set of labels not present in the candidate label set is denoted by . Obviously, .1 The ground-truth label associated with is denoted by lowercase . It is assumed that the actual label lies within the set (i.e., ). The goal is to learn a classifier . Let us assume that is a linear classifier. Thus, is parameterized by a matrix of weights and is defined as where (th column vector of ) denotes the parameter vector corresponding to the class. Discrepancy between the true label and the predicted label is captured using 0-1 loss as . Here, is the 0-1 indicator function, which evaluates to true when the condition mentioned is true and 0 otherwise. However, in the case of partial labels, we use partial (ambiguous) 0-1 loss [2] as follows.

 LA(h(x),Y)=I{h(x)∉Y} (1)

Minimizing is difficult as it is not continuous. Thus, we use continuous surrogates for . A convex surrogate of is the average prediction hinge loss (APH) [2] which is defined as follows.

 LAPH(h(x),Y)=[1−1|Y|∑i∈Ywi.x+maxj∉Ywj.x]+ (2)

where is the size of the candidate label set and . is shown to be a convex surrogate of in [3]. There is another non-convex surrogate loss function called the max prediction hinge loss (MPH) [12] that can be used for partial labels which is defined as follows:

 LMPH(h(x),Y)=[1−maxi∈Ywi.x+maxj∉Ywj.x]+ (3)

In this paper, we present online algorithms based on based on stochastic gradient descent on and .

## 3 Multiclass Perceptron using Partial Labels

In this section, we propose two variants of multiclass Perceptron using partial labels. Let the instance observed at time be and its corresponding label set be . The weight matrix at time is and the th column of is denoted by . To update the weights, we propose two different schemes: (a) Avg Perceptron (using stochastic gradient descent on ) and (b) Max Perceptron (using stochastic gradient descent on ). We use following sub-gradients of the and .

 ∇wkLAPH =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0,if 1|Y|∑i∈Ywi.x−maxj∈¯¯¯¯Ywj.x≥1−x|Y|,if 1|Y|∑i∈Ywi.x−maxj∈¯¯¯¯Ywj.x<1 and k∈Yx,if 1|Y|∑i∈Ywi.x−maxj∈¯¯¯¯Ywj.x<1 and k=argmaxj∈¯¯¯¯Ywj.x0,if 1|Y|∑i∈Ywi.x−maxj∈¯¯¯¯Ywj.x<1, k∈¯¯¯¯Y and k≠argmaxj∈¯¯¯¯Ywj.x (4)
 ∇wkLMPH =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0,if maxj∈Ywj.x−maxj∈¯¯¯¯Ywj.x≥1−x,if maxj∈Ywj.x−maxj∈¯¯¯¯Ywj.x<1 and k=argmaxi∈Ywi.xx,if maxj∈Ywj.x−maxj∈¯¯¯¯Ywj.x<1and k=argmaxi∈¯¯¯¯Ywi.x (5)

We initialize the weight matrix as a matrix of zeros. At trial , the update rule for can be written as:

 wt+1i=wti−η∇wiL(ht(xt),Yt)

where is the step size and is found using Eq.(4) and (5). The complete description of Avg Perceptron and Max Perceptron is provided in Algorithm 1 and 2 respectively.

### 3.1 Mistake Bound Analysis

In the partial label setting, we say that mistake happens when the predicted class label for an example does not belong to its partial label set. We first define two variants of linear separability in a partial label setting as follows.

###### Definition 1 (Average Linear Separability in Partial Label Setting)

Let …, be the training set for multiclass classification with partial labels. We say that the data is average linearly separable if there exist such that

 1|Yt|∑i∈Ytwi.xt−maxj∈¯¯¯¯Ytwj.xt≥γ,∀t∈[T].

Thus, average linear separability implies that .

###### Definition 2 (Max Linear Separability in Partial Label Setting)

Let …, be the training set for multiclass classification with partial labels. We say that the data is max linearly separable if there exist such that

 maxi∈Ytwi.xt−maxj∈¯¯¯¯Ytwj.xt≥γ,∀t∈[T].

Thus, max linear separability implies that .

We bound the number of mistakes made by Avg Perceptron (Algorithm 1) as follows.

###### Theorem 3.1 (Mistake Bound for Avg Perceptron Under Average Linear Separability)

Let be the examples presented to Avg Perceptron, where and . Let ( be such that . Then we get the following mistake bound for Avg Perceptron Algorithm.

where , and is the margin of separation.

The proof is given in Appendix A. We first notice that the bound is inversely proportional to the minimum label set size. This is intuitively obvious as the smaller the candidate label set size, the larger the chance of having a non-zero loss. When , the number of updates reduces to the normal multiclass Perceptron mistake bound for linearly separable data as given in [4]. Also, the number of mistakes is inversely proportional to . Linear separability (Definition 1) may not always hold for the training data. Thus, it is important to see how does the algorithm Avg Perceptron performs in such cases. We now bound the number of updates in rounds for partially labeled data, which is linearly non-separable under .

###### Theorem 3.2 (Mistake Bound for Avg Perceptron in Non-Separable Case)

Let be an input sequence presented to Avg Perceptron. Let () be weight matrix corresponding to a multiclass classifier. Then for a fixed , let . Let and and . Then, mistakes bound for Avg Perceptron is as follows.

 T∑t=1LA(ht(xt),Yt)≤2Z2γ2+2KR2+\Updelta2(γZ)2

where , and .

The proof is provided in the Appendix B.

## 4 Online Multiclass Pegasos Using Partial Labels

Pegasos [14] is an online algorithm originally proposed for an exact label setting. In Pegasos, regularizer of the weights is minimized along with the hinge loss, making the overall objective function strongly convex. The strong convexity enables the algorithm to achieve a regret in trials. The objective function of the Pegasos at trial is the following.

 f(W,xt,Yt)=λ2||W||2+L(h(xt),Yt)

Here, is a regularization constant and is Frobenius norm of the weight matrix. Let be the weight matrix at the beginning of trial . Then, is found as . Here , is the step size at trial and is a projection operation onto the set which is defined as . Thus, .

We now propose extension of Pegasos [14] for online multiclass learning using partially labeled data. We again propose two variants of Pegasos: (a) Avg Pegasos (using average prediction hinge loss (Eq.2)) and (b) Max Pegasos (using max prediction hinge loss (Eq.(3)). We first note that can be written as:

 ∇t=λWt+∇WtL (6)

where is given by Eq.(4) (for ) and Eq.(5) (for ). Complete description of Avg Pegasos and Max Pegasos are given in Algorithm 3 and Algorithm 4 respectively.

### 4.1 Regret Bound Analysis of Avg Pegasos

We now derive the regret bound for Avg Pegasos.

###### Theorem 4.1

Let be an input sequence where and . Let . Then the regret of Avg Pegasos is given as:

 1TT∑t=1f(Wt,xt,Yt)−minW1TT∑t=1f(W,xt,Yt)≤G2lnTλT

where and

The proof is given in Appendix C. We again see the regret is inversely proportional to the size of the minimum candidate label set.

## 5 Experiments

We now describe the experimental results. We perform experiments on Ecoli, Satimage, Dermatology, and USPS datasets (available on UCI repository [5]) and MNIST dataset [10]. We perform experiments using the proposed algorithms Avg Perceptron, Max Perceptron, Avg Pegasos, and Max Pegasos. For benchmarking, we use Perceptron and Pegasos based on exact labels.

For all the datasets, the candidate or partial label set for each instance contains the true label and some labels selected uniformly at random from the remaining labels. After every trial, we find the average mis-classification rate (average of loss over examples seen till that trial) is calculated with respect to the true label. This sets a hard evaluation criteria for the algorithms. The number of rounds for each dataset is selected by observing when the error curves start to converge. For every dataset, we repeat the process of generating partial label sets and plotting the error curves 100 times and average the instantaneous error rates across the 100 runs. The final plots for each dataset have the average instantaneous error rate on the Y-axis and the number of rounds on the X-axis.

For every dataset, we plot the error rate curves for all the algorithms for different candidate label set sizes. This helps us in understanding how the online algorithms behave as the candidate label set size increases. For the Dermatology dataset, which contains six classes, we take candidate labels sets of sizes 2 and 4, respectively, as shown in Fig. 1. We see that the average prediction loss based algorithms perform the better in both cases. The results for the Ecoli dataset for candidate label sets of size 2,4 and 6 are shown in Fig. 2. Here, we find that the Max Pegasos algorithm performs comparably to the algorithms based on the Average Prediction Loss for candidate labels set sizes 2 and 4. But for candidate label set size 8, the Max Prediction Loss performs significantly worse than the Average Prediction Loss based algorithm.

The results for Satimage and USPS datasets are shown in Fig. 3 and 4 respectively. For Satimage, the Max Pegasos performs the best for label set of size 2. But for label set size 4, the Average Prediction Loss based algorithms perform much better. For USPS, we see that though for candidate labels set sizes 2 and 4, the Max Perceptron and Max Pegasos perform better than our algorithms, for label set sizes 6 and 8, the Average Prediction Loss based algorithms perform much better. The results for MNIST are provided in Fig. 5. Here we observe the Max Perceptron and Max Pegasos performs much better than the other algorithms for label set sizes 2 and 4. However, for label set sizes 6 and 8, the Average Pegasos performs best.

Overall, we see that for smaller labels set sizes, the Max Prediction Loss performs quite well. However, the Average Prediction Loss shows the best for larger candidate label set sizes. Studying the convergence and theoretical properties of the non-convex Max Prediction Loss can be an exciting future direction for exploration.

## 6 Conclusion

In this paper, we proposed online algorithms for classifying partially labeled data. This is very useful in real-life scenarios when multiple annotators give different labels for the same instance. We presented algorithms based on the Perceptron and Pegasos. We also provide mistake bounds for the Perceptron based algorithm and the regret bound for the Pegasos based algorithm. We also provide an experimental comparison of all the algorithms on various datasets. The results show that though the Average Prediction Loss is convex, the non-convex Max Prediction Loss can also be useful for small labels set sizes. Providing a theoretical analysis for the Max Prediction Loss can be a useful endeavor in the future.

## Appendix A Proof of Theorem 1

###### Proof

Assume that at the round , the algorithm fails to classify with the proper margin using the weight matrix , that is, or . So, the weights are updated using the rule where the are as specified in Algorithm 1. To prove the theorem, we bound from above and below. First, we derive the lower bound for .

 K∑i=1w∗i.wt+1i=K∑i=1w∗i.(wti+τtixt) =K∑i=1w∗i.wti+K∑i=1τti(w∗i.xt) =K∑i=1w∗i.wti+1|Yt|∑i∈Ytw∗i.xt−maxj∈¯¯¯¯Ytw∗j.xt ≥K∑i=1w∗i.wti+γI{1|Yt|∑i∈Ytwti.xt−maxj∈¯¯¯Ytwtj.xt<1} (7)

We get the above expression due to the assumption that classifies all points with margin at least . Summing Eq.(7) from to , we get the following. Thus, if the algorithm made mistakes in trials, we get.

 T∑t=1K∑i=1w∗i.wt+1i≥T∑t=1K∑i=1w∗i.wti +γT∑t=1I{1|Yt|∑i∈Ytwti.xt−maxj∈¯¯¯Ytwtj.xt<1} ⇒ K∑i=1w∗i.wT+1i≥K∑i=1w∗i.w1i+γm≥γm ⇒ W∗.WT+1≥γm (8)

Where we used the fact that and . Let be the Frobenius inner product between and . Then, using Cauchy-Schwartz inequality, we get the following.

 (W∗.WT+1)2 =(K∑i=1w∗i.wT+1i)2≤K∑i=1∥w∗i∥22.∥wT+1i∥22 ≤(K∑i=1∥w∗i∥22)(K∑i=1∥wT+1i∥22) =∥W∗∥2.∥WT+1∥2 (9)

From Eq.(8) and (9) and using the assumption that , we get:

 ||WT+1||2≥m2γ2 (10)

Now, we derive upper bound on . We know that at trial, example is misclassified. Thus,

 ||Wt+1||2=K∑i=1||wt+1i||2=K∑i=1||wti+τtixt||2=K∑i=1||wti||2+2K∑i=1τti(wti.xt)+K∑i=1||τtixt||2=K∑i=1||wti||2+2K∑i=1τti(wti.xt)+||xt||2K∑i=1(τti)2. (11)

Using , and in Eq.(11), we get the following.

 ||Wt+1||2−||Wt||2≤(2+[1|Yt|+1]R2)I{1|Yt|∑i∈Ytwti.xt−maxj∈¯¯¯Ytwtj.xt<1}

We know that and there are mistakes. Summing the above equation over to , we get,

 ∥WT+1∥2−∥W1∥2≤2m+[1c+1]mR2 ⇒ ∥WT+1∥2≤2m+[1c+1]mR2. (12)

Where, . Thus, combining the upper and lower bound from Eq.(10) and (12), we get the following.

 m2γ2≤||WT+1||2≤2m+[1c+1]mR2 ⇒ m≤2γ2+[1c+1]R2γ2

## Appendix B Proof of Theorem 2

###### Proof

If , it reduces to linearly separable case and thus, we assume . Which means, there exists such that . Thus, the data is not linearly separable with respect to . We now transform the linearly non-separable data to separable data. We extend each instance to as follows. The first d coordinates of are set to . The th coordinate of is set to whose value will be determined later while the rest of the coordinates of are set to 0. We extend weight matrix to as follows. We set the first d columns of to be (where Z is a constant whose value will be determined). For the rest of the columns, we set the position in to if and to 0 otherwise.

We choose the value of such that and hence,

 1=||M||22=1Z2(||W||22+D2\Updelta2).

This gives us,

 Z=√1+D2\Updelta2.

Let be the column of , then . We now show that linearly separates all the examples with a margin at least as follows.

 1|Yt|∑i∈Ytmi.xt−maxj∈¯¯¯¯Ytmj.xt=1Z|Yt|∑i∈Yt(wi.xt+dt)−maxj∈¯¯¯¯Yt{1Zwj.xt}=1Zdt+1Z⎡⎣1|Yt|∑i∈Ytwi.xt−maxj∈¯¯¯¯Ytwj.xt⎤⎦≥1Z(γ−[1|Yt|∑i∈Ytwi.xt−maxj∈¯¯¯¯Ytwj.xt])+1Z⎡⎣1|Yt|∑i∈Ytwi.xt−maxj∈¯¯¯¯Ytwj.xt⎤⎦=γZ

We also observe that . Thus, using Theorem 3.1, the number of mistakes made by the algorithm Avg Perceptron on the sequence is bounded above as follows.

 m≤2Z2γ2+2[1c+1]R2+\Updelta2(γZ)2 (13)

Minimizing RHS expression in Eq.(13) over , we get that the optimal value of is where . Using this value of , we get the mistake bound as follows.

 m≤2Z2γ2+2KR2+\Updelta2(γZ)2

Finally, to complete the proof we need to show that classifying the original partially labeled sequence with matrices is the same as classifying as the extended sequence with the extended matrices . That is, they both produce same sequence of predictions. This can be accomplished if we can show the following holds for all .

1. The first d columns of are equal to

2. The (d+t)th column of is zero.

The proof of the above conditions is straightforward by induction on t (by initializing and as zero matrices).

## Appendix C Proof of Theorem 3

###### Proof

The theorem and the proof is almost same as Theorem 1 and its proof in the Pegasos paper [14]. The main idea in the proof is to upper bound where is given by Eq. 6. Thus, using triangle inequality we can write:

 ||∇t||≤λ||Wt||+||∇WtL|| (14)

We note that the L2 norm of the weight matrix can be written as . Now, and . From the updates of Avg Perceptron, we get:

 ||∇WtL||2=⎧⎨⎩||xt||2+||xt||2|Yt|,if L>00,if L=0

So we get,

 ||∇WtL||≤√1+1Yt||xt||

So, using the above result along with Equation 14, we can write:

 ||∇t||≤√λ+√1+1Yt||xt||

Thus, if , we get the following bound:

 ||∇t||≤√λ+√1+1c||xt||

The rest of the proof is exactly same as the one given in [14].

### Footnotes

1. We denote the set using .

### References

1. Chen, Y., Patel, V.M., Chellappa, R., Phillips, P.J.: Ambiguously labeled learning using dictionaries. IEEE Transactions on Information Forensics and Security 9(12), 2076â–2088 (Dec 2014)
2. Cour, T., Sapp, B., Taskar, B.: Learning from partial labels. Journal of Machine Learning Research 12, 1501â–1536 (2011)
3. Cour, T., Sapp, B., Jordan, C., Taskar, B.: Learning from ambiguously labeled images. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. pp. 919â–926 (2009)
4. Crammer, K., Singer, Y.: Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research 3, 951â–991 (March 2003)
5. Dua, D., Graff, C.: UCI machine lea