[
Abstract
The major challenge of learning from multilabel data has arisen from the overwhelming size of label space which makes this problem NPhard. This problem can be alleviated by gradually involving easy to hard tags into the learning process. Besides, the utilization of a diversity maintenance approach avoids overfitting on a subset of easy labels. In this paper, we propose a selfpaced multilabel learning with diversity (SPMLD) which aims to cover diverse labels with respect to its learning pace. In addition, the proposed framework is applied to an efficient correlationbased multilabel method. The nonconvex objective function is optimized by an extension of the block coordinate descent algorithm. Empirical evaluations on realworld datasets with different dimensions of features and labels imply the effectiveness of the proposed predictive model.
101 \jmlryear2019 \jmlrworkshopACML 2019 selfpaced multilabel learning with diversity]SelfPaced MultiLabel Learning with Diversity \editorsWee Sun Lee and Taiji Suzuki
elfPaced Learning, MultiLabel Learning, Block Coordinate Descent, Manifold Optimization.
1 Introduction
The paradigm of multilabel learning has become a popular topic in recent years. In many realworld applications, instances are in semantic association with more than one class label 8423669. Therefore, it sounds more rational to map each instance into a vector of labels rather than a single one. An effective stage to handle a multilabel problem is to learn dependencies among the labels zhang2014review. Accordingly, numerous studies have been conducted to accomplish this goal. zhang2007ml proposed a lazy learning method, derived from the conventional kNN classifier which classifies instances using a statistical model based on maximum a posteriori principle. However, this algorithm implicitly covers a local definition of correlation; label dependency has gone further. huang2012multi investigated an explicit view of local correlation by encoding their influences into a local code (LOC).
Another important stage that has drawn much attention is a lowrank representation of label space. Many algorithms with this property are proposed. yu2014large presented a largescale lowrank structure applicable to scaled label spaces meanwhile handling data with missinglabels. The framework of xu2014learning aims to capture global label correlations by utilizing a lowrank structure on the label correlation matrix and copes with the missinglabel challenge by introducing a supplementary label matrix. xu2013speedup proposed a fast matrix completion algorithm with a lowrank representation which exploits side information explicitly to optimize complexity in a transductive manner. 8233207 investigated the concept of label correlation in a new manner. Contrary to previous approaches that only rely on a single definition of correlation i.e. global or local, the GLOCAL framework analyzes both GLObal and loCAL correlations of labels simultaneously in a latent label representation. This method takes advantage of manifold optimization and is capable of dealing with both missing and full label scenarios.
Algorithms mentioned so far have many pros and cons. Many methods in this field have studied the multilabel problem from different perspectives and have made valuable efforts. However, there is a common shortcoming; they lack a mechanism to give a clear order to training instances. Some instances are easy for a specific label. It is beneficial to learn that label with those instances first and then gradually learn harder ones. For example, learning label ”rabbit” with a black rabbit running on grass in a picture is easier than a white rabbit running on snow.
Curriculum and selfpaced learning (SPL) are recently proposed regimes with the aim of learning from easier to more complex concepts bengio2009curriculum; kumar2010self; meng2017theoretical. These learning frameworks are inspired by human education system where the major difference arises in identifying the complexity level. Curriculum learning needs a teacher (extra knowledge) to distinguish easy concepts from the complex ones, whereas selfpaced learning is like a student who starts to learn a curriculum based on selfabilities.
The SPL framework is widely applied to various fields. zhao2015self incorporated SPL with conventional matrix factorization and introduced a new Matrix factorization framework with a generalization of SPL to produce soft weight values along with the original binary weights. li2017self proposed a multitask algorithm with an selfpaced regularization, and optimized this learner with efficient development of block coordinate descent. SelfPaced learning is claimed to be a general framework applicable to any learning framework having an objective function with an empirical loss function. It has been successfully applied in various learning fields such as classification li2017selfcnn, boost learning pi2016self, object detection sangineto2019self, Cosaliency detection zhang2017co, face identification lin2018active, Multiview Clustering xu2015multi, and multitask learning murugesan2017self.
li2018self introduced a selfpaced regularization framework for multilabel learning. It is one of the first attempts to tackle the multilabel problem by considering the complexity of instances for labels. However, without a diversity maintenance approach, a selfpaced regularizer may cause the learning model to be biased on a subset of labels that are easy to learn jiang2014self.
In this paper, a selfpaced multilabel with diversity (SPMLD) framework is proposed. The diversity regularization term drives the model to be inclined to learn different labels that are easier first and somehow overcomes the problem of being biased on a limited number of easy labels. Besides, this gradual learning scheme can exploit more reliable label dependencies. Finally, to present a comparable example for realizing desired selfpaced multilabel learning, SPMLD is applied to a host algorithm, a recent multilabel method 8233207 which has acceptable performance. Empirical results supported by statistical significance tests demonstrate the effectiveness of our method against several wellknown algorithms including the host algorithm.
2 Background
Selfpaced learning provides a way for simultaneously choosing the easier patterns and reestimating the learning parameters w in the form of an iterative process kumar2010self. We presume a linear function with unknown parameter w. SPL is then given by the following objective function to be solved:
(1) 
where is the regularization term, is the domain space of p and is the regularization parameter which determines the complexity of patterns. Equation (1) has two unknowns including w as learning parameter and the parameter of pace control (restricted to the specified domain). Equation (1) then becomes a biconvex optimization problem over the parameters w and p, which can be efficiently solved by alternating minimization. The optimal solution of w with fixed p can be achieved by any offtheshelf solver and, the optimal solution of p with a fixed w can be obtained by:
(2) 
According to (2) easy samples have losses smaller than a determined threshold because they have less prediction errors, when updating p given a fixed w, so they are chosen for training () or otherwise they aren’t chosen (). for updating w given a fixed p, the training process of learning model only performs on the ”easy” samples selected before. Small values of , only pass ”easy” samples with small losses. With gradually increasing , larger loss values for ”complex” samples are accepted.
3 SelfPaced MultiLabel Learning
Suppose we are given an input matrix with ddimensional feature space and be the finite set of l existing labels. Let be a multilabel data, where is an arbitrary vector of features with ddimensions of the jth sample and is the interpretation of labels for and is if the ith label is assigned to and otherwise. Multilabel learning intends to train a predictor from the training data , so relevant and irrelevant labels of outofsample data are predicted. The general objective function for multilabel learning is:
(3) 
where is the learned matrix with each column indicating the weight vector for each independent label. is the loss of jth sample for the ith label. Correlation matrix C demonstrates dependency degree between each pair of labels and is a correlation regularizer with characteristic to let labels with positive dependency degrees encourage their corresponding outputs to be closer, and vice versa.
The aforementioned objective function treats all labels identically and does similar for samples per label. Although in many realworld cases, different labels don’t necessarily have identical complexities, and also samples have different complexity levels for labels. Generally, the nonconvex objective function of multilabel learning is the potential to get stuck in local optima zhao2015self, especially with the presence of bad initialization or noisy and corrupted labels. To address these defects, by defining a selfpaced regularization, the model can learn a sequence of instances with respect to their degree of complexity.
(4) 
Furthermore, our desirable multilabel learning is expected to learn not only easy but also diverse labels that are sufficiently disparate from the current learning pace. To this end, instancelabel weights are introduced. In order to accomplish the easytohard strategy on diverse labels simultaneously, we propose a new selfpaced regularizer in (5):
(5) 
Equation (5) consists of two terms, a negative norm and an adaptive norm of a matrix. The first term induces a preference to select the easy instances rather than the hard ones per label. Combining this term with (3), implies that small empirical loss on the training data point () drives the weight to be high. Hence, this optimization process well corresponds with the intuitive notion of starting with the easiest instances (the ones that have the lowest empirical errors). By progressively increasing while the learning proceeds, the selfpaced weights will increasingly grow higher in consequent. This leads to gradual involvement of more complex samples into training. The matrix norm term leads to labelwise sparsity. It favors selecting from different categories of labels in the initial steps with higher diversities. As the training proceeds, with gradually decreasing , the impact of diversity decreases. By plugging (5) into (3), we obtain the final objective function:
(6) 
3.1 SPMLD with Local and Global Correlation
We give an example to motivate our selfpaced learning procedure. We briefly discuss how the proposed algorithm develops the learning pace of the original problem. Host method (GLOCAL) 8233207 simultaneously recovers the missinglabels, trains the learner and exploits both global and local correlations among labels without needing any further prior knowledge, through learning a latent label representation. The objective function of GLOCAL is as follows:
(7) 
where V stands for the matrix of the latent labels capturing concepts in a higher level which are more compact and semantically abstract than the original labels; while U represents the matrix containing the interactions between the original labels and the latent labels. In general, labels may only be partially observed. Lowrank representation is one of the key techniques in matrix completion, and the lowrank decomposition of the observed labels yields a natural solution to recover missinglabels (Let J be the indicator matrix of the observed labels in Y).
Label correlations may not have the same values in different categories, so we define the local manifold regularization. Assume that the dataset X is partitioned into b groups , where has instances.
This partitioning can be obtained by clustering. If is the label submatrix in Y corresponding to , then are the local correlation of group b. Similar to global label correlations, we force the output to be similar or dissimilar on the relevant or irrelevant correlated labels, and optimize , where is the Laplacian of
that stores our knowledge about the relationship of our labels
and is defined as where is a diagonal matrix with
and is the predicted submatrix for group b.
Finally, We propose the following objective function by considering the diversity of labels in a
unified setting:
s.t.  (8) 
where denotes the elementwise square root of P, the (elementwise product) of matrices and is the regularization term to guarantee generalization ability and numerical stability.
The details of selfpaced multilabel learning with diversity (SPMLD) is summarized in Algorithm 1. Implementation is available on GitHub repository^{1}^{1}1http://github.com/amjadseyedi/SPMLD.
3.2 Optimization
In this section, we discuss how to solve (3.1) by alternating minimization which gives us capability to tune the variables iteratively and find an optimal solution. It is difficult to find a global optimal answer for this nonconvex objective function. To achieve the reliable selfpaced weights, we extend a block coordinate descent optimizer. For solving block with fixed blocks ,, and , the optimization problem can be decomposed to k subproblems for k latent labels, respectively. Thus, objective function of the ith label, is given by:
(9) 
In order to solve (9), we first assume . Let and . For each i and arbitrary we define , , and for later computation:
1.
2.
3. be the smallest j such that .
4. be the largest j such that .
Let , and be obtained by optimizing the following objective function:
Then, the optimal is given by
(10) 
Then, we discuss update procedures for U, V, W, and Z. To optimize these variables with the gradient descent method, we utilize the MANOPT toolbox^{2}^{2}2http://www.manopt.org boumal2014manopt with line search on the Euclidian and manifold spaces.
Updating ’s: With U, V, W and ’s fixed:
for each . Due to the constraint
, it has no closedform solution, and we
solve it with projected gradient descent. The gradient of the
objective w.r.t. is
(11) 
To satisfy the constraint , we project each
row of onto the unit norm ball after each update. where is the jth row of .
Updating V: With U, W, ’s and ’s fixed: The gradient of the objective in (3.1) w.r.t. V is
(12) 
Updating U: With V, W, ’s and ’s fixed: Again, we use gradient descent and the gradient w.r.t. U is:
(13) 
Updating W: With U, V, ’s and ’s fixed: The gradient w.r.t. W is:
(14) 
4 Experiments
In this section, empirical experiments are conducted to test the validation of our method. In these experiments, seven realworld multilabel datasets including Yahoo text datasets (Business, Computers, Education, Health, Science and Social) ueda2003parametric along with an image classification data (Corel5K) duygulu2002object are used^{3}^{3}3http://mulan.sourceforge.net/datasetsmlc.html.
4.1 Experimental Setting
dataset  #instance  #dimension  #label  #labelinstance 

Business ueda2003parametric  5,000  438  30  1.59 
Computers ueda2003parametric  5,000  681  33  1.5 
Education ueda2003parametric  5,000  550  33  1.46 
Health ueda2003parametric  5,000  612  32  1.66 
Science ueda2003parametric  5,000  743  40  1.45 
Social ueda2003parametric  5,000  1,047  39  1.28 
Corel5K duygulu2002object  5,000  499  374  3.52 
Table 1 lists detailed characteristics of the employed datasets. Each column sequentially represents the number of features, number of instances, number of labels and label per instance ratio for each dataset.
Since prediction in the presence of missinglabels is a more challenging task, we have performed the experiments on missinglabel data. We randomly sample of the elements in the label matrix as observed, and the rest as missing. is set to 30% and 70% revealed entries, respectively. Coverage, Ranking loss, Average AUC, Instance AUC, MacroF1, MicroF1, and InstanceF1 evaluation metrics are used to measure the performance of the proposed predictive model against baselines. The abovementioned metrics analyze different aspects of multilabel learning algorithms. The first two metrics are to be minimized and the other metrics are to be maximized according to wu2017unified.
For examining the effectiveness of our framework, it is compared with three stateoftheart multilabel learning algorithms namely LEML yu2014large, MLLRC xu2014learning, GLOCAL 8233207. These Baseline methods along with the SPMLD have two common traits, which are the main reasons that we compare our framework with them. All the abovementioned methods somehow learn in a latent subspace and furthermore, they all handle the missinglabel challenge.

Lowrank empirical risk minimization for multilabel learning (LEML) has a linear lowrank structure to train a model for mapping from instance to label space and utilizes an implicit concept of global label dependency.

Learning lowrank label correlations for multilabel classification (MLLRC) learns and exploits lowrank global label correlations for multilabel classification.

Multilabel learning with global and local label correlation (GLOCAL) learns label correlations in a new manner. It considers both local and global label correlations in latent label representation.
What’s more, to statistically measure the significance of performance difference, pairwise ttests at 95% significance level are conducted between SPMLD and each of the baseline algorithm. Therefore, in the statistical peer test, for each test, the performance of an algorithm is boldfaced denoting that it statistically outperforms the other one. Furthermore, when there is no significant difference between the performance of SPMLD compared to one or more of the baselines on a dataset regarding a specific evaluation metric, their results are shown in italic. Furthermore, selfpaced regularization parameter controls the complexity of instances and labels. It first considers the least losses corresponding to the easiest samples, then gradually involves harder ones through iterations. and its increase ratio are searched from and , respectively. Diversity regularization parameter controls to select less similar instances and labels specified as easy. needs to be higher in the initial iterations. and its decrease ratio are tuned using a grid search in ranges [110] and , respectively. Parameters of the host algorithm (GLOCAL) and parameters of the competing methods are all set the same as recommended in the corresponding literature.
4.2 Results on RealWorld Datasets
All the results reported in tables and charts are averaged over 10 independent runs. Tables (27) exhibit the obtained label prediction results of the multilabel algorithms on six of the mentioned datasets, respectively. Overall, results of all algorithms have improved on 70% observation of samples. In each table three metrics are reported, regarding two different settings of there are six probable cases. On five datasets; Business, Education, Health, Science and Social, SPMLD has significantly better performance regarding all the measures (and for both observation settings) reported in Tables (2, 46), respectively. On Computers dataset whose results are shown in Table 3. SPMLD reports significantly better results regarding Rkl measure for both settings. Similarly, it shows statistically better AUC and COV values for 30% entries revealed, while in the case of 70% LEML, MLLRC and SPMLD show no significant different AUC values compared to each other but they have jointly better AUC values than the GLOCAL. Again, in the case of COV values for 70% entries, the proposed method shows no significant difference compared with MLLRC but, it gains statistically better results than the two other models. Equivalently, on Social dataset SPMLD shows significantly better performance regarding both percentages of observation for Rkl and AUC metrics and also COV with =30% and only for the proportion of 70% it statistically performs equal to GLOCAL while they get jointly better results than the remaining two algorithms and jointly stand at the first place. Subsequently, as the summary of tables, SPMLD ranks first in 91.66% cases according to statistical significance test. Moreover, it shows statistically equal performance in 8.34% cases where it still ranks first but jointly with one ore more of the baselines.
Method  Rkl  AUC  COV  

LEML  30%  0.063 0.0057  0.928 0.0052  3.954 0.2751 
70%  0.058 0.0048  0.942 0.0057  3.303 0.2706  
MLLRC  30%  0.061 0.0024  0.937 0.0055  3.279 0.0666 
70%  0.046 0.0019  0.950 0.0050  2.580 0.0593  
GLOCAL  30%  0.054 0.0025  0.937 0.0036  2.863 0.1711 
70%  0.046 0.0022  0.952 0.0031  2.579 0.1658  
SPMLD  30%  0.044 0.0021  0.956 0.0032  2.361 0.1688 
70%  0.043 0.0018  0.958 0.0029  2.347 0.1625 
Method  Rkl  AUC  COV  

LEML  30%  0.179 0.0072  0.880 0.0064  7.392 0.2181 
70%  0.141 0.0058  0.894 0.0069  6.306 0.2653  
MLLRC  30%  0.152 0.0044  0.873 0.0057  6.052 0.1426 
70%  0.115 0.0026  0.895 0.0058  5.000 0.6228  
GLOCAL  30%  0.132 0.0037  0.876 0.0034  5.647 0.7823 
70%  0.123 0.0028  0.884 0.0062  5.440 0.5340  
SPMLD  30%  0.104 0.0030  0.903 0.0047  4.600 0.5359 
70%  0.113 0.0015  0.894 0.0053  5.018 0.5712 
Method  Rkl  AUC  COV  

LEML  30%  0.176 0.0084  0.817 0.0075  9.672 0.4461 
70%  0.151 0.0077  0.842 0.0082  7.595 0.5138  
MLLRC  30%  0.144 0.0034  0.845 0.0068  6.350 0.2042 
70%  0.113 0.0028  0.860 0.0063  5.075 0.1866  
GLOCAL  30%  0.125 0.0026  0.875 0.0057  5.741 0.2312 
70%  0.122 0.0035  0.878 0.0064  5.784 0.2450  
SPMLD  30%  0.096 0.0019  0.904 0.0048  4.246 0.1372 
70%  0.093 0.0021  0.907 0.0052  4.162 0.1524 
Method  Rkl  AUC  COV  

LEML  30%  0.095 0.0029  0.896 0.0038  6.248 0.1640 
70%  0.074 0.0033  0.920 0.0045  5.167 0.1827  
MLLRC  30%  0.085 0.0041  0.907 0.0082  4.924 0.1351 
70%  0.071 0.0036  0.920 0.0093  3.960 0.1825  
GLOCAL  30%  0.0828 0.0018  0.919 0.0057  4.438 0.1357 
70%  0.0795 0.0023  0.923 0.0078  4.472 0.1414  
SPMLD  30%  0.064 0.0009  0.938 0.0063  3.355 0.1182 
70%  0.059 0.0011  0.943 0.0068  3.253 0.1200 
Method  Rkl  AUC  COV  

LEML  30%  0.203 0.0052  0.827 0.0053  10.587 0.2011 
70%  0.174 0.0056  0.849 0.0064  9.501 0.2548  
MLLRC  30%  0.169 0.0027  0.830 0.0039  8.794 0.1254 
70%  0.134 0.0024  0.850 0.0033  6.900 0.1273  
GLOCAL  30%  0.154 0.0029  0.840 0.0108  7.949 0.1371 
70%  0.134 0.0034  0.866 0.0113  7.106 0.1365  
SPMLD  30%  0.129 0.0024  0.871 0.0095  6.640 0.1156 
70%  0.124 0.0028  0.876 0.0087  6.432 0.1283 
Method  Rkl  AUC  COV  

LEML  30%  0.128 0.0067  0.872 0.0065  5.459 0.3084 
70%  0.081 0.0061  0.919 0.0069  3.824 0.3011  
MLLRC  30%  0.123 0.0059  0.877 0.0057  5.167 0.1034 
70%  0.073 0.0052  0.928 0.0046  3.608 0.0975  
GLOCAL  30%  0.102 0.0054  0.898 0.0058  4.496 0.2627 
70%  0.073 0.0049  0.929 0.0052  3.442 0.2583  
SPMLD  30%  0.068 0.0055  0.931 0.0052  3.563 0.2554 
70%  0.065 0.0048  0.934 0.0044  3.456 0.2505 
Results obtained on Corel5K image dataset are analyzed through a Radar plot which enables one to compare models against multiple metrics. In this straightforward analysis, SPMLD is compared to the baselines in the highdimensional label space of Corel5K in terms of six evaluation metrics for 30% and 70% revealed data, respectively and the results are shown in Figure 1. Note that, the proposed method endeavors to cover all labels fairly in its learning process. Thus, it is able to distinguish positive and negative labels of an instance by simultaneously making a larger labelwise margin and preserving instancewise margin. Hence, it can be obviously seen that the SPMLD is far better on labelwise metrics (MicroF1 and InstanceF1) and it obtains comparable results on the other instancewise metrics such as MacroF1.
4.3 Parameter Analysis
In this subsection, the influence of parameters on the proposed model is analyzed. According to (3.1) SPMLD has two parameters namely and which correspond to the selfpaced and the diversity regularization terms, respectively. It must be mentioned that parameters of other regularization terms which belong to the base algorithm are analyzed in 8233207. Thus, to make a thorough study on the two mentioned parameters we evaluated them on all datasets through a gridsearch strategy and reported the results based on three measures. In addition, each graph consists of 50 evaluations referring to 5 different s and 10 s. According to Figure 2, there is a bar next to each graph that indicates the highest and the lowest values obtained on the corresponding dataset regarding each measure which is shown using an spectrum of colors. Subsequently, Light colors (e.g. ”orange” to ”yellow”) represent high amounts and dark colors (e.g. ”blue” to ”darkblue”) represent low amounts for the measures.
It can be seen that for each value of changing the values of from 1 to 10 makes a significant difference except for = which lies on the top row of graphs and stays unchanged with increasing .
5 Conclusion
In this paper, we propose a novel SelfPaced framework for MultiLabel learning. This framework incorporates the complexity of both instances and labels, and trains its predictive model with gradual involvement of harder samples. It also utilizes an efficient Diversity maintenance mechanism to avoid biasing over a limited subset of labels. The diverse easytohard learning strategy has also an implicit positive effect on correlations exploited. SPMLD is applied to correlationbased multilabel learning as a host algorithm. Experiments on realworld datasets verify the effectiveness of SPMLD compared to the host algorithm itself and two other stateoftheart methods. For future studies, it is desirable to investigate the direct impact of selfpaced regularization on correlation exploitation and we intend to study and analyze the effect of diversity on local dependencies.