The major challenge of learning from multi-label data has arisen from the overwhelming size of label space which makes this problem NP-hard. This problem can be alleviated by gradually involving easy to hard tags into the learning process. Besides, the utilization of a diversity maintenance approach avoids overfitting on a subset of easy labels. In this paper, we propose a self-paced multi-label learning with diversity (SPMLD) which aims to cover diverse labels with respect to its learning pace. In addition, the proposed framework is applied to an efficient correlation-based multi-label method. The non-convex objective function is optimized by an extension of the block coordinate descent algorithm. Empirical evaluations on real-world datasets with different dimensions of features and labels imply the effectiveness of the proposed predictive model.
101 \jmlryear2019 \jmlrworkshopACML 2019 self-paced multi-label learning with diversity]Self-Paced Multi-Label Learning with Diversity \editorsWee Sun Lee and Taiji Suzuki
elf-Paced Learning, Multi-Label Learning, Block Coordinate Descent, Manifold Optimization.
The paradigm of multi-label learning has become a popular topic in recent years. In many real-world applications, instances are in semantic association with more than one class label 8423669. Therefore, it sounds more rational to map each instance into a vector of labels rather than a single one. An effective stage to handle a multi-label problem is to learn dependencies among the labels zhang2014review. Accordingly, numerous studies have been conducted to accomplish this goal. zhang2007ml proposed a lazy learning method, derived from the conventional kNN classifier which classifies instances using a statistical model based on maximum a posteriori principle. However, this algorithm implicitly covers a local definition of correlation; label dependency has gone further. huang2012multi investigated an explicit view of local correlation by encoding their influences into a local code (LOC).
Another important stage that has drawn much attention is a low-rank representation of label space. Many algorithms with this property are proposed. yu2014large presented a large-scale low-rank structure applicable to scaled label spaces meanwhile handling data with missing-labels. The framework of xu2014learning aims to capture global label correlations by utilizing a low-rank structure on the label correlation matrix and copes with the missing-label challenge by introducing a supplementary label matrix. xu2013speedup proposed a fast matrix completion algorithm with a low-rank representation which exploits side information explicitly to optimize complexity in a transductive manner. 8233207 investigated the concept of label correlation in a new manner. Contrary to previous approaches that only rely on a single definition of correlation i.e. global or local, the GLOCAL framework analyzes both GLObal and loCAL correlations of labels simultaneously in a latent label representation. This method takes advantage of manifold optimization and is capable of dealing with both missing and full label scenarios.
Algorithms mentioned so far have many pros and cons. Many methods in this field have studied the multi-label problem from different perspectives and have made valuable efforts. However, there is a common shortcoming; they lack a mechanism to give a clear order to training instances. Some instances are easy for a specific label. It is beneficial to learn that label with those instances first and then gradually learn harder ones. For example, learning label ”rabbit” with a black rabbit running on grass in a picture is easier than a white rabbit running on snow.
Curriculum and self-paced learning (SPL) are recently proposed regimes with the aim of learning from easier to more complex concepts bengio2009curriculum; kumar2010self; meng2017theoretical. These learning frameworks are inspired by human education system where the major difference arises in identifying the complexity level. Curriculum learning needs a teacher (extra knowledge) to distinguish easy concepts from the complex ones, whereas self-paced learning is like a student who starts to learn a curriculum based on self-abilities.
The SPL framework is widely applied to various fields. zhao2015self incorporated SPL with conventional matrix factorization and introduced a new Matrix factorization framework with a generalization of SPL to produce soft weight values along with the original binary weights. li2017self proposed a multi-task algorithm with an self-paced regularization, and optimized this learner with efficient development of block coordinate descent. Self-Paced learning is claimed to be a general framework applicable to any learning framework having an objective function with an empirical loss function. It has been successfully applied in various learning fields such as classification li2017selfcnn, boost learning pi2016self, object detection sangineto2019self, Co-saliency detection zhang2017co, face identification lin2018active, Multi-view Clustering xu2015multi, and multi-task learning murugesan2017self.
li2018self introduced a self-paced regularization framework for multi-label learning. It is one of the first attempts to tackle the multi-label problem by considering the complexity of instances for labels. However, without a diversity maintenance approach, a self-paced regularizer may cause the learning model to be biased on a subset of labels that are easy to learn jiang2014self.
In this paper, a self-paced multi-label with diversity (SPMLD) framework is proposed. The diversity regularization term drives the model to be inclined to learn different labels that are easier first and somehow overcomes the problem of being biased on a limited number of easy labels. Besides, this gradual learning scheme can exploit more reliable label dependencies. Finally, to present a comparable example for realizing desired self-paced multi-label learning, SPMLD is applied to a host algorithm, a recent multi-label method 8233207 which has acceptable performance. Empirical results supported by statistical significance tests demonstrate the effectiveness of our method against several well-known algorithms including the host algorithm.
Self-paced learning provides a way for simultaneously choosing the easier patterns and re-estimating the learning parameters w in the form of an iterative process kumar2010self. We presume a linear function with unknown parameter w. SPL is then given by the following objective function to be solved:
where is the regularization term, is the domain space of p and is the regularization parameter which determines the complexity of patterns. Equation (1) has two unknowns including w as learning parameter and the parameter of pace control (restricted to the specified domain). Equation (1) then becomes a bi-convex optimization problem over the parameters w and p, which can be efficiently solved by alternating minimization. The optimal solution of w with fixed p can be achieved by any off-the-shelf solver and, the optimal solution of p with a fixed w can be obtained by:
According to (2) easy samples have losses smaller than a determined threshold because they have less prediction errors, when updating p given a fixed w, so they are chosen for training () or otherwise they aren’t chosen (). for updating w given a fixed p, the training process of learning model only performs on the ”easy” samples selected before. Small values of , only pass ”easy” samples with small losses. With gradually increasing , larger loss values for ”complex” samples are accepted.
3 Self-Paced Multi-Label Learning
Suppose we are given an input matrix with d-dimensional feature space and be the finite set of l existing labels. Let be a multi-label data, where is an arbitrary vector of features with d-dimensions of the jth sample and is the interpretation of labels for and is if the ith label is assigned to and otherwise. Multi-label learning intends to train a predictor from the training data , so relevant and irrelevant labels of out-of-sample data are predicted. The general objective function for multi-label learning is:
where is the learned matrix with each column indicating the weight vector for each independent label. is the loss of jth sample for the ith label. Correlation matrix C demonstrates dependency degree between each pair of labels and is a correlation regularizer with characteristic to let labels with positive dependency degrees encourage their corresponding outputs to be closer, and vice versa.
The aforementioned objective function treats all labels identically and does similar for samples per label. Although in many real-world cases, different labels don’t necessarily have identical complexities, and also samples have different complexity levels for labels. Generally, the non-convex objective function of multi-label learning is the potential to get stuck in local optima zhao2015self, especially with the presence of bad initialization or noisy and corrupted labels. To address these defects, by defining a self-paced regularization, the model can learn a sequence of instances with respect to their degree of complexity.
Furthermore, our desirable multi-label learning is expected to learn not only easy but also diverse labels that are sufficiently disparate from the current learning pace. To this end, instance-label weights are introduced. In order to accomplish the easy-to-hard strategy on diverse labels simultaneously, we propose a new self-paced regularizer in (5):
Equation (5) consists of two terms, a negative -norm and an adaptive -norm of a matrix. The first term induces a preference to select the easy instances rather than the hard ones per label. Combining this term with (3), implies that small empirical loss on the training data point () drives the weight to be high. Hence, this optimization process well corresponds with the intuitive notion of starting with the easiest instances (the ones that have the lowest empirical errors). By progressively increasing while the learning proceeds, the self-paced weights will increasingly grow higher in consequent. This leads to gradual involvement of more complex samples into training. The -matrix norm term leads to label-wise sparsity. It favors selecting from different categories of labels in the initial steps with higher diversities. As the training proceeds, with gradually decreasing , the impact of diversity decreases. By plugging (5) into (3), we obtain the final objective function:
3.1 SPMLD with Local and Global Correlation
We give an example to motivate our self-paced learning procedure. We briefly discuss how the proposed algorithm develops the learning pace of the original problem. Host method (GLOCAL) 8233207 simultaneously recovers the missing-labels, trains the learner and exploits both global and local correlations among labels without needing any further prior knowledge, through learning a latent label representation. The objective function of GLOCAL is as follows:
where V stands for the matrix of the latent labels capturing concepts in a higher level which are more compact and semantically abstract than the original labels; while U represents the matrix containing the interactions between the original labels and the latent labels. In general, labels may only be partially observed. Low-rank representation is one of the key techniques in matrix completion, and the low-rank decomposition of the observed labels yields a natural solution to recover missing-labels (Let J be the indicator matrix of the observed labels in Y).
Label correlations may not have the same values in different categories, so we define the local manifold regularization. Assume that the dataset X is partitioned into b groups , where has instances.
This partitioning can be obtained by clustering. If is the label submatrix in Y corresponding to , then are the local correlation of group b. Similar to global label correlations, we force the output to be similar or dissimilar on the relevant or irrelevant correlated labels, and optimize , where is the Laplacian of
that stores our knowledge about the relationship of our labels
and is defined as where is a diagonal matrix with
and is the predicted submatrix for group b.
Finally, We propose the following objective function by considering the diversity of labels in a unified setting:
where denotes the element-wise square root of P, the (element-wise product) of matrices and is the regularization term to guarantee generalization ability and numerical stability.
In this section, we discuss how to solve (3.1) by alternating minimization which gives us capability to tune the variables iteratively and find an optimal solution. It is difficult to find a global optimal answer for this non-convex objective function. To achieve the reliable self-paced weights, we extend a block coordinate descent optimizer. For solving block with fixed blocks ,, and , the optimization problem can be decomposed to k sub-problems for k latent labels, respectively. Thus, objective function of the i-th label, is given by:
In order to solve (9), we first assume . Let and . For each i and arbitrary we define , , and for later computation:
3. be the smallest j such that .
4. be the largest j such that .
Let , and be obtained by optimizing the following objective function:
Then, the optimal is given by
Then, we discuss update procedures for U, V, W, and Z. To optimize these variables with the gradient descent method, we utilize the MANOPT toolbox222http://www.manopt.org boumal2014manopt with line search on the Euclidian and manifold spaces.
Updating ’s: With U, V, W and ’s fixed: for each . Due to the constraint , it has no closed-form solution, and we solve it with projected gradient descent. The gradient of the objective w.r.t. is
To satisfy the constraint , we project each
row of onto the unit norm ball after each update. where is the jth row of .
Updating V: With U, W, ’s and ’s fixed: The gradient of the objective in (3.1) w.r.t. V is
Updating U: With V, W, ’s and ’s fixed: Again, we use gradient descent and the gradient w.r.t. U is:
Updating W: With U, V, ’s and ’s fixed: The gradient w.r.t. W is:
In this section, empirical experiments are conducted to test the validation of our method. In these experiments, seven real-world multi-label datasets including Yahoo text datasets (Business, Computers, Education, Health, Science and Social) ueda2003parametric along with an image classification data (Corel5K) duygulu2002object are used333http://mulan.sourceforge.net/datasets-mlc.html.
4.1 Experimental Setting
Table 1 lists detailed characteristics of the employed datasets. Each column sequentially represents the number of features, number of instances, number of labels and label per instance ratio for each dataset.
Since prediction in the presence of missing-labels is a more challenging task, we have performed the experiments on missing-label data. We randomly sample of the elements in the label matrix as observed, and the rest as missing. is set to 30% and 70% revealed entries, respectively. Coverage, Ranking loss, Average AUC, Instance AUC, MacroF1, MicroF1, and InstanceF1 evaluation metrics are used to measure the performance of the proposed predictive model against baselines. The above-mentioned metrics analyze different aspects of multi-label learning algorithms. The first two metrics are to be minimized and the other metrics are to be maximized according to wu2017unified.
For examining the effectiveness of our framework, it is compared with three state-of-the-art multi-label learning algorithms namely LEML yu2014large, ML-LRC xu2014learning, GLOCAL 8233207. These Baseline methods along with the SPMLD have two common traits, which are the main reasons that we compare our framework with them. All the above-mentioned methods somehow learn in a latent subspace and furthermore, they all handle the missing-label challenge.
Low-rank empirical risk minimization for multi-label learning (LEML) has a linear low-rank structure to train a model for mapping from instance to label space and utilizes an implicit concept of global label dependency.
Learning low-rank label correlations for multi-label classification (ML-LRC) learns and exploits low-rank global label correlations for multi-label classification.
Multi-label learning with global and local label correlation (GLOCAL) learns label correlations in a new manner. It considers both local and global label correlations in latent label representation.
What’s more, to statistically measure the significance of performance difference, pairwise t-tests at 95% significance level are conducted between SPMLD and each of the baseline algorithm. Therefore, in the statistical peer test, for each test, the performance of an algorithm is bold-faced denoting that it statistically outperforms the other one. Furthermore, when there is no significant difference between the performance of SPMLD compared to one or more of the baselines on a dataset regarding a specific evaluation metric, their results are shown in italic. Furthermore, self-paced regularization parameter controls the complexity of instances and labels. It first considers the least losses corresponding to the easiest samples, then gradually involves harder ones through iterations. and its increase ratio are searched from and , respectively. Diversity regularization parameter controls to select less similar instances and labels specified as easy. needs to be higher in the initial iterations. and its decrease ratio are tuned using a grid search in ranges [1-10] and , respectively. Parameters of the host algorithm (GLOCAL) and parameters of the competing methods are all set the same as recommended in the corresponding literature.
4.2 Results on Real-World Datasets
All the results reported in tables and charts are averaged over 10 independent runs. Tables (2-7) exhibit the obtained label prediction results of the multi-label algorithms on six of the mentioned datasets, respectively. Overall, results of all algorithms have improved on 70% observation of samples. In each table three metrics are reported, regarding two different settings of there are six probable cases. On five datasets; Business, Education, Health, Science and Social, SPMLD has significantly better performance regarding all the measures (and for both observation settings) reported in Tables (2, 4-6), respectively. On Computers dataset whose results are shown in Table 3. SPMLD reports significantly better results regarding Rkl measure for both settings. Similarly, it shows statistically better AUC and COV values for 30% entries revealed, while in the case of 70% LEML, ML-LRC and SPMLD show no significant different AUC values compared to each other but they have jointly better AUC values than the GLOCAL. Again, in the case of COV values for 70% entries, the proposed method shows no significant difference compared with ML-LRC but, it gains statistically better results than the two other models. Equivalently, on Social dataset SPMLD shows significantly better performance regarding both percentages of observation for Rkl and AUC metrics and also COV with =30% and only for the proportion of 70% it statistically performs equal to GLOCAL while they get jointly better results than the remaining two algorithms and jointly stand at the first place. Subsequently, as the summary of tables, SPMLD ranks first in 91.66% cases according to statistical significance test. Moreover, it shows statistically equal performance in 8.34% cases where it still ranks first but jointly with one ore more of the baselines.
|LEML||30%||0.063 0.0057||0.928 0.0052||3.954 0.2751|
|70%||0.058 0.0048||0.942 0.0057||3.303 0.2706|
|ML-LRC||30%||0.061 0.0024||0.937 0.0055||3.279 0.0666|
|70%||0.046 0.0019||0.950 0.0050||2.580 0.0593|
|GLOCAL||30%||0.054 0.0025||0.937 0.0036||2.863 0.1711|
|70%||0.046 0.0022||0.952 0.0031||2.579 0.1658|
|SPMLD||30%||0.044 0.0021||0.956 0.0032||2.361 0.1688|
|70%||0.043 0.0018||0.958 0.0029||2.347 0.1625|
|LEML||30%||0.179 0.0072||0.880 0.0064||7.392 0.2181|
|70%||0.141 0.0058||0.894 0.0069||6.306 0.2653|
|ML-LRC||30%||0.152 0.0044||0.873 0.0057||6.052 0.1426|
|70%||0.115 0.0026||0.895 0.0058||5.000 0.6228|
|GLOCAL||30%||0.132 0.0037||0.876 0.0034||5.647 0.7823|
|70%||0.123 0.0028||0.884 0.0062||5.440 0.5340|
|SPMLD||30%||0.104 0.0030||0.903 0.0047||4.600 0.5359|
|70%||0.113 0.0015||0.894 0.0053||5.018 0.5712|
|LEML||30%||0.176 0.0084||0.817 0.0075||9.672 0.4461|
|70%||0.151 0.0077||0.842 0.0082||7.595 0.5138|
|ML-LRC||30%||0.144 0.0034||0.845 0.0068||6.350 0.2042|
|70%||0.113 0.0028||0.860 0.0063||5.075 0.1866|
|GLOCAL||30%||0.125 0.0026||0.875 0.0057||5.741 0.2312|
|70%||0.122 0.0035||0.878 0.0064||5.784 0.2450|
|SPMLD||30%||0.096 0.0019||0.904 0.0048||4.246 0.1372|
|70%||0.093 0.0021||0.907 0.0052||4.162 0.1524|
|LEML||30%||0.095 0.0029||0.896 0.0038||6.248 0.1640|
|70%||0.074 0.0033||0.920 0.0045||5.167 0.1827|
|ML-LRC||30%||0.085 0.0041||0.907 0.0082||4.924 0.1351|
|70%||0.071 0.0036||0.920 0.0093||3.960 0.1825|
|GLOCAL||30%||0.0828 0.0018||0.919 0.0057||4.438 0.1357|
|70%||0.0795 0.0023||0.923 0.0078||4.472 0.1414|
|SPMLD||30%||0.064 0.0009||0.938 0.0063||3.355 0.1182|
|70%||0.059 0.0011||0.943 0.0068||3.253 0.1200|
|LEML||30%||0.203 0.0052||0.827 0.0053||10.587 0.2011|
|70%||0.174 0.0056||0.849 0.0064||9.501 0.2548|
|ML-LRC||30%||0.169 0.0027||0.830 0.0039||8.794 0.1254|
|70%||0.134 0.0024||0.850 0.0033||6.900 0.1273|
|GLOCAL||30%||0.154 0.0029||0.840 0.0108||7.949 0.1371|
|70%||0.134 0.0034||0.866 0.0113||7.106 0.1365|
|SPMLD||30%||0.129 0.0024||0.871 0.0095||6.640 0.1156|
|70%||0.124 0.0028||0.876 0.0087||6.432 0.1283|
|LEML||30%||0.128 0.0067||0.872 0.0065||5.459 0.3084|
|70%||0.081 0.0061||0.919 0.0069||3.824 0.3011|
|ML-LRC||30%||0.123 0.0059||0.877 0.0057||5.167 0.1034|
|70%||0.073 0.0052||0.928 0.0046||3.608 0.0975|
|GLOCAL||30%||0.102 0.0054||0.898 0.0058||4.496 0.2627|
|70%||0.073 0.0049||0.929 0.0052||3.442 0.2583|
|SPMLD||30%||0.068 0.0055||0.931 0.0052||3.563 0.2554|
|70%||0.065 0.0048||0.934 0.0044||3.456 0.2505|
Results obtained on Corel5K image dataset are analyzed through a Radar plot which enables one to compare models against multiple metrics. In this straightforward analysis, SPMLD is compared to the baselines in the high-dimensional label space of Corel5K in terms of six evaluation metrics for 30% and 70% revealed data, respectively and the results are shown in Figure 1. Note that, the proposed method endeavors to cover all labels fairly in its learning process. Thus, it is able to distinguish positive and negative labels of an instance by simultaneously making a larger label-wise margin and preserving instance-wise margin. Hence, it can be obviously seen that the SPMLD is far better on label-wise metrics (MicroF1 and InstanceF1) and it obtains comparable results on the other instance-wise metrics such as MacroF1.
4.3 Parameter Analysis
In this subsection, the influence of parameters on the proposed model is analyzed. According to (3.1) SPMLD has two parameters namely and which correspond to the self-paced and the diversity regularization terms, respectively. It must be mentioned that parameters of other regularization terms which belong to the base algorithm are analyzed in 8233207. Thus, to make a thorough study on the two mentioned parameters we evaluated them on all datasets through a grid-search strategy and reported the results based on three measures. In addition, each graph consists of 50 evaluations referring to 5 different s and 10 s. According to Figure 2, there is a bar next to each graph that indicates the highest and the lowest values obtained on the corresponding dataset regarding each measure which is shown using an spectrum of colors. Subsequently, Light colors (e.g. ”orange” to ”yellow”) represent high amounts and dark colors (e.g. ”blue” to ”dark-blue”) represent low amounts for the measures.
It can be seen that for each value of changing the values of from 1 to 10 makes a significant difference except for = which lies on the top row of graphs and stays unchanged with increasing .
In this paper, we propose a novel Self-Paced framework for Multi-Label learning. This framework incorporates the complexity of both instances and labels, and trains its predictive model with gradual involvement of harder samples. It also utilizes an efficient Diversity maintenance mechanism to avoid biasing over a limited subset of labels. The diverse easy-to-hard learning strategy has also an implicit positive effect on correlations exploited. SPMLD is applied to correlation-based multi-label learning as a host algorithm. Experiments on real-world datasets verify the effectiveness of SPMLD compared to the host algorithm itself and two other state-of-the-art methods. For future studies, it is desirable to investigate the direct impact of self-paced regularization on correlation exploitation and we intend to study and analyze the effect of diversity on local dependencies.