MultiLabel Learning with Global and Local Label Correlation
Abstract
It is wellknown that exploiting label correlations is important to multilabel learning. Existing approaches either assume that the label correlations are global and shared by all instances; or that the label correlations are local and shared only by a data subset. In fact, in the realworld applications, both cases may occur that some label correlations are globally applicable and some are shared only in a local group of instances. Moreover, it is also a usual case that only partial labels are observed, which makes the exploitation of the label correlations much more difficult. That is, it is hard to estimate the label correlations when many labels are absent. In this paper, we propose a new multilabel approach GLOCAL dealing with both the fulllabel and the missinglabel cases, exploiting global and local label correlations simultaneously, through learning a latent label representation and optimizing label manifolds. The extensive experimental studies validate the effectiveness of our approach on both fulllabel and missinglabel data.
MultiLabel Learning with Global and Local Label Correlation
National Key Laboratory for Novel Software Technology
Nanjing University, Nanjing 210093, China
Email: {zhuy, zhouzh}@lamda.nju.edu.cn
the Department of Computer Science and Engineering
Hong Kong University of Science and Technology, Hong Kong
Email: jamesk@cse.ust.hk
Key words: Global and local label correlation, label manifold, missing labels, multilabel learning.
1 Introduction
In realworld classification applications, an instance is often associated with more than one class labels. For example, a scene image can be annotated with several tags boutell2004learning (), a document may belong to multiple topics ueda2002parametric (), and a piece of music may be associated with different genres turnbull2008semantic (). Thus, multilabel learning has attracted a lot of attention in recent years zhang2014review ().
Current studies on multilabel learning try to incorporate label correlations of different orders zhang2014review (). However, existing approaches mostly focus on global label correlations shared by all instances furnkranz2008multilabel (); ji2008extracting (); read2011classifier (). For example, labels “fish” and “ocean” are highly correlated, and so are “stock” and “finance”. On the other hand, certain label correlations are only shared by a local data subset huang2012 (). For example, “apple” is related to “fruit” in gourmet magazines, but is related to “digital devices” in technology magazines. Previous studies focus on exploiting either global or local label correlations. However, considering both of them is obviously more beneficial and desirable.
Another problem with label correlations is that they are usually difficult to specify manually. As label correlations may vary in different contexts and there is no unified measure for specifying appropriate correlations, they are usually estimated from the observed data. Some approaches learn the label hierarchies by hierarchical clustering Punera2005Automatically () or Bayesian network structure learning zhang2010multi (). However, the hierarchical structure may not exist in some applications. For example, labels such as “desert”, “mountains”, “sea”, “sunset” and “trees” do not have any natural hierarchical correlations, and label hierarchies may not be useful. Others estimate label correlations by the cooccurrence of labels in training data NIPS2011_4239 (). However, it may cause overfitting. Moreover, cooccurrence is less meaningful for labels with very few positive instances.
In multilabel learning, some labels may be missing from the training set. For example, human labelers may ignore object classes they do not know or of little interest. Recently, multilabel learning with missing labels has become a hot topic. Xu et al. xu2013speedup () and Yu et al. Yu2014 () considered using the lowrank structure on the instancelabel mapping. A more direct approach to model the label dependency approximates the label matrix as a product of two lowrank matrices goldberg2010transduction (). This leads to simpler recovery of the missing labels, and produces a latent representation of the label matrix.
In the missing label cases, estimation of label correlation becomes even more difficult, as the observed label distribution is different from the true one. As a result, the aforementioned methods (based on hierarchical clustering and cooccurrence, for example) will produce biased estimates of label correlations.
In this paper, we propose a new approach called “MultiLabel Learning with GLObal and loCAL Correlation” (GLOCAL), which simultaneously recovers the missing labels, trains the linear classifiers and exploits both global and local label correlations. It learns a latent label representation. Classifier outputs are encouraged to be similar on highly positively correlated labels, and dissimilar on highly negatively correlated labels. We do not assume the presence of external knowledge sources specifying the label correlations. Instead, these correlations are learned simultaneously with the latent label representations and instancelabel mapping.
The rest of the paper is organized as follows. In Section 2, related works of multilabel learning with label correlations are introduced. In Section 3, the problem formulation and the GLOCAL approach are proposed. Experimental results are presented in Section 4. Finally, Section 5 concludes the work.
Notations For a matrix , denotes its transpose, is its trace, is its Frobenius norm, and returns a vector containing the diagonal elements of . For two matrices and , denotes the Hadamard (elementwise) product. For a vector , is its norm, and returns a diagonal matrix with on the diagonal.
2 Related Work
Multilabel learning has been widely studied in recent years. Based on the degree of label correlations used, it can be divided into three categories zhang2014review (): (i) firstorder; (ii) secondorder; and (iii) highorder. For the firstorder strategy, label correlations are not considered, and the multilabel problem is transformed into multiple independent binary classification problems. For example, BR boutell2004learning () trains a classifier for each label independently. For the secondorder strategy, pairwise label relations are considered. For example, CLR furnkranz2008multilabel () transforms the multilabel learning problem into the pairwise label ranking problem. For the highorder strategy, all other labels’ influences imposed on each label are taken into account. For example, CC read2011classifier () transforms the multilabel learning problem into a chain of binary classification problems, with the groundtruth labels encoded into the features.
Most previous studies focus on global label correlations. However, MLLOC huang2012 () demonstrates that sometimes label correlations may only be shared by a local data subset. Specifically, it enhances the feature representation of each instance by embedding a code into the feature space, which encodes the influence of labels of an instance to the local label correlations. This has some limitations. First, when the dimensionality of the feature space is large, the code is less discriminative and will be dominated by the original features. Second, MLLOC considers only the local label correlations, but not the global ones. Third, MLLOC cannot learn with missing labels.
In some realworld applications, labels are partially observed, and multilabel learning with missing labels has attracted much attention. MAXIDE xu2013speedup () is based on fast lowrank matrix completion, and has strong theoretical guarantees. However, it only works in the transductive setting. Moreover, a label correlation matrix has to be specified manually. LEML Yu2014 () also relies on a lowrank structure, and works in an inductive setting. However, it only implicitly uses global label correlations. MLLRC xu2014learning () adopts a lowrank structure to capture global label correlations, and addresses the missing labels by introducing a supplementary label matrix. However, only global label correlations are taken into account. Obviously, it would be more desirable to learn both global and local label correlations simultaneously.
Manifold regularization belkin2006manifold () exploits instance similarity by forcing the predicted values on similar instances to be similar. A similar idea can be adapted to the label manifold, and so predicted values for correlated labels should be similar. However, the Laplacian matrix is based on some label similarity or correlation matrix, which can be hard to specify as discussed in Section 1.
3 The Proposed Approach
In multilabel learning, an instance can be associated with multiple class labels. Let be the class label set of labels. We denote the feature vector of an instance by , and denote the groundtruth label vector by , where if is with class label , and otherwise. As mentioned in Section 1, instances in the training data may be partially labeled, i.e., some labels may be missing. We adopt the general setting that both positive and negative labels can be missing goldberg2010transduction (); xu2013speedup (); Yu2014 (). The observed label vector is denoted , where if class label is not labeled (i.e. it is missing), and otherwise. Given the training data , our goal is to learn a mapping function .
In this paper, we propose the GLOCAL algorithm, which learns and exploits both global and local label correlations via label manifolds. To recover the missing labels, learning of the latent label representation and classifier training are performed simultaneously.
3.1 Basic Model
Let be the groundtruth label matrix, where each is the label vector for instance . As discussed in Section 1, is lowrank. Let its rank be . Thus, can be written as the lowrank decomposition , where and . Intuitively, represents the latent labels that are more compact and more semantically abstract than the original labels, while matrix projects the original labels to the latent label space.
In general, the labels are only partially observed. Let the observed label matrix be , and be the set containing indices of the observed labels in (i.e., indices of the nonzero elements in ). We focus on minimizing the reconstruction error on the observed labels, i.e., , where if , and 0 otherwise. Moreover, we use a linear mapping to map instances to the latent labels. This is learned by minimizing , where is the instance matrix. Combining these two, we obtain the following optimization problem:
(1) 
where is a regularizer and , are tradeoff parameters. While the square loss has been used in Eqn (1), it can be replaced by any differentiable loss function. The prediction on is , where . Let , thus denotes the predictive value on th label for . We concatenate all , denoted by , thus .
3.2 Global and Local Manifold Regularizers
Exploiting label correlations is an essential ingredient in multilabel learning. Here, we use label correlations to regularize the model. Intuitively, the more positively correlated two labels are, the closer are the corresponding classifier outputs, and vice versa. Let be the global label correlation matrix. The manifold regularizer should have a small value melacci2011primallapsvm (). Here, , the th row of , is the vector of classifier outputs for the th label on the samples. Let be the diagonal matrix with diagonal , where is the vector of ones. The manifold regularizer can be equivalently written as luo2009non (), where is the Laplacian matrix of .
As discussed in Section 1, label correlations may vary from one local region to another. Assume that the data is partitioned into groups , where has size . This partitioning can be obtained by domain knowledge (e.g., gene pathways subramanian2005gene () and networks chuang2007network () in bioinformatics applications) or clustering. Let be the label submatrix in corresponding to , and be the local label correlation matrix of group . Similar to global label correlation, to encourage the classifier outputs to be similar on the positively correlated labels and dissimilar on the negatively correlated ones, we minimize , where is the Laplacian matrix of and is the classifier output matrix for group .
Combining global and local label correlations with Eqn. (1), we have the following optimization problem:
(2) 
where are tradeoff parameters.
Intuitively, a large local group contributes more to the global label correlations. In particular, the following Lemma shows that when the cosine similarity is used to compute , we have .
Lemma 1
Let and , where is the th row of , and is the th row of . Then, .
In general, when the global label correlation matrix is a linear combination of the local label correlation matrices, the following Proposition shows that the global label Laplacian matrix is also a linear combination of the local label Laplacian matrices with the same combination coefficients.
Proposition 1
If , then .
The success of label manifold regularization hinges on a good correlation matrix (or equivalently, a good Laplacian matrix). In multilabel learning, one rudimentary approach is to compute the correlation coefficient between two labels by cosine distance wang2009image (). However, this can be noisy since some labels may only have very few positive instances in the training data. When labels can be missing, this computation may even become misleading, since the label distribution of observed labels may be much different from that of the groundtruth label distribution due to the missing labels.
In this paper, instead of specifying any correlation metric or label correlation matrix, we learn the Laplacian matrices directly. Note that the Laplacian matrices are symmetric positive definite. Thus, for , we decompose as , where . For simplicity, is set to the dimensionality of the latent representation . As a result, learning the Laplacian matrices is transformed to learning . Note that optimization w.r.t. may lead to the trivial solution . To avoid this problem, we add the constraint that the diagonal entries in are 1, for . This constraint also enables us to obtain a normalized Laplacian matrix chung1997spectral () of .
Let be the indicator matrix with if , and 0 otherwise. can be rewritten as the Hadamard product . Combining the decomposition of Laplacian matrices and the diagonal constraints of , we obtain the optimization problem as:
(4)  
s.t. 
Moreover, we will use .
3.3 Learning by Alternating Minimization
Problem (4) can be solved by alternating minimization (Algorithm 1). In each iteration, we update one of the variables in with gradient descent, and leave the others fixed. Specifically, the MANOPT toolbox manopt () is utilized to implement gradient descent with line search on the Euclidean space for the update of , and on the manifolds for the update of .
3.3.1 Updating
With fixed, problem (4) reduces to
(5)  
s.t. 
for each . Due to the constraint , it has no closedform solution, and we will solve it with projected gradient descent. The gradient of the objective w.r.t. is
To satisfy the constraint , we project each row of onto the unit norm ball after each update:
where is the th row of .
3.3.2 Updating
With ’s and fixed, problem (4) reduces to
(6) 
Notice that each column of is independent to each other, and thus can be solved columnbycolumn. Let and be th column of and , respectively. The optimization problem for can be written as:
Setting the gradient w.r.t. to 0, we obtain the following closedform solution of :
This involves computing a matrix inverse for each . If this is expensive, we can use gradient descent instead. The gradient of the objective in (6) w.r.t. is
3.3.3 Updating
With ’s and fixed, problem (4) reduces to
(7) 
Again, we use gradient descent, and the gradient w.r.t. is:
3.3.4 Updating
#instance  #dim  #label  #label/instance  #instance  #dim  #label  #label/instance  

Arts  5,000  462  26  1.64  Business  5,000  438  30  1.59 
Computers  5,000  681  33  1.51  Education  5,000  550  33  1.46 
Entertainment  5,000  640  21  1.42  Health  5,000  612  32  1.66 
Recreation  5,000  606  22  1.42  Reference  5,000  793  33  1.17 
Science  5,000  743  40  1.45  Social  5,000  1,047  39  1.28 
Society  5,000  636  27  1.69  Enron  1,702  1,001  53  3.37 
Corel5k  5,000  499  374  3.52  Image  2,000  294  5  1.24 
4 Experiments
In this section, extensive experiments are performed on text and image datasets. Performance on both the fulllabel and missinglabel cases are discussed.
4.1 Setup
4.1.1 Data sets
On text, eleven Yahoo datasets^{1}^{1}1http://www.kecl.ntt.co.jp/as/members/ueda/yahoo.tar (Arts, Business, Computers, Education, Entertainment, Health, Recreation, Reference, Science, Social and Society) and the Enron dataset^{2}^{2}2http://mulan.sourceforge.net/datasetsmlc.html are used. On images, the Corel5k^{3}^{3}footnotemark: 3 and Image^{3}^{3}3http://cse.seu.edu.cn/people/zhangml/files/Image.rar datasets are used. In the sequel, each dataset is denoted by its first three letters.^{4}^{4}4“Society” is denoted “Soci”, so as to distinguish it from “Social”. Detailed information of the datasets are shown in Table 1. For each dataset, we randomly select of the instances for training, and the rest for testing.
4.1.2 Baselines
In the GLOCAL algorithm, we use the kmeans clustering algorithm to partition the data into local groups. The solution of Eqn. (1) is used to warmstart and . The ’s are randomly initialized. GLOCAL is compared with the following stateoftheart multilabel learning algorithms:

BR boutell2004learning (), which trains a binary linear SVM (using the LIBLINEAR package REF08a ()) for each label independently;

MLLOC huang2012 (), which exploits local label correlations by encoding them into the instance’s feature representation;

LEML Yu2014 (), which learns a linear instancetolabel mapping with lowrank structure, and implicitly takes advantage of global label correlation;

MLLRC xu2014learning (), which learns and exploits lowrank global label correlations for multilabel classification with missing labels.
Note that BR does not take label correlation into account. MLLOC considers only local label correlations; LEML implicitly uses global label correlations, whereas MLLRC models global label correlation directly. On the ability to handle missing labels, BR and MLLOC can only learn with full labels.
For simplicity, we set in GLOCAL. The other parameters, as well as those of the baseline methods, are selected via 5fold crossvalidation on the training set. All the algorithms are implemented in Matlab (with some C++ code for LEML).
4.1.3 Performance Evaluation
Let be the number of test instances, be the sets of positive and negative labels associated with the th instance; and be the sets of positive and negative instances belonging to the th label. Given input , let be the rank of label in the predicted label ranking (sorted in descending order). For performance evaluation, we use the following popular metrics in multilabel learning zhang2014review ():

Ranking loss (Rkl): This is the fraction that a negative label is ranked higher than a positive label. For instance , define . Then,

Average AUC (Auc): This is the fraction that a positive instance is ranked higher than a negative instance, averaged over all labels. Specifically, for label , define . Then,

Coverage (Cvg): This counts how many steps are needed to move down the predicted label ranking so as to cover all the positive labels of the instances.

Average precision (Ap): This is the average fraction of positive labels ranked higher than a particular positive label. For instance , define . Then,
For Auc and Ap, the higher the better; whereas for Rkl and Cvg, the lower the better. To reduce statistical variability, results are averaged over 10 independent repetitions.
Measure  BR  MLLOC  LEML  MLLRC  GLOCAL  

Arts  Rkl ()  0.2010.005  0.1770.013  0.1700.005  0.1570.002  0.1380.002 
Auc ()  0.7990.006  0.8230.013  0.8330.005  0.8430.001  0.8460.005  
Cvg ()  7.3470.196  6.7620.344  6.3370.243  5.5290.037  5.3470.146  
Ap ()  0.5940.006  0.6060.006  0.5900.005  0.6000.007  0.6190.005  
Business  Rkl ()  0.0720.005  0.0550.009  0.0560.005  0.0440.002  0.0440.002 
Auc ()  0.9280.005  0.9440.008  0.9450.005  0.9500.005  0.9550.003  
Cvg ()  4.0870.268  3.2650.464  3.1870.270  2.5600.059  2.5590.169  
Ap ()  0.8610.007  0.8780.011  0.8670.007  0.8700.005  0.8830.004  
Computers  Rkl ()  0.1460.007  0.1340.014  0.1380.004  0.1070.002  0.1070.002 
Auc ()  0.8540.007  0.8660.014  0.8950.002  0.8940.002  0.8950.002  
Cvg ()  6.6540.236  6.2240.480  6.1480.183  4.8930.142  4.8890.058  
Ap ()  0.6800.007  0.6890.009  0.6690.007  0.6890.005  0.6980.004  
Education  Rkl ()  0.2030.010  0.1580.021  0.1450.008  0.0990.002  0.0950.002 
Auc ()  0.7970.102  0.8420.022  0.8590.008  0.8680.006  0.8780.006  
Cvg ()  8.9790.487  7.3810.765  6.7110.364  4.5310.104  4.5290.206  
Ap ()  0.5800.010  0.6130.004  0.5960.009  0.6000.007  0.6280.009  
Entertainment  Rkl ()  0.1850.006  0.1460.013  0.1540.005  0.1300.005  0.1080.004 
Auc ()  0.8150.006  0.8540.013  0.8520.005  0.8710.003  0.8740.005  
Cvg ()  5.0060.160  4.2930.344  4.1930.139  3.5050.125  3.1140.110  
Ap ()  0.6620.009  0.6700.005  0.6470.007  0.6610.012  0.6810.008  
Health  Rkl ()  0.1130.001  0.0930.005  0.0910.003  0.0710.003  0.0650.002 
Auc ()  0.8860.003  0.9070.005  0.9130.004  0.9290.009  0.9230.007  
Cvg ()  6.1930.059  5.4030.157  5.0630.128  3.7510.128  3.8580.131  
Ap ()  0.7630.002  0.7770.004  0.7500.003  0.7550.006  0.7820.001  
Recreation  Rkl ()  0.1970.003  0.1840.015  0.1850.001  0.1700.004  0.1550.002 
Auc ()  0.8020.003  0.8160.015  0.8220.002  0.8330.004  0.8400.000  
Cvg ()  5.5060.089  5.2680.333  5.1100.040  4.5150.045  4.4310.048  
Ap ()  0.6090.005  0.6200.004  0.5950.004  0.6040.003  0.6250.004  
Reference  Rkl ()  0.1550.005  0.1380.008  0.1370.004  0.0920.003  0.0860.003 
Auc ()  0.8450.005  0.8620.008  0.8720.004  0.9000.006  0.8940.004  
Cvg ()  6.1710.219  5.5140.309  5.2770.171  3.4380.133  3.3870.118  
Ap ()  0.6850.005  0.6880.003  0.6670.003  0.6670.007  0.6880.007  
Science  Rkl ()  0.1970.009  0.1660.017  0.1700.005  0.1310.002  0.1180.003 
Auc ()  0.8020.010  0.8340.018  0.8340.005  0.8600.003  0.8530.010  
Cvg ()  10.1890.435  8.8670.751  8.8850.197  6.7040.122  6.4340.137  
Ap ()  0.5680.012  0.5810.009  0.5510.008  0.5610.009  0.5800.009  
Social  Rkl ()  0.1120.001  0.0940.013  0.1060.006  0.0750.005  0.0750.005 
Auc ()  0.8880.002  0.9060.013  0.8940.006  0.9170.005  0.9150.005  
Cvg ()  6.0360.125  5.1470.401  5.5210.301  4.6510.102  4.5370.258  
Ap ()  0.7240.005  0.7640.008  0.7310.005  0.7190.003  0.7580.008  
Society  Rkl ()  0.2040.004  0.1820.006  0.1820.007  0.1420.002  0.1360.005 
Auc ()  0.7960.005  0.8180.006  0.8220.008  0.8400.006  0.8440.006  
Cvg ()  8.0480.108  7.3920.216  7.4380.162  5.9730.108  5.8520.194  
Ap ()  0.6100.007  0.6230.004  0.5990.006  0.6050.006  0.6330.009  
Enron  Rkl ()  0.1940.006  0.1690.012  0.1590.005  0.1330.004  0.1250.004 
Auc ()  0.8060.006  0.8310.009  0.8510.006  0.8690.004  0.8770.005  
Cvg ()  23.6180.450  21.7240.950  18.5310.707  16.6540.198  16.7370.622  
Ap ()  0.5750.006  0.5860.009  0.6000.004  0.5910.004  0.6470.006  
Corel5k  Rkl ()  0.2710.006  0.2300.012  0.2460.004  0.1700.002  0.1730.005 
Auc ()  0.6990.006  0.7570.012  0.7540.005  0.8250.005  0.8270.005  
Cvg ()  261.993.15  201.806.71  184.581.72  137.312.49  136.913.21  
Ap ()  0.1530.001  0.1820.005  0.1880.004  0.1980.003  0.2000.004  
Image  Rkl ()  0.1810.011  0.1800.008  0.1810.012  0.1800.009  0.1790.004 
Auc ()  0.8120.011  0.8100.012  0.7860.005  0.7480.010  0.8190.009  
Cvg ()  1.0040.050  0.9750.060  1.0000.027  1.0000.019  0.9750.054  
Ap ()  0.7880.008  0.7940.010  0.7900.008  0.7900.010  0.7950.007 
4.2 Learning with Full Labels
In this experiment, all elements in the training label matrix are observed. Performance on the test data is shown in Table 2. As expected, BR is the worst , since it treats each label independently without considering label correlations. MLLOC only considers local label correlations and LEML only makes use of the lowrank structure. Though MLLRC takes advantage of both the lowrank structure and label correlations, only global label correlations are considered. As a result, GLOCAL is the best overall, as it models both global and local label correlations.
To show the example correlations learned by GLOCAL, we use two local groups extracted from the Image dataset. Figure 1 shows that local label correlation does vary from group to group, and is different from global correlation. For group 1, “sunset” is highly correlated with “desert” and “sea” (Figure 1(c)). This can also be seen from the images in Figure 1(a). Moreover, “trees” sometimes cooccurs with “deserts” (first and last images in Figure 1(a)). However, in group 2 (Figure 1(d)), “mountain” and “sea” often occur together and “trees” occurs less often with “desert” (Figure 1(b)). Figure 1(e) shows the learned global label correlation: “sea” and “sunset”, “mountain” and “trees” are positively correlated, whereas “desert” and “sea”, “desert” and “trees” are negatively correlated. All these correlations are consistent with intuition.
To further validate the effectiveness of global and local label correlations, we study two degenerate versions of GLOCAL: (i) GLObal, which uses only global label correlations; and (ii) loCAL, which uses only local label correlations. Note that the local groups obtained by clustering are not of equal sizes. For some datasets, the largest cluster contains more than of instances, while some small ones contain fewer than each. Global correlation is then dominated by the local correlation matrix of the largest cluster (Proposition 1), making the performance difference on the whole test set obscure. Hence, we focus on the performance of the small clusters. As can be seen from Table 3, using only global or local correlation may be good enough on some data sets (such as Health). On the other hand, considering both types of correlation as in GLOCAL achieves comparable or even better performance.
GLObal  loCAL  GLOCAL  GLObal  loCAL  GLOCAL  

Art  Rkl ()  0.1370.003  0.1370.002  0.1300.005  Bus  Rkl ()  0.0400.002  0.0400.002  0.0400.003 
Auc ()  0.8630.003  0.8630.002  0.8700.005  Auc ()  0.9580.003  0.9580.003  0.9580.003  
Cvg ()  5.2860.046  5.2860.046  5.1970.065  Cvg ()  2.5290.035  2.5280.040  2.5280.040  
Ap ()  0.6020.013  0.6020.010  0.6310.011  Ap ()  0.8820.002  0.8820.002  0.8860.003  
Com  Rkl ()  0.0950.002  0.0950.002  0.0920.002  Edu  Rkl ()  0.1010.002  0.1010.002  0.0970.002 