Distributionbased Label Space Transformation for Multilabel Learning
Abstract
Multilabel learning problems have manifested themselves in various machine learning applications. The key to successful multilabel learning algorithms lies in the exploration of interlabel correlations, which usually incur great computational cost. Another notable factor in multilabel learning is that the label vectors are usually extremely sparse, especially when the candidate label vocabulary is very large and only a few instances are assigned to each category. Recently, a label space transformation (LST) framework has been proposed targeting these challenges. However, current methods based on LST usually suffer from information loss in the label space dimension reduction process and fail to address the sparsity problem effectively. In this paper, we propose a distributionbased label space transformation (DLST) model. By defining the distribution based on the similarity of label vectors, a more comprehensive label structure can be captured. Then, by minimizing KLdivergence of two distributions, the information of the original label space can be approximately preserved in the latent space. Consequently, multilabel classifier trained using the dense latent codes yields better performance. The leverage of distribution enables DLST to fill out additional information about the label correlations. This endows DLST the capability to handle label set sparsity and training data sparsity in multilabel learning problems. With the optimal latent code, a kernel logistic regression function is learned for the mapping from feature space to the latent space. Then MLKNN is employed to recover the original label vector from the transformed latent code. Extensive experiments on several benchmark datasets demonstrate that DLST not only achieves high classification performance but also is computationally more efficient.
Multilabel, label propagation, semisupervised, distribution, KLdivergence.
1 Introduction
\IEEEPARstartMultilabel learning naturally arise in various machine learning tasks such as text mining [30], image classification [4] and annotation [8, 40] and bioinformatic analysis [2]. For example, a document may be annotated with multiple diverse tags, an image may contain multiple object categories, a gene is usually multifunctional. As a generalization of multiclass learning, multilabel learning allows each instance to be assigned to a set of labels rather than a single label. Given potential applications in a variety of realworld problems such as keyword suggestions [1] and video segmentation [31], multilabel learning, especially multilabel classification has been extensively studied and is attracting more research attention.
Wellestablished multilabel classification methods roughly follow two lines of research [33, 23, 48], namely Algorithm Adaptation and Problem Transformation. Algorithm adaptation methods adopt certain existing algorithms and adapt them to solve the multilabel classification problem. Representative members of this group include RankSVM [13], MLKNN [47] and Instance based Logistic Regression [10].
Problem transformation methods, on the other hand, reformulate the multilabel classification problem into one or more sub learning tasks. Typical examples include Binary Relevance (BR), Classifier Chains (CC) [29, 26], Label Ranking (LR) [15], Label Powerset (LP) [22] and its variants such as Random kLabel Sets [37] and Pruned Problem Transformation [28]. BR decomposes multilabel classification to many separate singlelabel binary classification tasks, each for one of the labels. CC takes the label dependency into consideration by constructing a chain of binary classifiers, where each classifier additionally leverage the previous labels as its input feature. LR reformulates the multilabel classification problem into a task of ranking labels on hand by relevance and determining the threshold of the relevance. LP reduces multilabel classification to multiclass classification by treating each observed label set as a distinct multiclass label. The problem transformation approaches are more advantageous since any algorithm to the transformed task can be used to solve multilabel classification problems.
More recently, research on multilabel classification generally fall into two learning paradigms [33, 23]. The first is structured output learning paradigm, which focuses on modeling label structure and exploiting interlabel correlations, then using them to predict label vector for test instances [16, 23]. Label correlations are typically encoded in a graph structure, such as ChowLiu Tree [11] and Maximum Spanning Tree [24], and conditional label structure can be approximately learned via Structured Support Vector Machine and Markov Random Field. To further assist the label structure learning, [19, 21, 23] explicitly incorporate feature information into the learning process. However, such label structure learning methods are computationally expensive.
The second model employs label space reduction [49], which encodes the original label space to a lowdimensional latent space either through random projection [51], canonical correlation analysis (CCA) based projection or by directly learning the projected codes [25]. Subsequently, prediction is performed on the lowdimensional latent space, whose results are translated back to the original label space via a decoding process, thus the original labels for test instances can be recovered. Moreover, algorithms with both label space and feature space dimension reduction have been proposed, such as conditional principal label space transformation [9]. In addition, [20, 45] take a more direct approach by formulating the label prediction problem as learning a lowrank linear mapping from feature space to label space. However, these methods usually suffer from information loss and depend on the reduced dimension of the latent space.
In recent years, the proliferation of labels pose great challenge to existing multilabel learning methods. Due to the large label vocabulary, the label vectors are usually characterized by high dimensionality and remarkable sparsity. The sparsity originates from two sources: (1) For each instance, only a small number of labels are present, namely, the label vector has little support (Sparsity I). (2) For a certain set of labels, very few training instances are assigned to it (Sparsity II). In the following paper, we refer to these two types of sparsity as label set sparsity and training data sparsity respectively. Although the label space dimension reduction approaches target to address this problem, the performance of the reduced latent space is not satisfying in terms of prediction accuracy and computational complexity. The reason is that the dimension reduction process in these methods may incur information loss such that the original interlabel correlations will not be fully preserved in the latent space.
Following the general label space transformation framework proposed in [33], we propose a novel distributionbased label space transformation model (DLST). By aligning the distribution between the original label space and the latent space, an optimal transformed code can be learned for each label vector. The advantages of employing distribution alignment in label space transformation lie in two aspects: (1) In contrast to conventional approaches which lose information, extra information that are beyond the original label vector can actually be fulfilled in the latent code according to the distribution. As a result, more complicated label correlations can be captured and approximated in the latent code. (2) In face of label set sparsity and training data sparsity, the proposed model is still able to recover the whole distribution. It has been empirically verified that the number of labels per instance required for DLST to obtain highest score is far less than comparing baselines. Similar phenomenon is also observed for the number of training instances per class required to achieve the best performance.
As shown in Figure 1, the latent codes derived by the proposed method are much denser than original label vectors which also preserve the distribution of the original label space. Therefore, it can be expected that the classification performance using the transformed latent codes will be significantly better than the original sparse labels.
In the training phase, a regression function is learned to map the original data in the feature space to the transformed code in the latent space. For each test instance, the corresponding latent code can be computed by applying the regression function. Then, MLKNN [47] is employed as the decoder to recover the original label vector from the latent code of each instance. The proposed model can also be extended with kernel tricks to deal with nonlinear regression from original feature to the latent code.
Since the proposed model is capable of tackling label set sparsity and training data sparsity. Realworld examples of these two types of sparsity are missing data and limited training data. The performance of DLST is relatively stable across varying missing ratios or training data ratios. This can be attributed to the distribution used in DLST. Rather than limited to the given label vectors, DLST captures the whole distribution of the label space by fitting the observed label vectors using a distribution with maximal variance, and transmits the distribution to the latent space.
The contributions of this paper are:

The proposed method takes advantage of distribution to capture more comprehensive interlabel correlations in the original label space and transmits it to the latent space.

The dense latent code learned by DLST successfully addresses label set sparsity by distinguishing concurrent label patterns from most other unrelated labels.

The proposed model effectively alleviates the requirement on training data size to achieve high multilabel classification performance.
2 A Distribution based MultiLabel Learning Framework
2.1 Preliminaries
Let denote the set of labeled training data, where each instance is associated with a subset of possible labels represented by a binary vector . when is assigned the th label and 0 otherwise. The goal of multilabel classification is to learn a mapping for predicting the label vector of each test instance. For convenience of notation, the feature vectors and label vectors of training instances are arranged in row to form the input feature matrix and label matrix .
Traditional multilabel classification methods aim to learn a binary classifier for each dimension of the label vector, where indicates the th dimension of the original label space. In the case of numerous labels, it will become computationally prohibitive for these methods to predict the label vector.
To tackle this challenge, a novel label space transformation learning framework was proposed, where each label vector is firstly encoded into a point in the latent space with an encoding process . Similarly, the latent codes are also stacked in row to form a matrix . Then, the multilabel classification problem on turns to a multidimensional regression problem on . After obtaining the regression function that has a high prediction accuracy on , the framework will then map back to the original label space via some decoder . The whole learning process is illustrated in Figure 2. Note that with the transformed latent codes , the regression process is open to be replaced by any mapping algorithm from data feature to multilabel vectors.
2.2 Label Space Transformation
In this section, we propose an implicit encoding [25] module by aligning the distribution of label space and that of latent space . While traditional multilabel learning methods which explicitly model the interlabel correlations [3, 5, 24, 34, 46, 52, 41], the depth of the correlations that they investigated are no more than three order limited by the computational complexity. Moreover, the correlations captured in the existing models are either local or global, and have a bias towards similarity between labels while dissimilarity is mostly ignored.
In contrast, our model provides a more comprehensive description of the correlations between labels by employing distribution. Firstly, the similarities in the label space are transformed into probability distribution, and approximated by the distribution of latent code in a reduced space. Based on this distribution alignment, a new label representation can be learned which fills out the original label distribution information. As a result, the proposed model is not limited to the number of given label vectors and their completeness, which greatly expands its flexibility.
To derive the distribution of label vectors for the labeled training instances, first define as the probability of observing the similarity between label vectors and among all pairs of labeled training instances. Following tSNE [38], we utilize a Student tdistribution with one degree of freedom to transform the Euclidean distances into probabilities, as shown below
(1) 
Let denote the distribution of instances with tobelearnt representations in the latent space, then can be calculated as follows
(2) 
where is the dimension of the latent space.
To approximate the distribution with that of latent code in the reduced space, we adopt the KullbackLeibler divergence to measure the distribution discrepancy between , which can be formulated by
(3) 
This KLdivergence based distribution alignment technique considers the comprehensive correlations among the label vectors. The underlying assumption is that instances with highly correlated label vectors tend to have high similarity in the input data space. Therefore, instances with the same labels tend to be drawn much closer in the latent space.
2.3 Feature Regression
With the optimal latent representation , the original multilabel classification problem on converts to a multidimensional regression problem on . The mapping function is actually open for any effective multilabel prediction models, such as linear regression, ridge regression and logistic regression. More generally, any algorithm that learns a mapping from data features to the multilabel vectors can be exploited here, with a boosted performance than the original algorithm.
In this paper, we use kernel logistic regression to learn the mapping from features to latent codes. The reason are twofold: on one hand, logistic regression not only outputs the predictions but also the corresponding probabilities for the prediction. On the other hand, it can easily be extended to a kernelized version where nonlinear mappings are included.
In kernel logistic regression, each instance is mapped to the Reproducing Kernel Hilbert Space (RKHS) as , which also form a kernel feature matrix . In RKHS, the inner product between kernel features can be efficiently calculated by applying kernel trick , where is the introduced kernel function. Employing nonlinear kernel functions, the linear mapping from kernel features to the latent codes are actually nonlinear mappings from the original feature space to the latent space. In this paper, we treat each dimension of the latent code separately and learn a linear mapping in RKHS for the th dimension. The objective function of kernel logistic regression is as follows
(4) 
where is the th entry in , and is a weighting parameter.
Following the common practice in literature, let fall in the span of the kernel features for training instances, i.e. with as the spanning coefficients. Then in Eq. (4), . It can be seen that the training cost of kernel logistic regression is positively related to the training set size , where is undesirable for largescale datasets.
Note that not all training instances are required to form the span, as redundancy may exist between kernel features of training instances. Therefore, we only sample a small part of them for building the kernel feature matrix and use it as the basis to span the th mapping . Hence, where is the coefficients that need to be learned and denotes the sampling size. Then the training cost of kernel logistic regression can be greatly reduced, making it more efficient for training as well as predicting. The specific sampling strategy can be either random sampling or other more sophisticated methods.
2.4 MultiLabel Prediction
For a test instance , based on the learned regression function , the th dimension of the latent code for can be forecasted. In addition, the probabilities of can be obtained as follows
(5) 
To obtain the label vector for each test instance , the latent code learned in the previous step needs to be further mapped back to the original label space through some decoder . In this paper, MLKNN is employed to recover the original label for test instances.
For each test instance , MLKNN first identifies its nearest neighbors in the training set. Then, based on the label sets of these neighbors, the label vector for can be determined using the following maximum a posteriori principle
(6) 
where indicates that instance has label , while denotes that is not assigned label . denotes that among the nearest neighbors of , there are exactly instances which are assigned the th label, can be calculated by . Using Bayesian rule, Eq. (6) is equivalent to the following objective function
(7) 
2.5 Optimization
The objective function of problem (3) is nonconvex, thus only local optimum can be obtained. Since problem (3) is an unconstrained optimization problem, to learn a locally optimal , we propose to exploit gradient descent based optimization methods. From Eq. (3), we have
(8) 
Since solely depends on the labels of training data and remains fixed during the optimization procedure, therefore, problem (8) can be reduced to
(9) 
(10) 
With gradients calculated in Eq. (10), effective gradient descent based optimization methods can be further applied to derive optimal . The update strategy of is as follows
(11) 
where denotes the optimal at th iteration, is the learning rate, is the momentum at th iteration. The stopping criteria for the algorithm is , with a maximum iteration number . The details of the encoding algorithm are presented in Algorithm 1. Algorithm 2 summarizes the whole procedure of DLST. The complexity of the proposed algorithm is , where is the number of labeled training data, and is the sampling size for kernel logistic regression.
Dataset  type  n  d  K  card 
Scene  image  2,407  294  6  1.074 
Emotions  music  593  72  6  1.869 
Yeast  biology  2,417  103  14  4.237 
Mediamill  video  43,907  120  101  4.376 
MSRC  image  591  512  23  2.508 
SUNattribute  image  14,240  512  102  15.526 
3 Experiments
In this section, we demonstrate the effectiveness of the proposed algorithm on six benchmark multilabel datasets including: Scene [4] with scene classes such as mountain, beach and field; Emotions [35] with music emotion labels; Yeast [13] with gene functional categories (e.g., metabolism, energy); Mediamill [32] with semantic concept labels (e.g., military, desert, and basketball); MSRC [43] with image class labels and SUNattribute [27] with image class labels.
The data for the first four datasets can be directly downloaded from Mulan website
The statistical information of these datasets used for experiments is summarized in Table 1. From Table 1 we can observe that both label set sparsity (indicated by cardinality) and training data sparsity (estimated by ) are significant in these datasets. For example, each instance in the Scene dataset only has an average of labels out of all 6 candidate labels. And each set of labels for the Mediamill dataset is only occupied by as few as instances of the total number of training data.
Experimental results show that the proposed approach outperforms stateoftheart methods on all six datasets and manifests strong generalization ability across different types of labels. Analysis of the transformed latent code demonstrates that our approach can effectively preserve the distribution of the original label vectors while alleviating the sparsity problem.
3.1 Compared Methods and Evaluation Metrics
Compared Methods. To validate the performance of our proposed DLST, we compare it with the following representative and related multilabel learning algorithms:

BR [36]: Binary Relevance.

CPLST [9]: Conditional Principal Label Space Transformation.

FAIE [25]: Featureaware Implicit Label Space Encoding.

MLLOC [18]: MultiLabel Learning using Local Correlation.

MC [6]: Matrix Completion.

MIML [39]: Mutual Information for MultiLabel Classification.

SLRM [20]: Semisupervised LowRank Mapping.

MRV [50]: Manifold Regularized Vectorvalued MultiLabel Learning.

LEML [45]: Large Scale Empirical Risk Minimization Method with Missing Labels.
BR is the baseline method, where each label is treated as an independent binary classification problem. CPLST, FAIE and MLLOC are label space reduction methods, which only use labeled instances as training set. MC, MIML and SLRM utilize both labeled and unlabeled instances for training. Meanwhile, SLRM, MRV and LEML are specifically designed for multilabel learning with missing labels. In the experiment, we adopt LibSVM [7] as the binary classifier for BR. In the learning stage, both CPLST and FAIE are coupled with linear regression for multilabel prediction. Unless otherwise specified, we set the parameters of the comparing methods according to what the authors supposed in the original papers or codes. As for DLST, a 10fold crossvalidation is performed by varying from to with a stepsize of . Results show that DLST yields stable performance around , which is used for DLST in the following experiments.
Evaluation Metrics. Performance evaluation for multilabel classification can be complicated since each instance is associated with a set of labels rather than a single one. Various metrics have been proposed based on the prediction likelihood with respect to each label, among which we adopt three widelyused evaluation metrics Average Precision, Micro F1 and Macro F1 to quantitatively compare the performance of these multilabel classification methods.
Average Precision (AP) evaluates the average fraction of relevant labels ranked ahead of a particular label. The larger the value of AP, the better the performance. Its formal definition can be found in [44].
Micro F1 and Macro F1 evaluate the micro average and macro average of the harmonic mean of precision and recall, respectively. As microaveraging and macroaveraging require binary indicator vectors, we consider the labels corresponding to the largest entries of the predicted vector as the predicted labels of each instance, where is set to be the average number of labels per instance. Therefore, from Table 1, for Scene, Emotions, Yeast, Mediamill, MSRC and SUNattribute is and respectively. The bigger the value of Micro F1 and Macro F1, the better the performance. Their formal definitions can be found in [37].
Dataset  BR  CPLST  FAIE  MLLOC  MC  MIML  SLRM  MRV  LEML  DLST 

Average Precision  
Scene  0.4306  0.4492  0.4501  0.4327  0.4580  0.4888  0.5082  0.5125  0.4718  0.5365 
Emotions  0.2734  0.2958  0.3012  0.3146  0.3068  0.3235  0.3487  0.3574  0.3128  0.3864 
Yeast  0.3225  0.3364  0.3425  0.3389  0.3567  0.3842  0.4005  0.4082  0.3620  0.4312 
Mediamill  0.4086  0.4264  0.4265  0.4326  0.4509  0.4465  0.4691  0.4653  0.4324  0.4980 
MSRC  0.3145  0.3281  0.3346  0.2070  0.2353  0.2801  0.3864  0.3725  0.3376  0.4016 
SUNattribute  0.2876  0.3009  0.3387  0.2876  0.3052  0.2954  0.3286  0.3124  0.3092  0.3584 
Micro F1  
Scene  0.5987  0.6496  0.6528  0.6630  0.6282  0.6713  0.7022  0.7324  0.6825  0.7642 
Emotions  0.3450  0.3633  0.3428  0.3596  0.3088  0.3694  0.4012  0.5424  0.5280  0.5642 
Yeast  0.4435  0.4520  0.4631  0.4552  0.4328  0.4784  0.4950  0.6413  0.6086  0.6971 
Mediamill  0.4234  0.5785  0.6422  0.6381  0.6273  0.6412  0.6476  0.5283  0.5562  0.6632 
MSRC  0.4383  0.5109  0.5357  0.3692  0.4196  0.5538  0.5890  0.5726  0.3981  0.6235 
SUNattribute  0.4425  0.4605  0.4936  0.4441  0.4670  0.4521  0.5043  0.4631  0.4430  0.4926 
Macro F1  
Scene  0.3153  0.3264  0.3358  0.3125  0.3562  0.3458  0.3745  0.3964  0.3692  0.4235 
Emotions  0.1928  0.2034  0.2135  0.2234  0.2580  0.2542  0.2718  0.2826  0.2984  0.3260 
Yeast  0.2436  0.2580  0.2624  0.2578  0.2842  0.2673  0.3016  0.3245  0.3326  0.3794 
Mediamill  0.1150  0.0982  0.1302  0.1399  0.1269  0.1298  0.1413  0.1526  0.1254  0.1738 
MSRC  0.3562  0.3317  0.3467  0.1048  0.2541  0.4083  0.4481  0.4468  0.3575  0.4738 
SUNattribute  0.1842  0.2196  0.2630  0.1923  0.2507  0.2852  0.2687  0.2716  0.2535  0.3283 
3.2 Experimental Results
Quantitative results on all six datasets under three evaluation metrics are presented in Table 2. From Table 2, we can see that the proposed DLST performs better than or comparable to the other nine stateofthearts and baseline methods across all 18 configurations (6 datasets 3 evaluation metrics). The superior performance of DLST across all three evaluation measures justifies our motivation of exploiting label distribution preservation during label transformation. In the following, we present a more detailed comparison between DLST and the other three categories of multilabel learning methods.
DLST outperforms label space reduction methods (CPLST, FAIE, MLLOC) by as much as on the six datasets measured by average precision. This advantage demonstrates that DLST learns a higher quality latent code than the other three baselines in terms of approximating the original label space. Moreover, the latent space learned by DLST improves the original label space by revealing the comprehensive label correlations, thus alleviating the sparsity problem presented in multilabel classification.
Moreover, DLST shows better performance than semisupervised multilabel classification methods (MC, MIML, SLRM). The three semisupervised baselines utilize abundant unlabeled data in the training process, which is believed to be able to boost the performance. However, these methods require a large number of training data to perform well. In contrast, DLST demonstrates comparable or even better performance with only 10% of the training data used by the comparing semisupervised baselines. The performance gain can be explained by the distribution employed in DLST, which can estimate and fulfill the comprehensive label correlations given only limited number of labeled training data.
Compared with baselines specifically targeting missing labels (SLRM, MRV, LEML), DLST almost always outperforms them by on the six datasets with respect to average precision criterion. These results corroborate the effectiveness of DLST in exploiting interlabel correlations. In subsection 3.5, we will further compare the performance of these methods under varying number of labels for each instance.
Dataset  type  n  d  K  card 

Pascal VOC  image  9,963  2,048  20  1.560 
NUSWIDE  image  269,648  500  81  1.869 
3.3 The Benefit of Latent Code of DLST
In this subsection, to further study the superiority of DLST in learning a dense latent code, we conduct another set of experiments on two variants of DLST: DLST and DLST. DLST directly learns a regression function from the feature space to the original label space, while DLST predicts the original label of test instances based on MLKNN using the feature vector of training instances. The performance of these methods are evaluated on three datasets: Scene, Emotions and Yeast. Similar to previous experimental settings, the training and test subsets provided along with each dataset is adopted. The mean value and stand deviation of DLST and its two variants under the three evaluation metrics are recorded in Table 4.
From Tabel 4, we can see that DLST outperforms the two variants by and respectively on the Scene dataset under average precision. This result verifies that sparsity in the original label space significantly deteriorates multilabel learning performance. It also suggests that the dense latent code learned by DLST is more effective in capturing interlabel correlations and more informative in predicting labels for multilabel instances.
Figure 4 shows the regression labels and nearest neighbor labels for images on MSRC dataset. Regression labels are produced by applying DLST, while nearest neighbor labels are obtained by utilizing DLST. The difference between Nearest Neighbor labels and NN labelsT lies in the space where nearest neighbor searching takes place, specifically, the former occurs in the original feature space, while the latter occurs in the transformed latent space. From Figure 4, it can be observed that DLST produces the most comprehensive label sets for multilabel images. The rationality lies in that the latent space derived by DLST captures the whole distribution of relative distances of any label pairs. Thus the probability of cooccurrence between labels can be more delicately predicted.
Methods  Scene  Emotions  Yeast 
Average Precision  
DLST  0.446 0.005  0.297 0.006  0.334 
DLST  0.453 0.020  0.304 0.013  0.352 0.009 
DLST  0.536 0.001  0.386 0.002  0.431 0.004 
Micro F1  
DLST  0.661 0.008  0.465 0.006  0.525 0.003 
DLST  0.690 0.016  0.480 0.014  0.594 0.008 
DLST  0.763 0.004  0.564 0.003  0.697 0.002 
Macro F1  
DLST  0.306 0.004  0.204 0.008  0.253 0.005 
DLST  0.314 0.012  0.218 0.005  0.264 0.008 
DLST  0.423 0.002  0.326 0.001  0.379 0.003 
Methods  plane  bicycle  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  motor  person  plant  sheep  sofa  train  tv  mAP 
labels  1.2  1.9  1.1  1.4  2.4  2.0  1.7  1.4  2.5  1.4  2.8  1.6  1.9  1.9  2.0  2.3  1.3  2.4  1.3  2.2   
samples  445  505  622  364  502  380  1536  676  1117  273  510  863  573  482  4192  527  195  727  522  534   
HCP1000C  95.1  90.1  92.8  89.9  51.5  80.0  91.7  91.6  57.7  77.8  70.9  89.3  89.3  85.2  93.0  64.0  85.7  62.7  94.4  78.3  81.5 
CNNRNN  96.7  83.1  94.2  92.8  61.2  82.1  89.1  94.2  64.2  83.6  70.0  92.4  91.7  84.2  93.7  59.8  93.2  75.3  99.7  78.6  84.0 
DLST  95.5  93.1  92.4  91.8  90.2  83.4  73.6  85.4  67.5  85.3  84.2  80.4  84.6  84.2  40.8  85.2  84.7  83.2  82.8  74.6  79.6 
DLST  96.8  95.2  93.4  96.3  95.1  96.4  84.2  93.3  88.4  97.2  94.8  91.8  94.3  94.8  51.5  94.8  98.2  93.0  94.2  94.6  85.7 
3.4 Large Scale Datasets
To evaluate the performance of DLST on large scale datasets, we additionally employ two multilabel datasets: Pascal VOC2007 [14] and NUSWIDE [12].
Pascal Visual Object Classes Challenge (VOC) datasets have been widely used as the benchmark for multilabel classification. VOC 2007 dataset contains images with labels. Each image in this dataset is represented by dimensional deep CNN feature, which is generated by ResNet50 [17] pretrained on ImageNet database. Images in the train and validation subsets are used as training data, while images in the test subset are utilized as testing data. Therefore, the training/test split adopted in the experiment are images.
NUSWIDE dataset is a web image dataset, which contains images and tags collected from Flickr. There are tags after removing noisy and rare tags. These images are further manually annotated into 81 concept groups, e.g., sunset, clouds, beach, mountain, animal as shown in Figure 3. The BagofWords features based on SIFT descriptions are adopted to represent each image in this dataset. Among the images, images are utilized for training, and images are employed for testing. All experiments are conducted over 10 random training/test subsets of data, and the average performance are recorded. Table 3 summarizes more detailed characteristics of these two datasets. Similar notations are adopted as in Table 1.
Since other baselines cannot deal with such large amounts of data (out of memory), we compare the proposed method with two stateoftheart deep learning methods HCP1000C [42] and CNNRNN [41] as well as the variant of our methods DLST. The precision and recall of predicted labels are employed as evaluation metrics. For each image, the precision records the number of correctly annotated labels divided by the number of generated labels; while the recall is defined as the number of correctly annotated labels divided by the number of groundtruth labels. We additionally compute the perclass precision and mAP for both datasets.
From Table 5, we can observe that DLST achieves consistently higher average precision than other comparing methods across all label classes. Moreover, it is interesting to note that DLST performs best on class sheep and worst on class person, which correlates negatively with the number of training instances for each class.
This can be explained from two aspects, on one hand, classes with more training samples tend to have noisy correlations with other labels, as shown in Figure 5, images for person and car are either occluded or dominated by other objects in the image. Therefore, the distances between a certain label and all the other labels are relatively the same, leading to approximately uniform distribution according to DLST. Therefore, the discriminative information that helps to identify a certain class is overwhelmed by the noise present in large volume of diverse training instances. On the other hand, classes with less training samples are likely to develop simple and clear relationships with a limited number of other labels, e.g., sheep and cow almost always relate to plant, whereas seldom relate to boat. Thus, according to the distribution proposed by DLST, these labels have a prominently higher probability to cooccur with a small group of specific labels, which dramatically reduces the difficulty of recognizing them in various images. Actually the relationship holds as long as relatively equal number of labelsperinstance are assigned each class, which can be well observed in Pascal VOC dataset (the first row in Table 5). This observation further justifies the advantage of leveraging distribution to tackle sparsity in multilabel classification.
Figure 6 shows the perclass precision and recall of DLST on NUSWIDE dataset. It can be seen that DLST achieves high precision and low recall on classes with few labels, Representative classes include computer, protest and wedding. The reason is that DLST tends to stop predicting more labels for sparsely correlated labels. While on other classes such as map, book and rainbow, which have larger label cardinality, DLST achieves low precision and high recall. This may be caused by the insufficient standard training data for these classes. There are also some classes that obtain comparable precision and recall, such as clouds, grass, person. Notably, these concepts are ubiquitous among all the images in NUSWIDE dataset. Moreover, the mean precision and recall of DLST averaged over all classes are and respectively, which is and higher than stateoftheart CNNRNN methods. However, it is worthy to note that this result is yielded by randomly selecting images for training and for testing, rather than employing the whole dataset for implementation.
3.5 The Advantage of DLST in Tackling Sparsity
Sparsity I: Label Set Sparsity for Instances
We investigate the performance of DLST in handling label set sparsity by conducting experiments with missing labels. This setting also facilitates comparison with other baselines. The experimental data are generated on datasets MSRC and SUN with 10% of the data from each class for training and the remaining for testing. The experiments are conducted with missing labels on the labeled training instances. For each missing ratio, we randomly drop of the observed labels. To avoid empty class or instances with no positive labels, at least one instance is kept for each class and at least one positive label is kept for each instance. Then the label vector for training instance is reset according to the protocol: if the th instance is assigned the th label, and otherwise. Note that may indicate missing label or negative label.
The proposed method is compared with FAIE, MC, SLRM, MRV, LEML, and the results are shown in Figure 11. We can observe that DLST shows notable performance gain over other methods across all numbers of labels assigned to each instance. Also, the performance of DLST increases relatively less than the comparing methods as the average number of labels per instance increases. This is because the label correlations are more prominent with an increasing number of labels per instance, which benefits most methods significantly. However, DLST performs relatively stable and depends less on the extra given labels, since it can capture the comprehensive relationship between labels by using distribution alignment. As shown in Figure 11, the minimum number of labels per instance for DLST to obtain sufficient information (i.e., performance variance less than in AP) is and for MSRC and SUN datasets respectively, which is much less than the comparing baselines.
Sparsity II: Training Data Sparsity for Labels
We additionally investigate the effectiveness of DLST in tackling training data sparsity by varying the number of training instances for each class from to with a stepsize of . For a given percentage, a corresponding number of training instances are randomly sampled for 10 times, and the resulting average precision are recorded. The experimental results of CPLST, MC, SLAM, FAIE, MIML are shown in Figure 14. Although the performance of all the methods degrade with a decreasing number of labeled training instances, DLST achieves a relatively stable performance across all training ratios and consistently outperforms comparing baselines. This can be explained by the capability of DLST to exploit interlabel correlations encoded in distribution. Given the fact that the comparing methods usually need training instances to saturate. In contrast, DLST only requires as few as training instances per class to gain the highest classification performance (with less than performance variance). This verifies that by using distribution, the label correlations can be more effectively and efficiently exploited, thus remarkably reducing the requirement on training size.
3.6 Further Analysis
The effectiveness of our approach has been quantitatively evaluated in Table 2. We further present the qualitative analysis of the learned latent code of DLST to illustrate its capability of revealing label correlations as well as addressing the sparsity problems. In Figure 15, we show the average precision for each class on the Mediamill dataset.
To investigate the capability of DLST in tackling sparsity in the original label space, we study the Average Precision improvement for classes with different number of concurrent labels. From Figure 15 we can see that, AP improvement is more prominent for classes with small number of average concurrent labels. Typical examples include classes such as car, grass, beach and waterscape, where an AP improvement of can be observed respectively. Although these labels are sparsely correlated to other labels for each instance, nonetheless, by learning from the distribution of all the possible labels, DLST is able to predict the corresponding labels with high precision. The result justifies our proposition that distribution reveals more comprehensive information about the label correlations which can be utilized to enhance multilabel classification performance.
4 Conclusion
To tackle the label sparsity and resolve label correlation for multilabel classification, in this paper, a distributionbased label space transformation method is proposed. By introducing the concept of distribution, more comprehensive relationship among labels of training instances can be captured. A much denser latent code is learned, enabling the proposed model to cope with both label set sparsity and training data sparsity where most multilabel classification methods fail to work effectively. The proposed model is especially successful in capturing a set of distinctive concurrent label patterns from a large pool of label vocabularies, which offers significant benefits to multilabel classification. Extensive experimental results demonstrate that DLST is superior to stateoftheart multilabel learning methods under various percentage of labeled data and missing labels.
Footnotes
 http://mulan.sourceforge.net/datasetsmlc.html
 http://research.microsoft.com/enus/projects/ObjectClassRecognition
 https://cs.brown.edu/gen/sunattributes.html
References
 R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma, “Multilabel learning with millions of labels: recommending advertiser bid phrases for web pages,” in WWW, 2013, pp. 13–24.
 Z. Barutçuoglu, R. E. Schapire, and O. G. Troyanskaya, “Hierarchical multilabel prediction of gene function,” Bioinformatics, vol. 22, no. 7, pp. 830–836, 2006.
 W. Bi and J. T. Kwok, “Multilabel classification with label correlations and missing labels,” in AAAI, 2014, pp. 1680–1686.
 M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multilabel scene classification,” PR, vol. 37, no. 9, pp. 1757–1771, 2004.
 J. K. Bradley and C. Guestrin, “Learning tree conditional random fields,” in ICML, 2010, pp. 127–134.
 R. S. Cabral, F. D. la Torre, J. P. Costeira, and A. Bernardino, “Matrix completion for weaklysupervised multilabel image classification,” TPAMI, vol. 37, no. 1, pp. 121–135, 2015.
 C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM TIST, vol. 2, no. 3, pp. 27:1–27:27, 2011.
 M. Chen, A. X. Zheng, and K. Q. Weinberger, “Fast image tagging,” in ICML, 2013, pp. 1274–1282.
 Y. Chen and H. Lin, “Featureaware label space dimension reduction for multilabel classification,” in NIPS, 2012, pp. 1538–1546.
 W. Cheng and E. Hüllermeier, “Combining instancebased learning and logistic regression for multilabel classification,” ML, vol. 76, no. 23, pp. 211–225, 2009.
 C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 462–467, 1968.
 T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUSWIDE: a realworld web image database from national university of singapore,” in CIVR.
 A. Elisseeff and J. Weston, “A kernel method for multilabelled classification,” in NIPS, 2001, pp. 681–687.
 M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vision, vol. 88, no. 2, pp. 303–338, 2010.
 J. Fürnkranz, E. Hüllermeier, E. Loza Mencía, and K. Brinker, “Multilabel classification via calibrated label ranking,” ML, vol. 73, no. 2, pp. 133–153, 2008.
 B. Hariharan, L. ZelnikManor, S. V. N. Vishwanathan, and M. Varma, “Large scale maxmargin multilabel classification with priors,” in ICML, 2010, pp. 423–430.
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
 S. Huang and Z. Zhou, “Multilabel learning by exploiting label correlations locally,” in AAAI, 2012.
 J. Jiang, “Multilabel learning on tensor product graph,” in AAAI, 2012, pp. 956–962.
 L. Jing, L. Yang, J. Yu, and M. K. Ng, “Semisupervised lowrank mapping learning for multilabel classification,” in CVPR, 2015, pp. 1483–1491.
 X. Kong, B. Cao, and P. S. Yu, “Multilabel classification by mining label and instance correlations from heterogeneous information networks,” in KDD, 2013, pp. 614–622.
 X. Kong, M. K. Ng, and Z. Zhou, “Transductive multilabel learning via label set propagation,” TKDE, vol. 25, no. 3, pp. 704–719, 2013.
 Q. Li, M. Qiao, W. Bian, and D. Tao, “Conditional graphical lasso for multilabel image classification,” in CVPR, 2016, pp. 2977–2986.
 X. Li, F. Zhao, and Y. Guo, “Multilabel image classification with A probabilistic label enhancement model,” in UAI, 2014, pp. 430–439.
 Z. Lin, G. Ding, M. Hu, and J. Wang, “Multilabel classification via featureaware implicit label space encoding,” in ICML, 2014, pp. 325–333.
 W. Liu and I. W. Tsang, “On the optimality of classifier chain for multilabel classification,” in NIPS, 2015, pp. 712–720.
 G. Patterson, C. Xu, H. Su, and J. Hays, “The SUN attribute database: Beyond categories for deeper scene understanding,” International Journal of Computer Vision, vol. 108, no. 12, pp. 59–81, 2014.
 J. Read, “A pruned problem transformation method for multilabel classification,” in Proc. New Zealand Computer Science Research Student Conference, 2008, pp. 143–150.
 J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multilabel classification,” ML, vol. 85, no. 3, pp. 333–359, 2011.
 R. E. Schapire and Y. Singer, “Boostexter: A boostingbased system for text categorization,” Machine Learning, vol. 39, no. 2/3, pp. 135–168, 2000.
 C. Snoek, M. Worring, J. C. van Gemert, J. Geusebroek, and A. W. M. Smeulders, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” in ACM MM, 2006, pp. 421–430.
 C. G. M. Snoek, M. Worring, J. C. van Gemert, J.M. Geusebroek, and A. W. M. Smeulders, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” in ACM MM, 2006, pp. 421–430.
 F. Tai and H.T. Lin, “Multilabel classification with principal label space transformation,” Neural Computation, vol. 24, no. 9, pp. 2508–2542, 2012.
 M. Tan, Q. Shi, A. van den Hengel, C. Shen, J. Gao, F. Hu, and Z. Zhang, “Learning graph structure for multilabel image classification via clique generation,” in CVPR, 2015, pp. 4100–4109.
 K. Trohidis, G. Tsoumakas, G. Kalliris, and I. Vlahavas, “Multilabel classification of music by emotion,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 4, no. 1, pp. 325–330, 2008.
 G. Tsoumakas, I. Katakis, and I. P. Vlahavas, “Mining multilabel data,” in Data Mining and Knowledge Discovery Handbook, 2nd ed., 2010, pp. 667–685.
 G. Tsoumakas and I. P. Vlahavas, “Random k labelsets: An ensemble method for multilabel classification,” in ECML, 2007, pp. 406–417.
 L. van der Maaten and G. E. Hinton, “Visualizing highdimensional data using tsne,” JMLR, vol. 9, pp. 2579–2605, 2008.
 D. Vasisht, A. C. Damianou, M. Varma, and A. Kapoor, “Active learning for sparse bayesian multilabel classification,” in KDD, 2014, pp. 472–481.
 C. Wang, S. Yan, L. Zhang, and H. J. Zhang, “Multilabel sparse coding for automatic image annotation,” in CVPR, 2009, pp. 1643–1650.
 J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnnrnn: A unified framework for multilabel image classification,” in CVPR, 2016, pp. 2285–2294.
 Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “CNN: singlelabel to multilabel,” CoRR, vol. abs/1406.5726, 2014.
 J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in ICCV 2005, pp. 1800–1807.
 X. Wu and Z. Zhou, “A unified view of multilabel performance measures,” CoRR, vol. abs/1609.00288, 2016.
 H. Yu, P. Jain, P. Kar, and I. S. Dhillon, “Largescale multilabel learning with missing labels,” in ICML, 2014, pp. 593–601.
 M. Zhang and K. Zhang, “Multilabel learning by exploiting label dependency,” in ACM SIGKDD, 2010, pp. 999–1008.
 M. Zhang and Z. Zhou, “MLKNN: A lazy learning approach to multilabel learning,” Pattern Recognition, vol. 40, no. 7, pp. 2038–2048, 2007.
 M. L. Zhang and Z. H. Zhou, “A review on multilabel learning algorithms,” TKDE, vol. 26, no. 8, pp. 1819–1837, 2014.
 Y. Zhang and J. G. Schneider, “Maximum margin output coding,” in ICML, 2012, pp. 1575–1582.
 F. Zhao and Y. Guo, “Semisupervised multilabel learning with incomplete labels,” in IJCAI, 2015, pp. 4062–4068.
 T. Zhou, D. Tao, and X. Wu, “Compressed labeling on distilled labelsets for multilabel learning,” ML, vol. 88, no. 12, pp. 69–126, 2012.
 F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spatial regularization with imagelevel supervisions for multilabel image classification,” in CVPR, 2017, pp. 5513–5522.