Distribution-based Label Space Transformation for Multi-label Learning

Distribution-based Label Space Transformation for Multi-label Learning


Multi-label learning problems have manifested themselves in various machine learning applications. The key to successful multi-label learning algorithms lies in the exploration of inter-label correlations, which usually incur great computational cost. Another notable factor in multi-label learning is that the label vectors are usually extremely sparse, especially when the candidate label vocabulary is very large and only a few instances are assigned to each category. Recently, a label space transformation (LST) framework has been proposed targeting these challenges. However, current methods based on LST usually suffer from information loss in the label space dimension reduction process and fail to address the sparsity problem effectively. In this paper, we propose a distribution-based label space transformation (DLST) model. By defining the distribution based on the similarity of label vectors, a more comprehensive label structure can be captured. Then, by minimizing KL-divergence of two distributions, the information of the original label space can be approximately preserved in the latent space. Consequently, multi-label classifier trained using the dense latent codes yields better performance. The leverage of distribution enables DLST to fill out additional information about the label correlations. This endows DLST the capability to handle label set sparsity and training data sparsity in multi-label learning problems. With the optimal latent code, a kernel logistic regression function is learned for the mapping from feature space to the latent space. Then ML-KNN is employed to recover the original label vector from the transformed latent code. Extensive experiments on several benchmark datasets demonstrate that DLST not only achieves high classification performance but also is computationally more efficient.


Multi-label, label propagation, semi-supervised, distribution, KL-divergence.


1 Introduction


Multi-label learning naturally arise in various machine learning tasks such as text mining [30], image classification [4] and annotation [8, 40] and bioinformatic analysis [2]. For example, a document may be annotated with multiple diverse tags, an image may contain multiple object categories, a gene is usually multifunctional. As a generalization of multi-class learning, multi-label learning allows each instance to be assigned to a set of labels rather than a single label. Given potential applications in a variety of real-world problems such as keyword suggestions [1] and video segmentation [31], multi-label learning, especially multi-label classification has been extensively studied and is attracting more research attention.

Figure 1: An illustration of label transformation of the proposed method via distribution alignment. Most methods disable when the input label vectors are very sparse, either resulting from large label vocabulary or missing labels. By employing label similarity based distribution, the proposed method exploits comprehensive label correlations. For example, flower seldom co-occurs with building, dog, books, etc., in the training images, nonetheless its association with these labels can still be well captured by describing the distribution defined on pairwise label similarity. Then, the distribution information is further encoded in the latent dense label representation.

Well-established multi-label classification methods roughly follow two lines of research [33, 23, 48], namely Algorithm Adaptation and Problem Transformation. Algorithm adaptation methods adopt certain existing algorithms and adapt them to solve the multi-label classification problem. Representative members of this group include Rank-SVM [13], ML-KNN [47] and Instance based Logistic Regression [10].

Problem transformation methods, on the other hand, reformulate the multi-label classification problem into one or more sub learning tasks. Typical examples include Binary Relevance (BR), Classifier Chains (CC) [29, 26], Label Ranking (LR) [15], Label Powerset (LP) [22] and its variants such as Random k-Label Sets [37] and Pruned Problem Transformation [28]. BR decomposes multi-label classification to many separate single-label binary classification tasks, each for one of the labels. CC takes the label dependency into consideration by constructing a chain of binary classifiers, where each classifier additionally leverage the previous labels as its input feature. LR reformulates the multi-label classification problem into a task of ranking labels on hand by relevance and determining the threshold of the relevance. LP reduces multi-label classification to multi-class classification by treating each observed label set as a distinct multi-class label. The problem transformation approaches are more advantageous since any algorithm to the transformed task can be used to solve multi-label classification problems.

More recently, research on multi-label classification generally fall into two learning paradigms [33, 23]. The first is structured output learning paradigm, which focuses on modeling label structure and exploiting inter-label correlations, then using them to predict label vector for test instances [16, 23]. Label correlations are typically encoded in a graph structure, such as ChowLiu Tree [11] and Maximum Spanning Tree [24], and conditional label structure can be approximately learned via Structured Support Vector Machine and Markov Random Field. To further assist the label structure learning,  [19, 21, 23] explicitly incorporate feature information into the learning process. However, such label structure learning methods are computationally expensive.

The second model employs label space reduction [49], which encodes the original label space to a low-dimensional latent space either through random projection [51], canonical correlation analysis (CCA) based projection or by directly learning the projected codes [25]. Subsequently, prediction is performed on the low-dimensional latent space, whose results are translated back to the original label space via a decoding process, thus the original labels for test instances can be recovered. Moreover, algorithms with both label space and feature space dimension reduction have been proposed, such as conditional principal label space transformation [9]. In addition, [20, 45] take a more direct approach by formulating the label prediction problem as learning a low-rank linear mapping from feature space to label space. However, these methods usually suffer from information loss and depend on the reduced dimension of the latent space.

In recent years, the proliferation of labels pose great challenge to existing multi-label learning methods. Due to the large label vocabulary, the label vectors are usually characterized by high dimensionality and remarkable sparsity. The sparsity originates from two sources: (1) For each instance, only a small number of labels are present, namely, the label vector has little support (Sparsity I). (2) For a certain set of labels, very few training instances are assigned to it (Sparsity II). In the following paper, we refer to these two types of sparsity as label set sparsity and training data sparsity respectively. Although the label space dimension reduction approaches target to address this problem, the performance of the reduced latent space is not satisfying in terms of prediction accuracy and computational complexity. The reason is that the dimension reduction process in these methods may incur information loss such that the original inter-label correlations will not be fully preserved in the latent space.

Following the general label space transformation framework proposed in [33], we propose a novel distribution-based label space transformation model (DLST). By aligning the distribution between the original label space and the latent space, an optimal transformed code can be learned for each label vector. The advantages of employing distribution alignment in label space transformation lie in two aspects: (1) In contrast to conventional approaches which lose information, extra information that are beyond the original label vector can actually be fulfilled in the latent code according to the distribution. As a result, more complicated label correlations can be captured and approximated in the latent code. (2) In face of label set sparsity and training data sparsity, the proposed model is still able to recover the whole distribution. It has been empirically verified that the number of labels per instance required for DLST to obtain highest score is far less than comparing baselines. Similar phenomenon is also observed for the number of training instances per class required to achieve the best performance.

As shown in Figure 1, the latent codes derived by the proposed method are much denser than original label vectors which also preserve the distribution of the original label space. Therefore, it can be expected that the classification performance using the transformed latent codes will be significantly better than the original sparse labels.

In the training phase, a regression function is learned to map the original data in the feature space to the transformed code in the latent space. For each test instance, the corresponding latent code can be computed by applying the regression function. Then, ML-KNN [47] is employed as the decoder to recover the original label vector from the latent code of each instance. The proposed model can also be extended with kernel tricks to deal with nonlinear regression from original feature to the latent code.

Since the proposed model is capable of tackling label set sparsity and training data sparsity. Real-world examples of these two types of sparsity are missing data and limited training data. The performance of DLST is relatively stable across varying missing ratios or training data ratios. This can be attributed to the distribution used in DLST. Rather than limited to the given label vectors, DLST captures the whole distribution of the label space by fitting the observed label vectors using a distribution with maximal variance, and transmits the distribution to the latent space.

The contributions of this paper are:

  • The proposed method takes advantage of distribution to capture more comprehensive inter-label correlations in the original label space and transmits it to the latent space.

  • The dense latent code learned by DLST successfully addresses label set sparsity by distinguishing concurrent label patterns from most other unrelated labels.

  • The proposed model effectively alleviates the requirement on training data size to achieve high multi-label classification performance.

2 A Distribution based Multi-Label Learning Framework

Figure 2: Schematic illustration of DLST. In the training phase: Label space transformation and Feature regression. Label space transformation converts sparse label vector into dense latent code while preserving the distribution information of the original label space. Feature regression learns a kernel logistic regression function to map image features to latent codes. In the testing phase, the latent code of test instance is first learned by applying the regression function, and then ML-KNN prediction is performed to recover the original label vector.

2.1 Preliminaries

Let denote the set of labeled training data, where each instance is associated with a subset of possible labels represented by a binary vector . when is assigned the -th label and 0 otherwise. The goal of multi-label classification is to learn a mapping for predicting the label vector of each test instance. For convenience of notation, the feature vectors and label vectors of training instances are arranged in row to form the input feature matrix and label matrix .

Traditional multi-label classification methods aim to learn a binary classifier for each dimension of the label vector, where indicates the -th dimension of the original label space. In the case of numerous labels, it will become computationally prohibitive for these methods to predict the label vector.

To tackle this challenge, a novel label space transformation learning framework was proposed, where each label vector is firstly encoded into a point in the latent space with an encoding process . Similarly, the latent codes are also stacked in row to form a matrix . Then, the multi-label classification problem on turns to a multi-dimensional regression problem on . After obtaining the regression function that has a high prediction accuracy on , the framework will then map back to the original label space via some decoder . The whole learning process is illustrated in Figure 2. Note that with the transformed latent codes , the regression process is open to be replaced by any mapping algorithm from data feature to multi-label vectors.

2.2 Label Space Transformation

In this section, we propose an implicit encoding [25] module by aligning the distribution of label space and that of latent space . While traditional multi-label learning methods which explicitly model the inter-label correlations [3, 5, 24, 34, 46, 52, 41], the depth of the correlations that they investigated are no more than three order limited by the computational complexity. Moreover, the correlations captured in the existing models are either local or global, and have a bias towards similarity between labels while dissimilarity is mostly ignored.

In contrast, our model provides a more comprehensive description of the correlations between labels by employing distribution. Firstly, the similarities in the label space are transformed into probability distribution, and approximated by the distribution of latent code in a reduced space. Based on this distribution alignment, a new label representation can be learned which fills out the original label distribution information. As a result, the proposed model is not limited to the number of given label vectors and their completeness, which greatly expands its flexibility.

To derive the distribution of label vectors for the labeled training instances, first define as the probability of observing the similarity between label vectors and among all pairs of labeled training instances. Following t-SNE [38], we utilize a Student t-distribution with one degree of freedom to transform the Euclidean distances into probabilities, as shown below


Let denote the distribution of instances with to-be-learnt representations in the latent space, then can be calculated as follows


where is the dimension of the latent space.

To approximate the distribution with that of latent code in the reduced space, we adopt the Kullback-Leibler divergence to measure the distribution discrepancy between , which can be formulated by


This KL-divergence based distribution alignment technique considers the comprehensive correlations among the label vectors. The underlying assumption is that instances with highly correlated label vectors tend to have high similarity in the input data space. Therefore, instances with the same labels tend to be drawn much closer in the latent space.

2.3 Feature Regression

With the optimal latent representation , the original multi-label classification problem on converts to a multi-dimensional regression problem on . The mapping function is actually open for any effective multi-label prediction models, such as linear regression, ridge regression and logistic regression. More generally, any algorithm that learns a mapping from data features to the multi-label vectors can be exploited here, with a boosted performance than the original algorithm.

In this paper, we use kernel logistic regression to learn the mapping from features to latent codes. The reason are twofold: on one hand, logistic regression not only outputs the predictions but also the corresponding probabilities for the prediction. On the other hand, it can easily be extended to a kernelized version where nonlinear mappings are included.

In kernel logistic regression, each instance is mapped to the Reproducing Kernel Hilbert Space (RKHS) as , which also form a kernel feature matrix . In RKHS, the inner product between kernel features can be efficiently calculated by applying kernel trick , where is the introduced kernel function. Employing non-linear kernel functions, the linear mapping from kernel features to the latent codes are actually non-linear mappings from the original feature space to the latent space. In this paper, we treat each dimension of the latent code separately and learn a linear mapping in RKHS for the -th dimension. The objective function of kernel logistic regression is as follows


where is the -th entry in , and is a weighting parameter.

Following the common practice in literature, let fall in the span of the kernel features for training instances, i.e. with as the spanning coefficients. Then in Eq. (4), . It can be seen that the training cost of kernel logistic regression is positively related to the training set size , where is undesirable for large-scale datasets.

Note that not all training instances are required to form the span, as redundancy may exist between kernel features of training instances. Therefore, we only sample a small part of them for building the kernel feature matrix and use it as the basis to span the -th mapping . Hence, where is the coefficients that need to be learned and denotes the sampling size. Then the training cost of kernel logistic regression can be greatly reduced, making it more efficient for training as well as predicting. The specific sampling strategy can be either random sampling or other more sophisticated methods.

2.4 Multi-Label Prediction

For a test instance , based on the learned regression function , the -th dimension of the latent code for can be forecasted. In addition, the probabilities of can be obtained as follows


To obtain the label vector for each test instance , the latent code learned in the previous step needs to be further mapped back to the original label space through some decoder . In this paper, ML-KNN is employed to recover the original label for test instances.

For each test instance , ML-KNN first identifies its nearest neighbors in the training set. Then, based on the label sets of these neighbors, the label vector for can be determined using the following maximum a posteriori principle


where indicates that instance has label , while denotes that is not assigned label . denotes that among the nearest neighbors of , there are exactly instances which are assigned the -th label, can be calculated by . Using Bayesian rule, Eq. (6) is equivalent to the following objective function


As shown in Eq. (7), in order to determine the label vector , all the information needed is the prior probabilities and the posterior probabilities , which can all be directly estimated from the training instances. Problem (7) can be similarly solved as in [47].

1:Data matrix , label matrix , .
2:optimization parameters: learning rate , momentum , number of iterations .
4:Compute the probability distribution of the original label space according to Eq. (1)
5:while  do
6:      Update the distribution according to Eq. (2)
7:      Update gradient according to Eq. (10)
8:      Update according to Eq. (11)
9:end while
Algorithm 1 The latent representation optimization for problem (3)
1:Original data matrix , Original label matrix , .
3:Reduce the original label matrix to a latent space with code matrix via algorithm 1.
4:Learn a kernel logistic regression from to by solving problem (4).
6:For each test instance, derive the latent code .
7:Map to label space, and recover the label vector of test instances according to Eq. (7). return
Algorithm 2 Multi-label propagation for problem (7)

2.5 Optimization

The objective function of problem (3) is non-convex, thus only local optimum can be obtained. Since problem (3) is an unconstrained optimization problem, to learn a locally optimal , we propose to exploit gradient descent based optimization methods. From Eq. (3), we have


Since solely depends on the labels of training data and remains fixed during the optimization procedure, therefore, problem (8) can be reduced to


Combining Eq. (1) and  (2), the gradient of Eq. (9) w.r.t. can be derived as follows


With gradients calculated in Eq. (10), effective gradient descent based optimization methods can be further applied to derive optimal . The update strategy of is as follows


where denotes the optimal at -th iteration, is the learning rate, is the momentum at -th iteration. The stopping criteria for the algorithm is , with a maximum iteration number . The details of the encoding algorithm are presented in Algorithm 1. Algorithm 2 summarizes the whole procedure of DLST. The complexity of the proposed algorithm is , where is the number of labeled training data, and is the sampling size for kernel logistic regression.

Dataset type n d K card
Scene image 2,407 294 6 1.074
Emotions music 593 72 6 1.869
Yeast biology 2,417 103 14 4.237
Mediamill video 43,907 120 101 4.376
MSRC image 591 512 23 2.508
SUNattribute image 14,240 512 102 15.526
Table 1: Dataset statistics used in the experiments. n is the number of instances; d is the dimensionality of instances; K is the number of possible labels; card is the average number of labels per instance.
Figure 3: Examplar images with corresponding labels from the MSRC, Pascal VOC and NUS-WIDE datasets respectively.

3 Experiments

In this section, we demonstrate the effectiveness of the proposed algorithm on six benchmark multi-label datasets including: Scene [4] with scene classes such as mountain, beach and field; Emotions [35] with music emotion labels; Yeast [13] with gene functional categories (e.g., metabolism, energy); Mediamill [32] with semantic concept labels (e.g., military, desert, and basketball); MSRC [43] with image class labels and SUNattribute [27] with image class labels.

The data for the first four datasets can be directly downloaded from Mulan website1, where the corresponding data features have been extracted. We also follow the training and test subsets provided with the releases of the four datasets. MSRC2 is a Microsoft research labeled image dataset with images. Each image is represented by bag-of-words features on sampled patches. The SUNattribute dataset3 contains images with GIST features. For MSRC and SUNattribute datasets, of the data from each class are used for training, while the rest are for testing. All experiments are repeated over 10 random training/test splits, the average results are reported.

The statistical information of these datasets used for experiments is summarized in Table 1. From Table 1 we can observe that both label set sparsity (indicated by cardinality) and training data sparsity (estimated by ) are significant in these datasets. For example, each instance in the Scene dataset only has an average of labels out of all 6 candidate labels. And each set of labels for the Mediamill dataset is only occupied by as few as instances of the total number of training data.

Experimental results show that the proposed approach outperforms state-of-the-art methods on all six datasets and manifests strong generalization ability across different types of labels. Analysis of the transformed latent code demonstrates that our approach can effectively preserve the distribution of the original label vectors while alleviating the sparsity problem.

3.1 Compared Methods and Evaluation Metrics

Compared Methods. To validate the performance of our proposed DLST, we compare it with the following representative and related multi-label learning algorithms:

  • BR [36]: Binary Relevance.

  • CPLST [9]: Conditional Principal Label Space Transformation.

  • FAIE [25]: Feature-aware Implicit Label Space Encoding.

  • MLLOC [18]: Multi-Label Learning using Local Correlation.

  • MC [6]: Matrix Completion.

  • MIML [39]: Mutual Information for Multi-Label Classification.

  • SLRM [20]: Semi-supervised Low-Rank Mapping.

  • MRV [50]: Manifold Regularized Vector-valued Multi-Label Learning.

  • LEML [45]: Large Scale Empirical Risk Minimization Method with Missing Labels.

BR is the baseline method, where each label is treated as an independent binary classification problem. CPLST, FAIE and MLLOC are label space reduction methods, which only use labeled instances as training set. MC, MIML and SLRM utilize both labeled and unlabeled instances for training. Meanwhile, SLRM, MRV and LEML are specifically designed for multi-label learning with missing labels. In the experiment, we adopt LibSVM [7] as the binary classifier for BR. In the learning stage, both CPLST and FAIE are coupled with linear regression for multi-label prediction. Unless otherwise specified, we set the parameters of the comparing methods according to what the authors supposed in the original papers or codes. As for DLST, a 10-fold cross-validation is performed by varying from to with a stepsize of . Results show that DLST yields stable performance around , which is used for DLST in the following experiments.

Evaluation Metrics. Performance evaluation for multi-label classification can be complicated since each instance is associated with a set of labels rather than a single one. Various metrics have been proposed based on the prediction likelihood with respect to each label, among which we adopt three widely-used evaluation metrics Average Precision, Micro F1 and Macro F1 to quantitatively compare the performance of these multi-label classification methods.

Average Precision (AP) evaluates the average fraction of relevant labels ranked ahead of a particular label. The larger the value of AP, the better the performance. Its formal definition can be found in [44].

Micro F1 and Macro F1 evaluate the micro average and macro average of the harmonic mean of precision and recall, respectively. As microaveraging and macroaveraging require binary indicator vectors, we consider the labels corresponding to the largest entries of the predicted vector as the predicted labels of each instance, where is set to be the average number of labels per instance. Therefore, from Table 1, for Scene, Emotions, Yeast, Mediamill, MSRC and SUNattribute is and respectively. The bigger the value of Micro F1 and Macro F1, the better the performance. Their formal definitions can be found in [37].

Average Precision
Scene 0.4306 0.4492 0.4501 0.4327 0.4580 0.4888 0.5082 0.5125 0.4718 0.5365
Emotions 0.2734 0.2958 0.3012 0.3146 0.3068 0.3235 0.3487 0.3574 0.3128 0.3864
Yeast 0.3225 0.3364 0.3425 0.3389 0.3567 0.3842 0.4005 0.4082 0.3620 0.4312
Mediamill 0.4086 0.4264 0.4265 0.4326 0.4509 0.4465 0.4691 0.4653 0.4324 0.4980
MSRC 0.3145 0.3281 0.3346 0.2070 0.2353 0.2801 0.3864 0.3725 0.3376 0.4016
SUNattribute 0.2876 0.3009 0.3387 0.2876 0.3052 0.2954 0.3286 0.3124 0.3092 0.3584
Micro F1
Scene 0.5987 0.6496 0.6528 0.6630 0.6282 0.6713 0.7022 0.7324 0.6825 0.7642
Emotions 0.3450 0.3633 0.3428 0.3596 0.3088 0.3694 0.4012 0.5424 0.5280 0.5642
Yeast 0.4435 0.4520 0.4631 0.4552 0.4328 0.4784 0.4950 0.6413 0.6086 0.6971
Mediamill 0.4234 0.5785 0.6422 0.6381 0.6273 0.6412 0.6476 0.5283 0.5562 0.6632
MSRC 0.4383 0.5109 0.5357 0.3692 0.4196 0.5538 0.5890 0.5726 0.3981 0.6235
SUNattribute 0.4425 0.4605 0.4936 0.4441 0.4670 0.4521 0.5043 0.4631 0.4430 0.4926
Macro F1
Scene 0.3153 0.3264 0.3358 0.3125 0.3562 0.3458 0.3745 0.3964 0.3692 0.4235
Emotions 0.1928 0.2034 0.2135 0.2234 0.2580 0.2542 0.2718 0.2826 0.2984 0.3260
Yeast 0.2436 0.2580 0.2624 0.2578 0.2842 0.2673 0.3016 0.3245 0.3326 0.3794
Mediamill 0.1150 0.0982 0.1302 0.1399 0.1269 0.1298 0.1413 0.1526 0.1254 0.1738
MSRC 0.3562 0.3317 0.3467 0.1048 0.2541 0.4083 0.4481 0.4468 0.3575 0.4738
SUNattribute 0.1842 0.2196 0.2630 0.1923 0.2507 0.2852 0.2687 0.2716 0.2535 0.3283
Table 2: Performance comparison for multi-label learning approaches on six datasets under different evaluation metrics. means the bigger the value, the better the performance.

3.2 Experimental Results

Quantitative results on all six datasets under three evaluation metrics are presented in Table 2. From Table 2, we can see that the proposed DLST performs better than or comparable to the other nine state-of-the-arts and baseline methods across all 18 configurations (6 datasets 3 evaluation metrics). The superior performance of DLST across all three evaluation measures justifies our motivation of exploiting label distribution preservation during label transformation. In the following, we present a more detailed comparison between DLST and the other three categories of multi-label learning methods.

DLST outperforms label space reduction methods (CPLST, FAIE, MLLOC) by as much as on the six datasets measured by average precision. This advantage demonstrates that DLST learns a higher quality latent code than the other three baselines in terms of approximating the original label space. Moreover, the latent space learned by DLST improves the original label space by revealing the comprehensive label correlations, thus alleviating the sparsity problem presented in multi-label classification.

Moreover, DLST shows better performance than semi-supervised multi-label classification methods (MC, MIML, SLRM). The three semi-supervised baselines utilize abundant unlabeled data in the training process, which is believed to be able to boost the performance. However, these methods require a large number of training data to perform well. In contrast, DLST demonstrates comparable or even better performance with only 10% of the training data used by the comparing semi-supervised baselines. The performance gain can be explained by the distribution employed in DLST, which can estimate and fulfill the comprehensive label correlations given only limited number of labeled training data.

Compared with baselines specifically targeting missing labels (SLRM, MRV, LEML), DLST almost always outperforms them by on the six datasets with respect to average precision criterion. These results corroborate the effectiveness of DLST in exploiting inter-label correlations. In subsection 3.5, we will further compare the performance of these methods under varying number of labels for each instance.

Dataset type n d K card
Pascal VOC image 9,963 2,048 20 1.560
NUS-WIDE image 269,648 500 81 1.869
Table 3: Statistics of two large scale datasets used in the experiments.

3.3 The Benefit of Latent Code of DLST

In this subsection, to further study the superiority of DLST in learning a dense latent code, we conduct another set of experiments on two variants of DLST: DLST and DLST. DLST directly learns a regression function from the feature space to the original label space, while DLST predicts the original label of test instances based on ML-KNN using the feature vector of training instances. The performance of these methods are evaluated on three datasets: Scene, Emotions and Yeast. Similar to previous experimental settings, the training and test subsets provided along with each dataset is adopted. The mean value and stand deviation of DLST and its two variants under the three evaluation metrics are recorded in Table 4.

From Tabel 4, we can see that DLST outperforms the two variants by and respectively on the Scene dataset under average precision. This result verifies that sparsity in the original label space significantly deteriorates multi-label learning performance. It also suggests that the dense latent code learned by DLST is more effective in capturing inter-label correlations and more informative in predicting labels for multi-label instances.

Figure 4 shows the regression labels and nearest neighbor labels for images on MSRC dataset. Regression labels are produced by applying DLST, while nearest neighbor labels are obtained by utilizing DLST. The difference between Nearest Neighbor labels and NN labels-T lies in the space where nearest neighbor searching takes place, specifically, the former occurs in the original feature space, while the latter occurs in the transformed latent space. From Figure 4, it can be observed that DLST produces the most comprehensive label sets for multi-label images. The rationality lies in that the latent space derived by DLST captures the whole distribution of relative distances of any label pairs. Thus the probability of co-occurrence between labels can be more delicately predicted.

Figure 4: Regression labels and nearest neighbor labels obtained using methods DLST, DLST and DLST for images on MSRC dataset.
Methods Scene Emotions Yeast
Average Precision
DLST 0.446 0.005 0.297 0.006 0.334
DLST 0.453 0.020 0.304 0.013 0.352 0.009
DLST 0.536 0.001 0.386 0.002 0.431 0.004
Micro F1
DLST 0.661 0.008 0.465 0.006 0.525 0.003
DLST 0.690 0.016 0.480 0.014 0.594 0.008
DLST 0.763 0.004 0.564 0.003 0.697 0.002
Macro F1
DLST 0.306 0.004 0.204 0.008 0.253 0.005
DLST 0.314 0.012 0.218 0.005 0.264 0.008
DLST 0.423 0.002 0.326 0.001 0.379 0.003
Table 4: Experimental results (meanstd) of DLST and its two variants on Scene, Emotions and Yeast datasets across three evaluation metrics.
Figure 5: Examplar images from classes that has (top) highest per-class-precision and (bottom) lowest per-class-precision on Pascal VOC dataset.
Figure 6: The per-class precision and recall of DLST on NUS-WIDE dataset.
Methods plane bicycle bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv mAP
labels 1.2 1.9 1.1 1.4 2.4 2.0 1.7 1.4 2.5 1.4 2.8 1.6 1.9 1.9 2.0 2.3 1.3 2.4 1.3 2.2 -
samples 445 505 622 364 502 380 1536 676 1117 273 510 863 573 482 4192 527 195 727 522 534 -
HCP-1000C 95.1 90.1 92.8 89.9 51.5 80.0 91.7 91.6 57.7 77.8 70.9 89.3 89.3 85.2 93.0 64.0 85.7 62.7 94.4 78.3 81.5
CNN-RNN 96.7 83.1 94.2 92.8 61.2 82.1 89.1 94.2 64.2 83.6 70.0 92.4 91.7 84.2 93.7 59.8 93.2 75.3 99.7 78.6 84.0
DLST 95.5 93.1 92.4 91.8 90.2 83.4 73.6 85.4 67.5 85.3 84.2 80.4 84.6 84.2 40.8 85.2 84.7 83.2 82.8 74.6 79.6
DLST 96.8 95.2 93.4 96.3 95.1 96.4 84.2 93.3 88.4 97.2 94.8 91.8 94.3 94.8 51.5 94.8 98.2 93.0 94.2 94.6 85.7
Table 5: The per-class precision and mAP of DLST and compared methods on Pascal VOC dataset. The biggest and smallest number of labels-per-instance as well as samples-per-class are labeled in italic and underlined. The highest and lowest average precision-per-class are shown in boldface and italic.

3.4 Large Scale Datasets

To evaluate the performance of DLST on large scale datasets, we additionally employ two multi-label datasets: Pascal VOC2007 [14] and NUS-WIDE [12].

Pascal Visual Object Classes Challenge (VOC) datasets have been widely used as the benchmark for multi-label classification. VOC 2007 dataset contains images with labels. Each image in this dataset is represented by -dimensional deep CNN feature, which is generated by ResNet-50 [17] pretrained on ImageNet database. Images in the train and validation subsets are used as training data, while images in the test subset are utilized as testing data. Therefore, the training/test split adopted in the experiment are images.

NUS-WIDE dataset is a web image dataset, which contains images and tags collected from Flickr. There are tags after removing noisy and rare tags. These images are further manually annotated into 81 concept groups, e.g., sunset, clouds, beach, mountain, animal as shown in Figure 3. The Bag-of-Words features based on SIFT descriptions are adopted to represent each image in this dataset. Among the images, images are utilized for training, and images are employed for testing. All experiments are conducted over 10 random training/test subsets of data, and the average performance are recorded. Table 3 summarizes more detailed characteristics of these two datasets. Similar notations are adopted as in Table 1.

Since other baselines cannot deal with such large amounts of data (out of memory), we compare the proposed method with two state-of-the-art deep learning methods HCP-1000C [42] and CNN-RNN [41] as well as the variant of our methods DLST. The precision and recall of predicted labels are employed as evaluation metrics. For each image, the precision records the number of correctly annotated labels divided by the number of generated labels; while the recall is defined as the number of correctly annotated labels divided by the number of ground-truth labels. We additionally compute the per-class precision and mAP for both datasets.

From Table 5, we can observe that DLST achieves consistently higher average precision than other comparing methods across all label classes. Moreover, it is interesting to note that DLST performs best on class sheep and worst on class person, which correlates negatively with the number of training instances for each class.

This can be explained from two aspects, on one hand, classes with more training samples tend to have noisy correlations with other labels, as shown in Figure 5, images for person and car are either occluded or dominated by other objects in the image. Therefore, the distances between a certain label and all the other labels are relatively the same, leading to approximately uniform distribution according to DLST. Therefore, the discriminative information that helps to identify a certain class is overwhelmed by the noise present in large volume of diverse training instances. On the other hand, classes with less training samples are likely to develop simple and clear relationships with a limited number of other labels, e.g., sheep and cow almost always relate to plant, whereas seldom relate to boat. Thus, according to the distribution proposed by DLST, these labels have a prominently higher probability to co-occur with a small group of specific labels, which dramatically reduces the difficulty of recognizing them in various images. Actually the relationship holds as long as relatively equal number of labels-per-instance are assigned each class, which can be well observed in Pascal VOC dataset (the first row in Table 5). This observation further justifies the advantage of leveraging distribution to tackle sparsity in multi-label classification.

Figure 6 shows the per-class precision and recall of DLST on NUS-WIDE dataset. It can be seen that DLST achieves high precision and low recall on classes with few labels, Representative classes include computer, protest and wedding. The reason is that DLST tends to stop predicting more labels for sparsely correlated labels. While on other classes such as map, book and rainbow, which have larger label cardinality, DLST achieves low precision and high recall. This may be caused by the insufficient standard training data for these classes. There are also some classes that obtain comparable precision and recall, such as clouds, grass, person. Notably, these concepts are ubiquitous among all the images in NUS-WIDE dataset. Moreover, the mean precision and recall of DLST averaged over all classes are and respectively, which is and higher than state-of-the-art CNN-RNN methods. However, it is worthy to note that this result is yielded by randomly selecting images for training and for testing, rather than employing the whole dataset for implementation.

3.5 The Advantage of DLST in Tackling Sparsity

Sparsity I: Label Set Sparsity for Instances


.48 {subfigure}.48 {subfigure}.48 {subfigure}.48

Figure 7: AP on MSRC
Figure 8: AP on SUN
Figure 9: Macro F1 on MSRC
Figure 10: Macro F1 on SUN



Figure 11: Performance comparison of six methods with different proportions of missing labels on MSRC and SUN.

We investigate the performance of DLST in handling label set sparsity by conducting experiments with missing labels. This setting also facilitates comparison with other baselines. The experimental data are generated on datasets MSRC and SUN with 10% of the data from each class for training and the remaining for testing. The experiments are conducted with missing labels on the labeled training instances. For each missing ratio, we randomly drop of the observed labels. To avoid empty class or instances with no positive labels, at least one instance is kept for each class and at least one positive label is kept for each instance. Then the label vector for training instance is reset according to the protocol: if the -th instance is assigned the -th label, and otherwise. Note that may indicate missing label or negative label.

The proposed method is compared with FAIE, MC, SLRM, MRV, LEML, and the results are shown in Figure 11. We can observe that DLST shows notable performance gain over other methods across all numbers of labels assigned to each instance. Also, the performance of DLST increases relatively less than the comparing methods as the average number of labels per instance increases. This is because the label correlations are more prominent with an increasing number of labels per instance, which benefits most methods significantly. However, DLST performs relatively stable and depends less on the extra given labels, since it can capture the comprehensive relationship between labels by using distribution alignment. As shown in Figure 11, the minimum number of labels per instance for DLST to obtain sufficient information (i.e., performance variance less than in AP) is and for MSRC and SUN datasets respectively, which is much less than the comparing baselines.

Sparsity II: Training Data Sparsity for Labels


.48 {subfigure}.48

Figure 12: AP on Mediamill
Figure 13: Macro F1 on Mediamill



Figure 14: Performance comparison of six methods with varying number of labeled training data on Mediamill.

We additionally investigate the effectiveness of DLST in tackling training data sparsity by varying the number of training instances for each class from to with a stepsize of . For a given percentage, a corresponding number of training instances are randomly sampled for 10 times, and the resulting average precision are recorded. The experimental results of CPLST, MC, SLAM, FAIE, MIML are shown in Figure 14. Although the performance of all the methods degrade with a decreasing number of labeled training instances, DLST achieves a relatively stable performance across all training ratios and consistently outperforms comparing baselines. This can be explained by the capability of DLST to exploit inter-label correlations encoded in distribution. Given the fact that the comparing methods usually need training instances to saturate. In contrast, DLST only requires as few as training instances per class to gain the highest classification performance (with less than performance variance). This verifies that by using distribution, the label correlations can be more effectively and efficiently exploited, thus remarkably reducing the requirement on training size.

3.6 Further Analysis

The effectiveness of our approach has been quantitatively evaluated in Table 2. We further present the qualitative analysis of the learned latent code of DLST to illustrate its capability of revealing label correlations as well as addressing the sparsity problems. In Figure 15, we show the average precision for each class on the Mediamill dataset.

Figure 15: Top: The average precision improvement for each class on the Mediamill dataset. Bottom: Average number of concurrent labels for true positive instances for each label class. All sorted according to number of concurrent labels in ascending order.

To investigate the capability of DLST in tackling sparsity in the original label space, we study the Average Precision improvement for classes with different number of concurrent labels. From Figure 15 we can see that, AP improvement is more prominent for classes with small number of average concurrent labels. Typical examples include classes such as car, grass, beach and waterscape, where an AP improvement of can be observed respectively. Although these labels are sparsely correlated to other labels for each instance, nonetheless, by learning from the distribution of all the possible labels, DLST is able to predict the corresponding labels with high precision. The result justifies our proposition that distribution reveals more comprehensive information about the label correlations which can be utilized to enhance multi-label classification performance.

4 Conclusion

To tackle the label sparsity and resolve label correlation for multi-label classification, in this paper, a distribution-based label space transformation method is proposed. By introducing the concept of distribution, more comprehensive relationship among labels of training instances can be captured. A much denser latent code is learned, enabling the proposed model to cope with both label set sparsity and training data sparsity where most multi-label classification methods fail to work effectively. The proposed model is especially successful in capturing a set of distinctive concurrent label patterns from a large pool of label vocabularies, which offers significant benefits to multi-label classification. Extensive experimental results demonstrate that DLST is superior to state-of-the-art multi-label learning methods under various percentage of labeled data and missing labels.


  1. http://mulan.sourceforge.net/datasets-mlc.html
  2. http://research.microsoft.com/en-us/projects/ObjectClassRecognition
  3. https://cs.brown.edu/gen/sunattributes.html


  1. R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma, “Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages,” in WWW, 2013, pp. 13–24.
  2. Z. Barutçuoglu, R. E. Schapire, and O. G. Troyanskaya, “Hierarchical multi-label prediction of gene function,” Bioinformatics, vol. 22, no. 7, pp. 830–836, 2006.
  3. W. Bi and J. T. Kwok, “Multilabel classification with label correlations and missing labels,” in AAAI, 2014, pp. 1680–1686.
  4. M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” PR, vol. 37, no. 9, pp. 1757–1771, 2004.
  5. J. K. Bradley and C. Guestrin, “Learning tree conditional random fields,” in ICML, 2010, pp. 127–134.
  6. R. S. Cabral, F. D. la Torre, J. P. Costeira, and A. Bernardino, “Matrix completion for weakly-supervised multi-label image classification,” TPAMI, vol. 37, no. 1, pp. 121–135, 2015.
  7. C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” ACM TIST, vol. 2, no. 3, pp. 27:1–27:27, 2011.
  8. M. Chen, A. X. Zheng, and K. Q. Weinberger, “Fast image tagging,” in ICML, 2013, pp. 1274–1282.
  9. Y. Chen and H. Lin, “Feature-aware label space dimension reduction for multi-label classification,” in NIPS, 2012, pp. 1538–1546.
  10. W. Cheng and E. Hüllermeier, “Combining instance-based learning and logistic regression for multilabel classification,” ML, vol. 76, no. 2-3, pp. 211–225, 2009.
  11. C. K. Chow and C. N. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 462–467, 1968.
  12. T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: a real-world web image database from national university of singapore,” in CIVR.
  13. A. Elisseeff and J. Weston, “A kernel method for multi-labelled classification,” in NIPS, 2001, pp. 681–687.
  14. M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vision, vol. 88, no. 2, pp. 303–338, 2010.
  15. J. Fürnkranz, E. Hüllermeier, E. Loza Mencía, and K. Brinker, “Multilabel classification via calibrated label ranking,” ML, vol. 73, no. 2, pp. 133–153, 2008.
  16. B. Hariharan, L. Zelnik-Manor, S. V. N. Vishwanathan, and M. Varma, “Large scale max-margin multi-label classification with priors,” in ICML, 2010, pp. 423–430.
  17. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  18. S. Huang and Z. Zhou, “Multi-label learning by exploiting label correlations locally,” in AAAI, 2012.
  19. J. Jiang, “Multi-label learning on tensor product graph,” in AAAI, 2012, pp. 956–962.
  20. L. Jing, L. Yang, J. Yu, and M. K. Ng, “Semi-supervised low-rank mapping learning for multi-label classification,” in CVPR, 2015, pp. 1483–1491.
  21. X. Kong, B. Cao, and P. S. Yu, “Multi-label classification by mining label and instance correlations from heterogeneous information networks,” in KDD, 2013, pp. 614–622.
  22. X. Kong, M. K. Ng, and Z. Zhou, “Transductive multilabel learning via label set propagation,” TKDE, vol. 25, no. 3, pp. 704–719, 2013.
  23. Q. Li, M. Qiao, W. Bian, and D. Tao, “Conditional graphical lasso for multi-label image classification,” in CVPR, 2016, pp. 2977–2986.
  24. X. Li, F. Zhao, and Y. Guo, “Multi-label image classification with A probabilistic label enhancement model,” in UAI, 2014, pp. 430–439.
  25. Z. Lin, G. Ding, M. Hu, and J. Wang, “Multi-label classification via feature-aware implicit label space encoding,” in ICML, 2014, pp. 325–333.
  26. W. Liu and I. W. Tsang, “On the optimality of classifier chain for multi-label classification,” in NIPS, 2015, pp. 712–720.
  27. G. Patterson, C. Xu, H. Su, and J. Hays, “The SUN attribute database: Beyond categories for deeper scene understanding,” International Journal of Computer Vision, vol. 108, no. 1-2, pp. 59–81, 2014.
  28. J. Read, “A pruned problem transformation method for multi-label classification,” in Proc. New Zealand Computer Science Research Student Conference, 2008, pp. 143–150.
  29. J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” ML, vol. 85, no. 3, pp. 333–359, 2011.
  30. R. E. Schapire and Y. Singer, “Boostexter: A boosting-based system for text categorization,” Machine Learning, vol. 39, no. 2/3, pp. 135–168, 2000.
  31. C. Snoek, M. Worring, J. C. van Gemert, J. Geusebroek, and A. W. M. Smeulders, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” in ACM MM, 2006, pp. 421–430.
  32. C. G. M. Snoek, M. Worring, J. C. van Gemert, J.-M. Geusebroek, and A. W. M. Smeulders, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” in ACM MM, 2006, pp. 421–430.
  33. F. Tai and H.-T. Lin, “Multilabel classification with principal label space transformation,” Neural Computation, vol. 24, no. 9, pp. 2508–2542, 2012.
  34. M. Tan, Q. Shi, A. van den Hengel, C. Shen, J. Gao, F. Hu, and Z. Zhang, “Learning graph structure for multi-label image classification via clique generation,” in CVPR, 2015, pp. 4100–4109.
  35. K. Trohidis, G. Tsoumakas, G. Kalliris, and I. Vlahavas, “Multi-label classification of music by emotion,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 4, no. 1, pp. 325–330, 2008.
  36. G. Tsoumakas, I. Katakis, and I. P. Vlahavas, “Mining multi-label data,” in Data Mining and Knowledge Discovery Handbook, 2nd ed., 2010, pp. 667–685.
  37. G. Tsoumakas and I. P. Vlahavas, “Random k -labelsets: An ensemble method for multilabel classification,” in ECML, 2007, pp. 406–417.
  38. L. van der Maaten and G. E. Hinton, “Visualizing high-dimensional data using t-sne,” JMLR, vol. 9, pp. 2579–2605, 2008.
  39. D. Vasisht, A. C. Damianou, M. Varma, and A. Kapoor, “Active learning for sparse bayesian multilabel classification,” in KDD, 2014, pp. 472–481.
  40. C. Wang, S. Yan, L. Zhang, and H. J. Zhang, “Multi-label sparse coding for automatic image annotation,” in CVPR, 2009, pp. 1643–1650.
  41. J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnn-rnn: A unified framework for multi-label image classification,” in CVPR, 2016, pp. 2285–2294.
  42. Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “CNN: single-label to multi-label,” CoRR, vol. abs/1406.5726, 2014.
  43. J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in ICCV 2005, pp. 1800–1807.
  44. X. Wu and Z. Zhou, “A unified view of multi-label performance measures,” CoRR, vol. abs/1609.00288, 2016.
  45. H. Yu, P. Jain, P. Kar, and I. S. Dhillon, “Large-scale multi-label learning with missing labels,” in ICML, 2014, pp. 593–601.
  46. M. Zhang and K. Zhang, “Multi-label learning by exploiting label dependency,” in ACM SIGKDD, 2010, pp. 999–1008.
  47. M. Zhang and Z. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognition, vol. 40, no. 7, pp. 2038–2048, 2007.
  48. M. L. Zhang and Z. H. Zhou, “A review on multi-label learning algorithms,” TKDE, vol. 26, no. 8, pp. 1819–1837, 2014.
  49. Y. Zhang and J. G. Schneider, “Maximum margin output coding,” in ICML, 2012, pp. 1575–1582.
  50. F. Zhao and Y. Guo, “Semi-supervised multi-label learning with incomplete labels,” in IJCAI, 2015, pp. 4062–4068.
  51. T. Zhou, D. Tao, and X. Wu, “Compressed labeling on distilled labelsets for multi-label learning,” ML, vol. 88, no. 1-2, pp. 69–126, 2012.
  52. F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spatial regularization with image-level supervisions for multi-label image classification,” in CVPR, 2017, pp. 5513–5522.
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Comments 0
The feedback must be of minumum 40 characters
Add comment

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question