Deep Learning for Multi-label Classification

Deep Learning for Multi-label Classification

Jesse Read, Fernando Perez-Cruz

In multi-label classification, the main focus has been to develop ways of learning the underlying dependencies between labels, and to take advantage of this at classification time. Developing better feature-space representations has been predominantly employed to reduce complexity, e.g., by eliminating non-helpful feature attributes from the input space prior to (or during) training. This is an important task, since many multi-label methods typically create many different copies or views of the same input data as they transform it, and considerable memory can be saved by taking advantage of redundancy. In this paper, we show that a proper development of the feature space can make labels less interdependent and easier to model and predict at inference time. For this task we use a deep learning approach with restricted Boltzmann machines. We present a deep network that, in an empirical evaluation, outperforms a number of competitive methods from the literature.

I Introduction

Multi-label classification is the supervised learning problem where an instance may be associated with multiple labels. This is opposed to the traditional task of single-label classification (i.e., multi-class, or binary) where each instance is only associated with a single class label. The multi-label context is receiving increased attention and is applicable to a wide variety of domains, including text, audio data, still images and video, and bioinformatics, [12, 22, 23] and the references therein.

The most well-known approach to multi-label classification is to simply train an independent classifier for each label. This is usually known in the literature as the binary relevance (BR) transformation, e.g., [22, 15]. Essentially, a multi-label problem is transformed into one binary problem for each label and any off-the-shelf binary classifier is applied to each of these problems individually. Practically all the multi-label literature identifies that this method is limited by the fact that dependencies between labels are not explicitly modelled and proposes algorithms to take these dependencies into account.

To date, many successful multi-label algorithms have been obtained by the so-called problem transformation methods (where the multi-label problem is transformed into several multi-class or binary problems), for example, [2, 5, 14, 24, 4]. These methods make many copies of the feature space in memory (or make many passes over it). Most of the highest performing methods also use ensembles, for example with support vector machines (SVMs) [14, 24], decision trees [18], probabilistic methods [26, 28] or boosting [17, 25].

That is to say, most competitive methods from the large part of the literature could benefit tremendously from more concise representations of the feature space, relatively much more so than in the singe-label context; the initial investment in reducing the number of feature variables in a multi-label problem is much more likely to offer considerable speed-ups during learning and classification. However, relatively little work in the multi-label literature has considered this approach.

Using the raw instance data to construct a model makes the implicit assumption that the labels originate from this data and that they can be recovered directly from it. Usually, however, both the labels and the feature variables originate from particular abstract concepts. For example, we generally think of an image as being labelled beach, not because its pixel-data vector is beach-like, but rather because the image itself meets some criteria of our abstract idea of what a beach is. Ideally then, a feature set would include (for example) variables for a grainy surface such as sand or pebbles, and for being adjacent to a (significant) body of water. Hence, it is highly desirable to recover the hidden dependencies and structure from the original concepts behind the learning task. A good representation of these dependencies make the problem easier to learn.

A Restricted Boltzmann Machine (RBM) [9] learns a layer of hidden features in an unsupervised fashion. This hidden layer can capture complex dependencies and structure from the input space, and represent it more compactly (whenever the number of hidden units is smaller than the number of original feature attributes). The methods we detail in this paper using RBMs offer some interesting benefits to multi-label classification in a variety of domains:

  • The predictive performance of existing state-of-the-art methods is generally improved.

  • Many classification paradigms previously relatively uncompetitive in multi-label learning can often obtain much higher predictive performance and become competitive and thus now offer their respective advantages to this context, such as better posterior-probability estimates, lower memory consumption, faster performance, easier implementation, and incremental learning.

  • The output feature space can be updated incrementally. This not only makes incremental learning feasible, but also means that cost savings are magnified for batch-learners that need to be retrained at intervals on new data.

  • The model can be built using unlabeled examples, which are typically obtained much more cheaply than labelled examples; especially in multi-label contexts, since examples are assigned multiple labels.

We also stack several RBMs to create two varieties of Deep Belief Networks (DBNs). We look at two approaches using DBNs. In a first approach, we learn the final layer together with the labels and use an existing multi-label classifier. In a second approach, we use back-propagation to fine-tune the weights of our neural network for discriminative prediction, and augment this with a second multi-label predictive layer.

We develop a framework to experiment with RBMs and DBNs in a variety of multi-label classification contexts. Within this framework we carry out an empirical evaluation with many different methods from the literature, on a collection of real-world datasets from diverse domains (to the best of our knowledge, this is also the largest and varied collection of datasets analysed with an RBM framework). The results indicate the benefits of this style of learning for multi-label classification.

Ii Prior Work

Multi-label datasets and classification methods have rapidly become more numerous in recent years, and classification performance has steadily improved. An overview of the most well known and influential work in this area is provided in [22, 12].

The binary relevance approach (BR) does not obtain high predictive performance because it does not model dependencies between labels. A number of methods have improved on this predictive performance with methods that do model label dependence.

A well-known alternative is the label powerset (LP) method [23] which transforms the multi-label problem into single-label problem with a single class, having the powerset as the set of values (i.e., all possible combinations). In LP, label dependencies are modelled directly and predictive performance is greater than BR, but computational complexity is too high for most practical applications. The complexity issue has been addressed in works such as [24] and [13]. The former presents RAkEL (RAndom -labEL sets), an ensemble method that selects subsets of labels and uses LP to learn each of these subproblems.

The classifier chain approach (CC) [15] has received recent attention, for example in [3] and [26]. This method employs one classifier for each label, like BR, but the classifiers are not independent. Rather, each classifier predicts the binary relevance of each label given the input space plus the predictions of the previous classifiers (hence the chain).

Another type of binary-classification approach is the pairwise transformation method (PW), where a binary model is trained for each pair of labels. The predictions result more naturally in a set of pairwise preferences than a multi-label prediction (thus becoming popular in ranking schemes), but PW methods can be adapted to make multi-label predictions, for example [5]. These methods performs well in several domains, although their application can easily be prohibitive on many datasets due to its quadratic complexity.

An alternative to problem transformation is algorithm adaptation, where a specific single-label method is adapted directly for multi-label classification. MLkNN [30] is a -nearest neighbours method adapted for multi-label learning by voting from the labels found in the neighbours. IBLR is a related method that also incorporates a second layer of logistic regression. BPMLL [29] is a back-propagation neural network adapted for multi-label classification by having multiple binary outputs as the label variables.

Processing the feature space of multi-label data has already been studied in the literature. [20] presents an overview of the main techniques with respect to problem transformation methods. In [27] a clustering-based supervised approach is used to obtain label-specific features for each label. The advantages of this method are reduced where label-relevances are not trained separately, for example in LP methods (which learns all labels together as a single multi-class meta label). In any case, this a meta technique that can easily be applied independently of other preprocessing and learning techniques, such as the one we describe in this paper.

In [25] redundancy is eliminated from the learning space of the BR method by taking random subsets of the training space across an ensemble. This work centers on the fact that a standard BR approach considers the full input space for each label, even though only a subset of the variables may be relevant to any particular label. Compressive sensing techniques have also been used in the literature for reducing the complexity multi-label data by taking advantage of label sparsity [21, 11].

These methods are mainly motivated by reducing an algorithm’s running-time by reducing the number of feature variables in the input space, rather than learning or modelling the dependencies between them. More examples of feature-space reduction for multi-label classification are reviewed in [22].

The authors of [7] use a fully-connected network closely related to a Boltzmann machine for multi-label classification, using Gibbs sampling for inference. They use this network to model dependencies in the label space for prediction, rather than to improve the feature space. Since this is a fully connected network, it is tractable only for problems with a relatively small number of labels.

Figure 1 roughly illustrates the way some of the different classifiers model correlations among attributes and labels, assuming a linear base classifier.

(a) BR
(b) CC
(c) LP
(d) PW, CDN
Fig. 1: A network view of various classifiers; the connections among features and labels.

Iii Deep Learning with Restricted Boltzmann Machines

A well-known approach to deep learning is to model each layer of higher level features in a restricted Boltzmann machine [9]. We base our approaches on this strategy.

Iii-a Preliminaries

In all that follows: is the input domain of all possible feature values. An instance is represented as a vector of feature values . The set is the output domain of possible labels. Each instance is associated with a subset of these labels typically represented by a binary vector , where ; i.e., if and only if the th label is associated with instance , and otherwise.

We assume a set of training data of labelled examples ; is the label vector (labelset) assignment of the th example; is the relevance of the th label to the th example.

In the BR context, for example, binary classifiers are trained, where each models the binary problem relating to the th label, such

outputs prediction vector for any test instance .

Iii-B Restricted Boltzmann Machines

A Boltzmann machine is a type of fully-connected neural network that can be used to discover the underlying regularities of the (observed) training data [1]. When many features are involved, this type of network is only tractable in the restricted Boltzmann machine setting [9], where units are fully connected between layers, but are unconnected within layers.

An RBM learns a layer of hidden feature variables from the original feature variables of a training set (usually ). These hidden variables can provide a compact representation of the underlying patterns and structure of the input. In fact, an RBM can capture input space regions, whereas standard clustering requires parameters and examples to capture this much complexity.

Figure 2 shows an RBM can as a graphical model with two sets of nodes: visible (-variables, shaded) and hidden (-variables). Each is connected to all by weight (the same for both directions).

Fig. 2: An RBM with 5 input units and 3 hidden units. Each edge is associated with a weight , which together make up weight matrix .

RBMs are energy-based models, where the joint probability of visible and hidden units is proportional to the energy between them:

Hence, by manipulating the energy we can in turn generate the probability . Specifically, we minimize the energy

by learning the weight matrix to find low energy states. Contrastive divergence [8] is typically used for this task.

Iii-C Deep Belief Networks

RBMs can be stacked to form so-called DBNs [9]. The RBMs are trained greedily: the first RBM takes the input space and produces output , then the second RBM treats as if it were the input space, and produces , and so on and so forth.

When used for single-label classification, the final output layer is typically a softmax function, (which is appropriate where only one of the output units should be on, to indicate one of classes). In the following section we outline our approach, creating DBNs suitable for multi-label classification.

Iv Deep Belief Networks (DBNs) for Multi-label Classification

Ideally, an RBM would produce hidden variables that correspond directly to the label variables, and thus we could recover the label vector directly given any input vector; i.e., or deterministically mappable . Unfortunately, this is seldom the case, because the abstract hidden variables do not need to correspond directly to the labels. However, we should expect the hidden layer of data to be more closely related to the labels than the original data, and thus it makes sense to use it as a feature space to classify instances.

Hence, by using the hidden space created by the RBM, we would expect any multi-label classifier to obtain better performance (than when using the original feature space). We do this simply by using the hidden representation of each instance as the input feature space, and associating it with the labels to create training set . We can then train any multi-label classifier on this dataset. To evaluate a test instance , we feed it through the RBM and obtain from the upper layer, and then acquire a prediction , and thus so for each test instance.

From here we take two approaches. Since the sub-optimality produced by greedy learning is not necessarily harmful to many discriminative supervised methods [10], we can treat the final hidden layer variables as the feature input variables, and train any off-the-shelf multi-label model that can predict

where is produced by the RBM for some test instance ; see Figure LABEL:fig:DBNa.

In a second approach, we add a final layer of weights on top; see Figure LABEL:fig:DBNb. Now, the structure is similar to the neural network of BPMLL [29], except that create the layers and initialize the weights using RBMs. Later we will show that our methods performs much better. We can employ back propagation to fine-tune the network in a supervised fashion (with respect to label assignments) as in, for example, [9] (for single-label classification). For a number of epochs, each training instance is propagated forward (upward) through the network and output as the prediction . The errors are then propagated backward through the network, updating the weights (previously initialized by the RBMs). Due to the initialisation with RBMs, far fewer epochs are required than would usually be typical for back propagation (and we actually observed that more than around epochs tends to result in overfitting).

On both these approaches it is possible to add more depth in the form including an additional classification layer. In the multi-label context, this has previously been done to the basic BR method in [6], where a second BR is trained on the outputs of the first (a stacking approach). A related technique in the neural network context, often called a “skip layer” has been used in, e.g., [19, 16]. In our case we allow for generic classifiers. This helps add some further discriminative power for taking into account the dependencies in the label space.

(a) b.1
(b) b.2
Fig. 3: DBNs for multi-label classification. In LABEL:fig:DBNa, the output space (second hidden layer) can be trained with the label space by any multi-label classifier. In LABEL:fig:DBNb, the labels are predicted directly in a third hidden layer.

Note that we have also experimented with a DBN that models the instance space and label space together generatively . In the multi-label setting this complicates the inference, since there are possible . We tried using Gibbs sampling, but could not obtain competitive results from this model in the multi-label setting compared to our other approaches (even after reducing in an RBM first). However, this seems like an interesting direction, and we intend to follow this idea further in future work.

V Experiments

We carry out an empirical evaluation to gauge the effectiveness and efficiency of RBMs and DBNs in a number of different multi-label classification scenarios, using different learning algorithms and a wide collection of databases. We have implemented these methods in the MEKA framework111; an open-source Java-based framework with a number of important benchmark multi-label methods. In this framework RBMs can easily be used in a wide variety of multi-label schemes. The source code of our implementations will be made available as part of the MEKA framework.

We selected commonly-used datasets from a variety of domains, listed in Table I along with some basic statistics about them. The datasets vary considerably with respect to the type of data, and their dimensions (the number of labels, features, and examples). In Music, instances of music are associated with emotions; in Scene, images belong to categories; in Yeast proteins may be associated with multiple biological functions, and in Genbase gene sequences. Medical, Enron and Reuters are text datasets where text documents are associated with categories. These datasets are described in greater detail in [12].

LC Type
Music 593 6 72 1.87 audio
Scene 2407 6 294 1.07 image
Yeast 2417 14 103 4.24 biology
Genbase 661 27 1185 1.25 biology
Medical 978 45 1449 1.25 medical/text
Enron 1702 53 1001 3.38 e-mail/text
Reuters 6000 103 500 1.46 news/text
TABLE I: A collection of multi-label datasets and associated statistics, where LC is label cardinality: the average number of labels relevant to each example.

V-a RBM performance

We first compare the performance of introducing an RBM, blindly trained, for reducing the input dimension and then try out three of the common paradigms in multi-label classification (namely BR, LP and PW) to test the improvements proposed for this feature extraction algorithm. The RBM would improve the performance of the multi-label classification paradigms, if the extracted features are relevant for better describing the task at hand and will be neutral or negative if those features that have been extracted blindly do not correspond with relevant features for assigning labels.

The RBM has several parameters that need to be fine-tuned (i.e. number of hidden units, learning rate and momentum) and we use three-fold cross validation to set them. We considered the number of hidden units , the learning rate , and momentum . We used weight costs of and epochs throughout.

V-A1 Ensemble of Classifier Chains

CC is a competitive BR method that uses the chain rule to improve the prediction for each potential label. As it is unclear what should be the best ordering, we use an ensemble of 50 CC, in which the labels are randomly ordered in each realization (as in [15]). In Table (a)a, we report the accuracy, as defined in [23, 6, 15, 13], to report the performance of our multi-label classifiers222There are a variety of multi-label evaluation measures used in multi-label experiments in the literature; [22] provides an overview of some of the most popular. The accuracy provides a good balance to gauge the overall predictive performance of multi-label methods [12, 15].:

where and are the bitwise AND and OR functions, respectively, for .

SVM Log-Reg
Music 0.581 0.576 0.558 0.504
Scene 0.731 0.710 0.709 0.554
Yeast 0.532 0.535 0.513 0.504
Genbase 0.979 0.981 0.971 0.977
Medical 0.695 0.770 0.449 0.706
Enron 0.469 0.454 0.451 0.355
Reuters 0.459 0.461 0.408 0.376
(a) We report the accuracy for SVM and logistic regression based multi-label classifiers.
SVMs Log. Reg.
Music 0.1 0.2 120 0.1 0.8 30
Scene 0.1 0.8 240 0.1 0.8 60
Yeast 0.01 0.2 120 0.01 0.2 30
Genbase 0.1 0.8 120 0.1 0.4 60
Medical 0.1 0.6 120 0.1 0.6 120
Enron 0.1 0.6 120 0.1 0.6 120
Reuters 0.1 0.6 120 0.1 0.6 120
(b) The parameters chosen for ECC on the first of the two folds (using an internal train/test set of the training set). Parameters for the second fold of each dataset were invariably similar or identical.
TABLE II: We compare ECC with and without feature extraction using RBMs.

In Table (a)a, ECC and ECC, respectively, denote the accuracy of the ECC with the RBM-generated features and with the original input space. We have used two different classifiers: nonlinear SVM and logistic regression (linear classifier), both of them have been trained with the default parameters in WEKA. It can be seen that the for the logistic regression classifier the achieved accuracy with the generated features by the RBM are significantly better for the Music, Scene, Enron, Reuters datasets, it only underperforms for the Medical dataset, and they are comparable for Yeast and Genbase datasets. The RBM not only reduces the dimensionality of the input space for the classifier, but it also makes the features suitable for linear classifiers, which allows interpreting the RBM features and understand how each one of them participate in the prediction for each label.

For the SVM-based ECC classifiers there is not a significant difference when we use the RBM processed features compared to using the raw data directly, as the RBF kernel in the SVM can compensate for the preprocessing done by the RBM. In this case, almost all the results are comparable, except for the Scene and Medical, in which, respectively, the ECC and ECC outperform. We should remark that the linear logistic regression is as good as the nonlinear SVM in most cases, so it seams that using the RBM features reduces the input dimension and makes the classification problem easier, as a linear classifier performs as well as a state-of-the-art nonlinear classifier.

In Figure 4 we show the accuracy for the seven data bases for the ECC and ECC multi-label classifier with an SVM classifier, as a function of the number of hidden units of the RBM. In this plot, it can be seen that once we have enough features, using the RBM is comparable to not using it and it is clear that for the Medical the number of features is too little and we would have needed to increase the number of extracted features333We did not do so, to keep the experimental setting uniform for all proposed methods, as we think it is important that hyper-parameter setting should be general and not finely tuned for each application. to achieved the same performance as the SVM does.

Fig. 4: The number of hidden units (horizontal axis) and corresponding accuracy as compared to accuracy with the same methods on the original feature space (horizontal lines). For , .
Fig. 5: The difference in accuracy (shown here on Music and Medical datasets) between baseline BR (dashed lines) and more-advanced CC (solid lines) – both built on RBM-produced outputs – decreases with more hidden units (horizontal axis). For , .
0.001 0.2 0.707
0.001 0.4 0.705
0.001 0.8 0.705
0.01 0.2 0.710
0.01 0.4 0.714
0.01 0.8 0.720
0.1 0.2 0.726
0.1 0.4 0.727
0.1 0.8 0.726
TABLE III: The accuracy of ECC, with an SVM base classifier, for fixed number of hidden units , and for varying learning rate () and momentum ().

Finally, in Table III we show the accuracy for the SVM-based classifier for the Scene dataset for all the tested combinations of the learning rate and the momentum, in which the number of hidden units is fixed to . The accuracy for the ECC (without RBM generated features) is and in this case any combination of learning rate and momentum does better, which indicates that with a sufficient number of hidden units, the RBM learning is quite robust and not overly sensitive to hyperparameter settings.

V-A2 RAndom K labEL subsets

RAkEL is a truncated power set method in which we try all combinations for 3 labels and we report an ensemble with classifiers. We use the same hyperparameter setting as we did for the ECC to make the results comparable across multi-label classification paradigms, as reported in Table (b)b and we report the acuracy in Table IV.

SVM Log-Reg
Music 0.581 0.579 0.538 0.465
Scene 0.712 0.684 0.663 0.469
Yeast 0.537 0.537 0.497 DNF
Genbase 0.984 0.984 0.968 0.976
Medical 0.652 0.743 0.494 0.639
Enron 0.452 0.413 0.376 0.273
Reuters 0.342 0.337 0.285 DNF
TABLE IV: We report the accuracy for RAkEL with and without feature extraction using RBMs using an SVM and a logistic regression based multi-label classifiers.

The results for this paradigm are similar to the ones that we reported for the ECC in the previous section. For the logistic regression (a linear classifier) the RBM generated features lend themselves for accurate predictions when compared with the unprocessed features with the same baseline classifier and they are comparable to the results achieved for the nonlinear SVM classifier. After processing the features with an RBM we might not need to rely on a nonlinear classifier. For the SVM using the RBM generated features does not help, but it does not hurt either, in terms of accuracy, as the SVM nonlinear mapping is versatile to learn any nonlinear mapping.

V-A3 Pairwise Classification

We implemented a pairwise approach, namely Four-class pairWise classifier (FW), in which we build models to learn classes for each label pair , dividing each into votes for the individual labels and and using a threshold at classification time. We find that overall it obtains better predictive performance than the pairwise methods that create decision boundaries between labels (where ), as in [5], for example, especially with SVMs. We report the accuracy in Table V, using the same hyper parameters as we did for the ECC to make the results comparable across multi-label classification paradigms, as reported in Table (b)b.

SVM Log-Reg
Music 0.578 0.573 0.549 0.492
Scene 0.694 0.649 0.660 0.490
Yeast 0.537 0.538 0.507 0.495
Genbase 0.985 0.985 0.949 0.975
Medical 0.571 0.748 0.492 DNF
Enron 0.463 0.408 0.376 DNF
TABLE V: We report the accuracy for FW with and without feature extraction using RBMs, using an SVM and a logistic regression based multi-label classifiers.

The conclusions are similar to the other two paradigms. The linear classifier (logistic regression) does significantly better with the RBM generated features than with the original input space, while the SVM nonlinear classifier is versatile enough to provide accurate predictions with or without RBM generated features. Fortunately, the linear classifier with RBM generated features is quite close to the SVM-based classifier and allows to interpret which RBM features contribute to each label, hence we can provide intuitive interpretations for each RBM features, while it is hard to get such interpretation from the SVM nonlinear mapping.

V-B DBN performance

After analyzing the performance of the RBM generated features, we focus on two DBN structures for multi-label classification:

  • DBN: a network of two hidden layers, the final of which is united with the labels in a new dataset and trained with ECC (see Figure LABEL:fig:DBNa)

  • DBN: a network of three hidden layers where the final layer represents the labels; fine-tuned with back propagation (see Figure LABEL:fig:DBNb)

Both setups can be visualised in Figure 6, where in the case of DBN.

Fig. 6: A deep learning setup for multi-label classification.

We use hidden units, 1000 RBM epochs, 100 BP epochs (on DBN), and the best of either and on a 67:33 percent internal train/test validation (taking advantage of the fact, as we explained earlier, that the choice of learning rate and momentum is fairly robust given enough hidden units).

In Table VI, we compare the accuracy for the proposed DBMs structures and the previously proposed methods. We have also added MLkNN, BPMLL, and IBLR (see Section II for details). In this table we can see that the DBN is either the best classifier or close to the best, which give sense that the features generated by the second layer improve the first layer. For example, the only database (Medical) in which the ECC was not good enough compared to the ECC  now the DBN and DBN do almost as good as ECC and the performance on the other databases is also improved (or not degraded). This structure seems to be amenable for multi-label classification and competitive with all the proposed paradigms in the literature.

Music 0.577 0.581 0.542 0.545 0.581 0.576 0.579 0.573 0.533
Scene 0.731 0.742 0.696 0.697 0.731 0.710 0.684 0.649 0.552
Yeast 0.529 0.531 0.537 0.539 0.532 0.535 0.537 0.538 0.491
Genbase 0.984 0.985 0.950 0.918 0.979 0.981 0.984 0.985 0.049
Medical 0.746 0.742 0.596 0.494 0.695 0.770 0.743 0.748 0.053
Enron 0.442 0.480 0.353 0.363 0.469 0.454 0.413 0.408 0.144
Reuters 0.410 0.451 0.408 0.357 0.459 0.461 0.337 DNF 0.004
TABLE VI: Comparing multi-label methods under accuracy. Highest results are set in boldface.

Vi Conclusions

Our empirical evaluation over a variety of multi-label datasets shows that a selection of high-performing multi-label methods from the literature can be improved upon by using an RBM-processed feature space. The labels become easier to model at training time, and predict at inference time. We obtained an improvement of up to percentage points in accuracy than when using the original feature space directly. Our study showed that important improvements can be obtained in multi-label classification with respect to both scalability and predictive performance when using deep learning in the area of multi-label classification. As a result, we can recommend to multi-labellers to focus more on feature modelling, rather than solely on modelling dependencies between the output labels. Our multi-label DBN models achieved the best predictive performance overall compared with seven competing methods from the multi-label literature.


  • [1] David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, 9:147–169, 1985.
  • [2] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, 2004.
  • [3] Weiwei Cheng, Krzysztof Dembczyński, and Eyke Hüllermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In ICML’10: 27th International Conference on Machine Learning, Haifa, Israel, June 2010. Omnipress.
  • [4] Weiwei Cheng and Eyke Hüllermeier. Combining instance-based learning and logistic regression for multilabel classification. Machine Learning, 76(2-3):211–225, 2009.
  • [5] Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. Multilabel classification via calibrated label ranking. Machine Learning, 73(2):133–153, November 2008.
  • [6] Shantanu Godbole and Sunita Sarawagi. Discriminative methods for multi-labeled classification. In PAKDD ’04: Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 22–30. Springer, 2004.
  • [7] Yuhong Guo and Suicheng Gu. Multi-label classification using conditional dependency networks. In IJCAI ’11: 24th International Conference on Artificial Intelligence, pages 1300–1305. IJCAI/AAAI, 2011.
  • [8] Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1711–1800, 2000.
  • [9] Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006.
  • [10] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
  • [11] Daniel Hsu, Sham M. Kakade, John Langford, and Tong Zhang. Multi-label prediction via compressed sensing. In NIPS ’09: Neural Information Processing Systems 2009, 2009.
  • [12] Jesse Read. Scalable Multi-label Classification. PhD thesis, University of Waikato, 2010.
  • [13] Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Multi-label classification using ensembles of pruned sets. In ICDM ’08: Eighth IEEE International Conference on Data Mining, pages 995–1000. IEEE, 2008.
  • [14] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. In ECML ’09: 20th European Conference on Machine Learning, pages 254–269. Springer, 2009.
  • [15] Jesse Read, Bernhard Pfahringer, Geoffrey Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine Learning, 85(3):333–359, 2011.
  • [16] B. D. Ripley. Neural networks and related methods for classification. Journal of the Royal Statistical Society. Series B (Methodological), 45(3):409–456, 1994.
  • [17] Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000.
  • [18] Leander Schietgat, Celine Vens, Jan Struyf, Hendrik Blockeel, Dragi Kocev, and Saso Dzeroski. Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics, 11:2, 2010.
  • [19] Eduardo D. Sontag. Feedforward nets for interpolation and classification. J. Comp. Syst. Sci, 45:20–48, 1992.
  • [20] Newton Spolaôr, Everton Alvares Cherman, Maria Carolina Monard, and Huei Diana Lee. A comparison of multi-label feature selection methods using the problem transformation approach. Electronic Notes in Theoretical Computer Science, 292(0):135 – 151, 2013. Proceedings of the {XXXVIII} Latin American Conference in Informatics (CLEI).
  • [21] Farbound Tai and Hsuan-Tien Lin. Multi-label classification with principle label space transformation. In Workshop Proceedings of Learning from Multi-Label Data, Haifa, Israel, June 2010.
  • [22] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In O. Maimon and L. Rokach, editors, Data Mining and Knowledge Discovery Handbook. 2nd edition, Springer, 2010.
  • [23] Grigorios Tsoumakas and Ioannis Katakis. Multi label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.
  • [24] Grigorios Tsoumakas and Ioannis P. Vlahavas. Random k-labelsets: An ensemble method for multilabel classification. In ECML ’07: 18th European Conference on Machine Learning, pages 406–417. Springer, 2007.
  • [25] Rong Yan, Jelena Tesic, and John R. Smith. Model-shared subspace boosting for multi-label classification. In KDD ’07: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pages 834–843. ACM, 2007.
  • [26] Julio H. Zaragoza, Luis Enrique Sucar, Eduardo F. Morales, Concha Bielza, and Pedro Larrañaga. Bayesian chain classifiers for multidimensional classification. In IJCAI’11: 24th International Joint Conference on Artificial Intelligence, pages 2192–2197, 2011.
  • [27] Min-Ling Zhang. LIFT: Multi-label learning with label-specific features. In IJCAI’11: 24th International Joint Conference on Artificial Intelligence, pages 1609–1614, 2011.
  • [28] Min-Ling Zhang and Kun Zhang. Multi-label learning by exploiting label dependency. In KDD ’10: 16th ACM SIGKDD International conference on Knowledge Discovery and Data mining, pages 999–1008, New York, NY, USA, 2010. ACM.
  • [29] Min-Ling Zhang and Zhi-Hua Zhou. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10):1338–1351, 2006.
  • [30] Min-Ling Zhang and Zhi-Hua Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description