Dynamic classifier chains for multi-label learning
In this paper, we deal with the task of building a dynamic ensemble of chain classifiers for multi-label classification. To do so, we proposed two concepts of classifier chains algorithms that are able to change label order of the chain without rebuilding the entire model. Such modes allows anticipating the instance-specific chain order without a significant increase in computational burden. The proposed chain models are built using the Naive Bayes classifier and nearest neighbour approach as a base single-label classifiers. To take the benefits of the proposed algorithms, we developed a simple heuristic that allows the system to find relatively good label order. The heuristic sort labels according to the label-specific classification quality gained during the validation phase. The heuristic tries to minimise the phenomenon of error propagation in the chain. The experimental results showed that the proposed model based on Naive Bayes classifier the above-mentioned heuristic is an efficient tool for building dynamic chain classifiers.
Keywords:multi-label, classifier chains, naive bayes, dynamic chains, nearest neighbour
Under well-known single-label classification framework, an object is assigned to only one class which provides a full description of the object. However, many real-world datasets contain objects that are assigned to different categories at the same time. All of these categories constitute a full description of the object. Omitting of one of these concepts induces a loss of information. Classification process in which such kind of data is involved is called multi-label classification Gibaja2014 (). A great example of a multi-label dataset is a gallery of tagged photos. Each photo may be described using such tags as mountains, sea, forest, beach, sunset, etc. Multi-label classification is a relatively new idea that is explored extensively for last two decades. As a consequence, it was employed in a wide range of practical applications including text classification Jiang2012 (), multimedia classification Sanden2011 () and bioinformatics Wu2014 () to name a few.
Multi-label classification algorithms can be broadly partitioned into two main groups i.e. dataset transformation algorithms and algorithm adaptation approaches Gibaja2014 ().
Methods belong to the group of algorithm adaptation approaches provides a generalisation of an existing multi-class algorithm. The generalised algorithm is able to solve multi-label classification problem in a direct way. Among the others, the most known approaches from this group are: multi label KNN algorithm Jiang2012 (), the Structured SVM approach Diez2014 () or deep-learning-based algorithms Wei2015 ().
In this paper, we investigate only dataset transformation algorithms that decompose a multi-label problem into a set of single-label classification tasks. To reconstruct a multi-label response, during the inference phase, outputs of the underlying single-label classifiers are combined in order to create a multi-label prediction.
Let’s focus on one of the simplest decomposition methods. That is the binary relevance (BR) approach that decomposes a multi-label classification task into a set of one-vs-rest binary classification problems AlvaresCherman2010 (). This approach assumes that labels are conditionally independent. However the assumption does not hold in most of real-life recognition problems, the BR framework is one of the most widespread multi-label classification methods Tsoumakas_Katakis_Vlahavas_2008 (). This is due to its excellent scalability and acceptable classification quality Luaces2012 ().
To preserve scalability of BR systems, and provide a model of inter-label relations, Read et al. Read2009 (); Read2011 () provided us with the Classifier Chain model (CC) which establish a linked chain of modified one-vs-rest binary classifiers. The modification consists of an extension of the input space of single-label classifiers along the chain sequence. To be more strict, for a given label sequence, the feature space of each classifier along the chain is extended with a set of binary variables corresponding to the labels that precede the given one. The model implies that, during the training phase, input space of given classifier is extended using the ground-truth labels extracted from the training set. During the inference step, due to lack of the ground-truth labels, we employ binary labels predicted by preceding classifiers. The inference is done in a greedy way that makes the best decision for each of considered labels. That is, the described approach passes along the chain, information allowing CC to take into account inter-label relations at the cost of allowing the label-prediction-errors to propagate along the chain Read2011 (). This way of performing classification induces a major drawback of the CC system. That is, the CC classifier uses a kind of greedy strategy during the inference phase. This design allows classification errors to propagate along the chain. As a consequence, the performance of a chain classifier strongly depends on chain configuration Senge2013 (). To overcome these effects, the authors suggested to generate an ensemble of chain classifiers (ECC). The ensemble consists of classifiers trained using different label sequences Read2009 ().
The originally proposed ECC ensemble uses randomly generated label orders. Additionally, each chain classifier is built using a resampled dataset. This approach provides an additional diversity into the ensemble classifier. This simple, yet effective approach allows improving the classification quality significantly in comparison to single chain classifier. However, the intuition says that there is still room for improvement when we employ a more data-driven approach.
Indeed, later research shows that the members of the ensemble may be chosen in such a way that provides further improvement of classification quality. That is, Read et al. proposed a strategy which uses Monte Carlo sampling to explore the label sequence space in order to find a classifier chain that offers the highest classification quality Read2014 (). Another approach was proposed by Liu et al. Liu2010 (). They introduced a method that builds a model of inter-label relations as a directed acyclic graph (DAG). The weights of the graph are calculated using confusion and support for each pair of labels. Then, the ensemble is generated using topological sorting of the graph. Chen et al. Chen2016 () proposed a method that makes clusters of labels. Then, for each cluster of labels, an undirected graph describing inter-label relations is built. Then, a minimum spanning tree is created for the graph. After that, breadth-first search algorithm determines sequences for a cluster-specific ensemble of CC classifiers. A similar approach was proposed by Huang et al Huang2015 (). They proposed to build the clusters using a meta-space that mixes input space and label space. Then inter-label relations are modeled using correlation. The model is expressed using DAG structure. Finally, the CC classifier is built for each cluster. The chain structure may also be induced using Bayesian Network approach Zhang2014 ().
Chain sequence can be also found using meta-heuristic approach. That is, Goncalves et al. developed a strategy that utilises a genetic algorithm (GA) to find a good chain structure for the entire dataset Goncalves2013 (); Gonalves2015 (). The proposed approach using wrapper-based approach. That is each chromosome codes different chain order. To evaluate those label orders each corresponding classifier must be built and evaluated using a validation set. A similar approach was also used by Trajdos and Kurzynski who proposed to use a multi-objective genetic algorithm to optimize classification quality and chain diversity simultaneously Trajdos2017 (). Although those methods are rather time-consuming, they provide a significant improvement in terms of classification quality.
Another way of dealing with the error propagation is to build a classifier that combines CC algorithm BR-based approach Montaes2014 (). The authors proposed a stacking based architecture to combine the above-mentioned classifiers. That is, the first layer is a simple BR classifier that predicts each label separately. The attribute set of the classifiers from the second layer is extended using all labels except the predicted one. During the training phase, both layers are trained separately. During the prediction phase, on the other hand, classifiers from the second layer mix outcomes of the BR classifiers with the outcomes provides by preceding classifiers. That is, the first classifier of the chain structure has its attribute space extended by the outcomes of the BR classifier. The second one uses the prediction of the first one and the remaining attributes are taken from the prediction of the BR classifier. Finally, the last classifier along the chain has the attribute space extended using only labels predicted by the preceding classifiers. Another way of combining the CC classifier with the BR classifier is described in Madjarov2012 ().
The previously cited methods build ensemble structure during the training procedure. Consequently, throughout this paper, this kind of methods will be called static methods. The dynamic chain classifiers, on the other hand, determines the best label order at the prediction phase daSilva2014 (). The above-mentioned classifier produces a set of randomly generated label sequences and then validates the chain classifiers. During the validation phase, each point from the validation set is assigned with a label order that produces the most accurate output vector for this point. As the experimental research shows, the dynamic methods of building a label order may achieve better classification quality daSilva2014 ().
We observed that during the building of a dynamic chain classifier, multiple chain classifiers must be learned. These classifiers are built using the same training set and differ only in chain order. As a consequence, the computational burden of the algorithm may be reduced if there exists a classifier that is trained once and changing the label sequence is done without rebuilding the model. To address this issue, we built two models based on the Naive Bayes Hand2001 () approach and the nearest neighbour approach Cover1967 () that meet the above-mentioned properties.
Additionally, we proposed a dynamic method of determining the chain order based on classification quality for each label separately.
A part of this paper was previously published in Trajdos2017b (). This paper is an extended version of the previously-published work. The main elements that has been changed/extended:
The literature review has been extended.
We have added the results of the experimental comparison of the BR and CC versions of different base classifiers.
We have proposed a new model of the dynamic chain classifier. That is, we introduce the CC model based on the nearest neighbour approach.
New experimental results have been provided.
The rest of the paper is organised as follows. Next Section 2 provides a formal description of the multi-label classification problem and describes the developed algorithms. Section 3 contains a description of the conducted experiments. The results are presented and discussed in Section 4. Finally, Section 5 concludes the paper.
2 Proposed Methods
In this section, we introduce a formal notation of multi-label classification problem and provide a description of the proposed method.
Under the multi-label (ML) formalism a object is assigned to a set of labels indicated by a binary vector of length : , where denotes the number of labels.
In this paper, we follow a statistical classification framework. As a consequence, it is assumed that object and its set of labels are realizations of corresponding random vectors , and the joint probability distribution on is known.
Because the above-mentioned assumption is never meet in real world, in this study, we assume that multi-label classifier , which maps feature space to the set , is built in a supervised learning procedure using the training set containing pairs of feature vectors and corresponding class labels :
2.2 Naive Bayes Classifier for Dynamic Classifier Chains
In this paper, we consider ML classifiers build according to the chain rule. That is, the classifier is an ensemble of single-label classifiers that constitutes a linked chain which is built according to a permutation of label sequence . As it was mentioned earlier, in this paper we follow the statistical classification framework. Consequently, each single-label classifier along with the chain makes its decision according to the following rule:
where is a random event defined below:
Conditioning on the random event instead of allows the chain to take inter-label dependencies into account. The above-mentioned classification rule is a greedy rule that calculates the probability (2) using predictions of preceding classifiers. The optimal prediction under the chaining rule may be found using the PCC approach cheng2010bayes (). However, the approach requires the number of calculations that grows exponentially with the number of labels.
The probability defined in (2) is then computed using the Bayes rule:
The term does not depend on event . Consequently, the decision rule (2) is rewritten:
Now, to improve the readability we simplify the notation:
Then, following the Naive Bayes rule, we assume that all random variables that constitute are conditionally independent given . Consequently, is defined using the following formula:
Now, it is easy to see that the term , contrary to , depends on the chain structure. Furthermore, all probability distributions used in the above-mentioned terms can be estimated during the training phase when the chain structure is unknown.
|- training set;|
|Split into and so that:|
|Using build estimators of|
|the following distributions:|
|-- input instance;|
|-- validation set;|
|#Query the BR models|
|Determine label permutaion using and ;|
2.2.1 Computational complexity
In this section, we assess the increase in computational complexity that the proposed algorithm causes.
First of all, it is easy to see that for both the original and the proposed algorithm the number of estimators that must be built to assess is: .
The number of estimators of that must be built is also the same for both classifiers: L.
The key difference is in the number of estimators of that must be built. For the original CC classifier the number of estimators that is built is . On the other hand our method builds estimators.
At the inference phase, the only additional calculations are performed to determine the permutation of labels. Since the validation set is involved in this process, a number of calculations is proportional to .
2.3 KNN Classifier for Dynamic Classifier Chains
In this section, we define a dynamic classifier chain algorithm based on the nearest neighbours approach.The nearest neighbour algorithm is an instance-based classifier that does not build an explicit model of mapping between the feature space and the label space. Instead, the classifier performs classification in a lazy manner. That is, the R nearest instances and then the class is predicted using labels of the neighbour instances.
Let’s begin with the definition of a distance function that depends on label permutation and the position along the chain. The distance function is defined in the extended feature space that combines the input space and the label space. For the first position, the distance is a simple Euclidean distance in the input space:
For the other positions, the distance function uses both the input space and the label space:
Such defined distance function allows us to make the prediction using chaining rule. Since the distance is modified in order to fit the chain structure. During the inference phase, the distance calculates the extended distance using labels predicted at the preceding steps of the procedure. The above-defined distance function is used to build the neighbourhood of a given point in the extended feature space: . The neighbourhood contains the closest instances selected from the training set according to the distance function .
Given the neighbourhood, the probability is estimated as follows:
The label is also predicted using rule (2).
|- training set;|
|Split into and so that:|
|Save the training set|
|-- input instance;|
|-- validation set;|
|-- Training set;|
|Determine label permutaion using and ;|
The training procedure is described in Algorithm 3. The procedure is very simple. That is, it splits the original training set into actual training set and the validation set .
The inference procedure begins with assigning undefined values into the prediction vector . Then the predictions are updated sequentially according to the permutation . The precodure is shown in Algorithm 4
2.4 Dynamic Chain order
In this subsection, we define a local measure of classification quality. To do so, we employed a modified version of the well-known measure.
First of all, we defined a fuzzy neighbourhood in the input space. The neighbourhood of an instance is defined using the following fuzzy set Zadeh1965 ():
where each tripplet defines fuzzy set with the membership coefficient . The membership function is defined using gaussian potential function:
The distance function is simple euclidean distance and the coefficient is tuned during the experiments.
Then, we define set of points that belongs to given label and that are classified as given label :
The above-mentioned classifier responses are related to the binary relevance classifier that can be built without knowing the order of the chain. The classifier is defined using the following classification rule:
Since the neighbourhood of a given instance is defined as a fuzzy set, consistently the above-mentioned sets are also defined as fuzzy. However, the sets are fuzzy singletons. The visualisation of aforementioned sets is provided in Figure 1.
Using the above-mentioned sets we define local True Positive rate, False Positive rate, False Negative rate respectively:
where is the cardinality of a fuzzy set Dhar2013 (). Then, we define the local measure of classification quality:
Finally, the label order is chosen so that the following inequalities are met:
That is labels for whom the classification quality is higher precedes other labels in the chain structure. In other words, this simple heuristic is aimed at dealing with error propagation in the chain structure by employing the most accurate models at the beginning of the chain.
2.5 The Ensemble Classifier
Now, let us define a ML classifier ensemble: . The ensemble is built using classifier chain algorithms defined in previous sections. Each ensemble classifier is built using a subset of the original dataset. The size of subset is of the original training set.
The BR transformation may produce imbalanced single-label dataset. To prevent the classifier from learning from a highly imbalanced dataset, we applied the random undersampling technique Garca2012 (). The majority class is undersampled when imbalance ratio is higher than 20. The goal of undersampling is to keep the imbalance ratio at the level of 20.
The research on the application of Naive Bayes algorithm under the CC framework shows that when the number of features in the input space is significantly higher in comparison to the number of labels the Naive Bayes classifier may not perform well daSilva2014 (). To prevent the proposed system from being affected by this phenomenon, we applied the feature selection procedure for each single-label separately. That is, the attributes are selected in order to improve the classification quality for given label. The feature selection removes only attributes related to the original input space. Features related to labels are passed through the chain without selection. We employed the selection procedure based on correlation. In other words, we select attributes that are highly correlated to the predicted label and their inter-correlations are low Hall1999 (). Additionally, if the number of selected features is higher than 300, we select 300 random features from the set of previously selected features.
The final prediction vector of the ensemble is obtained via is a simple averaging of response vectors corresponding to base classifiers of the ensemble followed by the thresholding procedure:
where is the Iverson bracket.
3 Experimental Setup
The experimental study is divided into three main sections. The first one assesses the impact of employing chaining approach. In the section, we compare binary relevance and classifier chains algorithms built using the following base classifiers:
J48 Classifier (C4.5 algorithm implemented in Weka) Quinlan1993 ().
Naive Bayes Classifier Hand2001 ().
Nearest Neighbour classifier Cover1967 ().
In this section, we compare BR and CC ensembles built using a genetic algorithm tailored to optimise the macro-averaged loss. For each ensemble, the size of the committee was set to . For the algorithm based on the genetic algorithm, the initial size of the committee was set to . Each numeric attribute in the training and validation datasets was also standardised. After the standardisation, the mean value of the attribute is 0 and its standard deviation is 1.
During the experimental study, the parameters of the SVM classifier (, ) were tuned using grid search and 3-fold cross validation. The number of nearest neighbours was also tuned using 3-fold cross validation. The number of neighbours was chosen among the following values .
In two remaining sections, the conducted experimental study provides an empirical evaluation of the classification quality of the proposed methods and compares it to reference methods. Namely, we conducted our experiments using the following algorithms of building a CC ensemble:
The proposed approach (Section 2.4).
Static ensemble generated using a genetic algorithm Trajdos2017 (). The enesmble is tuned to optimise the macro-averaged measure
ECC ensemble with randomly generated chain orders Read2009 ().
OOCC dynamic method proposed by da Silva et al. daSilva2014 (). The ensemble is tuned to optimise the example based measure. Additionally, the reference method uses single split into training and validation sets.
The above-mentioned methods of building CC systems were evaluated using Naive Bayes and the nearest neighbour algorithms as base classifiers. Systems built using different base classifiers are investigated in two separate sections. In the sections, we will refer to the investigated algorithms using the above-said numbers.
The reference algorithm also uses Naive Bayes/nearest neighbour algorithm with data preprocessing procedures described in Section 2.5.
The extraction of training and test datasets was performed using fold cross-validation. For each ensemble, the proportion of the training set was fixed at of the original training set (see Algorithm 1). For each ensemble, the size of the committee was set to . For the algorithm based on the genetic algorithm, the initial size of the committee was set to . Each numeric attribute in the training and validation datasets was also standardised. After the standardisation, the mean value of the attribute is 0 and its standard deviation is 1.
The coefficient was tuned during the training procedure using 3 CV approach. The best value among is chosen.
The experiments were conducted using 30 multi-label benchmark sets. The main characteristics of the datasets are summarized in Table 1. We used datasets from the sources abbreviated as follows:A Charte2015 (), B meka () M–Tsoumakas2011_mulan (); W–Wu2014 (); X–Xu2013 (); Z–Zhou2012 (); T–Tomas2014 (); O – thorsten1998 (). Some of the employed sets needed some preprocessing. That is, we used multi-label multi-instance Zhou2012 () sets (sources Z and W) which were transformed to single-instance multi-label datasets according to the suggestion made by Zhou et al. Zhou2012 (). Multi-target regression sets (No 9, 30) were binarised using simple thresholding strategy. That is if the response is greater than 0 the resulting label is set relevant. Two of the used datasets are synthetic ones (source T) and they were generated using algorithm described in Tomas2014 (). To reduce the computational burden, we use only a subset of original Tmc2007 and IMDB sets. Additionally, the number of labels in Stackex datasets is reduced to 15.
The algorithms were compared in terms of 11 different quality criteria coming from three groups Luaces2012 (): Instance-based (Hamming, Zero-One, , False Discovery Rate, False Negative Rate); Label-based. The last group contains the following measures: Macro Averaged (False Discovery Rate (FDR, 1- Precision), False Negative Rate (FNR, 1-Recall), ) and Micro Averaged versions of the above-mentioned criteria.
Statistical evaluation of the results was performed using the Wilcoxon signed-rank test demsar2006 (); wilcoxon1945 () and the family-wise error rates were controlled using the Holm procedure demsar2006 (); holm1979 (). For all statistical tests, the significance level was set to . Additionally, we also applied the Friedman Friedman1940 () test followed by the Nemenyi post-hoc procedure demsar2006 ().
4 Results and Discussion
4.1 Assessing the impact of chaining approach
In this section, we assess the consequences of employing chaining approach. That is, we compare binary relevance ensembles with classifier chain ensembles built using the same base classifier. The results are shown in Figure 2 and Table 2. Full results are presented in the appendix in Tables 5, 6 and 7. The compared algorithms are numbered as follows:
BR ensemble built using J48 algorithm.
CC ensemble built using J48 algorithm.
BR ensemble built using SVM algorithm.
CC ensemble built using SVM algorithm.
BR ensemble built using NB algorithm.
CC ensemble built using NB algorithm.
BR ensemble built using KNN algorithm.
CC ensemble built using KNN algorithm.
The analysis of the results clearly shows that there is a noticeable difference between two groups of measures. That is, the differences between BR-based and CC-based algorithms are greater in terms of example based criteria. On the other hand, the differences in mean ranks are lower for example based measures.
For the example based measures, the average ranks achieved by CC-based algorithms are lower than for BR-based algorithms. However, only for algorithms based on J48 classifier, the differences are significant for example-based FDR, FNR and measures. A similar trend is observed for the zero-one loss. In this case, only differences for the nearest neighbour classifier are insignificant. It means that CC-based classifiers obtain the higher number of ’perfect match’ results.
On the other hand, for the label-based measures and Hamming loss, almost no significant differences are observed. However, the average ranks suggest that for this group of measures, the classification quality may deteriorate.
The results clearly show that although label-specific quality measures do not change in a significant way, the prediction of the entire label-vector improves. This is an expected result since the CC-based approach incorporates the inter-label relations. This is a well-known fact that has been reported by authors that have previously compared both approaches Madjarov2012 ().
The results also show that there are almost no significant differences between J48, NB and KNN based algorithms. Contrary, SVM algorithm tends to outperform the remaining ones in terms of example-based criteria, hamming loss and zero-one loss. It means that although J48 algorithm takes the biggest advantage of employing the chain rule, NB and KNN based classifiers are comparable to J48-based ensembles.
|Hamming||Zero-One||EX FDR||EX FNR|
|EX||Macro FDR||Macro FNR||Macro|
|Micro FDR||Micro FNR||Micro|
4.2 Naive Bayes Classifier
The results of the experimental study are presented in Table 3 and Figure (a)a. Tables 8, 6 and 10 show full results of the experiment. Table 3 provides results of the statistical evaluation of the experiments. Figure (a)a visualises the average ranks and provide a view of the Nemenyi post-hoc procedure.
First, let’s analyse differences between the proposed heuristic and the simple ECC ensemble. The proposed method is tailored to optimise the macro-averaged loss so we begin with investigating macro-averaged measures. It is easy to see that both methods are comparable in terms of recall but the proposed one is significantly better in terms of precision. It means that the proposed method makes significantly less false positive predictions. Consequently, under the macro-averaged loss the proposed method outperforms the ECC ensemble. The same pattern is also present in results related to micro-averaged measures. However, the difference for the micro-averaged measure is not significant. In contrast, under example based measures, except the Hamming loss, there are no significant differences between investigated methods.
The results show that the proposed heuristic provides an effective way of improving classification quality for classifier chains ensemble. Moving the best performing label-specific models at the beginning of the chain reduces the error that propagates along the chain. What is more, the experimental study also showed that the Naive Bayes classifier combined with proper data preprocessing may be effectively employed in classifier chain ensembles.
Now, let’s compare the proposed method to the other algorithm based on the dynamic chain approach. When we investigate the example-based criteria it is easy to see that the OOCC algorithm outperforms the proposed one in terms of FDR and Hamming loss. Those results combined with results achieved in terms of macro and micro averaged measures shows that the OOCC mthod seems to be too much conservative. That is, it tends to makes many false negative predictions in comparison to the other methods. The outstanding results for the Hamming loss are a consequence of the imbalanced nature of the multi-label data. That is, the presence of labels is relatively rare and the prediction that contains many false negatives may achieve inadequately hight performance under the Hamming loss Luaces2012 ().
On the other hand, the average ranks clearly show that the method based on genetic algorithm achieves the best results in comparison to the other investigated methods. The main reason is that the GA-based approach optimises the entire ensemble structure, whereas the investigated dynamic chain methods, choose the best label order for single classifier chain. Then the locally chosen chains are combined into an ensemble. It gives us an important clue. That is when we consider an algorithm for dynamic chain order selection, we should think about a single chain and the global structure of the entire ensemble as well.
|Hamming||Zero-One||EX FDR||EX FNR|
|EX||Macro FDR||Macro FNR||Macro|
|Micro FDR||Micro FNR||Micro|
4.3 Nearest Neighbour Classifier
The results of the experimental study are presented in Table 4 and Figure (b)b. Tables 11, 12 and 13 show full results of the experiment. Table 3 provides results of the statistical evaluation of the experiments. Figure (b)b visualises the average ranks and provide a view of the Nemenyi post-hoc procedure.
The results show that for the group of example based measures and the zero-one loss, there are no significant differences in classification quality between all investigated algorithms.
For macro and micro averaged measures, the best performing algorithm is an ensemble optimised using the genetic algorithm. The proposed nearest-neighbour-based classifier does not differ significantly from ECC OOCC algorithms. However, it tends to be more conservative because it achieves lower FDR and higher FDR. In other words, the classifier tends to decrease the false positive rate at the cost of decreasing the true positive rate. This phenomenon causes the highest classification quality in terms of the Hamming loss. The reason is that for the multi-label set with low label density, it is easy to obtain high classification by setting all possible labels as irrelevant.
The results confirm the findings described in Section 4.1. The nearest-neighbour-based CC algorithm is unable to take all benefits of the chaining approach. On the other hand, the method is still comparable to chains built using different base classifiers. The main goal of this paper was to propose the model that can change the chain structure without retraining. This goal was achieved.
|Hamming||Zero-One||EX FDR||EX FNR|
|EX||Macro FDR||Macro FNR||Macro|
|Micro FDR||Micro FNR||Micro|
The main goal of this research was to provide an effective chain classifier that allows changing label order at relatively low computational cost. We achieved it using a classifier based on the Naive Bayes approach. To prove that the proposed method allows handling inter-label relations in an efficient way, we proposed a simple heuristic method that determines label order that should minimise label propagation error. Indeed, the experimental results showed that the proposed method is able to produce a good chain structure at a low computational cost. However, the proposed method of building a dynamic ensemble does not allow to outperform the static system that optimizes the entire ensemble structure. The obtained results are very promising. We believe that there is still a room for improvement. In our opinion, the performance of the system may be improved if we provide better, a better heuristic that optimises the entire ensemble in a dynamic way. The proposed dynamic classifier is a first step in the process of investigating dynamic classifier chain ensembles.
Another way of improving this idea is to build different classifiers that would be able to change the chain structure without retraining the entire model.
The work was supported by the statutory funds of the Department of Systems and Computer Networks, Wroclaw University of Science and Technology, under agreement 0401/0159/16.
- (1) Alvares Cherman, E., Metz, J., Monard, M.C.: A simple approach to incorporate label dependency in multi-label classification. In: Advances in Soft Computing, pp. 33–43. Springer Berlin Heidelberg (2010). DOI 10.1007/978-3-642-16773-7˙3
- (2) Chang, C.C., Lin, C.J.: LIBSVM. ACM Transactions on Intelligent Systems and Technology 2(3), 1–27 (2011). DOI 10.1145/1961189.1961199
- (3) Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: Concurrence among imbalanced labels and its influence on multilabel resampling algorithms. In: Lecture Notes in Computer Science, pp. 110–121. Springer International Publishing (2014). DOI 10.1007/978-3-319-07617-1˙10
- (4) Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Quinta: A question tagging assistant to improve the answering ratio in electronic forums. In: IEEE EUROCON 2015 - International Conference on Computer as a Tool (EUROCON). IEEE (2015). DOI 10.1109/eurocon.2015.7313677
- (5) Chen, B., Li, W., Zhang, Y., Hu, J.: Enhancing multi-label classification based on local label constraints and classifier chains. In: 2016 International Joint Conference on Neural Networks (IJCNN). IEEE (2016). DOI 10.1109/ijcnn.2016.7727370
- (6) Cheng, W., Hüllermeier, E., Dembczynski, K.J.: Bayes optimal multilabel classification via probabilistic classifier chains. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp. 279–286 (2010)
- (7) Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995). DOI 10.1007/bf00994018
- (8) Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967). DOI 10.1109/tit.1967.1053964
- (9) Díez, J., Luaces, O., del Coz, J.J., Bahamonde, A.: Optimizing different loss functions in multilabel classifications. Progress in Artificial Intelligence 3(2), 107–118 (2014). DOI 10.1007/s13748-014-0060-7
- (10) Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research 7, 1–30 (2006)
- (11) Dhar, M.: On cardinality of fuzzy sets. International Journal of Intelligent Systems and Applications 5(6), 47–52 (2013). DOI 10.5815/ijisa.2013.06.06
- (12) Friedman, M.: A comparison of alternative tests of significance for the problem of rankings. The Annals of Mathematical Statistics 11(1), 86–92 (1940). DOI 10.1214/aoms/1177731944
- (13) García, V., Sánchez, J., Mollineda, R.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems 25(1), 13–21 (2012). DOI 10.1016/j.knosys.2011.06.013
- (14) Gibaja, E., Ventura, S.: Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4(6), 411–444 (2014). DOI 10.1002/widm.1139
- (15) Gonçalves, E.C., Plastino, A., Freitas, A.A.: Simpler is better. In: Proceedings of the 2015 on Genetic and Evolutionary Computation Conference - GECCO ’15. ACM Press (2015). DOI 10.1145/2739480.2754650
- (16) Goncalves, E.C., Plastino, A., Freitas, A.A.: A genetic algorithm for optimizing the label ordering in multi-label classifier chains. In: 2013 IEEE 25th International Conference on Tools with Artificial Intelligence. IEEE (2013). DOI 10.1109/ictai.2013.76
- (17) Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software. ACM SIGKDD Explorations Newsletter 11(1), 10 (2009). DOI 10.1145/1656274.1656278
- (18) Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato (1999)
- (19) Hand, D.J., Yu, K.: Idiot’s bayes: Not so stupid after all? International Statistical Review / Revue Internationale de Statistique 69(3), 385 (2001). DOI 10.2307/1403452
- (20) Holm, S.: A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 6(2), 65–70 (1979). DOI 10.2307/4615733
- (21) Huang, J., Li, G., Wang, S., Zhang, W., Huang, Q.: Group sensitive classifier chains for multi-label classification. In: 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE (2015). DOI 10.1109/icme.2015.7177400
- (22) Jiang, J.Y., Tsai, S.C., Lee, S.J.: Fsknn: Multi-label text categorization based on fuzzy similarity and k nearest neighbors. Expert Systems with Applications 39(3), 2813–2821 (2012). DOI 10.1016/j.eswa.2011.08.141
- (23) Joachims, T.: Text categorization with suport vector machines: Learning with many relevant features. In: Proc. 10th European Conference on Machine Learning, pp. 137–142 (1998)
- (24) Liu, X., Shi, Z., Li, Z., Wang, X., Shi, Z.: Sorted label classifier chains for learning images with multi-label. In: Proceedings of the international conference on Multimedia - MM ’10. ACM Press (2010). DOI 10.1145/1873951.1874121
- (25) Luaces, O., Díez, J., Barranquero, J., del Coz, J.J., Bahamonde, A.: Binary relevance efficacy for multilabel classification. Progress in Artificial Intelligence 1(4), 303–313 (2012). DOI 10.1007/s13748-012-0030-x
- (26) Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recognition 45(9), 3084–3104 (2012). DOI 10.1016/j.patcog.2012.03.004
- (27) Montañes, E., Senge, R., Barranquero, J., Ramón Quevedo, J., José del Coz, J., Hüllermeier, E.: Dependent binary relevance models for multi-label classification. Pattern Recognition 47(3), 1494–1508 (2014). DOI 10.1016/j.patcog.2013.09.029
- (28) Quinlan, J.R.: C4.5 : Programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)
- (29) Read, J., Martino, L., Luengo, D.: Efficient monte carlo methods for multi-dimensional learning with classifier chains. Pattern Recognition 47(3), 1535–1546 (2014). DOI 10.1016/j.patcog.2013.10.006
- (30) Read, J., Peter, R.: Meka:http://meka.sourceforge.net/ (2017). URL http://meka.sourceforge.net/
- (31) Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Machine Learning and Knowledge Discovery in Databases, pp. 254–269. Springer Berlin Heidelberg (2009). DOI 10.1007/978-3-642-04174-7˙17
- (32) Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Machine Learning 85(3), 333–359 (2011). DOI 10.1007/s10994-011-5256-5
- (33) Sanden, C., Zhang, J.Z.: Enhancing multi-label music genre classification through ensemble techniques. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR ’11. ACM Press (2011). DOI 10.1145/2009916.2010011
- (34) Senge, R., del Coz, J.J., Hüllermeier, E.: On the problem of error propagation in classifier chains for multi-label classification. In: Studies in Classification, Data Analysis, and Knowledge Organization, pp. 163–170. Springer International Publishing (2013). DOI 10.1007/978-3-319-01595-8˙18
- (35) da Silva, P.N., Gonçalves, E.C., Plastino, A., Freitas, A.A.: Distinct chains for different instances: An effective strategy for multi-label classifier chains. In: Machine Learning and Knowledge Discovery in Databases, pp. 453–468. Springer Berlin Heidelberg (2014). DOI 10.1007/978-3-662-44851-9˙29
- (36) Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, I.: Multi-target regression via input space expansion: treating targets as inputs. Machine Learning 104(1), 55–98 (2016). DOI 10.1007/s10994-016-5546-z
- (37) Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, I.: Multi-target regression via input space expansion: treating targets as inputs. Machine Learning 104(1), 55–98 (2016). DOI 10.1007/s10994-016-5546-z
- (38) Tomás, J.T., Spolaôr, N., Cherman, E.A., Monard, M.C.: A framework to generate synthetic multi-label datasets. Electronic Notes in Theoretical Computer Science 302, 155–176 (2014). DOI 10.1016/j.entcs.2014.01.025
- (39) Trajdos, P., Kurzynski, M.: Naive bayes classifier for dynamic chaining approach in multi-label learning. International Journal of Education and Learning Systems 2, 133–142 (2017). URL http://www.iaras.org/iaras/filedownloads/ijels/2017/002-0019(2017).pdf
- (40) Trajdos, P., Kurzynski, M.: Permutation-based diversity measure for classifier-chain approach. In: Advances in Intelligent Systems and Computing, pp. 412–422. Springer International Publishing (2017). DOI 10.1007/978-3-319-59162-9˙43
- (41) Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels, p. 30–44 (2008)
- (42) Wei, Y., Xia, W., Lin, M., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: Hcp: A flexible cnn framework for multi-label image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(9), 1901–1907 (2016). DOI 10.1109/tpami.2015.2491929
- (43) Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin 1(6), 80 (1945). DOI 10.2307/3001968
- (44) Wu, J.S., Huang, S.J., Zhou, Z.H.: Genome-wide protein function prediction through multi-instance multi-label learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11(5), 891–902 (2014). DOI 10.1109/tcbb.2014.2323058
- (45) Xu, J.: Fast multi-label core vector machine. Pattern Recognition 46(3), 885–898 (2013). DOI 10.1016/j.patcog.2012.09.003
- (46) Zadeh, L.: Fuzzy sets. Information and Control 8(3), 338–353 (1965). DOI 10.1016/s0019-9958(65)90241-x
- (47) Zhang, P., Yang, Y., Zhu, X.: Approaching multi-dimensional classification by using bayesian network chain classifiers. In: 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics. IEEE (2014). DOI 10.1109/ihmsc.2014.129
- (48) Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F.: Multi-instance multi-label learning. Artificial Intelligence 176(1), 2291–2320 (2012). DOI 10.1016/j.artint.2011.10.002
Appendix A Full results
|Macro FDR||Macro FNR||Macro|
|Macro FDR||Macro FNR||Macro||Hamming|