Weighting Scheme for a Pairwise Multi-label Classifier Based on the Fuzzy Confusion Matrix.
In this work we addressed the issue of applying a stochastic classifier and a local, fuzzy confusion matrix under the framework of multi-label classification. We proposed a novel solution to the problem of correcting label pairwise ensembles. The main step of the correction procedure is to compute classifier-specific competence and cross-competence measures, which estimates error pattern of the underlying classifier. At the fusion phase we employed two weighting approaches based on information theory. The classifier weights promote base classifiers which are the most susceptible to the correction based on the fuzzy confusion matrix. During the experimental study, the proposed approach was compared against two reference methods. The comparison was made in terms of six different quality criteria. The conducted experiments reveals that the proposed approach eliminates one of main drawbacks of the original FCM-based approach i.e. the original approach is vulnerable to the imbalanced class/label distribution. What is more, the obtained results shows that the introduced method achieves satisfying classification quality under all considered quality criteria. Additionally, the impact of fluctuations of data set characteristics is reduced.
keywords:multi-label classification, label pairwise transformation, random reference classifier, confusion matrix, information theroy, entropy
Many real-world datasets describe objects that are assigned to multiple categories at the same time. All of these concepts constitutes a full description of the object and omitting one of these tags induces a loss of information. Classification process in which such kind of data is involved is called multi-label classification Gibaja2014 ().
Unfortunately, single-label classification methods are not able to solve aforementioned task directly. The main reason is that single-label classifiers are built under the assumption that an object is assigned to only one class. A solution to this issue is to provide dedicated multi-label classification procedures that are able to handle multi-label data directly.
This study is conducted with the aim of assessing the results of the application of an information-theory-based competence measure in the task of improving the classification quality obtained by label-pairwise (LPW) multi-label classifiers. Especially, the focus is put on investigating the impact of the aforementioned quality criterion on a classifier that is corrected using a procedure based on fuzzy confusion matrix and Random Reference classifier (RRC). The procedure corrects predictions of the classifiers constituting the LPW ensemble. The outcome of each of LPW members is individually modified according to the confusion pattern obtained during the validation stage. And then, they are combined using combination method driven by information-theoretic competence measure.
This paper is organized as follows. The next section shows the work related to the issue which is considered in this paper. Section 3 provides a formal notation used throughout this article, and introduces the FCM correction algorithm and its weighted version. Section 4 contains a description of experimental setup. In section 5 the experimental results are presented and discussed. Section 6 concludes the paper.
2 Related Work
Multi-label classification algorithms can be broadly partitioned into two main groups i.e. set transformation algorithms and algorithm adaptation approaches Gibaja2014 ().
A method that belongs to the group of algorithm adaptation approaches provides a generalisation of an existing multi-class algorithm. The generalised algorithm is able to solve multi-label classification problem in a direct way. Among the others, the most known approaches from this group are: multi label KNN algorithm Jiang2012 (), the ML Hoeffding trees Read2012 (), the Structured SVM approach Diez2014 () or deep-learning-based algorithms Wei2015 ().
On the other hand, methods from the former group decomposes a multi-label problem into a set of single-label classification tasks. During the inference phase output of the underlying single-label classifiers are combined in order to create a multi-label prediction.
The technique of decomposition of multi-label classification task into a set of binary classifiers, which is studied in depth throughout this article, is the label-pairwise (LPW) scheme. Under this framework, each pair of labels is assigned with a binary classifier. The outcome of the classifier is interpreted as an expression of pairwise preference in a label ranking Hllermeier2010 ().
The concept of the fuzzy confusion matrix (FCM) was first introduced in studies related to the task of hand gestures recognition Kurzynski2015 (); Trajdos2016 (). The proposed system uses two main advantages of FCM approach. That is, its ability to correct output of a classifier that makes systematic errors. The other is a possibility of handling imprecise class assignment.
Above-mentioned approach was also employed under multi-label classification framework Trajdos2015 (). Namely, it was used to improve the quality of Binary Relevance classifiers. Experiments confirmed the validity of its use, but also showed sensitivity to the unbalanced class distribution in a binary problem. In this study, we are focused on addressing this issue via employing LPW technique which produces more balanced single-label problems than BR approach.
During the prediction phase, we decided to employ a weight function based on information theory. The main motivation was that the information-theoretic measures holds a few properties which makes them very reliable indicators of competence of a FCM-corrected classifier. To be more precise, the previously conducted research showed that although the FCM model is able to correct a randomly guessing classifier, the correction is most efective when the underlying base classifier makes a systematic error Trajdos2016 (). The information-theoretic competence criterion allow us to detect such situation, and put more weight to classifiers which correction ability is higher.
3 Proposed method
Under the Multi-label formalism a object is assigned to a set of labels indicated by a binary vector of length : , where denotes the number of labels. Each element of the vector is related to a single label. In this study we suppose, that multi-label classifier is built in a supervised learning procedure using the training set containing pairs of feature vectors and corresponding label vectors .
Additionally, throughout this paper we follow the statistical classification framework, so vectors and are treated as realisations of random vectors and , respectively.
3.2 Pairwise Transformation
The label-pairwise (LPW) transformation, builds the multi-label classifier using an ensemble of binary classifiers and a single binary classifier is assigned to each pair of labels:
During the training phase of a binary classifier only learning objects belonging either to -th or -th label are used. Examples that appear in both classes are ignored. Instances assigned to other labels are also ignored because they hold no information that can be used by the binary classifier Hllermeier2010 ().
During the inference stage, at the continuous-valued output level, classifier produces a 2-dimensional vector of label supports , which values are interpreted as the supports for the hypothesis that –th and –th labels are relevant for the object . Without loss of generality we assume that the output vector is normalised, that is: .
All binary classifiers in the LPW ensemble contribute to the final decision through combining their continuous-valued outputs. That is, the final support for -th label is calculated using weighted average of soft outputs of adequate binary classifiers:
where is a weight, which is calculated in a dynamic way for an input vector , that is assigned to a pair-specific binary classifier.
Final multi-label classification, i.e. response of a multi-label classifier is obtained as a result of thresholding procedure applied to the soft outputs of the above-defined multi-label classifier:
where the threshold is usually set to .
3.3 Proposed Correction Method
The proposed correction method is based on an assessment of the probability of classifying an object to the class by the binary classifier . Such an approach requires a probabilistic model which assumes that result of classification of object by binary classifier , true label and feature vector are observed values of random variables , , , respectively. Random and are a simple consequence of the probabilistic model presented in the previous subsection.
Random for a given denotes that binary classifier is a randomized classifier which is defined by the conditional probabilities Berger1985 ().
The Bayesian model allows us to define the posterior probability of label as:
where denotes probability that an object belongs to the class given that .
Unfortunately, at the core of the proposed method, we put rather an impractical assumption that the classifier assigns a label in a stochastic manner. We dealt with this issue by harnessing deterministic binary classifiers whose statistical properties were modelled using the RRC procedure Woloszynski2011 (). The RRC model calculates the probability that the underlying classifier assigns an instance to class : .
3.4 Confusion Matrix
During the inference process of the proposed approach, the probability is estimated using local, fuzzy confusion matrix. An example of such matrix for a binary classification task is given in Table 1. The rows of the matrix corresponds to the ground-truth classes, whereas the columns match the outcome of a classifier. The fuzzy nature of the confusion matrix arises directly from the fact that a stochastic model has been employed. We expressed decision regions of the random classifier in terms of fuzzy set formalism Zadeh1965 (). To provide an accurate estimation, we have also defined our confusion matrix as local which means that the matrix is build using neighbouring points of the instance .
The local fuzzy confusion matrix is estimated using a validation set:
where denotes description instance and corresponding vector indicating label assignment respectively. On the basis of this set we define pairwise subsets of validation set, fuzzy decision region of and set of neighbours of respectively:
where each triplet defines fuzzy membership value of instance and indicates the fuzzy decision region of the stochastic classifier. Additionally, denotes the fuzzy neighbourhood of instance . The membership function of the neighbourhood was defined using Gaussian potential function.
The above-defined fuzzy sets are employed to approximate : The following fuzzy sets are employed to approximate entries of the local confusion matrix:
where is the cardinality of a fuzzy set Dhar2013 (). Finally, the approximation of is calculated as follows:
3.5 Weighting Scheme
In this section, we define a weighting approach, that is used during the prediction phase, to promote base classifiers for whom the correction ability of the FCM model is most effective.
We compute mutual information () and joint entropy () of the random variables corresponding to randomized classifier prediction and true label assignment. Finally, the classifier-specific weight is defined as a normalised mutual information Cahill2010 ():
4 Experimental Setup
The conducted experimental study provides an empirical evaluation of the classification quality of the proposed method and compares it against reference mehods. Namely, we conducted our experiments using the following algorithms:
Unmodified LPW classifier Hllermeier2010 (),
LPW classifier corrected using confusion matrix specific to balanced label distributions.
LPW classfier corrected using FCM with fusion performed using information theoretic weight.
In the following sections of this paper, we will refer to the investigated algorithms using above-said numbers.
All base single-label classifiers were implemented using the Naïve Bayes classifier Hand2001 () combined with Random Subspace technique TinKamHo1998 (). We utilized Naïve Bayes implemented in WEKA framework Hall2009 (). The classifier parameters were set to its defaults. For the Random Subspace we set the number of attributes to the of the original number of attributes, and the number of repetitions was set to . All multi-label algorithms were implemented using MULAN Tsoumakas2011_mulan () framework.
The experiments were conducted using 29 multi-label benchmark sets. The main characteristics of the datasets are summarized in Table 2.
The extraction of training and test datasets was performed using fold cross-validation. The proportion of the training set was fixed at of the original training set . Some of the employed sets needed some preprocessing. That is, multi label regression sets (No. 9,10,28) were binarised using thresholding procedure. To be more accurate, when the value of output variable,for given object, is greater than zero, the corresponding label is set to be relevant to the object. We also used multi-label multi-instance Zhou2012 () sets (No.:2,4,5,12,13,18,20,21) which were transformed to single-instance multi-label datasets according to the suggestion made by Zhou et al. Zhou2012 (). Two of used datasets are synthetic ones (No. 23,24) and they were generated using algorithm described in Tomas2014 (). To reduce the computational burden we use only two subsets for each of IMDB and Tmc2007 sets.
The algorithms were compared in terms of 6 different quality criteria coming from three groups: ranking based, instance-based, label based (including micro-averaged and macro-averaged) Luaces2012 ().
Statistical evaluation of the results was performed using the Wilcoxon signed-rank test demsar2006 () and the family-wise error rates were controlled using the Bergmann-Hommel’s procedure demsar2006 (). For all statistical tests, the significance level was set to .
To provide a more detailed look at the properties of the proposed approach, the relations between classification quality obtained by investigated algorithms and chosen dataset characteristics were also analysed. Above-mentioned assessment allow us to determine how the investigated classifiers respond to changes in vital properties of datasets. In order to assess the relations in a quantitative way, we used Spearman correlation coefficient Spearman1904 (). The significance of the obtained correlations is tested using two-tailed t-test Hollander_2013_book (). As in the experiments related to classification quality, the significance level was also set to and we employed Holm method to adjust p-values demsar2006 ().
5 Results and Discussion
This section shows the results obtained during the conducted experimental study. The following subsections provide a detailed description of outcome related to classification quality and dependencies between results obtained by investigated algorithms and set characteristics respectively.
5.1 Classification quality
The Summarised results related to the classification quality, which is analysed from different points of view using appropriate quality criteria, are presented in table 3 and figure 1. Additionally the full results are presented in table 4.
First of all, it is worth noting that the results reveals that the proposed algorithm does not perform significantly worse than reference methods in terms of any quality criterion. What is more, weighted algorithm outperforms the unweighted FCM approach in terms of macro averaged measure. Although, the weighted FCM does not provide a significant improvement over original label pairwise ensemble, this result indicates that the proposed weighting scheme allows the FCM classifier to achieve better performance for rare labels. This phenomenon can be explained by the fact that the weighting scheme assigns lower weights to the FCM classifiers that are biased towards the majority class, since those classifiers cannot be successfully corrected using FCM approach. As a consequence, the outcome for given label is produced using base classifiers that were built for more balanced binary sub-problems. The reported property reduces the tendency of the original FCM algorithm to increase the bias towards the majority class Trajdos2015 () and allows the FCM-based algorithms to be successfully employed in the task of multi-label imbalance classification.
What is more, the classification quality expressed using micro-averaged criterion does not differ significantly between FCM and its weighted version. It demonstrates that the increasue of classification quality for rare labels is not followed by deterioration of classification quality for frequent labels. The weighting procedure also causes no significant loss of classification quality under example-based loss. Moreover, in case of micro-averaged and example-based measures, the approaches based on the idea of fuzzy confusion matrix significantly outperforms base label pairwise algorithm. On the other hand, no significant improvement for frequent labels shows that the proposed methods offers almost no improvement when the LPW ensemble is built using label-balanced datasets. Since, for those datasets the base binary classifiers are rather competent. However those competent classifiers tends to commit systematic errors. As a consequence, the utilisation of FCM based approach allows to improve classification quality, in comparison with uncorrected label pairiwise ensemble, for frequent labels.
The proposed algorithm significantly improves the unweighted one in terms of zero-one quality criterion. Significant improvement under this criterion shows that the proposed method achieves the greatest number of exact match results among the investigated procedures. Combining this results with the performance achieved under macro-averaged loss, we can conclude that the increase in perfect match ratio is a consequence of improved classification of rare labels. However, the increase in perfect match ratio is not followed by overall improvement in classification.
The experiments show that assessed classifiers do not differ in a significant way when we consider their ability to produce label ranking instead of a simple binary response.
5.2 Impact of dataset properties
In this section we assess relations between classification quality obtained by a classifier employed on given multi-label set and properties of this set. At the beginning of correlation analysis, it is worth mentioning that, in general, the lack of significant correlation between multi-label set characteristics and classification quality obtained by an algorithm, under specific circumstances, can be interpreted as an advantage of the classifier. That is, the algorithm is more elastic, as it offers a possibility to be employed in order to solve multi-label classification problems for data sets which significantly differs in characteristics. However, the classifier can be said to be elastic only when it offers acceptable classification quality for wide range of data sets. Achieving satisfactory quality is an important condition since it is easy to build a classifier which is completely independent of set characteristics and achieves low classification quality.
In general,we can observe that if label density (LD) increases, the classification quality increases. What is more, in most cases, correlations are significant. This strong correlation is a result of employment of label pairwise decomposition of the multi-label task. That is, when LD is high, the instances are better utilised during training and validation phases. In other words, an instance that is relevant to many categories simultaneously more often becomes a member of training or validation set. As a consequence, each of underlying binary classifiers is built using a larger number of training instances. The main exception to this rule is the Ranking loss criterion. This result shows that under the considered classification methods can not produce more relevant label ranking even if base classifiers are more competent.
It can also be seen that, in general, the classification quality decreases when imbalance ratio increases. However, this fact is widely known observation for machine learning Lopez2013 () or under the multi-label classification framework, in particular, Charte2014 (). Exceptions to this trend are results obtained in terms of ranking loss and Hamming loss. However, for those loss function, the change of correlation sign cannot be considered as significant.
On the other hand, no consistent tendency for the average Scumble measure can be observed. What is more, for quality criteria other than zero-one loss the obtained correlations are not significant.
Now, let us investigate each classification quality criterion separately.
First of all, we analyse macro-averaged measure. For the mentioned quality criterion, only the introduced algorithm does not demonstrate significant correlation with label density, although the corresponding p-value is very close to the assumed significance level. This observation supports formerly made a claim that the proposed weighting approach can eliminate from the ensemble classifiers that offer no possibility of successful correction using FCM approach, including classifiers that are build using a too low number of training instances. As a consequence, the relation between classification quality for rare labels and label density can be interpreted as insignificant.
On the other hand, correlations between LD and classification quality measured in terms of macro-averaged and example-based measures are insignificant. This result shows that although the proposed approach can reduce the impact of label density for rare labels, the classification quality for frequent labels is still affected by LD. This result clearly shows that the classification quality gains when the number of instances grows. However, for rare labels, the proposed method prevents it from dropping too low.
In contrast to the results related to the macro-averaged measure, for the Hamming loss correlation between the classification quality and label density is far from being significant. Whereas the correlation obtained for the remaining methods are significant. A possible explanation to this result is the impact of classification quality of rare labels, which is described above.
Although the considered algorithms are rather insensitive to changes in scumble value, under the zero-one loss, the original label pairwise ensemble shows significant correlation with scumble coefficient. The classification quality of the above mentioned method decreases when scumble increases. What is more, the algorithm achieves the highest rank in terms of this measure. This results shows that FCM-based correction eliminate the quality loss when frequent labels cooccur with the frequent ones.
|Hamming||Hamming p-val||Zero-one||Zero-one p-val|
|Ranking||Ranking p-val||Macro||Macro p-val|
|Micro||Micro p-val||Example||Example p-val|
During this study, we successfully tackled the issue of eliminating drawbacks of the previously proposed correction algorithm based on fuzzy confusion matrix. To reach this goal, we proposed an information theoretic competence measure, that assess if the base binary classifier can take benefits from correction based on the FCM model.
During the experimental study, we obtained interesting results. That is, the proposed approach is able to improve classification quality for rare labels (macro-averaged loss) and under zero-one loss. What is more, the proposed weighting scheme does not achieve significantly lower quality in terms of any criterion. In addition, the approach reduces the impact of changing set-specific characteristics. As a consequence, the improved version of the FCM-based algorithm is recommended for use instead of the original one.
Since the obtained results are promising, we are willing to continue the development of FCM-based algorithms.
The work was supported by the statutory funds of the Department of Systems and Computer Networks, Wroclaw University of Science and Technology. Computational resources were provided by PL-Grid Infrastructure.
- (1) E. Gibaja, S. Ventura, Multi-label learning: a review of the state of the art and ongoing research, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4 (6) (2014) 411–444. doi:10.1002/widm.1139.
- (2) J.-Y. Jiang, S.-C. Tsai, S.-J. Lee, Fsknn: Multi-label text categorization based on fuzzy similarity and k nearest neighbors, Expert Systems with Applications 39 (3) (2012) 2813–2821. doi:10.1016/j.eswa.2011.08.141.
- (3) J. Read, A. Bifet, G. Holmes, B. Pfahringer, Scalable and efficient multi-label classification for evolving data streams, Machine Learning 88 (1-2) (2012) 243–272. doi:10.1007/s10994-012-5279-6.
- (4) J. Díez, O. Luaces, J. J. del Coz, A. Bahamonde, Optimizing different loss functions in multilabel classifications, Progress in Artificial Intelligence 3 (2) (2014) 107–118. doi:10.1007/s13748-014-0060-7.
- (5) Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, S. Yan, Hcp: A flexible cnn framework for multi-label image classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (9) (2016) 1901–1907. doi:10.1109/tpami.2015.2491929.
- (6) E. Hüllermeier, J. Fürnkranz, On predictive accuracy and risk minimization in pairwise label ranking, Journal of Computer and System Sciences 76 (1) (2010) 49–62. doi:10.1016/j.jcss.2009.05.005.
- (7) M. Kurzynski, M. Krysmann, P. Trajdos, A. Wolczowski, Multiclassifier system with hybrid learning applied to the control of bioprosthetic hand, Computers in Biology and Medicine 69 (2016) 286–297. doi:10.1016/j.compbiomed.2015.04.023.
- (8) P. Trajdos, M. Kurzynski, A dynamic model of classifier competence based on the local fuzzy confusion matrix and the random reference classifier, International Journal of Applied Mathematics and Computer Science 26 (1). doi:10.1515/amcs-2016-0012.
- (9) P. Trajdos, M. Kurzynski, An extension of multi-label binary relevance models based on randomized reference classifier and local fuzzy confusion matrix, in: Intelligent Data Engineering and Automated Learning – IDEAL 2015, Springer International Publishing, 2015, pp. 69–76. doi:10.1007/978-3-319-24834-9_9.
- (10) J. O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer New York, 1985. doi:10.1007/978-1-4757-4286-2.
- (11) T. Woloszynski, M. Kurzynski, A probabilistic model of classifier competence for dynamic ensemble selection, Pattern Recognition 44 (10-11) (2011) 2656–2668. doi:10.1016/j.patcog.2011.03.020.
- (12) L. Zadeh, Fuzzy sets, Information and Control 8 (3) (1965) 338–353. doi:10.1016/s0019-9958(65)90241-x.
- (13) M. Dhar, On cardinality of fuzzy sets, International Journal of Intelligent Systems and Applications 5 (6) (2013) 47–52. doi:10.5815/ijisa.2013.06.06.
- (14) N. D. Cahill, Normalized measures of mutual information with general definitions of entropy for multimodal image registration, in: Biomedical Image Registration, Springer Berlin Heidelberg, 2010, pp. 258–268. doi:10.1007/978-3-642-14366-3_23.
- (15) D. J. Hand, K. Yu, Idiot’s bayes: Not so stupid after all?, International Statistical Review / Revue Internationale de Statistique 69 (3) (2001) 385. doi:10.2307/1403452.
- (16) The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (8) (1998) 832–844. doi:10.1109/34.709601.
- (17) M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, The weka data mining software, ACM SIGKDD Explorations Newsletter 11 (1) (2009) 10. doi:10.1145/1656274.1656278.
- (18) E. Spyromitros-Xioufis, G. Tsoumakas, W. Groves, I. Vlahavas, Multi-target regression via input space expansion: treating targets as inputs, Machine Learning 104 (1) (2016) 55–98. doi:10.1007/s10994-016-5546-z.
- (19) Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, Y.-F. Li, Multi-instance multi-label learning, Artificial Intelligence 176 (1) (2012) 2291–2320. doi:10.1016/j.artint.2011.10.002.
- (20) J. T. Tomás, N. Spolaôr, E. A. Cherman, M. C. Monard, A framework to generate synthetic multi-label datasets, Electronic Notes in Theoretical Computer Science 302 (2014) 155–176. doi:10.1016/j.entcs.2014.01.025.
- (21) J.-S. Wu, S.-J. Huang, Z.-H. Zhou, Genome-wide protein function prediction through multi-instance multi-label learning, IEEE/ACM Transactions on Computational Biology and Bioinformatics 11 (5) (2014) 891–902. doi:10.1109/tcbb.2014.2323058.
- (22) J. Xu, Fast multi-label core vector machine, Pattern Recognition 46 (3) (2013) 885–898. doi:10.1016/j.patcog.2012.09.003.
J. Read, R. Peter,
- (24) O. Luaces, J. Díez, J. Barranquero, J. J. del Coz, A. Bahamonde, Binary relevance efficacy for multilabel classification, Progress in Artificial Intelligence 1 (4) (2012) 303–313. doi:10.1007/s13748-012-0030-x.
- (25) J. Demšar, Statistical comparisons of classifiers over multiple data sets, The Journal of Machine Learning Research 7 (2006) 1–30.
- (26) C. Spearman, The proof and measurement of association between two things, The American Journal of Psychology 15 (1) (1904) 72. doi:10.2307/1412159.
- (27) M. Hollander, D. A. Wolfe, E. Chicken, Nonparametric Statistical Methods, John Wiley & Sons, Inc., 2015. doi:10.1002/9781119196037.
- (28) F. Charte, A. Rivera, M. J. del Jesus, F. Herrera, Concurrence among imbalanced labels and its influence on multilabel resampling algorithms, in: Lecture Notes in Computer Science, Springer International Publishing, 2014, pp. 110–121. doi:10.1007/978-3-319-07617-1_10.
- (29) V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences 250 (2013) 113–141. doi:10.1016/j.ins.2013.07.007.