#

Sub-Classifier Construction for Error Correcting Output Code Using Minimum Weight Perfect Matching^{†}^{†}thanks: Patoomsiri Songsiri and Boonserm Kijsirikul are with the Department of Computer Engineering, Chulalongkorn University, Bangkok, Thailand (email: patoomsiri.s@student.chula.ac.th and boonserm.k@chula.ac.th).
^{†}^{†}thanks: Thimaporn Phetkaew is with the School of Informatics, Walailak University, Nakhon Si Thammarat, Thailand (email: pthimapo@wu.ac.th).
^{†}^{†}thanks: Ryutaro Ichise is with the National Institute of Informatics, Tokyo, Japan (email: ichise@nii.ac.jp).

###### Abstract

Multi-class classification is mandatory for real world problems and one of promising techniques for multi-class classification is Error Correcting Output Code. We propose a method for constructing the Error Correcting Output Code to obtain the suitable combination of positive and negative classes encoded to represent binary classifiers. The minimum weight perfect matching algorithm is applied to find the optimal pairs of subset of classes by using the generalization performance as a weighting criterion. Based on our method, each subset of classes with positive and negative labels is appropriately combined for learning the binary classifiers. Experimental results show that our technique gives significantly higher performance compared to traditional methods including the dense random code and the sparse random code both in terms of accuracy and classification times. Moreover, our method requires significantly smaller number of binary classifiers while maintaining accuracy compared to the One-Versus-One.

## I Introduction

Error Correcting Output Code (ECOC) [1, 2] is one of the well-known techniques for solving multiclass classification. Based on this framework, an unknown-class instance will be classified by all binary classifiers corresponding to designed columns of the code matrix, and then the class with the closet codeword is assigned to the final output class. Each individual binary function including the large number of classes indicates the high capability as a shared classifier. Generally, when the number of classes increases, the complexity for creating the hyperplane also increases. The suitable combination of classes with the proper number of classes for constructing the model is still a challenge issue to obtain the effective classifier.

Several classic works are applied to the design of the code matrix such as One-Versus-One (OVO) [3], One-Versus-All (OVA) [4], dense random code, and sparse random code [1, 2], and for an class problem, they provide the number of binary models of , , , and , respectively. Moreover, some approaches using the genetic algorithm have been proposed [5, 6]. However, according to the complexity of the problem that the solutions are searched from a large space including all possible columns [1] in case of dense code, and columns [7] in case of sparse code, design of code matrix is still ongoing research.

This research aims to find the suitable combination of classes for creating the binary models in the ECOC framework by providing both good classification accuracy and the small number of classifiers. Our method is based on the minimum weight perfect matching algorithm by using the relation between pair of subset of classes defined by the generalization performance as the criterion for constructing the code matrix. We study this multiclass classification based on Support Vector Machines [8, 9] as base learners. We also empirically evaluate our technique by comparing with the traditional methods on ten datasets from the UCI Machine Learning Repository [10].

## Ii Error Correcting Output Codes

Error Correcting Output Code (ECOC) was introduced by Dietterich and Bakiri [1] as a combining framework for multiclass classification. For a given code matrix with rows and columns, each element contains either 1’, or -1’. This code matrix is designed to represent a set of binary learning problems for classes. Each specific class is defined by the unique bit string called codeword and each sub-problem is indicated by the combination of positive and negative classes corresponding to the elements of the code matrix. Moreover, in order to allow the binary model learned without considering some particular classes, Allwein et al. [2] extended the coding method by adding the third symbol 0’ as “ don’t care bit”. Unlike the previous method, the number of classes for training a binary classifier can be varied from 2 to classes.

Several classic coding designs have been proposed, e.g., One-Versus-All (OVA) [4], dense random codes, sparse random codes [2], and One-Versus-One (OVO) [3]; the first two techniques and the last two techniques are binary and ternary strategies, respectively. One-Versus-All codes including columns were designed by setting the class, and the remaining classes labeled with the positive and negative classes, respectively. Dense random codes and sparse random were introduced by randomizing sets of and binary functions, respectively. In case of One-Versus-One codes, they were designed to define each column by labeling 1’ and -1’ to only two out of classes, and therefore there are possible columns.

For a decoding process, an instance with unknown-class will be classified by all binary functions corresponding to designed columns of the code matrix. This output vector is compared to each row of the code matrix. The class corresponding the row of code matrix that provides highest similarity is assigned as the final output class. Several similarity measures have been proposed such as Hamming distance [1], Euclidean distance [11], extended strategies based on these two methods [11], and the loss-based technique [2].

## Iii Proposed Method

We aim to construct the code matrix in which each column is a suitable combination of positive and negative classes encoded to represent a binary model. As mentioned before, the best code matrix can be obtained by searching from all possible columns for an -class problem, and thus this is comparatively difficult when the number of classes increases. Our objective is to construct the code matrix providing high quality of compression (the low number of binary classifiers) with high accuracy of classification. Although the highest compression of code can be possible by using binary classifiers, compression without proper combination of classes may lead to suffer with the complexity of classifier construction. To design a code matrix, for any classes i and j, some binary classifier(s) has to be selected that constructs a pairing between a set of classes containing class i, and another set containing class j. We believe that the most important pairings affecting the classification accuracy are the pairs of classes with hard separation. These pairings cannot be avoided as they must be included in some combinations of classes to distinguish between each other. The number of classes combined in each classifier is varied from 2 to classes. If we do not construct a classifier to separate between pair difficult classes, to distinguish the two classes, we still have to build a classifier by using other classes together with these two classes that increases the complexity of classifier construction. For example, Fig. 1 shows that if we build the model with a linear function to distinguish classes 1 vs 2, which are hard-to-separate classes, by setting classes 1,3,4,7,8 as positive classes and classes 2,5,6 as negative classes, it will not be easy to learn a good hyperplane.

By the above reason, we carefully design the code matrix by considering the pairs of classes with hard separation as the first priority pairings and these pairings also indicate the high similarity between the classes. Intuitively, the pair of classes with difficult separation are allowed to combine with the low number of classes in order to avoid situation mentioned before, and we expect that the easier pairing can be combined with the large number of classes without affecting the classification ability much. Based on our idea, the classes with high similarity are grouped together as the same class label, and the classes with low similarity are separated. For example, consider two-dimensional artificial data shown in Fig. 1; classes 1,2,3,4 have high similarity and should be assigned with the same label (e.g. positive class), and classes 5,6,7,8 also have high similarity and should be assigned with the same label (e.g. negative class). Additionally, these two groups have low similarity by observation, and thus if we learn the binary model with a linear function, we will obtain absolutely separable hyperplane.

In order to obtain the code matrix satisfying the above requirements, we employ the minimum weight perfect matching algorithm [13] applied to find the optimal pairs of subset of classes by using the generalization performance [12] as a weighting criterion. Our method is called optimal matching of subset of classes algorithm described in Algorithm 1.

For solving the optimal matching problem, let be a graph with node set and edge set . Each node in denotes one subset of classes and each edge indicates one binary classifier of which generalization performance is estimated from Algorithm 2 (see Fig. 2(a)). The output of the matching algorithm for graph is a subset of edges with the minimum sum of generalization performances of all edges and each node in is met by exactly one edge in the subset (see Fig. 2(b)).

Given a real weight being generalization performance for each edge of , the problem of matching algorithm can be solved by the minimum weight perfect matching that finds a perfect matching of minimum weight .

For , let .
is the set of edges with both endpoints in . The set of edges incident to
node in the node-edge incidence matrix is denoted by .
The convex hull
of perfect matchings on a graph with even is given by

a)

b) for

c) for all odd sets with

where , and
() means that is (is not) in the matching.
Hence, the minimum weight of a perfect matching (mp) is at least as large as the following value.

(1) |

where satisfies (a), (b), and (c). Therefore, the matching problem can be solved by the integer program in Eq. (1).

On each round of Algorithm 1, we consider a sub-problem including subsets of classes, and calculate optimal pairs of subsets of classes by using the minimum weight perfect matching algorithm. The cost function employed in the minimum weight perfect matching algorithm is the sum of generalization performance calculated from Algorithm 2 [14]. The obtained subsets of classes are then used as a column of the code matrix. Consider data on Fig. 1 and the designed code matrix in Fig. 3, at the first round, each element of contains only set of one class, and the size of is 8. After applying the minimum weight perfect matching algorithm, we get four optimal pairings, i.e., classifier 1 vs 2, classifier 3 vs 4, classifier 5 vs 6, and classifier 7 vs 8. All of optimal four pairs as binary classifiers are added into the columns of the code matrix, i.e., classifiers to (see Fig. 3). The set is re-assigned by the empty set. The classes from each pairing are combined together into the same subset and then are added into as new members. Currently, the size of is 4, so we continue the next round. After applying the minimum weight perfect matching algorithm, we get two optimal pairings, i.e., classifier 1,2 vs 3,4, and classifier 5,6 vs 7,8. These two optimal pairs as binary classifiers are added into the columns of the code matrix, i.e., classifiers to . The set is re-assigned by the empty set again, and then the classes from each pairing are combined together into the same subset and then are added into as new members. Now, the size of is 2, so we exit the loop. After that, the last remaining two subsets (the first subset including classes 1,2,3,4 and the second subset including classes 5,6,7,8) are added into the last column of the code matrix as binary classifier, i.e., classifier .

In our methodology, a pair of classes that already are paired will not be considered again. In the later round, the number of the members in increases. For example at the last round, and . It seems that this leads to complexity for calculating the hyperplane to separate between and . However, the remaining classes that can be paired trend to allow easier separation due to the combination of classes in the same subset of classes having high similarity (in Fig. 1, classes 1,2,3,4 are very close, and classes 5,6,7,8 are also close), while the possible sixteen pairings including 1 vs 5, 1 vs 6, 1 vs 7, 1 vs 8, 2 vs 5, …, 4 vs 8 contain classes with high dissimilarity. It illustrates that these sixteen pairings are encoded in only one binary classifier as the One-Versus-One requires sixteen binary classifiers. This example also confirms that based on our technique although the number of classes for constructing a binary model increases, each subset of positive and negative classes is combined appropriately, and this leads to effective separation.

## Iv Experiments

In this section, we design the experiment to evaluate the performance of the proposed method. We compare our method with the classic codes, i.e., the dense random code, the sparse random code [1, 2], and the One-Versus-One [3]. This section is divided into two parts as experimental settings, and results & discussions.

### Iv-a Experimental Settings

We run experiments on ten datasets from the UCI Machine Learning Repository [10]. For the datasets containing both training data and test data, we added up both of them into one set, and used 5-fold cross validation for evaluating the classification accuracy.

Datasets | #Classes | #Features | #Cases |
---|---|---|---|

Segment | 7 | 18 | 2,310 |

Arrhyth | 9 | 255 | 438 |

Mfeat-factor | 10 | 216 | 2,000 |

Optdigit | 10 | 62 | 5,620 |

Vowel | 11 | 10 | 990 |

Primary Tumor | 13 | 15 | 315 |

Libras Movement | 15 | 90 | 360 |

Soybean | 15 | 35 | 630 |

Abalone | 16 | 8 | 4,098 |

Spectrometer | 21 | 101 | 475 |

Data sets | Dense random code | Proposed method |
---|---|---|

Segment | 91.0180.112 | 92.8570.159* |

Arrhyth | 64.6260.246* | 60.7310.298 |

Mfeat-factor | 97.0730.228 | 97.0500.188 |

Optdigit | 99.0340.056 | 99.0930.075 |

Vowel | 59.6320.980 | 80.8080.479* |

Primary Tumor | 43.9310.567 | 44.1271.085 |

Libras Movement | 82.5070.372 | 84.7220.621* |

Soybean | 93.2730.418 | 93.6510.251 |

Abalone | 23.1260.012 | 26.9400.152* |

Spectrometer | 50.4280.420 | 57.4741.080* |

Data sets | Sparse random code | Proposed method |
---|---|---|

Segment | 91.9370.105 | 92.8570.159* |

Arrhyth | 62.1120.225* | 60.7310.298 |

Mfeat-factor | 97.0480.212 | 97.0500.188 |

Optdigit | 99.0720.059 | 99.0930.075 |

Vowel | 61.9860.951 | 80.8080.479* |

Primary Tumor | 44.7700.864 | 44.1271.085 |

Libras Movement | 83.4210.311 | 84.7220.621 |

Soybean | 93.1150.421 | 93.6510.251 |

Abalone | 24.3110.023 | 26.9400.152* |

Spectrometer | 52.6700.433 | 57.4741.080* |

Data sets | One-Versus-One | Proposed method |
---|---|---|

Segment | 92.7410.174 | 92.8570.159 |

Arrhyth | 60.5020.161 | 60.7310.298 |

Mfeat-factor | 97.1830.189 | 97.0500.188 |

Optdigit | 99.1700.062 | 99.0930.075 |

Primary Tumor | 44.5500.831 | 44.1271.085 |

Vowel | 80.9090.258 | 80.8080.479 |

Libras Movement | 84.3520.317 | 84.7220.621 |

Soybean | 92.9890.422 | 93.6510.251 |

Abalone | 26.3260.063 | 26.9400.152 |

Spectrometer | 57.1930.775 | 57.4741.080 |

Data sets | #classes | Dense | Sparse | One-Versus-One | Proposed |
---|---|---|---|---|---|

random code | random code | method | |||

Segment | 7 | 29 | 43 | 21 | 6 |

Arrhyth | 9 | 32 | 48 | 36 | 8 |

Mfeat-factor | 10 | 34 | 50 | 45 | 9 |

Optdigit | 10 | 34 | 50 | 45 | 9 |

Vowel | 11 | 35 | 52 | 55 | 10 |

Primary Tumor | 13 | 38 | 56 | 78 | 12 |

Libras Movement | 15 | 40 | 59 | 105 | 14 |

Soybean | 15 | 40 | 59 | 105 | 14 |

Abalone | 16 | 40 | 60 | 120 | 15 |

Spectrometer | 21 | 44 | 66 | 210 | 20 |

In these experiments, we scaled data to be in [-1,1]. In the training phase, we used software package version 6.02 [15, 16] to create the binary classifiers. The regularization parameter of 1 was applied for model construction; this parameter is used to trade off between error of the SVM on training data and margin maximization. We employed the RBF kernel , and applied the degree to all datasets. For the dense random code and the sparse random code, ten code matrices were randomly generated by using a pseudo-random tool [17]. In case of sparse random codes, the probability 1/2 and 1/4 were applied to generate the bit of 0’, and the other bits, i.e.. 1’ and -1’, respectively. For decoding process, we employed the attenuated euclidean decoding [11] by using where is binary output vector and is a code word belonging to class .

### Iv-B Results & Discussions

We compared the proposed method with three traditional techniques, i.e., the dense random code, the sparse random code, and the One-Versus-One. The comparison results are shown in Table II to Table IV in which all of datasets are sorted in ascending order by their number of classes. For each dataset, the best accuracy among these algorithms is illustrated in bold-face and the symbol ‘*’ means that the method gives the higher accuracy at 95 % confidence interval with one-tailed paired t-test.

The experimental results in Table II shows that the proposed method yields highest accuracy in almost all datasets. The results also show that, at % confidence interval, the proposed technique performs statistically better than the dense random code in five datasets and there is only one data set, i.e., the Arrhyth that the proposed method does not archive the better result. The experimental results in Table III illustrates that the proposed method gives higher accuracy compared to the sparse random code in almost all datasets. The results also show that, at % confidence interval, the proposed technique performs statistically better than the sparse random code in four datasets and there is only one data set, i.e., the Arrhyth that the proposed method does not provide the higher accuracy. The combination method with a few binary classifiers may be weaker compared with one with the large number of binary classifier. Our proposed method is carefully designed to find the proper combination of classes, and to create the binary classifiers with high efficiency. However, as our method takes a few binary classifiers, it is possible that this situation may occur as found in case of the Arrhyth dataset. Moreover, the classification performance also relates to the structure of the code matrix. We consider the size of search space that is proportional to where is the number of classes. The dense random code and the sparse random code which is independent to the selection of a code matrix, while our method tries to find a specific structure of code matrix according to the relation of generalization performance of their classes. In case of the low number of with the small size of search space, the random techniques may obtain a good solution while in case of the higher number of with the bigger size of search space their probabilities to reach an expected solution decrease proportional to the growth of . In this aspect, our method based on utilization of generalization performance as a guideline to find a solution is not affected. However, by the above reason in case of the small number of classes the code matrix generated by a random technique with the larger number of binary models may lead to the better classification performance as in the Arrhyth dataset mentioned before. Next, the last classification results in Table IV shows that the proposed method gives a little better results compared to the One-Versus-One in several datasets, and there is no significant difference between these two methods at % confidence interval.

Consider the characteristic of the obtained binary models by using our technique. Some binary classifiers include the large number of classes that seem to be complex to construct binary models with high accuracy (as the first important issue). For example, in case of the Spectrometer dataset including 21 classes, twenty binary classifiers are obtained. The number of classes containing in the binary classifier is varied from 2 to 21 classes. For some hard-separation parings, our method allows the binary classifiers with the low number of classes to be combined to avoid the situation mentioned in Section III (as the second important issue). The effects of these two issues to the designed code matrix can be observed via the classification accuracies. Among all algorithms based on this dataset, our code matrix gives the higher classification accuracies compared to the dense random code and the sparse random code, while providing a little better results compared to the One-Versus-One. Moreover, the experimental results confirm that generally our algorithm provides the better code matrix that the proper subsets of classes are combined.

We also compare the classification time (the number of binary classifiers employed) of the proposed method to the traditional works as shown in Table V. The results illustrate that the proposed technique requires comparatively low running time in all datasets especially, when the number of classes is relatively large. Our technique reduces large number of classification times compared to all previous works; the dense random code, the sparse random code, and the One-Versus-One require , , and , respectively while our framework needs only for -class problems. Our code matrix construction requires the calculation of class matching using the minimum weight perfect matching that the estimation of generalization performances are employed as a weighting criterion. Although this process needs additional computation to estimate the weight values using -fold cross validation, the task is conducted in the training phase, and not affecting the performance in the classification phase.

## V Conclusion

We propose an algorithm to find the suitable combination of classes for creating the binary models in the ECOC framework by using the generalization performance as a relation measure among subset of classes. This measure is applied to obtain the set of the closest pairs of subset of classes via the minimum weight perfect matching algorithm in order to generate the columns of the code matrix. The proposed method gives higher performance both in terms of accuracy and classification times compared to the traditional methods, i.e., the dense random code and the sparse random code. Moreover, our approach requires significantly smaller number of binary classifiers while maintaining accuracy compared to the One-versus-One. However, the expected matchings of subset of classes are not possibly available in some cases because relation of their subset of classes may force to combine inappropriate subset of classes and it may lead to misclassification. We will further analyze to address this situation in our future work.

## Vi Acknowledgment

This research is supported by the Thailand Research Fund, Thailand.

## References

- [1] T.G.Dietterich, and G.Bakiri. “Solving multiclass learning problems via error-correcting output codes,”Journal of Artificial Intelligence Research, vol. 2, pp. 263-286, 1995.
- [2] E.L.Allwein, R.E.Schapire, and Y.Singer. “Reducing multiclass to binary: a unifying approach for margin classifiers,”Journal of Machine Learning Research, vol.1, pp. 113-141, 2000.
- [3] T.Hastie, and R.Tibshirani. “Classification by pairwise grouping,”Neural Information Processing Systems, vol.26, pp. 451-471,1998.
- [4] V.Vapnik. Statistical learning theory, New York, Wiley, 1998.
- [5] L.I.Kuncheva. “Using diversity measures for generating error-correcting output codes in classifier ensembles,”Pattern Recognition Letters, vol.26, pp. 83-90, 2005.
- [6] A.C.Lorena, and A.C.P.L.F.Carvalho. “Evaluation functions for the evolutionary design of multiclass support vector machines,”International Journal of Computational Intelligence and Applications, vol.8, pp.53-68, 2009.
- [7] M.A.Bagheri, G.Montazer, and E.Kabir. “A subspace approach to error correcting output codes,”Pattern Recognition Letters, vol.34, pp.176-184, 2013.
- [8] V.Vapnik. The nature of statistical learning theory, London, UK, Springer-Verlag, 1995.
- [9] V.Vapnik. “An overview of statistical learning theory,”IEEE Transactions on Neural Networks, vol.10, pp.988-999, 1999.
- [10] C.Blake, E.Keogh, and C.Merz. UCI repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine, 1998.
- [11] S.Escalera, O.Pujol, P.Radeva. “On the decoding process in ternary error correcting output codes,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, pp.120-134, 2010.
- [12] P.Bartlett, and J.Shawe-Taylor. “Generalization performance of support vector machines and other pattern classifiers,”Advances in Kernel Methods - Support Vector Learning, MIT Press, pp.43-54, 1998.
- [13] W.Cook, and A.Rohe. “Computing minimum-weight perfect matchings,”INFORMS Journal on Computing, vol.11, pp.138-148, 1999.
- [14] T.Mitchell. Machine Learning, McGraw Hill, 1997.
- [15] T.Joachims. “Making large-scale SVM learning practical,”Advances in Kernel Methods - Support Vector Learning, MIT Press, pp.169-184, 1998.
- [16] T.Joachims, SVM, http://ais.gmd.de/~thorsten/svm_light, 1999.
- [17] Bagheri, M. A., Gao, Q. and Escalera, S. “Efficient pairwise classification using Local Cross Off strategy,”Proceedings of the 25th Canadian Conference on Artificial Intelligence, pp.25-36, 2012.