Optimal arrangements of hyperplanes for multiclass classification
Abstract.
In this paper, we present a novel approach to construct multiclass clasifiers by means of arrangements of hyperplanes. We propose different mixed integer non linear programming formulations for the problem by using extensions of widely used measures for misclassifying observations. We prove that kernel tools can be extended to these models. Some strategies are detailed that help solving the associated mathematical programming problems more efficiently. An extensive battery of experiments has been run which reveal the powerfulness of our proposal in contrast to other previously proposed methods.
Key words and phrases:
Multiclass Support Vector Machines, Mixed Integer Non Linear Programming, Classification, Hyperplanes2010 Mathematics Subject Classification:
62H30, 90C11, 68T05, 32S22.1. Introduction
Support Vector Machine (SVM) is a widelyused methodology in supervised binary classification, firstly proposed by Cortes and Vapnik [6]. Given a set of data together with a label, the general idea under the SVM methodologies is to find a partition of the feature space and an assignment rule from data to each of the cells in the partition that maximizes the separation between the classes of a training sample and that minimizes certain measure for the misclassifying errors. At that point, convex optimization tools come into scene and the shape of the obtained dual problem allows one to project the data out onto a higher dimensional space where the separation of the classes can be more adequately performed, but whose problem can be solved with the same computational effort that the original one. This fact is the socalled kernel trick, and has motivated the use of this tool with success in a wide range of applications [2, 14, 10, 20, 25].
Most of the SVM proposals and extensions concern instances with only two different classes. Some extensions have been proposed for this case by means of choosing different measures for the separation between classes [12, 13, 5], incorporating feature selection tasks [19], regularization strategies [18], etc. However, the analysis of SVMbased methods for instances with more than two classes has been, from our point of view, only partially investigated. To construct a label classification rule for , one is provided with a training sample of observations and labels for each of the observations in such a sample, . The goal is to find a decision rule which is able to classify outofsample data into a single class learning from the training data.
The most common approaches to construct multiclass classifiers are based on the extension of the methodologies for the binary case, as Deep Learning tools [1], Nearest Neighborhoods [7, 26] or Naïve Bayes [16]. Also, a few approaches have also been proposed for multiclass classification using the power of binary SVMs. In particular, the most popular are: oneversusall (OVA) and oneversusone methods (OVO). The first, OVA, computes, for each class , a binary SVM classifier by labeling the observations as if the observation is in class and otherwise. Then, it is repeated for all the classes ( times), and each observation is classified into the class whose constructed hyperplane is further from it. In the OVO approach, each class is separated from any other by computing hyperplanes (one for each pair of classes). Although OVA and OVO shares the advantages that they can be efficiently solved and that they inherit all the good properties of binary SVM, they are not able to correctly classify many multiclass instances, as datasets with different and separated clouds of observations which belong to the same class. Some attempts have been proposed to construct multiclass SVMs by solving a compact optimization problem which involve all the classes at the same time, as WW [28], CS [8] or LLW [15], where the authors consider different choices for the hinge loss in multicategory classification. Some of them (OVO, CS and WW) are implemented in some of the most popular softwares used in machine learning as R [23] or Python [24]. However, as far as we know, it does not exist multiclass classification methods that keep the essence of binary SVM which stems on finding a globally optimal partition of the feature space.
Here, we propose a novel approach to handle multiclass classification extending the paradigm of binary SVM classifiers. In particular, our method will find a polyhedral partition of the feature space and an assignment of classes to cells of the partition, by maximizing the separation between classes and minimizing the missclassifying errors. For biclass instances, and using a single separating hyperplane, the method coincides with the classical SVM, although different new alternatives appear even for biclass datasets allowing more than one hyperplane to separate the data. We propose different mathematical programming formulations for the problem. These models share the same modelling idea and they allow us to consider different measures for the missclassifying errors (hinge or rampbased losses). The models will belong to the family of Mixed Integer Non Linear Programming (MINLP) problems, in which the nonlinearities come from the representation of the margin distances between classes, that can be modeled as a set of second order cone constraints [4]. This type of constraints can be handled by any of the available offtheshelf optimization solvers (CPLEX, Gurobi, XPress, SCIP, …). However, the number of binary variables in the model may become an inconvenient when trying to compute classifiers for medium to large size instances. For the sake of obtaining the classifiers in lower computational times, we detail some strategies which can be applied to reduce the dimensionality of the problem. Recently, a few new approaches have been proposed for different classification problems using discrete optimiation tools. For instance, in [27] the authors construct classification hyperboxes for multiclass classification, and in [9, 19, 21] new mixed integer linear programming tools are provided for feature selection in SVM.
In case the data are, by nature, nonlinearly separable, in classical SVM one can apply the socalled kernel trick to project the data out onto a higher dimensional space where the linear separation has a better performance. The magic there is that one does not need to know which specific transformation is performed on the data and that the decision space of the mathematical programming problem which is needed to be solved, is the same as the original. Here, we will show that kernel trick can be extended to our framework and will also allow us to find nonlinear classifiers with our methodology.
Finally, we successfully apply our method to some wellknown dataset in multiclass classification, and compare the results with those obtained with the best SVMbased classifiers (OVO, CS and WW).
The paper is organized as follows. In sections 2 and 3 we describe and set up the elements of the problem, afterwards we introduce the MINLP formulation of our model. Simultaneously we present a linear version, measuring the margin with the norm. Moreover, we point out that a Ramp Loss version of the model can be easily derived with very few modifications. In 3.2 we show that this model admits the use of kernels as the binary SVM does. In section 4 we introduce some heuristics that we have developed to obtain an initial solution of the MINLP. Finally, section 5 contains computational results on real data sets and its comparison with the available methods mentioned above.
2. Multiclass Support Vector Machines
In this section, we introduce the problem under study and set the notation used through this paper.
Given a training sample the goal of supervised classification is to find a separation rule to assign labels () to data (), in order to be applied to outofsample data. We assume that a given number, , of linear separators have to be built to obtain a partition of into polyhedral cells. The linear separators are nothing but hyperplanes in , , in the form for . Each of the subdivisions obtained with such an arrangement of hyperplanes will be then assigned to a label in . In Figure 1 we show the framework of our approach. In the left side, a of cellstoclasses. In this case such a subdivision and assignment reaches a perfect classification of the given training data.
Under these settings, our goal is to construct such an arrangement of hyperplanes, , induced by (the first component of each vector accounts for the intercept) and to assign a single label to each one of the cells in the subdivision it induces. First, observe that each cell in the subdivision, can be uniquely identified with a vector in , each of the signs representing in which side of the hyperplane is . Hence, a suitable assignment will be a function , which maps cells (equivalently signpatterns) to labels in . Hence, for a given , which belongs to a cell, we will identify it with its signpattern , where for . Then, the classification rule is defined as , the predicted label of . The goodness of the fitting will be based, on comparing predictions and actual labels on a training sample, but also on maximally separating the classes in order to find good predictions and avoid undesired overfitting.
In binary classification datasets, SVM is a particular case of our approach if , i.e., a single hyperplane to partition the feature space is used. In such a case, signs are in and classes in , so whenever there are observations in both classes, the assignment is onetoone. However, even for biclass instances, if more than one hyperplane is used, one may find better classifiers. In Figure 2 we draw the same dataset of labeled (red and blue) observations and the result of applying a classical SVM (left) and our approach with hyperplanes. In that picture one may see that not only the misclassifying errors are smaller with two hyperplanes, as expected, but also the separation between classes is larger, improving the predictive power of the classifier.
This approach is particularly useful for datasets in which there are several separated “clouds” of observations that belong to the same class. In Figure 3, we show two different instances in which, again, the colors indicate the class of the observations. The classes in both instances cannot be appropriately separated using any of the linear SVMbased methods while we were able to perfectly separate the classes using 5 hyperplanes.
In Figure 4 we compare our approach and the OneversusOne (OVO) approach in an instance with observations. In the left picture we show the result of separating the classes with four hyperplanes, reaching a perfect classification of the training sample. In the right side we show the best linear OVO classifier, in which only of the data were correctly classified. We would like also to highlight that, although nonlinear SVMapproaches may separate the data more conveniently, our approach may help to avoid using kernels and ease the interpretation of the results.
Different choices are possible to compute multiclass classifiers under the proposed framework. We will consider two different models which share the same paradigm but differ in the way they account for misclassifying errors. Recall that in SVMbased methods, two functions are simultaneously minimized when constructing a classifier. On the one hand, a measure of the good performance of the separation rule on outofsample observations, based on finding a maximum separation between classes; and (2) a measure of misclassifying errors for the training set of observations. Both criteria are adequately weighted in order to find a good compromise between the goals.
In what follows we describe the way we account for the two criteria in our multiclass classification framework.
2.1. Separation between classes
Concerning the first criterion, we measure such a separation between classes as usual in SVMbased methods. Let be the coefficients and intercepts of a set of hyperplanes. The Euclidean distance between the shifted hyperplanes and is given by , where is the Euclidean norm in (see [22]).
Hence, in order to find globally optimal hyperplanes with maximum separation, we maximize the minimum separation between classes, that is . This measure will conveniently keep the minimum separation between classes as largest as possible. Observe that finding the maximum minseparation is equivalent to minimize . For a given arrangement of hyperplanes, , we will denote by .
Here, we observe that different criteria could be used to model the separation between classes. For instance, one may consider to maximize the summation of all separations namely . However, although mathematically possible, this approach does not capture the original concept in classical SVM and we have left it to be developed by the interested reader.
2.2. Misclassifying errors
The performance of the classifier on the training set is usually measured with some type of misclassifying errors. Classical SVMs with hingeloss errors use, for observations not wellclassified, a penalty proportional to the distance to the side in which they would be wellclassified. Then the overall sum of these errors is minimized. We extend the notion of hingeloss errors to the multiclass setting as follows:
Definition 2.1 (Multiclass HingeLosses).
Let be an arrangement of hyperplanes and an observation/label, with the signpatters of with respect to the hyperplanes in . Let a function that assigns to each cell induced by a class. Let the signs of the closest cell whose assigned class by is .

is said incorrectly classified with respect to if , otherwise it is said that is wellclassified.

The multiclass inmargin hingeloss for with respect to the hyperplane is defined as:

The multiclass outmargin hingeloss for with respect to the hyperplane is defined as:
The losses and account for missclassifying errors because of different causes. On the one hand, models the errors due to observations that although adequately classified with respect to , they belong to the margin between the shifted hyperplanes and . On the other hand, measures, for incorrectly classified observations, how far is from being wellclassified. Note that if an observation, asides from being wrong classified, belongs to the margin between and , then only should be accounted for. In Figure 5 we illustrate the differences between the two types of losses.
3. Mixed Integer Non Linear Programming Formulations
In this section we describe the two mathematical optimization models that we propose for the multiclass separation problem. With the above notation, the problem can be mathematically stated as follows:
(1)  
s.t. 
where and are constants which model the cost of misclassified and striprelated errors. Usually these constants will be considered equal, nevertheless, in practice studying different values on them might lead to better results on predictions. A case of interest results considering .
Observe that the problem above consists of finding the arrangement of hyperplanes minimizing a combination of the three quality measures described in the previous section: the maximum margin between classes and the overall sums of the inmargin and outmargin misclassifying errors. In what follows, we describe how the above problem can be rewritten as a mixed integer non linear programming problem by means of adequate decision variables and constraints. Furthermore, the proposed model will consist of a set of continuous and binary variables, a linear objective function, a set of linear constraints and a set of second order cone constraints. It will allow us to push the model to a commercial solver in order to solve, at least, small to medium instances.
First, we describe the variables and constraints needed to model the first term in the objective function. We consider the continuous variables and to represent the coefficients and intercept of hyperplane , for . Since there is no distinction between hyperplanes, we can assume, without loss of generality that they are nondecreasingly sorted with respect to the norms of their coefficients, i.e., . Then, it is straightforward to see that the term can be replaced in the objective function by , and the following set of constraints allows to model the desired term:
(2) 
For the second term, each of the inmargin misclassifying errors, , will be identified with the continuous variables , for , . Observe that to determine each of these errors, one has to first determine whether the observation is wellclassified or not with respect to the th hyperplane. First, we consider the following two sets of binary variables:
for , , . The variables model the signpatterns of observation, while the variables give the allocation profile observationsclasses. As mentioned above, the classification rule is based on assigning signpatterns to classes.
The adequate definition of the variables is assured with the following constraints:
(3)  
(4) 
where is a big enough constant. Observe that can be easily and accurately estimated based on the data set under consideration.
The following constraints assure the adequate relationships between the variables:
(5)  
(6) 
Observe that (5) enforce that a single class is assigned to each observation while (6) assure that the assignments of two observations must coincide if their signpatterns are the same. Also, the set of variables automatically determines whether an observation is incorrectly classified through the amount (where is the binary encoding of the class of the th observation  if and otherwise, which is part of the input data). Observe that equals zero if and only if the predicted and the actual class coincide.
Now, we will model whether the th observation is well classified or not, with respect to the th hyperplane. Observe that the measure of how far is an incorrectly classified observation from being wellclassified, needs a further analysis. One may has an incorrectly classified observation and several training observations in its same class. We assume that the error for this observation is the missclasifying error with respect to the closest cell for which there are wellclassified observations in its class. Thus, we need to model the decision on the wellclassified representative observation for an incorrectlyclassified observation. We consider the following set of binary variables:
These variables are correctly defined by imposing the following constraints:
(7)  
(8)  
(9) 
The first set of constraints, (7), impose a single assignment between observations belonging to the same class. Constraints (8) avoid choosing incorrectly classified representative observations. The set of constraints (9) avoid selfassignments for incorrectly classified data, and also enforces well classified observations to be represented by themselves.
With these variables, we can model the inmargin errors by means of the following constraints:
(10)  
(11) 
These constraints model, by using the signpatters given by , that, . Note that the constraints are activated if either , i.e., if the wellclassified observation is the representative observation for and both are in the positive side of the thhyperplane; or and , i.e., if the wellclassified observation is the representative observation for and both are in the negative side of the thhyperplane. Thus, constraints (10) and (11) adequatelly model the inmargin errors for all observations . Furthermore, because of (3) and (4), and those described above, the variables always take values smaller than or equal to .
Finally, the third addend, the outmargin errors, will be modeled through the continuous variables , for , . With the set of variables described above, the outmargin misclassifying errors can be adequately modeled through the following constraints:
(12)  
(13) 
There, the constraints are active only in case and , that is, if is a well classified observation in the positive side of while is incorrectly classified in the negative side of being the representative observation for (note that in case is wellclassified then by (9) and then, the constraint cannot be activated). The main difference with respect to (10) and (11) is that the constraints are activated only in case is incorrectly classified. The second set of constraints, namely (13), can be analogously justified in terms of the negative side of .
According to the above constraints, a missclassified observation is penalized in two ways with respect to each hyperplane . In case that is wellclassified with respect to , but it belong to the margins, then and (). Otherwise, if is wrongly classified with respect to , then and ().
We illustrate the convenience of the proposed constraints on the data drawn in Figure 6. Observe that A is not correctly classified since it lies within a cell in which blueclass is not assigned. Suppose that B, a well classified observation, is the representative of A (), then the model would have to penalize two types of errors. Regarding to , if we suppose , then , leading to an activation on constraint (12) being . On the other hand, even tough A is well classified with respect to , we also have to penalize its margin violation. Again if we assume , then , what would make an activation on constraint (10) being .
The above comments can be summarized in the following mathematical programming formulation for Problem (1):
()  
s.t.  
() is a mixed integer non linear programming model, whose nonlinear terms come from the norm minimization in the objective function, so they are second order cone representable. Therefore, the model is suitable to be solved using any of the available commercial solvers, as Gurobi, CPLEX, etc. The main bottleneck of the formulation stems on the number of binary variables which is of the order .
3.1. Building the classification rule
Recall that the main goal of multiclass classification is to determine a decision rule such that, given any observation, it is able to assign it a class. Hence, once the solution of () is obtained, the decision rule has to be derived. Given , two different situations are possible: (a) belongs to a cell with an assigned class; and (b) belongs to a cell with no training observations inside, so with non assigned class. For the first case, is assigned to its cell’s class. In the second case, different strategies to determine a class for are posible. We propose the following assignment rule based on the same allocation methods used in (): observations are assigned to their closest wellclassified representatives. More specifically, let be the signpatterns of with respect to the optimal arrangement obtained from (), and let (here stand for the optimal vector obtained by solving ()). Then, among all the wellclassified observations in the training sample, , we assign to the class of the one whose cell is, in average, the closest (less separated from ). Such a classification of can be performed by enumerating all the possible assignments, and computing the distance measure over all of them. Equivalently, one can solve the following mathematical programming problem:
s.t.  
where . The integrality condition in the problem above can be relaxed, since the constraint is T.U. and thus, the problem is a linear programming problem. Clearly, the solution of the above problem gives the optimal labelling of with respect to existing cells in the arrangement.
One could also consider other robust measures for such an assignment following the same paradigm, as minmax error or the like.
Remark 3.1 (Ramp Loss Missclassifying errors).
An alternative measure of misclassifying training errors is ramp loss. The ramp loss version of the model is interesting for certain instances since they allow to improve robustness against potential outliers. Instead of using out of margin hinge loss errors , the ramploss measure consists of penalizing wrong classified observations by a constant, independently on how far they are from being well classified. Given an observation/label , the ramploss is defined as:
Note that, for our training sample, the ramploss is modeled in our model through the variables. More specifically, for all . In order to do that we just need to do the following few modifications on the MINLP problem:
()  
s.t.  
3.2. Nonlinear Multiclass Classification
Finally, we analyze a crucial question in any SVMbased methodology, which is whether one can apply the Theory of Kernels in our framework. Using kernels means been able to apply transformations to the features, , to a higher dimensional space, where the separation of the data is more adequately performed. In case the desired transformation, , is known, one could transform the data and solve the problem () with a higher number of variables. However, in binary SVMs, formulating the dual of the classification problem, one can observe that it only depends of the observations via the inner products of each pair of observations (originally in ), i.e., through the amounts for . If the transformation is applied to the data, the observations only appear in the problem as for . Thus, kernels are defined as generalized inner products as for each , and they can be provided using any of the wellknown families of kernel functions (see e.g., [11]). Moreover, Mercer’s theorem gives sufficient conditions for a function to be a kernel function (one which is constructed as the inner product of a transformation of the features) which allows one to construct kernel measures that induce transformations. The main advantage of using kernels, apart from a probably better separation in the projected space, is that in binary SVM, the complexity of the transformed problem is the same as the one of the original problem. More specifically, the dual problems have the same structure and the same number of variables.
Although problem () is a MINLP, and then, duality results do not hold, one can apply decomposition techniques to separate the binary and the continuous variables and then, iterate over the binary variables by recursively solving certain continuous and easier problems (see e.g. Benders…). The following result, whose proof is detailed in the Appendix, states that our approach also allows us to find nonlinear classifiers via the kernel tools.
Theorem 3.1.
Let be a transformation of the feature space. Then, one can obtain a multiclass classifier which only depends on the original data by means of the inner products , for .
Proof.
See Appendix.∎
4. A MathHeuristic Algorithm
As mentioned above, the computational burden for solving (), which is a mixed integer non linear programming problem (in which the nonlinearities come from the norm minimization in the objective function), is the combination of the discrete aspects and the nonlinearities in the model. In this section we provide some heuristic strategies that allow us to cut down the computational effort by fixing some of the variables. It will also allow to provide goodquality initial feasible solutions when solving, exactly, () using a commercial solver. Two different strategies are provided. The first one consists of applying a variable fixing strategy to reduce the number of variables in the model. Note that in principle, variables of this type are considered in the model. The second approach consists of fixing to zero some of the variables. These variables allow to model assignments between observations and classes. The proposed approach is a mathheuristic approach, since after applying the adequate dimensionaly reductions, Problem () (or ()) has to be solved. Also, although our strategies do not ensure any kind of optimality measure, they have a very good performance as will be shown in our computational experiments. Observe that when classifying data sets, the measure of the efficiency of a decision rule, as ours, is usually done by means of the accuracy of the classification on outofsample data, and the objective value of the proposed model is just an approximated measure of such an accuracy which cannot be computed only with the training data.
4.1. Reducing the variables
Our first strategy comes from the fact that for a given observation , there may be several possible choices for to be one with the same final result. Recall that could be equal to one whenever is a wellclassified observation in the same class as . The errors and are then computed by using the class of but not the observation itself. Thus, if a set of observations of the same class is close enough, being then wellclassified, a single observation of them can be take to act as the representative element of the group. In order to illustrate the procedure, we show in Figure 7 (left) a classes and points instance in which the classes are easily identified by applying any clustering strategy. In such a case () has variables, but if we allow only to take value at a single point in each cluster, we obtain the same result but reducing to the number of variables. In the right picture we show the clusters performed using the data, and a (random) selection of a unique point at each cluster for which the values are allowed to be one.
This strategy is summarized in Algorithm 1.
4.2. Reducing the variables
The second strategy consists of fixing to zero some of the pointtoclass assignments (variables). In the dataset drawn in Figure 8 one may observe that points in the blackclass will not be assigned to the redclass because of proximity. Following this idea, we derive a procedure to set some of the variables to zero. The strategy is described in Algorithm 2. In that figure one can observe that for the red cluster we obtain the following set of distances: . Such a set of distances is reduced to because we would not take into account the distance to the green cluster on the very right (). Thus, we would fix to zero all variables that relate each cluster with the maximum of their minimum distance set, that is, in this case we fix to zero the variables associated to the black cluster with the red cluster ( and ).
This strategy fixes a number of variables making the problem easier to solve.
5. Experiments
5.1. Real Datasets
In this section we report the results of our computational experience. We have run a series of experiments to analyze the performance of our model in some real widely used datasets in the classification literature, and that are available in the UCI machine learning repository [17]. The summarized information about the datasets is detailed in Table 1. In such a table we report, for each dataset the number of observations considered in the training sample () and test sample (), the number of features (), the number of classes (), the number of hyperplanes used in our separation (), and the number of hyperplanes that the OVO methodology needs to obtain the classifier ().
Dataset  

Forest  75  448  28  4  3  6 
Glass  75  139  10  6  6  15 
Iris  75  75  4  3  2  3 
Seeds  75  135  7  3  2  3 
Wine  75  103  13  3  2  3 
Zoo  75  26  17  7  4  21 
For these datasets, we have run both the hingeloss () and the ramploss () models, using as margin distances those based on the and the norm. We have performed a cross validation scheme to test each of the approaches. Thus, the data sets were partitioned into 5 traintest random partitions. Then, the models were solved for the training sample and the resulting classification rule is applied to the test sample. We report the average accuracy in percentage of the models of the repetitions of the experiment on test, which is defined as:
The parameters of our models were also chosen after applying a gridbased cross validation scheme. In particular, we move the parameters (number of hyperplanes to locate) and the misclassification costs and in:
For hingeloss models , meanwhile for ramploss models we consider to give a hight penalty to badly wronglyclassified observations. As a result we obtain a misclassification error for each grid point, and we select the parameters that provide the lower error to make adjustments on the test set. The same methodology was applied to the classical methods, OVO, WestonWatkins (WW) and CrammerSinger (CS), by moving the single misclassifying cost in .
The mathematical programming models were coded in Python 3.6, and solved using Gurobi 7.5.2 on a PC Intel Core i77700
processor at 2.81 GHz and 16GB of RAM.
In Table 2 we report the average accuracies obtained in our 4 models and compare them with the ones obtained with OVO, WW and CS. The first two columns ( RL and HL) provide the average accuracies of our two approaches (Ramp Loss  RL and Hing Loss HL) using the norm as the distance measure. On the other hand, the third and four columns ( RL and HL) provide the results but for the norm. In the last three columns, we report the average accuracies obtained with the classical methodologies (OVO, WW and CS). The best accuracies obtained for each of the datasets are bolfaced in the table. One can observe that our models can always replicate or improve the scores of the former models, as expected. When comparing the results obtained for ur approaches with the different norms, the Euclidean norm seems to provide slightly better results than the norm. However, the models for the norm are mixed integer linear programming models, while the norm based models are mixed integer nonlinear programming problems, which may imply a higher computational difficulty when solving largesize instances.
Dataset  RL  HL  RL  HL  OVO  WW  CS 

Forest  80.66  80.12  82.30  81.62  82.10  78.40  78.60 
Glass  64.92  64.92  65.32  65.32  58.76  56.25  59.26 
Iris  95.08  95.40  96.44  96.66  93.80  96.44  96.44 
Seeds  93.66  93.66  93.52  93.52  91.02  93.52  93.52 
Wine  95.20  95.20  96.82  96.82  96.34  96.09  96.17 
Zoo  89.75  89.75  89.75  89.75  87.44  87.68  87.68 
6. Conclusions
In this paper we propose a novel modeling framework for multiclass classification based on the Supoort Vector Machine paradigm, in which the different classes are linearly separated and the separation between classes is maximized. We propose two approaches whose goals are to compute an arrangement of hyperplanes which subdivide the space into cells, and each cell is assigned to a class based on the training data. The models result in Mixed Integer (Non Linear) Programming problems. Some strategies are presented in order to help solvers to find the optimal solution of the problem. We also prove that the kernel trick can be extended to this framework. The powerful of the approach is illustrated on some wellknown datasets in the multicategory classification literature as well as in some synthetic small examples.
Acknowledgements
The authors were partially supported by the research project MTM201674983C21R (MINECO, Spain). The first author has been also supported by project PP2016PIP06 (Universidad de Granada) and the research group SEJ534 (Junta de Andalucía).
Proof of Theorem 3.1
Note that once the binary variables of our model are fixed, the problem becomes polynomial time solvable and it reduces to find the coordinates of the coefficients and intercepts of the hyperplanes and the different misclassifying errors. In particular, it is clear that the MINLP formulation for the problem is equivalent to:
s.t.  
where is the evaluation of the margin and hingeloss errors for any assignment provided by the binary variables. That is,
The above problem would be separable provided that the first constraints (2) were relaxed. For the sake of simplicity in the notation, we consider the following functions, for , defined as:
for . Note that , , and that , .
Based on the separability mentioned above, we introduce another instrumental family of problems for all , namely,
where for simplifying the notation we have introduced the auxiliary variables , , and , for and .
Observe that , apart from the first constraint, only considers variables associated to the th hyperplane.
Moreover, we need another problem that accounts for the first part of .
s.t.  