A Quadratic Loss MultiClass SVM [1cm] Emmanuel Monfrini — Yann GuermeurJuly 6, 2019
A Quadratic Loss MultiClass SVM
Emmanuel Monfrini^{†}^{†}thanks: UMR 7503UHP , Yann Guermeur^{†}^{†}thanks: UMR 7503CNRS
— July 6, 2019 — ?? pages
Abstract: Using a support vector machine requires to set two types of hyperparameters: the soft margin parameter and the parameters of the kernel. To perform this model selection task, the method of choice is crossvalidation. Its leaveoneout variant is known to produce an estimator of the generalization error which is almost unbiased. Its major drawback rests in its time requirement. To overcome this difficulty, several upper bounds on the leaveoneout error of the pattern recognition SVM have been derived. Among those bounds, the most popular one is probably the radiusmargin bound. It applies to the hard margin pattern recognition SVM, and by extension to the norm SVM. In this report, we introduce a quadratic loss MSVM, the , as a direct extension of the norm SVM to the multiclass case. For this machine, a generalized radiusmargin bound is then established.
Keywords: MSVMs, model selection, leaveoneout error, radiusmargin bound.
Une SVM multiclasse à coût quadratique
Résumé : La mise en œuvre d’une machine à vecteurs support requiert la détermination des valeurs de deux types d’hyperparamètres : le paramètre de “marge douce” et les paramètres du noyau. Pour effectuer cette tâche de sélection de modèle, la méthode de choix est la validation croisée. Sa variante “leaveoneout” est connue pour fournir un estimateur de l’erreur en généralisation presque sans biais. Son défaut premier réside dans le temps de calcul qu’elle nécessite. Afin de surmonter cette difficulté, plusieurs majorants de l’erreur “leaveoneout” de la SVM calculant des dichotomies ont été proposés. La plus populaire de ces bornes supérieures est probablement la borne “rayonmarge”. Elle s’applique à la version à marge dure de la machine, et par extension à la variante dite “de norne ”. Ce rapport introduit une MSVM “à coût quadratique”, la , comme une extension directe de la SVM de norne au cas multiclasse. Pour cette machine, une borne “rayonmarge” généralisée est ensuite établie.
Motsclés : MSVM, sélection de modèle, erreur “leaveoneout”, borne “rayonmarge”.
1 Introduction
Using a support vector machine (SVM) [2, 4] requires to set two types of hyperparameters: the soft margin parameter and the parameters of the kernel. To perform this model selection task, several approaches are available (see for instance [9, 12]). The solution of choice consists in applying a crossvalidation procedure. Among those procedures, the leaveoneout one appears especially attractive, since it is known to produce an estimator of the generalization error which is almost unbiased [11]. The seamy side of things is that it is highly time consuming. This is the reason why, in recent years, a number of upper bounds on the leaveoneout error of pattern recognition SVMs have been proposed in literature (see [3] for a survey). Among those bounds, the tightest one is the span bound [16]. However, the results of Chapelle and coworkers presented in [3] show that another bound, the radiusmargin one [15], achieves equivalent performance for model selection while being far simpler to compute. This is the reason why it is currently the most popular bound. It applies to the hard margin machine and, by extension, to the norm SVM (see for instance Chapter 7 in [13]).
In this report, a multiclass extension of the norm SVM is introduced. This machine, named , is a quadratic loss multiclass SVM, i.e., a multiclass SVM (MSVM) in which the norm on the vector of slack variables has been replaced with a quadratic form. The standard MSVM on which it is based is the one of Lee, Lin and Wahba [10]. As the norm SVM, its training algorithm is equivalent to the training algorithm of a hard margin machine obtained by a simple change of kernel. We then establish a generalized radiusmargin bound on the leaveoneout error of the hard margin version of the MSVM of Lee, Lin and Wahba.
The organization of this paper is as follows. Section 2 presents the multiclass SVMs, by describing their common architecture and the general form taken by their different training algorithms. It focuses on the MSVM of Lee, Lin and Wahba. In Section 3, the is introduced as a particular case of quadratic loss MSVM. Its connection with the hard margin version of the MSVM of Lee, Lin and Wahba is highlighted, as well as the fact that it constitutes a multiclass generalization of the norm SVM. Section 4 is devoted to the formulation and proof of the corresponding multiclass radiusmargin bound. At last, we draw conclusions and outline our ongoing research in Section 5.
2 MultiClass SVMs
2.1 Formalization of the learning problem
We are interested here in multiclass pattern recognition problems. Formally, we consider the case of category classification problems with , but our results extend to the case of dichotomies. Each object is represented by its description and the set of the categories can be identified with the set of indexes of the categories: . We assume that the link between objects and categories can be described by an unknown probability measure on the product space . The aim of the learning problem consists in selecting in a set of functions from into a function classifying data in an optimal way. The criterion of optimality must be specified. The function assigns to the category if and only if . In case of ex æquo, is assigned to a dummy category denoted by . Let be the decision function (from into ) associated with . With these definitions at hand, the objective function to be minimized is the probability of error . The optimization process, called training, is based on empirical data. More precisely, we assume that there exists a random pair , distributed according to , and we are provided with a sample of independent copies of .
There are two questions raised by such problems: how to properly choose the class of functions and how to determine the best candidate in this class, using only . This report addresses the first question, named model selection, in the particular case when the model considered is a MSVM. The second question, named function selection, is addressed for instance in [8].
2.2 Architecture and training algorithms
MSVMs, like all the SVMs, belong to the family of kernel machines. As such, they operate on a class of functions induced by a positive semidefinite (Mercer) kernel. This calls for the formulation of some definitions and propositions.
Definition 1 (Positive semidefinite kernel)
A positive semidefinite kernel on the set is a continuous and symmetric function verifying:
Definition 2 (Reproducing kernel Hilbert space [1])
Let be a Hilbert space of functions on (). A function is a reproducing kernel of if and only if:

;

(reproducing property).
A Hilbert space of functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space (RKHS).
Proposition 1
Let be a RKHS of functions on with reproducing kernel . Then, there exists a map from into a Hilbert space such that:
(1) 
is called a feature map and a feature space.
The connection between positive semidefinite kernels and RKHS is the following.
Proposition 2
If is a positive semidefinite kernel on , then there exists a RKHS of functions on such that is a reproducing kernel of .
Let be a positive semidefinite kernel on and let be the RKHS spanned by . Let and let . By construction, is the class of vectorvalued functions on such that
where the are elements of , as well as the limits of these functions when the sets become dense in in the norm induced by the dot product (see for instance [17]). Due to Equation 1, can be seen as a multivariate affine model on . Functions can then be rewritten as:
where the vectors are elements of . They are thus described by the pair with and . As a consequence, can be seen as a multivariate linear model on , endowed with a norm given by:
where . With these definitions and propositions at hand, a generic definition of the MSVMs can be formulated as follows.
Definition 3 (MSVM, Definition 42 in [8])
Let and . A category MSVM is a large margin discriminant model obtained by minimizing over the hyperplane of a penalized risk of the form:
where the data fit component involves a loss function which is convex.
Three main models of MSVMs can be found in literature. The oldest one is the model of Weston and Watkins [19], which corresponds to the loss function given by:
where the hinge loss function is the function . The second one is due to Crammer and Singer [5] and corresponds to the loss function given by:
The most recent model is the one of Lee, Lin and Wahba [10] which corresponds to the loss function given by:
(2) 
Among the three models, the MSVM of Lee, Lin and Wahba is the only one that implements asymptotically the Bayes decision rule. It is Fisher consistent [20, 14].
2.3 The MSVM of Lee, Lin and Wahba
The substitution in Definition 3 of with the expression of the loss function given by Equation 2 provides us with the expressions of the quadratic programming (QP) problems corresponding to the training algorithms of the hard margin and soft margin versions of the MSVM of Lee, Lin and Wahba.
Problem 1 (Hard margin MSVM)
where
Problem 2 (Soft margin MSVM)
where
In Problem 2, the are slack variables introduced in order to relax the constraints of correct classification. The coefficient , which characterizes the tradeoff between prediction accuracy on the training set and smoothness of the solution, can be expressed in terms of the regularization coefficient as follows: . It is called the soft margin parameter. Instead of directly solving Problems 1 and 2, one usually solves their Wolfe dual [6]. We now derive the dual problem of Problem 1. Giving the details of the implementation of the Lagrangian duality will provide us with partial results which will prove useful in the sequel.
Let be the vector of Lagrange multipliers associated with the constraints of good classification. It is for convenience of notation that this vector is expressed with double subscript and that the dummy variables , all equal to , are introduced. Let be the Lagrange multiplier associated with the constraint and the Lagrange multiplier associated with the constraint . The Lagrangian function of Problem 1 is given by:
(3) 
Setting the gradient of the Lagrangian function with respect to equal to the null vector provides us with alternative expressions for the optimal value of vector :
(4) 
Since by hypothesis, , summing over the index provides us with the expression of as a function of dual variables only:
(5) 
By substitution into (4), we get the expression of the vectors at the optimum:
which can also be written as
(6) 
where is the Kronecker symbol.
Let us now set the gradient of (3) with respect to equal to the null vector. It comes:
and thus
Given the constraint , this implies that:
(7) 
By application of (6),
(8) 
Still by application of (6),
(9) 
(10) 
In what follows, we use the notation to designate the vector of such that all its components are equal to . Let be the matrix of of general term:
With these notations at hand, reporting (7) and (10) in (3) provides us with the algebraic expression of the Lagrangian function at the optimum:
This eventually provides us with the Wolfe dual formulation of Problem 1:
Problem 3 (Hard margin MSVM, dual formulation)
where
with the general term of the Hessian matrix being
2.4 Geometrical margins
From a geometrical point of view, the algorithms described above tend to construct a set of hyperplanes that maximize globally the margins between the differents categories. If these margins are defined as in the biclass case, their analytical expression is more complex.
Definition 4 (Geometrical margins, Definition 7 in [7])
Let us consider a category MSVM (a function of ) classifying the examples of its training set without error. , its margin between categories and , is defined as the smallest distance of a point either in or to the hyperplane separating those categories. Let us denote
and for , let be:
Then we have:
Given the constraints of Problem 1, the expression of corresponding to the MSVM of Lee, Lin and Wahba is:
Remark 1
The values of the parameters (or in the case of interest) are known as soon as the pair is known.
The connection between the geometrical margins and the penalizer of is given by the following equation:
(11) 
the proof of which can for instance be found in Chapter 2 of [7]. We introduce now a result needed in the proof of the master theorem of this report.
Proposition 3
For the hard margin MSVM of Lee, Lin and Wahba, we have:
Proof


This is a direct consequence of Equation 10 and the definition of matrix .

3 The
3.1 Quadratic loss multiclass SVMs: motivation and principle
The MSVMs presented in Section 2.2 share a common feature with the standard pattern recognition SVM: the contribution of the slack variables to their objective functions is linear. Let be the vector of these variables. In the cases of the MSVMs of Weston and Watkins and Lee, Lin and Wahba, we have with , and in the case of the model of Crammer and Singer, it is simply . In both cases, the contribution to the objective function is .
In the biclass case, there exists a variant of the standard SVM which is known as the norm SVM since for this machine, the empirical contribution to the objective function is . Its main advantage, underlined for instance in the Chapter 7 of [13], is that its training algorithm can be expressed, after an appropriate change of kernel, as the training algorithm of a hard margin machine. As a consequence, its leaveoneout error can be upper bounded thanks to the radiusmargin bound.
Unfortunately, a naive extension of the norm SVM to the multiclass case, resulting from substituting in the objective function of either of the three MSVMs with , does not preserve this property. Section 2.4.1.4 of [7] gives detailed explanations about that point. The strategy that we propose to exhibit interesting multiclass generalizations of the norm SVM consists in studying the class of quadratic loss MSVMs, i.e., the class of extensions of the MSVMs such that the contribution of the slack variables is a quadratic form:
where is a symmetric positive semidefinite matrix.
3.2 The as a multiclass generalization of the norm SVM
In this section, we establish that the idea introduced above provides us with a solution to the problem of interest when the MSVM used is the one of Lee, Lin and Wahba and the general term of the matrix is . The corresponding machine, named , generalizes the norm SVM to an arbitrary (but finite) number of categories.
Problem 4 ()
where
Note that as in the biclass case, it is useless to introduce nonnegativity constraints for the slack variables. The Lagrangian function associated with Problem 4 is thus
(12) 
Setting the gradient of with respect to equal to the null vector gives
(13) 
which has for immediate consequence that
(14) 
Using the same reasoning that we used to derive the objective function of Problem 3 and (14), at the optimum, (12) simplifies into:
(15) 
Besides, using (13),
and thus
By a double summation over and , we have:
Since
this simplifies into
Finally, a double summation over and implies that
A substitution into (15) provides us with:
As in the case of the hard margin version of the MSVM of Lee, Lin and Wahba, setting the gradient of (12) with respect to equal to the null vector gives:
Putting things together, we obtain the following expression for the dual problem of Problem 4:
Problem 5 (, dual formulation)
where
Due to the definitions of the matrices and , this is precisely Problem 3 with the kernel replaced by a kernel such that:
When , the MSVM of Lee, Lin and Wahba, like the two other ones, is equivalent to the standard biclass SVM (see for instance [7]). Furthermore, in that case, we get . The is thus equivalent to the norm SVM.
4 MultiClass RadiusMargin Bound on the LeaveOneOut Error of the
To begin with, we must recall Vapnik’s initial biclass theorem (see Chapter 10 of [15]), which is based on an intermediate result of central importance known as the “key lemma”.
4.1 Biclass radiusmargin bound
Lemma 1 (Biclass key lemma)
Let us consider a hard margin biclass SVM on a domain . Suppose that it is trained on a set of couples of (the points of which it separates without error). Consider now the same machine, trained on . If it makes an error on , then the inequality
holds, where is the diameter of the smallest sphere containing the images by the feature map of the support vectors of the initial machine.
Theorem 1 (Biclass radiusmargin bound)
Let be the geometrical margin of the hard margin SVM defined in Lemma 1, when trained on . Let also be the number of errors resulting from applying a leaveoneout crossvalidation procedure to this machine. We have:
The multiclass radiusmargin bound that we propose in this report is a direct generalization of the one proposed by Vapnik. The first step of the proof consists in establishing a “multiclass key lemma”. This is the subject of the following subsection.
4.2 Multiclass key lemma
Lemma 2 (Multiclass key lemma)
Let us consider a category hard margin MSVM of Lee, Lin and Wahba on a domain . Let be its training set. Consider now the same machine trained on . If it makes an error on , then the inequality
holds, where is the diameter of the smallest sphere of the feature space containing the set .
Proof Let be the couple characterizing the optimal hyperplanes when the machine is trained on . Let
be the corresponding vector of dual variables. belongs to , with . This representation is used to characterize directly the second MSVM with respect to the first one. Indeed, is an optimal solution of Problem 3 under the additional constraint . Let us define two more vectors in , and . satisfies additional properties so that the vector is a feasible solution of Problem 3 under the additional constraint that , i.e., satisfies the same constraints as . We have
We deduce from the equality constraints of Problem 3 that:
To sum up, vector satisfies the following constraints: