1 Introduction

A Quadratic Loss Multi-Class SVM [1cm] Emmanuel Monfrini — Yann GuermeurJuly 6, 2019

A Quadratic Loss Multi-Class SVM

Emmanuel Monfrinithanks: UMR 7503-UHP , Yann Guermeurthanks: UMR 7503-CNRS

— July 6, 2019 — ?? pages

Abstract: Using a support vector machine requires to set two types of hyperparameters: the soft margin parameter and the parameters of the kernel. To perform this model selection task, the method of choice is cross-validation. Its leave-one-out variant is known to produce an estimator of the generalization error which is almost unbiased. Its major drawback rests in its time requirement. To overcome this difficulty, several upper bounds on the leave-one-out error of the pattern recognition SVM have been derived. Among those bounds, the most popular one is probably the radius-margin bound. It applies to the hard margin pattern recognition SVM, and by extension to the -norm SVM. In this report, we introduce a quadratic loss M-SVM, the , as a direct extension of the -norm SVM to the multi-class case. For this machine, a generalized radius-margin bound is then established.

Key-words: M-SVMs, model selection, leave-one-out error, radius-margin bound.

Une SVM multi-classe à coût quadratique

Résumé : La mise en œuvre d’une machine à vecteurs support requiert la détermination des valeurs de deux types d’hyper-paramètres : le paramètre de “marge douce” et les paramètres du noyau. Pour effectuer cette tâche de sélection de modèle, la méthode de choix est la validation croisée. Sa variante “leave-one-out” est connue pour fournir un estimateur de l’erreur en généralisation presque sans biais. Son défaut premier réside dans le temps de calcul qu’elle nécessite. Afin de surmonter cette difficulté, plusieurs majorants de l’erreur “leave-one-out” de la SVM calculant des dichotomies ont été proposés. La plus populaire de ces bornes supérieures est probablement la borne “rayon-marge”. Elle s’applique à la version à marge dure de la machine, et par extension à la variante dite “de norne ”. Ce rapport introduit une M-SVM “à coût quadratique”, la , comme une extension directe de la SVM de norne au cas multi-classe. Pour cette machine, une borne “rayon-marge” généralisée est ensuite établie.

Mots-clés : M-SVM, sélection de modèle, erreur “leave-one-out”, borne “rayon-marge”.

## 1 Introduction

Using a support vector machine (SVM) [2, 4] requires to set two types of hyperparameters: the soft margin parameter and the parameters of the kernel. To perform this model selection task, several approaches are available (see for instance [9, 12]). The solution of choice consists in applying a cross-validation procedure. Among those procedures, the leave-one-out one appears especially attractive, since it is known to produce an estimator of the generalization error which is almost unbiased [11]. The seamy side of things is that it is highly time consuming. This is the reason why, in recent years, a number of upper bounds on the leave-one-out error of pattern recognition SVMs have been proposed in literature (see [3] for a survey). Among those bounds, the tightest one is the span bound [16]. However, the results of Chapelle and co-workers presented in [3] show that another bound, the radius-margin one [15], achieves equivalent performance for model selection while being far simpler to compute. This is the reason why it is currently the most popular bound. It applies to the hard margin machine and, by extension, to the -norm SVM (see for instance Chapter 7 in [13]).

In this report, a multi-class extension of the -norm SVM is introduced. This machine, named , is a quadratic loss multi-class SVM, i.e., a multi-class SVM (M-SVM) in which the -norm on the vector of slack variables has been replaced with a quadratic form. The standard M-SVM on which it is based is the one of Lee, Lin and Wahba [10]. As the -norm SVM, its training algorithm is equivalent to the training algorithm of a hard margin machine obtained by a simple change of kernel. We then establish a generalized radius-margin bound on the leave-one-out error of the hard margin version of the M-SVM of Lee, Lin and Wahba.

The organization of this paper is as follows. Section 2 presents the multi-class SVMs, by describing their common architecture and the general form taken by their different training algorithms. It focuses on the M-SVM of Lee, Lin and Wahba. In Section 3, the is introduced as a particular case of quadratic loss M-SVM. Its connection with the hard margin version of the M-SVM of Lee, Lin and Wahba is highlighted, as well as the fact that it constitutes a multi-class generalization of the -norm SVM. Section 4 is devoted to the formulation and proof of the corresponding multi-class radius-margin bound. At last, we draw conclusions and outline our ongoing research in Section 5.

## 2 Multi-Class SVMs

### 2.1 Formalization of the learning problem

We are interested here in multi-class pattern recognition problems. Formally, we consider the case of -category classification problems with , but our results extend to the case of dichotomies. Each object is represented by its description and the set of the categories can be identified with the set of indexes of the categories: . We assume that the link between objects and categories can be described by an unknown probability measure on the product space . The aim of the learning problem consists in selecting in a set of functions from into a function classifying data in an optimal way. The criterion of optimality must be specified. The function assigns to the category if and only if . In case of ex æquo, is assigned to a dummy category denoted by . Let be the decision function (from into ) associated with . With these definitions at hand, the objective function to be minimized is the probability of error . The optimization process, called training, is based on empirical data. More precisely, we assume that there exists a random pair , distributed according to , and we are provided with a -sample of independent copies of .

There are two questions raised by such problems: how to properly choose the class of functions and how to determine the best candidate in this class, using only . This report addresses the first question, named model selection, in the particular case when the model considered is a M-SVM. The second question, named function selection, is addressed for instance in [8].

### 2.2 Architecture and training algorithms

M-SVMs, like all the SVMs, belong to the family of kernel machines. As such, they operate on a class of functions induced by a positive semidefinite (Mercer) kernel. This calls for the formulation of some definitions and propositions.

###### Definition 1 (Positive semidefinite kernel)

A positive semidefinite kernel on the set is a continuous and symmetric function verifying:

 ∀n∈N∗,∀(xi)1≤i≤n∈Xn,∀(ai)1≤i≤n∈Rn,n∑i=1n∑j=1aiajκ(xi,xj)≥0.
###### Definition 2 (Reproducing kernel Hilbert space [1])

Let be a Hilbert space of functions on (). A function is a reproducing kernel of if and only if:

1. ;

2. (reproducing property).

A Hilbert space of functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space (RKHS).

###### Proposition 1

Let be a RKHS of functions on with reproducing kernel . Then, there exists a map from into a Hilbert space such that:

 ∀(x,x′)∈X2,κ(x,x′)=⟨Φ(x),Φ(x′)⟩. (1)

is called a feature map and a feature space.

The connection between positive semidefinite kernels and RKHS is the following.

###### Proposition 2

If is a positive semidefinite kernel on , then there exists a RKHS of functions on such that is a reproducing kernel of .

Let be a positive semidefinite kernel on and let be the RKHS spanned by . Let and let . By construction, is the class of vector-valued functions on such that

 h(⋅)=(mk∑i=1βikκ(xik,⋅)+bk)1≤k≤Q

where the are elements of , as well as the limits of these functions when the sets become dense in in the norm induced by the dot product (see for instance [17]). Due to Equation 1, can be seen as a multivariate affine model on . Functions can then be rewritten as:

 h(⋅)=(⟨wk,⋅⟩+bk)1≤k≤Q

where the vectors are elements of . They are thus described by the pair with and . As a consequence, can be seen as a multivariate linear model on , endowed with a norm given by:

 ∀¯h∈¯H,∥∥¯h∥∥¯H= ⎷Q∑k=1∥wk∥2=∥w∥,

where . With these definitions and propositions at hand, a generic definition of the M-SVMs can be formulated as follows.

###### Definition 3 (M-SVM, Definition 42 in [8])

Let and . A -category M-SVM is a large margin discriminant model obtained by minimizing over the hyperplane of a penalized risk of the form:

 JM-SVM(h)=m∑i=1ℓM-SVM(yi,h(xi))+λ∥∥¯h∥∥2¯H

where the data fit component involves a loss function which is convex.

Three main models of M-SVMs can be found in literature. The oldest one is the model of Weston and Watkins [19], which corresponds to the loss function given by:

 ℓWW(y,h(x))=∑k≠y(1−hy(x)+hk(x))+,

where the hinge loss function is the function . The second one is due to Crammer and Singer [5] and corresponds to the loss function given by:

 ℓCS(y,¯h(x))=(1−¯hy(x)+maxk≠y¯hk(x))+.

The most recent model is the one of Lee, Lin and Wahba [10] which corresponds to the loss function given by:

 ℓLLW(y,h(x))=∑k≠y(hk(x)+1Q−1)+. (2)

Among the three models, the M-SVM of Lee, Lin and Wahba is the only one that implements asymptotically the Bayes decision rule. It is Fisher consistent [20, 14].

### 2.3 The M-SVM of Lee, Lin and Wahba

The substitution in Definition 3 of with the expression of the loss function given by Equation 2 provides us with the expressions of the quadratic programming (QP) problems corresponding to the training algorithms of the hard margin and soft margin versions of the M-SVM of Lee, Lin and Wahba.

###### Problem 1 (Hard margin M-SVM)
 minw,bJHM(w,b)
 s.t.⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩⟨wk,Φ(xi)⟩+bk≤−1Q−1,(1≤i≤m),(1≤k≠yi≤Q)∑Qk=1wk=0∑Qk=1bk=0

where

 JHM(w,b)=12Q∑k=1∥wk∥2.
###### Problem 2 (Soft margin M-SVM)
 minw,bJSM(w,b)
 s.t.⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩⟨wk,Φ(xi)⟩+bk≤−1Q−1+ξik,(1≤i≤m),(1≤k≠yi≤Q)ξik≥0,(1≤i≤m),(1≤k≠yi≤Q)∑Qk=1wk=0∑Qk=1bk=0

where

In Problem 2, the are slack variables introduced in order to relax the constraints of correct classification. The coefficient , which characterizes the trade-off between prediction accuracy on the training set and smoothness of the solution, can be expressed in terms of the regularization coefficient as follows: . It is called the soft margin parameter. Instead of directly solving Problems 1 and 2, one usually solves their Wolfe dual [6]. We now derive the dual problem of Problem 1. Giving the details of the implementation of the Lagrangian duality will provide us with partial results which will prove useful in the sequel.

Let be the vector of Lagrange multipliers associated with the constraints of good classification. It is for convenience of notation that this vector is expressed with double subscript and that the dummy variables , all equal to , are introduced. Let be the Lagrange multiplier associated with the constraint and the Lagrange multiplier associated with the constraint . The Lagrangian function of Problem 1 is given by:

 L(w,b,α,β,δ)=
 12Q∑k=1∥wk∥2−⟨δ,Q∑k=1wk⟩−βQ∑k=1bk+m∑i=1Q∑k=1αik(⟨wk,Φ(xi)⟩+bk+1Q−1). (3)

Setting the gradient of the Lagrangian function with respect to equal to the null vector provides us with alternative expressions for the optimal value of vector :

 δ∗=w∗k+m∑i=1α∗ikΦ(xi),(1≤k≤Q). (4)

Since by hypothesis, , summing over the index provides us with the expression of as a function of dual variables only:

 δ∗=1Qm∑i=1Q∑k=1α∗ikΦ(xi). (5)

By substitution into (4), we get the expression of the vectors at the optimum:

 w∗k=1Qm∑i=1Q∑l=1α∗ilΦ(xi)−m∑i=1α∗ikΦ(xi),(1≤k≤Q)

which can also be written as

 w∗k=m∑i=1Q∑l=1α∗il(1Q−δk,l)Φ(xi),(1≤k≤Q) (6)

where is the Kronecker symbol.

Let us now set the gradient of (3) with respect to equal to the null vector. It comes:

 β∗=m∑i=1α∗ik,(1≤k≤Q)

and thus

 m∑i=1Q∑l=1α∗il(1Q−δk,l)=0,(1≤k≤Q).

Given the constraint , this implies that:

 m∑i=1Q∑k=1α∗ikb∗k=β∗Q∑k=1b∗k=0. (7)

By application of (6),

 Q∑k=1∥∥w∗k∥∥2=Q∑k=1⟨m∑i=1Q∑l=1α∗il(1Q−δk,l)Φ(xi),m∑j=1Q∑n=1α∗jn(1Q−δk,n)Φ(xj)⟩
 =m∑i=1m∑j=1Q∑l=1Q∑n=1α∗ilα∗jn⟨Φ(xi),Φ(xj)⟩Q∑k=1(1Q−δk,l)(1Q−δk,n)
 =m∑i=1m∑j=1Q∑l=1Q∑n=1α∗ilα∗jn(δl,n−1Q)κ(xi,xj). (8)

Still by application of (6),

 m∑i=1Q∑k=1α∗ik⟨w∗k,Φ(xi)⟩=m∑i=1Q∑k=1α∗ik⟨m∑j=1Q∑l=1α∗jl(1Q−δk,l)Φ(xj),Φ(xi)⟩
 =m∑i=1m∑j=1Q∑k=1Q∑l=1α∗ikα∗jl(1Q−δk,l)κ(xi,xj). (9)

Combining (8) and (9) gives:

 12Q∑k=1∥∥w∗k∥∥2+m∑i=1Q∑k=1α∗ik⟨w∗k,Φ(xi)⟩=−12Q∑k=1∥∥w∗k∥∥2
 =−12m∑i=1m∑j=1Q∑k=1Q∑l=1α∗ikα∗jl(δk,l−1Q)κ(xi,xj). (10)

In what follows, we use the notation to designate the vector of such that all its components are equal to . Let be the matrix of of general term:

 hik,jl=(δk,l−1Q)κ(xi,xj).

With these notations at hand, reporting (7) and (10) in (3) provides us with the algebraic expression of the Lagrangian function at the optimum:

 L(α∗)=−12α∗THα∗+1Q−11TQmα∗.

This eventually provides us with the Wolfe dual formulation of Problem 1:

###### Problem 3 (Hard margin M-SVM, dual formulation)
 maxαJLLW,d(α)
 s.t.{αik≥0,(1≤i≤m),(1≤k≠yi≤Q)∑mi=1∑Ql=1αil(1Q−δk,l)=0,(1≤k≤Q)

where

 JLLW,d(α)=−12αTHα+1Q−11TQmα,

with the general term of the Hessian matrix being

 hik,jl=(δk,l−1Q)κ(xi,xj).

Let the couple denote the optimal solution of Problem 1 and equivalently, let be the optimal solution of Problem 3. According to (6), the expression of is then:

 w0k=m∑i=1Q∑l=1α0il(1Q−δk,l)Φ(xi).

### 2.4 Geometrical margins

From a geometrical point of view, the algorithms described above tend to construct a set of hyperplanes that maximize globally the margins between the differents categories. If these margins are defined as in the bi-class case, their analytical expression is more complex.

###### Definition 4 (Geometrical margins, Definition 7 in [7])

Let us consider a -category M-SVM (a function of ) classifying the examples of its training set without error. , its margin between categories and , is defined as the smallest distance of a point either in or to the hyperplane separating those categories. Let us denote

 dM-SVM=min1≤k

and for , let be:

 dM-SVM,kl=1dM-SVMmin[mini:yi=k(hk(xi)−hl(xi)−dM-SVM),minj:yj=l(hl(xj)−hk(xj)−dM-SVM)].

Then we have:

 γkl=dM-SVM1+dM-SVM,kl∥wk−wl∥.

Given the constraints of Problem 1, the expression of corresponding to the M-SVM of Lee, Lin and Wahba is:

 dLLW=QQ−1.
###### Remark 1

The values of the parameters (or in the case of interest) are known as soon as the pair is known.

The connection between the geometrical margins and the penalizer of is given by the following equation:

 ∑k

the proof of which can for instance be found in Chapter 2 of [7]. We introduce now a result needed in the proof of the master theorem of this report.

###### Proposition 3

For the hard margin M-SVM of Lee, Lin and Wahba, we have:

 Q(Q−1)2∑k

Proof

• This equation is a direct consequence of Definition 4 and Equation 11.

• This is a direct consequence of Equation 10 and the definition of matrix .

• One of the Kuhn-Tucker optimality conditions is:

 α0ik(⟨w0k,Φ(xi)⟩+b0k+1Q−1)=0,(1≤i≤m),(1≤k≠yi≤Q),

and thus:

 m∑i=1Q∑k=1α0ik(⟨w0k,Φ(xi)⟩+b0k+1Q−1)=0.

By application of (7), this simplifies into

 m∑i=1Q∑k=1α0ik⟨w0k,Φ(xi)⟩+1Q−11TQmα0=0.

Since

 m∑i=1Q∑k=1α0ik⟨w0k,Φ(xi)⟩=−α0THα0

is a direct consequence of (10), this concludes the proof.

## 3 The M-SVM2

### 3.1 Quadratic loss multi-class SVMs: motivation and principle

The M-SVMs presented in Section 2.2 share a common feature with the standard pattern recognition SVM: the contribution of the slack variables to their objective functions is linear. Let be the vector of these variables. In the cases of the M-SVMs of Weston and Watkins and Lee, Lin and Wahba, we have with , and in the case of the model of Crammer and Singer, it is simply . In both cases, the contribution to the objective function is .

In the bi-class case, there exists a variant of the standard SVM which is known as the -norm SVM since for this machine, the empirical contribution to the objective function is . Its main advantage, underlined for instance in the Chapter 7 of [13], is that its training algorithm can be expressed, after an appropriate change of kernel, as the training algorithm of a hard margin machine. As a consequence, its leave-one-out error can be upper bounded thanks to the radius-margin bound.

Unfortunately, a naive extension of the -norm SVM to the multi-class case, resulting from substituting in the objective function of either of the three M-SVMs with , does not preserve this property. Section 2.4.1.4 of [7] gives detailed explanations about that point. The strategy that we propose to exhibit interesting multi-class generalizations of the -norm SVM consists in studying the class of quadratic loss M-SVMs, i.e., the class of extensions of the M-SVMs such that the contribution of the slack variables is a quadratic form:

 CξTMξ=Cm∑i=1m∑j=1Q∑k=1Q∑l=1mik,jlξikξjl

where is a symmetric positive semidefinite matrix.

### 3.2 The M-SVM2 as a multi-class generalization of the 2-norm SVM

In this section, we establish that the idea introduced above provides us with a solution to the problem of interest when the M-SVM used is the one of Lee, Lin and Wahba and the general term of the matrix is . The corresponding machine, named , generalizes the -norm SVM to an arbitrary (but finite) number of categories.

###### Problem 4 (M-SVM2)
 minw,bJM-SVM2(w,b)
 s.t.⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩⟨wk,Φ(xi)⟩+bk≤−1Q−1+ξik,(1≤i≤m),(1≤k≠yi≤Q)∑Qk=1wk=0∑Qk=1bk=0

where

 JM-SVM2(w,b)=12Q∑k=1∥wk∥2+Cm∑i=1m∑j=1Q∑k=1Q∑l=1(δk,l−1Q)δi,jξikξjl.

Note that as in the bi-class case, it is useless to introduce nonnegativity constraints for the slack variables. The Lagrangian function associated with Problem 4 is thus

 L(w,b,ξ,α,β,δ)=
 12Q∑k=1∥wk∥2+CξTMξ−⟨δ,Q∑k=1wk⟩−βQ∑k=1bk
 +m∑i=1Q∑k=1αik(⟨wk,Φ(xi)⟩+bk+1Q−1−ξik). (12)

Setting the gradient of with respect to equal to the null vector gives

 2CMξ∗=α∗ (13)

which has for immediate consequence that

 Cξ∗TMξ∗−α∗Tξ∗=−Cξ∗TMξ∗. (14)

Using the same reasoning that we used to derive the objective function of Problem 3 and (14), at the optimum, (12) simplifies into:

 L(ξ∗,α∗)=−12α∗THα∗−Cξ∗TMξ∗+1Q−11TQmα∗. (15)

Besides, using (13),

 α∗inα∗ip=4C2Q∑k=1(δk,n−1Q)ξ∗ikQ∑l=1(δl,p−1Q)ξ∗il

and thus

By a double summation over and , we have:

 Q∑n=1Q∑p=1α∗inα∗ip(δn,p−1Q)=4C2Q∑k=1Q∑l=1ξ∗ikξ∗ilQ∑n=1Q∑p=1(δk,nδl,p−(δk,n+δl,p)1Q+1Q2)(δn,p−1Q).

Since

 Q∑n=1Q∑p=1(δk,nδl,p−(δk,n+δl,p)1Q+1Q2)(δn,p−1Q)=δk,l−1Q,

this simplifies into

 Q∑n=1Q∑p=1α∗inα∗ip(δn,p−1Q)=4C2Q∑k=1Q∑l=1(δk,l−1Q)ξ∗ikξ∗il.

Finally, a double summation over and implies that

 α∗TMα∗=4C2ξ∗TMξ∗.

A substitution into (15) provides us with:

 L(α∗)=−12α∗T(H+12CM)α∗+1Q−11TQmα∗.

As in the case of the hard margin version of the M-SVM of Lee, Lin and Wahba, setting the gradient of (12) with respect to equal to the null vector gives:

 m∑i=1Q∑l=1α∗il(1Q−δk,l)=0,(1≤k≤Q).

Putting things together, we obtain the following expression for the dual problem of Problem 4:

###### Problem 5 (M-SVM2, dual formulation)
 maxαJM-SVM2,d(α)
 s.t.{αik≥0,(1≤i≤m),(1≤k≠yi≤Q)∑mi=1∑Ql=1αil(1Q−δk,l)=0,(1≤k≤Q)

where

 JM-SVM2,d(α)=−12αT(H+12CM)α+1Q−11TQmα.

Due to the definitions of the matrices and , this is precisely Problem 3 with the kernel replaced by a kernel such that:

 κ′(xi,xj)=κ(xi,xj)+12Cδi,j,(1≤i,j≤m).

When , the M-SVM of Lee, Lin and Wahba, like the two other ones, is equivalent to the standard bi-class SVM (see for instance [7]). Furthermore, in that case, we get . The is thus equivalent to the -norm SVM.

## 4 Multi-Class Radius-Margin Bound on the Leave-One-Out Error of the M-SVM2

To begin with, we must recall Vapnik’s initial bi-class theorem (see Chapter 10 of [15]), which is based on an intermediate result of central importance known as the “key lemma”.

### 4.1 Bi-class radius-margin bound

###### Lemma 1 (Bi-class key lemma)

Let us consider a hard margin bi-class SVM on a domain . Suppose that it is trained on a set of couples of (the points of which it separates without error). Consider now the same machine, trained on . If it makes an error on , then the inequality

 α0p≥1D2m

holds, where is the diameter of the smallest sphere containing the images by the feature map of the support vectors of the initial machine.

###### Theorem 1 (Bi-class radius-margin bound)

Let be the geometrical margin of the hard margin SVM defined in Lemma 1, when trained on . Let also be the number of errors resulting from applying a leave-one-out cross-validation procedure to this machine. We have:

 Lm≤D2mγ2.

The multi-class radius-margin bound that we propose in this report is a direct generalization of the one proposed by Vapnik. The first step of the proof consists in establishing a “multi-class key lemma”. This is the subject of the following subsection.

### 4.2 Multi-class key lemma

###### Lemma 2 (Multi-class key lemma)

Let us consider a -category hard margin M-SVM of Lee, Lin and Wahba on a domain . Let be its training set. Consider now the same machine trained on . If it makes an error on , then the inequality

holds, where is the diameter of the smallest sphere of the feature space containing the set .

Proof  Let be the couple characterizing the optimal hyperplanes when the machine is trained on . Let

 αp=(αp11,…,αp(p−1)Q,0,…,0,αp(p+1)1,…,αpmQ)T

be the corresponding vector of dual variables. belongs to , with . This representation is used to characterize directly the second M-SVM with respect to the first one. Indeed, is an optimal solution of Problem 3 under the additional constraint . Let us define two more vectors in , and . satisfies additional properties so that the vector is a feasible solution of Problem 3 under the additional constraint that , i.e., satisfies the same constraints as . We have

 ∀i≠p,∀k≠yi,α0ik−λpik≥0⟺λpik≤α0ik.

We deduce from the equality constraints of Problem 3 that:

 ∀k,m∑i=1Q∑l=1(α0il−λpil)(1Q−δk,l)=0⟺m∑i=1Q∑l=1λpil(1Q−δk,l)=0.

To sum up, vector satisfies the following constraints:

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩∀k,λppk=α0pk∀i≠p,∀k,0≤λpik≤α0ik∑mi=