Laplacian Support Vector MachinesTrained in the Primal

# Laplacian Support Vector Machines Trained in the Primal

\nameStefano Melacci \emailmela@dii.unisi.it
University of Siena
53100, Siena, ITALY \AND\nameMikhail Belkin \emailmbelkin@cse.ohio-state.edu
\addrDepartment of Computer Science and Engineering
Ohio State University
Columbus, OH 43210, USA
###### Abstract

In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi–supervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. Whereas training a LapSVM in the dual requires two steps, using the primal form allows us to collapse training to a single step. Moreover, the computational complexity of the training algorithm is reduced from to using preconditioned conjugate gradient, where is the combined number of labeled and unlabeled examples. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large datasets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach.

Laplacian Support Vector Machines Trained in the Primal Stefano Melacci mela@dii.unisi.it
Department of Information Engineering
University of Siena
53100, Siena, ITALY
Mikhail Belkin mbelkin@cse.ohio-state.edu
Department of Computer Science and Engineering
Ohio State University
Columbus, OH 43210, USA

TECHNICAL REPORT

Keywords: Laplacian Support Vector Machines, Manifold Regularization, Semi–Supervised Learning, Classification, Optimization.

## Section 1 Introduction

In semi-supervised learning one estimates a target classification/regression function from a few labeled examples together with a large collection of unlabeled data. In the last few years there has been a growing interest in the semi–supervised learning in the scientific community. Many algorithms for exploiting unlabeled data in order to enhance the quality of classifiers have been recently proposed, see, e.g., (Chapelle et al., 2006) and (Zhu and Goldberg, 2009). The general principle underlying semi-supervised learning is that the marginal distribution, which can be estimated from data alone, may suggest a suitable way to adjust the target function. The two commons assumption on such distribution that, explicitly or implicitly, are made by many of semi–supervised learning algorithms are the cluster assumption (Chapelle et al., 2003) and the manifold assumption (Belkin et al., 2006). The cluster assumption states that two points are likely to have the same class label if they can be connected by a curve through a high density region. Consequently, the separation boundary between classes should lie in the lower density region of the space. For example, this intuition underlies the Transductive Support Vector Machines (Vapnik, 2000) and in its different implementations, such as TSVM in (Joachims, 1999) or SVM (Demiriz and Bennett, 2000; Chapelle et al., 2008). The manifold assumption states that the marginal probability distribution underlying the data is supported on or near a low–dimensional manifold, and that the target function should change smoothly along the tangent direction. Many graph based methods have been proposed in this direction, but the most of them only perform transductive inference (Joachims, 2003; Belkin and Niyogi, 2003; Zhu et al., 2003), that is classify the unlabeled data given in training. Laplacian Vector Machines (LapSVM) (Belkin et al., 2006) provide a natural out–of–sample extension, so that they can classify data that becomes available after the training process, without having to retrain the classifier or resort to various heuristics.

In this paper, we focus on the LapSVM algorithm, that has shown to achieve the state of the art performances in semi–supervised classification. The original approach used to train LapSVM in Belkin et al. (2006) is based on the dual formulation of the problem, in a traditional SVM–like fashion. This dual problem is defined only on a number of dual variables equal to , the number of labeled points, and the the relationship between the variables and the final coefficients is given by a linear system of equations and variables, where is the total number of training points, both labeled and unlabeled. The overall cost of this “two step” process is .

Motivated by the recent interest in solving the SVM problem in the primal (Chapelle, 2007; Joachims, 2006; Shalev-Shwartz et al., 2007), we present a way to solve the primal LapSVM problem that can significantly reduce training times and overcome some issues of the original training algorithm. Specifically, the contributions of this paper are the following:

1. We propose two methods for solving the LapSVM problem in the primal form (not limited to the linear case), following the ideas presented in (Chapelle, 2007) for SVMs. Our Matlab library can be downloaded from http://www.dii.unisi.it/ ̃ melacci/lapsvmp/. The solution can now be compactly computed in a “single step” on the whole variable set. We show how to solve the problem by Newton’s method, comparing it with the supervised case. From this comparison it turns out that the real advantages of the Newton’s method for the SVM problem are lost in LapSVM due to the intrisic norm regularizer, and the complexity of this solution is still , same as in the original dual formulation. On the other hand, preconditioned conjugate gradient can be directly applied. Preconditioning by the kernel matrix come at no additional cost, and convergence can be achieved with only a small number of iterations. Complexity can be further reduced if the kernel matrix is sparse, increasing the scalability of the algorithm.

2. An approximate solution of the dual form and the resulting approximation of the target optimal function are not directly related due to the change of variables while switching to the dual problem. Training LapSVMs in the primal overcomes this issue, and it allows us to directly compute approximate solutions by controlling the number conjugate gradient iterations.

3. An approximation of the target function with roughly the same classification accuracy as the optimal one can be achieved with a small number of iterations due to the effects of the intrinsic norm regularizer of LapSVMs on the training process. We investigate those effects, showing that they make common stopping conditions for iterative gradient based algorithms hard to tune, often leading to either a premature stopping of the algorithm or to the execution of a large amount of iterations without improvements to the classification accuracy. We suggest to use a criterion built upon the output of the classifier on the available training data for terminating the iteration of the algorithm. Specifically, the stability of the prediction on the unlabeled data, or the classification accuracy on validation data (if available) can be exploited. A number of experiments on several datasets support these types of criteria, showing that accuracy similar to that of the optimal solution can be obtained in with significatly reduced training time.

4. The primal solution of the LapSVM problem is based on an hinge loss, that establishes a direct connection to the Laplacian Regularized Least Square Classifier (LapRLSC) (Belkin et al., 2006). We discuss the similarities between primal LapSVM and LapRLSC and we show that the proposed fast solution can be trivially applied also to LapRLSC.

The rest of the paper is organized as follows. In Section 2 the basic principles behind manifold regularization are resumed. Section 2.1 describes the LapSVM algorithm in its original formulation whereas Section 3 discusses the proposed solutions of the primal form and their details. The quality of an approximate solution and the data based early stopping criterion are the key contents of Section 4. In Section 5 a parallel with the primal solution of LapSVM and the one of LapRLSC is drawn, describing some possible future work. An extensive experimental analysis is presented in Section 6, and, finally, Section 7 concludes the paper.

## Section 2 Manifold Regularization

First, we introduce some notation that will be used in this Section and in the rest of the paper. We take to be the number of dimensional training examples , collected in . Examples are ordered so that the first ones are labeled, with label , and the remaining points are unlabeled. We put , where is the labeled data set and is the unlabeled data set. Labeled examples are generated accordingly to the distribution on , whereas unlabeled examples are drawn according to the marginal distribution of . Labels are obtained from the conditional probability distribution . is the graph Laplacian associated to , given by , where is the adjacency matrix of the data graph (the entry in position is indicated with ) and is the diagonal matrix with the degree of each node (i.e. the element from is ). Laplacian can be expressed in the normalized form, , and iterated to a degree greater that one. By we denote the Gram matrix associated to the points of and the –th entry of such matrix is the evaluation of the kernel function , . The unknown target function that the learning algorithm must estimate is indicated with , where is the vector of the values of on training data, . In a classification problem, the decision function that discriminates between classes is indicated with , where we overloaded the use of to denote such function.

Manifold regularization approach (Belkin et al., 2006) exploits the geometry of the marginal distribution . The support of the probability distribution of data is assumed to have the geometric structure of a Riemannian manifold . The labels of two points that are close in the intrinsic geometry of (i.e. with respect to geodesic distances on ) should be the same or similar in sense that the conditional probability distribution should change little between two such points. This constraint is enforced in the learning process by an intrinsic regularizer that is empirically estimated from the point cloud of labeled and unlabeled data using the graph Laplacian associated to them, since is truly unknown. In particular, choosing exponential weights for the adjacency matrix leads to convergence of the graph Laplacian to the Laplace–Beltrami operator on the manifold (Belkin and Niyogi, 2008). As a result, we have

 ∥f∥2I=n∑i=1n∑j=iwij(f(\boldmathxi)−f(\boldmathxj))2=\boldmathfTL\boldmathf. (1)

Consider that, in general, several natural choices of exist (Belkin et al., 2006).

In the established regularization framework for function learning, given a kernel function , its associated Reproducing Kernel Hilber Space (RKHS) of functions with corresponding norm , we estimate the target function by minimizing

 f∗=\operatornamewithlimitsargminf∈Hkl∑i=1V(\boldmathxi,yi,f)+γA∥f∥2A+γI∥f∥2I (2)

where is some loss function and is the weight of the norm of the function in the RKHS (or ambient norm), that enforces a smoothness condition on the possible solutions, and is the weight of the norm of the function in the low dimensional manifold (or intrinsic norm), that enforces smoothness along the sampled . For simplicity, we removed every normalization factor of the weights of each term in the summation. The ambient regularizer makes the problem well–posed, and its presence can be really helpful from a practical point of view when the manifold assumption holds at a lesser degree.

It has been shown in Belkin et al. (2006) that admits an expansion in terms of the points of ,

 f∗(\boldmathx)=n∑i=1α∗ik(\boldmathxi,\boldmathx). (3)

The decision function that discriminates between class and is . Figure 1 shows the effect of the intrinsic regularizer on the “clock” toy dataset. The supervised approach defines the classification hyperplane just by considering the two labeled examples, and it does not benefit from unlabeled data (Figure 1(b)). With manifold regularization, the classification appears more natural with respect to the geometry of the marginal distribution (Figure 1(c)).

The intrinsic norm of Eq. 1 actually performs a transduction along the manifold that enforces the values of in nearby points with respect to geodesic distances on to be the “same”. From a merely practical point of view, the intrinsic regularizer can be excessively strict in some situations. Since the decision function relies only on the sign of the target function , if has the same sign on nearby points along then the graph transduction is actually complete. Requiring that assumes exactly the same value on a pair of nearby points could be considered as over constraining the problem.

This intuition is closely related to the ideas explored in Sindhwani (2007); Sindhwani and Rosenberg (2008); Abernethy et al. (2008). In particular, in some restricted function spaces the intrinsic regularizer could degenerate to the ambient one as it is not able to model some underlying geometries of the given data. The Manifold Co-Regularization (MCR) framework (Sindhwani and Rosenberg, 2008) has been proposed to overcome such issue using multi–view learning. It has been shown that MCR corresponds to adding some extra slack variables in the objective function of Eq. 2 to better fit the intrinsic regularizer. The slack variables of MCR could be seen as a way to relax the regularizer. Similarly, Abernethy et al. (2008) uses a slack based formulation to improve the flexibility of the graph regularizer of their spam detector. This problem has been addressed also by Tsang and Kwok (2007), where the intrinsic regularizer is an –insensitive loss. We will use these considerations in Section 4 to early stop the training algorithm.

### 2.1 Laplacian Support Vector Machines

LapSVMs follow the principles behind manifold regularization (Eq. 2), where the loss function is the linear hinge loss (Vapnik, 2000), or loss. The interesting property of such function is that well classified labeled examples are not penalized by , independently by the value of .

In order to train a LapSVM classifier, the following problem must be solved

 minf∈Hkl∑i=1max(1−yif(\boldmathxi),0)+γA∥f∥2A+γI∥f∥2I. (4)

The function admits the expansion of Eq. 3, where an unregularized bias term can be added as in many SVM formulations.

The solution of LapSVM problem proposed by Belkin et al. (2006) is based on the dual form. By introducing the slack variables , the unconstrained primal problem can be written as a constrained one:

 \omit\span\omitmin\boldmathα∈IRn,\boldmathξ∈IRl∑li=1ξi+γA\boldmathαTK\boldmathα+γI\boldmathαTKLK\boldmathα\operatornamewithlimitssubject to: yi(∑nj=1αik(% \boldmathxi,\boldmathxj)+b)≥1−ξi,   i=1,…,lξi≥0,   i=1,…,l

After the introduction of two sets of multipliers , , the Lagrangian associated to the problem is:

 Lg(\boldmathα,\boldmathξ,b,% \boldmathβ,\boldmathς) = l∑i=1ξi+12\boldmathαT(2γAK+2γIKLK)\boldmathα− −l∑i=1βi(yi(n∑j=1αik(% \boldmathxi,\boldmathxj)+b)−1+ξi)−l∑i=1ςiξi

In order to recover the dual representation we need to set:

 ∂Lg∂b=0 ⟹ l∑i=1βiyi=0 ∂Lg∂ξi=0 ⟹ 1−βi−ςi=0  ⟹  0≤βi≤1

where the bounds on consider that , since they are Lagrange multipliers. Using the above identities, we can rewrite the Lagrangian as a function of and only. Assuming (as stated in Section 2) that the points in are ordered such that the first are labeled and the remaining are unlabeled, we define with the matrix where is the identity matrix and is a rectangular matrix with all zeros. Moreover, is a diagonal matrix composed by the labels . The Lagrangian becomes

 Lg(\boldmathα,\boldmathβ) = 12\boldmathαT(2γAK+2γIKLK)\boldmathα−l∑i=1βi(yi(n∑j=1αik(\boldmathxi,\boldmathxj)+b)−1)= = 12\boldmathαT(2γAK+2γIKLK)\boldmathα−\boldmathαTKJTLY\boldmathβ+l∑i=1βi.

Setting to zero the derivative with respect to establishes a direct relationships between the coefficients and the ones:

 ∂Lg∂\boldmathα=0  ⟹  (2γAK+2γIKLK)\boldmathα−KJTLY\boldmathβ=0
 ⟹  \boldmathα=(2γAI+2γIKL)−1JTLY\boldmathβ (5)

After substituting back in the Lagrangian expression, we get the dual problem whose solution leads to the optimal :

 \omit\span\omitmax\boldmathβ∈IRl∑li=1βi−12\boldmathβTQ% \boldmathβ\operatornamewithlimitssubject to: ∑li=1βiyi=00≤βi≤1,   i=1,…,l

where

 Q=YJTLK(2γAI+2γIKL)−1JTLY. (6)

Training the LapSVM classifier requires to optimize this variable problem, for example using a standard quadratic SVM solver, and then to solve the linear system of equations and variables of Eq. 5 in order to get the coefficients that define the target function .

The overall complexity of this “two step” solution is , due to the matrix inversion of Eq. 5 (and 6). Even if the coefficients are sparse, since they come from a SVM–like dual problem, the expansion of will generally involves all coefficients .

## Section 3 Training in the Primal

In this Section we analyze the optimization of the primal form of the non linear LapSVM problem, following the growing interest in training SVMs in the primal of the last few years (Joachims, 2006; Chapelle, 2007; Shalev-Shwartz et al., 2007). Primal optimization of a SVM has strong similarities with the dual strategy (Chapelle, 2007), and its implementation does not require any particularly complex optimization libraries. The focus of researchers has been mainly on the solution of the linear SVM primal problem, showing how it can be solved fast and efficiently (Joachims, 2006; Shalev-Shwartz et al., 2007). Most of the existing results can be directly extended to the non linear case by reparametrizing the linear output function with and introducing the Gram matrix . However this may result in a loss of efficiency. In Chapelle (2007); Keerthi et al. (2006) the authors investigated efficient solutions for the non linear SVM case.

Primal and dual optimization are two ways different of solving the same problem, neither of which can in general be considered a “better” approach. Therefore why should a solution of the primal problem be useful in the case of LapSVM? There are three primary reasons why such a solution may be preferable. First, it allows us to efficiently solve a single problem without the need of a two step solution. Second, it allows us to very quickly compute good approximate solutions, while the exact relation between approximate solutions of the dual and original problems may be involved. Third, since it allows us to directly “manipulate” the coefficients of without passing through the ones, greedy techniques for incremental building of the LapSVM classifier are easier to manage (Sindhwani, 2007). We believe that studying the primal LapSVM problem is the basis for future investigations and improvements of this classifier.

We rewrite the primal LapSVM problem of Eq. 4 by considering the representation of of Eq. 3, the intrinsic regularized of Eq. 1, and by indicating with the -th column of the matrix

 min\boldmathα∈IRn,b∈IRl∑i=1V(xi,yi,\boldmathkTi\boldmathα+b)+γA\boldmathαTK\boldmathα+γI\boldmathαTKLK\boldmathα.

Note that, for completeness, we included the bias in the expansion of . Such bias does not affect the intrinsic norm that is actually a sum of squared differences of evaluated on pair of points111If the Laplacian is normalized then the expression of the intrinsic norm changes. This must be taken into account when computing the bias.. We use a squared hinge loss, or loss, for the labeled examples, following Chapelle (2007) (see Figure 2). loss makes the LapSVM problem continuous and differentiable in and so in . The optimization problem after adding the scaling constant becomes

 min\boldmathα∈IRn,b∈IR12(l∑i=1max(1−yi(\boldmathkTi\boldmathα+b),0)2+γA\boldmathαTK\boldmathα+γI\boldmathαTKLK\boldmathα). (7)

We solved such convex problem by Newton’s method and by preconditioned conjugate gradient, comparing their complexities and the complexity of the original LapSVM solution, and showing a parallel with the SVM case. The two solution strategies are analyzed in the following Subsections, while a large set of experimental results are collected in Section 6.

### 3.1 Newton’s Method

The problem of Eq. 7 is piecewise quadratic and the Newton’s method appears a natural choice for an efficient minimization, since it builds a quadratic approximation of the function. After indicating with the vector , each Newton’s step consists of the following update

 \boldmathzt=\boldmathzt−1−sH−1∇ (8)

where is the iteration number, is the step size, and and are the gradient and the Hessian of Eq. 7 with respect to . We will use the symbols and to indicate the gradient with respect to and to .

Before continuing, we introduce the further concept of error vectors (Chapelle, 2007). The set of error vectors is the subset of with the points that generate a hinge loss value greater than zero. The classifier does not penalize all the remaining labeled points, since the function on that points produces outputs with the same sign of the corresponding label and with absolute value greater then or equal to it. In the classic SVM framework, error vectors correspond to support vectors at the optimal solution. In the case of LapSVM, all points are support vectors at optimum in the sense that they all generally contribute to the expansion of .

We have

 ∇=[∇b∇\boldmathα]=(∑li=1yi(yi(% \boldmathki\boldmathα+b)−1)∑li=1\boldmathkiyi(yi(\boldmathki% \boldmathα+b)−1)+γAK\boldmathα+γIKLK\boldmathα)==(\boldmath1TIE(K% \boldmathα+\boldmath1b)−\boldmath1IE\boldmathyKIE(K\boldmathα+\boldmath1b)−KIE\boldmathy+γAK\boldmathα+γIKLK% \boldmathα) (9)

where is the vector on elements equal to and is the vector that collects the labels of the labeled training points and a set of zeros. The matrix is a diagonal matrix where the only elements different from 0 (and equal to 1) along the main diagonal are in positions corresponding to points of that belong to at the current iteration.

The Hessian is

 H=⎛⎜⎝∇2b∇b(∇\boldmathα)∇\boldmathα(∇b)∇2\boldmathα⎞⎟⎠=(\boldmath1TIE\boldmath1\boldmath1TIEKKIE\boldmath1KIEK+γAK+γIKLK)==(−γA\boldmath1T0K)(0\boldmath1TIE\boldmath1IEK+γAI+γILK)

Note that the criterion function of Eq. 7 is not twice differentiable everywhere, so that is the generalized Hessian where the subdifferential in the breakpoint of the hinge function is set to 0. This leaves intact the least square nature of the problem, as in the Modified Newton’s method proposed by Keerthi and DeCoste (2006) for linear SVMs. In other words, the contribute to the Hessian of the hinge loss is the same as the one of a squared loss applied to error vectors only.

Combining the last two expressions we can write as

 ∇=H\boldmathz−(\boldmath1TK)IE\boldmathy. (10)

From Eq. 10, the Newton’s update of Eq. 8 becomes

 \boldmathzt=\boldmathzt−1−s%\boldmath$z$t−1+sH−1(\boldmath1TK)IE\boldmathy==(1−s)\boldmathzt−1+s(0\boldmath1TIE\boldmath1IEK+γAI+γILK)−1(−γA\boldmath1% T0K)−1(\boldmath1TK)IE\boldmathy==(1−s)\boldmathzt−1+s(0\boldmath1TIE\boldmath1IEK+γAI+γILK)−1(0IE\boldmathy). (11)

Looking at the update rule of Eq. 11 the analogies and differences with the solution of the linear system of Eq. 5 can be clearly appreciated. In particular, Eq. 5 relates the dual variables with the ones using the information on the ambient and intrinsic regularizers. The contribute of the labeled data has already been collected in the variables, by solving the dual problem. Differently, in the update rule of of Eq. 11 the information of the loss is represented by the term.

The step size must be computed by solving the one–dimensional minimization of Eq. 7 restricted on the ray from to , with exact line search or backtracking (Boyd and Vandenberghe, 2004). Convergence is declared when the set of error vectors does not change between two consecutive iterations of the algorithm. We can see that when , Eq. 11 shows that the vector of the previous iteration is not explicitly included in the update rule of . The only variable element that defines the new is , i.e. the set of error vectors . Exactly like in the case of primal SVMs (Chapelle, 2007), in our experiments setting did not result in any convergence problems.

#### 3.1.1 Complexity Analysis

Updating the coefficients with the Newton’s method costs , due to the matrix inversion in the update rule. Convergence is usually achieved in a tiny number of iterations, no more than 5 in our experiments (see Section 6). In order to reduce the cost of each iteration, a Cholesky factorization of the Hessian can be computed before performing the first matrix inversion, and it can be updated using a rank–1 scheme during the following iterations, with cost for each update (Seeger, 2008). On the other hand, this does not allow us to simplify in Eq. 11, otherwise the resulting matrix to be inverted will not be symmetric. Since a lot of time is wasted in the product by (that is usually dense), using the update of Cholesky factorization may not necessarily lead to a reduction of the overall training time.

Solving the primal problem using the Newton’s method has the same complexity of the original LapSVM solution based on the dual problem discussed in Section 7. The only benefit of solving the primal problem with Netwon’s method relies on the compact and simple formulation that does not requires the “two step” approach and a quadratic SVM solver as in the original dual formulation.

It is interesting to compare the training of SVMs in the primal with the one of LapSVMs for a better insight in the Newton’s method based solution. SVMs can benefit from the inversion of only a portion of the whole Hessian matrix, that reduces the complexity of each iteration to . Exploiting this useful aspect, the training algorithm can be run incrementally, reducing the complexity of the whole training process. In detail, an initial run on small portion of the available data is used to compute an approximate solution. Then the remaining training points, or some of them, are added. Due to the hinge loss and the currently estimated separating hyperplane, many of them will probably not belong to so that its maximum cardinality during the whole training process will reasonably be smaller than . Moreover, if we fix the step size the components of that are not associated to an error vector will become zeros after the update, so that the Newton’s method encourages sparser solutions.

In the case of LapSVM those benefits are lost due to the presence of the intrinsic norm . As a matter of fact and independently by the set , the constraints make the Hessian a full matrix, avoiding the described useful block inversion of SVMs. If the classifier is build incrementally, the addiction of a new non–error vector point makes the current solution no more optimal. Following the considerations of Section 2 on the norm, this suggests that a different regularizer may help the LapSVM solution with Newton’s method to gain the benefits of the SVM one. Some steps in this direction has been moved by Tsang and Kwok (2007), and we will investigate a similar approach, but based on the primal problem, in future work.

Finally, we are assuming that and the matrix to invert on Eq. 11 are non singular, otherwise the final expansion of will not be unique, even if the optimal value of the criterion function of Eq. 7 will be.

Instead of performing a costly Newton’s step, the solution of the system can be computed by conjugate gradient descent. In particular if we look at Eq. 9, we can write the system as as ,

 H\boldmathz=\boldmathc⟹(\boldmath1TIE\boldmath1\boldmath1TIEKKIE\boldmath1KIEK+γAK+γIKLK)\boldmathz=(% \boldmath1TIE\boldmathyKIE\boldmathy). (12)

The convergence rate of conjugate gradient is related to the condition number of (Shewchuk, 1994). In the most general case, the presence of the terms and leads to a not so well conditioned system and to a slow convergence rate. Fortunately the general fix investigated by Chapelle (2007) can be applied also in the case of LapSVMs, due to the quadratic form of the intrinsic regularizer. Eq. 12 can be factorized as

 (1\boldmath0T\boldmath0K)(% \boldmath1TIE\boldmath1\boldmath1TIEKIE\boldmath1IEK+γAI+γILK)\boldmathz=(1% \boldmath0T\boldmath0K)(\boldmath% 1TIE\boldmathyIE\boldmathy).

For instance, we can precondition the system of Eq. 12 with the symmetic matrix

 P=(1\boldmath0T\boldmath0K)

so that the condition number of the original system is sensibly decreased. In the preconditioned gradient the two previously described terms are reduced to and . Moreover, preconditioning is generally useful when such product can be efficiently computed and in our problem it comes at no additional computational cost. As in the Newton’s method, we are assuming that is non singular, otherwise a small ridge can be added to fix it.

Classic rules for the update of the conjugate direction at each step are resumed by Shewchuk (1994). After some iterations the conjugacy of the descent directions tends to get lost due to roundoff floating point error, so a restart of the preconditioned conjugate gradient algorithm is required. The Fletcher–Reeves (FR) update is commonly used in linear optimization. Due to the piecewise nature of the problem, defined by the matrix, we exploited the Pollak–Ribier (PR) formula, where restart can be automatically performed when the update term becomes negative (Shewchuk, 1994)222Note that in the linear case FR and PR are equivalent.. We experimentally evaluated that for the LapSVM problem such formula is generally the best choice, both for convergence speed and numerical stability. The iterative solution of LapSVM problem using preconditioned conjugate gradient (PCG) is reported in Algorithm 1. The first iteration is actually a steepest descent one, and so it is after each restart of PCG, i.e. when becomes zero in Algorithm 1.

Convergence is usually declared when the norm of the preconditioned gradient falls below a given threshold (Chapelle, 2007), or when the current preconditioned gradient is roughly orthogonal with the real gradient (Shewchuk, 1994). We will investigate these conditions in Section 4.

#### 3.2.1 Line Search

The optimal step length on the current direction of the PCG algorithm must be computed by backtracking or exact line search. At a generic iteration we have to solve

 s∗=\operatornamewithlimitsargmins≥0obj(\boldmathzt−1+s\boldmathdt−1) (13)

where is the objective function of Eq. 7.

The accuracy of the line search is crucial for the performance of PCG. When minimizing a quadratic form that leads to a linear expression of the gradient, line search can be computed in closed form. In our case, we have to deal with the variations of the set (and of ) for different values of , so that a closed form solution cannot be derived, and we have to compute the optimal in an iterative way.

Due to the quadratic nature of Eq. 13, the 1–dimensional Newton’s method can be directly used, but the average number of line search iterations per PCG step can be very large, even if the cost of each of them is negligible with respect to the of a PCG iteration. We can efficiently solve the line search problem analytically, as suggested by Keerthi and DeCoste (2006) for SVMs.

In order to simplify the notation, we discard the iteration index in the following description. Given the PCG direction , we compute for each point , being it an error vector or not, the step length for which its state switches. The state of a given error vector switches when it leaves the set, whether the state of a point initially not in switches when it becomes an error vector. We refer to the set of the former points with while the latter is , with . The derivative of Eq. 13, , is piecewise linear, and are the break points of such function.

Let us consider, for simplicity, that are in a non decreasing order, discarding the negative ones. Starting from , they define a set of intervals where is linear and the set does not change. We indicate with the linear portion of in the –th interval. Starting with , if the value for which crosses zero is within such interval, then it is the optimal step size , otherwise the following interval must be checked. The convergence of the process is guaranteed by the convexity of the function .

The zero crossing of is given by , where the two points and determine the line . We indicate with the function whose coefficients are in , i.e. , and we have

 ψj(0)=∑\boldmathxi∈Ej(f(\boldmathxi)−yi)fd(\boldmathxi)+γA\boldmathαTK\boldmathdα+γI% \boldmathαTKLK\boldmathdαψj(1)=∑\boldmathxi∈Ej(f(% \boldmathxi)+fd(\boldmathxi)−yi)fd(\boldmathxi)+γA(\boldmathα+\boldmathdα)TK\boldmathdα+γI(\boldmathα+% \boldmathdα)TKLK\boldmathdα

where is the set of error vectors for the –th interval.

Given and , their successive values for increasing can be easily computed considering that only one point (that we indicate with ) switches status moving from an interval to the following one. From this consideration we derived the following update rules

 ψj+1(0)=ψj(0)+νj(f(\boldmathxj)−yj)fd(\boldmathxj)ψj+1(1)=ψj(1)+νj(f(\boldmathxj)+fd(% \boldmathxj)−yi)fd(\boldmathxj)

where is if and it is if .

#### 3.2.2 Complexity Analysis

Each PCG iteration requires to compute the product, leading to a complexity of to update the coefficients. The term can then be computed efficiently from , since the matrix is generally sparse. Note that, differently from the Newton’s method and from the original dual solution of the LapSVM problem, we never explicitly compute the product, whereas we always compute matrix by vector products. Even if is sparse, when the number of training point increases or is iterated many times, a large amount of time may be wasted in such matrix by matrix product, as we will show in Section 6. Moreover, if the kernel matrix is sparse, the complexity drops to , where is the maximum number of non null elements between and .

Convergence of the conjugate gradient algorithm is theoretically declared in steps, but a solution very close to the optimal one can be computed with far less iterations. The convergence speed is related to the condition number of the Hessian (Shewchuk, 1994), that it is composed by a sum of three contributes (Eq. 12). As a consequence, their condition numbers and weighting coefficients (, ) have a direct influence in the convergence speed, and in particular the condition number of the matrix. For example, using a bandwidth of a Gaussian kernel that lead to a matrix close to the identity allows the algorithm to converge very quickly, but the accuracy of the classifier may not be sufficient.

Finally, PCG can be efficiently seeded with an initial rough estimate of the solution. This can be crucial for an efficient incremental building of the classifier with reduced complexity, following the one proposed for SVMs by Keerthi et al. (2006).

## Section 4 Approximating the Optimal Solution

In order to reduce the training times, we want the PCG to converge as fast as possible to a good approximation of the optimal solution. By appropriately selecting the goal condition of Algorithm 1, we can discard iterations that may not lead to significant improvement in the classifier quality.

The common goal conditions for the PCG algorithm and, more generally, for gradient based iterative algorithms, rely on the norm of the gradient (Boyd and Vandenberghe, 2004), of the preconditioned gradient (Chapelle, 2007), on the mixed product (Shewchuk, 1994). These values are usually normalized by the first estimate of each of them. The value of the objective function or its relative decrement between two consecutive iterations can also be checked, requiring some additional computations since the PCG algorithm never explicitly computes it. When one of such “stopping” values falls below the chosen threshold associated to it, the algorithm terminates333Thresholds associated to different conditions are obviously different, but, for simplicity in the description, we will refer to a generic threshold .. Moreover, a maximum number of iterations is generally specified. Tuning these parameters is crucial both for the time spent running the algorithm and the quality of the resulting solution.

It is really hard to find a trade–off between good approximation and low number of iterations, since and are strictly problem dependent. As an example, consider that the surface of , the objective function of Eq. 7, varies among different choices of its parameters. Increasing or decreasing the values of and can lead to a less flat or a more flat region around the optimal point. Fixing in advance the values of and may cause an early stop too far from the optimal solution, or it may result in the execution of a large number of iterations without a significant improvement on the classification accuracy.

The latter situation can be particularly frequent for LapSVMs. As described in Section 2 the choice of the intrinsic norm introduces the soft constraint for nearby points , along the underlying manifold. This allows the algorithm to perform a graph transduction and diffuse the labels from points in to the unlabeled data .

When the diffusion is somewhat complete and the classification hyperplane has assumed a quite stable shape around the available training data, similar to the optimal one, the intrinsic norm will keep contributing to the gradient until a balance with respect to the ambient norm (and to the loss on error vectors) is found. Due to the strictness of this constraint, it will still require some iterations (sometimes many) to achieve the optimal solution with , even if the decision function will remain substantially the same. The described common goal conditions do not “directly” take into account the decision of the classifier, so that they do not appear appropriate to early stop the PCG algorithm for LapSVMs.

We investigate our intuition on the “two moons” dataset of Figure 3(a), where we compare the decision boundary after each PCG iteration (Figure 3(b)-(e)) with the optimal solution (computed by Newton’s method, Figure 3(f)). Starting with , the first iteration exploits only the gradient of the loss on labeled points, since both the regularizing norms are zero. In the following iterations we can observe the label diffusion process along the manifold. After only 4 iterations we get a perfect classification of the dataset and a separating boundary not far from the optimal one. All the remaining iterations until complete convergence are used to slightly asses the coherence along the manifold required by the intrinsic norm and the balancing with the smoothness of the function, as can be observed by looking at the function values after 25 iterations. The most of changes influences regions far from the support of , and it is clear that an early stop after 4 PCG steps would be enough to roughly approximate the accuracy of optimal solution.

In Figure 4 we can observe the values of the previously described general stopping criterion for PCG. After 4 iterations they are still sensibly decreasing, without reflecting real improvements in the classifier quality. The value of the objective function starts to become more stable only after, say, 16 iterations, but it is still slightly decreasing even if it appears quite horizontal on the graph, due to its scale. It is clear that fixing in advance the parameters and is random guessing and it will probably result in a bad trade–off between training time and accuracy.

### 4.1 Early Stopping Conditions

Following these considerations, we propose to early stop the PCG algorithm exploiting the predictions of the classifier on the available data. Due to the high amount of unlabeled training points in the semi–supervised learning framework, the stability of the decision , , can be used as a reference to early stop the gradient descent (stability check). Moreover, if labeled validation data (set ) is available for classifier parameters tuning, we can formulate a good stopping condition based on the classification accuracy on it (validation check), that can be eventually merged to the previous one (mixed check).

In detail, when becomes quite stable between consecutive iterations or when , the error rate on , is not decreasing anymore, then the PCG algorithm should be stopped. Due to their heuristic nature, it is generally better to compare the predictions every iterations and within a certain tolerance . As a matter of fact, may slightly change also when we are very close to the optimal solution, and is not necessarily an always decreasing function. Moreover, labeled validation data in the semi–supervised setting is usually small with respect to the whole training data, labeled and unlabeled, and it may not be enough to represent the structure of the dataset.

We propose very simple implementations of such conditions, that we used to achieve the results of Section 6. Starting from these, many different and more efficient variants can be formulated, but it goes beyond the scope of this paper. They are sketched in Algorithms 2 and 3.

We computed the classifier decision every iterations and we required the classifier to improve by one correctly classifier example at every check, due to the usually small size of . Sometimes this can also help to avoid a slight overfitting of the classifier.

Generating the decision on unlabeled data does not require heavy additional machinery, since the product must be necessarily computed to perform every PCG iteration. Its overall cost is . Differently, computing the accuracy on validation data requires the evaluation of the kernel function on validation points against the training ones, and products, that is negligible with respect to the cost of a PCG iteration.

Finally, please note that even if these are generally early stopping conditions, sometimes they can help in the opposite situation. For instance they can also detect that the classifier needs to move some more steps toward the optimal solution than the ones limited by the selected .

## Section 5 Laplacian Regularized Least Squares

Laplacian Regularized Least Square Classifier (LapRLSC) has many analogies with the proposed hinge loss based LapSVMs. LapRLSC uses a squared loss function to penalize wrongly classified examples, leading to the following objective function

 minf∈Hkl∑i=1(yi−f(\boldmathxi))2+γA∥f∥2A+γI∥f∥2I. (14)

The optimal coefficients and the optimal bias , collected in the vector , can be obtained by solving the linear system

 (|L|\boldmath1TILKKIL\boldmath1KILK+γAK+γIKLK)\boldmathz=(% \boldmath1T\boldmathyK\boldmathy) (15)

where is the diagonal matrix with the first elements equal to 1 and the remaining elements equal to zero.

Following the notation used for LapSVMs, in LapRLSCs we have a set of error vectors that is actually fixed and equal to . As a matter of fact a LapRLSC requires the estimated function to interpolate the given targets in order to not incur in a penalty. In a hypothetic situation where all the labeled examples always belong to during the training of a LapSVM classifier in the primal, then the solution will be the same of LapRLSC.

Solving the least squares problem of LapRLSC can be performed by matrix inversion, after factoring and simplifying the previously defined matrix in Eq. 15. Otherwise the proposed PCG approach and the early stopping conditions can be directly used. In this case the classic instruments for linear optimization apply, and the required line search of Eq. 13 can be computed in closed form without the need of an iterative process,

 s∗=−∇T\boldmathd\boldmathdTH% \boldmathd

where and are no more functions of .

As shown by Belkin et al. (2006); Sindhwani and Rosenberg (2008) and in the experimental Section of this paper, LapRLSC, LapSVM and primal LapSVM allow us to achieve similar classification performances. The interesting property of the LapSVM problem is that the effect of the regularization terms at a given iteration can be decoupled by the one of the loss function on labeled points, since the gradient of the loss function for correctly classified points is zero and do not disturb classifier design. This characteristic can be useful as a starting point for the study of some alternative formulations of the intrinsic norm regularizer.

## Section 6 Experimental results

We ran a wide set of experiments to analyze the proposed solution strategies of the primal LapSVM problem. In this Section we describe the selected datasets, our experimental protocol and the details on the parameter selection strategy. Then we show the main result of the proposed approach, very fast training of the LapSVM classifier with reduced complexity by means of early stopped PCG. We compare the quality of the hinge loss LapSVMs trained in the primal by Newton’s method with respect to the hinge loss dual formulation and LapRLSCs. Finally, we describe the convergence speed and the impact on performances of our early stopping conditions.

As a baseline reference for the performances in the supervised setting, we selected two popular regularized classifiers, Support Vector Machines (SVMs) and Regularized Least Square Classifiers (RLSCs). We implemented and tested all the algorithms using Matlab 7.6 on a 2.33Ghz machine with 6GB of memory. The dual problem of LapSVM has been solved using the latest version of Libsvm (Fan et al., 2005). Multiclass classification has been performed using the one–against–all approach.

### 6.1 Datasets

We selected eight popular datasets for our experiments. Most of them datasets has been already used in previous works to evaluate several semi–supervised classification algorithms (Sindhwani et al., 2005; Belkin et al., 2006; Sindhwani and Rosenberg, 2008), and all of them are available on the Web. G50C444It can be downloaded from http://people.cs.uchicago.edu/~ vikass/manifoldregularization.html. is an artificial dataset generated from two unit covariance normal distributions with equal probabilities. The class means are adjusted so that the Bayes error is . The COIL20 dataset is a collection of pictures of 20 different objects from the Columbia University. Each object has been placed on a turntable and at every 5 degrees of rotation a 32x32 gray scale image was acquired. The USPST dataset is a collection of handwritten digits form the USPS postal system. Images are acquired at the resolution of 16x16 pixels. USPST refers to the test split of the original dataset. We analyzed the COIL20 and USPST dataset in their original 20 and 10–class versions and also in their 2–class versions, to discard the effects on performances of the selected multiclass strategy. COIL20(B) discriminates between the first 10 and the last 10 objects, whereas USPST(B) from the first 5 digits and the remaining ones. PCMAC is a two–class dataset generated from the famous 20–Newsgroups collection, that collects posts on Windows and Macintosh systems. MNIST3VS8 is the binary version of the MNIST dataset, a collection of 28x28 gray scale handwritten digit images from NIST. The goal is to separate digit 3 from digit 8. Finally, the FACEMIT dataset of the Center for Biological and Computational Learning at MIT contains 19x19 gray scale, PGM format, images of faces and non–faces. The details of the described datasets are resumed in Table 1.

### 6.2 Experimental protocol

All presented results has been obtained by averaging them on different splits of the available data. In particular, a 4–fold cross–validation has been performed, randomizing the fold generation process for 3 times, for a total of 12 splits. Each fold contains the same number of per class examples as in the complete dataset. For each split, we have 3 folds that are used for training the classifier and the remaining one that constitutes the test set (). Training data has been divided in labeled (), unlabeled () and validation sets (), where the last one is only used to tune the classifier parameters. The labeled and validation sets have been randomly selected from the training data such that at least one example per class is assured to be present on each of them, without any additional balancing constraints. A small number of labeled points has been generally selected, in order to simulate a semi–supervised scenario where labeling data has a large cost. The MNIST3VS8 and FACEMIT dataset are already divided in training and test data, so that the 4–fold generation process was not necessary, and just the random subdivision of training data has been performed. In particular, on the FACEMIT dataset we exchanged the original training and test sets, since, as a matter of fact, the latter is sensibly larger that the former. In this case our goal is just to show how we were able to handle a high amount of training data using the proposed primal solution with PCG, whereas it was not possible to do it with the original dual formulation of LapSVM. Due to the high unbalancing of such dataset, we report the macro error rates for it (, where and are the rates of true positives and true negatives). Details are collected in Table 2.

### 6.3 Parameters

We selected a Gaussian kernel function in the form for each experiment, with the exception of the MNIST3VS8 where a polynomial kernel of degree 9 was used, as suggest by Decoste and Schölkopf (2002). The other parameters were selected by cross–validating them on the set. In order to speedup this step, the values of the Gaussian kernel width and of the parameters required to build the graph Laplacian (the number of neighbors, , and the degree, ) for the first six datasets were fixed as specified by Sindhwani and Rosenberg (2008). For details on the selection of such parameters please refer to Sindhwani and Rosenberg (2008); Sindhwani et al. (2005). The graph Laplacian was computed by using its normalized expression. The optimal weights of the ambient and intrinsic norms, , , were determined by varying them on the grid and chosen with respect to validation error. For the FACEMIT dataset also the value was considered, due to the high amount of training points. The selected parameter values are reported in Table 8 of Appendix A for reproducibility of the experiments.

### 6.4 Results

Before going further into details, the training times of LapSVMs using the original dual formulation and the primal one are reported in Table 3, to empathize our main result555For a fair comparison of the training algorithms, the Gram matrix and the Laplacian were precomputed.. The last column refers to LapSVMs trained using the best (in terms of accuracy) of the proposed stopping heuristics for each specific dataset. As expected, training in the primal by the Newton’s method requires training times similar to the ones of the dual formulation. On the other hand, training by PCG with the proposed early stopping conditions shows an appreciable reduction of them on all datasets. As the size of labeled and unlabeled points increases, the improvement becomes very evident. In the MNIST3VS8 dataset we drop from roughly half an hour to two minutes. Both in the dual formulation of LapSVMs and in the primal one solved by means of Newton’s method, a lot of time is spent in computing the matrix product. Even if is sparse, as its size increases or when it is iterated the cost of this product becomes quite high. It is also the case of the PCMAC dataset, where the training time drops from 15 seconds to only 2 seconds when solving with PCG. Finally, also the memory requirements are reduced, since there is no need to explicitly compute, store and invert the Hessian when PCG is used. As an example, we trained the classifier on the FACEMIT dataset only using PCG. The high memory requirements of dual LapSVM and primal LapSVM solved with Newton’s method, coupled with the high computational cost and slow training times, made the problem intractable for such techniques on our machine.

We investigate now the details of the solution of the primal LapSVM problem. In order to compare the effects of the different loss functions of LapRLSCs, LapSVMs trained in the dual, and LapSVMs trained in the primal, in Table 4 the classification errors of the described techniques are reported. For this comparison, the optimal solution of primal LapSVMs is computed by means of the Newton’s method. The manifold regularization based techniques lead to comparable results, and, as expected, all semi–supervised approaches show a sensible improvement over classic supervised classification algorithms. The error rates of primal LapSVMs and LapRLSCs are really close, due to the described relationship of the hinge loss and the squared loss. We collected the average number of Newton’s steps required to compute the optimal solution in Table 5. In all our experiments we always declared convergence in less than 6 steps.

In Figure 5-12 we compared the error rates of LapSVMs trained in the primal by Newton’s method with ones of PCG training, in function of the number of gradient steps . For this comparison, and were selected by cross–validating with the former (see Appendix A). The horizontal line on each graph represents the error rate of the optimal solution computed with the Newton’s method. The number of iterations required to converge to a solution with the same accuracy of the optimal one is sensibly smaller than . Convergence is achieved really fast, and only in the COIL20 dataset we experienced a relatively slower rate with respect to the other datasets. The error surface of each binary classifier is quite flat around optimum with the selected and , leading to some round–off errors in gradient descent based techniques, stressed by the large number of classes and the one–against–all approach. Moreover labeled training examples are highly unbalanced. As a matter of fact, in the COIL20(B) dataset we did not experience this behavior. Finally, in the FACEMIT dataset the algorithm perfectly converges in a few iterations, showing that in this dataset the most of information is contained in the labeled data (even if it is very small), and the intrinsic constraint is easily fulfilled.