Boosting as a Kernel-Based Method

Boosting as a Kernel-Based Method

\nameAleksandr Y. Aravkin \emailsaravkin@uw.edu
\addrDepartment of Applied Mathematics
University of Washington
Seattle, WA 98195-4322, USA \AND\nameGiulio Bottegal \emailgiulio.bottegal@gmail.com
\addrDepartment of Electrical Engineering
TU Eindhoven
Eindhoven, MB 5600, The Netherlands \AND\nameGianluigi Pillonetto \emailgiapi@dei.unipd.it
\addrDepartment of Information Engineering
University of Padova
Padova, 35131, Italy
Abstract

Boosting combines weak (biased) learners to obtain effective learning algorithms for classification and prediction. In this paper, we show a connection between boosting and kernel-based methods, highlighting both theoretical and practical applications. In the context of boosting, we start with a weak linear learner defined by a kernel . We show that boosting with this learner is equivalent to estimation with a special boosting kernel that depends on , as well as on the regression matrix, noise variance, and hyperparameters. The number of boosting iterations is modeled as a continuous hyperparameter, and fit along with other parameters using standard techniques.
We then generalize the boosting kernel to a broad new class of boosting approaches for more general weak learners, including those based on the , hinge and Vapnik losses. The approach allows fast hyperparameter tuning for this general class, and has a wide range of applications, including robust regression and classification. We illustrate some of these applications with numerical examples on synthetic and real data.

Boosting as a Kernel-Based Method Aleksandr Y. Aravkin saravkin@uw.edu
Department of Applied Mathematics
University of Washington
Seattle, WA 98195-4322, USA
Giulio Bottegal giulio.bottegal@gmail.com
Department of Electrical Engineering
TU Eindhoven
Eindhoven, MB 5600, The Netherlands
Gianluigi Pillonetto giapi@dei.unipd.it
Department of Information Engineering
University of Padova
Padova, 35131, Italy

Editor: XXX

Keywords: Boosting; weak learners; Kernel-based methods; Reproducing kernel Hilbert spaces; robust estimation

1 Introduction

Boosting is a popular technique to construct learning algorithms (Schapire, 2003). The basic idea is that any weak learner, i.e. algorithm that is only slightly better than guessing, can be used to build an effective learning mechanism that achieves high accuracy. Since the introduction of boosting in Schapire’s seminal work (Schapire, 1990), numerous variants have been proposed for regression, classification, and specific applications including semantic learning and computer vision (Schapire and Freund, 2012; Viola and Jones, 2001; Temlyakov, 2000; Tokarczyk et al., 2015; Bissacco et al., 2007; Cao et al., 2014). In particular, in the context of classification, LPBoost, LogitBoost (Friedman et al., 2000), Bagging and Boosting (Lemmens and Croux, 2006) and AdaBoost (Freund and Schapire, 1997) have become standard tools, the latter having being recognized as the best off-the-shelf binary classification method (Breiman, 1998; Zhu et al., 2009). Applications of the boosting principle are also found in decision tree learning (Tu, 2005) and distributed learning (Fan et al., 1999). For a survey on applications of boosting in classification tasks see the work of Freund et al. (1999). For regression problems, AdaBoost.RT (Solomatine and Shrestha, 2004; Avnimelech and Intrator, 1999) and Boost (Bühlmann and Yu, 2003; Tutz and Binder, 2007; Champion et al., 2014) are the most prominent boosting algorithms. In particular, in boosting the weak learner often corresponds to a kernel-based estimator with a heavily weighted regularization term. The fit on the training set is then measured using the quadratic loss and increases at each iteration. Hence, the procedure can lead to overfitting if it continues too long (Bühlmann and Hothorn, 2007). To avoid this, several stopping criteria based on model complexity arguments have been developed. Hurvich et al. (1998) propose a modified version of Akaike’s information criterion (AIC); Hansen and Yu (2001) use the principle of minimum description length (MDL), and Bühlmann and Yu (2003) suggest a five-fold cross validation.
In this paper, we focus on boosting and consider linear weak learners induced by the combination of a quadratic loss and a regularizer induced by a kernel . We show that the resulting boosting estimator is equivalent to estimation with a special boosting kernel that depends on , as well as on the regression matrix, noise variance, and hyperparameters. This viewpoint leads to both greater generality and better computational efficiency. In particular, the number of boosting iterations is a continuous hyperparameter of the boosting kernel, and can be tuned by standard fast hyper-parameter selection techniques including SURE, generalized cross validation, and marginal likelihood (Hastie et al., 2001a). In Section 5, we show that tuning is far more efficient than applying boosting iterations, and non-integer values of can improve performance.
We then generalize the boosting kernel to a wider class of problems, including robust regression, by combining the boosting kernel with piecewise linear quadratic (PLQ) loss functions (e.g. , Vapnik, Huber). The computational burden of standard boosting is high for general loss functions, since the estimator at each iteration is no longer a linear function of the data. The boosting kernel makes the general approach tractable. We also use the boosting kernel in the context of regularization problems in reproducing kernel Hilbert spaces (RKHSs), e.g. to solve classification formulations that use the hinge loss.
The organization of the paper is as follows. After a brief overview of boosting in regression and classification, we develop the main connection between boosting and kernel-based methods in the context of finite-dimensional inverse problems in Section 2. Consequences of this connection are presented in Section 3. In Section 4 we combine the boosting kernel with PLQ penalties to develop a new class of boosting algorithms. We also consider regression and classification in RKHSs. In Section 5 we show numerical results for several experiments involving the boosting kernel. We end with discussion and conclusions in Section 6.

2 Boosting as a kernel-based method

In this section, we give a basic overview of boosting, and present the boosting kernel.

2.1 Boosting: notation and overview

Assume we are given a model for some observed data , where is an unknown parameter vector. Suppose our estimator for minimizes some objective that balances variance with bias. In the boosting context, the objective is designed to provide a weak estimator, i.e. one with low variance in comparison to the bias.

Given a loss function and a kernel matrix , the weak estimator can be defined by minimizing the regularized formulation

(1)

where the regularization parameter is large and leads to over-smoothing. Boosting uses this weak estimator iteratively, as detailed below. The predicted data for an estimator are denoted by .

Boosting scheme:

  1. Set and obtain and using (1);

  2. Solve (1) using the current residuals as data vector, i.e. compute

    and set the new predicted output to

  3. Increase by 1 and repeat step 2 for a prescribed number of iterations.

2.2 Using regularized least squares as weak learner

Suppose data are generated according to

(2)

where is a known regression matrix of full column rank. The components of are independent random variables, mean zero and variance .

We now use a quadratic loss to define the regularized weak learner. Let to denote the kernel scale factor and set so that (1) becomes

(3)

We obtain the following expressions for the predicted data :

(4)

where

(5)

is assumed invertible for the moment. This assumption will be relaxed later on.

The following well-known connection (Wahba, 1990) between (3) and Bayesian estimation is useful for theoretical development. Assume that and are independent Gaussian random vectors with priors

Then, (3) and (4) provide the minimum variance estimates of and conditional on the data . In view of this, we refer to diagonal values of as the prior variances of .

2.3 The boosting kernel

Define

(6)

Fixing a small , the predicted data obtained by the weak kernel-based learner is

where is the number of boosting iterations. According to the scheme specified in Section 2.1, as increases, boosting refines the estimate as follows:

(7)

We now show that the boosting estimates are kernel-based estimators from the boosting kernel, which plays a key role for subsequent developments.

Proposition 1

The quantity is a kernel-based estimator

where is the boosting kernel defined by

(8)

Proof  First note that satisfies

(9)

This follows simply from adding the term to (6) and observing that expression reduces to . Next, plugging in the expression (8) for into the right hand side of expression (9) for , we have

exactly as required by (7).  
In Bayesian terms, for a given , the above result also shows that boosting returns the minimum variance estimate of the noiseless output conditional on if and are independent Gaussian random vectors with priors

(10)

3 Consequences

In this section, we use Proposition 1 to gain new insights on boosting and a new perspective on hyperparameter tuning.

3.1 Insights on the nature of boosting

We first derive a new representation of the boosting kernel via a change of coordinates. Let be the SVD of . Then, we obtain

(11)

and the predicted output can be rewritten as

In coordinates , the estimate of each component of is

(12)

and corresponds to the regularized least squares estimate induced by a diagonal kernel with entry

(13)

In Bayesian terms, (13) is the prior variance assigned by boosting to the noiseless output .

Eq. (13) shows that boosting builds a kernel on the basis of the output signal-to-noise ratios , which then enter . All diagonal kernel elements with grow to as increases; therefore asymptotically, data will be perfectly interpolated but with growth rates controlled by the . If is large, the prior variance increases quickly and after a few iterations the estimator is essentially unbiased along the -th direction. If is close to zero, the -th direction is treated as though affected by ill-conditioning, and a large is needed to remove the regularization on .

This perspective makes it clear when boosting can be effective. In the context of inverse problems (deconvolution), in (2) represents the unknown input to a linear system whose impulse response defines the regression matrix . For simplicity, assume that the kernel is set to the identity matrix, so that the weak learner (3) becomes ridge regression and the in (13) reflect the power content of the impulse response at different frequencies. Then, boosting can outperfom standard ridge regression if the system impulse response and input share a similar power spectrum. Under this condition, boosting can inflate the prior variances (13) along the right directions. For instance, if the impulse response energy is located at low frequencies, as increases boosting will amplify the low pass nature of the regularizer. This can significantly improve the estimate if the input is also low pass.

3.2 Hyperparameter estimation

In the classical scheme described in section 2.1, is an iteration counter that only takes integer values, and the boosting scheme is sequential: to obtain the estimate , one has to solve optimization problems. Using (8) and (11), we can interpret as a kernel hyperparameter, and let it take real values. In the following we estimate both the scale factor and from the data, and restrict the range of to .

The resulting boosting approach estimates by minimizing fit measures such as cross validation or SURE (Hastie et al., 2001a). In particular, this accelerates the tuning procedure, as it requires solving a single problem instead of multiple boosting iterations. Consider estimating using the SURE method. Given (e.g. using an unbiased estimator), choose

(14)

Straightforward computations show that, for the cost of a single SVD, problem (14) simplifies to

(15)

which is a smooth 2-variable problem over a box, and can be easily optimized.

We can also extract some useful information on the nature of the optimization problem (15). In fact, denoting the objective, we have

(16)

where we have defined Simple considerations on the sign of the derivative then show that

  • if

    (17)

    then . This means that we have chosen a learner so weak that SURE suggests an infinite number of boosting iterations as optimal solution;

  • if

    (18)

    then . This means that the weak learner is instead so strong that SURE suggests not to perform any boosting iterations.

3.3 Numerical illustration

Figure 1: True signal (thick red line), Ridge estimate (solid blue) and Boosting estimate (dashed black) obtained in the first Monte Carlo run. The system impulse response is a low pass signal.
   
Figure 2: Boxplot of the percentage fits obtained by Ridge regression and Boosting, using SURE to estimate hyperparameters; system impulse response is white noise (left) and low pass (right).

We illustrate our insights using a numerical experiment. Consider (2), where represents the input to a discrete-time linear system. In particular, the signal is taken from (Wahba, 1990) and displayed in Fig. 1 (thick red line). The system is represented by the regression matrix whose components are realizations of either white noise or low pass filtered white Gaussian noise with normalized band . The measurement noise is white and Gaussian, with variance assumed known and set to the variance of the noiseless output divided by 10.

We use a Monte Carlo of 100 runs to compare the following two estimators

  • Boosting: boosting estimator with set to the identity matrix and with estimated using the SURE strategy (14).

  • Ridge: ridge regression (which corresponds to boosting with fixed to 1).

Fig. 2 displays the box plots of the 100 percentage fits of , , obtained by Boosting and Ridge. When the entries of are white noise (left panel) one can see that the two estimators have similar performance. When the entries of are filtered white noise (right panel) Boosting performs significantly better than Ridge. Furthermore, 36 out of the 100 fits achieved by Boosting under the white noise scenario are lower than those obtained adopting a low pass , which is surprising since the conditioning of latter problem is much worse. The reasons are those previously described. The unknown represents a smooth signal. In Bayesian terms, setting to the identity matrix corresponds to modeling it as white noise, which is a poor prior. If the nature of is low pass, the energy of the are more concentrated at low frequencies. So, as increases, Boosting can inflate the prior variances associated to the low-frequency components of . The prior variances associated to high-frequencies induce low , so that they increase slowly with . This does not happen in the white noise case, since the random variables have similar distributions. Hence, the original white noise prior for can be significantly refined only in the low pass context: it is reshaped so as to form a regularizer, inducing more smoothness. Fig. 1 shows this effect by plotting estimates from Ridge and Boosting in a Monte Carlo run where is low pass.

4 Boosting algorithms for general loss functions and RKHSs

In this section, we combine the boosting kernel with piecewise linear-quadratic (PLQ) losses to obtain tractable algorithms for more general regression and classification problems. We also consider estimation in Reproducing Kernel Hilbert (RKHS) spaces.

4.1 Boosting kernel-based estimation with general loss functions

(a) quadratic (b) huber (d) hinge

(e) quantile (f) vapnik (h) elastic net
Figure 3: Six common piecewise-linear quadratic losses.

In the previous sections, the boosting kernel was derived using regularized least squares (3) as the weak learner. The sequence of resulting linear estimators then led to a closed form expression for . Now, we consider a kernel-based weak learner (1), based on a general (convex) penalty . Important examples include Vapnik’s epsilon insensitive loss (Fig. 3f) used in support vector regression (Vapnik, 1998; Hastie et al., 2001b; Schölkopf et al., 2000; Schölkopf and Smola, 2001), hinge loss (Fig. 3d) used for classification (Evgeniou et al., 2000; Pontil and Verri, 1998; Schölkopf et al., 2000), Huber and quantile huber (Fig 3b,e), used for robust regression(Huber, 2004; Maronna et al., 2006; Bube and Nemeth, 2007; Zou and Yuan, 2008; Koenker and Geling, 2001; Koenker, 2005; A. Aravkin et al., 2014), and elastic net (Fig. 3f), a sparse regularizer that also finds correlated predictors (Zou and Hastie, 2005b, a; Li and Lin, 2010; De Mol et al., 2009). The resulting boosting scheme is computationally expensive: requires solving a sequence of optimization problems, each of which must be solved iteratively. In addition, since the estimators are no longer linear, deriving a boosting kernel is no longer straightforward.

We combine general loss with the regularizer induced by the boosting kernel from the linear case to define a new class of kernel-based boosting algorithms. More specifically, given a kernel , let be the SVD of . If is invertible, the boosting output estimate is where

(19)

where the last line is obtained using (11). Note, here and also in the reformulations below, that the solution depends on and only through the ratio .
If is not invertible, the following two strategies can be adopted.

Approach I:

We use (11) to obtain the factorization

where is full column rank and contains the columns of the matrix

associated to the . Then, the output estimate is with

(20)

The estimate of is then given by , where is the pseudo-inverse of . One advantage of the formulation (20) is that the evaluation of for different and is efficient.

Approach II:

Define the matrix

Then, it is easy to see that another representation for the output estimate is with

(21)

The new class of boosting kernel-based estimators defined by (20) or (21) keeps the advantages of boosting in the quadratic case. In particular, the kernel structure can decrease bias along directions less exposed to noise. The use of a general loss allows a range of applications, with e.g. penalties such as Vapnik and Huber, guarding against outliers in the training set. Finally, the algorithm has clear computational advantages over the classic scheme described in Section 2.1. Whereas in the classic approach, require solving optimization problems, in the new approach, given any positive and , the prediction is obtained by solving the single convex optimization problem (19). This is illustrated in Section 5.

4.2 New boosting algorithms in RKHSs

We now show how the new class of boosting algorithms can be extended to the context of regularization in RKHSs. We start with Boost in RKHSs.

Assume that we want to reconstruct a function from sparse and noisy data collected on input locations taking values on the input space . Our aim now is to allow the function estimator to assume values in infinite-dimensional spaces, introducing suitable regularization to circumvent ill-posedness, e.g. in terms of function smoothness. For this purpose, we use denote a kernel function which captures smoothness properties of the unknown function. We can then use Boost, with weak learner

(22)

where is a generic convex loss and is the RKHS induced by with norm denoted by . From the representer theorem of Schölkopf et al. (2001), the solution of (22) is where the are the components of the column vector

(23)

and is the kernel (Gram) matrix, with and is the -th row of . Using (23), we extend the boosting scheme from section 2.1 with (22) as the weak learner. In particular, repeated applications of the representer theorem ensure that, for any value of the iteration counter , the corresponding function estimate belongs to the subspace spanned by the kernel sections . Hence, Boosting in RKHS can be summarized as follows.

Boosting scheme in RKHS:

  1. Set . Solve (23) to obtain and for , call them and .

  2. Update by solving (23) with the current residuals as the data vector:

    and set the new estimated function to

  3. Increase by 1 and repeat step 2 for a prescribed number of iterations.

There is a fundamental computational drawback related to this scheme which we have already encountered in the previous sections. To obtain we need to solve optimization problems, each of them requiring an iterative procedure. Now, we define a new computationally efficient class of regularized estimators in RKHS. The idea is to obtain the expansion coefficients of the function estimate through the new boosting kernel. Letting and , with the kernel matrix, define the boosting kernel as in (8). Then, we can first solve

(24)

with defined as the sum of the . Then, we compute

and the estimated function becomes

Note that the weights coincide with only when the are quadratic. Nevertheless, given any loss, (24) preserves all advantages of boosting outlined in the linear case. Furthermore, as in the finite-dimensional case, given any and kernel hyperparameter, the estimator (24) can compute by solving a single problem, rather than iterating the boosting scheme.

Classification with the hinge loss.

Another advantage related to the use of the boosting kernel w.r.t. the classical boosting scheme arises in the classification context. Classification tries to predict one of two output values, e.g. 1 and -1, as a function of the input. Boost could be used using the residual as misfit, e.g. equipping the weak learner (22) with the quadratic or the loss. However, in this context one often prefers to use the margin on an example to measure how well the available data are classified. For this purpose, support vector classification is widely used (Schölkopf and Smola, 2002). It relies on the hinge loss

which gives a linear penalty when . Note that this loss assumes . However, the classical boosting scheme applies the weak learner (22) repeatedly, and residuals will not be binary for . This means that Boost cannot be used for the hinge loss.

This limitation does not affect the new class of boosting-kernel based estimators: support vector classification can be boosted by plugging in the hinge loss into (24):

(25)

where we have used to denote the -th component of .

5 Numerical Experiments

5.1 Boosting kernel regression: temperature prediction real data

   
Figure 4: Left: prediction fits obtained by the stable spine estimator (SS) and by Boosting equipped with the stable spline kernel (Boosting SS). Right: 30-min ahead temperature prediction from Boosting SS on a portion of the test set.

To test boosting on real data, we use a case study in thermodynamic modeling of buildings. Eight temperature sensors produced by Moteiv Inc were placed in two rooms of a small two-floor residential building of about and . The experiment lasted for 8 days starting from February 24th, 2011; samples were taken every 5 minutes. A thermostat controlled the heating systems and the reference temperature was manually set every day depending upon occupancy and other needs. The goal of the experiment is to assess the predictive capability of models built using kernel-based estimators.

We consider Multiple Input-Single Output (MISO) models. The temperature from the first node is the output () and the other 7 represent the inputs (, ). The measurements are split into a training set of size and a test set of size . The notation indicates the test data, which is used to test the ability of our estimator to predict future data. Data are normalized so that they have zero mean and unit variance before identification is performed.

The model predictive power is measured in terms of -step-ahead prediction fit on , i.e.

We consider ARX models of the form

where denotes discrete-time convolution and the are 8 unknown one-step ahead predictor impulse responses, each of length 50. Note that when such impulse responses are known, one can use them in an iterative fashion to obtain any -step ahead prediction. We can stack all the in the vector and form the regression matrix with the past outputs and the inputs so that the model becomes . Then, we consider the following two estimators:

  • Boosting SS: this estimator regularizes each introducing information on its smoothness and exponential decay by the stable spline kernel (Pillonetto and De Nicolao, 2010). In particular, let with entry . Then, we recover by the boosting scheme (20) with , and set to the quadratic loss. Note that the estimator contains the three unknown hyperparameters and . To estimate them, the training set is divided in half and hold-out cross validation is used.

  • Classical Boosting SS: the same as above except that can assume only integer values.

  • SS: this is the stable spline estimator described in (Pillonetto and De Nicolao, 2010) (and corresponds to Boosting SS with ) with hyperparameters obtained via marginal likelihood optimization.

For Boosting SS, we obtained and ; note that it is not an integer. For Classical Boosting SS, we obtained and . In practice, this estimator gives the same results achieved by SS so that our discussion below just compares the performance of Boosting SS and SS.

The left panel of Fig. 4 shows the prediction fits, as a function of the prediction horizon , obtained by Boosting SS and SS. Note that the non-integer gives an improvement in performance. This means that in this experiment using a continuous improves also over the classical boosting. The right panel of Fig. 4 shows sample trajectories of half-hour-ahead boosting prediction on a part of the test set.

5.2 Boosting kernel regression using the loss: Real data water tank system identification

   
Figure 5: Left: training set. Right: test set simulation from Boosting SS with loss.

We test our new class of boosting algorithms on another real data set obtained from a water tank system (see also Bottegal et al. (2016)). In this example, a tank is fed with water by an electric pump. The water is drawn from a lower basin, and then flows back through a hole in the bottom of the tank. The system input is the voltage applied, while the output is the water level in the tank, measured by a pressure sensor at the bottom of the tank. The setup represents a typical control engineering scenario, where the experimenter is interested in building a mathematical model of the system in order to predict its behavior and design a control algorithm (Ljung, 1999). To this end, input/output samples are collected every second, comprising almost 1000 pairs that are divided into a training and test set. The signals are de-trended, removing their means. The training and test outputs are shown in the left and right panel of Fig. 5. One can see that the second part of the training data are corrupted by outliers caused by pressure perturbations in the tank; these are due to air occasionally being blown into the tank. Our aim is to understand the predictive capability of the boosting kernel even in presence of outliers.

We consider a FIR model of the form

where the unknown vector contains the impulse response coefficients. It is estimated using a variation of the estimator Boosting SS described in the previous section: while the stable spline kernel is still employed to define the regularizer, the key difference is that in (20) is now set to the robust loss. The hyperparameter estimates obtained using hold-out cross validation are and . The right panel of Fig. 5 shows the boosting simulation of the test set. The estimate from Boosting SS predicts the test set with fit. Using the approach equal to the quadratic loss, the test set fit decreases to .

5.3 Boosting in RKHSs: Classification problem

Consider the problem described in Section 2 of (Hastie et al., 2001a). Two classes are introduced, each defined by a mixture of Gaussian clusters; the first 10 means are generated from a Gaussian and remaining ten means from with the identity matrix. Class labels and corresponding to the clusters are generated randomly with probability 1/2. Observations for a given label are generated by picking one of the ten means from the correct cluster with uniform probability 1/10, and drawing an input location from . A Monte Carlo study of 100 runs is designed. At any run, a new data set of size 500 is generated, with the split given by for training and each for validation and testing. The validation set is used to estimate through hold-out cross-validation the unknown hyperparameters, in particular the boosting parameter . Performance for a given run is quantified by computing percentage of data correctly classified.

We compare the performance of the following two estimators:

  • Boosting+ loss: this is the boosting scheme in RKHS illustrated in the previous section ( may assume only integer values) with the weak learner (22) defined by the Gaussian kernel

    setting each to the loss and using .

  • Boosting kernel+ loss: this is the estimator using the new boosting kernel. The latter is defined by the kernel matrix built using the same Gaussian kernel reported above, with so that one still has . The function estimate is achieved solving (24) using the loss.

Note that the two estimators contain only one unknown parameter, i.e. which is estimated by the cross validation strategy described above. The top left panel of Fig. 6 compares their performance. Interestingly, results are very similar, see also Table 1. This supports the fact that the boosting kernel can include classical boosting features in the estimation process. In this example, the difference between the two methods is mainly in their computational complexity. In particular, the top right panel of Fig. 6 reports some cross validation scores as a function of the boosting iterations counter for the classical boosting scheme. The score is linearly interpolated, since can assume only integer values. On average, during the 100 Monte Carlo runs the optimal value corresponds to , so on average, problems (22) must be solved 340 times. After obtaining the estimate of , to obtain the function estimate using the union of the training and validation data, another 340 problems must be solved.

In contrast, the boosting kernel used in (24) does not require repeated optimization of the weak learner. Using a golden section search,estimating by cross validation on average requires solving 20 problems of the form (24). Once is found, only one additional optimization problem must be solved to obtain the function estimate. Summarizing, in this example the boosting kernel obtains results similar to those achieved by classical boosting, but requires solving only 20 optimization problems rather than nearly 700. The computational times of the two approaches are reported in the bottom panel of Fig. 6.

Table 1 also shows the average fit obtained by other two estimators. The first estimator is denoted by Boosting SVC: it coincides with Boosting kernel+ loss, except that the hinge loss replaces the loss in (24). The other one is SVC and corresponds to the classical support vector classifier. It uses the same Gaussian kernel defined above with the regularization parameter determined via cross validation on a grid containing 20 logarithmically spaced values on the interval . One can see that the best results are obtained by boosting support vector classification. Recall also that the hinge loss cannot be adopted using the classical boosting scheme as discussed at the end of the previous section.

Boosting+ Boosting kernel+ Boosting SVC SVC
78.91 % 79.15 % 79.73 % 78.12 %
Table 1: Average percentage classification fit
   
   
Figure 6: Classification problem Top Left Fits obtained by the new boosting kernel (x-axis) vs fits obtained by the classical boosting scheme (y-axis). Both the estimators use the loss. Top Right Some cross validation scores computed using the classical boosting scheme equipped with the loss as a function of the boosting iteration counter . Each curve corresponds to a different run. Bottom Computational times to solve a classification problem needed by the new boosting kernel (x-axis) and by the classical boosting scheme (y-axis).

5.4 Boosting in RKHSs: Regression problem

Consider now a regression problem where only smoothness information is available to reconstruct the unknown function from sparse and noisy data. As in the previous example, our aim is to illustrate how the new class of proposed boosting algorithms can solve this problem using a RKHS with a great computational advantage w.r.t. the traditional scheme. For this purpose, we just consider a classical benchmark problem where the unknown map is the Franke’s bivariate test function given by the weighted sum of four exponentials (Wahba, 1990). Data set size is 1000 and is generated as follows. First, 1000 input locations are drawn from a uniform distribution on . The data are divided in the same way described in the classification problem. The outputs in the training and validation data are

where the errors are independent, with distribution given by the mixture of Gaussians

The test outputs are instead given by noiseless outputs . A Monte Carlo study of 100 runs is considered, where a new data set is generated at any run. The test fit is computed as

where is the test set prediction.

Note that the mixture noise can model the effect of outliers which affect, on average, 1 out of 10 outputs. This motivates the use of the robust loss. Hence, the function is still reconstructed by Boosting+ loss and Boosting kernel+ loss which are implemented exactly in the same way as previously described. Fig. 7 displays the results with the same rationale adopted in Fig. 6. The fits are close each other but, at any run, the classical boosting scheme requires solving hundreds of optimization problems, while the boosting kernel-based approach needs to solve around 15 problems on average. The computational times of the two approaches are reported in the bottom panel of Fig. 7.

Finally, Table 2 reports the average fits including those achieved by Gaussian kernel+ loss, which is implemented as the estimator SVC described in the previous section except that the hinge loss is replaced by the loss. The best results are achieved by boosting kernel with .

Boosting+ Boosting kernel+ Gaussian kernel+
76.62 % 76.75 % 75.19 %
Table 2: Average percentage regression fit
   
   
Figure 7: Regression problem Top Left Fits obtained by the new boosting kernel (x-axis) vs fits obtained by the classical boosting scheme (y-axis). Both the estimators use the loss. Top Right Some cross validation scores computed using the classical boosting scheme equipped with the loss as a function of the boosting iteration counter . Bottom Computational times to solve a regression problem needed by the new boosting kernel (x-axis) and by the classical boosting scheme (y-axis).

6 Conclusion

In this paper, we presented a connection between boosting and kernel-based methods. We showed that in the context of regularized least-squares, boosting with a weak learner can be interpreted using a boosting kernel. This connection was used for three main applications: (1) providing insight into boosting estimators and when they can be effective; (2) determining schemes for hyperparameter estimation using the kernel connection and (3) proposing a more general class of boosting schemes for general misfit measures, including , Huber and Vapnik, which can use also RKHSs as hypothesis spaces.

The proposed approach combines generality with computational efficiency. In contract to the classic boosting scheme, treating boosting iterations as a continuous hyperparameter may improve prediction capability. Real data support the use of these generalized schemes in practice. Indeed, in some real experiments we obtained as estimate improving on the classic scheme. In addition, this new viewpoint avoids sequential solutions. This turns out a particularly strong advantage for boosting using general losses , as each boosting run would itself require an iterative algorithm. This has been outlined also in the RKHS setting: the boosting kernel allows to obtain results similar (or also better) than the classical boosting scheme dramatically reducing the computational cost.


References

  • A. Aravkin et al. (2014) A. Aravkin, P. Kambadur, A.C. Lozano, and R. Luss. Orthogonal matching pursuit for sparse quantile regression. In Data Mining (ICDM), International Conference on, pages 11–19. IEEE, 2014.
  • Avnimelech and Intrator (1999) R. Avnimelech and N. Intrator. Boosting regression estimators. Neural computation, 11(2):499–520, 1999.
  • Bissacco et al. (2007) A. Bissacco, M.-H. Yang, and S. Soatto. Fast human pose estimation using appearance and motion via multi-dimensional boosting regression. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
  • Bottegal et al. (2016) G. Bottegal, A.Y. Aravkin, H. Hjalmarsson, and G. Pillonetto. Robust EM kernel-based methods for linear system identification. Automatica, 67:114–126, 2016.
  • Breiman (1998) L. Breiman. Arcing classifier (with discussion and a rejoinder by the author). The annals of statistics, 26(3):801–849, 1998.
  • Bube and Nemeth (2007) K.P. Bube and T. Nemeth. Fast line searches for the robust solution of linear systems in the hybrid and huber norms. Geophysics, 72(2):A13–A17, 2007.
  • Bühlmann and Hothorn (2007) P. Bühlmann and T. Hothorn. Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, pages 477–505, 2007.
  • Bühlmann and Yu (2003) P. Bühlmann and B. Yu. Boosting with the L2 loss: regression and classification. Journal of the American Statistical Association, 98(462):324–339, 2003.
  • Cao et al. (2014) X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2):177–190, 2014.
  • Champion et al. (2014) M. Champion, C. Cierco-Ayrolles, S. Gadat, and M. Vignes. Sparse regression and support recovery with L2-boosting algorithms. Journal of Statistical Planning and Inference, 155:19–41, 2014.
  • De Mol et al. (2009) C. De Mol, E. De Vito, and L. Rosasco. Elastic-net regularization in learning theory. Journal of Complexity, 25(2):201–230, 2009.
  • Evgeniou et al. (2000) T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13:1–150, 2000.
  • Fan et al. (1999) W. Fan, S.J. Stolfo, and J. Zhang. The application of adaboost for distributed, scalable and on-line learning. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 362–366. ACM, 1999.
  • Freund and Schapire (1997) Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Freund et al. (1999) Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999.
  • Friedman et al. (2000) J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000.
  • Hansen and Yu (2001) M.H. Hansen and B. Yu. Model selection and the principle of minimum description length. Journal of the American Statistical Association, 96(454):746–774, 2001.
  • Hastie et al. (2001a) T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001a.
  • Hastie et al. (2001b) T.J. Hastie, R.J. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer, Canada, 2001b.
  • Huber (2004) P. J. Huber. Robust Statistics. John Wiley and Sons, 2004.
  • Hurvich et al. (1998) C.M. Hurvich, J.S. Simonoff, and C.-L. Tsai. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(2):271–293, 1998.
  • Koenker (2005) R. Koenker. Quantile Regression. Cambridge University Press, 2005.
  • Koenker and Geling (2001) R. Koenker and O. Geling. Reappraising medfly longevity: A quantile regression survival analysis. Journal of the American Statistical Association, 96:458–468, 2001.
  • Lemmens and Croux (2006) A. Lemmens and C. Croux. Bagging and boosting classification trees to predict churn. Journal of Marketing Research, 43(2):276–286, 2006.
  • Li and Lin (2010) Q. Li and N. Lin. The bayesian elastic net. Bayesian Analysis, 5(1):151–170, 2010.
  • Ljung (1999) L. Ljung. System Identification, Theory for the User. Prentice Hall, 1999.
  • Maronna et al. (2006) R.A. Maronna, D. Martin, and V.J. Yohai. Robust Statistics. Wiley Series in Probability and Statistics. Wiley, 2006.
  • Pillonetto and De Nicolao (2010) G. Pillonetto and G. De Nicolao. A new kernel-based approach for linear system identification. Automatica, 46(1):81–93, 2010.
  • Pontil and Verri (1998) M. Pontil and A. Verri. Properties of support vector machines. Neural Computation, 10:955–974, 1998.
  • Schapire (1990) R.E. Schapire. The strength of weak learnability. Machine learning, 5(2):197–227, 1990.
  • Schapire (2003) R.E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
  • Schapire and Freund (2012) R.E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.
  • Schölkopf and Smola (2001) B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. (Adaptive Computation and Machine Learning). MIT Press, 2001.
  • Schölkopf and Smola (2002) B. Schölkopf and A.J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • Schölkopf et al. (2000) B. Schölkopf, A.J. Smola, R.C. Williamson, and P.L. Bartlett. New support vector algorithms. Neural Computation, 12:1207–1245, 2000.
  • Schölkopf et al. (2001) B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. Neural Networks and Computational Learning Theory, 81:416–426, 2001.
  • Solomatine and Shrestha (2004) D.P. Solomatine and D.L. Shrestha. AdaBoost. RT: a boosting algorithm for regression problems. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, volume 2, pages 1163–1168. IEEE, 2004.
  • Temlyakov (2000) V.N. Temlyakov. Weak greedy algorithms. Advances in Computational Mathematics, 12(2-3):213–227, 2000.
  • Tokarczyk et al. (2015) P. Tokarczyk, J.D. Wegner, S. Walk, and K. Schindler. Features, color spaces, and boosting: New insights on semantic classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 53(1):280–295, 2015.
  • Tu (2005) Z. Tu. Probabilistic boosting-tree: Learning discriminative models for classification, recognition, and clustering. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1589–1596. IEEE, 2005.
  • Tutz and Binder (2007) G. Tutz and H. Binder. Boosting ridge regression. Computational Statistics & Data Analysis, 51(12):6044–6059, 2007.
  • Vapnik (1998) V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, USA, 1998.
  • Viola and Jones (2001) P. Viola and M. Jones. Fast and robust classification using asymmetric adaboost and a detector cascade. Advances in Neural Information Processing System, 14, 2001.
  • Wahba (1990) G. Wahba. Spline models for observational data. SIAM, Philadelphia, 1990.
  • Zhu et al. (2009) J. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class adaboost. Statistics and its Interface, 2(3):349–360, 2009.
  • Zou and Hastie (2005a) H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005a.
  • Zou and Hastie (2005b) H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005b.
  • Zou and Yuan (2008) H. Zou and M. Yuan. Regularized simultaneous model selection in multiple quantiles regression. Computational Statistics & Data Analysis, 52(12):5296–5304, 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
248057
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description