Gradient and Newton Boosting for Classification and Regression

Gradient and Newton Boosting for Classification and Regression

Fabio Sigrist
Lucerne University of Applied Sciences and Arts
Email: fabio.sigrist@hslu.ch. Address: Lucerne University of Applied Sciences and Arts, Grafenauweg 10, 6302 Zug, Switzerland.
Abstract

Boosting algorithms enjoy large popularity due to their high predictive accuracy on a wide array of datasets. In this article, we argue that it is important to distinguish between three types of statistical boosting algorithms: gradient and Newton boosting as well as a hybrid variant of the two. To date, both researchers and practitioners often do not discriminate between these boosting variants. We compare the different boosting algorithms on a wide range of real and simulated datasets for various choices of loss functions using trees as base learners. In addition, we introduce a novel tuning parameter for Newton boosting. We find that Newton boosting performs substantially better than the other boosting variants for classification, and that the novel tuning parameter is important for predictive accuracy.


Keywords: Statistical boosting, supervised learning, ensembles, trees, prediction

1 Introduction

Boosting refers to a type of classification and regression algorithms that enjoy large popularity due to their excellent predictive accuracy on a wide range of datasets. The first boosting algorithms for classification, including the well known AdaBoost algorithm, were introduced by Schapire [1990], Freund and Schapire [1995], and Freund et al. [1996]. Later, several authors [Breiman, 1998, 1999, Friedman et al., 2000, Mason et al., 2000, Friedman, 2001] introduced the statistical view of boosting as a stagewise optimization approach. In particular, Friedman et al. [2000] first introduced boosting algorithms which iteratively optimize Bernoulli and multinomial likelihoods for binary and multiclass classification using Newton updates. Further, Friedman [2001] presented gradient descent based boosting algorithms for both regression and classification tasks with general loss functions. See Schapire [2003], Bühlmann and Hothorn [2007], Schapire and Freund [2012], Mayr et al. [2014a], and Mayr et al. [2014b] for reviews on boosting algorithms in both the machine learning and statistical literature.

As we argue in Section 2, there exist broadly speaking two different approaches of statistical boosting for iteratively finding a minimizer in a function space: one that is based on stagewise first-order gradient descent updates and one that is based on a second-order Newton-Raphson updates. We denote these two approaches by gradient boosting and Newton boosting. In addition, one can combine gradient with Newton updates by learning a part of the parameters with a gradient step and the remaining parameters with a Newton step. In particular, if trees are used as base learners, one can learn the structure of the trees using a gradient step and the leaves using Newton updates. This approach is denoted by hybrid gradient-Newton boosting in the following.

In both research and practice, the distinction between gradient and Newton boosting is often not made thereby implicitly assuming that the difference is not important. For instance, several recent articles [e.g., Ke et al., 2017, Ponomareva et al., 2017] do not distinguish between gradient and Newton boosting, and one has to assume that gradient boosting with potentially Newton updates for tree leaves is used. To the best of our knowledge, a systematic comparison between gradient, Newton, and hybrid gradient-Newton boosting has not been done so far. The _TreeBoost algorithm of Friedman [2001] is compared in Friedman [2001] with -class LogitBoost of Friedman et al. [2000] for classification with five classes in a simulation study for one type of random functions. In our terminology, _TreeBoost is a version of hybrid gradient-Newton boosting and -class LogitBoost corresponds to Newton boosting for the Bernoulli likelihood. Friedman [2001] finds that the algorithms perform ”nearly the same” with ”LogitBoost perhaps having a slight advantage”. In addition, it is mentioned that ”it is likely that when the shrinkage parameter is carefully tuned for each of the three methods [_TreeBoost, -class LogitBoost, AdaBoost], there would be little performance differential between them.” Similarly, Bühlmann and Hothorn [2007] state that for gradient boosting ”an additional line search … seems unnecessary for achieving a good estimator.” For trees as base learners, the additional line search is often done for each leaf separately by using a Newton step [Friedman, 2001]. This means that the line search then corresponds to what we denote as hybrid gradient-Newton boosting; see Section 2.1.3 for more details. In this article, we present evidence that is not in line with these statements. In particular, we find that for classification, Newton boosting performs substantially better than both gradient boosting and hybrid gradient-Newton boosting with trees as base learners. Further, we also find that hybrid gradient-Newton boosting has usually better predictive accuracy than gradient boosting.

The main goal of this article is to clarify the distinction between gradient and Newton boosting and to show that the difference between these approaches matters. In addition, we introduce a novel tuning parameter for Newton boosting with trees as base learners in Section 2. We argue that this equivalent number of weighted samples per leaf parameter is a natural tuning parameter and show that it is important for predictive accuracy. We first present the different boosting algorithms in a unified framework in Section 2, and then empirically compare their performance on a large set of popular real datasets in Section 3 and in a simulation study in Section 4. In doing so, we use regression trees [Breiman et al., 1984] as base learners as these are the most widely adopted base learners for prediction tasks in practice [Ridgeway, 2017, Pedregosa et al., 2011, Chen and Guestrin, 2016, Meng et al., 2016, Ke et al., 2017, Ponomareva et al., 2017].

We note that, originally, boosting and, in particular, AdaBoost was motivated differently from the currently adopted statistical view of boosting as a stagewise optimization procedure. Although there is some debate on whether the statistical view of boosting helps to understand the success of AdaBoost [Mease and Wyner, 2008, Wyner et al., 2017], we focus on statistical boosting in this article mainly because it provides a unified framework which allows for generalizing boosting to the regression setting or any other suitable loss function.

2 The statistical view of boosting: three approaches for stagewise optimization

In this section, we present the statistical view of boosting as finding the minimizer of a risk functional in a function space using a stagewise, or greedy, optimization approach. We distinguish between gradient and Newton boosting as wells as a hybrid version.

Beforehand, we note that for some loss functions such as the squared error, the least absolute deviation (LAD), or other quantile regression loss functions, and the Huber loss, there is no difference between these three approaches since the second derivative is either zero or constant. Further, if a loss function is not P-almost everywhere twice differentiable in its second argument, Newton boosting and also hybrid gradient-Newton boosting is not applicable. However, the majority of commonly used loss functions are twice differentiable. See Appendix A for a selection of loss functions that we consider in this article.

2.1 Population versions

We assume that there is a response variable and a vector of predictor variables . Our goal is to predict the response variable using the predictor variables. We assume that are random variables on where both the distribution of and the conditional distribution are absolutely continuous with respect to either the Lebesgue measure, a counting measure, or a mixture of both. In particular, this covers both regression and classification tasks or mixtures of the two such as Tobit regression [Sigrist and Hirnschall, 2017].

The goal of boosting is to find a minimizer of the risk which is defined as the expected loss

(1)

where is an appropriately chosen loss function. The first major assumption of boosting is that lies in the span of a set of so called weak or base learners. In other words, , where and is a set of base learners . This means that is given by

(2)

For notational simplicity, we often denote a function shortly by in the following. We further assume that is a subspace of a Hilbert space with inner product given by

If the risk is convex in , then (2) is a convex optimization problem since is also convex.

The second major assumption of boosting is that can be found in a stagewise way by sequentially adding an update to the current estimate ,

(3)

such that the risk is minimized

(4)

This minimization can often not be done analytically and, consequently, an approximation has to be used.

Different boosting algorithms vary in the way the minimization in (4) is done, the loss function used in (1), and in the choice of base learners . Concerning loss functions, popular choices include, for instance, the squared loss for regression with or the negative log-likelihood of a binomial model with a logistic link function for binary classification with . Under appropriate regularity assumptions, one can use the negative log-likelihood of any statistical model as loss function:

where is the density of given with respect to some reference measure, is linked to a possibly transformed parameter of this density, and are additional parameters. This can also easily be extended to let more than one parameter depend on several different functions . See Appendix A for various examples of loss functions obtained in this way.

Concerning base learners, regression trees [Breiman et al., 1984] are the most frequently used choice. Other potential base learners include splines or linear functions [Bühlmann and Hothorn, 2007]. In this article, we focus on trees:

where , , and denotes the number of terminal nodes, or leaves, of the tree . The multivalued indicator function represents the structure of the tree, i.e., the partition of the space , and contains the values of the leaves. As in Breiman et al. [1984], we assume that the partition of the space made by is a binary tree where each cell in the partition is a rectangle of the form with and if .

For finding an update in (3), either a form of gradient descent, the Newton method [Saberian et al., 2011], or a hybrid variant is used to obtain an approximate solution to the minimization problem in (4). In the following, we briefly describe these approaches.

2.1.1 Gradient boosting

Assuming that is Gâteau differentiable for all , we denote the Gâteau derivative by

Gradient boosting then works by choosing as the minimizer of a first order Taylor approximation with a penalty on the norm of the base learner:

(5)

Note that we add the penalty since the functions are not necessarily normed and is not constant.

If we assume that is differentiable in for P-almost all and that this derivative is integrable with respect to the measure of , then due to Lebesgue’s dominated convergence theorem, is given by

where denotes the gradient of the loss function with respect to at the current estimate :

(6)

Consequently, (5) can be written as

(7)

This shows that is the approximation to the negative gradient of the loss function with respect to evaluated at the current estimate .

If the following expression is well defined for P-almost all , then the minimization in (7) can also be done point-wise:

2.1.2 Newton boosting

For Newton boosting, we assume that is two times Gâteau differentiable and denote the second Gâteau derivative by

Newton boosting chooses as the minimizer of a second order Taylor approximation

(8)

If we assume the P-almost all existence and integrability of the second derivative of with respect to , then (8) can be written as

(9)

where the gradient is defined in (6) and is the Hessian of with respect to at :

(10)

The last line in Equation (9) shows that is the weighted approximation to negative ratio of the gradient over the Hessian where the weights are given by the second derivative .

If the following expression is well defined for P-almost all , we can again calculate the point-wise minimizer of (9) as:

2.1.3 Hybrid gradient-Newton boosting

A hybrid variant of gradient and Newton boosting is obtained by first learning part of the parameters of the base learners using a gradient step and the remaining part using a Newton update. For instance, for trees as bases learners, the structure of a tree is learned using a gradient update:

and then, conditional on this, one finds the weights using a Newton step:

Note that the update step in (3) is sometimes presented in the form with , where is found by doing an additional line-search . We are not considering this approach explicitly here since, first, we assume that the set of base learners is rich enough to include not just normalized base learners but base learners of any norm and, second, the line-search often cannot be done analytically and a second-order Taylor approximation is used instead, which then corresponds to a Newton step. I.e., this case is essentially a version of hybrid gradient-Newton or Newton boosting.

2.2 Sample versions

In the following, we assume that we observe a sample of data points , and approximate the risk in (1) with the empirical risk :

(11)

For gradient boosting, the sample version of (7) can be written as

(12)

where is the gradient of the loss function for observation

This means that the stagewise minimizer can be found as the least squares approximation to the negative gradient .

Similarly, the sample version of the Newton update in (9) is given by

(13)

where is the Hessian of the loss function for observation :

I.e., can be found as the weighted least squares approximation to the ratio of the negative gradient over the Hessian with weights given by .

The sample version of the hybrid gradient-Newton algorithm first finds the structure of a tree using a gradient step:

and then determines the weights using a Newton step:

2.3 Tuning parameters and regularization

It has been empirically observed that damping the update in (3) results in increased predictive accuracy. This means that the update in (3) is replaced with

where is a shrinkage parameter. The main tuning parameters of boosting algorithms are then the number of boosting iterations and the shrinkage parameter . These tuning parameters and also the ones for the base learners presented in the following can be chosen by minimizing a performance measure on a validation dataset, by cross-validation, or using an appropriate model selection criterion.

Depending on the choice of base learners, there are additional tuning parameters. For instance, if trees are used as base learner, the depth of the trees and the minimal number of samples per leaf are further tuning parameters. Since Newton boosting solves the weighted least squares problem in (13) in each update step, the raw number of samples per leaf it not meaningful, and we argue that instead one should consider what we denote as the equivalent number of weighted samples per leaf. As we show below on real and simulated data, this parameter is important for predictive accuracy. In more detail, we propose the following approach.

For gradient boosting or hybrid gradient-Newton boosting, every data point has a weight of one when learning the structure of a tree. Motivated by this, we first normalize the weights

such that the sum of all normalized weights equals the number of data points . We then interpret the sum of all normalized weights as the equivalent number of weighted data points per leaf , and require that this is larger than a certain constant :

(14)

The constant is considered as a tuning parameter analogous to the minimum number of sampler per leaf in gradient boosting.

To the best of our knowledge, the only other software that implements Newton boosting is XGBoost [Chen and Guestrin, 2016]. XGBoost handles this tuning parameter differently by requiring that the sum of all raw weights per leaf is larger than a certain constant which is by default one111This constant is denoted by ’min_child_weight’ in ’XGBoost’ (as of July 30, 2018).. According to the authors, the motivation for this is that for linear regression, ”this simply corresponds to minimum number of instances needed to be in each node”222Unfortunately, this is not documented in the corresponding companion article [Chen and Guestrin, 2016]. We gather this information from the online documentation https://xgboost.readthedocs.io/en/latest/parameter.html (retrieved on July 30, 2018).. We argue that this is not an optimal choice for the following two reasons.

First, the second derivative of the loss function of a linear regression model with Gaussian noise only equals one if the noise variance is one. Otherwise, the Hessian equals . I.e., the analogy to the linear regression case does not hold true in general. In contrast, our proposed normalized weights do indeed equal one for the linear regression case no matter what the noise variance is, and thus the sum of normalized weights equals the number of samples per leaf for the linear regression model. Second, as we show in our empirical study on both real and simulated datasets, the minimal sum of raw weights is a parameter that is difficult to tune in practice, and we obtain inferior predictive accuracy for the large majority of datasets. Related to this, the sum of normalized weights can be interpreted as the equivalent number of weighted samples per leaf, and one has good intuition concerning reasonable choices for this, which do not depend on the size of the data . For the sum of raw weights, this is not the case.

In addition to the above presented tuning parameters, which all perform in varying degrees a form of regularization, one can add further regularization parameters such as L1 and/or L2 penalties on the tree weights, or an L0 penalty on the number of leaves. Finally, boosting algorithms can also be made stochastic [Friedman, 2002] by (sub-)sampling data points in each boosting iteration and variables in the tree algorithm as it is done for random forests. For the sake of simplicity, we are not considering these additional regularization options in this article.

2.4 Software implementations

The methodology presented in this article, i.e., gradient, Newton, and hybrid gradient-Newton boosting for the loss functions listed in Appendix A is implemented in Python and openly available on github as a fork of scikit-learn333It is available in the branch ’newtonboost’ on https://github.com/fabsig/scikit-learn.git. This can be installed, for instance, using the command ’pip install git+https://github.com/fabsig/scikit-learn.git@newtonboost’. The parameter ’update_step’ of the functions ’GradientBoostingClassifier’ and ’GradientBoostingRegressor’ takes as arguments ’gradient’, ’hybrid’, or ’Newton’. E.g., for Newton boosting, one needs to choose the option ’update_step=”newton”’..

Current statistical and machine learning software implementations for boosting use algorithms with either gradient descent updates, Newton updates, or a hybrid form of the two minimization methods. For instance, the R package ’gbm’ [Ridgeway, 2007, 2017] and the Python library ’scikit-learn’ [Pedregosa et al., 2011] follow the approach of Friedman [2001] and use gradient descent steps for finding the structures of trees with Newton updates for the tree leaves for loss functions that are two-times differentiable. Despite the name ’eXtreme Gradient Boosting’, ’XGBoost’ [Chen and Guestrin, 2016] uses Newton boosting with Newton steps for finding both the tree structure and the tree leaves. The R package ’mboost’ [Hothorn et al., 2010] uses gradient boosting. In addition to trees, it also supports linear functions and splines as base learners. Other recent implementations such as ’LightGBM’ [Ke et al., 2017], ’TF Boosted Trees’ [Ponomareva et al., 2017] and ’Spark MLLib’ [Meng et al., 2016], do not explicitly mention in their companion articles [Ke et al., 2017, Ponomareva et al., 2017] or in their online documentation444https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-trees-gbts (retrieved on July 30, 2018). whether gradient descent or Newton updates are used in the stagewise boosting updates. Since Friedman [2001] is referenced in Ke et al. [2017] and Ponomareva et al. [2017], we assume that they use the hybrid gradient boosting approach of Friedman [2001] with Newton updates for the leaves.

2.5 Numerical stability and computational cost

As was already observed by Friedman et al. [2000] for the LogitBoost algorithm, numerical stability can be an issue for Newton boosting. For LogitBoost, i.e., Newton boosting for a Bernoulli likelihood with a logistic link function, Friedman et al. [2000] propose two steps for coping with this. First, they bound the ratio such that it lies within with . And second, they enforce a lower threshold on the second derivatives such that they are always strictly positive. In our experience, the first step is not necessary since in cases where is indeed large, this is only happens for the first few iterations and, subsequently, as the algorithm converges these values are small. The second point, on the other hand, is important, and we also enforce a lower bound on at in our implementation of Newton boosting. We have found that the exact value of this lower bound is not important as long as it is small enough.

Concerning computational cost, the main cost of a boosting algorithm with trees as base learners result from growing the regression trees [Ke et al., 2017]. Consequently, the differences in computational times are marginal for the three versions of boosting presented in this article. Tree boosting implementations that are designed to scale to large data use some computational tricks when growing trees, see e.g. Chen and Guestrin [2016].

3 Application and comparison on real data

In the following, we compare the three different boosting algorithms presented in the previous section on several real datasets. In addition, to Newton boosting with our novel equivalent number of weighted samples per leaf tuning parameter as described in Equation (14), we also use Newton boosting as implemented in XGBoost for which the sum of Hessians in each leaf acts as tuning parameter (so called ’min_child_weight’ parameter)555We use XGBoost version number 0.7 in Python.. See Section 2.3 for a discussion on this.

We consider the following datasets: insurance, adult, bank, (breast) cancer, ijcnn, ionosphere, titanic, sonar, car, covtype, digits, glass, letter, satimage, smartphone, usps. Poisson regression is used for the insurance dataset, and for all other datasets, binary or multiclass classification is used. The insurance dataset is obtained from Kaggle666https://www.kaggle.com/apex51/poisson-regression. The covtype, ijcnn, and usps datasets are LIBSVM datasets777Obtained from ’https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. All other datasets are obtained from the UCI Machine Learning Repository888http://archive.ics.uci.edu/ml/datasets/. A summary of the datasets can be found in Table 1. If a dataset contains categorical predictor variables, these are converted to binary dummy variables using one-hot encoding.

Data Type / nb. classes Nb. samples Nb. features
adult 2 48842 108
bank 2 41188 62
cancer 2 699 9
ijcnn 2 141691 22
ionosphere 2 351 34
sonar 2 208 60
car 4 1728 21
covtype 7 581012 54
digits 10 5620 64
glass 7 214 9
letter 26 20000 16
satimage 6 6438 36
smartphone 6 10299 561
usps 10 9298 256
insurance Poisson regr. 50999 117
Table 1: Summary of datasets.

In order to quantify variability in the results, we use bootstrap [Efron and Tibshirani, 1994] by drawing random samples with replacement from the original data. We then split the data into three equally sized datasets: training, validation, and test data. Learning is done on the training data, tuning parameters are estimated on the validation data, and model comparison is done on the test data. For the two largest datasets (ijcnn and covtype) we limit the size of the training, validation, and test data to 20000 data points. This is done for computational reasons. We note that there are various strategies in order that tree based boosting scales to large data, see e.g. Chen and Guestrin [2016] or Ke et al. [2017], but this is not the goal of this article. For the four small datasets (cancer, ionosphere, sonar, glass) with number of observations between 208 and 699, we assign two thirds to the training data and use the remaining data as both validation and training data. This helps to reduce the variability in the results. The number of bootstrap iterations is 100 for datasets with less than 1500 samples (less than 500 training points), 20 for datasets with a size between 1500 and 7500 (number of training points between 500 and 2500), and 10 for datasets with more than 7500 samples (more than 2500 training points).

Concerning tuning parameters, we select the number of boosting iterations from , the learning rate from , and the minimum number of samples per leaf from . For Newton boosting with our novel tuning parameter, the latter corresponds to the equivalent number of weighted samples in Equation (14), and for the XGBoost implementation, the minimum sum of Hessians per leaf is used. Tuning parameters are chosen in each bootstrap iteration such that they minimize the out-of-sample negative log-likelihood on the validation data. The maximal tree size is set to for all methods. Having one tuning parameter less means less variability in the results, and fixing the maximal tree size allows for a fairer comparison since all methods then use the same degree of interaction. However, we have also tried setting this parameter to , , and (results not reported), and the findings that we report in the following are robust to changing this parameter.

Figure 1: Comparison of boosting methods on real datasets using out-of-sample negative log-likelihoods. The red rhombi represent means.

Figure 2: Comparison of boosting methods on real datasets using out-of-sample negative log-likelihoods relative to the best method in each run. The red rhombi represent means.
Data CompareAll NewtonVsGrad NewtonVsHybird NewtonVsXGBoost
adult 0 2.49e-10 0.0206 1.84e-11
bank 6.66e-16 7.56e-06 0.539 2.62e-07
cancer 0 0 9.42e-08 0
ijcnn 0 1.23e-10 6.43e-06 7.6e-07
ionosphere 0 0 2.22e-16 0
sonar 0 0 8.08e-14 0
car 0 0 0.0438 0
covtype 0 8.03e-11 2.44e-05 4.11e-14
digits 0 0 4.44e-16 4.06e-14
glass 0 0 3.33e-15 0
letter 0 1.04e-12 1.76e-10 2.04e-11
satimage 0 1.48e-11 1.59e-06 1.47e-13
smartphone 3.35e-12 3.06e-08 4.9e-05 5.2e-06
usps 2.22e-16 2.17e-09 4.08e-07 1.88e-07
insurance 8.28e-13 4.13e-07 0.872 0.00306
Table 2: Comparison of different boosting methods on real datasets using negative log-likelihoods on the test sets. The column ’CompareAll’ contains p-values of F-tests that compare differences among all methods. The other columns contain p-values of t-tests comparing Newton boosting with our novel tuning parameter to the other boosting variants.

In Figure 1, we show the results. We use the out-of-sample negative log-likelihood on the test data to compare the different methods. The reason why we use the log-likelihood and not the error rate for classification tasks is that the log-likelihood contains more information, e.g., on margins, and it is less affect by small changes in predicted probabilities in cases where the predicted probabilities are close to equal for several classes. However, the results are very similar when using error rates; see below and Appendix C where we also report the results when using error rates. The red rhombi in Figure 1 represent the means over all bootstrap simulations. In addition, we use boxplots to visualize variability in the results over the different bootstrap iterations. In Figure 2, we also show log-likelihoods relative to the best method for each bootstrap run.

We observe substantial differences in the performance of the different methods. In particular, we find that Newton boosting is the best method for all classification datasets. The differences in the methods are particularly striking for several multiclass classification datasets. In addition, Newton boosting with the new equivalent number of weighted samples parameter performs substantially better than the XGBoost variant of Newton boosting with a minimal sum of Hessians parameter. Interestingly, we find that the XGBoost implementation performs also worse than both gradient and hybrid gradient-Newton boosting for the majority of datasets. The latter no longer holds true when we are not tuning the minimum number of samples per leaf parameter; see below and Appendix B. For the Poisson regression dataset (insurance), we observe that hybrid as well as Newton boosting perform equally well and clearly better than gradient boosting.

To investigate whether the differences are significant, we additionally report in Table 2 p-values when comparing the four methods. The column ’CompareAll’ contains p-values from F-tests that compare differences among all four approaches. The other columns contain p-values from t-tests comparing Newton boosting with our novel tuning parameter to the other boosting variants as well as the XGBoost implementation of Newton boosting which uses the minimum number of Hessians per leaf as parameter. All tests are done using a regression model with a random effect at the bootstrap iteration level to account for correlation among the results for different bootstrap iterations. For all datasets, we find significant differences among the different boosting methods. Further, for thirteen out of fourteen classification data sets, we observe that Newton boosting is significantly better at the 5% level than hybrid gradient-Newton boosting, which is the next best competitor. These results are the more remarkable given the relatively small number of bootstrap iterations and the resulting low power of tests.

In Appendix B in Figures 5 and 6 and Table 4, we also report the results when the minimum number of samples per leaf parameter is not tuned and simply set to the default value. I.e., for gradient and hybrid gradient-Newton boosting, the minimum number of samples per leaf is one, for Newton boosting with our proposed choice in (14), we set the minimum equivalent number of weighted samples per leaf to one, and for the XGBoost implementation, we set the minimum sum of Hessians to its default value, i.e., also one. The overall picture is very similar: Newton boosting is generally the best method also when the default value is used for the minimum number of samples, or the equivalent number of weighted samples, parameter. Overall, the performance gain of Newton boosting compared to gradient and hybrid gradient-Newton boosting is even slightly larger than when also tuning the minimum number of samples per leaf parameter. Further, Newton boosting with out novel minimum number of weighted samples per leaf parameter at the default value also clearly outperforms XGBoost with the sum of Hessians parameter at the default value.

As mentioned, we also report in Appendix C in Figures 7 and 8 and Table 5 the results when using error rates on the test sets instead of the negative log-likelihood for the classification datasets. The results are very similar compared to when using the negative log-likelihoods. For several datasets (e.g., ijcnn, digits, letter, smartphone, usps), the average relative accuracy gain of Newton boosting compared to the other boosting variants is quite large in the area of 10% to 20%.

4 Simulation study

In the following, we also compare the performance of the different boosting approaches on simulated data for both classification and regression. Concerning regression, we consider two extensions of generalized linear models [McCullagh and Nelder, 1989]: boosted Poisson and Gamma regression. For classification, we consider both binary and multiclass classification. In addition, we consider the boosted Tobit model, which can be considered as a hybrid regression-classification model; see Sigrist and Hirnschall [2017]. See Appendix A for more details on these models.

For Poisson and Gamma regression as well as the boosted Tobit model, we use the functions ’make_friedman1’ and ’make_friedman3’ available in scikit-learn and introduced in Friedman [1991] and Breiman [1996] for simulating mean functions. They are given by

where , and

where with

In contrast to the original function of Friedman [1991], we multiply the Friedman#3 function by and add . The former is done in order that the function also contains larger values and the latter in order that all values are positive. We denote these two functions by the suffixes ’_f1’ and ’_f3’ in the following. In addition, we also consider the following mean function introduced by Ridgeway [1999]:

where , We denote this function by the suffix ’_r’ in the following.

For Gamma regression, we choose and consider this as a known parameter. We note that XGBoost only supports Gamma regression for . However, this slight miss-specification seems to have no detrimental impact as our results below show. For the Tobit model, we use and also consider this as a known parameter. Further, we set the lower and upper censoring thresholds and in such a way that approximately one third of all data points are lower and upper censored. Tobit regression is currently not supported in XGBoost and, consequently, no comparison can be done in this case.

For classification, we use the scikit-learn function ’make_classification’, which simulates from an algorithm that is adapted from Guyon [2003] and was designed to generate the ’Madelon’ dataset. We use this for both simulating binary data and a multiclass data with classes. We use (informative) features and no redundant and repeated features; see Guyon [2003] for more details. These two datasets are denoted by ’bin_classif’ as well as ’multi_classif’ in the following. In addition, we simulate binary data according to the following specification introduced by [Friedman et al., 2000]:

I.e., follows a ten dimensional multivariate Gaussian distribution with mean zero and identity covariance matrix. This data is denoted by ’bin_classif_fht’ in the following. Finally, we also simulate multiclass data with 5 classes according to the following specification used by [Friedman et al., 2000]:

where the thresholds are chosen such that the labels are approximately equally distributed among the different classes. We denote this simulated data by ’multi_classif_fht’.

We simulate 10 times datasets with 15000 samples. In each run, 5000 samples are used as training, validation, and test data. The results from this are reported in Figures 3 and 4. In Table 3, we report p-values of tests in order to check whether we find significant differences among the different boosting approaches and whether Newton boosting performs better than the other versions. See Section 3 for more details on the plots and tests. For classification, we find again that Newton boosting performs substantially better than gradient and hybrid gradient-Newton boosting for all simulated datasets. Further, Newton boosting with the new equivalent sample size per leaf tuning parameter performs better than XGBoost. For Poisson and Gamma regression as well as the boosted Tobit model, we observe only small differences among the different boosting methods, and there is no method that is consistently better than the other approaches.

We also redo the simulations by using smaller data sets. I.e., we sample 100 times 500 independent data points as training, validation, and test data from the same data generating processes as described above. The results from this are reported in Appendix D in Figures 9 and 10 as well as Table 6. In this case, we also find that Newton boosting has significantly better predictive accuracy on all classification datasets. For the regression tasks, there are again only small differences among the different boosting methods, but Newton boosting performs best in six out of nine datasets.

Figure 3: Comparison of boosting methods on simulated data using out-of-sample error rates for classification and negative log-likelihoods for regression. The red rhombi represent means.
Figure 4: Comparison of boosting methods on simulated data using out-of-sample error rates for classification and negative log-likelihoods for regression relative to the best method in each run. The red rhombi represent means.
Data CompareAll NewtonVsGrad NewtonVsHybird NewtonVsXGBoost
bin_classif 0 6.27e-08 6.51e-06 1.49e-07
bin_classif_fht 0 5.89e-10 5.67e-07 3.09e-08
multi_classif 0 4e-10 2.4e-06 8.08e-12
multi_classif_fht 0 3.98e-13 8.06e-11 7.94e-09
poisson_r 1.28e-07 6.83e-05 0.00134 0.0687
poisson_f1 5.18e-08 0.000108 0.352 0.000518
poisson_f3 1.44e-07 1.33e-05 0.00883 0.000231
gamma_r 1.82e-07 0.000195 0.264 0.0766
gamma_f1 0.00594 0.0305 0.289 0.991
gamma_f3 0.01 0.07 3.97e-06 0.00478
tobit_r 0.637 0.308 0.993
tobit_f1 0.618 0.477 0.838
tobit_f3 2.35e-06 0.000169 0.000161
Table 3: Comparison of different boosting methods on simulated datasets using negative log-likelihoods on the test sets. The column ’CompareAll’ contains p-values of F-tests that compare differences among all methods. The other columns contain p-values of t-tests comparing Newton boosting with our novel tuning parameter to the other boosting variants.

5 Conclusions

We compare gradient and Newton boosting as well as a hybrid variant of the two with trees as base learners on a wide range of real and simulated datasets. In addition, we introduce a novel tuning parameter for Newton boosting, which arises naturally as the analog to the minimum number of samples per leaf in gradient boosting with trees as base learners. Our results show that Newton boosting consistently and significantly outperforms gradient and hybrid gradient-Newton boosting for both binary and multiclass classification. In addition, Newton boosting with our newly proposed tuning parameter substantially outperforms Newton boosting implemented in XGBoost, which uses a different tuning parameter for the minimum number of samples per leaf. For the regression datasets considered, we find only relatively small differences among the methods. We must note, though, that our study on regression models with loss functions with non-trivial second derivatives, such as Poisson and Gamma regression, is less comprehensive than the one for classification since there exist relatively few public datasets and established simulations settings with sufficiently complex mean functions. It remains to be investigated whether the three different boosting approaches do indeed show the same predictive accuracy in general for regression models. Further, future research should also shed light on the reason why Newton boosting performs better than gradient and hybrid gradient-Newton boosting with trees as base learners.

References

  • Breiman [1996] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
  • Breiman [1998] L. Breiman. Arcing classifiers. Annals of Statistics, pages 801–824, 1998.
  • Breiman [1999] L. Breiman. Prediction games and arcing algorithms. Neural computation, 11(7):1493–1517, 1999.
  • Breiman et al. [1984] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regression trees. CRC press, 1984.
  • Bühlmann and Hothorn [2007] P. Bühlmann and T. Hothorn. Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, pages 477–505, 2007.
  • Chen and Guestrin [2016] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
  • Efron and Tibshirani [1994] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
  • Freund and Schapire [1995] Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23–37. Springer, 1995.
  • Freund et al. [1996] Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In Icml, volume 96, pages 148–156. Bari, Italy, 1996.
  • Friedman et al. [2000] J. Friedman, T. Hastie, R. Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000.
  • Friedman [1991] J. H. Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1–67, 1991.
  • Friedman [2001] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001.
  • Friedman [2002] J. H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
  • Guyon [2003] I. Guyon. Design of experiments of the nips 2003 variable selection benchmark, 2003.
  • Hothorn et al. [2010] T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. Model-based boosting 2.0. Journal of Machine Learning Research, 11(Aug):2109–2113, 2010.
  • Ke et al. [2017] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3149–3157, 2017.
  • Mason et al. [2000] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512–518, 2000.
  • Mayr et al. [2014a] A. Mayr, H. Binder, O. Gefeller, and M. Schmid. The evolution of boosting algorithms. Methods of information in medicine, 53(06):419–427, 2014a.
  • Mayr et al. [2014b] A. Mayr, H. Binder, O. Gefeller, and M. Schmid. Extending statistical boosting. Methods of information in medicine, 53(06):428–435, 2014b.
  • McCullagh and Nelder [1989] P. McCullagh and J. A. Nelder. Generalized linear models, volume 37. CRC press, 1989.
  • Mease and Wyner [2008] D. Mease and A. Wyner. Evidence contrary to the statistical view of boosting. Journal of Machine Learning Research, 9(Feb):131–156, 2008.
  • Meng et al. [2016] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241, 2016.
  • Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Ponomareva et al. [2017] N. Ponomareva, S. Radpour, G. Hendry, S. Haykal, T. Colthurst, P. Mitrichev, and A. Grushetsky. Tf boosted trees: A scalable tensorflow based framework for gradient boosting. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 423–427. Springer, 2017.
  • Ridgeway [2007] G. Ridgeway. Generalized boosted models: A guide to the gbm package. Update, 1(1):2007, 2007.
  • Ridgeway [2017] G. Ridgeway. gbm: Generalized Boosted Regression Models, 2017. URL https://CRAN.R-project.org/package=gbm. R package version 2.1.3.
  • Ridgeway [1999] G. K. Ridgeway. Generalization of boosting algorithms and applications of bayesian inference for massive datasets. PhD thesis, 1999.
  • Saberian et al. [2011] M. J. Saberian, H. Masnadi-Shirazi, and N. Vasconcelos. Taylorboost: First and second-order boosting algorithms with explicit margin control. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2929–2934. IEEE, 2011.
  • Schapire [1990] R. E. Schapire. The strength of weak learnability. Machine learning, 5(2):197–227, 1990.
  • Schapire [2003] R. E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
  • Schapire and Freund [2012] R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.
  • Sigrist and Hirnschall [2017] F. Sigrist and C. Hirnschall. Grabit: Gradient tree boosted tobit models for default prediction. arXiv preprint arXiv:1711.08695, 2017.
  • Wyner et al. [2017] A. J. Wyner, M. Olson, J. Bleich, and D. Mease. Explaining the success of adaboost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18(48):1–33, 2017.

Appendix A Loss functions for regression and classification tasks

In the following, we list the loss functions and corresponding gradients and second derivatives that we consider in this article.

  • Binary classification

    Loss:
    Gradient:
    Hessian:

  • Multiclass classification

    Loss:
    Gradient:
    Hessian:
    As in Friedman et al. [2000], we use for simplicity.

  • Poisson regression

    Loss:
    Gradient:
    Hessian:

  • Gamma regression
    with shape and rate ,
    Loss:
    Gradient:
    Hessian:

  • Tobit model
    , with mean , , and variance of the latent variable and lower and upper censoring thresholds and
    Loss:


    Gradient:


    Hessian:

Appendix B Results for real datasets when the default value for the minimum number of (weighted) samples per leaf parameter is used

Figure 5: Comparison of boosting methods on real classification datasets using out-of-sample error rates for classification and negative log-likelihoods for regression. The minimum number of (weighted) samples parameter is set to the default value and no tuning is done for this parameter. The red rhombi represent means.

Figure 6: Comparison of boosting methods on real classification datasets using error rates for classification and negative log-likelihoods for regression relative to the best method in each run. The minimum number of (weighted) samples parameter is set to the default value and no tuning is done for this parameter. The red rhombi represent means.
Data CompareAll NewtonVsGrad NewtonVsHybird NewtonVsXGBoost
adult 1.45e-12 0.143 0.0003 3.21e-08
bank 4.76e-11 0.00417 0.391 2.45e-05
cancer 0 0 3.61e-11 0
ijcnn 0 6e-11 2.9e-07 2.96e-07
ionosphere 0 0 7.83e-09 0
sonar 0 4.88e-15 1.11e-06 0.00748
car 0 0 0 0
covtype 0 4.45e-10 9.42e-06 8.79e-10
digits 0 0 1.98e-14 7.77e-15
glass 0 0 1.55e-15 0
letter 0 1.37e-13 6.27e-10 8.25e-12
satimage 0 2.69e-11 3.02e-06 2.13e-14
smartphone 0 2.15e-09 3.24e-08 1.01e-08
usps 0 2.15e-11 2.05e-11 1.25e-10
insurance 2.8e-14 3.84e-07 0.825 3.44e-06
Table 4: Comparison of different boosting methods on real datasets using negative log-likelihoods on the test sets. The minimum number of (weighted) samples per leaf parameter is set to the default value and no tuning is done for this parameter. The column ’CompareAll’ contains p-values of F-tests that compare differences among all methods. The other columns contain p-values of t-tests comparing Newton boosting with our novel tuning parameter to the other boosting variants.

Appendix C Results for real classification datasets using out-of-sample error rate

Figure 7: Comparison of boosting methods on real classification datasets using out-of-sample error rate. The red rhombi represent means.
Figure 8: Comparison of boosting methods on real classification datasets using out-of-sample error rate relative to the best method in each run. The red rhombi represent means.
Data CompareAll NewtonVsGrad NewtonVsHybird NewtonVsXGBoost
adult 1.29e-07 0.251 0.0253 0.00211
bank 6.07e-05 0.000418 0.477 0.284
cancer 0 1.8e-09 0.373 3.7e-12
ijcnn 5.92e-12 6.81e-06 5.7e-05 0.0124
ionosphere 0 4.01e-12 0.00165 9.92e-13
sonar 0 9.99e-15 0.00149 7.77e-15
car 0 4.27e-13 0.288 0
covtype 1.41e-06 1.89e-05 0.0105 0.000174
digits 0 5.23e-10 0.0013 2.25e-12
glass 1.67e-15 4.38e-11 1.96e-05 1.81e-09
letter 1.33e-15 7.34e-09 1.03e-05 7.68e-06
satimage 2.38e-07 1.5e-06 0.00408 0.00307
smartphone 1.24e-06 0.00086 0.135 0.000105
usps 3.63e-07 0.000663 0.00131 1.05e-05
Table 5: Comparison of different boosting methods on real datasets using error rates on the test sets for classification. The column ’CompareAll’ contains p-values of F-tests that compare differences among all methods. The other columns contain p-values of t-tests comparing Newton boosting with our novel tuning parameter to the other boosting variants.

Appendix D Results for simulated datasets using 500 data points as training, validation, and test data

Figure 9: Comparison of boosting methods on simulated data using error rates for classification and negative log-likelihoods for regression. 500 data points are simulated 100 times as training, validation, and test data. The red rhombi represent means.
Figure 10: Comparison of boosting methods on simulated data using error rates for classification and negative log-likelihoods for regression relative to the best method in each run. 500 data points are simulated 100 times as training, validation, and test data. The red rhombi represent means.
Data CompareAll NewtonVsGrad NewtonVsHybird NewtonVsXGBoost
bin_classif 0 0 0 0
bin_classif_fht 0 0 0 0
multi_classif 0 0 0 0
multi_classif_fht 0 0 6.89e-11 0
poisson_r 0.000818 0.039 0.183 0.0639
poisson_f1 3.36e-06 0.0511 5.07e-05 0.701
poisson_f3 0 2.22e-16 0.0504 0
gamma_r 0 3.4e-11 0 3.15e-14
gamma_f1 2.32e-11 0.00433 0.0126 4.73e-05
gamma_f3 1.27e-12 5.53e-05 0.52 4.07e-09
tobit_r 0.0311 0.0266 0.478
tobit_f1 0.000233 3.7e-05 0.022
tobit_f3 0.015 0.0841 0.386
Table 6: Comparison of different boosting methods on simulated datasets using negative log-likelihoods on the test sets. The column ’CompareAll’ contains p-values of F-tests that compare differences among all methods. The other columns contain p-values of t-tests comparing Newton boosting with our novel tuning parameter to the other boosting variants.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
247325
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description