Online Gradient Boosting

Online Gradient Boosting

Abstract

We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.

1Introduction

Boosting algorithms [21] are ensemble methods that convert a learning algorithm for a base class of models with weak predictive power, such as decision trees, into a learning algorithm for a class of models with stronger predictive power, such as a weighted majority vote over base models in the case of classification, or a linear combination of base models in the case of regression.

Boosting methods such as AdaBoost [9] and Gradient Boosting [10] have found tremendous practical application, especially using decision trees as the base class of models. These algorithms were developed in the batch setting, where training is done over a fixed batch of sample data. However, with the recent explosion of huge data sets which do not fit in main memory, training in the batch setting is infeasible, and online learning techniques which train a model in one pass over the data have proven extremely useful.

A natural goal therefore is to extend boosting algorithms to the online learning setting. Indeed, there has already been some work on online boosting for classification problems [20]. Of these, the work by [4] provided the first theoretical study of online boosting for classification, which was later generalized by [2] to obtain optimal and adaptive online boosting algorithms.

However, extending boosting algorithms for regression to the online setting has been elusive and escaped theoretical guarantees thus far. In this paper, we rigorously formalize the setting of online boosting for regression and then extend the very commonly used gradient boosting methods [10] to the online setting, providing theoretical guarantees on their performance.

The main result of this paper is an online boosting algorithm that competes with any linear combination the base functions, given an online linear learning algorithm over the base class. This algorithm is the online analogue of the batch boosting algorithm of [24], and in fact our algorithmic technique, when specialized to the batch boosting setting, provides exponentially better convergence guarantees.

We also give an online boosting algorithm that competes with the best convex combination of base functions. This is a simpler algorithm which is analyzed along the lines of the Frank-Wolfe algorithm [8]. While the algorithm has weaker theoretical guarantees, it can still be useful in practice. We also prove that this algorithm obtains the optimal regret bound (up to constant factors) for this setting.

Finally, we conduct some proof-of-concept experiments which show that our online boosting algorithms do obtain performance improvements over different classes of base learners.

1.1Related Work

While the theory of boosting for classification in the batch setting is well-developed (see [21]), the theory of boosting for regression is comparatively sparse. The foundational theory of boosting for regression can be found in the statistics literature [14], where boosting is understood as a greedy stagewise algorithm for fitting of additive models. The goal is to achieve the performance of linear combinations of base models, and to prove convergence to the performance of the best such linear combination.

While the earliest works on boosting for regression such as [10] do not have such convergence proofs, later works such as [19] do have convergence proofs but without a bound on the speed of convergence. Bounds on the speed of convergence have been obtained by [7] relying on a somewhat strong assumption on the performance of the base learning algorithm. A different approach to boosting for regression was taken by [9], who give an algorithm that reduces the regression problem to classification and then applies AdaBoost; the corresponding proof of convergence relies on an assumption on the induced classification problem which may be hard to satisfy in practice. The strongest result is that of [24], who prove convergence to the performance of the best linear combination of base functions, along with a bound on the rate of convergence, making essentially no assumptions on the performance of the base learning algorithm. [22] proves similar results for logistic (or similar) loss using a slightly simpler boosting algorithm.

The results in this paper are a generalization of the results of [24] to the online setting. However, we emphasize that this generalization is nontrivial and requires different algorithmic ideas and proof techniques. Indeed, we were not able to directly generalize the analysis in [24] by simply adapting the techniques used in recent online boosting work [4], but we made use of the classical Frank-Wolfe algorithm [8]. On the other hand, while an important part of the convergence analysis for the batch setting is to show statistical consistency of the algorithms [24], in the online setting we only need to study the empirical convergence (that is, the regret), which makes our analysis much more concise.

2Setup

Examples are chosen from a feature space , and the prediction space is . Let denote some norm in . In the setting for online regression, in each round for , an adversary selects an example and a loss function , and presents to the online learner. The online learner outputs a prediction , obtains the loss function , and incurs loss .

Let denote a reference class of regression functions , and let denote a class of loss functions . Also, let be a non-decreasing function. We say that the function class is online learnable for losses in with regret if there is an online learning algorithm , that for every and every sequence for chosen by the adversary, generates predictions1 such that

If the online learning algorithm is randomized, we require the above bound to hold with high probability.

The above definition is simply the online generalization of standard empirical risk minimization (ERM) in the batch setting. A concrete example is -dimensional regression, i.e. the prediction space is . For a labeled data point , the loss for the prediction is given by where is a fixed loss function that is convex in the second argument (such as squared loss, logistic loss, etc). Given a batch of labeled data points and a base class of regression functions (say, the set of bounded norm linear regressors), an ERM algorithm finds the function that minimizes .

In the online setting, the adversary reveals the data in an online fashion, only presenting the true label after the online learner has chosen a prediction . Thus, setting , we observe that if satisfies the regret bound (Equation 1), then it makes predictions with total loss almost as small as that of the empirical risk minimizer, up to the regret term. If is the set of all bounded-norm linear regressors, for example, the algorithm could be online gradient descent [25] or online Newton Step [16].

At a high level, in the batch setting, “boosting” is understood as a procedure that, given a batch of data and access to an ERM algorithm for a function class (this is called a “weak” learner), obtains an approximate ERM algorithm for a richer function class (this is called a “strong” learner). Generally, is the set of finite linear combinations of functions in . The efficiency of boosting is measured by how many times, , the base ERM algorithm needs to be called (i.e., the number of boosting steps) to obtain an ERM algorithm for the richer function within the desired approximation tolerance. Convergence rates [24] give bounds on how quickly the approximation error goes to and .

We now extend this notion of boosting to the online setting in the natural manner. To capture the full generality of the techniques, we also specify a class of loss functions that the online learning algorithm can work with. Informally, an online boosting algorithm is a reduction that, given access to an online learning algorithm for a function class and loss function class with regret , and a bound on the total number of calls made in each iteration to copies of , obtains an online learning algorithm for a richer function class , a richer loss function class , and (possibly larger) regret . The bound on the total number of calls made to all the copies of corresponds to the number of boosting stages in the batch setting, and in the online setting it may be viewed as a resource constraint on the algorithm. The efficacy of the reduction is measured by which is a function of , , and certain parameters of the comparator class and loss function class . We desire online boosting algorithms such that quickly as and . We make the notions of richness in the above informal description more precise now.

Comparator function classes. A given function class is said to be -bounded if for all and all , we have . Throughout this paper, we assume that is symmetric:2 i.e. if , then , and it contains the constant zero function, which we denote, with some abuse of notation, by .

Given , we define two richer function classes : the convex hull of , denoted , is the set of convex combinations of a finite number of functions in , and the span of , denoted , is the set of linear combinations of finitely many functions in . For any , define . Since functions in are not bounded, it is not possible to obtain a uniform regret bound for all functions in : rather, the regret of an online learning algorithm for is specified in terms of regret bounds for individual comparator functions , viz.

Loss function classes. The base loss function class we consider is , the set of all linear functions , with Lipschitz constant bounded by . A function class that is online learnable with the loss function class is called online linear learnable for short. The richer loss function class we consider is denoted by and is a set of convex loss functions satisfying some regularity conditions specified in terms of certain parameters described below.

We define a few parameters of the class . For any , let be the ball of radius . The class is said to have Lipschitz constant on if for all and all there is an efficiently computable subgradient with norm at most . Next, is said to be -smooth on if for all and all we have

Next, define the projection operator as , and define .

3Online Boosting Algorithms

The setup is that we are given a -bounded reference class of functions with an online linear learning algorithm with regret bound . For normalization, we also assume that the output of at any time is bounded in norm by , i.e. for all . We further assume that for every , we can compute3 a Lipschitz constant , a smoothness parameter , and the parameter for the class over . Furthermore, the online boosting algorithm may make up to calls per iteration to any copies of it maintains, for a given a budget parameter .

Given this setup, our main result is an online boosting algorithm, Algorithm ?, competing with . The algorithm maintains copies of , denoted , for . Each copy corresponds to one stage in boosting. When it receives a new example , it passes it to each and obtains their predictions , which it then combines into a prediction for using a linear combination. At the most basic level, this linear combination is simply the sum of all the predictions scaled by a step size parameter . Two tweaks are made to this sum in step 8 to facilitate the analysis:

  1. While constructing the sum, the partial sum is multiplied by a shrinkage factor . This shrinkage term is tuned using an online gradient descent algorithm in step 14. The goal of the tuning is to induce the partial sums to be aligned with a descent direction for the loss functions, as measured by the inner product .

  2. The partial sums are made to lie in , for some parameter , by using the projection operator . This is done to ensure that the Lipschitz constant and smoothness of the loss function are suitably bounded.

Once the boosting algorithm makes the prediction and obtains the loss function , each is updated using a suitably scaled linear approximation to the loss function at the partial sum , i.e. the linear loss function . This forces to produce predictions that are aligned with a descent direction for the loss function.

We provide the analysis of the algorithm in Section 4.2. The analysis yields the following regret bound for the algorithm:

The regret bound in this theorem depends on several parameters such as , and . In applications of the algorithm for -dimensional regression with commonly used loss functions, however, these parameters are essentially modest constants; see Section 3.1 for calculations of the parameters for various loss functions. Furthermore, if is appropriately set (e.g. ), then the average regret clearly converges to as and . While the requirement that may raise concerns about computational efficiency, this is in fact analogous to the guarantee in the batch setting: the algorithms converge only when the number of boosting stages goes to infinity. Moreover, our lower bound (Theorem ?) shows that this is indeed necessary.

We also present a simpler boosting algorithm, Algorithm ?, that competes with . Algorithm ? is similar to Algorithm ?, with some simplifications: the final prediction is simply a convex combination of the predictions of the base learners, with no projections or shrinkage necessary. While Algorithm ? is more general, Algorithm ? may still be useful in practice when a bound on the norm of the comparator function is known in advance, using the observations in Section 5.2. Furthermore, its analysis is cleaner and easier to understand for readers who are familiar with the Frank-Wolfe method, and this serves as a foundation for the analysis of Algorithm ?. This algorithm has an optimal (up to constant factors) regret bound as given in the following theorem, proved in Section 4.1. The upper bound in this theorem is proved along the lines of the Frank-Wolfe [8] algorithm, and the lower bound using information-theoretic arguments.

The dependence of the regret bound on is unimprovable without additional assumptions: otherwise, Algorithm ? will be an online linear learning algorithm over with better than regret.

Using a deterministic base online linear learning algorithm. If the base online linear learning algorithm is deterministic, then our results can be improved, because our online boosting algorithms are also deterministic, and using a standard simple reduction, we can now allow to be any set of convex functions (smooth or not) with a computable Lipschitz constant over the domain for any .

This reduction converts arbitrary convex loss functions into linear functions: viz. if is the output of the online boosting algorithm, then the loss function provided to the boosting algorithm as feedback is the linear function . This reduction immediately implies that the base online linear learning algorithm , when fed loss functions , is already an online learning algorithm for with losses in with the regret bound .

As for competing with , since linear loss functions are -smooth, we obtain the following easy corollary of Theorem ?:

3.1The parameters for several basic loss functions

In this section we consider the application of our results to -dimensional regression, where we assume, for normalization, that the true labels of the examples and the predictions of the functions in the class are in . In this case denotes the absolute value norm. Thus, in each round, the adversary chooses a labeled data point , and the loss for the prediction is given by where is a fixed loss function that is convex in the second argument. Note that in this setting. We give examples of several such loss functions below, and compute the parameters , and for every , as well as from Theorem ?.

  1. Linear loss: . We have , , , and .

  2. -norm loss, for some : . We have , , , and .

  3. Modified least squares: . We have , , , and .

  4. Logistic loss: . We have , , , and .

4Analysis

In this section, we analyze Algorithms ? and Algorithm ?.

4.1Competing with convex combinations of the base functions

We give the analysis of Algorithm ? before that of Algorithm ? since it is easier to understand and provides the foundation for the analysis of Algorithm ?.

First, note that for any , since is a linear function, we have

Let be any function in . The equality above and the fact that is an online learning algorithm for with regret bound for the -Lipschitz linear loss functions imply that

Now define, for , . We have

For , since , the above bound implies that . Starting from this base case, an easy induction on proves that . Applying this bound for completes the proof.

We now show that the dependence of the regret bound of Algorithm ? on the parameter is optimal up to constant factors.

Consider the following construction. At a high level, the setting is 1-dimensional regression with corresponding to squared loss. The domain and true labels of examples are in .

Define and , where , and let and be two distributions over where each bit is Bernoulli random variable with parameter and respectively, chosen independently of the other bits. Consider a sequence of examples generated as follows: , and the label is chosen from uniformly at random in each round.

Let for . The function class consists of a large number, , of functions , . For each and , we set w.p. , and w.p. , independently of all other values of and .

The base online linear learning algorithm is simply Hedge over the functions. In each round, the Hedge algorithm selects one of the functions in and uses that to predict the label, and for any sequence of examples, with high probability, incurs regret .

We set to be set of squared loss functions, i.e. functions of the form for . Note that these loss functions are -smooth and . In round , the loss function is .

Consider the function , which is in . Given any input sequence for it is easy to calculate that , and since the examples and predictions of functions on the examples are independent across iterations, a simple application of the multiplicative Chernoff bound implies that if , then with probability at least , we have .

Now suppose there is an online boosting algorithm making at most calls total to all copies of in each iteration, that for any large enough and for any sequence for , outputs predictions such that with high probability, say at least , we have . Then by a union bound, with probability at least , we have . By Markov’s inequality and a union bound, with probability at least , for a uniform random time , we have

or in other words, is on the same side of as , and thus can be used to identify . In the rest of the proof, we will use this fact, along with fact the total variation distance between and , denoted , is small, to derive a contradiction.

Define the random variable as follows. For any bit string , choose a random round , and simulate the online boosting process until round by sampling ’s and the outputs of for all and from the appropriate distributions. In round , let be the functions that are obtained from the at most calls to copies of (there could be repetitions). Assign for (being careful with repeated functions and repeating outputs appropriately), and run the booster with these outputs to obtain , and set . Let denotes probability of events in this process for generating given .

Let and denote expectation of a random variable when is drawn from and respectively, and let denote expectation of a random variable when is chosen from uniformly at random and then is sampled from . The above analysis (inequality (Equation 3)) implies that

Now define a random variable as . Since

we conclude, using the above bound, that . This is a contradiction, since because , we have

where the bound on is standard, for e.g. see [15]. This gives us the desired contradiction.

The above result can be easily extended to any given parameters and so that the is -bounded and is -smooth on , giving a lower bound of on the regret of an online boosting algorithm for with losses in : we simply scale all function and label values by , and consider the loss functions . If there were an online boosting algorithm for with these loss functions with regret , then by scaling down the predictions by , we obtain an online boosting algorithm for exactly the setting in the proof of Theorem ? with a regret bound of , which is a contradiction.

4.2Competing with the span of the base functions

In this section we show that Algorithm ? satisfies the regret bound claimed in Theorem ?.

Let , for some finite subset of , where . Since is symmetric, we may assume that all , and let . Furthermore, we may assume that with weight , so that . Note that is exactly the infimum of over all such ways of expressing as a finite weighted sum of functions in . We now prove that bound stated in the theorem holds with replaced by ; the theorem then follows simply by taking the infimum of the bound over all such ways of expressing .

Now, for each , the update in line 14 of Algorithm ? is exactly online gradient descent [25] on the domain with linear loss functions . Note that the derivative of this loss function is bounded as follows: . Since , the standard analysis of online gradient descent then implies that the sequence for satisfies

Next, since with , we have

Let . Since is an online learning algorithm for with regret bound for the -Lipschitz linear loss functions , and , multiplying the regret bound (Equation 1) by we have

by (Equation 5). Now, we analyze how much excess loss is potentially introduced due to the projection in line 8. First, note that if , then the projection has no effect since , and in this case . If , then by the definition of , , and since and , and we have

In either case, we have

We now move to the main part of the analysis. Define for , . We have

since, by convexity of we have . Applying the above bound iteratively, we get

This completes the proof.

5Variants of the boosting algorithms

Our boosting algorithms and the analysis are considerably flexible: it is easy to modify the algorithms to work with a different (and perhaps more natural) kind of base learner which does greedy fitting, or incorporate a scaling of the base functions which improves performance. Also, when specialized to the batch setting, our algorithms provide better convergence rates than previous work.

5.1Fitting to actual loss functions

The choice of an online linear learning algorithm over the base function class in our algorithms was made to ease the analysis. In practice, it is more common to have an online algorithm which produce predictions with comparable accuracy to the best function in hindsight for the actual sequence of loss functions. In particular, a common heuristic in boosting algorithms such as the original gradient boosting algorithm by [10] or the matching pursuit algorithm of [18] is to build a linear combination of base functions by iteratively augmenting the current linear combination via greedily choosing a base function and a step size for it that minimizes the loss with respect to the residual label. Indeed, the boosting algorithm of [24] also uses this kind of greedy fitting algorithm as the base learner.

In the online setting, we can model greedy fitting as follows. We first fix a step size in advance. Then, in each round , the base learner receives not only the example , but also an offset for the prediction, and produces a prediction , after which it receives the loss function and suffers loss . The predictions of satisfy

where is the regret. Our algorithms can be made to work with this kind of base learner as well. The details can be found in Section Appendix A.1 of the supplementary material.

5.2Improving the regret bound via scaling

Given an online linear learning algorithm over the function class with regret , then for any scaling parameter , we trivially obtain an online linear learning algorithm, denoted , over a -scaling of , viz. , simply by multiplying the predictions of by . The corresponding regret scales by as well, i.e. it becomes .

The performance of Algorithm ? can be improved by using such an online linear learning algorithm over for a suitably chosen scaling of the function class . The regret bound from Theorem ? improves because the -norm of measured with respect to , i.e. , is smaller than , but degrades because the parameter is larger than . But, as detailed in Section Appendix A.2 of the supplementary material, in many situations the improvement due to the former compensates for the degradation due to the latter, and overall we can get improved regret bounds using a suitable value of .

5.3Improvements for batch boosting

Our algorithmic technique can be easily specialized and modified to the standard batch setting with a fixed batch of training examples and a base learning algorithm operating over the batch, exactly as in [24]. The main difference compared to the algorithm of [24] is the use of the variables to scale the coefficients of the weak hypotheses appropriately. While a seemingly innocuous tweak, this allows us to derive analogous bounds to those of [24] on the optimization error that show that our boosting algorithm converges exponential faster. A detailed comparison can be found in Section Appendix A.3 of the supplementary material.

6Experimental Results

Is it possible to boost in an online fashion in practice with real base learners? To study this question, we implemented and evaluated Algorithms ? and ? within the Vowpal Wabbit (VW) open source machine learning system [23]. The three online base learners used were VW’s default linear learner (a variant of stochastic gradient descent), two-layer sigmoidal neural networks with 10 hidden units, and regression stumps.

Regression stumps were implemented by doing stochastic gradient descent on each individual feature, and predicting with the best-performing non-zero valued feature in the current example.

All experiments were done on a collection of 14 publically available regression and classification datasets (described in Section Appendix B in the supplementary material) using squared loss. The only parameters tuned were the learning rate and the number of weak learners, as well as the step size parameter for Algorithm ?. Parameters were tuned based on progressive validation loss on half of the dataset; reported is propressive validation loss on the remaining half. Progressive validation is a standard online validation technique, where each training example is used for testing before it is used for updating the model [3].

The following table reports the average and the median, over the datasets, relative improvement in squared loss over the respective base learner. Detailed results can be found in Section Appendix B in the supplementary material.

Base learner

Algorithm ? Algorithm ? Algorithm ? Algorithm ?
SGD 1.65% 1.33% 0.03% 0.29%
Regression stumps 20.22% 15.9% 10.45% 13.69%
Neural networks 7.88% 0.72% 0.72% 0.33%

Note that both SGD (stochastic gradient descent) and neural networks are already very strong learners. Naturally, boosting is much more effective for regression stumps, which is a weak base learner.

7Conclusions and Future Work

In this paper we generalized the theory of boosting for regression problems to the online setting and provided online boosting algorithms with theoretical convergence guarantees. Our algorithmic technique also improves convergence guarantees for batch boosting algorithms. We also provide experimental evidence that our boosting algorithms do improve prediction accuracy over commonly used base learners in practice, with greater improvements for weaker base learners. The main remaining open question is whether the boosting algorithm for competing with the span of the base functions is optimal in any sense, similar to our proof of optimality for the the boosting algorithm for competing with the convex hull of the base functions.

Supplementary material for “Online Gradient Boosting”


AVariants of the boosting algorithms

In this section we provide the omitted details of two variants of our boosting algorithms: (a) a variant that work with a different kind of base learner which does greedy fitting, and (b) a variant that incorporates a scaling of the base functions to improves performance. We also show how our algorithmic technique can be used to improve the convergence speed for batch boosting.

a.1Fitting to actual loss functions

The choice of an online linear learning algorithm over the base function class in our algorithms was made to ease the analysis. In practice, it is more common to have an online algorithm which produce predictions with comparable accuracy to the best function in hindsight for the actual sequence of loss functions. In particular, a common heuristic in boosting algorithms such as the original gradient boosting algorithm by [10] or the matching pursuit algorithm of [18] is to build a linear combination of base functions by iteratively augmenting the current linear combination by greedily choosing a base function and a step size for it that minimizes the loss with respect to the residual label. Indeed, the boosting algorithm of [24] also uses this kind of greedy fitting algorithm as the base learner.

In the online setting, we can model greedy fitting as follows. We first fix a step size in advance. Then, in each round , the base learner receives not only the example , but also an offset for the prediction, and produces a prediction , after which it receives the loss function and suffers loss . The predictions of satisfy

where is the regret. We now describe how our algorithms can be made to work with this kind of base learner as well.

Assume that for some known parameter , we have , for all . Let , and assume that the loss functions are Lipschitz and smooth on . Then using the convexity and smoothness of the loss functions, we have and . Plugging these bounds into the above regret bound we get, for any ,