Online Gradient Boosting

Online Gradient Boosting

Alina Beygelzimer
Yahoo Labs
New York, NY 10036
beygel@yahoo-inc.com
   Elad Hazan
Princeton University
Princeton, NJ 08540
ehazan@cs.princeton.edu
   Satyen Kale
Yahoo Labs
New York, NY 10036
satyen@yahoo-inc.com
   Haipeng Luo
Princeton University
Princeton, NJ 08540
haipengl@cs.princeton.edu
Abstract

We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss functions that competes with a larger class of regression functions. Our main result is an online gradient boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the linear span of the base class. We also give a simpler boosting algorithm that converts a weak online learning algorithm into a strong one where the larger class of functions is the convex hull of the base class, and prove its optimality.

1 Introduction

Boosting algorithms [21] are ensemble methods that convert a learning algorithm for a base class of models with weak predictive power, such as decision trees, into a learning algorithm for a class of models with stronger predictive power, such as a weighted majority vote over base models in the case of classification, or a linear combination of base models in the case of regression.

Boosting methods such as AdaBoost [9] and Gradient Boosting [10] have found tremendous practical application, especially using decision trees as the base class of models. These algorithms were developed in the batch setting, where training is done over a fixed batch of sample data. However, with the recent explosion of huge data sets which do not fit in main memory, training in the batch setting is infeasible, and online learning techniques which train a model in one pass over the data have proven extremely useful.

A natural goal therefore is to extend boosting algorithms to the online learning setting. Indeed, there has already been some work on online boosting for classification problems [20, 11, 17, 12, 4, 5, 2]. Of these, the work by Chen et al. [4] provided the first theoretical study of online boosting for classification, which was later generalized by Beygelzimer et al. [2] to obtain optimal and adaptive online boosting algorithms.

However, extending boosting algorithms for regression to the online setting has been elusive and escaped theoretical guarantees thus far. In this paper, we rigorously formalize the setting of online boosting for regression and then extend the very commonly used gradient boosting methods [10, 19] to the online setting, providing theoretical guarantees on their performance.

The main result of this paper is an online boosting algorithm that competes with any linear combination the base functions, given an online linear learning algorithm over the base class. This algorithm is the online analogue of the batch boosting algorithm of Zhang and Yu [24], and in fact our algorithmic technique, when specialized to the batch boosting setting, provides exponentially better convergence guarantees.

We also give an online boosting algorithm that competes with the best convex combination of base functions. This is a simpler algorithm which is analyzed along the lines of the Frank-Wolfe algorithm [8]. While the algorithm has weaker theoretical guarantees, it can still be useful in practice. We also prove that this algorithm obtains the optimal regret bound (up to constant factors) for this setting.

Finally, we conduct some proof-of-concept experiments which show that our online boosting algorithms do obtain performance improvements over different classes of base learners.

1.1 Related Work

While the theory of boosting for classification in the batch setting is well-developed (see [21]), the theory of boosting for regression is comparatively sparse.The foundational theory of boosting for regression can be found in the statistics literature [14, 13], where boosting is understood as a greedy stagewise algorithm for fitting of additive models. The goal is to achieve the performance of linear combinations of base models, and to prove convergence to the performance of the best such linear combination.

While the earliest works on boosting for regression such as [10] do not have such convergence proofs, later works such as [19, 6] do have convergence proofs but without a bound on the speed of convergence. Bounds on the speed of convergence have been obtained by Duffy and Helmbold [7] relying on a somewhat strong assumption on the performance of the base learning algorithm. A different approach to boosting for regression was taken by Freund and Schapire [9], who give an algorithm that reduces the regression problem to classification and then applies AdaBoost; the corresponding proof of convergence relies on an assumption on the induced classification problem which may be hard to satisfy in practice. The strongest result is that of Zhang and Yu [24], who prove convergence to the performance of the best linear combination of base functions, along with a bound on the rate of convergence, making essentially no assumptions on the performance of the base learning algorithm. Telgarsky [22] proves similar results for logistic (or similar) loss using a slightly simpler boosting algorithm.

The results in this paper are a generalization of the results of Zhang and Yu [24] to the online setting. However, we emphasize that this generalization is nontrivial and requires different algorithmic ideas and proof techniques. Indeed, we were not able to directly generalize the analysis in [24] by simply adapting the techniques used in recent online boosting work [4, 2], but we made use of the classical Frank-Wolfe algorithm [8]. On the other hand, while an important part of the convergence analysis for the batch setting is to show statistical consistency of the algorithms [24, 1, 22], in the online setting we only need to study the empirical convergence (that is, the regret), which makes our analysis much more concise.

2 Setup

Examples are chosen from a feature space , and the prediction space is . Let denote some norm in . In the setting for online regression, in each round for , an adversary selects an example and a loss function , and presents to the online learner. The online learner outputs a prediction , obtains the loss function , and incurs loss .

Let denote a reference class of regression functions , and let denote a class of loss functions . Also, let be a non-decreasing function. We say that the function class is online learnable for losses in with regret if there is an online learning algorithm , that for every and every sequence for chosen by the adversary, generates predictions111There is a slight abuse of notation here. is not a function but rather the output of the online learning algorithm computed on the given example using its internal state. such that

(1)

If the online learning algorithm is randomized, we require the above bound to hold with high probability.

The above definition is simply the online generalization of standard empirical risk minimization (ERM) in the batch setting. A concrete example is -dimensional regression, i.e. the prediction space is . For a labeled data point , the loss for the prediction is given by where is a fixed loss function that is convex in the second argument (such as squared loss, logistic loss, etc). Given a batch of labeled data points and a base class of regression functions (say, the set of bounded norm linear regressors), an ERM algorithm finds the function that minimizes .

In the online setting, the adversary reveals the data in an online fashion, only presenting the true label after the online learner has chosen a prediction . Thus, setting , we observe that if satisfies the regret bound (1), then it makes predictions with total loss almost as small as that of the empirical risk minimizer, up to the regret term. If is the set of all bounded-norm linear regressors, for example, the algorithm could be online gradient descent [25] or online Newton Step [16].

At a high level, in the batch setting, “boosting” is understood as a procedure that, given a batch of data and access to an ERM algorithm for a function class (this is called a “weak” learner), obtains an approximate ERM algorithm for a richer function class (this is called a “strong” learner). Generally, is the set of finite linear combinations of functions in . The efficiency of boosting is measured by how many times, , the base ERM algorithm needs to be called (i.e., the number of boosting steps) to obtain an ERM algorithm for the richer function within the desired approximation tolerance. Convergence rates [24] give bounds on how quickly the approximation error goes to and .

We now extend this notion of boosting to the online setting in the natural manner. To capture the full generality of the techniques, we also specify a class of loss functions that the online learning algorithm can work with. Informally, an online boosting algorithm is a reduction that, given access to an online learning algorithm for a function class and loss function class with regret , and a bound on the total number of calls made in each iteration to copies of , obtains an online learning algorithm for a richer function class , a richer loss function class , and (possibly larger) regret . The bound on the total number of calls made to all the copies of corresponds to the number of boosting stages in the batch setting, and in the online setting it may be viewed as a resource constraint on the algorithm. The efficacy of the reduction is measured by which is a function of , , and certain parameters of the comparator class and loss function class . We desire online boosting algorithms such that quickly as and . We make the notions of richness in the above informal description more precise now.

Comparator function classes.

A given function class is said to be -bounded if for all and all , we have . Throughout this paper, we assume that is symmetric:222This is without loss of generality; as will be seen momentarily, our base assumption only requires an online learning algorithm for for linear losses . By running the Hedge algorithm on two copies of , one of which receives the actual loss functions and the other recieves , we get an algorithm which competes with negations of functions in and the constant zero function as well. Furthermore, since the loss functions are convex (indeed, linear) this can be made into a deterministic reduction by choosing the convex combination of the outputs of the two copies of with mixing weights given by the Hedge algorithm. i.e. if , then , and it contains the constant zero function, which we denote, with some abuse of notation, by .

Given , we define two richer function classes : the convex hull of , denoted , is the set of convex combinations of a finite number of functions in , and the span of , denoted , is the set of linear combinations of finitely many functions in . For any , define . Since functions in are not bounded, it is not possible to obtain a uniform regret bound for all functions in : rather, the regret of an online learning algorithm for is specified in terms of regret bounds for individual comparator functions , viz.

Loss function classes.

The base loss function class we consider is , the set of all linear functions , with Lipschitz constant bounded by . A function class that is online learnable with the loss function class is called online linear learnable for short. The richer loss function class we consider is denoted by and is a set of convex loss functions satisfying some regularity conditions specified in terms of certain parameters described below.

We define a few parameters of the class . For any , let be the ball of radius . The class is said to have Lipschitz constant on if for all and all there is an efficiently computable subgradient with norm at most . Next, is said to be -smooth on if for all and all we have

Next, define the projection operator as , and define .

3 Online Boosting Algorithms

The setup is that we are given a -bounded reference class of functions with an online linear learning algorithm with regret bound . For normalization, we also assume that the output of at any time is bounded in norm by , i.e. for all . We further assume that for every , we can compute333It suffices to compute upper bounds on these parameters. a Lipschitz constant , a smoothness parameter , and the parameter for the class over . Furthermore, the online boosting algorithm may make up to calls per iteration to any copies of it maintains, for a given a budget parameter .

0:  Number of weak learners , step size parameter ,
1:  Let .
2:  Maintain copies of the algorithm , denoted for .
3:  For each , initialize .
4:  for  to  do
5:     Receive example .
6:     Define .
7:     for  to  do
8:        Define .
9:     end for
10:     Predict .
11:     Obtain loss function and suffer loss .
12:     for  to  do
13:        Pass loss function to .
14:        Set , where .
15:     end for
16:  end for
Algorithm 1 Online Gradient Boosting for

Given this setup, our main result is an online boosting algorithm, Algorithm 1, competing with . The algorithm maintains copies of , denoted , for . Each copy corresponds to one stage in boosting. When it receives a new example , it passes it to each and obtains their predictions , which it then combines into a prediction for using a linear combination. At the most basic level, this linear combination is simply the sum of all the predictions scaled by a step size parameter . Two tweaks are made to this sum in step 8 to facilitate the analysis:

  1. While constructing the sum, the partial sum is multiplied by a shrinkage factor . This shrinkage term is tuned using an online gradient descent algorithm in step 14. The goal of the tuning is to induce the partial sums to be aligned with a descent direction for the loss functions, as measured by the inner product .

  2. The partial sums are made to lie in , for some parameter , by using the projection operator . This is done to ensure that the Lipschitz constant and smoothness of the loss function are suitably bounded.

Once the boosting algorithm makes the prediction and obtains the loss function , each is updated using a suitably scaled linear approximation to the loss function at the partial sum , i.e. the linear loss function . This forces to produce predictions that are aligned with a descent direction for the loss function.

We provide the analysis of the algorithm in Section 4.2. The analysis yields the following regret bound for the algorithm:

Theorem 1.

Let be a given parameter. Let . Algorithm 1 is an online learning algorithm for and losses in with the following regret bound for any :

where .

The regret bound in this theorem depends on several parameters such as , and . In applications of the algorithm for -dimensional regression with commonly used loss functions, however, these parameters are essentially modest constants; see Section 3.1 for calculations of the parameters for various loss functions. Furthermore, if is appropriately set (e.g. ), then the average regret clearly converges to as and . While the requirement that may raise concerns about computational efficiency, this is in fact analogous to the guarantee in the batch setting: the algorithms converge only when the number of boosting stages goes to infinity. Moreover, our lower bound (Theorem 3) shows that this is indeed necessary.

We also present a simpler boosting algorithm, Algorithm 2, that competes with . Algorithm 2 is similar to Algorithm 1, with some simplifications: the final prediction is simply a convex combination of the predictions of the base learners, with no projections or shrinkage necessary. While Algorithm 1 is more general, Algorithm 2 may still be useful in practice when a bound on the norm of the comparator function is known in advance, using the observations in Section 5.2. Furthermore, its analysis is cleaner and easier to understand for readers who are familiar with the Frank-Wolfe method, and this serves as a foundation for the analysis of Algorithm 1. This algorithm has an optimal (up to constant factors) regret bound as given in the following theorem, proved in Section 4.1. The upper bound in this theorem is proved along the lines of the Frank-Wolfe [8] algorithm, and the lower bound using information-theoretic arguments.

Theorem 2.

Algorithm 2 is an online learning algorithm for for losses in with the regret bound

Furthermore, the dependence of this regret bound on is optimal up to constant factors.

The dependence of the regret bound on is unimprovable without additional assumptions: otherwise, Algorithm 2 will be an online linear learning algorithm over with better than regret.

1:  Maintain copies of the algorithm , denoted , and let for .
2:  for  to  do
3:     Receive example .
4:     Define .
5:     for  to  do
6:        Define .
7:     end for
8:     Predict .
9:     Obtain loss function and suffer loss .
10:     for  to  do
11:        Pass loss function to .
12:     end for
13:  end for
Algorithm 2 Online Gradient Boosting for

Using a deterministic base online linear learning algorithm.

If the base online linear learning algorithm is deterministic, then our results can be improved, because our online boosting algorithms are also deterministic, and using a standard simple reduction, we can now allow to be any set of convex functions (smooth or not) with a computable Lipschitz constant over the domain for any .

This reduction converts arbitrary convex loss functions into linear functions: viz. if is the output of the online boosting algorithm, then the loss function provided to the boosting algorithm as feedback is the linear function . This reduction immediately implies that the base online linear learning algorithm , when fed loss functions , is already an online learning algorithm for with losses in with the regret bound .

As for competing with , since linear loss functions are -smooth, we obtain the following easy corollary of Theorem 1:

Corollary 1.

Let be a given parameter, and set . Algorithm 1 is an online learning algorithm for for losses in with the following regret bound for any :

where .

3.1 The parameters for several basic loss functions

In this section we consider the application of our results to -dimensional regression, where we assume, for normalization, that the true labels of the examples and the predictions of the functions in the class are in . In this case denotes the absolute value norm. Thus, in each round, the adversary chooses a labeled data point , and the loss for the prediction is given by where is a fixed loss function that is convex in the second argument. Note that in this setting. We give examples of several such loss functions below, and compute the parameters , and for every , as well as from Theorem 1.

  1. Linear loss: . We have , , , and .

  2. -norm loss, for some : . We have , , , and .

  3. Modified least squares: . We have , , , and .

  4. Logistic loss: . We have , , , and .

4 Analysis

In this section, we analyze Algorithms 1 and Algorithm 2.

4.1 Competing with convex combinations of the base functions

We give the analysis of Algorithm 2 before that of Algorithm 1 since it is easier to understand and provides the foundation for the analysis of Algorithm 1.

Proof of Theorem 2.

First, note that for any , since is a linear function, we have

Let be any function in . The equality above and the fact that is an online learning algorithm for with regret bound for the -Lipschitz linear loss functions imply that

(2)

Now define, for , . We have

For , since , the above bound implies that . Starting from this base case, an easy induction on proves that . Applying this bound for completes the proof. ∎

We now show that the dependence of the regret bound of Algorithm 2 on the parameter is optimal up to constant factors.

Theorem 3.

Let be any specified bound on the total number of calls in each iteration to all copies of the base online linear learning algorithm. Then there is a setting of -dimensional prediction with a -bounded comparator function class , an online linear optimization algorithm over , and a class of loss functions that is -smooth on such that any online boosting algorithm for with losses in respecting the bound has regret at least .

Proof.

Consider the following construction. At a high level, the setting is 1-dimensional regression with corresponding to squared loss. The domain and true labels of examples are in .

Define and , where , and let and be two distributions over where each bit is Bernoulli random variable with parameter and respectively, chosen independently of the other bits. Consider a sequence of examples generated as follows: , and the label is chosen from uniformly at random in each round.

Let for . The function class consists of a large number, , of functions , . For each and , we set w.p. , and w.p. , independently of all other values of and .

The base online linear learning algorithm is simply Hedge over the functions. In each round, the Hedge algorithm selects one of the functions in and uses that to predict the label, and for any sequence of examples, with high probability, incurs regret .

We set to be set of squared loss functions, i.e. functions of the form for . Note that these loss functions are -smooth and . In round , the loss function is .

Consider the function , which is in . Given any input sequence for it is easy to calculate that , and since the examples and predictions of functions on the examples are independent across iterations, a simple application of the multiplicative Chernoff bound implies that if , then with probability at least , we have .

Now suppose there is an online boosting algorithm making at most calls total to all copies of in each iteration, that for any large enough and for any sequence for , outputs predictions such that with high probability, say at least , we have . Then by a union bound, with probability at least , we have . By Markov’s inequality and a union bound, with probability at least , for a uniform random time , we have

(3)

or in other words, is on the same side of as , and thus can be used to identify . In the rest of the proof, we will use this fact, along with fact the total variation distance between and , denoted , is small, to derive a contradiction.

Define the random variable as follows. For any bit string , choose a random round , and simulate the online boosting process until round by sampling ’s and the outputs of for all and from the appropriate distributions. In round , let be the functions that are obtained from the at most calls to copies of (there could be repetitions). Assign for (being careful with repeated functions and repeating outputs appropriately), and run the booster with these outputs to obtain , and set . Let denotes probability of events in this process for generating given .

Let and denote expectation of a random variable when is drawn from and respectively, and let denote expectation of a random variable when is chosen from uniformly at random and then is sampled from . The above analysis (inequality (3)) implies that

Now define a random variable as . Since

we conclude, using the above bound, that . This is a contradiction, since because , we have

where the bound on is standard, for e.g. see [15]. This gives us the desired contradiction. ∎

The above result can be easily extended to any given parameters and so that the is -bounded and is -smooth on , giving a lower bound of on the regret of an online boosting algorithm for with losses in : we simply scale all function and label values by , and consider the loss functions . If there were an online boosting algorithm for with these loss functions with regret , then by scaling down the predictions by , we obtain an online boosting algorithm for exactly the setting in the proof of Theorem 3 with a regret bound of , which is a contradiction.

4.2 Competing with the span of the base functions

In this section we show that Algorithm 1 satisfies the regret bound claimed in Theorem 1.

Proof of Theorem 1.

Let , for some finite subset of , where . Since is symmetric, we may assume that all , and let . Furthermore, we may assume that with weight , so that . Note that is exactly the infimum of over all such ways of expressing as a finite weighted sum of functions in . We now prove that bound stated in the theorem holds with replaced by ; the theorem then follows simply by taking the infimum of the bound over all such ways of expressing .

Now, for each , the update in line 14 of Algorithm 1 is exactly online gradient descent [25] on the domain with linear loss functions . Note that the derivative of this loss function is bounded as follows: . Since , the standard analysis of online gradient descent then implies that the sequence for satisfies

(4)

Next, since with , we have

(5)

Let . Since is an online learning algorithm for with regret bound for the -Lipschitz linear loss functions , and , multiplying the regret bound (1) by we have

(6)

by (5). Now, we analyze how much excess loss is potentially introduced due to the projection in line 8. First, note that if , then the projection has no effect since , and in this case . If , then by the definition of , , and since and , and we have

In either case, we have

(7)

We now move to the main part of the analysis. Define for , . We have