Optimal and Adaptive Algorithms for Online Boosting
We study online boosting, the task of converting any weak online learner into a strong online learner. Based on a novel and natural definition of weak online learnability, we develop two online boosting algorithms. The first algorithm is an online version of boost-by-majority. By proving a matching lower bound, we show that this algorithm is essentially optimal in terms of the number of weak learners and the sample complexity needed to achieve a specified accuracy. This optimal algorithm is not adaptive, however. Using tools from online loss minimization, we derive an adaptive online boosting algorithm that is also parameter-free, but not optimal. Both algorithms work with base learners that can handle example importance weights directly, as well as by rejection sampling examples with probability defined by the booster. Results are complemented with an experimental study.
We study online boosting, the task of boosting the accuracy of any weak online learning algorithm. The theory of boosting in the batch setting has been studied extensively in the literature and has led to a huge practical success. See the book by Schapire and Freund (2012) for a thorough discussion.
Online learning algorithms receive examples one by one, updating the predictor immediately after seeing each new example. In contrast to the batch setting, online learning algorithms typically don’t make any stochastic assumptions about the data they observe. They are often much faster, more memory-efficient, and apply to situations where the best predictor changes over time as new examples keep coming in.
Given the success of boosting in batch learning, it is natural to ask about the possibility of applying boosting to online learning. Indeed, there has already been some work on online boosting (Oza and Russell, 2001; Grabner and Bischof, 2006; Liu and Yu, 2007; Grabner et al., 2008; Chen et al., 2012, 2014).
From a theoretical viewpoint, recent work by Chen et al. (2012) is perhaps most interesting. They generalized the batch weak learning assumption to the online setting, and made a connection between online boosting and batch boosting that produces smooth distributions over the training examples. The resulting algorithm is guaranteed to achieve an arbitrarily small error rate as long as the number of weak learners and the number of examples are sufficiently large. No assumptions need to be made about how the data is generated. Indeed, the data can even be generated by an adversary.
We present a new online boosting algorithm, based on the boost-by-majority (BBM) algorithm of (Freund, 1995). This algorithm, called Online BBM, improves upon the work of Chen et al. (2012) in several different aspects:
our assumption on online weak learners is weaker and can be seen as a direct online analogue of the weak learning assumption in standard batch boosting,
our algorithm is optimal in the sense that no online boosting algorithm can achieve the same error rate with less weak learners or less examples asymptotically (see the lower bounds in Section 3.2).
A quantitative comparison of our results with those of Chen et al. (2012) appears in Table 1, where and represent the number111In this paper, we use the and notation to suppress dependence on polylogarithmic factors in the natural parameters. of weak learners and examples needed to achieve error rate , and stands for a similar concept of the “edge” of the weak learning oracle as in the batch setting (smaller means more inaccurate weak learners).
A clear drawback of all the algorithms mentioned above is lack of adaptivity. A simple interpretation of this drawback is that all these algorithms require using , an unknown quantity, as a parameter. More importantly, this also means that the algorithm treats each weak learner equally and ignores the fact that some weak learners are actually doing better than the others. The best example of adaptive boosting algorithm is the well-known parameter-free AdaBoost algorithm (Freund and Schapire, 1997), where each weak learner is naturally weighted by how accurate it is. In fact, adaptivity is known to be one of the key features that lead to the practical success of AdaBoost, and therefore should also be essential to the performance of online boosting algorithms. In Section 4, we thus propose AdaBoost.OL, an adaptive and parameter-free online boosting algorithm. As shown in Table 1, AdaBoost.OL is theoretically suboptimal in terms of and . However, empirically it generally outperforms OSBoost and sometimes even beats the optimal algorithm Online BBM (see Section 5).
Our techniques are also very different from those of Chen et al. (2012), which rely on the smooth boosting algorithm of Servedio (2003). As far as we know, all other work on smooth boosting (Bshouty and Gavinsky, 2003; Bradley and Schapire, 2008; Barak et al., 2009) cannot be easily generalized to the online setting, necessitating completely different methods not relying on smooth distributions. Our Online BBM algorithm builds on top of a potential based family that arises naturally in the batch setting as approximate minimax optimal algorithms for so-called drifting games (Schapire, 2001; Luo and Schapire, 2014). The decomposition of each example in that framework naturally allows us to generalize it to the online setting where example comes one by one. On the other hand, AdaBoost.OL is derived by viewing boosting from a different angle: loss minimization (Mason et al., 2000; Schapire and Freund, 2012). The theory of online loss minimization is the key tool for developing AdaBoost.OL.
Finally, in Section 5, experiments on benchmark data are conducted to show that our new algorithms indeed improve over previous work.
2 Setup and Assumptions
We describe the formal setup of the task of online classification by boosting. At each time step , an adversary chooses an example , where is the domain, and reveals to the online learner. The learner makes a prediction on its label , and suffers the 0-1 loss . As is usual with online algorithms, this prediction may be randomized.
For parameters , , and a constant , the learner is said to be a weak online learner with edge and excess loss if, for any and for any input sequence of examples for chosen adaptively, it generates predictions such that with probability at least ,
The excess loss requirement is necessary since an online learner can’t be expected to predict with any accuracy with too few examples. Essentially, the excess loss yields a kind of sample complexity bound: the weak learner starts obtaining a distinct edge of over random guessing when . Typically, the dependence of the high probability bound on is polylogarithmic in ; thus in the following we will avoid explicitly mentioning .
For a given parameter , the learner is said to be a strong online learner with error rate if it satisfies the same conditions as a weak online learner except that its edge is , or in other words, the fraction of mistakes made, asymptotically, is . Just as for the weak learner, the excess loss yields a sample complexity bound: the fraction of mistakes made by the strong learner becomes when .
Our main theorem is the following:
Given a weak online learning algorithm with edge and excess loss and any target error rate , there is a strong online learning algorithm with error rate which uses copies of the weak online learner, and has excess loss ; thus its sample complexity is . Furthermore, if , then the number of weak online learners is optimal up to constant factors, and the sample complexity is optimal up to polylogarithmic factors.
The requirement that in the lower bound is not very stringent; this is precisely the excess loss one obtains when using standard online learning algorithms with regret bound , as is explained in the discussion following Lemma 2. Furthermore, since we require the bound (1) to hold with high probability, typical analyses of online learning algorithms will have an deviation term, which also leads to .
As the theorem indicates, the strong online learner (hereafter referred to as “booster”) works by maintaining copies of the weak online learner, for some positive integer to be specified later. Denote the weak online learners for . At time step , the prediction of -th weak online learner is given by . Note the slight abuse of notation here: is not a function, rather it is an algorithm with an internal state that is updated as it is fed training examples. Thus, the prediction depends on the internal state of , and for notational convenience we avoid reference to the internal state.
In each round , the booster works by taking a weighted majority vote of the weak learners’ predictions. Specifically, the booster maintains weights for corresponding to each weak learner, and its final prediction will then be222In Section 4 a slightly different final prediction will be used. , where is if the argument is nonnegative and otherwise. After making the prediction, the true label is revealed by the environment. The booster then updates by passing the training example to with a carefully chosen sampling probability (and not passing the example with the remaining probability). The sampling probability is obtained by computing a weight and setting333In the algorithm we simply use a tight-enough upper bound on (such as the bound from Lemma 4) to compute the values ; we abuse notation here and use to also denote this upper bound. , where . At the same time the booster updates as well, and then it is ready to make a prediction for the next round.
We introduce some more notation to ease the presentation. Let and with . Define . Finally, a martingale concentration bound using (1) yields the following bound (proof deferred to Appendix A). The bound can be seen as a weighted version of (1) which is necessary for the rest of the analysis.
There is a constant such that for any , with high probability, for every weak learner we have
2.1 Handling Importance Weights
Typical online learning algorithms can handle importance weighted examples: each example comes with a weight , and the loss on that example is scaled by , i.e. the loss for predicting is . Consider the following natural extension to the definition of online weak learners which incorporates importance weighted examples: we now require that for any sequence of weighted examples with weight for , the online learner generates predictions such that with probability at least ,
Having access to such a weak learner makes the boosting algorithm simpler: we now simply pass every example to every weak learner using the probability as importance weights. The advantage is that the bound (2) immediately implies the following inequality for any weak learner , which can be seen as a (stronger) analogue of Lemma 1.
Since the analysis depends only on the bound in Lemma 1, if we use the importance-weighted version of the boosting algorithm, then we can simply use inequality (3) instead in the analysis, which gives a slightly tighter version of Theorem 1, viz. the excess loss can now be bounded by .
In the rest of the paper, for simplicity of exposition we assume that the ’s are used as sampling probabilities rather than importance weights, and give the analysis using the bound from Lemma 1. In experiments, however, using the ’s as importance weights rather than sampling probabilities led to better performance.
2.2 Discussion of Weak Online Learning Assumption
We now justify our definition of weak online learning, viz. inequality (1). In the standard batch boosting case, the corresponding weak learning assumption (see for example Schapire and Freund (2012)) made is that there is an algorithm which, given a training set of examples and an arbitrary distribution on it, generates a hypothesis that has error at most on the training data under the given distribution. This statement can be interpreted as making the following two implicit assumptions:
(Richness.) Given an edge parameter , there is a set of hypotheses, , such that given any training set (possibly, a multiset) of examples , there is some hypothesis with error at most , i.e.
(Agnostic Learnability.) For any , there is an algorithm which, given any training set (possibly, a multiset) of examples , can compute a nearly optimal hypothesis , i.e.
Our weak online learning assumption can be seen as arising from a direct generalization of the above two assumptions to the online setting. Namely, the richness assumption stays the same, whereas the agnostic learnability of assumption is replaced by an agnostic online learnability of assumption (c.f. Ben-David et al. (2009)). I.e., there is an online learning algorithm which, given any sequence of examples, for , generates predictions such that
where is the regret, a non-decreasing, sublinear function of the number of prediction periods . Since online learning algorithms are typically randomized, we assume the above bound holds with high probability. The following lemma shows that richness and agnostic online learnability immediately imply our online weak learning assumption (1):
Suppose the sequence of examples is obtained from a data set for which there exists a hypothesis class that is both rich for edge parameter and agnostically online learnable with regret . Then, the agnostic online learning algorithm for satisfies the weak learning assumption (1), with edge and excess loss .
For the given sequence of examples for , the richness with edge parameter and agnostic online learnability assumptions on imply that with high probability, the predictions generated by the agnostic online learning algorithm for satisfy
It only remains to show that
or equivalently, , which is true by the definition of . This concludes the proof. ∎
Various agnostic online learning algorithms are known that have a regret bound of ; for example, a standard experts algorithm on a finite hypothesis space such as Hedge. If we use such an online learning algorithm as a weak online learner, then a simple calculation implies, via Lemma 2, that it has excess loss . Thus, by Theorem 1, we obtain an online boosting algorithm with near-optimal sample complexity.
3 An Optimal Algorithm
In this section, we generalize a family of potential based batch boosting algorithms to the online setting. With a specific potential, an online version of boost-by-majority is developed with optimal number of weak learners and near-optimal sample complexity. Matching lower bounds will be shown at the end of the section.
3.1 A Potential Based Family and Boost-By-Majority
In the batch setting, many boosting algorithms can be understood in a unified framework called drifting games (Schapire, 2001). Here, we generalize the analysis and propose a potential based family of online boosting algorithms.
Pick a sequence of non-increasing potential functions such that
Then the algorithm is simply to set and . The following theorem states the error rate bound of this general scheme.
For any and , with high probability, the number of mistakes made by the algorithm described above is bounded as follows:
The key property of the algorithm is that for any fixed and , one can verify the following:
by plugging the formula of , realizing that can only be or , and using the definition of from Eq. (4). to , we get
Using Lemma 1, we get
which relates the sums of all examples’ potential for two successive weak learners. We can therefore apply this inequality iteratively to arrive at:
The proof is completed by noting that
since by definition. ∎
Note that the term becomes a penalty for the final error rate. Therefore, we naturally want this penalty term to be relatively small. This is not necessarily true for any choice of the potential function. For example, if is the exponential potential that leads to a variant of AdaBoost in the batch setting (see Schapire and Freund, 2012, Chap. 13), then the weight could be exponentially large.
Fortunately, there is indeed a set of potential functions that produces small weights, which, in the batch setting, corresponds to an algorithm called boost-by-majority (BBM) Freund (1995). All we need to do is to let Eq. (4) hold with equality, and direct calculation shows:
where and is defined to be if or . In other words, imagine flipping a biased coin whose probability of heads is for times. Then is exactly the probability of seeing at most heads and is half of the probability of seeing heads. We call this algorithm Online BBM. The pseudocode is given in Algorithm 1.
One can see that the weights produced by this algorithm are small since trivially . To get a better result, however, we need a better estimate of stated in the following lemma.
If is defined as in Eq. (5), then we have for any .
This lemma was essentially proven before by Freund (1993, Lemma 2.3.10). We give an alternative and simpler proof in Appendix B in the supplementary material by using Berry-Esseen theorem directly. We are now ready to state the main results of Online BBM.
For any and , with high probability, the number of mistakes made by the Online BBM algorithm is bounded as follows:
Thus, in order to achieve error rate , it suffices to use weak learners, which gives an excess loss bound of .
3.2 Matching Lower Bounds
We give lower bounds for the number of weak learners and the sample complexity in this section that show that our Online BBM algorithm is optimal up to logarithmic factors.
For any , , and , there is a weak online learning algorithm with edge and excess loss satisfying (1) with probability at least , such that to achieve error rate , an online boosting algorithm needs at least weak learners and a sample complexity of .
The proof of both lower bounds use a similar construction. In either case, all examples’ labels are generated uniformly at random from , and in time period , each weak learner outputs the correct label independently of all other weak learners and other examples with a certain probability to be specified later. Thus, for any , by the Azuma-Hoeffding inequality, with probability at least , the predictions made by the weak learner satisfy
For the lower bound on the number of weak learners, we set , so that inequality (7) implies that with probability at least , the predictions made by the weak learner satisfy
Thus, the weak online learner has edge with excess loss . In this case, the Bayes optimal output of a booster using weak learners is to simply take a majority vote of all the weak learners (see for instance Schapire and Freund, 2012, Chap. 13.2.6), and the probability that the majority vote is incorrect is . Setting this error to and solving for gives the desired lower bound.
Now we turn to the lower bound on the sample complexity. We divide the whole process into two phases: for , we set , and for , we set . Now, if , inequality (7) implies that with probability at least , the predictions made by the weak learner satisfy
using the fact that and . Next, if , let , and again inequality (7) implies that with probability at least , the predictions made by the weak learner satisfy
However, in the first phase (i.e. ), since the predictions of the weak learners are uncorrelated with the true labels, it is clear that no matter what the booster does, it makes a mistake with probability . Thus, it will make mistakes with high probability in the first phase, and thus to achieve error rate, it needs at least examples. ∎
4 An Adaptive Algorithm
Although the Online BBM algorithm is optimal, it is unfortunately not adaptive since it requires the knowledge of as a parameter, which is unknown ahead of time. As discussed in the introduction, adaptivity is essential to the practical performance of boosting algorithms such as AdaBoost.
In this section we thus study adaptive online boosting algorithms using the theory of online loss minimization as the main tool. It is known that boosting can be viewed as trying to find a linear combination of weak hypotheses to minimize the total loss of the training examples, usually using functional gradient descent (see for details Schapire and Freund, 2012, Chap. 7). AdaBoost, for instance, minimizes the exponential loss. Here, as discussed before, we intuitively want to avoid using exponential loss since it could lead to large weights. Instead, we will consider logistic loss , which results in an algorithm called AdaBoost.L in the batch setting (Schapire and Freund, 2012, Chap. 7).
In the online setting, we conceptually define different “experts” giving advice on what to predict on the current example . In round , expert predicts by combining the first weak learners: . Now as in AdaBoost.L, the weight for is obtained by computing the logistic loss of the prediction of expert , i.e. , and then setting to be the negative derivative of the loss:
In terms of the weight of , i.e. , ideally we wish to mimic AdaBoost.L and use a fixed for all such that the total logistic loss is minimized: . Of course this is not possible because depends on the future unknown examples. Nevertheless, it turns out that we can almost achieve that using tools from online learning theory. Indeed, one of the fundamental topics in online learning is exactly how to perform almost as well as the best fixed choice () in the hindsight.
Specifically, it turns out that it suffices to restrict to the feasible set . Then consider the following simple one dimensional online learning problem: on each round , algorithm predicts from a feasible set ; the environment then reveals loss function and the algorithm suffers loss . There are many so-called “low-regret” algorithms in the literature (see the survey by Shalev-Shwartz (2011)) for this problem ensuring
where is sublinear in so that on average it goes to when is large and the algorithm is thus doing almost as well as the best constant choice . The simplest low-regret algorithm in this case is perhaps online gradient descent Zinkevich (2003):
where is a time-varying learning rate and represents projection onto the set , i.e., . Since the loss function is actually -Lipschitz (), if we set to be , then standard analysis shows .
Finally, it remains to specify the algorithm’s final prediction . In Online BBM, we simply used the advice of expert . Unfortunately the algorithm described in this section cannot guarantee that expert will always make highly accurate predictions. However, as we will show in the proof of Theorem 4, the algorithm does ensure that at least one of the experts will have high accuracy. Therefore, what we really need to do is to decide which expert to follow on each round, and try to predict almost as well as the best fixed expert in the hindsight. This is again another classic online learning problem (called expert or hedge problem), and can be solved, for instance, by the well-known Hedge algorithm (Littlestone and Warmuth, 1994; Freund and Schapire, 1997). The idea is to pick an expert on each round randomly with different importance weights according to their previous performance.
We call the final resulting algorithm AdaBoost.OL (O stands for online and L stands for logistic loss), and summarize it in Algorithm 2. Note that as promised, AdaBoost.OL is an adaptive online boosting algorithm and does not require knowing in advance. In fact, in the analysis we do not even assume that the weak learners satisfy the bound (1). Instead, define the quantities for each weak learner . This can be interpreted as the (weighted) edge over random guessing that obtains. Note that may even be negative, which means flipping the sign of ’s predictions performs better than random guessing. Nevertheless, the algorithm can still make accurate predictions even with negative since it will end up choosing negative weights in that case. The performance of AdaBoost.OL is provided below.
For any and , with high probability, the number of mistakes made by AdaBoost.OL is bounded by
Let the number of mistakes made by expert be , also define for convenience. Note that AdaBoost.OL is using a variant of the Hedge algorithm with being the loss of expert on round (Line 7 and 15). So by standard analysis (see e.g. Cesa-Bianchi and Lugosi, 2006, Corollary 2.3), and the Azuma-Hoeffding inequality, we have with high probability
Now, whenever expert makes a mistake (i.e. ), we have and therefore
Note that Eq. (11) holds even for by the definition of . We now bound the difference between the logistic loss of two successive experts, . Online gradient descent (Line 13) ensures that
as discussed previously. On the other hand, direct calculation shows . With , we thus have
Summing over and rearranging gives
which implies that
since for all , for all and . Using this bound in inequality (10), we get
where the last inequality follows from the bound , where is the hidden factor in the term, using the arithmetic mean-geometric mean inequality. ∎
For the case when the weak learners do satisfy the bound (1), we get the following bound on the number of errors:
If the weak learners satisfy (1), then for any and , with high probability, the number of mistakes made by AdaBoost.OL is bounded by
Thus, in order to achieve error rate , it suffices to use weak learners, which gives an excess loss bound of .
The proof is on the same lines as that of Theorem 4. The only change is that in inequality (14), we use the bound which follows from Lemma 1 using the fact that implies for non-negative and , and the fact that . This leads to the following change in inequality (15):
Continuing using this bound in the proof and simplifying, we get the stated bound on the number of errors. ∎
The following lemma is a simple calculation:
For any ,
It suffice to prove the bound for ; the bound for follows by simply using the bound for . For , setting gives
since for . For , setting