A Bayesian Approach for Online Classifier Ensemble

# A Bayesian Approach for Online Classifier Ensemble

## Abstract

We propose a Bayesian approach for recursively estimating the classifier weights in online learning of a classifier ensemble. In contrast with past methods, such as stochastic gradient descent or online boosting, our approach estimates the weights by recursively updating its posterior distribution. For a specified class of loss functions, we show that it is possible to formulate a suitably defined likelihood function and hence use the posterior distribution as an approximation to the global empirical loss minimizer. If the stream of training data is sampled from a stationary process, we can also show that our approach admits a superior rate of convergence to the expected loss minimizer than is possible with standard stochastic gradient descent. In experiments with real-world datasets, our formulation often performs better than state-of-the-art stochastic gradient descent and online boosting algorithms.

A Bayesian Approach for Online Classifier EnsembleBai, Lam and Sclaroff \firstpageno1

\editor{keywords}

Online learning, classifier ensembles, Bayesian methods.

## 1 Introduction

The basic idea of classifier ensembles is to enhance the performance of individual classifiers by combining them. In the offline setting, a popular approach to obtain the ensemble weights is to minimize the training error, or a surrogate risk function that approximates the training error. Solving this optimization problem usually calls for various sorts of gradient descent methods. For example, the most successful and popular ensemble technique, boosting, can be viewed in such a way (Freund and Schapire, 1995; Mason et al., 1999; Friedman, 2001; Telgarsky, 2012). Given the success of these ensemble techniques in a variety of batch learning tasks, it is natural to consider extending this idea to the online setting, where the labeled sample pairs are presented to and processed by the algorithm sequentially, one at a time.

Indeed, online versions of ensemble methods have been proposed from a spectrum of perspectives. Some of these works focus on close approximation of offline ensemble schemes, such as boosting (Oza and Russell, 2001; Pelossof et al., 2009). Other methods are based on stochastic gradient descent (Babenko et al., 2009b; Leistner et al., 2009; Grbovic and Vucetic, 2011). Recently, Chen et al. (2012) formulated a smoothed boosting algorithm based on the analysis of regret from offline benchmarks. Despite their success in many applications (Grabner and Bischof, 2006; Babenko et al., 2009a), however, there are some common drawbacks of these online ensemble methods, including the lack of a universal framework for theoretical analysis and comparison, and the ad hoc tuning of learning parameters such as step size.

In this work, we propose an online ensemble classification method that is not based on boosting or gradient descent. The main idea is to recursively estimate a posterior distribution of the ensemble weights in a Bayesian manner. We show that, for a given class of loss functions, we can define a likelihood function on the ensemble weights and, with an appropriately formulated prior distribution, we can generate a posterior mean that closely approximates the empirical loss minimizer. If the stream of training data is sampled from a stationary process, this posterior mean converges to the expected loss minimizer.

Let us briefly explain the rationale for this scheme, which shall be contrasted from the usual Bayesian setup where the likelihood is chosen to describe closely the generating process of the training data. In our framework, we view Bayesian updating as a loss minimization procedure: it provides an approximation to the minimizer of a well-defined risk function. More precisely, this risk minimization interpretation comes from the exploitation of two results in statistical asymptotic theory. First is that, under mild regularity conditions, a Bayesian posterior distribution tends to peak at the maximum likelihood estimate (MLE) of the same likelihood function, as a consequence of the so-called Laplace method (MacKay, 2003). Second, MLE can be viewed as a risk minimizer, where the risk is defined precisely as the expected negative log-likelihood. Therefore, given a user-defined loss function, one can choose a suitable log-likelihood as a pure artifact, and apply a corresponding Bayesian update to minimize the risk. We will develop the theoretical foundation that justifies the above rationale.

Our proposed online ensemble classifier learning scheme is straightforward, but powerful in two respects. First, whenever our scheme is applicable, it can approximate the global optimal solution, in contrast with local methods such as stochastic gradient descent (SGD). Second, assuming the training data is sampled from a stationary process, our proposed scheme possesses a rate of convergence to the expected loss minimizer that is at least as fast as standard SGD. In fact, our rate is faster unless the SGD step size is chosen optimally, which cannot be done a priori in the online setting. Furthermore, we also found that our method performs better in experiments with finite datasets compared with the averaging schemes in SGD (Polyak and Juditsky, 1992; Schmidt et al., 2013) that have the same optimal theoretical convergence rate as our method.

In addition to providing a theoretical analysis of our formulation, we also tested our approach on real-world datasets and compared with individual classifiers, a baseline stochastic gradient descent method for learning classifier ensembles, and their averaging variants, as well as state-of-the-art online boosting methods. We found that our scheme consistently achieves superior performance over the baselines and often performs better than state-of-the-art online boosting algorithms, further demonstrating the validity of our theoretical analysis.

In summary, our contributions are:

1. We propose a Bayesian approach to estimate the classifier weights with closed-form updates for online learning of classifier ensembles.

2. We provide theoretical analyses of both the convergence guarantee and the bound on prediction error.

3. We compare the asymptotic convergence rate of the proposed framework versus previous gradient descent frameworks thereby demonstrating the advantage of the proposed framework.

This paper is organized as follows. We first briefly discuss the related works. We then state in detail our approach and provide theoretical guarantees in Section 3. A specific example for solving the online ensemble problem is provided in Section 4, and numerical experiments are reported in Section 5. We discuss the use of other loss functions for online ensemble learning in Section 6 and conclude our paper in Section 7 with future work. Some technical proofs are left to the Appendix.

## 2 Related work

There is considerable past work on online ensemble learning. Many past works have focused on online learning with concept drift (Wang et al., 2003; Kolter and Maloof, 2005, 2007; Minku, 2011), where dynamic strategies of pruning and rebuilding ensemble members are usually considered. Given the technical difficulty, theoretical analysis for concept drift seems to be underdeveloped. Kolter and Maloof (2005) proved error bounds for their proposed method, which appears to be the first such theoretical analysis, yet such analysis is not easily generalized to other methods in this category. Other works, such as Schapire (2001), and Cesa-Bianchi and Lugosi (2003), obtained performance bounds from the perspective of iterative games.

Our work is more closely related to methods that operate in a stationary environment, most notably some online boosting methods. One of the first methods was proposed by Oza and Russell (2001), who showed asymptotic convergence to batch boosting under certain conditions. However, the convergence result only holds for some simple “lossless” weak classifiers (Oza, 2001), such as Naïve Bayes. Other variants of online boosting have been proposed, such as methods that employ feature selection (Grabner and Bischof, 2006; Liu and Yu, 2007), semi-supervised learning (Grabner et al., 2008), multiple instance learning (Babenko et al., 2009a), and multi-class learning (Saffari et al., 2010). However, most of these works consider the design and update of weak classifiers beyond that of Oza (2001) and, thus, do not bear the convergence guarantee therein. Other methods employ the gradient descent framework, such as Online GradientBoost (Leistner et al., 2009), Online Stochastic Boosting (Babenko et al., 2009b) and Incremental Boosting (Grbovic and Vucetic, 2011). There are convergence results given for many of these, which provide a basis for comparison with our framework. In fact, we show that our method compares favorably to gradient descent in terms of asymptotic convergence rate.

Other recent online boosting methods (Chen et al., 2012; Beygelzimer et al., 2015) generalize the weak learning assumption to online learning, and can offer theoretical guarantees on the error rate of the learned strong classifier if certain performance assumptions are satisfied for the weak learners. Our work differs from these approaches, in that our formulation and theoretical analysis focuses on the classes of loss functions, rather than imposing assumptions on the set of weak learners. In particular, we show that the ensemble weights in our algorithm converge asymptotically at an optimal rate to the minimizer of the expected loss.

Our proposed optimization scheme is related to two other lines of work. First is the so-called model-based method for global optimization (Zlochin et al., 2004; Rubinstein and Kroese, 2004; Hu et al., 2007). This method iteratively generates an approximately optimal solution as the summary statistic for an evolving probability distribution. It is primarily designed for deterministic optimization, in contrast to the stochastic optimization setting that we consider. Second, our approach is, at least superficially, related to Bayesian model averaging (BMA) (Hoeting et al., 1999). While BMA is motivated from a model selection viewpoint and aims to combine several candidate models for better description of the data, our approach does not impose any model but instead targets at loss minimization.

The present work builds on an earlier conference paper (Bai et al., 2014). We make several generalizations here. First, we remove a restrictive, non-standard requirement on the loss function (which enforces the loss function to satisfy certain integral equality; Assumption 2 in Bai et al., 2014). Second, we conduct experiments that compare our formulation with two variants of the SGD baseline in Bai et al. (2014), where the ensemble weights are estimated via two averaging schemes of SGD, namely Polyak-Juditsky averaging (Polyak and Juditsky, 1992) and Stochastic Averaging Gradient (Schmidt et al., 2013). Third, we evaluate two additional loss functions for ensemble learning and compare them with the loss function proposed in Bai et al. (2014).

## 3 Bayesian Recursive Ensemble

We denote the input feature by and its classification label by ( or ). We assume that we are given binary weak classifiers , and our goal is to find the best ensemble weights , where , to construct an ensemble classifier. For now, we do not impose a particular form of ensemble method (we defer this until Section 4), although one example form is . We focus on online learning, where training data comes in sequentially, one at a time at .

### 3.1 Loss Minimization Formulation

We formulate the online ensemble learning problem as a stochastic loss minimization problem. We first introduce a loss function at the weak classifier level. Given a training pair and an arbitrary weak classifier , we denote as a non-negative loss function. Popular choices of include the logistic loss function, hinge loss, ramp loss, zero-one loss, etc. If is one of the given weak classifiers , we will denote as , or simply for ease of notation. Furthermore, we define where is the training sample and the updated -th weak classifier at time . To simplify notation, we use to denote the vector of losses for the weak classifiers, to denote the losses at time , and to denote the losses up to time .

With the above notation, we let be some ensemble loss function at time , which depends on the ensemble weights and the individual loss of each weak classifier. Then, ideally, the optimal ensemble weight vector should minimize the expected loss , where the expectation is taken with respect to the underlying distribution of the training data . Since this data distribution is unknown, we use the empirical loss as a surrogate:

 LT(λ;g1:T)=ℓ0(λ)+T∑t=1ℓ(λ;gt) (1)

where can be regarded as an initial loss and can be omitted.

We make a set of assumptions on that are adapted from Chen (1985):

###### Assumption 1 (Regularity Conditions)

Assume that for each , there exists a that minimizes (1), and

1. “local optimality”: for each , and is positive definite,

2. “steepness”: the minimum eigenvalue of approaches as ,

3. “smoothness”: For any , there exists a positive integer and such that for any and , exists and satisfies

 I−A(ϵ)≤∇2LT(θ;g1:T)(∇2LT(λ∗T;g1:T))−1≤I+A(ϵ)

for some positive semidefinite symmetric matrix whose largest eigenvalue tends to 0 as , and the inequalities above are matrix inequalities,

4. “concentration”: for any , there exists a positive integer and constants such that for any and , we have

 LT(θ;g1:T)−LT(λ∗T;g1:T)< c((θ−λ∗T)′∇2LT(λ∗T;g1:T)(θ−λ∗T))p,
5. “integrability”:

 ∫e−LT(λ;g1:T)dλ<∞.

In the situation where is separable in terms of each component of , i.e.  and for some twice differentiable functions and , the assumptions above will depend only on for each . For example, Condition 3 in Assumption 1 reduces to merely checking uniform continuity of each .

Condition 1 in Assumption 1 can be interpreted as the standard first and second order conditions for the optimality of , whereas Condition 3 in essence requires continuity of the Hessian matrix. Conditions 2 and 4 are needed for the use of the Laplace method (MacKay, 2003), which, as we will show later, stipulates that the posterior distribution peaks near the optimal solution of empirical loss (1).

### 3.2 A Bayesian Approach

We state our procedure in Algorithm 1. We define and .

Algorithm 1 requires some further explanation:

1. Our updated estimate for at each step is the “posterior mean” for , given by

 ∫λt∏s=1p(gs|λ)p(λ)dλ∫t∏s=1p(gs|λ)p(λ)dλ
2. When the loss function satisfies

 ∫e−ℓ(λ;w)dw=1 (2)

and satisfies

 ∫e−ℓ0(w)dw=1

then is a valid likelihood function and a valid prior distribution, so that as depicted in Algorithm 1 is indeed a posterior distribution for (i.e. the quote-and-quote around “posterior distribution” in the algorithm can be removed). In this context, a good choice of , e.g. as a conjugate prior for the likelihood , can greatly facilitate the computational effort at each step. On the other hand, we also mention that such a likelihood interpretation is not a necessary requirement for Algorithm 1 to work, since its convergence analysis relies on the Laplace method, which is non-probabilistic in nature.

Algorithm 1 offers the desirable properties characterized by the following theorem.

###### Theorem 1

Under Assumption 1, the sequence of random vectors with distributions in Algorithm 1 satisfies the asymptotic normality property

 (∇2LT(λ∗T;g1:T))1/2(λT−λ∗T)\lx@stackreld→N(0,1) (3)

where is interpreted as a random variable with distribution , and denotes convergence in distribution. Furthermore, under the uniform integrability condition for some , we have

 |EλT|g1:T[λT]−λ∗T|=o⎛⎝1σ1/2T⎞⎠ (4)

where denotes the posterior mean and is the minimum eigenvalue of the matrix .

{proof}

Let

 ~LT(λ;g1:T)=LT(λ;g1:T)+log∫e−LT(λ;g1:T)dλ

which is well-defined by Condition 5 in Assumption 1. Note that is a valid probability density in by definition. Moreover, Conditions 1–4 in Assumption 1 all hold when is replaced by (since they all depend only on the gradient of with respect to or the difference ).

The convergence in (3) then follows from Theorem 2.1 in Chen (1985) applied to the sequence of densities for . Condition 1 in Assumption 1 is equivalent to conditions (P1) and (P2) therein, while Conditions 2 and 3 in Assumption 1 correspond to (C1) and (C2) in Chen (1985). Condition 4 is equivalent to (C3.1), which then implies (C3) there to invoke its Theorem 2.1 to conclude (3).

To show the bound (4) we take the expectation on (3) to get

 (∇2LT(λ∗T;g1:T))12(EλT|g1:T[λT]−λ∗T)→0, (5)

which is valid because of the uniform integrability condition (Durrett, 2010). Therefore, where by (5). But then

 ∥∥∥(∇2LT(λ∗T;g1:T))−12wT∥∥∥1 ≤ ∥∥∥(∇2LT(λ∗T;g1:T))−12∥∥∥1∥wT∥1 ≤ Cσ1/2T∥wT∥1=o⎛⎝1σ1/2T⎞⎠

where when applied to matrix is the induced -norm. This shows (4).

The idea behind (3) comes from classical Bayesian asymptotics and is an application of the so-called Laplace method (MacKay, 2003). Theorem 1 states that given the loss structure satisfying Assumption 1, the posterior distribution of under our update scheme provides an approximation to the minimizer of the cumulative loss at time , as increases, by tending to a normal distribution peaked at with shrinking variance ( here can be interpreted as the maximum a posterior (MAP) estimate). The bound (4) states that this posterior distribution can be summarized using the posterior mean to give a point estimate of . Moreover, note that is the global, not merely local, minimizer of the cumulative loss. This approximation of global optimum highlights a key advantage of the proposed Bayesian scheme over other methods such as stochastic gradient descent (SGD), which only find a local optimum.

The next theorem states another benefit of our Bayesian scheme over standard SGD. Suppose that SGD does indeed converge to the global optimum. Even so, it turns out that our proposed Bayesian scheme converges faster than standard SGD under the assumption of i.i.d. training samples.

###### Theorem 2

Suppose Assumption 1 holds. Assume also that are i.i.d., with and . The Bayesian posterior mean produced by Alg. 1 converges to strictly faster than standard SGD (supposing it converges to the global minimum), given by

 λT+1←λT−ϵTK∇ℓ(λT;gT) (6)

in terms of the asymptotic variance, except when the step size and the matrix is chosen optimally.

In Theorem 2, by asymptotic variance we mean the following: both the sequence of posterior means and the update sequence from SGD possess versions of the central limit theorem, in the form where . Our comparison is on the asymptotic covariance matrix with respect to matrix inequality: for two update schemes with corresponding asymptotic covariance matrices and , Scheme 1 converges faster than Scheme 2 if is positive definite.

{proof}

The proof follows by combining (4) with established central limit theorems for sample average approximation (Pasupathy and Kim, 2011) and stochastic gradient descent (SGD) algorithms. First, let , and . Note that the quantity is the minimizer of . Then, together with the fact that is asymptotically negligible, Theorem 5.9 in Pasupathy and Kim (2011) stipulates that , where

 Σ=(∇2z(λ))−1Var(∇ℓ(λ;g))(∇2z(λ))−1 (7)

and denotes the covariance matrix.

Now since and by the law of large numbers (Durrett, 2010), we have . Then the bound in (4) implies that . In other words, the difference between the posterior mean and is of smaller scale than . By Slutsky Theorem (Serfling, 2009), this implies that also.

On the other hand, for SGD (6), it is known (e.g. Asmussen and Glynn, 2007) that the optimal step size parameter value is and , in which case the central limit theorem for the update will be given by where is exactly (7). For other choices of step size, either the convergence rate is slower than order or the asymptotic variance, denoted by , is such that is positive definite. Therefore, by comparing the asymptotic variance, the posterior mean always has a faster convergence unless the step size in SGD is chosen optimally.

To give some intuition from a statistical viewpoint, Theorem 2 arises from two layers of approximation of our posterior mean to . First, thanks to (4), the difference between posterior mean and the minimizer of cumulative loss (which can be interpreted as the MAP) decreases at a rate faster than . Second, converges to at a rate of order with the optimal multiplicative constant. This is equivalent to the observation that the MAP, much like the maximum likelihood estimator (MLE), is asymptotically efficient as a statistical estimator.

Putting things in perspective, compared with local methods such as SGD, we have made an apparently stronger set of assumptions (i.e. Assumption 1), which pays off by allowing for stronger theoretical guarantees (Theorems 1 and 2). In the next section we describe an example where a meaningful loss function precisely fits into our framework.

## 4 A Specific Example

We now discuss in depth a simple and natural choice of loss function and its corresponding likelihood function and prior, which are also used in our experiments in Section 5. Consider

 ℓ(λ;g)=θm∑i=1λigi−m∑i=1logλi (8)

The motivation for (8) is straightforward: it is the sum of individual losses each weighted by . The extra term prevents from approaching zero, the trivial minimizer for the first term. The parameter specifies the trade-off between the importance of the first and the second term. This loss function satisfies Assumption 1. In particular, the Hessian of turns out to not depend on , therefore all conditions of Assumption 1 can be verified easily.

Using the discussion in Section 3.2, we choose the exponential likelihood (note that in this definition we add an extra constant term on (8), which does not affect the minimization in any way)

 p(g|λ)=m∏i=1(θλi)e−θλigi . (9)

To facilitate computation, we employ the Gamma prior:

 p(λ)∝m∏i=1λα−1ie−βλi (10)

where and are the hyper shape and rate parameters. Correspondingly, we pick . To be concrete, the cumulative loss in (1) (disregarding the constant terms) is

 βm∑i=1λi−(α−1)m∑i=1logλi+T∑t=1(θm∑i=1λigti−m∑i=1logλi).

Now, under conjugacy of (9) and (10), the posterior distribution of after steps is given by the Gamma distribution

 p(λ|g1:t)∝m∏i=1(λi)α+t−1e−(β+θ∑ts=1gsi)λi .

Therefore the posterior mean for each is

 α+tβ+θ∑ts=1gsi . (11)

We use the following prediction rule at each step:

 y=⎧⎪⎨⎪⎩1if m∑i=1λigi(x,1)≤m∑i=1λigi(x,−1)−1otherwise (12)

where each is the posterior mean given by (11). For this setup, Algorithm 1 can be cast as Algorithm 2 below, which is to be implemented in Section 5.

The following bound provides further understanding of the loss function (8) and the prediction rule (12), by relating their use with a guarantee on the prediction error:

###### Theorem 3

Suppose that are i.i.d., so that converges to for defined in (8). The prediction error using rule (12) with is bounded by

 P(x,y)(error)≤m1p⎛⎜⎝E(x,y)⎡⎢⎣(m∑i=1gi(x,−y)E[gi(x,y)])−1p−1⎤⎥⎦⎞⎟⎠p−1p (13)

for any .

To make sense of this result, note that the quantity can be interpreted as a performance indicator of each weak classifier, i.e. the larger it is, the better the weaker classifier is, since a good classifier should have a small loss and correspondingly a large . As long as there exist some good weak classifiers among the choices, will be large, which leads to a small error bound in (13). {proof} Suppose is used in the strong classifier (12). Denote as the indicator function. Consider

 E(x,y)[m∑i=1λigi(x,y)] = ∫(m∑i=1λigi(x,1)P(y=1|x)+m∑i=1λigi(x,−1)P(y=−1|x))dP(x) ≥ ∫(I(m∑i=1λigi(x,1)>m∑i=1λigi(x,−1))⋅m∑i=1λigi(x,1)P(y=1|x) +I(m∑i=1λigi(x,1)m∑i=1λigi(x,−1))⋅m∑i=1λigi(x,−1)P(y=1|x) +I(m∑i=1λigi(x,1)

the last inequality holds by reverse Holder inequality (Hardy et al., 1952). So

 P(error) ≤ (E(x,y)[m∑i=1λigi(x,y)])1p ⋅⎛⎜⎝E(x,y)⎡⎢⎣(m∑i=1λigi(x,−y))−1p−1⎤⎥⎦⎞⎟⎠p−1p

and the result (13) follows by plugging in for each , the minimizer of , which can be solved directly when is in the form (8).

Finally, in correspondence to Theorem 2, the standard SGD for (8) is written as

 λt+1i=λti−γt(θgti−1λti) (14)

where is a parameter that controls the step size. The following result is a consequence of Theorem 2 (we give another proof here that reveals more specific details).

###### Theorem 4

Suppose that are i.i.d., and and . For each , the posterior mean given by (11) always has a rate of convergence at least as fast as the SGD update (14) in terms of asymptotic variance. In fact, it is strictly better in all situations except when the step size parameter in (14) is set optimally a priori.

{proof}

Since for each , are i.i.d., the sample mean follows a central limit theorem. It can be argued using the delta method (Serfling, 2009) that the posterior mean (11) satisfies

 √T(α+Tβ+θ∑Tt=1gti−1θE[gi(x,y)]) (15) ⟶ N(0,Var(gi(x,y))θ2(E[gi(x,y)])4)

For the stochastic gradient descent scheme (14), it would be useful to cast the objective function as . Let which can be directly solved as . Then . If the step size , the update scheme (14) will generate that satisfies the following central limit theorem (Asmussen and Glynn, 2007; Kushner and Yin, 2003)

 √T(λTi−λ∗i)\lx@stackreld→N(0,σ2i) (16)

where

 σ2i=∫∞0e(1−2γz′′i(λ∗i))sγ2Var(θgi(x,y)−1λ∗i)ds (17)

and is the unbiased estimate of the gradient at the point . On the other hand, if , i.e. the convergence is slower than (16) asymptotically and so we can disregard this case (Asmussen and Glynn, 2007). Now substitute into (17) to obtain

 σ2i = θ2γ2Var(gi(x,y))∫∞0e(1−2γ/λ∗i)sds = θ2γ2Var(gi(x,y))2γ/λ∗i−1=θ2γ2Var(gi(x,y))2γθ2(E[gi(x,y)])2−1

and let , we get

 σ2i=~γ2Var(gi(x,y))θ2(2~γ(E[gi(x,y)])2−1) (18)

if .

We are now ready to compare the asymptotic variance in (15) and (18), and show that for all , the one in (15) is smaller. Note that this is equivalent to showing that

 Var(gi(x,y))θ2(E[gi(x,y)])4≤~γ2Var(gi(x,y))θ2(2~γ(E[gi(x,y)])2−1)

Eliminating the common factors, we have

 1(E[gi(x,y)])2≤~γ22~γ−1/(E[gi(x,y)])2

and by re-arranging the terms, we have

 (E[gi(x,y)])2(~γ−1(E[gi(x,y)])2)2≥0

which is always true. Equality holds iff , which corresponds to . Therefore, the asymptotic variance in (15) is always smaller than (18), unless the step size is chosen optimally.

## 5 Experiments

We report two sets of binary classification experiments in the online learning setting. In the first set of experiments, we evaluate our scheme’s performance vs. five baseline methods: a single baseline classifier, a uniform voting ensemble, and three SGD based online ensemble learning methods. In the second set of experiments, we compare with three leading online boosting methods: GradientBoost (Leistner et al., 2009), Smooth-Boost (Chen et al., 2012), and the online boosting method of Oza and Russell (2001) .

In all experiments, we follow the experimental setup in Chen et al. (2012). Data arrives as a sequence of examples . At each step the online learner predicts the class label for , then the true label is revealed and used to update the classifier online. We report the averaged error rate for each evaluated method over five trials of different random orderings of each dataset. The experiments are conducted for two different choices of weak classifiers: Perceptron and Naïve Bayes.

In all experiments, we choose the loss function of our method to be the ramp loss, and set the hyperparameters of our method as and . From the expression of the posterior mean (11), the prediction rule (12) is unrelated to the values of , and in the longterm. We have observed that the classification performance of our method is not very sensitive with respect to changes in the settings of these parameters. However, the stochastic gradient descent baseline (SGD) (14) is sensitive to the setting of , and since works best for SGD we also use for our method.

### 5.1 Comparison with Baseline Methods

In the experimental evaluation, we compare our online ensemble method with five baseline methods:

1. a single weak classifier (Perceptron or Naïve Bayes),

2. a uniform ensemble of weak classifiers (Voting),

3. an ensemble of weak classifiers where the ensemble weights are estimated via standard stochastic gradient descent (SGD),

4. a variant of (3.) where the ensemble weights are estimated via Polyak averaging (Polyak and Juditsky, 1992) (SGD-avg), and

5. another variant of (3.) where the ensemble weights are estimated via the Stochastic Average Gradient method of Schmidt et al. (2013) (SAG).

We use ten binary classification benchmark datasets obtained from the LIBSVM repository1. Each dataset is split into training and testing sets for each random trial, where a training set contains no more than of the total amount of data available for that particular benchmark. For each experimental trial, the ordering of items in the testing sequence is selected at random, and each online classifier ensemble learning method is presented with the same testing data sequence for that trial.

In each experimental trial, for all ensemble learning methods, we utilize a set of 100 pre-trained weak classifiers that are kept static during the online learning process. The training set is used in learning these 100 weak classifiers. The same weak classifiers are then shared by all of the ensemble methods, including our method. In order to make weak classifiers divergent, each weak classifier uses a randomly sampled subset of data features as input for both training and testing. The first baseline (single classifier) is learned using all the features.

For all of the benchmarks we observed that the error rate varies with different orderings of the dataset. Therefore, following Chen et al. (2012), we report the average error rate over five random trials of different orders of each sequence. In fact, while the error rate may vary according to different orderings of a dataset, it was observed throughout all our experiments that the ranking of performance among different methods is usually consistent.

Classification error rates for this experiment are shown in Tables 1 and 2. Our proposed method consistently performs the best for all datasets. Its superior performance against Voting is consistent with the asymptotic convergence analysis in Theorem 1. Its superior performance against the SGD baseline is consistent with the convergence rate analysis in Theorem 4. Polyak averaging (SGD-avg) does not improve the performance of basic SGD in general; this is consistent with the analysis in Xu (2011) which showed that, despite its optimal asymptotic convergence rate, a huge number of samples may be needed for Polyak averaging to reach its asymptotic region for a randomly chosen step size. SAG (Schmidt et al., 2013) is a close runner-up to our approach, but it has two limitations: 1) it requires knowing the length of the testing sequence a priori, and 2) as noted in Schmidt et al. (2013), the step size suggested in the theoretical analysis does not usually give the best result in practice, and thus the authors suggest a larger step size instead. In our experiments, we also found that the improvement of Schmidt et al. (2013) over the SGD baseline relies on tuning the step size to a value that is greater than that given in the theory. The performance of SAG reported here has taken advantage of these two points.

Fig. 1 shows plots of the convergence of online learning for three of the benchmark datasets. Plots for the other benchmark datasets are provided in the supplementary material. Each plot reports the classification error curves of our method, the SGD baseline, Polyak averaging SGD-avg (Polyak and Juditsky, 1992), and Stochastic Average Gradient SAG (Schmidt et al., 2013). Overall, for all methods, the error rate generally tends to decrease as the online learning process considers more and more samples. As is evident in the graphs, our method tends to attain lowest error rates overall, throughout each training sequence, for the compared methods for these benchmarks. Ideally, as an algorithm converges, the rate of cumulative error should tend to decrease as more samples are processed, approaching the minimal error rate that is achievable for the given set of pre-trained weak classifiers. Yet given the finite size of training sample set, and the randomness caused by different orderings of the sequences, we may not see the ideal monotonic curves. But in general, the trend of curves obtained by our method is consistent with the convergence analysis of Theorem 1. The online learning algorithm that converges faster should result in curves that go down more quickly in general. Again, given finite samples and different orderings, there is variance, but still, consistent with Theorem 2, the consistently better performance of our formulation vs. the compared methods is evident.

Fig 2 and Fig. 3 show plots for studying the sensitivity of parameter settings of our method. It is clear from the expression of the posterior mean (11) that the numerator containing will be cancelled out in the prediction rule (12), therefore we just need to study the effect of and . We select a short sequence, “Heart” and a long sequence, “Mushrooms” as two representative datasets. We plot the classification error rates of our method under different settings of (Fig. 2) and (Fig. 3), averaged over five random trials. It can be observed that the performance of our method is not very sensitive with respect to the changes in the settings of and even for a short sequence like “Heart” (270 samples). And the performance is more stable to the settings of these parameters for longer sequence like “Mushrooms” (8124 samples). This observation is consistent with the asymptotic property of our prediction rule (12). We observed similar behavior for all the other benchmark datasets we tested.

### 5.2 Comparison with Online Boosting Methods

We further compare our method with a single Perceptron/Naïve Bayes classifier that is updated online, and three representative online boosting methods reported in Chen et al. (2012): OzaBoost is the method proposed by Oza and Russell (2001), OGBoost is the online GradientBoost method proposed by Leistner et al. (2009), and OSBoost is the online Smooth-Boost method proposed by Chen et al. (2012). Ours-r is our proposed Bayesian ensemble method for online updated weak classifiers. All methods are trained and compared following the setup of Chen et al. (2012), where for each experimental trial, a set of 100 weak classifiers are initialized and updated online.

We use ten binary classification benchmark datasets that are also used by Chen et al. (2012). We discard the “Ijcnn1” and “Web Page” datasets from the tables of Chen et al. (2012), because they are highly biased with portions of positive samples around and respectively, and even a naïve “always negative” classifier attains comparably top performance.