Online Multiclass Boosting with Bandit Feedback

# Online Multiclass Boosting with Bandit Feedback

Daniel Zhang
Department of Computer Science
University of Michigan
dtzhang@umich.edu
Young Hun Jung
Department of Statistics
University of Michigan
yhjung@umich.edu
Ambuj Tewari
Department of Statistics
University of Michigan
tewaria@umich.edu
###### Abstract

We present online boosting algorithms for multiclass classification with bandit feedback, where the learner only receives feedback about the correctness of its prediction. We propose an unbiased estimate of the loss using a randomized prediction, allowing the model to update its weak learners with limited information. Using the unbiased estimate, we extend two full information boosting algorithms (Jung et al., 2017) to the bandit setting. We prove that the asymptotic error bounds of the bandit algorithms exactly match their full information counterparts. The cost of restricted feedback is reflected in the larger sample complexity. Experimental results also support our theoretical findings, and performance of the proposed models is comparable to the that of an existing bandit boosting algorithm, which is limited to use binary weak learners.

## 1 Introduction

We study the online multiclass classification problem with bandit feedback. In this setting, the data instances arrive sequentially, and the learner has to predict the label among a finite, but perhaps large, set of candidates. In certain practical settings, such as when the labels are ads or product recommendations on the web, the learner does not receive the correct label as feedback. Instead, it only receives feedback about whether its predicted label was correct (e.g., the user clicked on the ad or recommendation) or not (e.g., user did not click). However, training machine learning models under such partial feedback is challenging. A common approach is to convert a full information algorithm into a bandit version without incurring too much performance loss (see, for example, Kakade et al. (2008) and Beygelzimer et al. (2017) for work using the perceptron algorithm).

In this paper, we design online algorithms for multiclass classification under bandit feedback by building on recent work on online boosting in the full-information setting. Online boosting algorithms combine the predictions of multiple online weak learners to improve prediction performance. Classical boosting algorithms were designed for the batch setting. Chen et al. (2012) and Beygelzimer et al. (2015) first developed a theory of online boosting for binary classification. Then Jung et al. (2017) and Jung and Tewari (2018) extended the theory to the multiclass classification and the multilabel ranking problems. These works prove that the boosting algorithm’s asymptotic error converges to zero if the number of weak learners, whose predictions are slightly better than random guessing, gets larger.

Designing a boosting algorithm with bandit feedback is particularly difficult as it is not clear how to update the weak learners. For example, suppose that a weak learner predicts the label , another learner predicts the label , and the boosting algorithm predicts the label , which turns out to be incorrect. We cannot even tell whether its prediction is correct. To the best of our knowledge, Chen et al. (2014) are the only ones who have proposed a boosting algorithm in the multiclass bandit setting. However, their algorithm is restricted to use binary weak learners and only updates a subset of them, viz. ones which can get full feedback. In contrast, our algorithms use multiclass weak learners and update every learner at each round.

To derive our algorithms and guarantees, we extend the work of Jung et al. (2017) to the bandit setting. Instead of making a deterministic prediction, our algorithms randomize them. This allows them to estimate the loss using the distribution over labels, and this estimate is used to update the weak learners. Similar to the full information work, we propose a computationally expensive algorithm, BanditBBM, with an optimal error bound and a more practical algorithm, AdaBandit, with a suboptimal bound. AdaBandit is the first adaptive boosting algorithm in the bandit setting that does not assume weak learners’ edge over random is known beforehand. Interestingly, our algorithms’ asymptotic error bounds match the full information counterparts with increased sample complexity, which can be interpreted as the cost of bandit feedback.

## 2 Preliminaries

We begin by introducing notation. We denote the indicator function by . We write the standard basis vector by , the vector of ones by 1, and the vector of zeros by 0. The notation stands for the set , and the family of distributions over this set is denoted by .

### 2.1 Problem Setting

We first describe the online multiclass classification problem with bandit feedback. It is a sequential game between two players: learner and adversary. The set of possible labels is known to both players. At each round , the adversary selects a labeled example (where is some domain) and sends only to the learner. The learner then tries to guess its label and sends its prediction back to the adversary. As we are in the bandit setting, the adversary only reveals whether the prediction is correct by sending to the learner. The learner’s goal is to minimize the number of incorrect predictions. In other words, the learner’s performance is evaluated by the zero-one loss (see (1) for definition).

To tackle this problem, we use the online multiclass boosting setup of Jung et al. (2017). In this setting, the learner further splits into online weak learners, , as well as a booster that handles the weak learners. When the booster receives an unlabeled instance , it shares this information with the weak learners and then aggregates their predictions to produce the final prediction. Here we assume that each weak learner predicts a label in . Once the booster gets the feedback from the adversary, it computes a cost vector for to incur the loss and update its prediction rule. It should be noted that even though our boosting algorithms are designed for the bandit setting, the weak learners observe full cost vectors , which are constructed from bandit feedback by the booster.

### 2.2 Unbiased Estimate of the Zero-One Loss

Even though we have not specified how to compute cost vectors , it is naturally expected that they should depend on the final zero-one loss vector:

 l0−1t=1−eyt∈Rk. (1)

As we are in the bandit setting, the booster only has limited information about this vector. In particular, unless its final prediction is correct, only a single entry of is available.

A popular approach for algorithm design in the partial information setting is to obtain an unbiased estimate of the loss. To do so, many bandit algorithms randomize their prediction. In our setting, instead of making a deterministic prediction , the algorithm designs a sampling distribution as follows:

 pt,i={1−ρ if i=^ytρk−1 if i≠^yt, (2)

where is a parameter that controls the exploration rate. This distribution puts a large weight on the label and evenly distributes the remaining weight over the rest. The algorithm draws a final prediction based on . In this way, the algorithm can build an estimator using the known sampling distribution. A simplest unbiased estimate of the zero-one loss is

 ^l0−1t=I(~yt=yt)pt,~yt(1−e~yt)∈Rk. (3)

It is easy to check that this is indeed unbiased. However, it is not necessarily the best because it becomes a zero vector when the booster makes a mistake. As the zero loss vector does not provide any useful information, the weak learners cannot update at this round. Therefore, it would be hard for the booster to escape the early training stage using the simple estimate.

As an alternative, we propose a new estimator

 (4)

We first emphasize that this quantity can be computed only using the bandit feedback. The proof that it is actually unbiased appears in Appendix A.1.

This estimator resolves the main issue with the estimator in (3), viz. that the learner cannot update during a mistake round. Furthermore, the algorithms using this estimator empirically performed much better than ones using the estimator in (3). For these reasons, we will stick to the estimate in (4) from now on.

To apply concentration inequalities, we need to control the variance of estimators. We say a random vector is -bounded if

 ||Y−EY||∞≤b almost surely.

Note that this definition also applies to random variables (i.e., scalars), in which case the norm above simply becomes the absolute value. It is easy to check our estimator is -bounded.

Now suppose that a cost vector (to be fed into weak learner at time ) requires the knowledge of the true label . Since the label is usually unavailable, we also need to estimate the cost vector. We first compute a matrix , whose column is the cost vector assuming is the correct label. Then we will use the following random cost vector:

 ^cit=Cit⋅(1−^l0−1t). (5)

Since is a deterministic matrix, we can compute

 E~yt^cit=Cit⋅(1−l0−1t)=Cit⋅eyt,

which is the column of . This shows that is an unbiased estimate of .

## 3 Algorithms

We introduce two different online boosting algorithms and provide their theoretical error bounds. As the booster’s performance obviously depends on the weak learner’s predictive power, we need a way to quantify the latter. Firstly, we define the edge of a weak learner over random guessing and assume every weak learner has a positive edge . This edge is closely related to the one defined in the full information setting, and hence we can easily compare the error bounds between the two settings. This idea leads to the algorithm BanditBBM, which has a very strong error bound. Secondly, instead of having an additional assumption on the weak learners, we measure empirical edges of the learners and use these quantities to bound the number of mistakes. We call this algorithm AdaBandit, and since it enjoys theoretical guarantees under fewer assumptions, it is more practical.

### 3.1 Algorithm Template

Our boosting algorithms share a template, which we discuss here. As it adopts the cost vector framework from Jung et al. (2017) and Jung and Tewari (2018), the template is very similar except for the additional step of estimating the loss.

Algorithm 1 summarizes the template. To keep it general, we do not specify certain steps, which will be finalized later for each algorithm. Also, we do not restrict weak learners in any way except requiring that each predicts a label , receives a full loss vector , and suffers the loss according to their prediction.

The booster keeps updating the learner weights and constructs experts. There are experts where the expert tracks the weighted cumulative votes among the first weak learners: . We also track , where the tie breaks arbitrarily. Then the booster chooses an expert index at each round and decides the intermediate prediction as . In other words, it takes a weighted majority vote among the first learners. BanditBBM fixes to be , while AdaBandit draws it randomly using a calibrated distribution. Using and the exploration rate , the booster computes the sampling distribution as in (2). A random label drawn from is the final prediction, and the booster gets the feedback . Then it constructs the unbiased estimate of the zero-one loss in (4) and updates the learner weights . Finally, it computes cost vectors for and lets them update their parameters.

### 3.2 An Optimal Algorithm

The our first algorithm, BanditBBM111Bandit Boost-by-Majority, assumes the bandit weak learning condition, which states that weak learners are better than random guessing. The algorithm is optimal in that it requires the minimal number of weak learners up to a constant factor to attain a certain accuracy.

#### 3.2.1 Bandit Weak Learning Condition

This section proposes a bandit weak learning condition which requires weak learners to do better than random guessing, having only observed unbiased estimates of cost vectors. At a high level, what the weak learning condition says is that, as far as the random cost vectors provided by the booster satisfy certain conditions, the weak learner can perform better than random guessing when graded against the expected cost vectors. As a baseline, we define to be almost a uniform distribution that puts more weight on the label . As an example, . The intuition is that if a learner predicts a label based on at each round, then its accuracy would be better than random guessing by the edge .

The booster’s goal is to minimize the number of incorrect predictions and hence it wants to put the minimal cost on the correct label. In this regard, Jung et al. (2017) constrain the choice of cost vectors222In fact, the authors constrain the choice of cost matrices, but it suffices to choose a specific row to get a cost vector. to

 Ceor1={c∈Rk+ | cy=0 and ||c||1=1}, (6)

where is the correct label. We allow a sample weight that can be multiplied by a cost vector to include scaled cost vectors. One remark is that we can always subtract a common number from every entry of the cost vector as we are interested in the relative loss. This means that as long as the booster puts the minimal cost on the correct label, one can transform it to represent as for some weight and . Meanwhile, we are in the bandit setting, where the true label is often unavailable to the booster. Therefore, we allow our booster to compute a random vector , whose expectation lies in . Once the exploration rate is specified, we can additionally ensure that the random cost vectors are -bounded.

It would be theoretically most sound if the bandit weak learning condition can be closely related to its full information counterpart. To do so, we present two weak learning conditions together. The settings are almost identical except the full information version observes a deterministic cost vector while the bandit version only observes a randomized vector such that . Recall that the entire cost vector is shown to the learner even in the bandit setting. For both conditions, the time horizon is , labeled data are chosen adaptively, and the parameters , and the sample weights lie in .

###### Definition 3.1 (OnlineWLC from Jung et al. (2017)).

A pair of a learner and an adversary satisfies OnlineWLC() if the learner can generate predictions such that we have with probability ,

 T∑t=1wtct,^yt≤T∑t=1wtct⋅uytγ+S.
###### Definition 3.2 (BanditWLC).

Suppose the random cost vectors are -bounded for some . A pair of learner and adversary satisfies BanditWLC() if the learner can generate predictions , observing the random cost vectors , such that we have with probability ,

 T∑t=1wtct,^yt≤T∑t=1wtct⋅uytγ+S,

where for all .

Here is called excess loss. OnlineWLC is a special case of BanditWLC where the bound is . In fact, we can show more intrinsic relations between the two.

Suppose there is a fixed hypothesis class and an online learner makes a prediction at time by choosing a hypothesis . Obviously, this setting does not cover all online learners, but the most widely used learners can be interpreted in this manner. Jung et al. (2017) showed that the OnlineWLC can be derived from the following two assumptions:

• (Online Richness Condition) For any sequence of cost vectors , there is a hypothesis such that

 T∑t=1wtct,h(xt)≤T∑t=1wtct⋅uytγ.
• (Online Agnostic Learnability Condition) For any sequence of (bounded) loss vectors , there is an online algorithm which can generate predictions such that with probability ,

 T∑t=1lt,^yt≤infh∈HT∑t=1lt,h(xt)+Rδ(T),

where is a sublinear regret.

Note that the online learnability condition only assumes a bounded loss instead of . This condition holds, for example, if the space has a finite Littlestone dimension. Interested readers can refer the paper by Daniely et al. (2015). We show that these two conditions also imply the BanditWLC. The proof appears in Appendix A.2.

###### Theorem 3.1.

Suppose a pair of weak learning space and adversary satisfies the richness condition with edge and the agnostic learnability condition with regret (T). Addionally, we assume that for all . Then the online learner based on satisfies BanditWLC() with

 S=supT−γkmT+b√2Tlog1δ+Rδ(T),

where is the bound of the random cost vectors.

The extra condition is acceptable because if for some , then we can simply ignore this round because any prediction does not incur a loss. The excess loss is always finite due to the sublinear regret . Furthermore, a smaller would require a larger . The excess loss in the BanditWLC is larger than the one in the OnlineWLC due to the term . This is intuitive in that the learner needs more samples if only bandit feedback is available. Finally, the exploration rate also affects because is equal to . This provides the following rough bound:

 S=~O(kρ), (7)

where suppresses dependence on .

#### 3.2.2 BanditBBM Details

Throughout the section, we assume the weak learners satisfy BanditWLC(). BanditBBM is a modification of OnlineMBBM from Jung et al. (2017) by incorporating the unbiased estimate of the loss in (4).

We use the potential function , discussed thoroughly in relation to boosting by Mukherjee and Schapire (2013), to design the cost vectors. The potential function takes the current cumulative votes as an input and estimates the booster’s loss when the true label is and there are weak learners left until the final prediction. In particular, it can be recursively defined as follows:

 ϕy0(s) =I(argmaxisi≠y) ϕyi+1(s) =El∼uyγϕyi(s+el).

Unfortunately, there is not a closed form to compute the potential. Since potential functions are the main ingredient to design cost vectors, their computation becomes a bottleneck when running the algorithm. This is a weakness of BanditBBM despite its strong mistake bound. As the potential can be computed recursively, one can use Monte Carlo simulations to approximate its value.

Returning to our algorithm, we essentially want to set the cost vector to

 cit,l=ϕytN−i(si−1t+el). (8)

Jung et al. (2017) prove that this cost vector puts the minimal cost on the correct label and thus it is a valid choice. The booster in our setting, however, cannot compute this vector as it requires the knowledge of the true label . As an alternative, we create the following cost matrix

 Cit[l,r]=ϕrN−i(si−1t+el) (9)

and use (5) to compute a random cost vector , which is an unbiased estimate of in (8).

The rest of the algorithm is straightforward. We set all weights to be one and always choose the last expert: . This means that the intermediate prediction is a simple majority vote among all the weak learners. The reasoning behind this is that the booster wants to include all learners as they are strictly better than random, and all weak learners are equivalent in that they share the same edge . Algorithm 2 summarizes the specifications.

#### 3.2.3 Mistake Bound of BanditBBM

We still assume that our weak learners satisfy BanditWLC(). From observation (7), it is reasonable to assume . Upon these assumptions, we can bound the number of mistakes made by BanditBBM. The proof appears in Appendix A.3

###### Theorem 3.2 (Mistake Bound of BanditBBM).

For any , satisfying , the number of mistakes made by BanditBBM satisfies the following inequality with probability at least :

 T∑t=1I(~yt≠yt)≤(k−1)e−γ2N2T+2ρT+~O(k7/2√Nρ),

where suppresses dependence on .

If we set the exploration rate , then the bound becomes

 (k−1)e−γ2N2T+~O(k7/4N1/4√T).

Dividing by , we can infer that is the asymptotic error bound of the algorithm. This bound matches the bound of the full information counterpart, OnlineMBBM. Since it depends exponentially on , BanditBBM does not require too many weak learners to obtain a desired accuracy. Jung et al. (2017) also provide a lower bound in the full information setting, which shows that the exponential decay is the fastest rate one can expect for the asymptotic error bound. This result applies to our bandit setting as it is harder.

### 3.3 An Adaptive Algorithm

While BanditBBM is theoretically sound, in real applications it has a number of drawbacks. Firstly, it is hard to identify the edge of each weak learner, leading to incorrect computations of the potential function. Also, each learner may have a different edge, and assuming a common edge can underestimate some weak learner’s predictive power. Finally, as pointed out in the previous section, evaluating the potential function is computationally expensive, which makes BanditBBM less useful in practice. To address these issues, we propose an adaptive algorithm, AdaBandit, based on the full information adaptive algorithm, Adaboost.OLM by Jung et al. (2017). Using the idea of improper learning, Foster et al. (2018) proposed another adaptive boosting algorithm that has a tighter sample complexity than Adaboost.OLM. However, we stick to Adaboost.OLM as it has the competitive asymptotic error bound, which is of primary interest in this paper.

#### 3.3.1 Logistic Loss and Empirical Edges

Instead of directly minimizing the zero-one loss, the adaptive algorithm tries to minimize a surrogate loss. As in Adaboost.OLM, we choose the following logistic loss :

 llogy(s)=k∑l=1log(1+exp(sl−sy)),

where is the cumulative votes of a chosen expert.

As for the zero-one loss, computing the loss requires knowledge of the true label, and we again use the idea in (5) to estimate the loss. We want to emphasize that the logistic loss only plays an intermediate role in training, and the learner’s predictions are still evaluated by the zero-one loss.

Essentially, we want to set the cost vector . Since this depends on the true label , we build a cost matrix as below:

 (10)

Note that each column also puts the minimal cost on the correct label . Moreover, the sum of entries equals zero. Using the idea described in (5), we can compute , which is an unbiased estimate of .

Even though the adaptive algorithm does not assume the BanditWLC, we still need to measure the weak learners’ predictive powers to analyze the booster’s performance. As in the full information case, we use the following empirical edge of :

 γi=∑Tt=1cit,hit∑Tt=1cit,yt.

Having the same empirical edge as Adaboost.OLM allows us to precisely evaluate the cost of bandit feedback. Based on our design of cost vector , we can check that is in and a larger value implies a better accuracy. Obviously, the empirical edge is unavailable to the learner as it requires the true cost vector . This is fine because we only use this value to provide the mistake bound. Running AdaBandit does not require the knowledge of empirical edges.

#### 3.3.2 AdaBandit Details

Now we describe the details of AdaBandit (see Algorithm 3). The choice of cost vectors is already discussed in the previous section. As this is an adaptive algorithm, we update the learner weights to give more influence to high-performing learners. We also allow negative weights in case a weak learner is worse than random.

As AdaBandit incorporates the logistic loss as a surrogate, we want to pick to minimize

 T∑t=1fit(αit) where fit(α)=llogyt(si−1t+αehit),

where only the following unbiased estimator is available to the learner:

 ^fit(α)=k∑j=1llogj(si−1t+αehit)⋅(1−^l0−1t,j).

Since the logistic loss is convex, it is a classical online convex optimization problem, and we can use stochastic gradient descent (see Zinkevich (2003) and Shalev-Shwartz and Ben-David (2014)). Following the covention in Adaboost.OLM, we use the feasible set and the projection function to update :

 αit+1=Π(αit−ηt^fi′t(αit)), (11)

where is a learning rate. As the gradient of the logistic loss is universally bounded by and is -bounded, we can check that almost surely. From this, if we set , then a standard result in online stochastic gradient descent (see Shalev-Shwartz and Ben-David (2014), Chapter 14) provides with probability ,

 T∑t=1fit(αit)≤minα∈[−2,2]T∑t=1fit(α)+~O(k2ρ√T), (12)

where suppresses dependence on .

We cannot prove that the last expert is the best because our weak learners do not adhere to the weak learning condition. Instead, we will show that at least one expert is reliable. To identify this expert, we use the Hedge algorithm from Littlestone and Warmuth (1994) and Freund and Schapire (1997). This algorithm generally receives the zero-one loss of each expert. Since that is no longer available, we will feed , which in expectation reflects the true zero-one loss. As the exploration rate controls the variance of the loss estimate, we can combine the analysis of the Hedge algorithm with the concentration inequality to obtain a similar result.

#### 3.3.3 Mistake Bound of AdaBandit

As mentioned earlier, we bound the number of mistakes made by the adaptive algorithm using the weak learners’ empirical edges. We emphasize again that these empirical edges are defined exactly in the same manner with those used in the full information bound.

###### Theorem 3.3 (Mistake Bound of AdaBandit).

For any , satisfying , the number of mistakes made by AdaBandit satisfies the following inequality with probability at least :

 T∑t=1I(~yt≠yt)≤8k∑Ni=1γ2iT+2ρT+~O(k3N2ρ2∑Ni=1γ2i),

where suppresses dependence on .

If we set the exploration rate , then the bound becomes

 8k∑Ni=1γ2iT+~O(kN23(∑Ni=1γ2i)13T23).

This implies that becomes the asymptotic error bound of AdaBandit, which matches the bound of Adaboost.OLM. Jung et al. (2017) observe that with high probability if the learner has edge . Therefore, if our weak learners satisfy BanditWLC() as for BanditBBM, then the asymptotic bound becomes roughly . The bound depends polynomially on , which is suboptimal. However, AdaBandit resolves the aforementioned issues of BanditBBM and actually shows comparable results on real data sets.

## 4 Experiments

We run various boosting algorithms on benchmark data sets to compare their performance. The models include our proposed algorithms, BanditBBM and AdaBandit, their full information versions, OnlineMBBM and Adaboost.OLM from Jung et al. (2017), and BanditBoost from Chen et al. (2014). To maximize readability, we will call them by OptBandit, AdaBandit, OptFull, AdaFull, and BinBandit respectively, based on their characteristics. The first four models require multiclass weak learners, whereas BinBandit needs binary learners. For every model, we use online decision trees proposed by Domingos and Hulten (2000) as weak learners.

We examine several data sets from the UCI data repository333Blake and Merz (1998), Higuera et al. (2015), and Ugulino et al. (2012) that are tested by Jung et al. (2017). We follow the authors’ data preprocessing to provide a consistent comparison. However, the bandit algorithms need more samples to reach their asymptotic performance. Because these data sets often have insufficient examples to yield this asymptotic performance, we duplicate and shuffle the data sets a number of times before feeding them to the algorithm. The amount of duplication done to each data set is chosen to suggest the asymptotic performance of each algorithm. Table 1 contains a summary of data sets that are examined. The number of actual data points sent to each model is noted under the column StreamCnt.

We optimize the number of weak learners for each bandit algorithm and data set, with granularity down to multiples of . As BinBandit only takes binary weak learners, it needs more of them for data sets with large . Thus, we use weak learners for BinBandit on each data set. Recall that OptBandit and OptFull require the knowledge of the edge from their weak learning condition. Since one cannot identify this value in practice, we also do not optimize this value and select to be fixed. Lastly, the three bandit algorithms have the exploration rate , which we optimize through the grid search and record the best results. A more detailed description of the experiment setting appears in Appendix B.

### 4.1 Asymptotic Performance

Since the theoretical asymptotic error bounds of the proposed algorithms match their full information counterparts, we first compare the models’ empirical asymptotic performance. To do so, we feed the first of the data without counting mistakes and compute the average accuracy on the last of the data. Table 1 summarizes the results. The accuracy is averaged over 20 rounds for all data sets except Isolet and Movement, which we ran 10 times. These runs were computed with shuffling from 20 random seeds, a predetermined subset of which were used for Isolet and Movement.

The full information algorithms exhibit very strong performance due to data duplication. Despite this, OptBandit and AdaBandit stay competitive against them across all data sets, supporting our theoretical findings. Their performance is also comparable with BinBandit, showing that our algorithms successfully combine the multiclass weak learners. A noteworthy aspect is the superiority of AdaBandit over OptBandit on all of the data sets, displaying the power of adaptive weight algorithms.

### 4.2 Analyzing Learning Curves

Even though our bandit algorithms have the same asymptotic error bounds as their full information counterparts, the cost of bandit feedback is reflected in larger sample complexities. To investigate this cost, we compute approximate learning curves for these algorithms by recording the moving average accuracy across the latest (total rounds to be run) data instances. For data sets of varying this illustrates the hardness of the bandit problem: as increases, learning an appropriately performing hypothesis takes longer, but is achievable nonetheless.

Figure 1 shows the learning curves on Car and Isolet data for our two bandit algorithms as compared with their full information counterparts. The curves for other data sets can be found in the Appendix B. On Car data where is small, AdaBandit does particularly well, even outperforming OptFull and AdaFull by the end. This competitiveness with full information algorithms is reflected in the other learning curves in the appendix. On Isolet data with large , our bandit methods lose some of their competitiveness. However, given that the exploration rate is set to (whereas the theory would have it converge to ) and that the bandit algorithms have not fully plateaued off, the performance is still reasonable.

#### Acknowledgements

We acknowledge support from NSF CAREER grant IIS-1452099, a Sloan Research Fellowship, and the LSA Associate Professor Support Fund.

## References

• Beygelzimer et al. (2015) Beygelzimer, A., Kale, S., and Luo, H. (2015). Optimal and adaptive algorithms for online boosting. In Proceedings of the International Coference on Machine Learning.
• Beygelzimer et al. (2017) Beygelzimer, A., Orabona, F., and Zhang, C. (2017). Efficient online bandit multiclass learning with regret. In Proceedings of the International Conference on Machine Learning.
• Blake and Merz (1998) Blake, C. and Merz, C. (1998). UCI machine learning repository.
• Cesa-Bianchi and Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
• Chen et al. (2012) Chen, S.-T., Lin, H.-T., and Lu, C.-J. (2012). An online boosting algorithm with theoretical justifications. In Proceedings of the International Coference on Machine Learning.
• Chen et al. (2014) Chen, S.-T., Lin, H.-T., and Lu, C.-J. (2014). Boosting with online binary learners for the multiclass bandit problem. In Proceedings of the International Conference on Machine Learning.
• Daniely et al. (2015) Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz, S. (2015). Multiclass learnability and the erm principle. The Journal of Machine Learning Research.
• Domingos and Hulten (2000) Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining.
• Foster et al. (2018) Foster, D. J., Kale, S., Luo, H., Mohri, M., and Sridharan, K. (2018). Logistic regression: The importance of being improper. Proceedings of Machine Learning Research.
• Freund and Schapire (1997) Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences.
• Higuera et al. (2015) Higuera, C., Gardiner, K. J., and Cios, K. J. (2015). Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PloS one.
• Jung et al. (2017) Jung, Y. H., Goetz, J., and Tewari, A. (2017). Online multiclass boosting. In Advances in Neural Information Processing Systems.
• Jung and Tewari (2018) Jung, Y. H. and Tewari, A. (2018). Online boosting algorithms for multi-label ranking. In International Conference on Artificial Intelligence and Statistics.
• Kakade et al. (2008) Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. (2008). Efficient bandit algorithms for online multiclass prediction. In Proceedings of the International Conference on Machine Learning. ACM.
• Littlestone and Warmuth (1994) Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Information and computation.
• Mukherjee and Schapire (2013) Mukherjee, I. and Schapire, R. E. (2013). A theory of multiclass boosting. Journal of Machine Learning Research.
• Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
• Ugulino et al. (2012) Ugulino, W., Cardador, D., Vega, K., Velloso, E., Milidiú, R., and Fuks, H. (2012). Wearable computing: Accelerometers? data classification of body postures and movements. In Advances in Artificial Intelligence-SBIA.
• Zinkevich (2003) Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning.

## Appendix A Detailed Proofs

In this section, we include the full proofs that are omitted in the main manuscript.

### a.1 Unbiased Estimate of The Zero-One Loss

We prove the unbaisedness of our loss estimator presented in (4).

###### Lemma A.1.

The estimator in (4) is an unbiased estimator of the zero-one loss :

 E~yt∼pt^l0−1t=l0−1t.
###### Proof.

Since is drawn with respect to , we can write

 E~yt∼pt^l0−1t,i =I(yt≠i)I(^yt≠i)+I(^yt≠yt)I(^yt=i) =I(yt≠i)I(^yt≠i)+I(i≠yt)I(^yt=i) =I(yt≠i)(I(^yt≠i)+I(^yt=i)) =I(yt≠i),

where the last term is , which completes the proof. ∎

### a.2 Proof of Theorem 3.1

Since the agnostic learnability condition is given with deterministic cost vectors, we need to bridge the deterministic costs with the randomized ones. Observe that relies solely on the random draw of at each round. Therefore, the partial sum of random vectors has the martingale property. Then we can prove the following lemma using the Azuma-Hoeffding inequality.

###### Lemma A.2.

Suppose the random cost vectors are -bounded. Let be a probability vector in . Then the following inequality holds with probability :

 |T∑t=1(^ct−ct)⋅pt|≤b√2Tlog2δ.
###### Proof.

Since is -bounded and unbiased, we have

 |(^ct−ct)⋅pt|≤b a.s. and E(^ct−ct)⋅pt=0.

Therefore the Azuma-Hoeffding’s inequality implies

 P(|T∑t=1(^ct−ct)⋅pt|≥ϵ)≤2e−ϵ22b2T.

Putting finishes the proof. ∎

We now go into the main proof of Theorem 3.1.

###### Proof.

Fix the sequence of cost vectors . From the richness condition with edge , we know

 infh∈HT∑t=1wtct,h(xt)≤T∑t=1wtct⋅uyt2γ.

By applying Lemma A.2 with , we get with probability ,

 infh∈HT∑t=1wt^ct,h(xt)≤T∑t=1wtct⋅uyt2γ+b√2Tlog2δ. (13)

Then by the online learnability condition, the online learner based on can generate predictions that satisfies the following inequality with probability :

 T∑t=1wt^ct,^yt≤infh∈HT∑t=1wt^ct,h(xt)+Rδ(T),

Using (13) and the union bound, we have with probability ,

 T∑t=1wt^ct,^yt≤T∑t=1wtct⋅uyt2γ+b√2Tlog2δ+Rδ(T).

Then by the definition of in (6), we can compute

 ct⋅uyγ=1−γk.

Therefore, using the assumption for all , we can bound

 T∑t=1wtct⋅(uytγ−uyt2γ)=γkT∑t=1wt≥γkmT.

Then by taking

 S=supT−γkmT+b√2Tlog1δ+Rδ(T),

we prove that with probability ,

 T∑t=1wt^ct,^yt≤T∑t=1wtct⋅uytγ+S,

which shows the learner and the adversary satisfy BanditWLC(). ∎

### a.3 Proof of Theorem 3.2

Note that the cost vectors defined in (8) does not put zero cost on the correct label. In order to apply the BanditWLC, we transform the cost vector. Since the zero-one loss vector has the minimal loss on the true label, we can inductively check that . Then we define as below:

 dit,l=cit,l−cit,yt.

The minimal entry of is zero. Let , which plays a similar role of the sample weight in that . We also define .

We bound the cumulative potential functions by following the modified proof of Theorem 2 from Jung et al. (2017).

###### Lemma A.3.

With probability , we have

 T∑t=1ϕyt0(sNt)≤ϕ1N(0)⋅T+SN∑i=1wi∗.
###### Proof.

In the proof of Theorem 2 by Jung et al. (2017), the authors write

 T∑t=1 ϕytN−i+1(si−1t) =1−γkT∑t=1wit−T∑t=1dit,hit+T∑t=1ϕytN−i(sit).

Using the fact that our weak learners satisfy BanditWLC(), we have with probability

 1wi∗T∑t=1dit,hit≤1−γkwi∗T∑t=1wit+S,

from which we deduce

 T∑t=1 ϕytN−i+1(si−1t)+wi∗S≥T∑t=1ϕytN−i(sit).

Summing this over and using the union bound, we have with probability ,

 T∑t=1ϕyt0(sNt)≤T∑t=1ϕytN(0)+SN∑i=1wi∗.

By symmetry, we can check for any label , which completes the proof. ∎

We now prove Theorem 3.2.

###### Proof.

Since , we obtain

 ϕyt0(sNt)=I(^yt≠yt).

Furthermore, Jung et al. (2017) bound the terms that appear in the previous lemma:

 ϕlN(0)≤(k−1)e−γ2N2 N∑i=1wi∗=O(k5/2√N).

Combining these, we get with probability ,

 T∑t=1I(^yt≠yt) ≤(k−1)e−γ2N2T+O(k5/2√NS) ≤(k−1)e−γ2N2T+~O(k7/2√Nρ),

where the last inequality holds by (7).

To bound the booster’s loss , observe

 E~ytI(~yt≠yt)≤I(^yt≠yt)+ρ.

Using the concentration inequality, we have with probability ,

 T∑t=1I(~yt≠yt) ≤(k−1)e−γ2N2T+~O(k7/2√Nρ)+ρT+√Tlog1δ ≤(k−1)e−γ2N2T+2ρT+~O(k7/2√Nρ),

where we use the relation to absorb the term . This proves the main theorem. ∎

### a.4 Proof of Theorem 3.3

We first recall a lemma from Jung et al. (2017) to aid the proof.

###### Lemma A.4 (Jung et al. (2017), Lemma 11).

Suppose , , and . Then we have

 minα∈[−2,2]A(eα−1)+B(e−α−1)≤−γ22.

Now we proceed with a bound of the zero-one loss of AdaBandit. The main structure of the proof results from the mistake bound of Adaboost.OLM by Jung et al. (2017).

###### Proof.

We let denote the number of mistakes made by expert : . We also let for convenience. As the booster uses the estimate to run the Hedge algorithm, we define so that . If we write , then by the Azuma-Hoeffding inequality and the fact that is -bounded, we have with probability ,

 mini^Mi≤^Mi∗≤miniMi+~O(kρ√T),

where suppresses dependence on .

Then a standard analysis of the Hedge algorithm (see Corollary 2.3 by Cesa-Bianchi and Lugosi (2006)) and the Azuma-Hoeffding inequality provide that with probability ,

 T∑t=1I(^yt≠yt)≤T∑t=1^l0−1t,^yt+~O(kρ√T)≤2mini^Mi+2logN+~O(kρ√T)≤2miniMi+2logN+~O(kρ√T). (14)

Now define . If the expert makes a mistake at round , there is such that . According to (10), this implies that . From this, we can deduce that

 wi≥Mi−12. (15)

By our convention , the above inequality still holds for .

Next we define the difference in the cumulative logistic loss between two consecutive experts as

 Δi =T∑t=1llog<