# New Insights into Bootstrapping for Bandits

###### Abstract

We investigate the use of bootstrapping in the bandit setting. We first show that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establish a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards. We show that NPB with an appropriate amount of forced exploration can result in sub-linear albeit sub-optimal regret. As an alternative to NPB, we propose a weighted bootstrapping (WB) procedure. For Bernoulli rewards, WB with multiplicative exponential weights is mathematically equivalent to Thompson sampling (TS) and results in near-optimal regret bounds. Similarly, in the bandit setting with Gaussian rewards, we show that WB with additive Gaussian weights achieves near-optimal regret. Beyond these special cases, we show that WB leads to better empirical performance than TS for several reward distributions bounded on . For the contextual bandit setting, we give practical guidelines that make bootstrapping simple and efficient to implement and result in good empirical performance on real-world datasets.

New Insights into Bootstrapping for Bandits

Sharan Vaswani University of British Columbia sharanv@cs.ubc.ca Branislav Kveton Adobe Research kveton@adobe.com Zheng Wen Adobe Research zwen@adobe.com Anup Rao Adobe Research anuprao@adobe.com Mark Schmidt University of British Columbia schmidtm@cs.ubc.ca Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

The multi-armed bandit framework Lai and Robbins (1985); Bubeck et al. (2012); Auer (2002); Auer et al. (2002) is a classic approach for sequential decision-making under uncertainty. The basic framework consists of independent arms that correspond to different choices or actions. These may be different treatments in a clinical trial or different products that can be recommended to the users of an online service. Each arm has an associated expected reward or utility. Typically, we do not have prior information about the utility of the available choices and the agent learns to make “good” decisions via repeated interaction in a trial-and-error fashion. Under the bandit setting, in each interaction or round, the agent selects an arm and observes a reward only for the selected arm. The objective of the agent is to maximize the reward accumulated across multiple rounds. This results in an exploration-exploitation trade-off: exploration means choosing an arm to gain more information about it, while exploitation corresponds to choosing the arm with the highest estimated reward so far. The contextual bandit setting Wang et al. (2005); Pandey et al. (2007); Kakade et al. (2008); Dani et al. (2008); Li et al. (2010); Agrawal and Goyal (2013b) is a generalization of the bandit framework and assumes that we have additional information in the form of a feature vector or “context” at each round. A context might be used to encode the medical data of a patient in a clinical trial or the demographics of an online user of a recommender system. In this case, the expected reward for an arm is an unknown function ^{1}^{1}1We typically assume a parametric form for this function and infer the corresponding parameters from observations. of the context at that particular round. For example, for linear bandits Rusmevichientong and Tsitsiklis (2010); Dani et al. (2008); Abbasi-Yadkori et al. (2011), this function is assumed to be linear implying that the expected reward can be expressed as an inner product between the context vector and an (unknown) parameter to be learned from observations.

In both the bandit and contextual bandit settings, there are three main strategies for addressing the exploration-exploitation tradeoff: (i) -greedy Langford and Zhang (2008) (ii) optimism-in-the-face-of-uncertainty Auer (2002); Abbasi-Yadkori et al. (2011) (OFU) and (iii) Thompson sampling Agrawal and Goyal (2013b). Though -greedy (EG) is simple to implement and is widely used in practice, it results in sub-optimal performance from a theoretical stand-point. In practice, its performance heavily relies on choosing the right exploration parameter and the strategy for annealing it. Strategies based on optimism under uncertainty rely on constructing confidence sets and are statistically optimal and computationally efficient in the bandit Auer et al. (2002) and linear bandit Abbasi-Yadkori et al. (2011) settings. However, for non-linear feature-reward mappings, we can construct only approximate confidence sets Filippi et al. (2010); Li et al. (2017); Zhang et al. (2016); Jun et al. (2017) that result in over-conservative uncertainty estimates Filippi et al. (2010) and consequently to worse empirical performance. Given a prior distribution over the rewards or parameters being inferred, Thompson sampling (TS) uses the observed rewards to compute a posterior distribution. It then uses samples from the posterior to make decisions. TS is computationally efficient when we have a closed-form posterior like in the case of Bernoulli or Gaussian rewards. For reward distributions beyond those admitting conjugate priors or for complex non-linear feature-reward mappings, it is not possible to have a closed form posterior or obtain exact samples from it. In these cases, we have to rely on computationally-expensive approximate sampling techniques Riquelme et al. (2018).

To address the above difficulties, bootstrapping Efron (1992) has been used in the bandit Baransi et al. (2014); Eckles and Kaptein (2014), contextual bandit Tang et al. (2015); McNellis et al. (2017) and deep reinforcement learning Osband and Van Roy (2015); Osband et al. (2016) settings. All previous work uses non-parametric bootstrapping (explained in Section 3.1) as an approximation to TS. As opposed to maintaining the entire posterior distribution for TS, bootstrapping requires computing only point-estimates (such as the maximum likelihood estimator). Bootstrapping thus has two major advantages over other existing strategies: (i) Unlike OFU and TS, it is simple to implement and does not require designing problem-specific confidence sets or efficient sampling algorithms. (ii) Unlike EG, it is not sensitive to hyper-parameter tuning. In spite of its advantages and good empirical performance, bootstrapping for bandits is not well understood theoretically, even under special settings of the bandit problem. Indeed, to the best of our knowledge, McNellis et al. (2017) is the only work that attempts to theoretically analyze the non-parametric bootstrapping (referred to as NPB) procedure. For the bandit setting with Bernoulli rewards and a Beta prior (henceforth referred to as the Bernoulli bandit setting), they prove that both TS and NPB will take similar actions as the number of rounds increases. However, this does not have any implication on the regret for NPB.

In this work, we first show that the NPB procedure used in the previous work is provably inefficient in the Bernoulli bandit setting (Section 3.2). In particular, we establish a near-linear lower bound on the incurred regret. In Section 3.3, we show that NPB with an appropriate amount of forced exploration (done in practice in McNellis et al. (2017); Tang et al. (2015)) can result in a sub-linear though sub-optimal upper bound on the regret. As an alternative to NPB, we propose the weighted bootstrapping (abbreviated as WB) procedure. For Bernoulli (or more generally categorical) rewards, we show that WB with multiplicative exponential weights is mathematically equivalent to TS and thus results in near-optimal regret. Similarly, for Gaussian rewards, WB with additive Gaussian weights is equivalent to TS with an uninformative prior and also attains near-optimal regret.

In Section 5, we empirically show that for several reward distributions on , WB outperforms TS with a randomized rounding procedure proposed in Agrawal and Goyal (2013b). In the contextual bandit setting, we give two implementation guidelines. To improve the computational efficiency of bootstrapping, prior work Eckles and Kaptein (2014); McNellis et al. (2017); Tang et al. (2015) approximated it by an ensemble of models that requires additional hyperparameter tuning, such as choosing the size of the ensemble; or problem-specific heuristics, for example McNellis et al. (2017) uses a lazy update procedure specific to decision trees. We find that with appropriate stochastic optimization, bootstrapping (without any approximation) is computationally efficient and simple to implement. Our second guideline is for the initialization of the bootstrapping procedure. Prior work McNellis et al. (2017); Tang et al. (2015) used forced exploration at the beginning of bootstrapping, by pulling each arm for some number of times or by adding pseudo-examples. This involves tuning additional hyper-parameters, for example, McNellis et al. (2017) pull each arm times before bootstrapping. Similarly, the number of pseudo-examples or the procedure for generating them is rather arbitrary. We propose a simple method for generating such examples and experimentally validate that using pseudo-examples, where is the dimension of the context vector, leads to consistently good performance. These contributions result in a simple and efficient implementation of the bootstrapping procedure. We experimentally evaluate bootstrapping with several parametric models and real-world datasets.

## 2 Background

We describe the framework for the contextual bandit problem in Section 2.1. In Section 2.2, we give the necessary background on bootstrapping and then explain its adaptation to bandits in Section 2.3.

### 2.1 Bandits Framework

The bandit setting consists of arms where each arm has an underlying (unknown) reward distribution. The protocol for a bandit problem is as follows: in each round , the bandit algorithm selects an arm . It then receives a reward sampled from the underlying reward distribution for the selected arm . The best or optimal arm is defined as the one with the highest expected reward. The aim of the bandit algorithm is to maximize the expected cumulative reward, or alternatively, to minimize the expected cumulative regret. The cumulative regret is the cumulative loss in the reward across rounds because of the lack of knowledge of the optimal arm.

In the contextual bandit setting Langford and Zhang (2008); Li et al. (2017); Chu et al. (2011), the expected reward at round depends on the context or feature vector . Specifically, each arm is parametrized by the (unknown) vector and its expected reward at round is given by i.e. . Here the function is referred to as the model class. Given these definitions, the expected cumulative regret is defined as follows:

(1) |

The standard bandit setting is a special case of the above framework. To see this, if denotes the expected reward of arm of the -arm bandit, then it can be obtained by setting , for all and for all . Assuming that arm is the optimal arm, i.e. , then the expected cumulative regret is defined as: . Throughout this paper, we describe our algorithm under the general contextual bandit framework, but develop our theoretical results under the simpler bandit setting.

### 2.2 Bootstrapping

In this section, we set up some notation and describe the bootstrapping procedure in the offline setting. Assume we have a set of data-points denoted by . Here, and refer to the feature vector and observation (alternatively label) for the point. We assume a parametric generative model (parametrized by ) from the features to the observations . Given , the log-likelihood of observing the data is given by where is the probability of observing label given the feature vector , under the model parameters . In the absence of features, the probability of observing (for all ) is given by . The maximum likelihood estimator (MLE) for the observed data is defined as . In this paper, we mostly focus on Bernoulli observations without features in which case, .

Bootstrapping is typically used to obtain uncertainty estimates for a model fit to data. The general bootstrapping procedure consists of two steps: (i) Formulate a bootstrapping log-likelihood function by injecting stochasticity into via the random variable such that . (ii) Given , generate a bootstrap sample as: . In the offline setting Friedman et al. (2001), these steps are repeated (usually ) times to obtain the set . The variance of these samples is then used to estimate the uncertainty in the model parameters . Unlike a Bayesian approach that requires characterizing the entire posterior distribution in order to compute uncertainty estimates, bootstrapping only requires computing point-estimates (maximizers of the bootstrapped log-likelihood functions). In Sections 3 and 4, we discuss two specific bootstrapping procedures.

### 2.3 Bootstrapping for Bandits

In the bandit setting, the work in Eckles and Kaptein (2014); Tang et al. (2015); McNellis et al. (2017) uses bootstrapping as an approximation to Thompson sampling (TS). The basic idea is to compute one bootstrap sample and treat it as a sample from an underlying posterior distribution in order to emulate TS. In Algorithm 1, we describe the procedure for the contextual bandit setting. At every round , the set consists of the features and observations obtained on pulling arm in the previous rounds. The algorithm (in line ) uses the set to compute a bootstrap sample for each arm . Given the bootstrap sample for each arm, the algorithm (similar to TS) selects the arm maximizing the reward conditioned on this bootstrap sample (line ). After obtaining the observation (line ), the algorithm updates the set of observations for the selected arm (line ). In the subsequent sections, we instantiate the procedures for generating the bootstrap sample and analyze the performance of the algorithm in these settings.

## 3 Non-parametric Bootstrapping

We first describe the non-parametric bootstrapping (NPB) procedure in Section 3.1. We show that NPB used in conjunction with Algorithm 1 can be provably inefficient and establish a near-linear lower bound on the regret incurred by it in the Bernoulli bandit setting (Section 3.2). In Section 3.3, we show that NPB with an appropriate amount of forced exploration can result in an regret in this setting.

### 3.1 Procedure

In order to construct the bootstrap sample in Algorithm 1, we first create a new dataset by sampling with replacement, points from . The bootstrapped log-likelihood is equal to the log-likelihood of observing . Formally,

(2) |

The bootstrap sample is computed as . Observe that the sampling with replacement procedure is the source of randomness for bootstrapping and .

For the special case of Bernoulli rewards without features, a common practice is to use Laplace smoothing where we generate positive () or negative () pseudo-examples to be used in addition to the observed labels. Laplace smoothing is associated with two non-negative integers , where (and ) is the pseudo-count, equal to the number of positive (or negative) pseudo-examples. These pseudo-counts are used to “simulate” the prior distribution . For the NPB procedure with Bernoulli rewards, generating is equivalent to sampling from a Binomial distribution where and the success probability is equal to the fraction of positive observations in . Formally, if the number of positive observations in is equal to , then

(3) |

### 3.2 Inefficiency of Non-Parametric Bootstrapping

In this subsection, we formally show that Algorithm 1 used with NPB might lead to an regret with arbitrarily close to . Specifically, we consider a simple -arm bandit setting, where at each round , the reward of arm is independently drawn from a Bernoulli distribution with mean , and the reward of arm is deterministic and equal to . Furthermore, we assume that the agent knows the deterministic reward of arm , but not the mean reward for arm . Notice that this case is simpler than the standard two-arm Bernoulli bandit setting, in the sense that the agent also knows the reward of arm . Observe that if is a bootstrap sample for arm (obtained according to equation 3), then the arm is selected if . Under this setting, we prove the following lower bound:

###### Theorem 1.

If the NPB procedure is used in the above-described case with pseudo-counts for arm , then for any and any , we obtain

###### Proof.

Please refer to Appendix A for the detailed proof of Theorem 1. It is proved based on a binomial tail bound (Proposition 2) and uses the following observation: under a “bad history", where at round NPB has pulled arm for times, but all of these pulls have resulted in a reward , NPB will pull arm with probability less than (Lemma 1). Hence, the number of times NPB will pull the suboptimal arm before it pulls arm again or reach the end of the time steps follows a “truncated geometric distribution", whose expected value is bounded in Lemma 2. Based on Lemma 2, and the fact that the probability of this bad history is , we have in Lemma 3. Theorem 1 is proved by setting . ∎

Theorem 1 shows that, when is large enough, the NPB procedure used in previous work Eckles and Kaptein (2014); Tang et al. (2015); McNellis et al. (2017) incurs an expected cumulative regret arbitrarily close to a linear regret in the order of . It is straightforward to prove a variant of this lower bound with any constant (in terms of ) number of pseudo-examples. Next, we show that NPB with appropriate forced exploration can result in sub-linear regret.

### 3.3 Forced Exploration

In this subsection, we show that NPB, when coupled with an appropriate amount of forced exploration, can result in sub-linear regret in the Bernoulli bandit setting. In order to force exploration, we pull each arm times before starting Algorithm 1. The following theorem shows that for an appropriate value of , this strategy can result in an upper bound on the regret.

###### Theorem 2.

In any -armed bandit setting, if each arm is initially pulled times before starting Algorithm 1, then

###### Proof.

The claim is proved in Appendix B based on the following observation: If the gap of the suboptimal arm is large, the prescribed steps are sufficient to guarantee that the bootstrap sample of the optimal arm is higher than that of the suboptimal arm with a high probability at any round . On the other hand, if the gap of the suboptimal arm is small, no algorithm can have high regret. ∎

Although we can remedy the NPB procedure using this strategy, it results in a sub-optimal regret bound. In the next section, we consider a weighted bootstrapping approach as an alternative to NPB.

## 4 Weighted Bootstrapping

In this section, we propose weighted bootstrapping (WB) as an alternative to the non-parametric bootstrap. We first describe the weighted bootstrapping procedure in Section 4.1. For the bandit setting with Bernoulli rewards, we show the mathematical equivalence between WB and TS, hence proving that WB attains near-optimal regret (Section 4.2).

### 4.1 Procedure

In order to formulate the bootstrapped log-likelihood, we use a random transformation of the labels in the corresponding log-likelihood function. First, consider the case of Bernoulli observations where the labels . In this case, the log-likelihood function is given by:

where the function is the inverse-link function. For each observation , we sample a random weight from an exponential distribution, specifically, for all , . We use the following transformation of the labels: and . Since we transform the labels by multiplying them with exponential weights, we refer to this case as WB with multiplicative exponential weights. Observe that this transformation procedure extends the domain for the labels from values in to those in and does not result in a valid probability mass function. However, below, we describe several advantages of using this transformation.

Given this transformation, the bootstrapped log-likelihood function is defined as:

(4) |

Here is the log-likelihood of observing point . As before, the bootstrap sample is computed as: . Note that in WB, the randomness for bootstrapping is induced by the weights and that . As a special case, in the absence of features, when for all , assuming positive and negative pseudo-counts and denoting , we obtain the following closed-form expression for computing the bootstrap sample:

(5) |

Using the above transformation has the following advantages: (i) Using equation 4, we can interpret as a random re-weighting (by the weights ) of the observations. This formulation is equivalent to the weighted likelihood bootstrapping procedure proposed and proven to be asymptotically consistent in the offline case in Newton and Raftery (1994). (ii) From an implementation perspective, computing involves solving a weighted maximum likelihood estimation problem. It thus has the same computational complexity as NPB and can be solved by using black-box optimization routines. (iii) In the next section, we show that using WB with multiplicative exponential weights has good theoretical properties in the bandit setting. Furthermore, such a procedure of randomly transforming the labels lends itself naturally to the Gaussian case and in Appendix C.2.1, we show that WB with an additive transformation using Gaussian weights is equivalent to TS.

### 4.2 Equivalence to Thompson sampling

We now analyze the theoretical performance of WB in the Bernoulli bandit setting. In the following proposition proved in appendix C.1.1, we show that WB with multiplicative exponential weights is equivalent to TS.

###### Proposition 1.

If the rewards , then weighted bootstrapping using the estimator in equation 5 results in , where and is the number of positive and negative observations respectively; and are the positive and negative pseudo-counts. In this case, WB is equivalent to Thompson sampling under the prior.

Since WB is mathematically equivalent to TS, the bounds in Agrawal and Goyal (2013a) imply near-optimal regret for WB in the Bernoulli bandit setting.

In Appendix C.1.2, we show that this equivalence extends to the more general categorical (with categories) reward distribution i.e. for . In appendix C.2.1, we prove that for Gaussian rewards, WB with additive Gaussian weights, i.e. and using the additive transformation , is equivalent to TS under an uninformative prior. Furthermore, this equivalence holds even in the presence of features, i.e. in the linear bandit case. Using the results in Agrawal and Goyal (2013b), this implies that for Gaussian rewards, WB with additive Gaussian weights achieves near-optimal regret.

## 5 Experiments

In Section 5.1, we first compare the empirical performance of bootstrapping and Thompson sampling in the bandit setting. In section 5.2, we describe the experimental setup for the contextual bandit setting and compare the performance of different algorithms under different feature-reward mappings.

### 5.1 Bandit setting

We consider arms (refer to Appendix D for results with other values of ), a horizon of rounds and average our results across runs. We perform experiments for four different reward distributions - Bernoulli, Truncated Normal, Beta and the Triangular distribution, all bounded on the interval. In each run and for each arm , we choose the expected reward (mean of the corresponding distribution) to be a uniformly distributed random number in . For the Truncated-Normal distribution, we choose the standard deviation to be equal to , whereas for the Beta distribution, the shape parameters of arm are chosen to be and . We use the prior for TS. In order to use TS on distributions other than Bernoulli, we follow the procedure proposed in Agrawal and Goyal (2013a): for a reward in we flip a coin with the probability of obtaining equal to the reward, resulting in a binary “pseudo-reward”. This pseudo-reward is then used to update the Beta posterior as in the Bernoulli case. For NPB and WB, we use the estimators in equations 3 and 5 respectively. For both of these, we use the pseudo-counts .

In the Bernoulli case, NPB obtains a higher regret as compared to both TS and WB which are equivalent. For the other distributions, we observe that both WB and NPB (with WB resulting in consistently better performance) obtain lower cumulative regret than the modified TS procedure. This shows that for distributions that do not admit a conjugate prior, WB (and NPB) can be directly used and results in good empirical performance as compared to making modifications to the TS procedure.

### 5.2 Contextual bandit setting

We adopt the one-versus-all multi-class classification setting for evaluating contextual bandits Agarwal et al. (2014); McNellis et al. (2017). Each arm corresponds to a class. In each round, the algorithm receives a reward of one if the context vector belongs to the class corresponding to the selected arm and zero otherwise. Each arm maintains an independent set of sufficient statistics that map the context vector to the observed binary reward. We use two multi-class datasets: CoverType ( and ) and MNIST ( and ). The number of rounds in experiments is and we average results over independent runs. We experiment with LinUCB Abbasi-Yadkori et al. (2011), which we call UCB, linear Thompson sampling (TS) Agrawal and Goyal (2013b), -greedy (EG) Langford and Zhang (2008), non-parametric bootstrapping (NPB), and weighted bootstrapping (WB). For EG, NPB and WB, we consider three model classes: linear regression (suffix “-lin” in plots), logistic regression (suffix “-log” in plots), and a single hidden-layer (with hidden neurons) fully-connected neural network (suffix “-nn” in plots). Since we compare various bandit algorithms and model classes, we use the expected per-step reward, , as our performance metric.

For EG, we experimented extensively with many different exploration schedules. We found that leads to the best performance on both of our datasets. In practice, it is not possible to do such tuning on a new problem. Therefore, the EG results in this paper should be viewed as a proxy for the “best” attainable performance. As alluded to in Section 1, we implement bootstrapping using stochastic optimization with warm-start, in contrast to approximating it as in McNellis et al. (2017); Tang et al. (2015). Specifically, we use stochastic gradient descent to compute the MLE for the bootstrapped log-likelihood and warm-start the optimization at round by the solution from the previous round . For linear and logistic regression, we optimize until we reach an error threshold of . For the neural network, we take pass over the dataset in each round. To ensure that the results do not depend on our specific choice of optimization, we use scikit-learn Buitinck et al. (2013) with stochastic optimization, and default optimization options for both linear and logistic regression. For the neural network, we use the Keras library Chollet (2015) with the ReLU non-linearity for the hidden layer and sigmoid in the output layer, along with SGD and its default configuration. Preliminary experiments suggested that our procedure leads to better runtime as compared to McNellis et al. (2017) and better performance than the approximation proposed in Tang et al. (2015), while also alleviating the need to tune any hyper-parameters.

In the prior work on bootstrapping McNellis et al. (2017); Tang et al. (2015) for contextual bandits, the algorithm was initialized through forced exploration, where each arm is explored times at the beginning; or equivalently assigned pseudo-examples that are randomly sampled context vectors. Such a procedure introduces yet another tunable parameter . Therefore, we propose the following parameter-free procedure. Let be the eigenvectors of the covariance matrix of the context vectors, and be the corresponding eigenvalues. For each arm, we add pseudo-examples: for all , we include the vectors and each with both and labels. Since is the standard deviation of features in the direction of , this procedure ensures that we maintain enough variance in the directions where the contexts lie. In the absence of any prior information about the contexts, we recommend using samples from an isotropic multivariate Gaussian and validate that it led to comparable performance on the two datasets.

We plot the expected per-step reward of all compared methods on the CoverType and MNIST datasets in figures 2(a) and 2(b), respectively. In figure 2(a), we observe that EG, NPB, and WB with logistic regression have the best performance in all rounds. The linear methods (EG, UCB, and bootstrapping) perform similarly, and slightly worse than logistic regression whereas TS has the worst performance. Neural networks perform similarly to logistic regression and we do not plot them here. This experiment shows that even for a relatively simple dataset, like CovType, a more expressive non-linear model can lead to better performance. This effect is more pronounced in figure 2(b). For this dataset, we only show the best performing linear method, UCB. The performance of other linear methods, including those with bootstrapping, is comparable to or worse than UCB. We observe that non-linear models yield a much higher per-step reward, with the neural network performing the best. For both logistic regression and neural networks, the performance of both bootstrapping methods is similar and only slightly worse, respectively, than that of a tuned EG method. Both NPB and WB are computationally efficient; on the CovType dataset, NPB and WB with logistic regression take on average, and seconds per round, respectively. On the MNIST dataset, NPB and WB have an average runtime of and seconds per round, respectively, when using logistic regression; and and seconds per round, respectively, when using a neural network.

## 6 Discussion

We showed that the commonly used non-parametric bootstrapping procedure can be provably inefficient. As an alternative, we proposed the weighted bootstrapping procedure, special cases of which become equivalent to TS for common reward distributions such as Bernoulli and Gaussian. On the empirical side, we showed that the WB procedure has better performance than a modified TS scheme for several bounded distributions in the bandit setting. In the contextual bandit setting, we provided guidelines to make bootstrapping simple and efficient to implement and showed that non-linear versions of bootstrapping have good empirical performance. Our work raises several open questions: does bootstrapping result in near-optimal regret for generalized linear models? Under what assumptions or modifications can NPB be shown to have good performance? On the empirical side, evaluating bootstrapping across multiple datasets and comparing it against TS with approximate sampling is an important future direction.

## References

- Abbasi-Yadkori et al. (2011) Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
- Agarwal et al. (2014) A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
- Agrawal and Goyal (2013a) S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013a.
- Agrawal and Goyal (2013b) S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 2013b.
- Arratia and Gordon (1989) R. Arratia and L. Gordon. Tutorial on large deviations for the binomial distribution. Bulletin of Mathematical Biology, 51(1):125–131, Jan 1989. ISSN 1522-9602.
- Auer (2002) P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Baransi et al. (2014) A. Baransi, O.-A. Maillard, and S. Mannor. Sub-sampling for multi-armed bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 115–131. Springer, 2014.
- Boucheron et al. (2013) S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
- Bubeck et al. (2012) S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Buitinck et al. (2013) L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013.
- Chollet (2015) F. Chollet. keras. https://github.com/fchollet/keras, 2015.
- Chu et al. (2011) W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In AISTATS, volume 15, pages 208–214, 2011.
- Dani et al. (2008) V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.
- Eckles and Kaptein (2014) D. Eckles and M. Kaptein. Thompson sampling with the online bootstrap. arXiv preprint arXiv:1410.4009, 2014.
- Efron (1992) B. Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer, 1992.
- Filippi et al. (2010) S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
- Friedman et al. (2001) J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
- Jun et al. (2017) K.-S. Jun, A. Bhargava, R. Nowak, and R. Willett. Scalable generalized linear bandits: Online computation and hashing. arXiv preprint arXiv:1706.00136, 2017.
- Kakade et al. (2008) S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th international conference on Machine learning, pages 440–447. ACM, 2008.
- Lai and Robbins (1985) T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Langford and Zhang (2008) J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817–824, 2008.
- Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
- Li et al. (2017) L. Li, Y. Lu, and D. Zhou. Provable optimal algorithms for generalized linear contextual bandits. arXiv preprint arXiv:1703.00048, 2017.
- McNellis et al. (2017) R. McNellis, A. N. Elmachtoub, S. Oh, and M. Petrik. A practical method for solving contextual bandit problems using decision trees. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017, 2017.
- Newton and Raftery (1994) M. A. Newton and A. E. Raftery. Approximate bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society. Series B (Methodological), pages 3–48, 1994.
- Osband and Van Roy (2015) I. Osband and B. Van Roy. Bootstrapped thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300, 2015.
- Osband et al. (2016) I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pages 4026–4034, 2016.
- Pandey et al. (2007) S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-Armed Bandit Problems with Dependent Arms. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 721–728, New York, NY, USA, 2007. ACM.
- Riquelme et al. (2018) C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.
- Rusmevichientong and Tsitsiklis (2010) P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- Tang et al. (2015) L. Tang, Y. Jiang, L. Li, C. Zeng, and T. Li. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 323–332. ACM, 2015.
- Wang et al. (2005) C.-c. Wang, S. R. Kulkarni, and H. V. Poor. Bandit problems with side observations. IEEE Transactions on Automatic Control, 50:338–355, 2005.
- Zhang et al. (2016) L. Zhang, T. Yang, R. Jin, Y. Xiao, and Z. Zhou. Online stochastic linear optimization under one-bit feedback. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 392–401, 2016.

## Appendix A Proof for Theorem 1

We prove Theorem 1 in this section. First, we have the following tail bound for Binomial random variables:

###### Proposition 2 (Binomial Tail Bound).

Assume that random variable , then for any s.t. , we have

where is the KL-divergence between and .

Notice that for our considered case, the “observation history" of the agent at the beginning of time is completely characterized by a triple , where is the number of times arm has been pulled from time to and the realized reward is , plus the pseudo count ; similarly, is the number of times arm has been pulled from time to and the realized reward is , plus the pseudo count . Moreover, conditioning on this history , the probability that the agent will pull arm under the NPB only depends on . To simplify the exposition, we use to denote this conditional probability. The following lemma bounds this probability in a “bad" history:

###### Lemma 1.

Consider a “bad" history with and for some integer , then we have

###### Proof.

Recall that by definition, we have

(6) |

where (a) follows from the NPB procedure in this case, and (b) follows from Proposition 2. Specifically, recall that , and for . Thus, the conditions of Proposition 2 hold in this case. Furthermore, we have

(7) |

where (c) follows from the fact that for . Thus we have

(8) |

∎

The following technical lemma derives the expected value of a truncated geometric random variable, as well as a lower bound on it, which will be used in the subsequent analysis:

###### Lemma 2 (Expected Value of Truncated Geometric R.V.).

Assume that is a truncated geometric r.v. with parameter and integer . Specifically, the domain of is , and for and . Then we have

###### Proof.

Notice that by definition, we have

Define the shorthand notation , we have

(9) |

Recall that , we have proved that .

Now we prove the lower bound. First, we prove that

(10) |

always holds by induction on . Notice that when , the LHS of equation (10) is , and the RHS of equation (10) is . Hence, this inequality trivially holds in the base case. Now assume that equation (10) holds for , we prove that it also holds for . Notice that

(11) |

where (a) follows from the induction hypothesis. Thus equation (10) holds for all and . Notice that equation 10 implies that

We now prove the lower bound. Notice that for any , is an increasing function of , thus for , we have

On the other hand, if , we have

Combining the above results, we have proved the lower bound on . ∎

We then prove the following lemma:

###### Lemma 3 (Regret Bound Based on ).

When NPB is applied in the considered case, for any integer and time horizon satisfying , we have

###### Proof.

We start by defining the bad event as

Thus, we have . Since for all , with probability , the agent will pull arm infinitely often. Moreover, the event only depends on the outcomes of the first pulls of arm . Thus we have . Furthermore, conditioning on , we define the stopping time as

Then we have

(12) |

Notice that conditioning on event , in the first steps, the agent either pulls arm or pulls arm but receives a reward , thus, by definition of , we have

On the other hand, if , notice that for any time with history s.t. , the agent will pull arm conditionally independently with probability . Thus, conditioning on , the number of times the agent will pull arm before it pulls arm again follows the truncated geometric distribution with parameter and . From Lemma 2, for any , we have

(13) |

notice that a factor of in inequality (a) is due to the reward gap. Inequality (b) follows from the fact that ; inequality (c) follows from Lemma 1, which states that for , we have ; inequality (d) follows from the fact that for , we have

Finally, notice that

Thus, combining everything together, we have

(14) |

where the last equality follows from the fact that for . This concludes the proof. ∎

Finally, we prove Theorem 1.

###### Proof.

For any given , we choose . Since

we have

thus, Lemma 3 is applicable. Notice that

Furthermore, we have

where the first inequality follows from . On the other hand, we have

where the last inequality follows from the fact that , since . Notice that we have

where the first inequality follows from the fact that , and the second inequality follows from . Putting it together, we have

This concludes the proof for Theorem 1. ∎

## Appendix B Proof for Theorem 2

For simplicity of exposition, we consider arms with means . Let . Let be the mean of the history of arm at time and be the mean of the bootstrap sample of arm at time . Note that both are random variables. Each arm is initially explored times. Since and are estimated from random samples of size at least , we get from Hoeffding’s inequality (Theorem 2.8 in Boucheron et al. Boucheron et al. (2013)) that

for any and time . The first two inequalities hold for any and . The last two hold for any and , and therefore also in expectation over their random realizations. Let be the event that the above inequalities hold jointly at all times and be the complement of event . Then by the union bound,

By the design of the algorithm, the expected -step regret is bounded from above as

where the last inequality follows from the definition of event and observation that the maximum -step regret is . Let

where is a tunable parameter that determines the number of exploration steps per arm. From the definition of and , and the fact that when , we have that

Finally, note that and we choose that optimizes the upper bound.

## Appendix C Weighted bootstrapping and equivalence to TS

In this section, we prove that for the common reward distributions, WB becomes equivalent to TS for specific choices of the weight distribution and the transformation function.

### c.1 Using multiplicative exponential weights

In this subsection, we consider multiplicative exponential weights, implying that and . We show that in this setting WB is mathematically equivalent to TS for Bernoulli and more generally categorical rewards.

#### c.1.1 Proof for Proposition 1

###### Proof.

Recall that the bootstrap sample is given as:

To characterize the distribution of , let us define and as the sum of weights for the positive and negative examples respectively. Formally,

The sample can then be rewritten as:

Observe that (and ) is the sum of (and respectively) exponentially distributed random variables. Hence, and . This implies that .

When using the prior for TS, the corresponding posterior distribution on observing positive examples and negative examples is . Hence computing according to WB is the equivalent to sampling from the Beta posterior. Hence, WB with multiplicative exponential weights is mathematically equivalent to TS. ∎

#### c.1.2 Categorical reward distribution

###### Proposition 3.

Let the rewards where is the number of categories and is the probability of an example belonging to category . In this case, weighted bootstrapping with and the transformation results in where