Generalized Thompson Sampling for
Contextual Bandits
Abstract
Thompson Sampling, one of the oldest heuristics for solving multiarmed bandits, has recently been shown to demonstrate stateoftheart performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sampling and exponentiated updates, we propose a new family of algorithms called Generalized Thompson Sampling in the expertlearning framework, which includes Thompson Sampling as a special case. Similar to most expertlearning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts’ weights. General regret bounds are derived, which are also instantiated to two important loss functions: square loss and logarithmic loss. In contrast to existing bounds, our results apply to quite general contextual bandits. More importantly, they quantify the effect of the “prior” distribution on the regret bounds.
Generalized Thompson Sampling for
Contextual Bandits
Lihong Li Microsoft Research Redmond, WA 98052 lihongli@microsoft.com
1 Introduction
Thompson Sampling [18], one of the oldest heuristics for solving stochastic multiarmed bandits, embodies the principle of probability matching. Given a prior distribution over the underlying, unknown reward generating process as well as past observations of rewards, one can maintain a posterior distribution of which arm is optimal. Thompson Sampling then selects arms randomly according to the current posterior distribution.
While having being unpopular for decades, this algorithm was recently shown to be stateoftheart in empirical studies, and has found success in important applications like news recommendation and online advertising [16, 10, 7, 14]. In addition, it has other advantages such as robustness to observation delay [7] and simplicity in implementation, compared to the dominant strategies based on upper confidence bounds (UCB).
Despite the empirical success, theoretical understanding of finitetime performance of Thompson Sampling has been limited until very recently. The first such result is provided by [2] for noncontextual armed bandits, who prove a nontrivial problemdependent regret bound when the prior of an arm’s expected reward is a Beta distribution. Later on, improved bounds are found for the same setting [11, 3], which match the asymptotic regret lower bound [12].
For contextual bandits [13], only two pieces of work are available, to the best of our knowledge. [4] analyze linear bandits, where a Gaussian prior is used on the weight vector space, and a Gaussian likelihood function is assumed for the reward function. The authors are able to show the regret grows on the order of , which is only a factor away from a known matching lower bound [8]. In contrast, [15] establish an interesting connection between UCBstyle analysis and the Bayes risk of Thompson Sampling, based on the probabilitymatching property. This observation allows the authors to obtain Bayes risk bound based on a novel metric, known as margin dimension, of an arbitrary function class that essentially measures how fast upper confidence bounds decay.
All the existing work above relies critically either on advanced properties of the assumed prior distribution (such as in the case of Beta distributions), or on the assumption that the prior is correct (in the analysis of Bayes risk of [15]). Such analysis, although very interesting and important for better understanding Thompson Sampling, seems hard to be generalized to general (possibly nonlinear) contextual bandits. Furthermore, none of the existing theory is able to quantify the role of prior plays in controlling the regret, although in practice better domain knowledge is often available to construct good priors that should “accelerate” learning.
This paper attempts to address the limitations of prior work, from a very different angle. Based on a connection between Thompson Sampling and exponentiated update rules, we propose a family of contextualbandit algorithms called Generalized Thompson Sampling in the expertlearning framework [6], where each expert corresponds to a contextual policy for arm selection. Similar to Thompson Sampling, Generalized Thompson Sampling is a randomized strategy, following an expert’s policy more often if the expert is more likely to be optimal. Different from Thompson Sampling, it uses a loss function to update the experts’ weights; Thompson Sampling is a special of Generalized Thompson Sampling when the logarithmic loss is used.^{1}^{1}1It should be emphasized that, in this paper, we use the loss function to measure how well an expert predicts the average reward, given the context and the selected arm. In general, the loss function and the reward may be completely unrelated. Details are given later.
Regret bounds are then derived under certain conditions. The proof relies critically on a novel application of a “selfboundedness” property of loss functions in competitive analysis. The results are instantiated to the square and logarithmic losses, two important loss functions. Not only do these bounds apply to quite general sets of experts, but they also quantify the impact of the prior distribution on regret. These benefits come at a cost of a worse dependence on the number of steps. However, we believe it is possible to close the gap with a more involved analysis, and the connection between (Generalized) Thompson Sampling to expertlearning will likely lead to further interesting insights and algorithms in future work.
2 Preliminaries
Contextual bandits can be formulated as the following game between the learner and a stochastic environment. Let and be the sets of context and arms, and let . At step :

Learner observes the context , where can be chosen by an adversary.

Learner selects arm , and receives reward , with expectation .
Note that the setup above allows the contexts to be chosen by an adversary, which is a more general setting than typical contextual bandits [13]. The reader may notice we require the reward to be binary, instead of being in . This choice will make our exposition simpler, without sacrificing loss of generality. Indeed, as also suggested by [2], if reward is received, one can convert it into a binary pseudoreward as follows: let be with probability , and otherwise. Clearly, the bandit process remains essentially the same, with the same optimal expert and regrets.
Motivated by prior work on Thompson Sampling with parametric function classes [7], we allow the learner to have access to a set of experts, , each one of them makes predicts about the average reward . Let be the associated prediction function of expert . Its armselection policy in context is simply the greedy policy with respect to the reward predictions: . This setting can naturally be used to capture the use of parametric function classes: for example, when generalized linear models are used to predict [10, 7], each weight vector is an expert. The only difference is that our framework works with a discrete set of experts. Using a covering device, however, it is possible to approximate a continuous function class by a finite set of cardinality , where is the covering number.
We define the step average regret of the learner by
(1) 
where the expectation refers to the possible randomization of the learner in selecting . As in all existing analysis for Thompson Sampling, we make the realization assumption that one of the experts, , correctly predicts the average reward. Without loss of generality, let be this expert; in other words, . Clearly, is the rewardmaximizing expert, so .
With the notation above, Thompson Sampling can be described as follows. It requires as input a “prior” distribution over the experts, where . Intuitively, may be interpreted as the prior probability that is the rewardmaximizing expert. The algorithm starts with the first “posterior” distribution where . At step , the algorithm samples an expert based on the posterior distribution and follows that expert’s policy to choose action. Upon receiving the reward, the weights are updated by , where is the negative loglikelihood.
Finally, one can assume the optimal expert, , is drawn from an unknown prior distribution, . The expected step Bayes regret can then be defined: . It should be noted that the Bayes risk considered by other authors [15] is just , where is the prior used by Thompson Sampling. In general, the true prior is unknown, so . We believe the Bayes risk defined with respect to is more reasonable in light of the almost inevitable misspecificatin of priors in practice.
3 Generalized Thompson Sampling
An observation with Thompson Sampling from the previous section is that its Bayes update rule can be viewed as an exponentiated update with the logarithmic loss (see also [6]). After receiving a reward, each expert is penalized for the mismatch in its prediction () and the observed reward, and the penalty happens to be the logarithmic loss in Thompson Sampling. Therefore, in principle, one can use other loss function to get a more general family of algorithms. In fact, none of the existing regret analyses [2, 3, 4, 11] relies on the interpretation that are meant to be Bayesian posteriors, and yet manages to show strong regret bound for Thompson Sampling.^{2}^{2}2The analysis of [15] is different since the metric (Bayes risk) is defined with respect to the prior. The above observations suggest the promising performance of Thompson Sampling is not due to its Bayesian nature, and also motivates us to develop a more general family of algorithms known as Generalized Thompson Sampling.
We denote by the loss incurred by reward prediction when the observed reward is . Generalized Thompson Sampling performs exponentiated updates to adjust experts’ weights, and follows a randomly selected expert when making decisions, similar to Thompson Sampling. In addition, the algorithm also allows mixing of the exponentially weighted distribution and a uniform distribution controlled by . The pseudocode is given in Algorithm 1.
Clearly, Generalized Thompson Sampling includes Thompson Sampling as a special case, by setting , , and to be the logarithmic loss: . Another loss function considered in this paper is the square loss: .
4 Analysis
For convenience, the analysis here uses the following shorthand notation:

The history of the learner up to step is .

The immediate regret of expert in context is .

The normalized weight at step is .

The shifted loss incurred by expert in triple is denoted by . In particular, define . In other words, is the loss relative to the best expert (), and can be negative.

The average shifted loss at step is .
4.1 Main Theorem
Clearly, conditions are needed to relate the loss function to the regret. Our results need the following assumptions:

(Consistency) For all , .

(Informativeness) There exists a constant such that .

(Boundedness) The shifted loss assumes values in .

(Selfboundedness) There exists a constant such that, for all , ; namely, the second moment is bounded, up to a constant, by the first moment of the shifted loss.
Proof. The expected step regret may be rewritten more explicitly, and then bounded, as follows:
Now the question becomes one of bounding the expected total shifted loss, . This problem is tackled by the following key lemma, which makes use the selfboundedness property of the loss function. The lemma may be of interest on its own. Similar properties were used in [1] in a very different way.
Lemma 1
Proof. First, observe that if the shifted loss is used in Generalized Thompson Sampling to replace the loss , the algorithm behaves identically. The rest of the proof uses this fact, pretending Generalized Thompson Sampling uses for weight updates.
For any step , the weight sum changes according to
where the first inequality is due to Condition C3 and the inequality for ; the second inequality is due to the inequality for .
Conditioned on the observed context and selected arm at step , we take expectation of the above expressions, with respect to the randomization in observed reward, leading to
Condition C4 then implies
Setting gives
The above inequality holds for any , so also holds in expectation if is randomized:
Finally, summing the lefthand side over gives
which implies
The last inequality above follows from the observation that , and that .
Corollary 1
The next corollary considers the Bayes regret, , with an unknown, true prior :
Corollary 2
If the optimal expert is sampled from distribution , the Bayes regret is at most
where and are the standard entropy and KLdivergence.
4.2 Square Loss
We start with the simpler case of square loss. It clearly satisfies Condition C3. Condition C1 holds because of the following wellknown fact:
Conditions C2 and C4 are also satisfied with and , from prior work [1]. Plugging these values in Corollary 1 and choosing , we obtain the regret bound of , and the Bayes regret bound of .
4.3 Logarithmic Loss
For logarithmic loss, we assume the shifted loss of all experts are bounded in for some constant , so that one can normalize the shifted logarithmic loss to the range of by defining:
(2) 
This assumption can usually be satisfied in practice, and seems necessary to derive finitetime guarantees. Note that this assumption is slightly weaker than the more common assumption that the logarithmic loss itself is bounded (e.g., [9]).
We now verify all necessary conditions. Condition C1 follows from the wellknown fact that the expectation of logarithmic loss between the true expert and another is their KLdivergence,
(3) 
which is in turn nonnegative.
Condition C2 is verified in the following lemma:
Lemma 2
For the loss function defined in Equation (2), one has
Proof. We have the following:
where the first inequality is due to the triangle inequality; the second inequality is due to Pinsker’s inequality; the fourth inequality is due to Jensen’s inequality; the fifth inequality is from the fact that each arm is selected with probability at least ; the last equality is from Equation (3).
Condition C3 is immediately satisfied by the normalization of in the definition of above.
Condition C4 is the most difficult one to verify. To the best of our knowledge, such a result for logarithmic loss is not found in literature and can be of independent interest. For example, it implies that the analysis of [1] for square loss also applies to the logarithmic loss. The following lemma states the result more formally. Its proof, which is rather technical, is left to the appendix.
Lemma 3
With all four conditions verified, we can apply results in Section 4.1 to reach the regret bound of and the Bayes regret bound of
5 Discussions
In this paper, we propose a new family of algorithms, Generalized Thompson Sampling, and analyze its regret in the expertlearning framework. Our regret analysis provides a promising alternative to understanding the strong performance of Thompson Sampling, an interesting and pressing research problem raised by its recent empirical success. Compared to existing analysis in the literature, it has the following benefits. First, the results apply more generally to a set of experts, rather than making specific modeling assumptions about the prior and likelihood. Second, the analysis quantifies how the (not necessarily correct prior ) affects the regret bound, as well as the Bayes regret when optimal experts are drawn from an unknown prior . Similar to PACBayes bounds, these results combine the benefits of good priors and the robustness of frequentist approaches.
Our proof for Generalized Thompson Sampling is inspired by the onlinelearning literature [6]. However, a new technique is needed to prove the critical Lemma 1, which relies on selfboundedness of a loss function. A similar property is shown by [1] for square loss only, and is used in a very different way. The selfboundedness of logarithmic loss (Lemma 3) appears new, to the best of our knowledge, and may be of independent interest.
Generalized Thompson Sampling bears some similarities to the Regressor Elimination (RE) algorithm [1]. A crucial difference is that RE requires a computationally expensive operation of computing a “balanced” distribution over experts, in order to control variance in the elimination process. In contrast, our algorithm is computationally much cheaper. The operations of Generalized Thompson Sampling are also related to EXP4 [5], which uses unbiased, importanceweighted reward estimates to do exponentiated updates of expert weights. In practice, it seems more natural to use prediction loss of an expert to adjust its weight, rather than using the reward signals directly [10, 7].
While we have focused on the case of finitely many experts, the setting is motivated by the more realistic case when the set of experts is continuous [10, 7, 4]. The discrete case considered here may be thought of as an approximation to the continuous case, using a covering device. We expect similar results to hold with replaced by the covering number of the class.
This work suggests a few interesting directions for future work. The first is to close the gap between the current bound and the best problemindependent bound for contextual bandits. The second is to extend the analysis here to continuous expert classes, and more importantly to the agnostic (nonrealizable) case. Finally, it is interesting to use the regret analysis of (Generalized) Thompson Sampling to obtain performance guarantees for its reinforcementlearning analogues (e.g., [17]).
Appendix A Proof of Lemma 3: Selfboundedness of Logarithmic Loss
This section proves Lemma 3, regarding selfboundedness of logarithmic loss, in the sense described in Condition C4. The analysis here does not involve step and the corresponding context and selected arm. We therefore simplify notation as follows: the true expert predicts , and the other expert predicts . The binary reward is then a Bernoulli random variable with success rate . The shifted logarithmic loss of is given by
The first two moments of the random variable are given by:
Define
the ratio between variance and expectation of , as a function of . Our goal is to show that is bounded by a constant, independent of and . It will then follow that is also bounded by a constant since .
Taking the derivative of , one obtains
for some function . It can be verified, by rather tedious calculations, that there exists some such that for and for . So is maximized by making close to either or . It then follows, again by rather tedious calculations, that , using the assumption that the logratios (that is, shifted loss) are bounded by .
References
 [1] Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert E. Schapire. Contextual bandit learning under the realizability assumption. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS12), 2012.
 [2] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multiarmed bandit problem. In Proceedins of the TwentyFifth Annual Conference on Learning Theory (COLT12), pages 39.1–39.26, 2012.
 [3] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson sampling. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS13), 2013.
 [4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of Thirtieth International Conference on Machine Learning (ICML13), 2013.
 [5] Peter Auer, Nicolò CesaBianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
 [6] Nicolò CesaBianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
 [7] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24 (NIPS11), pages 2249–2257, 2012.
 [8] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS11), pages 208–214, 2011.
 [9] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23 (NIPS10), pages 586–594, 2011.
 [10] Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. Webscale Bayesian clickthrough rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of the TwentySeventh International Conference on Machine Learning (ICML10), pages 13–20, 2010.
 [11] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finitetime analysis. In Proceedings of the TwentyThird International Conference on Algorithmic Learning Theory (ALT12), pages 199–213, 2012.
 [12] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
 [13] John Langford and Tong Zhang. The epochgreedy algorithm for contextual multiarmed bandits. In Advances in Neural Information Processing Systems 20, pages 1096–1103, 2008.
 [14] Benedict C. May, Nathan Korda, Anthony Lee, and David S. Leslie. Optimistic Bayesian sampling in contextualbandit problems. Journal of Machine Learning Research, 13:2069–2106, 2012.
 [15] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling, 2013. arXiv:1301.2609.
 [16] Steven L. Scott. A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26:639–658, 2010.
 [17] Malcolm J. A. Strens. A Bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML00), pages 943–950, 2000.
 [18] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3–4):285–294, 1933.