An InformationTheoretic Analysis of
Thompson Sampling for Large Action Spaces
Abstract
Informationtheoretic Bayesian regret bounds of Russo and Van Roy [8] capture the dependence of regret on prior uncertainty. However, this dependence is through entropy, which can become arbitrarily large as the number of actions increases. We establish new bounds that depend instead on a notion of ratedistortion. Among other things, this allows us to recover through informationtheoretic arguments a nearoptimal bound for the linear bandit. We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an informationtheoretic statistic that we have only been able to quantify via computation.
An InformationTheoretic Analysis of
Thompson Sampling for Large Action Spaces
Shi Dong Stanford University sdong15@stanford.edu Benjamin Van Roy Stanford University bvr@stanford.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Thompson sampling [11] has proved to be an effective heuristic across a broad range of online decision problems [2, 10]. Russo and Van Roy [8] provided an informationtheoretic analysis that yields insight into the algorithm’s broad applicability and establishes a bound of on cumulative expected regret over time periods of any algorithm and online decision problem. The information ratio is a statistic that captures the manner in which an algorithm trades off between immediate reward and information acquisition; Russo and Van Roy [8] bound the information ratio of Thompson sampling for particular classes of problems. The entropy of the optimal action quantifies the agent’s initial uncertainty.
If the prior distribution of is uniform, the entropy is the logarithm of the number of actions. As such, grows arbitrarily large with the number of actions. On the other hand, even for problems with infinite action sets, like the linear bandit with a polytopic action set, Thompson sampling is known to obey gracious regret bounds [6]. This suggests that the dependence on entropy leaves room for improvement.
In this paper, we establish bounds that depend on a notion of ratedistortion instead of entropy. Our new line of analysis is inspired by ratedistortion theory, which is a branch of information theory that quantifies the amount of information required to learn an approximation [3]. This concept was also leveraged in recent work of Russo and Van Roy [9], which develops an alternative to Thompson sampling that aims to learn satisficing actions. An important difference is that the results of this paper apply to Thompson sampling itself.
We apply our analysis to linear and generalized linear bandits and establish Bayesian regret bounds that remain sharp with large action spaces. For the dimensional linear bandit setting, our bound is , which is tighter than the bound of [7]. Our bound also improves on the previous informationtheoretic bound of [8] since it does not depend on the number of actions. Our Bayesian regret bound is within a factor of of the worstcase regret lower bound of [4].
For the logistic bandit, previous bounds for Thompson sampling [7] and upperconfidencebound algorithms [5] scale linearly with , where is the logistic function . These bounds explode as since . This does not make sense because, as grows, each action rewards becomes a deterministic binary value, which should simplify learning. Our analysis addresses this gap in understanding by establishing a bound that decays as becomes large, converging to for any fixed . However, this analysis relies on a conjecture about the information ratio of Thompson sampling for the logistic bandit, which we only support through computational results.
2 Problem Formulation
We consider an online decision problem in which over each time period , an agent selects an action from a finite action set and observes an outcome , where denotes the set of possible outcomes. A fixed and known system function associates outcomes with actions according to
where is the action, is an exogenous noise term, and is the “true” model unknown to the agent. Here we adopt the Bayesian setting, in which is a random variable taking value in a space of parameters . The randomness of stems from the prior uncertainty of the agent. To make notations succinct and avoid measuretheoretic issues, we assume that is a finite set, whereas our analysis can be extended to the cases where both and are infinite.
The reward function assigns a realvalued reward to each outcome. We assume that the reward is bounded, i.e.
Further, as a shorthand we define
Simply stated, is the expected reward of action when the true model is . In addition, for each parameter , let be the optimal action under model , i.e.
Note that the ties induced by can be circumvented by expanding with identical elements. Let be the “true” optimal action and let be the corresponding maximum reward.
Before making her decision at the beginning of period , the agent has access to the history up to time , which we denote by
A policy is defined as a sequence of functions mapping histories and exogenous noise to actions, which can be written as
where is a random variable which characterizes the algorithmic randomness. The performance of policy is evaluated by the finite horizon Bayesian regret, defined by
where the actions are chosen by policy , and the expectation is taken over the randomness in both and .
3 Thompson Sampling and Information Ratio
The Thompson sampling policy is defined such that at each period, the agent samples the next action according to her posterior belief of the optimal action, i.e.
An equivalent definition, which we use throughout our analysis, is that over period the agent samples a parameter from the posterior of the true parameter , and plays the action . The history available to the agent is thus
The information ratio, first proposed in [8], quantifies the tradeoff between exploration and exploitation. Here we adopt the simplified definition in [9], which integrates over all randomness. Let be two valued random variables. Over period , the information ratio of with respect to is defined by
(1) 
We can interpret as a benchmark model parameter that the agent wants to learn and as the model parameter that she selects. When is small, the agent would only incur large regret over period if she was expected to learn a lot of information about . We restate a result proven in [6], which proposes a bound for the regret of any policy in terms of the worstcase information ratio.
Proposition 1.
For all and policy , let be such that for each , then
where is the entropy of and
The bound given by Proposition 1 is loose in the sense that it depends implicitly on the cardinality of . When is large, knowing exactly what is requires a lot of information. Nevertheless, because of the correlation between actions, it suffices for the agent to learn a “blurry” version of , which conveys far less information, to achieve low regret. In the following section we concretize this argument.
4 A RateDistortion Analysis of Thompson Sampling
In this section we develop a sharper bound for Thompson sampling. At a high level, the argument relies on three observations:

A summary statistic of that is less informative than exists;

In each period, if the agent aims to learn the summary statistic instead of the regret incurred can be bounded in terms of the information gained about the summary statistic; we refer to this approximate learning as “compressed Thompson sampling”;

The summary statistic can be chosen such that the regret of Thompson sampling is close to that of the compressed Thompson sampling, and at the same time, compressed Thompson sampling yields no more information about the summary statistic than Thompson sampling.
Following the above line of analysis, we can bound the regret of Thompson sampling by the mutual information between the summary statistic and . Since the summary statistic is chosen to be far less informative than , we will arrive at a significantly tighter bound.
To develop the argument, we first quantify the amount of distortion that we incur if we replace one parameter with another. For two parameters , the distortion of with respect to is defined as
(2) 
In other words, the distortion is the price we pay if we deem to be the true parameter while the actual true parameter is . Notice that from the definition of , we always have . Let be a partition of , i.e. and such that
(3) 
where is a positive distortion tolerance. Let be the random variable taking values in that records the index of the partition in which lies, i.e.
Then we have . If the structure of allows for a small number of partitions, would have much less information than . Let subscript denote corresponding values under the posterior measure . In other words, and are random variables that are functions of . We claim the following.
Proposition 2.
According to Proposition 2, over period if the agent deviates from her original Thompson sampling scheme and applies a “onestep” compressed Thompson sampling to learn by sampling , she would not incur much more regret (as is guaranteed by 2). Meanwhile, from 1, 3 and the dataprocessing inequality, we have that
(4) 
which implies that the information gain of the compressed Thompson sampling will not exceed that of the original Thompson sampling towards . Therefore, the regret of the original Thompson sampling can be bounded in terms of the total information gain towards and the worstcase information ratio of the onestep compressed Thompson sampling. Formally, we have the following.
Theorem 1.
Let be any partition of such that for any and , . Let and satisfy the conditions in Proposition 2. We have
(5) 
where
Proof. We have that
(6)  
where 6 follows from Proposition 2 2; 6 follows from (4); 6 results from CauchySchwartz inequality; 6 is the chain rule for mutual information and 6 comes from that
where we use the fact that is independent of , conditioned on . Thence we arrive at our desired result.∎
Remark. The bound given in Theorem 1 dramatically improves the previous bound in Proposition 1 since in general is much smaller than . The new bound also characterizes the tradeoff between the preserved information and the distortion tolerance , which is the essence of rate distortion theory. In fact, we can define the distortion between and as
where and depend on through Proposition 2. By taking the infimum over all possible choices of , the bound (5) can be written as
(7) 
where
s.t. 
is the ratedistortion function with respect to the distortion .
To obtain explicit bounds for specific problem instances, we use the fact that . In the following section we introduce a broad range of problems in which both and can be effectively bounded.
5 Main Results
We now apply the analysis in Section 2 to common bandit settings and show that our bounds are significantly sharper than the previous bounds. In these models, the observation of the agent is the received reward. Hence we can let be the identity function and use as a shorthand for .
5.1 Linear Bandits
Linear bandits are a class of problems in which each action is parametrized by a finitedimensional feature vector, and the mean reward of playing each action is the inner product between the feature vector and the model parameter vector. Formally, let , where , and . The reward of playing action satisfies
Note that we apply a normalizing factor to make the setting consistent with our assumption that .
A similar line of analysis as in [8] allows us to bound the information ratio of the onestep compressed Thompson sampling.
Proposition 3.
Under the linear bandit setting, for each , letting and satisfy the conditions in Proposition 2, we have
At the same time, with the help of a covering argument, we can also bound the number of partitions that is required to achieve distortion tolerance .
Proposition 4.
Under the linear bandit setting, suppose that , where is the dimensional closed Euclidean unit ball. Then for any there exists a partition of such that for all and , we have and
Theorem 2.
Under the linear bandit setting, if , then
This bound is the first informationtheoretic bound that holds for arbitrarily large action set and any distribution of the reward. It significantly improves the bound in [8] and the bound in [1] in that it drops the dependence on the cardinality of the action set and imposes no assumption on the reward distribution. Comparing with the confidencelevelbased analysis in [7], which results in the bound , our argument is much simpler and cleaner and yields a tighter bound. This bound also demonstrates the nearoptimality of Thompson sampling in that it exceeds the actionspace independent lower bound proposed in [4] by only a factor.
5.2 Generalized Linear Bandits with iid Noise
In generalized linear models, there is a fixed and strictly increasing link function , such that
Let
We make the following assumptions.
Assumption 1.
The reward noise is iid, i.e.
where is a zeromean noise term with a fixed and known distribution for all .
Assumption 2.
The link function is continuously differentiable in , with
Under these assumptions, both the information ratio of the compressed Thompson sampling and the number of partitions can be bounded.
Proposition 5.
Proposition 6.
Under the generalized linear bandit setting and Assumption 2, suppose that . Then for any there exists a partition of such that for each and we have and
5.3 Logistic Bandits
Logistic bandits are special cases of generalized linear bandits, in which the agent only observes binary rewards, i.e. . The link function is given by , where is a fixed and known parameter. Conditioned on , the reward of playing action is Bernoulli distributed with parameter .
The preexisting upper bounds on logistic bandit problems all scale linearly with
which explodes when . However, when is large, the rewards of actions are clearly bifurcated by a hyperplane and we expect Thompson sampling to perform better. The regret bound given by our analysis addresses this point and has a finite limit as increases. Since the logistic bandit setting is incompatible with Assumption 1, we propose the following conjecture, which is supported with numerical evidence.
Conjecture 1.
Under the logistic bandit setting, let the link function be , and for each , let and satisfy the conditions in Proposition 2. Then for all ,
To provide evidence for Conjecture 1, for each and , we randomly generate 100 actions and parameters and compute the exact information ratio under a randomly selected distribution over the parameters. The result is given in Figure 1. As the figure shows, the simulated information ratio is always smaller than the conjectured upper bound . We suspect that for every link function , there exists an upper bound for the information ratio that depends only on and and is independent of the cardinality of the parameter space. This opens an interesting topic for future research.
We further make the following assumption, which suggests that each parameter in the parameter set is not “too bad,” in the sense that the optimal expected reward conditioned on each parameter being the true model parameter is bounded below from .
Assumption 3.
We have that . Equivalently, we have that
The following theorem proposes the bound for the logistic bandit.
Theorem 4.
6 Conclusion
Through an analysis based on a notion ratedistortion, we established a new informationtheoretic regret bound for Thompson sampling that scales gracefully to large action spaces. Our analysis yields an regret bound for the linear bandit problem, which strengthens stateoftheart bounds. The same regret applies also to the logistic bandit problem if a conjecture about the information ratio that agrees with computational results holds. We expect that our new line of analysis applies to a wide range of online decision algorithms.
References
 [1] Shipra Agrawal and Navin Goyal. Nearoptimal regret bounds for Thompson sampling. Journal of the ACM (JACM), 64(5):30, 2017.
 [2] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
 [3] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 [4] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
 [5] Lihong Li, Yu Lu, and Dengyong Zhou. Provable optimal algorithms for generalized linear contextual bandits. arXiv preprint arXiv:1703.00048, 2017.
 [6] Daniel Russo and Benjamin Van Roy. Learning to optimize via informationdirected sampling. In Advances in Neural Information Processing Systems, pages 1583–1591, 2014.
 [7] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 [8] Daniel Russo and Benjamin Van Roy. An informationtheoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
 [9] Daniel Russo and Benjamin Van Roy. Satisficing in timesensitive bandit learning. arXiv preprint arXiv:1803.02855, 2018.
 [10] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. A tutorial on Thompson sampling. arXiv preprint arXiv:1707.02038, 2017.
 [11] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Appendix A Proof of Proposition 2
We first show the following lemma.
Lemma 1.
Let and be two sequences of real numbers, where . Let be such that for all and . Then there exist indices (possibly ) and such that
and
Proof. We prove the lemma by induction over . The result is trivial when . Assume that the result holds when . In the following we show the case where . Let and .
Suppose there exists index such that and , then by choosing , there is
Suppose there exists index such that and . Without loss of generality we can assume . If , the result becomes trivial by choosing . Hence we only consider . Let for , then . Applying our assumption to , and , we can find and such that
and
Notice that
and similarly . Therefore by choosing and , we arrive at the result.
Consequently, we only have to consider the case where for each , either or . Without loss of generality, let be the index such that and . Suppose the result is false, then for any and , the following set of inequalities
has no solution for . Since and , this can only happen when
Rearranging, the above inequality is equivalent to
(A1) 
Let , and . Multiplying both sides of (A1) by and , and summing over and , we have that
which is a contradiction. Therefore the result holds for . ∎
To show Proposition 2, for each we construct that satisfies 1, 2 and 3. Notice that, for each , there is
(A2)  
and
(A3)  
where we used the fact that is independent of and .
According to Lemma 1, at stage , for each , there exists two parameters and , such that
(A4) 
and
(A5) 
Let be a random variable such that
(A6) 
and let be an iid copy of . Since the value of only depends on , 1 is satisfied. Also we have that
(A7)  
where A7 and A7 follows from that both and are independent of , conditioned on , and A7 follows from (A5). Therefore 3 is satisfied.
To show 2,By construction we have that, at each stage ,
Hence there is
(A8)  
Therefore we arrive at
(A9)  