An Information-Theoretic Analysis of Thompson Sampling for Large Action Spaces

# An Information-Theoretic Analysis of Thompson Sampling for Large Action Spaces

Shi Dong
Stanford University
sdong15@stanford.edu
&Benjamin Van Roy
Stanford University
bvr@stanford.edu
###### Abstract

Information-theoretic Bayesian regret bounds of Russo and Van Roy [8] capture the dependence of regret on prior uncertainty. However, this dependence is through entropy, which can become arbitrarily large as the number of actions increases. We establish new bounds that depend instead on a notion of rate-distortion. Among other things, this allows us to recover through information-theoretic arguments a near-optimal bound for the linear bandit. We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.

An Information-Theoretic Analysis of
Thompson Sampling for Large Action Spaces

Shi Dong Stanford University sdong15@stanford.edu Benjamin Van Roy Stanford University bvr@stanford.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Thompson sampling [11] has proved to be an effective heuristic across a broad range of online decision problems [2, 10]. Russo and Van Roy [8] provided an information-theoretic analysis that yields insight into the algorithm’s broad applicability and establishes a bound of on cumulative expected regret over time periods of any algorithm and online decision problem. The information ratio is a statistic that captures the manner in which an algorithm trades off between immediate reward and information acquisition; Russo and Van Roy [8] bound the information ratio of Thompson sampling for particular classes of problems. The entropy of the optimal action quantifies the agent’s initial uncertainty.

If the prior distribution of is uniform, the entropy is the logarithm of the number of actions. As such, grows arbitrarily large with the number of actions. On the other hand, even for problems with infinite action sets, like the linear bandit with a polytopic action set, Thompson sampling is known to obey gracious regret bounds [6]. This suggests that the dependence on entropy leaves room for improvement.

In this paper, we establish bounds that depend on a notion of rate-distortion instead of entropy. Our new line of analysis is inspired by rate-distortion theory, which is a branch of information theory that quantifies the amount of information required to learn an approximation [3]. This concept was also leveraged in recent work of Russo and Van Roy [9], which develops an alternative to Thompson sampling that aims to learn satisficing actions. An important difference is that the results of this paper apply to Thompson sampling itself.

We apply our analysis to linear and generalized linear bandits and establish Bayesian regret bounds that remain sharp with large action spaces. For the -dimensional linear bandit setting, our bound is , which is tighter than the bound of [7]. Our bound also improves on the previous information-theoretic bound of [8] since it does not depend on the number of actions. Our Bayesian regret bound is within a factor of of the worst-case regret lower bound of [4].

For the logistic bandit, previous bounds for Thompson sampling [7] and upper-confidence-bound algorithms [5] scale linearly with , where is the logistic function . These bounds explode as since . This does not make sense because, as grows, each action rewards becomes a deterministic binary value, which should simplify learning. Our analysis addresses this gap in understanding by establishing a bound that decays as becomes large, converging to for any fixed . However, this analysis relies on a conjecture about the information ratio of Thompson sampling for the logistic bandit, which we only support through computational results.

## 2 Problem Formulation

We consider an online decision problem in which over each time period , an agent selects an action from a finite action set and observes an outcome , where denotes the set of possible outcomes. A fixed and known system function associates outcomes with actions according to

 Ya=g(a,θ∗,W),

where is the action, is an exogenous noise term, and is the “true” model unknown to the agent. Here we adopt the Bayesian setting, in which is a random variable taking value in a space of parameters . The randomness of stems from the prior uncertainty of the agent. To make notations succinct and avoid measure-theoretic issues, we assume that is a finite set, whereas our analysis can be extended to the cases where both and are infinite.

The reward function assigns a real-valued reward to each outcome. We assume that the reward is bounded, i.e.

 supy∈YR(y)−infy∈YR(y)≤1.

Further, as a shorthand we define

 μ(a,θ)=E[R(Ya)∣∣θ∗=θ],∀a∈A,θ∈Θ.

Simply stated, is the expected reward of action when the true model is . In addition, for each parameter , let be the optimal action under model , i.e.

 α(θ)=argmaxa∈Aμ(a,θ).

Note that the ties induced by can be circumvented by expanding with identical elements. Let be the “true” optimal action and let be the corresponding maximum reward.

Before making her decision at the beginning of period , the agent has access to the history up to time , which we denote by

 Ht−1=(A1,YA1,…,At−1,YAt−1).

A policy is defined as a sequence of functions mapping histories and exogenous noise to actions, which can be written as

 At=πt(Ht−1,ξt),t=1,2,…,

where is a random variable which characterizes the algorithmic randomness. The performance of policy is evaluated by the finite horizon Bayesian regret, defined by

 BayesRegret(T;π)=E[T∑t=1(R∗−R(YAt))],

where the actions are chosen by policy , and the expectation is taken over the randomness in both and .

## 3 Thompson Sampling and Information Ratio

The Thompson sampling policy is defined such that at each period, the agent samples the next action according to her posterior belief of the optimal action, i.e.

 P(πTSt(Ht−1,ξt)=a∣∣Ht−1)=P(A∗=a∣∣Ht−1),∀a∈A, t=1,2,….

An equivalent definition, which we use throughout our analysis, is that over period the agent samples a parameter from the posterior of the true parameter , and plays the action . The history available to the agent is thus

 ~Ht=(θ1,Yα(θ1),…,θt,Yα(θt)).

The information ratio, first proposed in [8], quantifies the trade-off between exploration and exploitation. Here we adopt the simplified definition in [9], which integrates over all randomness. Let be two -valued random variables. Over period , the information ratio of with respect to is defined by

 Γt(θ;θ′)=E[R(Yα(θ))−R(Yα(θ′))]2I(θ;(θ′,Yα(θ′))∣∣~Ht−1). (1)

We can interpret as a benchmark model parameter that the agent wants to learn and as the model parameter that she selects. When is small, the agent would only incur large regret over period if she was expected to learn a lot of information about . We restate a result proven in [6], which proposes a bound for the regret of any policy in terms of the worst-case information ratio.

###### Proposition 1.

For all and policy , let be such that for each , then

 BayesRegret(T;π)≤√¯¯¯¯ΓT⋅H(θ∗)⋅T,

where is the entropy of and

 ¯¯¯¯ΓT=max1≤t≤TΓt(θ∗;θt).

The bound given by Proposition 1 is loose in the sense that it depends implicitly on the cardinality of . When is large, knowing exactly what is requires a lot of information. Nevertheless, because of the correlation between actions, it suffices for the agent to learn a “blurry” version of , which conveys far less information, to achieve low regret. In the following section we concretize this argument.

## 4 A Rate-Distortion Analysis of Thompson Sampling

In this section we develop a sharper bound for Thompson sampling. At a high level, the argument relies on three observations:

1. A summary statistic of that is less informative than exists;

2. In each period, if the agent aims to learn the summary statistic instead of the regret incurred can be bounded in terms of the information gained about the summary statistic; we refer to this approximate learning as “compressed Thompson sampling”;

3. The summary statistic can be chosen such that the regret of Thompson sampling is close to that of the compressed Thompson sampling, and at the same time, compressed Thompson sampling yields no more information about the summary statistic than Thompson sampling.

Following the above line of analysis, we can bound the regret of Thompson sampling by the mutual information between the summary statistic and . Since the summary statistic is chosen to be far less informative than , we will arrive at a significantly tighter bound.

To develop the argument, we first quantify the amount of distortion that we incur if we replace one parameter with another. For two parameters , the distortion of with respect to is defined as

 d(θ,θ′)=μ(α(θ′),θ′)−μ(α(θ),θ′). (2)

In other words, the distortion is the price we pay if we deem to be the true parameter while the actual true parameter is . Notice that from the definition of , we always have . Let be a partition of , i.e. and such that

 d(θ,θ′)≤ϵ,∀θ,θ′∈Θk, k=1,…,K. (3)

where is a positive distortion tolerance. Let be the random variable taking values in that records the index of the partition in which lies, i.e.

 ψ=k ⇔ θ∗∈Θk.

Then we have . If the structure of allows for a small number of partitions, would have much less information than . Let subscript denote corresponding values under the posterior measure . In other words, and are random variables that are functions of . We claim the following.

###### Proposition 2.

For each , there exists a -valued random variable that satisfies the following:

1. is independent of , conditioned on .

2. , a.s.

3. , a.s.

where in 2 and 3, is an iid copy of .

According to Proposition 2, over period if the agent deviates from her original Thompson sampling scheme and applies a “one-step” compressed Thompson sampling to learn by sampling , she would not incur much more regret (as is guaranteed by 2). Meanwhile, from 1, 3 and the data-processing inequality, we have that

 It−1(~θ∗t;(~θt,Yα(~θt)))≤It−1(ψ;(~θt,Yα(~θt)))≤It−1(ψ;(θt,Yα(θt))), a.s. (4)

which implies that the information gain of the compressed Thompson sampling will not exceed that of the original Thompson sampling towards . Therefore, the regret of the original Thompson sampling can be bounded in terms of the total information gain towards and the worst-case information ratio of the one-step compressed Thompson sampling. Formally, we have the following.

###### Theorem 1.

Let be any partition of such that for any and , . Let and satisfy the conditions in Proposition 2. We have

 BayesRegret(T;πTS)≤√¯¯¯¯Γ⋅I(θ∗;ψ)⋅T+ϵ⋅T, (5)

where

 ¯¯¯¯Γ=max1≤t≤TΓt(~θ∗t;~θt).

Proof. We have that

 BayesRegret(T;πTS) = T∑t=1E[R∗−R(YAt)] (6) = T∑t=1E{Et−1[R∗−R(YAt)]} (a)≤ = T∑t=1√Γt(~θ∗t,~θt)⋅I(~θ∗t;(~θt,Yα(~θt))∣∣~Ht−1)+ϵ⋅T (b)≤ T∑t=1√¯¯¯¯Γ⋅I(ψ;(θt,Yα(θt))∣∣~Ht−1)+ϵ⋅T (c)≤  ⎷¯¯¯¯Γ⋅T⋅T∑t=1I(ψ;(θt,Yα(θt))∣∣~Ht−1)+ϵ⋅T (d)= √¯¯¯¯Γ⋅T⋅I(ψ;~HT)+ϵ⋅T (e)≤

where 6 follows from Proposition 2 2; 6 follows from (4); 6 results from Cauchy-Schwartz inequality; 6 is the chain rule for mutual information and 6 comes from that

 I(ψ;~HT)≤I(ψ;(θ∗,~HT))=I(ψ;θ∗)+I(ψ;~HT∣∣θ∗)=I(ψ;θ∗),

where we use the fact that is independent of , conditioned on . Thence we arrive at our desired result.∎

Remark. The bound given in Theorem 1 dramatically improves the previous bound in Proposition 1 since in general is much smaller than . The new bound also characterizes the tradeoff between the preserved information and the distortion tolerance , which is the essence of rate distortion theory. In fact, we can define the distortion between and as

where and depend on through Proposition 2. By taking the infimum over all possible choices of , the bound (5) can be written as

 BayesRegret(T;πTS)≤√¯¯¯¯Γ⋅ρ(ϵ)⋅T+ϵ⋅T,∀ϵ>0, (7)

where

 ρ(ϵ)= minψ I(θ∗;ψ) s.t. D(θ∗,ψ)≤ϵ

is the rate-distortion function with respect to the distortion .

To obtain explicit bounds for specific problem instances, we use the fact that . In the following section we introduce a broad range of problems in which both and can be effectively bounded.

## 5 Main Results

We now apply the analysis in Section 2 to common bandit settings and show that our bounds are significantly sharper than the previous bounds. In these models, the observation of the agent is the received reward. Hence we can let be the identity function and use as a shorthand for .

### 5.1 Linear Bandits

Linear bandits are a class of problems in which each action is parametrized by a finite-dimensional feature vector, and the mean reward of playing each action is the inner product between the feature vector and the model parameter vector. Formally, let , where , and . The reward of playing action satisfies

 E[Ra|θ∗=θ]=μ(a,θ)=12a⊤θ,∀a∈A,θ∈Θ.

Note that we apply a normalizing factor to make the setting consistent with our assumption that .

A similar line of analysis as in [8] allows us to bound the information ratio of the one-step compressed Thompson sampling.

###### Proposition 3.

Under the linear bandit setting, for each , letting and satisfy the conditions in Proposition 2, we have

 Γt(~θ∗t;~θt)≤d2.

At the same time, with the help of a covering argument, we can also bound the number of partitions that is required to achieve distortion tolerance .

###### Proposition 4.

Under the linear bandit setting, suppose that , where is the -dimensional closed Euclidean unit ball. Then for any there exists a partition of such that for all and , we have and

 K≤(1ϵ+1)d.

Combining Theorem 1, Propositions 3 and 4, we arrive at the following bound.

###### Theorem 2.

Under the linear bandit setting, if , then

 BayesRegret(T;πTS)≤d ⎷Tlog(3+3√2Td).

This bound is the first information-theoretic bound that holds for arbitrarily large action set and any distribution of the reward. It significantly improves the bound in [8] and the bound in [1] in that it drops the dependence on the cardinality of the action set and imposes no assumption on the reward distribution. Comparing with the confidence-level-based analysis in [7], which results in the bound , our argument is much simpler and cleaner and yields a tighter bound. This bound also demonstrates the near-optimality of Thompson sampling in that it exceeds the action-space independent lower bound proposed in [4] by only a factor.

### 5.2 Generalized Linear Bandits with iid Noise

In generalized linear models, there is a fixed and strictly increasing link function , such that

 E[Ra|θ∗=θ]=μ(a,θ)=ϕ(a⊤θ).

Let

 L––=infa∈A,θ∈Θa⊤θ,¯¯¯¯L=supa∈A,θ∈Θa⊤θ.

We make the following assumptions.

###### Assumption 1.

The reward noise is iid, i.e.

 Ra=μ(a,θ∗)+Wa=ϕ(a⊤θ∗)+Wa,∀a∈A,

where is a zero-mean noise term with a fixed and known distribution for all .

###### Assumption 2.

The link function is continuously differentiable in , with

 C(ϕ)=supx∈[L––,¯¯¯L]ϕ′(x)<∞.

Under these assumptions, both the information ratio of the compressed Thompson sampling and the number of partitions can be bounded.

###### Proposition 5.

Under the genearlized linear bandit setting and Assumptions 1 and 2, for each , letting and satisfy the conditions in Proposition 2, we have

 Γt(~θ∗t;~θt)≤2C(ϕ)2d.
###### Proposition 6.

Under the generalized linear bandit setting and Assumption 2, suppose that . Then for any there exists a partition of such that for each and we have and

 K≤(2C(ϕ)ϵ+1)d.

Combining Theorem 1, Propositions 5 and 6, we have the following.

###### Theorem 3.

Under the generalized linear bandit setting and Assumptions 1 and 2, if , then

 BayesRegret(T;πTS)≤2C(ϕ)⋅d ⎷Tlog(3+3√2Td).

Note that the optimism-based algorithm in [5] achieves regret, and the bound of Thompson sampling given in [7] is , where . Theorem 3 apparently yields a sharper bound.

### 5.3 Logistic Bandits

Logistic bandits are special cases of generalized linear bandits, in which the agent only observes binary rewards, i.e. . The link function is given by , where is a fixed and known parameter. Conditioned on , the reward of playing action is Bernoulli distributed with parameter .

The preexisting upper bounds on logistic bandit problems all scale linearly with

 r=supx(ϕL)′(x)/infx(ϕL)′(x),

which explodes when . However, when is large, the rewards of actions are clearly bifurcated by a hyperplane and we expect Thompson sampling to perform better. The regret bound given by our analysis addresses this point and has a finite limit as increases. Since the logistic bandit setting is incompatible with Assumption 1, we propose the following conjecture, which is supported with numerical evidence.

###### Conjecture 1.

Under the logistic bandit setting, let the link function be , and for each , let and satisfy the conditions in Proposition 2. Then for all ,

 Γt(~θ∗t;~θt)≤d2.

To provide evidence for Conjecture 1, for each and , we randomly generate 100 actions and parameters and compute the exact information ratio under a randomly selected distribution over the parameters. The result is given in Figure 1. As the figure shows, the simulated information ratio is always smaller than the conjectured upper bound . We suspect that for every link function , there exists an upper bound for the information ratio that depends only on and and is independent of the cardinality of the parameter space. This opens an interesting topic for future research.

We further make the following assumption, which suggests that each parameter in the parameter set is not “too bad,” in the sense that the optimal expected reward conditioned on each parameter being the true model parameter is bounded below from .

###### Assumption 3.

We have that . Equivalently, we have that

 infθ∈Θα(θ)⊤θ>0.

The following theorem proposes the bound for the logistic bandit.

###### Theorem 4.

Under the logistic bandit setting where , for all , if the link function is given by , Assumption 3 holds with , and Conjecture 1 holds, then for all sufficiently large ,

 BayesRegret(T;πTS)≤2d ⎷Tlog(3+6√2Td⋅βeβδ(1+eβδ)2). (8)

For fixed and , when the right-hand side of (8) converges to . Thus (8) is substantially sharper than previous bounds when is large.

## 6 Conclusion

Through an analysis based on a notion rate-distortion, we established a new information-theoretic regret bound for Thompson sampling that scales gracefully to large action spaces. Our analysis yields an regret bound for the linear bandit problem, which strengthens state-of-the-art bounds. The same regret applies also to the logistic bandit problem if a conjecture about the information ratio that agrees with computational results holds. We expect that our new line of analysis applies to a wide range of online decision algorithms.

## References

• [1] Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for Thompson sampling. Journal of the ACM (JACM), 64(5):30, 2017.
• [2] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
• [3] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
• [4] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
• [5] Lihong Li, Yu Lu, and Dengyong Zhou. Provable optimal algorithms for generalized linear contextual bandits. arXiv preprint arXiv:1703.00048, 2017.
• [6] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pages 1583–1591, 2014.
• [7] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
• [8] Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
• [9] Daniel Russo and Benjamin Van Roy. Satisficing in time-sensitive bandit learning. arXiv preprint arXiv:1803.02855, 2018.
• [10] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. A tutorial on Thompson sampling. arXiv preprint arXiv:1707.02038, 2017.
• [11] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

## Appendix A Proof of Proposition 2

We first show the following lemma.

###### Lemma 1.

Let and be two sequences of real numbers, where . Let be such that for all and . Then there exist indices (possibly ) and such that

 paj+(1−p)ak≤N∑m=1ampm

and

 pbj+(1−p)bk≤N∑m=1bmpm.

Proof. We prove the lemma by induction over . The result is trivial when . Assume that the result holds when . In the following we show the case where . Let and .

Suppose there exists index such that and , then by choosing , there is

 paj+(1−p)ak=at≤Aandpbj+(1−p)bk=bt≤B.

Suppose there exists index such that and . Without loss of generality we can assume . If , the result becomes trivial by choosing . Hence we only consider . Let for , then . Applying our assumption to , and , we can find and such that

 p′aj′+(1−p′)ak′≤n∑m=1amp′m

and

 p′bj′+(1−p′)bk′≤n∑m=1bmp′m.

Notice that

 n∑m=1amp′m=n∑m=1ampm1−pn+1≤n+1∑m=1ampm=A,

and similarly . Therefore by choosing and , we arrive at the result.

Consequently, we only have to consider the case where for each , either or . Without loss of generality, let be the index such that and . Suppose the result is false, then for any and , the following set of inequalities

 {paℓ+(1−p)ah≤Apbℓ+(1−p)bh≤B

has no solution for . Since and , this can only happen when

 A−ahaℓ−ah

Rearranging, the above inequality is equivalent to

 bhA−bℓA+aℓB−ahB+ahbℓ−aℓbh<0. (A1)

Let , and . Multiplying both sides of (A1) by and , and summing over and , we have that

 0 > s∑ℓ=1n+1∑h=s+1(bhphpℓA−bℓpℓphA+aℓpℓphB−ahphpℓB+ahphbℓpℓ−aℓpℓbhph) = (B−B′)P′A−B′(1−P′)A+A′(1−P′)B−(A−A′)P′B+(A−A′)B′−A′(B−B′) = 0,

which is a contradiction. Therefore the result holds for . ∎

To show Proposition 2, for each we construct that satisfies 1, 2 and 3. Notice that, for each , there is

 Et−1[μ(α(θt),θ∗)∣∣θt∈Θk] = (A2) =

and

 It−1(ψ;Yα(θt)∣∣θt∈Θk) = (A3) =

where we used the fact that is independent of and .

According to Lemma 1, at stage , for each , there exists two parameters and , such that

 rk,t⋅Et−1[μ(α(θk,t1),θ∗)]+(1−rk,t)⋅Et−1[μ(α(θk,t2),θ∗)]≤Et−1[μ(α(θt),θ∗)∣∣θt∈Θk], (A4)

and

 rk,t⋅It−1(ψ;Yα(θk,t1))+(1−rk,t)⋅It−1(ψ;Yα(θk,t2))≤It−1(ψ;Yα(θt)∣∣θt∈Θk). (A5)

Let be a random variable such that

 Pt−1(~θ∗t=θk,t1∣∣ψ=k)=rk,t,Pt−1(~θ∗t=θk,t2∣∣ψ=k)=1−rk,t, (A6)

and let be an iid copy of . Since the value of only depends on , 1 is satisfied. Also we have that

 It−1(ψ;(~θt,Yα(~θt))) = It−1(ψ;~θt)+It−1(ψ;Yα(~θt)∣∣~θt) (A7) (f)= It−1(ψ;Yα(~θt)∣∣~θt) = K∑k=1∑i=1,2P(~θt=θk,ti∣∣θt∈Θk)⋅P(θt∈Θk)It−1(ψ;Yα(θk,ti)) = K∑k=1[rk,t⋅It−1(ψ;Yα(θk,t1))+(1−rk,t)⋅It−1(ψ;Yα(θk,t2))]⋅P(θt∈Θk) (g)≤ It−1(ψ;Yα(θt)∣∣θt∈Θk)⋅P(θt∈Θk) = It−1(ψ;Yα(θt)) (h)=

where A7 and A7 follows from that both and are independent of , conditioned on , and A7 follows from (A5). Therefore 3 is satisfied.

To show 2,By construction we have that, at each stage ,

 Dt=rk,t⋅Et−1[μ(α(θk,t1),θ∗)]+(1−rk,t)⋅Et−1[μ(α(θk,t2),θ∗)]−Et−1[μ(α(θt),θ∗)∣∣θt∈Θk]≤0.

Hence there is

 Et−1[R(Yα(~θt))−R(Yα(θt))] = Et−1[μ(α(~θt),θ∗)−μ(α(θt),θ∗)] (A8) = K∑k=1P(θt∈Θk)⋅Et−1[μ(α(~θt),θ∗)−μ(α(θt),θ∗) ∣∣ θt∈Θk] = K∑k=1P(θt∈Θk)⋅Dt≤0.

Therefore we arrive at

 Et−1[R∗−R(Yα(θt))]−Et−1[R(Yα(~θ∗t))−R(Yα(~θt))] (A9) = Et−1[R(Yα(θ∗))−R(Yα