MOTS: Minimax Optimal Thompson Sampling

# MOTS: Minimax Optimal Thompson Sampling

## Abstract

Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can achieve the minimax optimal regret for -armed bandit problems, where is the total time horizon. In this paper, we solve this long open problem by proposing a new Thompson sampling algorithm called MOTS that adaptively truncates the sampling result of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound for finite time horizon and also the asymptotic optimal regret bound when grows to infinity as well. This is the first time that the minimax optimality of multi-armed bandit problems has been attained by Thompson sampling type of algorithms.

## 1 Introduction

The Multi-Armed Bandit (MAB) problem models the exploration and exploitation tradeoff in sequential decision processes and is typically described as a game between the agent and the environment with arms. The game proceeds in time steps. In each time step , the agent plays an arm based on the observation of the previous time steps, and then observes a reward that is independently generated from a 1-subGaussian distribution with mean value , where are unknown. The goal of the agent is to maximize the cumulative reward over time steps. The performance of a strategy for MAB is measured by the expected cumulative difference over time steps between playing the best arm and playing the arm according to the strategy, which is also called the regret of a bandit strategy. Formally, the regret is defined as follows

 Rμ(T)=Tmaxi∈{1,2,⋯,K}μi−Eμ[T∑t=1rt]. (1)

For a fixed time horizon , the problem-independent lower bound (Auer et al., 2002b) states that any strategy has at least a regret in the order of 1, which is called the minimax-optimal regret or worse case optimal regret. On the other hand, for a fixed model (i.e., are fixed), Lai and Robbins (1985); Katehakis and Robbins (1995) proved the asymptotically lower bound that any strategy must have at least regret when the horizon approaches infinity, where is a constant depending on the model. A strategy with a regret upper-bounded by is called asymptotically optimal.

In this paper we aim at achieving the asymptotic optimality and minimax optimality for the earliest bandit strategy, Thompson Sampling (TS) (Thompson, 1933). It has been observed in practice that Thompson Sampling can achieve a better performance than many upper confidence bound (UCB)-based algorithms (Chapelle and Li, 2011; Wang and Chen, 2018). In addition, TS is natural, simple and easy to implement. Despite the aforementioned advantages, the theoretical analysis of TS has not been established until the past decade. In particular, Agrawal and Goyal (2012) and Kaufmann et al. (2012) proved the first regret bound of TS and showed that it is asymptotically optimal. Later, Agrawal and Goyal (2017) showed that TS using Beta distribution as the prior achieves problem-independent regret bound while maintaining the asymptotic optimality as well. Moreover, Agrawal and Goyal (2017) also proved that TS with Gaussian prior can achieve an improved regret bound . Meanwhile, Agrawal and Goyal (2017) proved that the vanilla TS strategy with Gaussian prior has a problem-independent bound at least in the order of .

It remains an open problem (Li and Chapelle, 2012) that whether Thompson Sampling type algorithms can achieve the minimax optimal regret bound for MAB problems.

Main Contributions. In this paper, we solve this open problem by proposing a new Thompson Sampling algorithm called Minimax Optimal Thompson Sampling (MOTS), which clips the sampling results for each arm based on the history of pulls for the arm. We prove that our proposed MOTS algorithm achieves the asymptotic optimal and minimax optimal regret simultaneously. This is the first TS type algorithm that achieves the minimax optimal regret bound . Our result also conveys the important message that the lower bound for vanilla TS strategy with Gaussian priors in Agrawal and Goyal (2017) may not hold in more general cases. Our experimental results also demonstrate the superiority of MOTS over the state-of-the-art bandit algorithms such as UCB (Auer et al., 2002a), MOSS (Audibert and Bubeck, 2009) and TS.

Notations. A random variable is said to follow 1-subGaussian distribution, if it holds that for all . We reserve the notation to represent universal positive constants that are independent of problem parameters. The specific value of can be different line by line. We use for total number of time steps, for number of arms and for set . Without loss of generality, we assume throughout this paper. We use to denote the gap between arm and arm , i.e., , . We denote as the number of times that arm has been played at time step and as the average reward for pulling arm up to time , where is the reward received by the algorithm at time .

## 2 Minimax Optimal Thompson Sampling Algorithm

In this section, we propose a Minimax Optimal Thompson Sampling (MOTS) algorithm, whose details are displayed in Algorithm 1.

Specifically, MOTS maintains a distribution for each arm at time step during execution, where is initialized as the standard Gaussian distribution. At the -th iteration of Algorithm 1, it samples instances independently from distribution for all . Then the agent plays the arm and receives a reward . The average reward and the number of pulls for each arm are updated accordingly.

The main difference between MOTS and vanilla Thompson Sampling in Agrawal and Goyal (2017) is the choice of distribution . In Agrawal and Goyal (2017), is chosen as the Gaussian distribution . In contrast, we define as a clipped Gaussian distribution , where is an arbitrary constant. We describe the detailed procedure of sampling from of MOTS as follows.

Sampling from a clipped Gaussian distribution: At time step , for all arm , we denote the following range

 R=(−∞,ˆμi(t)+√4Ti(t)log+(TKTi(t))), (2)

where is defined as . For arm , we first sample an instance from Gaussian distribution . If , then return as a sample from ; otherwise return as a sample from .

###### Remark 1.

We would like to point out that the right endpoint in (2) resembles the upper confidence bound in MOSS (Audibert and Bubeck, 2009). Apart from the difference that MOTS is TS-type and MOSS is UCB-type, we claim that they are also very different from a theoretical perspective. Under the definition of in (2), we will prove in the next section that MOTS is both asymptotically optimal and minimax optimal. However, MOSS is only minimax optimal (Audibert and Bubeck, 2009). The improvement of MOSS to achieve asymptotic optimality is only recently developed in the KL-UCB algorithm (Ménard and Garivier, 2017) and the AdaUCB algorithm  (Lattimore, 2018), which can be seen as variants of MOSS. Both KL-UCB and AdaUCB need to reduce the constant factor 4 in the right endpoint of defined in (2) to 2, which essentially decreases the exploration rate. Moreover, KL-UCB utilizes a more complicated upper confidence bound with an additional term and AdaUCB only works for Gaussian reward distributions.

In contrast, it is easy to verify that for MOTS the constant 4 in (2) can be replaced by any constant larger than 4 while maintaining the asymptotic optimality and minimax optimality. Therefore, MOTS is more robust in the choice of hyperparameter. It will be more suitable to design better algorithms based on MOTS, e.g., achieving instance-dependent optimality (see Lattimore (2018) for detail) while keeping the asymptotic optimality.

## 3 Main Theory

In this section, we present our main theory of MOTS.

###### Theorem 1 (Minimax Optimality).

For any fixed , there exists a universal constant such that the regret of Algorithm 1 with 1-subGaussian rewards satisfies

 Rμ(T)≤c√KT+K∑i=1Δi. (3)

The second term in the right hand side of (3) is due to the fact that we need to pull each arm at least once if . Follow the convention in the literature (Audibert and Bubeck, 2009; Agrawal and Goyal, 2017), we only need to consider the case when is dominated by .

###### Remark 2.

Compared with the results in Agrawal and Goyal (2017), the regret bound of MOTS improves that of TS by a factor of and improves that of TS with Gaussian priors by a factor of . This is the first time that a Thompson Sampling type algorithm achieves the minimax optimal regret for multi-armed bandit problems (Auer et al., 2002a), which also answers the open problem in Li and Chapelle (2012) where it is conjectured that Thompson samplingâs regret actually matches the lower bound and is indeed optimal.

###### Theorem 2 (Asymptotic Optimality).

For any fixed , the regret of Algorithm 1 with 1-subGaussian rewards satisfies

 limT→∞Rμ(T)log(T)=∑i:Δi>02δΔi. (4)
###### Remark 3.

Theorem 2 indicates that the asymptotic regret rate of MOTS matches the asymptotic optimal rate up to a multiplicative factor , where is arbitrarily fixed. This is the same as that of vanilla TS in Agrawal and Goyal (2017), where the authors proved an asymptotic regret rate that matches the asymptotic optimal rate by a multiplicative factor , where is a fixed constant.

So far, we have assumed the reward follows an unknown subGaussian distribution. In the next theorem, we present an variant of MOTS that achieves the minimax optimality and asymptotic optimality for Gaussian reward distributions.

###### Theorem 3.

If the reward of each arm follows a Gaussian distribution , and the right endpoint of range in (2) is replaced by

 (5)

then Theorem 1 and Theorem 2 still hold.

### 3.1 Proof of the Minimax Optimality

The following lemma will be frequently used throughout our analysis, which characterises the concentration property of subGaussian random variables.

###### Lemma 1 (Lemma 9.3 in Lattimore and Szepesvári (2020)).

Let be independent and -subGaussian with zero mean. Denote . Then for any ,

 P(∃ s≥1:ˆβs+√4slog+(TsK)+Δ≤0)≤cKTΔ2, (6)

where is a universal constant.

Let be the average reward of arm when it has been played times. Define

 Δ=μ1−mins≤T{ˆμ1s+√4slog+(TsK)}. (7)

The regret of Algorithm 1 can be decomposed as follows.

 Rμ(T)=∑i:Δi>0ΔiE[Ti(T)]≤E[2TΔ]+E[∑i:Δi>2ΔE[ΔiTi(T)]]≤E[2TΔ]+8√KT+E[∑i:Δi>max{2Δ,8√K/T}ΔiTi(T)]. (8)

The first term in (8) can be bounded as:

 (9)

where is a universal constant and the inequality comes from Lemma 1 since

 P(μ1−mins≤T{ˆμ1s+√4slog+(TsK)}≥x) =P(∃1≤s≤T:μ1−ˆμ1s−√4slog+(TsK)−x≥0) ≤cKx2T. (10)

Now we focus on . Note that the update rules of Algorithm 1 ensure whenever . Hence, we can define as the prior distribution of arm when it has been played times and obtain the following Lemma.

###### Lemma 2 (Theorem 36.2 in Lattimore and Szepesvári (2020)).

Let be an arbitrary constant. Then the expected number of times that Algorithm 1 plays arm is bounded by

 E[Ti(T)] =E[T∑t=1\mathds1{At=i,Ei(t)}]+E[T∑t=1\mathds1{At=i,Eci(t)}] ≤1+E[T−1∑s=0(1G1s(ϵ)−1)]+E[T−1∑t=0\mathds1{At=i,Eci(t)}] (11) ≤1+E[T−1∑s=0(1G1s(ϵ)−1)]+E[T−1∑s=0\mathds1{Gis(ϵ)>1/T}], (12)

where , is the CDF of , and .

The above lemma is first proved by Agrawal and Goyal (2017) and here we use an improved version presented in Lattimore and Szepesvári (2020). We define

 mis=ˆμis+√4slog+(TsK). (13)

By the definition in (2), we know that is the right endpoint of range when . From the definition of in (7) and note that , we obtain

 m1s=ˆμ1s+√4slog+(TsK)≥μ1−Δ≥μ1−Δi2. (14)

Let be sampled from the clipped distribution . Recall the clipped sampling procedure in Section 2. We can first sample from distribution . If , we return ; otherwise, we return . Combining with (14), we know that .

Let be the CDF of for and CDF of for . Let . Thus . Using (11) of Lemma 2 and setting , we have

 E[ΔiTi(T)]≤Δi+Δi⋅E[T−1∑s=0(1G1s(Δi/2)−1)]+Δi⋅E[T−1∑t=0\mathds1{At=i,Eci(t)}]=Δi+Δi⋅E[T−1∑t=0\mathds1{At=i,Eci(t)}]I1+Δi⋅E[T−1∑s=0(1G′1s(Δi/2)−1)]I2. (15)

Bounding term : Note that

 Eci(t)

We define the following notation.

 κi=T∑s=1\mathds1{ˆμis+√4slog+(TsK)>μ1−Δi2}, (16)

which immediately implies

 I1=Δi⋅E[T−1∑t=0\mathds1{At=i,Eci(t)}]≤ΔiE[κi]. (17)

The following lemma characterizes the bound for .

###### Lemma 3.

Let be a constant and be 1-subGaussian random variables with zero means. Denote . Then for any ,

 T∑n=1P(ˆμn+√4nlog+(Nn)≥ω) ≤1+4log+(Nω2)ω2+2ω2+√8πlog+(Nω2)ω2. (18)

Since the proof of Lemma 3 is quite standard, we defer it to the appendix. Now we continue our proof of the minimax optimality of MOTS. Applying Lemma 3 to (17) yields

 ΔiE[κi] =ΔiT∑s=1\mathds1{ˆμis+√4slog+(TsK)>μ1−Δi2} ≤ΔiT∑s=1P{ˆμis−μi+√4slog+(TsK)>Δi2} ≤Δi+8Δi+16Δi(log+(TΔ2i4K)+√2πlog+(TΔ2i4K)), (19)

where the first inequality is due to the fact that . It is easy to verify that is monotonically decreasing for and any . Since , we have . Plugging this fact into (3.1), we have that , where is a universal constant.

Bounding term : We first prove the following lemma.

###### Lemma 4.

There exists a universal constant such that:

 E[T−1∑s=0(1G′1s(ϵ)−1)]≤cϵ2. (20)
###### Proof of Lemma 4.

We decompose the proof of Lemma 4 into the proof of the following two statements: (i) there exists a universal constant such that

 E[1G′1s(ϵ)−1]≤c,∀s, (21)

and (ii) for , it holds that

 E[T∑s=L(1G′1s(ϵ)−1)]≤4e2(1+16ϵ2). (22)

For , and . For , let and be the random variable denoting the number of consecutive independent trials until a sample of becomes greater than . Note that , where is sampled from . Hence we have

 E[1G′1s(ϵ)−1]=E[Ys]. (23)

Consider an integer . Let , where and will be determined late. Let random variable be the maximum of independent samples from . Define to be the filtration consisting the history of plays of Algorithm 1 up to the -th pull of arm . Then it holds

 P(Ysμ1−ϵ) ≥E[E[(Mr>ˆμ1s+z√δs,ˆμ1s+z√δs≥μ1−ϵ)∣∣∣Fs]] =E[\mathds1{ˆμ1s+z√δs≥μ1−ϵ}⋅P(Mr>ˆμ1s+z√δs)]. (24)

For a random variable , it holds by Formula 7.1.13 from Abramowitz and Stegun (1965) that

 P(Z>μ+xσ)≥1√2πxx2+1e−x22. (25)

Therefore, it holds that

 P(Mr>ˆμ1s+z√δs) ≥1−(1−1√2πzz2+1e−z2/2)r =1−(1−r−δ′√2π√2δ′logr2δ′logr+1)r ≥1−exp(−r1−δ′√8πlogr), (26)

where the last inequality is due to , and . Let , then

 ≤1r2⇔exp((1−δ′)x)≥2√8πx32.

It is easy to verify that for , . Hence, if , we have . Thus, we have

 P(Mr>ˆμ1s+z√δs)≥1−1r2. (27)

For any , it holds that

 P(ˆμ1s+z√δs≥μ1−ϵ) ≥P(ˆμ1s+z√δs≥μ1) ≥1−exp(−z2/(2δ)) =1−exp(−δ′/δlogr) =1−r−δ′/δ. (28)

where the second equality is due to Lemma 8. Therefore, for , substituting (27) and (3.1) into (3.1) yields

 P(Ys

For any , this gives rise to

 E[Ys] =T∑r=0P(Ys≥r) ≤e10⋅exp[1(1−δ′)2]+∑r≥11r2+∑r≥1r−δ′δ ≤e10⋅exp[1(1−δ′)2]+2+1+∫∞x=1x−δ′δdx ≤2e10⋅exp[1(1−δ′)2]+1(1−δ)−(1−δ′),

where is a universal constant. Let . We further obtain

 E[1G′1s(ϵ)−1]≤2e10⋅exp[4(1−δ)2]+21−δ. (30)

Since is fixed, then there exists a universal constant such that

 E[1G′1s(ϵ)−1]≤c. (31)

Now, we turn to proving (22). Let be the event that . Let . Let is distributed random variable. Under the event , using the upper bound of Lemma 7 with , we obtain

 P(X1s>μ1−ϵ) ≥P(X1s>ˆμ1s−ϵ/2)≥1−1/2exp(−sδϵ2/8). (32)

Then, we have

 E[1G′1s(ϵ)−1]=E[1P(X1s>μ1−ϵ)−1]≤E[1P(X1s>μ1−ϵ∣Es)⋅P(Es)−1]≤E[1(1−1/2exp(−sδϵ2/8))P(Es)−1]. (33)

Applying Lemma 8, we have

 P(Es) =P(ˆμ1s≥μ1−ϵ2)≥1−exp(−sϵ28)≥1−exp(−sδϵ2/8). (34)

Substituting the above inequality into (33) yields

 E[T∑s=L(1G′1s(ϵ)−1)] ≤T∑s=L[1(1−exp(−sδϵ2/8))2−1] ≤T∑s=L4exp(−sϵ216) ≤4e2(1+16ϵ2).

The second inequality follows since , for and . We complete the proof of Lemma 4 by combining (21) and (22). ∎

From Lemma 4, we immediately obtain

 I2=ΔiE[T−1∑s=0(1G′1s(Δi/2)−1)]≤c√TK. (35)

Substituting (9), (15), (3.1) and (35) into (8), we complete the proof of Theorem 1.

### 3.2 Proof of the Asymptotic Optimality

We first prove the following technical lemma.

###### Lemma 5.

For any that satisfies , it holds that

 E[T−1∑s=0\mathds1{G′is(ϵ)>1/T}]≤1+2ϵ2T+2logTδ(Δi−ϵ−ϵT)2.
###### Proof of Lemma 5.

For sufficiently small such that , which also implies . Applying Lemma 8, we have . Furthermore,

 ∞∑s=1exp(−sϵ2T2)≤1exp(ϵ2T/2)−1≤2ϵ2T. (36)

where the last inequality is due to the fact for all . Define . For and sampled from , if , then using Gaussian tail bound in Lemma 7, we obtain

 P(Xis≥μ1−ϵ) ≤12exp(−δs(ˆμis−μ1+ϵ)22) ≤12exp(−δs(μ1−ϵ−μi−ϵT)22) =12exp(−δs(Δi−ϵ−ϵT)22) ≤1T. (37)

Let be the event that holds for all . We further obtain

 E[T−1∑s=0\mathds1{G′is(ϵ)>1/T}]≤ E[T−1∑s=0\mathds1{G′is(ϵ)>1/T}∣∣∣Yi]+P[Yi] ≤ T∑s=⌈Li⌉E[\mathds1{P(Xis>μ1−ϵ)>1/T}∣∣∣Yi]+⌈Li⌉+P[Yi] ≤ ⌈Li⌉+P[Yi]≤1+2ϵ2T+2logTδ(Δi−ϵ−ϵT)2. (38)

where the third inequality is from (3.2) and the last inequality is from (36). ∎

Now we prove the asymptotic optimality of MOTS.

###### Proof of Theorem 2.

Let be the following event

 Z(ϵ)={∀s

For any arm , we have

 E[Ti(T)] ≤E[Ti(T)|Z(ϵ)]P(Z(ϵ))+T(1−P[Z(ϵ)]) ≤1+E[(T−1∑s=0(1G1s(ϵ)−1))∣∣∣Z(ϵ)]+T(1−P[Z(ϵ)])+E[T−1∑s=0\mathds1{Gis(ϵ