Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem

Shipra Agrawal
Microsoft Research India
shipra@microsoft.com
   Navin Goyal
Microsoft Research India
navingo@microsoft.com
Abstract

The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to play according to its probability of being the best arm. Thompson Sampling algorithm has experimentally been shown to be close to optimal. In addition, it is efficient to implement and exhibits several desirable properties such as small regret for delayed feedback. However, theoretical understanding of this algorithm was quite limited. In this paper, for the first time, we show that Thompson Sampling algorithm achieves logarithmic expected regret for the stochastic multi-armed bandit problem. More precisely, for the stochastic two-armed bandit problem, the expected regret in time is . And, for the stochastic -armed bandit problem, the expected regret in time is . Our bounds are optimal but for the dependence on and the constant factors in big-Oh.

1 Introduction

Multi-armed bandit (MAB) problem models the exploration/exploitation trade-off inherent in sequential decision problems. Many versions and generalizations of the multi-armed bandit problem have been studied in the literature; in this paper we will consider a basic and well-studied version of this problem: the stochastic multi-armed bandit problem. Among many algorithms available for the stochastic bandit problem, some popular ones include Upper Confidence Bound (UCB) family of algorithms, (e.g., [9, 1], and more recently [3], [10], [8]), which have good theoretical guarantees, and the algorithm by [4], which gives optimal strategy under Bayesian setting with known priors and geometric time-discounted rewards. In one of the earliest works on stochastic bandit problems, [14] proposed a natural randomized Bayesian algorithm to minimize regret. The basic idea is to assume a simple prior distribution on the parameters of the reward distribution of every arm, and at any time step, play an arm according to its posterior probability of being the best arm. This algorithm is known as Thompson Sampling (TS), and it is a member of the family of randomized probability matching algorithms. We emphasize that although TS algorithm is a Bayesian approach, the description of the algorithm and our analysis apply to the prior-free stochastic multi-armed bandit model where parameters of the reward distribution of every arm are fixed, though unknown (refer to Section 1.1). One could think of the “assumed” Bayesian priors as a tool employed by the TS algorithm to encode the current knowledge about the arms. Thus, our regret bounds for Thompson Sampling are directly comparable to the regret bounds for UCB family of algorithms which are a frequentist approach to the same problem.

Recently, TS has attracted considerable attention. Several studies (e.g., [6, 13, 2, 12]) have empirically demonstrated the efficacy of Thompson Sampling: [13] provides a detailed discussion of probability matching techniques in many general settings along with favorable empirical comparisons with other techniques. [2] demonstrate that empirically TS achieves regret comparable to the lower bound of [9]; and in applications like display advertising and news article recommendation, it is competitive to or better than popular methods such as UCB. In their experiments, TS is also more robust to delayed or batched feedback (delayed feedback means that the result of a play of an arm may become available only after some time delay, but we are required to make immediate decisions for which arm to play next) than the other methods. A possible explanation may be that TS is a randomized algorithm and so it is unlikely to get trapped in an early bad decision during the delay. Microsoft’s adPredictor ([5]) for CTR prediction of search ads on Bing uses the idea of Thompson Sampling.

It has been suggested ([2]) that despite being easy to implement and being competitive to the state of the art methods, the reason TS is not very popular in literature could be its lack of strong theoretical analysis. Existing theoretical analyses in [6, 11] provide weak guarantees, namely, a bound of on expected regret in time . In this paper, for the first time, we provide a logarithmic bound on expected regret of TS algorithm in time that is close to the lower bound of [9]. Before stating our results, we describe the MAB problem and the TS algorithm formally.

1.1 The multi-armed bandit problem

We consider the stochastic multi-armed bandit (MAB) problem: We are given a slot machine with arms; at each time step , one of the arms must be chosen to be played. Each arm , when played, yields a random real-valued reward according to some fixed (unknown) distribution with support in . The random reward obtained from playing an arm repeatedly are i.i.d. and independent of the plays of the other arms. The reward is observed immediately after playing the arm.

An algorithm for the MAB problem must decide which arm to play at each time step , based on the outcomes of the previous plays. Let denote the (unknown) expected reward for arm . A popular goal is to maximize the expected total reward in time , i.e., , where is the arm played in step , and the expectation is over the random choices of made by the algorithm. It is more convenient to work with the equivalent measure of expected total regret: the amount we lose because of not playing optimal arm in each step. To formally define regret, let us introduce some notation. Let , and . Also, let denote the number of times arm has been played up to step . Then the expected total regret in time is given by

Other performance measures include PAC-style guarantees; we do not consider those measures here.

1.2 Thompson Sampling

For simplicity of discussion, we first provide the details of Thompson Sampling algorithm for the Bernoulli bandit problem, i.e. when the rewards are either or , and for arm the probability of success (reward =) is . This description of Thompson Sampling follows closely that of [2]. Next, we propose a simple new extension of this algorithm to general reward distributions with support , which will allow us to seamlessly extend our analysis for Bernoulli bandits to general stochastic bandit problem.

The algorithm for Bernoulli bandits maintains Bayesian priors on the Bernoulli means ’s. Beta distribution turns out to be a very convenient choice of priors for Bernoulli rewards. Let us briefly recall that beta distributions form a family of continuous probability distributions on the interval . The pdf of , the beta distribution with parameters , , is given by . The mean of is ; and as is apparent from the pdf, higher the , tighter is the concentration of around the mean. Beta distribution is useful for Bernoulli rewards because if the prior is a distribution, then after observing a Bernoulli trial, the posterior distribution is simply or , depending on whether the trial resulted in a success or failure, respectively.

The Thompson Sampling algorithm initially assumes arm to have prior on , which is natural because is the uniform distribution on . At time , having observed successes (reward = ) and failures (reward = ) in plays of arm , the algorithm updates the distribution on as . The algorithm then samples from these posterior distributions of the ’s, and plays an arm according to the probability of its mean being the largest. We summarize the Thompson Sampling algorithm below.

.
foreach  do
       For each arm , sample from the distribution.
       Play arm and observe reward .
       If , then , else .
end foreach
Algorithm 1 Thompson Sampling for Bernoulli bandits

We adapt the Bernoulli Thompson sampling algorithm to the general stochastic bandits case, i.e. when the rewards for arm are generated from an arbitrary unknown distribution with support and mean , in a way that allows us to reuse our analysis of the Bernoulli case. To our knowledge, this adaptation is new. We modify TS so that after observing the reward at time , it performs a Bernoulli trial with success probability . Let random variable denote the outcome of this Bernoulli trial, and let denote the number of successes and failures in the Bernoulli trials until time . The remaining algorithm is the same as for Bernoulli bandits. Algorithm 2 gives the precise description of this algorithm.

We observe that the probability of observing a success (i.e., ) in the Bernoulli trial after playing an arm in the new generalized algorithm is equal to the mean reward . Let denote the (unknown) pdf of reward distribution for arm . Then, on playing arm ,

Thus, the probability of observing is same and evolve exactly in the same way as in the case of Bernoulli bandits with mean . Therefore, the analysis of TS for Bernoulli setting is applicable to this modified TS for the general setting. This allows us to replace, for the purpose of analysis, the problem with general stochastic bandits with Bernoulli bandits with the same means. We use this observation to confine the proofs in this paper to the case of Bernoulli bandits only.

.
foreach  do
       For each arm , sample from the distribution.
       Play arm and observe reward .
       Perform a Bernoulli trial with success probability and observe output .
       If , then , else .
end foreach
Algorithm 2 Thompson Sampling for general stochastic bandits

1.3 Our results

In this article, we bound the finite time expected regret of Thompson Sampling. From now on we will assume that the first arm is the unique optimal arm, i.e., . Assuming that the first arm is an optimal arm is a matter of convenience for stating the results and for the analysis. The assumption of unique optimal arm is also without loss of generality, since adding more arms with can only decrease the expected regret; details of this argument are provided in Appendix A.

Theorem 1.

For the two-armed stochastic bandit problem (), Thompson Sampling algorithm has expected regret

in time , where .

Theorem 2.

For the -armed stochastic bandit problem, Thompson Sampling algorithm has expected regret

in time , where .

Remark 1.

For the -armed bandit problem, we can obtain an alternate bound of

by slight modification to the proof. The above bound has a better dependence on than in Theorem 2, but worse dependence on . Here ,.

In interest of readability, we used big-Oh notation 111For any two functions , if there exist two constants and such that for all , . to state our results. The exact constants are provided in the proofs of the above theorems. Let us contrast our bounds with the previous work. [9] proved the following lower bound on regret of any bandit algorithm:

where denotes the KL divergence. They also gave algorithms asymptotically achieving this guarantee, though unfortunately their algorithms are not efficient. [1] gave the UCB1 algorithm, which is efficient and achieves the following bound:

For many settings of the parameters, the bound of Auer et al. is not far from the lower bound of Lai and Robbins. Our bounds are optimal in terms of dependence on , but inferior in terms of the constant factors and dependence on . We note that for the two-armed case our bound closely matches the bound of [1]. For the -armed setting, the exponent of ’s in our bound is basically compared to the exponent for UCB1.

More recently, [8] gave Bayes-UCB algorithm which achieves regret bounds close to the lower bound of [9] for Bernoulli rewards. Bayes-UCB is a UCB like algorithm, where the upper confidence bounds are based on the quantiles of Beta posterior distributions. Interestingly, these upper confidence bounds turn out to be similar to those used by algorithms in [3] and [10]. Bayes-UCB can be seen as an hybrid of TS and UCB. However, the general structure of the arguments used in [8] is similar to [1]; for the analysis of Thompson Sampling we need to deal with additional difficulties, as discussed in the next section.

2 Proof Techniques

In this section, we give an informal description of the techniques involved in our analysis. We hope that this will aid in reading the proofs, though this section is not essential for the sequel. We assume that all arms are Bernoulli arms, and that the first arm is the unique optimal arm. As explained in the previous sections, these assumptions are without loss of generality.

Main technical difficulties.

Thompson Sampling is a randomized algorithm which achieves exploration by choosing to play the arm with best sampled mean, among those generated from beta distributions around the respective empirical means. The beta distribution becomes more and more concentrated around the empirical mean as the number of plays of an arm increases. This randomized setting is unlike the algorithms in UCB family, which achieve exploration by adding a deterministic, non-negative bias inversely proportional to the number of plays, to the observed empirical means. Analysis of TS poses difficulties that seem to require new ideas.

For example, following general line of reasoning is used to analyze regret of UCB like algorithms in two-arms setting (for example, in [1]): once the second arm has been played sufficient number of times, its empirical mean is tightly concentrated around its actual mean. If the first arm has been played sufficiently large number of times by then, it will have an empirical mean close to its actual mean and larger than that of the second arm. Otherwise, if it has been played small number of times, its non-negative bias term will be large. Consequently, once the second arm has been played sufficient number of times, it will be played with very small probability (inverse polynomial of time) regardless of the number of times the first arm has been played so far.

However, for Thompson Sampling, if the number of previous plays of the first arm is small, then the probability of playing the second arm could be as large as a constant even if it has already been played large number of times. For instance, if the first arm has not been played at all, then is a uniform random variable, and thus with probability . As a result, in our analysis we need to carefully consider the distribution of the number of previous plays of the first arm, in order to bound the probability of playing the second arm.

The observation just mentioned also points to a challenge in extending the analysis of TS for two-armed bandit to the general -armed bandit setting. One might consider analyzing the regret in the -armed case by considering only two arms at a time—the first arm and one of the suboptimal arms. We could use the observation that the probability of playing a suboptimal arm is bounded by the probability of it exceeding the first arm. However, this probability also depends on the number of previous plays of the two arms, which in turn depend on the plays of the other arms. Again, [1], in their analysis of UCB algorithm, overcome this difficulty by bounding this probability for all possible numbers of previous plays of the first arm, and large enough plays of the suboptimal arm. For Thompson Sampling, due to the observation made earlier, the (distribution of the) number of previous plays of the first arm needs to be carefully accounted for, which in turn requires considering all the arms at the same time, thereby leading to a more involved analysis.

Proof outline for two arms setting.

Let us first consider the special case of two arms which is simpler than the general arms case. Firstly, we note that it is sufficient to bound the regret incurred during the time steps after the second arm has been played times. The expected regret before this event is bounded by because only the plays of the second arm produce an expected regret of ; regret is when the first arm is played. Next, we observe that after the second arm has been played times, the following happens with high probability: the empirical average reward of the second arm from each play is very close to its actual expected reward , and its beta distribution is tightly concentrated around . This means that, thereafter, the first arm would be played at time if turns out to be greater than (roughly) . This observation allows us to model the number of steps between two consecutive plays of the first arm as a geometric random variable with parameter close to . To be more precise, given that there have been plays of the first arm with successes and failures, we want to estimate the expected number of steps before the first arm is played again (not including the steps in which the first arm is played). This is modeled by a geometric random variable with parameter , where has distribution , and thus . To bound the overall expected number of steps between the and play of the first arm, we need to take into account the distribution of the number of successes . For large , we use Chernoff–Hoeffding bounds to say that with high probability, and moreover is concentrated around its mean, and thus we get a good estimate of . However, for small we do not have such concentration, and it requires a delicate computation to get a bound on . The resulting bound on the expected number of steps between consecutive plays of the first arm bounds the expected number of plays of the second arm, to yield a good bound on the regret for the two-arms setting.

Proof outline for arms setting.

At any step , we divide the set of suboptimal arms into two subsets: saturated and unsaturated. The set of saturated arms at time consists of arms that have already been played a sufficient number () of times, so that with high probability, is tightly concentrated around . As earlier, we try to estimate the number of steps between two consecutive plays of the first arm. After play, the play of first arm will occur at the earliest time such that . The number of steps before is greater than of all saturated arms can be closely approximated using a geometric random variable with parameter close to , as before. However, even if is greater than the of all saturated arms , it may not get played due to play of an unsaturated arm with a greater . Call this event an “interruption” by unsaturated arms. We show that if there have been plays of first arm with successes, the expected number of steps until the play can be upper bounded by the product of the expected value of a geometric random variable similar to defined earlier, and the number of interruptions by the unsaturated arms. Now, the total number of interruptions by unsaturated arms is bounded by (since an arm becomes saturated after plays). The actual number of interruptions is hard to analyze due to the high variability in the parameters of the unsaturated arms. We derive our bound assuming the worst case allocation of these interruptions. This step in the analysis is the main source of the high exponent of in our regret bound for the -armed case compared to the two-armed case.

3 Regret bound for the two-armed bandit problem

In this section, we present a proof of Theorem 1, our result for the two-armed bandit problem. Recall our assumption that all arms have Bernoulli distribution on rewards, and that the first arm is the unique optimal arm.

Let random variable denote the number of plays of the first arm until plays of the second arm. Let random variable denote the time step at which the play of the first arm happens (we define ). Also, let random variable measure the number of time steps between the and plays of the first arm (not counting the steps in which the and plays happened), and let denote the number of successes in the first plays of the first arm. Then the expected number of plays of the second arm in time is bounded by

To understand the expectation of , it will be useful to define another random variable as follows. We perform the following experiment until it succeeds: check if a distributed random variable exceeds a threshold . For each experiment, we generate the beta-distributed r.v. independently of the previous ones. Now define to be the number of trials before the experiment succeeds. Thus, takes non-negative integer values, and is a geometric random variable with parameter (success probability) . Here denotes the cdf of the beta distribution with parameters . Also, let denote the cdf of the binomial distribution with parameters .

We will relate and shortly. The following lemma provides a handle on the expectation of .

Lemma 1.

For all non-negative integers , and for all ,

where denotes the cdf of the binomial distribution with parameters .

Proof.

By the well-known formula for the expectation of a geometric random variable and the definition of we have, (The additive is there because we do not count the final step where the Beta r.v. is greater than .) The lemma then follows from Fact 1 in Appendix B. ∎

Recall that was defined as the number of steps before happens for the first time after the play of the first arm. Now, consider the number of steps before happens for the first time after the play of the first arm. Given , this has the same distribution as . However, can be larger than this number if (and only if) at some time step between and , . In that case we use the fact that is always bounded by . Thus, for any , we can bound as,

Here notation is the indicator for event , i.e., its value is if event happens and otherwise. In the first term of RHS, the expectation is over distribution of as well as over the distribution of the geometric variable . Since we are interested only in , we will instead use the similarly obtained bound on ,

This gives,

The last inequality holds because for any , by definition . We denote the event by . In words, this is the event that if sufficient number of plays of second arm have happened until time , then is not much larger than ; intuitively, we expect this event to be a high probability event as we will show. is the event used in the above equation. Next, we bound and .

Lemma 2.
Proof.

Refer to Appendix C.1. ∎

Lemma 3.

Consider any positive , and let . Also, let , and let denote the KL-divergence between and , i.e.

where the outer expectation is taken over distributed as .

Proof.

The complete proof of this lemma is included in Appendix C.2; here we provide some high level ideas.

Using Lemma 1, the expected value of for any given ,

For large , i.e., , we use Chernoff–Hoeffding bounds to argue that with probability at least (), will be greater than . And, for , we can show that the probability will be at least , again using Chernoff–Hoeffding bounds. These observations allow us to derive that , for .

For small , the argument is more delicate. In this case, could be small with a significant probability. More precisely, could take a value smaller than with binomial probability . For such , we use the lower bound , and then bound the ratio in terms of , and KL-divergence . For , we use the observation that since is greater than or equal to the median of (see [7]), we have . After some algebraic manipulations, we get the result of the lemma. ∎

Using Lemma 2, and Lemma 3 for , and , we can bound the expected number of plays of the second arm as:

(1)

where the last inequality is obtained after some algebraic manipulations; details are provided in Appendix C.3.

This gives a regret bound of

4 Regret bound for the -armed bandit problem

In this section, we prove Theorem 2, our result for the -armed bandit problem. Again, we assume that all arms have Bernoulli distribution on rewards, and that the first arm is the unique optimal arm.

At every time step , we divide the set of suboptimal arms into saturated and unsaturated arms. We say that an arm is in the saturated set at time , if it has been played at least times before time . We bound the regret due to playing unsaturated and saturated suboptimal arms separately. The former is easily bounded as we will see; most of the work is in bounding the latter. For this, we bound the number of plays of saturated arms between two consecutive plays of the first arm.

Figure 1: Interval

In the following, by an interval of time we mean a set of contiguous time steps. Let r.v. denote the interval between (and excluding) the and plays of the first arm. We say that event holds at time , if exceeds of all the saturated arms, i.e.,

(2)

For such that is empty, we define to hold trivially.

Let r.v. denote the number of occurrences of event in interval :

(3)

Events divide into sub-intervals in a natural way: For to , let r.v. denote the sub-interval of between the and occurrences of event in (excluding the time steps in which event occurs). We also define and : If then denotes the sub-interval in before the first occurrence of event in ; and denotes the sub-interval in after the last occurrence of event in . For we have .

Figure 1 shows an example of interval along with sub-intervals ; in this figure .

Observe that since a saturated arm can be played at step only if is greater than , saturated arm can be played at a time step (i.e., at a time step where holds) only if . Let us define event as

Then, the number of plays of saturated arms in interval is at most

In words, denotes the event that all saturated arms have tightly concentrated around their means. Intuitively, from the definition of saturated arms, should hold with high probability; we prove this in Lemma 4.

We are interested in bounding regret due to playing saturated arms, which depends not only on the number of plays, but also on which saturated arm is played at each time step. Let denote the number of steps in , for which is the best saturated arm, i.e.

(4)

(resolve the ties for best saturated arm using an arbitrary, but fixed, ordering on arms). In Figure 1, we illustrate this notation by showing steps for interval . In the example shown, we assume that , and that the suboptimal arms got added to the saturated set in order , so that initially is the best saturated arm, then is the best saturated arm, and finally is the best saturated arm.

Recall that holds trivially for all such that is empty. Therefore, there is at least one saturated arm at all , and hence are well defined and cover the interval ,

Next, we will show that the regret due to playing a saturated arm at a time step in one of the steps is at most . The idea is that if all saturated arms have their tightly concentrated around their means , then either the arm with the highest mean (i.e., the best saturated arm ) or an arm with mean very close to will be chosen to be played during these steps. That is, if a saturated arm is played at a time among one of the steps, then, either is violated, i.e. for some saturated arm is not close to its mean, or

which implies that

(5)

Therefore, regret due to play of a saturated arm at a time in one of the steps is at most . With slight abuse of notation let us use to indicate that is one of the steps in . Then, the expected regret due to playing saturated arms in interval is bounded as

(6)

The following lemma will be useful for bounding the second term on the right hand side in the above equation (as shown in the complete proof in Appendix D).

Lemma 4.

For all ,

Also, for all , and ,

Proof.

Refer to Appendix C.4. ∎

The stronger bound given by the second statement of lemma above will be useful later in bounding the first term on the rhs of (6). For bounding that term, we establish the following lemma.

Lemma 5.

For all ,

(7)
Proof.

The key observation used in proving this lemma is that given a fixed value of , the random variable is stochastically dominated by random variable (defined earlier as a geometric variable denoting the number of trials before an independent sample from distribution exceeds ). A technical difficulty in deriving the inequality above is that the random variables and are not independent in general (both depend on the values taken by over the interval). This issue is handled through careful conditioning of the random variables on history. The details of the proof are provided in Appendix C.5. ∎

Now using the above lemma the first term in (6) can be bounded by

We next show how to bound the first term in this equation; the second term will be dealt with in the complete proof in Appendix D.

Recall that denotes the number of occurrences of event in interval , i.e. the number of times in interval , was greater than of all saturated arms , and yet the first arm was not played. The only reasons the first arm would not be played at a time despite of are that either was violated, i.e. some saturated arm whose was not close to its mean was played instead; or some unsaturated arm with highest was played. Therefore, the random variables satisfy

Using Lemma 4, and the fact that an unsaturated arm can be played at most times before it becomes saturated, we obtain that

(8)

Note that is a r.v. (because of random ), and the above bound applies for all instantiations of this r.v.

Let . Then,

where

Note that is a random variable, which is completely determined by the instantiation of random sequence . Now, for the first term in above,