A General Framework to Analyze Stochastic Linear Bandit

# A General Framework to Analyze Stochastic Linear Bandit

## Abstract

In this paper we study the well-known stochastic linear bandit problem where a decision-maker sequentially chooses among a set of given actions in , observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length . We introduce a general family of algorithms for the problem and prove that they are rate optimal. We also show that several well-known algorithms for the problem such as optimism in the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) are special cases of our family of algorithms. Therefore, we obtain a unified proof of rate optimality for both of these algorithms. Our results include both adversarial action sets (when actions are potentially selected by an adversary) and stochastic action sets (when actions are independently drawn from an unknown distribution). In terms of regret, our results apply to both Bayesian and worst-case regret settings. Our new unified analysis technique also yields a number of new results and solves two open problems known in the literature. Most notably, (1) we show that TS can incur a linear worst-case regret, unless it uses inflated (by a factor of ) posterior variances at each step. This shows that the best known worst-case regret bound for TS, that is given by [4, 3] and is worse (by a factor of ) than the best known Bayesian regret bound given by [16] for TS, is tight. This settles an open problem stated in [15]. (2) Our proof also shows that TS can incur a linear Bayesian regret if it does not use the correct prior or noise distribution. (3) Under a generalized gap assumption and a margin condition, as in [9], we obtain a poly-logarithmic (in ) regret bound for OFUL and TS in the stochastic setting. The result for OFUL resolves an open problem from [7].

## 1 Introduction

In a bandit problem, a decision-maker sequentially chooses actions from given action sets and receives rewards corresponding to the selected actions. The goal of the decision-maker, also known as the policy, is to maximize the (cumulatively obtained) reward by utilizing the history of the previous observations. This paper considers a variant of this problem, called stochastic linear bandit, in which all actions are elements of for some integer and the expected values of the rewards depend on the actions through a linear function. We also let the action sets change over time. The classical multi-armed bandit (MAB) and the -armed contextual multi-armed bandit are special cases of this problem.

Since its introduction by [2], the linear bandit problem has attracted a great deal of attention. Several algorithms based on the idea of upper confidence bound (UCB), due to [11], have been proposed and analysed (notable examples are [5, 7, 14, 1]). The best known regret bound for these algorithms is which matches the existing lower bounds up to logarithmic factors [7, 14, 18, 13, 12]. The best known algorithm in this family is the optimism in the face of uncertainty linear bandit (OFUL) algorithm by [1].

A different line of research examines the performance of Thompson sampling (TS), a Bayesian heuristic due to [19] that employs the posterior distribution of the reward function to balance exploration and exploitation. TS is also known as posterior sampling. [16, 17] and [8] proved an upper bound for the Bayesian regret of TS, thereby indicating its near-optimality. The best thus-far known worst-case regret bound for TS, however, is given by [4, 3] which is worse than the previous bounds by a factor of . As it is stated in Section 8.2.1 of [15], it is an open question whether this extra factor can be eliminated by a more careful analysis.

In addition, when there is a gap between the expected rewards of the top two actions, OFUL and TS are shown to have a regret with a dependence in instead of . According to [7], it remains an open problem whether this result extends to the cases when can be . We defer to [12], and references therein, for a more thorough discussion.

On the other hand, in a subclass of the linear bandit problem known as linear -armed contextual bandit, [9] considered a more general version of the gap assumption, a certain type of margin condition for the action set, and proposed a novel extension of the -greedy algorithm. Their OLS Bandit algorithm explicitly allocates a fraction of rounds to each arm, and uses these forced samples to discard obviously sub-optimal arms. They show that this filtering approach can lead to a near-optimal regret bound grows logarithmically in . [6] adapted this idea to the setting with very large , and [10] extended that further to when both the number of arms and are large. In this paper, we demonstrate that a major generalization of this idea can be used to obtain a unifying technique for analyzing all of the above algorithms, that not only recovers known results in the literature, but also yields a number of new results, and notably, solves the aforementioned two open problems. To be explicit, the main contributions of this paper are:

1. We propose a general family of algorithms, called Two-phase Bandit, for the stochastic linear bandit problem and prove that they are rate optimal. We also show that TS, OFUL, and OLS Bandit are special cases of this family. Therefore, we obtain a universal proof of rate optimality for all of these algorithms, in both Bayesian or worst-case regret settings.

2. We consider the same generalized gap assumption as in [9], that with positive probability, and obtain a poly-logarithmic (in ) gap-dependent regret bound for all of the above algorithms, when the action sets are independently drawn from an unknown distribution. To the best of our knowledge, this result is new for OFUL and TS.

3. We show that TS can incur a linear worst-case regret, unless it inflates, by a factor of , variance of the posterior at each step. Therefore, the best known worst-case regret bound for TS, given by [4, 3], is tight. This resolves the aforementioned open problem in [15].

4. Our proof also shows that TS is vulnerable (can incur a linear Bayesian and worst-case regret) if it uses an incorrect prior distribution for the unknown parameter vector of the linear reward function, or when it uses an incorrect noise distribution.

5. As a byproduct of our analyses in 3-4, we obtain a set of conditions under which (a) TS is rate optimal and (b) we can shrink the confidence sets of OFUL by a factor , without impacting its regret.

Organization. We start by introducing the notation and main assumptions in §2. In §3 we introduce the Two-phase Bandit algorithm and prove that it is rate optimal. In §4 we introduce the ROFUL algorithm, a special case of the Two-phase Bandit algorithm, and in §5 we show that OFUL and TS are special cases of the ROFUL algorithm. Finally, in §6, we prove that TS can incur linear regret in the worst-case or when it does not have correct information on the prior or the noise distribution.

## 2 Setting and notations

For any positive integer , we denote by . Letting be a positive semi-definite matrix, by we mean for any vector of suitable size. By a grouped linear bandit (GLB) problem, we mean a tuple where:

1. is the prior distribution of a parameter vector on .

2. consists of orthogonal -dimensional subspaces of . By abuse of notation, we write to also denote the projection matrix from onto .

3. are random compact subsets of .

4. are random objects passed to (randomized) policies to function.

5. is a sequence of independent mean-zero -sub-Gaussian random variables.

The main difference between our model and the common linear bandit formulation (for example, as defined in [1]) is the introduction of ’s. Loosely speaking, we can consider each as a copy of ; thus, . In this case, our assumption on the action sets demands each action to have non-zeros entries in only one of the copies of . Our problem can be regarded as a -dimensional instance of the ordinary (un-grouped) linear bandit; however, we will see in the next sections that this additional structure lets us improve the regret bound by a factor of in the gap-dependent setting and a factor of in the gap-independent setting. Three interesting special cases of this model are:

1. When and for all , our problem reduces to a simple multi-armed bandit problem.

2. When , and each action set contains exactly copies of a vector (one in each ), the problem is called -armed contextual bandit.

3. When , we get the ordinary stochastic linear bandit problem.

The optimal and selected actions at time are denoted by and respectively. The corresponding reward is then revealed to the policy where . We denote the history of observations up to time by . More precisely, we define

 H0:=∅    and    Ht:=Ht−1∪{(Xt,˜Xt,Zt,rt)}.

In this model, a policy is formally defined as a deterministic function that maps to an element of . We emphasize that our definition of policies includes randomized policies, as well. Indeed, the random objects ’s are the source of randomness used by a policy.

The performance measure for evaluating the policies is the standard cumulative Bayesian regret defined as

 Regret(T,π,PΘ⋆):=T∑t=1\IEsupX∈Xt\inner[]Θ⋆,X−\inner[]Θ⋆,˜Xt.

The expectation is taken with respect to the entire randomness in our model, including the prior distribution. Although we describe our setting in a Bayesian fashion, our results cover fixed setting by defining the prior distribution to be the one with a point mass at .

### 2.1 Action sets

In the next sections, we derive our regret bounds for various types of action sets, thereby dedicating this subsection to the definitions and notations we will use for the action sets. We start off by defining the extremal points of an action set. {defn}[Extremal points] For an action set , define its extremal points to be all for which there are no and satisfying

 X=n∑i=1ciYi    and    n∑i=1ci=1.

The importance of this definition is that all the algorithms studied in this paper only choose extremal points in action sets. This observation implies that the rewards attained by any of these algorithms, and an action set , belong to the reward profile of defined by

 ΠX:={\innerX,Θ⋆:X∈EX}.

The maximum attainable reward and gap of an action set for the parameter vector are defined respectively as

For any , write

 Xz:={X∈X:\innerΘ⋆,X≥MX−z}.

In the above notations, for the sake of simplicity, we may use subscript to refer to . For instance, by we mean . We now define a gapped problem as follows: {defn}[Gapped problem] We call a GLB problem gapped if for some the following inequality holds:

 \IPΔt≥δ≥q      % for all t∈[T]. (1)

Moreover, for a fixed gap level , we let be the indicator of the event . {rem} Note that all problems are gapped for all and . This observation will help us obtain gap-independent bounds. {rem} Our notion of gap is more general than the well-known gap assumption in the literature (e.g., as in [1]) which always holds which means would be . We conclude this section by defining near-optimal space followed by diversity condition and margin condition. The two conditions will enable us to enhance a term in the our regret bound, that will appear in Eq. (6), to an expression that grows sub-linearly in terms of . {defn}[Near-optimal space] Let be the smallest number such that there exists with and

 \IPXδt⊆⨁j∈IVj=1      for all t∈[T],

Let us denote , and as before, we also treat as the projection of onto the subspace . {rem} The main purpose of this notion is to handle sub-optimal arms in the special case of -armed contextual bandit. One might harmlessly assume that (or equal to the identity function if viewed as an operator) and follow the rest of the paper. {defn}[Diversity condition] We say that a GLB problem satisfies the diversity condition with parameter if is independent of and

 λmin(W⊤\IEX⋆tX⋆t⊤⋅GtW)≥γmax    for all t∈[T]. (2)
{defn}

[Margin condition] In a GLB problem, the margin condition holds if

 \IPΔt≤z≤c0zα      for all t∈[T] and 0≤z≤δ. (3)

where are two constants.

## 3 Two-phase bandit algorithm

In this section, we describe Two-phase Bandit algorithm, an extension of the OLS Bandit algorithm, that was introduced by [9] for the special case of -armed contextual bandit setting, to our more general grouped linear bandit problem. Two-phase Bandit algorithm (presented in Algorithm 1) has two separate phases to deal with the exploration-exploitation trade-off. At each time , a forced-sampling rule determines which arm to pull (i.e., when is an element of ) or it refuses to pick one which implies that the best arm should be chosen by exploiting the information gathered thus far (i.e., when ). The forced-sampling rule is allowed to depend on the history as well as the current action set . The notation expresses this dependence explicitly.

In the exploitation phase, two selectors are used to decide which action to choose. The blurry selector first selects a candidate set which is later passed to the vivid selector to pick a single action from. The idea behind this architecture is that the blurry selector eliminates all the actions that are suboptimal with a constant margin with high probability. The name “blurry” indicates the low accuracy of this selector. The vivid selector, on the other hand, should become more accurate as grows with high probability. The diversity condition is the assumption that plays a crucial role in proving this type of result. We now state and briefly discuss the assumptions that need to be met such that our results are valid. {assume}[Boundedness] There exist constants such that and for all and for all almost surely. The next assumption concerns how much regret the forced-sampling rule incurs to assure that the blurry selector works properly. {assume}[Forced-sampling cost] Let be the indicator function for the event . Then there exists some such that

 \IE∑t∈[T](Mt−\inner[]Θ⋆,˜Xt)⋅Ft ≤c1. (4)

We are now ready to formalize what it means for the blurry selector to work properly. {assume}[Blurry selector bound] Let . There exists some such that for all , we have that

 \IP˜Xt⊈Xδt≤c2t2. (5)

The above assumptions are sufficient to prove a bound which scales with as . Furthermore, we will discuss in §4, that can be tuned such that the regret grows as . Furthermore, we can obtain even sharper regret bounds, under additional assumptions that will be stated below. For example, the vivid selector should satisfy certain properties. In order to highlight the main idea, we restrict our attention to a rather concrete scenario in which these properties hold and defer the most general cases to a longer version of the paper. More specifically, we assume that the vivid selector is a greedy selector defined as follows. {defn}[Greedy selector] Let be an estimator for . By the greedy selector with respect to , we mean a selector given by

 V(˜Xt\givenHt−1)∈\argmaxX∈˜Xt\inner[]˜Θt−1,X.
{defn}

[Reasonableness] Let be fixed. For an estimator , define

 Σt:=(λI+t∑j=1˜Xj˜X⊤j)−1

and

 Ct:={Θ∈\IRkd:maxi∈[k]\normVi(Θ−˜Θt)Σ−1t≤ρ}.

The estimator is called reasonable if

 \IP[]Θ⋆∉Ct    %and    Ft=0≤c3t3,

for some .

{assume}

The vivid selector is a greedy selector for a reasonable estimator for all .

Our last assumption demands the selected actions to be diverse in the near-optimal space. This condition is not a mere property of a model or a policy; it is a characteristic of a policy in combination with a model. {assume}[Linear expansion] We say that linear expansion holds if

 \IP\normOpW⊤ΣtW≥c25t≤c4t2

for some constants and all . Now, with all these assumptions, we are ready to state our main regret bound that results in both gap-independent and gap-dependent settings. The result is also applicable to the cases that the action selected by an adversary. {thm} If Assumptions 3-3 hold, the cumulative regret of Algorithm 1 (denoted by policy ) satisfies the following inequality:

 Regret(T,πTP,PΘ⋆) ≤c1+4xθc2+Tδ(1−q). (6)

Furthermore, under Assumptions 3-3 and the margin condition (3), we have

 Regret(T,πTP,PΘ⋆) ≤c1+4xθc2+2δ(c3+c4)+c0(2xρc5)α+1(1+∫T1t−α+12dt). (7)
{rem}

Theorem 3 provides a gap-independent bound, Eq. (6), as well as a gap-dependent bound, Eq. (7). For the special case of OLS Bandit, when , the latter produces an bound on the regret. Note that we can follow the same peeling argument as in [9, 6] and obtain an for OLS Bandit. However, this peeling argument may not apply for more general algorithms that we will be analyzing in next sections.

###### Proof of Theorem 3.

We split the regret of the algorithm into the following three cases. We will then bound each term separately.

1. Forced-sampling phase ,

2. When and , where is the indicator function for ,

3. When and .

By and , we denote the regrets of the above items up to time . Clearly, we have that

 Regret(T,πTP,PΘ⋆)=R(a)T+R(b)T+R(c)T.

Assumption 3 gives us an upper bound for

 R(a)T≤c1.

Next, notice that the maximum regret that can occur in each round is bounded above by

 supX,X′∈Xt\abs\innerX−X′,Θ⋆ ≤2\normΘ⋆2⋅supXXt\normX2 ≤2xθ.

It follows from the definition of and the inequality (5) in Assumption 3 that the number of times that and is controlled by

 \IET∑t=1(1−Ft)BBt =T∑t=1\IPBBt=1  and  Ft=0 ≤T∑t=1c2t2 ≤2c2.

It thus entails that

 R(b)T ≤2xθ\IET∑t=1(1−Ft)BBt ≤4xθc2.

From the definition of , we can infer that whenever and , the regret at time can not exceed , and in the case that the action set is gapped (at this level ), the regret is equal to zero. Using this observation, we get that

 R(c)T =\IET∑t=1(Mt−rt)⋅(1−BBt)(1−Ft) =\IET∑t=1(Mt−rt)⋅(1−BBt)(1−Ft)(1−Gt) ≤δT∑t=1\IE[](1−BBt)(1−Ft)(1−Gt) ≤δT∑t=1\IE[](1−Gt) =δT∑t=1\IP[]Gt=0 ≤Tδ(1−q),

which completes the proof of (6).

In order to prove (7), we use Assumptions 3-3 and the margin condition to bound in a different way. Define

 BVt:={1if Θ⋆∉Ct or \normOpW⊤ΣtW≥c25t,0otherwise.

The key idea in bounding is that under , ; thereby, we get, with probability 1,

 ˜Xt⊆W.

Hence, we have that

 R(c)T =\IET∑t=1\inner[]Θ⋆,X⋆t−˜Xt⋅(1−BBt)(1−Ft) =\IET∑t=1\inner[]Θ⋆,X⋆t−˜Xt⋅(1−BBt)(1−Ft)⋅[BVt+(1−BVt)] ≤\IET∑t=1\inner[]Θ⋆,X⋆t−˜Xt⋅(1−BBt)(1−BVt)+\IET∑t=1δ⋅(1−Ft)BVt ≤\IET∑t=1\inner[]Θ⋆,X⋆t−˜Xt⋅(1−BBt)(1−BVt)+δT∑t=1\IPBVt=1,Ft=0 ≤\IET∑t=1\inner[]Θ⋆,X⋆t−˜Xt⋅(1−BBt)(1−BVt)+2δ(c3+c4).

Note that, whenever and , we have

 \inner[]X⋆t−˜Xt,Θ⋆ =\inner[]X⋆t,Θ⋆−˜ΘVt+\inner[]X⋆t−˜Xt,˜ΘVt+\inner[]˜Xt,˜ΘVt−Θ⋆ ≤\inner[]X⋆t,Θ⋆−˜ΘVt+\inner[]˜Xt,˜ΘVt−Θ⋆ ≤(\norm[]X⋆t2+\norm[]˜Xt2)⋅\norm[]Θ⋆−˜ΘVt2 ≤2x⋅\norm[]Θ⋆−˜ΘVt2 ≤2xc5√1t⋅\norm[]Θ⋆−˜ΘVtΣ−1t ≤2xρc5√1t.

As in the proof of (6), the regret would be zero if is larger than the above. This, in turn, implies that

 R(c)T ≤T∑t=12xρc5√1t⋅\IPΔt≤2xρc5√1t+2δ(c3+c4) ≤T∑t=12xρc5√1t⋅c0(2xρc5√1t)α+2δ(c3+c4) =c0(2xρc5)α+1T∑t=1t−α+12+2δ(c3+c4) ≤c0(2xρc5)α+1(1+∫T1t−α+12dt)+2δ(c3+c4),

which is the desired result. ∎

## 4 Randomized OFUL

In this section, we present an extension of the OFUL algorithm of [1], and prove that under mild conditions it enjoys the same regret bound as the original OFUL. We call this extension Randomized OFUL (or ROFUL) and present its pseudo-code version in Algorithm 2. It receives an arbitrary estimator , and at each round, it makes greedy decision using this estimator. We require this estimator to be reasonable (Definition 3) and optimistic (Definition 4), and in Theorem 4, we use these assumptions to provide our regret bounds for this algorithm. We now define optimism. Recall that we defined

 Mt=supX∈Xt\inner[]X,Θ⋆.

Similarly, we write

{defn}

[Optimism] We say that the estimator is optimistic if for some we have

 \IP˜Mt≥Mt−δ4\givenXt,Ht−1,Tt=1≥p (8)

with probability at least where is a fixed constant and is the indicator function for the typical event .

{thm}

For any reasonable (Definition 3) and optimistic (Definition 4) estimator the corresponding ROFUL algorithm admits the following regret bound:

 Regret(T,πROFUL,PΘ⋆)≤128ρ2kdδplog(λ+Tx2d)+Tδ(1−q)+4xθ(c3+c6). (9)

Furthermore, if the diversity condition (2) and the margin condition (3) also hold, we have the following bound:

 Regret(T,πROFUL,PΘ⋆)≤128ρ2kdxθδ2plog(λ+Tx2d)+4(xθ)2(c3+c6)δ+c0(30xρ√γmin)α+1(1+∫T1t−α+12dt). (10)
###### Proof.

For simplicity, we introduce some new notations to keep the expressions in this proof shorter; thereby, increasing the readability. Define:

1. Well-posed action set indicator:

 Wt:=I(\IP˜Mt≥Mt−δ/4\givenXt,Ht−1,Tt=1≥p).
2. Upper and lower confidence bounds:

 Ut(X) :=supΘ∈Ct−1\innerX,Θ, Lt(X) :=infΘ∈Ct−1\innerX,Θ.
3. Acceptance threshold:

 At:=supX∈XtLt(X).

Our strategy is to first represent ROFUL as an instance of Two-phase Bandit, and then, verify the assumptions of Theorem 3. In order to do so, let be given by

 F(Xt\givenHt−1):={˜Xtif Tt=1, Wt=1, and \inner[]Θ⋆,X⋆t−˜Xt≥δ,Nullotherwise.

The blurry selector is also defined as

 B(Xt\givenHt−1):=Xδt⋃{˜Xt}.

We also set the vivid selector to be the greedy selector with respect to the estimator . It is straight-forward to verify that Two-phase Bandit algorithm with as defined in the above is equivalent to Algorithm 2. We thus need to show that the assumptions of Theorem 3 hold. We begin with computing the forced-sampling cost in Assumption 3.

 \IE\inner[]Θ⋆,X⋆t−˜Xt⋅Ft ≤1δ\IE\inner[]Θ⋆,X⋆t−˜Xt2⋅Ft ≤1δ\IE(2\inner[]Θ⋆,X⋆t−˜Xt2−δ2)⋅Ft. (11)

It follows from the definition of that implies

 \inner[]Θ⋆,X⋆t−˜Xt=Mt−\inner[]Θ⋆,˜Xt≤Mt−Lt(˜Xt).

This gives us

 \IE\inner[]Θ⋆,X⋆t−˜Xt2⋅Ft ≤\IE(Mt−Lt(˜Xt))2⋅Ft ≤\IE2((Mt−At)2+(At−Lt(˜Xt))2)⋅Ft,

which in combination with (11) leads to

 \IE\inner[]Θ⋆,X⋆t−˜Xt⋅Ft (12)

Next, letting , we get

 (Mt−At)2≤δ24+4(Mt−δ4−At)2⋅Ut,

which in turn yields

 \IE(Mt−At)2⋅Ft ≤\IEδ24⋅Ft+\IE4(Mt−δ4−At)2⋅UtTtWt =\IEδ24⋅Ft+\IE\IE4(Mt−δ4−At)2⋅UtTtWt\givenXt,Ht−1,Tt.

We deduce from optimism (8) that

 0 ≤\IE{1p(˜Mt−At)2−(Mt−δ4−At)2}⋅UtWtTt\givenXt,Ht−1,Tt,

almost surely. Hence, we have

 \IE(Mt−At)2⋅Ft ≤\IEδ24⋅Ft+\IE4p(˜Mt−At)2⋅UtTtWt

Substituting the above inequality into (12), we obtain

 \IE\inner[]Θ⋆,X⋆t−˜Xt⋅Ft ≤1δ\IE16p(Ut(˜Xt)−At)2+4(At−Lt(˜Xt))2⋅Ft ≤16δp\IE(Ut(˜Xt)−At)2+(At−Lt(˜Xt))2 ≤16δp\IE(Ut(˜Xt)−Lt(˜Xt))2 =64δp\IE(Ut(˜Xt)−˜Mt)2 =64δp\IEsupΘ∈Ct−1\inner[]˜Xt,Θ−˜Θt2.

Recall we assumed that for each , there exists such that . Now, let be such that . We have that

 \IE\inner[]Θ⋆,X⋆t−˜Xt⋅Ft ≤64δp\IEsupΘ∈Ct−1\inner[]Vjt˜Xt,Vjt(Θ−˜Θt)2 =64ρ2δp\IE\norm[]Vjt˜Xt2Σt.

Finally, we apply Lemma 10 and Lemma 11 in [1] for each separately:

 T∑t=1\norm[]Vjt˜Xt2Σt⋅I(jt=i) ≤2log(detV⊤iΣ−1TVidetV⊤iΣ−10Vi) ≤2dlog(λ+Tx2d).

Therefore, we have

 c1 =T∑t=1\IE\inner[]Θ⋆,X⋆t−˜Xt⋅Ft ≤64ρ2δp\IET∑t=1\norm[]Vjt˜Xt2Σt ≤64ρ2δp\IEk∑i=1T∑t=1\norm[]Vjt˜Xt2Σt⋅I(jt=i) ≤64ρ2δp⋅k∑i=