A General Framework to Analyze Stochastic Linear Bandit
Abstract
In this paper we study the wellknown stochastic linear bandit problem where a decisionmaker sequentially chooses among a set of given actions in , observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length . We introduce a general family of algorithms for the problem and prove that they are rate optimal. We also show that several wellknown algorithms for the problem such as optimism in the face of uncertainty linear bandit (OFUL) and Thompson sampling (TS) are special cases of our family of algorithms. Therefore, we obtain a unified proof of rate optimality for both of these algorithms. Our results include both adversarial action sets (when actions are potentially selected by an adversary) and stochastic action sets (when actions are independently drawn from an unknown distribution). In terms of regret, our results apply to both Bayesian and worstcase regret settings. Our new unified analysis technique also yields a number of new results and solves two open problems known in the literature. Most notably, (1) we show that TS can incur a linear worstcase regret, unless it uses inflated (by a factor of ) posterior variances at each step. This shows that the best known worstcase regret bound for TS, that is given by [4, 3] and is worse (by a factor of ) than the best known Bayesian regret bound given by [16] for TS, is tight. This settles an open problem stated in [15]. (2) Our proof also shows that TS can incur a linear Bayesian regret if it does not use the correct prior or noise distribution. (3) Under a generalized gap assumption and a margin condition, as in [9], we obtain a polylogarithmic (in ) regret bound for OFUL and TS in the stochastic setting. The result for OFUL resolves an open problem from [7].
1 Introduction
In a bandit problem, a decisionmaker sequentially chooses actions from given action sets and receives rewards corresponding to the selected actions. The goal of the decisionmaker, also known as the policy, is to maximize the (cumulatively obtained) reward by utilizing the history of the previous observations. This paper considers a variant of this problem, called stochastic linear bandit, in which all actions are elements of for some integer and the expected values of the rewards depend on the actions through a linear function. We also let the action sets change over time. The classical multiarmed bandit (MAB) and the armed contextual multiarmed bandit are special cases of this problem.
Since its introduction by [2], the linear bandit problem has attracted a great deal of attention. Several algorithms based on the idea of upper confidence bound (UCB), due to [11], have been proposed and analysed (notable examples are [5, 7, 14, 1]). The best known regret bound for these algorithms is which matches the existing lower bounds up to logarithmic factors [7, 14, 18, 13, 12]. The best known algorithm in this family is the optimism in the face of uncertainty linear bandit (OFUL) algorithm by [1].
A different line of research examines the performance of Thompson sampling (TS), a Bayesian heuristic due to [19] that employs the posterior distribution of the reward function to balance exploration and exploitation. TS is also known as posterior sampling. [16, 17] and [8] proved an upper bound for the Bayesian regret of TS, thereby indicating its nearoptimality. The best thusfar known worstcase regret bound for TS, however, is given by [4, 3] which is worse than the previous bounds by a factor of . As it is stated in Section 8.2.1 of [15], it is an open question whether this extra factor can be eliminated by a more careful analysis.
In addition, when there is a gap between the expected rewards of the top two actions, OFUL and TS are shown to have a regret with a dependence in instead of . According to [7], it remains an open problem whether this result extends to the cases when can be . We defer to [12], and references therein, for a more thorough discussion.
On the other hand, in a subclass of the linear bandit problem known as linear armed contextual bandit, [9] considered a more general version of the gap assumption, a certain type of margin condition for the action set, and proposed a novel extension of the greedy algorithm. Their OLS Bandit algorithm explicitly allocates a fraction of rounds to each arm, and uses these forced samples to discard obviously suboptimal arms. They show that this filtering approach can lead to a nearoptimal regret bound grows logarithmically in . [6] adapted this idea to the setting with very large , and [10] extended that further to when both the number of arms and are large. In this paper, we demonstrate that a major generalization of this idea can be used to obtain a unifying technique for analyzing all of the above algorithms, that not only recovers known results in the literature, but also yields a number of new results, and notably, solves the aforementioned two open problems. To be explicit, the main contributions of this paper are:

We propose a general family of algorithms, called Twophase Bandit, for the stochastic linear bandit problem and prove that they are rate optimal. We also show that TS, OFUL, and OLS Bandit are special cases of this family. Therefore, we obtain a universal proof of rate optimality for all of these algorithms, in both Bayesian or worstcase regret settings.

We consider the same generalized gap assumption as in [9], that with positive probability, and obtain a polylogarithmic (in ) gapdependent regret bound for all of the above algorithms, when the action sets are independently drawn from an unknown distribution. To the best of our knowledge, this result is new for OFUL and TS.

Our proof also shows that TS is vulnerable (can incur a linear Bayesian and worstcase regret) if it uses an incorrect prior distribution for the unknown parameter vector of the linear reward function, or when it uses an incorrect noise distribution.

As a byproduct of our analyses in 34, we obtain a set of conditions under which (a) TS is rate optimal and (b) we can shrink the confidence sets of OFUL by a factor , without impacting its regret.
Organization. We start by introducing the notation and main assumptions in §2. In §3 we introduce the Twophase Bandit algorithm and prove that it is rate optimal. In §4 we introduce the ROFUL algorithm, a special case of the Twophase Bandit algorithm, and in §5 we show that OFUL and TS are special cases of the ROFUL algorithm. Finally, in §6, we prove that TS can incur linear regret in the worstcase or when it does not have correct information on the prior or the noise distribution.
2 Setting and notations
For any positive integer , we denote by . Letting be a positive semidefinite matrix, by we mean for any vector of suitable size. By a grouped linear bandit (GLB) problem, we mean a tuple where:

is the prior distribution of a parameter vector on .

consists of orthogonal dimensional subspaces of . By abuse of notation, we write to also denote the projection matrix from onto .

are random compact subsets of .

are random objects passed to (randomized) policies to function.

is a sequence of independent meanzero subGaussian random variables.
The main difference between our model and the common linear bandit formulation (for example, as defined in [1]) is the introduction of ’s. Loosely speaking, we can consider each as a copy of ; thus, . In this case, our assumption on the action sets demands each action to have nonzeros entries in only one of the copies of . Our problem can be regarded as a dimensional instance of the ordinary (ungrouped) linear bandit; however, we will see in the next sections that this additional structure lets us improve the regret bound by a factor of in the gapdependent setting and a factor of in the gapindependent setting. Three interesting special cases of this model are:

When and for all , our problem reduces to a simple multiarmed bandit problem.

When , and each action set contains exactly copies of a vector (one in each ), the problem is called armed contextual bandit.

When , we get the ordinary stochastic linear bandit problem.
The optimal and selected actions at time are denoted by and respectively. The corresponding reward is then revealed to the policy where . We denote the history of observations up to time by . More precisely, we define
In this model, a policy is formally defined as a deterministic function that maps to an element of . We emphasize that our definition of policies includes randomized policies, as well. Indeed, the random objects ’s are the source of randomness used by a policy.
The performance measure for evaluating the policies is the standard cumulative Bayesian regret defined as
The expectation is taken with respect to the entire randomness in our model, including the prior distribution. Although we describe our setting in a Bayesian fashion, our results cover fixed setting by defining the prior distribution to be the one with a point mass at .
2.1 Action sets
In the next sections, we derive our regret bounds for various types of action sets, thereby dedicating this subsection to the definitions and notations we will use for the action sets. We start off by defining the extremal points of an action set. {defn}[Extremal points] For an action set , define its extremal points to be all for which there are no and satisfying
The importance of this definition is that all the algorithms studied in this paper only choose extremal points in action sets. This observation implies that the rewards attained by any of these algorithms, and an action set , belong to the reward profile of defined by
The maximum attainable reward and gap of an action set for the parameter vector are defined respectively as
For any , write
In the above notations, for the sake of simplicity, we may use subscript to refer to . For instance, by we mean . We now define a gapped problem as follows: {defn}[Gapped problem] We call a GLB problem gapped if for some the following inequality holds:
(1) 
Moreover, for a fixed gap level , we let be the indicator of the event . {rem} Note that all problems are gapped for all and . This observation will help us obtain gapindependent bounds. {rem} Our notion of gap is more general than the wellknown gap assumption in the literature (e.g., as in [1]) which always holds which means would be . We conclude this section by defining nearoptimal space followed by diversity condition and margin condition. The two conditions will enable us to enhance a term in the our regret bound, that will appear in Eq. (6), to an expression that grows sublinearly in terms of . {defn}[Nearoptimal space] Let be the smallest number such that there exists with and
Let us denote , and as before, we also treat as the projection of onto the subspace . {rem} The main purpose of this notion is to handle suboptimal arms in the special case of armed contextual bandit. One might harmlessly assume that (or equal to the identity function if viewed as an operator) and follow the rest of the paper. {defn}[Diversity condition] We say that a GLB problem satisfies the diversity condition with parameter if is independent of and
(2) 
[Margin condition] In a GLB problem, the margin condition holds if
(3) 
where are two constants.
3 Twophase bandit algorithm
In this section, we describe Twophase Bandit algorithm, an extension of the OLS Bandit algorithm, that was introduced by [9] for the special case of armed contextual bandit setting, to our more general grouped linear bandit problem. Twophase Bandit algorithm (presented in Algorithm 1) has two separate phases to deal with the explorationexploitation tradeoff. At each time , a forcedsampling rule determines which arm to pull (i.e., when is an element of ) or it refuses to pick one which implies that the best arm should be chosen by exploiting the information gathered thus far (i.e., when ). The forcedsampling rule is allowed to depend on the history as well as the current action set . The notation expresses this dependence explicitly.
In the exploitation phase, two selectors are used to decide which action to choose. The blurry selector first selects a candidate set which is later passed to the vivid selector to pick a single action from. The idea behind this architecture is that the blurry selector eliminates all the actions that are suboptimal with a constant margin with high probability. The name “blurry” indicates the low accuracy of this selector. The vivid selector, on the other hand, should become more accurate as grows with high probability. The diversity condition is the assumption that plays a crucial role in proving this type of result. We now state and briefly discuss the assumptions that need to be met such that our results are valid. {assume}[Boundedness] There exist constants such that and for all and for all almost surely. The next assumption concerns how much regret the forcedsampling rule incurs to assure that the blurry selector works properly. {assume}[Forcedsampling cost] Let be the indicator function for the event . Then there exists some such that
(4) 
We are now ready to formalize what it means for the blurry selector to work properly. {assume}[Blurry selector bound] Let . There exists some such that for all , we have that
(5) 
The above assumptions are sufficient to prove a bound which scales with as . Furthermore, we will discuss in §4, that can be tuned such that the regret grows as . Furthermore, we can obtain even sharper regret bounds, under additional assumptions that will be stated below. For example, the vivid selector should satisfy certain properties. In order to highlight the main idea, we restrict our attention to a rather concrete scenario in which these properties hold and defer the most general cases to a longer version of the paper. More specifically, we assume that the vivid selector is a greedy selector defined as follows. {defn}[Greedy selector] Let be an estimator for . By the greedy selector with respect to , we mean a selector given by
[Reasonableness] Let be fixed. For an estimator , define
and
The estimator is called reasonable if
for some .
The vivid selector is a greedy selector for a reasonable estimator for all .
Our last assumption demands the selected actions to be diverse in the nearoptimal space. This condition is not a mere property of a model or a policy; it is a characteristic of a policy in combination with a model. {assume}[Linear expansion] We say that linear expansion holds if
for some constants and all . Now, with all these assumptions, we are ready to state our main regret bound that results in both gapindependent and gapdependent settings. The result is also applicable to the cases that the action selected by an adversary. {thm} If Assumptions 33 hold, the cumulative regret of Algorithm 1 (denoted by policy ) satisfies the following inequality:
(6) 
Furthermore, under Assumptions 33 and the margin condition (3), we have
(7) 
Theorem 3 provides a gapindependent bound, Eq. (6), as well as a gapdependent bound, Eq. (7). For the special case of OLS Bandit, when , the latter produces an bound on the regret. Note that we can follow the same peeling argument as in [9, 6] and obtain an for OLS Bandit. However, this peeling argument may not apply for more general algorithms that we will be analyzing in next sections.
Proof of Theorem 3.
We split the regret of the algorithm into the following three cases. We will then bound each term separately.

Forcedsampling phase ,

When and , where is the indicator function for ,

When and .
By and , we denote the regrets of the above items up to time . Clearly, we have that
Assumption 3 gives us an upper bound for
Next, notice that the maximum regret that can occur in each round is bounded above by
It follows from the definition of and the inequality (5) in Assumption 3 that the number of times that and is controlled by
It thus entails that
From the definition of , we can infer that whenever and , the regret at time can not exceed , and in the case that the action set is gapped (at this level ), the regret is equal to zero. Using this observation, we get that
which completes the proof of (6).
In order to prove (7), we use Assumptions 33 and the margin condition to bound in a different way. Define
The key idea in bounding is that under , ; thereby, we get, with probability 1,
Hence, we have that
Note that, whenever and , we have
As in the proof of (6), the regret would be zero if is larger than the above. This, in turn, implies that
which is the desired result. ∎
4 Randomized OFUL
In this section, we present an extension of the OFUL algorithm of [1], and prove that under mild conditions it enjoys the same regret bound as the original OFUL. We call this extension Randomized OFUL (or ROFUL) and present its pseudocode version in Algorithm 2. It receives an arbitrary estimator , and at each round, it makes greedy decision using this estimator. We require this estimator to be reasonable (Definition 3) and optimistic (Definition 4), and in Theorem 4, we use these assumptions to provide our regret bounds for this algorithm. We now define optimism. Recall that we defined
Similarly, we write
[Optimism] We say that the estimator is optimistic if for some we have
(8) 
with probability at least where is a fixed constant and is the indicator function for the typical event .
For any reasonable (Definition 3) and optimistic (Definition 4) estimator the corresponding ROFUL algorithm admits the following regret bound:
(9) 
Furthermore, if the diversity condition (2) and the margin condition (3) also hold, we have the following bound:
(10) 
Proof.
For simplicity, we introduce some new notations to keep the expressions in this proof shorter; thereby, increasing the readability. Define:

Wellposed action set indicator:

Upper and lower confidence bounds:

Acceptance threshold:
Our strategy is to first represent ROFUL as an instance of Twophase Bandit, and then, verify the assumptions of Theorem 3. In order to do so, let be given by
The blurry selector is also defined as
We also set the vivid selector to be the greedy selector with respect to the estimator . It is straightforward to verify that Twophase Bandit algorithm with as defined in the above is equivalent to Algorithm 2. We thus need to show that the assumptions of Theorem 3 hold. We begin with computing the forcedsampling cost in Assumption 3.
(11) 
It follows from the definition of that implies
This gives us
which in combination with (11) leads to
(12) 
Next, letting , we get
which in turn yields
We deduce from optimism (8) that
almost surely. Hence, we have
Substituting the above inequality into (12), we obtain
Recall we assumed that for each , there exists such that . Now, let be such that . We have that
Finally, we apply Lemma 10 and Lemma 11 in [1] for each separately:
Therefore, we have