1 Introduction
\OneAndAHalfSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\RUNTITLE

Exploration-Free Contextual Bandits \TITLEMostly Exploration-Free Algorithms for
Contextual Bandits \ARTICLEAUTHORS\AUTHORHamsa Bastani \AFFWharton School, \EMAILhamsab@wharton.upenn.edu \AUTHORMohsen Bayati \AFFStanford Graduate School of Business, \EMAILbayati@stanford.edu \AUTHORKhashayar Khosravi \AFFStanford University Electrical Engineering, \EMAILkhosravi@stanford.edu

\ABSTRACT

The contextual bandit literature has traditionally focused on algorithms that address the exploration-exploitation tradeoff. In particular, greedy algorithms that exploit current estimates without any exploration may be sub-optimal in general. However, exploration-free greedy algorithms are desirable in practical settings where exploration may be costly or unethical (e.g., clinical trials). Surprisingly, we find that a simple greedy algorithm can be rate-optimal (achieves asymptotically optimal regret) if there is sufficient randomness in the observed contexts (covariates). We prove that this is always the case for a two-armed bandit under a general class of context distributions that satisfy a condition we term covariate diversity. Furthermore, even absent this condition, we show that a greedy algorithm can be rate optimal with positive probability. Thus, standard bandit algorithms may unnecessarily explore. Motivated by these results, we introduce Greedy-First, a new algorithm that uses only observed contexts and rewards to determine whether to follow a greedy algorithm or to explore. We prove that this algorithm is rate-optimal without any additional assumptions on the context distribution or the number of arms. Extensive simulations demonstrate that Greedy-First successfully reduces exploration and outperforms existing (exploration-based) contextual bandit algorithms such as Thompson sampling or upper confidence bound (UCB). \KEYWORDSsequential decision-making, contextual bandit, greedy algorithm, exploration-exploitation

1 Introduction

Service providers across a variety of domains are increasingly interested in personalizing decisions based on customer characteristics. For instance, a website may wish to tailor content based on an Internet user’s web history (Li et al. 2010), or a medical decision-maker may wish to choose treatments for patients based on their medical records (Kim et al. 2011). In these examples, the costs and benefits of each decision depend on the individual customer or patient, as well as their specific context (web history or medical records respectively). Thus, in order to make optimal decisions, the decision-maker must learn a model predicting individual-specific rewards for each decision based on the individual’s observed contextual information. This problem is often formulated as a contextual bandit (Auer 2003, Langford and Zhang 2008, Li et al. 2010), which generalizes the classical multi-armed bandit problem (Thompson 1933, Lai and Robbins 1985).

In this setting, the decision-maker has access to possible decisions (arms) with uncertain rewards. Each arm is associated with an unknown parameter that is predictive of its individual-specific rewards. At each time , the decision-maker observes an individual with an associated context vector . Upon choosing arm , she realizes a (linear) reward of

(1)

where are idiosyncratic shocks. One can also consider nonlinear rewards given by generalized linear models (e.g., logistic, probit, and Poisson regression); in this case, (1) is replaced with

(2)

where is a suitable inverse link function (Filippi et al. 2010, Li et al. 2017). The decision-maker’s goal is to maximize the cumulative reward over different individuals by gradually learning the arm parameters. Devising an optimal policy for this setting is often computationally intractable, and thus, the literature has focused on effective heuristics that are asymptotically optimal, including UCB (Dani et al. 2008, Abbasi-Yadkori et al. 2011), Thompson sampling (Agrawal and Goyal 2013, Russo and Van Roy 2014b), information-directed sampling (Russo and Van Roy 2014a), and algorithms inspired by -greedy methods (Goldenshluger and Zeevi 2013, Bastani and Bayati 2015).

The key ingredient in designing these algorithms is addressing the exploration-exploitation tradeoff. On one hand, the decision-maker must explore or sample each decision for random individuals to improve her estimate of the unknown arm parameters ; this information can be used to improve decisions for future individuals. Yet, on the other hand, the decision-maker also wishes to exploit her current estimates to make the estimated best decision for the current individual in order to maximize cumulative reward. The decision-maker must therefore carefully balance both exploration and exploitation to achieve good performance. In general, algorithms that fail to explore sufficiently may fail to learn the true arm parameters, yielding poor performance.

However, exploration may be prohibitively costly or infeasible in a variety of practical environments (Bird et al. 2016). In medical decision-making, choosing a treatment that is not the estimated-best choice for a specific patient may be unethical; in marketing applications, testing out an inappropriate ad on a potential customer may result in the costly, permanent loss of the customer. Such concerns may deter decision-makers from deploying bandit algorithms in practice.

In this paper, we analyze the performance of exploration-free greedy algorithms. Surprisingly, we find that a simple greedy algorithm can achieve the same state-of-the-art asymptotic performance guarantees as standard bandit algorithms if there is sufficient randomness in the observed contexts (thereby creating natural exploration). In particular, we prove that the greedy algorithm is near-optimal for a two-armed bandit when the context distribution satisfies a condition we term covariate diversity; this property requires that the covariance matrix of the observed contexts conditioned on any half space is positive definite. We show that covariate diversity is satisfied by a natural class of continuous and discrete context distributions. Furthermore, even absent covariate diversity, we show that a greedy approach provably converges to the optimal policy with some probability that depends on the problem parameters. Our results hold for arm rewards given by both linear and generalized linear models. Thus, exploration may not be necessary at all in a general class of problem instances, and is only sometimes be necessary in other problem instances.

Unfortunately, one may not know a priori when a greedy algorithm will converge, since its convergence depends on unknown problem parameters. For instance, the decision-maker may not know if the context distribution satisfies covariate diversity; if covariate diversity is not satisfied, the greedy algorithm may be undesirable since it may achieve linear regret some fraction of the time (i.e., it fails to converge to the optimal policy with positive probability). To address this concern, we present Greedy-First, a new algorithm that seeks to reduce exploration when possible by starting with a greedy approach, and incorporating exploration only when it is confident that the greedy algorithm is failing with high probability. In particular, we formulate a simple hypothesis test using observed contexts and rewards to verify (with high probability) if the greedy arm parameter estimates are converging at the asymptotically optimal rate. If not, our algorithm transitions to a standard exploration-based contextual bandit algorithm.

Greedy-First satisfies the same asymptotic guarantees as standard contextual bandit algorithms without our additional assumptions on covariate diversity or any restriction on the number of arms. More importantly, Greedy-First does not perform any exploration (i.e., remains greedy) with high probability if the covariate diversity condition is met. Furthermore, even when covariate diversity is not met, Greedy-First provably reduces the expected amount of forced exploration compared to standard bandit algorithms. This occurs because the vanilla greedy algorithm provably converges to the optimal policy with some probability even for problem instances without covariate diversity; however, it achieves linear regret on average since it may fail a positive fraction of the time. Greedy-First leverages this observation by following a purely greedy algorithm until it detects that this approach has failed. Thus, in any bandit problem, the Greedy-First policy explores less on average than standard algorithms that always explore. Simulations confirm our theoretical results, and demonstrate that Greedy-First outperforms existing contextual bandit algorithms even when covariate diversity is not met.

Finally, Greedy-First provides decision-makers with a natural interpretation for exploration. The hypothesis test for adopting exploration only triggers when an arm has not received sufficiently diverse samples; at this point, the decision-maker can choose to explore that arm by assigning it random individuals, or to discard it based on current estimates and continue with a greedy approach. In this way, Greedy-First reduces the opaque nature of experimentation, which we believe can be valuable for aiding the adoption of bandit algorithms in practice.

1.1 Related Literature

We study sequential decision-making algorithms under the classic linear contextual bandit framework, which has been extensively studied in the computer science, operations, and statistics literature (see Chapter 4 of Bubeck and Cesa-Bianchi (2012) for an informative review). A key feature of this setting is the presence of bandit feedback, i.e., the decision-maker only observes feedback for her chosen decision and does not observe counterfactual feedback from other decisions she could have made; this obstacle inspires the exploration-exploitation tradeoff in bandit problems.

The contextual bandit setting was first introduced by Auer (2003) through the LinRel algorithm and was subsequently improved through the OFUL algorithm by Dani et al. (2008) and the LinUCB algorithm by Chu et al. (2011). More recently, Abbasi-Yadkori et al. (2011) proved an upper bound of regret after time periods when contexts are -dimensional. While this literature often allows for arbitrary (adversarial) context sequences, we consider the more restricted setting where contexts are generated i.i.d. from some unknown distribution. This additional structure is well-suited to certain applications (e.g., clinical trials on treatments for a non-infectious disease) and allows for improved regret bounds in (see Goldenshluger and Zeevi 2013, who prove an upper bound of regret), and more importantly, allows us to delve into the performance of exploration-free policies which have not been analyzed previously.

Recent work has applied contextual bandit techniques for personalization in a variety of applications such as healthcare (Bastani and Bayati 2015, Tewari and Murphy 2017, Mintz et al. 2017, Kallus and Zhou 2018, Chick et al. 2018, Zhou et al. 2019), recommendation systems (Chu et al. 2011, Kallus and Udell 2016, Agrawal et al. 2017, Bastani et al. 2018), and dynamic pricing (Cohen et al. 2016, Qiang and Bayati 2016, Javanmard and Nazerzadeh 2019, Ban and Keskin 2018, Bastani et al. 2019). However, this substantial literature requires exploration. Exploration-free greedy policies are desirable in practical settings where exploration may be costly or unethical.

Greedy Algorithms.

A related literature studies greedy (but not exploration-free) algorithms in discounted Bayesian multi-armed bandit problems. The seminal paper by Gittins (1979) showed that greedily applying an index policy is optimal for a classical multi-armed bandit in Bayesian regret (with a known prior over the unknown parameters). Woodroofe (1979) and Sarkar (1991) extend this result to a Bayesian one armed bandit with a single i.i.d. covariate when the discount factor approaches 1, and Wang et al. (2005a, b) generalize this result with a single covariate and two arms. Mersereau et al. (2009) further model known structure between arm rewards. However, these policies are not exploration-free; in particular, the Gittins index of an arm is not simply the arm parameter estimate, but includes an additional factor that implicitly captures the value of exploration for under-sampled arms. Recent work has shown a sharp equivalence between the UCB policy (which incorporates exploration) and the Gittins index policy as the discount factor approaches one (Russo 2019). In contrast, we consider a greedy policy with respect to unbiased arm parameter estimates, i.e., without incorporating any exploration. It is surprising that such a policy can be effective; in fact, we show that it is not rate optimal in general, but is rate optimal for the linear contextual bandit if there is sufficient randomness in the context distribution.

It is also worth noting that, unlike the literature above, we consider undiscounted minimax regret with unknown and deterministic arm parameters. Gutin and Farias (2016) show that the Gittins analysis does not succeed in minimizing Bayesian regret over all sufficiently large horizons, and propose “optimistic” Gittins indices (which incorporate additional exploration) to solve the undiscounted Bayesian multi-armed bandit.

There are also technical parallels between our work and the analysis of greedy policies in the dynamic pricing literature (Lattimore and Munos 2014, Broder and Rusmevichientong 2012). When there is no context, the greedy algorithm provably converges to a suboptimal price with nonzero probability (den Boer and Zwart 2013, Keskin and Zeevi 2014, 2015). However, in the presence of contexts, Qiang and Bayati (2016) show that changes in the demand environment can induce natural exploration for an exploration-free greedy algorithm, thereby ensuring asymptotically optimal performance. Our work significantly differs from this line of analysis since we need to learn multiple reward functions (for each arm) simultaneously. Specifically, in dynamic pricing, the decision-maker always receives feedback from the true demand function; in contrast, in the contextual bandit, we only receive feedback from a decision if we choose it, thereby complicating the analysis. As a result, the greedy policy is always rate-optimal in the setting of Qiang and Bayati (2016), but only rate-optimal in the presence of covariate diversity in our setting.

Covariate Diversity.

The adaptive control theory literature has studied “persistent excitation”: for linear models, if the sample path of the system satisfies this condition, then the minimum eigenvalue of the covariance matrix grows at a suitable rate, implying that the parameter estimates converge over time (Narendra and Annaswamy 1987, Nguyen 2018). Thus, if persistent excitation holds for each arm, we will eventually recover the true arm rewards. However, the problem remains to derive policies that ensure that such a condition holds for each (optimal) arm; classical bandit algorithms achieve this goal with high probability by incorporating exploration for under-sampled arms. Importantly, a greedy policy that does not incorporate exploration may not satisfy this condition, e.g., the greedy policy may “drop” an arm. The covariate diversity assumption ensures that there is sufficient randomness in the observed contexts, thereby exogenously ensuring that persistent excitation holds for each arm regardless of the sample path taken by the bandit algorithm.

Conservative Bandits.

Our approach is also related to recent literature on designing conservative bandit algorithms (Wu et al. 2016, Kazerouni et al. 2016) that operate within a safety margin, i.e., the regret is constrained to stay below a certain threshold that is determined by a baseline policy. This literature proposes algorithms that restrict the amount of exploration (similar to the present work) in order to satisfy a safety constraint. Wu et al. (2016) studies the classical multi-armed bandit, and Kazerouni et al. (2016) generalizes these results to the contextual linear bandit.

Additional Related Work.

Since the first draft of this paper appeared online, there have been two follow-up papers that cite our work and provide additional theoretical and empirical validation for our results. Kannan et al. (2018) consider the case where an adversary selects the observed contexts, but these contexts are then perturbed by white noise; they find that the greedy algorithm can be rate optimal in this setting even for small perturbations. Bietti et al. (2018) perform an extensive empirical study of contextual bandit algorithms on datasets that are publicly available on the OpenML platform. These datasets arise from a variety of applications including medicine, natural language, and sensors. Bietti et al. (2018) find that the greedy algorithm outperforms a wide range of bandit algorithms in cumulative regret on more that datasets. This study provides strong empirical validation of our theoretical findings.

1.2 Main Contributions and Organization of the Paper

We begin by studying conditions under which the greedy algorithm performs well. In §2, we introduce the covariate diversity condition (Assumption 2.1), and show that it holds for a general class of continuous and discrete context distributions. In §3, we show that when covariate diversity holds, the greedy policy is asymptotically optimal for a two-armed contextual bandit with linear rewards (Theorem 3.2); this result is extended to rewards given by generalized linear models in Proposition 3.4. For problem instances with more than two arms or where covariate diversity does not hold, we prove that the greedy algorithm is asymptotically optimal with some probability, and we provide a lower bound on this probability (Theorem 3.5).

Building on these results, in §4, we introduce the Greedy-First algorithm that uses observed contexts and rewards to determine whether the greedy algorithm is failing or not via a hypothesis test. If the test detects that the greedy steps are not receiving sufficient exploration, the algorithm switches to a standard exploration-based algorithm. We show that Greedy-First achieves rate optimal regret bounds without our additional assumptions on covariate diversity or number of arms. More importantly, we prove that Greedy-First remains purely greedy (while achieving asymptotically optimal regret) for almost all problem instances for which a pure greedy algorithm is sufficient (Theorem 4.2). Finally, for problem instances with more than two arms or where covariate diversity does not hold, we prove that Greedy-First remains exploration-free and rate optimal with some probability, and we provide a lower bound on this probability (Theorem 4.3). This result implies that Greedy-First reduces exploration on average compared to standard bandit algorithms.

Finally, in §5, we run several simulations on synthetic and real datasets to verify our theoretical results. We find that the greedy algorithm outperforms standard bandit algorithms when covariate diversity holds, but can perform poorly when this assumption does not hold. However, Greedy-First outperforms standard bandit algorithms even in the absence of covariate diversity, while remaining competitive with the greedy algorithm in the presence of covariate diversity. Thus, Greedy-First provides a desirable compromise between avoiding exploration and learning the true policy.

2 Problem Formulation

We consider a -armed contextual bandit for time steps, where is unknown. Each arm is associated with an unknown parameter . For any integer , let denote the set . At each time , we observe a new individual with context vector . We assume that is a sequence of i.i.d. samples from some unknown distribution that admits probability density with respect to the Lebesgue measure. If we pull arm , we observe a stochastic linear reward (in §3.4, we discuss how our results can be extended to generalized linear models)

where are independent -subgaussian random variables (see Definition 2 below). {definition} A random variable is -subgaussian if for all we have . We seek to construct a sequential decision-making policy that learns the arm parameters over time in order to maximize expected reward for each individual.

We measure the performance of by its cumulative expected regret, which is the standard metric in the analysis of bandit algorithms (Lai and Robbins 1985, Auer 2003). In particular, we compare ourselves to an oracle policy , which knows the arm parameters in advance. Upon observing context , the oracle will always choose the best expected arm . Thus, if we choose an arm at time , we incur instantaneous expected regret

which is simply the expected difference in reward between the oracle’s choice and our choice. We seek to minimize the cumulative expected regret . In other words, we seek to mimic the oracle’s performance by gradually learning the arm parameters.

Additional Notation:

Let be the closed ball of radius around the origin in defined as , and let the volume of a set be .

2.1 Assumptions

We now describe the assumptions required for our regret analysis. Some assumptions will be relaxed in later sections of the paper as noted below.

Our first assumption is that the contexts as well as the arm parameters are bounded. This ensures that the maximum regret at any time step is bounded. This is a standard assumption made in the bandit literature (see e.g., Dani et al. 2008). {assumption}[Parameter Set] There exists a positive constant such that the context probability density has no support outside the ball of radius , i.e., for all . There also exists a constant such that for all .

Second, we assume that the context probability density satisfies a margin condition, which comes from the classification literature (Tsybakov 2004). We do not require this assumption to prove convergence of the greedy algorithm, but the rate of convergence differs depending on whether it holds. In particular, Goldenshluger and Zeevi (2009) prove matching upper and lower bounds demonstrating that all bandit algorithms achieve regret when the margin condition holds, but they can achieve up to regret when this condition is violated. We can obtain analogous results for the simple greedy algorithm as well (see Appendix 11.2 for details). This is because the margin condition rules out unusual context distributions that become unbounded near the decision boundary (which has zero measure), thereby making learning difficult. {assumption}[Margin Condition] There exists a constant such that for each :

Thus far, we have made generic assumptions that are standard in the bandit literature. Our third assumption introduces the covariate diversity condition, which is essential for proving that the greedy algorithm always converges to the optimal policy. This condition guarantees that no matter what our arm parameter estimates are at time , there is a diverse set of possible contexts (supported by the context probability density ) under which each arm may be chosen.

{assumption}

[Covariate Diversity] There exists a positive constant such that for each vector the minimum eigenvalue of is at least , i.e.,

Assumption 2.1 holds for a general class of distributions. For instance, if the context probability density is bounded below by a nonzero constant in an open set around the origin, then it would satisfy covariate diversity. This includes common distributions such as the uniform or truncated gaussian distributions. Furthermore, discrete distributions such as the classic Rademacher distribution on binary random variables also satisfy covariate diversity.

{remark}

As discussed in the related literature, the adaptive control theory literature has studied “persistent excitation,” which is reminiscent of the covariate diversity condition without the indicator function . If persistent excitation holds for each arm in a given sample path, then the minimum eigenvalue of the corresponding covariance matrix grows at a suitable rate, and the arm parameter estimate converges over time. However, a greedy policy that does not incorporate exploration may not satisfy this condition, e.g., the greedy policy may “drop” an arm. Assumption 2.1 ensures that there is sufficient randomness in the observed contexts, thereby exogenously ensuring that persistent excitation holds for each arm (see Lemma 3.3), regardless of the sample path taken by the bandit algorithm.

2.2 Examples of Distributions Satisfying Assumptions 2.1-2.1

While Assumptions 2.1-2.1 are generic, it is not straightforward to verify Assumption 2.1. The following lemma provides sufficient conditions (that are easier to check) that guarantee Assumption 2.1. {lemma} If there exists a set that satisfies conditions (a), (b), and (c) given below, then satisfies Assumption 2.1.

  • is symmetric around the origin; i.e., if then .

  • There exist positive constants such that for all , .

  • There exists a positive constant such that . For discrete distributions, the integral is replaced with a sum.

We now use Lemma 2.2 to demonstrate that covariate diversity holds for a wide range of continuous and discrete context distributions, and we explicitly provide the corresponding constants. It is straightforward to verify that these examples (and any product of their distributions) also satisfy Assumptions 2.1 and 2.1.

  1. Uniform Distribution. Consider the uniform distribution over an arbitrary bounded set that contains the origin. Then, there exists some such that . Taking , we note that conditions (a) and (b) of Lemma 2.2 follow immediately. We now check condition (c) by first stating the following lemma (see Appendix 7 for proof): {lemma} for any . By definition, for all , and . Applying Lemma 1, we see that condition (c) of Lemma 2.2 holds with constant .

  2. Truncated Multivariate Gaussian Distribution. Let be a multivariate Gaussian distribution , truncated to for all . The density after renormalization is

    Taking , conditions (a) and (b) of Lemma 2.2 follow immediately. Condition (c) of Lemma 2.2 holds with constant

    as shown in Lemma 7 in Appendix 7.

  3. Gibbs Distributions with Positive Covariance. Consider the set equipped with a discrete probability density , which satisfies

    for any . Here, are (deterministic) parameters, and is a normalization term known as the partition function in the statistical physics literature. We define , satisfying conditions (a) and (b) of Lemma 2.2. Furthermore, condition (c) follows by definition since the covariance of the distribution is positive-definite. This class of distributions includes the well-known Rademacher distribution (by setting all ).

A special case under which the conditions in Lemma 2.2 hold is when is the entire support of the distribution ; this is the case in the Gaussian and Gibbs distributions, where and respectively. Now, let be a random vector that satisfies this special case and has mean . Let be another vector that is independent of and satisfies the general form of Lemma 2.2. Then it is easy to see that also satisfies the conditions in Lemma 2.2: parts (a) and (b) clearly hold; to see why (c) holds, note that the cross diagonal entries in are zero since has mean . This construction illustrates how covariate diversity works for distributions that contain a mixture of discrete and continuous components.

3 Greedy Bandit

Notation. Let the design matrix X be the matrix whose rows are . Similarly, for , let be the length vector of potential outcomes . Since we only obtain feedback when arm is played, entries of may be missing. For any let be the set of times when arm was played within the first time steps. We use the notation and to refer to the design matrix, the outcome vector, and vector of idiosyncratic shocks respectively, for observations restricted to time periods in . We estimate at time based on and , using ordinary least squares (OLS) regression that is defined below. We denote this estimator , or for short.

{definition}

[OLS Estimator] For any and , the OLS estimator is , which is equal to when is invertible.

We now describe the greedy algorithm and its performance guarantees under covariate diversity.

3.1 Algorithm

At each time step, we observe a new context and use the current arm estimates to play the arm with the highest estimated reward, i.e., . Upon playing arm , a reward is observed. We then update our estimate for arm but we need not update the arm parameter estimates for other arms as for . The update formula is given by

We do not update the parameter of arm if is not invertible (see Remark 3.1 below for alternative choices). The pseudo-code for the algorithm is given in Algorithm 1.

\SingleSpacedXI
Initialize for
for  do
     Observe
      (break ties randomly)
     
     Play arm , observe
     If is invertible, update the arm parameter via
end for
Algorithm 1 Greedy Bandit
{remark}

In Algorithm 1, we only update the arm parameter from its (arbitrary) initial value of when the covariance matrix is invertible. However, one can alternatively update the parameter using ridge regression or a pseudo inverse to improve empirical performance. Our theoretical analysis is unaffected by this choice — as we will show in Lemma 3.3, no matter what estimator we use, covariate diversity ensures that the probability that these covariance matrices are singular is upper bounded by , thereby contributing at most an additive constant factor to the cumulative regret (the second term in Lemma 3.3).

3.2 Performance of Greedy Bandit with Covariate Diversity

We now establish a finite-sample upper bound on the cumulative expected regret of the Greedy Bandit for the two-armed contextual bandit when covariate diversity is satisfied. {theorem} If and Assumptions 2.1-2.1 are satisfied, the cumulative expected regret of the Greedy Bandit at time is at most

(3)

where the constant is defined in Assumption 2.1 and

(4)

We prove an analogous result for the greedy algorithm in the case where arm rewards are given by generalized linear models (see §3.4 and Proposition 3.4 for details).

Goldenshluger and Zeevi (2013) established a lower bound of for any algorithm in a two-armed contextual bandit. While they do not make Assumption 2.1, the distribution used in their proof satisfies Assumption 2.1; thus their result applies to our setting. Combined with our upper bound (Theorem 3.2), we conclude that the Greedy Bandit is rate optimal.

{remark}

Our upper bound in Theorem 3.2 scales as in the context dimension . This is because the term scales as for standard distributions satisfying covariate diversity (e.g., truncated multivariate gaussian or uniform distribution). Thus, our upper bound for the Greedy Bandit is slightly worse (by a factor of ) than the upper bound of established in Bastani and Bayati (2015) for the OLS Bandit.

3.3 Proof of Theorem 3.2

Notation. Let denote the true set of contexts where arm is optimal. Then, let denote the estimated set of contexts at time where arm appears optimal; in other words, if the context , then the greedy policy will choose arm at time . (since we assume without loss of generality that ties are broken randomly as selected by and thus, and partition the context space .)

For any , let denote the -algebra containing all observed information up to time before taking an action; thus, our policy is -measurable. Furthermore, let which is the -algebra containing all observed information before time .

Define as the sample covariance matrix for observations from arm up to time . We may compare this to the expected covariance matrix for arm under the greedy policy, defined as .

Proof Strategy. Intuitively, covariate diversity (Assumption 2.1) guarantees that there is sufficient randomness in the observed contexts, which creates natural “exploration.” In particular, no matter what our current arm parameter estimates are at time , each arm will be chosen by the greedy policy with at least some constant probability (with respect to ) depending on the observed context. We formalize this intuition in the following lemma. {lemma} Given Assumptions 2.1 and 2.1, the following holds for any :

Proof.

Proof. For any observed context , note that by Assumption 2.1. Re-stating Assumption 2.1 for each , we can write

since the indicator function and are both nonnegative. \Halmos

Taking , Lemma 3.3 implies that arm 1 will be pulled with probability at least at each time ; the claim holds analogously for arm 2. Thus, each arm will be played at least times in expectation. However, this is not sufficient to guarantee that each arm parameter estimate converges to the true parameter . In Lemma 3.3, we establish a sufficient condition for convergence.

First, we show that covariate diversity guarantees that the minimum eigenvalue of each arm’s expected covariance matrix under the greedy policy grows linearly with . This result implies that not only does each arm receive a sufficient number of observations under the greedy policy, but also that these observations are sufficiently diverse (in expectation). Next, we apply a standard matrix concentration inequality (see Lemma 8 in Appendix 8) to show that the minimum eigenvalue of each arm’s sample covariance matrix also grows linearly with . This will guarantee the convergence of our regression estimates for each arm parameter.

{lemma}

Take . Given Assumptions 2.1 and 2.1, the following holds for the minimum eigenvalue of the empirical covariance matrix of each arm :

Proof.

Proof. Without loss of generality, take . For any , let ; by the greedy policy, we pull arm 1 if and arm 2 if (ties are broken randomly using a fair coin flip ). Thus, the estimated set of optimal contexts for arm 1 is

First, we seek to bound the minimum eigenvalue of the expected covariance matrix . Expanding one term in the sum, we can write

where the last line follows from Assumption 2.1. Since the minimum eigenvalue function is concave over positive semi-definite matrices, we can write

Next, we seek to use matrix concentration inequalities (Lemma 8 in Appendix 8) to bound the minimum eigenvalue of the sample covariance matrix . To apply the concentration inequality, we also need to show an upper bound on the maximum eigenvalue of ; this follows trivially from Assumption 2.1 using the Cauchy-Schwarz inequality:

We can now apply Lemma 8, taking the finite adapted sequence to be , so that and . We also take and . Thus, we have

using the fact . As we showed earlier, . This proves the result. \Halmos

Next, Lemma 3.3 guarantees with high probability that each arm’s parameter estimate has small error with respect to the true parameter if the minimum eigenvalue of the sample covariance matrix has a positive lower bound. Note that we cannot directly use results on the convergence of the OLS estimator since the set of samples from arm at time are not i.i.d. (we use the arm estimate to decide whether to play arm at time ; thus, the samples in are correlated.). Instead, we use a Bernstein concentration inequality to guarantee convergence with adaptive observations. In the following lemma, note that is any deterministic upper bound on the total number of times that arm is pulled until time . In the proof of Lemma 3.3, we will take ; however, we state the lemma for general for later use in our probabilistic guarantees.

{lemma}

Taking and , we have for all ,

Proof.

Proof of Lemma 3.3.

We begin by noting that if the event holds, then

As a result, we can write

where denotes the column of . We can expand

For simplicity, define . First, note that is -subgaussian, since is -subgaussian and . Next, note that and are both measurable; taking the expectation gives . Thus, the sequence is a martingale difference sequence adapted to the filtration . Applying a standard Bernstein concentration inequality (see Lemma 8 in Appendix 8), we can write

where is an upper bound on the number of nonzero terms in above sum, i.e., an upper bound on . This yields the desired result. \Halmos

To summarize, Lemma 3.3 provides a lower bound (with high probability) on the minimum eigenvalue of the sample covariance matrix. Lemma 3.3 states that if such a bound holds on the minimum eigenvalue of the sample covariance matrix, then the estimated parameter is close to the true (with high probability). Having established convergence of the arm parameters under the Greedy Bandit, one can use a standard peeling argument (as in Goldenshluger and Zeevi (2013)) to bound the instantaneous expected regret of the Greedy Bandit algorithm. {lemma} Define . Then, the instantaneous expected regret of the Greedy Bandit at time satisfies

where , is defined in Assumption 2.1, and is defined in Theorem 3.2. Note that can be upper bounded using Lemma 3.3. Substituting this in the upper bound derived on in Lemma 3.3, and using finishes the proof of Theorem 3.2.

3.4 Generalized Linear Rewards

In this section, we discuss how our results generalize when the arm rewards are given by a generalized linear model (GLM). Now, upon playing arm after observing context , the decision-maker realizes a reward with expectation , where is the inverse link function. For instance, in logistic regression, this would correspond to a binary reward with ; in Poisson regression, this would correspond to an integer-valued reward with ; in linear regression, this would correspond to .

In order to describe the greedy policy in this setting, we give a brief overview of the exponential family, generalized linear model, and maximum likelihood estimation.

Exponential family.

A univariate probability distribution belongs to the canonical exponential family if its density with respect to a reference measure (e.g., Lebesgue measure) is given by

(5)

where is the underlying real-valued parameter, and are real-valued functions, and is assumed to be twice continuously differentiable. For simplicity, we assume the reference measure is the Lebesgue measure. It is well known that if is distributed according to the above canonical exponential family, then it satisfies and , where and denote the first and second derivatives of the function with respect to , and is strictly convex (see e.g., Lehmann and Casella 1998).

Generalized linear model (GLM).

The natural connection between exponential families and GLMs is provided by assuming that the density of for the context and arm is given by . where is defined in . In other words, the reward upon playing arm for context is with density

Using the aforementioned properties of the exponential family, , i.e., the link function . This implies that is continuously differentiable and its derivative is . Thus, is strictly increasing since is strictly convex.

Maximum likelihood estimation.

Suppose that we have samples from a distribution with density . The maximum likelihood estimator of based on this sample is given by

(6)

Since is strictly convex (so is strictly concave), the solution to (6) can be obtained efficiently (see e.g., McCullagh and Nelder 1989). It is not hard to see that whenever is positive definite, this solution is unique (see Appendix 11.1 for a proof). We denote this unique solution by .

Now we are ready to generalize the Greedy Bandit algorithm when the arm rewards are given by a GLM. Using similar notation as in the linear reward case, given the estimates at time , the greedy policy plays the arm that maximizes expected estimated reward, i.e.,

Since is a strictly increasing function, this translates to .

\SingleSpacedXI
Input parameters: inverse link function
Initialize for
for  do
     Observe
      (break ties randomly)
     Play arm , observe
     Update , where is the solution to the maximum likelihood estimation in Equation (6)
end for
Algorithm 2 Greedy Bandit for Generalized Linear Models

Next, we state the following result (proved in Appendix 11.1) that Algorithm 2 achieves logarithmic regret when and the covariate diversity assumption holds. {proposition} Consider arm rewards given by a GLM with -subgaussian noise . Define . If and Assumptions 2.1-2.1 are satisfied, the cumulative expected regret of Algorithm 2 at time is at most